Re: [Languagetool] Hunspell spellcheck performance

Marcin Miłkowski Fri, 22 Jun 2012 00:01:15 -0700

W dniu 2012-06-22 07:39, Dominique Pellé pisze:
> Daniel Naber<list2...@danielnaber.de>  wrote:
>
>> Hi,
>>
>> I know Marcin has warned about this several times but I only today noticed
>> how slow spell check suggestions really are. Checking a long German blog
>> entry takes 100 seconds. Without suggestions it takes 6 seconds. 100
>> seconds seems not acceptable, so I suggest we keep the spell checking but
>> disable the suggestions. In the next version we can then have an
>> alternative rule that does spell checking with suggestions (introducing
>> that now would mean new strings that need translation). Any thoughts?


I'm against the complete removal of suggestions. Hunspell without 
suggestions is not better than using just our taggers for spell-check, 
and for some languages, like English, hunspell is fine as regards speed. 
German simply has a huge dictionary, all because of compounding.

The only thing that seems practical to me is to extend HunspellRule for 
languages that have really slow suggestions (HunspellNoSuggestRule) and 
use it for this release. I do not think that removing spell-check 
suggestions for English US is a good idea at all.

For languages without compounds and diacritics, we can use unmunch and 
convert them quickly to morfologik-speller format, if you're worried 
about the speed. The morfologik-speller module is experimental and it 
does not allow for some of hunspell tricks (I didn't have time to 
implement REP etc.), but it's fair enough.

> I'm fine with this if that.
> Will this be made configurable in command line mode?
> 3 possible modes:
>
> (1) no hunspell  (fastest) (we can already do that with -d HUNSPELL_RULE)
> (2) hunspell without spelling suggestions
> (3) hunspell with corrections (slowest)

I'm afraid this is not possible and will not be possible with 1.8. The 
rules are not configurable at all from the command-line, and we're 
already in the freeze period, no new features introduced. Rule 
configuration is a non-trivial thing to implement.

>
> On top of that, there is also the idea of using Hunspell
> only on words with UNKNOWN POS tag which may work
> fine for some languages.

This algorithm would be a waste of time: we can already use non-tagged 
words for displaying an error. It will be faster than any Hunspell rule. 
But most Hunspell dicts cover more words than our taggers, so it should 
not make any difference in timing, especially because we would have to 
some string processing for every sentence to map string portions to 
tokens. In some languages it won't help, if hunspell tokenization is 
different. This is why I didn't bother with this. Moreover, checking 
time is negligible. The crucial thing is the time spent for creating 
suggestions.

Regards,
Marcin


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] Hunspell spellcheck performance

Reply via email to