W dniu 2012-06-04 22:51, Dominique Pellé pisze:
> Hi
>
> Another problem with spell checking, is the quality
> of the Esperanto Hunspell dictionary.  It's not good
> enough. Too many correct words are highlighted
> because they are missing in the dictionary. That's
> not LT's faults here.

OK, I disabled HunspellRule for Esperanto.

> To be fair, it's hard to make an Esperanto dictionary
> with Hunspell because Esperanto is an agglutinative
> language. Hunspell only supports two prefixes/suffixes
> But Esperanto words can often use more of them.
> Working around it is messy I think.

Actually, hunspell is much smarter than ispell, and it can create 
compounds and agglutination should not be in principle so hard (it was 
created for Hungarian, after all). Yet, for Finish, it is not enough. 
However, creating compounds and two-level affixation rules should be 
quite satisfactory for Esperanto. There are people on this list with 
some experience with creating such dictionaries (for example, Ruud). But 
that's a major hard project. For now, I did not remove the Esperanto 
hunspell files from the repository.

>
> Given all the unresolved issues at least in all the languages
> that I maintain (br, fr, eo), can we consider turning
> Hunspell off by default? I'm concerned that people
> downloading the nightly build will experience many
> spurious errors.

Well, I fixed errors for Breton, and there's a fix for French, so maybe 
it's not a big problem?

> As an experiment, I also commented out the Hunspell
> rule in src/java/org/languagetool/language/Breton.java
> and LT is then more than twice faster (even when comparing
> using -d HUNSPELL_RULE).

It's because loading a language activates the HunspellRule constructor, 
and the constructor reads files from disk.

I'm not a fan of hunspell; I think it has a wrong approach for creating 
suggestions because the computational complexity of its algorithm is 
simply too high. It should use something else, such as composition of a 
Levenshtein distance automaton with a dictionary automaton, and that 
would be really fast (such an approach is used by suggest methods in 
Lucene). Its "user-friendly" representation of affixation could be even 
nicer with twolc/lexc files for creating automata. Anyway, voikko (the 
Finish speller) might has scripts to convert hunspell files to such 
automata, and we might, in the future, use a better algorithm. If my 
algorithm for morfologik turns out to be implemented correctly, we may 
also use it for some languages -- Polish is *pathetically* slow in 
hunspell. Right now, however, for pragmatic reasons, I would vote for 
hunspell in LanguageTool 1.8.

Regards,
Marcin

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to