W dniu 2012-06-04 22:51, Dominique Pellé pisze: > Hi > > Another problem with spell checking, is the quality > of the Esperanto Hunspell dictionary. It's not good > enough. Too many correct words are highlighted > because they are missing in the dictionary. That's > not LT's faults here.
OK, I disabled HunspellRule for Esperanto. > To be fair, it's hard to make an Esperanto dictionary > with Hunspell because Esperanto is an agglutinative > language. Hunspell only supports two prefixes/suffixes > But Esperanto words can often use more of them. > Working around it is messy I think. Actually, hunspell is much smarter than ispell, and it can create compounds and agglutination should not be in principle so hard (it was created for Hungarian, after all). Yet, for Finish, it is not enough. However, creating compounds and two-level affixation rules should be quite satisfactory for Esperanto. There are people on this list with some experience with creating such dictionaries (for example, Ruud). But that's a major hard project. For now, I did not remove the Esperanto hunspell files from the repository. > > Given all the unresolved issues at least in all the languages > that I maintain (br, fr, eo), can we consider turning > Hunspell off by default? I'm concerned that people > downloading the nightly build will experience many > spurious errors. Well, I fixed errors for Breton, and there's a fix for French, so maybe it's not a big problem? > As an experiment, I also commented out the Hunspell > rule in src/java/org/languagetool/language/Breton.java > and LT is then more than twice faster (even when comparing > using -d HUNSPELL_RULE). It's because loading a language activates the HunspellRule constructor, and the constructor reads files from disk. I'm not a fan of hunspell; I think it has a wrong approach for creating suggestions because the computational complexity of its algorithm is simply too high. It should use something else, such as composition of a Levenshtein distance automaton with a dictionary automaton, and that would be really fast (such an approach is used by suggest methods in Lucene). Its "user-friendly" representation of affixation could be even nicer with twolc/lexc files for creating automata. Anyway, voikko (the Finish speller) might has scripts to convert hunspell files to such automata, and we might, in the future, use a better algorithm. If my algorithm for morfologik turns out to be implemented correctly, we may also use it for some languages -- Polish is *pathetically* slow in hunspell. Right now, however, for pragmatic reasons, I would vote for hunspell in LanguageTool 1.8. Regards, Marcin ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel