W dniu 2014-05-07 16:16, Daniel Naber pisze: > Hi, > > as you may know, After the Deadline is an Open Source text checker, > quite similar to LT. It's not maintained anymore, so why not use some of > its ideas in LT? A paper describing AtD is available at [1], it's > well-written and provides a good overview of AtD. > > One interesting idea is to detect wrong words based on statistics. AtD > has a (manually created) set of words that can be easily confused. If > such a word is found in a text, the probability of that word in its > context is calculated and compared to the probability of the similar > words in the same context. If the word from the text is less probable, > an error is assumed, and a more probable word is suggested. > > If this approach works, it's easier than writing rules: just add a set > of easily confused words like "adapt, adopt" to a file, and the rest > will happen automatically. What you need though is a huge corpus to > calculate the probabilities. The Google n-gram corpus[2] might be used > for that. > > AtD has been evaluated against a dyslexia corpus[3] with a recall of > 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get > only 19% recall, and that only considers if an error was detected, not > if the correction was correct. So there's clearly something to gain for > LT here.
That may be true but at the same time, I found that AtD almost never found mistakes in my English where LT surely did. So I think a hybrid approach is a nice idea (see however below). I also started to play with collocations, and our rule editor could use some of the collocation statistics for detecting word confusion: http://pelcra.pl/hask_pl/Home The idea is similar to what I used in generating our rules automatically. BTW, I got around 100% recall and 40% precision by using my method, which is definitely better than AtD. I simply did not generate the word confusion sets as I never had the time, and my code was composed of different scripts and languages (ultimately, I did not use Java TBL). See my paper here: http://arxiv.org/abs/1211.6887 Note that I never used just a confusion set, but I seeded a clean corpus with mistakes. The details are in the paper. Regards, Marcin ------------------------------------------------------------------------------ Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel