Good work has been done using this and more sophisticated tools by the universities of Nijmegen and Tilburg, by A. v d Bosch et al.
Their tools are also fully open source. These tools got public as 'valkuil.net' and 'fowl.net'. It requires a quite heavy server. In case you are interested, Prof. v d Bosch is one of my Linked-In-contacts: https://www.linkedin.com/profile/view?id=5639559&authType=OUT_OF_NETWORK&authToken=4n9Z&locale=en_US&srchid=525963551399472523579&srchindex=1&srchtotal=3524&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A525963551399472523579%2CVSRPtargetId%3A5639559%2CVSRPcmpt%3Aprimary Ruud op 07-05-14 16:16, Daniel Naber schreef: > Hi, > > as you may know, After the Deadline is an Open Source text checker, > quite similar to LT. It's not maintained anymore, so why not use some of > its ideas in LT? A paper describing AtD is available at [1], it's > well-written and provides a good overview of AtD. > > One interesting idea is to detect wrong words based on statistics. AtD > has a (manually created) set of words that can be easily confused. If > such a word is found in a text, the probability of that word in its > context is calculated and compared to the probability of the similar > words in the same context. If the word from the text is less probable, > an error is assumed, and a more probable word is suggested. > > If this approach works, it's easier than writing rules: just add a set > of easily confused words like "adapt, adopt" to a file, and the rest > will happen automatically. What you need though is a huge corpus to > calculate the probabilities. The Google n-gram corpus[2] might be used > for that. > > AtD has been evaluated against a dyslexia corpus[3] with a recall of > 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get > only 19% recall, and that only considers if an error was detected, not > if the correction was correct. So there's clearly something to gain for > LT here. > > I have checked in some prototypical work for a statistical homophone > rule in LT into the new 'confusion-rule' branch. > > Regards > Daniel > > [1] http://aclweb.org/anthology-new/W/W10/W10-0404.pdf > [2] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html > [3] http://www.dcs.bbk.ac.uk/~jenny/resources.html > > > ------------------------------------------------------------------------------ > Is your legacy SCM system holding you back? Join Perforce May 7 to find out: > • 3 signs your SCM is hindering your productivity > • Requirements for releasing software faster > • Expert tips and advice for migrating your SCM now > http://p.sf.net/sfu/perforce > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel