Good work has been done using this and more sophisticated tools by the 
universities of Nijmegen and Tilburg, by A. v d Bosch et al.

Their tools are also fully open source.

These tools got public as 'valkuil.net' and 'fowl.net'. It requires a 
quite heavy server.

In case you are interested, Prof. v d Bosch is one of my Linked-In-contacts:

https://www.linkedin.com/profile/view?id=5639559&authType=OUT_OF_NETWORK&authToken=4n9Z&locale=en_US&srchid=525963551399472523579&srchindex=1&srchtotal=3524&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A525963551399472523579%2CVSRPtargetId%3A5639559%2CVSRPcmpt%3Aprimary

Ruud


op 07-05-14 16:16, Daniel Naber schreef:
> Hi,
>
> as you may know, After the Deadline is an Open Source text checker,
> quite similar to LT. It's not maintained anymore, so why not use some of
> its ideas in LT? A paper describing AtD is available at [1], it's
> well-written and provides a good overview of AtD.
>
> One interesting idea is to detect wrong words based on statistics. AtD
> has a (manually created) set of words that can be easily confused. If
> such a word is found in a text, the probability of that word in its
> context is calculated and compared to the probability of the similar
> words in the same context. If the word from the text is less probable,
> an error is assumed, and a more probable word is suggested.
>
> If this approach works, it's easier than writing rules: just add a set
> of easily confused words like "adapt, adopt" to a file, and the rest
> will happen automatically. What you need though is a huge corpus to
> calculate the probabilities. The Google n-gram corpus[2] might be used
> for that.
>
> AtD has been evaluated against a dyslexia corpus[3] with a recall of
> 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get
> only 19% recall, and that only considers if an error was detected, not
> if the correction was correct. So there's clearly something to gain for
> LT here.
>
> I have checked in some prototypical work for a statistical homophone
> rule in LT into the new 'confusion-rule' branch.
>
> Regards
>    Daniel
>
> [1] http://aclweb.org/anthology-new/W/W10/W10-0404.pdf
> [2] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
> [3] http://www.dcs.bbk.ac.uk/~jenny/resources.html
>
>
> ------------------------------------------------------------------------------
> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
> • 3 signs your SCM is hindering your productivity
> • Requirements for releasing software faster
> • Expert tips and advice for migrating your SCM now
> http://p.sf.net/sfu/perforce
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to