Re: homophone detection

Marcin Miłkowski Wed, 07 May 2014 09:48:38 -0700

W dniu 2014-05-07 16:16, Daniel Naber pisze:
> Hi,
>
> as you may know, After the Deadline is an Open Source text checker,
> quite similar to LT. It's not maintained anymore, so why not use some of
> its ideas in LT? A paper describing AtD is available at [1], it's
> well-written and provides a good overview of AtD.
>
> One interesting idea is to detect wrong words based on statistics. AtD
> has a (manually created) set of words that can be easily confused. If
> such a word is found in a text, the probability of that word in its
> context is calculated and compared to the probability of the similar
> words in the same context. If the word from the text is less probable,
> an error is assumed, and a more probable word is suggested.
>
> If this approach works, it's easier than writing rules: just add a set
> of easily confused words like "adapt, adopt" to a file, and the rest
> will happen automatically. What you need though is a huge corpus to
> calculate the probabilities. The Google n-gram corpus[2] might be used
> for that.
>
> AtD has been evaluated against a dyslexia corpus[3] with a recall of
> 27%. Running LT on the same corpus (see RealWordCorpusEvaluator), we get
> only 19% recall, and that only considers if an error was detected, not
> if the correction was correct. So there's clearly something to gain for
> LT here.


That may be true but at the same time, I found that AtD almost never 
found mistakes in my English where LT surely did. So I think a hybrid 
approach is a nice idea (see however below).

I also started to play with collocations, and our rule editor could use 
some of the collocation statistics for detecting word confusion:

http://pelcra.pl/hask_pl/Home

The idea is similar to what I used in generating our rules 
automatically. BTW, I got around 100% recall and 40% precision by using 
my method, which is definitely better than AtD. I simply did not 
generate the word confusion sets as I never had the time, and my code 
was composed of different scripts and languages (ultimately, I did not 
use Java TBL). See my paper here:

http://arxiv.org/abs/1211.6887

Note that I never used just a confusion set, but I seeded a clean corpus 
with mistakes. The details are in the paper.

Regards,
Marcin

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: homophone detection

Reply via email to