Here's the use case: deciding the language of a mid-size document like
a newspaper article or a technical report. The problem has been
tackled fairly successfully by pulling 2- and 3-letter sequences from
bodies of text in various languages, and comparing the set of 2- and
3-letter sequences from the document.

This would be for text indexing in Lucene, so it should be
memory-resident. The implementation should have a small dataset. It is
better if the computation is front-loaded, like video compression: the
heavy lifting happens in a model preparation phase, and then working
from the model is fast. A confidence rating for the classification
would be nice.

Open license (Apache-compatible) code would be great, as are
non-patented algorithms.

Any suggestions?

-- 
Lance Norskog
[email protected]

Reply via email to