Hi Lance,

On Jan 14, 2011, at 1:04am, Lance Norskog wrote:

Here's the use case: deciding the language of a mid-size document like
a newspaper article or a technical report. The problem has been
tackled fairly successfully by pulling 2- and 3-letter sequences from
bodies of text in various languages, and comparing the set of 2- and
3-letter sequences from the document.

This would be for text indexing in Lucene, so it should be
memory-resident. The implementation should have a small dataset. It is
better if the computation is front-loaded, like video compression: the
heavy lifting happens in a model preparation phase, and then working
from the model is fast. A confidence rating for the classification
would be nice.

Open license (Apache-compatible) code would be great, as are
non-patented algorithms.

Any suggestions?

I can't currently recommend the language detector in Tika - see https://issues.apache.org/jira/browse/TIKA-369 for details.

That issue has a link to a review of other options, though it's slightly dated.

Want to code up the LLR-based approach that Ted described in the PDF attached to the issue? :)

That would be a killer contribution...

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to