Hi Lance, On Jan 14, 2011, at 1:04am, Lance Norskog wrote:
Here's the use case: deciding the language of a mid-size document like a newspaper article or a technical report. The problem has been tackled fairly successfully by pulling 2- and 3-letter sequences from bodies of text in various languages, and comparing the set of 2- and 3-letter sequences from the document. This would be for text indexing in Lucene, so it should be memory-resident. The implementation should have a small dataset. It is better if the computation is front-loaded, like video compression: the heavy lifting happens in a model preparation phase, and then working from the model is fast. A confidence rating for the classification would be nice. Open license (Apache-compatible) code would be great, as are non-patented algorithms. Any suggestions?
I can't currently recommend the language detector in Tika - see https://issues.apache.org/jira/browse/TIKA-369 for details.
That issue has a link to a review of other options, though it's slightly dated.
Want to code up the LLR-based approach that Ted described in the PDF attached to the issue? :)
That would be a killer contribution... -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
