Tika has a classifier which I think has been updated to use competitive techniques.
See https://issues.apache.org/jira/browse/TIKA-369 for details. On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote: > Here's the use case: deciding the language of a mid-size document like > a newspaper article or a technical report. The problem has been > tackled fairly successfully by pulling 2- and 3-letter sequences from > bodies of text in various languages, and comparing the set of 2- and > 3-letter sequences from the document. > > This would be for text indexing in Lucene, so it should be > memory-resident. The implementation should have a small dataset. It is > better if the computation is front-loaded, like video compression: the > heavy lifting happens in a model preparation phase, and then working > from the model is fast. A confidence rating for the classification > would be nice. > > Open license (Apache-compatible) code would be great, as are > non-patented algorithms. > > Any suggestions? > > -- > Lance Norskog > [email protected] >
