Here's the use case: deciding the language of a mid-size document like a newspaper article or a technical report. The problem has been tackled fairly successfully by pulling 2- and 3-letter sequences from bodies of text in various languages, and comparing the set of 2- and 3-letter sequences from the document.
This would be for text indexing in Lucene, so it should be memory-resident. The implementation should have a small dataset. It is better if the computation is front-loaded, like video compression: the heavy lifting happens in a model preparation phase, and then working from the model is fast. A confidence rating for the classification would be nice. Open license (Apache-compatible) code would be great, as are non-patented algorithms. Any suggestions? -- Lance Norskog [email protected]
