And, there is a patch that is close to being committed for Solr. On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote:
> Tika has a classifier which I think has been updated to use competitive > techniques. > > See https://issues.apache.org/jira/browse/TIKA-369 for details. > > On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote: > >> Here's the use case: deciding the language of a mid-size document like >> a newspaper article or a technical report. The problem has been >> tackled fairly successfully by pulling 2- and 3-letter sequences from >> bodies of text in various languages, and comparing the set of 2- and >> 3-letter sequences from the document. >> >> This would be for text indexing in Lucene, so it should be >> memory-resident. The implementation should have a small dataset. It is >> better if the computation is front-loaded, like video compression: the >> heavy lifting happens in a model preparation phase, and then working >> from the model is fast. A confidence rating for the classification >> would be nice. >> >> Open license (Apache-compatible) code would be great, as are >> non-patented algorithms. >> >> Any suggestions? >> >> -- >> Lance Norskog >> [email protected] >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
