And, there is a patch that is close to being committed for Solr.

On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote:

> Tika has a classifier which I think has been updated to use competitive
> techniques.
> 
> See https://issues.apache.org/jira/browse/TIKA-369 for details.
> 
> On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote:
> 
>> Here's the use case: deciding the language of a mid-size document like
>> a newspaper article or a technical report. The problem has been
>> tackled fairly successfully by pulling 2- and 3-letter sequences from
>> bodies of text in various languages, and comparing the set of 2- and
>> 3-letter sequences from the document.
>> 
>> This would be for text indexing in Lucene, so it should be
>> memory-resident. The implementation should have a small dataset. It is
>> better if the computation is front-loaded, like video compression: the
>> heavy lifting happens in a model preparation phase, and then working
>> from the model is fast. A confidence rating for the classification
>> would be nice.
>> 
>> Open license (Apache-compatible) code would be great, as are
>> non-patented algorithms.
>> 
>> Any suggestions?
>> 
>> --
>> Lance Norskog
>> [email protected]
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to