https://issues.apache.org/jira/browse/SOLR-1979

Nice.  How effective is the Tika language stuff?

On Fri, Jan 14, 2011 at 3:13 PM, Grant Ingersoll <[email protected]> wrote:
> And, there is a patch that is close to being committed for Solr.
>
> On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote:
>
>> Tika has a classifier which I think has been updated to use competitive
>> techniques.
>>
>> See https://issues.apache.org/jira/browse/TIKA-369 for details.
>>
>> On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote:
>>
>>> Here's the use case: deciding the language of a mid-size document like
>>> a newspaper article or a technical report. The problem has been
>>> tackled fairly successfully by pulling 2- and 3-letter sequences from
>>> bodies of text in various languages, and comparing the set of 2- and
>>> 3-letter sequences from the document.
>>>
>>> This would be for text indexing in Lucene, so it should be
>>> memory-resident. The implementation should have a small dataset. It is
>>> better if the computation is front-loaded, like video compression: the
>>> heavy lifting happens in a model preparation phase, and then working
>>> from the model is fast. A confidence rating for the classification
>>> would be nice.
>>>
>>> Open license (Apache-compatible) code would be great, as are
>>> non-patented algorithms.
>>>
>>> Any suggestions?
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>



-- 
Lance Norskog
[email protected]

Reply via email to