On Dec 13, 2013, at 8:04pm, Albretch Mueller <lbrt...@gmail.com> wrote:

> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in
> very general terms about that theme and mentioned that tika currently
> uses n-grams but may change the underlying algorithm in the future
> 
> Could you/committers/the autors share a little more about tika's
> language detection internals and/or your probable future
> decisions/plans?

Currently it's based on some code that came over from Nutch, with a few 
improvements.

It has a number of issues, e.g. see…

https://issues.apache.org/jira/browse/TIKA-369

https://issues.apache.org/jira/browse/TIKA-856

https://issues.apache.org/jira/browse/TIKA-354

https://issues.apache.org/jira/browse/TIKA-568

https://issues.apache.org/jira/browse/TIKA-496

https://issues.apache.org/jira/browse/TIKA-993

https://issues.apache.org/jira/browse/TIKA-465

There's a proposal to replace this with language-detection, a separate library 
that has better accuracy and much faster performance. See…

https://issues.apache.org/jira/browse/TIKA-369

And yes, that's been sitting on my plate for way too long. If somebody wants to 
put a release stake in the ground, it would help motivate me to at least close 
out that issue :)

Regards,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to