On Dec 13, 2013, at 8:04pm, Albretch Mueller <lbrt...@gmail.com> wrote:
> On the sections 7.2 (pg. 115) ... of "tika in action", they talk in > very general terms about that theme and mentioned that tika currently > uses n-grams but may change the underlying algorithm in the future > > Could you/committers/the autors share a little more about tika's > language detection internals and/or your probable future > decisions/plans? Currently it's based on some code that came over from Nutch, with a few improvements. It has a number of issues, e.g. see… https://issues.apache.org/jira/browse/TIKA-369 https://issues.apache.org/jira/browse/TIKA-856 https://issues.apache.org/jira/browse/TIKA-354 https://issues.apache.org/jira/browse/TIKA-568 https://issues.apache.org/jira/browse/TIKA-496 https://issues.apache.org/jira/browse/TIKA-993 https://issues.apache.org/jira/browse/TIKA-465 There's a proposal to replace this with language-detection, a separate library that has better accuracy and much faster performance. See… https://issues.apache.org/jira/browse/TIKA-369 And yes, that's been sitting on my plate for way too long. If somebody wants to put a release stake in the ground, it would help motivate me to at least close out that issue :) Regards, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr