i have also compared tika performance with the nutch language detector in version 1.0. it seems that nutch is far better in performance than tika ( 5 to 6 times faster than nutch). but my use case is so special (short texts ~ 140 characters length) and i dont have time to investigate, so i have not reported. so may be you can compare with performance of language detector in nutch 1.0. i know that tika language detector is derived from nutch, but then has been reimplemented, code has been changed. 1 ngram , 2 ngrams and 4 ngram have been ommitted for a faster startup time and smaller language profiles.
regards reinhard Am 25.10.2011 18:12, schrieb Michael McCandless: > OK I posted the 3rd post about CLD, this time testing perf by > comparing to Tika and language-detection (Google Code project): > > > http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html > > Net/net all three do very well (>= 97% accuracy); I had to remove 4 > languages from consideration because we don't support them. > > Tika seems to have a lot of trouble with Spanish (confuses w/ > Galician) and Danish (confuses with Dutch). > > Also, Tika's performance is substantially slow than the other two... not > sure what's up. > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless > <luc...@mikemccandless.com> wrote: > >> On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler >> <kkrugler_li...@transpac.com> wrote: >> >> >>> Sounds like a great idea - see the recent comment thread on >>> https://issues.apache.org/jira/browse/TIKA-431 for some related discussions. >>> >>> And there's also https://issues.apache.org/jira/browse/TIKA-539 >>> >> Those do look related (if you swap charset in for language)! >> >> It's tricky to know just how much to "trust" what the server >> (Content-Type HTTP header) and content (http-equiv meta tag) says, >> though I do like CLD's approach: they never fully "trust" what was >> declared but rather use the declaration as a hint to boost language >> priors. >> >> And then to figure out what priors to assign for each hint they have >> these tables trained from a large content set (10% of Base). >> >> If we have access to a biggish crawl we could presumably do something >> similar, ie record how often the hint is wrong and translate that into >> appropriate prior boosts, ie make it a hint instead of fully trusting >> it. >> >> Does anyone know how ICU translates the encoding "hint" into priors >> for each encoding? >> >> >>> Also, what will you be using to test language detection? WIkipedia pages? >>> >> I'm using the corpus from here: >> >> >> http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/ >> >> It's a random subset of europarl (1000 strings from each of 21 langs). >> >> Wikipedia would be great too! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >