In Solr, we made support for pluggable lang detectors, one being Tika's. See http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/langid/ The detectLanguage() method returns a list of DetectedLanguage objects with a normalized certainty between 0.0 and 1.0. Think it's a step in right direction.
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 22. mars 2012, at 11:22, Julien Nioche wrote: > If you mean integrating a better third-party detector - that's exactly my > point. We don't develop and maintain our own parsers, why should we follow > a different logic when it comes to language identification? There are other > resource around why don't we just use them? I assume that by default our > existing detector (improved or not) could still be used, all we need is > just a mechanism to be able to select an alternative implementation and a > common interface. That's probably not a big deal to implement. Any thoughts > on how to do it? Are there any things we should reuse from the way we deal > with the parsers? > > Thanks for your comments > > Julien > > > On 21 March 2012 16:55, Ken Krugler <kkrugler_li...@transpac.com> wrote: > >> >> On Mar 21, 2012, at 8:51am, Julien Nioche wrote: >> >>> Hi guys, >>> >>> Just wondering about the best way to make the language detection >> pluggable >>> instead of having it hard-wired as it is now. We now that the resources >>> that are currently in Tika are both slow and inaccurate [1] and there are >>> other libraries that we could leverage. Why not having the option to >> select >>> a different implementation just like we do for parsers? Obviously we'd >> need >>> a common interface for the parsers etc... >>> >>> What do you think? >> >> I'd be more in favor of using that time to integrate a better language >> detector into Tika, so that everybody wins from the work :) >> >> -- Ken >> >> >>> [1] >>> >> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >> >> -------------------------- >> Ken Krugler >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Mahout & Solr >> >> >> >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble