https://issues.apache.org/jira/browse/SOLR-1979
Nice. How effective is the Tika language stuff? On Fri, Jan 14, 2011 at 3:13 PM, Grant Ingersoll <[email protected]> wrote: > And, there is a patch that is close to being committed for Solr. > > On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote: > >> Tika has a classifier which I think has been updated to use competitive >> techniques. >> >> See https://issues.apache.org/jira/browse/TIKA-369 for details. >> >> On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote: >> >>> Here's the use case: deciding the language of a mid-size document like >>> a newspaper article or a technical report. The problem has been >>> tackled fairly successfully by pulling 2- and 3-letter sequences from >>> bodies of text in various languages, and comparing the set of 2- and >>> 3-letter sequences from the document. >>> >>> This would be for text indexing in Lucene, so it should be >>> memory-resident. The implementation should have a small dataset. It is >>> better if the computation is front-loaded, like video compression: the >>> heavy lifting happens in a model preparation phase, and then working >>> from the model is fast. A confidence rating for the classification >>> would be nice. >>> >>> Open license (Apache-compatible) code would be great, as are >>> non-patented algorithms. >>> >>> Any suggestions? >>> >>> -- >>> Lance Norskog >>> [email protected] >>> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Lance Norskog [email protected]
