Word level classifiers don't work as well for short strings or short training data. They also assume word segmentation which is a bother in many languages, especially if you don't know what language it is. Over-training is also an issue with small training sets which are fairly common.
See here for an alternative: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958 On Sun, Jan 16, 2011 at 10:37 PM, Robin Anil <[email protected]> wrote: > I would say you dont need any fancy stuff > > Complementary Naive bayes classifier. Put high frequency words(stop words) > from various languages into bayes format. Train the model(very small model > gets generated). The classifier is surprisingly accurate. I have used it > for > many projects and have never needed to tweak anything > > Robin > > > On Mon, Jan 17, 2011 at 8:50 AM, Ted Dunning <[email protected]> > wrote: > > > TIKA-369 is still open. Apparently the new code isn't committed yet. > > > > On Sun, Jan 16, 2011 at 7:15 PM, Lance Norskog <[email protected]> > wrote: > > > > > https://issues.apache.org/jira/browse/SOLR-1979 > > > > > > Nice. How effective is the Tika language stuff? > > > > > > On Fri, Jan 14, 2011 at 3:13 PM, Grant Ingersoll <[email protected]> > > > wrote: > > > > And, there is a patch that is close to being committed for Solr. > > > > > > > > On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote: > > > > > > > >> Tika has a classifier which I think has been updated to use > > competitive > > > >> techniques. > > > >> > > > >> See https://issues.apache.org/jira/browse/TIKA-369 for details. > > >
