Word level classifiers don't work as well for short strings or short
training data.  They also assume word segmentation which is a bother in many
languages, especially if you don't know what language it is.  Over-training
is also an issue with small training sets which are fairly common.

See here for an alternative:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958

On Sun, Jan 16, 2011 at 10:37 PM, Robin Anil <[email protected]> wrote:

> I would say you dont need any fancy stuff
>
> Complementary Naive bayes classifier. Put high frequency words(stop words)
> from various languages into bayes format. Train the model(very small model
> gets generated). The classifier is surprisingly accurate. I have used it
> for
> many projects and have never needed to tweak anything
>
> Robin
>
>
> On Mon, Jan 17, 2011 at 8:50 AM, Ted Dunning <[email protected]>
> wrote:
>
> > TIKA-369 is still open.  Apparently the new code isn't committed yet.
> >
> > On Sun, Jan 16, 2011 at 7:15 PM, Lance Norskog <[email protected]>
> wrote:
> >
> > > https://issues.apache.org/jira/browse/SOLR-1979
> > >
> > > Nice.  How effective is the Tika language stuff?
> > >
> > > On Fri, Jan 14, 2011 at 3:13 PM, Grant Ingersoll <[email protected]>
> > > wrote:
> > > > And, there is a patch that is close to being committed for Solr.
> > > >
> > > > On Jan 14, 2011, at 11:33 AM, Ted Dunning wrote:
> > > >
> > > >> Tika has a classifier which I think has been updated to use
> > competitive
> > > >> techniques.
> > > >>
> > > >> See https://issues.apache.org/jira/browse/TIKA-369 for details.
> >
>

Reply via email to