Re: Document Classification

Jörn Kottmann Thu, 26 Apr 2012 00:46:03 -0700

On 04/26/2012 03:37 AM, Lance Norskog wrote:

Cool! Yeah, Tika has one also.


Now for the annoying use case: older web sites and pre-web text in
Southeast Asia and India/Pakistan are written in phonetic USASCII.
(They only had that technology available. Does anybody do
classification on that kind of text?


I never did. Its only doing bag-of-word feature generation,
to make that work you need to tokenize your input text.
We have a learn-able tokenizer (must be trained), character-class
and whitespace tokenizer.

Jörn

Re: Document Classification

Reply via email to