On 04/26/2012 03:37 AM, Lance Norskog wrote:
Cool! Yeah, Tika has one also.

Now for the annoying use case: older web sites and pre-web text in
Southeast Asia and India/Pakistan are written in phonetic USASCII.
(They only had that technology available. Does anybody do
classification on that kind of text?


I never did. Its only doing bag-of-word feature generation,
to make that work you need to tokenize your input text.
We have a learn-able tokenizer (must be trained), character-class
and whitespace tokenizer.

Jörn

Reply via email to