Blog post on extracting text features using Tika

Ken Krugler Thu, 11 Jul 2013 13:52:47 -0700

Hi all,

I just posted part 1 of a series on extracting text features for machine 
learning…


http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

It uses a modified version of the Tika RFC822 parser to process mbox files.

I decided it was time to try to share some of what I'd learned over the years 
in processing text for classification, clustering and other related ML tasks.

It undoubtedly has some things that are unclear or even incorrect, so please 
comment :)

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Blog post on extracting text features using Tika

Reply via email to