Hi all,

I just posted part 1 of a series on extracting text features for machine 
learning…

http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/

It uses a modified version of the Tika RFC822 parser to process mbox files.

I decided it was time to try to share some of what I'd learned over the years 
in processing text for classification, clustering and other related ML tasks.

It undoubtedly has some things that are unclear or even incorrect, so please 
comment :)

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to