Hi all, I just posted part 1 of a series on extracting text features for machine learning…
http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machine-learning-part-1/ It uses a modified version of the Tika RFC822 parser to process mbox files. I decided it was time to try to share some of what I'd learned over the years in processing text for classification, clustering and other related ML tasks. It undoubtedly has some things that are unclear or even incorrect, so please comment :) Thanks, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr