How to convert input key:value text format to mahout digestible format

Sarang Deshpande Tue, 25 Sep 2012 11:56:54 -0700

Dear mahout users,
I am trying to use bayes classifier from mahout distribution 0.7. As input 
training set, I have a text file in following format: One document per line, 
first  entry on the line is the label (key), rest is the evidence (value = 
document contents). In mahout 0.5, command trainclassifier used to take 
directory containing files with above kind of format as input but in mahout 
0.7, seqdirectory command needs input directory with one file per document. My 
training set contains millions of small documents so I am trying to avoid 
having millions of tiny files on HDFS.
Is there an easy way to convert above files into sequence files that could be 
digestible by seq2sparse command subsequently.


Thanks much
~Sarang

How to convert input key:value text format to mahout digestible format

Reply via email to