Dear mahout users,
I am trying to use bayes classifier from mahout distribution 0.7. As input 
training set, I have a text file in following format: One document per line, 
first  entry on the line is the label (key), rest is the evidence (value = 
document contents). In mahout 0.5, command trainclassifier used to take 
directory containing files with above kind of format as input but in mahout 
0.7, seqdirectory command needs input directory with one file per document. My 
training set contains millions of small documents so I am trying to avoid 
having millions of tiny files on HDFS.
Is there an easy way to convert above files into sequence files that could be 
digestible by seq2sparse command subsequently.

Thanks much
~Sarang

Reply via email to