Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Jeff Eastman Sat, 28 Aug 2010 16:22:24 -0700

It has been reported recently that some of our jobs fail quietlyand/or in unexpected ways when inputs are not correct. If you canduplicate this behavior please submit a JIRA and we will look into it.The 0.4 release is coming up maybe next month so please help us improveour user experience. To get a batch of correct files to inspect tryrunning examples/bin/build-reuters.sh.


On 8/28/10 9:33 AM, Valerio wrote:

  Jeff Eastman<jdog<at>  windwardsolutions.com>  writes:
   Try naming the input *directory* not the particular input file.

I tried,but the result was the same.
But i did a discovery about a bug of mahout.

When I try to convert a text file in a sequence with the command line:

bin/mahout seqdirectory –input<PATH>  --output<PATH>  --charset UTF-8

and then in a sparse vector with:

bin/mahout seq2sparse --input<PATH>/content/reuters/seqfiles/ --norm 2 --weight
TF --output<PATH>/content/reuters/seqfiles-TF/ --minDF 5 --maxDFPercent 90

if the original file isn't correct,or the path is incorrect
mahout create a fake chunk-0,not useful for the seq2sparse,and the second
command create other

useless things because files are empty and you can see this because the file
part-00000 in the folder vector is around 90 bytes.

I think that this was an old your answer to a similar problem like mine ^^

have you got a link or a site where I can download a correct text file that is
a dataset? so i can try to convert it in sequence and then in vectors to see
what mahout kmeans produce.

Thanks in advance!

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

Reply via email to