Hi All, Good afternoon. I run the following three steps and got the clustered data I expected.
My input data is 1124 object (it is in key:value format), However, from the output, I only received 491 objects. What happened to the 1124-491=633 objects? I checked out the options of seq2sparse, kmeans, clusterdump, but I failed to find any options indicate that it is used to filter out some data. So, I wonder in which step does the data gone missing? is it ignored or just filtered? when and how? is there anyway that I could stop the filtering process? Step1: mahout seq2sparse -i /group/tbdev/zhimo.bmz/mahout/data/videotags_seq -o /group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors -ow \ -a org.apache.lucene.analysis.WhitespaceAnalyzer \ -wt tfidf \ -x 90 \ -seq \ -n 2 \ -nv Step2: mahout kmeans -i /group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/tfidf-vectors/ \ -c /group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/videotags-initial-clusters \ -o /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -cd 0.5 \ -k 150 \ -x 20 \ -cl \ -ow Step3: mahout clusterdump \ --seqFileDir /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusters-2/ \ --pointsDir /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000 \ --output /home/zhimo.bmz/work/videotagsAnalyze_15.txt \ -d /group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/dictionary.file-0 \ -dt sequencefile Please help!
