Hi All,

Good afternoon.
I run the following three steps and got the clustered data I expected.

My input data is 1124 object (it is in key:value format), However, from the
output, I only received 491 objects.
What happened to the 1124-491=633 objects?

I checked out the options of seq2sparse, kmeans, clusterdump, but I failed
to find any options indicate that it is used to filter out some data.
So, I wonder in which step does the data gone missing? is it ignored or
just filtered? when and how?
is there anyway that I could stop the filtering process?

Step1:
mahout seq2sparse -i /group/tbdev/zhimo.bmz/mahout/data/videotags_seq -o
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors -ow \
-a org.apache.lucene.analysis.WhitespaceAnalyzer \
-wt tfidf \
-x 90 \
-seq \
-n 2 \
-nv

Step2:
mahout kmeans -i
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/tfidf-vectors/ \
-c
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/videotags-initial-clusters
\
-o /group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-cd 0.5 \
-k 150 \
-x 20 \
-cl \
-ow

Step3:
mahout clusterdump \
--seqFileDir
/group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusters-2/ \
--pointsDir
/group/tbdev/zhimo.bmz/mahout/output/videotags-kmeans-clusters/clusteredPoints/part-m-00000
\
--output /home/zhimo.bmz/work/videotagsAnalyze_15.txt \
-d
/group/tbdev/zhimo.bmz/mahout/vectors/videotags-vectors/dictionary.file-0  \
-dt sequencefile


Please help!

Reply via email to