Thanks again for the reference! One more question if you don't mind. How do I get the text keys for each cluster? I mean, so at beginning we have input file for seq2sparse, and we knew the format is
textkey1 text1 textkey2 text2 .. after mahout kmeans, and clusterdump, how do I know, for example, which texts belong to cluster 1 from command line? I thought the outputs from clusterdump have what I want, but so far, it is still elusive. Best, Baoqiang On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I guess you figured this out but the cluster drivers take "-cl", which tells > them to put points into the calculated clusters and output to the > clusterPoints directory. Then you pass that in to clusterdump. > > instructions here: > https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering > > the --help for the mahout cluster drivers is incomplete, check cwiki for > differences > > > > On 3/14/12 3:18 PM, Baoqiang Cao wrote: >> >> Thanks a lot. But I don't know if I miss anything in front of my teary >> eyes because of Wednesday afternoon or ? I have equivalent inputs as >> yours: >> >> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d >> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points >> >> the cluster files after 15 iterations are >> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I >> created in prior. On screen, the output are something like >> "VL-1721020{n=186 c=[...". It just is no any output files under that >> directory. >> >> Any help , please >> >> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<p...@occamsmachete.com> wrote: >>> >>> The -p parameter is an input. You should pass in the clusterPoints/ >>> directory that was generated by the cluster driver you used. >>> >>> My use of fkmeans might be an example: >>> >>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c >>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m >>> 2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure >>> >>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000 >>> which is the file with the clustered points. I then did a clusterdump >>> >>> mahout clusterdump -s >>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p >>> wikipedia-fkmeans-clusters/clusteredPoints/ -d >>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm >>> org.apache.mahout.common.distance.CosineDistanceMeasure >>> >>> This will output to the screen. Use -o to specify an output file. >>> >>> Good advice for any user of mahout is read the output of the help very >>> carefully. IMHO it is very easy to misunderstand the parameters, inputs, >>> and >>> outputs. I think I only understand about 10%. Try: >>> >>> mahout fkmeans --help >>> >>> >>> >>> On 3/14/12 10:52 AM, Baoqiang Cao wrote: >>>> >>>> Hi, >>>> >>>> Very sorry for such a trivial question but ran out of luck. I'm trying >>>> to see which points (thru point-ids) belong to which cluster center. >>>> Here is what I did: >>>> >>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d >>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points >>>>> >>>>> out >>>> >>>> The onscreen output is: >>>> >>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments: >>>> {--dictionary=/mahout/sparse/dictionary.file-0, >>>> --dictionaryType=sequencefile, >>>> >>>> >>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, >>>> --endPhase=2147483647, --outputFormat=TEXT, >>>> --pointsDir=/mahout/points, >>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0, >>>> --tempDir=temp} >>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is >>>> available >>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop >>>> library >>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded >>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor >>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor >>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor >>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor >>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters >>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms >>>> (Minutes: 2.2546) >>>> >>>> >>>> There is nothing under "/mahout/points". Any help on why and how? >>>> >>>> Thanks in advance. >>>> Baoqiang >>>> >