Thanks again for the reference!

One more question if you don't mind. How do I get the text keys for
each cluster? I mean, so at beginning we have input file for
seq2sparse, and we knew the format is

textkey1 text1
textkey2 text2
..

after mahout kmeans, and clusterdump, how do I know, for example,
which texts belong to cluster 1 from command line? I thought the
outputs from clusterdump have what I want, but so far, it is still
elusive.

Best,
Baoqiang



On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> I guess you figured this out but the cluster drivers take "-cl", which tells
> them to put points into the calculated clusters and output to the
> clusterPoints directory. Then you pass that in to clusterdump.
>
> instructions here:
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>
> the --help for the mahout cluster drivers is incomplete, check cwiki for
> differences
>
>
>
> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>
>> Thanks a lot. But I don't know if I miss anything in front of my teary
>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>> yours:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>
>> the cluster files after 15 iterations are
>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>> created in prior. On screen, the output are something like
>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>> directory.
>>
>> Any help , please
>>
>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<p...@occamsmachete.com>  wrote:
>>>
>>> The -p parameter is an input. You should pass in the clusterPoints/
>>> directory that was generated by the cluster driver you used.
>>>
>>> My use of fkmeans might be an example:
>>>
>>>   mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>   wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>   2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>> which is the file with the clustered points. I then did a clusterdump
>>>
>>>   mahout clusterdump -s
>>>   wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>   wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>  wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>   org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will output to the screen. Use -o to specify an output file.
>>>
>>> Good advice for any user of mahout is read the output of the help very
>>> carefully. IMHO it is very easy to misunderstand the parameters, inputs,
>>> and
>>> outputs. I think I only understand about 10%. Try:
>>>
>>>   mahout fkmeans --help
>>>
>>>
>>>
>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>
>>>> Hi,
>>>>
>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>> to see which points (thru point-ids) belong to which cluster center.
>>>> Here is what I did:
>>>>
>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>
>>>>> out
>>>>
>>>> The onscreen output is:
>>>>
>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>> --dictionaryType=sequencefile,
>>>>
>>>>
>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>> --pointsDir=/mahout/points,
>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>> --tempDir=temp}
>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>> available
>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>> library
>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>> (Minutes: 2.2546)
>>>>
>>>>
>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>
>>>> Thanks in advance.
>>>> Baoqiang
>>>>
>

Reply via email to