Proper way to dump kmeans clusters?

Drew Farris Thu, 25 Feb 2010 20:54:04 -0800

I'm trying to dump the clusters generated using kmeans -- I am running
on the 20-news data prepped by SequenceFileFromDirectory and
SparseVectorsFromSequenceFiles.


I'm running with the 301 patch in place,  the files are on hdfs and
the necessary hadoop env vars are set for the mahout script.

./mahout clusterdump -s mahout/20news-sv/kmeans/clusters-10 -o
mahout/20news-sv/kmeans-dump -p mahout/20news-sv/kmeans/points -d
mahout/20news-sv/dictionary.file-0 -dt sequencefile

I get the error:

java.lang.NullPointerException
        at 
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:85)
        at 
org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:78)

It seems to work fine if I copy the files from hdts to my local
filesystem. I suspect that this is due to the fact the ClusterDumper
uses java.io filesystem primitives to locate the points file instead
of the Hadoop primitives. (lines 316-321)

Also, If I run the entire job locally, SparseVectorsFromSequenceFiles
generates multiple dictionries: dictionary.file-0 and
dictionary.file-1 -- how would I use these as input to the dumper?

Thanks,

Drew

Proper way to dump kmeans clusters?

Reply via email to