I'm trying to dump the clusters generated using kmeans -- I am running on the 20-news data prepped by SequenceFileFromDirectory and SparseVectorsFromSequenceFiles.
I'm running with the 301 patch in place, the files are on hdfs and the necessary hadoop env vars are set for the mahout script. ./mahout clusterdump -s mahout/20news-sv/kmeans/clusters-10 -o mahout/20news-sv/kmeans-dump -p mahout/20news-sv/kmeans/points -d mahout/20news-sv/dictionary.file-0 -dt sequencefile I get the error: java.lang.NullPointerException at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:323) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:85) at org.apache.mahout.utils.clustering.ClusterDumper.<init>(ClusterDumper.java:78) It seems to work fine if I copy the files from hdts to my local filesystem. I suspect that this is due to the fact the ClusterDumper uses java.io filesystem primitives to locate the points file instead of the Hadoop primitives. (lines 316-321) Also, If I run the entire job locally, SparseVectorsFromSequenceFiles generates multiple dictionries: dictionary.file-0 and dictionary.file-1 -- how would I use these as input to the dumper? Thanks, Drew