Run clusterdump -s canopyCentroids/clusters-0. Generally, Mahout arguments are directories full of part-n files. You can also run clusterdump -s clusterOutput/clusters-n -p .../clusteredPoints after KMeans to see the results of your clustering. Argument 'n' would be the last iteration number.
-----Original Message----- From: surf reta [mailto:[email protected]] Sent: Wednesday, August 10, 2011 9:19 AM To: [email protected] Subject: Re: issues on Mahout clustering result using K-means Hi Jeff, I frist transferred a set of text files into sequence files through a customized program as follows. This program uses the Mahout utility of SequenceFilesFromDriectory public class TestSequenceFileConverter { public static void main(String args[]){ String inputDir = "testdataset"; String outputDir = "sequenceInputDir"; try{SequenceFilesFromDirectory.main(new String[] {"--input", inputDir.toString(), "--output", outputDir.toString(), "--chunkSize", "64", "--charset",Charsets.UTF_8.name()});} catch(Exception e){System.out.println("");} } } Then I ran the K-means program, borrowed from NewsKMeansClustering, an example program given in Mahout-in-Action, to run against these generated sequence files. I just checked the generated clusters-0 directory, it has a file called part-r-00000. How can I read this file and get the useful information from it? Thanks. The NewsKMeansClustering is listed here for your reference:* * public class NewsKMeansClustering { public static void main(String args[]) throws Exception { int minSupport = 5; int minDf = 5; int maxDFPercent = 95; int maxNGramSize = 2; int minLLRValue = 50; int reduceTasks = 1; int chunkSize = 200; int norm = 2; boolean sequentialAccessOutput = true; // String inputDir = "inputDir"; String inputDir = "sequenceInputDir"; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); /* * SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new Path(inputDir, "documents.seq"), * Text.class, Text.class); for (Document d : Database) { writer.append(new Text(d.getID()), new * Text(d.contents())); } writer.close(); */ String outputDir = "newsClusters"; HadoopUtil.delete(conf, new Path(outputDir)); Path tokenizedPath = new Path(outputDir, DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); MyAnalyzer analyzer = new MyAnalyzer(); DocumentProcessor.tokenizeDocuments(new Path(inputDir), analyzer.getClass() .asSubclass(Analyzer.class), tokenizedPath, conf); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), conf, minSupport, maxNGramSize, minLLRValue, 2, true, reduceTasks, chunkSize, sequentialAccessOutput, false); TFIDFConverter.processTfIdf( new Path(outputDir , DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSize, minDf, maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks); Path vectorsFolder = new Path(outputDir, "tfidf-vectors"); Path canopyCentroids = new Path(outputDir , "canopy-centroids"); Path clusterOutput = new Path(outputDir , "clusters"); CanopyDriver.run(vectorsFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250, 120, false, false); KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0"), clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false); SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(clusterOutput+"/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf); // new Path(clusterOutput+"/clusteredPoints"+"/part-m-00000"),conf); IntWritable key = new IntWritable(); WeightedVectorWritable value = new WeightedVectorWritable(); while (reader.next(key, value)) { System.out.println(key.toString() + " belongs to cluster " + value.toString()); } reader.close(); } } On Wed, Aug 10, 2011 at 11:40 AM, Jeff Eastman <[email protected]> wrote: > What do your input vectors look like? > How many canopies did you get in clusters-0? > > -----Original Message----- > From: eric skinner [mailto:[email protected]] > Sent: Wednesday, August 10, 2011 8:33 AM > To: [email protected] > Subject: issues on Mahout clustering result using K-means > > I ran the K-means clustering algorithm against a set of sequence files. > However, the generated result looks like this: > > 0 belongs to cluster 1.0: [] > > 0 belongs to cluster 1.0: [] > > 0 belongs to cluster 1.0: [] > > 0 belongs to cluster 1.0: [] > > 0 belongs to cluster 1.0: [] > > 0 belongs to cluster 1.0: [] > > Would you like to let me know why I get this type of result? Is that > because > of any specific parameter setting requirement or anything else? > > The program I use is borrowed from NewsKMeansClustering.java, an example > given in chapter 9 of Mahout-in-Action. > > The core clustering code in this program is > > CanopyDriver.run(vectorsFolder, canopyCentroids, new > EuclideanDistanceMeasure(), 250, 120, false, false); > > KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, > "clusters-0"), > clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true, false); >
