[ https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karthik Prakhya updated MAHOUT-1330: ------------------------------------ Attachment: NewsKMeansClustering-output.txt I uploaded the most complete version of the output of my shell script. The previous file was incomplete. Sorry about the mistake. > Unable to do K-means clustering on Reuters dataset > -------------------------------------------------- > > Key: MAHOUT-1330 > URL: https://issues.apache.org/jira/browse/MAHOUT-1330 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Environment: Linux > Reporter: Karthik Prakhya > Fix For: 0.8 > > Attachments: df-count.txt, frequency-file.txt, hadoop-core-1.1.2.jar, > lucene-analyzers-common-4.3.0.jar, lucene-core-4.3.0.jar, > mahout-core-0.8.jar, mahout-integration-0.8.jar, mahout-math-0.8.jar, > MyAnalyzer.java, NewsKMeansClustering.java, NewsKMeansClustering-output.txt, > reuters-seqfiles.zipx, test-kmeans-clustering-reuters-java-api.sh, > tfidf-vectors.txt > > > The attached code uses the Mahout API to do k-means clustering on the Reuters > dataset and generates the initial centroids using the canopy algorithm. The > parameters are exactly the same as the ones in the Scala example presented in > the following link: > http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html > The code compiles without an error, but the K-means algorithm cannot initiate > because the initial centroids are not being generated. This in turn is due to > the fact that the TF-IDF vectors are not being generated. > Considering that this code compiles and is based on earlier Scala code that > worked, it is suggestive that there is a bug in the Mahout source code that > may need fixing. I thought I should bring it to your attention. > I have attached the source code, the included JAR files and the shell script > (called test-kmeans-clustering-reuters-java-api.sh) to compile and run the > code. The output of the shell script is located in > NewsKMeansClustering-output.txt. Please note that you may need to change the > path (see environmental variable JARPATH) to the JAR files in the shell > script based on where you put the JARs. I also attached the output of > clusterdump utility in the form of .txt files for the intermediate outputs of > my code such as the TF vectors and TF-IDF vectors (see tf-vectors.txt, > tfidf-vectors.txt, df-count.txt and frequency-file.txt). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira