[jira] [Created] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset

Karthik Prakhya (JIRA) Tue, 10 Sep 2013 12:23:48 -0700

Karthik Prakhya created MAHOUT-1330:
---------------------------------------


             Summary: Unable to do K-means clustering on Reuters dataset
                 Key: MAHOUT-1330
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1330
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.8
         Environment: Linux
            Reporter: Karthik Prakhya
             Fix For: 0.8


The attached code uses the Mahout API to do k-means clustering on the Reuters 
dataset and generates the initial centroids using the canopy algorithm. The 
parameters are exactly the same as the ones in the Scala example presented in 
the following link:

http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html

The code compiles without an error, but the K-means algorithm cannot initiate 
because the initial centroids are not being generated. This in turn is due to 
the fact that the TF-IDF vectors are not being generated.

Considering that this code compiles and is based on earlier Scala code that 
worked, it is suggestive that there is a bug in the Mahout source code that may 
need fixing. I thought I should bring it to your attention.

I have attached the source code, the included JAR files in a zip folder and the 
shell script (called test-kmeans-clustering-reuters-java-api.sh) to compile and 
run the code. The output of the shell script is located in 
NewsKMeansClustering-output.txt. Please note that you may need to change the 
path (see environmental variable JARPATH) to the JAR files in the shell script 
based on where you put the JARs. I also attached the output of clusterdump 
utility in the form of .txt files for the intermediate outputs of my code such 
as the TF vectors and TF-IDF vectors (see tf-vectors.txt, tfidf-vectors.txt, 
df-count.txt and frequency-file.txt).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset

Reply via email to