[jira] [Updated] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset

Karthik Prakhya (JIRA) Tue, 10 Sep 2013 12:26:48 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karthik Prakhya updated MAHOUT-1330:
------------------------------------

    Attachment: reuters-seqfiles.zipx

The zipx file reuters-seqfiles contains the sequence files generated from the 
Reuters dataset using the seqdirectory command. This is the input to my Java 
code NewsKmeansClustering.java.
                
> Unable to do K-means clustering on Reuters dataset
> --------------------------------------------------
>
>                 Key: MAHOUT-1330
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1330
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>         Environment: Linux
>            Reporter: Karthik Prakhya
>             Fix For: 0.8
>
>         Attachments: df-count.txt, frequency-file.txt, hadoop-core-1.1.2.jar, 
> lucene-analyzers-common-4.3.0.jar, lucene-core-4.3.0.jar, 
> mahout-core-0.8.jar, mahout-integration-0.8.jar, mahout-math-0.8.jar, 
> MyAnalyzer.java, NewsKMeansClustering.java, NewsKMeansClustering-output.txt, 
> reuters-seqfiles.zipx, test-kmeans-clustering-reuters-java-api.sh, 
> tfidf-vectors.txt
>
>
> The attached code uses the Mahout API to do k-means clustering on the Reuters 
> dataset and generates the initial centroids using the canopy algorithm. The 
> parameters are exactly the same as the ones in the Scala example presented in 
> the following link:
> http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html
> The code compiles without an error, but the K-means algorithm cannot initiate 
> because the initial centroids are not being generated. This in turn is due to 
> the fact that the TF-IDF vectors are not being generated.
> Considering that this code compiles and is based on earlier Scala code that 
> worked, it is suggestive that there is a bug in the Mahout source code that 
> may need fixing. I thought I should bring it to your attention.
> I have attached the source code, the included JAR files in a zip folder and 
> the shell script (called test-kmeans-clustering-reuters-java-api.sh) to 
> compile and run the code. The output of the shell script is located in 
> NewsKMeansClustering-output.txt. Please note that you may need to change the 
> path (see environmental variable JARPATH) to the JAR files in the shell 
> script based on where you put the JARs. I also attached the output of 
> clusterdump utility in the form of .txt files for the intermediate outputs of 
> my code such as the TF vectors and TF-IDF vectors (see tf-vectors.txt, 
> tfidf-vectors.txt, df-count.txt and frequency-file.txt).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset

Reply via email to