The unit tests dont care which format is used as long as it is consistent. The compiler helps enforce that. kMeans will run and its tests will pass. So will Canopy. When somebody runs the kMeans example it encounters the file format differences. Are all the examples run by the install? I'd be surprised.

Jeff


Palleti, Pallavi wrote:
Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean 
install" command.

Thanks
Pallavi

-----Original Message-----
From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 9:56 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The Synthetic Control kMeans job calls the Canopy job to build its initial 
clusters as is commonly done. If the kMeans record format was changed and the 
Canopy not changed accordingly, then everything would still compile but there 
would be a mismatch when the kMeans mapper tried to read in the clusters.

Jeff


Richard Tomsett (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir
a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683
252#action_12683252 ]

Richard Tomsett commented on MAHOUT-99:
---------------------------------------

Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the 
same error on the Synthetic Control example. It seems to be because the new 
KMeans code uses a KeyValueLineRecordReader object to read the input cluster 
centres from the canopy clustering output, but the canopy clustering job 
outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the 
cluster centres). Think that's the problem at least, I''ll have a quick play.

Improving speed of KMeans
-------------------------

                Key: MAHOUT-99
                URL: https://issues.apache.org/jira/browse/MAHOUT-99
            Project: Mahout
         Issue Type: Improvement
         Components: Clustering
           Reporter: Pallavi Palleti
           Assignee: Grant Ingersoll
            Fix For: 0.1

Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch


Improved the speed of KMeans by passing only cluster ID from mapper to reducer. 
Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and 
the code is modified accordingly so that it won't create a bug when combiner 
runs zero or more than once.


Attachment: PGP.sig
Description: PGP signature

Reply via email to