Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Never mind on this, I read some emails out of context and now realize this has been addressed. On Mar 19, 2009, at 6:57 AM, Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683426 #action_12683426 ] Grant Ingersoll commented on MAHOUT-99: --- For the record, I ran Canopy independently, and that worked just fine. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
It depends on the kind of output. If we are just outputting only some numeric values then it is preferred to have SequenceFile as the data is written as binary. If not, it is preferred to write as simple text. Text file is readable where as binary is not readable. As we consider the data as text in reducers of both Canopy and KMeans, I don't see any performance improvement in using SequenceFile. So, I used TextInputFormat which is read friendly. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 10:19 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans Also why not consider just converting canopy? Which reader is better? Jeff Eastman wrote: > * PGP Signed: 03/18/09 at 21:37:36 > > Sure, why don't you go ahead and post a patch? > > > Pallavi Palleti (JIRA) wrote: >> [ >> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji >> ra.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=126 >> 83312#action_12683312 >> ] >> Pallavi Palleti commented on MAHOUT-99: >> --- >> >> I have used KeyValueLineRecordReader internally for my code and >> forgot to revert back to SequenceFileReader. Will that be sufficient >> to add another patch on the latest code and modify only KMeansDriver >> to use SequenceFileReader? Kindly let me know. >> >> Thanks >> Pallavi >> >> >>> Improving speed of KMeans >>> - >>> >>> Key: MAHOUT-99 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-99 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Clustering >>>Reporter: Pallavi Palleti >>>Assignee: Grant Ingersoll >>> Fix For: 0.1 >>> >>> Attachments: MAHOUT-99-1.patch, Mahout-99.patch, >>> MAHOUT-99.patch >>> >>> >>> Improved the speed of KMeans by passing only cluster ID from mapper >>> to reducer. Previously, whole Cluster Info as formatted s`tring was >>> being sent. >>> Also removed the implicit assumption of Combiner runs only once >>> approach and the code is modified accordingly so that it won't >>> create a bug when combiner runs zero or more than once. >>> >> >> > > > * Jeff Eastman > * 0x6BFF1277 > > . >
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
There is a testcase in TestKMeansClustering.java which actually uses the output of Canopy as input. This testcase succeeded without any issue. But the thing here is, it doesn't use hdfs but uses the local file system. So, this might be the reason why it is succeeded without any issue. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 10:14 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans The unit tests dont care which format is used as long as it is consistent. The compiler helps enforce that. kMeans will run and its tests will pass. So will Canopy. When somebody runs the kMeans example it encounters the file format differences. Are all the examples run by the install? I'd be surprised. Jeff Palleti, Pallavi wrote: > Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn > clean install" command. > > Thanks > Pallavi > > -Original Message- > From: Jeff Eastman [mailto:j...@windwardsolutions.com] > Sent: Thursday, March 19, 2009 9:56 AM > To: mahout-dev@lucene.apache.org > Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans > > The Synthetic Control kMeans job calls the Canopy job to build its initial > clusters as is commonly done. If the kMeans record format was changed and the > Canopy not changed accordingly, then everything would still compile but there > would be a mismatch when the kMeans mapper tried to read in the clusters. > > Jeff > > > Richard Tomsett (JIRA) wrote: > >> [ >> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji >> r >> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1268 >> 3 >> 252#action_12683252 ] >> >> Richard Tomsett commented on MAHOUT-99: >> --- >> >> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get >> the same error on the Synthetic Control example. It seems to be because the >> new KMeans code uses a KeyValueLineRecordReader object to read the input >> cluster centres from the canopy clustering output, but the canopy clustering >> job outputs a SequenceFile (and the old KMeans code read in a SequenceFile >> for the cluster centres). Think that's the problem at least, I''ll have a >> quick play. >> >> >> >>> Improving speed of KMeans >>> - >>> >>> Key: MAHOUT-99 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-99 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Clustering >>>Reporter: Pallavi Palleti >>>Assignee: Grant Ingersoll >>> Fix For: 0.1 >>> >>> Attachments: MAHOUT-99-1.patch, Mahout-99.patch, >>> MAHOUT-99.patch >>> >>> >>> Improved the speed of KMeans by passing only cluster ID from mapper to >>> reducer. Previously, whole Cluster Info as formatted s`tring was being sent. >>> Also removed the implicit assumption of Combiner runs only once approach >>> and the code is modified accordingly so that it won't create a bug when >>> combiner runs zero or more than once. >>> >>> >> >> > >
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Also why not consider just converting canopy? Which reader is better? Jeff Eastman wrote: * PGP Signed: 03/18/09 at 21:37:36 Sure, why don't you go ahead and post a patch? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. * Jeff Eastman * 0x6BFF1277 . PGP.sig Description: PGP signature
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
The unit tests dont care which format is used as long as it is consistent. The compiler helps enforce that. kMeans will run and its tests will pass. So will Canopy. When somebody runs the kMeans example it encounters the file format differences. Are all the examples run by the install? I'd be surprised. Jeff Palleti, Pallavi wrote: Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean install" command. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 9:56 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683 252#action_12683252 ] Richard Tomsett commented on MAHOUT-99: --- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Sure, why don't you go ahead and post a patch? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean install" command. Thanks Pallavi -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Thursday, March 19, 2009 9:56 AM To: mahout-dev@lucene.apache.org Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir > a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683 > 252#action_12683252 ] > > Richard Tomsett commented on MAHOUT-99: > --- > > Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get > the same error on the Synthetic Control example. It seems to be because the > new KMeans code uses a KeyValueLineRecordReader object to read the input > cluster centres from the canopy clustering output, but the canopy clustering > job outputs a SequenceFile (and the old KMeans code read in a SequenceFile > for the cluster centres). Think that's the problem at least, I''ll have a > quick play. > > >> Improving speed of KMeans >> - >> >> Key: MAHOUT-99 >> URL: https://issues.apache.org/jira/browse/MAHOUT-99 >> Project: Mahout >> Issue Type: Improvement >> Components: Clustering >>Reporter: Pallavi Palleti >>Assignee: Grant Ingersoll >> Fix For: 0.1 >> >> Attachments: MAHOUT-99-1.patch, Mahout-99.patch, >> MAHOUT-99.patch >> >> >> Improved the speed of KMeans by passing only cluster ID from mapper to >> reducer. Previously, whole Cluster Info as formatted s`tring was being sent. >> Also removed the implicit assumption of Combiner runs only once approach and >> the code is modified accordingly so that it won't create a bug when combiner >> runs zero or more than once. >> > >
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Are the examples run automatically in the build? Pallavi Palleti (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683297#action_12683297 ] Pallavi Palleti commented on MAHOUT-99: --- Yup. That must be the issue. But I am wondering how the test case succeeded? Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters. Jeff Richard Tomsett (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683252#action_12683252 ] Richard Tomsett commented on MAHOUT-99: --- Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
On my Mac, I have: $ echo $JAVA_HOME /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home -Grant On Mar 18, 2009, at 2:10 PM, Jeff Eastman wrote: I'm running the example in Eclipse using the stand-alone mode in the hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in Eclipse. I cannot; however, get any hadoop stuff to work from the command line. Even though my JAVA_HOME environment is set to / Library/Java/Home and java -version yields: Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) ... the hadoop build script and the start-all.sh commands all complain about class version errors. Can any other Mac users help me out? Jeff Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683077 #action_12683077 ] Grant Ingersoll commented on MAHOUT-99: --- Yeah, what version of Hadoop are you running? I got it w/ 0.19.1, but maybe I didn't set something up right. {code} bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/ target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {code} Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
I'm running the example in Eclipse using the stand-alone mode in the hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in Eclipse. I cannot; however, get any hadoop stuff to work from the command line. Even though my JAVA_HOME environment is set to /Library/Java/Home and java -version yields: Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) ... the hadoop build script and the start-all.sh commands all complain about class version errors. Can any other Mac users help me out? Jeff Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683077#action_12683077 ] Grant Ingersoll commented on MAHOUT-99: --- Yeah, what version of Hadoop are you running? I got it w/ 0.19.1, but maybe I didn't set something up right. {code} bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {code} Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. PGP.sig Description: PGP signature
RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
Hi Grant, I am Rohini and work in the same team as Pallavi is. Pallavi is out of Office till the end of this month. I will be taking care of this issue now. I will look into the issue you have pointed out and get back to you. Thanks, -Rohini -Original Message- From: Grant Ingersoll (JIRA) [mailto:[EMAIL PROTECTED] Sent: Sunday, December 07, 2008 7:32 AM To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-99) Improving speed of KMeans [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira. plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654168# action_12654168 ] Grant Ingersoll commented on MAHOUT-99: --- Hi Pallavi, The core code works, but the change to the KMeansDriver causes a compile error in examples in the Kmeans demo code b/c it now asks for the number of map tasks and the number of centroids. Could you document these new parameters and put in reasonable defaults and update the patch? One thing I'm not certain of, though, is why we need to pass in the number of map tasks, isn't that a config thing already when you setup Hadoop? > Improving speed of KMeans > - > > Key: MAHOUT-99 > URL: https://issues.apache.org/jira/browse/MAHOUT-99 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Reporter: Pallavi Palleti >Assignee: Grant Ingersoll > Attachments: MAHOUT-99.patch > > > Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. > Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.