[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos
[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhao zhendong updated MAHOUT-232: - Attachment: SVMonMahout0.5.1.patch MapReduce/MapReduceUtil.java should have been mapreduce/MapReduceUtil.java the folders are NOT in camel case. I still see camel casing everywhere. >> Done. Change MapReduce -> mapreduce, ParallelAlgorithms -> >> parallelalgorithms and SequentialAlgorithms -> sequentialalgorithms + public static final String DEFAULT_HDFS_SERVER = "hdfs://localhost:12009"; + // For HBASE + public static final String DEFAULT_HBASE_SERVER = "localhost:6"; These are read from the hadoop conf and hbase configuraiton file. Mahout shouldnt be doing any sort of configuration internally. >> Hard coding in hadoop and hbase configuration have been removed. The Default >> HDFS and Hbase setting in SVMParameters only for MapReduce application >> runtime default setting. No System.out.Println use the Logger log instead >> Done. HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the code. As i said Opening a FileSystem using the Configuration object would in-turn decide between local fs or hdfs based on the execution context >> Yeap, the Sequential algorithms use this principle you mentioned, it >> determines which file system it should choose according to parameter "hdfs" >> is given or not in training and prediction procedures. HDFSReader only >> server Sequential Algorithms but not for parallel algorithms based on >> Map/Reduce framework. > Implementation of sequential SVM solver based on Pegasos > > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.4 >Reporter: zhao zhendong > Fix For: 0.4 > > Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, > SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, SVMDataset.patch, > SVMonMahout0.5.1.patch, SVMonMahout0.5.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > Sequential SVM based on Pegasos. > Maxim zhao (zhaozhendong at gmail dot com) > --- > Currently, this package provides (Features): > --- > 1. Sequential SVM linear solver, include training and testing. > 2. Support general file system and HDFS right now. > 3. Supporting large-scale data set training. > Because of the Pegasos only need to sample certain samples, this package > supports to pre-fetch > the certain size (e.g. max iteration) of samples to memory. > For example: if the size of data set has 100,000,000 samples, due to the > default maximum iteration is 10,000, > as the result, this package only random load 10,000 samples to memory. > 4. Sequential Data set testing, then the package can support large-scale data > set both on training and testing. > 5. Supporting parallel classification (only testing phrase) based on > Map-Reduce framework. > 6. Supoorting Multi-classfication based on Map-Reduce framework (whole > parallelized version). > 7. Supporting Regression. > --- > TODO: > --- > 1. Multi-classification Probability Prediction > 2. Performance Testing > --- > Usage: > --- > >> > Classification: > >> > > @@ Training: @@ > > SVMPegasosTraining.java > The default argument is: > -tr ../examples/src/test/resources/svmdataset/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > ~~ > @ For the case that training data set on HDFS:@ > ~~ > 1 Assure that your training data set has been submitted to hdfs > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > 2 revise the argument: > -tr /user/hadoop/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model -hdfs > hdfs://localhost:12009 > ~~ > @ Multi-class Training [Based on MapReduce Framework]:@
Re: Welcome Drew Farris
Welcome Drew =D On Fri, Feb 19, 2010 at 5:02 AM, Grant Ingersoll wrote: > > On Feb 18, 2010, at 8:32 PM, Drew Farris wrote: > >> There's lots more stuff I'd like to get in there, >> now I only need to figure how to squeeze 48 hours of consciousness >> into a day. > > I believe there is a compression algorithm for that. >
[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable
[ https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-299: --- Status: Patch Available (was: Open) > Collocations: improve performance by making Gram BinaryComparable > - > > Key: MAHOUT-299 > URL: https://issues.apache.org/jira/browse/MAHOUT-299 > Project: Mahout > Issue Type: Improvement > Components: Utils >Affects Versions: 0.3 >Reporter: Drew Farris >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-299.patch > > > Robin's profiling indicated that a large portion of a run was spent in > readFields() in Gram due to the deserialization occuring as a part of Gram > comparions for sorting. He pointed me to BinaryComparable and the > implementation in Text. > Like Text, in this new implementation, Gram stores its string in binary form. > When encoding the string at construction time we allocate an extra > character's worth of data to hold the Gram type information. When sorting > Grams, the binary arrays are compared instead of deserializing and comparing > fields. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable
[ https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-299: --- Attachment: MAHOUT-299.patch Patch as described above: Included other cleanups: * Gram is no longer mutable, except in the case of readFields of course. * Added explicit NGRAM type, remove constructors that implicitly set type. * Added unit tests for constuctors, writability. One should be added for sortability/comparison. * Better unigram handling in the mappers/reducers (no need to setType on these anymore) * Switched to adjustOrPutValue when accumulating frequencies in OpenObjectIntHashMaps Also, NGramCollector, NGramCollectorTest should be removed from the repo. They are no longer relevant. Applying this patch with -E will empty and erase these files, but it's up to svn to do the rest. > Collocations: improve performance by making Gram BinaryComparable > - > > Key: MAHOUT-299 > URL: https://issues.apache.org/jira/browse/MAHOUT-299 > Project: Mahout > Issue Type: Improvement > Components: Utils >Affects Versions: 0.3 >Reporter: Drew Farris >Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-299.patch > > > Robin's profiling indicated that a large portion of a run was spent in > readFields() in Gram due to the deserialization occuring as a part of Gram > comparions for sorting. He pointed me to BinaryComparable and the > implementation in Text. > Like Text, in this new implementation, Gram stores its string in binary form. > When encoding the string at construction time we allocate an extra > character's worth of data to hold the Gram type information. When sorting > Grams, the binary arrays are compared instead of deserializing and comparing > fields. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable
Collocations: improve performance by making Gram BinaryComparable - Key: MAHOUT-299 URL: https://issues.apache.org/jira/browse/MAHOUT-299 Project: Mahout Issue Type: Improvement Components: Utils Affects Versions: 0.3 Reporter: Drew Farris Priority: Minor Fix For: 0.3 Robin's profiling indicated that a large portion of a run was spent in readFields() in Gram due to the deserialization occuring as a part of Gram comparions for sorting. He pointed me to BinaryComparable and the implementation in Text. Like Text, in this new implementation, Gram stores its string in binary form. When encoding the string at construction time we allocate an extra character's worth of data to hold the Gram type information. When sorting Grams, the binary arrays are compared instead of deserializing and comparing fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Welcome Drew Farris
On Feb 18, 2010, at 8:32 PM, Drew Farris wrote: > There's lots more stuff I'd like to get in there, > now I only need to figure how to squeeze 48 hours of consciousness > into a day. I believe there is a compression algorithm for that.
Re: Welcome Drew Farris
Hi Drew, Cong. and thank you for all helps on dependencies stuff. Cheers, Zhendong On Fri, Feb 19, 2010 at 6:27 AM, Drew Farris wrote: > Hi Grant, fellow Mahouts, > > Thanks for the chance to join the team. I really look forward to > contributing my skills to the project and learning a great deal as > well. > > So, a little bit about myself; > > It all started with an Apple //+ back in 1982. Growing up, I never > thought I'd do something serious with computers. In college I studied > Computer Graphics in the Art School, Architecture and ended up getting > a Masters in Information Resource Management on top of that. > > Since then I've been a software developer and who has brushed up > against information retrieval, search and NLP for many years. I got my > start in search and content management working as a web-developer for > a newspaper in the early days of the Internet. > > As Grant mentioned, I've worked at TextWise for a number of years. The > company grew out of a NLP-oriented research group headed by Liz Liddy > at Syracuse University and continues to focus on the commercial > applications of text-oriented technologies albeit with a more > statistical orientation as of late. > > While a TextWise, I've worked on projects ranging everything from > cross-language IR to contextual advertising. Mostly I've been involved > in developing the glue that holds the core algorithms together, > helping them scale and combining the various moving parts of an system > into a cohesive whole. I've had a chance to do everything from web > crawling, document processing, database, visualization, web-app and > distributed systems work. To that end, I've worked on an off with > Lucene, Nutch, and many other projects from the Apache ecosystem for > years. > > Reading "Programming Collective Intelligence" a couple years back > really solidified my interest in machine learning algorithms. After > building a number of different systems to process large amounts of > content, the ability to quickly and effortlessly scale things up with > hadoop/mapreduce really appeals to me. Thje Mahout project is > wonderful to me in that it combines the these things I'm interested in > personally, has relevance to the things I do for work and has a really > outstanding group of people working on it. > > I'm looking forward to working with you all, > > Drew > > On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil wrote: > > Welcome Drew > > > > @Grant: No customary introduction? :) > > > > Robin > > > > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll >wrote: > > > >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the > >> newest member of the Mahout committer family. Drew has been > contributing > >> some really nice work to Mahout in recent months and I look forward to > his > >> continuing involvement with Mahout. > >> > >> Congrats, Drew! > >> > >> > >> -Grant > > > -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>> Department of Computer Science School of Computing National University of Singapore >>><><><><><><><><<><>><><<
Re: Profiling SequentialAccessSparseVector
Can't remember. I suppose I should look at the code. :-) On Thu, Feb 18, 2010 at 6:19 PM, Jake Mannix wrote: > Don't we already have generalized scalar aggregation? -- Ted Dunning, CTO DeepDyve
Re: Profiling SequentialAccessSparseVector
Don't we already have generalized scalar aggregation? I thought I committed that a while back. Its very useful for inner products, distances, and stats. Vector accumulation using a BinaryFunction as a map just needs to be made more efficient (sparsity and random accessibility taken into account), but works. The only remaining piece is something like accumulate(Vector v, BinaryFunction map, BinaryFunction aggregator) - a method on Matrix, which aggregates partial map() combinations af each row with the input Vector, and returns a Vector. This generalizes times(Vector). I guess Matrix.assign(Vector v, BinaryFunction map) could be useful for mutating a matrix, but on HDFS would operate by making new sequencefiles. -jake On Feb 18, 2010 5:11 PM, "Ted Dunning" wrote: On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix wrote: > What would this metho... This method would apply the mapFunction to each corresponding pair of elements from the two vectors and then aggregate the results using the aggregatorFunction. The unit is the unit of the aggregator and would only be needed if the vectors have no entries. We could probably do without it. This could be a static function or could be a method on vectorA. Putting the method on vectorA would probably be better because it could drive many common optimizations. Examples of this pattern include sum-squared-difference (agg = plus, map = compose(sqr, minus)), dot (agg = plus, map = times). This can be composed with a temporary output vector or sometimes by mutating one of the operands. This is not as desirable as just accumulating the results on the fly, however. The reason why we need a specialized function is to do things in a nicely > mutating way: Hadoop M... We definitely need that too. > The only thing more we need than what we have now is in the assign method > - > currently we ha... That can work, but very often requires an extra copy of the vector as in the distance case that Robin brought up. The contract there says neither operand can be changed which forces a vector copy in the current API. A mapReduce operation in addition to a map would allow us to avoid that important case.
Re: Welcome Drew Farris
On Thu, Feb 18, 2010 at 7:45 PM, Jake Mannix wrote: > Welcome Drew! I've been using your excellent colloc code quite a bit > in testing my svd stuff (produces nicely bigger vectors out of text!), > looking > forward to more cool stuff (NLP package! Bring it on! :) ). > Heh, great to hear! There's lots more stuff I'd like to get in there, now I only need to figure how to squeeze 48 hours of consciousness into a day.
[jira] Created: (MAHOUT-298) 2 test case fails while trying to mvn clean install after checking out revision 911542 of trunk
2 test case fails while trying to mvn clean install after checking out revision 911542 of trunk --- Key: MAHOUT-298 URL: https://issues.apache.org/jira/browse/MAHOUT-298 Project: Mahout Issue Type: Test Components: Clustering Affects Versions: 0.3 Environment: Windows 7 with Cygwin Reporter: Anish Shah Priority: Minor I checked out revision 911542 from trunk and seeing the following failed tests when I ran mvn clean install: Results : Failed tests: testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering) testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean sClustering) Tests run: 338, Failures: 2, Errors: 0, Skipped: 0 [INFO] [ERROR] BUILD FAILURE [INFO] [INFO] There are test failures. I looked in the surefire-reports and see the following details on the failures: testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering) Time elapsed: 11.169 sec <<< FAILURE! junit.framework.AssertionFailedError: clusters[3] expected:<4> but was:<2> at junit.framework.Assert.fail(Assert.java:47) ... testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering) Time elapsed: 3.35 sec <<< FAILURE! junit.framework.AssertionFailedError: num points[0] expected:<4> but was:<1> at junit.framework.Assert.fail(Assert.java:47) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: New to Mahout - question about the failed test cases
I have created https://issues.apache.org/jira/browse/MAHOUT-298 to track this. On Thu, Feb 18, 2010 at 6:59 PM, Ted Dunning wrote: > Darn. That uses up all of my ideas. > > I would vote for a platform issue. > > On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah wrote: > > > I checked out revision 911542 (after removing the mahout-trunk from my > > local > > machine) and > > tried again and still getting the same 2 failures upon running mvn clean > > install! > > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: Profiling SequentialAccessSparseVector
On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix wrote: > What would this method mean? aggregatorUnit means what? What would this > be a method on? > This method would apply the mapFunction to each corresponding pair of elements from the two vectors and then aggregate the results using the aggregatorFunction. The unit is the unit of the aggregator and would only be needed if the vectors have no entries. We could probably do without it. This could be a static function or could be a method on vectorA. Putting the method on vectorA would probably be better because it could drive many common optimizations. Examples of this pattern include sum-squared-difference (agg = plus, map = compose(sqr, minus)), dot (agg = plus, map = times). This can be composed with a temporary output vector or sometimes by mutating one of the operands. This is not as desirable as just accumulating the results on the fly, however. The reason why we need a specialized function is to do things in a nicely > mutating way: Hadoop M/R is functional in the lispy-sensen: read-only > immutable objects (once on the filesystem). > We definitely need that too. > The only thing more we need than what we have now is in the assign method > - > currently we have it with a map, with reduce being the identity (with > replacement - > the calling object becomes the output of the reduce -ie the output of the > map): > That can work, but very often requires an extra copy of the vector as in the distance case that Robin brought up. The contract there says neither operand can be changed which forces a vector copy in the current API. A mapReduce operation in addition to a map would allow us to avoid that important case.
Re: Welcome Drew Farris
Welcome Drew! I've been using your excellent colloc code quite a bit in testing my svd stuff (produces nicely bigger vectors out of text!), looking forward to more cool stuff (NLP package! Bring it on! :) ). -jake On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris wrote: > Hi Grant, fellow Mahouts, > > Thanks for the chance to join the team. I really look forward to > contributing my skills to the project and learning a great deal as > well. > > So, a little bit about myself; > > It all started with an Apple //+ back in 1982. Growing up, I never > thought I'd do something serious with computers. In college I studied > Computer Graphics in the Art School, Architecture and ended up getting > a Masters in Information Resource Management on top of that. > > Since then I've been a software developer and who has brushed up > against information retrieval, search and NLP for many years. I got my > start in search and content management working as a web-developer for > a newspaper in the early days of the Internet. > > As Grant mentioned, I've worked at TextWise for a number of years. The > company grew out of a NLP-oriented research group headed by Liz Liddy > at Syracuse University and continues to focus on the commercial > applications of text-oriented technologies albeit with a more > statistical orientation as of late. > > While a TextWise, I've worked on projects ranging everything from > cross-language IR to contextual advertising. Mostly I've been involved > in developing the glue that holds the core algorithms together, > helping them scale and combining the various moving parts of an system > into a cohesive whole. I've had a chance to do everything from web > crawling, document processing, database, visualization, web-app and > distributed systems work. To that end, I've worked on an off with > Lucene, Nutch, and many other projects from the Apache ecosystem for > years. > > Reading "Programming Collective Intelligence" a couple years back > really solidified my interest in machine learning algorithms. After > building a number of different systems to process large amounts of > content, the ability to quickly and effortlessly scale things up with > hadoop/mapreduce really appeals to me. Thje Mahout project is > wonderful to me in that it combines the these things I'm interested in > personally, has relevance to the things I do for work and has a really > outstanding group of people working on it. > > I'm looking forward to working with you all, > > Drew > > On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil wrote: > > Welcome Drew > > > > @Grant: No customary introduction? :) > > > > Robin > > > > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll >wrote: > > > >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the > >> newest member of the Mahout committer family. Drew has been > contributing > >> some really nice work to Mahout in recent months and I look forward to > his > >> continuing involvement with Mahout. > >> > >> Congrats, Drew! > >> > >> > >> -Grant > > >
Re: Profiling SequentialAccessSparseVector
On Thu, Feb 18, 2010 at 3:58 PM, Ted Dunning wrote: > Actually, this makes the case that we should have something like: > > microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction, > vectorA, vectorB) > What would this method mean? aggregatorUnit means what? What would this be a method on? The reason why we need a specialized function is to do things in a nicely mutating way: Hadoop M/R is functional in the lispy-sensen: read-only immutable objects (once on the filesystem). The only thing more we need than what we have now is in the assign method - currently we have it with a map, with reduce being the identity (with replacement - the calling object becomes the output of the reduce -ie the output of the map): Vector.assign(Vector other, BinaryFunction map) { // implemented effectively as follows in AbstractVector for(int i=0;i it = sparse ? other.iterateNonZero() : other.iterateAll(); while(it.hasNext()) { Element e = it.next(); int i = e.index(); e.set(i, map.apply(getQuick(i), e.get())); } // do stuff with the reduce - what exactly? return this; } (is the reduce necessary?) -jake
Re: New to Mahout - question about the failed test cases
Darn. That uses up all of my ideas. I would vote for a platform issue. On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah wrote: > I checked out revision 911542 (after removing the mahout-trunk from my > local > machine) and > tried again and still getting the same 2 failures upon running mvn clean > install! > -- Ted Dunning, CTO DeepDyve
Re: Profiling SequentialAccessSparseVector
Actually, this makes the case that we should have something like: microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction, vectorA, vectorB) The name should be changed after its rhetorical effect has worn off. As the Chukwa guys tend to say, its turtles all the way down. We can have map-reduce inside map-reduce. On Thu, Feb 18, 2010 at 3:41 PM, Robin Anil wrote: > TODO: sum of minus to be optimised without having to hold the intermediate > vector. > -- Ted Dunning, CTO DeepDyve
Re: Profiling SequentialAccessSparseVector
Yes. addTo is just a specialization of a very, very common case. On Thu, Feb 18, 2010 at 1:06 PM, Sean Owen wrote: > Isn't this basically what assign() is for? > -- Ted Dunning, CTO DeepDyve
Re: Profiling SequentialAccessSparseVector
TODO: sum of minus to be optimised without having to hold the intermediate vector.
Re: New to Mahout - question about the failed test cases
Ted, I checked out revision 911542 (after removing the mahout-trunk from my local machine) and tried again and still getting the same 2 failures upon running mvn clean install! Anish On Thu, Feb 18, 2010 at 11:51 AM, Ted Dunning wrote: > Note the different version number here. > > I think that Anish has somehow gotten stuck on an old version. Anish, can > you do a clean checkout and build? > > On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil wrote: > > > I am building Revision: 911405 on a Mac, and things work fine for me. I > am > > assuming same is the case for sean(mac) > > > > ... > > > > > > On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah wrote: > > > > ... > > > $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk > > > Checked out revision 911364. > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve >
[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-297: -- Attachment: MAHOUT-297.patch Last one had a correctness problem in manhattan. This one fixes it. > Canopy and Kmeans clustering slows down on using SeqAccVector for center > > > Key: MAHOUT-297 > URL: https://issues.apache.org/jira/browse/MAHOUT-297 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.4 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.4 > > Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, > MAHOUT-297.patch, MAHOUT-297.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Welcome Drew Farris
We have already enjoyed working with you and look forward to more of it. Good to have you on board. On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris wrote: > I'm looking forward to working with you all, -- Ted Dunning, CTO DeepDyve
Re: Welcome Drew Farris
Hi Grant, fellow Mahouts, Thanks for the chance to join the team. I really look forward to contributing my skills to the project and learning a great deal as well. So, a little bit about myself; It all started with an Apple //+ back in 1982. Growing up, I never thought I'd do something serious with computers. In college I studied Computer Graphics in the Art School, Architecture and ended up getting a Masters in Information Resource Management on top of that. Since then I've been a software developer and who has brushed up against information retrieval, search and NLP for many years. I got my start in search and content management working as a web-developer for a newspaper in the early days of the Internet. As Grant mentioned, I've worked at TextWise for a number of years. The company grew out of a NLP-oriented research group headed by Liz Liddy at Syracuse University and continues to focus on the commercial applications of text-oriented technologies albeit with a more statistical orientation as of late. While a TextWise, I've worked on projects ranging everything from cross-language IR to contextual advertising. Mostly I've been involved in developing the glue that holds the core algorithms together, helping them scale and combining the various moving parts of an system into a cohesive whole. I've had a chance to do everything from web crawling, document processing, database, visualization, web-app and distributed systems work. To that end, I've worked on an off with Lucene, Nutch, and many other projects from the Apache ecosystem for years. Reading "Programming Collective Intelligence" a couple years back really solidified my interest in machine learning algorithms. After building a number of different systems to process large amounts of content, the ability to quickly and effortlessly scale things up with hadoop/mapreduce really appeals to me. Thje Mahout project is wonderful to me in that it combines the these things I'm interested in personally, has relevance to the things I do for work and has a really outstanding group of people working on it. I'm looking forward to working with you all, Drew On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil wrote: > Welcome Drew > > @Grant: No customary introduction? :) > > Robin > > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote: > >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the >> newest member of the Mahout committer family. Drew has been contributing >> some really nice work to Mahout in recent months and I look forward to his >> continuing involvement with Mahout. >> >> Congrats, Drew! >> >> >> -Grant >
Re: Profiling SequentialAccessSparseVector
I have made all changes take a look. Same could be done for fuzzy kmeans, dirichlet and lda. Havent had time to look at internals yet. On Fri, Feb 19, 2010 at 3:35 AM, Robin Anil wrote: > 2 second canopy clustering over reuters :D > > > > On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil wrote: > >> This really doesnt work for, i cant modify any vectors inside distance >> measure. So i have wrote a subtract inside manhattan distance itself. Works >> great for now >> >> >> On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix wrote: >> >>> currentVector.assign(otherVector, minus) takes the other vector, and >>> subtracts >>> it from currentVector, which mutates currentVector. If currentVector is >>> DenseVector, >>> this is already optimized. It could be optimized if currentVector is >>> RandomAccessSparse. >>> >>> -jake >>> >>> On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil >>> wrote: >>> >>> > Just to be clear, this does: >>> > currentVector-otherVector ? >>> > >>> > currentVector.assign(otherVector, Functions.minus); >>> > >>> > >>> > >>> > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix >>> > wrote: >>> > >>> > > to do subtractFrom, you can instead just do >>> > > >>> > > Vector.assign(otherVector, Functions.minus); >>> > > >>> > > The problem is that while DenseVector has an optimization here: if >>> the >>> > > BinaryFunction passed in is additive (it's an instance of PlusMult), >>> > > sparse iteration over "otherVector" is executed, applying the binary >>> > > function and mutating self. AbstractVector should have this >>> optimization >>> > > in general, as it would be useful in RandomAccessSparseVector >>> (although >>> > > not terribly useful in SequentialAccessSparseVector, but still better >>> > than >>> > > current). >>> > > >>> > > -jake >>> > > >>> > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil >>> > wrote: >>> > > >>> > > > I just had to change it at one place(and the tests pass, which is >>> > scary). >>> > > > Canopy is really fast now :). Still could be pushed >>> > > > Now the bottleneck is minus >>> > > > >>> > > > maybe a subtractFrom on the lines of addTo? or a mutable negate >>> > function >>> > > > for >>> > > > vector, before adding to >>> > > > >>> > > > Robin >>> > > > >>> > > > >>> > > > >>> > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix < >>> jake.man...@gmail.com> >>> > > > wrote: >>> > > > >>> > > > > I use it (addTo) in decomposer, for exactly this performance >>> issue. >>> > > > > Changing >>> > > > > plus into addTo requires care, because since plus() leaves >>> arguments >>> > > > > immutable, >>> > > > > there may be code which *assumes* that this is the case, and >>> doing >>> > > > addTo() >>> > > > > leaves side effects which might not be expected. This bit me >>> hard on >>> > > svd >>> > > > > migration, because I had other assumptions about mutability in >>> there. >>> > > > > >>> > > > > -jake >>> > > > > >>> > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil < >>> robin.a...@gmail.com> >>> > > > wrote: >>> > > > > >>> > > > > > ah! Its not being used anywhere :). Should we make that a big >>> task >>> > > > before >>> > > > > > 0.3 ? Sweep through code(mainly clustering) and change all >>> these >>> > > > things. >>> > > > > > >>> > > > > > Robin >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen >>> > wrote: >>> > > > > > >>> > > > > > > Isn't this basically what assign() is for? >>> > > > > > > >>> > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil < >>> > robin.a...@gmail.com> >>> > > > > > wrote: >>> > > > > > > > Now the big perf bottle neck is immutability >>> > > > > > > > >>> > > > > > > > Say for plus its doing vector.clone() before doing anything >>> > else. >>> > > > > > > > There should be both immutable and mutable plus functions >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >
[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-297: -- Attachment: MAHOUT-297.patch Changed Euclidean distance measure to use v2 to iterate and v1 to access > Canopy and Kmeans clustering slows down on using SeqAccVector for center > > > Key: MAHOUT-297 > URL: https://issues.apache.org/jira/browse/MAHOUT-297 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.4 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.4 > > Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, > MAHOUT-297.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-297: -- Attachment: MAHOUT-297.patch Improvements in TanimotoDistanceMeasure > Canopy and Kmeans clustering slows down on using SeqAccVector for center > > > Key: MAHOUT-297 > URL: https://issues.apache.org/jira/browse/MAHOUT-297 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.4 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.4 > > Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-297: -- Attachment: MAHOUT-297.patch Really fast now. > Canopy and Kmeans clustering slows down on using SeqAccVector for center > > > Key: MAHOUT-297 > URL: https://issues.apache.org/jira/browse/MAHOUT-297 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.4 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.4 > > Attachments: MAHOUT-297.patch, MAHOUT-297.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Profiling SequentialAccessSparseVector
2 second canopy clustering over reuters :D On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil wrote: > This really doesnt work for, i cant modify any vectors inside distance > measure. So i have wrote a subtract inside manhattan distance itself. Works > great for now > > > On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix wrote: > >> currentVector.assign(otherVector, minus) takes the other vector, and >> subtracts >> it from currentVector, which mutates currentVector. If currentVector is >> DenseVector, >> this is already optimized. It could be optimized if currentVector is >> RandomAccessSparse. >> >> -jake >> >> On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil wrote: >> >> > Just to be clear, this does: >> > currentVector-otherVector ? >> > >> > currentVector.assign(otherVector, Functions.minus); >> > >> > >> > >> > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix >> > wrote: >> > >> > > to do subtractFrom, you can instead just do >> > > >> > > Vector.assign(otherVector, Functions.minus); >> > > >> > > The problem is that while DenseVector has an optimization here: if the >> > > BinaryFunction passed in is additive (it's an instance of PlusMult), >> > > sparse iteration over "otherVector" is executed, applying the binary >> > > function and mutating self. AbstractVector should have this >> optimization >> > > in general, as it would be useful in RandomAccessSparseVector >> (although >> > > not terribly useful in SequentialAccessSparseVector, but still better >> > than >> > > current). >> > > >> > > -jake >> > > >> > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil >> > wrote: >> > > >> > > > I just had to change it at one place(and the tests pass, which is >> > scary). >> > > > Canopy is really fast now :). Still could be pushed >> > > > Now the bottleneck is minus >> > > > >> > > > maybe a subtractFrom on the lines of addTo? or a mutable negate >> > function >> > > > for >> > > > vector, before adding to >> > > > >> > > > Robin >> > > > >> > > > >> > > > >> > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > > >> > > > wrote: >> > > > >> > > > > I use it (addTo) in decomposer, for exactly this performance >> issue. >> > > > > Changing >> > > > > plus into addTo requires care, because since plus() leaves >> arguments >> > > > > immutable, >> > > > > there may be code which *assumes* that this is the case, and doing >> > > > addTo() >> > > > > leaves side effects which might not be expected. This bit me hard >> on >> > > svd >> > > > > migration, because I had other assumptions about mutability in >> there. >> > > > > >> > > > > -jake >> > > > > >> > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > > >> > > > wrote: >> > > > > >> > > > > > ah! Its not being used anywhere :). Should we make that a big >> task >> > > > before >> > > > > > 0.3 ? Sweep through code(mainly clustering) and change all these >> > > > things. >> > > > > > >> > > > > > Robin >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen >> > wrote: >> > > > > > >> > > > > > > Isn't this basically what assign() is for? >> > > > > > > >> > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil < >> > robin.a...@gmail.com> >> > > > > > wrote: >> > > > > > > > Now the big perf bottle neck is immutability >> > > > > > > > >> > > > > > > > Say for plus its doing vector.clone() before doing anything >> > else. >> > > > > > > > There should be both immutable and mutable plus functions >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
Re: Profiling SequentialAccessSparseVector
This really doesnt work for, i cant modify any vectors inside distance measure. So i have wrote a subtract inside manhattan distance itself. Works great for now On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix wrote: > currentVector.assign(otherVector, minus) takes the other vector, and > subtracts > it from currentVector, which mutates currentVector. If currentVector is > DenseVector, > this is already optimized. It could be optimized if currentVector is > RandomAccessSparse. > > -jake > > On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil wrote: > > > Just to be clear, this does: > > currentVector-otherVector ? > > > > currentVector.assign(otherVector, Functions.minus); > > > > > > > > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix > > wrote: > > > > > to do subtractFrom, you can instead just do > > > > > > Vector.assign(otherVector, Functions.minus); > > > > > > The problem is that while DenseVector has an optimization here: if the > > > BinaryFunction passed in is additive (it's an instance of PlusMult), > > > sparse iteration over "otherVector" is executed, applying the binary > > > function and mutating self. AbstractVector should have this > optimization > > > in general, as it would be useful in RandomAccessSparseVector (although > > > not terribly useful in SequentialAccessSparseVector, but still better > > than > > > current). > > > > > > -jake > > > > > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil > > wrote: > > > > > > > I just had to change it at one place(and the tests pass, which is > > scary). > > > > Canopy is really fast now :). Still could be pushed > > > > Now the bottleneck is minus > > > > > > > > maybe a subtractFrom on the lines of addTo? or a mutable negate > > function > > > > for > > > > vector, before adding to > > > > > > > > Robin > > > > > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > > > > wrote: > > > > > > > > > I use it (addTo) in decomposer, for exactly this performance issue. > > > > > Changing > > > > > plus into addTo requires care, because since plus() leaves > arguments > > > > > immutable, > > > > > there may be code which *assumes* that this is the case, and doing > > > > addTo() > > > > > leaves side effects which might not be expected. This bit me hard > on > > > svd > > > > > migration, because I had other assumptions about mutability in > there. > > > > > > > > > > -jake > > > > > > > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > > > > wrote: > > > > > > > > > > > ah! Its not being used anywhere :). Should we make that a big > task > > > > before > > > > > > 0.3 ? Sweep through code(mainly clustering) and change all these > > > > things. > > > > > > > > > > > > Robin > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen > > wrote: > > > > > > > > > > > > > Isn't this basically what assign() is for? > > > > > > > > > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil < > > robin.a...@gmail.com> > > > > > > wrote: > > > > > > > > Now the big perf bottle neck is immutability > > > > > > > > > > > > > > > > Say for plus its doing vector.clone() before doing anything > > else. > > > > > > > > There should be both immutable and mutable plus functions > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Profiling SequentialAccessSparseVector
currentVector.assign(otherVector, minus) takes the other vector, and subtracts it from currentVector, which mutates currentVector. If currentVector is DenseVector, this is already optimized. It could be optimized if currentVector is RandomAccessSparse. -jake On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil wrote: > Just to be clear, this does: > currentVector-otherVector ? > > currentVector.assign(otherVector, Functions.minus); > > > > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix > wrote: > > > to do subtractFrom, you can instead just do > > > > Vector.assign(otherVector, Functions.minus); > > > > The problem is that while DenseVector has an optimization here: if the > > BinaryFunction passed in is additive (it's an instance of PlusMult), > > sparse iteration over "otherVector" is executed, applying the binary > > function and mutating self. AbstractVector should have this optimization > > in general, as it would be useful in RandomAccessSparseVector (although > > not terribly useful in SequentialAccessSparseVector, but still better > than > > current). > > > > -jake > > > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil > wrote: > > > > > I just had to change it at one place(and the tests pass, which is > scary). > > > Canopy is really fast now :). Still could be pushed > > > Now the bottleneck is minus > > > > > > maybe a subtractFrom on the lines of addTo? or a mutable negate > function > > > for > > > vector, before adding to > > > > > > Robin > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > > > wrote: > > > > > > > I use it (addTo) in decomposer, for exactly this performance issue. > > > > Changing > > > > plus into addTo requires care, because since plus() leaves arguments > > > > immutable, > > > > there may be code which *assumes* that this is the case, and doing > > > addTo() > > > > leaves side effects which might not be expected. This bit me hard on > > svd > > > > migration, because I had other assumptions about mutability in there. > > > > > > > > -jake > > > > > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > > > wrote: > > > > > > > > > ah! Its not being used anywhere :). Should we make that a big task > > > before > > > > > 0.3 ? Sweep through code(mainly clustering) and change all these > > > things. > > > > > > > > > > Robin > > > > > > > > > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen > wrote: > > > > > > > > > > > Isn't this basically what assign() is for? > > > > > > > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil < > robin.a...@gmail.com> > > > > > wrote: > > > > > > > Now the big perf bottle neck is immutability > > > > > > > > > > > > > > Say for plus its doing vector.clone() before doing anything > else. > > > > > > > There should be both immutable and mutable plus functions > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Welcome Drew Farris
On Feb 18, 2010, at 4:05 PM, Robin Anil wrote: > Welcome Drew > > @Grant: No customary introduction? :) Sorry, forgot that. Drew, tradition is new committers give a little background on themselves. I can add one tidbit: I worked w/ Drew way back when at TextWise, so I'm glad he showed up here! > > Robin > > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote: > >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the >> newest member of the Mahout committer family. Drew has been contributing >> some really nice work to Mahout in recent months and I look forward to his >> continuing involvement with Mahout. >> >> Congrats, Drew! >> >> >> -Grant
Re: Profiling SequentialAccessSparseVector
Just to be clear, this does: currentVector-otherVector ? currentVector.assign(otherVector, Functions.minus); On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix wrote: > to do subtractFrom, you can instead just do > > Vector.assign(otherVector, Functions.minus); > > The problem is that while DenseVector has an optimization here: if the > BinaryFunction passed in is additive (it's an instance of PlusMult), > sparse iteration over "otherVector" is executed, applying the binary > function and mutating self. AbstractVector should have this optimization > in general, as it would be useful in RandomAccessSparseVector (although > not terribly useful in SequentialAccessSparseVector, but still better than > current). > > -jake > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil wrote: > > > I just had to change it at one place(and the tests pass, which is scary). > > Canopy is really fast now :). Still could be pushed > > Now the bottleneck is minus > > > > maybe a subtractFrom on the lines of addTo? or a mutable negate function > > for > > vector, before adding to > > > > Robin > > > > > > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > > wrote: > > > > > I use it (addTo) in decomposer, for exactly this performance issue. > > > Changing > > > plus into addTo requires care, because since plus() leaves arguments > > > immutable, > > > there may be code which *assumes* that this is the case, and doing > > addTo() > > > leaves side effects which might not be expected. This bit me hard on > svd > > > migration, because I had other assumptions about mutability in there. > > > > > > -jake > > > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > > wrote: > > > > > > > ah! Its not being used anywhere :). Should we make that a big task > > before > > > > 0.3 ? Sweep through code(mainly clustering) and change all these > > things. > > > > > > > > Robin > > > > > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen wrote: > > > > > > > > > Isn't this basically what assign() is for? > > > > > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil > > > > wrote: > > > > > > Now the big perf bottle neck is immutability > > > > > > > > > > > > Say for plus its doing vector.clone() before doing anything else. > > > > > > There should be both immutable and mutable plus functions > > > > > > > > > > > > > > > > > > > > >
Re: Profiling SequentialAccessSparseVector
to do subtractFrom, you can instead just do Vector.assign(otherVector, Functions.minus); The problem is that while DenseVector has an optimization here: if the BinaryFunction passed in is additive (it's an instance of PlusMult), sparse iteration over "otherVector" is executed, applying the binary function and mutating self. AbstractVector should have this optimization in general, as it would be useful in RandomAccessSparseVector (although not terribly useful in SequentialAccessSparseVector, but still better than current). -jake On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil wrote: > I just had to change it at one place(and the tests pass, which is scary). > Canopy is really fast now :). Still could be pushed > Now the bottleneck is minus > > maybe a subtractFrom on the lines of addTo? or a mutable negate function > for > vector, before adding to > > Robin > > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > wrote: > > > I use it (addTo) in decomposer, for exactly this performance issue. > > Changing > > plus into addTo requires care, because since plus() leaves arguments > > immutable, > > there may be code which *assumes* that this is the case, and doing > addTo() > > leaves side effects which might not be expected. This bit me hard on svd > > migration, because I had other assumptions about mutability in there. > > > > -jake > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > wrote: > > > > > ah! Its not being used anywhere :). Should we make that a big task > before > > > 0.3 ? Sweep through code(mainly clustering) and change all these > things. > > > > > > Robin > > > > > > > > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen wrote: > > > > > > > Isn't this basically what assign() is for? > > > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil > > > wrote: > > > > > Now the big perf bottle neck is immutability > > > > > > > > > > Say for plus its doing vector.clone() before doing anything else. > > > > > There should be both immutable and mutable plus functions > > > > > > > > > > > > > > >
Re: Profiling SequentialAccessSparseVector
I just had to change it at one place(and the tests pass, which is scary). Canopy is really fast now :). Still could be pushed Now the bottleneck is minus maybe a subtractFrom on the lines of addTo? or a mutable negate function for vector, before adding to Robin On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix wrote: > I use it (addTo) in decomposer, for exactly this performance issue. > Changing > plus into addTo requires care, because since plus() leaves arguments > immutable, > there may be code which *assumes* that this is the case, and doing addTo() > leaves side effects which might not be expected. This bit me hard on svd > migration, because I had other assumptions about mutability in there. > > -jake > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil wrote: > > > ah! Its not being used anywhere :). Should we make that a big task before > > 0.3 ? Sweep through code(mainly clustering) and change all these things. > > > > Robin > > > > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen wrote: > > > > > Isn't this basically what assign() is for? > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil > > wrote: > > > > Now the big perf bottle neck is immutability > > > > > > > > Say for plus its doing vector.clone() before doing anything else. > > > > There should be both immutable and mutable plus functions > > > > > > > > > >
Re: Profiling SequentialAccessSparseVector
I use it (addTo) in decomposer, for exactly this performance issue. Changing plus into addTo requires care, because since plus() leaves arguments immutable, there may be code which *assumes* that this is the case, and doing addTo() leaves side effects which might not be expected. This bit me hard on svd migration, because I had other assumptions about mutability in there. -jake On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil wrote: > ah! Its not being used anywhere :). Should we make that a big task before > 0.3 ? Sweep through code(mainly clustering) and change all these things. > > Robin > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen wrote: > > > Isn't this basically what assign() is for? > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil > wrote: > > > Now the big perf bottle neck is immutability > > > > > > Say for plus its doing vector.clone() before doing anything else. > > > There should be both immutable and mutable plus functions > > > > > >
Re: Profiling SequentialAccessSparseVector
ah! Its not being used anywhere :). Should we make that a big task before 0.3 ? Sweep through code(mainly clustering) and change all these things. Robin On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen wrote: > Isn't this basically what assign() is for? > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil wrote: > > Now the big perf bottle neck is immutability > > > > Say for plus its doing vector.clone() before doing anything else. > > There should be both immutable and mutable plus functions > > >
Re: Profiling SequentialAccessSparseVector
addTo() is mutable plus. On Thu, Feb 18, 2010 at 1:04 PM, Robin Anil wrote: > Now the big perf bottle neck is immutability > > Say for plus its doing vector.clone() before doing anything else. > There should be both immutable and mutable plus functions > > Robin > > > > On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix > wrote: > > > I dunno, we can file it for whenever, 0.4 and if it turns out it's a > really > > easy > > change we can always commit it for 0.3. > > > > -jake > > > > On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil > wrote: > > > > > File it for 0.3 ? > > > > > > > > > Robin > > > > > > On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix > > > wrote: > > > > > > > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil > > > wrote: > > > > > > > > > I was trying out SeqAccessSparseVector on Canopy Clustering using > > > > Manhattan > > > > > distance. I found performance to be really bad. So I profiled it > with > > > > > Yourkit(Thanks a lot for providing us free license) > > > > > > > > > > Since i was trying out manhattan distance, there were a lot of A-B > > > which > > > > > created a lot of clone operation 5% of the total time > > > > > there were also so many A+B for adding a point to the canopy to > > > average. > > > > > this was also creating a lot of clone operations. 90% of the total > > > time > > > > > > > > > > > > > SequentialAccessSparseVector should only be used in a read-only > > fashion. > > > > If > > > > you are creating an average centroid which is sparse, but it is > > mutating, > > > > then it should be RandomAccessSparseVector. The points which are > being > > > > used > > > > to create it can be SequentialAccessSparseVector (if they themselves > > > never > > > > change), but then the method called should be > > > > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this > > > > exploits > > > > the fast sequential iteration of SeqAcc, and the fast random-access > > > > mutatability of RandAcc. > > > > > > > > > > > > > > > > > > So we definitely needs to improve that.. > > > > > > > > > > For a small hack. I made the cluster centers RandomAccess Vector. > > > Things > > > > > are fast again. I dont know whether to commit or not. But something > > to > > > > look > > > > > into in 0.4? > > > > > > > > > > > > > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch > so > > > we > > > > can see exactly what the change is? > > > > > > > > -jake > > > > > > > > > >
Re: Profiling SequentialAccessSparseVector
Isn't this basically what assign() is for? On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil wrote: > Now the big perf bottle neck is immutability > > Say for plus its doing vector.clone() before doing anything else. > There should be both immutable and mutable plus functions >
Re: Welcome Drew Farris
Welcome Drew @Grant: No customary introduction? :) Robin On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote: > On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the > newest member of the Mahout committer family. Drew has been contributing > some really nice work to Mahout in recent months and I look forward to his > continuing involvement with Mahout. > > Congrats, Drew! > > > -Grant
Re: Profiling SequentialAccessSparseVector
If it's as obvious a win as it sounds, I'd say 0.3. We aren't in lock down yet are we? -Grant On Feb 18, 2010, at 3:37 PM, Jake Mannix wrote: > I dunno, we can file it for whenever, 0.4 and if it turns out it's a really > easy > change we can always commit it for 0.3. > > -jake > > On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil wrote: > >> File it for 0.3 ? >> >> >> Robin >> >> On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix >> wrote: >> >>> On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil >> wrote: >>> I was trying out SeqAccessSparseVector on Canopy Clustering using >>> Manhattan distance. I found performance to be really bad. So I profiled it with Yourkit(Thanks a lot for providing us free license) Since i was trying out manhattan distance, there were a lot of A-B >> which created a lot of clone operation 5% of the total time there were also so many A+B for adding a point to the canopy to >> average. this was also creating a lot of clone operations. 90% of the total >> time >>> >>> SequentialAccessSparseVector should only be used in a read-only fashion. >>> If >>> you are creating an average centroid which is sparse, but it is mutating, >>> then it should be RandomAccessSparseVector. The points which are being >>> used >>> to create it can be SequentialAccessSparseVector (if they themselves >> never >>> change), but then the method called should be >>> SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this >>> exploits >>> the fast sequential iteration of SeqAcc, and the fast random-access >>> mutatability of RandAcc. >>> >>> So we definitely needs to improve that.. For a small hack. I made the cluster centers RandomAccess Vector. >> Things are fast again. I dont know whether to commit or not. But something to >>> look into in 0.4? >>> >>> Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so >> we >>> can see exactly what the change is? >>> >>> -jake >>> >> -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Profiling SequentialAccessSparseVector
Now the big perf bottle neck is immutability Say for plus its doing vector.clone() before doing anything else. There should be both immutable and mutable plus functions Robin On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix wrote: > I dunno, we can file it for whenever, 0.4 and if it turns out it's a really > easy > change we can always commit it for 0.3. > > -jake > > On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil wrote: > > > File it for 0.3 ? > > > > > > Robin > > > > On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix > > wrote: > > > > > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil > > wrote: > > > > > > > I was trying out SeqAccessSparseVector on Canopy Clustering using > > > Manhattan > > > > distance. I found performance to be really bad. So I profiled it with > > > > Yourkit(Thanks a lot for providing us free license) > > > > > > > > Since i was trying out manhattan distance, there were a lot of A-B > > which > > > > created a lot of clone operation 5% of the total time > > > > there were also so many A+B for adding a point to the canopy to > > average. > > > > this was also creating a lot of clone operations. 90% of the total > > time > > > > > > > > > > SequentialAccessSparseVector should only be used in a read-only > fashion. > > > If > > > you are creating an average centroid which is sparse, but it is > mutating, > > > then it should be RandomAccessSparseVector. The points which are being > > > used > > > to create it can be SequentialAccessSparseVector (if they themselves > > never > > > change), but then the method called should be > > > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this > > > exploits > > > the fast sequential iteration of SeqAcc, and the fast random-access > > > mutatability of RandAcc. > > > > > > > > > > > > > > So we definitely needs to improve that.. > > > > > > > > For a small hack. I made the cluster centers RandomAccess Vector. > > Things > > > > are fast again. I dont know whether to commit or not. But something > to > > > look > > > > into in 0.4? > > > > > > > > > > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so > > we > > > can see exactly what the change is? > > > > > > -jake > > > > > >
Welcome Drew Farris
On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the newest member of the Mahout committer family. Drew has been contributing some really nice work to Mahout in recent months and I look forward to his continuing involvement with Mahout. Congrats, Drew! -Grant
[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-297: -- Attachment: MAHOUT-297.patch converts centers to randomaccess on first creation > Canopy and Kmeans clustering slows down on using SeqAccVector for center > > > Key: MAHOUT-297 > URL: https://issues.apache.org/jira/browse/MAHOUT-297 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.4 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.4 > > Attachments: MAHOUT-297.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center
Canopy and Kmeans clustering slows down on using SeqAccVector for center Key: MAHOUT-297 URL: https://issues.apache.org/jira/browse/MAHOUT-297 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.4 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.4 Attachments: MAHOUT-297.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Profiling SequentialAccessSparseVector
I dunno, we can file it for whenever, 0.4 and if it turns out it's a really easy change we can always commit it for 0.3. -jake On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil wrote: > File it for 0.3 ? > > > Robin > > On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix > wrote: > > > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil > wrote: > > > > > I was trying out SeqAccessSparseVector on Canopy Clustering using > > Manhattan > > > distance. I found performance to be really bad. So I profiled it with > > > Yourkit(Thanks a lot for providing us free license) > > > > > > Since i was trying out manhattan distance, there were a lot of A-B > which > > > created a lot of clone operation 5% of the total time > > > there were also so many A+B for adding a point to the canopy to > average. > > > this was also creating a lot of clone operations. 90% of the total > time > > > > > > > SequentialAccessSparseVector should only be used in a read-only fashion. > > If > > you are creating an average centroid which is sparse, but it is mutating, > > then it should be RandomAccessSparseVector. The points which are being > > used > > to create it can be SequentialAccessSparseVector (if they themselves > never > > change), but then the method called should be > > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this > > exploits > > the fast sequential iteration of SeqAcc, and the fast random-access > > mutatability of RandAcc. > > > > > > > > > > So we definitely needs to improve that.. > > > > > > For a small hack. I made the cluster centers RandomAccess Vector. > Things > > > are fast again. I dont know whether to commit or not. But something to > > look > > > into in 0.4? > > > > > > > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so > we > > can see exactly what the change is? > > > > -jake > > >
Re: Profiling SequentialAccessSparseVector
File it for 0.3 ? Robin On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix wrote: > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil wrote: > > > I was trying out SeqAccessSparseVector on Canopy Clustering using > Manhattan > > distance. I found performance to be really bad. So I profiled it with > > Yourkit(Thanks a lot for providing us free license) > > > > Since i was trying out manhattan distance, there were a lot of A-B which > > created a lot of clone operation 5% of the total time > > there were also so many A+B for adding a point to the canopy to average. > > this was also creating a lot of clone operations. 90% of the total time > > > > SequentialAccessSparseVector should only be used in a read-only fashion. > If > you are creating an average centroid which is sparse, but it is mutating, > then it should be RandomAccessSparseVector. The points which are being > used > to create it can be SequentialAccessSparseVector (if they themselves never > change), but then the method called should be > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this > exploits > the fast sequential iteration of SeqAcc, and the fast random-access > mutatability of RandAcc. > > > > > > So we definitely needs to improve that.. > > > > For a small hack. I made the cluster centers RandomAccess Vector. Things > > are fast again. I dont know whether to commit or not. But something to > look > > into in 0.4? > > > > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so we > can see exactly what the change is? > > -jake >
Re: Profiling SequentialAccessSparseVector
On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil wrote: > I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan > distance. I found performance to be really bad. So I profiled it with > Yourkit(Thanks a lot for providing us free license) > > Since i was trying out manhattan distance, there were a lot of A-B which > created a lot of clone operation 5% of the total time > there were also so many A+B for adding a point to the canopy to average. > this was also creating a lot of clone operations. 90% of the total time > SequentialAccessSparseVector should only be used in a read-only fashion. If you are creating an average centroid which is sparse, but it is mutating, then it should be RandomAccessSparseVector. The points which are being used to create it can be SequentialAccessSparseVector (if they themselves never change), but then the method called should be SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this exploits the fast sequential iteration of SeqAcc, and the fast random-access mutatability of RandAcc. > > So we definitely needs to improve that.. > > For a small hack. I made the cluster centers RandomAccess Vector. Things > are fast again. I dont know whether to commit or not. But something to look > into in 0.4? > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so we can see exactly what the change is? -jake
Re: Profiling SequentialAccessSparseVector
Sorry about the attachment. see it here. http://yfrog.com/4epicture1pfp On Fri, Feb 19, 2010 at 1:25 AM, Robin Anil wrote: > I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan > distance. I found performance to be really bad. So I profiled it with > Yourkit(Thanks a lot for providing us free license) > > Since i was trying out manhattan distance, there were a lot of A-B which > created a lot of clone operation 5% of the total time > there were also so many A+B for adding a point to the canopy to average. > this was also creating a lot of clone operations. 90% of the total time > > So we definitely needs to improve that.. > > For a small hack. I made the cluster centers RandomAccess Vector. Things > are fast again. I dont know whether to commit or not. But something to look > into in 0.4? > > Robin > > >
Profiling SequentialAccessSparseVector
I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan distance. I found performance to be really bad. So I profiled it with Yourkit(Thanks a lot for providing us free license) Since i was trying out manhattan distance, there were a lot of A-B which created a lot of clone operation 5% of the total time there were also so many A+B for adding a point to the canopy to average. this was also creating a lot of clone operations. 90% of the total time So we definitely needs to improve that.. For a small hack. I made the cluster centers RandomAccess Vector. Things are fast again. I dont know whether to commit or not. But something to look into in 0.4? Robin
Re: New to Mahout - question about the failed test cases
Note the different version number here. I think that Anish has somehow gotten stuck on an old version. Anish, can you do a clean checkout and build? On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil wrote: > I am building Revision: 911405 on a Mac, and things work fine for me. I am > assuming same is the case for sean(mac) > > ... > > > On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah wrote: > > ... > > $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk > > Checked out revision 911364. > > > -- Ted Dunning, CTO DeepDyve
[jira] Closed: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key
[ https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil closed MAHOUT-296. - > TestClassifier takes correctLabel from filename instead of from the key > --- > > Key: MAHOUT-296 > URL: https://issues.apache.org/jira/browse/MAHOUT-296 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key
[ https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-296. --- Resolution: Fixed > TestClassifier takes correctLabel from filename instead of from the key > --- > > Key: MAHOUT-296 > URL: https://issues.apache.org/jira/browse/MAHOUT-296 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key
[ https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on MAHOUT-296 started by Robin Anil. > TestClassifier takes correctLabel from filename instead of from the key > --- > > Key: MAHOUT-296 > URL: https://issues.apache.org/jira/browse/MAHOUT-296 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key
TestClassifier takes correctLabel from filename instead of from the key --- Key: MAHOUT-296 URL: https://issues.apache.org/jira/browse/MAHOUT-296 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Robin Anil Assignee: Robin Anil Fix For: 0.3 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: New to Mahout - question about the failed test cases
I am building Revision: 911405 on a Mac, and things work fine for me. I am assuming same is the case for sean(mac) One reason i assume and error could come is that on windows the output directories doesnt get deleted if the filesystem locks it(for reason i cannot fathom). Could you try deleting testdata and output directories which are formed and try again. Those directories could have some data used by the Kmeans test and is deleted before the canopy test. Other than that if you are building the mahout for usage do: mvn clean install -DskipTests=true Robin On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah wrote: > I tried again by first syncing from the trunk and running mvn install using > the following > and getting the same test failures! I am running this on Windows 7 machine > using Cigwin. > > $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk > Checked out revision 911364. > > $ mvn clean install > .. lots of junit test successes and the following failure > > Results : > > Failed tests: > testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering) > > > testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean > sClustering) > > Tests run: 338, Failures: 2, Errors: 0, Skipped: 0 > > On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil wrote: > > > Yeah me neither. Could you try syncing from the trunk > > > > > > > > On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen wrote: > > > > > I'm not seeing any such failures myself, from head. > > > > > > In case Robin just fixed something, try again from head? > > > > > > > > > On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah > wrote: > > > > Hi, > > > > > > > > I am new to Mahout and going through the initial steps of setting the > > > > development > > > > environment on my machine. I checked out the latest code from trunk > and > > > > seeing > > > > the following failed tests when I ran mvn clean install: > > > > > > > > > >
Re: New to Mahout - question about the failed test cases
I tried again by first syncing from the trunk and running mvn install using the following and getting the same test failures! I am running this on Windows 7 machine using Cigwin. $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk Checked out revision 911364. $ mvn clean install .. lots of junit test successes and the following failure Results : Failed tests: testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering) testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean sClustering) Tests run: 338, Failures: 2, Errors: 0, Skipped: 0 On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil wrote: > Yeah me neither. Could you try syncing from the trunk > > > > On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen wrote: > > > I'm not seeing any such failures myself, from head. > > > > In case Robin just fixed something, try again from head? > > > > > > On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah wrote: > > > Hi, > > > > > > I am new to Mahout and going through the initial steps of setting the > > > development > > > environment on my machine. I checked out the latest code from trunk and > > > seeing > > > the following failed tests when I ran mvn clean install: > > > > > >
Re: Fuzzy K Means
Yeah, by killing the job in between, i find all the centers to be same :( Which is the main problem On Thu, Feb 18, 2010 at 5:51 PM, Jeff Eastman wrote: > It sounds like k-means is looping because point memberships are oscillating > between two stable states. Try increasing the convergence delta and you will > likely terminate. > > > Robin Anil wrote: > >> Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add >> point >> to a single canopy instead of adding it to all canopies. This would help a >> lot on large datasets. There is no point of adding to all canopies, you >> will >> get approximate clustering anyways >> >> I have cleaned up most of SoftCluster. Still the error exists. It seems to >> be looping forever now. I will post a patch on the issue take please take >> a >> look >> >> Robin >> >> On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman > >wrote: >> >> >> >>> Robin Anil wrote: >>> >>> >>> Hadoop reuses the *same* instance whenever it uses readFields and I've > been > bitten more than once by assuming otherwise. > > > > Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see the where the writable is causing the error. Best is if we could have some test data and make a check to see if the algorithm is working. >>> Good hunting. I notice that some of the code in the fuzzy MR unit test >>> has >>> been commented out but have not looked into it further. >>> >>> I assume also you have sorted out the canopy issue you were having? >>> >>> Jeff >>> >>> >>> >> >> >> > >
Re: Fuzzy K Means
It sounds like k-means is looping because point memberships are oscillating between two stable states. Try increasing the convergence delta and you will likely terminate. Robin Anil wrote: Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add point to a single canopy instead of adding it to all canopies. This would help a lot on large datasets. There is no point of adding to all canopies, you will get approximate clustering anyways I have cleaned up most of SoftCluster. Still the error exists. It seems to be looping forever now. I will post a patch on the issue take please take a look Robin On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman wrote: Robin Anil wrote: Hadoop reuses the *same* instance whenever it uses readFields and I've been bitten more than once by assuming otherwise. Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see the where the writable is causing the error. Best is if we could have some test data and make a check to see if the algorithm is working. Good hunting. I notice that some of the code in the fuzzy MR unit test has been commented out but have not looked into it further. I assume also you have sorted out the canopy issue you were having? Jeff
Re: Fuzzy K Means
+1 Looks like what I did too. Robin Anil wrote: I am pasting the patch for SoftCluster here.. Index: core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java === --- core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java (revision 910924) +++ core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java (working copy) @@ -21,21 +21,14 @@ import java.io.DataOutput; import java.io.IOException; -import org.apache.hadoop.io.Writable; +import org.apache.mahout.clustering.ClusterBase; import org.apache.mahout.math.AbstractVector; -import org.apache.mahout.math.RandomAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; import org.apache.mahout.math.function.SquareRootFunction; -public class SoftCluster implements Writable { - - // this cluster's clusterId - private int clusterId; - - // the current center - private Vector center = new RandomAccessSparseVector(0); - +public class SoftCluster extends ClusterBase{ + // the current centroid is lazy evaluated and may be null private Vector centroid = null; @@ -90,7 +83,7 @@ @Override public void write(DataOutput out) throws IOException { -out.writeInt(clusterId); +out.writeInt(this.getId()); out.writeBoolean(converged); Vector vector = computeCentroid(); VectorWritable.writeVector(out, vector); @@ -98,13 +91,13 @@ @Override public void readFields(DataInput in) throws IOException { -clusterId = in.readInt(); +this.setId(in.readInt()); converged = in.readBoolean(); VectorWritable temp = new VectorWritable(); temp.readFields(in); -center = temp.get(); +this.setCenter(temp.get()); this.pointProbSum = 0; -this.weightedPointTotal = center.like(); +this.weightedPointTotal = getCenter().like(); } /** @@ -112,6 +105,7 @@ * * @return the new centroid */ + @Override public Vector computeCentroid() { if (pointProbSum == 0) { return weightedPointTotal; @@ -132,7 +126,7 @@ * the center point */ public SoftCluster(Vector center) { -this.center = center; +setCenter(center); this.pointProbSum = 0; this.weightedPointTotal = center.like(); @@ -145,8 +139,8 @@ * the center point */ public SoftCluster(Vector center, int clusterId) { -this.clusterId = clusterId; -this.center = center; +this.setId(clusterId); +this.setCenter(center); this.pointProbSum = 0; this.weightedPointTotal = center.like(); } @@ -154,7 +148,7 @@ /** Construct a new softcluster with the given clusterID */ public SoftCluster(String clusterId) { -this.clusterId = Integer.parseInt(clusterId.substring(1)); +this.setId(Integer.parseInt(clusterId.substring(1))); this.pointProbSum = 0; // this.weightedPointTotal = center.like(); this.converged = clusterId.charAt(0) == 'V'; @@ -162,14 +156,15 @@ @Override public String toString() { -return getIdentifier() + " - " + center.asFormatString(); +return getIdentifier() + " - " + getCenter().asFormatString(); } + @Override public String getIdentifier() { if (converged) { - return "V" + clusterId; + return "V" + this.getId(); } else { - return "C" + clusterId; + return "C" + this.getId(); } } @@ -212,7 +207,7 @@ centroid = null; pointProbSum += ptProb; if (weightedPointTotal == null) { - weightedPointTotal = point.clone().times(ptProb); + weightedPointTotal = point.times(ptProb); } else { weightedPointTotal = weightedPointTotal.plus(point.times(ptProb)); } @@ -234,19 +229,15 @@ } } - public Vector getCenter() { -return center; - } - public double getPointProbSum() { return pointProbSum; } /** Compute the centroid and set the center to it. */ public void recomputeCenter() { -center = computeCentroid(); +this.setCenter(computeCentroid()); pointProbSum = 0; -weightedPointTotal = center.like(); +weightedPointTotal = getCenter().like(); } public Vector getWeightedPointTotal() { @@ -265,8 +256,9 @@ this.converged = converged; } - public int getClusterId() { -return clusterId; + @Override + public String asFormatString() { +return formatCluster(this); } }
Re: Fuzzy K Means
Very similar, especially when you consider that k-means only adds the whole point value to the single, closest cluster (i.e. weightedPointTotal += 1), whereas fuzzy adds it partially to all. I don't think the other clustering routines require/expect numPoints to be an integer and the instvar could probably be generalized to double weightedPointTotal without impact. Perhaps better to consider that change separately, as there are a number of tests which compare getNumPoints() with an integer value and would have to be adjusted. Likely it would be just adding an (int) cast as the values in non-fuzzy tests would always be whole numbers. Pallavi Palleti wrote: Yes. But not the total number of points. So, the numpoints from ClusterBase will not be used in SoftCluster. numpoints is specific to Kmeans similar to weightedpoint total for fuzzy kmeans. Robin Anil wrote: the center is still the averaged out centroid right? weightedtotalvector/totalprobWeight On Wed, Feb 17, 2010 at 5:10 PM, Pallavi Palleti < pallavi.pall...@corp.aol.com> wrote: I haven't yet gone thru ClusterDumper. However, ClusterBase would be having number of points to average out (pointTotal/numPoints as per kmeans) where as SoftCluster will have weighted point total. So, I am wondering how can we reuse ClusterBase here? Thanks Pallavi Robin Anil wrote: yes. So that cluster dumper can print it out. On Wed, Feb 17, 2010 at 5:02 PM, Pallavi Palleti < pallavi.pall...@corp.aol.com> wrote: Hi Robin, when you meant by reusing ClusterBase, are you planning to extend ClusterBase in SoftCluster? For example, SoftCluster extends ClusterBase? Thanks Pallavi Robin Anil wrote: I have been trying to convert FuzzyKMeans SoftCluster(which should be ideally be named FuzzyKmeansCluster) to use the ClusterBase. I am getting* the same center* for all the clusters. To aid the conversion all i did was remove the center vector from the SoftCluster class and reuse the same from the ClusterBase. These are essentially making no change in the tests which passes correctly. So I am questioning whether the implementation keeps the average center at all ? Anyone who has used FuzzyKMeans experiencing this? Robin
Re: New to Mahout - question about the failed test cases
Yeah me neither. Could you try syncing from the trunk On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen wrote: > I'm not seeing any such failures myself, from head. > > In case Robin just fixed something, try again from head? > > > On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah wrote: > > Hi, > > > > I am new to Mahout and going through the initial steps of setting the > > development > > environment on my machine. I checked out the latest code from trunk and > > seeing > > the following failed tests when I ran mvn clean install: > > >
Re: New to Mahout - question about the failed test cases
I'm not seeing any such failures myself, from head. In case Robin just fixed something, try again from head? On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah wrote: > Hi, > > I am new to Mahout and going through the initial steps of setting the > development > environment on my machine. I checked out the latest code from trunk and > seeing > the following failed tests when I ran mvn clean install: >