[jira] Issue Comment Edited: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos
[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794809#action_12794809 ] zhao zhendong edited comment on MAHOUT-232 at 12/30/09 7:07 AM: I still work on it :(. I can attach them as a patch tomorrow or the day after tomorrow, maybe. I will check the code of MAHOUT-228. -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>> Department of Computer Science School of Computing National University of Singapore Mail: zhaozhend...@gmail.com was (Author: maximzhao): I still work on it :(. I can attach them as a patch tomorrow or the day after tomorrow, maybe. I will check the code of MAHOUT-228. -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>> Department of Computer Science School of Computing National University of Singapore Homepage:http://zhaozhendong.googlepages.com Mail: zhaozhend...@gmail.com > Implementation of sequential SVM solver based on Pegasos > > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.1 >Reporter: zhao zhendong > Attachments: SequentialSVM_0.1.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > The plan of Sequential Pegasos: > 1 Supporting the general file system ( almost finished ); > 2 Supporting HDFS; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos
[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhao zhendong updated MAHOUT-232: - Attachment: SequentialSVM_0.1.patch > Implementation of sequential SVM solver based on Pegasos > > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.1 >Reporter: zhao zhendong > Attachments: SequentialSVM_0.1.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > The plan of Sequential Pegasos: > 1 Supporting the general file system ( almost finished ); > 2 Supporting HDFS; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos
[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhao zhendong updated MAHOUT-232: - Affects Version/s: 0.1 Status: Patch Available (was: Open) Sequential SVM based on Pegasos. --- Currently, this package provides (Features): --- 1. Sequential SVM linear solver, include training and testing. 2. It supports general file system right now, it means that HDFS supporting will be a near future work. 3. Supporting large-scale data set. ( need to assign the argument "trainSampleNum" ) Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch the certain size (e.g. max iteration) of samples to memory. For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, as the result, this package only randomly loads 10,000 samples to memory. --- TODO: --- 1. Supporting HDFS; 2. Because of adopting mahout.math.SparseMatrix and mahout.math.SparseVectorUnsafe, I must assign the cardinality of matrix while create them. It's not easy for reading the data set with the format of SVM-light or libsvm, which are very popular in Machine learning community. Such dataset does not store the number of samples and the size of dimension. Currently, I still use a stupid method to read the data to map<> first, then dump the data to SparseMatrix. Does any one know some smart methods or other matrix to support such operation? --- Usage: --- Training: SVMPegasosTraining.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model Testing: SVMPegasosTesting.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model > Implementation of sequential SVM solver based on Pegasos > > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.1 >Reporter: zhao zhendong > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > The plan of Sequential Pegasos: > 1 Supporting the general file system ( almost finished ); > 2 Supporting HDFS; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [math] watch out for Windows
last time I tried, running Hadoop 0.20 on Windows was impossible for me...should we still try to support Windows ? I found that installing Ubuntu on Windows using Virtual Box is the easiest way to use Hadoop inside Windows On Mon, Dec 28, 2009 at 8:47 PM, Benson Margulies wrote: > Robin & I just established that the new code generator isn't working > on Windows at all. I'm in process on a repair. >
Re: [jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
Hadoop put their MurmurHash in utils, so that might be a consideration. But for Mahout it fits better, imo, in org.apache.mahout.common with other code that has similar philosophy and purpose. I make the assumption that others will want to add some alternative hash tools, therefore I'd create a "hash" package in mahout.common. The randomizers I'd put in org.apache.mahout.math due to their interaction with Vector, either at that very depth or in org.apache.mahout.math.randomizer, as .math is beginning to get dense based on number of modules. I imagine using the priors outside of sgd, so they could be moved to org.apache.mahout.math as well, where they may merit their own sub package. --- On Tue, 12/29/09, Ted Dunning (JIRA) wrote: > From: Ted Dunning (JIRA) > Subject: [jira] Commented: (MAHOUT-228) Need sequential logistic regression > implementation using SGD techniques > To: mahout-dev@lucene.apache.org > Date: Tuesday, December 29, 2009, 12:29 PM > > [ > https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138 > ] > > Ted Dunning commented on MAHOUT-228: > > > > This is the time. The MurmurHash and Randomizer > classes both seem ripe for promotion to other packages. > > What I will do is file some additional JIRA's that include > just those classes (one JIRA for Murmur, one for > Randomizer/Vectorizer). Those patches will probably > make it in before this one does because they are > simpler. At that point, I will rework the patch on > this JIRA to not include those classes. > > Where would you recommend these others go? > > > > Need sequential logistic regression implementation > using SGD techniques > > > --- > > > > > Key: MAHOUT-228 > > > URL: https://issues.apache.org/jira/browse/MAHOUT-228 > > > Project: Mahout > > Issue Type: New > Feature > > Components: > Classification > > Reporter: Ted > Dunning > > > Fix For: 0.3 > > > > Attachments: > logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, > sgd-derivation.tex, sgd.csv > > > > > > Stochastic gradient descent (SGD) is often fast enough > for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/). > > I often need to have a logistic regression in Java as > well, so that is a reasonable place to start. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue > online. > >
[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795144#action_12795144 ] Robin Anil commented on MAHOUT-228: --- I say. let the hash functions be in math. The text Randomizers can go in util.vectors vectors.lucence, vectors.arff etc are there currently. Or we move the all these to core along with Randomizers and DictionaryBased? > Need sequential logistic regression implementation using SGD techniques > --- > > Key: MAHOUT-228 > URL: https://issues.apache.org/jira/browse/MAHOUT-228 > Project: Mahout > Issue Type: New Feature > Components: Classification >Reporter: Ted Dunning > Fix For: 0.3 > > Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, > sgd-derivation.tex, sgd.csv > > > Stochastic gradient descent (SGD) is often fast enough for highly scalable > learning (see Vowpal Wabbit, http://hunch.net/~vw/). > I often need to have a logistic regression in Java as well, so that is a > reasonable place to start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795141#action_12795141 ] Jake Mannix commented on MAHOUT-228: bq. Where would you recommend these others go? Somewhere in the math module, package name, I don't know. > Need sequential logistic regression implementation using SGD techniques > --- > > Key: MAHOUT-228 > URL: https://issues.apache.org/jira/browse/MAHOUT-228 > Project: Mahout > Issue Type: New Feature > Components: Classification >Reporter: Ted Dunning > Fix For: 0.3 > > Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, > sgd-derivation.tex, sgd.csv > > > Stochastic gradient descent (SGD) is often fast enough for highly scalable > learning (see Vowpal Wabbit, http://hunch.net/~vw/). > I often need to have a logistic regression in Java as well, so that is a > reasonable place to start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795140#action_12795140 ] Jake Mannix commented on MAHOUT-220: bq. Robin: This is a library, our job is to have options for people like us to debate over . So lets agree upon a common mechanism. Yep, agreed. We need fully deterministic techniques as well as probabilistic ones (which will often scale better), and let people use what works for them and they are comfortable with. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795139#action_12795139 ] Robin Anil commented on MAHOUT-220: --- The current Bayes implementation is an island. if you skim through the training mechanism. Its a very optimised. (with least map/reduces) and the kind of information I store in hbase and in memory is very specific to that paper. First there is the weight, which is a matrix of feature as row and label as column and cell as the weight. Secondly, there is sum of cols and rows. put along with the weight matrix. Then there are special rows containing, the theta normalizer and alpha smoothing value etc. You can see its not really doing bayes rule. it is reproducing the math of CBayes paper. So I see noway of it direcly using the sgd model. We could have a Bayes Algo implementation specfic to the model you are training. If thats ok? > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138 ] Ted Dunning commented on MAHOUT-228: This is the time. The MurmurHash and Randomizer classes both seem ripe for promotion to other packages. What I will do is file some additional JIRA's that include just those classes (one JIRA for Murmur, one for Randomizer/Vectorizer). Those patches will probably make it in before this one does because they are simpler. At that point, I will rework the patch on this JIRA to not include those classes. Where would you recommend these others go? > Need sequential logistic regression implementation using SGD techniques > --- > > Key: MAHOUT-228 > URL: https://issues.apache.org/jira/browse/MAHOUT-228 > Project: Mahout > Issue Type: New Feature > Components: Classification >Reporter: Ted Dunning > Fix For: 0.3 > > Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, > sgd-derivation.tex, sgd.csv > > > Stochastic gradient descent (SGD) is often fast enough for highly scalable > learning (see Vowpal Wabbit, http://hunch.net/~vw/). > I often need to have a logistic regression in Java as well, so that is a > reasonable place to start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795137#action_12795137 ] Jake Mannix commented on MAHOUT-220: bq. The extreme case is the DenseRandomizer. Every term gets spread out to every feature so you have collisions on every term on every feature. Because of the random weighting, you preserve enough information to allow effective learning. Right, this is the use case in the stochastic decomposition case, cool. bq. Should we generalize this concept to Vectorizer? The dictionary approach can accept a previously computed dictionary (possibly augmenting it on the fly) and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer. At the level I have been working, the storage of the dictionary is an open question. The randomizers could inherit from the same basic interface (or abstract class). Definitely. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795136#action_12795136 ] Ted Dunning commented on MAHOUT-220: {quote} For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and your datastore agree upon. {quote} That is fine if sgd is an island, but it plausibly should be able to output models to be used by the Bayes classifier in a map-reduce setting. That requires some documentation of how DataStore is used by the Bayes models. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795135#action_12795135 ] Ted Dunning commented on MAHOUT-220: {quote} Robin: I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? The one i am proposing is going to do exact numbering of the features. {quote} That is the point of the "probes" parameter. That allows for multiple hashing as Jake is suggesting. If you have, for example, 4 probes for each word, the chances of complete collision is minuscule and where there are collisions, the learning algorithm puts the weight on the non-colliding probes. The extreme case is the DenseRandomizer. Every term gets spread out to every feature so you have collisions on every term on every feature. Because of the random weighting, you preserve enough information to allow effective learning. See vowpal wabbit for a practical example. They handle 10^12 (very) sparse features in memory and can learn at disk bandwidth in some applications. {quote} Jake: They might belong in a more general place, actually. If I'm going to use some of this stuff in the decompositions (although I'm not sure yet of the efficacy of the single hash for doing SVD), it should go somewhere in the math module. {quote} Should we generalize this concept to Vectorizer? The dictionary approach can accept a previously computed dictionary (possibly augmenting it on the fly) and might be called a DictionaryVectorizer or WeightedDictionaryVectorizer. At the level I have been working, the storage of the dictionary is an open question. The randomizers could inherit from the same basic interface (or abstract class). > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795133#action_12795133 ] Robin Anil commented on MAHOUT-220: --- Anyways, I guess we are sounding like ML engineers here. This is a library, our job is to have options for people like us to debate over :). So lets agree upon a common mechanism. i.e Have different ways to create a term frequency vector. ie List => SparseVector from documents. Once the SparseVector is created. Use uniform M/R jobs to do things like tfidf weighting, log likelihood(although i think we need the orginal file to get the co-occurrence and not the SparseVector) Any ideas? > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795131#action_12795131 ] Jake Mannix commented on MAHOUT-220: bq. I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? You can lose some information, sure, but there are *tons* of words, and you don't lose much information. It is a probabilistic technique though. Personally I prefer the mutli-hash approach, because at least there I really believe the projection is preserving distances properly. In the single hash case, sometimes (ie for some single word documents, with different words), the collapse of distance is extreme (as Robin is alluding to). > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795127#action_12795127 ] Robin Anil commented on MAHOUT-220: --- I am not very clear what is happening there when two words have the same hash?. Arent we loosing out on a lot of information ? The one i am proposing is going to do exact numbering of the features. One thing my method suffer from is addition of new data. It will take another couple of M/R to create the new dictionary file, while preserving the old ids. Its cumbersome but doable. What is happening in a Randomizer approach. Since you are fixing the feature set size. The new hash ids will also change when that feature set size increase right? > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795128#action_12795128 ] Jake Mannix commented on MAHOUT-220: Anil, Your map-reduces look great, that's the kind of thing I've done for this as well. Good stuff. As for HBase and caching layers, I'd say it's still not fully scalable, as it's limited by whatever cache size you set, and your hit/miss ratio. It seems the Datastore interface really is just a wrapper around Matrix and Vector, calling out to the entries. Doing so in a random-access fashion seems like the reverse of the the way I'd do it: pass the Algorithm *to* the Datastore, and have the computations be done where the data lives (iterate over the Datastore internally, either in memory, or if it knows it's backed by mySQL, say, it can batch calls to the db, pulling chunks into memory, if it's HDFS-backed, then it can fire off a M/R job, etc...). > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795124#action_12795124 ] Jake Mannix commented on MAHOUT-220: Ted, While I'm totally down with using the randomizer / hashing techniques in places, I don't think we should totally wed ourselves to it either - having the option of using the "real" vector representation should probably be implemented to, as people understand it better, and it's pretty standard. bq. If you like these, we can promote them to a common area under classifier. They might belong in a more general place, actually. If I'm going to use some of this stuff in the decompositions (although I'm not sure yet of the efficacy of the single hash for doing SVD), it should go somewhere in the math module. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795122#action_12795122 ] Ted Dunning commented on MAHOUT-220: Anil, See classifier.sgd.TermRandomizer (and implementations DenseRandomizer and BinaryRandomizer) for a term list to vector converter. These are in the MAHOUT-228 patch. It has the virtue of converting term lists to vectors of fixed size. It currently does not do term weighting, but that would be a very easy fix. The approach is roughly along the lines of http://arxiv.org/PS_cache/arxiv/pdf/0902/0902.2206v2.pdf or the stochastic decomposition work. If you like these, we can promote them to a common area under classifier. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795117#action_12795117 ] Robin Anil commented on MAHOUT-220: --- A Caching layer is implemented in HbaseDatastore, You can set the cache size. Take a look at MAHOUT-124 for more details I am just porting the feature mapper and tfidf mapper from bayes classifier common over to make a the new text vectorizer. Take a look at them. Its a fully distributed way of doing tf.idf in 2 map/reduces. For the vector convertor Here is the idea in Steps M/R1: Count frequencies of words tokenized using configurable lucene Analyzer SEQ1: read the frequency list, prune words less than minSupport and create the dictionary file(string => long) and the frequency file (string=>long) Do map/reduce in chunks by keeping a block of the dictionary file in memory. repeat- M/R2: Run over the input documents. replacing string with the integer id. and create (docid => sparsevector). This sparsevector as weigths as TF. but its incomplete. Now run a map reduce over the incomplete sparse vectors. Group by docid.In reducer, merge the sparse vectors. Initial SparseVectors dataset is ready. function multiplyIDF(){ M/R3: Calculate DF from the SparseVector dataset M/R4: Run over the SparseVector TF dataset. and get IDF. } This is the first plan. Atleast when i finish. Second is to convert the document into a stream of integers using the dictionary file. Then subsequent funcitons can run M/R jobs to calculate LLR and make bigrams. For this. The sparsevector merge MapReduce fucntion should be generic enough. > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795114#action_12795114 ] Jake Mannix commented on MAHOUT-220: Robin, To really be scalable here, I'm down with the M/R approach for the classifiers. The random-access nature of the current Datastore interface definitely seems limiting - even using HBase this way means we're making lots of remote calls, while a traditional hadoop job would do the nice "put the coding where the data lives" instead. Switching over to use SparseVectors and doing things sequentially over the data set stored in SequenceFile's of them seems definitely the way I'd see this going. Is that what your current hadoopified version of this do? bq. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. What is the way you're doing this? Is this bag-of-words representation (what form of tf are you using? how are you putting in idf if it's fully distributed?)? > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795064#action_12795064 ] Steve Umfleet commented on MAHOUT-228: -- Hi Ted. Watching your progress on SGD was instructive. Thanks for the "template" of how to submit and proceed with an issue. At what point in the process are decisions about packages resolved? For example, MurmurHash at first glance, and based on its own documentation, seems like it might be broadly useful outside of org.apache.mahout.classifier. > Need sequential logistic regression implementation using SGD techniques > --- > > Key: MAHOUT-228 > URL: https://issues.apache.org/jira/browse/MAHOUT-228 > Project: Mahout > Issue Type: New Feature > Components: Classification >Reporter: Ted Dunning > Fix For: 0.3 > > Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, > sgd-derivation.tex, sgd.csv > > > Stochastic gradient descent (SGD) is often fast enough for highly scalable > learning (see Vowpal Wabbit, http://hunch.net/~vw/). > I often need to have a logistic regression in Java as well, so that is a > reasonable place to start. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795060#action_12795060 ] Benson Margulies commented on MAHOUT-230: - Heck, a read of the reference cited in the JDK 1.6 doc would prove rewarding, no doubt. Anyone else willing? > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795058#action_12795058 ] Robin Anil commented on MAHOUT-230: --- What about hadoop?. I guess its their core operation. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/MergeSort.html > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795055#action_12795055 ] Grant Ingersoll commented on MAHOUT-230: I seem to recall Lucene having it's own merge sort, maybe we should look there? Not saying it's faster, but might be worth looking at. > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795054#action_12795054 ] Benson Margulies commented on MAHOUT-230: - And it's all in the merge sorts. > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795053#action_12795053 ] Grant Ingersoll commented on MAHOUT-230: Committed revision 894390. > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050 ] Robin Anil edited comment on MAHOUT-220 at 12/29/09 1:36 PM: - Datastore is an interface which allows you pick a named vector or a named matrix and lookup the cell. For Bayes classifier, the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column are therefore string and the contract between the Algorithm and Datastore is decided per algo. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and InMemoryBayesDatastore.java. {code} double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException; double getWeight(String vectorName, String index) throws InvalidDatastoreException; {code} For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and your datastore agree upon. I know it, this creates a limitation that you cant use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements. was (Author: robinanil): Datastore is an interface which allows you pick a named vector or a matrix and lookup the cell. For Bayes classifier, since the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column is upto the implementation. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and InMemoryBayesDatastore.java. {code} double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException; double getWeight(String vectorName, String index) throws InvalidDatastoreException; {code} For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and datastore agree upon. I know it, this creates a limitation that you can use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795050#action_12795050 ] Robin Anil commented on MAHOUT-220: --- Datastore is an interface which allows you pick a named vector or a matrix and lookup the cell. For Bayes classifier, since the entire code is based on tokens and not SparseVectors. The names of the matrix, the row and column is upto the implementation. for the Cbayes/Bayes algorithms, We have the HBaseBayesDatastore.java and InMemoryBayesDatastore.java. {code} double getWeight(String matrixName, String row, String column) throws InvalidDatastoreException; double getWeight(String vectorName, String index) throws InvalidDatastoreException; {code} For sgd algorithm. I suggest you define your own matrix names, row indices and column indices, which your algorithm and datastore agree upon. I know it, this creates a limitation that you can use integer based column and row names. Maybe we can parameterize it OR change Bayes package to use Vectors instead of the current string token based implementation. I am currenly writing a Map/reduce job to convert text documents to vectors without relying on Lucene. Once that is done, I will overhaul the classifier package to use SparseVectors. Before that I need to know if this Patch is ok. In terms of code style, I will then patch it and start with the enhancements > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
Simple answer: the Harmony team didn't code as well as the Sun people did. This is not my metier, so if someone else can suggest algorithmic improvements ... On Tue, Dec 29, 2009 at 8:06 AM, Robin Anil (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795047#action_12795047 > ] > > Robin Anil commented on MAHOUT-230: > --- > > I ran the SortingTest instead of 3.2 sec it now takes 5.2 seconds. I repeated > the patch and rechecked. Any idea why the perf dip? Notice the perf drop in > the second block i.e. after the line break in each block > > {code:xml|title=Original Output} > > name="testBinarySearch"/> > name="testBinarySearchObjects"/> > name="testQuickSortBytes"/> > name="testQuickSortChars"/> > name="testQuickSortInts"/> > name="testQuickSortLongs"/> > name="testQuickSortShorts"/> > name="testQuickSortFloats"/> > name="testQuickSortDoubles"/> > name="testMergeSortBytes"/> > > name="testMergeSortChars"/> > name="testMergeSortInts"/> > name="testMergeSortLongs"/> > name="testMergeSortShorts"/> > name="testMergeSortFloats"/> > name="testMergeSortDoubles"/> > {code} > > {code:xml|title=After Patching} > > name="testBinarySearch"/> > name="testBinarySearchObjects"/> > name="testQuickSortBytes"/> > name="testQuickSortChars"/> > name="testQuickSortInts"/> > name="testQuickSortLongs"/> > name="testQuickSortShorts"/> > name="testQuickSortFloats"/> > name="testQuickSortDoubles"/> > name="testMergeSortBytes"/> > > name="testMergeSortChars"/> > name="testMergeSortInts"/> > name="testMergeSortLongs"/> > name="testMergeSortShorts"/> > name="testMergeSortFloats"/> > name="testMergeSortDoubles"/> > {code} > >> Replace org.apache.mahout.math.Sorting with code of clear provenance >> >> >> Key: MAHOUT-230 >> URL: https://issues.apache.org/jira/browse/MAHOUT-230 >> Project: Mahout >> Issue Type: Bug >> Components: Math >> Affects Versions: 0.3 >> Reporter: Benson Margulies >> Assignee: Benson Margulies >> Fix For: 0.3 >> >> Attachments: replace-sorting.diff >> >> Original Estimate: 72h >> Remaining Estimate: 72h >> >> org.apache.mahout.math.Sorting looks as if the original author borrowed from >> the Sun JRE, based on the private internal function names and contents. That >> code has a restrictive license. We need to take the equivalent file >> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean >> replacement. >> The problematic code are the quickSort and mergeSort functions, which extend >> 'Arrays' by supporting slices of arrays and custom sorting predicate >> functions. >> One might also wistfully note that the more recent JDKs from Sun have >> deployed different (and one hopes) better sort algorithms that 1.5 and/or >> Harmony, so a really energetic person might build implementations in here to >> match. However, expediency calls for just bashing on the Harmony >> implementation to solve the problem at hand. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795048#action_12795048 ] Grant Ingersoll commented on MAHOUT-230: I think we need to commit and then worry about performance. The legal issues outweigh the performance issues at this point. I'll commit and then we can open a new one for performance.. > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-230: -- Assignee: Grant Ingersoll (was: Benson Margulies) > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795047#action_12795047 ] Robin Anil commented on MAHOUT-230: --- I ran the SortingTest instead of 3.2 sec it now takes 5.2 seconds. I repeated the patch and rechecked. Any idea why the perf dip? Notice the perf drop in the second block i.e. after the line break in each block {code:xml|title=Original Output} {code} {code:xml|title=After Patching} {code} > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.