issues found
I just did an exercise of implementing a faster sparse vector. In the process, I uncovered a bunch of warts. I will be filing some Jiras and patches as soon as I can get to them, but here is a preview: a) most serialization code for vectors would write vectors with null names out as if they had names of . This causes grief in tests and seems wrong. b) the Vector/SparseVector hierarchy was oddly split out. I added a HashVector and moved the current SparseVector into IntDoubleMappingVector with SparseVector remaining as an abstract class. This unfortunately caused lots of upheaval even up into the Vector class. I have yet to sort this out cleanly. c) the squared distance functions were defined in multiple places. I centralized these into SquaredEuclideanDistance as a static function. These definitions also were wrong and would ignore any components that had opposite sign and equal magnitude. In fixing this, I went ahead and wrote an implementation that makes use of any sparsity present. To do that, I added a method to Vector so that I could tell if a vector is sparse. This subsumes all of the distance optimizations that we have discussed. It also makes it very easy for new code to use these optimizations without knowing about them. d) the nonZeroIterator had to be substantially refactored to work in an abstract class without knowing to much about the internals of everything. e) in order ot make sure that all metrics worked reasonably with all types, I have heavily refactored the testing structure for metrics. I will be doing more of this as well. The goal is to test all vector types against all distance metrics to make sure that they give the same results as DenseVector. Then, I will build tests for DenseVector to verify that it produces the correct result. This is important because so many vectors have special code to help metric computation. f) I have uncovered some strangeness in the ARFF I/O code that I introduced with the SparseVector abstraction. The old code will work as it did, but it won't understand the new HashVector. Soo I will be trying to break this down into as small a pieces as I can, but the total will be a bunch of interdependent patches. If anybody can help me apply these as quickly as possible, we should have minimal problems with it all. If it drags out, it will get hard to keep rebasing the patch sequence all the time. -- Ted Dunning, CTO DeepDyve
[jira] Updated: (MAHOUT-145) PartialData mapreduce Random Forests
[ https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim updated MAHOUT-145: Attachment: partial_August_31.patch * Corrected some bugs in the new code when testing in a pseudo-distributed cluster PartialData mapreduce Random Forests Key: MAHOUT-145 URL: https://issues.apache.org/jira/browse/MAHOUT-145 Project: Mahout Issue Type: New Feature Components: Classification Reporter: Deneche A. Hakim Priority: Minor Attachments: partial_August_10.patch, partial_August_13.patch, partial_August_15.patch, partial_August_17.patch, partial_August_19.patch, partial_August_2.patch, partial_August_24.patch, partial_August_27.patch, partial_August_31.patch, partial_August_9.patch This implementation is based on a suggestion by Ted: modify the original algorithm to build multiple trees for different portions of the data. That loses some of the solidity of the original method, but could actually do better if the splits exposed non-stationary behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: issues found
On Aug 31, 2009, at 2:55 AM, Ted Dunning wrote: I just did an exercise of implementing a faster sparse vector. In the process, I uncovered a bunch of warts. I will be filing some Jiras and patches as soon as I can get to them, but here is a preview: a) most serialization code for vectors would write vectors with null names out as if they had names of . This causes grief in tests and seems wrong. b) the Vector/SparseVector hierarchy was oddly split out. I added a HashVector and moved the current SparseVector into IntDoubleMappingVector with SparseVector remaining as an abstract class. This unfortunately caused lots of upheaval even up into the Vector class. I have yet to sort this out cleanly. c) the squared distance functions were defined in multiple places. I centralized these into SquaredEuclideanDistance as a static function. These definitions also were wrong and would ignore any components that had opposite sign and equal magnitude. In fixing this, I went ahead and wrote an implementation that makes use of any sparsity present. To do that, I added a method to Vector so that I could tell if a vector is sparse. This subsumes all of the distance optimizations that we have discussed. It also makes it very easy for new code to use these optimizations without knowing about them. d) the nonZeroIterator had to be substantially refactored to work in an abstract class without knowing to much about the internals of everything. e) in order ot make sure that all metrics worked reasonably with all types, I have heavily refactored the testing structure for metrics. I will be doing more of this as well. The goal is to test all vector types against all distance metrics to make sure that they give the same results as DenseVector. Then, I will build tests for DenseVector to verify that it produces the correct result. This is important because so many vectors have special code to help metric computation. f) I have uncovered some strangeness in the ARFF I/O code that I introduced with the SparseVector abstraction. The old code will work as it did, but it won't understand the new HashVector. Soo I will be trying to break this down into as small a pieces as I can, but the total will be a bunch of interdependent patches. If anybody can help me apply these as quickly as possible, we should have minimal problems with it all. If it drags out, it will get hard to keep rebasing the patch sequence all the time. These all sound reasonable. If you're happy w/ the changes, just put up a big patch on an issue, give it a few days to percolate and then commit. -- Ted Dunning, CTO DeepDyve -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
[jira] Updated: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-157: -- Attachment: MAHOUT-157-August-31.patch Performance Improvements in sequential version Frequent Pattern Mining using Parallel FP-Growth Key: MAHOUT-157 URL: https://issues.apache.org/jira/browse/MAHOUT-157 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Robin Anil Fix For: 0.2 Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, MAHOUT-157-Combinations-BSD-License.patch, MAHOUT-157-Combinations-BSD-License.patch, MAHOUT-157-inProgress-August-5.patch Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Request for brainstorming: how to represent item-item diffs efficiently
On Mon, Aug 31, 2009 at 8:29 AM, Sean Owen sro...@gmail.com wrote: And I think it is possible to do away with the two-level data structure organized by row, then column. I think. Again thinking of efficient ways to access by row without perhaps the overhead of storing by row. If this is for off-line storage, then you may not need to store by row ... just store in a combined store with an xor of row and column and sort it all out later. -- Ted Dunning, CTO DeepDyve
Re: About Java Code Performance Profiling
Robin Anil wrote: Please suggest some good performance profilers for Java. There is eclipse Test and Performance Platform. Which for some reason is screwed up with eclipse galileo Then there is JProbe which has great reviews but its not free Yourkit has also great reviews, there is an option to get free license(for opensource community) provided we reference Yourkit on our website http://www.yourkit.com/purchase/index.jsp You can get a YourKit license for work on your Apache project. If I remember right, your PMC chair should contact them with a list of committers (with email addresses) who would like to have a license. Each committer on the list will then receive their license key individually. I'm pretty sure there's no requirement to list them on your home page (which I'm not sure the ASF would allow, anyway). I know of more than one Apache project that's using YourKit, so this should not be a problem. --Thilo Discussions and Suggestions welcome Robin
Re: Request for brainstorming: how to represent item-item diffs efficiently
Not really. With off-line use, you can amortize the cost of sorting across all items. This allows very tight encodings in some cases. In on-line (real-time actually) use you have the problem that you can't amortize anything because you have a worst case constraint. Essentially, this is the difference between hadoop and OLTP relational databases. On Mon, Aug 31, 2009 at 10:34 AM, Sean Owen sro...@gmail.com wrote: This is definitely for on-line use. Though I imagine strategies for efficient storage pertain well to both the off- and on-line case. -- Ted Dunning, CTO DeepDyve
[jira] Created: (MAHOUT-168) Need integer compression routines
Need integer compression routines - Key: MAHOUT-168 URL: https://issues.apache.org/jira/browse/MAHOUT-168 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Ted Dunning A selection of these algorithms would be very nice to have: www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Request for brainstorming: how to represent item-item diffs efficiently
Ignoring the diff for the moment, you will probably win with some coding for the actual counts. My vision would be a row-coded complex matrix. Each row would have a list of id's, a list of counts, and a list of diffs. If you have your ids coded so that the most commonly rated items come first, then simple integer compression should help on the list of id's. The counts are already heavily skewed to small numbers so a bit-variable representation should win big on that side. The diffs are probably pretty degenerate as well. If you look at the distribution of values you will either have a small number of very common values (that can be encoded with small integers) or a pretty tight distribution (that can be encoded by picking ranges at the level of accuracy you want, encoding the range as an integer and then huffman coding the integers). The canonical reference for compressing integers is Williams and Zobel, http://www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf I can try to work up an example implementation for these codes, but my time this week is kind of tight(er). On Mon, Aug 31, 2009 at 11:01 AM, Sean Owen sro...@gmail.com wrote: . I believe I will proceed with investigating a structure that maps a long (really two ints together) to a count and to a total diff. -- Ted Dunning, CTO DeepDyve
Re: About Java Code Performance Profiling
I use JProfiler 5 just because I am used to it and ended up buying a license. Works well. I imagine free solutions work fine too. On Aug 31, 2009 4:39 PM, Robin Anil robin.a...@gmail.com wrote: Please suggest some good performance profilers for Java. There is eclipse Test and Performance Platform. Which for some reason is screwed up with eclipse galileo Then there is JProbe which has great reviews but its not free Yourkit has also great reviews, there is an option to get free license(for opensource community) provided we reference Yourkit on our website http://www.yourkit.com/purchase/index.jsp Discussions and Suggestions welcome Robin
Re: About Java Code Performance Profiling
YourKit has an open source license option that asks you to mail them and tell them what project you work on. I've used both YourKit and JProfiler and both are quite good. I've also used the NetBeans one, which can now plugin into IntelliJ too, but I don't think it is as good as YourKit. Over time, we will want to develop some performance benchmarking tools, too, similar to Lucene's benchmark contrib, as having some standard ways to talk about performance and share settings is crucial for this stuff in open source. On Aug 31, 2009, at 8:39 AM, Robin Anil wrote: Please suggest some good performance profilers for Java. There is eclipse Test and Performance Platform. Which for some reason is screwed up with eclipse galileo Then there is JProbe which has great reviews but its not free Yourkit has also great reviews, there is an option to get free license(for opensource community) provided we reference Yourkit on our website http://www.yourkit.com/purchase/index.jsp Discussions and Suggestions welcome Robin
[jira] Commented: (MAHOUT-168) Need integer compression routines
[ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749703#action_12749703 ] Grant Ingersoll commented on MAHOUT-168: Can we leverage some of Lucene's capabilities here? It doesn't have all the algorithms mentioned, but does have delta encodings. Need integer compression routines - Key: MAHOUT-168 URL: https://issues.apache.org/jira/browse/MAHOUT-168 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Ted Dunning A selection of these algorithms would be very nice to have: www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-168) Need integer compression routines
[ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reassigned MAHOUT-168: -- Assignee: Ted Dunning Need integer compression routines - Key: MAHOUT-168 URL: https://issues.apache.org/jira/browse/MAHOUT-168 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Ted Dunning Assignee: Ted Dunning A selection of these algorithms would be very nice to have: www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-168) Need integer compression routines
[ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-168: --- Status: Patch Available (was: Open) First version. 100% test coverage except for BitInputStream at 95%. Needs review for API and obvious speed improvements. Need integer compression routines - Key: MAHOUT-168 URL: https://issues.apache.org/jira/browse/MAHOUT-168 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Ted Dunning Assignee: Ted Dunning A selection of these algorithms would be very nice to have: www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-168) Need integer compression routines
[ https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749721#action_12749721 ] Ted Dunning commented on MAHOUT-168: I considered that, but it seemed easier to write a version from scratch. Somebody should look at the Lucene code to see if they did anything earth-shakingly clever for speed. A delta-code, btw, includes gamma and unary codes so that is complete enough to be interesting. It should be very little work to add Golomb and byte variable codes to this. Only the byte-variable code seems like a good candidate because it can be made faster than these other forms pretty easily. Need integer compression routines - Key: MAHOUT-168 URL: https://issues.apache.org/jira/browse/MAHOUT-168 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Ted Dunning Assignee: Ted Dunning A selection of these algorithms would be very nice to have: www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.