Moving this discussion to the mailing list, as a -1 is a fairly serious thing when it comes to Apache and it merits a discussion.
On Nov 12, 2011, at 2:06 AM, Sean Owen (Commented) (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149034#comment-13149034 > ] > > Sean Owen commented on MAHOUT-881: > ---------------------------------- > > -1 Grant I thought we discussed this on the mailing list? I don't see that > this achieves anything. You are still doing an implicit n log n heap sort to > attain an ordered result. You still allocate a container object for the > result. This seems like swapping some code for more complex (but equally > working) code, and adding a Lucene dependency. Just b/c some piece of code isn't considered a bottleneck at the moment means it can't be improved? This new code removes a fair bit of complexity (the code was passing over the same data at least 3 times in some cases, when it only needed to pass over it once) and puts in place an approach that is far fewer operations AND has a faster underlying data structure. It's a no-brainer. I would think we would be happy that we are finally to the point where doing these kinds of optimizations is a worthwhile exercise. As for the Lucene dependency, we already have it and it isn't going anywhere anytime soon. As for n log n, yeah, I know, but really the current code is nlogn (PQ operations) + n(copying arrays) + nlogn(sort). Not too mention several extra memory allocations and branch conditions. We all know that the constant factor and the additive factors hidden by Big O notation can still make a difference in real systems. Besides, I have some ideas on other optimizations we can make too, like precomputing more stuff, but that is a separate discussion. So, do you really think this is worth vetoing? This is nowhere near a veto-able piece of code, IMO. Not even in the same solar system. Vetoes are for stuff that moves something in a completely opposite direction from the project's goal. Us getting rid of Watchmaker is a veto-able offense if someone feels near and dear to that code. This is a micro optimization in a single class file. Sure, there are other, bigger optimizations we can make, but that doesn't mean we can't fix things. > > Do you have a load test that shows it would be notably faster? the patch does > have such a thing. I'm working on it, which is why I haven't committed yet, b/c I know it bears some scrutiny. I was putting it up to show how much simpler it is. But it doesn't take a benchmarking tool to see that removing n inserts, a sort of an array of size n and then another n inserts in favor of a single set of n inserts is going to be faster. Not too mention it also kills several extraneous array allocations and branch condition checks. This isn't the kind of stuff the JVM can optimize for you. -Grant