Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Grant Ingersoll Sat, 12 Nov 2011 07:41:42 -0800

Moving this discussion to the mailing list, as a -1 is a fairly serious thing 
when it comes to Apache and it merits a discussion.

On Nov 12, 2011, at 2:06 AM, Sean Owen (Commented) (JIRA) wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149034#comment-13149034
>  ] 
> 
> Sean Owen commented on MAHOUT-881:
> ----------------------------------
> 
> -1 Grant I thought we discussed this on the mailing list? I don't see that 
> this achieves anything. You are still doing an implicit n log n heap sort to 
> attain an ordered result. You still allocate a container object for the 
> result. This seems like swapping some code for more complex (but equally 
> working) code, and adding a Lucene dependency.

Just b/c some piece of code isn't considered a bottleneck at the moment means 
it can't be improved?  This new code removes a fair bit of complexity (the code 
was passing over the same data at least 3 times in some cases, when it only 
needed to pass over it once) and puts in place an approach that is far fewer 
operations AND has a faster underlying data structure.  It's a no-brainer.  I 
would think we would be happy that we are finally to the point where doing 
these kinds of optimizations is a worthwhile exercise.

As for the Lucene dependency, we already have it and it isn't going anywhere 
anytime soon.  

As for n log n, yeah, I know, but really the current code is nlogn (PQ 
operations) + n(copying arrays) + nlogn(sort).  Not too mention several extra 
memory allocations and branch conditions.  We all know that the constant factor 
and the additive factors hidden by Big O notation can still make a difference 
in real systems.   Besides, I have some ideas on other optimizations we can 
make too, like precomputing more stuff, but that is a separate discussion.

So, do you really think this is worth vetoing?   This is nowhere near a 
veto-able piece of code, IMO.  Not even in the same solar system.  Vetoes are 
for stuff that moves something in a completely opposite direction from the 
project's goal.  Us getting rid of Watchmaker is a veto-able offense if someone 
feels near and dear to that code.  This is a micro optimization in a single 
class file.  Sure, there are other, bigger optimizations we can make, but that 
doesn't mean we can't fix things.

> 
> Do you have a load test that shows it would be notably faster? the patch does 
> have such a thing.

I'm working on it, which is why I haven't committed yet, b/c I know it bears 
some scrutiny.  I was putting it up to show how much simpler it is.  But it 
doesn't take a benchmarking tool to see that removing n inserts, a sort of an 
array of size n and then another n inserts in favor of a single set of n 
inserts is going to be faster.  Not too mention it also kills several 
extraneous array allocations and branch condition checks.  This isn't the kind 
of stuff the JVM can optimize for you.

-Grant

Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Reply via email to