Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Sean Owen Sat, 12 Nov 2011 08:28:30 -0800

I was surprised to see a JIRA pop up when we had an open thread on the
topic on the mailing list. Did you see my reply? I believe it had
answered your questions... then you seemed to be about to make a
change anyway. I don't think anything is "serious" here, but I suppose
that's what I was flagging with a -1. Indulge me below first please.


I am sure this can easily be agreed on. I detest classic Apache-style
JIRA fights, I'm not picking any such thing!


> Just b/c some piece of code isn't considered a bottleneck at the moment means 
> it can't be improved?  This new code removes a fair bit of complexity (the 
> code was passing over the same data at least 3 times in some cases, when it 
> only needed to pass over it once) and puts in place an approach that is far 
> fewer operations AND has a faster underlying data structure.  It's a 
> no-brainer.  I would think we would be happy that we are finally to the point 
> where doing these kinds of optimizations is a worthwhile exercise.

No disagreement about optimizing and improving stuff even for its own
sake. The question is whether this is an improvement and I'm not sure
I agree with that reasoning (see next). This adds a net 40 lines of
code -- I am not sure that's a simplification? It's no big deal
either, but I suppose I am not seeing that win.


> As for n log n, yeah, I know, but really the current code is nlogn (PQ 
> operations) + n(copying arrays) + nlogn(sort).  Not too mention several extra 
> memory allocations and branch conditions.  We all know that the constant 
> factor and the additive factors hidden by Big O notation can still make a 
> difference in real systems.   Besides, I have some ideas on other 
> optimizations we can make too, like precomputing more stuff, but that is a 
> separate discussion.

I don't think this analysis is correct, and I think if we delve into
it, maybe what I'm saying will be clearer.

The current code does not perform n log n heap operations -- what are
you referring to? It doesn't even do n inserts. That's the benefit of
keeping around the smallest-big-value value to compare to. This is
exactly the strategy that the Lucene class uses too, it seems:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/3.4.0/org/apache/lucene/util/PriorityQueue.java

That's not what I meant. Constructing the sorted list of results from
a heap necessarily involves a sort. Your code does a heap sort,
implicitly, by calling pop() to clear the PriorityQueue. There is no
difference there. The new method also spends the same amount of time
sorting.

There's also no difference in allocating a container for the result in
getTopItems(). Both fill out a List. What's the win there?

There *is* a difference in getTopUsers()! You are going straight to
the long[] result bypassing the intermediate array of results. That's
a small win but a win.


> So, do you really think this is worth vetoing?   This is nowhere near a 
> veto-able piece of code, IMO.  Not even in the same solar system.  Vetoes are 
> for stuff that moves something in a completely opposite direction from the 
> project's goal.  Us getting rid of Watchmaker is a veto-able offense if 
> someone feels near and dear to that code.  This is a micro optimization in a 
> single class file.  Sure, there are other, bigger optimizations we can make, 
> but that doesn't mean we can't fix things.

I was concerned about a perceived process problem above and just
wanted to put up a stop-sign in Apache-speak. Is it so serious? To
rationalize it: I thought you were indicating you were moving ahead
with a change that I had open questions about. That seems like
something we shouldn't do and against mom and apple pie and the Apache
way. It seems OK to flag that with a -1. Right? I don't know. That's
my meaning, anyway.


> I'm working on it, which is why I haven't committed yet, b/c I know it bears 
> some scrutiny.  I was putting it up to show how much simpler it is.  But it 
> doesn't take a benchmarking tool to see that removing n inserts, a sort of an 
> array of size n and then another n inserts in favor of a single set of n 
> inserts is going to be faster.  Not too mention it also kills several 
> extraneous array allocations and branch condition checks.  This isn't the 
> kind of stuff the JVM can optimize for you.

Since you've already done some good work to put that in place it would
really be good to see the result, indeed. In practice it sure could be
faster, theoretical arguments aside! And that speaks for itself. But
even if it's slower by a bit, and you've heard all the thinking here
and would still prefer to make the change, I'm OK with it. Then at
least it will have been after useful discussion and even measurement.

Re: Vetoes and [jira] [Commented] (MAHOUT-881) Refactor TopItems to use Lucene's PriorityQueue and remove excessive sorting

Reply via email to