On Fri, Oct 3, 2008 at 11:49 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> Here's an alternate implementation of ScorePosting's read method. (Note that
> the DECODE_C32 macro from MathUtils advances the pointer supplied to it.)
The code looks like it should give significantly better performance
than the current approach. And yes, a guarantee to be able to read to
the end of single posting should give most of the performance
benefits.
> while (num_prox--) {
> position += DECODE_32(buf);
> *positions++ = position;
> }
I think there are going to be better ways of doing this once we don't
have to check buffer bounds. I mentioned earlier that was looking at
a branchless way to decode VBytes. I don't have working code yet, and
it's not going to be truly branchless, but I'm pretty certain there is
room for big speedup by treating the VBytes in a byte-aligned manner.
The basic algorithm is to read several bytes, make a mask byte from
the high bits that are set, and then use the mask as a switch into a
jump table to handle the pattern of bytes encoded. It should make it
as fast as the S9/S16 code described in the PForDelta paper, and I
think we can make it even faster with a little MMX/SSE optimization.
Thus the loop above would be replaced with something like:
vbyte_read_block(&buf, num_prox, &positions);
Perhaps you could convert the loop into an inline function of this
signature to make a switch like this easier in the future?
> (Later, we'll go further and minimize processor decision branching using
> PForDelta.)
What is your anticipated path for doing this? Would you subclass
ScorePost as VByteScorePost and PForDeltaScorePost? And what would
the mechanism be to ensure that the correct one is used?
Nathan Kurz
[EMAIL PROTECTED]