[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637119#action_12637119
 ] 

Michael McCandless commented on LUCENE-1410:
--------------------------------------------


(Attached autogen.tgz).

I started peeking at the ASM generated by different tweaks to the
bit decoder methods.

It's rather depressing: small things, like using relative not absolute
getInt from ByteBuffer, declaring things final, reading into array
first instead of getInt per int, make sizable differences in the
resulting asm and decode speed.

To try out these various changes to the Java sources, I created a
Python script to generate the 31 different decode methods.  I then
made a new standalone perf test that reads in a prx file as a series
of packed N-bit ints, and reports the best of 5 runs.

These results are NOT testing the full pfor -- just different possible
methods for doing the N-bit int unpacking part of pfor.  Paul, I
haven't integrated these into the pfor code, but I think we probably
eventually should.

Finally, I switched the autogen to C++/gcc to see how much faster
simple C code is.

In both the java and C tests, by far the fastest way was to mmap the
file and read ints from it, sequentially (relative gets) using
ByteBuffer, so all 3 take that approach.

(To run these, extract autogen.tgz, then open *.py and see the comment
at the top; you'll have to edit the sources to fix the hardwired path
to the prx file).

Java1 code calls getInt() one at a time, like this:

{code}
  static void decode4(final ByteBuffer in, final int[] out) {
    final int i1 = in.getInt();
    out[0] = i1 & 15;
    out[1] = (i1 >>> 4) & 15;
    out[2] = (i1 >>> 8) & 15;
    out[3] = (i1 >>> 12) & 15;
    out[4] = (i1 >>> 16) & 15;
    out[5] = (i1 >>> 20) & 15;
    out[6] = (i1 >>> 24) & 15;
    out[7] = (i1 >>> 28) & 15;
    final int i2 = in.getInt();
    out[8] = i2 & 15;
    out[9] = (i2 >>> 4) & 15;
    out[10] = (i2 >>> 8) & 15;
    ...
  }
{code}

Java 2 code gets all N ints up front, like this:

{code}
  static void decode4(IntBuffer in, int[] inbuf, int[] out) {
    in.get(inbuf, 0, 16);
    out[0] = inbuf[0] & 15;
    out[1] = (inbuf[0] >>> 4) & 15;
    out[2] = (inbuf[0] >>> 8) & 15;
    out[3] = (inbuf[0] >>> 12) & 15;
    out[4] = (inbuf[0] >>> 16) & 15;
    out[5] = (inbuf[0] >>> 20) & 15;
    out[6] = (inbuf[0] >>> 24) & 15;
    out[7] = (inbuf[0] >>> 28) & 15;
    out[8] = inbuf[1] & 15;
    out[9] = (inbuf[1] >>> 4) & 15;
    out[10] = (inbuf[1] >>> 8) & 15;
    ...
  }
{code}

C++ code is analogous to Java 1 (data is mmap'd):

{code}
static bool decode4(int* out) {
  int i1 = *(data++);
  *(out++) = i1 & 15;
  *(out++) = (i1 >> 4) & 15;
  *(out++) = (i1 >> 8) & 15;
  *(out++) = (i1 >> 12) & 15;
  *(out++) = (i1 >> 16) & 15;
  *(out++) = (i1 >> 20) & 15;
  *(out++) = (i1 >> 24) & 15;
  *(out++) = (i1 >> 28) & 15;
  int i2 = *(data++);
  *(out++) = i2 & 15;
  ...
}
{code}

Here's the performance for each bit size:

||bits||Java 1 (kints/msec)||Java 2 (kints/msec)||C++ (kints/msec)||C 
advantage||
|1|*916.6*|648.5|1445.3|1.6x
|2|*793.8*|593.4|1118.3|1.4x
|3|*616.7*|541.9|1068.8|1.7x
|4|*656.6*|512.1|1161.6|1.8x
|5|*499.8*|469.0|897.3|1.8x
|6|410.6|*444.9*|899.5|2.0x
|7|367.4|*409.0*|801.7|2.0x
|8|*414.0*|386.7|816.6|2.0x
|9|306.3|*366.9*|710.8|1.9x
|10|278.8|*341.9*|665.8|1.9x
|11|258.1|*307.8*|623.6|2.0x
|12|245.9|*311.7*|592.7|1.9x
|13|223.9|*285.0*|574.5|2.0x
|14|204.2|*271.8*|538.1|2.0x
|15|190.3|*260.0*|522.6|2.0x
|16|229.9|*250.4*|519.7|2.1x
|17|190.8|*223.7*|488.3|2.2x
|18|171.6|*198.1*|461.2|2.3x
|19|158.3|*180.5*|436.2|2.4x
|20|163.1|*207.4*|416.4|2.0x
|21|156.3|*166.3*|403.0|2.4x
|22|147.0|*163.5*|387.8|2.4x
|23|141.6|*174.1*|357.5|2.1x
|24|141.9|*162.0*|362.6|2.2x
|25|133.2|*177.6*|335.3|1.9x
|26|125.8|*153.5*|334.7|2.2x
|27|121.6|*139.6*|314.0|2.2x
|28|122.7|*130.1*|316.7|2.4x
|29|107.3|*123.9*|296.7|2.4x
|30|111.0|*127.6*|300.7|2.4x
|31|*108.1*|94.0|290.5|2.7x

C code is between 1.4-2.7 X faster than the best Java run.  Reading
one int at a time is faster when the #bits is small <=5), but then
reading all N ints up front is general faster for larger bit sizes.


> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-1410b.patch, TestPFor2.java, TestPFor2.java, 
> TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to