[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637119#action_12637119 ]
Michael McCandless commented on LUCENE-1410: -------------------------------------------- (Attached autogen.tgz). I started peeking at the ASM generated by different tweaks to the bit decoder methods. It's rather depressing: small things, like using relative not absolute getInt from ByteBuffer, declaring things final, reading into array first instead of getInt per int, make sizable differences in the resulting asm and decode speed. To try out these various changes to the Java sources, I created a Python script to generate the 31 different decode methods. I then made a new standalone perf test that reads in a prx file as a series of packed N-bit ints, and reports the best of 5 runs. These results are NOT testing the full pfor -- just different possible methods for doing the N-bit int unpacking part of pfor. Paul, I haven't integrated these into the pfor code, but I think we probably eventually should. Finally, I switched the autogen to C++/gcc to see how much faster simple C code is. In both the java and C tests, by far the fastest way was to mmap the file and read ints from it, sequentially (relative gets) using ByteBuffer, so all 3 take that approach. (To run these, extract autogen.tgz, then open *.py and see the comment at the top; you'll have to edit the sources to fix the hardwired path to the prx file). Java1 code calls getInt() one at a time, like this: {code} static void decode4(final ByteBuffer in, final int[] out) { final int i1 = in.getInt(); out[0] = i1 & 15; out[1] = (i1 >>> 4) & 15; out[2] = (i1 >>> 8) & 15; out[3] = (i1 >>> 12) & 15; out[4] = (i1 >>> 16) & 15; out[5] = (i1 >>> 20) & 15; out[6] = (i1 >>> 24) & 15; out[7] = (i1 >>> 28) & 15; final int i2 = in.getInt(); out[8] = i2 & 15; out[9] = (i2 >>> 4) & 15; out[10] = (i2 >>> 8) & 15; ... } {code} Java 2 code gets all N ints up front, like this: {code} static void decode4(IntBuffer in, int[] inbuf, int[] out) { in.get(inbuf, 0, 16); out[0] = inbuf[0] & 15; out[1] = (inbuf[0] >>> 4) & 15; out[2] = (inbuf[0] >>> 8) & 15; out[3] = (inbuf[0] >>> 12) & 15; out[4] = (inbuf[0] >>> 16) & 15; out[5] = (inbuf[0] >>> 20) & 15; out[6] = (inbuf[0] >>> 24) & 15; out[7] = (inbuf[0] >>> 28) & 15; out[8] = inbuf[1] & 15; out[9] = (inbuf[1] >>> 4) & 15; out[10] = (inbuf[1] >>> 8) & 15; ... } {code} C++ code is analogous to Java 1 (data is mmap'd): {code} static bool decode4(int* out) { int i1 = *(data++); *(out++) = i1 & 15; *(out++) = (i1 >> 4) & 15; *(out++) = (i1 >> 8) & 15; *(out++) = (i1 >> 12) & 15; *(out++) = (i1 >> 16) & 15; *(out++) = (i1 >> 20) & 15; *(out++) = (i1 >> 24) & 15; *(out++) = (i1 >> 28) & 15; int i2 = *(data++); *(out++) = i2 & 15; ... } {code} Here's the performance for each bit size: ||bits||Java 1 (kints/msec)||Java 2 (kints/msec)||C++ (kints/msec)||C advantage|| |1|*916.6*|648.5|1445.3|1.6x |2|*793.8*|593.4|1118.3|1.4x |3|*616.7*|541.9|1068.8|1.7x |4|*656.6*|512.1|1161.6|1.8x |5|*499.8*|469.0|897.3|1.8x |6|410.6|*444.9*|899.5|2.0x |7|367.4|*409.0*|801.7|2.0x |8|*414.0*|386.7|816.6|2.0x |9|306.3|*366.9*|710.8|1.9x |10|278.8|*341.9*|665.8|1.9x |11|258.1|*307.8*|623.6|2.0x |12|245.9|*311.7*|592.7|1.9x |13|223.9|*285.0*|574.5|2.0x |14|204.2|*271.8*|538.1|2.0x |15|190.3|*260.0*|522.6|2.0x |16|229.9|*250.4*|519.7|2.1x |17|190.8|*223.7*|488.3|2.2x |18|171.6|*198.1*|461.2|2.3x |19|158.3|*180.5*|436.2|2.4x |20|163.1|*207.4*|416.4|2.0x |21|156.3|*166.3*|403.0|2.4x |22|147.0|*163.5*|387.8|2.4x |23|141.6|*174.1*|357.5|2.1x |24|141.9|*162.0*|362.6|2.2x |25|133.2|*177.6*|335.3|1.9x |26|125.8|*153.5*|334.7|2.2x |27|121.6|*139.6*|314.0|2.2x |28|122.7|*130.1*|316.7|2.4x |29|107.3|*123.9*|296.7|2.4x |30|111.0|*127.6*|300.7|2.4x |31|*108.1*|94.0|290.5|2.7x C code is between 1.4-2.7 X faster than the best Java run. Reading one int at a time is faster when the #bits is small <=5), but then reading all N ints up front is general faster for larger bit sizes. > PFOR implementation > ------------------- > > Key: LUCENE-1410 > URL: https://issues.apache.org/jira/browse/LUCENE-1410 > Project: Lucene - Java > Issue Type: New Feature > Components: Other > Reporter: Paul Elschot > Priority: Minor > Attachments: LUCENE-1410b.patch, TestPFor2.java, TestPFor2.java, > TestPFor2.java > > Original Estimate: 21840h > Remaining Estimate: 21840h > > Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]