Re: New portion of improvements for Dual-Pivot Quicksort

Vladimir Iaroslavski Tue, 22 Jun 2010 06:24:44 -0700

both computers with Intel processor:

Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM


Osvaldo Pinali Doederlein wrote:

 Hi Vladimir,
Hello,

I tried with the latest JDK, build 98 and see different behaviour
on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
works slower with a[less++] instruction on both computers:
Is the first computer (with larger performance diff) an AMD by anychance? It's odd that just the a[less++] would make the code so muchslower. Perhaps the worst compiled code has additional costs in someCPUs, e.g. spoiling branch prediction for the bounds checking guards.
Anyway the change is risk-free (including performance) and the advantageis important at least in some CPUs, so go ahead (FWIW wrt myopinion...). I just wish that C1 would be fixed to not need this kind ofhacking - as I categorize this as a hack, not even as a "normal"low-level Java code optimization - because there are certainly millionsof uses of the idiom "array[index++/--]" in JavaSE APIs and elsewhere.But I'm not familiar with the C1 source so I don't know if this is somelow hanging fruit that could be addressed quickly (any HotSpot engineersthere to comment?).
A+
Osvaldo
                 first               second
          a[less] = ak; less++; / (a[less++] = ak;

      random: 16371 / 16696        14018 / 14326
   ascendant:  2706 /  2762         2884 /  2897
  descendant:  2994 /  3108         3170 /  3258
 organ pipes:  7073 /  7296         6939 /  7090
  stagger(2):  7765 /  8069         7531 /  7731
 period(1,2):   653 /   743          719 /   753
random(1..4):  2152 /  2234         1567 /  1591

If I change Test class and populate the array with descendant
values, I still see difference in times: 4793 / 5718
see updated attached Test class.

Also I'm attaching the latest version of DualPivotQuicksort
with minor format changes. If you don't have comments, I'll
ask to to integrate the code at the end of this week.

Thank you,
Vladimir

Osvaldo Doederlein wrote:
Hi Vladimir,
2010/6/19 Vladimir Iaroslavski <iaroslav...@mail.ru<mailto:iaroslav...@mail.ru>>
    Hello Osvaldo,

    I've prepared simple test which scans an array and does assignments
    for each element,
    see attached Test class:

    a[k] = a[less];
    a[less++] = 0; // or a[less] = 0; less++;

    The result of running "java -client Test" is:

    a[less], less++;   Time: 6998
    a[less++];         Time: 8416

    It is much more than 1%. Is it bug in JVM? Note that under server VM
The amount of diff surely depends on the benchmark; your bench should"zoom" the problem by not having much other work polluting themeasurement. But in my own test with b98 (32-bit), Q6600 CPU, I'vegot 5065/5079, so the diff is < 1%. The excessive disadvantage youreport may be something bad in your older b84.
Anyway I investigated the JIT-compiled code, details in the end (I'vesplit the benchmark in two classes just to make the comparisonsimpler). The problem with "a[less++]" is that "less++" will firstincrement "less", _then_ it will use the old value (not-incremented)to index "a". C1 generates code that is equivalent to:
int less_incremented = less + 1;
a[less] = 0;
less = less_incremented;
...which is a 1-to-1 translation of the IR coming off the bytecode.C1 is not smart enough to see that the increment can be reorderedafter the indexing, maybe because there's a data dependency as theindexing uses "less"; but due to the semantics of postfix "++" thisdependency is for the before-increment value, so the reordering wouldbe safe. Maybe that's just some simple missing heuristics that couldbe easily added?
    there is no difference between "a[less++]" and "a[less], less++".
C2 generates completely different code,with 16X unrolling - this isthe inner loop:
  0x026a6e40: pxor   %xmm0,%xmm0        ;*aload_0
                                        ; - Test1::so...@9 (line 23)
  0x026a6e44: movq   %xmm0,0xc(%ecx,%esi,4)
  0x026a6e4a: movq   %xmm0,0x14(%ecx,%esi,4)
  0x026a6e50: movq   %xmm0,0x1c(%ecx,%esi,4)
  0x026a6e56: movq   %xmm0,0x24(%ecx,%esi,4)
  0x026a6e5c: movq   %xmm0,0x2c(%ecx,%esi,4)
  0x026a6e62: movq   %xmm0,0x34(%ecx,%esi,4)
  0x026a6e68: movq   %xmm0,0x3c(%ecx,%esi,4)
  0x026a6e6e: movq   %xmm0,0x44(%ecx,%esi,4)  ;*iastore
                                        ; - Test1::so...@21 (line 24)
  0x026a6e74: add    $0x10,%esi         ;*iinc
                                        ; - Test1::so...@22 (line 22)
  0x026a6e77: cmp    %ebp,%esi
  0x026a6e79: jl     0x026a6e44
There is some extra slow-path code to fill the remaining 1...15elements if the loop length is not multiple of 16, and that's all. C2detects the redundancy between the "k" and "less" vars, and kills thealso-redundant "a[k] = a[less]" assignment so the net result is asimple zero-fill of the array. Maybe a different benchmark withoutthese redundancies would make easier to see that C2 doesn't have aproblem with the postfix "++", but if it had, I think it wouldn'treach the excellent result above.
A+
Osvaldo


    I'm using JDK 7 on Windows XP:

    java version "1.7.0-ea"
    Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
    Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)

    Thanks,
    Vladimir



This is the C1 code for sort2()'s loop:
0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception:dispatches to 0x0251c21e
  ;;   30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
  0x0251c1df: jae    0x0251c24a
  0x0251c1e5: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
                                        ; - Test2::so...@13 (line 23)
  0x0251c1e9: cmp    0x8(%ecx),%edi
  ;;   36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
  0x0251c1ec: jae    0x0251c263
  0x0251c1f2: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
                                        ; - Test2::so...@14 (line 23)

(sort1/sort2 start to differ here)

  0x0251c1f6: cmp    0x8(%ecx),%esi
  ;;   42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
  0x0251c1f9: jae    0x0251c27c
  0x0251c1ff: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less] = 0;
                                        ; - Test2::so...@18 (line 24)
  0x0251c207: inc    %esi               ; ++less;
  0x0251c208: inc    %edi               ; OopMap{ecx=Oop off=73}
                                        ;*goto: for k++
                                        ; - Test2::so...@25 (line 22)
  0x0251c209: test   %eax,0x1a0100      ;*goto
                                        ; - Test2::so...@25 (line 22)
                                        ;   {poll}
  ;;  block B1 [4, 6]

  0x0251c20f: cmp    %edx,%edi
  ;;   22 branch [LT] [B2]
  0x0251c211: jl     0x0251c1dc         ;*if_icmpge: for k < right
                                        ; - Test2::so...@6 (line 22)
The code looks OK; C1 doesn't do much optimization - no unrolling,bounds check elimination - but it's otherwise just about the code youwould expect for a simple JITting.
                   Now, C1 code for sort1()'s loop:
0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception:dispatches to 0x024bc262
  ;;   30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
  0x024bc21f: jae    0x024bc28e
  0x024bc225: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
                                        ; - Test1::so...@13 (line 23)
  0x024bc229: cmp    0x8(%ecx),%edi
  ;;   36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
  0x024bc22c: jae    0x024bc2a7
  0x024bc232: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
                                        ; - Test1::so...@14 (line 23)

(sort1/sort2 start to differ here)
0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into%ebx0x024bc238: inc %ebx ; ++less; (for the temp "lessfrom future")
0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "oldless"....
  ;;   46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
  0x024bc23c: jae    0x024bc2c0
  0x024bc242: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less++] = 0;
                                        ; - Test1::so...@21 (line 24)
  0x024bc24a: inc    %edi               ; OopMap{ecx=Oop off=75}
                                        ;*goto: for k++
                                        ; - Test1::so...@25 (line 22)
  0x024bc24b: test   %eax,0x1a0100      ;   {poll}
  0x024bc251: mov    %ebx,%esi          ;*goto
; - Test1::so...@25 (line22): for...
  ;;  block B1 [4, 6]

  0x024bc253: cmp    %edx,%edi
  ;;   22 branch [LT] [B2]
  0x024bc255: jl     0x024bc21c         ;*if_icmpge
; - Test1::so...@6 (line 22):for...

Re: New portion of improvements for Dual-Pivot Quicksort

Reply via email to