Re: New portion of improvements for Dual-Pivot Quicksort

Osvaldo Doederlein Tue, 06 Jul 2010 18:43:25 -0700

The ironic thing is that these operators were originally introduced in the C
language, at least according to well-known folklore, as an optimization -
taking advantage of inc/dec instructions from the PDP arch, in the old times
when (I suppose) compilers wouldn't do even basic strength-reduction
optimizations like x + 1 => inc x. But this works well for pre-inc/dec, but
post-increment and post-decrement have a more complex semantics that
requires inducing an extra temporary if the compiler is not smart enough.


My personal coding style, since my days of C, enforces pre- instead of post-
operators; I will always write ++i, never i++, when both choices have the
same effect. This is exactly because it's the post- operators that have
confusing and error-prone semantics, as soon as the inc/dec operation is
used as RHS of a bigger expression (or used a parameter etc.), and not as an
isolated statement. I will use post- operators only when I really need its
semantics to make my code a little extra terse. And now we discover another
gotcha of post- operators: potentially less efficient code generation. It's
worth noticing this, because I the post- operators seem to be the preferred
style for most people, they can even be found in many Java style
guidelines... I wonder why.

Perhaps it's just the name of the C++ language that has taught a generation
to like this wrong style? In that case, the blame goes to Stroustrup, he
should have named it "++C" instead. That would get extra nerd points too ;-)

A+
Osvaldo

2010/7/6 David Holmes <david.hol...@oracle.com>

> I asked someone to take a look at this and it seems that the problem is
> that a[less++] requires introduction of a temporary to store the
> pre-incremented index value which in turn increases register pressure and
> can lead to a register spill. This seems to not only be architecture
> dependent but even variable on the same architecture (as observed below).
>
> As Osvaldo noted C1 does not attempt to decide whether the increment can be
> reordered with the access and so allow removal of the temporary. You'd need
> the compiler folk (C1/C2) to comment on whether or not it would be
> reasonable/possible/practical for C1 to do that - so I've cc'ed them (though
> I'm not a member of their list so it may be a little while before it gets
> pushed through to them.)
>
> David Holmes
>
> Vladimir Iaroslavski said the following on 06/22/10 23:24:
>
>  both computers with Intel processor:
>>
>> Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM
>> Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM
>>
>> Osvaldo Pinali Doederlein wrote:
>>
>>>  Hi Vladimir,
>>>
>>>  Hello,
>>>>
>>>> I tried with the latest JDK, build 98 and see different behaviour
>>>> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method
>>>> works slower with a[less++] instruction on both computers:
>>>>
>>>
>>> Is the first computer (with larger performance diff) an AMD by any
>>> chance? It's odd that just the a[less++] would make the code so much slower.
>>> Perhaps the worst compiled code has additional costs in some CPUs, e.g.
>>> spoiling branch prediction for the bounds checking guards.
>>>
>>> Anyway the change is risk-free (including performance) and the advantage
>>> is important at least in some CPUs, so go ahead (FWIW wrt my opinion...). I
>>> just wish that C1 would be fixed to not need this kind of hacking - as I
>>> categorize this as a hack, not even as a "normal" low-level Java code
>>> optimization - because there are certainly millions of uses of the idiom
>>> "array[index++/--]" in JavaSE APIs and elsewhere. But I'm not familiar with
>>> the C1 source so I don't know if this is some low hanging fruit that could
>>> be addressed quickly (any HotSpot engineers there to comment?).
>>>
>>> A+
>>> Osvaldo
>>>
>>>                  first               second
>>>>          a[less] = ak; less++; / (a[less++] = ak;
>>>>
>>>>      random: 16371 / 16696        14018 / 14326
>>>>   ascendant:  2706 /  2762         2884 /  2897
>>>>  descendant:  2994 /  3108         3170 /  3258
>>>>  organ pipes:  7073 /  7296         6939 /  7090
>>>>  stagger(2):  7765 /  8069         7531 /  7731
>>>>  period(1,2):   653 /   743          719 /   753
>>>> random(1..4):  2152 /  2234         1567 /  1591
>>>>
>>>> If I change Test class and populate the array with descendant
>>>> values, I still see difference in times: 4793 / 5718
>>>> see updated attached Test class.
>>>>
>>>> Also I'm attaching the latest version of DualPivotQuicksort
>>>> with minor format changes. If you don't have comments, I'll
>>>> ask to to integrate the code at the end of this week.
>>>>
>>>> Thank you,
>>>> Vladimir
>>>>
>>>> Osvaldo Doederlein wrote:
>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> 2010/6/19 Vladimir Iaroslavski <iaroslav...@mail.ru <mailto:
>>>>> iaroslav...@mail.ru>>
>>>>>
>>>>>    Hello Osvaldo,
>>>>>
>>>>>    I've prepared simple test which scans an array and does assignments
>>>>>    for each element,
>>>>>    see attached Test class:
>>>>>
>>>>>    a[k] = a[less];
>>>>>    a[less++] = 0; // or a[less] = 0; less++;
>>>>>
>>>>>    The result of running "java -client Test" is:
>>>>>
>>>>>    a[less], less++;   Time: 6998
>>>>>    a[less++];         Time: 8416
>>>>>
>>>>>    It is much more than 1%. Is it bug in JVM? Note that under server VM
>>>>>
>>>>> The amount of diff surely depends on the benchmark; your bench should
>>>>> "zoom" the problem by not having much other work polluting the 
>>>>> measurement.
>>>>> But in my own test with b98 (32-bit), Q6600 CPU, I've got 5065/5079, so 
>>>>> the
>>>>> diff is < 1%. The excessive disadvantage you report may be something bad 
>>>>> in
>>>>> your older b84.
>>>>>
>>>>> Anyway I investigated the JIT-compiled code, details in the end (I've
>>>>> split the benchmark in two classes just to make the comparison simpler). 
>>>>> The
>>>>> problem with "a[less++]" is that "less++" will first increment "less",
>>>>> _then_ it will use the old value (not-incremented) to index "a". C1
>>>>> generates code that is equivalent to:
>>>>>
>>>>> int less_incremented = less + 1;
>>>>> a[less] = 0;
>>>>> less = less_incremented;
>>>>>
>>>>> ...which is a 1-to-1 translation of the IR coming off the bytecode. C1
>>>>> is not smart enough to see that the increment can be reordered  after the
>>>>> indexing, maybe because there's a data dependency as the indexing uses
>>>>> "less"; but due to the semantics of postfix "++" this dependency is for 
>>>>> the
>>>>> before-increment value, so the reordering would be safe. Maybe that's just
>>>>> some simple missing heuristics that could be easily added?
>>>>>
>>>>>    there is no difference between "a[less++]" and "a[less], less++".
>>>>>
>>>>> C2 generates completely different code,with 16X unrolling - this is the
>>>>> inner loop:
>>>>>
>>>>>  0x026a6e40: pxor   %xmm0,%xmm0        ;*aload_0
>>>>>                                        ; - Test1::so...@9 (line 23)
>>>>>  0x026a6e44: movq   %xmm0,0xc(%ecx,%esi,4)
>>>>>  0x026a6e4a: movq   %xmm0,0x14(%ecx,%esi,4)
>>>>>  0x026a6e50: movq   %xmm0,0x1c(%ecx,%esi,4)
>>>>>  0x026a6e56: movq   %xmm0,0x24(%ecx,%esi,4)
>>>>>  0x026a6e5c: movq   %xmm0,0x2c(%ecx,%esi,4)
>>>>>  0x026a6e62: movq   %xmm0,0x34(%ecx,%esi,4)
>>>>>  0x026a6e68: movq   %xmm0,0x3c(%ecx,%esi,4)
>>>>>  0x026a6e6e: movq   %xmm0,0x44(%ecx,%esi,4)  ;*iastore
>>>>>                                        ; - Test1::so...@21 (line 24)
>>>>>  0x026a6e74: add    $0x10,%esi         ;*iinc
>>>>>                                        ; - Test1::so...@22 (line 22)
>>>>>  0x026a6e77: cmp    %ebp,%esi
>>>>>  0x026a6e79: jl     0x026a6e44
>>>>>
>>>>> There is some extra slow-path code to fill the remaining 1...15
>>>>> elements if the loop length is not multiple of 16, and that's all. C2
>>>>> detects the redundancy between the "k" and "less" vars, and kills the
>>>>> also-redundant "a[k] = a[less]" assignment so the net result is a simple
>>>>> zero-fill of the array. Maybe a different benchmark without these
>>>>> redundancies would make easier to see that C2 doesn't have a problem with
>>>>> the postfix "++", but if it had, I think it wouldn't reach the excellent
>>>>> result above.
>>>>>
>>>>> A+
>>>>> Osvaldo
>>>>>
>>>>>
>>>>>    I'm using JDK 7 on Windows XP:
>>>>>
>>>>>    java version "1.7.0-ea"
>>>>>    Java(TM) SE Runtime Environment (build 1.7.0-ea-b84)
>>>>>    Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing)
>>>>>
>>>>>    Thanks,
>>>>>    Vladimir
>>>>>
>>>>>
>>>>>
>>>>> This is the C1 code for sort2()'s loop:
>>>>>
>>>>>  0x0251c1dc: cmp    0x8(%ecx),%esi     ; implicit exception: dispatches
>>>>> to 0x0251c21e
>>>>>  ;;   30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13]
>>>>>  0x0251c1df: jae    0x0251c24a
>>>>>  0x0251c1e5: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>>>>>                                        ; - Test2::so...@13 (line 23)
>>>>>  0x0251c1e9: cmp    0x8(%ecx),%edi
>>>>>  ;;   36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14]
>>>>>  0x0251c1ec: jae    0x0251c263
>>>>>  0x0251c1f2: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>>>>>                                        ; - Test2::so...@14 (line 23)
>>>>>
>>>>> (sort1/sort2 start to differ here)
>>>>>
>>>>>  0x0251c1f6: cmp    0x8(%ecx),%esi
>>>>>  ;;   42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18]
>>>>>  0x0251c1f9: jae    0x0251c27c
>>>>>  0x0251c1ff: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less] = 0;
>>>>>                                        ; - Test2::so...@18 (line 24)
>>>>>  0x0251c207: inc    %esi               ; ++less;
>>>>>  0x0251c208: inc    %edi               ; OopMap{ecx=Oop off=73}
>>>>>                                        ;*goto: for k++
>>>>>                                        ; - Test2::so...@25 (line 22)
>>>>>  0x0251c209: test   %eax,0x1a0100      ;*goto
>>>>>                                        ; - Test2::so...@25 (line 22)
>>>>>                                        ;   {poll}
>>>>>  ;;  block B1 [4, 6]
>>>>>
>>>>>  0x0251c20f: cmp    %edx,%edi
>>>>>  ;;   22 branch [LT] [B2]
>>>>>  0x0251c211: jl     0x0251c1dc         ;*if_icmpge: for k < right
>>>>>                                        ; - Test2::so...@6 (line 22)
>>>>>
>>>>> The code looks OK; C1 doesn't do much optimization - no unrolling,
>>>>> bounds check elimination - but it's otherwise just about the code you 
>>>>> would
>>>>> expect for a simple JITting.
>>>>>                   Now, C1 code for sort1()'s loop:
>>>>>
>>>>>  0x024bc21c: cmp    0x8(%ecx),%esi     ; implicit exception: dispatches
>>>>> to 0x024bc262
>>>>>  ;;   30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13]
>>>>>  0x024bc21f: jae    0x024bc28e
>>>>>  0x024bc225: mov    0xc(%ecx,%esi,4),%ebx  ;*iaload: %ebx = a[less];
>>>>>                                        ; - Test1::so...@13 (line 23)
>>>>>  0x024bc229: cmp    0x8(%ecx),%edi
>>>>>  ;;   36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14]
>>>>>  0x024bc22c: jae    0x024bc2a7
>>>>>  0x024bc232: mov    %ebx,0xc(%ecx,%edi,4)  ;*iastore: a[k] = %ebx;
>>>>>                                        ; - Test1::so...@14 (line 23)
>>>>>
>>>>> (sort1/sort2 start to differ here)
>>>>>
>>>>>  0x024bc236: mov    %esi,%ebx          ; Crap! C1 temps 'less' into
>>>>> %ebx
>>>>>  0x024bc238: inc    %ebx               ; ++less; (for the temp "less
>>>>> from future")
>>>>>
>>>>>  0x024bc239: cmp    0x8(%ecx),%esi     ; %esi is still the "old
>>>>> less"....
>>>>>  ;;   46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21]
>>>>>  0x024bc23c: jae    0x024bc2c0
>>>>>  0x024bc242: movl   $0x0,0xc(%ecx,%esi,4)  ;*iastore: a[less++] = 0;
>>>>>                                        ; - Test1::so...@21 (line 24)
>>>>>  0x024bc24a: inc    %edi               ; OopMap{ecx=Oop off=75}
>>>>>                                        ;*goto: for k++
>>>>>                                        ; - Test1::so...@25 (line 22)
>>>>>  0x024bc24b: test   %eax,0x1a0100      ;   {poll}
>>>>>  0x024bc251: mov    %ebx,%esi          ;*goto
>>>>>                                        ; - Test1::so...@25 (line 22):
>>>>> for...
>>>>>  ;;  block B1 [4, 6]
>>>>>
>>>>>  0x024bc253: cmp    %edx,%edi
>>>>>  ;;   22 branch [LT] [B2]
>>>>>  0x024bc255: jl     0x024bc21c         ;*if_icmpge
>>>>>                                        ; - Test1::so...@6 (line 22):
>>>>> for...
>>>>>
>>>>

Re: New portion of improvements for Dual-Pivot Quicksort

Reply via email to