The ironic thing is that these operators were originally introduced in the C language, at least according to well-known folklore, as an optimization - taking advantage of inc/dec instructions from the PDP arch, in the old times when (I suppose) compilers wouldn't do even basic strength-reduction optimizations like x + 1 => inc x. But this works well for pre-inc/dec, but post-increment and post-decrement have a more complex semantics that requires inducing an extra temporary if the compiler is not smart enough.
My personal coding style, since my days of C, enforces pre- instead of post- operators; I will always write ++i, never i++, when both choices have the same effect. This is exactly because it's the post- operators that have confusing and error-prone semantics, as soon as the inc/dec operation is used as RHS of a bigger expression (or used a parameter etc.), and not as an isolated statement. I will use post- operators only when I really need its semantics to make my code a little extra terse. And now we discover another gotcha of post- operators: potentially less efficient code generation. It's worth noticing this, because I the post- operators seem to be the preferred style for most people, they can even be found in many Java style guidelines... I wonder why. Perhaps it's just the name of the C++ language that has taught a generation to like this wrong style? In that case, the blame goes to Stroustrup, he should have named it "++C" instead. That would get extra nerd points too ;-) A+ Osvaldo 2010/7/6 David Holmes <david.hol...@oracle.com> > I asked someone to take a look at this and it seems that the problem is > that a[less++] requires introduction of a temporary to store the > pre-incremented index value which in turn increases register pressure and > can lead to a register spill. This seems to not only be architecture > dependent but even variable on the same architecture (as observed below). > > As Osvaldo noted C1 does not attempt to decide whether the increment can be > reordered with the access and so allow removal of the temporary. You'd need > the compiler folk (C1/C2) to comment on whether or not it would be > reasonable/possible/practical for C1 to do that - so I've cc'ed them (though > I'm not a member of their list so it may be a little while before it gets > pushed through to them.) > > David Holmes > > Vladimir Iaroslavski said the following on 06/22/10 23:24: > > both computers with Intel processor: >> >> Pentium, Intel, 4 CPU, 3.2 GHz, 2 Gb RAM >> Pentium, Intel, 4 CPU, 2.8 GHz, 2 Gb RAM >> >> Osvaldo Pinali Doederlein wrote: >> >>> Hi Vladimir, >>> >>> Hello, >>>> >>>> I tried with the latest JDK, build 98 and see different behaviour >>>> on two computers: 7570 / 8318 and 8560 / 8590, but sorting method >>>> works slower with a[less++] instruction on both computers: >>>> >>> >>> Is the first computer (with larger performance diff) an AMD by any >>> chance? It's odd that just the a[less++] would make the code so much slower. >>> Perhaps the worst compiled code has additional costs in some CPUs, e.g. >>> spoiling branch prediction for the bounds checking guards. >>> >>> Anyway the change is risk-free (including performance) and the advantage >>> is important at least in some CPUs, so go ahead (FWIW wrt my opinion...). I >>> just wish that C1 would be fixed to not need this kind of hacking - as I >>> categorize this as a hack, not even as a "normal" low-level Java code >>> optimization - because there are certainly millions of uses of the idiom >>> "array[index++/--]" in JavaSE APIs and elsewhere. But I'm not familiar with >>> the C1 source so I don't know if this is some low hanging fruit that could >>> be addressed quickly (any HotSpot engineers there to comment?). >>> >>> A+ >>> Osvaldo >>> >>> first second >>>> a[less] = ak; less++; / (a[less++] = ak; >>>> >>>> random: 16371 / 16696 14018 / 14326 >>>> ascendant: 2706 / 2762 2884 / 2897 >>>> descendant: 2994 / 3108 3170 / 3258 >>>> organ pipes: 7073 / 7296 6939 / 7090 >>>> stagger(2): 7765 / 8069 7531 / 7731 >>>> period(1,2): 653 / 743 719 / 753 >>>> random(1..4): 2152 / 2234 1567 / 1591 >>>> >>>> If I change Test class and populate the array with descendant >>>> values, I still see difference in times: 4793 / 5718 >>>> see updated attached Test class. >>>> >>>> Also I'm attaching the latest version of DualPivotQuicksort >>>> with minor format changes. If you don't have comments, I'll >>>> ask to to integrate the code at the end of this week. >>>> >>>> Thank you, >>>> Vladimir >>>> >>>> Osvaldo Doederlein wrote: >>>> >>>>> Hi Vladimir, >>>>> >>>>> 2010/6/19 Vladimir Iaroslavski <iaroslav...@mail.ru <mailto: >>>>> iaroslav...@mail.ru>> >>>>> >>>>> Hello Osvaldo, >>>>> >>>>> I've prepared simple test which scans an array and does assignments >>>>> for each element, >>>>> see attached Test class: >>>>> >>>>> a[k] = a[less]; >>>>> a[less++] = 0; // or a[less] = 0; less++; >>>>> >>>>> The result of running "java -client Test" is: >>>>> >>>>> a[less], less++; Time: 6998 >>>>> a[less++]; Time: 8416 >>>>> >>>>> It is much more than 1%. Is it bug in JVM? Note that under server VM >>>>> >>>>> The amount of diff surely depends on the benchmark; your bench should >>>>> "zoom" the problem by not having much other work polluting the >>>>> measurement. >>>>> But in my own test with b98 (32-bit), Q6600 CPU, I've got 5065/5079, so >>>>> the >>>>> diff is < 1%. The excessive disadvantage you report may be something bad >>>>> in >>>>> your older b84. >>>>> >>>>> Anyway I investigated the JIT-compiled code, details in the end (I've >>>>> split the benchmark in two classes just to make the comparison simpler). >>>>> The >>>>> problem with "a[less++]" is that "less++" will first increment "less", >>>>> _then_ it will use the old value (not-incremented) to index "a". C1 >>>>> generates code that is equivalent to: >>>>> >>>>> int less_incremented = less + 1; >>>>> a[less] = 0; >>>>> less = less_incremented; >>>>> >>>>> ...which is a 1-to-1 translation of the IR coming off the bytecode. C1 >>>>> is not smart enough to see that the increment can be reordered after the >>>>> indexing, maybe because there's a data dependency as the indexing uses >>>>> "less"; but due to the semantics of postfix "++" this dependency is for >>>>> the >>>>> before-increment value, so the reordering would be safe. Maybe that's just >>>>> some simple missing heuristics that could be easily added? >>>>> >>>>> there is no difference between "a[less++]" and "a[less], less++". >>>>> >>>>> C2 generates completely different code,with 16X unrolling - this is the >>>>> inner loop: >>>>> >>>>> 0x026a6e40: pxor %xmm0,%xmm0 ;*aload_0 >>>>> ; - Test1::so...@9 (line 23) >>>>> 0x026a6e44: movq %xmm0,0xc(%ecx,%esi,4) >>>>> 0x026a6e4a: movq %xmm0,0x14(%ecx,%esi,4) >>>>> 0x026a6e50: movq %xmm0,0x1c(%ecx,%esi,4) >>>>> 0x026a6e56: movq %xmm0,0x24(%ecx,%esi,4) >>>>> 0x026a6e5c: movq %xmm0,0x2c(%ecx,%esi,4) >>>>> 0x026a6e62: movq %xmm0,0x34(%ecx,%esi,4) >>>>> 0x026a6e68: movq %xmm0,0x3c(%ecx,%esi,4) >>>>> 0x026a6e6e: movq %xmm0,0x44(%ecx,%esi,4) ;*iastore >>>>> ; - Test1::so...@21 (line 24) >>>>> 0x026a6e74: add $0x10,%esi ;*iinc >>>>> ; - Test1::so...@22 (line 22) >>>>> 0x026a6e77: cmp %ebp,%esi >>>>> 0x026a6e79: jl 0x026a6e44 >>>>> >>>>> There is some extra slow-path code to fill the remaining 1...15 >>>>> elements if the loop length is not multiple of 16, and that's all. C2 >>>>> detects the redundancy between the "k" and "less" vars, and kills the >>>>> also-redundant "a[k] = a[less]" assignment so the net result is a simple >>>>> zero-fill of the array. Maybe a different benchmark without these >>>>> redundancies would make easier to see that C2 doesn't have a problem with >>>>> the postfix "++", but if it had, I think it wouldn't reach the excellent >>>>> result above. >>>>> >>>>> A+ >>>>> Osvaldo >>>>> >>>>> >>>>> I'm using JDK 7 on Windows XP: >>>>> >>>>> java version "1.7.0-ea" >>>>> Java(TM) SE Runtime Environment (build 1.7.0-ea-b84) >>>>> Java HotSpot(TM) Client VM (build 17.0-b09, mixed mode, sharing) >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> >>>>> >>>>> This is the C1 code for sort2()'s loop: >>>>> >>>>> 0x0251c1dc: cmp 0x8(%ecx),%esi ; implicit exception: dispatches >>>>> to 0x0251c21e >>>>> ;; 30 branch [AE] [RangeCheckStub: 0x454c640] [bci:13] >>>>> 0x0251c1df: jae 0x0251c24a >>>>> 0x0251c1e5: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less]; >>>>> ; - Test2::so...@13 (line 23) >>>>> 0x0251c1e9: cmp 0x8(%ecx),%edi >>>>> ;; 36 branch [AE] [RangeCheckStub: 0x454c7e0] [bci:14] >>>>> 0x0251c1ec: jae 0x0251c263 >>>>> 0x0251c1f2: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx; >>>>> ; - Test2::so...@14 (line 23) >>>>> >>>>> (sort1/sort2 start to differ here) >>>>> >>>>> 0x0251c1f6: cmp 0x8(%ecx),%esi >>>>> ;; 42 branch [AE] [RangeCheckStub: 0x454c980] [bci:18] >>>>> 0x0251c1f9: jae 0x0251c27c >>>>> 0x0251c1ff: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less] = 0; >>>>> ; - Test2::so...@18 (line 24) >>>>> 0x0251c207: inc %esi ; ++less; >>>>> 0x0251c208: inc %edi ; OopMap{ecx=Oop off=73} >>>>> ;*goto: for k++ >>>>> ; - Test2::so...@25 (line 22) >>>>> 0x0251c209: test %eax,0x1a0100 ;*goto >>>>> ; - Test2::so...@25 (line 22) >>>>> ; {poll} >>>>> ;; block B1 [4, 6] >>>>> >>>>> 0x0251c20f: cmp %edx,%edi >>>>> ;; 22 branch [LT] [B2] >>>>> 0x0251c211: jl 0x0251c1dc ;*if_icmpge: for k < right >>>>> ; - Test2::so...@6 (line 22) >>>>> >>>>> The code looks OK; C1 doesn't do much optimization - no unrolling, >>>>> bounds check elimination - but it's otherwise just about the code you >>>>> would >>>>> expect for a simple JITting. >>>>> Now, C1 code for sort1()'s loop: >>>>> >>>>> 0x024bc21c: cmp 0x8(%ecx),%esi ; implicit exception: dispatches >>>>> to 0x024bc262 >>>>> ;; 30 branch [AE] [RangeCheckStub: 0x44ee3b0] [bci:13] >>>>> 0x024bc21f: jae 0x024bc28e >>>>> 0x024bc225: mov 0xc(%ecx,%esi,4),%ebx ;*iaload: %ebx = a[less]; >>>>> ; - Test1::so...@13 (line 23) >>>>> 0x024bc229: cmp 0x8(%ecx),%edi >>>>> ;; 36 branch [AE] [RangeCheckStub: 0x44ee550] [bci:14] >>>>> 0x024bc22c: jae 0x024bc2a7 >>>>> 0x024bc232: mov %ebx,0xc(%ecx,%edi,4) ;*iastore: a[k] = %ebx; >>>>> ; - Test1::so...@14 (line 23) >>>>> >>>>> (sort1/sort2 start to differ here) >>>>> >>>>> 0x024bc236: mov %esi,%ebx ; Crap! C1 temps 'less' into >>>>> %ebx >>>>> 0x024bc238: inc %ebx ; ++less; (for the temp "less >>>>> from future") >>>>> >>>>> 0x024bc239: cmp 0x8(%ecx),%esi ; %esi is still the "old >>>>> less".... >>>>> ;; 46 branch [AE] [RangeCheckStub: 0x44ee7b8] [bci:21] >>>>> 0x024bc23c: jae 0x024bc2c0 >>>>> 0x024bc242: movl $0x0,0xc(%ecx,%esi,4) ;*iastore: a[less++] = 0; >>>>> ; - Test1::so...@21 (line 24) >>>>> 0x024bc24a: inc %edi ; OopMap{ecx=Oop off=75} >>>>> ;*goto: for k++ >>>>> ; - Test1::so...@25 (line 22) >>>>> 0x024bc24b: test %eax,0x1a0100 ; {poll} >>>>> 0x024bc251: mov %ebx,%esi ;*goto >>>>> ; - Test1::so...@25 (line 22): >>>>> for... >>>>> ;; block B1 [4, 6] >>>>> >>>>> 0x024bc253: cmp %edx,%edi >>>>> ;; 22 branch [LT] [B2] >>>>> 0x024bc255: jl 0x024bc21c ;*if_icmpge >>>>> ; - Test1::so...@6 (line 22): >>>>> for... >>>>> >>>>