Re: RFR: 8319423: Improve Year.isLeap by checking divisibility by 16 [v2]

Cassio Neri Sat, 04 Nov 2023 14:03:04 -0700

On Fri, 3 Nov 2023 23:22:27 GMT, Claes Redestad <redes...@openjdk.org> wrote:


>> https://github.com/cassioneri/eaf suggest this code for leap year 
>> calculation:
>> 
>>     public static boolean isLeap(long year) {
>>         int d = year % 100 != 0 ? 4 : 16;
>>         return (year & (d - 1)) == 0;
>>     }
>> 
>> .. with a claim this would compile down to branchless, easily pipelined code.
>> 
>> This doesn't currently happen with C2. In the meantime I think we can 
>> improve the current code in `Year.isLeap` and `IsoChronology.isLeapYear` by 
>> leveraging the fact that the `% 100` check is only needed if `(year & 15) != 
>> 0`:
>> 
>> 
>>     public static boolean isLeap(long year) {
>>         return (year & 15) == 0 ? (year & 3) == 0 : (year & 3) == 0 && year 
>> % 100 != 0;
>>     }
>> 
>> 
>> Mac M1:
>> 
>> Name                           Cnt  Base   Error   Test   Error   Unit  
>> Change
>> LeapYearBench.isLeapYear        15 0,743 ± 0,009  0,994 ± 0,005 ops/us   
>> 1,34x (p = 0,000*)
>> LeapYearBench.isLeapYearChrono  15 0,748 ± 0,006  0,991 ± 0,003 ops/us   
>> 1,32x (p = 0,000*)
>> LeapYearBench.isLeapYearNS      15 0,558 ± 0,026  0,552 ± 0,033 ops/us   
>> 0,99x (p = 0,602 )
>>   * = significant
>> 
>> 
>> Linux x64:
>> 
>> Name                           Cnt  Base   Error   Test   Error   Unit  
>> Change
>> LeapYearBench.isLeapYear        15 0.534 ± 0.001  0.765 ± 0.004 ops/us   
>> 1.43x (p = 0.000*)
>> LeapYearBench.isLeapYearChrono  15 0.535 ± 0.000  0.753 ± 0.040 ops/us   
>> 1.41x (p = 0.000*)
>> LeapYearBench.isLeapYearNS      15 0.352 ± 0.000  0.351 ± 0.001 ops/us   
>> 1.00x (p = 0.000*)
>>   * = significant
>> 
>> 
>> 30% higher throughput on M1, 40% on x64. `isLeapYearNS` runs a variant of 
>> the code from https://github.com/cassioneri/eaf ported to java - perhaps the 
>> JIT can be improved to do whatever clang/gcc does here and achieve an even 
>> better speed-up.
>> 
>> Testing: so far only java/time/tck/java/time locally, will run a few tiers 
>> before filing an enhancement and opening the PR for review.
>
> Claes Redestad has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Apply similar optimization to GregorianCalendar, sun.util.calendar

Thanks for your interest in my work. I'd love to assist porting our algorithms 
to Java. Notice, however, that I'm not a Java programmer and I don't have the 
complete picture of what goes on in the JVM. What follows is based on my 
experience with C/C++ compilers but, I reckon, most of it might apply to Java 
as well.

When determining if `year` is leap or not it's very reasonable to believe that 
checking divisibility by 4 first is the best strategy. As I told in [my 
talk](https://www.youtube.com/watch?v=0s9F4QWAl-E), virtually every 
implementation that I found made that assumption. However, this is not the case 
thanks to modern branch predictors, at least for x86_64 CPUs. My experiments 
showed that checking divisibility by 100 first is the best way:

    if (year % 100 != 0)
      return year % 4 == 0;
    return year % 400 == 0;

Maths results show that, in calculations of leap year, it's correct to replace 
`year % 400 == 0` by the cheaper expression `y % 16 == 0`. Except for [compiler 
bugs](https://bugs.llvm.org/show_bug.cgi?id=50334), this should always be done. 
Hence, the implementation that @cl4es  mentioned:

    public static boolean isLeap(long year) {
        int d = year % 100 != 0 ? 4 : 16;
        return (year & (d - 1)) == 0;
    }

In my talk I also said that I did other optimisations around `year % 100 == 0` 
but that was another story. Let me tell this story.

Similar mathematical arguments as above show that it's also correct to replace 
`year % 100 == 0` with `year % 25 == 0` and [the latter requires one fewer 
assembly instruction](https://godbolt.org/z/zKza1Yc83). (I've also discussed 
this topic in a 
[PR](https://github.com/nakedible/datealgo-rs/discussions/21#discussioncomment-6907992)
 for the Rust implementation.) However, contrary to the case of 400-and-16, 
it's not always profitable to replace `year % 100 == 0` with `year % 25 == 0`. 
It depends on whether the code executed by the CPU contains branches or not. 
(Despite the usage of the ternary operator `?`, some CPUs support conditional 
moves which allow the elimination of some branches.)

If there's no branch, then this is probably be the best option:

    public static boolean is_leap_v0(long year) {
      int d = year % 25 != 0 ? 4 : 16;
      return (year & (d - 1)) == 0;
    }


If there's a branch, then I'd expect this to perform better:

    public static boolean is_leap_v0(long year) {
      int d = year % 100 != 0 ? 4 : 16;
      return (year & (d - 1)) == 0;
    }

The reason is hinted in my talk: If there's a branch, the branch predictor can 
do a better job when execution is split into paths with (1/100, 99/100) = (1%, 
99%) probability distribution than when the distribution is (1/25, 24/25) = 
(4%, 96%).

[Looking at byte-code](https://godbolt.org/z/MEq9c5hef) for the 4 different 
implementations floated in this discussion, I see some `goto`s but I can't tell 
if, in assembly, they translate to branches or conditional moves. With more 
knowledgeable eyes, the [C versions](https://godbolt.org/z/xTGKh5x4h) of the 
same implementations suggest me that the branchless `is_leap_v0` is the best. 
But, I ask you, to not trust in my eyes or my intuition and, instead, measure 
the performance of the several alternatives. This old [SO 
post](https://stackoverflow.com/a/60646967/1137388) might also be helpful.

I can also shed some light on the magical numbers that appear in the code that 
check divisibility by 100 and 25 by pointing to my series of article  on this 
topic:

* Quick Modular Calculations (Part 1) 
https://accu.org/journals/overload/27/154/overload154.pdf#page=13
* Quick Modular Calculations (Part 2) 
https://accu.org/journals/overload/28/155/overload155.pdf#page=16
* Quick Modular Calculations (Part 3) 
https://accu.org/journals/overload/28/156/overload156.pdf#page=12

I hope this helps.
Cassio.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16491#issuecomment-1793544215

Re: RFR: 8319423: Improve Year.isLeap by checking divisibility by 16 [v2]

Reply via email to