Re: RFR: 8207851: Implement JEP 352
Hi Andrew, On 05/24/2019 07:06 AM, Andrew Dinn wrote: Ping! Any takers for a review? I found some trailing space in v5 and it seems they are in v6 as well. You might want to check the followings: src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp +// data cache line writeback +StubRoutines::_data_cache_writeback = generate_data_cache_writeback(); +StubRoutines::_data_cache_writeback_sync = generate_data_cache_writeback_sync(); + <-- here src/hotspot/cpu/aarch64/vm_version_aarch64.cpp +if (_dcpop || UsePOCForPOP) { + _data_cache_line_flush_size = dcache_line; +} + } + <-- here src/hotspot/cpu/x86/macroAssembler_x86.cpp +// no need for fence when using CLFLUSH +clflush(line); + } <-- here src/hotspot/cpu/x86/stubGenerator_x86_64.cpp in generate_data_cache_writeback(): +__ ret(0); + +return start;<-- here in generate_data_cache_writeback_sync(): +__ jcc(Assembler::notEqual, skip); <-- here and +// data cache line writeback +StubRoutines::_data_cache_writeback = generate_data_cache_writeback(); +StubRoutines::_data_cache_writeback_sync = generate_data_cache_writeback_sync(); +<-- here (like in aarch64 file) src/java.base/share/classes/java/nio/MappedByteBuffer.java +private final boolean isSync; +<-- here and +private boolean isSync() { +return isSync; +} +<-- here src/java.base/share/classes/jdk/internal/misc/Unsafe.java + *that must be guaranteed written back to memory. + *<-- here and here: + * @since 13 + */ +<-- here +checkSize(length); +} +<-- here + * calling method {@link writeback #writeback} if it is disabled). + * <-- here +throw new RuntimeException("writebackMemory not enabled!"); +} +} +<-- here +// following memory writes +@HotSpotIntrinsicCandidate +private native void writebackPostSync0(); +<-- here src/java.base/share/classes/sun/nio/ch/FileChannelImpl.java +incrementStats(); +} +<-- here totalCapacity -= cap; } } } +<-- here +super(address, size, cap, fd); +incrementStats(); +} +<-- here +totalCapacity -= cap; +} +} +} +<-- here src/java.base/unix/native/libnio/ch/FileChannelImpl.c +// should never be called with map_sync and prot == PRIVATE +assert((prot != sun_nio_ch_FileChannelImpl_MAP_PV) ||! map_sync); +<-- here +#ifndef MAP_SHARED_VALIDATE +#define MAP_SHARED_VALIDATE 0x03 +#endif +<-- here test/jdk/java/nio/MappedByteBuffer/PmemTest.java +import com.sun.nio.file.ExtendedMapMode; +<-- here +public static final int NUM_KBS = 16; +<-- here Best regards, Gustavo
Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace
Hi Martin, On 12/12/2018 05:10 AM, Doerr, Martin wrote: thanks for your improvements. You can also remove the trailing \t from the opto assembly. Note that there's no need to re-push newer version to jdk/submit when only PPC files were changed. jdk/submit doesn't look at them. Got it. Pushed to jdk/jdk: http://hg.openjdk.java.net/jdk/jdk/rev/7384e00d5860 Thank you. Best regards, Gustavo Best regards, Martin -Original Message----- From: Gustavo Romero Sent: Mittwoch, 12. Dezember 2018 03:08 To: Michihiro Horie Cc: core-libs-dev@openjdk.java.net; hotspot-compiler-...@openjdk.java.net; Doerr, Martin ; Roger Riggs ; Vladimir Kozlov ; Simonis, Volker Subject: Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace Hi Michi, On 12/11/2018 11:12 PM, Michihiro Horie wrote: Thank you for finding the issue on Power8. You do not need a check with has_darn in the ppc.ad. It is better to add a check in vm_versoin_ppc. I agree. I uploaded webrev.08 based on your webrev.07. (Thanks for the enhancement of opto assembly and removing trailing spaces!) http://cr.openjdk.java.net/~mhorie/8213754/webrev.08/ <http://cr.openjdk.java.net/%7Emhorie/8213754/webrev.08/> Thanks for the updated webrev. Looks good! I've just pushed webrev.08 to jdk/submit expecting no failures as .07 passed fine. Once I get the jdk/submit results tomorrow I'll push. Best regards, Gustavo
Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace
Hi Michi, On 12/11/2018 11:12 PM, Michihiro Horie wrote: Thank you for finding the issue on Power8. You do not need a check with has_darn in the ppc.ad. It is better to add a check in vm_versoin_ppc. I agree. I uploaded webrev.08 based on your webrev.07. (Thanks for the enhancement of opto assembly and removing trailing spaces!) http://cr.openjdk.java.net/~mhorie/8213754/webrev.08/ <http://cr.openjdk.java.net/%7Emhorie/8213754/webrev.08/> Thanks for the updated webrev. Looks good! I've just pushed webrev.08 to jdk/submit expecting no failures as .07 passed fine. Once I get the jdk/submit results tomorrow I'll push. Best regards, Gustavo Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Gustavo Romero" ---2018/12/12 07:57:21---From: "Gustavo Romero" To: "Do"Gustavo Romero" ---2018/12/12 07:57:21---From: "Gustavo Romero" To: "Doerr, Martin" , Michihiro Horie/Japan/IBM@IBMJP From: "Gustavo Romero" To: "Doerr, Martin" , Michihiro Horie/Japan/IBM@IBMJP Cc: "core-libs-dev@openjdk.java.net" , "hotspot-compiler-...@openjdk.java.net" , "Roger Riggs" , "Vladimir Kozlov" , "Simonis, Volker" Date: 2018/12/12 07:57 Subject: Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace -- Hi Michi and Martin, On 12/11/2018 01:30 PM, Doerr, Martin wrote: > thanks for updating. I think it can get shipped when Gustavo is fine with it and jdk/submit tests have passed. webrev.06 removed has_darn from match_rule_supported and the JVM now crashes with SIGILL (cmprb) on CPUs <= POWER8. I think that what we want is both UseCharacterCompareIntrinsics and has_darn in the predicate, so: diff -r 01b519fcb8a8 -r f3bd7a422a6c src/hotspot/cpu/ppc/ppc.ad --- a/src/hotspot/cpu/ppc/ppc.ad Tue Dec 11 11:01:02 2018 -0500 +++ b/src/hotspot/cpu/ppc/ppc.ad Tue Dec 11 12:29:41 2018 -0500 @@ -2257,6 +2257,11 @@ return SuperwordUseVSX; case Op_PopCountVI: return (SuperwordUseVSX && UsePopCountInstruction); + case Op_Digit: + case Op_LowerCase: + case Op_UpperCase: + case Op_Whitespace: + return (UseCharacterCompareIntrinsics && VM_Version::has_darn()); } Martin, is what you meant on your last comment about it? I tested the change on POWER9 and it looks good (from webrev.06 and also with webrev.06 + the fix above)! I think that the Opto assembly could be improved a little bit around 'done:' label, like: diff -r 89aab62d10cd src/hotspot/cpu/ppc/ppc.ad --- a/src/hotspot/cpu/ppc/ppc.ad Tue Dec 11 12:29:41 2018 -0500 +++ b/src/hotspot/cpu/ppc/ppc.ad Tue Dec 11 15:55:34 2018 -0500 @@ -12440,7 +12440,8 @@ "LIS $src2, (signed short)0xAAB5\n\t" "ORI $src2, $src2, 0xBABA\n\t" "INSRDI $src2, $src2, 32, 0\n\t" - "CMPEQB $crx, 1, $src1, $src2\n\t" + "CMPEQB $crx, 1, $src1, $src2\n" + "done:\n\t" "SETB $dst, $crx" %} size(48); @@ -12484,7 +12485,8 @@ "BGT $crx, done\n\t" "LIS $src2, (signed short)0xD6C0\n\t" "ORI $src2, $src2, 0xDED8\n\t" - "CMPRB $crx, 1, $src1, $src2\n\t" + "CMPRB $crx, 1, $src1, $src2\n" + "done:\n\t" "SETB $dst, $crx" %} so the output would be: 024 B2: # B5 B3 <- B1 Freq: 1 024 LI R14, 0x5A41 CMPRB CCR6, 0, R3, R14 BGT CCR6, done LIS R14, (signed short)0xD6C0 ORI R14, R14, 0xDED8 CMPRB CCR6, 1, R3, R14 done: SETB R15, CCR6 040 CMPWI CCR6, R15, #0 044 Beq CCR6, B5 P=0.00 C=5377.00 ins
Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace
Hi Michi and Martin, On 12/11/2018 01:30 PM, Doerr, Martin wrote: thanks for updating. I think it can get shipped when Gustavo is fine with it and jdk/submit tests have passed. webrev.06 removed has_darn from match_rule_supported and the JVM now crashes with SIGILL (cmprb) on CPUs <= POWER8. I think that what we want is both UseCharacterCompareIntrinsics and has_darn in the predicate, so: diff -r 01b519fcb8a8 -r f3bd7a422a6c src/hotspot/cpu/ppc/ppc.ad --- a/src/hotspot/cpu/ppc/ppc.adTue Dec 11 11:01:02 2018 -0500 +++ b/src/hotspot/cpu/ppc/ppc.adTue Dec 11 12:29:41 2018 -0500 @@ -2257,6 +2257,11 @@ return SuperwordUseVSX; case Op_PopCountVI: return (SuperwordUseVSX && UsePopCountInstruction); + case Op_Digit: + case Op_LowerCase: + case Op_UpperCase: + case Op_Whitespace: +return (UseCharacterCompareIntrinsics && VM_Version::has_darn()); } Martin, is what you meant on your last comment about it? I tested the change on POWER9 and it looks good (from webrev.06 and also with webrev.06 + the fix above)! I think that the Opto assembly could be improved a little bit around 'done:' label, like: diff -r 89aab62d10cd src/hotspot/cpu/ppc/ppc.ad --- a/src/hotspot/cpu/ppc/ppc.adTue Dec 11 12:29:41 2018 -0500 +++ b/src/hotspot/cpu/ppc/ppc.adTue Dec 11 15:55:34 2018 -0500 @@ -12440,7 +12440,8 @@ "LIS $src2, (signed short)0xAAB5\n\t" "ORI $src2, $src2, 0xBABA\n\t" "INSRDI $src2, $src2, 32, 0\n\t" -"CMPEQB $crx, 1, $src1, $src2\n\t" +"CMPEQB $crx, 1, $src1, $src2\n" +"done:\n\t" "SETB$dst, $crx" %} size(48); @@ -12484,7 +12485,8 @@ "BGT $crx, done\n\t" "LIS $src2, (signed short)0xD6C0\n\t" "ORI $src2, $src2, 0xDED8\n\t" -"CMPRB $crx, 1, $src1, $src2\n\t" +"CMPRB $crx, 1, $src1, $src2\n" +"done:\n\t" "SETB$dst, $crx" %} so the output would be: 024 B2: # B5 B3 <- B1 Freq: 1 024 LI R14, 0x5A41 CMPRB CCR6, 0, R3, R14 BGT CCR6, done LIS R14, (signed short)0xD6C0 ORI R14, R14, 0xDED8 CMPRB CCR6, 1, R3, R14 done: SETBR15, CCR6 040 CMPWI CCR6, R15, #0 044 Beq CCR6, B5 P=0.00 C=5377.00 instead of: 024 B2: # B5 B3 <- B1 Freq: 1 024 LI R14, 0x5A41 CMPRB CCR6, 0, R3, R14 BGT CCR6, done LIS R14, (signed short)0xD6C0 ORI R14, R14, 0xDED8 CMPRB CCR6, 1, R3, R14 SETBR15, CCR6 040 CMPWI CCR6, R15, #0 044 Beq CCR6, B5 P=0.00 C=5379.00 If nobody opposes to that tiny enhancement I would like to keep it. ... and a nit: several trailing spaces here and there :) I generated a webrev with has_darn in match_rule_supported, with the change to the Opto asm, and with the trailing spaces removed. I uploaded it to: http://cr.openjdk.java.net/~gromero/8213754/webrev.07/ I pushed that same version to jdk/submit [1] not expecting any failure and once I get the fine results and if you, Martin are ok with the version in webrev.07 I'll push it to jdk/jdk. I'll update this thread a soon as I get the results back from jdk/submit repo. I think we have time until Thurday 16:00 UTC, even taking into account the 3+ different timezones working on that change :) Thank you and best regards, Gustavo [1] http://mail.openjdk.java.net/pipermail/jdk-submit-changes/2018-December/004418.html Note that changes pushed before Thursday 16:00 UTC will go into jdk12. Best regards, Martin *From:*Michihiro Horie *Sent:* Dienstag, 11. Dezember 2018 14:08 *To:* Doerr, Martin *Cc:* core-libs-dev@openjdk.java.net; Gustavo Romero ; hotspot-compiler-...@openjdk.java.net; Roger Riggs ; Vladimir Kozlov ; Simonis, Volker *Subject:* RE: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace Hi Martin, Thank you so much for your review and pointing out the issue on flag enablement on PPC64. Latest webrev enables the flag on PPC64 using has_darn in vm_version_ppc and it is used in match_rule_supported in the ad file. http://cr.openjdk.java.net/~mhorie/8213754/webrev.06/ <http://cr.openjdk.java.net/%7Emhorie/8213754/webrev.06/> Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Doerr, Martin" ---2018/12/11 20:31:40---Hi Michihiro, the shared code looks good to me, now."Doerr, Martin" ---2018/12/11 20:31:40---Hi Michihiro, the shared code looks good to me, now. From: "Doerr, Martin" mailto:martin.do...@sap.com>> To: Vladimir Kozlov mailto:vladimir.koz...@oracle.com>>
Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace
Hi Martin, On 12/11/2018 01:30 PM, Doerr, Martin wrote: thanks for updating. I think it can get shipped when Gustavo is fine with it and jdk/submit tests have passed. Yep, I'm working on that right now. I'll update soon. So I have to push to jdk/submit, wait the test results and if it's fine I'll push to jdk/jdk after, right? Note that changes pushed before Thursday 16:00 UTC will go into jdk12. Got it. Thank you. Best regards, Gustavo Best regards, Martin *From:*Michihiro Horie *Sent:* Dienstag, 11. Dezember 2018 14:08 *To:* Doerr, Martin *Cc:* core-libs-dev@openjdk.java.net; Gustavo Romero ; hotspot-compiler-...@openjdk.java.net; Roger Riggs ; Vladimir Kozlov ; Simonis, Volker *Subject:* RE: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace Hi Martin, Thank you so much for your review and pointing out the issue on flag enablement on PPC64. Latest webrev enables the flag on PPC64 using has_darn in vm_version_ppc and it is used in match_rule_supported in the ad file. http://cr.openjdk.java.net/~mhorie/8213754/webrev.06/ <http://cr.openjdk.java.net/%7Emhorie/8213754/webrev.06/> Best regards, -- Michihiro, IBM Research - Tokyo Inactive hide details for "Doerr, Martin" ---2018/12/11 20:31:40---Hi Michihiro, the shared code looks good to me, now."Doerr, Martin" ---2018/12/11 20:31:40---Hi Michihiro, the shared code looks good to me, now. From: "Doerr, Martin" mailto:martin.do...@sap.com>> To: Vladimir Kozlov mailto:vladimir.koz...@oracle.com>>, Michihiro Horie mailto:ho...@jp.ibm.com>>, Gustavo Romero mailto:grom...@linux.vnet.ibm.com>> Cc: "core-libs-dev@openjdk.java.net <mailto:core-libs-dev@openjdk.java.net>" mailto:core-libs-dev@openjdk.java.net>>, "hotspot-compiler-...@openjdk.java.net <mailto:hotspot-compiler-...@openjdk.java.net>" mailto:hotspot-compiler-...@openjdk.java.net>>, Roger Riggs mailto:roger.ri...@oracle.com>>, "Simonis, Volker" mailto:volker.simo...@sap.com>> Date: 2018/12/11 20:31 Subject: RE: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace -- Hi Michihiro, the shared code looks good to me, now. Looks like the PPC64 is not consistent any more. Where do you enable UseCharacterCompareIntrinsics on PPC64? Why aren't you using it for match_rule_supported in the ad file? I think has_darn could be used to enable it in vm_version_ppc. Best regards, Martin -Original Message- From: Vladimir Kozlov mailto:vladimir.koz...@oracle.com>> Sent: Montag, 10. Dezember 2018 20:33 To: Michihiro Horie mailto:ho...@jp.ibm.com>>; Gustavo Romero mailto:grom...@linux.vnet.ibm.com>> Cc: core-libs-dev@openjdk.java.net <mailto:core-libs-dev@openjdk.java.net>; hotspot-compiler-...@openjdk.java.net <mailto:hotspot-compiler-...@openjdk.java.net>; Doerr, Martin mailto:martin.do...@sap.com>>; Roger Riggs mailto:roger.ri...@oracle.com>>; Simonis, Volker mailto:volker.simo...@sap.com>> Subject: Re: RFR: 8213754: PPC64: Add Intrinsics for isDigit/isLowerCase/isUpperCase/isWhitespace On 12/9/18 5:28 PM, Michihiro Horie wrote: Hi Vladimir, Thanks a lot for your review. I also fixed the copyright in intrinsicnode.hpp. >Did you look on code generated by C2 with your latest changes? Yes,I usually check generated code with Oprofile and they were as expected like: : 90 0.0905 : 3fff64c27654: rlwinm r12,r9,24,8,31 12 0.0121 : 3fff64c27658: li r11,14640 15 0.0151 : 3fff64c2765c: cmprb cr5,0,r31,r11 5527 5.5587 : 3fff64c27660: setb r11,cr5 36010 36.2164 : 3fff64c27664: stb r11,16(r15) Good. : I found a CSR FAQ that mentions adding a diagnostic flag do not need CSR process. This time I do not need CSR. https://wiki.openjdk.java.net/display/csr/CSR+FAQs Thank is correct. Hi Gustavo, Could I ask you to sponsor the latest
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hi Goetz and Severin, On 09/14/2018 07:12 AM, Lindenmaier, Goetz wrote: Hi, We are only talking about reducing O3 to O2 for the compilation step of fdlibm, right? Yes. As I know, this is being replaced by Java coding, anyways. Therefore I don't have a strong opinion for any of the choices. I totally agree with Goetz considering OpenJDK 12 and 11u. I was actually more concerned about a possible future backport of that change, for instance: it gets into 8u, but Java implementations never get. Maybe I forgot to mention my concern about a 8u backport earlier. Severin, anyway, as Goetz is also fine with either option, I'm fine with what you and Andrew decide as the best option. Thanks bearing with the discussion on ppc64. Best regards, Gustavo Best regards, Goetz. -Original Message- From: core-libs-dev On Behalf Of Gustavo Romero Sent: Freitag, 14. September 2018 01:30 To: Severin Gehwolf ; David Holmes ; Erik Joelsson ; build-dev ; core-libs-dev Subject: Re: RFR: 8210416: [linux] Poor StrictMath performance due to non- optimized compilation Hello Severin, On 09/12/2018 04:48 AM, Severin Gehwolf wrote: Hi David, On Wed, 2018-09-12 at 13:57 +1000, David Holmes wrote: Hi Severin, Sorry I'm a bit confused now. Originally for ppc/s390x/aarch64 the optimization level for fdlibm was HIGH. But now it will be LOW due to: 45 ifneq ($(FDLIBM_CFLAGS), ) 46 BUILD_LIBFDLIBM_OPTIMIZATION := LOW as those platforms will use gcc/clang with support for -ffp-contract and so FDLIBM_CFLAGS will not be empty. ?? Correct. That was done based on Andrew Haley's comment here: http://mail.openjdk.java.net/pipermail/build-dev/2018- September/023160.html I've done some measurements with -O2 and -O3. -O2 is sufficient for a 2.75 speedup for sin/cos on ppc64le as compared to the -O0 baseline. On the other hand the speedup of -O3 as compared to -O2 is only 1.05 for sin/cos on ppc64le. Since the difference between them were not huge, I've used -O2, a.k.a LOW. See also: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK- 8210416/microbenchmarks_results/ To me it seems -O2 is sufficiently good. Note that HIGH == -O3 and LOW == -O2 on gcc arches. Does that clear things up? FWIW, I'm happy to change it back to HIGH if there are concerns. See also Gustavo's reply here for some more data points: http://mail.openjdk.java.net/pipermail/build-dev/2018- September/023193.html I'm wondering why the possible issue of using -O3 on fdlibm would not affect other parts of the JVM code (like the Opto code in libjvm.so) that afaics are also built using -O3. Also, the gap of 1.05 for sin/cos, for instance, still sounds like a regression to me. Maybe a better approach to this would be to figure out a way to test -O3 and -O2 with the real world case in which -O3 can have a bad impact. I'm not sure if SPECjbb would qualify for that. Best regards, Gustavo Thanks, Severin David On 12/09/2018 2:14 AM, Severin Gehwolf wrote: Hi Erik, Thanks for the review! On Tue, 2018-09-11 at 08:57 -0700, Erik Joelsson wrote: Hello Severin, Even if using the macro, I still think you need to add a condition on the compiler types where the switch can be reasonably expected to exist. How about this? http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK- 8210416/webrev.05/ Thanks, Severin On 2018-09-11 05:02, Severin Gehwolf wrote: On Mon, 2018-09-10 at 09:29 -0700, Erik Joelsson wrote: I see. I was not aware of that issue, but we clearly need to file a bug for it and fix it. In this case I think it's fine to us the macro however. OK. Update webrev, which now uses FLAGS_COMPILER_CHECK_ARGUMENTS. http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK- 8210416/webrev.04/ Micro-benchmark results from running [1] for x86_64 and ppc64le are here (-O2 is sufficient it seems): http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK- 8210416/microbenchmarks_results/ More thoughts? Thanks, Severin [1] https://github.com/gromero/strictmath/
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hello Severin, On 09/12/2018 04:48 AM, Severin Gehwolf wrote: Hi David, On Wed, 2018-09-12 at 13:57 +1000, David Holmes wrote: Hi Severin, Sorry I'm a bit confused now. Originally for ppc/s390x/aarch64 the optimization level for fdlibm was HIGH. But now it will be LOW due to: 45 ifneq ($(FDLIBM_CFLAGS), ) 46 BUILD_LIBFDLIBM_OPTIMIZATION := LOW as those platforms will use gcc/clang with support for -ffp-contract and so FDLIBM_CFLAGS will not be empty. ?? Correct. That was done based on Andrew Haley's comment here: http://mail.openjdk.java.net/pipermail/build-dev/2018-September/023160.html I've done some measurements with -O2 and -O3. -O2 is sufficient for a 2.75 speedup for sin/cos on ppc64le as compared to the -O0 baseline. On the other hand the speedup of -O3 as compared to -O2 is only 1.05 for sin/cos on ppc64le. Since the difference between them were not huge, I've used -O2, a.k.a LOW. See also: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/microbenchmarks_results/ To me it seems -O2 is sufficiently good. Note that HIGH == -O3 and LOW == -O2 on gcc arches. Does that clear things up? FWIW, I'm happy to change it back to HIGH if there are concerns. See also Gustavo's reply here for some more data points: http://mail.openjdk.java.net/pipermail/build-dev/2018-September/023193.html I'm wondering why the possible issue of using -O3 on fdlibm would not affect other parts of the JVM code (like the Opto code in libjvm.so) that afaics are also built using -O3. Also, the gap of 1.05 for sin/cos, for instance, still sounds like a regression to me. Maybe a better approach to this would be to figure out a way to test -O3 and -O2 with the real world case in which -O3 can have a bad impact. I'm not sure if SPECjbb would qualify for that. Best regards, Gustavo Thanks, Severin David On 12/09/2018 2:14 AM, Severin Gehwolf wrote: Hi Erik, Thanks for the review! On Tue, 2018-09-11 at 08:57 -0700, Erik Joelsson wrote: Hello Severin, Even if using the macro, I still think you need to add a condition on the compiler types where the switch can be reasonably expected to exist. How about this? http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/webrev.05/ Thanks, Severin On 2018-09-11 05:02, Severin Gehwolf wrote: On Mon, 2018-09-10 at 09:29 -0700, Erik Joelsson wrote: I see. I was not aware of that issue, but we clearly need to file a bug for it and fix it. In this case I think it's fine to us the macro however. OK. Update webrev, which now uses FLAGS_COMPILER_CHECK_ARGUMENTS. http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/webrev.04/ Micro-benchmark results from running [1] for x86_64 and ppc64le are here (-O2 is sufficient it seems): http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/microbenchmarks_results/ More thoughts? Thanks, Severin [1] https://github.com/gromero/strictmath/
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hi Andrew, On 09/12/2018 12:51 AM, Gustavo Romero wrote: Maybe Andrew can enlight us. I was not sure if 'enlight us' was the correct form here, so I did some search and I found with horror today that 'enlighten us' can also be considered an insult (!?). That's really not the case here. So if it's really possible to interpret it in a non-positive / non-cordial way I sincerely apologize for the wrong use. I was really only asking for an advice. :) Regards, Gustavo
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hi Severin, On 09/11/2018 09:02 AM, Severin Gehwolf wrote: Micro-benchmark results from running [1] for x86_64 and ppc64le are here (-O2 is sufficient it seems): http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/microbenchmarks_results/ More thoughts? Thanks a lot for checking it on PPC64. I did additional tests on a POWER8 using the same microbench. I share the results below. The numbers are the arithmetic mean of 10 runs. Unit is second. 1 O3 | O2 2 MathStrictMath | MathStrictMath 3 | 4 acos:3.0728 3.0728 | acos:3.07325 3.07375 5 asin:3.0933 3.0916 | asin:3.09337 3.0915 6 atan2: 10.9743 10.9759 | atan2: 11.2994 11.2977 7 atan:7.653 7.6602 | atan:7.48463 7.48225 8 cbrt:13.444 13.4433 | cbrt:13.4535 13.4527 9 cos: 30.1561 33.9689 | cos: 30.157 35.1162 10 cosh:3.0911 3.1074 | cosh:3.09975 3.09088 11 exp: 1.413 3.0572 | exp: 1.41313 3.09737 12 expm1: 3.0281 3.0349 | expm1: 2.98813 2.98913 13 hypot: 33.7668 33.7471 | hypot: 27.9664 27.9675 14 log10: 4.2213 10.0832 | log10: 4.221 10.1109 15 log1p: 9.2886 9.2888 | log1p: 9.08713 9.08075 16 log: 3.1362 7.6327 | log: 3.136 7.633 17 pow: 13.7787 17.4739 | pow: 13.7829 17.483 18 sin: 30.216 33.9733 | sin: 30.2163 35.1684 19 sinh:3.1478 3.152 | sinh:3.18638 3.18375 20 sqrt:0.665 0.6714 | sqrt:0.665 0.671 21 tan: 31.6409 36.461 | tan: 31.6385 37.7166 22 tanh:3.0753 3.0925 | tanh:3.11088 3.10675 Based upon that, it looks like that O3 is better for most of the functions, specially for sin, cos, and tan. On the other hand, it looks like that O3 hurts hypot. Thus it seems -O2 is not quite the same as -O3, but it's not straightforward to decide which one is better too. Maybe Andrew can enlight us. I also CC:ed ppc-aix-port in case somebody there wishes to comment on that. Thank you. Best regards, Gustavo Thanks, Severin [1] https://github.com/gromero/strictmath/
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hi Severin, On 09/10/2018 06:27 AM, Severin Gehwolf wrote: On Mon, 2018-09-10 at 10:05 +0100, Andrew Haley wrote: On 09/05/2018 02:12 PM, Severin Gehwolf wrote: Is there a good reason to not use -O3 -ffp-contract=off everywhere? Is there a good reason to use -O3 rather than -O2? Not sure. I was following what JDK-8170153 did, which was using OPTIMIZATION := HIGH corresponding to -O3. cc'ing Gustavo. Gustavo, would you know why HIGH was chosen over, LOW? I don't remember exactly, but at least for ppc64 I discussed that a bit with the toolchain folks (also regarding the precision issue, etc) and they never said anything against using -O3. Unfortunately it was long time ago so I don't remember exactly the numbers on ppc64 for -O2 to check if it was worse and so I selected -O3 instead. -O3 can bloat the code which can increase cache pressure, which is not always noticeable in benchmarks but hurts real-world programs. Unless benchmarks are significantly better at -O3, -O2 is a good default choice. OK, thanks! I'll re-test and change to LOW (-O2) if it gives similar results. That's interesting. Andrew, do you mean bloat in the sense of final code size (for instance, due to unrolling), right? BTW (I just remembered that), on RISC the lack of optimization hurts way more than the lack of optimization on CISC, so I recall that it puzzled me the fact that turning on the optimization on x86_64 did not change much the scenario, contrary to the conspicuous gains on on ppc64 when turning on the optimization. I took me some time so to understand that the optimization flag was the culprit (a much simpler case lucky), because I tried first to profile and optimize the fdlibm code (after extracting it from JVM for detailed analysis) and only after getting to a dead end I turned to look at simpler causes. Are you checking the difference between -O2 and -O3 only on x86_64? Best regards, Gustavo
Re: RFR: 8210416: [linux] Poor StrictMath performance due to non-optimized compilation
Hello, On 09/05/2018 04:15 PM, joe darcy wrote: Hello, On 9/5/2018 6:12 AM, Severin Gehwolf wrote: Hi, Cross-posting this review-thread on core-libs-dev and build-dev as this is a build change, but affects fdlibm which is core-libs. With JDK-8170153 optimization for fdlibm code has been turned on for ppc64 s390 and aarch64. This patch intends to turn it on on all arches on Linux. I've not observed any precision issues. Is there a good reason to not use -O3 -ffp-contract=off everywhere? Bug: https://bugs.openjdk.java.net/browse/JDK-8210416 webrev: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8210416/webrev.01/ Testing: - java/lang/Math, java/lang/StrictMath tests (all pass). - Currently running through submit repo. A simple micro benchmark from JDK-8170153[1] shows these numbers for StrictMath: Function | before | after -- sin | 0m33.382s | 0m18.731s cos | 0m31.562s | 0m18.796s tan | 0m33.657s | 0m21.093s atan | 0m5.714s | 0m4.902s log | 0m6.212s | 0m4.439s log10 | 0m7.946s | 0m5.543s sqrt | 0m0.481s | 0m0.449s cbrt | 0m5.295s | 0m5.214s tanh | 0m1.404s | 0m1.307s log1p | 0m6.457s | 0m5.131s IEEEremainder | 0m10.629s | 0m6.048s atan2 | 0m8.037s | 0m5.668s hypot | 0m2.171s | 0m2.147s Note that pow (JDK-8134795), hypot (JDK-7130085), cbrt (JDK-8136799), and exp (JDK-8139688), have been ported to Java as of JDK 9. The sqrt method is commonly handled as an intrinsic. Testing that was not geared toward finding precision/rounding issues would be unlikely to find them. I don't see the sources of the microbenchmark in JDK-8170153. The microbench can be found on the first email I sent discussing that issue on Power. Check it at the end, but there is nothing special in there: http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-November/002738.html There is also some discussing on the formal RFR: http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-November/002751.html Severin, the precision issues were quite conspicuous on Power when -O3 was turned on. There are some discussions about it in the threads above IIRC. SAP helped to test the change against additional tests, like TCK. HTH. Best regards, Gustavo Cheers, -Joe
Re: RFC: Add new JCA provider to support hardware RNGs
Hi Bernd, Thanks for your comments. On 06/20/2018 05:05 PM, Bernd Eckenfels wrote: Just a FYI under Linux when you read from urandom the Linux kernel will always XOR with random bytes generated with x64 rdrand instruction (arch_get_random_lomg() - if supported). Since it is a XOR it does not have to trust the quality of this black box hardware implementation. Yes, I know to some extent how the use of rdrand on Intel by Linux generated debates in the past. I would not implement a generator which does not have its own software whitening. (And most likely there is no need for one different than urandom on Linux). If you do implement whitening I would use a DRBG construction and no longer use a (low state) SHA1PRNG. My main concern to use 'darn' on Power was to get some additional speed and throughput on obtaining true random numbers (or on cases where /dev/random - which blocks - was preferred to /dev/urandom). That concern seems valid specially when the code is not JITed yet by the C2 compiler (so for Interpreted and C1 Compiler). Anyway, the idea to have a separated provider and that the use of the intrinsic is by no means enabled by default is to let the user to decide if he/she wants to use that kind of generator. I think OpenSSL disabled by default the use of rdrand similarly but let the user to enable it if desired? The idea of DRBG looks valid. I took SHA1PRNG as a fallback. But if DRBG would be used for whitening that would need a seed from /dev/urandom? (I'm trying to think the impact on performance here). Best regards, Gustavo Gruss Bernd Gruss Bernd -- http://bernd.eckenfels.net From: core-libs-dev on behalf of Gustavo Romero Sent: Wednesday, June 20, 2018 2:59:47 AM To: core-libs-dev; security-dev Cc: vladimir.koz...@oracle.com; Doerr, Martin Subject: RFC: Add new JCA provider to support hardware RNGs Hi, Please, could I get comments on the following change? Since it's related to security, I would be glad if security experts could also comment on that. webrev: http://cr.openjdk.java.net/~gromero/POWER9/darn/v6_rebased/ It introduces a way to get benefits from hardware instructions in userspace that can delivery a random number and so be used to speed up and avoid blocking in some SecureRandom methods, notably generateSeed() and nextBytes(), without loosing the random number quality. Particularly on Power, the new POWER9 CPUs introduced a hardware instruction called 'darn' that can be used for that purpose. The main idea is to add a new JCA provider called "HWTRNG" (if no better name is suggested), adding new helper methods that can then be intrinsified and that will be used by generateSeed() and nextBytes(). In that sense, this change also adds the proper mechanisms in the Interpreter, C1 Compiler, and C2 compiler (for PPC64, but also paving the way for other platforms) to intrinsify these helper methods that will be used in the end by generateSeed() and nextBytes() methods. The added helpers are: byte[] getRandomSequence0(int) byte[] getRandomSequence1(byte[]) long validRandomLong(void) long randomLong(void) The helpers also take care of checking if the returned random number is valid and attempt 10 times (as recommended by ISA) get a valid random number before falling back to a software alternative (HWTRNG is based mostly on JCA provider NativePRNG and so the fall back mechanism will use the exactly same methods found in NativePRNG and hence behave similarly. Nonetheless, in my experience such a hardware failures (which are the main cause of a returned invalid random number) are quite rare: in my tests I was never able to exhaust the HW RNG and force it to generate an invalid random number). The user, besides having to specify explicitly the use of a non-default provider (HWTRNG), will have to unlock the VM experimental options to effectively use the hardware RNG as an unique random source by passing options "-XX:+UnlockExperimentalVMOptions -XX:+UseRANDOMIntrinsics". On Power, for the Interpreter and C1 Compiler, besides avoiding the blocking effect of a low entropy on /dev/random, I was able to get about 126 Mbps (3x higher than the version without 'darn') on generaSeed() and nextBytes(). The maximum theoretical value on a POWER9 machine is usually 128 Mbps. I'll address the details about the file headers (Copyright, dates, authors, versions, etc) after I get some feedback about this change. Thanks in advance. Best regards, Gustavo
RFC: Add new JCA provider to support hardware RNGs
Sorry for resending it. I missed a few MLs. -- Hi, Please, could I get comments on the following change? Since it's related to security, I would be glad if security experts could also comment on that. webrev: http://cr.openjdk.java.net/~gromero/POWER9/darn/v6_rebased/ It introduces a way to get benefits from hardware instructions in userspace that can delivery a random number and so be used to speed up and avoid blocking in some SecureRandom methods, notably generateSeed() and nextBytes(), without loosing the random number quality. Particularly on Power, the new POWER9 CPUs introduced a hardware instruction called 'darn' that can be used for that purpose. The main idea is to add a new JCA provider called "HWTRNG" (if no better name is suggested), adding new helper methods that can then be intrinsified and that will be used by generateSeed() and nextBytes(). In that sense, this change also adds the proper mechanisms in the Interpreter, C1 Compiler, and C2 compiler (for PPC64, but also paving the way for other platforms) to intrinsify these helper methods that will be used in the end by generateSeed() and nextBytes() methods. The added helpers are: byte[] getRandomSequence0(int) byte[] getRandomSequence1(byte[]) long validRandomLong(void) long randomLong(void) The helpers also take care of checking if the returned random number is valid and attempt 10 times (as recommended by ISA) get a valid random number before falling back to a software alternative (HWTRNG is based mostly on JCA provider NativePRNG and so the fall back mechanism will use the exactly same methods found in NativePRNG and hence behave similarly. Nonetheless, in my experience such a hardware failures (which are the main cause of a returned invalid random number) are quite rare: in my tests I was never able to exhaust the HW RNG and force it to generate an invalid random number). The user, besides having to specify explicitly the use of a non-default provider (HWTRNG), will have to unlock the VM experimental options to effectively use the hardware RNG as an unique random source by passing options "-XX:+UnlockExperimentalVMOptions -XX:+UseRANDOMIntrinsics". On Power, for the Interpreter and C1 Compiler, besides avoiding the blocking effect of a low entropy on /dev/random, I was able to get about 126 Mbps (3x higher than the version without 'darn') on generaSeed() and nextBytes(). The maximum theoretical value on a POWER9 machine is usually 128 Mbps. I'll address the details about the file headers (Copyright, dates, authors, versions, etc) after I get some feedback about this change. Thanks in advance. Best regards, Gustavo
RFC: Add new JCA provider to support hardware RNGs
Hi, Please, could I get comments on the following change? Since it's related to security, I would be glad if security experts could also comment on that. webrev: http://cr.openjdk.java.net/~gromero/POWER9/darn/v6_rebased/ It introduces a way to get benefits from hardware instructions in userspace that can delivery a random number and so be used to speed up and avoid blocking in some SecureRandom methods, notably generateSeed() and nextBytes(), without loosing the random number quality. Particularly on Power, the new POWER9 CPUs introduced a hardware instruction called 'darn' that can be used for that purpose. The main idea is to add a new JCA provider called "HWTRNG" (if no better name is suggested), adding new helper methods that can then be intrinsified and that will be used by generateSeed() and nextBytes(). In that sense, this change also adds the proper mechanisms in the Interpreter, C1 Compiler, and C2 compiler (for PPC64, but also paving the way for other platforms) to intrinsify these helper methods that will be used in the end by generateSeed() and nextBytes() methods. The added helpers are: byte[] getRandomSequence0(int) byte[] getRandomSequence1(byte[]) long validRandomLong(void) long randomLong(void) The helpers also take care of checking if the returned random number is valid and attempt 10 times (as recommended by ISA) get a valid random number before falling back to a software alternative (HWTRNG is based mostly on JCA provider NativePRNG and so the fall back mechanism will use the exactly same methods found in NativePRNG and hence behave similarly. Nonetheless, in my experience such a hardware failures (which are the main cause of a returned invalid random number) are quite rare: in my tests I was never able to exhaust the HW RNG and force it to generate an invalid random number). The user, besides having to specify explicitly the use of a non-default provider (HWTRNG), will have to unlock the VM experimental options to effectively use the hardware RNG as an unique random source by passing options "-XX:+UnlockExperimentalVMOptions -XX:+UseRANDOMIntrinsics". On Power, for the Interpreter and C1 Compiler, besides avoiding the blocking effect of a low entropy on /dev/random, I was able to get about 126 Mbps (3x higher than the version without 'darn') on generaSeed() and nextBytes(). The maximum theoretical value on a POWER9 machine is usually 128 Mbps. I'll address the details about the file headers (Copyright, dates, authors, versions, etc) after I get some feedback about this change. Thanks in advance. Best regards, Gustavo
Re: [aarch64-port-dev ] RFR(s) PPC64/s390x/aarch64: Poor StrictMath performance due to non-optimized compilation
Hi Erik, Volker, Andrew On 29-11-2016 16:15, Andrew Haley wrote: > On 29/11/16 18:06, Volker Simonis wrote: >> @Andrew: are you fine with Gustavos latest version of the change? > > Sure. The StrictMath versions all seem to give the same results. > > Andrew. > Thanks for reviewing the change! I changed the "FC Extension Request" status to "reviewed". Regards, Gustavo
Re: [aarch64-port-dev ] RFR(s) PPC64/s390x/aarch64: Poor StrictMath performance due to non-optimized compilation
Hi Andrew, On 29-11-2016 13:35, Andrew Haley wrote: > On 29/11/16 09:41, Volker Simonis wrote: >> Thanks Gustavo, >> >> the change looks good. >> >> So now we're just waiting for another review from somebody of the aarch64 >> folks. >> Once we have that and the fc-request is approved I'll push the changes. > > One thing I don't understand: > > cos 0.17098435541865692 1m7.433s 0.1709843554185943 0m56.678s > sin 1.7136493465700289 1m10.654s 1.7136493465700542 0m57.114s > > Do you know what causes the lower digits to be different? Is > it that Math and StrictMath use different algorithms, not just > different optimization levels? I don't know exactly what's the root cause for that difference (in the result). The difference is not present on x64, however on PPC64 even with -O0 (as it is by now) that difference exists. Math methods are intrisified, but StricMath are not. But I understand that Math and StrictMath share the fdlibm code since I already changed some code in fdlibm that reflected both on Math and StrictMath, so it's not clear to me where the Math relaxation occurs on PPC64 (given that such a relaxation is allowed [1]). For sure others much more experienced than I can comment about difference. Regards, Gustavo [1] https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html
RFR(s) PPC64/s390x/aarch64: Poor StrictMath performance due to non-optimized compilation
Hi all, I'm re-sending due to JDK title update to include s390x and aarch64 archs. Could the following webrev be reviewed, please? webrev 1/2: http://cr.openjdk.java.net/~gromero/8170153/v2/ webrev 2/2: http://cr.openjdk.java.net/~gromero/8170153/v2/jdk/ bug:https://bugs.openjdk.java.net/browse/JDK-8170153 Thank you. Regards, Gustavo
Re: RFR(s) 8170153: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Volker, Sorry for not replying earlier, it was day-off on Friday here... On 25-11-2016 11:32, Volker Simonis wrote: > Hi Gustavo, > > we've realized that we have exactly the same problem on Linux/s390 so > I hope you don't mind that I've updated the bug and the webrev to also > include the fix for Linux/s390: > > http://cr.openjdk.java.net/~simonis/webrevs/2016/8170153.top/ > http://cr.openjdk.java.net/~simonis/webrevs/2016/8170153.jdk/ > https://bugs.openjdk.java.net/browse/JDK-8170153 > > The top-level change stays the same (I've only added the current > reviewers) and for the jdk change I've just added Linux/s390 as > another platform which can compile fdlibm with HIGH optimization. Actually, it's really cool to know that an analysis on PPC64 contributed also to the s390 arch! :) Thanks for providing the updated webrevs. Regards, Gustavo
Re: RFR(s) 8170153: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Erik, On 23-11-2016 12:29, Erik Joelsson wrote: > Build changes look ok. > > In CoreLibraries.gmk, I think it would have been ok to keep the conditional > checking (OPENJDK_TARGET_CPU_ARCH, ppc), but this certainly works too. Thanks a lot for reviewing the change. Regards, Gustavo
Re: RFR(s) 8170153: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Martin, On 23-11-2016 12:38, Doerr, Martin wrote: > Hi Gustavo, > > thanks for providing the webrevs. > > I have ran the StrictMath jck tests which fail when building with -O3 and > without -ffp-contract=off: > FailedTests: > api/java_lang/StrictMath/desc.html#acos > javasoft.sqe.tests.api.java.lang.StrictMath.acos_test > api/java_lang/StrictMath/desc.html#asin > javasoft.sqe.tests.api.java.lang.StrictMath.asin_test > api/java_lang/StrictMath/desc.html#atan > javasoft.sqe.tests.api.java.lang.StrictMath.atan_test > api/java_lang/StrictMath/desc.html#atan2 > javasoft.sqe.tests.api.java.lang.StrictMath.atan2_test > api/java_lang/StrictMath/desc.html#cos > javasoft.sqe.tests.api.java.lang.StrictMath.cos_test > api/java_lang/StrictMath/desc.html#exp > javasoft.sqe.tests.api.java.lang.StrictMath.exp_test > api/java_lang/StrictMath/desc.html#log > javasoft.sqe.tests.api.java.lang.StrictMath.log_test > api/java_lang/StrictMath/desc.html#sin > javasoft.sqe.tests.api.java.lang.StrictMath.sin_test > api/java_lang/StrictMath/desc.html#tan > javasoft.sqe.tests.api.java.lang.StrictMath.tan_test > api/java_lang/StrictMath/index.html#expm1 > javasoft.sqe.tests.api.java.lang.StrictMath.expm1Tests -TestCaseID ALL > api/java_lang/StrictMath/index.html#log10 > javasoft.sqe.tests.api.java.lang.StrictMath.log10Tests -TestCaseID ALL > api/java_lang/StrictMath/index.html#log1p > javasoft.sqe.tests.api.java.lang.StrictMath.log1pTests -TestCaseID ALL > > All of them have passed when building with -O3 and -ffp-contract=off (on > linuxppc64le). Thank you very much for running the additional StrictMath jck tests against the change! Best regards, Gustavo
Re: RFR(s) 8170153: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Volker, On 23-11-2016 12:05, Volker Simonis wrote: > thanks a lot for tracking this down! Happy to contribute :) > The change looks good and I a can sponsor it once you get another > review from the build group and the FC Extension Request was approved. Thanks a lot for sponsoring it! > In general I'd advise to sign the OCTLA [1] to get access to the Java > SE TCK [2] as this contains quite a lot of additional conformance > tests which can be quite valuable for changes like this. > > Regards, > Volker > > [1] http://openjdk.java.net/legal/octla-java-se-8.pdf > [2] http://openjdk.java.net/groups/conformance/JckAccess/ Right. I'll check the documentation and find a way to get access to the TCK. Best regards, Gustavo
RFR(s) 8170153: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi, Could the following change be reviewed, please? webrev 1/2: http://cr.openjdk.java.net/~gromero/8170153/ webrev 2/2: http://cr.openjdk.java.net/~gromero/8170153/jdk/ bug:https://bugs.openjdk.java.net/browse/JDK-8170153 It enables fdlibm optimization on Linux PPC64 LE & BE and hence speeds up the StrictMath methods (in some cases up to 3x) on that platform. On PPC64 fdlibm optimization can be done without precision issues if floating-point expression contraction is disable, i.e. if the compiler does not use floating-point multiply-add (FMA). For further details please refer to gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78386 No regression was observed on Math and StrictMath tests: Passed: java/lang/Math/AbsPositiveZero.java Passed: java/lang/Math/Atan2Tests.java Passed: java/lang/Math/CeilAndFloorTests.java Passed: java/lang/Math/CubeRootTests.java Passed: java/lang/Math/DivModTests.java Passed: java/lang/Math/ExactArithTests.java Passed: java/lang/Math/Expm1Tests.java Passed: java/lang/Math/FusedMultiplyAddTests.java Passed: java/lang/Math/HyperbolicTests.java Passed: java/lang/Math/HypotTests.java Passed: java/lang/Math/IeeeRecommendedTests.java Passed: java/lang/Math/Log10Tests.java Passed: java/lang/Math/Log1pTests.java Passed: java/lang/Math/MinMax.java Passed: java/lang/Math/MultiplicationTests.java Passed: java/lang/Math/PowTests.java Passed: java/lang/Math/Rint.java Passed: java/lang/Math/RoundTests.java Passed: java/lang/Math/SinCosCornerCasesTests.java Passed: java/lang/Math/TanTests.java Passed: java/lang/Math/WorstCaseTests.java Test results: passed: 21 Passed: java/lang/StrictMath/CubeRootTests.java Passed: java/lang/StrictMath/ExactArithTests.java Passed: java/lang/StrictMath/Expm1Tests.java Passed: java/lang/StrictMath/HyperbolicTests.java Passed: java/lang/StrictMath/HypotTests.java Passed: java/lang/StrictMath/Log10Tests.java Passed: java/lang/StrictMath/Log1pTests.java Passed: java/lang/StrictMath/PowTests.java Test results: passed: 8 and also on the following hotspot tests: Passed: compiler/intrinsics/mathexact/sanity/AddExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/AddExactLongTest.java Passed: compiler/intrinsics/mathexact/sanity/DecrementExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/DecrementExactLongTest.java Passed: compiler/intrinsics/mathexact/sanity/IncrementExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/IncrementExactLongTest.java Passed: compiler/intrinsics/mathexact/sanity/MultiplyExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/MultiplyExactLongTest.java Passed: compiler/intrinsics/mathexact/sanity/NegateExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/NegateExactLongTest.java Passed: compiler/intrinsics/mathexact/sanity/SubtractExactIntTest.java Passed: compiler/intrinsics/mathexact/sanity/SubtractExactLongTest.java Passed: compiler/intrinsics/mathexact/AddExactICondTest.java Passed: compiler/intrinsics/mathexact/AddExactIConstantTest.java Passed: compiler/intrinsics/mathexact/AddExactILoadTest.java Passed: compiler/intrinsics/mathexact/AddExactILoopDependentTest.java Passed: compiler/intrinsics/mathexact/AddExactINonConstantTest.java Passed: compiler/intrinsics/mathexact/AddExactIRepeatTest.java Passed: compiler/intrinsics/mathexact/AddExactLConstantTest.java Passed: compiler/intrinsics/mathexact/AddExactLNonConstantTest.java Passed: compiler/intrinsics/mathexact/CompareTest.java Passed: compiler/intrinsics/mathexact/DecExactITest.java Passed: compiler/intrinsics/mathexact/DecExactLTest.java Passed: compiler/intrinsics/mathexact/GVNTest.java Passed: compiler/intrinsics/mathexact/IncExactITest.java Passed: compiler/intrinsics/mathexact/IncExactLTest.java Passed: compiler/intrinsics/mathexact/MulExactICondTest.java Passed: compiler/intrinsics/mathexact/MulExactIConstantTest.java Passed: compiler/intrinsics/mathexact/MulExactILoadTest.java Passed: compiler/intrinsics/mathexact/MulExactILoopDependentTest.java Passed: compiler/intrinsics/mathexact/MulExactINonConstantTest.java Passed: compiler/intrinsics/mathexact/MulExactIRepeatTest.java Passed: compiler/intrinsics/mathexact/MulExactLConstantTest.java Passed: compiler/intrinsics/mathexact/MulExactLNonConstantTest.java Passed: compiler/intrinsics/mathexact/NegExactIConstantTest.java Passed: compiler/intrinsics/mathexact/NegExactILoadTest.java Passed: compiler/intrinsics/mathexact/NegExactILoopDependentTest.java Passed: compiler/intrinsics/mathexact/NegExactINonConstantTest.java Passed: compiler/intrinsics/mathexact/NegExactLConstantTest.java Passed: compiler/intrinsics/mathexact/NegExactLNonConstantTest.java Passed: compiler/intrinsics/mathexact/NestedMathExactTest.java Passed: compiler/intrinsics/mathexact/SplitThruPhiTest.java Passed: compiler/intrinsics/mathexact/SubExactICondTest.java Passed: compiler/intrinsics/mathexact/SubExactIConstantTest.java Passed: compiler/intrinsics/mathexact/SubExactILoadTest.java Passed: c
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Derek, On 17-11-2016 20:47, White, Derek wrote: > Hi Joe, > > Although neither a floating point expert (as I think I've proven to you over > the years), or a gcc expert, I checked with our in-house gcc expert and got > this following answer: > > "Yes using -fno-strict-aliasing fixes the issues. Also there are many > forks of fdlibm which has this fixed including the code inside glibc. " I've tried on PPC64 -O3 and -fno-strict-aliasing but it didn't work. Disabling the FMA fixed the issue although. Do you know if the gap between Math and StrictMath is also huge on aarch64? Thank you. Regards, Gustavo:
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Chris, On 17-11-2016 19:48, Chris Plummer wrote: >> The fdlibm code relies on aliasing a two-element array of int with a double >> to do bit-level reads and writes of floating-point values. As I understand >> it, the C spec allows compilers to assume >> values of different types don't overlap in memory. The compilation >> environment has to be configured in such a way that the C compiler disables >> code generation and optimization techniques that would >> run afoul of these fdlibm coding practices. > This is the strict aliasing issue right? It's a long standing problem with > fdlibm that kept getting worse as gcc got smarter. IIRC, compiling with > -fno-strict-aliasing fixes it, but it's been more > than 12 years since I last dealt with fdlibm and compiler aliasing issues. I've tested with -O3 and -fno-strict-aliasing as you suggested but it did not fix the fp precision issue on PPC64. After finding that -fno-expensive-optimizations solved the problem, we narrowed down the problem to the FMA: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78386 Thank you. Regards, Gustavo
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Joe, On 17-11-2016 19:33, joe darcy wrote: Currently, optimization for building fdlibm is disabled, except for the "solaris" OS target [1]. >>> The reason for that is because historically the Solaris compilers have had >>> sufficient discipline and control regarding floating-point semantics and >>> compiler optimizations to still implement the >>> Java-mandated results when optimization was enabled. The gcc family of >>> compilers, for example, has lacked such discipline. >> oh, I see. Thanks for clarifying that. I was exactly wondering why fdlibm >> optimization is off even for x86_x64 as it, AFAICS regarding gcc 5 only, does >> not affect the precision, even if setting -O3 does not improve the >> performance >> as much as on PPC64. > > The fdlibm code relies on aliasing a two-element array of int with a double > to do bit-level reads and writes of floating-point values. As I understand > it, the C spec allows compilers to assume values > of different types don't overlap in memory. The compilation environment has > to be configured in such a way that the C compiler disables code generation > and optimization techniques that would run afoul > of these fdlibm coding practices. On discussing with the Power toolchain folks we narrowed down the issue on PPC64 to the FMA. -fno-strict-aliasing has no effect and when used with an aggressive optimization does not solve the issue on precision. Thus -ffp-contract=off is the best options we have by now to optimize the fdlibm on PPC64. >>> Methods in the Math class, such as pow, are often intrinsified and use a >>> different algorithm so a straight performance comparison may not be as fair >>> or meaningful in those cases. >> I agree. It's just that the issue on StrictMath methods was first noted due >> to >> that huge gap (Math vs StrictMath) on PPC64, which is not prominent on x64. > > Depending on how Math.{sin, cos} is implemented on PPC64, compiling the > fdlibm sin/cos with more aggressive optimizations should not be expected to > close the performance gap. In particular, if > Math.{sin, cos} is an intrinsic on PPC64 (I haven't checked the sources) that > used platform-specific feature (say fused multiply add instructions) then > just compiling fdlibm more aggressively wouldn't > necessarily make up that gap. In our case (PPC64) it does close the gap. Non-optimized code will suffer a lot, for instance, from load-hit-store issues. Contrary to what happens on PPC64, the gap on x64 seems to be quite small as you said. > > To allow cross-platform and cross-release reproducibility, StrictMath is > specified to use the particular fdlibm algorithms, which precludes using > better algorithms developed more recently. If we were > to start with a clean slate today, to get such reproducibility we would > specify correctly-rounded behavior of all those methods, but such an approach > was much less tractable technical 20+ years ago > without benefit of the research that was been done in the interim, such as > the work of Prof. Muller and associates: > https://lipforge.ens-lyon.fr/projects/crlibm/. > >> >> >>> Accumulating the the results of the functions and comparisons the sums is >>> not a sufficiently robust way of checking to see if the optimized versions >>> are indeed equivalent to the non-optimized ones. >>> The specification of StrictMath requires a particular result for each set >>> of floating-point arguments and sums get round-away low-order bits that >>> differ. >> That's really good point, thanks for letting me know about that. I'll >> re-test my >> change under that perspective. >> >> >>> Running the JDK math library regression tests and corresponding JCK tests >>> is recommended for work in this area. >> Got it. By "the JDK math library regression tests" you mean exactly which >> test >> suite? the jtreg tests? > > Specifically, the regression tests under test/java/lang/Math and > test/java/lang/StrictMath in the jdk repository. There are some other math > library tests in the hotspot repo, but I don't know where > they are offhand. > > A note on methodologies, when I've been writing test for my port I've tried > to include test cases that exercise all the branches point in the code. Due > to the large input space (~2^64 for a > single-argument method), random sampling alone is an inefficient way to try > to find differences in behavior. >> For testing against JCK/TCK I'll need some help on that. >> > > I believe the JCK/TCK does have additional testcases relevant here. > > HTH; thanks, > > -Joe > Thank you very much for the valuable comments. I'll send a webrev accordingly for review. I filed a bug: https://bugs.openjdk.java.net/browse/JDK-8170153 Best regards, Gustavo
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Joe, Thanks a lot for your valuable comments. On 17-11-2016 15:35, joe darcy wrote: >> Currently, optimization for building fdlibm is disabled, except for the >> "solaris" OS target [1]. > > The reason for that is because historically the Solaris compilers have had > sufficient discipline and control regarding floating-point semantics and > compiler optimizations to still implement the > Java-mandated results when optimization was enabled. The gcc family of > compilers, for example, has lacked such discipline. oh, I see. Thanks for clarifying that. I was exactly wondering why fdlibm optimization is off even for x86_x64 as it, AFAICS regarding gcc 5 only, does not affect the precision, even if setting -O3 does not improve the performance as much as on PPC64. >> As a consequence on PPC64 (Linux) StrictMath methods like, but not limited >> to, >> sin(), cos(), and tan() perform verify poor in comparison to the same methods >> in Math class [2]: > > If you are doing your work against JDK 9, note that the pow, hypot, and cbrt > fdlibm methods required by StrictMath have been ported to Java (JDK-8134780: > Port fdlibm to Java). I have intentions to > port the remaining methods to Java, but it is unclear whether or not this > will occur for JDK 9. Yes, I'm doing my work against 9. So is there any problem if I proceed with my change? I understand that there is no conflict as JDK-8134780 progresses and replaces the StrictMath methods by their counterparts in Java. Please, advice. Is it intended to downport JDK-8134780 to 8? > Methods in the Math class, such as pow, are often intrinsified and use a > different algorithm so a straight performance comparison may not be as fair > or meaningful in those cases. I agree. It's just that the issue on StrictMath methods was first noted due to that huge gap (Math vs StrictMath) on PPC64, which is not prominent on x64. > Accumulating the the results of the functions and comparisons the sums is not > a sufficiently robust way of checking to see if the optimized versions are > indeed equivalent to the non-optimized ones. > The specification of StrictMath requires a particular result for each set of > floating-point arguments and sums get round-away low-order bits that differ. That's really good point, thanks for letting me know about that. I'll re-test my change under that perspective. > Running the JDK math library regression tests and corresponding JCK tests is > recommended for work in this area. Got it. By "the JDK math library regression tests" you mean exactly which test suite? the jtreg tests? For testing against JCK/TCK I'll need some help on that. Thank you very much. Regards, Gustavo
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi Erik, On 17-11-2016 07:17, Erik Joelsson wrote: > Overall this looks reasonable to me. However, if we want to introduce a new > possible tuple for specifying compilation flags to SetupNativeCompilation, we > (the build team) would prefer if we used > OPENJDK_TARGET_CPU instead of OPENJDK_TARGET_CPU_ARCH. Got it. Thanks a lot for that info. I'll take that into account. Regards, Gustavo
Re: PPC64: Poor StrictMath performance due to non-optimized compilation
Hi David, On 17-11-2016 00:31, David Holmes wrote: > Adding in build-dev as they need to scrutinize all build changes. Thanks a lot. Regards, Gustavo
PPC64: Poor StrictMath performance due to non-optimized compilation
Hi, Currently, optimization for building fdlibm is disabled, except for the "solaris" OS target [1]. As a consequence on PPC64 (Linux) StrictMath methods like, but not limited to, sin(), cos(), and tan() perform verify poor in comparison to the same methods in Math class [2]: Math StrictMath = == sin 0m29.984s 1m41.184s cos 0m30.031s 1m41.200s tan 0m31.772s 1m46.976s asin 0m4.577s0m4.543s acos 0m4.539s0m4.525s atan 0m12.929s 0m12.896s exp0m1.071s0m4.570s log0m3.272s 0m14.239s log10 0m4.362s 0m20.236s sqrt 0m0.913s0m0.981s cbrt 0m10.786s 0m10.808s sinh 0m4.438s0m4.433s cosh 0m4.496s0m4.478s tanh 0m3.360s0m3.353s expm1 0m4.076s0m4.094s log1p 0m13.518s 0m13.527s IEEEremainder 0m38.803s 0m38.909s atan2 0m20.100s 0m20.057s pow0m14.096s 0m19.938s hypot 0m5.136s0m5.122s Switching on the O3 optimization can damage precision of those methods, nonetheless it's possible to avoid that side effect and yet get huge benefits of the -O3 optimization on PPC64 if -fno-expensive-optimizations is passed in addition to the -O3 optimization flag. In that sense the following change is proposed to resolve the issue: diff -r 81eb4bd34611 make/lib/CoreLibraries.gmk --- a/make/lib/CoreLibraries.gmkWed Nov 09 13:37:19 2016 +0100 +++ b/make/lib/CoreLibraries.gmkWed Nov 16 19:11:11 2016 -0500 @@ -33,10 +33,16 @@ # libfdlibm is statically linked with libjava below and not delivered into the # product on its own. -BUILD_LIBFDLIBM_OPTIMIZATION := HIGH +BUILD_LIBFDLIBM_OPTIMIZATION := NONE -ifneq ($(OPENJDK_TARGET_OS), solaris) - BUILD_LIBFDLIBM_OPTIMIZATION := NONE +ifeq ($(OPENJDK_TARGET_OS), solaris) + BUILD_LIBFDLIBM_OPTIMIZATION := HIGH +endif + +ifeq ($(OPENJDK_TARGET_OS), linux) + ifeq ($(OPENJDK_TARGET_CPU_ARCH), ppc) +BUILD_LIBFDLIBM_OPTIMIZATION := HIGH + endif endif LIBFDLIBM_SRC := $(JDK_TOPDIR)/src/java.base/share/native/libfdlibm @@ -51,6 +57,7 @@ CFLAGS := $(CFLAGS_JDKLIB) $(LIBFDLIBM_CFLAGS), \ CFLAGS_windows_debug := -DLOGGING, \ CFLAGS_aix := -qfloat=nomaf, \ + CFLAGS_linux_ppc := -fno-expensive-optimizations, \ DISABLED_WARNINGS_gcc := sign-compare, \ DISABLED_WARNINGS_microsoft := 4146 4244 4018, \ ARFLAGS := $(ARFLAGS), \ diff -r 2a1f97c0ad3d make/common/NativeCompilation.gmk --- a/make/common/NativeCompilation.gmk Wed Nov 09 15:32:39 2016 +0100 +++ b/make/common/NativeCompilation.gmk Wed Nov 16 19:08:06 2016 -0500 @@ -569,16 +569,19 @@ $1_ALL_OBJS := $$(sort $$($1_EXPECTED_OBJS) $$($1_EXTRA_OBJECT_FILES)) # Pickup extra OPENJDK_TARGET_OS_TYPE and/or OPENJDK_TARGET_OS dependent variables for CFLAGS. - $1_EXTRA_CFLAGS:=$$($1_CFLAGS_$(OPENJDK_TARGET_OS_TYPE)) $$($1_CFLAGS_$(OPENJDK_TARGET_OS)) + $1_EXTRA_CFLAGS:=$$($1_CFLAGS_$(OPENJDK_TARGET_OS_TYPE)) $$($1_CFLAGS_$(OPENJDK_TARGET_OS)) \ + $$($1_CFLAGS_$(OPENJDK_TARGET_OS)_$(OPENJDK_TARGET_CPU_ARCH)) ifneq ($(DEBUG_LEVEL),release) # Pickup extra debug dependent variables for CFLAGS $1_EXTRA_CFLAGS+=$$($1_CFLAGS_debug) $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS_TYPE)_debug) $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS)_debug) + $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS)_$(OPENJDK_TARGET_CPU_ARCH)_debug) else $1_EXTRA_CFLAGS+=$$($1_CFLAGS_release) $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS_TYPE)_release) $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS)_release) + $1_EXTRA_CFLAGS+=$$($1_CFLAGS_$(OPENJDK_TARGET_OS)_$(OPENJDK_TARGET_CPU_ARCH)_release) endif # Pickup extra OPENJDK_TARGET_OS_TYPE and/or OPENJDK_TARGET_OS dependent variables for CXXFLAGS. After enabling the optimization it's possible to again up to 3x on performance regarding the aforementioned methods without losing precision: StrictMath, original StrictMath, optimized sin1.7136493465700542 1m41.184s 1.7136493465700542 0m33.895s cos0.1709843554185943 1m41.200s 0.1709843554185943 0m33.884s tan -5.5500322522995315E7 1m46.976s -5.5500322522995315E7 0m36.461s asin NaN 0m4.543s NaN 0m3.175s acos NaN 0m4.525s NaN 0m3.211s atan 1.5707961389886132E8 0m12.896s1.5707961389886132E8 0m7.100s exp Infinity 0m4.570sInfinity 0m3.187s log 1.7420680845245087E9 0m14.239s1.7420680845245087E9 0m7.170s log10 7.565705562087342E8 0m20.236s 7.565705562087342E8 0m9.610