Please have a quick look at 8274730: AArch64: AES/GCM acceleration is broken by the fix for JDK-8273297
I had to touch common code in C2 to unbreak AArch64, so I'd like another reviewer, please. It's pretty simple, thanks. https://github.com/openjdk/jdk/pull/5819 -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125 [v5]
On Wed, 22 Sep 2021 22:48:32 GMT, Smita Kamath wrote: >> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not >> support the new intrinsic. Tests run were crypto.full.AESGCMBench and >> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks. >> >> The problem is each instance of GHASH allocates 96 extra longs for the >> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table >> space should be allocated differently so that non-supporting CPUs do not >> suffer this penalty. This issue also affects non-Intel CPUs too. > > Smita Kamath has updated the pull request incrementally with one additional > commit since the last revision: > > Added htbl_entries constant to other architectures Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/5402
Re: Intrinsic methods and time to safepoint
On 9/17/21 10:13 AM, Roland Westrelin wrote: > >> OK. I guess the problem is that the call to the stub doesn't have an oop >> map. > > Assuming you can get the oop map in stub, deoptimization could still be > delayed until the stub is exited. Is that considered a problem? > > Actually I'm not even sure the loop synthesized from IR nodes is > feasible: what would the JVM state at the safepoint be given that > there's no actual java code for it? What would you deoptimize to if you > were to deoptimize? Ah. Perhaps you have a good point. :-) -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Intrinsic methods and time to safepoint
On 9/16/21 1:28 PM, Roland Westrelin wrote: > >> I believe we should have a policy to cover how long an intrinsic can >> delay without responding to a safepoint, and that it should be in the >> millisecond range. It would make almost no difference to the >> performance of encryption if chunks handles by a fast intrinsic were, >> say, about a megabyte. The difference in performance is so small as to >> be immeasurable, and the improvement in the performance of other threads >> is vast. > > I agree with you (seems like a no brainer) but I have a couple comments > about implementation details. > > Those intrinsics usually call some stub. It's not possible AFAICT, to > have the safepoint in the stub itself. So we need some loop that > repeatedly calls the stub. That loop can either be added 1) by the JIT > as IR when the intrinsic is expanded 2) in java code, that is java > library code needs to be refactored. > > 2) would seem much easier to implement and would work for both c1 and c2 > (if some of these intrinsics end up implemented by c1). OK. I guess the problem is that the call to the stub doesn't have an oop map. The tricky cases seem to be in the crypto code, which is already rather fiddly. It's usually simple enough for intrinsics to return the amount of work left to do, I guess. > Also a note of caution about loop strip mining: it doesn't have a model > for what the loop body costs. So it blindly assumes all loop bodies can > be run for 1000 iterations (by default) between safepoints. Unless I'm > missing something, with a stub running for 1ms, delays between safepoint > could still be 1s. It'd be interesting to do the experiment. I might try that, just for grins. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125
On Tue, 7 Sep 2021 22:31:30 GMT, Smita Kamath wrote: > Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not > support the new intrinsic. Tests run were crypto.full.AESGCMBench and > crypto.full.AESGCMByteBuffer from the jmh micro benchmarks. > > The problem is each instance of GHASH allocates 96 extra longs for the > AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table > space should be allocated differently so that non-supporting CPUs do not > suffer this penalty. This issue also affects non-Intel CPUs too. It seems to me there's a serious problem here. When you execute the galoisCounterMode_AESCrypt() intrinsic, I don't think there's a limit on the number of blocks to be encrypted. With the older intrinsic things are not so very bad because the incoming data is split into 6 segments. But if we use this intrinsic, there is no safepoint check in the inner loop, which can lead to a long time to safepoint, and this causes stalls on the other threads. If you split the incoming data into blocks of about a megabyte you'd lose no measurable performance but you'd dramatically improve the performance of everything else, especially with a concurrent GC. - PR: https://git.openjdk.java.net/jdk/pull/5402
Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125
On Mon, 13 Sep 2021 12:48:14 GMT, Andrew Haley wrote: >> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not >> support the new intrinsic. Tests run were crypto.full.AESGCMBench and >> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks. >> >> The problem is each instance of GHASH allocates 96 extra longs for the >> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table >> space should be allocated differently so that non-supporting CPUs do not >> suffer this penalty. This issue also affects non-Intel CPUs too. > > src/hotspot/share/opto/library_call.cpp line 6796: > >> 6794: >> 6795: Node* avx512_subkeyHtbl = new_array(klass_node, intcon(96), 0); >> 6796: if (avx512_subkeyHtbl == NULL) return false; > > This looks very Intel-specific, but it's in generic code. Please make this > constant 96 a symbol and push it into a header file in the x86 back end. Likewise, the name prefix "avx512_" isn't appropriate for code that will certainly be used by other targets. - PR: https://git.openjdk.java.net/jdk/pull/5402
Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125
On Tue, 7 Sep 2021 22:31:30 GMT, Smita Kamath wrote: > Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not > support the new intrinsic. Tests run were crypto.full.AESGCMBench and > crypto.full.AESGCMByteBuffer from the jmh micro benchmarks. > > The problem is each instance of GHASH allocates 96 extra longs for the > AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table > space should be allocated differently so that non-supporting CPUs do not > suffer this penalty. This issue also affects non-Intel CPUs too. src/hotspot/share/opto/library_call.cpp line 6796: > 6794: > 6795: Node* avx512_subkeyHtbl = new_array(klass_node, intcon(96), 0); > 6796: if (avx512_subkeyHtbl == NULL) return false; This looks very Intel-specific, but it's in generic code. Please make this constant 96 a symbol and push it into a header file in the x86 back end. - PR: https://git.openjdk.java.net/jdk/pull/5402
Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath wrote: >> I would like to submit AES-GCM optimization for x86_64 architectures >> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES >> and GHASH operations. >> Performance gain of ~1.5x - 2x for message sizes 8k and above. > > Smita Kamath has updated the pull request incrementally with one additional > commit since the last revision: > > rewiew update src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp line 1682: > 1680: vpshufb(AAD_HASHx, AAD_HASHx, xmm24, Assembler::AVX_128bit); > 1681: > 1682: // Compute #rounds for AES based on the length of the key array Forget that, it's done everywhere. Deleted. - PR: https://git.openjdk.java.net/jdk/pull/4019
Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath wrote: >> I would like to submit AES-GCM optimization for x86_64 architectures >> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES >> and GHASH operations. >> Performance gain of ~1.5x - 2x for message sizes 8k and above. > > Smita Kamath has updated the pull request incrementally with one additional > commit since the last revision: > > rewiew update src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp line 1682: > 1680: vpshufb(AAD_HASHx, AAD_HASHx, xmm24, Assembler::AVX_128bit); > 1681: > 1682: // Compute #rounds for AES based on the length of the key array This is a bit of a hack. Wouldn't it make more sense to pass in the array oop, then derive both the length and the address of the base of the key array from the oop, rather than using a negative offset from the base address? - PR: https://git.openjdk.java.net/jdk/pull/4019
Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath wrote: >> I would like to submit AES-GCM optimization for x86_64 architectures >> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES >> and GHASH operations. >> Performance gain of ~1.5x - 2x for message sizes 8k and above. > > Smita Kamath has updated the pull request incrementally with one additional > commit since the last revision: > > rewiew update src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java line 694: > 692: > 693: /** > 694: * ByteBuffer wrapper for intrinsic implGCMCrypt. It will > operation Suggestion: * ByteBuffer wrapper for intrinsic implGCMCrypt. It will operate - PR: https://git.openjdk.java.net/jdk/pull/4019
Re: JEP 411, removal of finalizers, a path forward.
On 8/2/21 8:49 AM, Peter Firmstone wrote: > If I fix that bug, will JEP 411 be cancelled? No. The problem wasn't that we couldn't fix the [Speculative Execution Vulnerabilities], more that any fix would be so invasive and pervasive that it would severely hamper the whole platform. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: JEP 411 - Secure Java Distribution
On 6/1/21 10:06 AM, Peter Firmstone wrote: > If a vendor were to continue supporting SecurityManager and was > backporting from OpenJDK, if it passes the JCK with SecurityManager > disabled, that's still acceptable right? Look at the licence agreement in conjunction with the JCK users' guide. See the definition of “Compatible Licensee Implementation”. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Tue, 23 Mar 2021 16:20:47 GMT, Ludovic Henry wrote: >> Anton Kozlov has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 115 commits: >> >> - Merge branch 'master' into jdk-macos >> - JDK-8262491: bsd_aarch64 part >> - JDK-8263002: bsd_aarch64 part >> - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos >> - Wider #ifdef block >> - Fix most of issues in java/foreign/ tests >> >>Failures related to va_args are tracked in JDK-8263512. >> - Add Azul copyright >> - Update Oracle copyright years >> - Use Thread::current_or_null_safe in SafeFetch >> - 8262903: [macos_aarch64] Thread::current() called on detached thread >> - ... and 105 more: >> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 > > Marked as reviewed by luhenry (Author). > > > > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 > > > > maintainers to accept this (and few other, like jep-388, as we depend > > > > on it) contribution. > > > > > > > > > To the extent that 11u has fixed policies :) we definitely have a policy > > > of accepting patches to keep 11u working on current hardware. So yes. > > > > > > @lewurm That sounds like a green flag for you and jep-388 (with its > > R18_RESERVED functionality) ;) > > Thanks, @theRealAph, and @VladimirKempik . We are on it! It's going to be tricky to do in a really clean way, given some of the weirdnesses of the ABI. However, I think there's probably a need for it - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Mon, 22 Mar 2021 12:50:14 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 115 commits: > > - Merge branch 'master' into jdk-macos > - JDK-8262491: bsd_aarch64 part > - JDK-8263002: bsd_aarch64 part > - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos > - Wider #ifdef block > - Fix most of issues in java/foreign/ tests > >Failures related to va_args are tracked in JDK-8263512. > - Add Azul copyright > - Update Oracle copyright years > - Use Thread::current_or_null_safe in SafeFetch > - 8262903: [macos_aarch64] Thread::current() called on detached thread > - ... and 105 more: > https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]
On Tue, 23 Mar 2021 13:54:24 GMT, Andrew Haley wrote: >> Anton Kozlov has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 115 commits: >> >> - Merge branch 'master' into jdk-macos >> - JDK-8262491: bsd_aarch64 part >> - JDK-8263002: bsd_aarch64 part >> - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos >> - Wider #ifdef block >> - Fix most of issues in java/foreign/ tests >> >>Failures related to va_args are tracked in JDK-8263512. >> - Add Azul copyright >> - Update Oracle copyright years >> - Use Thread::current_or_null_safe in SafeFetch >> - 8262903: [macos_aarch64] Thread::current() called on detached thread >> - ... and 105 more: >> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269 > > Marked as reviewed by aph (Reviewer). > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 > maintainers to accept this (and few other, like jep-388, as we depend on it) > contribution. To the extent that 11u has fixed policies :) we definitely have a policy of accepting patches to keep 11u working on current hardware. So yes. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]
On Tue, 23 Mar 2021 09:53:54 GMT, Andrew Haley wrote: >>> @theRealAph, could you elaborate on what is need to be done for [#2200 >>> (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066). >> >> I think that what you've got now is fine. > >> _Mailing list message from [Andrew Haley](mailto:a...@redhat.com) on >> [build-dev](mailto:build-...@openjdk.java.net):_ >> >> On 3/15/21 6:56 PM, Anton Kozlov wrote: >> >> > On Wed, 10 Mar 2021 11:21:44 GMT, Andrew Haley wrote: >> > > > We always check for `R18_RESERVED` with `#if(n)def`, is there any >> > > > reason to define the value for the macro? >> > > >> > > >> > > Robustness, clarity, maintainability, convention. Why not? >> > >> > >> > I've tried to implement the suggestion, but it pulled more unnecessary >> > changes. It makes the intended way to check the condition less clear >> > (`#ifdef` and not `#if`). >> >> No, no, no! I am not suggesting you change anything else, just that >> you do not define contentless macros. You might as well define it >> to be something, and true is a reasonable default, that's all. It's >> not terribly important, it's just good practice. > > I'm quite prepared to drop this if it's holding up the port. It's a style > thing, but it's not critical. So, where are we up to now? Are we done yet? - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]
On Fri, 12 Mar 2021 16:32:10 GMT, Andrew Haley wrote: >> Anton Kozlov has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 105 commits: >> >> - Merge commit 'refs/pull/11/head' of https://github.com/AntonKozlov/jdk >> into jdk-macos >> - workaround JDK-8262895 by disabling subtest >> - Fix typo >> - Rename threadWXSetters.hpp -> threadWXSetters.inline.hpp >> - JDK-8259937: bsd_aarch64 part >> - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos >> - Fix after JDK-8259539, partially revert preconditions >> - JDK-8260471: bsd_aarch64 part >> - JDK-8259539: bsd_aarch64 part >> - JDK-8257828: bsd_aarch64 part >> - ... and 95 more: >> https://git.openjdk.java.net/jdk/compare/a6e34b3d...a72f6834 > >> @theRealAph, could you elaborate on what is need to be done for [#2200 >> (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066). > > I think that what you've got now is fine. > _Mailing list message from [Andrew Haley](mailto:a...@redhat.com) on > [build-dev](mailto:build-...@openjdk.java.net):_ > > On 3/15/21 6:56 PM, Anton Kozlov wrote: > > > On Wed, 10 Mar 2021 11:21:44 GMT, Andrew Haley wrote: > > > > We always check for `R18_RESERVED` with `#if(n)def`, is there any > > > > reason to define the value for the macro? > > > > > > > > > Robustness, clarity, maintainability, convention. Why not? > > > > > > I've tried to implement the suggestion, but it pulled more unnecessary > > changes. It makes the intended way to check the condition less clear > > (`#ifdef` and not `#if`). > > No, no, no! I am not suggesting you change anything else, just that > you do not define contentless macros. You might as well define it > to be something, and true is a reasonable default, that's all. It's > not terribly important, it's just good practice. I'm quite prepared to drop this if it's holding up the port. It's a style thing, but it's not critical. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]
On Tue, 9 Mar 2021 18:01:11 GMT, Anton Kozlov wrote: >> src/hotspot/cpu/aarch64/globalDefinitions_aarch64.hpp line 62: >> >>> 60: >>> 61: #if defined(__APPLE__) || defined(_WIN64) >>> 62: #define R18_RESERVED >> >> #define R18_RESERVED true``` > > We always check for `R18_RESERVED` with `#if(n)def`, is there any reason to > define the value for the macro? Robustness, clarity, maintainability, convention. Why not? - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]
On Tue, 9 Mar 2021 16:12:36 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 105 commits: > > - Merge commit 'refs/pull/11/head' of https://github.com/AntonKozlov/jdk > into jdk-macos > - workaround JDK-8262895 by disabling subtest > - Fix typo > - Rename threadWXSetters.hpp -> threadWXSetters.inline.hpp > - JDK-8259937: bsd_aarch64 part > - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos > - Fix after JDK-8259539, partially revert preconditions > - JDK-8260471: bsd_aarch64 part > - JDK-8259539: bsd_aarch64 part > - JDK-8257828: bsd_aarch64 part > - ... and 95 more: > https://git.openjdk.java.net/jdk/compare/a6e34b3d...a72f6834 > @theRealAph, could you elaborate on what is need to be done for [#2200 > (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066). I think that what you've got now is fine. - Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]
On Wed, 3 Mar 2021 17:39:28 GMT, Gerard Ziemski wrote: > A list of the bugs that our internal testing revealed so far: Are any of these blockers for integration? Some of them are to do with things like features that aren't yet supported, and we can't fix what we can't see. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v10]
On Fri, 12 Feb 2021 11:42:59 GMT, Vladimir Kempik wrote: >> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 194: >> >>> 192: // may get turned off by -fomit-frame-pointer. >>> 193: frame os::get_sender_for_C_frame(frame* fr) { >>> 194: return frame(fr->link(), fr->link(), fr->sender_pc()); >> >> Why is it >> >> return frame(fr->link(), fr->link(), fr->sender_pc()); >> >> and not >> >> return frame(fr->sender_sp(), fr->link(), fr->sender_pc()); >> >> like in the bsd-x86 counter part? > > bsd_aarcb64 was based on linux_aarch64, with addition of bsd-specific things > from bsd_x86 > You think the bsd-x86 way is better here ? There's no point copying x86. We don't have any way to know what the sender's SP was in a C frame without using unwind info. I think this is just used when trying to print the stack in a crash dump. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]
On Fri, 5 Feb 2021 17:20:55 GMT, Anton Kozlov wrote: >> src/hotspot/cpu/aarch64/vm_version_aarch64.hpp line 93: >> >>> 91: CPU_MARVELL = 'V', >>> 92: CPU_INTEL = 'i', >>> 93: CPU_APPLE = 'a', >> >> The `ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture >> profile` has 8538 pages, can we be more specific and point to the particular >> section of the document where the CPU codes are defined? > > They are defined in 13.2.95. MIDR_EL1, Main ID Register. Apple's code is not > there, but "Arm can assign codes that are not published in this manual. All > values not assigned by Arm are reserved and must not be used.". I assume the > value was obtained by digging around > https://github.com/apple/darwin-xnu/blob/main/osfmk/arm/cpuid.h#L62 Anton, this paragraph looks like an excellent comment. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v10]
On Thu, 4 Feb 2021 23:01:52 GMT, Gerard Ziemski wrote: >> Anton Kozlov has updated the pull request incrementally with six additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos >> - Add comments to WX transitions >> >>+ minor change of placements >> - Use macro conditionals instead of empty functions >> - Add W^X to tests >> - Do not require known W^X state >> - Revert w^x in gtests > > src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 652: > >> 650: >> 651: void os::setup_fpu() { >> 652: } > > Is there really nothing to do here, or does still need to be implemented? A > clarification comment here would help/. There is really nothing to do here. > src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 198: > >> 196: >> 197: NOINLINE frame os::current_frame() { >> 198: intptr_t *fp = *(intptr_t **)__builtin_frame_address(0); > > In the bsd_x86 counter part we initialize `fp` to `_get_previous_fp()` - do > we need to implement it on aarch64 as well (and using address 0 is just a > temp workaround) or is it doing the right thing here? (0)``` looks right to me. > src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 291: > >> 289: bool is_unsafe_arraycopy = (thread->doing_unsafe_access() && >> UnsafeCopyMemory::contains_pc(pc)); >> 290: if ((nm != NULL && nm->has_unsafe_access()) || >> is_unsafe_arraycopy) { >> 291: address next_pc = pc + NativeCall::instruction_size; > > Replace > > address next_pc = pc + NativeCall::instruction_size; > > with > > address next_pc = Assembler::locate_next_instruction(pc); > > there is at least one other place that needs it. Why is this change needed? AFAICS ```locate_next_instruction()``` is an x86 thing for variable-length instructions, and no other port uses it. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]
On Fri, 26 Feb 2021 19:17:12 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request incrementally with two additional > commits since the last revision: > > - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos > - Minor fixes Thanks. With this, I think we're done. - Changes requested by aph (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]
On Fri, 26 Feb 2021 19:17:12 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request incrementally with two additional > commits since the last revision: > > - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos > - Minor fixes src/hotspot/cpu/aarch64/globalDefinitions_aarch64.hpp line 62: > 60: > 61: #if defined(__APPLE__) || defined(_WIN64) > 62: #define R18_RESERVED #define R18_RESERVED true``` - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v18]
On Mon, 15 Feb 2021 17:45:32 GMT, Anton Kozlov wrote: >> I'm not sure it can wait. This change turns already-messy code into >> something significantly messier, to the extent that it's not really good >> enough to go into mainline. > > The latest merge with JDK-8261071 should resolve the issue. Please take a > look. Looks much better, thanks. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]
On Tue, 16 Feb 2021 06:24:05 GMT, Vladimir Kempik wrote: >> This is when passing a float, yes? In the case where we have more float >> arguments than n_float_register_parameters_c. >> I don't understand why you think it's acceptable to bail in this case. Can >> you explain, please? > > it's for everything that uses less than 8 bytes on a stack( ints ( 4), > shorts(2), bytes(1), floats(4)). > currently native wrapper generation does not support such cases at all, it > needs refactoring before this can be implemented. > So when a method has more argument than can be placed in registers, we may > have issues. > > So we just bailing out to interpreter in case when a smaller (<=4 b) type is > going to be passed thru the stack. > > There was attempt to implement handling such cases but currently it requires > some hacks (like using some vectors for non-specific task) - > https://github.com/openjdk/aarch64-port/pull/3 OK. I checked and the Panama preview doesn't support direct native calls for stack arguments, so we're good for now. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]
On Mon, 15 Feb 2021 18:00:50 GMT, Vladimir Kempik wrote: >> src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 839: >> >>> 837: // The code unable to handle this, bailout. >>> 838: return -1; >>> 839: #endif >> >> This looks like a bug to me. The caller doesn't necessarily check the return >> value. See CallRuntimeNode::calling_convention. > > Hello, we have updated PR, now this bailout is used only by the code which > can handle it (native wrapper generator), for the rest it will cause > guarantee failed if this bailout is triggered This is when passing a float, yes? In the case where we have more float arguments than n_float_register_parameters_c. I don't understand why you think it's acceptable to bail in this case. Can you explain, please? - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v10]
On Thu, 4 Feb 2021 23:05:56 GMT, Gerard Ziemski wrote: >> Anton Kozlov has updated the pull request incrementally with six additional >> commits since the last revision: >> >> - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos >> - Add comments to WX transitions >> >>+ minor change of placements >> - Use macro conditionals instead of empty functions >> - Add W^X to tests >> - Do not require known W^X state >> - Revert w^x in gtests > > I reviewed bsd_aarch64.cpp digging bit deeper and left some comments. > > Umm, so how does patching work? We don't even know if other threads are > > executing the code we need to patch. > > I thought java can handle that scenario in usual (non W^X systems) just fine, > so we just believe jvm did everything right and it's safe to rewrite some > code at specific moment. Got it, OK. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]
On Thu, 4 Feb 2021 09:49:17 GMT, Vladimir Kempik wrote: > > You read my mind, Andrew. Unless, of course, it's optimized to leverage the > > fact that it's thread-specific.. > > it's thread-specific > > https://developer.apple.com/documentation/apple_silicon/porting_just-in-time_compilers_to_apple_silicon > > > Because pthread_jit_write_protect_np changes only the current thread’s > > permissions, avoid accessing the same memory region from multiple threads. > > Giving multiple threads access to the same memory region opens up a > > potential attack vector, in which one thread has write access and another > > has executable access to the same region. Umm, so how does patching work? We don't even know if other threads are executing the code we need to patch. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]
On Tue, 2 Feb 2021 18:03:50 GMT, Gerard Ziemski wrote: >> Anton Kozlov has updated the pull request incrementally with one additional >> commit since the last revision: >> >> support macos_aarch64 in hsdis > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5271: > >> 5269: // >> 5270: void MacroAssembler::get_thread(Register dst) { >> 5271: RegSet saved_regs = RegSet::range(r0, r1) + >> BSD_ONLY(RegSet::range(r2, r17)) + lr - dst; > > The comment needs to be updated, since on BSD we also seem to clobber r2,r17 ? Surely this should be saved_regs = RegSet::range(r0, r1) BSD_ONLY(+ RegSet::range(r2, r17)) + lr - dst;``` Shouldn't it? - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]
On Tue, 2 Feb 2021 21:49:36 GMT, Daniel D. Daugherty wrote: >> Anton Kozlov has updated the pull request incrementally with one additional >> commit since the last revision: >> >> support macos_aarch64 in hsdis > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 323: > >> 321: str(zr, Address(rthread, JavaThread::last_Java_pc_offset())); >> 322: >> 323: str(zr, Address(rthread, JavaFrameAnchor::saved_fp_address_offset())); > > I don't think this switch from `JavaThread::saved_fp_address_offset()` > to `JavaFrameAnchor::saved_fp_address_offset()` is correct since > `rthread` is still used and is a JavaThread*. The new code will give you: > > `rthread` + offset of the `saved_fp_address` field in a JavaFrameAnchor > > The old code gave you: > > `rthread` + offset of the `saved_fp_address` field in the JavaFrameAnchor > field in the JavaThread > > Those are not the same things. I agree, I don't understand why this change was made. > src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.hpp line 45: > >> 43: // Atomically copy 64 bits of data >> 44: static void atomic_copy64(const volatile void *src, volatile void >> *dst) { >> 45: *(jlong *) dst = *(const jlong *) src; > > Is this construct actually atomic on aarch64? Yes. > src/hotspot/os_cpu/windows_aarch64/os_windows_aarch64.hpp line 37: > >> 35: >> 36: private: >> 37: > > 'private' usually has one space in front of it. > Also, why the blank line after it? It reads better with the blank line, and it's not in violation of HotSpot conventions. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]
On Sun, 31 Jan 2021 20:14:01 GMT, Anton Kozlov wrote: >> Please review the implementation of JEP 391: macOS/AArch64 Port. >> >> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and >> windows/aarch64. >> >> Major changes are in: >> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks >> JDK-8253817, JDK-8253818) >> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with >> necessary adjustments (JDK-8253819) >> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), >> required on macOS/AArch64 platform. It's implemented with >> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a >> thread, so W^X mode change relates to the java thread state change (for java >> threads). In most cases, JVM executes in write-only mode, except when >> calling a generated stub like SafeFetch, which requires a temporary switch >> to execute-only mode. The same execute-only mode is enabled when a java >> thread executes in java or native states. This approach of managing W^X mode >> turned out to be simple and efficient enough. >> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) > > Anton Kozlov has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 62 commits: > > - Merge branch 'master' into jdk-macos > - Update copyright year for BsdAARCH64ThreadContext.java > - Fix inclusing of StubRoutines header > - Redo buildsys fix > - Revert harfbuzz changes, disable warnings for it > - Little adjustement of SlowSignatureHandler > - Partially bring previous commit > - Revert "Address feedback for signature generators" > >This reverts commit 50b55f6684cd21f8b532fa979b7b6fbb4613266d. > - Refactor CDS disabling > - Redo builsys support for aarch64-darwin > - ... and 52 more: > https://git.openjdk.java.net/jdk/compare/8a9004da...b421e0b4 src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 84: > 82: // on stack. Natural alignment for types are still in place, > 83: // for example double/long should be 8 bytes alligned > 84: This comment is a bit confusing because it's no longer #ifdef APPLE. Better move it up to Line 41. src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 352: > 350: > 351: #ifdef __APPLE__ > 352: virtual void pass_byte() Please remove ```#ifdef __APPLE__``` around this region. src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 839: > 837: // The code unable to handle this, bailout. > 838: return -1; > 839: #endif This looks like a bug to me. The caller doesn't necessarily check the return value. See CallRuntimeNode::calling_convention. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v3]
On 1/26/21 6:42 PM, erik.joels...@oracle.com wrote: > My understanding is that Apple chose to not provide JNF for aarch64, so > if you want to build OpenJDK, you first need to build JNF yourself (it's > available in github). Phil is working on removing this dependency > completely, which will solve this issue [1]. > > In the meantime, I don't think we should rely on finding JNF in > unsupported locations inside Xcode. Could we wait with integrating this > port until it can be built without such hacks? That sounds right to me. In the meantime, there is some cleanup work to be done in mainline on slow_signature_handler, which will potentially make the AArch64 back end merge much simpler. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]
On Sun, 24 Jan 2021 15:50:01 GMT, Anton Kozlov wrote: >> src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 86: >> >>> 84: >>> 85: switch (_num_int_args) { >>> 86: case 0: >> >> I don't think you need such a large switch statement. I think this can be >> expressed as >> if (num_int_args <= 6) { >> ldr(as_Register(num_int_args + r1.encoding()), src); >> ... etc. > > I like the suggestion. For the defense, new functions were made this way > intentionally, to match existing `pass_int`, `pass_long`,.. I take your > comment as a blessing to fix all of them. But I feel that refactoring of > existing code should go in a separate commit. Especially after `pass_int` > used `ldr/str` instead of `ldrw/strw`, which looks wrong. > https://github.com/openjdk/jdk/pull/2200/files#diff-1ff58ce70aeea7e9842d34e8d8fd9c94dd91182999d455618b2a171efd8f742cL87-R122 This is new code, and it should not repeat the mistakes of the past. There's no need to refactor the similar existing code as part of this patch. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]
On Sun, 24 Jan 2021 16:10:44 GMT, Anton Kozlov wrote: >> src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 394: >> >>> 392: >>> 393: class SlowSignatureHandler >>> 394: : public NativeSignatureIterator { >> >> SlowSignatureHandler is turning into a maintenance nightmare. This isn't the >> fault of this commit so much as some nasty cut-and-paste programming in the >> code you're editing. It might well be better to rewrite this whole thing to >> use a table-driven approach, with a table that contains the increment and >> the size of each native type: then we'd have only a single routine which >> could pass data of any type, and each OS needs only supply a table of >> constants. > > Would you like me to do something about it now? The problem is that the > functions of SlowSignatureHandler are subtly different, so it will be > multiple tables, not sure how many. Such change is another candidate for a > separate code enhancement, which I would like not to mix into the JEP > implementation (it's already not a small change :)). But if you think it > would be a good thing, please let me know. In another case, I'll add this to > a queue of follow-up enhancements. I'm not sure it can wait. This change turns already-messy code into something significantly messier, to the extent that it's not really good enough to go into mainline. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]
On Sun, 24 Jan 2021 16:29:31 GMT, Vladimir Kempik wrote: >> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5272: >> >>> 5270: void MacroAssembler::get_thread(Register dst) { >>> 5271: RegSet saved_regs = RegSet::range(r0, r1) + >>> BSD_ONLY(RegSet::range(r2, r17)) + lr - dst; >>> 5272: push(saved_regs, sp); >> >> This isn't very nice. > > Hello > Why is it not nice ? > linux_aarch64 uses some linux specific tls function > _ZN10JavaThread25aarch64_get_thread_helperEv from > hotspot/os_cpu/linux_aarch64/threadLS_linux_aarch64.s > which clobbers only r0 and r1 > macos_aarch64 has no such tls code and uses generic C-call to > Thread::current(); > Hence we are saving possibly clobbered regs here. Surely if you did as Linux does you wouldn't need to clobber all those registers. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port
On Fri, 22 Jan 2021 18:49:42 GMT, Anton Kozlov wrote: > Please review the implementation of JEP 391: macOS/AArch64 Port. > > It's heavily based on existing ports to linux/aarch64, macos/x86_64, and > windows/aarch64. > > Major changes are in: > * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks > JDK-8253817, JDK-8253818) > * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with necessary > adjustments (JDK-8253819) > * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), > required on macOS/AArch64 platform. It's implemented with > pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a > thread, so W^X mode change relates to the java thread state change (for java > threads). In most cases, JVM executes in write-only mode, except when calling > a generated stub like SafeFetch, which requires a temporary switch to > execute-only mode. The same execute-only mode is enabled when a java thread > executes in java or native states. This approach of managing W^X mode turned > out to be simple and efficient enough. > * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941) src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 86: > 84: > 85: switch (_num_int_args) { > 86: case 0: I don't think you need such a large switch statement. I think this can be expressed as if (num_int_args <= 6) { ldr(as_Register(num_int_args + 1), src); ... etc. src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 43: > 41: //describe amount of space in bytes occupied by type on native stack > 42: #ifdef __APPLE__ > 43: const int nativeByteSpace= sizeof(jbyte); I'm not sure these names should be in the global name space, and they're not very descriptive. Something like NativeStack::byteSpace would be better. Then you can use them with a `using namespace NativeStack` declaration. src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 433: > 431: store_and_inc(_to, from_obj, nativeShortSpace); > 432: > 433: _num_int_args++; Please lift this increment out of the conditional. src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 394: > 392: > 393: class SlowSignatureHandler > 394: : public NativeSignatureIterator { SlowSignatureHandler is turning into a maintenance nightmare. This isn't the fault of this commit so much as some nasty cut-and=paste programming in the code you're editing. It might well be better to rewrite this whole thing to use a table-driven approach, with a table that contains the increment and the size of each native type: then we'd have only a single routine which could pass data of any type, and each OS needs only supply a table of constants. src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5272: > 5270: void MacroAssembler::get_thread(Register dst) { > 5271: RegSet saved_regs = RegSet::range(r0, r1) + > BSD_ONLY(RegSet::range(r2, r17)) + lr - dst; > 5272: push(saved_regs, sp); This isn't very nice. - PR: https://git.openjdk.java.net/jdk/pull/2200
Re: RFR: JDK-8255603: Memory/Performance regression after JDK-8210985
On Thu, 29 Oct 2020 15:11:09 GMT, Christoph Langer wrote: > It seems that there exists a memory/performance regression that was > introduced with JDK-8210985: Update the default SSL session cache size to > 20480. > > The idea to limit the maixmum SSL session cache size by itself is good. > Unfortunately, as per the current implementation of > sun.security.util.MemoryCache, it also modifies the initial size of the > LinkedHashMap that is backing the cache to initialize with more than the > maximum size. > > I suggest to restore the original behavior that initializes with an > initialCapacity of 1 in most cases. That was true when before JDK-8210985, > the property javax.net.ssl.sessionCacheSize was not set at all. > In case it was set, the initial size would have been like now, > (javax.net.ssl.sessionCacheSize / 0.75f) + 1, which still seems strange. Marked as reviewed by aph (Reviewer). src/java.base/share/classes/sun/security/util/Cache.java line 268: > 266: this.queue = null; > 267: > 268: cacheMap = new LinkedHashMap<>(1, LOAD_FACTOR, true); This looks right. The idea of scaling the initial cache size to the maximum size seems obviously to be wrong. - PR: https://git.openjdk.java.net/jdk/pull/937
Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v5]
On Fri, 9 Oct 2020 12:18:58 GMT, Andrew Haley wrote: >> Fei Yang has updated the pull request with a new target base due to a merge >> or a rebase. The pull request now contains >> six commits: >> - Merge master >> - Add sha3 instructions to cpu/aarch64/aarch64-asmtest.py and regenerate >> the test in assembler_aarch64.cpp:asm_check >> - Rebase >> - Merge master >> - Fix trailing whitespace issue >> - 8252204: AArch64: Implement SHA3 accelerator/intrinsic >>Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com > > Marked as reviewed by aph (Reviewer). I see Linux x64 failed. However, I don't seem to be able to withdraw my patch approval. However, please consider it withdrawn. - PR: https://git.openjdk.java.net/jdk/pull/207
Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v5]
On Fri, 9 Oct 2020 11:53:31 GMT, Fei Yang wrote: >> Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com >> >> This added an intrinsic for SHA3 using aarch64 v8.2 SHA3 Crypto Extensions. >> Reference implementation for core SHA-3 transform using ARMv8.2 Crypto >> Extensions: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/arm64/crypto/sha3-ce-core.S?h=v5.4.52 >> >> Trivial adaptation in SHA3. implCompress is needed for the purpose of adding >> the intrinsic. >> For SHA3, we need to pass one extra parameter "digestLength" to the stub for >> the calculation of block size. >> "digestLength" is also used in for the EOR loop before keccak to >> differentiate different SHA3 variants. >> >> We added jtreg tests for SHA3 and used QEMU system emulator which supports >> SHA3 instructions to test the functionality. >> Patch passed jtreg tier1-3 tests with QEMU system emulator. >> Also verified with jtreg tier1-3 tests without SHA3 instructions on >> aarch64-linux-gnu and x86_64-linux-gnu, to make >> sure that there's no regression. >> We used one existing JMH test for performance test: >> test/micro/org/openjdk/bench/java/security/MessageDigests.java >> We measured the performance benefit with an aarch64 cycle-accurate simulator. >> Patch delivers 20% - 40% performance improvement depending on specific SHA3 >> digest length and size of the message. >> >> For now, this feature will not be enabled automatically for aarch64. We can >> auto-enable this when it is fully tested on >> real hardware. But for the above testing purposes, this is auto-enabled >> when the corresponding hardware feature is >> detected. > > Fei Yang has updated the pull request with a new target base due to a merge > or a rebase. The pull request now contains > six commits: > - Merge master > - Add sha3 instructions to cpu/aarch64/aarch64-asmtest.py and regenerate the > test in assembler_aarch64.cpp:asm_check > - Rebase > - Merge master > - Fix trailing whitespace issue > - 8252204: AArch64: Implement SHA3 accelerator/intrinsic >Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com Marked as reviewed by aph (Reviewer). - PR: https://git.openjdk.java.net/jdk/pull/207
Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic
On Thu, 17 Sep 2020 06:03:57 GMT, Ard Biesheuvel wrote: >> @ardbiesheuvel : Ard, could you please ack this patch? Thanks. > > Acked-by: Ard Biesheuvel > If this feature is not auto-enabled when the SHA3 hardware feature is there, > we will have one failure for the following > test: > test/hotspot/jtreg/compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnSupportedCPU.java > 15 #-testresult- > 16 > description=file:/home/yangfei/github/jdk/test/hotspot/jtreg/compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnSupportedCPU.java > 17 elapsed=31546 0:00:31.546 18 end=Mon Sep 21 10:27:58 CST 2020 > 19 environment=regtest > 20 execStatus=Failed. Execution failed: `main' threw exception: > java.lang.AssertionError: Option 'UseSHA3Intrinsics' is > expected to have 'true' value Option 'UseSHA3Intrinsics' should be enabled by > default > Any suggestions for this? I don't understand your question. There should be two acceptable results, either "Pass" or "Not supported". What else is possible? - PR: https://git.openjdk.java.net/jdk/pull/207
Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic
On 17/09/2020 05:26, Fei Yang wrote: > For now, this feature will not be enabled automatically for aarch64. We can > auto-enable this when it is fully tested on > real hardware. But for the above testing purposes, this is auto-enabled when > the corresponding hardware feature is > detected. Given that there's no real hardware, it's extra-important to add the new instructions to cpu/aarch64/aarch64-asmtest.py and regenerate the test in assembler_aarch64.cc:asm_check. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Recent Java Sandbox escapes
On 19/08/2020 22:46, Alkanor Oumbratok wrote: > > I may be wrong on both points, and I would be really grateful if > someone could explain why these 2 CVE have been rated this high > whereas at first glance there isn't any really exploitable related > scenario. The first rule of the OpenJDK Vulerability Group is: You do not talk about the OpenJDK Vulerability Group. :-) In particular, we do not reveal exploits. Sorry. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows
On 18/06/2020 19:33, Martin Balao wrote: > * sharedRuntime_x86_64.cpp > * L3685 >* Do we still need 'long long' type for 'i' and 'cnt' local variables? No, but this is 64-bit-only code. And len is a long, so let's keep it. > * L3724 >* The last argument of 'sub' has type 'int', while in the not-Windows > variant is a long. Can we align this? We should do that, yes. Better it be long everywhere. > * L3729 >* Is it possible to directly store in a[i]? (instead of going through > 'tmp') I'm not sure this would add anything. In fact, I think this change would make it harder to read. > * I guess the compiler will easily optimize this, but we may still > get rid of the 2nd line > * I've seen in L3753 you directly store > > Note: it's a bit unfortunate that we don't have x86-64 inline assembly > in CL to maintain the same logic, as there is nothing OS-specific here. I think the problem is that MS never had a nice way to do inline assembly in 32-bit C, so they just gave up altogether when they went 64-bit. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows
On 05/06/2020 21:46, Simon Tooke wrote: > As per your and Andrew Haley's comments, I have updated the webrev: > > - used NOINLINE > > - used julong > > - deleted the block of unused code. > > Please let me know what you think. > > updated webrev: > http://cr.openjdk.java.net/~stooke/webrevs/jdk-8243114-jdk/01/01/ That looks clean, especially since it adds support for Windows with essentially no impact on the Linux code. I'm not sure I can Review it, though, because I'm one of the co-authors. It'd be nice if someone else familiar with the arithmetic (someone on security-dev?) could take a quick look. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows
On 05/06/2020 05:35, David Holmes wrote: > If these arrays are extracted from Java arrays Yes, they are. > then using a j* type would be appropriate, but then I would expect > jlong, unless the algorithm explicitly requires unsigned types? It does. All of the code handling bignums uses unsigned words. > If so julong would be acceptable. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows
On 04/06/2020 14:35, Simon Tooke wrote: > Yes, this hurt to type. A previous review suggested using julong, is > that acceptable to you? > > (an aside: I'm now wondering if there is other code in the JDK that > assumes long is 64bits - which is not true on all platforms...) That was just me, I think. I knew that this code would take some fixing up on Windows so I postponed portability issues until then. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows
On 21/05/2020 18:24, Simon Tooke wrote: > This change implements the given intrinsics on Windows. Thank you. Does julong work on Windows, rather than unsigned long long? If not, perhaps a local typedef might help. All those "unsigned log long"s look a bit clumsy. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Problems with AES-GCM native acceleration
On 12/10/18 1:47 PM, Gidon Gershinsky wrote: > I've run my decryption benchmarks on Java with this patch, with excellent > results. > The benchmarks reach the top speed right away, no long warm-up anymore. > Also, there is no need to split the operation into multiple updates, the > doFinal works just fine. > > Moreover, the maximal decryption throughput is actually higher than in > Java11 after warm-up. > On one thread, I get 930MB/sec instead of 850MB/s. > On 8 threads, 350x8 instead of 230x8. > > This capability will be important in the Spark/Parquet workloads. The downloadable binary build is at https://aph.fedorapeople.org/jdk-12-internal+0-adhoc.aph.jdk-jdk.tar.gz -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Problems with AES-GCM native acceleration
On 11/21/18 5:37 PM, Andrew Haley wrote: > On 11/15/18 10:42 AM, Gidon Gershinsky wrote: >> Having the decryption optimized in the HotSpot engine would be ideal. > > I agree with you. I did a few experiments, and it can take a very > long time for C2 compilation to kick in, especially because GCM is > glacially slow until the intrinsics are used. > > I think this would be a generally useful enhancement to HotSpot, > and I'm kicking around an experimental patch which adds the > intrinsics to c1 and the interpreter. There's a proof-of-concept patch at http://cr.openjdk.java.net/~aph/gctr/ It's all rather hacky but it works. The patch is rather more complicated than I would have liked. We could simplify it somewhat by getting rid of the C1 intrinsic, and instead making C1 call the interpreter implementation. There also a jmh benchmark in that directory. Test results for 1Mb look like this: Interp: Benchmark Mode Cnt Score Error Units AESGCMUpdateAAD2.test avgt5 1426.275 ± 8.778 us/op C1: Benchmark Mode Cnt Score Error Units AESGCMUpdateAAD2.test avgt5 1359.367 ± 8.196 us/op C2: Benchmark Mode Cnt ScoreError Units AESGCMUpdateAAD2.test avgt 5 1333.863 ± 18.385 us/op -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Problems with AES-GCM native acceleration
On 11/15/18 10:42 AM, Gidon Gershinsky wrote: > Having the decryption optimized in the HotSpot engine would be ideal. I agree with you. I did a few experiments, and it can take a very long time for C2 compilation to kick in, especially because GCM is glacially slow until the intrinsics are used. I think this would be a generally useful enhancement to HotSpot, and I'm kicking around an experimental patch which adds the intrinsics to c1 and the interpreter. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: contribute to the OpenJDK security group
On 24/01/18 10:39, Tomas Gustavsson wrote: > Imho the P11 layer always needs attention. To work properly we're > relying on some patches, where parts was recently merged into OpenJDK. > We just started testing the Amazon CloudHSM, and that requires changes > to SunPKCS11 as well to work. Not always bad in SunPKCS11 as some P11 > libraries out there do strange non-conforming stuff, but there's room > for more flexibility nevertheless. Martin Balao has been heavily reworking this layer because it leaks native memory. I'll let him fill you in on the details. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Do we need an unsigned multiplyHigh?
On 27/09/17 11:28, Peter Lawrey wrote: > If you need multiplyHigh for 128-bit then you need uint256. At some point > you have to decide whether you need that many bits as a supported > operation. When Java was created a 64-bit long as the widest type made > sense, however CPUs such as x64 now support 128, 256 and 512 bit natively > and having the JVM dong its best to work this out is not as clean as > defining it explicitly. I guess cleanliness is in the eye of the beholder. IMO multiplyHigh is as clean as we need, and I'd rather see more complexity there than in the type system. It'd be nice to be able to return more than one scalar value from a method, for sure. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Do we need an unsigned multiplyHigh?
On 26/09/17 15:53, Peter Lawrey wrote: > None, except you end up jumping through hoops to implement 128 bit > arithmetic efficiently or cleanly. At some point language support for such > a basic operation is the simplest and clearest solution. There's nothing inefficient about this approach. I don't quite see how 128-bit types help with cleanliness, because then you'd need a multiplyHigh for 128-bit types, surely? You need that for the type system to be complete. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Do we need an unsigned multiplyHigh?
On 26/09/17 11:20, Peter Lawrey wrote: > We have arrays already but we don't have primitive types of more than > 64-bit. If we had uint128 for example we wouldn't need this method. Perhaps not, but why is needing this method a problem? -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Do we need an unsigned multiplyHigh?
On 26/09/17 08:25, Peter Lawrey wrote: > I am looking forward to intrinsic support for 128 bit math using ?Long2? > and XMM (or even YMM, ZMM) instructions. > This is the best way forward, I hope. > > Personally I would like to see a long long type, or even uint128, uint256, > uint512 style notation. > > Another option might be something like long<128> or an annotation like > @uint128 long or even @decimal128 double but who knows. Do you actually need any of that? I think vector types make more sense. Java already has a great many scalar types. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: Do we need an unsigned multiplyHigh?
On 25/09/17 18:21, Adam Petcher wrote: > I agree that an unsigned multiplyHigh would be useful for crypto > purposes, and we should consider adding it. Of course, I would much > rather have multiply operations that return both 64-bit parts of the > result, but that is going to be hard to do well without value types. So > it would be nice to have something like this in the meantime. I take your point, but it won't be excruciatingly difficult for the C2 compiler to turn the multiply operations into a single one, if the CPU can do that. From what I've seen recently, though, on non-x86 it's common for the two halves of the result to be calculated by separate instructions. > If we are going to add this operation, it should probably be added > along with an intrinsic. I think the Java code can simply factor out > the else branch from the existing multiplyHigh code. This way, > unsignedMultiplyHigh will be at least as fast as multiplyHigh, > whether the intrinsic implementation is available or not. Sure. I can do that. > If possible, the implementation of this operation should not branch on > either operand. This would make it more widely useful for constant-time > crypto implementations. Though this property would need to go into the > spec in order for constant-time crypto code to use this method, and I > don't know how reasonable it is to put something like this in the spec. OK. I can do it so that there are no branches in the Java. The Java code for signed multiplyHigh has some data-dependent branches in an attempt to speed it up, though. I don't know how effective they are, and I could have a look at taking them out. > Side note: at the moment, I am using signed arithmetic in prototypes for > Poly1305, X25519, and EdDSA, partially due to lack of support for > unsigned operations like this one. I don't think having > unsignedMultiplyHigh would, on its own, convince me to use an unsigned > representation, but the forces are different for each > algorithm/implementation. Sure. I don't think it really matters from a performance point of view which you use, given intrinsics for both. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Do we need an unsigned multiplyHigh?
We now have a multiplyHigh intrinsic, but it is signed. Unsigned multiplyHigh is in general a more useful primitive for crypto than signed, and I wonder if we need an intrinsic for that as well. I've looked at cooking up an unsigned multiplyHigh in Java, and I think the fastest way is this: private static final long unsignedMultiplyHigh(long a, long b) { long result = Math.multiplyHigh(a, b); if (a < 0) result += b; if (b < 0) result += a; // Can also be written as: // result += (a >> 63) & b; // result += (b >> 63) & a; return result; } It's still about 50% slower than the signed multiplyHigh, though. Thoughts? -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Re: [Semi-]off-line encryption
On 29/09/16 17:49, Michael StJohns wrote: > On 9/29/2016 11:28 AM, Andrew Haley wrote: >> GCM allows most of the work in an encryption to be done offline (and >> ahead of time) by other processors, reducing latency and increasing >> throughput. It'd be lovely if we could do this in Java, but I can't >> really see a way to fit this in to the platform security framework. >> We don't want to do this eagerly, because we don't know that more data >> will be encrypted and we don't want to speculate. >> >> However, if we had a hint that (say) a large stream would need to >> encrypt a megabyte of data at some time in the future we could >> precompute a megabyte of keystream. Has anyone considered this? > > Um. No. You can make this work with CTR, but you can't with GCM. > With CTR, you just encrypt a stream of zeroes to get an encryption > stream and then XOR the encryption stream later with your actual plain > text. GCM (and CCM) tend to compute the integrity tag in parallel with > calculating the encryption stream. You'd have to still process all of > the plain text (or cipher text) to get the integrity tag. The keystream doesn't depend on the plaintext or the ciphertext. Of course the auth tag can't be calculated before we have the ciphertext but the keystream can be, and the keystream is the expensive part of the calculation. The auth tag doesn't require anything more than a Galois field multiplication, and that's really fast with current hardware. We could arrange it so that there is always some keystream precalculated, and then the encryption latency would be no more than a couple of XORs and a GF multiplication. Maybe there's no point if AES-NI can generate the keystream at 20ish cycles per block and the GF multiplication can be done in parallel, but not every target has AES-NI or some equivalent. And I'm not sure that every target which does have appropriate instructions can issue the AES and the GF multiplication instructions at the same time. Having said all of that, it may be that the overheads of inter- processor communication would make this inefficient. Andrew.
[Semi-]off-line encryption
GCM allows most of the work in an encryption to be done offline (and ahead of time) by other processors, reducing latency and increasing throughput. It'd be lovely if we could do this in Java, but I can't really see a way to fit this in to the platform security framework. We don't want to do this eagerly, because we don't know that more data will be encrypted and we don't want to speculate. However, if we had a hint that (say) a large stream would need to encrypt a megabyte of data at some time in the future we could precompute a megabyte of keystream. Has anyone considered this? Andrew.
Re: Issues with ALPN implementation in JDK 9
On 15/06/16 09:11, Simone Bordet wrote: > Sure, but then every ALPN implementer out there will have their own, > they will need to be kept up to date when a new TLS protocol or a new > cipher is introduced in the JDK, etc. etc. OK, I get that. The problem I have with reading posts about JDK9 problems is that I can't tell the *severity* of the problems. I don't know if something is a total blocker or an inconvenience. When someone uses an internal sun.* class they may be doing something which must be done in order to get access which would be impossible otherwise. Andrew.
Re: Issues with ALPN implementation in JDK 9
On 14/06/16 14:57, Simone Bordet wrote: > * Lack of facilities to convert TLS protocol bytes to protocol strings. > This class already exist in JDK, sun.security.ssl.ProtocolVersion, it > would just need to be exposed in javax.net.ssl. > > * Lack of facilities to convert TLS cipher bytes to cipher name strings. > As above, sun.security.ssl.CipherSuite exists, needs to be exposed publicly. > > Note that for the 2 bullets above, a recent message from Mark Reinhold > to jdk9-dev confirmed that JDK 9 is *not yet* feature complete, so I > hope they can be considered for inclusion. It's very late, and they certainly won't be considered unless someone proposes it. But can't we simply clone this class anyway? Incidentally, I can't tell which (if any) of the issues you describe are specific to JDK 9 (i.e. they would not be problems in JDK 8.) Andrew.
Re: GCM performance and Unsafe byte array accesses
On 09/01/2015 04:42 PM, Anthony Scarpino wrote: > Does your alignment changes affect x86 only or should this help all > architectures? It does the best thing it can on every architecture. On those which support unaligned accesses, it uses them; if not, it doesn't. But there is a very cool optimization on machines without unaligned memory accesses: if the profile data says that accesses at a particular call site are always aligned, C2 generates optimistic code to do the aligned fetch, plus a very simple check. > In general I don't see a disadvantage and that it could be expanded > to other places in crypto too. Yes, lots of other places. This one is my poster child because the effect is so dramatic. > But I have think about the effects on sparc, so that would need to > be tested. Right now the sparc intrinsic does alignment checking > and realigning, so it would be interesting to see if ByteArrays > performed better than the intrinsic alignment. I assume you don't > have the hardware to test sparc, right? I don't. My guess is, though, that it'll not lose on SPARC, and will probably win. If it does lose that's something to look at. Andrew.
RFR: 8134869: AARCH64: GHASH intrinsic is not optimal
I've been looking at the intrinsic we have for GHASH. While it is decent as it goes, its performance is considerably worse than some other implementations of GHASH on the same processor. Thanks are due to Alexander Alexeev who did a fine job implementing the x86 algorithm on AArch64, but the result is not optimal. on AArch64 we have the advantage of a bit-reversal instruction which x86 parts don't have, and this makes it possible to write a fully little-endian implementation of GHASH which is far more idiomatic on AArch64 than the big-endian implementation the x86 version uses. This gets us an overall performance improvement of AES/GCM of 10-20%. I've also taken the opportunity to add a lot of comments. The algorithms used are (fairly) obscure and most open source software implementations don't really explain what they're doing. In particular, the bizarre representation of polynomials in GF(2) (where byte ordering is little endian but bit ordering is big endian) is very confusing and surely deserves a comment or two. http://cr.openjdk.java.net/~aph/8134869-ghash-1/ One other remark: the AES/GCM implementation has a lot of overhead. Some profile data (on x86) looks like this: samples cum. samples %cum. % image name symbol name 479605 47960536.8408 36.840831156.jo aescrypt_encryptBlock 301014 78061923.1224 59.963231156.jo ghash_processBlocks 196563 97718215.0990 75.062131156.jo int com.sun.crypto.provider.GCTR.doFinal(byte[], int, int, byte[], int) 5006110272433.8454 78.907631156.jo void TestAESEncode.run() 4815910754023.6993 82.606931156.jo void TestAESDecode.run() 1850610939081.4215 84.0284libjvm.so TypeArrayKlass::allocate_common(int, bool, Thread*) GCTR.doFinal() doesn't need do anything except increment a counter and call aescrypt_encryptBlock, but it still takes 15% of the total runtime. Intrinsifying GCTR.update() would solve this problem. Andrew.
GCM performance and Unsafe byte array accesses
I've been looking at the performance of AES/GCM. The profile is quite surprising: samples cum. samples %cum. % symbol name 476009 47600936.7033 36.7033 aescrypt_encryptBlock 297239 77324822.9190 59.6224 ghash_processBlocks 195334 96858215.0615 74.6839 int com.sun.crypto.provider.GCTR.doFinal(byte[], int, int, byte[], int) I would have expected aescrypt_encryptBlock and ghash_processBlocks to be very high, but that GCTR.doFinal is so high is rather surprising: all it has to do is increment a counter, call aescrypt_encryptBlock, and xor the result with the plaintext. The problem seems to be due to byte accesses in GCTR.doFinal() and GaloisCounterMode.update(). Earlier this year I wrote a bunch of new Unsafe.get/put-XX-Unaligned methods, and these are ideal for bulk accesses to byte arrays. So, as an experiment I wrote some methods to do array accesses and used them to speed up GCM, with this result: 492274 49227440.8856 40.885613256.jo aescrypt_encryptBlock 298185 79045924.7656 65.651213256.jo ghash_processBlocks 86325876784 7.1697 72.820913256.jo int com.sun.crypto.provider.GCTR.update(byte[], int, int, byte[], int) GCTR.update() is twice as fast as it was, and overall the performance of AES/GCM is 10% better. The changes to the GCM code are quite minor: diff -r 6940407d544a src/java.base/share/classes/com/sun/crypto/provider/GCTR.java --- a/src/java.base/share/classes/com/sun/crypto/provider/GCTR.java Thu Aug 20 07:36:37 2015 -0700 +++ b/src/java.base/share/classes/com/sun/crypto/provider/GCTR.java Thu Aug 27 16:17:25 2015 +0100 @@ -94,11 +97,12 @@ int numOfCompleteBlocks = inLen / AES_BLOCK_SIZE; for (int i = 0; i < numOfCompleteBlocks; i++) { aes.encryptBlock(counter, 0, encryptedCntr, 0); -for (int n = 0; n < AES_BLOCK_SIZE; n++) { -int index = (i * AES_BLOCK_SIZE + n); -out[outOfs + index] = -(byte) ((in[inOfs + index] ^ encryptedCntr[n])); -} +int index = i * AES_BLOCK_SIZE; +ByteArrays.putLongs(out, outOfs + index, +(ByteArrays.getLong(in, inOfs + index + 0) ^ + ByteArrays.getLong(encryptedCntr, 0)), +(ByteArrays.getLong(in, inOfs + index + 8) ^ + ByteArrays.getLong(encryptedCntr, 8))); GaloisCounterMode.increment32(counter); } return inLen; diff -r 6940407d544a src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java --- a/src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java Thu Aug 20 07:36:37 2015 -0700 +++ b/src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java Thu Aug 27 16:17:25 2015 +0100 @@ -82,11 +82,8 @@ // should never happen throw new ProviderException("Illegal counter block length"); } -// start from last byte and only go over 4 bytes, i.e. total 32 bits -int n = value.length - 1; -while ((n >= value.length - 4) && (++value[n] == 0)) { -n--; -} +int counter = ByteArrays.getInt(value, value.length - 4, true); +ByteArrays.putInt(value, value.length - 4, counter + 1, true); } // ivLen in bits I've attached the full diff. So, here's my question: there are many places over the crypto code base where we could take advantage of this. Do you think it makes sense to make changes like this? I can't see any major disadvantages, and it's a considerable performance improvement. Andrew. diff -r 6940407d544a src/java.base/share/classes/com/sun/crypto/provider/ByteArrays.java --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/src/java.base/share/classes/com/sun/crypto/provider/ByteArrays.java Thu Aug 27 16:47:27 2015 +0100 @@ -0,0 +1,110 @@ +/* + * Copyright (c) 2015, Oracle and/or its affiliates. All rights reserved. + * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. + * + * This code is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 only, as + * published by the Free Software Foundation. Oracle designates this + * particular file as subject to the "Classpath" exception as provided + * by Oracle in the LICENSE file that accompanied this code. + * + * This code is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License + * version 2 for more details (a copy is included in the LICENSE file that + * accompanied this code). + * + * You should have received a copy of the GNU General Public License version + * 2 along with this
Re: RFR: 8130150: RSA Acceleration
Sorry for the mistake. New webrevs: http://cr.openjdk.java.net/~aph/8130150-hs-1/ http://cr.openjdk.java.net/~aph/8130150-jdk-1/ Andrew.
Re: RFR: 8130150: RSA Acceleration
On 06/30/2015 06:49 PM, Andrew Haley wrote: > New webrevs: > > http://cr.openjdk.java.net/~aph/8130150-jdk > http://cr.openjdk.java.net/~aph/8130150-hs/ I made an error when preparing these webrevs. Please ignore. Andrew.
RFR: 8130150: RSA Acceleration
On 06/29/2015 10:32 AM, Andrew Haley wrote: > On 29/06/15 09:37, Vladimir Kozlov wrote: >> Hi, Andrew >> >> Did you file RFE for this change? 8046943 is JEP. > > No; I will do so. Done. >> typo? "less" -> "more". >> >> + * number of ints in the number is less than this value we do not >> + * use the intrinsic. >> + */ >> +private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512; >> >> trailing spaces: >> src/java.base/share/classes/java/math/BigInteger.java:273: Trailing >> whitespace >> src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing >> whitespace >> >> I ran changes through JPRT and linux/solaris passed - thanks. >> Next step - Windows: In the end I had to punt on Windows. It's ifdef'd out for someone who works on Windows to fix. >> I am fine with JDK changes. >> >> Would be nice to have a test for this change. Do existing tests >> cover this code? > > They do. They don't. The RSA tests don't run for long enough to test the intrinsics. I wrote a new test for Montgomery multiplication. >> I agree that we should limit size when to invoke multiplyToLen >> intrinsic too. File bug I will assign it. > > OK. JDK-8130154 New webrevs: http://cr.openjdk.java.net/~aph/8130150-jdk http://cr.openjdk.java.net/~aph/8130150-hs/ Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/29/2015 02:19 PM, David M. Lloyd wrote: > Out of curiosity, instead of e.g.: > >unsigned long *scratch = (unsigned long *)alloca(total_allocation); > > Could you not just use e.g.: > >unsigned long scratch[longwords * 4]; > > and avoid alloca altogether? No. Variable-length arrays are now in Standard C but not C++. https://groups.google.com/forum/#!topic/comp.std.c++/K_4lgA1JYeg I imagine that C++ will eventually support them, but not today. Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/29/2015 02:19 PM, David M. Lloyd wrote: > Out of curiosity, instead of e.g.: > >unsigned long *scratch = (unsigned long *)alloca(total_allocation); > > Could you not just use e.g.: > >unsigned long scratch[longwords * 4]; > > and avoid alloca altogether? No. Variable-length are now in Standard C but not C++. https://groups.google.com/forum/#!topic/comp.std.c++/K_4lgA1JYeg I imagine that C++ will eventually support them, but not today. Andrew.
Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics
On 06/29/2015 01:38 PM, Doug Simon wrote: > I seems just plain wrong for an intrinsic to not implement the same > semantics as the intrinsified method. I would expect an intrinsic to > perform all necessary runtime checks and only have the compiler omit > them if it can prove they are unnecessary. If all intrinsics obeyed > this contract, then there’s no need for the > @HotSpotIntrinsicCandidate annotation from a semantics > perspective. And in terms of the keeping HotSpot in sync with the > JDK, the responsibility should fall entirely on HotSpot to check > that its intrinsics correspond to existing methods. Don't you think you're being rather idealistic? The other side of the argument is that it's much easier to write and maintain the arg-checking code if it's written in Java, and it also has the advantage that it benefits from profile data to guide the JIT. Andrew.
Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics
Hi, On 29/06/15 10:41, Zoltán Majó wrote: > > On 06/27/2015 10:05 AM, Andrew Haley wrote: >> On 25/06/15 12:49, Zoltán Majó wrote: >>> Problem: There is need to indicate Java methods that are potentially >>> intrinsified by JVM. >> It's a great idea but is it a good name? HotSpot is not the only Java >> VM. Do we expect people from to come along and add >> J9IntrinsicCandidate, CACAOIntrinsicCandidate, and so on? > > thank you bringing up this issue. > > The name HotSpotIntrinsicCandidate resulted from a private discussion > with Joe Darcy, Brian Goetz, and John Rose. The reason this name was > picked is to make clear that a marked method's interaction with the VM > (specifically with the HotSpot VM implementation) needs special attention. OK, cool. So has any thought been given to the other VMs? Do you expect that, say, J9 will use the HotSpotIntrinsicCandidate annotattion, or do you expect we will have similar annotations for each VM which uses OpenJDK libraries? Or is the need for this annotation totally HotSpot-specific? Andrew.
Re: RFR: 8046943: RSA Acceleration
On 29/06/15 09:37, Vladimir Kozlov wrote: > Hi, Andrew > > Did you file RFE for this change? 8046943 is JEP. No; I will do so. > typo? "less" -> "more". > > + * number of ints in the number is less than this value we do not > + * use the intrinsic. > + */ > +private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512; > > trailing spaces: > src/java.base/share/classes/java/math/BigInteger.java:273: Trailing whitespace > src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing > whitespace > > I ran changes through JPRT and linux/solaris passed - thanks. > Next step - Windows: > > C:\jprt\T\P1\s\hotspot\src\cpu\x86\vm\sharedRuntime_x86_64.cpp(26) : fatal > error C1083: Cannot open include file: > 'alloca.h': No such file or directory Hmm, okay. This is going to be fun. :-) AFAIK OpenJDK builds with Visual Studio. The VS equivalent of alloca() is called _alloca() and its header file is . I'm going to try to do this untested. I think that autoconf will #include malloc.h on Windows automagically, so all that I have to do is create a #define for alloca() on Windows. > I am fine with JDK changes. > > Would be nice to have a test for this change. Do existing tests > cover this code? They do. jdk/test/com/oracle/security/ucrypto/TestRSA looks like a pretty thorough test. If you make any mistake in the arithmetic RSA decryption simply will not work: the result is corrupt. The risks, then, are mistakes such as accidental side-effects. I don't know any way to test for that. The other possible think I could test for is unusual key sizes. I'll have a look. > I agree that we should limit size when to invoke multiplyToLen > intrinsic too. File bug I will assign it. OK. Andrew.
Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics
On 25/06/15 12:49, Zoltán Majó wrote: > Problem: There is need to indicate Java methods that are potentially > intrinsified by JVM. It's a great idea but is it a good name? HotSpot is not the only Java VM. Do we expect people from to come along and add J9IntrinsicCandidate, CACAOIntrinsicCandidate, and so on? Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/19/2015 09:34 AM, Andrew Haley wrote: > On 18/06/15 20:28, Vladimir Kozlov wrote: > >> Yes, it is a lot of handwriting but we need it to work on all OSs. > > Sure, I get that. I knew there would be a few goes around with this, > but it's worth the pain for the performance improvement. I made some changes, as requested. Everything is now private static final. The libcall now only calls the runtime code: all allocation is done in Java code. I tested on Solaris using Solaris Studio 12.3 tools, and it's fine. There's one thing I'm not sure about. I now longer allocate scratch memory on the heap. That was only needed for extremely large integers, larger than anyone needs for crypto. Now, if the size of an integer exceeds 16384 bits I do not use the intrinsic, and this allows it to use stack-allocated memory for its scratch space. The main thing I was worried about is that the time spent in Montgomery multiplication. The runtime of the algorithm is O(N^2); if you don't limit the size, the time is unbounded, with no safepoint delay. This would mean that anyone who passed an absurdly large integer to BigInteger.modPow() would see the virtual machine apparently lock up and garbage collection would not run. I note that the multiplyToLen() intrinsic has the same problem. http://cr.openjdk.java.net/~aph/8046943-hs-3/ http://cr.openjdk.java.net/~aph/8046943-jdk-3/ Andrew.
Building SA
Now that tools.jar is gone from build images, what is the right way to build the seviceability agent? Thanks, Andrew.
Re: RFR: 8046943: RSA Acceleration
On 18/06/15 20:28, Vladimir Kozlov wrote: > We have few new rules regarding intrinsics. > You need to add private static java method which does range checks > because their are not executed in intrinsic code - see squareToLen() > implementation, for example. Okay. > Note, we will rewrite multiplyToLen() too soon. Right. > Also method which will be intrinsified should be private static too and > you can move allocation from it (like we did for squareToLen()) to avoid > allocation in intrinsic code. I like that. It's surely a lot less complicated than what I've got right now. > Your Hotspot changes are hard to accept. We have to compile for solaris > with Sun compilers which does not work with such changes: > > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: > parameter in inline asm statement unused: %3. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3525: Warning: > parameter in inline asm statement unused: %6. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3701: Error: The > function "__builtin_expect" must have a prototype. > "hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp", line 3707: Error: The > function "__builtin_alloca" must have a prototype. Sure. I didn't realize that you were compiling that code with non-GCC. __builtin_alloca() can just be replaced by alloca() on on-GCC. > May be you can convert the code to stub and add new MacroAssembler > functions which you can use in sharedRuntime_x86_64.cpp. I don't think so. It's fast because it is truly inline assembler, inserted into the C code. If I had some way to access an x86 Solaris machine I'd test it there. But those warnings about unused parameters are just warnings. It may be that the code would work on a Sun compiler. Can you please try to replace __builtin_alloca() with alloca() and then tell me if the code works? > Yes, it is a lot of handwriting but we need it to work on all OSs. Sure, I get that. I knew there would be a few goes around with this, but it's worth the pain for the performance improvement. Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/18/2015 06:00 PM, Anthony Scarpino wrote: > Question, on the hotspot side you said in a previous post this was > C2-only. Was there a reason you don't have it for all? None. It's up for negotiation. What do you want? C1, interp? Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/15/2015 05:58 PM, Andrew Haley wrote: >>> 3. I fused squaring and multiplication into a single >>> >> montgomeryMultiply method. ... >> > >> > I don't agree with fusing them together. I think there should two >> > separate intrinsics. For one, SPARC has a montsqr and montmul >> > instructions. Additionally if someone wants to call montgomerySquare, >> > they should be able to call it directly with it's needed number of >> > arguments and not pass 'a' twice to satisfy an internal if(). > OK, fair enough. I'll think a little more about the best way to do > this. Done thusly. The only thing I had any doubt about was whether to use a single flag for squaring and multiplication. This patch uses separate flags. http://cr.openjdk.java.net/~aph/8046943-hs-2/ http://cr.openjdk.java.net/~aph/8046943-jdk-2/ Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/15/2015 05:38 PM, Anthony Scarpino wrote: > On 06/12/2015 04:06 AM, Andrew Haley wrote: >> http://cr.openjdk.java.net/~aph/8046943-hs-1/ >> http://cr.openjdk.java.net/~aph/8046943-jdk-1/ > > Please don't use the JEP 246 in the comments when you push. There are a > number of changesets related to 246 and I'd rather not have one > associated with it. We can link the a separate bug id to the JEP. Right. >> 3. I fused squaring and multiplication into a single >> montgomeryMultiply method. This has two advantages. Firstly, we only >> need a single intrinsic, and secondly the decision whether to use >> multiply or squaring can be made in the runtime library. If the >> target does not support the montgomeryMultiply intrinsic there is no >> slowdown when using C2 because it removes the if (a == b) test in >> >> if (a == b) { >> product = squareToLen(a, len, product); >> } else { >> product = multiplyToLen(a, len, b, len, product); >> } > > I don't agree with fusing them together. I think there should two > separate intrinsics. For one, SPARC has a montsqr and montmul > instructions. Additionally if someone wants to call montgomerySquare, > they should be able to call it directly with it's needed number of > arguments and not pass 'a' twice to satisfy an internal if(). OK, fair enough. I'll think a little more about the best way to do this. Out of curiosity I just had a look at the SPARC instruction specifications, and happily (it certainly surprised me!) they are almost exactly the same as what I've done, so using those instructions should be a trivial change after this patch. The SPARC instruction seems to be limited to 32 words (2048 bits) but I guess you'd just use the software for larger sizes. Andrew.
Re: RFR: 8046943: RSA Acceleration
On 06/12/2015 12:06 PM, Andrew Haley wrote: Sorry, I forgot the column labels. The are: signverifysign/s verify/s > rsa 512 bits 0.000134s 0.09s 7444.2 111853.3 > rsa 1024 bits 0.000674s 0.29s 1483.9 34456.9 > rsa 2048 bits 0.003945s 0.000100s253.5 9994.7 > rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 > > After: > > rsa 512 bits 0.59s 0.04s 17022.3 224141.1 > rsa 1024 bits 0.000258s 0.13s 3871.5 78851.0 > rsa 2048 bits 0.001506s 0.42s664.1 23844.3 > rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 Andrew.
RFR: 8046943: RSA Acceleration
http://cr.openjdk.java.net/~aph/8046943-hs-1/ http://cr.openjdk.java.net/~aph/8046943-jdk-1/ Before: rsa 512 bits 0.000134s 0.09s 7444.2 111853.3 rsa 1024 bits 0.000674s 0.29s 1483.9 34456.9 rsa 2048 bits 0.003945s 0.000100s253.5 9994.7 rsa 4096 bits 0.027015s 0.000373s 37.0 2681.6 After: rsa 512 bits 0.59s 0.04s 17022.3 224141.1 rsa 1024 bits 0.000258s 0.13s 3871.5 78851.0 rsa 2048 bits 0.001506s 0.42s664.1 23844.3 rsa 4096 bits 0.010214s 0.000153s 97.9 6516.0 There are some issues we need to discuss. The runtime code is in sharedRuntime_x86_64.cpp even though it's mostly written in C. My thinking here is that porters will want to tweak the code for their back ends, or maybe write it in assembler. It's little-endian and would need some reworking for big-endian machines. But maybe there should be a common version of the code in share/ ? Should it be in optoRuntime instead? It could be called from C1 or even the interpreter, but it's C2-only right now. I've done nothing about 32-bit targets, but I think they would benefit. I had to make some small changes to the Java library. 1. Montgomery multiplication works on whole words. Given that this is a 64-bit implementation I had to force the size of the arrays in oddModPow to be a multiple of 64 bits, i.e. the length of the arrays must be even. Given that RSA and Diffie-Hellman keys are usually a multiple of 64 bits in length I don't think this will be a real performance issue in practice. 2. I needed a 64-bit inverse rather than a 32-bit inverse. This is a single computation, done once for each modular exponentiation, so it makes an immeasurably small difference to the total runtime. 3. I fused squaring and multiplication into a single montgomeryMultiply method. This has two advantages. Firstly, we only need a single intrinsic, and secondly the decision whether to use multiply or squaring can be made in the runtime library. If the target does not support the montgomeryMultiply intrinsic there is no slowdown when using C2 because it removes the if (a == b) test in if (a == b) { product = squareToLen(a, len, product); } else { product = multiplyToLen(a, len, b, len, product); } Andrew.
RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]
I'm sorry this is a rather long email, and I pray for your patience. I'm getting close to something I can put forward for review. The performance is encouraging. [ Some background: The kernel of RSA and Diffie-Hellman key exchange is Montgomery multiplication. Optimizing RSA basically comes down to optimizing Montgomery multiplication. The core of OpenJDK's RSA is BigInteger.oddModPow(). ] My Montgomery multiply routine (mostly portable C, with a small assembly-language insert) executes a 1024-bit multiply/reduce in about 2000ns; the hand-coded assembly language equivalent in OpenSSL is faster (as you'd expect) but not very much faster: about 1700ns. In other words, compiled C is only about 20% slower. Firstly, this is pretty remarkable performance by GCC (Yay! Go GCC!) and it shows I'm on the right track. It also shows that there isn't a huge amount to be gained by hand-coding Montgomery multiplication, but anybody who fancies their hand can try to improve on GCC. This is also very nice because porting it to non-x86 hardware is fairly straightforward -- certainly far easier than writing a large assembly- language routine. Here are some numbers for comparison. Today's hs-comp: signverifysign/s verify/s rsa 512 bits 0.000133s 0.09s 7508.5 112819.1 rsa 1024 bits 0.000667s 0.28s 1498.6 35497.2 rsa 2048 bits 0.003867s 0.97s258.6 10342.7 rsa 4096 bits 0.026383s 0.000357s 37.9 2799.8 After my patch: signverifysign/s verify/s rsa 512 bits 0.71s 0.05s 14127.3 204112.4 rsa 1024 bits 0.000292s 0.13s 3424.5 74204.1 rsa 2048 bits 0.001628s 0.45s614.4 22399.7 rsa 4096 bits 0.010966s 0.000163s 91.2 6117.8 So, it's about twice as fast we have at present. [ Note that this comparison includes the latest "8081778: Use Intel x64 CPU instructions for RSA acceleration" patch. ] However, even after my patch OpenJDK is still somewhat slower than OpenSSL, which is: signverifysign/s verify/s rsa 512 bits 0.48s 0.04s 20687.1 257399.4 rsa 1024 bits 0.000189s 0.11s 5288.3 91711.5 rsa 2048 bits 0.001174s 0.37s851.7 27367.2 rsa 4096 bits 0.008475s 0.000137s118.0 7305.4 [ I am assuming that OpenSSL represents something like the "speed of light" for RSA on x86: this is carefully hand-coded assembly language and C, hand tuned. Getting anywhere near OpenSSL is a major win. ] Here's my problem: Some of this slowdown is due to the overhead of using the JCE, but not very much. Quite a lot of it, however, is due to the fact that the scratch memory used in oddModPow() is a big-endian array of jints. I have to convert the big-endian jints into native jlongs to do the multiply on little-endian x86-64. If I do the word reversal during the multiply (i.e. keep all data in memory in little-endian form, swap words when reading and writing to memory) the performance is horrible: a 1024-bit multiply takes 3000ns, 50% longer. This perhaps isn't very surprising: if you do the word-reversal before the multiplication you have O(N) swaps, if you do it during the multiplication you have O(N^2). I have found that the best thing to do is to word reverse all the data in memory into temporary little-endian arrays and do the work on them. It's much faster, but still really is very annoying: for 1024-bit RSA the word reversal is 14% of the total runtime. It would be nice to keep all of the data in an array of jlongs for the duration of oddModPow(). Here's one idea: I could write a version of oddModPow() which is guaranteed never to use the Java version of the Montgomery multiply. This would use a JNI method which calls the native Montgomery multiply routine, guaranteeing that that we always use that native routine, even from the interpreter and C1. Then I could keep all the internal state in native word order, and all this word-swapping would just go away. This would have the additional benefit that it would be faster when using the interpreter and C1. So, we'd have two versions of oddModPow() in BigInteger, and switch between them depending on whether the platform had support for a native Montgomery multiplication. The downside of having two versions of oddModPow() would, of course, be some code duplication. Or am I just making too much fuss about this? Maybe I should be happy with what I've got. Thank you for reading this far, Andrew.
Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]
On 05/28/2015 08:51 AM, Andrew Haley wrote: > On 27/05/15 21:35, Anthony Scarpino wrote: > >> How much of the montgomery multiply and sqaure are you planning to >> intrinsify? Are you doing the whole thing or just portions of the >> operations similar to multiplyToLen and squareToLen? > > I'm doing the whole montgomery multiply and square operation in a > routine which interleaves multiplication and reduction so that the > intermediate product is never written into memory. For a bit more information, here's the algorithm in C: Andrew. // Multiply (unsigned) Long A by Long B, accumulating the double- // length result into the accumulator formed of T0, T1, and T2. #define MACC(A, B, T0, T1, T2) \ do {\ unsigned long hi, lo; \ __asm__ ("mul %5; add %%rax, %2; adc %%rdx, %3; adc $0, %4" \ : "=&d"(hi), "=a"(lo), "+r"(T0), "+r"(T1), "+g"(T2) \ : "r"(A), "a"(B)); \ } while(0) // Fast Montgomery multiplication. The derivation of the algorithm is // in A Cryptographic Library for the Motorola DSP56000, // Dusse and Kaliski, Proc. EUROCRYPT 90, pp. 230-237. static void __attribute__((noinline)) montgomery_multiply(unsigned long a[], unsigned long b[], unsigned long n[], unsigned long m[], unsigned long inv, int len) { unsigned long t0 = 0, t1 = 0, t2 = 0; // Triple-precision accumulator int i; assert(inv * n[0] == -1UL); for (i = 0; i < len; i++) { int j; for (j = 0; j < i; j++) { MACC(a[j], b[i-j], t0, t1, t2); MACC(m[j], n[i-j], t0, t1, t2); } MACC(a[i], b[0], t0, t1, t2); m[i] = t0 * inv; MACC(m[i], n[0], t0, t1, t2); assert(t0 == 0); t0 = t1; t1 = t2; t2 = 0; } for (i = len; i < 2*len; i++) { int j; for (j = i-len+1; j < len; j++) { MACC(a[j], b[i-j], t0, t1, t2); MACC(m[j], n[i-j], t0, t1, t2); } m[i-len] = t0; t0 = t1; t1 = t2; t2 = 0; } m[len] = t0; while(cmp(m, n, len+1) > 0) sub(m, n, len+1); }
Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]
On 27/05/15 21:35, Anthony Scarpino wrote: > How much of the montgomery multiply and sqaure are you planning to > intrinsify? Are you doing the whole thing or just portions of the > operations similar to multiplyToLen and squareToLen? I'm doing the whole montgomery multiply and square operation in a routine which interleaves multiplication and reduction so that the intermediate product is never written into memory. Andrew.
RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]
[I'm sorry, I didn't send this to the correct list. I forgot that there was a separate security list.] An update: I'm still working on this. Following last week's revelations [1] it seems to me that a faster implementation of (integer) D-H is even more important. I've spent a couple of days tracking down an extremely odd feature (bug?) in MutableBigInteger which was breaking everything, but I'm past that now. I'm trying to produce an intrinsic implementation of the core modular exponentiation which is as fast as any state-of-the- art implementation while disrupting the common code as little as possible; this is not easy. I hope to have something which is faster on all processors, not just those for which we have hand-coded assembly-language implementations. I don't think that my work should be any impediment to Sadya's patch for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/ being committed. It'll still be useful. Andrew. [1] Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice https://weakdh.org/imperfect-forward-secrecy.pdf