Please have a quick look at 8274730: AArch64: AES/GCM acceleration is broken by the fix for JDK-8273297

2021-10-05 Thread Andrew Haley
I had to touch common code in C2 to unbreak AArch64, so I'd like
another reviewer, please. It's pretty simple, thanks.

https://github.com/openjdk/jdk/pull/5819

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125 [v5]

2021-09-24 Thread Andrew Haley
On Wed, 22 Sep 2021 22:48:32 GMT, Smita Kamath  wrote:

>> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not 
>> support the new intrinsic. Tests run were crypto.full.AESGCMBench and 
>> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks.
>> 
>> The problem is each instance of GHASH allocates 96 extra longs for the 
>> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table 
>> space should be allocated differently so that non-supporting CPUs do not 
>> suffer this penalty. This issue also affects non-Intel CPUs too.
>
> Smita Kamath has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added htbl_entries constant to other architectures

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/5402


Re: Intrinsic methods and time to safepoint

2021-09-17 Thread Andrew Haley
On 9/17/21 10:13 AM, Roland Westrelin wrote:
> 
>> OK. I guess the problem is that the call to the stub doesn't have an oop
>> map.
> 
> Assuming you can get the oop map in stub, deoptimization could still be
> delayed until the stub is exited. Is that considered a problem?
> 
> Actually I'm not even sure the loop synthesized from IR nodes is
> feasible: what would the JVM state at the safepoint be given that
> there's no actual java code for it? What would you deoptimize to if you
> were to deoptimize?

Ah. Perhaps you have a good point.  :-)

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Intrinsic methods and time to safepoint

2021-09-17 Thread Andrew Haley
On 9/16/21 1:28 PM, Roland Westrelin wrote:
> 
>> I believe we should have a policy to cover how long an intrinsic can
>> delay without responding to a safepoint, and that it should be in the
>> millisecond range. It would make almost no difference to the
>> performance of encryption if chunks handles by a fast intrinsic were,
>> say, about a megabyte. The difference in performance is so small as to
>> be immeasurable, and the improvement in the performance of other threads
>> is vast.
> 
> I agree with you (seems like a no brainer) but I have a couple comments
> about implementation details.
> 
> Those intrinsics usually call some stub. It's not possible AFAICT, to
> have the safepoint in the stub itself. So we need some loop that
> repeatedly calls the stub. That loop can either be added 1) by the JIT
> as IR when the intrinsic is expanded 2) in java code, that is java
> library code needs to be refactored.
> 
> 2) would seem much easier to implement and would work for both c1 and c2
> (if some of these intrinsics end up implemented by c1).

OK. I guess the problem is that the call to the stub doesn't have an oop
map.

The tricky cases seem to be in the crypto code, which is already rather
fiddly. It's usually simple enough for intrinsics to return the amount of
work left to do, I guess.

> Also a note of caution about loop strip mining: it doesn't have a model
> for what the loop body costs. So it blindly assumes all loop bodies can
> be run for 1000 iterations (by default) between safepoints. Unless I'm
> missing something, with a stub running for 1ms, delays between safepoint
> could still be 1s.

It'd be interesting to do the experiment. I might try that, just for grins.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125

2021-09-14 Thread Andrew Haley
On Tue, 7 Sep 2021 22:31:30 GMT, Smita Kamath  wrote:

> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not 
> support the new intrinsic. Tests run were crypto.full.AESGCMBench and 
> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks.
> 
> The problem is each instance of GHASH allocates 96 extra longs for the 
> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table 
> space should be allocated differently so that non-supporting CPUs do not 
> suffer this penalty. This issue also affects non-Intel CPUs too.

It seems to me there's a serious problem here. When you execute the 
galoisCounterMode_AESCrypt() intrinsic, I don't think there's a limit on the 
number of blocks to be encrypted. With the older intrinsic things are not so 
very bad because the incoming data is split into 6 segments. But if we use this 
intrinsic, there is no safepoint check in the inner loop, which can lead to a 
long time to safepoint, and this causes stalls on the other threads.
If you split the incoming data into blocks of about a megabyte you'd lose no 
measurable performance but you'd dramatically improve the performance of 
everything else, especially with a concurrent GC.

-

PR: https://git.openjdk.java.net/jdk/pull/5402


Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125

2021-09-13 Thread Andrew Haley
On Mon, 13 Sep 2021 12:48:14 GMT, Andrew Haley  wrote:

>> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not 
>> support the new intrinsic. Tests run were crypto.full.AESGCMBench and 
>> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks.
>> 
>> The problem is each instance of GHASH allocates 96 extra longs for the 
>> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table 
>> space should be allocated differently so that non-supporting CPUs do not 
>> suffer this penalty. This issue also affects non-Intel CPUs too.
>
> src/hotspot/share/opto/library_call.cpp line 6796:
> 
>> 6794: 
>> 6795:   Node* avx512_subkeyHtbl = new_array(klass_node, intcon(96), 0);
>> 6796:   if (avx512_subkeyHtbl == NULL) return false;
> 
> This looks very Intel-specific, but it's in generic code. Please make this 
> constant 96 a symbol and push it into a header file in the x86 back end.

Likewise, the name prefix "avx512_" isn't appropriate for code that will 
certainly be used by other targets.

-

PR: https://git.openjdk.java.net/jdk/pull/5402


Re: RFR: 8273297: AES/GCM non-AVX512+VAES CPUs suffer after 8267125

2021-09-13 Thread Andrew Haley
On Tue, 7 Sep 2021 22:31:30 GMT, Smita Kamath  wrote:

> Performance dropped up to 10% for 1k data after 8267125 for CPUs that do not 
> support the new intrinsic. Tests run were crypto.full.AESGCMBench and 
> crypto.full.AESGCMByteBuffer from the jmh micro benchmarks.
> 
> The problem is each instance of GHASH allocates 96 extra longs for the 
> AVX512+VAES intrinsic regardless if the intrinsic is used. This extra table 
> space should be allocated differently so that non-supporting CPUs do not 
> suffer this penalty. This issue also affects non-Intel CPUs too.

src/hotspot/share/opto/library_call.cpp line 6796:

> 6794: 
> 6795:   Node* avx512_subkeyHtbl = new_array(klass_node, intcon(96), 0);
> 6796:   if (avx512_subkeyHtbl == NULL) return false;

This looks very Intel-specific, but it's in generic code. Please make this 
constant 96 a symbol and push it into a header file in the x86 back end.

-

PR: https://git.openjdk.java.net/jdk/pull/5402


Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]

2021-08-11 Thread Andrew Haley
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath  wrote:

>> I would like to submit AES-GCM optimization for x86_64 architectures 
>> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES 
>> and GHASH operations.
>> Performance gain of ~1.5x - 2x for message sizes 8k and above.
>
> Smita Kamath has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   rewiew update

src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp line 1682:

> 1680: vpshufb(AAD_HASHx, AAD_HASHx, xmm24, Assembler::AVX_128bit);
> 1681: 
> 1682: // Compute #rounds for AES based on the length of the key array

Forget that, it's done everywhere. Deleted.

-

PR: https://git.openjdk.java.net/jdk/pull/4019


Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]

2021-08-11 Thread Andrew Haley
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath  wrote:

>> I would like to submit AES-GCM optimization for x86_64 architectures 
>> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES 
>> and GHASH operations.
>> Performance gain of ~1.5x - 2x for message sizes 8k and above.
>
> Smita Kamath has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   rewiew update

src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp line 1682:

> 1680: vpshufb(AAD_HASHx, AAD_HASHx, xmm24, Assembler::AVX_128bit);
> 1681: 
> 1682: // Compute #rounds for AES based on the length of the key array

This is a bit of a hack. Wouldn't it make more sense to pass in the array oop, 
then derive both the length and the address of the base of the key array from 
the oop, rather than using a negative offset from the base address?

-

PR: https://git.openjdk.java.net/jdk/pull/4019


Re: RFR: 8267125: AES Galois CounterMode (GCM) interleaved implementation using AVX512 + VAES instructions [v8]

2021-08-10 Thread Andrew Haley
On Mon, 9 Aug 2021 15:49:07 GMT, Smita Kamath  wrote:

>> I would like to submit AES-GCM optimization for x86_64 architectures 
>> supporting AVX3+VAES (Evex encoded AES). This optimization interleaves AES 
>> and GHASH operations.
>> Performance gain of ~1.5x - 2x for message sizes 8k and above.
>
> Smita Kamath has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   rewiew update

src/java.base/share/classes/com/sun/crypto/provider/GaloisCounterMode.java line 
694:

> 692: 
> 693: /**
> 694:  *  ByteBuffer wrapper for intrinsic implGCMCrypt.  It will 
> operation

Suggestion:

 *  ByteBuffer wrapper for intrinsic implGCMCrypt.  It will operate

-

PR: https://git.openjdk.java.net/jdk/pull/4019


Re: JEP 411, removal of finalizers, a path forward.

2021-08-02 Thread Andrew Haley
On 8/2/21 8:49 AM, Peter Firmstone wrote:
> If I fix that bug, will JEP 411 be cancelled?

No. The problem wasn't that we couldn't fix the [Speculative Execution
Vulnerabilities], more that any fix would be so invasive and pervasive
that it would severely hamper the whole platform.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: JEP 411 - Secure Java Distribution

2021-06-02 Thread Andrew Haley
On 6/1/21 10:06 AM, Peter Firmstone wrote:
> If a vendor were to continue supporting SecurityManager and was 
> backporting from OpenJDK, if it passes the JCK with SecurityManager 
> disabled, that's still acceptable right?

Look at the licence agreement in conjunction with the JCK users' guide.
See the definition of “Compatible Licensee Implementation”.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Tue, 23 Mar 2021 16:20:47 GMT, Ludovic Henry  wrote:

>> Anton Kozlov has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 115 commits:
>> 
>>  - Merge branch 'master' into jdk-macos
>>  - JDK-8262491: bsd_aarch64 part
>>  - JDK-8263002: bsd_aarch64 part
>>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>>  - Wider #ifdef block
>>  - Fix most of issues in java/foreign/ tests
>>
>>Failures related to va_args are tracked in JDK-8263512.
>>  - Add Azul copyright
>>  - Update Oracle copyright years
>>  - Use Thread::current_or_null_safe in SafeFetch
>>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>>  - ... and 105 more: 
>> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269
>
> Marked as reviewed by luhenry (Author).

> > > > [ Back-porting this patch to JDK 11] depends on the will of openjdk11 
> > > > maintainers to accept this (and few other, like jep-388, as we depend 
> > > > on it) contribution.
> > > 
> > > 
> > > To the extent that 11u has fixed policies :) we definitely have a policy 
> > > of accepting patches to keep 11u working on current hardware. So yes.
> > 
> > 
> > @lewurm That sounds like a green flag for you and jep-388 (with its 
> > R18_RESERVED functionality) ;)
> 
> Thanks, @theRealAph, and @VladimirKempik . We are on it!

It's going to be tricky to do in a really clean way, given some of the 
weirdnesses of the ABI. However, I think there's probably a need for it

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Mon, 22 Mar 2021 12:50:14 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 115 commits:
> 
>  - Merge branch 'master' into jdk-macos
>  - JDK-8262491: bsd_aarch64 part
>  - JDK-8263002: bsd_aarch64 part
>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>  - Wider #ifdef block
>  - Fix most of issues in java/foreign/ tests
>
>Failures related to va_args are tracked in JDK-8263512.
>  - Add Azul copyright
>  - Update Oracle copyright years
>  - Use Thread::current_or_null_safe in SafeFetch
>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>  - ... and 105 more: 
> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v29]

2021-03-23 Thread Andrew Haley
On Tue, 23 Mar 2021 13:54:24 GMT, Andrew Haley  wrote:

>> Anton Kozlov has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 115 commits:
>> 
>>  - Merge branch 'master' into jdk-macos
>>  - JDK-8262491: bsd_aarch64 part
>>  - JDK-8263002: bsd_aarch64 part
>>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>>  - Wider #ifdef block
>>  - Fix most of issues in java/foreign/ tests
>>
>>Failures related to va_args are tracked in JDK-8263512.
>>  - Add Azul copyright
>>  - Update Oracle copyright years
>>  - Use Thread::current_or_null_safe in SafeFetch
>>  - 8262903: [macos_aarch64] Thread::current() called on detached thread
>>  - ... and 105 more: 
>> https://git.openjdk.java.net/jdk/compare/a9d2267f...5add9269
>
> Marked as reviewed by aph (Reviewer).

> [ Back-porting this patch to JDK 11] depends on the will of openjdk11 
> maintainers to accept this (and few other, like jep-388, as we depend on it) 
> contribution.

To the extent that 11u has fixed policies :)  we definitely have a policy of 
accepting patches to keep 11u working on current hardware. So yes.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]

2021-03-23 Thread Andrew Haley
On Tue, 23 Mar 2021 09:53:54 GMT, Andrew Haley  wrote:

>>> @theRealAph, could you elaborate on what is need to be done for [#2200 
>>> (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066).
>> 
>> I think that what you've got now is fine.
>
>> _Mailing list message from [Andrew Haley](mailto:a...@redhat.com) on 
>> [build-dev](mailto:build-...@openjdk.java.net):_
>> 
>> On 3/15/21 6:56 PM, Anton Kozlov wrote:
>> 
>> > On Wed, 10 Mar 2021 11:21:44 GMT, Andrew Haley  wrote:
>> > > > We always check for `R18_RESERVED` with `#if(n)def`, is there any 
>> > > > reason to define the value for the macro?
>> > > 
>> > > 
>> > > Robustness, clarity, maintainability, convention. Why not?
>> > 
>> > 
>> > I've tried to implement the suggestion, but it pulled more unnecessary 
>> > changes. It makes the intended way to check the condition less clear 
>> > (`#ifdef` and not `#if`).
>> 
>> No, no, no! I am not suggesting you change anything else, just that
>> you do not define contentless macros. You might as well define it
>> to be something, and true is a reasonable default, that's all. It's
>> not terribly important, it's just good practice.
> 
> I'm quite prepared to drop this if it's holding up the port. It's a style 
> thing, but it's not critical.

So, where are we up to now? Are we done yet?

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]

2021-03-23 Thread Andrew Haley
On Fri, 12 Mar 2021 16:32:10 GMT, Andrew Haley  wrote:

>> Anton Kozlov has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 105 commits:
>> 
>>  - Merge commit 'refs/pull/11/head' of https://github.com/AntonKozlov/jdk 
>> into jdk-macos
>>  - workaround JDK-8262895 by disabling subtest
>>  - Fix typo
>>  - Rename threadWXSetters.hpp -> threadWXSetters.inline.hpp
>>  - JDK-8259937: bsd_aarch64 part
>>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>>  - Fix after JDK-8259539, partially revert preconditions
>>  - JDK-8260471: bsd_aarch64 part
>>  - JDK-8259539: bsd_aarch64 part
>>  - JDK-8257828: bsd_aarch64 part
>>  - ... and 95 more: 
>> https://git.openjdk.java.net/jdk/compare/a6e34b3d...a72f6834
>
>> @theRealAph, could you elaborate on what is need to be done for [#2200 
>> (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066).
> 
> I think that what you've got now is fine.

> _Mailing list message from [Andrew Haley](mailto:a...@redhat.com) on 
> [build-dev](mailto:build-...@openjdk.java.net):_
> 
> On 3/15/21 6:56 PM, Anton Kozlov wrote:
> 
> > On Wed, 10 Mar 2021 11:21:44 GMT, Andrew Haley  wrote:
> > > > We always check for `R18_RESERVED` with `#if(n)def`, is there any 
> > > > reason to define the value for the macro?
> > > 
> > > 
> > > Robustness, clarity, maintainability, convention. Why not?
> > 
> > 
> > I've tried to implement the suggestion, but it pulled more unnecessary 
> > changes. It makes the intended way to check the condition less clear 
> > (`#ifdef` and not `#if`).
> 
> No, no, no! I am not suggesting you change anything else, just that
> you do not define contentless macros. You might as well define it
> to be something, and true is a reasonable default, that's all. It's
> not terribly important, it's just good practice.

I'm quite prepared to drop this if it's holding up the port. It's a style 
thing, but it's not critical.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]

2021-03-12 Thread Andrew Haley
On Tue, 9 Mar 2021 18:01:11 GMT, Anton Kozlov  wrote:

>> src/hotspot/cpu/aarch64/globalDefinitions_aarch64.hpp line 62:
>> 
>>> 60: 
>>> 61: #if defined(__APPLE__) || defined(_WIN64)
>>> 62: #define R18_RESERVED
>> 
>> #define R18_RESERVED true```
>
> We always check for `R18_RESERVED` with `#if(n)def`, is there any reason to 
> define the value for the macro?

Robustness, clarity, maintainability, convention. Why not?

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v24]

2021-03-12 Thread Andrew Haley
On Tue, 9 Mar 2021 16:12:36 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 105 commits:
> 
>  - Merge commit 'refs/pull/11/head' of https://github.com/AntonKozlov/jdk 
> into jdk-macos
>  - workaround JDK-8262895 by disabling subtest
>  - Fix typo
>  - Rename threadWXSetters.hpp -> threadWXSetters.inline.hpp
>  - JDK-8259937: bsd_aarch64 part
>  - Merge remote-tracking branch 'upstream/jdk/master' into jdk-macos
>  - Fix after JDK-8259539, partially revert preconditions
>  - JDK-8260471: bsd_aarch64 part
>  - JDK-8259539: bsd_aarch64 part
>  - JDK-8257828: bsd_aarch64 part
>  - ... and 95 more: 
> https://git.openjdk.java.net/jdk/compare/a6e34b3d...a72f6834

> @theRealAph, could you elaborate on what is need to be done for [#2200 
> (review)](https://github.com/openjdk/jdk/pull/2200#pullrequestreview-600597066).

I think that what you've got now is fine.

-

Marked as reviewed by aph (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v10]

2021-03-01 Thread Andrew Haley
On Fri, 12 Feb 2021 11:42:59 GMT, Vladimir Kempik  wrote:

>> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 194:
>> 
>>> 192: // may get turned off by -fomit-frame-pointer.
>>> 193: frame os::get_sender_for_C_frame(frame* fr) {
>>> 194:   return frame(fr->link(), fr->link(), fr->sender_pc());
>> 
>> Why is it
>> 
>> return frame(fr->link(), fr->link(), fr->sender_pc());
>> 
>> and not 
>> 
>> return frame(fr->sender_sp(), fr->link(), fr->sender_pc());
>> 
>> like in the bsd-x86 counter part?
>
> bsd_aarcb64 was based on linux_aarch64, with addition of bsd-specific things 
> from bsd_x86
> You think  the bsd-x86 way is better here ?

There's no point copying x86. We don't have any way to know what the sender's 
SP was in a C frame without using unwind info. I think this is just used when 
trying to print the stack in a crash dump.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]

2021-03-01 Thread Andrew Haley
On Fri, 5 Feb 2021 17:20:55 GMT, Anton Kozlov  wrote:

>> src/hotspot/cpu/aarch64/vm_version_aarch64.hpp line 93:
>> 
>>> 91: CPU_MARVELL   = 'V',
>>> 92: CPU_INTEL = 'i',
>>> 93: CPU_APPLE = 'a',
>> 
>> The `ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture 
>> profile` has 8538 pages, can we be more specific and point to the particular 
>> section of the document where the CPU codes are defined?
>
> They are defined in 13.2.95. MIDR_EL1, Main ID Register. Apple's code is not 
> there, but "Arm can assign codes that are not published in this manual. All 
> values not assigned by Arm are reserved and must not be used.". I assume the 
> value was obtained by digging around 
> https://github.com/apple/darwin-xnu/blob/main/osfmk/arm/cpuid.h#L62

Anton, this paragraph looks like an excellent comment.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v10]

2021-03-01 Thread Andrew Haley
On Thu, 4 Feb 2021 23:01:52 GMT, Gerard Ziemski  wrote:

>> Anton Kozlov has updated the pull request incrementally with six additional 
>> commits since the last revision:
>> 
>>  - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos
>>  - Add comments to WX transitions
>>
>>+ minor change of placements
>>  - Use macro conditionals instead of empty functions
>>  - Add W^X to tests
>>  - Do not require known W^X state
>>  - Revert w^x in gtests
>
> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 652:
> 
>> 650: 
>> 651: void os::setup_fpu() {
>> 652: }
> 
> Is there really nothing to do here, or does still need to be implemented? A 
> clarification comment here would help/.

There is really nothing to do here.

> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 198:
> 
>> 196: 
>> 197: NOINLINE frame os::current_frame() {
>> 198:   intptr_t *fp = *(intptr_t **)__builtin_frame_address(0);
> 
> In the bsd_x86 counter part we initialize `fp` to `_get_previous_fp()` - do 
> we need to implement it on aarch64 as well (and using address 0 is just a 
> temp workaround) or is it doing the right thing here?

(0)``` looks right to me.

> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp line 291:
> 
>> 289: bool is_unsafe_arraycopy = (thread->doing_unsafe_access() && 
>> UnsafeCopyMemory::contains_pc(pc));
>> 290: if ((nm != NULL && nm->has_unsafe_access()) || 
>> is_unsafe_arraycopy) {
>> 291:   address next_pc = pc + NativeCall::instruction_size;
> 
> Replace
> 
> address next_pc = pc + NativeCall::instruction_size;
> 
> with
> 
> address next_pc = Assembler::locate_next_instruction(pc);
> 
> there is at least one other place that needs it.

Why is this change needed? AFAICS ```locate_next_instruction()``` is an x86 
thing for variable-length instructions, and no other port uses it.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]

2021-03-01 Thread Andrew Haley
On Fri, 26 Feb 2021 19:17:12 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos
>  - Minor fixes

Thanks. With this, I think we're done.

-

Changes requested by aph (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v21]

2021-03-01 Thread Andrew Haley
On Fri, 26 Feb 2021 19:17:12 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/jdk/jdk-macos' into jdk-macos
>  - Minor fixes

src/hotspot/cpu/aarch64/globalDefinitions_aarch64.hpp line 62:

> 60: 
> 61: #if defined(__APPLE__) || defined(_WIN64)
> 62: #define R18_RESERVED

#define R18_RESERVED true```

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v18]

2021-02-17 Thread Andrew Haley
On Mon, 15 Feb 2021 17:45:32 GMT, Anton Kozlov  wrote:

>> I'm not sure it can wait. This change turns already-messy code into 
>> something significantly messier, to the extent that it's not really good 
>> enough to go into mainline.
>
> The latest merge with JDK-8261071 should resolve the issue. Please take a 
> look.

Looks much better, thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]

2021-02-17 Thread Andrew Haley
On Tue, 16 Feb 2021 06:24:05 GMT, Vladimir Kempik  wrote:

>> This is when passing a float, yes? In the case where we have more float 
>> arguments than n_float_register_parameters_c.
>> I don't understand why you think it's acceptable to bail in this case. Can 
>> you explain, please?
>
> it's for everything that uses less than 8 bytes on a stack( ints ( 4), 
> shorts(2), bytes(1), floats(4)).
> currently native wrapper generation does not support such cases at all, it 
> needs refactoring before this can be implemented.
> So when a method has more argument than can be placed in registers, we may 
> have issues. 
> 
> So we just bailing out to interpreter in case when a smaller (<=4 b) type is 
> going to be passed thru the stack.
> 
> There was attempt to implement handling such cases but currently it requires 
> some hacks (like using some vectors for non-specific task) - 
> https://github.com/openjdk/aarch64-port/pull/3

OK. I checked and the Panama preview doesn't support direct native calls for 
stack arguments, so we're good for now.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]

2021-02-15 Thread Andrew Haley
On Mon, 15 Feb 2021 18:00:50 GMT, Vladimir Kempik  wrote:

>> src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 839:
>> 
>>> 837:   // The code unable to handle this, bailout.
>>> 838:   return -1;
>>> 839: #endif
>> 
>> This looks like a bug to me. The caller doesn't necessarily check the return 
>> value. See CallRuntimeNode::calling_convention.
>
> Hello, we have updated PR, now this bailout is used only by the code which 
> can handle it (native wrapper generator), for the rest it will cause 
> guarantee failed if this bailout is triggered

This is when passing a float, yes? In the case where we have more float 
arguments than n_float_register_parameters_c.
I don't understand why you think it's acceptable to bail in this case. Can you 
explain, please?

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]

2021-02-04 Thread Andrew Haley
On Thu, 4 Feb 2021 09:49:17 GMT, Vladimir Kempik  wrote:

> > You read my mind, Andrew. Unless, of course, it's optimized to leverage the 
> > fact that it's thread-specific..
> 
> it's thread-specific
> 
> https://developer.apple.com/documentation/apple_silicon/porting_just-in-time_compilers_to_apple_silicon
> 
> > Because pthread_jit_write_protect_np changes only the current thread’s 
> > permissions, avoid accessing the same memory region from multiple threads. 
> > Giving multiple threads access to the same memory region opens up a 
> > potential attack vector, in which one thread has write access and another 
> > has executable access to the same region.

Umm, so how does patching work? We don't even know if other threads are 
executing the code we need to patch.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]

2021-02-03 Thread Andrew Haley
On Tue, 2 Feb 2021 18:03:50 GMT, Gerard Ziemski  wrote:

>> Anton Kozlov has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   support macos_aarch64 in hsdis
>
> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5271:
> 
>> 5269: //
>> 5270: void MacroAssembler::get_thread(Register dst) {
>> 5271:   RegSet saved_regs = RegSet::range(r0, r1) + 
>> BSD_ONLY(RegSet::range(r2, r17)) + lr - dst;
> 
> The comment needs to be updated, since on BSD we also seem to clobber r2,r17 ?

Surely this should be 

 saved_regs = RegSet::range(r0, r1) BSD_ONLY(+ RegSet::range(r2, r17)) + lr - 
dst;```

Shouldn't it?

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v9]

2021-02-03 Thread Andrew Haley
On Tue, 2 Feb 2021 21:49:36 GMT, Daniel D. Daugherty  wrote:

>> Anton Kozlov has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   support macos_aarch64 in hsdis
>
> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 323:
> 
>> 321:   str(zr, Address(rthread, JavaThread::last_Java_pc_offset()));
>> 322: 
>> 323:   str(zr, Address(rthread, JavaFrameAnchor::saved_fp_address_offset()));
> 
> I don't think this switch from `JavaThread::saved_fp_address_offset()`
> to `JavaFrameAnchor::saved_fp_address_offset()` is correct since
> `rthread` is still used and is a JavaThread*. The new code will give you:
> 
> `rthread` + offset of the `saved_fp_address` field in a JavaFrameAnchor
> 
> The old code gave you:
> 
> `rthread` + offset of the `saved_fp_address` field in the JavaFrameAnchor 
> field in the JavaThread
> 
> Those are not the same things.

I agree, I don't understand why this change was made.

> src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.hpp line 45:
> 
>> 43:   // Atomically copy 64 bits of data
>> 44:   static void atomic_copy64(const volatile void *src, volatile void 
>> *dst) {
>> 45: *(jlong *) dst = *(const jlong *) src;
> 
> Is this construct actually atomic on aarch64?

Yes.

> src/hotspot/os_cpu/windows_aarch64/os_windows_aarch64.hpp line 37:
> 
>> 35: 
>> 36: private:
>> 37: 
> 
> 'private' usually has one space in front of it.
> Also, why the blank line after it?

It reads better with the blank line, and it's not in violation of HotSpot 
conventions.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v8]

2021-02-01 Thread Andrew Haley
On Sun, 31 Jan 2021 20:14:01 GMT, Anton Kozlov  wrote:

>> Please review the implementation of JEP 391: macOS/AArch64 Port.
>> 
>> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
>> windows/aarch64. 
>> 
>> Major changes are in:
>> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
>> JDK-8253817, JDK-8253818)
>> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with 
>> necessary adjustments (JDK-8253819)
>> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
>> required on macOS/AArch64 platform. It's implemented with 
>> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
>> thread, so W^X mode change relates to the java thread state change (for java 
>> threads). In most cases, JVM executes in write-only mode, except when 
>> calling a generated stub like SafeFetch, which requires a temporary switch 
>> to execute-only mode. The same execute-only mode is enabled when a java 
>> thread executes in java or native states. This approach of managing W^X mode 
>> turned out to be simple and efficient enough.
>> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)
>
> Anton Kozlov has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 62 commits:
> 
>  - Merge branch 'master' into jdk-macos
>  - Update copyright year for BsdAARCH64ThreadContext.java
>  - Fix inclusing of StubRoutines header
>  - Redo buildsys fix
>  - Revert harfbuzz changes, disable warnings for it
>  - Little adjustement of SlowSignatureHandler
>  - Partially bring previous commit
>  - Revert "Address feedback for signature generators"
>
>This reverts commit 50b55f6684cd21f8b532fa979b7b6fbb4613266d.
>  - Refactor CDS disabling
>  - Redo builsys support for aarch64-darwin
>  - ... and 52 more: 
> https://git.openjdk.java.net/jdk/compare/8a9004da...b421e0b4

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 84:

> 82: // on stack. Natural alignment for types are still in place,
> 83: // for example double/long should be 8 bytes alligned
> 84: 

This comment is a bit confusing because it's no longer #ifdef APPLE. Better 
move it up to Line 41.

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 352:

> 350: 
> 351: #ifdef __APPLE__
> 352:   virtual void pass_byte()

Please remove ```#ifdef __APPLE__``` around this region.

src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 839:

> 837:   // The code unable to handle this, bailout.
> 838:   return -1;
> 839: #endif

This looks like a bug to me. The caller doesn't necessarily check the return 
value. See CallRuntimeNode::calling_convention.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v3]

2021-01-27 Thread Andrew Haley
On 1/26/21 6:42 PM, erik.joels...@oracle.com wrote:
> My understanding is that Apple chose to not provide JNF for aarch64, so
> if you want to build OpenJDK, you first need to build JNF yourself (it's
> available in github). Phil is working on removing this dependency
> completely, which will solve this issue [1].
>
> In the meantime, I don't think we should rely on finding JNF in
> unsupported locations inside Xcode. Could we wait with integrating this
> port until it can be built without such hacks?

That sounds right to me.

In the meantime, there is some cleanup work to be done in mainline
on slow_signature_handler, which will potentially make the AArch64
back end merge much simpler.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]

2021-01-25 Thread Andrew Haley
On Sun, 24 Jan 2021 15:50:01 GMT, Anton Kozlov  wrote:

>> src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 86:
>> 
>>> 84: 
>>> 85:   switch (_num_int_args) {
>>> 86:   case 0:
>> 
>> I don't think you need such a large switch statement. I think this can be 
>> expressed as
>> if (num_int_args <= 6) {
>> ldr(as_Register(num_int_args + r1.encoding()), src);
>> ... etc.
>
> I like the suggestion. For the defense, new functions were made this way 
> intentionally, to match existing `pass_int`, `pass_long`,.. I take your 
> comment as a blessing to fix all of them. But I feel that refactoring of 
> existing code should go in a separate commit. Especially after `pass_int` 
> used `ldr/str` instead of `ldrw/strw`, which looks wrong. 
> https://github.com/openjdk/jdk/pull/2200/files#diff-1ff58ce70aeea7e9842d34e8d8fd9c94dd91182999d455618b2a171efd8f742cL87-R122

This is new code, and it should not repeat the mistakes of the past. There's no 
need to refactor the similar existing code as part of this patch.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]

2021-01-25 Thread Andrew Haley
On Sun, 24 Jan 2021 16:10:44 GMT, Anton Kozlov  wrote:

>> src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 394:
>> 
>>> 392: 
>>> 393: class SlowSignatureHandler
>>> 394:   : public NativeSignatureIterator {
>> 
>> SlowSignatureHandler is turning into a maintenance nightmare. This isn't the 
>> fault of this commit so much as some nasty cut-and-paste programming in the 
>> code you're editing. It might well be better to rewrite this whole thing to 
>> use a table-driven approach, with a table that contains the increment and 
>> the size of each native type: then we'd have only a single routine which 
>> could pass data of any type, and each OS needs only supply a table of 
>> constants.
>
> Would you like me to do something about it now? The problem is that the 
> functions of SlowSignatureHandler are subtly different, so it will be 
> multiple tables, not sure how many. Such change is another candidate for a 
> separate code enhancement, which I would like not to mix into the JEP 
> implementation (it's already not a small change :)). But if you think it 
> would be a good thing, please let me know. In another case, I'll add this to 
> a queue of follow-up enhancements.

I'm not sure it can wait. This change turns already-messy code into something 
significantly messier, to the extent that it's not really good enough to go 
into mainline.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port [v2]

2021-01-25 Thread Andrew Haley
On Sun, 24 Jan 2021 16:29:31 GMT, Vladimir Kempik  wrote:

>> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5272:
>> 
>>> 5270: void MacroAssembler::get_thread(Register dst) {
>>> 5271:   RegSet saved_regs = RegSet::range(r0, r1) + 
>>> BSD_ONLY(RegSet::range(r2, r17)) + lr - dst;
>>> 5272:   push(saved_regs, sp);
>> 
>> This isn't very nice.
>
> Hello
> Why is it not nice ?
> linux_aarch64 uses some linux specific tls function 
> _ZN10JavaThread25aarch64_get_thread_helperEv from 
> hotspot/os_cpu/linux_aarch64/threadLS_linux_aarch64.s
> which clobbers only r0 and r1
> macos_aarch64 has no such tls code and uses generic C-call to 
> Thread::current();
> Hence we  are saving possibly clobbered regs here.

Surely if you did as Linux does you wouldn't need to clobber all those 
registers.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: 8253795: Implementation of JEP 391: macOS/AArch64 Port

2021-01-23 Thread Andrew Haley
On Fri, 22 Jan 2021 18:49:42 GMT, Anton Kozlov  wrote:

> Please review the implementation of JEP 391: macOS/AArch64 Port.
> 
> It's heavily based on existing ports to linux/aarch64, macos/x86_64, and 
> windows/aarch64. 
> 
> Major changes are in:
> * src/hotspot/cpu/aarch64: support of the new calling convention (subtasks 
> JDK-8253817, JDK-8253818)
> * src/hotspot/os_cpu/bsd_aarch64: copy of os_cpu/linux_aarch64 with necessary 
> adjustments (JDK-8253819)
> * src/hotspot/share, test/hotspot/gtest: support of write-xor-execute (W^X), 
> required on macOS/AArch64 platform. It's implemented with 
> pthread_jit_write_protect_np provided by Apple. The W^X mode is local to a 
> thread, so W^X mode change relates to the java thread state change (for java 
> threads). In most cases, JVM executes in write-only mode, except when calling 
> a generated stub like SafeFetch, which requires a temporary switch to 
> execute-only mode. The same execute-only mode is enabled when a java thread 
> executes in java or native states. This approach of managing W^X mode turned 
> out to be simple and efficient enough.
> * src/jdk.hotspot.agent: serviceability agent implementation (JDK-8254941)

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 86:

> 84: 
> 85:   switch (_num_int_args) {
> 86:   case 0:

I don't think you need such a large switch statement. I think this can be 
expressed as
if (num_int_args <= 6) {
ldr(as_Register(num_int_args + 1), src);
... etc.

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 43:

> 41: //describe amount of space in bytes occupied by type on native stack
> 42: #ifdef __APPLE__
> 43: const int nativeByteSpace= sizeof(jbyte);

I'm not sure these names should be in the global name space, and they're not 
very descriptive. Something like NativeStack::byteSpace would be better. Then 
you can use them with a `using namespace NativeStack` declaration.

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 433:

> 431:   store_and_inc(_to, from_obj, nativeShortSpace);
> 432: 
> 433:   _num_int_args++;

Please lift this increment out of the conditional.

src/hotspot/cpu/aarch64/interpreterRT_aarch64.cpp line 394:

> 392: 
> 393: class SlowSignatureHandler
> 394:   : public NativeSignatureIterator {

SlowSignatureHandler is turning into a maintenance nightmare. This isn't the 
fault of this commit so much as some nasty cut-and=paste programming in the 
code you're editing. It might well be better to rewrite this whole thing to use 
a table-driven approach, with a table that contains the increment and the size 
of each native type: then we'd have only a single routine which could pass data 
of any type, and each OS needs only supply a table of constants.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5272:

> 5270: void MacroAssembler::get_thread(Register dst) {
> 5271:   RegSet saved_regs = RegSet::range(r0, r1) + 
> BSD_ONLY(RegSet::range(r2, r17)) + lr - dst;
> 5272:   push(saved_regs, sp);

This isn't very nice.

-

PR: https://git.openjdk.java.net/jdk/pull/2200


Re: RFR: JDK-8255603: Memory/Performance regression after JDK-8210985

2020-10-30 Thread Andrew Haley
On Thu, 29 Oct 2020 15:11:09 GMT, Christoph Langer  wrote:

> It seems that there exists a memory/performance regression that was 
> introduced with JDK-8210985: Update the default SSL session cache size to 
> 20480.
> 
> The idea to limit the maixmum SSL session cache size by itself is good. 
> Unfortunately, as per the current implementation of 
> sun.security.util.MemoryCache, it also modifies the initial size of the 
> LinkedHashMap that is backing the cache to initialize with more than the 
> maximum size.
> 
> I suggest to restore the original behavior that initializes with an 
> initialCapacity of 1 in most cases. That was true when before JDK-8210985, 
> the property javax.net.ssl.sessionCacheSize was not set at all.
> In case it was set, the initial size would have been like now, 
> (javax.net.ssl.sessionCacheSize / 0.75f) + 1, which still seems strange.

Marked as reviewed by aph (Reviewer).

src/java.base/share/classes/sun/security/util/Cache.java line 268:

> 266: this.queue = null;
> 267: 
> 268: cacheMap = new LinkedHashMap<>(1, LOAD_FACTOR, true);

This looks right. The idea of scaling the initial cache size to the maximum 
size seems obviously to be wrong.

-

PR: https://git.openjdk.java.net/jdk/pull/937


Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v5]

2020-10-09 Thread Andrew Haley
On Fri, 9 Oct 2020 12:18:58 GMT, Andrew Haley  wrote:

>> Fei Yang has updated the pull request with a new target base due to a merge 
>> or a rebase. The pull request now contains
>> six commits:
>>  - Merge master
>>  - Add sha3 instructions to cpu/aarch64/aarch64-asmtest.py and regenerate 
>> the test in assembler_aarch64.cpp:asm_check
>>  - Rebase
>>  - Merge master
>>  - Fix trailing whitespace issue
>>  - 8252204: AArch64: Implement SHA3 accelerator/intrinsic
>>Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com
>
> Marked as reviewed by aph (Reviewer).

I see Linux x64 failed. However, I don't seem to be able to withdraw my patch 
approval.
However, please consider it withdrawn.

-

PR: https://git.openjdk.java.net/jdk/pull/207


Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic [v5]

2020-10-09 Thread Andrew Haley
On Fri, 9 Oct 2020 11:53:31 GMT, Fei Yang  wrote:

>> Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com
>> 
>> This added an intrinsic for SHA3 using aarch64 v8.2 SHA3 Crypto Extensions.
>> Reference implementation for core SHA-3 transform using ARMv8.2 Crypto 
>> Extensions:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/arm64/crypto/sha3-ce-core.S?h=v5.4.52
>> 
>> Trivial adaptation in SHA3. implCompress is needed for the purpose of adding 
>> the intrinsic.
>> For SHA3, we need to pass one extra parameter "digestLength" to the stub for 
>> the calculation of block size.
>> "digestLength" is also used in for the EOR loop before keccak to 
>> differentiate different SHA3 variants.
>> 
>> We added jtreg tests for SHA3 and used QEMU system emulator which supports 
>> SHA3 instructions to test the functionality.
>> Patch passed jtreg tier1-3 tests with QEMU system emulator.
>> Also verified with jtreg tier1-3 tests without SHA3 instructions on 
>> aarch64-linux-gnu and x86_64-linux-gnu, to make
>> sure that there's no regression.
>> We used one existing JMH test for performance test: 
>> test/micro/org/openjdk/bench/java/security/MessageDigests.java
>> We measured the performance benefit with an aarch64 cycle-accurate simulator.
>> Patch delivers 20% - 40% performance improvement depending on specific SHA3 
>> digest length and size of the message.
>> 
>> For now, this feature will not be enabled automatically for aarch64. We can 
>> auto-enable this when it is fully tested on
>> real hardware.  But for the above testing purposes, this is auto-enabled 
>> when the corresponding hardware feature is
>> detected.
>
> Fei Yang has updated the pull request with a new target base due to a merge 
> or a rebase. The pull request now contains
> six commits:
>  - Merge master
>  - Add sha3 instructions to cpu/aarch64/aarch64-asmtest.py and regenerate the 
> test in assembler_aarch64.cpp:asm_check
>  - Rebase
>  - Merge master
>  - Fix trailing whitespace issue
>  - 8252204: AArch64: Implement SHA3 accelerator/intrinsic
>Contributed-by: ard.biesheu...@linaro.org, dong...@huawei.com

Marked as reviewed by aph (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/207


Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic

2020-09-29 Thread Andrew Haley
On Thu, 17 Sep 2020 06:03:57 GMT, Ard Biesheuvel 
 wrote:

>> @ardbiesheuvel : Ard, could you please ack this patch? Thanks.
>
> Acked-by: Ard Biesheuvel 

> If this feature is not auto-enabled when the SHA3 hardware feature is there, 
> we will have one failure for the following
> test: 
> test/hotspot/jtreg/compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnSupportedCPU.java
> 15 #-testresult-
> 16
> description=file:/home/yangfei/github/jdk/test/hotspot/jtreg/compiler/intrinsics/sha/cli/TestUseSHA3IntrinsicsOptionOnSupportedCPU.java
> 17 elapsed=31546 0:00:31.546 18 end=Mon Sep 21 10:27:58 CST 2020
> 19 environment=regtest
> 20 execStatus=Failed. Execution failed: `main' threw exception: 
> java.lang.AssertionError: Option 'UseSHA3Intrinsics' is
> expected to have 'true' value Option 'UseSHA3Intrinsics' should be enabled by 
> default
> Any suggestions for this?

I don't understand your question. There should be two acceptable results, 
either "Pass" or "Not supported". What else
is possible?

-

PR: https://git.openjdk.java.net/jdk/pull/207


Re: RFR: 8252204: AArch64: Implement SHA3 accelerator/intrinsic

2020-09-18 Thread Andrew Haley
On 17/09/2020 05:26, Fei Yang wrote:
> For now, this feature will not be enabled automatically for aarch64. We can 
> auto-enable this when it is fully tested on
> real hardware.  But for the above testing purposes, this is auto-enabled when 
> the corresponding hardware feature is
> detected.

Given that there's no real hardware, it's extra-important to add the
new instructions to cpu/aarch64/aarch64-asmtest.py and regenerate the
test in assembler_aarch64.cc:asm_check.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: Recent Java Sandbox escapes

2020-09-09 Thread Andrew Haley
On 19/08/2020 22:46, Alkanor Oumbratok wrote:
>
> I may be wrong on both points, and I would be really grateful if
> someone could explain why these 2 CVE have been rated this high
> whereas at first glance there isn't any really exploitable related
> scenario.

The first rule of the OpenJDK Vulerability Group is: You do not talk
about the OpenJDK Vulerability Group.  :-)

In particular, we do not reveal exploits. Sorry.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows

2020-06-22 Thread Andrew Haley
On 18/06/2020 19:33, Martin Balao wrote:
>  * sharedRuntime_x86_64.cpp
>   * L3685
>* Do we still need 'long long' type for 'i' and 'cnt' local variables?

No, but this is 64-bit-only code. And len is a long, so let's keep it.

>   * L3724
>* The last argument of 'sub' has type 'int', while in the not-Windows
> variant is a long. Can we align this?

We should do that, yes. Better it be long everywhere.

>   * L3729
>* Is it possible to directly store in a[i]? (instead of going through
> 'tmp')

I'm not sure this would add anything. In fact, I think this change would
make it harder to read.

> * I guess the compiler will easily optimize this, but we may still
> get rid of the 2nd line
> * I've seen in L3753 you directly store
>
> Note: it's a bit unfortunate that we don't have x86-64 inline assembly
> in CL to maintain the same logic, as there is nothing OS-specific here.

I think the problem is that MS never had a nice way to do inline
assembly in 32-bit C, so they just gave up altogether when they went
64-bit.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows

2020-06-08 Thread Andrew Haley
On 05/06/2020 21:46, Simon Tooke wrote:
> As per your and Andrew Haley's comments, I have updated the webrev:
> 
> - used NOINLINE
> 
> - used julong
> 
> - deleted the block of unused code.
> 
> Please let me know what you think.
> 
> updated webrev: 
> http://cr.openjdk.java.net/~stooke/webrevs/jdk-8243114-jdk/01/01/

That looks clean, especially since it adds support for Windows with
essentially no impact on the Linux code. I'm not sure I can Review
it, though, because I'm one of the co-authors.

It'd be nice if someone else familiar with the arithmetic (someone on
security-dev?) could take a quick look.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows

2020-06-05 Thread Andrew Haley
On 05/06/2020 05:35, David Holmes wrote:
> If these arrays are extracted from Java arrays

Yes, they are.

> then using a j* type would be appropriate, but then I would expect
> jlong, unless the algorithm explicitly requires unsigned types?

It does. All of the code handling bignums uses unsigned words.

> If so julong would be acceptable.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows

2020-06-04 Thread Andrew Haley
On 04/06/2020 14:35, Simon Tooke wrote:
> Yes, this hurt to type.  A previous review suggested using julong, is 
> that acceptable to you?
> 
> (an aside: I'm now wondering if there is other code in the JDK that 
> assumes long is 64bits - which is not true on all platforms...)

That was just me, I think. I knew that this code would take some
fixing up on Windows so I postponed portability issues until then.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: RFR[15] JDK-8243114: Implement montgomery{Multiply, Square}intrinsics on Windows

2020-05-27 Thread Andrew Haley
On 21/05/2020 18:24, Simon Tooke wrote:
> This change implements the given intrinsics on Windows.

Thank you.

Does julong work on Windows, rather than unsigned long long?  If not,
perhaps a local typedef might help. All those "unsigned log long"s
look a bit clumsy.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



Re: Problems with AES-GCM native acceleration

2018-12-10 Thread Andrew Haley
On 12/10/18 1:47 PM, Gidon Gershinsky wrote:
> I've run my decryption benchmarks on Java with this patch, with excellent
> results.
> The benchmarks reach the top speed right away, no long warm-up anymore.
> Also, there is no need to split the operation into multiple updates, the
> doFinal works just fine.
> 
> Moreover, the maximal decryption throughput is actually higher than in
> Java11 after warm-up.
> On one thread, I get 930MB/sec instead of 850MB/s.
> On 8 threads, 350x8 instead of 230x8.
> 
> This capability will be important in the Spark/Parquet workloads.

The downloadable binary build is at
https://aph.fedorapeople.org/jdk-12-internal+0-adhoc.aph.jdk-jdk.tar.gz

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Problems with AES-GCM native acceleration

2018-11-30 Thread Andrew Haley
On 11/21/18 5:37 PM, Andrew Haley wrote:
> On 11/15/18 10:42 AM, Gidon Gershinsky wrote:
>> Having the decryption optimized in the HotSpot engine would be ideal.
> 
> I agree with you. I did a few experiments, and it can take a very
> long time for C2 compilation to kick in, especially because GCM is
> glacially slow until the intrinsics are used.
> 
> I think this would be a generally useful enhancement to HotSpot,
> and I'm kicking around an experimental patch which adds the
> intrinsics to c1 and the interpreter.

There's a proof-of-concept patch at http://cr.openjdk.java.net/~aph/gctr/
It's all rather hacky but it works.

The patch is rather more complicated than I would have liked. We
could simplify it somewhat by getting rid of the C1 intrinsic, and
instead making C1 call the interpreter implementation.

There also a jmh benchmark in that directory. Test results for 1Mb
look like this:

Interp:

Benchmark  Mode  Cnt Score   Error  Units
AESGCMUpdateAAD2.test  avgt5  1426.275 ± 8.778  us/op

C1:

Benchmark  Mode  Cnt Score   Error  Units
AESGCMUpdateAAD2.test  avgt5  1359.367 ± 8.196  us/op

C2:

Benchmark  Mode  Cnt ScoreError  Units
AESGCMUpdateAAD2.test  avgt5  1333.863 ± 18.385  us/op

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: contribute to the OpenJDK security group

2018-01-24 Thread Andrew Haley
On 24/01/18 10:39, Tomas Gustavsson wrote:
> Imho the P11 layer always needs attention. To work properly we're
> relying on some patches, where parts was recently merged into OpenJDK.
> We just started testing the Amazon CloudHSM, and that requires changes
> to SunPKCS11 as well to work. Not always bad in SunPKCS11 as some P11
> libraries out there do strange non-conforming stuff, but there's room
> for more flexibility nevertheless.

Martin Balao has been heavily reworking this layer because it leaks
native memory.  I'll let him fill you in on the details.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Do we need an unsigned multiplyHigh?

2017-09-27 Thread Andrew Haley
On 27/09/17 11:28, Peter Lawrey wrote:
> If you need multiplyHigh for 128-bit then you need uint256. At some point
> you have to decide whether you need that many bits as a supported
> operation.  When Java was created a 64-bit long as the widest type made
> sense, however CPUs such as x64 now support 128, 256 and 512 bit natively
> and having the JVM dong its best to work this out is not as clean as
> defining it explicitly.

I guess cleanliness is in the eye of the beholder.  IMO multiplyHigh is
as clean as we need, and I'd rather see more complexity there than in
the type system.  It'd be nice to be able to return more than one scalar
value from a method, for sure.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Do we need an unsigned multiplyHigh?

2017-09-26 Thread Andrew Haley
On 26/09/17 15:53, Peter Lawrey wrote:
> None, except you end up jumping through hoops to implement 128 bit
> arithmetic efficiently or cleanly. At some point language support for such
> a basic operation is the simplest and clearest solution.

There's nothing inefficient about this approach.  I don't quite see
how 128-bit types help with cleanliness, because then you'd need a
multiplyHigh for 128-bit types, surely?  You need that for the type
system to be complete.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Do we need an unsigned multiplyHigh?

2017-09-26 Thread Andrew Haley
On 26/09/17 11:20, Peter Lawrey wrote:
> We have arrays already but we don't have primitive types of more than
> 64-bit. If we had uint128 for example we wouldn't need this method.

Perhaps not, but why is needing this method a problem?

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Do we need an unsigned multiplyHigh?

2017-09-26 Thread Andrew Haley
On 26/09/17 08:25, Peter Lawrey wrote:
> I am looking forward to intrinsic support for 128 bit math using ?Long2?
> and XMM (or even YMM, ZMM) instructions.
> This is the best way forward, I hope.
> 
> Personally I would like to see a long long type, or even uint128, uint256,
> uint512 style notation.
> 
> Another option might be something like long<128> or an annotation like
> @uint128 long or even @decimal128 double but who knows.

Do you actually need any of that?  I think vector types make more sense.
Java already has a great many scalar types.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: Do we need an unsigned multiplyHigh?

2017-09-25 Thread Andrew Haley
On 25/09/17 18:21, Adam Petcher wrote:
> I agree that an unsigned multiplyHigh would be useful for crypto 
> purposes, and we should consider adding it. Of course, I would much 
> rather have multiply operations that return both 64-bit parts of the 
> result, but that is going to be hard to do well without value types. So 
> it would be nice to have something like this in the meantime.

I take your point, but it won't be excruciatingly difficult for the C2
compiler to turn the multiply operations into a single one, if the CPU
can do that.  From what I've seen recently, though, on non-x86 it's
common for the two halves of the result to be calculated by separate
instructions.

> If we are going to add this operation, it should probably be added
> along with an intrinsic. I think the Java code can simply factor out
> the else branch from the existing multiplyHigh code. This way,
> unsignedMultiplyHigh will be at least as fast as multiplyHigh,
> whether the intrinsic implementation is available or not.

Sure.  I can do that.

> If possible, the implementation of this operation should not branch on 
> either operand. This would make it more widely useful for constant-time 
> crypto implementations. Though this property would need to go into the 
> spec in order for constant-time crypto code to use this method, and I 
> don't know how reasonable it is to put something like this in the spec.

OK.  I can do it so that there are no branches in the Java.  The Java
code for signed multiplyHigh has some data-dependent branches in an
attempt to speed it up, though.  I don't know how effective they are,
and I could have a look at taking them out.

> Side note: at the moment, I am using signed arithmetic in prototypes for 
> Poly1305, X25519, and EdDSA, partially due to lack of support for 
> unsigned operations like this one. I don't think having 
> unsignedMultiplyHigh would, on its own, convince me to use an unsigned 
> representation, but the forces are different for each 
> algorithm/implementation.

Sure.  I don't think it really matters from a performance point of
view which you use, given intrinsics for both.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Do we need an unsigned multiplyHigh?

2017-09-25 Thread Andrew Haley
We now have a multiplyHigh intrinsic, but it is signed.  Unsigned
multiplyHigh is in general a more useful primitive for crypto than
signed, and I wonder if we need an intrinsic for that as well.  I've
looked at cooking up an unsigned multiplyHigh in Java, and I think the
fastest way is this:

private static final long unsignedMultiplyHigh(long a, long b) {
long result = Math.multiplyHigh(a, b);
if (a < 0)  result += b;
if (b < 0)  result += a;
// Can also be written as:
// result += (a >> 63) & b;
// result += (b >> 63) & a;
return result;
}

It's still about 50% slower than the signed multiplyHigh, though.
Thoughts?

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: [Semi-]off-line encryption

2016-09-29 Thread Andrew Haley
On 29/09/16 17:49, Michael StJohns wrote:
> On 9/29/2016 11:28 AM, Andrew Haley wrote:
>> GCM allows most of the work in an encryption to be done offline (and
>> ahead of time) by other processors, reducing latency and increasing
>> throughput.  It'd be lovely if we could do this in Java, but I can't
>> really see a way to fit this in to the platform security framework.
>> We don't want to do this eagerly, because we don't know that more data
>> will be encrypted and we don't want to speculate.
>>
>> However, if we had a hint that (say) a large stream would need to
>> encrypt a megabyte of data at some time in the future we could
>> precompute a megabyte of keystream.  Has anyone considered this?
> 
> Um.  No.   You can make this work with CTR, but you can't with GCM.  
> With CTR, you just encrypt a stream of zeroes to get an encryption 
> stream and then XOR the encryption stream later with your actual plain 
> text.  GCM (and CCM) tend to compute the integrity tag in parallel with 
> calculating the encryption stream. You'd have to still process all of 
> the plain text (or cipher text) to get the integrity tag.

The keystream doesn't depend on the plaintext or the ciphertext.  Of
course the auth tag can't be calculated before we have the ciphertext
but the keystream can be, and the keystream is the expensive part of
the calculation.  The auth tag doesn't require anything more than a
Galois field multiplication, and that's really fast with current
hardware.  We could arrange it so that there is always some keystream
precalculated, and then the encryption latency would be no more than a
couple of XORs and a GF multiplication.

Maybe there's no point if AES-NI can generate the keystream at 20ish
cycles per block and the GF multiplication can be done in parallel,
but not every target has AES-NI or some equivalent.  And I'm not sure
that every target which does have appropriate instructions can issue
the AES and the GF multiplication instructions at the same time.

Having said all of that, it may be that the overheads of inter-
processor communication would make this inefficient.

Andrew.


[Semi-]off-line encryption

2016-09-29 Thread Andrew Haley

GCM allows most of the work in an encryption to be done offline (and
ahead of time) by other processors, reducing latency and increasing
throughput.  It'd be lovely if we could do this in Java, but I can't
really see a way to fit this in to the platform security framework.
We don't want to do this eagerly, because we don't know that more data
will be encrypted and we don't want to speculate.

However, if we had a hint that (say) a large stream would need to
encrypt a megabyte of data at some time in the future we could
precompute a megabyte of keystream.  Has anyone considered this?

Andrew.


Re: Issues with ALPN implementation in JDK 9

2016-06-15 Thread Andrew Haley
On 15/06/16 09:11, Simone Bordet wrote:
> Sure, but then every ALPN implementer out there will have their own,
> they will need to be kept up to date when a new TLS protocol or a new
> cipher is introduced in the JDK, etc. etc.

OK, I get that.

The problem I have with reading posts about JDK9 problems is that I
can't tell the *severity* of the problems.  I don't know if something
is a total blocker or an inconvenience.  When someone uses an internal
sun.* class they may be doing something which must be done in order to
get access which would be impossible otherwise.

Andrew.



Re: Issues with ALPN implementation in JDK 9

2016-06-15 Thread Andrew Haley
On 14/06/16 14:57, Simone Bordet wrote:
> * Lack of facilities to convert TLS protocol bytes to protocol strings.
> This class already exist in JDK, sun.security.ssl.ProtocolVersion, it
> would just need to be exposed in javax.net.ssl.
> 
> * Lack of facilities to convert TLS cipher bytes to cipher name strings.
> As above, sun.security.ssl.CipherSuite exists, needs to be exposed publicly.
> 
> Note that for the 2 bullets above, a recent message from Mark Reinhold
> to jdk9-dev confirmed that JDK 9 is *not yet* feature complete, so I
> hope they can be considered for inclusion.

It's very late, and they certainly won't be considered unless
someone proposes it.  But can't we simply clone this class anyway?

Incidentally, I can't tell which (if any) of the issues you describe
are specific to JDK 9 (i.e. they would not be problems in JDK 8.)

Andrew.



RFR: 8134869: AARCH64: GHASH intrinsic is not optimal

2015-09-01 Thread Andrew Haley
I've been looking at the intrinsic we have for GHASH.  While it is
decent as it goes, its performance is considerably worse than some
other implementations of GHASH on the same processor.

Thanks are due to Alexander Alexeev who did a fine job implementing
the x86 algorithm on AArch64, but the result is not optimal.  on
AArch64 we have the advantage of a bit-reversal instruction which x86
parts don't have, and this makes it possible to write a fully
little-endian implementation of GHASH which is far more idiomatic on
AArch64 than the big-endian implementation the x86 version uses.  This
gets us an overall performance improvement of AES/GCM of 10-20%.

I've also taken the opportunity to add a lot of comments.  The
algorithms used are (fairly) obscure and most open source software
implementations don't really explain what they're doing.  In
particular, the bizarre representation of polynomials in GF(2) (where
byte ordering is little endian but bit ordering is big endian) is very
confusing and surely deserves a comment or two.

http://cr.openjdk.java.net/~aph/8134869-ghash-1/

One other remark: the AES/GCM implementation has a lot of overhead.
Some profile data (on x86) looks like this:

samples  cum. samples  %cum. % image name   symbol name
479605   47960536.8408  36.840831156.jo 
aescrypt_encryptBlock
301014   78061923.1224  59.963231156.jo 
ghash_processBlocks
196563   97718215.0990  75.062131156.jo int 
com.sun.crypto.provider.GCTR.doFinal(byte[], int, int, byte[], int)
5006110272433.8454  78.907631156.jo void 
TestAESEncode.run()
4815910754023.6993  82.606931156.jo void 
TestAESDecode.run()
1850610939081.4215  84.0284libjvm.so
TypeArrayKlass::allocate_common(int, bool, Thread*)

GCTR.doFinal() doesn't need do anything except increment a counter
and call aescrypt_encryptBlock, but it still takes 15% of the total
runtime.  Intrinsifying GCTR.update() would solve this problem.

Andrew.


Re: RFR: 8130150: RSA Acceleration

2015-07-01 Thread Andrew Haley
On 06/30/2015 06:49 PM, Andrew Haley wrote:
 New webrevs:
 
 http://cr.openjdk.java.net/~aph/8130150-jdk
 http://cr.openjdk.java.net/~aph/8130150-hs/

I made an error when preparing these webrevs.  Please ignore.

Andrew.



Re: RFR: 8130150: RSA Acceleration

2015-07-01 Thread Andrew Haley
Sorry for the mistake.  New webrevs:

http://cr.openjdk.java.net/~aph/8130150-hs-1/
http://cr.openjdk.java.net/~aph/8130150-jdk-1/

Andrew.



RFR: 8130150: RSA Acceleration

2015-06-30 Thread Andrew Haley
On 06/29/2015 10:32 AM, Andrew Haley wrote:
 On 29/06/15 09:37, Vladimir Kozlov wrote:
 Hi, Andrew

 Did you file RFE for this change?  8046943 is JEP.
 
 No; I will do so.

Done.

 typo? less - more.

 + * number of ints in the number is less than this value we do not
 + * use the intrinsic.
 + */
 +private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512;

 trailing spaces:
 src/java.base/share/classes/java/math/BigInteger.java:273: Trailing 
 whitespace
 src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing 
 whitespace

 I ran changes through JPRT and linux/solaris passed - thanks.
 Next step - Windows:

In the end I had to punt on Windows.  It's ifdef'd out for someone who
works on Windows to fix.

 I am fine with JDK changes.

 Would be nice to have a test for this change. Do existing tests
 cover this code?
 
 They do.

They don't.  The RSA tests don't run for long enough to test the
intrinsics.  I wrote a new test for Montgomery multiplication.

 I agree that we should limit size when to invoke multiplyToLen
 intrinsic too. File bug I will assign it.
 
 OK.

JDK-8130154

New webrevs:

http://cr.openjdk.java.net/~aph/8130150-jdk
http://cr.openjdk.java.net/~aph/8130150-hs/

Andrew.




Re: RFR: 8046943: RSA Acceleration

2015-06-29 Thread Andrew Haley
On 06/29/2015 02:19 PM, David M. Lloyd wrote:
 Out of curiosity, instead of e.g.:
 
unsigned long *scratch = (unsigned long *)alloca(total_allocation);
 
 Could you not just use e.g.:
 
unsigned long scratch[longwords * 4];
 
 and avoid alloca altogether?

No.  Variable-length are now in Standard C but not C++.

https://groups.google.com/forum/#!topic/comp.std.c++/K_4lgA1JYeg

I imagine that C++ will eventually support them, but not today.

Andrew.



Re: RFR: 8046943: RSA Acceleration

2015-06-29 Thread Andrew Haley
On 06/29/2015 02:19 PM, David M. Lloyd wrote:
 Out of curiosity, instead of e.g.:
 
unsigned long *scratch = (unsigned long *)alloca(total_allocation);
 
 Could you not just use e.g.:
 
unsigned long scratch[longwords * 4];
 
 and avoid alloca altogether?

No.  Variable-length arrays are now in Standard C but not C++.

https://groups.google.com/forum/#!topic/comp.std.c++/K_4lgA1JYeg

I imagine that C++ will eventually support them, but not today.

Andrew.



Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics

2015-06-29 Thread Andrew Haley
On 06/29/2015 01:38 PM, Doug Simon wrote:

 I seems just plain wrong for an intrinsic to not implement the same
 semantics as the intrinsified method. I would expect an intrinsic to
 perform all necessary runtime checks and only have the compiler omit
 them if it can prove they are unnecessary. If all intrinsics obeyed
 this contract, then there’s no need for the
 @HotSpotIntrinsicCandidate annotation from a semantics
 perspective. And in terms of the keeping HotSpot in sync with the
 JDK, the responsibility should fall entirely on HotSpot to check
 that its intrinsics correspond to existing methods.

Don't you think you're being rather idealistic?  The other side of the
argument is that it's much easier to write and maintain the
arg-checking code if it's written in Java, and it also has the
advantage that it benefits from profile data to guide the JIT.

Andrew.


Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics

2015-06-29 Thread Andrew Haley
Hi,

On 29/06/15 10:41, Zoltán Majó wrote:
 
 On 06/27/2015 10:05 AM, Andrew Haley wrote:
 On 25/06/15 12:49, Zoltán Majó wrote:
 Problem: There is need to indicate Java methods that are potentially
 intrinsified by JVM.
 It's a great idea but is it a good name?  HotSpot is not the only Java
 VM.  Do we expect people from to come along and add
 J9IntrinsicCandidate, CACAOIntrinsicCandidate, and so on?
 
 thank you bringing up this issue.
 
 The name HotSpotIntrinsicCandidate resulted from a private discussion 
 with Joe Darcy, Brian Goetz, and John Rose. The reason this name was 
 picked is to make clear that a marked method's interaction with the VM 
 (specifically with the HotSpot VM implementation) needs special attention.

OK, cool.  So has any thought been given to the other VMs?  Do you
expect that, say, J9 will use the HotSpotIntrinsicCandidate
annotattion, or do you expect we will have similar annotations for
each VM which uses OpenJDK libraries?  Or is the need for this
annotation totally HotSpot-specific?

Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-29 Thread Andrew Haley
On 29/06/15 09:37, Vladimir Kozlov wrote:
 Hi, Andrew
 
 Did you file RFE for this change?  8046943 is JEP.

No; I will do so.

 typo? less - more.
 
 + * number of ints in the number is less than this value we do not
 + * use the intrinsic.
 + */
 +private static final int MONTGOMERY_INTRINSIC_THRESHOLD = 512;
 
 trailing spaces:
 src/java.base/share/classes/java/math/BigInteger.java:273: Trailing whitespace
 src/java.base/share/classes/java/math/BigInteger.java:2770: Trailing 
 whitespace
 
 I ran changes through JPRT and linux/solaris passed - thanks.
 Next step - Windows:
 
 C:\jprt\T\P1\s\hotspot\src\cpu\x86\vm\sharedRuntime_x86_64.cpp(26) : fatal 
 error C1083: Cannot open include file: 
 'alloca.h': No such file or directory

Hmm, okay.  This is going to be fun.  :-)

AFAIK OpenJDK builds with Visual Studio.  The VS equivalent of
alloca() is called _alloca() and its header file is malloc.h.  I'm
going to try to do this untested.  I think that autoconf will #include
malloc.h on Windows automagically, so all that I have to do is create
a #define for alloca() on Windows.

 I am fine with JDK changes.
 
 Would be nice to have a test for this change. Do existing tests
 cover this code?

They do.  jdk/test/com/oracle/security/ucrypto/TestRSA looks like a
pretty thorough test.  If you make any mistake in the arithmetic RSA
decryption simply will not work: the result is corrupt.  The risks,
then, are mistakes such as accidental side-effects.  I don't know any
way to test for that.

The other possible think I could test for is unusual key sizes.  I'll
have a look.

 I agree that we should limit size when to invoke multiplyToLen
 intrinsic too. File bug I will assign it.

OK.

Andrew.


Re: [9] RFR(M): 8076112: Add @HotSpotIntrinsicCandidate annotation to indicate methods for which Java Runtime has intrinsics

2015-06-27 Thread Andrew Haley
On 25/06/15 12:49, Zoltán Majó wrote:
 Problem: There is need to indicate Java methods that are potentially 
 intrinsified by JVM.

It's a great idea but is it a good name?  HotSpot is not the only Java
VM.  Do we expect people from to come along and add
J9IntrinsicCandidate, CACAOIntrinsicCandidate, and so on?

Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-26 Thread Andrew Haley
On 06/19/2015 09:34 AM, Andrew Haley wrote:
 On 18/06/15 20:28, Vladimir Kozlov wrote:
 
 Yes, it is a lot of handwriting but we need it to work on all OSs.
 
 Sure, I get that.  I knew there would be a few goes around with this,
 but it's worth the pain for the performance improvement.

I made some changes, as requested.

Everything is now private static final.

The libcall now only calls the runtime code: all allocation is done
in Java code.

I tested on Solaris using Solaris Studio 12.3 tools, and it's fine.

There's one thing I'm not sure about.  I now longer allocate scratch
memory on the heap.  That was only needed for extremely large
integers, larger than anyone needs for crypto.  Now, if the size of an
integer exceeds 16384 bits I do not use the intrinsic, and this allows
it to use stack-allocated memory for its scratch space.

The main thing I was worried about is that the time spent in
Montgomery multiplication.  The runtime of the algorithm is O(N^2); if
you don't limit the size, the time is unbounded, with no safepoint
delay.  This would mean that anyone who passed an absurdly large
integer to BigInteger.modPow() would see the virtual machine
apparently lock up and garbage collection would not run.  I note that
the multiplyToLen() intrinsic has the same problem.

http://cr.openjdk.java.net/~aph/8046943-hs-3/
http://cr.openjdk.java.net/~aph/8046943-jdk-3/

Andrew.


Building SA

2015-06-24 Thread Andrew Haley
Now that tools.jar is gone from build images, what is the right way
to build the seviceability agent?

Thanks,
Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-19 Thread Andrew Haley
On 18/06/15 20:28, Vladimir Kozlov wrote:

 We have few new rules regarding intrinsics.
 You need to add private static java method which does range checks 
 because their are not executed in intrinsic code - see squareToLen() 
 implementation, for example.

Okay.

 Note, we will rewrite multiplyToLen() too soon.

Right.

 Also method which will be intrinsified should be private static too and 
 you can move allocation from it (like we did for squareToLen()) to avoid 
 allocation in intrinsic code.

I like that.  It's surely a lot less complicated than what I've got
right now.

 Your Hotspot changes are hard to accept. We have to compile for solaris 
 with Sun compilers which does not work with such changes:
 
 hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp, line 3525: Warning: 
 parameter in inline asm statement unused: %3.
 hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp, line 3525: Warning: 
 parameter in inline asm statement unused: %6.
 hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp, line 3701: Error: The 
 function __builtin_expect must have a prototype.
 hotspot/src/cpu/x86/vm/sharedRuntime_x86_64.cpp, line 3707: Error: The 
 function __builtin_alloca must have a prototype.

Sure.  I didn't realize that you were compiling that code with
non-GCC.  __builtin_alloca() can just be replaced by alloca() on
on-GCC.

 May be you can convert the code to stub and add new MacroAssembler 
 functions which you can use in sharedRuntime_x86_64.cpp.

I don't think so.  It's fast because it is truly inline assembler,
inserted into the C code.  If I had some way to access an x86 Solaris
machine I'd test it there.

But those warnings about unused parameters are just warnings.  It may
be that the code would work on a Sun compiler.  Can you please try to
replace __builtin_alloca() with alloca() and then tell me if the code
works?

 Yes, it is a lot of handwriting but we need it to work on all OSs.

Sure, I get that.  I knew there would be a few goes around with this,
but it's worth the pain for the performance improvement.

Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-18 Thread Andrew Haley
On 06/18/2015 06:00 PM, Anthony Scarpino wrote:
 Question, on the hotspot side you said in a previous post this was 
 C2-only.  Was there a reason you don't have it for all?

None.  It's up for negotiation.  What do you want?  C1, interp?

Andrew.



Re: RFR: 8046943: RSA Acceleration

2015-06-16 Thread Andrew Haley
On 06/15/2015 05:58 PM, Andrew Haley wrote:
 3.  I fused squaring and multiplication into a single
  montgomeryMultiply method.  ...
  
  I don't agree with fusing them together.  I think there should two 
  separate intrinsics.  For one, SPARC has a montsqr and montmul 
  instructions.  Additionally if someone wants to call montgomerySquare, 
  they should be able to call it directly with it's needed number of 
  arguments and not pass 'a' twice to satisfy an internal if().

 OK, fair enough.  I'll think a little more about the best way to do
 this.

Done thusly.  The only thing I had any doubt about was whether to use a
single flag for squaring and multiplication.  This patch uses separate
flags.

http://cr.openjdk.java.net/~aph/8046943-hs-2/
http://cr.openjdk.java.net/~aph/8046943-jdk-2/

Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-15 Thread Andrew Haley
On 06/15/2015 05:38 PM, Anthony Scarpino wrote:
 On 06/12/2015 04:06 AM, Andrew Haley wrote:
 http://cr.openjdk.java.net/~aph/8046943-hs-1/
 http://cr.openjdk.java.net/~aph/8046943-jdk-1/
 
 Please don't use the JEP 246 in the comments when you push. There are a 
 number of changesets related to 246 and I'd rather not have one 
 associated with it.  We can link the a separate bug id to the JEP.

Right.

 3.  I fused squaring and multiplication into a single
 montgomeryMultiply method.  This has two advantages.  Firstly, we only
 need a single intrinsic, and secondly the decision whether to use
 multiply or squaring can be made in the runtime library.  If the
 target does not support the montgomeryMultiply intrinsic there is no
 slowdown when using C2 because it removes the if (a == b) test in

  if (a == b) {
  product = squareToLen(a, len, product);
  } else {
  product = multiplyToLen(a, len, b, len, product);
  }
 
 I don't agree with fusing them together.  I think there should two 
 separate intrinsics.  For one, SPARC has a montsqr and montmul 
 instructions.  Additionally if someone wants to call montgomerySquare, 
 they should be able to call it directly with it's needed number of 
 arguments and not pass 'a' twice to satisfy an internal if().

OK, fair enough.  I'll think a little more about the best way to do
this.

Out of curiosity I just had a look at the SPARC instruction
specifications, and happily (it certainly surprised me!)  they are
almost exactly the same as what I've done, so using those instructions
should be a trivial change after this patch.  The SPARC instruction
seems to be limited to 32 words (2048 bits) but I guess you'd just use
the software for larger sizes.

Andrew.


Re: RFR: 8046943: RSA Acceleration

2015-06-12 Thread Andrew Haley
On 06/12/2015 12:06 PM, Andrew Haley wrote:

Sorry, I forgot the column labels.  The are:

signverifysign/s verify/s

 rsa  512 bits 0.000134s 0.09s   7444.2 111853.3
 rsa 1024 bits 0.000674s 0.29s   1483.9  34456.9
 rsa 2048 bits 0.003945s 0.000100s253.5   9994.7
 rsa 4096 bits 0.027015s 0.000373s 37.0   2681.6
 
 After:
 
 rsa  512 bits 0.59s 0.04s  17022.3 224141.1
 rsa 1024 bits 0.000258s 0.13s   3871.5  78851.0
 rsa 2048 bits 0.001506s 0.42s664.1  23844.3
 rsa 4096 bits 0.010214s 0.000153s 97.9   6516.0

Andrew.




RFR: 8046943: RSA Acceleration

2015-06-12 Thread Andrew Haley
http://cr.openjdk.java.net/~aph/8046943-hs-1/
http://cr.openjdk.java.net/~aph/8046943-jdk-1/

Before:

rsa  512 bits 0.000134s 0.09s   7444.2 111853.3
rsa 1024 bits 0.000674s 0.29s   1483.9  34456.9
rsa 2048 bits 0.003945s 0.000100s253.5   9994.7
rsa 4096 bits 0.027015s 0.000373s 37.0   2681.6

After:

rsa  512 bits 0.59s 0.04s  17022.3 224141.1
rsa 1024 bits 0.000258s 0.13s   3871.5  78851.0
rsa 2048 bits 0.001506s 0.42s664.1  23844.3
rsa 4096 bits 0.010214s 0.000153s 97.9   6516.0

There are some issues we need to discuss.

The runtime code is in sharedRuntime_x86_64.cpp even though it's
mostly written in C.  My thinking here is that porters will want to
tweak the code for their back ends, or maybe write it in assembler.
It's little-endian and would need some reworking for big-endian
machines.  But maybe there should be a common version of the code in
share/ ?

Should it be in optoRuntime instead?  It could be called from C1 or
even the interpreter, but it's C2-only right now.

I've done nothing about 32-bit targets, but I think they would
benefit.

I had to make some small changes to the Java library.

1.  Montgomery multiplication works on whole words.  Given that this
is a 64-bit implementation I had to force the size of the arrays in
oddModPow to be a multiple of 64 bits, i.e. the length of the arrays
must be even.  Given that RSA and Diffie-Hellman keys are usually a
multiple of 64 bits in length I don't think this will be a real
performance issue in practice.

2.  I needed a 64-bit inverse rather than a 32-bit inverse.  This is a
single computation, done once for each modular exponentiation, so it
makes an immeasurably small difference to the total runtime.

3.  I fused squaring and multiplication into a single
montgomeryMultiply method.  This has two advantages.  Firstly, we only
need a single intrinsic, and secondly the decision whether to use
multiply or squaring can be made in the runtime library.  If the
target does not support the montgomeryMultiply intrinsic there is no
slowdown when using C2 because it removes the if (a == b) test in

if (a == b) {
product = squareToLen(a, len, product);
} else {
product = multiplyToLen(a, len, b, len, product);
}

Andrew.


RSA intrinsics [Was: RFR(L): 8069539: RSA acceleration]

2015-06-04 Thread Andrew Haley
I'm sorry this is a rather long email, and I pray for your patience.

I'm getting close to something I can put forward for review.  The
performance is encouraging.

[ Some background: The kernel of RSA and Diffie-Hellman key exchange
is Montgomery multiplication.  Optimizing RSA basically comes down to
optimizing Montgomery multiplication.  The core of OpenJDK's RSA is
BigInteger.oddModPow().  ]

My Montgomery multiply routine (mostly portable C, with a small
assembly-language insert) executes a 1024-bit multiply/reduce in about
2000ns; the hand-coded assembly language equivalent in OpenSSL is
faster (as you'd expect) but not very much faster: about 1700ns.  In
other words, compiled C is only about 20% slower.

Firstly, this is pretty remarkable performance by GCC (Yay!  Go GCC!)
and it shows I'm on the right track.  It also shows that there isn't a
huge amount to be gained by hand-coding Montgomery multiplication, but
anybody who fancies their hand can try to improve on GCC.  This is
also very nice because porting it to non-x86 hardware is fairly
straightforward -- certainly far easier than writing a large assembly-
language routine.

Here are some numbers for comparison.

Today's hs-comp:

  signverifysign/s verify/s
rsa  512 bits 0.000133s 0.09s   7508.5 112819.1
rsa 1024 bits 0.000667s 0.28s   1498.6  35497.2
rsa 2048 bits 0.003867s 0.97s258.6  10342.7
rsa 4096 bits 0.026383s 0.000357s 37.9   2799.8

After my patch:

  signverifysign/s verify/s
rsa  512 bits 0.71s 0.05s  14127.3 204112.4
rsa 1024 bits 0.000292s 0.13s   3424.5  74204.1
rsa 2048 bits 0.001628s 0.45s614.4  22399.7
rsa 4096 bits 0.010966s 0.000163s 91.2   6117.8

So, it's about twice as fast we have at present.

[ Note that this comparison includes the latest 8081778: Use Intel
x64 CPU instructions for RSA acceleration patch. ]

However, even after my patch OpenJDK is still somewhat slower than
OpenSSL, which is:

  signverifysign/s verify/s
rsa  512 bits 0.48s 0.04s  20687.1 257399.4
rsa 1024 bits 0.000189s 0.11s   5288.3  91711.5
rsa 2048 bits 0.001174s 0.37s851.7  27367.2
rsa 4096 bits 0.008475s 0.000137s118.0   7305.4

[ I am assuming that OpenSSL represents something like the speed of
light for RSA on x86: this is carefully hand-coded assembly language
and C, hand tuned.  Getting anywhere near OpenSSL is a major win. ]

Here's my problem:

Some of this slowdown is due to the overhead of using the JCE, but not
very much.  Quite a lot of it, however, is due to the fact that the
scratch memory used in oddModPow() is a big-endian array of jints.  I
have to convert the big-endian jints into native jlongs to do the
multiply on little-endian x86-64.

If I do the word reversal during the multiply (i.e. keep all data in
memory in little-endian form, swap words when reading and writing to
memory) the performance is horrible: a 1024-bit multiply takes 3000ns,
50% longer.  This perhaps isn't very surprising: if you do the
word-reversal before the multiplication you have O(N) swaps, if you do
it during the multiplication you have O(N^2).

I have found that the best thing to do is to word reverse all the data
in memory into temporary little-endian arrays and do the work on them.
It's much faster, but still really is very annoying: for 1024-bit RSA
the word reversal is 14% of the total runtime.

It would be nice to keep all of the data in an array of jlongs for the
duration of oddModPow().  Here's one idea: I could write a version of
oddModPow() which is guaranteed never to use the Java version of the
Montgomery multiply.  This would use a JNI method which calls the
native Montgomery multiply routine, guaranteeing that that we always
use that native routine, even from the interpreter and C1.  Then I
could keep all the internal state in native word order, and all this
word-swapping would just go away.  This would have the additional
benefit that it would be faster when using the interpreter and C1.
So, we'd have two versions of oddModPow() in BigInteger, and switch
between them depending on whether the platform had support for a
native Montgomery multiplication.

The downside of having two versions of oddModPow() would, of course,
be some code duplication.

Or am I just making too much fuss about this?  Maybe I should be happy
with what I've got.

Thank you for reading this far,

Andrew.


Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]

2015-05-28 Thread Andrew Haley
On 27/05/15 21:35, Anthony Scarpino wrote:

 How much of the montgomery multiply and sqaure are you planning to
 intrinsify?  Are you doing the whole thing or just portions of the
 operations similar to multiplyToLen and squareToLen?

I'm doing the whole montgomery multiply and square operation in a
routine which interleaves multiplication and reduction so that the
intermediate product is never written into memory.

Andrew.


Re: RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]

2015-05-28 Thread Andrew Haley
On 05/28/2015 08:51 AM, Andrew Haley wrote:
 On 27/05/15 21:35, Anthony Scarpino wrote:
 
 How much of the montgomery multiply and sqaure are you planning to
 intrinsify?  Are you doing the whole thing or just portions of the
 operations similar to multiplyToLen and squareToLen?
 
 I'm doing the whole montgomery multiply and square operation in a
 routine which interleaves multiplication and reduction so that the
 intermediate product is never written into memory.

For a bit more information, here's the algorithm in C:

Andrew.


// Multiply (unsigned) Long A by Long B, accumulating the double-
// length result into the accumulator formed of T0, T1, and T2.
#define MACC(A, B, T0, T1, T2)  \
do {\
  unsigned long hi, lo; \
  __asm__ (mul %5; add %%rax, %2; adc %%rdx, %3; adc $0, %4   \
   : =d(hi), =a(lo), +r(T0), +r(T1), +g(T2)  \
   : r(A), a(B));   \
 } while(0)

// Fast Montgomery multiplication.  The derivation of the algorithm is
// in  A Cryptographic Library for the Motorola DSP56000,
// Dusse and Kaliski, Proc. EUROCRYPT 90, pp. 230-237.

static void __attribute__((noinline))
montgomery_multiply(unsigned long a[], unsigned long b[], unsigned long n[],
unsigned long m[], unsigned long inv, int len) {
  unsigned long t0 = 0, t1 = 0, t2 = 0; // Triple-precision accumulator
  int i;

  assert(inv * n[0] == -1UL);

  for (i = 0; i  len; i++) {
int j;
for (j = 0; j  i; j++) {
  MACC(a[j], b[i-j], t0, t1, t2);
  MACC(m[j], n[i-j], t0, t1, t2);
}
MACC(a[i], b[0], t0, t1, t2);
m[i] = t0 * inv;
MACC(m[i], n[0], t0, t1, t2);

assert(t0 == 0);
t0 = t1; t1 = t2; t2 = 0;
  }

  for (i = len; i  2*len; i++) {
int j;
for (j = i-len+1; j  len; j++) {
  MACC(a[j], b[i-j], t0, t1, t2);
  MACC(m[j], n[i-j], t0, t1, t2);
}
m[i-len] = t0;
t0 = t1; t1 = t2; t2 = 0;
  }

  m[len] = t0;
  while(cmp(m, n, len+1)  0)
sub(m, n, len+1);
}



RSA and Diffie-Hellman performance [Was: RFR(L): 8069539: RSA acceleration]

2015-05-27 Thread Andrew Haley
[I'm sorry, I didn't send this to the correct list.  I forgot that
there was a separate security list.]

An update:

I'm still working on this.  Following last week's revelations [1] it
seems to me that a faster implementation of (integer) D-H is even more
important.

I've spent a couple of days tracking down an extremely odd feature
(bug?) in MutableBigInteger which was breaking everything, but I'm
past that now.  I'm trying to produce an intrinsic implementation of
the core modular exponentiation which is as fast as any state-of-the-
art implementation while disrupting the common code as little as
possible; this is not easy.

I hope to have something which is faster on all processors, not just
those for which we have hand-coded assembly-language implementations.

I don't think that my work should be any impediment to Sadya's patch
for squareToLen at http://cr.openjdk.java.net/~kvn/8069539/webrev.01/
being committed.  It'll still be useful.

Andrew.


[1]  Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
https://weakdh.org/imperfect-forward-secrecy.pdf