Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

Claes Redestad Mon, 16 Jan 2023 15:32:37 -0800

On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:


>>> Also, I'd like to note that C2 auto-vectorization support is not too far 
>>> away from being able to optimize hash code computations. At some point, I 
>>> was able to achieve some promising results with modest tweaking of 
>>> SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt 
>>> http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/
>> 
>> Intriguing. How far off is this - and do you think it'll be able to match 
>> the efficiency we see here with a memoized coefficient table etc?
>> 
>> If we turn this intrinsic into a stub we might also be able to reuse the 
>> optimization in other places, including from within the VM (calculating 
>> String hashCodes happen in a couple of places, including String 
>> deduplication). So I think there are still a few compelling reasons to go 
>> the manual route and continue on this path.
>
>> How far off is this ...?
> 
> Back then it looked way too constrained (tight constraints on code shapes). 
> But I considered it as a generally applicable optimization. 
> 
>>  ... do you think it'll be able to match the efficiency we see here with a 
>> memoized coefficient table etc?
> 
> Yes, it is able to build the constant table at runtime when folding 
> multiplications of constant coefficients produced during loop unrolling and 
> then packing scalars into a constant vector.
> 
> Moreover, briefly looking at the code shape, the vectorizer would produce a 
> more optimal loop shape (pre-loop would align vector accesses and would use 
> 512-bit vectors when available; vector post-loop could help as well).

I've opted to include the changes spurred by @iwanowww's comments since it led 
to a number of revisions to the intrinsified method API, and it would be 
strange to introduce an intrinsified method just to change the API drastically 
in a follow-up. Basically `ArraysSupport.vectorizedHashCode` has been changed 
to take an offset + length, an initial value and the logical basic type of the 
array elements. Which means any necessary scaling of index and length needs to 
be taken care of before calling the intrinsic. This makes the implementation 
more flexible at no measurable performance cost. Overall the refactoring might 
have reduced complexity a bit. 

Reviewers might observe that nothing is currently passing anything but `0` and 
`length` to `vectorizedHashCode` outside of the simple sanity test I've added, 
but I've verified this feature can be used to some effect elsewhere in the JDK, 
e.g: 
https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:zipcoder-hashcode?expand=1
 (which improves speed of opening `ZipFile` by a small percentage in 
microbenchmarks).

-------------

PR: https://git.openjdk.org/jdk/pull/10847

Re: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

Reply via email to