Small correction in the updated part:
http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/
Naoto
On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:
Hi Joe,
Thank you for your comments.
On 7/20/20 11:20 AM, Joe Wang wrote:
Hi Naoto,
StringUTF16: line 384 - 388 seem unnecessary since you'd only get
there if 389:isHighSurrogate is not true.
Good point.
But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead of
adding a new method.
If we call codePointAt/Before, it would call an extra getChar(). Since
we know one codepoint as an input, I would avoid the extra calls.
Comparing to the base benchmark:
StringCompareToIgnoreCase.lower 8.5%
StringCompareToIgnoreCase.supLower 139%
StringCompareToIgnoreCase.supUpperLower -21.8%
StringCompareToIgnoreCase.upperLower avgt -5.9%
"lower" was 8.5% slower, if such test exists in the specJVM, it would
be considered a regression. I would suggest you run the specJVM. I
agree with you on surrogate check being a requirement, thus supLower
being 139% slower is ok since it won't otherwise be correct anyways.
Yes, it would be a regression if SPECjvm produces 8+% degradation, but
the test suite is aimed at the entire application performance. But for
this one, it is a micro benchmark for relatively rarely issued methods
(I would think normal cases fall into Latin1 equivalents), I would
consider it is acceptable.
But after introducing additional operations supUpperLower and
upperLower ran faster? That may indicate irregularity in the tests.
Maybe we should consider running tests with short, long and very long
sample strings to see if we can reduce the noise level and also see
how it fares for a longer string. I assume the machine you're running
the test on was isolated.
This result pretty much depends on the data it is testing for. As I
wrote in the previous email, (sup)UpperLower tests use the data that are
almost identical, but one last character is case insensitively equal. So
in these cases, the new short cut works really well and not call
toLower/UpperCase() at all for most of the characters. Thus the new
results are faster. Again the test result is very dependent on the input
data, Unless the result showed 100% slower or something (except supLower
case), I think it is OK.
Anyways, here is the updated webrev addressing your first suggestion:
http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/
Naoto
Regards,
Joe
On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:
Hi Mark,
Thank you for your comments.
On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to
the point where you hit a high surrogate, then drop into the slower
codepoint path. That saves time for the high-runner cases.
On the other hand, if the times are good enough, you might not need
the complication.
The implementation is dealing with bare byte arrays. Only methods
that it uses from Character class are toLowerCase(int) and
toUpperCase(int) (sans surrogate check, which is needed at each
iteration anyways), and their "char" equivalents are merely casting
(char) to the int result. So it might not be so beneficial to
differentiate char and int paths.
Having said that, I found that there was an unnecessary surrogate
check (always checks high *and* low surrogate on each iteration), so
I revised the fix (added line 380-383 in StringUTF16.java):
http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/
Naoto
Mark
//////
On Fri, Jul 17, 2020 at 4:39 PM <naoto.s...@oracle.com
<mailto:naoto.s...@oracle.com>> wrote:
Hi,
Based on the suggestions, I modified the fix as follows:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
Changes from the initial revision are:
- Shared the implementation between compareToCI() and
regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and
after
the change:
before:
Benchmark Mode Cnt Score
Error Units
StringCompareToIgnoreCase.lower avgt 25 53.764 ?
2.811 ns/op
StringCompareToIgnoreCase.supLower avgt 25 24.211 ?
1.135 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?
1.344 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?
1.499 ns/op
after:
Benchmark Mode Cnt Score
Error Units
StringCompareToIgnoreCase.lower avgt 25 58.354 ?
4.603 ns/op
StringCompareToIgnoreCase.supLower avgt 25 57.975 ?
5.672 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?
0.965 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?
0.272 ns/op
Here, "sup" means all supplementary characters, BMP otherwise.
"lower"
means each character requires upper->lower casemap. "upperLower"
means
all characters are the same, except the last character which
requires
casemap.
I think the result is reasonable, considering surrogates check
are now
mandatory. I have tried Roger's suggestion to use
Arrays.mismatch() but
it did not seem to benefit here. In fact, the performance degraded
partly because I implemented the short cut, and possibly for the
overhead of extra checks.
Naoto
On 7/15/20 9:00 AM, naoto.s...@oracle.com
<mailto:naoto.s...@oracle.com> wrote:
> Hello,
>
> Please review the fix to the following issues:
>
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
>
> The proposed changeset and its CSR are located at:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
>
> A bug was filed against SimpleDateFormat (8248434) where
> case-insensitive date format/parse failed in some of the new
locales in
> JDK15. The root cause was that case-insensitive
String.regionMatches()
> method did not work with supplementary characters. The
problem is
that
> the method's spec does not expect case mappings of supplementary
> characters, possibly because it was overlooked in the first
place, JSR
> 204 - "Unicode Supplementary Character support". Similar
behavior is
> observed in other two case-insensitive methods, i.e.,
> compareToIgnoreCase() and equalsIgnoreCase().
>
> The fix is straightforward to compare strings by code point
basis,
> instead of code unit (16bit "char") basis. Technically this
change will
> introduce a backward incompatibility, but I believe it is an
> incompatibility to wrong behavior, not true to the meaning of
those
> methods' expectations.
>
> Naoto