Re: RFR: 8363972: Loose matching of dash/minusSign in number parsing [v6]

Naoto Sato Fri, 01 Aug 2025 16:26:23 -0700

On Fri, 1 Aug 2025 22:19:22 GMT, Justin Lu <[email protected]> wrote:

>> Naoto Sato has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>>  - Spec update
>>  - Supplementary/CanonEq tests
>>  - flipped again, which was correct
>>  - flipped the size check
>>  - Address review comments
>>  - Merge branch 'master' into JDK-8363972-Loose-matching-dash
>>  - tidying up
>>  - test location
>>  - spec update
>>  - ... and 6 more: https://git.openjdk.org/jdk/compare/8e921aee...3682484d
>
> src/java.base/share/classes/java/text/DecimalFormat.java line 3543:
> 
>> 3541:             var lmsp = Pattern.compile("[" + lms + "]", 
>> Pattern.CANON_EQ);
>> 3542:             var a = lmsp.matcher(affix).replaceAll("-");
>> 3543:             var t = lmsp.matcher(text.substring(position, 
>> Math.min(tlen, position + alen)))
> 
> I don't think this works. If you see my other comment regarding the test case 
> and swap it such that `lenientMinusSign` is "-\u00E4" and the pattern (affix) 
> includes "a\u0308" and we parse some text such as "\u00E41.5".
> 
> Then we have the following,
> 
> 
> var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ); // Returns 
> pattern [-ä]
> var a = lmsp.matcher(affix).replaceAll("-"); // Gets correctly normalized to 
> "-"
> // Incorrectly matches against "ä1" and normalized to "-1" since the 
> substring returned is indexed from 0 to 2.
> var t = lmsp.matcher(text.substring(position, Math.min(tlen, position + 
> alen)))
> 
> 
> That is, I think we need to substring after we have performed the 
> normalization. Something such as,
> 
> 
> var lmsp = Pattern.compile("[" + lms + "]", Pattern.CANON_EQ);
> var a = lmsp.matcher(affix).replaceAll("-");
> var t = lmsp.matcher(text).replaceAll("-");
> return t.startsWith(a, position);
> 
> 
> However, we will still run into issues later in DecimalFormat, as the 
> position is incremented by the original prefix length.
> 
> 
>         } else if (gotNegative) {
>             position += negativePrefix.length();
> 
> 
> So for "ä1.5" we would start parsing at position = 2, erroneously returning 
> "0.5". So further thought may be needed.


The current implementations assume the prefix/suffix lengths in a lot of places 
(no wonder), should be re-examined.

> test/jdk/java/text/Format/NumberFormat/LenientMinusSignTest.java line 128:
> 
>> 126:             .set(dfs, "-\u00E5");
>> 127:         var df = new DecimalFormat("#.#;a\u0308#.#", dfs);
>> 128:         assertEquals(df.parse("a\u03081.5"), -1.5);
> 
> This one confuses me, should it not be parsing "\u00E51.5" such that the test 
> can check if canonical equivalence occurs when matching the text "\u00E5" to 
> the affix "a\u0308"? Otherwise, we are just parsing the exact pattern we set? 
> Also I think it should be "\u00E4", not "\u00E5".

That's correct. I think the simple case is not that simple. Need to rethink.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2249005477
PR Review Comment: https://git.openjdk.org/jdk/pull/26580#discussion_r2249005036

Re: RFR: 8363972: Loose matching of dash/minusSign in number parsing [v6]

Reply via email to