alhudz opened a new pull request, #1734:
URL: https://github.com/apache/commons-lang/pull/1734
Repro: `splitByCharacterType("A" + boldA)`, where `boldA` is U+1D400
MATHEMATICAL BOLD CAPITAL A.
Expected: `["A𝐀"]`, one token, since `A` and the bold `A` are both
upper-case letters.
Actual: `["A", "𝐀"]`.
Cause: the shared worker iterates one `char` at a time and calls
`Character.getType(char)`, so each half of a surrogate pair reads as
`SURROGATE` rather than the real category of the code point. Same-type
neighbours get split, and in the `camelCase` path `pos - 1` lands inside the
pair. `splitByCharacterType("5" + boldFive)` splits two decimal digits the same
way.
Fix: iterate by code point with `Character.codePointAt`/`charCount` and
classify the whole code point; the `camelCase` boundary backs up by
`Character.charCount(Character.codePointBefore(...))`. BMP input is unchanged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]