Re: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6]

Eirik Bjørsnøs Thu, 10 Apr 2025 01:12:29 -0700

On Thu, 10 Apr 2025 07:32:18 GMT, Magnus Ihse Bursie <[email protected]> wrote:


>> You don't have to do that, I'm working on an omnibus UTF-8 fixing PR right 
>> now, where I will include a fix for this as well.
>
> If anything, I might be a bit worried that there are more incorrect 
> conversions stemming from this PR, that my automated tools and manual 
> scanning has not revealed.

Some observations: 

1: This PR seems to have been abondoned, so perhaps this discussion belongs in 
#15694 ?

2: The `å` (Unicode 'Latin small letter a with ring above' U+00E5) was 
correctly encoded as 0xEF in ISO-8859-1 previous to this change.

3: The conversion changed this `0xEF` to the three-byte sequence `ef bf bd`

4: This is as-if the file was incorrctly decoded using UTF-8, then encoded 
using UTF-8:


byte[] origBytes = "å".getBytes(StandardCharsets.ISO_8859_1);
String decoded = new String(origBytes, StandardCharsets.UTF_8);
byte[] encoded = decoded.getBytes(StandardCharsets.UTF_8);
String hex = HexFormat.of().formatHex(encoded);
assertEquals("efbfbd", hex);
``` 

Like @magicus I'm worried that similar incorrect decoding could have been 
introduced by the same script in other files.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/12726#discussion_r2036767319

Re: RFR: 8301991: Convert l10n properties resource bundles to UTF-8 native [v6]

Reply via email to