On Wed, 6 Aug 2025 10:52:00 GMT, Volkan Yazici <[email protected]> wrote:
>> I would assume your "double char" actually means the "surrogate pair"?
>>
>> I believe for the first pass of scanning you might want to skip the
>> 'surrogate", as a single dangling surrogate char should trigger a
>> "malformed" error, instead of 'unmappable", if the charset is implemented to
>> handle supplementary character.
>>
>> for (char c = 0xFF; c < 0xFFFF; c++) {
>> if (Character.isSurrogate(c))
>> continue;
>> if (!encoder.canEncode(c))
>> return new char[]{c};
>> }
>>
>> And for the second pass for the 'surrogates", I think we can just pick any
>> non-bmp panel, which should always be translated into a surrogate pair and
>> check if the charset can map/encode it, if not, it's our candidate.
>>
>> for (int i = 0x10000; i < 0x1FFFF; i++) {
>> char[] cc = Character.toChars(i);
>> if (!encoder.canEncode(new String(cc)))
>> return cc;
>> }
>
>> for (char c = 0xFF; c < 0xFFFF; c++)
>
> Doesn't this exclude `0xFFFF`, which is a valid (single-`char`,
> non-surrogate) BMP character?
>
>> ... we can just pick any non-bmp panel ...
>> ```
>> for (int i = 0x10000; i < 0x1FFFF; i++) { ...
>> ```
>
> Doesn't the non-BMP range rather end with 0x10FFFF?
(1) we might want to include 0xffff in first pass
(2) we just need to pick any unmappable non-bmp character, i would assume that
it should be pretty safe we will find one in the first non-bmp panel that is
not encoded by a specific charset.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257902674