On 09/29/2011 02:16 PM, Ulf Zibis wrote:
Please use spaces with ternary operators: Lines 155, 216
For short you could use sr instead srcRemaining, consistent to sa, sp, sl.
420 // returns -1 if there is malformed byte(s) and the
better:
420 // returns -1 if there is/are malformed byte(s) and the
466 sp -=3;
There should be a space: sp -= 3;
Webrev has been updated accordingly.
280 if (Character.isSurrogate(c))
281 return malformedForLength(src, sp, dst,
dp, 3);
Shouldn't we return cr.length() = 1to allow remaining 2 bytes to be
interpreted again ?
Actually I don't know the answer. My reading of D93a/D93b suggests that
we might
interpret it as a whole, given the bytes are actually in well-formed
byte pattern range
listed in Table 3.7, but "ill-formed" simply because they are surrogate
value not scale
value, so I would interpret the whole 3 bytes as a maximal subpart.
Given D93a/b is
"best practices for Using U+fffd", either way should be fine. We do have
Unicode expert
on the list, so maybe they can share their opinion on what is the
"desired"/recommended
behavior in this case, from Standard point view?
Am 29.09.2011 05:27, schrieb Xueming Shen:
Hi,
On 9/28/2011 3:44 PM, Ulf Zibis wrote:
5. IMHO charset CESU-8 should be hosted in extended-charsets,
otherwise it should be added to java.nio.StandardCharsets
We have lots of charsets provided via the "standard charset provider"
(in rt.jar) but not listed in StandardCharsets.
Yes, but the reasonable to add CESU-8 to StandardCharsets was the
supposed demand to treat all unicode charsets equivalent.
Otherwise there is no obstacle to host CESU-8 in extended-charsets.
IMHO, CESU-8 addresses corner case compatibility issues, but not
"standard" requirements.
To put CESU-8 into "standard charset provider" (it is only an
implementation details) does
not mean it is a "standard" requirement, it just means it is bundled
into rt.jar. The reason
I put it there is to make sure it is together with the UTF-8, with the
assumption is that you
might need it around when using the updated UTF-8, which no longer
handles those 3/6-byte
surrogates.
-Sherman