Re: UnicodeDecoder U+FFFE handling

li . jiang Tue, 01 Jan 2019 22:08:08 -0800

Sounds this request is reasonable since Unicode 7: do not consider theU+FFFE in the middle of stream as malformed.

FAQ about private use characters and non-characters. [1]http://www.unicode.org/faq/private_use.html


Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to beill-formed in any UTF.


Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-leveltext editors) which handle all Unicode strings should also handlenoncharacters. Often this means simple pass-through, the same way suchan API or tool would handle a reserved unassigned code point.


Thanks
Leo

On 12/24/18 3:06 AM, Clément MATHIEU wrote:

Hi,

I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
current Unicode specification. Supsicious code is:

        if (c == REVERSED_MARK) {
             // A reversed BOM cannot occur within middle of stream
             return CoderResult.malformedForLength(2);
        }

Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
character and that non characters "should never been interchanged".
Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
(Unicode 6.2).

However, Unicode 7 changed that and now says:

       Applications are free to use any of these noncharacter code
       points internally. They have no standard interpretation when
       exchanged outside the context of internal use. However, they are
       not illegal in interchange, nor does their presence cause Unicode
       text to be ill-formed. [...] They are not prohibited from
       occurring  in  valid  Unicode  strings  which  happen  to  be  in
       terchanged. [...]. If a noncharacter is received in open
       interchange, an application is not required to interpret it in
       any way. It is good practice, however, to recognize it as a
       noncharacter and to take appropriate action, such as replacing it
       with U+FFFD replacement character, to indicate
       the  problem  in  the  text.  It  is  not  recommended  to  simpl
       y  delete  noncharacter  code points from such text, because of
       the potential security issues caused by deleting uninterpreted
       characters.

See:
  - http://www.unicode.org/versions/corrigendum9.html
  - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)

Do you think that returning CR_MALFORMED is still OK?

Regards,
Clément MATHIEU

Re: UnicodeDecoder U+FFFE handling

Reply via email to