Sounds this request is reasonable since Unicode 7: do not consider the
U+FFFE in the middle of stream as malformed.
FAQ about private use characters and non-characters. [1]
http://www.unicode.org/faq/private_use.html
Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be
ill-formed in any UTF.
Q: So how should libraries and tools handle noncharacters?
A: Library APIs, components, and tool applications (such as low-level
text editors) which handle all Unicode strings should also handle
noncharacters. Often this means simple pass-through, the same way such
an API or tool would handle a reserved unassigned code point.
Thanks
Leo
On 12/24/18 3:06 AM, Clément MATHIEU wrote:
Hi,
I am wondering if UnicodeDecoder handling of U+FFFE is compliant with
current Unicode specification. Supsicious code is:
if (c == REVERSED_MARK) {
// A reversed BOM cannot occur within middle of stream
return CoderResult.malformedForLength(2);
}
Up to Unicode 6.3 Unicode specification said that U+FFFE is a non
character and that non characters "should never been interchanged".
Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8
(Unicode 6.2).
However, Unicode 7 changed that and now says:
Applications are free to use any of these noncharacter code
points internally. They have no standard interpretation when
exchanged outside the context of internal use. However, they are
not illegal in interchange, nor does their presence cause Unicode
text to be ill-formed. [...] They are not prohibited from
occurring in valid Unicode strings which happen to be in
terchanged. [...]. If a noncharacter is received in open
interchange, an application is not required to interpret it in
any way. It is good practice, however, to recognize it as a
noncharacter and to take appropriate action, such as replacing it
with U+FFFD replacement character, to indicate
the problem in the text. It is not recommended to simpl
y delete noncharacter code points from such text, because of
the potential security issues caused by deleting uninterpreted
characters.
See:
- http://www.unicode.org/versions/corrigendum9.html
- https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7)
Do you think that returning CR_MALFORMED is still OK?
Regards,
Clément MATHIEU