Hi, I am wondering if UnicodeDecoder handling of U+FFFE is compliant with current Unicode specification. Supsicious code is:
if (c == REVERSED_MARK) { // A reversed BOM cannot occur within middle of stream return CoderResult.malformedForLength(2); } Up to Unicode 6.3 Unicode specification said that U+FFFE is a non character and that non characters "should never been interchanged". Returning CR_MALFORMED on U+FFFE appears to be correct for Java 8 (Unicode 6.2). However, Unicode 7 changed that and now says: Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed. [...] They are not prohibited from occurring in valid Unicode strings which happen to be in terchanged. [...]. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simpl y delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. See: - http://www.unicode.org/versions/corrigendum9.html - https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf (23.7) Do you think that returning CR_MALFORMED is still OK? Regards, Clément MATHIEU