I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/? Mark On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode < unicode@unicode.org> wrote: > We're about to remove the U+FFFD generation for the case where there > is no content between two ISO-2022-JP escape sequences from the WHATWG > Encoding Standard. > > Is there anything wrong with my analysis that U+FFFD generation in > that case is not a useful security measure when unnecessary > transitions between the ASCII and Roman states do not generate U+FFFD? > > On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivo...@hsivonen.fi> > wrote: > > > > Context: https://github.com/whatwg/encoding/issues/115 > > > > Unicode Security Considerations say: > > "3.6.2 Some Output For All Input > > > > Character encoding conversion must also not simply skip an illegal > > input byte sequence. Instead, it must stop with an error or substitute > > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) > > or an escape sequence in the output. (See also Section 3.5 Deletion of > > Code Points.) It is important to do this not only for byte sequences > > that encode characters, but also for unrecognized or "empty" > > state-change sequences. For example: > > [...] > > ISO-2022 shift sequences without text characters before the next shift > > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants > > require at least one character in a text segment between shift > > sequences. Security software written to the formal specification may > > not detect malicious text (for example, "delete" with a > > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." > > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) > > > > The WHATWG Encoding Standard bakes this requirement by the means of > > "ISO-2022-JP output flag" > > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its > > ISO-2022-JP decoder algorithm > > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). > > > > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the > > WHATWG spec. > > > > After Gecko switched to encoding_rs from an implementation that didn't > > implement this U+FFFD generation behavior (uconv), a bug has been > > logged in the context of decoding Japanese email in Thunderbird: > > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 > > > > Ken Lunde also recalls seeing such email: > > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 > > > > The root problem seems to be that the requirement gives ISO-2022-JP > > the unusual and surprising property that concatenating two ISO-2022-JP > > outputs from a conforming encoder can result in a byte sequence that > > is non-conforming as input to a ISO-2022-JP decoder. > > > > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape > > sequence is immediately followed by another ISO-2022-JP escape > > sequence. Chrome and Safari do, but their implementations of > > ISO-2022-JP aren't independent of each other. Moreover, Chrome's > > decoder implementations generally are informed by the Encoding > > Standard (though the ISO-2022-JP decoder specifically might not be > > yet), and I suspect that Safari's implementation (ICU) is either > > informed by Unicode Security Considerations or vice versa. > > > > The example given as rationale in Unicode Security Considerations, > > obfuscating the ASCII string "delete", could be accomplished by > > alternating between the ASCII and Roman states to that every other > > character is in the ASCII state and the rest of the Roman state. > > > > Is the requirement to generate U+FFFD when there is no content between > > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII > > transitions or useless transitions between ASCII and Roman are not > > also required to generate U+FFFD? Would it even be feasible (in terms > > of interop with legacy encoders) to make useless transitions between > > ASCII and Roman generate U+FFFD? > > > > -- > > Henri Sivonen > > hsivo...@hsivonen.fi > > https://hsivonen.fi/ > > > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >