Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf
On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ <m...@macchiato.com> wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no > content between escapes in "shifting" encodings like ISO-2022-JP is > unnecessary, and for consistency between implementations should not be > recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the > committee can look at your proposal with an eye to changing > http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode > <unicode@unicode.org> wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivo...@hsivonen.fi> wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or substitute >> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) >> > or an escape sequence in the output. (See also Section 3.5 Deletion of >> > Code Points.) It is important to do this not only for byte sequences >> > that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next shift >> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants >> > require at least one character in a text segment between shift >> > sequences. Security software written to the formal specification may >> > not detect malicious text (for example, "delete" with a >> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its >> > ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the >> > WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that didn't >> > implement this U+FFFD generation behavior (uconv), a bug has been >> > logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two ISO-2022-JP >> > outputs from a conforming encoder can result in a byte sequence that >> > is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape >> > sequence is immediately followed by another ISO-2022-JP escape >> > sequence. Chrome and Safari do, but their implementations of >> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's >> > decoder implementations generally are informed by the Encoding >> > Standard (though the ISO-2022-JP decoder specifically might not be >> > yet), and I suspect that Safari's implementation (ICU) is either >> > informed by Unicode Security Considerations or vice versa. >> > >> > The example given as rationale in Unicode Security Considerations, >> > obfuscating the ASCII string "delete", could be accomplished by >> > alternating between the ASCII and Roman states to that every other >> > character is in the ASCII state and the rest of the Roman state. >> > >> > Is the requirement to generate U+FFFD when there is no content between >> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII >> > transitions or useless transitions between ASCII and Roman are not >> > also required to generate U+FFFD? Would it even be feasible (in terms >> > of interop with legacy encoders) to make useless transitions between >> > ASCII and Roman generate U+FFFD? >> > >> > -- >> > Henri Sivonen >> > hsivo...@hsivonen.fi >> > https://hsivonen.fi/ >> >> >> >> -- >> Henri Sivonen >> hsivo...@hsivonen.fi >> https://hsivonen.fi/ >> -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/