RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
IMO, encodings, particularly ones depending on state such as this, may have multiple ways to output the same, or similar, sequences. When means that pretty much any time an encoding transforms data any previous security or other validation style checks are no longer valid and any security/validation must be checked for again. I've seen numerous mistakes due to people expecting encodings to play nicely, particularly if there are different endpoints that may use different implementations with slightly different behaviors. -Shawn -Original Message- From: Unicode On Behalf Of Henri Sivonen via Unicode Sent: Sunday, August 16, 2020 11:39 PM To: Mark Davis ☕️ Cc: Unicode Public Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no > content between escapes in "shifting" encodings like ISO-2022-JP is > unnecessary, and for consistency between implementations should not be > recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the > committee can look at your proposal with an eye to changing > http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode > wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the >> WHATWG Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or >> > substitute a replacement character (such as U+FFFD ( ) >> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See >> > also Section 3.5 Deletion of Code Points.) It is important to do >> > this not only for byte sequences that encode characters, but also for >> > unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next >> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 >> > variants require at least one character in a text segment between >> > shift sequences. Security software written to the formal >> > specification may not detect malicious text (for example, "delete" >> > with a shift-to-double-byte then an immediate shift-to-ASCII in the >> > middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into >> > its ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements >> > the WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that >> > didn't implement this U+FFFD generation behavior (uconv), a bug has >> > been logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140 >> > 3 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two >> > ISO-2022-JP outputs from a conforming encoder can result in a byte >> > sequence that is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP >> > escape sequence is immediately followed by another ISO-2022-JP >> > escape sequence. Chrome and Safari do, but their implementations of &
Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, here's Python's (apparently contributed to Python by one Hye-Shik Chang): >>> "a¥bc~¥d".encode("iso-2022-jp") b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd' This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), but not per the WHATWG, whose output would be (to use another bytestring literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that Python's encoder appears to be using a preference order of codesets, with ASCII being before JIS-Roman, while the WHATWG logic is to encode the next character in the current codeset if possible, and switch to another if it is not. -- Har From: Unicode on behalf of Henri Sivonen via Unicode Sent: 17 August 2020 08:38 To: Mark Davis ☕️ Cc: Unicode Public Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no > content between escapes in "shifting" encodings like ISO-2022-JP is > unnecessary, and for consistency between implementations should not be > recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the > committee can look at your proposal with an eye to changing > http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode > wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or substitute >> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) >> > or an escape sequence in the output. (See also Section 3.5 Deletion of >> > Code Points.) It is important to do this not only for byte sequences >> > that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next shift >> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants >> > require at least one character in a text segment between shift >> > sequences. Security software written to the formal specification may >> > not detect malicious text (for example, "delete" with a >> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its >> > ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the >> > WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that didn't >> > implement this U+FFFD generation behavior (uconv), a bug has been >> > logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two ISO-2022-JP >> > outputs from a conforming encoder can result in a byte sequence that >> > is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape >>
Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no > content between escapes in "shifting" encodings like ISO-2022-JP is > unnecessary, and for consistency between implementations should not be > recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the > committee can look at your proposal with an eye to changing > http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode > wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or substitute >> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) >> > or an escape sequence in the output. (See also Section 3.5 Deletion of >> > Code Points.) It is important to do this not only for byte sequences >> > that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next shift >> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants >> > require at least one character in a text segment between shift >> > sequences. Security software written to the formal specification may >> > not detect malicious text (for example, "delete" with a >> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its >> > ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the >> > WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that didn't >> > implement this U+FFFD generation behavior (uconv), a bug has been >> > logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two ISO-2022-JP >> > outputs from a conforming encoder can result in a byte sequence that >> > is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape >> > sequence is immediately followed by another ISO-2022-JP escape >> > sequence. Chrome and Safari do, but their implementations of >> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's >> > decoder implementations generally are informed by the Encoding >> > Standard (though the ISO-2022-JP decoder specifically might not be >> > yet), and I suspect that Safari's implementation (ICU) is either >> > informed by Unicode Security Considerations or vice versa. >> > >> > The example given as rationale in Unicode Security Considerations, >> > obfuscating the ASCII string "delete", could be accomplished by >> > alternating between the ASCII and Roman states to that every other >> > character is in the ASCII state and the rest of the Roman state. >> > >> > Is the requirement to generate U+FFFD when there is no content between >> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII >> > transitions or useless transitions between ASCII and Roman are not >> > also required to generate U+FFFD? Would it even be feasible (in terms >> > of interop with legacy encoders) to make useless transitions between >> > ASCII and Roman generate U+FFFD? >> > >> > -- >> > Henri Sivonen >> > hsivo...@hsivonen.fi >> > https://hsivonen.fi/ >> >> >> >> -- >> Henri Sivonen >> hsivo...@hsivonen.fi >> https://hsivonen.fi/ >> -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
IMO, trying to do security checks on an encoded string that will be decoded later is pretty much guaranteed to miss cases. Particularly with ISO-2022-JP, which has a plethora of variations in how different software/libraries/OS's decode it and treat the invalid/edge cases. I typically encourage security checks on encodings to be done after the translation to Unicode has been done, but that only works if that is the Unicode stream itself is being checked. Eg: a firewall may not decode it the same way as the end-recipient of the data. Which I guess is the point of the encoding project, but... nobody can't guarantee that an endpoint conforms to any "standard", so from a security perspective, the recommended guidance is pretty much moot, secure applications have to consider non-conforming behavior of endpoints as well. Providing a "best practice" or suggestions in a standard is nice, but in practice systems are going to have differing interpretations and behaviors. Applications can't "depend" on any consistency. Even if all the standard documents agreed, there'd still be legacy implementations that people didn't update for whatever reason and other implementations would miss some of the subtleties (or less subtle differences) of the standards. IMO, all of the "state shifting" encodings should be treated with care by software. There're a lot of ways to encode the same or similar strings in different ways, and you never know what kind of validation happened "on the other end". It's pretty much a given that ISO-2022-JP, particularly edge cases, are going to be interpreted differently by different applications. -Shawn
Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended. Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/? Mark On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode < unicode@unicode.org> wrote: > We're about to remove the U+FFFD generation for the case where there > is no content between two ISO-2022-JP escape sequences from the WHATWG > Encoding Standard. > > Is there anything wrong with my analysis that U+FFFD generation in > that case is not a useful security measure when unnecessary > transitions between the ASCII and Roman states do not generate U+FFFD? > > On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen > wrote: > > > > Context: https://github.com/whatwg/encoding/issues/115 > > > > Unicode Security Considerations say: > > "3.6.2 Some Output For All Input > > > > Character encoding conversion must also not simply skip an illegal > > input byte sequence. Instead, it must stop with an error or substitute > > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) > > or an escape sequence in the output. (See also Section 3.5 Deletion of > > Code Points.) It is important to do this not only for byte sequences > > that encode characters, but also for unrecognized or "empty" > > state-change sequences. For example: > > [...] > > ISO-2022 shift sequences without text characters before the next shift > > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants > > require at least one character in a text segment between shift > > sequences. Security software written to the formal specification may > > not detect malicious text (for example, "delete" with a > > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." > > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) > > > > The WHATWG Encoding Standard bakes this requirement by the means of > > "ISO-2022-JP output flag" > > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its > > ISO-2022-JP decoder algorithm > > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). > > > > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the > > WHATWG spec. > > > > After Gecko switched to encoding_rs from an implementation that didn't > > implement this U+FFFD generation behavior (uconv), a bug has been > > logged in the context of decoding Japanese email in Thunderbird: > > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 > > > > Ken Lunde also recalls seeing such email: > > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 > > > > The root problem seems to be that the requirement gives ISO-2022-JP > > the unusual and surprising property that concatenating two ISO-2022-JP > > outputs from a conforming encoder can result in a byte sequence that > > is non-conforming as input to a ISO-2022-JP decoder. > > > > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape > > sequence is immediately followed by another ISO-2022-JP escape > > sequence. Chrome and Safari do, but their implementations of > > ISO-2022-JP aren't independent of each other. Moreover, Chrome's > > decoder implementations generally are informed by the Encoding > > Standard (though the ISO-2022-JP decoder specifically might not be > > yet), and I suspect that Safari's implementation (ICU) is either > > informed by Unicode Security Considerations or vice versa. > > > > The example given as rationale in Unicode Security Considerations, > > obfuscating the ASCII string "delete", could be accomplished by > > alternating between the ASCII and Roman states to that every other > > character is in the ASCII state and the rest of the Roman state. > > > > Is the requirement to generate U+FFFD when there is no content between > > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII > > transitions or useless transitions between ASCII and Roman are not > > also required to generate U+FFFD? Would it even be feasible (in terms > > of interop with legacy encoders) to make useless transitions between > > ASCII and Roman generate U+FFFD? > > > > -- > > Henri Sivonen > > hsivo...@hsivonen.fi > > https://hsivonen.fi/ > > > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >
Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
We're about to remove the U+FFFD generation for the case where there is no content between two ISO-2022-JP escape sequences from the WHATWG Encoding Standard. Is there anything wrong with my analysis that U+FFFD generation in that case is not a useful security measure when unnecessary transitions between the ASCII and Roman states do not generate U+FFFD? On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: > > Context: https://github.com/whatwg/encoding/issues/115 > > Unicode Security Considerations say: > "3.6.2 Some Output For All Input > > Character encoding conversion must also not simply skip an illegal > input byte sequence. Instead, it must stop with an error or substitute > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) > or an escape sequence in the output. (See also Section 3.5 Deletion of > Code Points.) It is important to do this not only for byte sequences > that encode characters, but also for unrecognized or "empty" > state-change sequences. For example: > [...] > ISO-2022 shift sequences without text characters before the next shift > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants > require at least one character in a text segment between shift > sequences. Security software written to the formal specification may > not detect malicious text (for example, "delete" with a > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) > > The WHATWG Encoding Standard bakes this requirement by the means of > "ISO-2022-JP output flag" > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its > ISO-2022-JP decoder algorithm > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). > > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the > WHATWG spec. > > After Gecko switched to encoding_rs from an implementation that didn't > implement this U+FFFD generation behavior (uconv), a bug has been > logged in the context of decoding Japanese email in Thunderbird: > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 > > Ken Lunde also recalls seeing such email: > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 > > The root problem seems to be that the requirement gives ISO-2022-JP > the unusual and surprising property that concatenating two ISO-2022-JP > outputs from a conforming encoder can result in a byte sequence that > is non-conforming as input to a ISO-2022-JP decoder. > > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape > sequence is immediately followed by another ISO-2022-JP escape > sequence. Chrome and Safari do, but their implementations of > ISO-2022-JP aren't independent of each other. Moreover, Chrome's > decoder implementations generally are informed by the Encoding > Standard (though the ISO-2022-JP decoder specifically might not be > yet), and I suspect that Safari's implementation (ICU) is either > informed by Unicode Security Considerations or vice versa. > > The example given as rationale in Unicode Security Considerations, > obfuscating the ASCII string "delete", could be accomplished by > alternating between the ASCII and Roman states to that every other > character is in the ASCII state and the rest of the Roman state. > > Is the requirement to generate U+FFFD when there is no content between > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII > transitions or useless transitions between ASCII and Roman are not > also required to generate U+FFFD? Would it even be feasible (in terms > of interop with legacy encoders) to make useless transitions between > ASCII and Roman generate U+FFFD? > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/