subject:"RE\: Generating U\+FFFD when there's no content between ISO\-2022\-JP escape sequences"

RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Shawn Steele via Unicode

IMO, encodings, particularly ones depending on state such as this, may have 
multiple ways to output the same, or similar, sequences.  When means that 
pretty much any time an encoding transforms data any previous security or other 
validation style checks are no longer valid and any security/validation must be 
checked for again.  I've seen numerous mistakes due to people expecting 
encodings to play nicely, particularly if there are different endpoints that 
may use different implementations with slightly different behaviors.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Henri Sivonen via 
Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for 
>> > unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the 
>> > middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
&

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Harriet Riddle via Unicode

In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, 
here's Python's (apparently contributed to Python by one Hye-Shik Chang):

>>> "a¥bc~¥d".encode("iso-2022-jp")
b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd'

This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), 
but not per the WHATWG, whose output would be (to use another bytestring 
literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that 
Python's encoder appears to be using a preference order of codesets, with ASCII 
being before JIS-Roman, while the WHATWG logic is to encode the next character 
in the current codeset if possible, and switch to another if it is not.

-- Har


From: Unicode  on behalf of Henri Sivonen via 
Unicode 
Sent: 17 August 2020 08:38
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>>

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-16 Thread Henri Sivonen via Unicode

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivo...@hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Shawn Steele via Unicode

IMO, trying to do security checks on an encoded string that will be decoded 
later is pretty much guaranteed to miss cases.  Particularly with ISO-2022-JP, 
which has a plethora of variations in how different software/libraries/OS's 
decode it and treat the invalid/edge cases.

I typically encourage security checks on encodings  to be done after the 
translation to Unicode has been done, but that only works if that is the 
Unicode stream itself is being checked.  Eg: a firewall may not decode it the 
same way as the end-recipient of the data.  Which I guess is the point of the 
encoding project, but... nobody can't guarantee that an endpoint conforms to 
any "standard", so from a security perspective, the recommended guidance is 
pretty much moot, secure applications have to consider non-conforming behavior 
of endpoints as well.

Providing a "best practice" or suggestions in a standard is nice, but in 
practice systems are going to have differing interpretations and behaviors. 
Applications can't "depend" on any consistency.  Even if all the standard 
documents agreed, there'd still be legacy implementations that people didn't 
update for whatever reason and other implementations would miss some of the 
subtleties (or less subtle differences) of the standards. 

IMO, all of the "state shifting" encodings should be treated with care by 
software.  There're a lot of ways to encode the same or similar strings in 
different ways, and you never know what kind of validation happened "on the 
other end".  It's pretty much a given that ISO-2022-JP, particularly edge 
cases, are going to be interpreted differently by different applications.  

-Shawn

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Mark Davis ☕️ via Unicode

I tend to agree with your analysis that emitting U+FFFD when there is no
content between escapes in "shifting" encodings like ISO-2022-JP is
unnecessary, and for consistency between implementations should not be
recommended.

Can you file this at http://www.unicode.org/reporting.html so that the
committee can look at your proposal with an eye to changing
http://www.unicode.org/reports/tr36/?

Mark


On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> We're about to remove the U+FFFD generation for the case where there
> is no content between two ISO-2022-JP escape sequences from the WHATWG
> Encoding Standard.
>
> Is there anything wrong with my analysis that U+FFFD generation in
> that case is not a useful security measure when unnecessary
> transitions between the ASCII and Roman states do not generate U+FFFD?
>
> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen 
> wrote:
> >
> > Context: https://github.com/whatwg/encoding/issues/115
> >
> > Unicode Security Considerations say:
> > "3.6.2 Some Output For All Input
> >
> > Character encoding conversion must also not simply skip an illegal
> > input byte sequence. Instead, it must stop with an error or substitute
> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> > or an escape sequence in the output. (See also Section 3.5 Deletion of
> > Code Points.) It is important to do this not only for byte sequences
> > that encode characters, but also for unrecognized or "empty"
> > state-change sequences. For example:
> > [...]
> > ISO-2022 shift sequences without text characters before the next shift
> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> > require at least one character in a text segment between shift
> > sequences. Security software written to the formal specification may
> > not detect malicious text  (for example, "delete" with a
> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
> >
> > The WHATWG Encoding Standard bakes this requirement by the means of
> > "ISO-2022-JP output flag"
> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> > ISO-2022-JP decoder algorithm
> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
> >
> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> > WHATWG spec.
> >
> > After Gecko switched to encoding_rs from an implementation that didn't
> > implement this U+FFFD generation behavior (uconv), a bug has been
> > logged in the context of decoding Japanese email in Thunderbird:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
> >
> > Ken Lunde also recalls seeing such email:
> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
> >
> > The root problem seems to be that the requirement gives ISO-2022-JP
> > the unusual and surprising property that concatenating two ISO-2022-JP
> > outputs from a conforming encoder can result in a byte sequence that
> > is non-conforming as input to a ISO-2022-JP decoder.
> >
> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> > sequence is immediately followed by another ISO-2022-JP escape
> > sequence. Chrome and Safari do, but their implementations of
> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> > decoder implementations generally are informed by the Encoding
> > Standard (though the ISO-2022-JP decoder specifically might not be
> > yet), and I suspect that Safari's implementation (ICU) is either
> > informed by Unicode Security Considerations or vice versa.
> >
> > The example given as rationale in Unicode Security Considerations,
> > obfuscating the ASCII string "delete", could be accomplished by
> > alternating between the ASCII and Roman states to that every other
> > character is in the ASCII state and the rest of the Roman state.
> >
> > Is the requirement to generate U+FFFD when there is no content between
> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> > transitions or useless transitions between ASCII and Roman are not
> > also required to generate U+FFFD? Would it even be feasible (in terms
> > of interop with legacy encoders) to make useless transitions between
> > ASCII and Roman generate U+FFFD?
> >
> > --
> > Henri Sivonen
> > hsivo...@hsivonen.fi
> > https://hsivonen.fi/
>
>
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Henri Sivonen via Unicode

We're about to remove the U+FFFD generation for the case where there
is no content between two ISO-2022-JP escape sequences from the WHATWG
Encoding Standard.

Is there anything wrong with my analysis that U+FFFD generation in
that case is not a useful security measure when unnecessary
transitions between the ASCII and Roman states do not generate U+FFFD?

On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>
> Context: https://github.com/whatwg/encoding/issues/115
>
> Unicode Security Considerations say:
> "3.6.2 Some Output For All Input
>
> Character encoding conversion must also not simply skip an illegal
> input byte sequence. Instead, it must stop with an error or substitute
> a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> or an escape sequence in the output. (See also Section 3.5 Deletion of
> Code Points.) It is important to do this not only for byte sequences
> that encode characters, but also for unrecognized or "empty"
> state-change sequences. For example:
> [...]
> ISO-2022 shift sequences without text characters before the next shift
> sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> require at least one character in a text segment between shift
> sequences. Security software written to the formal specification may
> not detect malicious text  (for example, "delete" with a
> shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>
> The WHATWG Encoding Standard bakes this requirement by the means of
> "ISO-2022-JP output flag"
> (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> ISO-2022-JP decoder algorithm
> (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>
> encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> WHATWG spec.
>
> After Gecko switched to encoding_rs from an implementation that didn't
> implement this U+FFFD generation behavior (uconv), a bug has been
> logged in the context of decoding Japanese email in Thunderbird:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>
> Ken Lunde also recalls seeing such email:
> https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>
> The root problem seems to be that the requirement gives ISO-2022-JP
> the unusual and surprising property that concatenating two ISO-2022-JP
> outputs from a conforming encoder can result in a byte sequence that
> is non-conforming as input to a ISO-2022-JP decoder.
>
> Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> sequence is immediately followed by another ISO-2022-JP escape
> sequence. Chrome and Safari do, but their implementations of
> ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> decoder implementations generally are informed by the Encoding
> Standard (though the ISO-2022-JP decoder specifically might not be
> yet), and I suspect that Safari's implementation (ICU) is either
> informed by Unicode Security Considerations or vice versa.
>
> The example given as rationale in Unicode Security Considerations,
> obfuscating the ASCII string "delete", could be accomplished by
> alternating between the ASCII and Roman states to that every other
> character is in the ASCII state and the rest of the Roman state.
>
> Is the requirement to generate U+FFFD when there is no content between
> ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> transitions or useless transitions between ASCII and Roman are not
> also required to generate U+FFFD? Would it even be feasible (in terms
> of interop with legacy encoders) to make useless transitions between
> ASCII and Roman generate U+FFFD?
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/



-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/

RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

6 matches

Site Navigation

Mail list logo

Footer information