> On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote: > > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote: > > > > Ping, can we conclude on the name? IMHO, -Wbidirectional is just fine, > > > > but changing the name is a trivial operation. > > > > > > Here's a patch with a better name (suggested by Jonathan W.). Otherwise > > > no > > > changes. > > > > Thanks for implementing this. > > > > > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > > > -- >8 -- > > > From a link below: > > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > > Specification through 14.0. It permits the visual reordering of > > > characters via control sequences, which can be used to craft source code > > > that renders different logic than the logical ordering of tokens > > > ingested by compilers and interpreters. Adversaries can leverage this to > > > encode source code for compilers accepting Unicode such that targeted > > > vulnerabilities are introduced invisibly to human reviewers." > > > > > > More info: > > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > > https://trojansource.codes/ > > > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly > > > misleading Unicode bidirectional characters the preprocessor may > > > encounter.
[...snip...] > > > > Terminology nit: > > The patch is referring to "bidirectional characters", but I think the > > term "bidirectional control characters" would be better. > > Adjusted. Thanks. I wonder if the warning should be -Wbidi-control-chars, but I don't care enough to insist on it being changed. > > > For example, a passage of text containing both numbers and characters > > in a right-to-left script could be considered "bidirectional", since > > the numbers are written from left-to-right. > > > > Specifically, the patch looks for these specific characters: > > * U+202A LEFT-TO-RIGHT EMBEDDING > > * U+202B RIGHT-TO-LEFT EMBEDDING > > * U+202C POP DIRECTIONAL FORMATTING > > * U+202D LEFT-TO-RIGHT OVERRIDE > > * U+202E RIGHT-TO-LEFT OVERRIDE > > * U+2066 LEFT-TO-RIGHT ISOLATE > > * U+2067 RIGHT-TO-LEFT ISOLATE > > * U+2068 FIRST STRONG ISOLATE > > * U+2069 POP DIRECTIONAL ISOLATE > > > > However, the following characters could also be considered as > > "bidirectional control characters": > > * U+200E LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E) > > * U+200F RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F) > > but aren't checked for in the patch. Should they be? I can imagine > > ways in which they could be abused, so I think so. > > I'd only intended to check the bidi chars described in the original > trojan source pdf, but I added checking for U+200E/U+200F too, since > it was easy enough. AFAIK they aren't popped by a PDF/PDI like the > rest, so don't need to go on the vec, and so we only warn with =any. > Tests: Wbidi-chars-16.c + Wbidi-chars-17.c Thanks. I took a look through the revised patch and I think you updated things correctly. [...snip...] > > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > new file mode 100644 > > > index 00000000000..9fd4bc535ca > > > --- /dev/null > > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c > > > @@ -0,0 +1,166 @@ > > > +/* PR preprocessor/103026 */ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */ > > > +/* Test all bidi chars in various contexts (identifiers, comments, > > > + string literals, character constants), both UCN and UTF-8. The bidi > > > + chars here are properly terminated, except for the character > > > constants. */ > > > + > > > +/* a b c LRE 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */ > > > +/* a b c RLE 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */ > > > +/* a b c LRO 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */ > > > +/* a b c RLO 1 2 3 PDF x y z */ > > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */ > > > +/* a b c LRI 1 2 3 PDI x y z */ > > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */ > > > +/* a b c RLI 1 2 3 PDI x y */ > > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */ > > > +/* a b c FSI 1 2 3 PDI x y z */ > > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */ > > > > AIUI the Unicode bidirectionality algorithm works at the line level, > > and so each line in a block comment should be checked individually for > > unclossed bidi control chars, rather than a block comment as a whole. > > Hence I think the test case needs to have block comment test coverage > > for: > > - single line blocks > > - first line of a multiline block comment > > - middle line of a multiline block comment > > - final line of a multiline block comment > > but I think the patch as it stands is only checking for the first of > > these four cases. > > The patch handles all of them, because of: > 1534 if (warn_bidi_p) > 1535 maybe_warn_bidi_on_close (pfile, cur); > in _cpp_skip_block_comment, but I was lacking some more testing, so I've > added some testing, and included a new test: Wbidi-chars-15.c. All of the cases in Wbidi-chars-15.c only test for unparired chars in a middle line of a multiline block comment; I don't think the patch has any explicit coverage for unpaired control chars happening in the first line and last lines of *multiline* block comments. So it would be good if Wbidi-chars-15.c could gain some coverage for that (don't have to handle all the different chars). [...snip...] > > > diff --git a/libcpp/lex.c b/libcpp/lex.c > > > index fa2253d41c3..3fb518e202b 100644 > > > --- a/libcpp/lex.c > > > +++ b/libcpp/lex.c > > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int > > > in_comment) > > > } > > > } > > > > > > +namespace bidi { > > > + enum class kind { > > > + NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI > > > + }; > > > + > > > + /* All the UTF-8 encodings of bidi characters start with E2. */ > > > + constexpr uchar utf8_start = 0xe2; > > > > Is there a difference between "constexpr" vs "const" here? (sorry for > > my ignorance) > > I just wanted to make sure that utf8_start will be usable in contexts where > an integral constant expression is required. 'const' objects need not be > initialized with compile-time constants, but 'constexpr' objects do. Thanks. > > > + > > > + /* A vector holding currently open bidi contexts. We use a char for > > > + each context, its LSB is 1 if it represents a PDF context, 0 if it > > > + represents a PDI context. The next bit is 1 if this context was > > > open > > > + by a bidi character written as a UCN, and 0 when it was UTF-8. */ > > > + semi_embedded_vec <unsigned char, 16> vec; > > > + > > > + /* Close the whole comment/identifier/string literal/character constant > > > + context. */ > > > + void on_close () > > > + { > > > + vec.truncate (0); > > > + } > > > + > > > + /* Pop the last element in the vector. */ > > > + void pop () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + vec.truncate (len - 1); > > > + } > > > + > > > + /* Return the context of the Ith element. */ > > > + kind ctx_at (unsigned int i) > > > + { > > > + return (vec[i] & 1) ? kind::PDF : kind::PDI; > > > + } > > > + > > > + /* Return which context is currently opened. */ > > > + kind current_ctx () > > > + { > > > + unsigned int len = vec.count (); > > > + if (len == 0) > > > + return kind::NONE; > > > + return ctx_at (len - 1); > > > + } > > > + > > > + /* Return true if the current context comes from a UCN origin, that is, > > > + the bidi char which started this bidi context was written as a UCN. > > > */ > > > + bool current_ctx_ucn_p () > > > + { > > > + unsigned int len = vec.count (); > > > + gcc_checking_assert (len > 0); > > > + return (vec[len - 1] >> 1) & 1; > > > + } > > > + > > > + /* We've read a bidi char, update the current vector as necessary. */ > > > + void on_char (kind k, bool ucn_p) > > > + { > > > + switch (k) > > > + { > > > + case kind::LRE: > > > + case kind::RLE: > > > + case kind::LRO: > > > + case kind::RLO: > > > + vec.push (ucn_p ? 3u : 1u); > > > + break; > > > + case kind::LRI: > > > + case kind::RLI: > > > + case kind::FSI: > > > + vec.push (ucn_p ? 2u : 0u); > > > + break; > > > > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in > > the above have special meaning, but aren't clearly labelled as such. > > > > Please can you at least use some kind of constant/define to make clear > > the meaning of the bits. Even clearer would be bitfields; is there a > > performance reason for not using them? (though this code is only > > called on bidi control chars, which presumably is a rare occurrence). > > My patch here: > > "[PATCH 2/2] Capture locations of bidi chars and underline ranges" > > https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html > > did some refactoring of this patch, replacing hand-coded bit > > manipulation with bitfields in a struct (as well as then using that as > > a good place to stach location_t values, and then using these > > locations). > > Would it be helpful if I split that part of my patch out? > > I think they are just fine here, given they are used only in bidi:: and > not outside of it. And I could just use a simple unsigned char in > semi_embedded_vec instead of inventing a new struct. > > Your diagnostic patch changes it because you need to remember a location > too, so we're changing it anyway, so I left it be so that you have fewer > conflicts. Fair enough; I'll post an updated version of my followup once yours goes in. > > [...snip...] > > > > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. > > > */ > > > + > > > +static bidi::kind > > > +get_bidi_utf8 (const unsigned char *const p) > > > +{ > > > + gcc_checking_assert (p[0] == bidi::utf8_start); > > > + > > > + if (p[1] == 0x80) > > > + switch (p[2]) > > > > get_bidi_utf8 accesss up to 2 bytes beyond "p"... > > > > [...snip...] > > > > ...and is called in various places such as... > > > > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile) > > > > > > cur = buffer->cur; > > > } > > > + /* If this is a beginning of a UTF-8 encoding, it might be > > > + a bidirectional character. */ > > > + else if (__builtin_expect (c == bidi::utf8_start, 0) && > > > warn_bidi_p) > > > + { > > > + bidi::kind kind = get_bidi_utf8 (cur - 1); > > > + maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false); > > > + } > > > > Are we guaranteed to have a '\n' at the end of the buffer? (even for > > the final line of the file) That would ensure that we don't read past > > the end of the buffer. > > We've discussed this in our internal thread; I think I said that you > will always get a '\n' so this is not going to read past the end of > the buffer. To be sure... > > > Can we have testcases involving malformed UTF-8, in which, say: > > - the final byte of the input file is 0xe2 > > - the final two bytes of the input file are 0xe2 0x80 > > for each of block comment, C++-style comment, string-literal, > > identifier, etc? > > (or is that overkill?) > > ...I'd crafted a malformed text file using hexedit but couldn't get > it to crash. I'd rather not include it in the testsuite though. Fair enough. > > [...snip...] > > > > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar > > > *base, bool starts_ucn, > > > { > > > /* Slower version for identifiers containing UCNs > > > or extended chars (including $). */ > > > - do { > > > - while (ISIDNUM (*pfile->buffer->cur)) > > > - { > > > - NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > - pfile->buffer->cur++; > > > - } > > > - } while (forms_identifier_p (pfile, false, nst)); > > > + do > > > + { > > > + while (ISIDNUM (*pfile->buffer->cur)) > > > + { > > > + NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur); > > > + pfile->buffer->cur++; > > > + } > > > + } > > > + while (forms_identifier_p (pfile, false, nst)); > > > > Is the above purely a whitespace change? > > Yes. If I'm reading things correctly, these lines in the existing code were correctly indented, so is there a purpose to this change? If not, please can you remove this change from the patch (to minimize the change to the history). [...snip...] > +/* We're closing a bidi context, that is, we've encountered a newline, > + are closing a C-style comment, or are at the end of a string literal, > + character constant, or identifier. Warn if this context was not > + properly terminated by a PDI or PDF. P points to the last character > + in this context. */ > + > +static void > +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p) > +{ > + if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired > + && bidi::vec.count () > 0) > + { > + const location_t loc > + = linemap_position_for_column (pfile->line_table, > + CPP_BUF_COLUMN (pfile->buffer, p)); > + cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0, > + "unpaired UTF-8 bidirectional character " > + "detected"); > + } Sorry, I missed this one in my initial review, should be "control character" here. [...snip...] OK for trunk with the above nits fixed. Thanks Dave