> On Mon, Nov 15, 2021 at 06:15:40PM -0500, David Malcolm wrote:
> > > On Mon, Nov 08, 2021 at 04:33:43PM -0500, Marek Polacek wrote:
> > > > Ping, can we conclude on the name?   IMHO, -Wbidirectional is just fine,
> > > > but changing the name is a trivial operation. 
> > > 
> > > Here's a patch with a better name (suggested by Jonathan W.).  Otherwise 
> > > no
> > > changes.
> > 
> > Thanks for implementing this.
> > 
> > > 
> > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > 
> > > -- >8 --
> > > From a link below:
> > > "An issue was discovered in the Bidirectional Algorithm in the Unicode
> > > Specification through 14.0. It permits the visual reordering of
> > > characters via control sequences, which can be used to craft source code
> > > that renders different logic than the logical ordering of tokens
> > > ingested by compilers and interpreters. Adversaries can leverage this to
> > > encode source code for compilers accepting Unicode such that targeted
> > > vulnerabilities are introduced invisibly to human reviewers."
> > > 
> > > More info:
> > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> > > https://trojansource.codes/
> > > 
> > > This is not a compiler bug.  However, to mitigate the problem, this patch
> > > implements -Wbidi-chars=[none|unpaired|any] to warn about possibly
> > > misleading Unicode bidirectional characters the preprocessor may 
> > > encounter.

[...snip...]

> > 
> > Terminology nit:
> > The patch is referring to "bidirectional characters", but I think the
> > term "bidirectional control characters" would be better.
> 
> Adjusted.

Thanks.

I wonder if the warning should be -Wbidi-control-chars, but I don't
care enough to insist on it being changed.

>  
> > For example, a passage of text containing both numbers and characters
> > in a right-to-left script could be considered "bidirectional", since
> > the numbers are written from left-to-right.
> > 
> > Specifically, the patch looks for these specific characters:
> >   * U+202A LEFT-TO-RIGHT EMBEDDING
> >   * U+202B RIGHT-TO-LEFT EMBEDDING
> >   * U+202C POP DIRECTIONAL FORMATTING
> >   * U+202D LEFT-TO-RIGHT OVERRIDE
> >   * U+202E RIGHT-TO-LEFT OVERRIDE
> >   * U+2066 LEFT-TO-RIGHT ISOLATE
> >   * U+2067 RIGHT-TO-LEFT ISOLATE
> >   * U+2068 FIRST STRONG ISOLATE
> >   * U+2069 POP DIRECTIONAL ISOLATE
> > 
> > However, the following characters could also be considered as
> > "bidirectional control characters":
> >   * U+200E ‎LEFT-TO-RIGHT MARK (UTF-8: E2 80 8E)
> >   * U+200F ‎RIGHT-TO-LEFT MARK (UTF-8: E2 80 8F)
> > but aren't checked for in the patch.  Should they be?  I can imagine
> > ways in which they could be abused, so I think so.
> 
> I'd only intended to check the bidi chars described in the original
> trojan source pdf, but I added checking for U+200E/U+200F too, since
> it was easy enough.  AFAIK they aren't popped by a PDF/PDI like the
> rest, so don't need to go on the vec, and so we only warn with =any.
> Tests: Wbidi-chars-16.c + Wbidi-chars-17.c

Thanks.  I took a look through the revised patch and I think you
updated things correctly.

[...snip...]

> > > diff --git a/gcc/testsuite/c-c++-common/Wbidi-chars-4.c 
> > > b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > new file mode 100644
> > > index 00000000000..9fd4bc535ca
> > > --- /dev/null
> > > +++ b/gcc/testsuite/c-c++-common/Wbidi-chars-4.c
> > > @@ -0,0 +1,166 @@
> > > +/* PR preprocessor/103026 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-Wbidi-chars=any -Wno-multichar -Wno-overflow" } */
> > > +/* Test all bidi chars in various contexts (identifiers, comments,
> > > +   string literals, character constants), both UCN and UTF-8.  The bidi
> > > +   chars here are properly terminated, except for the character 
> > > constants.  */
> > > +
> > > +/* a b c LRE‪ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202A" "" { target *-*-* } .-1 } */
> > > +/* a b c RLE‫ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202B" "" { target *-*-* } .-1 } */
> > > +/* a b c LRO‭ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202D" "" { target *-*-* } .-1 } */
> > > +/* a b c RLO‮ 1 2 3 PDF‬ x y z */
> > > +/* { dg-warning "U\\+202E" "" { target *-*-* } .-1 } */
> > > +/* a b c LRI⁦ 1 2 3 PDI⁩ x y z */
> > > +/* { dg-warning "U\\+2066" "" { target *-*-* } .-1 } */
> > > +/* a b c RLI⁧ 1 2 3 PDI⁩ x y */
> > > +/* { dg-warning "U\\+2067" "" { target *-*-* } .-1 } */
> > > +/* a b c FSI⁨ 1 2 3 PDI⁩ x y z */
> > > +/* { dg-warning "U\\+2068" "" { target *-*-* } .-1 } */
> > 
> > AIUI the Unicode bidirectionality algorithm works at the line level,
> > and so each line in a block comment should be checked individually for
> > unclossed bidi control chars, rather than a block comment as a whole. 
> > Hence I think the test case needs to have block comment test coverage
> > for:
> > - single line blocks
> > - first line of a multiline block comment
> > - middle line of a multiline block comment
> > - final line of a multiline block comment
> > but I think the patch as it stands is only checking for the first of
> > these four cases.
> 
> The patch handles all of them, because of:
> 1534           if (warn_bidi_p)
> 1535             maybe_warn_bidi_on_close (pfile, cur);
> in _cpp_skip_block_comment, but I was lacking some more testing, so I've
> added some testing, and included a new test: Wbidi-chars-15.c.

All of the cases in Wbidi-chars-15.c only test for unparired chars in a
middle line of a multiline block comment; I don't think the patch has
any explicit coverage for unpaired control chars happening in the first
line and last lines of *multiline* block comments.  So it would be good
if Wbidi-chars-15.c could gain some coverage for that (don't have to
handle all the different chars).

[...snip...]

> > > diff --git a/libcpp/lex.c b/libcpp/lex.c
> > > index fa2253d41c3..3fb518e202b 100644
> > > --- a/libcpp/lex.c
> > > +++ b/libcpp/lex.c
> > > @@ -1164,6 +1164,300 @@ _cpp_process_line_notes (cpp_reader *pfile, int 
> > > in_comment)
> > >      }
> > >  }
> > >  
> > > +namespace bidi {
> > > +  enum class kind {
> > > +    NONE, LRE, RLE, LRO, RLO, LRI, RLI, FSI, PDF, PDI
> > > +  };
> > > +
> > > +  /* All the UTF-8 encodings of bidi characters start with E2.  */
> > > +  constexpr uchar utf8_start = 0xe2;
> > 
> > Is there a difference between "constexpr" vs "const" here?  (sorry for
> > my ignorance)
> 
> I just wanted to make sure that utf8_start will be usable in contexts where
> an integral constant expression is required.  'const' objects need not be
> initialized with compile-time constants, but 'constexpr' objects do.

Thanks.

> > > +
> > > +  /* A vector holding currently open bidi contexts.  We use a char for
> > > +     each context, its LSB is 1 if it represents a PDF context, 0 if it
> > > +     represents a PDI context.  The next bit is 1 if this context was 
> > > open
> > > +     by a bidi character written as a UCN, and 0 when it was UTF-8.  */
> > > +  semi_embedded_vec <unsigned char, 16> vec;
> > > +
> > > +  /* Close the whole comment/identifier/string literal/character constant
> > > +     context.  */
> > > +  void on_close ()
> > > +  {
> > > +    vec.truncate (0);
> > > +  }
> > > +
> > > +  /* Pop the last element in the vector.  */
> > > +  void pop ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    gcc_checking_assert (len > 0);
> > > +    vec.truncate (len - 1);
> > > +  }
> > > +
> > > +  /* Return the context of the Ith element.  */
> > > +  kind ctx_at (unsigned int i)
> > > +  {
> > > +    return (vec[i] & 1) ? kind::PDF : kind::PDI;
> > > +  }
> > > +
> > > +  /* Return which context is currently opened.  */
> > > +  kind current_ctx ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    if (len == 0)
> > > +      return kind::NONE;
> > > +    return ctx_at (len - 1);
> > > +  }
> > > +
> > > +  /* Return true if the current context comes from a UCN origin, that is,
> > > +     the bidi char which started this bidi context was written as a UCN. 
> > >  */
> > > +  bool current_ctx_ucn_p ()
> > > +  {
> > > +    unsigned int len = vec.count ();
> > > +    gcc_checking_assert (len > 0);
> > > +    return (vec[len - 1] >> 1) & 1;
> > > +  }
> > > +
> > > +  /* We've read a bidi char, update the current vector as necessary.  */
> > > +  void on_char (kind k, bool ucn_p)
> > > +  {
> > > +    switch (k)
> > > +      {
> > > +      case kind::LRE:
> > > +      case kind::RLE:
> > > +      case kind::LRO:
> > > +      case kind::RLO:
> > > + vec.push (ucn_p ? 3u : 1u);
> > > + break;
> > > +      case kind::LRI:
> > > +      case kind::RLI:
> > > +      case kind::FSI:
> > > + vec.push (ucn_p ? 2u : 0u);
> > > + break;
> > 
> > I don't like the hand-coded bit fields here, where bit 1 and bit 2 in
> > the above have special meaning, but aren't clearly labelled as such.
> > 
> > Please can you at least use some kind of constant/define to make clear
> > the meaning of the bits.  Even clearer would be bitfields; is there a
> > performance reason for not using them?  (though this code is only
> > called on bidi control chars, which presumably is a rare occurrence). 
> > My patch here:
> >   "[PATCH 2/2] Capture locations of bidi chars and underline ranges"
> >     https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583160.html
> > did some refactoring of this patch, replacing hand-coded bit
> > manipulation with bitfields in a struct (as well as then using that as
> > a good place to stach location_t values, and then using these
> > locations).
> > Would it be helpful if I split that part of my patch out?
>  
> I think they are just fine here, given they are used only in bidi:: and
> not outside of it.  And I could just use a simple unsigned char in
> semi_embedded_vec instead of inventing a new struct.
> 
> Your diagnostic patch changes it because you need to remember a location
> too, so we're changing it anyway, so I left it be so that you have fewer
> conflicts.

Fair enough; I'll post an updated version of my followup once yours
goes in.

> > [...snip...]
> > 
> > > +/* Parse a sequence of 3 bytes starting with P and return its bidi code. 
> > >  */
> > > +
> > > +static bidi::kind
> > > +get_bidi_utf8 (const unsigned char *const p)
> > > +{
> > > +  gcc_checking_assert (p[0] == bidi::utf8_start);
> > > +
> > > +  if (p[1] == 0x80)
> > > +    switch (p[2])
> > 
> > get_bidi_utf8 accesss up to 2 bytes beyond "p"...
> > 
> > [...snip...]
> > 
> > ...and is called in various places such as...
> > 
> > > @@ -1218,6 +1519,13 @@ _cpp_skip_block_comment (cpp_reader *pfile)
> > >  
> > >     cur = buffer->cur;
> > >   }
> > > +      /* If this is a beginning of a UTF-8 encoding, it might be
> > > +  a bidirectional character.  */
> > > +      else if (__builtin_expect (c == bidi::utf8_start, 0) && 
> > > warn_bidi_p)
> > > + {
> > > +   bidi::kind kind = get_bidi_utf8 (cur - 1);
> > > +   maybe_warn_bidi_on_char (pfile, cur, kind, /*ucn_p=*/false);
> > > + }
> > 
> > Are we guaranteed to have a '\n' at the end of the buffer?  (even for
> > the final line of the file)  That would ensure that we don't read past
> > the end of the buffer.
> 
> We've discussed this in our internal thread; I think I said that you
> will always get a '\n' so this is not going to read past the end of
> the buffer.  To be sure...
>  
> > Can we have testcases involving malformed UTF-8, in which, say:
> > - the final byte of the input file is 0xe2
> > - the final two bytes of the input file are 0xe2 0x80
> > for each of block comment, C++-style comment, string-literal,
> > identifier, etc?
> > (or is that overkill?)
> 
> ...I'd crafted a malformed text file using hexedit but couldn't get
> it to crash.  I'd rather not include it in the testsuite though.

Fair enough.

> > [...snip...]
> > 
> > > @@ -1505,13 +1855,17 @@ lex_identifier (cpp_reader *pfile, const uchar 
> > > *base, bool starts_ucn,
> > >      {
> > >        /* Slower version for identifiers containing UCNs
> > >    or extended chars (including $).  */
> > > -      do {
> > > - while (ISIDNUM (*pfile->buffer->cur))
> > > -   {
> > > -     NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > -     pfile->buffer->cur++;
> > > -   }
> > > -      } while (forms_identifier_p (pfile, false, nst));
> > > +      do
> > > + {
> > > +   while (ISIDNUM (*pfile->buffer->cur))
> > > +     {
> > > +       NORMALIZE_STATE_UPDATE_IDNUM (nst, *pfile->buffer->cur);
> > > +       pfile->buffer->cur++;
> > > +     }
> > > + }
> > > +      while (forms_identifier_p (pfile, false, nst));
> > 
> > Is the above purely a whitespace change?
> 
> Yes.

If I'm reading things correctly, these lines in the existing code were
correctly indented, so is there a purpose to this change?  If not,
please can you remove this change from the patch (to minimize the
change to the history).

[...snip...]

> +/* We're closing a bidi context, that is, we've encountered a newline,
> +   are closing a C-style comment, or are at the end of a string literal,
> +   character constant, or identifier.  Warn if this context was not
> +   properly terminated by a PDI or PDF.  P points to the last character
> +   in this context.  */
> +
> +static void
> +maybe_warn_bidi_on_close (cpp_reader *pfile, const uchar *p)
> +{
> +  if (CPP_OPTION (pfile, cpp_warn_bidirectional) == bidirectional_unpaired
> +      && bidi::vec.count () > 0)
> +    {
> +      const location_t loc
> +     = linemap_position_for_column (pfile->line_table,
> +                                    CPP_BUF_COLUMN (pfile->buffer, p));
> +      cpp_warning_with_line (pfile, CPP_W_BIDIRECTIONAL, loc, 0,
> +                          "unpaired UTF-8 bidirectional character "
> +                          "detected");
> +    }

Sorry, I missed this one in my initial review, should be "control
character" here.

[...snip...]

OK for trunk with the above nits fixed.

Thanks
Dave


Reply via email to