Re: Patch for PR analyzer/98797

David Malcolm via Gcc-patches Fri, 12 Feb 2021 16:17:08 -0800

On Wed, 2021-02-10 at 19:42 +0000, brian.sobulefsky wrote:

Hi Brian


Thanks for the patch.

The patch is large enough to count as legally significant; I've sent
you information on copyright assignment separately; the patch can't be
committed until that paperwork is in place.

In the meantime, I've added some review notes inline below throughout:
 
>     Address the bug found in test file 
> gcc/testsuite/gcc.dg/analyzer/casts-1.c, as
>     recorded by the XFAIL, and some simpler and more complex versions found 
> during
>     the investigation of it. PR analyzer/98797 was opened for these bugs.
>     
>     The simplest manifestation of this bug can be replicated with:
>     
>     char arr[] = {'A', 'B', 'C', 'D'};
>     char *pt = ar;
>     __analyzer_eval(*(pt + 0) == 'A');
>     __analyzer_eval(*(pt + 1) == 'B');
>     //etc
>     
>     The result of the eval call is "UNKNOWN" because the analyzer is unable to
>     determine the value at the pointer. Note that array access (such as 
> arr[0]) is
>     correctly handled. The code responsible for this is in file
>     region-model-manager.cc, function 
> region_model_manager::maybe_fold_sub_svalue.
>     The relevant section is commented /* Handle getting individual chars from
>     STRING_CST */. This section only had a case for an element_region. A case
>     needed to be added for an offset_region.
>     
>     Additionally, when the offset was 0, such as in *pt or *(pt + 0), the 
> function
>     region_model_manager::get_offset_region was failing to make the needed 
> offset
>     region at all. This was due to the test for a constant 0 pointer that was 
> then
>     returning get_cast_region. A special case is needed for when PARENT is of 
> type
>     array_type whose type matches TYPE. In this case, get_offset_region is 
> allowed
>     to continue to normal conclusion.
>     
>     The original bug noted in gcc/testsuite/gcc.dg/analyzer/casts-1.c was for 
> the
>     case:
>     
>     struct s1
>     {
>       char a;
>       char b;
>       char c;
>       char d;
>     };
>     
>     struct s2
>     {
>       char arr[4];
>     };
>     
>     struct s2 x = {{'A', 'B', 'C', 'D'}};
>     struct s1 *p = (struct s1 *)&x;
>     __analyzer_eval (p->a == 'A');
>     //etc
>     
>     This requires a case added to region_model_manager::maybe_fold_sub_svalue 
> in
>     the individual characters from string constant section for a field region.
>     
>     Finally, the prior only works for the case where struct s2 was a single 
> field
>     struct. The more general case is:
>     
>     struct s2
>     {
>       char arr1[2];
>       char arr2[2];
>     };
>     
>     struct s2 x = {{'A', 'B'}, {'C', 'D'}};

This is s3 in the testcase, rather than s2; looks like this commit
message should be updated accordingly to match your change to the
testcase.

BTW, did this case arise in practice?  The existing cases are rather
artificial; IIRC I created them whilst experimenting with various casts
whilst prototypeing the code, in the hope of generating coverage, but
without any specific real-world examples in mind.  (kind of "kicking
the tires", if you will).  Am I right in thinking that this new one
arose in a similar way?


>     Here the array will never be found in the get_any_binding routines. A new
>     routine is added to class binding_cluster that checks if the region being
>     searched is a field_region of a cast_region, and if so, tries to find a 
> field
>     of the original region that contains the region under examination. This 
> new
>     function is named binding_cluster::get_parent_cast_binding. It is called 
> from
>     get_binding_recursive.
>     
>     gcc/analyzer/ChangeLog:
>             PR analyzer/98797
>             * region-model-manager.cc 
> region_model_manager::get_offset_region: Add
>             check for a PARENT array whose type matches TYPE, and have this 
> skip
>             returning get_cast_region and rather conclude the function 
> normally.
>             * region-model-manager.cc 
> region_model_manager::maybe_fold_sub_svalue
>             Update the get character from string_cst section to include cases 
> for
>             an offset_region and a field_region.
>             * store.cc binding_cluster::get_binding_recursive: Add case for 
> call
>             to new function get_parent_cast_binding.
>             * store.cc binding_cluster::get_parent_cast_binding: New function.
>             * store.h class binding_cluster: Add declaration for new member
>             function get_parent_class_binding and macros for bit to byte
>             conversion.

Formatting nit:  the items in the ChangeLog within each file should be
enclosed by parentheses.  We have a git commit hook that verifies the
format.  In theory there's a way to run it ahead of time, but I don't
know of the top of my head where it is.


>     gcc/testsuite/ChangeLog:
>             PR analyzer/98797
>             * gcc.dg/analyzer/casts-1.c: Update file to no longer expect 
> failures
>             and add test cases for additional bugs solved.
> 
> diff --git a/gcc/analyzer/region-model-manager.cc 
> b/gcc/analyzer/region-model-manager.cc
> index dfd2413e914..1fd6b94f20a 100644
> --- a/gcc/analyzer/region-model-manager.cc
> +++ b/gcc/analyzer/region-model-manager.cc
> @@ -602,16 +602,46 @@ region_model_manager::maybe_fold_sub_svalue (tree type,
>    /* Handle getting individual chars from a STRING_CST.  */
>    if (tree cst = parent_svalue->maybe_get_constant ())
>      if (TREE_CODE (cst) == STRING_CST)
> -      if (const element_region *element_reg
> +      {
> +     if (const element_region *element_reg
>           = subregion->dyn_cast_element_region ())
> -     {
> -       const svalue *idx_sval = element_reg->get_index ();
> -       if (tree cst_idx = idx_sval->maybe_get_constant ())
> -         if (const svalue *char_sval
> -             = maybe_get_char_from_string_cst (cst, cst_idx))
> -           return get_or_create_cast (type, char_sval);
> -     }
> -
> +       {
> +         const svalue *idx_sval = element_reg->get_index ();
> +         if (tree cst_idx = idx_sval->maybe_get_constant ())
> +           if (const svalue *char_sval
> +               = maybe_get_char_from_string_cst (cst, cst_idx))
> +             return get_or_create_cast (type, char_sval);
> +       }
> +     else if (const offset_region *offset_reg
> +           = subregion->dyn_cast_offset_region ())
> +       {
> +         const svalue *offset_sval = offset_reg->get_byte_offset();
> +         if (tree cst_offset = offset_sval->maybe_get_constant ())
> +           if (const svalue *char_sval
> +               = maybe_get_char_from_string_cst (cst, cst_offset))
> +             return get_or_create_cast (type, char_sval);
> +       }
> +     else if (const field_region *field_reg
> +           = subregion->dyn_cast_field_region ())
> +       {
> +         region_offset field_offset = field_reg->get_offset ();
> +         if (!field_offset.symbolic_p ())
> +           {
> +             bit_offset_t field_bit_offset = field_offset.get_bit_offset ();
> +             HOST_WIDE_INT field_offset_bits
> +                 = field_bit_offset.get_val ()[0];
> +             if(!(field_offset_bits & BYTE_BOUNDARY_MASK))
> +               {
> +                 tree cst_offset
> +                    = build_int_cst (integer_type_node,
> +                                     field_offset_bits >> BYTE_BIT_CONVERT);

The patch adds logic in two places to mask a HOST_WIDE_INT against
BYTE_BOUNDARY_MASK and if zero, shift by BYTE_BIT_CONVERT, and then to
build_int_cst from the result.

The mask and shift feel like premature optimization to me.

Please can you instead introduce a helper subroutine to do this,
something like:

/* If offset_bits is a multiple of BITS_PER_UNIT, return an INTEGER_CST
   for the offset expressed in bytes.
   Otherwise, return NULL_TREE.  */

tree
maybe_convert_bit_to_byte_offset (HOST_WIDE_INT offset_bits)
{
  if (offset_bits % BITS_PER_UNIT)
    return NULL_TREE;
  return build_int_cst (offset_bits / BITS_PER_UNIT);
}

> +                 if (const svalue *char_sval
> +                     = maybe_get_char_from_string_cst (cst, cst_offset))
> +                   return  get_or_create_cast (type, char_sval);

Doesn't this need to also check that the size of the field is
sizeof(char)?

Consider:

void test_4 ()
{
  const char *s = "ABCD";
  __analyzer_eval (*(short *)s == 'A');
}

With your patch this evaluates as TRUE, as it's erroneously slicing off
the 2nd byte of "AB".  It ought to be FALSE (and endianness concerns
would thwart TRUE also).

> +               }
> +           }
> +       }
> +      }
>    /* SUB(INIT(r)).FIELD -> INIT(r.FIELD)
>       i.e.
>       Subvalue(InitialValue(R1), FieldRegion(R2, F))
> @@ -867,8 +897,19 @@ region_model_manager::get_offset_region (const region 
> *parent,
>  {
>    /* If BYTE_OFFSET is zero, return PARENT.  */
>    if (tree cst_offset = byte_offset->maybe_get_constant ())
> -    if (zerop (cst_offset))
> -      return get_cast_region (parent, type);
> +    {
> +      /* Special case: PARENT type is array_type whose type matches TYPE */
> +      if (parent->get_type ()
> +         &&  TREE_CODE(parent->get_type ()) == ARRAY_TYPE
> +         &&  TREE_TYPE(parent->get_type ()) == type)
> +     {
> +       tree offset_int
> +           = build_int_cst (integer_type_node, cst_offset->int_cst.val[0]);

It took me a while to get the logic here.  Am I right in thinking that
this is to avoid returning
  (TYPE)PARENT
for the 0 offset case when it's a matching array, since:
  PARENT[0]
is better for that case?

If so, I think the comments could be updated.

I'm not quite seeing the purpose of constructing offset_int from
cst_offset here.  Is it to get a value of integer_type_node, or is it
redundant?

> +       byte_offset = get_or_create_constant_svalue (offset_int);
> +     }
> +      else if (zerop (cst_offset))
> +     return get_cast_region (parent, type);
> +    }
>  
>    /* Fold OFFSET_REGION(OFFSET_REGION(REG, X), Y)
>       to   OFFSET_REGION(REG, (X + Y)).  */
> diff --git a/gcc/analyzer/store.cc b/gcc/analyzer/store.cc
> index abdb336da91..3afba55ce90 100644
> --- a/gcc/analyzer/store.cc
> +++ b/gcc/analyzer/store.cc
> @@ -996,14 +996,129 @@ binding_cluster::get_binding_recursive (store_manager 
> *mgr,
>      return sval;
>    if (reg != m_base_region)
>      if (const region *parent_reg = reg->get_parent_region ())
> -      if (const svalue *parent_sval
> -       = get_binding_recursive (mgr, parent_reg, kind))
> +      {
> +     if (const svalue *parent_sval
> +         = get_binding_recursive (mgr, parent_reg, kind))
> +       {
> +         /* Extract child svalue from parent svalue.  */
> +         region_model_manager *rmm_mgr = mgr->get_svalue_manager ();
> +         return rmm_mgr->get_or_create_sub_svalue (reg->get_type (),
> +                                                   parent_sval, reg);
> +       }
> +     else if (const svalue *parent_cast_field_sval
> +         = get_parent_cast_binding (mgr, reg, kind))
> +       {
> +         /* PARENT_REG is a cast_region and we found a covering binding
> +            in the original_region */
> +         return parent_cast_field_sval;
> +       }
> +      }
> +  return NULL;
> +}
> +
> +/* Get a binding for access to a field of a cast_region.

Is it always a cast_region?  "reg" is a "const region *".

Is it possible to express in C-like syntax what this function does, in
the most general case.  Is it something like:

  ((TYPE)EXPR).FIELD

?

> +   Note the original motivation for this was access of the form:
> +   struct s1 x = {{'A', 'B'}; {'C', 'D'}}; struct s2 p = (struct s2) &x;
> +   __analyzer_eval (p->a == 'A');

I think the "s2" needs to be "s2 *" here.

> +   et al. for the other fields. get_binding_recursive fails because it is
> +   unidirectional (from the field_region p->a up the chain of parents). The
> +   routine can probably be expanded, and even further broken out, as other 
> cases
> +   are discovered and understood. */
> +const svalue *
> +binding_cluster::get_parent_cast_binding (store_manager *mgr,
> +                                       const region *reg,
> +                                       enum binding_kind kind) const
> +{
> +
> +  /* If we are sure this will never be called incorrectly, we can remove */
> +  if (!(mgr && reg))
> +    return NULL;

I'm fairly sure that mgr will always be non-NULL.  I'd prefer to remove
the checks; as far as I can tell, reg is always non-NULL.

> +  const region *parent_reg = reg->get_parent_region ();
> +
> +  /* If it is impossible for a region's parent to be NULL, remove
guard */
> +  if(!parent_reg)
> +    return NULL;

Some regions do have NULL parents e.g. symbolic regions.  I'm not sure
whether or not it can be called for them, so I'd keep this defensive
check.

> +  const cast_region *parent_cast = parent_reg->dyn_cast_cast_region ();
> +  const field_region *field_reg = reg->dyn_cast_field_region ();
> +
> +  /* The first two guards are just safety, this is the real condition */
> +  if (!(parent_cast && field_reg))
> +    return NULL;
> +
> +  const region *original_reg = parent_cast->get_original_region ();
> +  region_offset reg_offset = field_reg->get_offset ();
> +  bit_size_t reg_bit_size;
> +  tree original_tree = original_reg->get_type ();
> +  tree original_field;
> +
> +  /* Note assignment to ORIGINAL_FIELD in compound test to
> +     save some indentation space (80 comes quickly) */

I find it's often easiest to rewrite a series of nested conditionals:

  if (foo)
    if (bar)
      if (baz)
        {
          /* etc, where did all the columns go???  */
        }

into:

  if (!foo)
    return;
  if (!bar)
    return;
  if (!baz)
    return;
  /* Now we still have plenty of room.  */
  
assuming that the suite being guarded is all of the rest of the
function.


> +  if (original_tree && TREE_CODE(original_tree) == RECORD_TYPE
> +      && (original_field = TYPE_FIELDS(original_tree))
> +      && !reg_offset.symbolic_p() && field_reg->get_bit_size(&reg_bit_size))

so the above would probably be best rewritten.
That said, for future reference, with the compound conditional, you
correctly put the "&&" at the start of the line, but every such clause
needs to be at the start of a new line, so the above should read:

     if (original_tree
         && TREE_CODE (original_tree) == RECORD_TYPE
         && (original_field = TYPE_FIELDS (original_tree))
         && !reg_offset.symbolic_p ()
         && field_reg->get_bit_size (&reg_bit_size))

to avoid conditions hiding from a casual reader.

...but inverting the conditions and bailing is probably better for the
column-width thing as noted above.

> +    {
> +
> +      bit_offset_t reg_bit_offset = reg_offset.get_bit_offset ();
> +      HOST_WIDE_INT start_offset = reg_bit_offset.get_val ()[0];
> +      HOST_WIDE_INT end_offset = start_offset + reg_bit_size.get_val ()[0] - 
> 1;
> +      tree covering_field = NULL_TREE;
> +
> +      /* get_field_at_bit_offset in region.cc has a test for RECORD_TYPE, 
> maybe
> +      this should be here too? The test is probably moot, since we have to

It would need the test (which is to handle methods and nested types in
C++ classes), but I believe this code can simply reuse 
get_field_at_bit_offset, rather than duplicating the logic.

> +      fully cover and survive the get_binding_recursive below. Also
> +      consdider making the function in region.cc usable outside the
file? */
> +      for (covering_field = original_field;
> +         (original_field = TREE_CHAIN (original_field));)
> +     {
> +       if (original_field->field_decl.bit_offset->int_cst.val[0]
> +           <= start_offset)
> +         covering_field = original_field;
> +       else
> +         break; /* Is the list guaranteed to be in order? */
> +     }


> +
> +      if (covering_field && end_offset
> +       <= covering_field->field_decl.bit_offset->int_cst.val[0]
> +       + covering_field->decl_common.size->int_cst.val[0] - 1)

Linebreak before &&; probably needs parentheses, so it ought to be
formatted something like this:

        if (covering_field
            && (end_offset
                <= (covering_field->field_decl.bit_offset->int_cst.val[0]
                    + covering_field->decl_common.size->int_cst.val[0] - 1)))

But the direct access of tree fields seems incorrect to me.  Fixing
that would mean:

        if (covering_field
            && (end_offset
                <= (DECL_FIELD_BIT_OFFSET (covering_field)->int_cst.val[0]
                    + DECL_SIZE (covering_field)->int_cst.val[0] - 1)))

but the usage of int_cst.val[0] seem incorrect to me (what if it's a
really large value for which the low entry is 0?); does
int_bit_position do what you need?

>       {
> -       /* Extract child svalue from parent svalue.  */
> +       /* We found a field that entirely covers REG. The following code
> +       should probably be generalized to more cases, as of now it will
> +       basically handle the case where a char array has been initialized
> +       into the original struct type. Further work might be to handle when
> +       a field to a struct is itself a struct, which is likely recursive.*/
> +
>         region_model_manager *rmm_mgr = mgr->get_svalue_manager ();
> -       return rmm_mgr->get_or_create_sub_svalue (reg->get_type (),
> -                                                 parent_sval, reg);
> +       const region *covering_field_reg = rmm_mgr
> +           ->get_field_region (original_reg, covering_field);
> +
> +       if (const svalue *parent_sval = get_binding_recursive (mgr,
> +                                             covering_field_reg, kind))

Probably best to put the newline before the "=" in this; see examples
elsewhere in the analyzer sources.

> +         {
> +
> +           HOST_WIDE_INT covering_field_offset = covering_field
> +               ->field_decl.bit_offset->int_cst.val[0];

I don't think the code should be accessing int_cst.val[0].  Does
int_bit_position do what you need?

> +           if (TREE_CODE(covering_field_reg->get_type()) == ARRAY_TYPE
> +               && !((start_offset - covering_field_offset)
> +                     & BYTE_BOUNDARY_MASK))
> +             {
> +               const svalue *arr_index = rmm_mgr
> +                   ->get_or_create_constant_svalue (
> +                       build_int_cst (integer_type_node,
> +                         (start_offset - covering_field_offset)
> +                          >> BYTE_BIT_CONVERT));

See notes above about introducing a subroutine for part of this.

> +               const region *elmt_reg = rmm_mgr
> +                   ->get_element_region (covering_field_reg,
> +                       TREE_TYPE (covering_field_reg->get_type ()),
> +                         arr_index);

Probably best to have the newline before the "=" here.

> +               return rmm_mgr->get_or_create_sub_svalue (
> +                   elmt_reg->get_type (), parent_sval, elmt_reg);
> +             }
> +         }
>       }
> +    }
>    return NULL;
>  }
>  
> diff --git a/gcc/analyzer/store.h b/gcc/analyzer/store.h
> index 2bcef6c398a..9ddc57fea01 100644
> --- a/gcc/analyzer/store.h
> +++ b/gcc/analyzer/store.h
> @@ -416,6 +416,9 @@ public:
>    const svalue *get_binding_recursive (store_manager *mgr,
>                                       const region *reg,
>                                       enum binding_kind kind) const;
> +  const svalue *get_parent_cast_binding (store_manager *mgr,
> +                                       const region *reg,
> +                                       enum binding_kind kind)
const;
>    const svalue *get_any_binding (store_manager *mgr,
>                                 const region *reg) const;
>    const svalue *maybe_get_compound_binding (store_manager *mgr,
> @@ -640,4 +643,7 @@ private:
>  
>  } // namespace ana
>  
> +#define BYTE_BIT_CONVERT 3
> +#define BYTE_BOUNDARY_MASK ((1 << BYTE_BIT_CONVERT) - 1)

As noted above, please lose these in favor of a subroutine.

>  #endif /* GCC_ANALYZER_STORE_H */
> diff --git a/gcc/testsuite/gcc.dg/analyzer/casts-1.c
b/gcc/testsuite/gcc.dg/analyzer/casts-1.c
> index 15cd85f77cf..9340615b20f 100644
> --- a/gcc/testsuite/gcc.dg/analyzer/casts-1.c
> +++ b/gcc/testsuite/gcc.dg/analyzer/casts-1.c
> @@ -13,6 +13,21 @@ struct s2
>    char arr[4];
>  };
>  
> +struct s3
> +{
> +  char arr1[2];
> +  char arr2[2];
> +};
> +
> +void test_0 ()
> +{
> +  char ar[] = {'A', 'B', 'C', 'D'};
> +  char *pt = ar;
> +  __analyzer_eval (*(pt + 0) == 'A'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (*(pt + 1) == 'B'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (*(pt + 2) == 'C'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (*(pt + 3) == 'D'); /* { dg-warning "TRUE" } */
> +}

Probably best to append new tests, rather than to insert them, to avoid
changing the line numbers of existing tests (which can help when
comparing DejaGnu results).

>  void test_1 ()
>  {
>    struct s1 x = {'A', 'B', 'C', 'D'};
> @@ -38,12 +53,22 @@ void test_2 ()
>    __analyzer_eval (x.arr[2] == 'C'); /* { dg-warning "TRUE" } */
>    __analyzer_eval (x.arr[3] == 'D'); /* { dg-warning "TRUE" } */
>    struct s1 *p = (struct s1 *)&x;
> -  __analyzer_eval (p->a == 'A'); /* { dg-warning "TRUE" "true" {
xfail *-*-* } } */
> -  /* { dg-bogus "UNKNOWN" "unknown" { xfail *-*-* } .-1 } */
> -  __analyzer_eval (p->b == 'B'); /* { dg-warning "TRUE" "true" {
xfail *-*-* } } */
> -  /* { dg-bogus "UNKNOWN" "unknown" { xfail *-*-* } .-1 } */
> -  __analyzer_eval (p->c == 'C'); /* { dg-warning "TRUE" "true" {
xfail *-*-* } } */
> -  /* { dg-bogus "UNKNOWN" "unknown" { xfail *-*-* } .-1 } */
> -  __analyzer_eval (p->d == 'D'); /* { dg-warning "TRUE" "true" {
xfail *-*-* } } */
> -  /* { dg-bogus "UNKNOWN" "unknown" { xfail *-*-* } .-1 } */
> +  __analyzer_eval (p->a == 'A'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->b == 'B'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->c == 'C'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->d == 'D'); /* { dg-warning "TRUE" "true" } */
> +}

It's great to see xfails become passes.

> +void test_3 ()
> +{
> +  struct s3 x = {{'A', 'B'}, {'C', 'D'}};
> +  __analyzer_eval (x.arr1[0] == 'A'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (x.arr1[1] == 'B'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (x.arr2[0] == 'C'); /* { dg-warning "TRUE" } */
> +  __analyzer_eval (x.arr2[1] == 'D'); /* { dg-warning "TRUE" } */
> +  struct s1 *p = (struct s1 *)&x;
> +  __analyzer_eval (p->a == 'A'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->b == 'B'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->c == 'C'); /* { dg-warning "TRUE" "true" } */
> +  __analyzer_eval (p->d == 'D'); /* { dg-warning "TRUE" "true" } */
>  }

Thanks again for the patch; hope the above makes sense and is
constructive.

Dave

Re: Patch for PR analyzer/98797

Reply via email to