On 12/15/21 12:23 PM, Andrew MacLeod wrote:
On 12/14/21 18:55, Martin Sebor wrote:

Thanks a lot for the feedback!  I'll need some time to fully
digest it.  Just a few clarifying comments to explain what
I'm after.

Andrew, to improve the context of the late warnings I'm trying
to see how to get the execution path(s) leading from function
entry up to a statement.  For example, for the code below I'd
like to "collect" and show the three conditionals in the context
of the warning:

extern char a[9];

void f (int m, int n, void *s)
{
  if (m < 3) m = 3;
  if (n < 4) n = 4;
  if (s != 0)
    {
      char *d = a + 3;
      __builtin_memcpy (d, s, m + n);
    }
}

At a minimum, I'd like to print a note after the warning:

  warning: ‘__builtin_memcpy’ writing between 7 and 2147483647 bytes into a region of size 6 overflows the destination [-Wstringop-overflow=]

  note: when 'm >= 3 && n >= 4 && s != 0'

(The final version would point to each conditional in the source
like the static analyzer does.)


Finding the exact location of the conditional may not be possible.. the condition could be the aggregation of multiple statements.  The ranges of m and n are set via 2 statements, the condition and the assignment, and then collected by a 3rd, the PHI node.

The "note:" itself could simply be created from the ranges..  the range at the memcpy of m,  n and s are:

m_3  : int [3, +INF]
n_4  : int [4, +INF]
s_10(D) :     void * [1B, +INF]

which could be easily translated to 'm >= 3 && n >= 4 && s != 0', assuming you can figure out the original source name.. In this case happens to be the VAR on the ssa-name, but you can't always count on that.  I'd also point out there is no statement which has m >=3 or n>= 4, but rather a series of statements that produce that result.

Understood.




For conditions that involve ranges used in the statements (i.e.,
the first two conditions in the source above), I wonder if rather
than traversing the CFG in a separate step, I might be able to
use Ranger to collect the conditions at the time it populates its
cache (i.e., when I call it to get range info for each statement).
I imagine I would need to derive a new class from gimple_ranger
and override some virtual member functions.  (And maybe also do
the same for ranger_cache?)

Does this sound like something I should be able to do within
the framework?  If yes, do you have any tips or suggestions for
where/how to start?

That is not really going to work. The cache is agnostic as to why there are ranges or what caused them.  It is primarily a propagation engine which utilizes the edge ranges provided by GORI and range-ops.  Ranger itself is the controller for this property propagation.

GORI does not treat any statement as special.  It is an equation solver which uses range-ops to algebraically solve for an unknown value when the other values are known.  The "if" condition at the exit of a basic block is considered no different than an assignment statement where the LHS is either [0,0] or [1,1] depending on the edge you are looking at.    So to use your example above for just m:

     <bb 2> :
     if (m_6(D) <= 2)
       goto <bb 3>; [INV]       //  [1,1] = m_6 <= 2
     else
       goto <bb 4>; [INV]      //    [0,0] = m_6 <= 2

2->3  (T) m_6(D) :      int [-INF, 2]
2->4  (F) m_6(D) :      int [3, +INF]

=========== BB 3 ============
     <bb 3> :

=========== BB 4 ============
Imports: n_8(D)
Exports: n_8(D)
n_8(D)  int VARYING
     <bb 4> :
    # m_3 = PHI <m_6(D)(2), 3(3)>    //  m_3 = union (m_6 (2->4), [3,3]) = union ([3, +INF],[3,3]) = [3, +INF]

m_3 : int [3, +INF]


We generate 3 ranges for m..  m_6 on the 1)edge 2->3, 2)edge 2->4 and 3)combine them with the constant 3 to produce m_3 at the PHI.

This all happens at a fairly low level, and you wont be able to overload anything to be able to log it in a sensible fashion. And I'm not sure how you would go about logging the arbitrarily long series of actions that result in the range you are looking at, and then utilize it.  GORI could tell you that m_6 and m_3 are used in the range calculation, but then the IL tells you that too.

Yes, it is in the IL, and I need to get it from there.  Ranger
traverses the IL to build up the cache and I thought if I could
hook into it I could, in addition, collect the GIMPLE_CONDs that
Ranger relies on to determine the bounds.  (For each SSA_NAME
I'm thinking I'd collect an array of either the GIMPLE_CONDs,
or perhaps more usefully the edges between the CONDs and
the branches that contribute to the lower (and maybe also upper)
bounds.

The only other way to do it that I can think of is to recreate
the same process Ranger uses, except outside of it (including
all the computation and range propagation).

Thanks
Martin

PS I'm assuming -O0 for the above test case where the m + n
expression is a sum of two PHIs.  With -O1 and higher some of
the integer conditionals end up transformed into MAX_EXPRs so
it will likely need to be handled differently or I may not be
able to capture all the conditions reliably.  I don't know how
much that might compromise the result.

Within ranger is is irrelevant whether its from a MAX or a condition.

   <bb 3> [local count: 574129753]:
   _5 = MAX_EXPR <n_6(D), 4>;
   _7 = MAX_EXPR <m_4(D), 3>;
   _1 = _5 + _7;

_7 generates a range of [3, +INF] based on range-ops knowledge of how MAX works.

In both cases, for m_3 above and _7, the range of [3, +INF] is generated at the definition point... One is a PHI which includes some CFG propagation to retrieve the argument values and perform the calculation, and  _7 which is much simpler.. its just the global characteristic of the statement calculated by range-ops where the LHS is the unknown in the equation.  Ultimately, both ranges are created at the definition point of the ssa_name  _7 or m_3.

Understood.


The question is what precisely do you want to do with the IL  and what do you want to point at?  Optimization makes mapping things back to the original IL problematic to be sure, and as you can see with the _7, even the symbolic name can become lost or changed.

Right.  To print the conditional expression I expand each SSA_NAME
to its definition (variable, function call, or expressions like *p).
There's some loss of accuracy in the process so the result is not
a perfect reflection of the original source code.  Some things are
uglier than others.  But printing the expanded expressions is just
one possibility and it may not be best.  Just pointing to
the GIMPLE_CONDs and indicating which branch, when taken, leads
to the offending statement might be a simpler alternative.


GORI can be utilized to tell you whether a range is generated on an edge for any specific name, but it depends on what you want to do with it. There is currently no facility to ask which particular statement caused the range to be created, mostly because there hasn't been a need, and also because it is usually the result of a string of operations.   Given:

a = m + 10
if (m > 10 && m < 100)
     foo (a)

we see:

     <bb 2> :
     a_5 = m_4(D) + 10;
     m.0_1 = (unsigned int) m_4(D);
     _2 = m.0_1 + 4294967285;
     if (_2 <= 88)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]

a_5 : int [-2147483638, +INF]
2->3  (T) m.0_1 :       unsigned int [11, 99]
2->3  (T) m_4(D) :      int [11, 99]
2->3  (T) a_5 :         int [21, 109]
2->4  (F) m.0_1 :       unsigned int [0, 10][100, +INF]
2->4  (F) m_4(D) :      int [-INF, 10][100, +INF]
2->4  (F) a_5 :         int [-2147483638, 20][110, +INF]

=========== BB 3 ============
a_5     int [21, 109]

We produce a range of [21, 109] for a_5,  but which statement would we point to as the "defining" statement for that range?  (_2 <= 88) isn't very useful.  a_5 is never used in a condition, but based on the reduced range of m_4 we can recalculate a_5 on that path to produce that value... but there is no good single thing to point to.

Hmm, yes, this could make the result even more detached from
the source.


The condition can be arbitrarily complex as well:

if we change the condition to:

   if ((m > 10 && m < 100) || m == 200)
     foo (a);

     a_8 = m_7(D) + 10;
     m.0_1 = (unsigned int) m_7(D);
     _2 = m.0_1 + 4294967285;
     _3 = _2 <= 88;
     _4 = m_7(D) == 200;
     _5 = _3 | _4;
     if (_5 != 0)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]

a_8 : int [-2147483638, +INF]
2->3  (T) m.0_1 :       unsigned int [11, 99][200, 200]
2->3  (T) m_7(D) :      int [11, 99][200, 200]
2->3  (T) a_8 :         int [21, 109][210, 210]
2->4  (F) m.0_1 :       unsigned int [0, 10][100, 199][201, +INF]
2->4  (F) m_7(D) :      int [-INF, 10][100, 199][201, +INF]
2->4  (F) a_8 :         int [-2147483638, 20][110, 209][211, +INF]

=========== BB 3 ============
a_8     int [21, 109][210, 210]

Now the condition that causes this range is a string of expressions that all combine to produce the range, again, none of which actually utilize a_8.   GORI uses range-ops to solve the equation from the bottom of the block upwards until it has a result for m_7,  and then recalculates a_8 based on that.

We could produce a list of the ssa_names which are evaluated on each edge to produce a range.  In the latter case, the range of a_8 is produced by evaluating _5, _3, _4, m_7, _2, m.0_1, m_7, and finally a_8.


So I think the long answer to your question is no, ranger probably isn't going to provide a quick pointer to the information you are looking for.    However, its not clear to me precisely what you are looking for. GORI can provide lots of information about whether ranges are created or not from each basic block, and there is a fair amount of dependency information available about which names go into a calucaltion.  Its also good at seeing thru different combinations of expressions. The engine could also be repurposed, but to delve into the details of each low-level range operation in such a way to produce useful high-level information isn't clear to me how that would work.

You could reverse engineer an expression from the resulting range, and then point to the source statements from the aggregate of the line number info over all ssa_names definitions used in the calculation by GORI across the various blocks in the dominators, but I can see how that could quickly become verbose. Probably not that useful unless is was limited and some mechanism for combining source lines from different blocks in a meaningful way.

I'm not familiar with how much useful info the line-info has, nor how much "noise" this would produce.  I would expect this to rapidly degenerate and become confusing with missing information as the control flow or optimizations increase.  Perhaps it could be limited to just simple cases, if they could be easily identified.


These are valid concerns that might in the end spell doom for
the whole idea.  I'm at a very early stage to have a sense of
the extent of some of these problems.

I do have a couple of proof-of-concept implementations of
the general idea (sans ranges) so I have some sense of what
to expect.  Both are far from ideal but in light of the issues
we have discussed recently having something seems better than
nothing at all.

The ideal goal I think is to produce output similar to what our
static analyzer puts out.  For instance, instead of printing just:

warning: ‘v’ may be used uninitialized in this function [-Wmaybe-uninitialized]

I'd like to print this for the test:

/src/gcc/master/gcc/testsuite/gcc.dg/uninit-pred-2_b.c:26:5: warning: use of uninitialized value ‘v’ [CWE-457] [-Wanalyzer-use-of-uninitialized-value] 26 | blah(v); /* { dg-warning "uninitialized" "real uninitialized var warning" } */
      |     ^~~~~~~
  ‘foo’: events 1-5
    |
    |   13 |   if (n)
    |      |      ^
    |      |      |
    |      |      (1) following ‘false’ branch (when ‘n == 0’)...
    |......
    |   19 |   if (m)
    |      |      ~
    |      |      |
    |      |      (2) ...to here
    |......
    |   25 |   if (!flag)
    |      |      ~
    |      |      |
    |      |      (3) following ‘true’ branch (when ‘flag == 0’)...
| 26 | blah(v); /* { dg-warning "uninitialized" "real uninitialized var warning" } */
    |      |     ~~~~~~~
    |      |     |
    |      |     (4) ...to here
    |      |     (5) use of uninitialized value ‘v’ here
    |

I can come close to this for simple code that doesn't involve
ranges (like the offsets or the size in a memcpy() call).  My
biggest challenge at the moment is getting the conditions that
ranges are derived from in code with nontrivial logic.  To get
at those it seems like I would have to redo all the same work
that Ranger does.

Thanks again!

Martin

Reply via email to