Re: getting branch conditions using ranger

Martin Sebor via Gcc Wed, 15 Dec 2021 15:12:26 -0800

On 12/15/21 12:23 PM, Andrew MacLeod wrote:

On 12/14/21 18:55, Martin Sebor wrote:


Thanks a lot for the feedback!  I'll need some time to fully
digest it.  Just a few clarifying comments to explain what
I'm after.

Andrew, to improve the context of the late warnings I'm trying
to see how to get the execution path(s) leading from function
entry up to a statement.  For example, for the code below I'd
like to "collect" and show the three conditionals in the context
of the warning:

extern char a[9];

void f (int m, int n, void *s)
{
  if (m < 3) m = 3;
  if (n < 4) n = 4;
  if (s != 0)
    {
      char *d = a + 3;
      __builtin_memcpy (d, s, m + n);
    }
}

At a minimum, I'd like to print a note after the warning:
warning: ‘__builtin_memcpy’ writing between 7 and 2147483647 bytesinto a region of size 6 overflows the destination [-Wstringop-overflow=]
  note: when 'm >= 3 && n >= 4 && s != 0'

(The final version would point to each conditional in the source
like the static analyzer does.)
Finding the exact location of the conditional may not be possible.. thecondition could be the aggregation of multiple statements. The rangesof m and n are set via 2 statements, the condition and the assignment,and then collected by a 3rd, the PHI node.
The "note:" itself could simply be created from the ranges.. the rangeat the memcpy of m, n and s are:
m_3  : int [3, +INF]
n_4  : int [4, +INF]
s_10(D) :     void * [1B, +INF]
which could be easily translated to 'm >= 3 && n >= 4 && s != 0',assuming you can figure out the original source name.. In this casehappens to be the VAR on the ssa-name, but you can't always count onthat. I'd also point out there is no statement which has m >=3 or n>=4, but rather a series of statements that produce that result.


Understood.

For conditions that involve ranges used in the statements (i.e.,
the first two conditions in the source above), I wonder if rather
than traversing the CFG in a separate step, I might be able to
use Ranger to collect the conditions at the time it populates its
cache (i.e., when I call it to get range info for each statement).
I imagine I would need to derive a new class from gimple_ranger
and override some virtual member functions.  (And maybe also do
the same for ranger_cache?)

Does this sound like something I should be able to do within
the framework?  If yes, do you have any tips or suggestions for
where/how to start?
That is not really going to work. The cache is agnostic as to why thereare ranges or what caused them. It is primarily a propagation enginewhich utilizes the edge ranges provided by GORI and range-ops. Rangeritself is the controller for this property propagation.
GORI does not treat any statement as special. It is an equation solverwhich uses range-ops to algebraically solve for an unknown value whenthe other values are known. The "if" condition at the exit of a basicblock is considered no different than an assignment statement where theLHS is either [0,0] or [1,1] depending on the edge you are looking at. So to use your example above for just m:
     <bb 2> :
     if (m_6(D) <= 2)
       goto <bb 3>; [INV]       //  [1,1] = m_6 <= 2
     else
       goto <bb 4>; [INV]      //    [0,0] = m_6 <= 2

2->3  (T) m_6(D) :      int [-INF, 2]
2->4  (F) m_6(D) :      int [3, +INF]

=========== BB 3 ============
     <bb 3> :

=========== BB 4 ============
Imports: n_8(D)
Exports: n_8(D)
n_8(D)  int VARYING
     <bb 4> :
# m_3 = PHI <m_6(D)(2), 3(3)> // m_3 = union (m_6 (2->4),[3,3]) = union ([3, +INF],[3,3]) = [3, +INF]
m_3 : int [3, +INF]
We generate 3 ranges for m.. m_6 on the 1)edge 2->3, 2)edge 2->4 and3)combine them with the constant 3 to produce m_3 at the PHI.
This all happens at a fairly low level, and you wont be able to overloadanything to be able to log it in a sensible fashion. And I'm not surehow you would go about logging the arbitrarily long series of actionsthat result in the range you are looking at, and then utilize it. GORIcould tell you that m_6 and m_3 are used in the range calculation, butthen the IL tells you that too.


Yes, it is in the IL, and I need to get it from there.  Ranger
traverses the IL to build up the cache and I thought if I could
hook into it I could, in addition, collect the GIMPLE_CONDs that
Ranger relies on to determine the bounds.  (For each SSA_NAME
I'm thinking I'd collect an array of either the GIMPLE_CONDs,
or perhaps more usefully the edges between the CONDs and
the branches that contribute to the lower (and maybe also upper)
bounds.

The only other way to do it that I can think of is to recreate
the same process Ranger uses, except outside of it (including
all the computation and range propagation).

Thanks
Martin

PS I'm assuming -O0 for the above test case where the m + n
expression is a sum of two PHIs.  With -O1 and higher some of
the integer conditionals end up transformed into MAX_EXPRs so
it will likely need to be handled differently or I may not be
able to capture all the conditions reliably.  I don't know how
much that might compromise the result.
Within ranger is is irrelevant whether its from a MAX or a condition.

   <bb 3> [local count: 574129753]:
   _5 = MAX_EXPR <n_6(D), 4>;
   _7 = MAX_EXPR <m_4(D), 3>;
   _1 = _5 + _7;
_7 generates a range of [3, +INF] based on range-ops knowledge of howMAX works.
In both cases, for m_3 above and _7, the range of [3, +INF] is generatedat the definition point... One is a PHI which includes some CFGpropagation to retrieve the argument values and perform the calculation,and _7 which is much simpler.. its just the global characteristic ofthe statement calculated by range-ops where the LHS is the unknown inthe equation. Ultimately, both ranges are created at the definitionpoint of the ssa_name _7 or m_3.


Understood.

The question is what precisely do you want to do with the IL and whatdo you want to point at? Optimization makes mapping things back to theoriginal IL problematic to be sure, and as you can see with the _7, eventhe symbolic name can become lost or changed.


Right.  To print the conditional expression I expand each SSA_NAME
to its definition (variable, function call, or expressions like *p).
There's some loss of accuracy in the process so the result is not
a perfect reflection of the original source code.  Some things are
uglier than others.  But printing the expanded expressions is just
one possibility and it may not be best.  Just pointing to
the GIMPLE_CONDs and indicating which branch, when taken, leads
to the offending statement might be a simpler alternative.

GORI can be utilized to tell you whether a range is generated on an edgefor any specific name, but it depends on what you want to do with it.There is currently no facility to ask which particular statement causedthe range to be created, mostly because there hasn't been a need, andalso because it is usually the result of a string of operations. Given:
a = m + 10
if (m > 10 && m < 100)
     foo (a)

we see:

     <bb 2> :
     a_5 = m_4(D) + 10;
     m.0_1 = (unsigned int) m_4(D);
     _2 = m.0_1 + 4294967285;
     if (_2 <= 88)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]

a_5 : int [-2147483638, +INF]
2->3  (T) m.0_1 :       unsigned int [11, 99]
2->3  (T) m_4(D) :      int [11, 99]
2->3  (T) a_5 :         int [21, 109]
2->4  (F) m.0_1 :       unsigned int [0, 10][100, +INF]
2->4  (F) m_4(D) :      int [-INF, 10][100, +INF]
2->4  (F) a_5 :         int [-2147483638, 20][110, +INF]

=========== BB 3 ============
a_5     int [21, 109]
We produce a range of [21, 109] for a_5, but which statement would wepoint to as the "defining" statement for that range? (_2 <= 88) isn'tvery useful. a_5 is never used in a condition, but based on the reducedrange of m_4 we can recalculate a_5 on that path to produce thatvalue... but there is no good single thing to point to.


Hmm, yes, this could make the result even more detached from
the source.

The condition can be arbitrarily complex as well:

if we change the condition to:

   if ((m > 10 && m < 100) || m == 200)
     foo (a);

     a_8 = m_7(D) + 10;
     m.0_1 = (unsigned int) m_7(D);
     _2 = m.0_1 + 4294967285;
     _3 = _2 <= 88;
     _4 = m_7(D) == 200;
     _5 = _3 | _4;
     if (_5 != 0)
       goto <bb 3>; [INV]
     else
       goto <bb 4>; [INV]

a_8 : int [-2147483638, +INF]
2->3  (T) m.0_1 :       unsigned int [11, 99][200, 200]
2->3  (T) m_7(D) :      int [11, 99][200, 200]
2->3  (T) a_8 :         int [21, 109][210, 210]
2->4  (F) m.0_1 :       unsigned int [0, 10][100, 199][201, +INF]
2->4  (F) m_7(D) :      int [-INF, 10][100, 199][201, +INF]
2->4  (F) a_8 :         int [-2147483638, 20][110, 209][211, +INF]

=========== BB 3 ============
a_8     int [21, 109][210, 210]
Now the condition that causes this range is a string of expressions thatall combine to produce the range, again, none of which actually utilizea_8. GORI uses range-ops to solve the equation from the bottom of theblock upwards until it has a result for m_7, and then recalculates a_8based on that.
We could produce a list of the ssa_names which are evaluated on eachedge to produce a range. In the latter case, the range of a_8 isproduced by evaluating _5, _3, _4, m_7, _2, m.0_1, m_7, and finally a_8.
So I think the long answer to your question is no, ranger probably isn'tgoing to provide a quick pointer to the information you are looking for. However, its not clear to me precisely what you are looking for.GORI can provide lots of information about whether ranges are created ornot from each basic block, and there is a fair amount of dependencyinformation available about which names go into a calucaltion. Its alsogood at seeing thru different combinations of expressions. The enginecould also be repurposed, but to delve into the details of eachlow-level range operation in such a way to produce useful high-levelinformation isn't clear to me how that would work.
You could reverse engineer an expression from the resulting range, andthen point to the source statements from the aggregate of the linenumber info over all ssa_names definitions used in the calculation byGORI across the various blocks in the dominators, but I can see how thatcould quickly become verbose. Probably not that useful unless is waslimited and some mechanism for combining source lines from differentblocks in a meaningful way.
I'm not familiar with how much useful info the line-info has, nor howmuch "noise" this would produce. I would expect this to rapidlydegenerate and become confusing with missing information as the controlflow or optimizations increase. Perhaps it could be limited to justsimple cases, if they could be easily identified.


These are valid concerns that might in the end spell doom for
the whole idea.  I'm at a very early stage to have a sense of
the extent of some of these problems.

I do have a couple of proof-of-concept implementations of
the general idea (sans ranges) so I have some sense of what
to expect.  Both are far from ideal but in light of the issues
we have discussed recently having something seems better than
nothing at all.

The ideal goal I think is to produce output similar to what our
static analyzer puts out.  For instance, instead of printing just:

warning: ‘v’ may be used uninitialized in this function[-Wmaybe-uninitialized]


I'd like to print this for the test:

/src/gcc/master/gcc/testsuite/gcc.dg/uninit-pred-2_b.c:26:5: warning:use of uninitialized value ‘v’ [CWE-457][-Wanalyzer-use-of-uninitialized-value]26 | blah(v); /* { dg-warning "uninitialized" "realuninitialized var warning" } */

      |     ^~~~~~~
  ‘foo’: events 1-5
    |
    |   13 |   if (n)
    |      |      ^
    |      |      |
    |      |      (1) following ‘false’ branch (when ‘n == 0’)...
    |......
    |   19 |   if (m)
    |      |      ~
    |      |      |
    |      |      (2) ...to here
    |......
    |   25 |   if (!flag)
    |      |      ~
    |      |      |
    |      |      (3) following ‘true’ branch (when ‘flag == 0’)...

| 26 | blah(v); /* { dg-warning "uninitialized" "realuninitialized var warning" } */

    |      |     ~~~~~~~
    |      |     |
    |      |     (4) ...to here
    |      |     (5) use of uninitialized value ‘v’ here
    |

I can come close to this for simple code that doesn't involve
ranges (like the offsets or the size in a memcpy() call).  My
biggest challenge at the moment is getting the conditions that
ranges are derived from in code with nontrivial logic.  To get
at those it seems like I would have to redo all the same work
that Ranger does.

Thanks again!

Martin

Re: getting branch conditions using ranger

Reply via email to