On 12/14/21 18:55, Martin Sebor wrote:
Andrew, to improve the context of the late warnings I'm trying
to see how to get the execution path(s) leading from function
entry up to a statement. For example, for the code below I'd
like to "collect" and show the three conditionals in the context
of the warning:
extern char a[9];
void f (int m, int n, void *s)
{
if (m < 3) m = 3;
if (n < 4) n = 4;
if (s != 0)
{
char *d = a + 3;
__builtin_memcpy (d, s, m + n);
}
}
At a minimum, I'd like to print a note after the warning:
warning: ‘__builtin_memcpy’ writing between 7 and 2147483647 bytes
into a region of size 6 overflows the destination [-Wstringop-overflow=]
note: when 'm >= 3 && n >= 4 && s != 0'
(The final version would point to each conditional in the source
like the static analyzer does.)
Finding the exact location of the conditional may not be possible.. the
condition could be the aggregation of multiple statements. The ranges
of m and n are set via 2 statements, the condition and the assignment,
and then collected by a 3rd, the PHI node.
The "note:" itself could simply be created from the ranges.. the range
at the memcpy of m, n and s are:
m_3 : int [3, +INF]
n_4 : int [4, +INF]
s_10(D) : void * [1B, +INF]
which could be easily translated to 'm >= 3 && n >= 4 && s != 0',
assuming you can figure out the original source name.. In this case
happens to be the VAR on the ssa-name, but you can't always count on
that. I'd also point out there is no statement which has m >=3 or n>=
4, but rather a series of statements that produce that result.
For conditions that involve ranges used in the statements (i.e.,
the first two conditions in the source above), I wonder if rather
than traversing the CFG in a separate step, I might be able to
use Ranger to collect the conditions at the time it populates its
cache (i.e., when I call it to get range info for each statement).
I imagine I would need to derive a new class from gimple_ranger
and override some virtual member functions. (And maybe also do
the same for ranger_cache?)
Does this sound like something I should be able to do within
the framework? If yes, do you have any tips or suggestions for
where/how to start?
That is not really going to work. The cache is agnostic as to why there
are ranges or what caused them. It is primarily a propagation engine
which utilizes the edge ranges provided by GORI and range-ops. Ranger
itself is the controller for this property propagation.
GORI does not treat any statement as special. It is an equation solver
which uses range-ops to algebraically solve for an unknown value when
the other values are known. The "if" condition at the exit of a basic
block is considered no different than an assignment statement where the
LHS is either [0,0] or [1,1] depending on the edge you are looking at.
So to use your example above for just m:
<bb 2> :
if (m_6(D) <= 2)
goto <bb 3>; [INV] // [1,1] = m_6 <= 2
else
goto <bb 4>; [INV] // [0,0] = m_6 <= 2
2->3 (T) m_6(D) : int [-INF, 2]
2->4 (F) m_6(D) : int [3, +INF]
=========== BB 3 ============
<bb 3> :
=========== BB 4 ============
Imports: n_8(D)
Exports: n_8(D)
n_8(D) int VARYING
<bb 4> :
# m_3 = PHI <m_6(D)(2), 3(3)> // m_3 = union (m_6 (2->4),
[3,3]) = union ([3, +INF],[3,3]) = [3, +INF]
m_3 : int [3, +INF]
We generate 3 ranges for m.. m_6 on the 1)edge 2->3, 2)edge 2->4 and
3)combine them with the constant 3 to produce m_3 at the PHI.
This all happens at a fairly low level, and you wont be able to overload
anything to be able to log it in a sensible fashion. And I'm not sure
how you would go about logging the arbitrarily long series of actions
that result in the range you are looking at, and then utilize it. GORI
could tell you that m_6 and m_3 are used in the range calculation, but
then the IL tells you that too.
Thanks
Martin
PS I'm assuming -O0 for the above test case where the m + n
expression is a sum of two PHIs. With -O1 and higher some of
the integer conditionals end up transformed into MAX_EXPRs so
it will likely need to be handled differently or I may not be
able to capture all the conditions reliably. I don't know how
much that might compromise the result.
Within ranger is is irrelevant whether its from a MAX or a condition.
<bb 3> [local count: 574129753]:
_5 = MAX_EXPR <n_6(D), 4>;
_7 = MAX_EXPR <m_4(D), 3>;
_1 = _5 + _7;
_7 generates a range of [3, +INF] based on range-ops knowledge of how
MAX works.
In both cases, for m_3 above and _7, the range of [3, +INF] is generated
at the definition point... One is a PHI which includes some CFG
propagation to retrieve the argument values and perform the calculation,
and _7 which is much simpler.. its just the global characteristic of
the statement calculated by range-ops where the LHS is the unknown in
the equation. Ultimately, both ranges are created at the definition
point of the ssa_name _7 or m_3.
The question is what precisely do you want to do with the IL and what
do you want to point at? Optimization makes mapping things back to the
original IL problematic to be sure, and as you can see with the _7, even
the symbolic name can become lost or changed.
GORI can be utilized to tell you whether a range is generated on an edge
for any specific name, but it depends on what you want to do with it.
There is currently no facility to ask which particular statement caused
the range to be created, mostly because there hasn't been a need, and
also because it is usually the result of a string of operations. Given:
a = m + 10
if (m > 10 && m < 100)
foo (a)
we see:
<bb 2> :
a_5 = m_4(D) + 10;
m.0_1 = (unsigned int) m_4(D);
_2 = m.0_1 + 4294967285;
if (_2 <= 88)
goto <bb 3>; [INV]
else
goto <bb 4>; [INV]
a_5 : int [-2147483638, +INF]
2->3 (T) m.0_1 : unsigned int [11, 99]
2->3 (T) m_4(D) : int [11, 99]
2->3 (T) a_5 : int [21, 109]
2->4 (F) m.0_1 : unsigned int [0, 10][100, +INF]
2->4 (F) m_4(D) : int [-INF, 10][100, +INF]
2->4 (F) a_5 : int [-2147483638, 20][110, +INF]
=========== BB 3 ============
a_5 int [21, 109]
We produce a range of [21, 109] for a_5, but which statement would we
point to as the "defining" statement for that range? (_2 <= 88) isn't
very useful. a_5 is never used in a condition, but based on the reduced
range of m_4 we can recalculate a_5 on that path to produce that
value... but there is no good single thing to point to.
The condition can be arbitrarily complex as well:
if we change the condition to:
if ((m > 10 && m < 100) || m == 200)
foo (a);
a_8 = m_7(D) + 10;
m.0_1 = (unsigned int) m_7(D);
_2 = m.0_1 + 4294967285;
_3 = _2 <= 88;
_4 = m_7(D) == 200;
_5 = _3 | _4;
if (_5 != 0)
goto <bb 3>; [INV]
else
goto <bb 4>; [INV]
a_8 : int [-2147483638, +INF]
2->3 (T) m.0_1 : unsigned int [11, 99][200, 200]
2->3 (T) m_7(D) : int [11, 99][200, 200]
2->3 (T) a_8 : int [21, 109][210, 210]
2->4 (F) m.0_1 : unsigned int [0, 10][100, 199][201, +INF]
2->4 (F) m_7(D) : int [-INF, 10][100, 199][201, +INF]
2->4 (F) a_8 : int [-2147483638, 20][110, 209][211, +INF]
=========== BB 3 ============
a_8 int [21, 109][210, 210]
Now the condition that causes this range is a string of expressions that
all combine to produce the range, again, none of which actually utilize
a_8. GORI uses range-ops to solve the equation from the bottom of the
block upwards until it has a result for m_7, and then recalculates a_8
based on that.
We could produce a list of the ssa_names which are evaluated on each
edge to produce a range. In the latter case, the range of a_8 is
produced by evaluating _5, _3, _4, m_7, _2, m.0_1, m_7, and finally a_8.
So I think the long answer to your question is no, ranger probably isn't
going to provide a quick pointer to the information you are looking for.
However, its not clear to me precisely what you are looking for.
GORI can provide lots of information about whether ranges are created or
not from each basic block, and there is a fair amount of dependency
information available about which names go into a calucaltion. Its also
good at seeing thru different combinations of expressions. The engine
could also be repurposed, but to delve into the details of each
low-level range operation in such a way to produce useful high-level
information isn't clear to me how that would work.
You could reverse engineer an expression from the resulting range, and
then point to the source statements from the aggregate of the line
number info over all ssa_names definitions used in the calculation by
GORI across the various blocks in the dominators, but I can see how that
could quickly become verbose. Probably not that useful unless is was
limited and some mechanism for combining source lines from different
blocks in a meaningful way.
I'm not familiar with how much useful info the line-info has, nor how
much "noise" this would produce. I would expect this to rapidly
degenerate and become confusing with missing information as the control
flow or optimizations increase. Perhaps it could be limited to just
simple cases, if they could be easily identified.
Andrew