[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Artem Dergachev via Phabricator via cfe-commits Mon, 07 Jun 2021 20:01:36 -0700

NoQ added a comment.

In D103605#2802240 <https://reviews.llvm.org/D103605#2802240>, @vsavchenko 
wrote:

> In D103605#2801681 <https://reviews.llvm.org/D103605#2801681>, @NoQ wrote:
>
>> 
>
> OK, let's make a pause right here.  We again start to go into other topics.

These are not other topics. We're discussing the overall direction into which 
this patchset is a large step. I absolutely welcome this step and am fascinated 
to see how it turns out but I want us to agree on the overall direction first.

> You come from a pure perspective that `track` explains how the value got here 
> and does only that.  In this picture of the world, `track` has one starting 
> point and one ending point.  If it ends to early (due to the lack of domain 
> knowledge), the checker can pick up and restart the tracking if needed.  If 
> it ends for real, this is the earliest where we got this value from and it 
> can tell us about the warning and situation we are in.  The problem is that 
> tracking is fundamentally **dirty** the way it is.  It is not linear, and 
> when we track one simple value, it actually spawns a gigantic tree of traces. 
>  Thus when we get to the point where it all ended, it actually ended 100 
> times.  It ended in multiple different visitors that were tracking all kinds 
> of things not limited to the value itself.  `trackExpressionValue` is hard 
> also because of this reason.  Inline defensive checks don't belong there and 
> clutter it.
>
> I DON'T KNOW how to simplify this ATM.  Probably we can decouple ONE VALUE 
> TRACKING from the rest of it, but it's not what I'm trying to do.  I want to 
> refactor the existing solution (with all of its inherent design flaws) into 
> something more extensible.  And I think that again ATM, with the current 
> state of things, the concept of "tracking stopped here" is doomed.

The reason why I come from this perspective is because that's how I think it is 
//supposed// to work and I'm not aware of a single good reason why it can't. I 
think the reason why it's this way is that the original authors seem to have 
bombarded the code with extra clauses until it starts demonstrating the more or 
less expected behavior, without trying to solve the problem on paper first. 
Like with visitor fixpoint iteration, it sounds like an unnecessary piece of 
complexity that we will be ultimately able to eliminate.

Except, well, the distant constraints example a-la @RedDocMD's problem but for 
raw pointers. This is indeed an exceptional case which may cause two tracking 
efforts to run in parallel. This can be probably generalized to any 
constraint-like traits: if we want to explain a constraint, we'll have to be 
prepared that the constraint isn't part of the value track that we already 
have. This is the first time the above assumption was challenged and I 
definitely need time to process this but I still think there shouldn't be any 
other situations like this and there should be something special about this 
situation that will allow us to incorporate it into the linear-track solution.

"At the moment such refactoring is difficult to achieve" is a perfectly 
satisfying answer to me with respect to the current state of things. But I want 
us to agree as early as possible about whether we want to achieve this. Simply 
so that as we make every step, we keep this in mind and evaluate every step as 
taking us either closer to this goal or farther from this goal and try to avoid 
the latter. And I don't think that anything prevents us from discussing this 
goal right now; the problem we're trying to solve isn't really ever going to 
change and we already have a fairly large amount of data.

> With the new solution, if you want the most simple ways of addressing it, you 
> create a handler:
>
>   class FooHandler : public ExpressionHandler {
>   public:
>     using ExpressionHandler::ExpressionHandler;
>   
>     Tracker::Result handle(const Expr *E, const ExplodedNode *InputNode,
>                            const ExplodedNode *ExprNode,
>                            TrackingOptions Opts) {
>       if (auto *Arg = getFooArgument(E)) {
>         return getParentTracker().track(E, ExprNode, Opts);
>       }
>       return {};
>   };
>
> Essentially these handlers is my way of first introducing new logic for these 
> two checkers, and second to extract similar logic that exists there from all 
> other checkers.

Ok, do I understand correctly that:

- Functionality-wise, this is entirely equivalent to my NoteTag solution with 
"`isTracked()`";
- Boilerplate-wise it makes the `isTracked()` check implicit ("we were invoked 
at all therefore the value of the expression is tracked by... somebody") at the 
cost of asking the user to define a class;
- Convenience-wise, it potentially requires the developer to reverse-engineer 
the symbolic execution step when the logic gets more complicated;
- Flexibility-wise... I guess you can make custom trackers now, which will 
track the value in a different manner, and communicate with the tracker in 
order to produce different notes?

> Then, let's say, I decided that I don't like the note "d is assigned here".  
> It doesn't seem very specific and adding more notes on top will overwhelm our 
> poor user even more.  So, I decided to make notes: "d is foo'ed from e", "g 
> is initialized to foo from e", and "p's foo is used as the 1st argument".
> Right now, there is literally no way to do this except for adding this 
> directly to already bloated `FindLastStoreBRVisitor`.  Why do you think it 
> already produces notes about "garbage values" and "null/nil pointers"?
>
> So, the checker implements a `StoreHandler` where they handle all the cases.  
> It is important to have it exactly like this because `FindLastStoreBRVisitor` 
> has non-trivial logic of catching "stores" and we don't want this logic to 
> get copied into different checkers.
>
> And if we are talking where it can go further, this particular approach, is 
> defining more "events" in the tracking process and decoupling "event 
> identification" from the process of generating notes for such events.

Yeah, the store handler part seems perfectly legit because *I* don't see any 
better approaches here :) It's indeed a big problem that we can't customize 
existing notes and this sounds like the right data structure to solve the 
problem.

>> Interesting, can you give an example (say, in case of @RedDocMD's problem)  
>> where `isTracked()` would give an incorrect result but "I am tracking this 
>> value" would give a correct result? I guess such situations will arise more 
>> often as we add more tracking, but still.
>
> I'm not super sure, but I think smth like this will suffice:
>
>   // a and b are completely unrelated smart pointers
>   raw = a.get();
>   if (!b.get() && !a.get()) {
>     use(*raw);
>   }
>
> We can attempt to track `b.get()` because of the conditions.  Checker 
> wouldn't get that we are actually interested in conditions and will initiate 
> a full-blown back-trace for `b` even though we don't need it.
> (...)
> And yes, checkers should be able to say "I track this value". It is the 
> operation that makes sense only if it is helpful for the checker, and the 
> checker should be able to customize it however the checker pleases.

Ok, I don't understand! If we have anyway tracked the value back to `b.get()`, 
why not track further through `b`? What makes `b.get()` the exact place at 
which we want to stop tracking the value? If we wanted to know why we choose 
the true path at the branch, shouldn't we track further through `b` as well in 
order to answer that question? If we didn't want to know why we choose the true 
path, why did we track the value back to `b.get()` in the first place?

I guess one possible difference between `a` and `b` here would be in messaging, 
i.e. how we refer to that value. For example, we could refer to `a` as "the 
pointer value" (implying that this is the null-dereferenced pointer value) and 
refer to `b` as "a pointer value" (implying that it's some other value, maybe 
the one that participates in condition later). But even then, this distinction 
is much more coarse-grained than "I track the value" vs. "Someone else tracks 
the value". It's more like "Buggy value" vs. "Condition value" which is a 
distinction we already "almost" make with `TrackingKind`. Even if we wanted to 
identify our values as "pointer #1", "pointer #2", etc., we could totally do 
with a sufficiently advanced definition of `isTracked()`.

> Can we move inline defensive checks out of the BugReportVisitors.cpp with 
> `isTracked()` only?

My original thinking was along the lines of, eg.,:

  void ExprEngine::processBranch(...) {
    Bldr.addTransition(..., getNoteTag([] (BugReport &BR) {
      if (BR.isTracked(Condition) && !N->getLocationContext().inTopFrame())
        BR.markInvalid();
    }));
  }

And this is how, basically, the entirety of `trackExpressionValue()` could be 
decomposed into individual stateless note tags, with all the necessary state 
stuffed into the bug report. Including the tracking process itself: if note 
tags are sufficient to help with tracking, they're sufficient to do the 
tracking from the start.

This code could also be moved into a checker's `checkBranchCondition()` so that 
to make it truly checker-specific. That would come at the cost of creating a 
lot of redundant exploded nodes just for the sake of attaching the tag that 
will probably never be used so this is rather bad, wouldn't recommend.

I still think the solution that relies on well-defined end of tracking is 
significantly more elegant. Which, again, could be implemented with note tags: 
we could have a note tag that'd call a callback stuffed directly into the bug 
report.

So I think it can work both ways given that the tags solution would rely on 
having a sufficiently flexible mutable state as part of the `BugReport` object.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103605/new/

https://reviews.llvm.org/D103605

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D103605: [analyzer] Introduce a new interface for tracking

Reply via email to