[PATCH] D36737: [analyzer] Store design discussions in docs/analyzer for future use.

Artem Dergachev via Phabricator via cfe-commits Tue, 15 Aug 2017 06:49:04 -0700

NoQ created this revision.

Because i tend to raise important design questions in weird places like tiny 
phabricator patches, we had a thought that it's better to archive them in 
`docs/analyzer` for future reference. We should be able to pick them up 
whenever time comes to solve these problems.


I initialize a new folder in `docs/analyzer` for this sort of stuff with the 
discussion in https://reviews.llvm.org/D35216; we also have a few more 
discussions that might look good in such archive.


https://reviews.llvm.org/D36737

Files:
  docs/analyzer/DesignDiscussions/InitializerLists.txt

Index: docs/analyzer/DesignDiscussions/InitializerLists.txt
===================================================================
--- /dev/null
+++ docs/analyzer/DesignDiscussions/InitializerLists.txt
@@ -0,0 +1,325 @@
+This discussion took place in https://reviews.llvm.org/D35216
+"Escape symbols when creating std::initializer_list".
+
+It touches problems of modelling C++ standard library constructs in general,
+including modelling implementation-defined fields within C++ standard library
+objects, in particular constructing objects into pointers held by such fields,
+and separation of responsibilities between analyzer's core and checkers.
+
+================================================================================
+
+== Artem: ==
+
+I've seen a few false positives that appear because we construct C++11
+std::initializer_list objects with brace initializers, and such construction is
+not properly modeled. For instance, if a new object is constructed on the heap
+only to be put into a brace-initialized STL container, the object is reported to
+be leaked.
+
+Approach (0): This can be trivially fixed by this patch, which causes pointers
+passed into initializer list expressions to immediately escape.
+
+This fix is overly conservative though. So i did a bit of investigation as to
+how model std::initializer_list better.
+
+According to the standard, std::initializer_list<T> is an object that has
+methods begin(), end(), and size(), where begin() returns a pointer to continous
+array of size() objects of type T, and end() is equal to begin() plus size().
+The standard does hint that it should be possible to implement
+std::initializer_list<T> as a pair of pointers, or as a pointer and a size
+integer, however specific fields that the object would contain are an
+implementation detail.
+
+Ideally, we should be able to model the initializer list's methods precisely.
+Or, at least, it should be possible to explain to the analyzer that the list
+somehow "takes hold" of the values put into it. Initializer lists can also be
+copied, which is a separate story that i'm not trying to address here.
+
+The obvious approach to modeling std::initializer_list in a checker would be to
+construct a SymbolMetadata for the memory region of the initializer list object,
+which would be of type T* and represent begin(), so we'd trivially model begin()
+as a function that returns this symbol. The array pointed to by that symbol
+would be bindLoc()ed to contain the list's contents (probably as a CompoundVal
+to produce less bindings in the store). Extent of this array would represent
+size() and would be equal to the length of the list as written.
+
+So this sounds good, however apparently it does nothing to address our false
+positives: when the list escapes, our RegionStoreManager is not magically
+guessing that the metadata symbol attached to it, together with its contents,
+should also escape. In fact, it's impossible to trigger a pointer escape from
+within the checker.
+
+Approach (1): If only we enabled ProgramState::bindLoc(..., notifyChanges=true)
+to cause pointer escapes (not only region changes) (which sounds like the right
+thing to do anyway) such checker would be able to solve the false positives by
+triggering escapes when binding list elements to the list. However, it'd be as
+conservative as the current patch's solution. Ideally, we do not want escapes to
+happen so early. Instead, we'd prefer them to be delayed until the list itself
+escapes.
+
+So i believe that escaping metadata symbols whenever their base regions escape
+would be the right thing to do. Currently we didn't think about that because we
+had neither pointer-type metadatas nor non-pointer escapes.
+
+Approach (2): We could teach the Store to scan itself for bindings to
+metadata-symbolic-based regions during scanReachableSymbols() whenever a region
+turns out to be reachable. This requires no work on checker side, but it sounds
+performance-heavy.
+
+Approach (3): We could let checkers maintain the set of active metadata symbols
+in the program state (ideally somewhere in the Store, which sounds weird but
+causes the smallest amount of layering violations), so that the core knew what
+to escape. This puts a stress on the checkers, but with a smart data map it
+wouldn't be a problem.
+
+Approach (4): We could allow checkers to trigger pointer escapes in arbitrary
+moments. If we allow doing this within checkPointerEscape callback itself, we
+would be able to express facts like "when this region escapes, that metadata
+symbol attached to it should also escape". This sounds like an ultimate freedom,
+with maximum stress on the checkers - still not too much stress when we have
+smart data maps.
+
+I'm personally liking the approach (2) - it should be possible to avoid
+performance overhead, and clarity seems nice.
+
+== Gabor: ==
+
+At this point, I am a bit wondering about two questions.
+
+  - When should something belong to a checker and when should something belong
+to the engine? Sometimes we model library aspects in the engine and model
+language constructs in checkers.
+  - What is the checker programming model that we are aiming for? Maximum
+freedom or more easy checker development?
+
+I think if we aim for maximum freedom, we do not need to worry about the
+potential stress on checkers, and we can introduce abstractions to mitigate that
+later on.
+If we want to simplify the API, then maybe it makes more sense to move language
+construct modeling to the engine when the checker API is not sufficient instead
+of complicating the API.
+
+Right now I have no preference or objections between the alternatives but there
+are some random thoughts:
+
+  - Maybe it would be great to have a guideline how to evolve the analyzer and
+follow it, so it can help us to decide in similar situations
+  - I do care about performance in this case. The reason is that we have a
+limited performance budget. And I think we should not expect most of the checker
+writers to add modeling of language constructs. So, in my opinion, it is ok to
+have less nice/more verbose API for language modeling if we can have better
+performance this way, since it only needs to be done once, and is done by the
+framework developers.
+
+== Artem: ==
+
+These are some great questions, i guess it'd be better to discuss them more
+openly. As a quick dump of my current mood:
+
+  - To me it seems obvious that we need to aim for a checker API that is both
+simple and powerful. This can probably by keeping the API as powerful as
+necessary while providing a layer of simple ready-made solutions on top of it.
+Probably a few reusable components for assembling checkers. And this layer
+should ideally be pleasant enough to work with, so that people would prefer to
+extend it when something is lacking, instead of falling back to the complex
+omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly
+similar situation: matchers are not omnipotent, but they're so nice.
+
+  - Separation between core and checkers is usually quite strange. Once we have
+shared state traits, i generally wouldn't mind having region store or range
+constraint manager as checkers (though it's probably not worth it to transform
+them - just a mood). The main thing to avoid here would be the situation when
+the checker overwrites stuff written by the core because it thinks it has a
+better idea what's going on, so the core should provide a good default behavior.
+
+  - Yeah, i totally care about performance as well, and if i try to implement
+approach, i'd make sure it's good.
+
+== Artem: ==
+
+> Approach (2): We could teach the Store to scan itself for bindings to
+> metadata-symbolic-based regions during scanReachableSymbols() whenever
+> a region turns out to be reachable. This requires no work on checker side,
+> but it sounds performance-heavy.
+
+Nope, this approach is wrong. Metadata symbols may become out-of-date: when the
+object changes, metadata symbols attached to it aren't changing (because symbols
+simply don't change). The same metadata may have different symbols to denote its
+value in different moments of time, but at most one of them represents the
+actual metadata value. So we'd be escaping more stuff than necessary.
+
+If only we had "ghost fields"
+(http://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have
+been much easier, because the ghost field would only contain the actual
+metadata, and the Store would always know about it. This example adds to my
+belief that ghost fields are exactly what we need for most C++ checkers.
+
+== Devin: ==
+
+In this case, I would be fine with some sort of AbstractStorageMemoryRegion that
+meant "here is a memory region and somewhere reachable from here exists another
+region of type T". Or even multiple regions with different identifiers. This
+wouldn't specify how the memory is reachable, but it would allow for transfer
+functions to get at those regions and it would allow for invalidation.
+
+For std::initializer_list this reachable region would the region for the backing
+array and the transfer functions for begin() and end() yield the beginning and
+end element regions for it.
+
+In my view this differs from ghost variables in that (1) this storage does
+actually exist (it is just a library implementation detail where that storage
+lives) and (2) it is perfectly valid for a pointer into that storage to be
+returned and for another part of the program to read or write from that storage.
+(Well, in this case just read since it is allowed to be read-only memory).
+
+What I'm not OK with is modeling abstract analysis state (for example, the count
+of a NSMutableArray or the typestate of a file handle) as a value stored in some
+ginned up region in the store. This takes an easy problem that the analyzer does
+well at (modeling typestate) and turns it into a hard one that the analyzer is
+bad at (reasoning about the contents of the heap).
+
+I think the key criterion here is: "is the region accessible from outside the
+library". That is, does the library expose the region as a pointer that can be
+read to or written from in the client program? If so, then it makes sense for
+this to be in the store: we are modeling reachable storage as storage. But if
+we're just modeling arbitrary analysis facts that need to be invalidated when a
+pointer escapes then we shouldn't try to gin up storage for them just to get
+invalidation for free.
+
+== Artem: ==
+
+
+> In this case, I would be fine with some sort of AbstractStorageMemoryRegion
+> that meant "here is a memory region and somewhere reachable from here exists
+> another region of type T". Or even multiple regions with different
+> identifiers. This wouldn't specify how the memory is reachable, but it would
+> allow for transfer functions to get at those regions and it would allow for
+> invalidation.
+
+Yeah, this is what we can easily implement now as a
+symbolic-region-based-on-a-metadata-symbol (though we can make a new region
+class for that if we eg. want it typed). The problem is that the relation
+between such storage region and its parent object region is essentially
+immaterial, similarly to the relation between SymbolRegionValue and its parent
+region. Region contents are mutable: today the abstract storage is reachable
+from its parent object, tomorrow it's not, and maybe something else becomes
+reachable, something that isn't even abstract. So the parent region for the
+abstract storage is most of the time at best a "nice to know" thing - we cannot
+rely on it to do any actual work. We'd anyway need to rely on the checker to do
+the job.
+
+> For std::initializer_list this reachable region would the region for the
+> backing array and the transfer functions for begin() and end() yield the
+> beginning and end element regions for it.
+
+So maybe in fact for std::initializer_list it may work fine because you cannot
+change the data after the object is constructed - so this region's contents are
+essentially immutable. For the future, i feel as if it is a dead end.
+
+I'd like to consider another funny example. Suppose we're trying to model
+std::unique_ptr. Consider:
+
+void bar(const std::unique_ptr<int> &x);
+
+void foo(std::unique_ptr<int> &x) {
+  int *a = x.get();   // (a, 0, direct): &AbstractStorageRegion
+  *a = 1;             // (AbstractStorageRegion, 0, direct): 1 S32b
+  int *b = new int;
+  *b = 2;             // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b
+  x.reset(b);         // Checker map: x -> SymRegion{conj_$0<int *>}
+  bar(x);             // 'a' doesn't escape (the pointer was unique), 'b' does.
+  clang_analyzer_eval(*a == 1); // Making this true is up to the checker.
+  clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker.
+}
+
+The checker doesn't totally need to ensure that *a == 1 passes - even though the
+pointer was unique, it could theoretically have .get()-ed above and the code
+could of course break the uniqueness invariant (though we'd probably want it).
+The checker can say that "even if *a did escape, it was not because it was
+stuffed directly into bar()".
+
+The checker's direct responsibility, however, is to solve the *b == 2 thing
+(which is in fact the problem we're dealing with in this patch - escaping the
+storage region of the object).
+
+So we're talking about one more operation over the program state (scanning
+reachable symbols and regions) that cannot work without checker support.
+
+We can probably add a new callback "checkReachableSymbols" to solve this. This
+is in fact also related to the dead symbols problem (we're scanning for live
+symbols in the store and in the checkers separately, but we need to do so
+simultaneously with a single worklist). Hmm, in fact this sounds like a good
+idea; we can replace checkLiveSymbols with checkReachableSymbols.
+
+Or we could just have ghost member variables, and no checker support required at
+all. For ghost member variables, the relation with their parent region (which
+would be their superregion) is actually useful, the mutability of their contents
+is expressed naturally, and the store automagically sees reachable symbols, live
+symbols, escapes, invalidations, whatever.
+
+> In my view this differs from ghost variables in that (1) this storage does
+> actually exist (it is just a library implementation detail where that storage
+> lives) and (2) it is perfectly valid for a pointer into that storage to be
+> returned and for another part of the program to read or write from that
+> storage. (Well, in this case just read since it is allowed to be read-only
+> memory).
+
+> What I'm not OK with is modeling abstract analysis state (for example, the
+> count of a NSMutableArray or the typestate of a file handle) as a value stored
+> in some ginned up region in the store.This takes an easy problem that the
+> analyzer does well at (modeling typestate) and turns it into a hard one that
+> the analyzer is bad at (reasoning about the contents of the heap).
+
+Yeah, i tend to agree on that. For simple typestates, this is probably an
+overkill, so let's definitely put aside the idea of "ghost symbolic regions"
+that i had earlier.
+
+But, to summarize a bit, in our current case, however, the typestate we're
+looking for is the contents of the heap. And when we try to model such
+typestates (complex in this specific manner, i.e. heap-like) in any checker, we
+have a choice between re-doing this modeling in every such checker (which is
+something analyzer is indeed good at, but at a price of making checkers heavy)
+or instead relying on the Store to do exactly what it's designed to do.
+
+> I think the key criterion here is: "is the region accessible from outside
+> the library". That is, does the library expose the region as a pointer that
+> can be read to or written from in the client program? If so, then it makes
+> sense for this to be in the store: we are modeling reachable storage as
+> storage. But if we're just modeling arbitrary analysis facts that need to be
+> invalidated when a pointer escapes then we shouldn't try to gin up storage
+> for them just to get invalidation for free.
+
+As a metaphor, i'd probably compare it to body farms - the difference between
+ghost member variables and metadata symbols seems to me like the difference
+between body farms and evalCall. Both are nice to have, and body farms are very
+pleasant to work with, even if not omnipotent. I think it's fine for a
+FunctionDecl's body in a body farm to have a local variable, even if such
+variable doesn't actually exist, even if it cannot be seen from outside the
+function call. I'm not seeing immediate practical difference between "it does
+actually exist" and "it doesn't actually exist, just a handy abstraction".
+Similarly, i think it's fine if we have a CXXRecordDecl with
+implementation-defined contents, and try to farm up a member variable as a handy
+abstraction (we don't even need to know its name or offset, only that it's there
+somewhere).
+
+== Artem: ==
+
+We've discussed it in person with Devin, and he provided more points to think
+about:
+
+  - If the initializer list consists of non-POD data, constructors of list's
+objects need to take the sub-region of the list's region as this-region In the
+current (v2) version of this patch, these objects are constructed elsewhere and
+then trivial-copied into the list's metadata pointer region, which may be
+incorrect. This is our overall problem with C++ constructors, which manifests in
+this case as well. Additionally, objects would need to be constructed in the
+analyzer's core, which would not be able to predict that it needs to take a
+checker-specific region as this-region, which makes it harder, though it might
+be mitigated by sharing the checker state traits.
+
+  - Because "ghost variables" are not material to the user, we need to somehow
+make super sure that they don't make it into the diagnostic messages.
+
+So, because this needs further digging into overall C++ support and rises too
+many questions, i'm delaying a better approach to this problem and will fall
+back to the original trivial patch.

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D36737: [analyzer] Store design discussions in docs/analyzer for future use.

Reply via email to