On May 26, 2009, at 11:43 AM, David Mandelin wrote:

Ted Kremenek wrote:
C++ support in Clang is rapidly progressing,
Cool. Is there a page with notes on the design? I'm curious what approach you are using. Elsa's GLR design seems like a good approach, but it doesn't cover all the latest complicated template features, and the latest problems I saw seemed difficult to solve in that design. (It was long enough ago that I don't remember the exact problem.)

Hi David,

Clang uses a recursive descent parser design, and this applies to the C ++ portion as well. We have found that the design works really well, and leads to a fairly clean implementation that is fairly easy to understand and extend.

As you pointed out, the Clang documentation on the parser is lacking (and should be improved). The design itself is simple. The parser handles the grammar of the language, and calls back into an abstract interface (called 'Actions') that is responsible for building up the ASTs, performing the type checking, etc. (the implementation of this is called 'Sema'). The abstract interface allows the parsing logic to be (relatively) simple, and allows one to swap in a different implementation of the abstract interface if one didn't want to do full type-checking, etc.

but because the Clang static analyzer performs static analysis at the source level simply having C++ parsing support does not imply immediate support in the analyzer. Bringing that feature up will likely require active participation from the open source community.
What about running over a language-independent IR instead? That's the approach we've used in Treehydra and it seems like it would be even better because I think you have much cleaner IRs in LLVM.

We definitely thought about this, and we made a deliberate choice to analyze source code instead of LLVM IR.

Analyzing at the IR level certainly has it benefits, as complicated language features are lowered to primitive operations, allowing one to focus on analyzing those primitives. This is both a significant blessing and a onerous curse, and it all comes down to the tradeoffs employed.

Our choice of analyzing source code came down to several key factors (in no particular order):

a) Understanding high-level interfaces.

Many complicated language features reduce to a large number of LLVM IR instructions, but ultimately we are interested in the macroscopic actions (e.g., the invocation of a method, which could span many LLVM IR instructions). Many bugs have to do with reasoning about interfaces rather than the specific low-level semantics (which are also important, but can be approximated), and doing this at the source level can be much easier.

Further, source often captures the programmer's intent in a far more recognizable way than a lowered representation. Many bugs are somewhere between the poles of syntax and semantics. For example, sometimes a potential bug isn't really ever a bug if it occurs within a macro, or that the code was written in a specific way indicating that the user didn't care about the "bug". For example, consider a dead store:

  err = foo();

versus

  if (err = foo())

Suppose 'err' is never read after the assignment. According the the semantics of C, the variable 'err' isn't actually read in both cases, but the first case is more likely to be a programming mistake than the second (the second can also be an error if they meant to do 'err == foo ()', but that conceptually is a different kind of bug). Certainly distinguishing between these cases can be done at the LLVM IR level, but it is a little more tricky to do. There are also cases such as 'i+ +' and 'i = i + 1' that are potentially indistinguishable at the LLVM IR level, but could be relevant when determining the chance that a real bug occurred.

In other words, precisely analyzing semantics isn't always enough, and understanding the intent of the programmer, which often boils down to looking at syntax, is often very useful when determining whether or not a real error is present.

b) Language types

Often high-level language types are essentially completely erased at the LLVM IR level, being lowered to structs, etc. The high-level type system is especially useful when one is analyzing a language with a rich OO-type system such as Objective-C and C++. This is useful both for reasoning about high-level interfaces (my previous point) and thinking about virtual function calls, etc.

c) Great diagnostics

Clang's preprocessor and parser are integrated, meaning the ASTs have full information regarding macros, pragmas, the #include stack, and so on. Clang also has full source range information, with locations for individual '{' tokens, etc. This allows the analyzer to report excellent diagnostics with full column and line information, source ranges, etc. Such rich location information also allows us to potentially tie into code refactoring operations that could be used to either fix bugs or to transform the code in some other useful way. While is possible to tie much of the LLVM IR back to the original source, this isn't always trivial as the lowering could be architecture independent. Moreover, because some language-level features (such as an Objective-C method invocation) lower to many LLVM IR instructions, performing the back mapping in many cases can be non-trivial and error prone.

d) Sometimes lying gets you closer to the truth

Precisely handling various operations such as sign-extension, bit masking, etc., when reasoning about symbolic values can be challenging. Instead of being perfect, I think it is easier to approximate the truth when analyzing source code than when analyzing LLVM IR (since operations can be broken up over many instructions). At a high-level representation, it is often easier to understand what is important and what is not when it comes to precisely analyzing a fragment of code. Sometimes not handling certain details just doesn't really matter, and in certain cases where clang's analyzer currently doesn't handle something well we can often recover path-sensitivity by making up new symbolic values, etc., when the result of an operation is "too complicated" to reason about. I think this kind of cheating is often easier to do at a high-level than when using a lowered representation, but opinions may differ.


Of course analyzing source code can be hard. One has to reason about arbitrary casts, short-circuit operations, etc., that all simplified when lowered to the LLVM IR level. However, I argue that once the core logic to handle such things is implemented, that hard work in implementing the analyzer is elsewhere (e.g., reasoning about symbolic values and abstracted program memory, etc.).

The clang analyzer currently does mostly local analysis, essentially operating under the conservative approximation that the implementation of the callee of functions/methods is unavailable for analysis. That plan is to add more global analysis over time, hopefully over the next year (time permitting).
We generally do unsound analysis instead (assuming the callees do nothing, or do a little bit we can guess at, like writing to reference-typed arguments) to cut down on false positives. Maybe the best possible tool has a dial to tune the level of conservatism. I have no idea what the best default for general-purpose checking is, though.

Ah. By conservative I meant a combination of unsound and sound approximations designed to reduce the number of false positives and have a high signal-to-noise ratio from the analyzer. In other words I prefer to trade off false negatives for false positives in order to extract the most useful results.
_______________________________________________
dev-static-analysis mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-static-analysis

Reply via email to