steakhal added a comment. In D144269#4237539 <https://reviews.llvm.org/D144269#4237539>, @dkrupp wrote:
> This is a totally rewritten version of the patch which solely relies on the > existing "interestingness" utility to track back the taint propagation. (And > does not introduce a new FlowID in the ProgramState as requested in the > reviews.) > > -The new version also places a Note, when the taintedness is propagated to an > argument or to a return value. So it should be easier for the user to follow > how the taint information is spreading. > -"The taint originated here" is printed correctly at the taint source > function, which introduces taintedness. (Main goal of this patch.) > > Implementation: > -The createTaintPreTag() function places a NoteTag at the taint propagation > function calls, if taintedness is propagated. Then at report creation, the > tainted arguments are marked interesting if propagated taintedness is > relevant for the bug report. > > - The isTainted() function is extended to return the actually tainted > SymbolRef. This is important to be able to consistently mark relevant symbol > interesting which carries the taintedness in a complex expression. So this is how you circumvent introducing "transitive interestingness", because now you know which symbol to track. > -createTaintPostTag(..) function places a NoteTag to the taint generating > function calls to mark them interesting if they are relevant for a > taintedness report. So if they propagated taintedness to interesting > symbol(s). > > The tests are passing and the reports on the open source projects are much > better understandable than before (twin, tmux, curl): > > https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29 I've looked at the results you attached and they look good to me. Do you have an example where two tainted values are contributing to the same bug? Something like: n <- read(); m <- read(); malloc(n+m) Will both of these values tracked back? What do the notes look like? --- All in all, I'm pleased to see the improvements. It looks much better now IMO. Using two NoteTags cleans up the implementation quite a bit, kudos. I don't think there are major problems with this implementation, so I decided to spew your code with my nitpicks :D Remember to clang-format your code. See `clang/tools/clang-format/clang-format-diff.py`. And there are a few overloads of `getNoteTag()`, and there could be a better fit for usecases; you should decide. I also find it difficult to track how a variable got tainted across assignments and computations like in this <https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29&report-id=1549451&report-hash=c509c5c3af850f823db37c471550c248&report-filepath=%2afake_ntlm.c> case. This observation is completely orthogonal to your patch, I'm just noting it. it was bad previously as well. Maybe we could have a visitor for explaining the taint tracking across assignments and computations to complement the NoteTag. I hope I didn't miss much this time :D ================ Comment at: clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp:238-244 + if (kind == OOB_Tainted) + BT.reset( + new BugType(this, "Out-of-bound access", categories::TaintedData)); + else + BT.reset( + new BugType(this, "Out-of-bound access", categories::LogicError)); + } ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:45 -void DivZeroChecker::reportBug( - const char *Msg, ProgramStateRef StateZero, CheckerContext &C, - std::unique_ptr<BugReporterVisitor> Visitor) const { +void DivZeroChecker::reportBug(const char *Msg, std::string Category, + ProgramStateRef StateZero, CheckerContext &C, ---------------- It feels odd to have both `const char*` and a `std::string` on the same line. Should we update `const char*` to a more sophisticated type? I'm thinking of `StringRef`. It seems like that type should be used for the `Category` as well since the `PathSensitiveBugReport` constructor takes that, so we don't need to have an owning type here. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:91-96 + SymbolRef TSR = isTainted(C.getState(), *DV); + if ((stateNotZero && stateZero && TSR)) { + reportBug("Division by a tainted value, possibly zero", + categories::TaintedData, stateZero, C, TSR); return; } ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:138-139 /// Also considers stdin as a taint source. std::optional<SVal> getTaintedPointeeOrPointer(const CheckerContext &C, + const ProgramStateRef State, SVal Arg) { ---------------- I'm not sure about passing both the `CheckerContext` and the `State`. The `CheckerContext` already encapsulates a `State`, which opens up the possibilities for misuse. For example, the `getPointeeOf()` is called in this function, and that will eventually call `C.getState()` under the hood. So, for me it feels like a bad API design. What we could do instead, is to pass an `ASTContext` and a `State`; resolving this discrepancy. Could you please check if this is a real concern? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:158 +/// when the return value, or the outgoing parameters are tainted. +const NoteTag *createTaintPreTag(CheckerContext &C, + std::vector<SymbolRef> TaintedSymbols, ---------------- This function should be an implementation detail, as such I wonder if we should make it `static`. How about naming this function differently? I'm thinking of `taintOriginTrackerTag`, IDK. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:162 + const CallEvent& Call) { + const LocationContext* LC = Call.getCalleeStackFrame(0); + ---------------- If the `Call` parameter is only used for acquiring the `LocationContext`, wouldn't it be more descriptive to directly pass the `LocationContext` to the function instead? I'm also puzzled that we use `getCalleeStackFrame` here. I rarely ever see this function, so I'm a bit worried if this pick was intentional. That we pass the `0` as the `BlockCount` argument only reinforces this instinct. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:165 + const NoteTag *InjectionTag = C.getNoteTag( + [TaintedSymbols, TaintedArgs,LC](PathSensitiveBugReport &BR) -> std::string { + SmallString<256> Msg; ---------------- It should consume the `TaintedSymbols` and `TaintedArgs` variables, as such you should `std::move` out from the original parameters like this: ``` [TaintedSymbols = std::move(TaintedSymbols), TaintedArgs = std::move(TaintedArgs)](){...} ``` ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:169 + // We give diagnostics only for taint related reports + if (!BR.isInteresting(LC) || + BR.getBugType().getCategory() != categories::TaintedData) { ---------------- What does the first half of this condition guard against? Do you have a test for it to demonstrate? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:173-176 + if (TaintedSymbols.empty()){ + Out << "Taint originated here"; + return std::string(Out.str()); + } ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:177-183 + int i = 0; + for (SymbolRef SR : TaintedSymbols) { + LLVM_DEBUG(llvm::dbgs() << "Taint Propagated from argument" + << TaintedArgs.at(i) << "\n"); + BR.markInteresting(SR); + i++; + } ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:192 +/// when the return value, or the outgoing parameters are tainted. +const NoteTag *createTaintPostTag(CheckerContext &C, + std::vector<SymbolRef> TaintedSymbols, ---------------- How about calling this `taintPropagationExplainerTag`? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:197 + const LocationContext* LC = Call.getCalleeStackFrame(0); + + const NoteTag *InjectionTag = C.getNoteTag( ---------------- Please assert that the size of `TaintedSymbols` must be the same as `TaintedArgs`. Same in the other function. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:198 + + const NoteTag *InjectionTag = C.getNoteTag( + [TaintedSymbols, TaintedArgs, LC](PathSensitiveBugReport &BR) -> std::string { ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:214 + << TaintedArgs.at(i) << "\n"); + Out << "Taint propagated to argument " << TaintedArgs.at(i); + } else { ---------------- So, if `TaintedSymbols.size() > 1`, then the note message will look weird. Could you please have a test for this? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:871 + }else{ + LLVM_DEBUG(llvm::dbgs() << "Strange. Return value not tainted.\n"); + LLVM_DEBUG(Call.getReturnValue().dumpToStream(llvm::dbgs())); ---------------- I cannot see a test against the "Strange" string. Is this dead code? Same for the other block. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:919-922 + //Add taintedness to stdin parameters + if (isStdin(C.getSVal(E),C.getASTContext())){ + State = addTaint(State, C.getSVal(E)); + } ---------------- I just want to get it confirmed that this hunk is unrelated to your change per-say. Is it? BTW I don't mind this change being part of this patch, rather the opposite. Finally, we will have it. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:942 bool IsMatching = PropSrcArgs.isEmpty(); - ForEachCallArg( - [this, &C, &IsMatching, &State](ArgIdxTy I, const Expr *E, SVal) { - IsMatching = IsMatching || (PropSrcArgs.contains(I) && - isTaintedOrPointsToTainted(E, State, C)); - }); + const NoteTag *InjectionTag = nullptr; + ---------------- I prefer to move declarations close to their uses. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:1003 State = State->set<TaintArgsOnPostVisit>(C.getStackFrame(), Result); - C.addTransition(State); + InjectionTag = createTaintPreTag(C,TaintedSymbols,TaintedIndexes,Call); + C.addTransition(State, InjectionTag); ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:1025-1028 + SymbolRef TSR = isTainted(C.getState(),*TaintedSVal); + if (TSR) + report->markInteresting(TSR); + ---------------- ================ Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:169-173 + SymbolRef TSR; + if (TSR = isTainted(State, ER->getSuperRegion(), K)) + return TSR; + else if (TSR = isTainted(State, ER->getIndex(), K)) + return TSR; ---------------- I'd probably swap the two branches, so that the index would be tracked if both the region and the index are tainted. I also wonder if this edge-case could be tested at all. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:228-230 + if (SymbolRef TSR = + isTainted(State, SRV->getRegion(), Kind)) + return TSR; ---------------- What does `TSR` abbreviate? I would find `TaintedSym` more descriptive. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp:220-228 + if (Kind == VLA_Tainted) + BT.reset(new BugType(this, + "Dangerous variable-length array (VLA) declaration", + categories::TaintedData)); + else + BT.reset(new BugType(this, + "Dangerous variable-length array (VLA) declaration", ---------------- Why don't we use a distinct BugType for this? ================ Comment at: clang/test/Analysis/taint-diagnostic-visitor.c:46 // expected-note@-1 {{Taint originated here}} + // expected-note@-2 {{Taint propagated to argument 1}} int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}} ---------------- I'd suggest to count from 1 instead of 0 when referring to Nth arguments. You can also use the `llvm::getOrdinalSuffix(N)` to get nicer messages. We already count from 1 in the std library checker. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D144269/new/ https://reviews.llvm.org/D144269 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits