[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-26 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp closed this revision.
dkrupp added a comment.

Committed in 343bdb10940cb2387c0b9bd3caccee7bb56c937b 
.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-24 Thread Balázs Benics via Phabricator via cfe-commits
steakhal accepted this revision.
steakhal added a comment.
This revision is now accepted and ready to land.

LGTM.
About formatting the tests:
Personally, I would have preferred to "clean as you code", but I can see your 
point. Leave it as-is.
Land it, please.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-24 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

@steakhal  thanks for the review. I fixed all outstanding remarks.
I left the test taint-diagnostic-visitor.c formatting as is to remain 
consistent with the rest of the file. I think we should keep it as is, or 
reformat the whole file.




Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:108-109
+  if ((stateNotZero && stateZero)) {
+std::vector taintedSyms = getTaintedSymbols(C.getState(), *DV);
+if (!taintedSyms.empty()) {
+  reportTaintBug("Division by a tainted value, possibly zero", stateZero, 
C,

steakhal wrote:
> dkrupp wrote:
> > steakhal wrote:
> > > 
> > We cannot get rid off the getTaintedSymbols() call, as we need to pass all 
> > tainted symbols to reportTaintBug if we want to track back multiple 
> > variables. taintedSyms is a parameter of reportTaintBug(...)
> Yes, makes sense. mb.
> One more thing: if `reportTaintBug()` takes the `taintedSyms` vector 
> "by-value", you should express your intent by `std::move()`-ing your 
> collection expressing that it's meant to be consumed instead of taking a copy.
> Otherwise, you could express this intent if the `reportTaintBug()` take a 
> //view// type for the collection, such as `llvm::ArrayRef` - which 
> would neither copy nor move from the callsite's vector, being more performant 
> and expressive.
> 
> I get that this vector is small and bugreport construction will dominate the 
> runtime anyway so I'm not complaining about this specific case, I'm just 
> noting this for the next time. So, here I'm not expecting any actions.
Fixed as suggested. thanks.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:947-948
+  if (!TaintedArgSyms.empty()) {
+TaintedSymbols.insert(TaintedSymbols.begin(), TaintedArgSyms.begin(),
+  TaintedArgSyms.end());
+TaintedIndexes.push_back(I);

steakhal wrote:
> steakhal wrote:
> > 
> I observed you didn't take any action about this suggestion.
> It leaves me wonder if this suggestion - in general - makes sense or if there 
> are other reasons what I cannot foresee.
> I've seen you using the fully spelled-out version in total 8 times.
> Shouldn't we prefer the shorter, more expressive version instead?
Sorry I overlooked this comment. I like this shorter version. It is so much 
consize!  Changed at all places. Thanks for the suggestion.



Comment at: clang/test/Analysis/taint-diagnostic-visitor.c:53-67
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the 
return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}

steakhal wrote:
> steakhal wrote:
> > I know this is subjective, but I'd suggest to reformat the tests to match 
> > LLVM style guidelines, unless the formatting is important for the test.
> > Consistency helps the reader and reviewer, as code and tests are read many 
> > more times than written.
> > 
> > This applies to the rest of the touched tests.
> Originally I meant this to the rest of the test cases you change or add part 
> of this patch. I hope it clarifies.
I made some formatting changes you suggested, but 
I would like to leave the //expected-note tags as they are now, because then it 
remains consistent with the rest of the test cases.

Would it be okay like this, or should I reformat the whole file (untouched 
parts too)?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-24 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 516389.
dkrupp marked an inline comment as done.
dkrupp added a comment.

-using llvm::ArrayRef in the reportTaintBug(..) function in the 
DivZero Checker


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef __typeof(sizeof(int)) size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to the 2nd argument}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to the 2nd argument}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to the 2nd argument}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,72 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to the 2nd argument}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+// Tests if the originated note is correctly placed even if the path is
+// propagating through variables and expressions
+char *taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+// expected-note@-2 {{Taint propagated to the return value}}
+return pathbuf;
+  }
+  return 0;
+}
+
+// Taint origin should be marked correctly even if there are multiple taint
+// sources in the function
+char *taintDiagnosticPropagation2(){
+  char *pathbuf;
+  char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  char *user_env=get

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-24 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 516380.
dkrupp marked 10 inline comments as done.
dkrupp added a comment.

-append_range(..) used instead of std::vector.insert(...) to improve readability
-minor updates based on @steakhal comments


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef __typeof(sizeof(int)) size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to the 2nd argument}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to the 2nd argument}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to the 2nd argument}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,72 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to the 2nd argument}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+// Tests if the originated note is correctly placed even if the path is
+// propagating through variables and expressions
+char *taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+// expected-note@-2 {{Taint propagated to the return value}}
+return pathbuf;
+  }
+  return 0;
+}
+
+// Taint origin should be marked correctly even if there are multiple taint
+// sources in the function
+char *taintDiagnosticPropagation2(){
+  char *pathbuf;
+  char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-24 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

To conclude the review, please respond to the "Not Done" inline comments, and 
mark them "Done" if you think they are resolved.
Thank you for your patience.




Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:108-109
+  if ((stateNotZero && stateZero)) {
+std::vector taintedSyms = getTaintedSymbols(C.getState(), *DV);
+if (!taintedSyms.empty()) {
+  reportTaintBug("Division by a tainted value, possibly zero", stateZero, 
C,

dkrupp wrote:
> steakhal wrote:
> > 
> We cannot get rid off the getTaintedSymbols() call, as we need to pass all 
> tainted symbols to reportTaintBug if we want to track back multiple 
> variables. taintedSyms is a parameter of reportTaintBug(...)
Yes, makes sense. mb.
One more thing: if `reportTaintBug()` takes the `taintedSyms` vector 
"by-value", you should express your intent by `std::move()`-ing your collection 
expressing that it's meant to be consumed instead of taking a copy.
Otherwise, you could express this intent if the `reportTaintBug()` take a 
//view// type for the collection, such as `llvm::ArrayRef` - which 
would neither copy nor move from the callsite's vector, being more performant 
and expressive.

I get that this vector is small and bugreport construction will dominate the 
runtime anyway so I'm not complaining about this specific case, I'm just noting 
this for the next time. So, here I'm not expecting any actions.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:875-879
+  std::vector TaintedSyms = getTaintedSymbols(State, *V);
+  if (!TaintedSyms.empty()) {
+TaintedSymbols.push_back(TaintedSyms[0]);
+TaintedIndexes.push_back(ArgNum);
+  }

dkrupp wrote:
> steakhal wrote:
> > In these cases, the code would acquire all the tainted subsymbols, which 
> > then we throw away and keep only the first one.
> > This is why I suggested the approach I did I'm my last review.
> I think this suggested solution would not be correct here, as ArgSym might 
> not be the actual _tainted_ symbol (inside a more complex expression).
> 
> So I would prefer to leave it like this for correctness.
Okay, I also checked out the code and verified this. Indeed we would have 
failing tests with my recommendation.
I still think it's suboptimal. This is somewhat related to the tainted API. It 
shouldn't make sense to have tainted regions in the first place, which bites 
here again. Let's keep it as-is, no actions are required.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:947-948
+  if (!TaintedArgSyms.empty()) {
+TaintedSymbols.insert(TaintedSymbols.begin(), TaintedArgSyms.begin(),
+  TaintedArgSyms.end());
+TaintedIndexes.push_back(I);

steakhal wrote:
> 
I observed you didn't take any action about this suggestion.
It leaves me wonder if this suggestion - in general - makes sense or if there 
are other reasons what I cannot foresee.
I've seen you using the fully spelled-out version in total 8 times.
Shouldn't we prefer the shorter, more expressive version instead?



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:920
   /// Check for taint filters.
   ForEachCallArg([this, &C, &State](ArgIdxTy I, const Expr *E, SVal S) {
 if (FilterArgs.contains(I)) {

This unused variable generates a compiler warning.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:995
+  const NoteTag *InjectionTag = taintOriginTrackerTag(
+  C, TaintedSymbols, TaintedIndexes, Call.getCalleeStackFrame(0));
+  C.addTransition(State, InjectionTag);





Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:275
+TaintedSymbols.push_back(*SI);
+if (returnFirstOnly && !TaintedSymbols.empty())
+  return TaintedSymbols; // return early if needed

The second part of the conjunction should be tautologically true.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-22 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

@steakhal your comments are fixed. Thanks for the review.




Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:108-109
+  if ((stateNotZero && stateZero)) {
+std::vector taintedSyms = getTaintedSymbols(C.getState(), *DV);
+if (!taintedSyms.empty()) {
+  reportTaintBug("Division by a tainted value, possibly zero", stateZero, 
C,

steakhal wrote:
> 
We cannot get rid off the getTaintedSymbols() call, as we need to pass all 
tainted symbols to reportTaintBug if we want to track back multiple variables. 
taintedSyms is a parameter of reportTaintBug(...)



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:863-868
+  std::vector TaintedSyms =
+  getTaintedSymbols(State, Call.getReturnValue());
+  if (!TaintedSyms.empty()) {
+TaintedSymbols.push_back(TaintedSyms[0]);
+TaintedIndexes.push_back(ArgNum);
+  }

steakhal wrote:
> 
I think this suggested solution would not be correct here, as ArgSym might not 
be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:875-879
+  std::vector TaintedSyms = getTaintedSymbols(State, *V);
+  if (!TaintedSyms.empty()) {
+TaintedSymbols.push_back(TaintedSyms[0]);
+TaintedIndexes.push_back(ArgNum);
+  }

steakhal wrote:
> In these cases, the code would acquire all the tainted subsymbols, which then 
> we throw away and keep only the first one.
> This is why I suggested the approach I did I'm my last review.
I think this suggested solution would not be correct here, as ArgSym might not 
be the actual _tainted_ symbol (inside a more complex expression).

So I would prefer to leave it like this for correctness.



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:294
+  std::vector TaintedCasts =
+  getTaintedSymbols(State, SC->getOperand(), Kind);
+  TaintedSymbols.insert(TaintedSymbols.begin(), TaintedCasts.begin(),

steakhal wrote:
> If `returnFirstOnly` is `true`, this `getTaintedSymbols()` call would still 
> eagerly (needlessly) collect all of the symbols.
> I'd recommend propagating the `returnFirstOnly` parameter to the recursive 
> calls to avoid this problem.
> I also encourage you to make use of the `llvm::append_range()` whenever makes 
> sense.
You are perfectly right. I overlooked these calls and because of the the 
default parameter did got get a warning.  now fixed.



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:147-150
 bool taint::isTainted(ProgramStateRef State, const Stmt *S,
   const LocationContext *LCtx, TaintTagType Kind) {
-  SVal val = State->getSVal(S, LCtx);
-  return isTainted(State, val, Kind);
+  return !getTaintedSymbols(State, S, LCtx, Kind).empty();
 }

dkrupp wrote:
> steakhal wrote:
> > TBH I'm not sure if I like that now we allocate unbounded amount of times 
> > (because of `getTaintedSymbols()` is recursive and returns by value), where 
> > we previously did not.
> > 
> > What we could possibly do is to compute the elements of this sequence 
> > lazily.
> > I'm thinking of the `llvm::mapped_iterator`, but I'm not sure if it's 
> > possible to have something like that as a return type as it might encode 
> > the map function in the type or something like that.
> > Anyway, I'm just saying that it would be nice to not do more than it's 
> > necessary, and especially not allocate a lot of short-lived objects there.
> > 
> > Do you think there is a way to have the cake and eat it too?
> > 
> > ---
> > 
> > I did some investigation, and one could get pretty far in the 
> > implementation, and maybe even complete it but it would be really 
> > complicated as of now. Maybe we could revisit this subject when we have 
> > coroutines.
> > 
> > So, I would suggest to have two sets of APIs:
> >  - the usual `isTainted(.) -> bool`
> >  - and a `getTaintedSymbols(.) -> vector`
> > The important point would be that the `isTainted()` version would not 
> > eagerly collect all tainted sub-syms but return on finding the first one.
> > While, the `getTaintedSymbols()` would collect eagerly all of them, as its 
> > name suggests.
> > 
> > Imagine if `getTaintedSymbolsImpl()` had an extra flag like `bool 
> > returnAfterFirstMatch`. This way `isTainted()` can call it like that. While 
> > in the other case, the parameter would be `false`, and eagerly collect all 
> > symbols.
> > 
> > This is probably the best of both worlds, as it prevents `isTainted` from 
> > doing extra work and if we need to iterate over the tainted symbols, we 
> > always iterate over all of them, so doing it lazily wouldn't gain us much 
> > in that case anyway.
> > As a bonus, the user-facing API would be s

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-22 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 516077.
dkrupp marked 13 inline comments as done.
dkrupp added a comment.

-getTaintedSymbols(.) -> getTaintedSymbolsImpl() proxy function introduced for 
interface safety
-Other minor fixes based on comments from @steakhal


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef __typeof(sizeof(int)) size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to the 2nd argument}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to the 2nd argument}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to the 2nd argument}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,71 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to the 2nd argument}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+// expected-note@-2 {{Taint propagated to the return value}}
+return pathbuf;
+  }
+  return 0;
+}
+
+//Taint origin should be marked correctly even if there are multiple taint
+//sources in the function
+char* taintDiagnosticPropagation2(){
+  char *pathbuf;
+  char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-21 Thread Balázs Benics via Phabricator via cfe-commits
steakhal requested changes to this revision.
steakhal added a comment.
This revision now requires changes to proceed.

I didn't go through the whole revision this time, but I think the next steps 
are already clear for the next round.
My impression was that I might not expressed my intent and expectations about 
the directions of the next step.
I hope I managed this time. Let me know if you have questions.




Comment at: clang/include/clang/StaticAnalyzer/Checkers/Taint.h:82-103
+/// Returns the tainted Symbols for a given Statement and state.
+std::vector getTaintedSymbols(ProgramStateRef State, const Stmt *S,
+ const LocationContext *LCtx,
+ TaintTagType Kind = TaintTagGeneric,
+ bool returnFirstOnly = false);
+
+/// Returns the tainted Symbols for a given SVal and state.

The overloads having the extra `ReturnFirstOnly` parameter shouldn't be visible 
here in the header.
That is an implementation detail that no users should know about.
Note that having a single default argument overload potentially doubles the 
variations the user might need to keep in mind when choosing the right one. So 
there is value in simplicity.



Comment at: clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp:248-254
+  std::vector TaintedSyms =
+  getTaintedSymbols(errorState, TaintedSVal);
+  // Mark all tainted symbols interesting
+  // to track back the propagation of taintedness.
+  for (auto Sym : TaintedSyms) {
+BR->markInteresting(Sym);
+  }





Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:62
+CheckerContext &C,
+std::vector TaintedSyms) const {
+  if (ExplodedNode *N = C.generateErrorNode(StateZero)) {





Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:108-109
+  if ((stateNotZero && stateZero)) {
+std::vector taintedSyms = getTaintedSymbols(C.getState(), *DV);
+if (!taintedSyms.empty()) {
+  reportTaintBug("Division by a tainted value, possibly zero", stateZero, 
C,





Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:863-868
+  std::vector TaintedSyms =
+  getTaintedSymbols(State, Call.getReturnValue());
+  if (!TaintedSyms.empty()) {
+TaintedSymbols.push_back(TaintedSyms[0]);
+TaintedIndexes.push_back(ArgNum);
+  }





Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:875-879
+  std::vector TaintedSyms = getTaintedSymbols(State, *V);
+  if (!TaintedSyms.empty()) {
+TaintedSymbols.push_back(TaintedSyms[0]);
+TaintedIndexes.push_back(ArgNum);
+  }

In these cases, the code would acquire all the tainted subsymbols, which then 
we throw away and keep only the first one.
This is why I suggested the approach I did I'm my last review.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:936-940
+IsMatching =
+IsMatching || (PropSrcArgs.contains(I) &&
+   isTaintedOrPointsToTainted(State, C.getSVal(E)));
+std::optional TaintedSVal =
+getTaintedPointeeOrPointer(State, C.getSVal(E));

Here `getTaintedPointeeOrPointer` would be called two times, unnecessarily.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:947-948
+  if (!TaintedArgSyms.empty()) {
+TaintedSymbols.insert(TaintedSymbols.begin(), TaintedArgSyms.begin(),
+  TaintedArgSyms.end());
+TaintedIndexes.push_back(I);





Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:1009-1010
   assert(E);
-  std::optional TaintedSVal{getTaintedPointeeOrPointer(C, C.getSVal(E))};
+  std::optional TaintedSVal{
+  getTaintedPointeeOrPointer(C.getState(), C.getSVal(E))};
 





Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:1019-1023
+std::vector TaintedSyms =
+getTaintedSymbols(C.getState(), *TaintedSVal);
+for (auto TaintedSym : TaintedSyms) {
+  report->markInteresting(TaintedSym);
+}





Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:149
   const LocationContext *LCtx, TaintTagType Kind) {
-  SVal val = State->getSVal(S, LCtx);
-  return isTainted(State, val, Kind);
+  return !getTaintedSymbols(State, S, LCtx, Kind, true).empty();
 }

We usually pass booleans by "name".



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:294
+  std::vector TaintedCasts =
+  getTaintedSymbols(State, SC->getOperand(), Kind);
+  

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-21 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp marked an inline comment as done.
dkrupp added a comment.

@steakhal is there anything else to do before we merge this? Thanks.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-19 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp marked an inline comment as done.
dkrupp added a comment.

@steakhal  thanks for your review. All your remarks have been fixed.




Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:147-150
 bool taint::isTainted(ProgramStateRef State, const Stmt *S,
   const LocationContext *LCtx, TaintTagType Kind) {
-  SVal val = State->getSVal(S, LCtx);
-  return isTainted(State, val, Kind);
+  return !getTaintedSymbols(State, S, LCtx, Kind).empty();
 }

steakhal wrote:
> TBH I'm not sure if I like that now we allocate unbounded amount of times 
> (because of `getTaintedSymbols()` is recursive and returns by value), where 
> we previously did not.
> 
> What we could possibly do is to compute the elements of this sequence lazily.
> I'm thinking of the `llvm::mapped_iterator`, but I'm not sure if it's 
> possible to have something like that as a return type as it might encode the 
> map function in the type or something like that.
> Anyway, I'm just saying that it would be nice to not do more than it's 
> necessary, and especially not allocate a lot of short-lived objects there.
> 
> Do you think there is a way to have the cake and eat it too?
> 
> ---
> 
> I did some investigation, and one could get pretty far in the implementation, 
> and maybe even complete it but it would be really complicated as of now. 
> Maybe we could revisit this subject when we have coroutines.
> 
> So, I would suggest to have two sets of APIs:
>  - the usual `isTainted(.) -> bool`
>  - and a `getTaintedSymbols(.) -> vector`
> The important point would be that the `isTainted()` version would not eagerly 
> collect all tainted sub-syms but return on finding the first one.
> While, the `getTaintedSymbols()` would collect eagerly all of them, as its 
> name suggests.
> 
> Imagine if `getTaintedSymbolsImpl()` had an extra flag like `bool 
> returnAfterFirstMatch`. This way `isTainted()` can call it like that. While 
> in the other case, the parameter would be `false`, and eagerly collect all 
> symbols.
> 
> This is probably the best of both worlds, as it prevents `isTainted` from 
> doing extra work and if we need to iterate over the tainted symbols, we 
> always iterate over all of them, so doing it lazily wouldn't gain us much in 
> that case anyway.
> As a bonus, the user-facing API would be self-descriptive.
> 
> WDYT?
Good idea. I implemented the early return option in getTaintedSymbols(). This 
is used now by the isTainted() function.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-19 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 514973.
dkrupp marked an inline comment as done.
dkrupp added a comment.

- Implemented early return in getTaintedSymbols() when it is called by 
isTainted() for efficiency
- Fixed test incompatibility on Windows


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef __typeof(sizeof(int)) size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to the 2nd argument}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to the 2nd argument}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to the 2nd argument}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,71 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to the 2nd argument}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+// expected-note@-2 {{Taint propagated to the return value}}
+return pathbuf;
+  }
+  return 0;
+}
+
+//Taint origin should be marked correctly even if there are multiple taint
+//sources in the function
+char* taintDiagnosticPropagation2(){
+  char *pathbuf;
+  char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint 

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-19 Thread Balázs Benics via Phabricator via cfe-commits
steakhal requested changes to this revision.
steakhal added a comment.
This revision now requires changes to proceed.

Nice improvement!
I only have minor nitpicks and some recommendations for the taint API.




Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:205
+BR.markInteresting(CallLocation);
+if (TaintedArgs.at(Idx) != ReturnValueIndex) {
+  LLVM_DEBUG(llvm::dbgs() << "Taint Propagated to argument "

We usually use the `operator[]` on vector instead of the `at()`. When in doubt, 
we assert that the index is in `idx < size()`.
Apply this for all `.at()` uses.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:209-214
+  if (nofTaintedArgs == 0)
+Out << "Taint propagated to argument "
+<< TaintedArgs.at(Sym.index()) + 1;
+  else
+Out << ", " << TaintedArgs.at(Sym.index()) + 1;
+  nofTaintedArgs++;

dkrupp wrote:
> steakhal wrote:
> > I'd recommend using `llvm::interleaveComma()` in such cases.
> > You can probably get rid of `nofTaintedArgs` as well - by using this 
> > function.
> I chose another solution. I hope that is ok too.
Looks good. I'm not expecting many cases propagating to multiple arguments 
anyway.



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:147-150
 bool taint::isTainted(ProgramStateRef State, const Stmt *S,
   const LocationContext *LCtx, TaintTagType Kind) {
-  SVal val = State->getSVal(S, LCtx);
-  return isTainted(State, val, Kind);
+  return !getTaintedSymbols(State, S, LCtx, Kind).empty();
 }

TBH I'm not sure if I like that now we allocate unbounded amount of times 
(because of `getTaintedSymbols()` is recursive and returns by value), where we 
previously did not.

What we could possibly do is to compute the elements of this sequence lazily.
I'm thinking of the `llvm::mapped_iterator`, but I'm not sure if it's possible 
to have something like that as a return type as it might encode the map 
function in the type or something like that.
Anyway, I'm just saying that it would be nice to not do more than it's 
necessary, and especially not allocate a lot of short-lived objects there.

Do you think there is a way to have the cake and eat it too?

---

I did some investigation, and one could get pretty far in the implementation, 
and maybe even complete it but it would be really complicated as of now. Maybe 
we could revisit this subject when we have coroutines.

So, I would suggest to have two sets of APIs:
 - the usual `isTainted(.) -> bool`
 - and a `getTaintedSymbols(.) -> vector`
The important point would be that the `isTainted()` version would not eagerly 
collect all tainted sub-syms but return on finding the first one.
While, the `getTaintedSymbols()` would collect eagerly all of them, as its name 
suggests.

Imagine if `getTaintedSymbolsImpl()` had an extra flag like `bool 
returnAfterFirstMatch`. This way `isTainted()` can call it like that. While in 
the other case, the parameter would be `false`, and eagerly collect all symbols.

This is probably the best of both worlds, as it prevents `isTainted` from doing 
extra work and if we need to iterate over the tainted symbols, we always 
iterate over all of them, so doing it lazily wouldn't gain us much in that case 
anyway.
As a bonus, the user-facing API would be self-descriptive.

WDYT?



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:265-268
+  std::vector TaintedRegions =
+  getTaintedSymbols(State, SRV->getRegion(), Kind);
+  TaintedSymbols.insert(TaintedSymbols.begin(), TaintedRegions.begin(),
+TaintedRegions.end());

For such constructs, I would prefer this.



Comment at: clang/test/Analysis/taint-diagnostic-visitor.c:12-13
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );

The [[ 
https://buildkite.com/llvm-project/premerge-checks/builds/146760#01877fd9-2f3a-4f8d-9ffa-6e017eb883c5
 | premerge bots ]] are complaining about these two lines on Windows:
```
error: 'warning' diagnostics seen but not expected: 
  File 
C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c
 Line 12: incompatible redeclaration of library function 'strlen'
  File 
C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c
 Line 13: incompatible redeclaration of library function 'malloc'
error: 'note' diagnostics seen but not expected: 
  File 
C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c
 Line 12: 'strlen' is a builtin with type 'unsigned long long (const char *)'
  File 
C:\ws\w8\llvm-project\premerge-checks\clang\test\Analysis\taint-diagnostic-visitor.c
 Line 13: 'malloc' is a builtin with type 'void

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-15 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

You can find the improved reports on tmux, postgres, twin, openssl here:

here 



CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-14 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

All remarks from @steakhal has been fixed. Thanks for the review.
This new version now can handle the tracking back of multiple symbols!




Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:129-130
 /// Given a pointer/reference argument, return the value it refers to.
-std::optional getPointeeOf(const CheckerContext &C, SVal Arg) {
+std::optional getPointeeOf(ASTContext &ASTCtx,
+ const ProgramStateRef State, SVal Arg) {
   if (auto LValue = Arg.getAs())

steakhal wrote:
> BTW I don't know but `State->getStateManager().getContext()` can give you an 
> `ASTContext`. And we tend to not put `const` to variable declarations. See [[ 
> https://releases.llvm.org/4.0.0/tools/clang/tools/extra/docs/clang-tidy/checks/readability-avoid-const-params-in-decls.html
>  | readability-avoid-const-params-in-decls ]]
> 
> In other places we tend to refer to `ASTContext` by the `ACtx` I think.
> We also prefer const refs over mutable refs. Is the mutable ref justified for 
> this case?
Thanks for the suggestion. I took out ASTContext from the signature.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:209-214
+  if (nofTaintedArgs == 0)
+Out << "Taint propagated to argument "
+<< TaintedArgs.at(Sym.index()) + 1;
+  else
+Out << ", " << TaintedArgs.at(Sym.index()) + 1;
+  nofTaintedArgs++;

steakhal wrote:
> I'd recommend using `llvm::interleaveComma()` in such cases.
> You can probably get rid of `nofTaintedArgs` as well - by using this function.
I chose another solution. I hope that is ok too.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:213
+  else
+Out << ", " << TaintedArgs.at(Sym.index()) + 1;
+  nofTaintedArgs++;

steakhal wrote:
> I believe this branch is uncovered by tests.
Now it is covered. See multipleTaintedSArgs(..) test in 
taint-diagnostic-visitor.c



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:221
+}
+return std::string(Out.str());
+  });

steakhal wrote:
> I think since you explicitly specify the return type of the lambda, you could 
> omit the spelling of `std::string` here.
not sure. Got a "cannot convert raw_svector_ostream::str() from llvm:StringRef" 
error.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:875-878
+  // FIXME: The call argument may be a complex expression
+  // referring to multiple tainted variables.
+  // Now we generate notes and track back only one of them.
+  SymbolRef TaintedSym = isTainted(State, *V);

steakhal wrote:
> You could iterate over the symbol dependencies of the SymExpr (of the `*V` 
> SVal).
> 
> ```lang=c++
> SymbolRef PointeeAsSym = V->getAsSymbol();
> // eee, can it be null? Sure it can. See isTainted(Region),... for those 
> cases we would need to descend and check their symbol dependencies.
> for (SymbolRef SubSym : llvm::make_range(PointeeAsSym->symbol_begin(), 
> PointeeAsSym->symbol_end())) {
>   // TODO: check each if it's also tainted, and update the `TaintedSymbols` 
> accordingly, IDK.
> }
> ```
> Something like this should work for most cases (except when `*V` refers to a 
> tainted region instead of a symbol), I think.
I implememented a new function getTaintedSymbols(..) in Taint.cpp which returns 
all tainted symbols for a complex expr, SVal etc. With this addition, now we 
can track back multiple tainted symbols reaching a sink.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-14 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 513556.
dkrupp marked 11 inline comments as done.
dkrupp edited the summary of this revision.
dkrupp added a comment.

-All remarks from @steakhal was fixed. Thanks for the review!
-Now we can generate diagnostics for all tainted values when they reach a sink.

Se for example the following test case:

  void multipleTaintedArgs(void) {
int x,y;
scanf("%d %d", &x, &y); // expected-note {{Taint originated here}}
// expected-note@-1 {{Taint propagated to the 2nd 
argument, 3rd argument}}
int* ptr = (int*) malloc(x + y); // expected-warning {{Untrusted data is 
used to specify the buffer size}}
 // expected-note@-1{{Untrusted data is 
used to specify the buffer size}}
free (ptr);
  }


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef unsigned long size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to the 2nd argument}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to the 2nd argument}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to the 2nd argument}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,71 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to the 2nd argument}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used t

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-05 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

I think as we converge, now it could be time to update the summary of the patch 
to reflect the current implementation. (e.g. flowids etc.)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-05 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

Looks even better. Only minor concerns remained, mostly about style and 
suggestions of llvm utilities.




Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:129-130
 /// Given a pointer/reference argument, return the value it refers to.
-std::optional getPointeeOf(const CheckerContext &C, SVal Arg) {
+std::optional getPointeeOf(ASTContext &ASTCtx,
+ const ProgramStateRef State, SVal Arg) {
   if (auto LValue = Arg.getAs())

BTW I don't know but `State->getStateManager().getContext()` can give you an 
`ASTContext`. And we tend to not put `const` to variable declarations. See [[ 
https://releases.llvm.org/4.0.0/tools/clang/tools/extra/docs/clang-tidy/checks/readability-avoid-const-params-in-decls.html
 | readability-avoid-const-params-in-decls ]]

In other places we tend to refer to `ASTContext` by the `ACtx` I think.
We also prefer const refs over mutable refs. Is the mutable ref justified for 
this case?



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:159
+/// when the return value, or the outgoing parameters are tainted.
+const NoteTag *taintOriginTrackerTag(CheckerContext &C,
+ std::vector TaintedSymbols,

My bad. In LLVM style we use `UpperCamelCase` for variable names.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:173-175
+if (TaintedSymbols.empty()) {
+  return "Taint originated here";
+}

Generally, in LLVM style we don't put braces to single block statements unless 
it would hurt readability, which I don't think applies here.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:176-180
+for (auto Sym : llvm::enumerate(TaintedSymbols)) {
+  LLVM_DEBUG(llvm::dbgs() << "Taint Propagated from argument "
+  << TaintedArgs.at(Sym.index()) + 1 << "\n");
+  BR.markInteresting(Sym.value());
+}

I was also bad with this recommendation.
I think we can now use structured bindings to get the index and value right 
there, like:
`for (auto [Idx, Sym] : llvm::enumerate(TaintedSymbols))`
[[ 
https://github.com/llvm/llvm-project/blob/main/llvm/docs/ProgrammersManual.rst#enumerate
 | See ]]



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:203
+int nofTaintedArgs = 0;
+for (auto Sym : llvm::enumerate(TaintedSymbols)) {
+  if (BR.isInteresting(Sym.value())) {

Same here about structured bindings.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:209-214
+  if (nofTaintedArgs == 0)
+Out << "Taint propagated to argument "
+<< TaintedArgs.at(Sym.index()) + 1;
+  else
+Out << ", " << TaintedArgs.at(Sym.index()) + 1;
+  nofTaintedArgs++;

I'd recommend using `llvm::interleaveComma()` in such cases.
You can probably get rid of `nofTaintedArgs` as well - by using this function.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:211
+Out << "Taint propagated to argument "
+<< TaintedArgs.at(Sym.index()) + 1;
+  else

For clang diagnostics we usually use ordinary suffixes like `{st,nd,rd,th}`. It 
would be nice to align with the rest of the clang diagnostics on this.
It would require a bit of work on the wording though, I admit.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:213
+  else
+Out << ", " << TaintedArgs.at(Sym.index()) + 1;
+  nofTaintedArgs++;

I believe this branch is uncovered by tests.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:221
+}
+return std::string(Out.str());
+  });

I think since you explicitly specify the return type of the lambda, you could 
omit the spelling of `std::string` here.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:863-864
   State = addTaint(State, Call.getReturnValue());
+  SymbolRef TaintedSym = isTainted(State, Call.getReturnValue());
+  if (TaintedSym) {
+TaintedSymbols.push_back(TaintedSym);

We tend to fuse such declarations:
I've seen other cases like this elsewhere. Please check.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:875-878
+  // FIXME: The call argument may be a complex expression
+  // referring to multiple tainted variables.
+  // Now we generate notes and track back only one of them.
+  SymbolRef TaintedSym = isTainted(State, *V);

You could iterate over the symbol dependencies of the SymExpr (of the `*V` 
SVal).

```lang=c++
Symb

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-05 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

All comments addressed. Thanks for the review @steakhal .




Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:162
+  const CallEvent& Call) {
+  const LocationContext* LC = Call.getCalleeStackFrame(0);
+

steakhal wrote:
> If the `Call` parameter is only used for acquiring the `LocationContext`, 
> wouldn't it be more descriptive to directly pass the `LocationContext` to the 
> function instead?
> I'm also puzzled that we use `getCalleeStackFrame` here. I rarely ever see 
> this function, so I'm a bit worried if this pick was intentional. That we 
> pass the `0` as the `BlockCount` argument only reinforces this instinct.
The call.getCalleeStackFrame(0) gets the location context of the actual call 
that we are analyzing (in the pre or postcall), and that's what we need to mark 
interesting. It is intentionally used like this. I changed the parameter to 
locationcontext as use suggested.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:169
+// We give diagnostics only for taint related reports
+if (!BR.isInteresting(LC) ||
+BR.getBugType().getCategory() != categories::TaintedData) {

steakhal wrote:
> What does the first half of this condition guard against?
> Do you have a test for it to demonstrate?
To only follow taint propagation to function calls which actually result in 
tainted variables used in the report and not every function which returns a 
tainted variable. 

char* taintDiagnosticPropagation2() is such a test which is failing without 
this due to giving extra unrelated propagation notes.





Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:214
+  << TaintedArgs.at(i) << "\n");
+  Out << "Taint propagated to argument " << TaintedArgs.at(i);
+} else {

steakhal wrote:
> So, if `TaintedSymbols.size() > 1`, then the note message will look weird.
> Could you please have a test for this?
Test added multipleTaintedArgs (). I could not provoke the multi-argument 
message as we only track-back one tainted symbol now.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:871
+  }else{
+ LLVM_DEBUG(llvm::dbgs() << "Strange. Return value not tainted.\n");
+ LLVM_DEBUG(Call.getReturnValue().dumpToStream(llvm::dbgs()));

steakhal wrote:
> I cannot see a test against the "Strange" string. Is this dead code?
> Same for the other block.
It was a debugging code, which I removed. I noticed that in some cases (e.g. if 
the argument pointer is pointing to an unknown area) we don't get back a 
tainted symbol even though we call the addtaint on the arg/return value.



Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:919-922
+//Add taintedness to stdin parameters
+if (isStdin(C.getSVal(E),C.getASTContext())){
+  State = addTaint(State, C.getSVal(E));
+}

steakhal wrote:
> I just want to get it confirmed that this hunk is unrelated to your change 
> per-say. Is it?
> BTW I don't mind this change being part of this patch, rather the opposite. 
> Finally, we will have it.
It is related in the sense that in isTainted() function call did not return a 
valid SymbolRef for stdin if we did not make the stdin tainted when we first 
see it. Caused  testcase to fail as it was before. Now it is handled similarly 
to other tainted symbols.



Comment at: clang/lib/StaticAnalyzer/Checkers/Taint.cpp:228-230
+  if (SymbolRef TSR =
+  isTainted(State, SRV->getRegion(), Kind))
+return TSR;

steakhal wrote:
> What does `TSR` abbreviate? I would find `TaintedSym` more descriptive.
TSR = Tainted Symbol Ref

but I changed it as you suggested.



Comment at: clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp:220-228
+if (Kind == VLA_Tainted)
+  BT.reset(new BugType(this,
+   "Dangerous variable-length array (VLA) declaration",
+   categories::TaintedData));
+else
+  BT.reset(new BugType(this,
+   "Dangerous variable-length array (VLA) declaration",

steakhal wrote:
> Why don't we use a distinct BugType for this?
You mean a new bug type instances? Would there be an advantage for that? Seemed 
to be simpler this way. To distinguish identify the tainted reports with the 
bug category.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-05 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 511078.
dkrupp marked 21 inline comments as done.
dkrupp added a comment.

@steakhal thanks for your review. I tried to address all your concerns.
I added an extra test case too (multipleTaintSources(..)) which highlights the 
limitation of the current patch: If multiple tainted "variables" reach a sink, 
we only generate diagnostics for one of them. The main reason is that the 
isTainted() function returns a single tainted Symbolref instead of a vector of 
Symbolrefs if there are multiple instances. 
I highlighted this in the test and the implementation.

I think this could be still an acceptable limitation for now, because as the 
user sanitizes one of the tainted variables, he will get a new diagnostics for 
the remaining one(s).

So I suggest to address this limitation in  follow-up patche(s).
The implementation as is already greatly improves the understandability of the 
reports.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,24 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef unsigned long size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+void free( void *ptr );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to argument 2}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +27,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to argument 2}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +35,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to argument 2}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +44,71 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated here}}
+   // expected-note@-2 {{Taint propagated to argument 2}}
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+ // expected-note@-1 {{Taint propagated to the return value}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buf

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-04-03 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

In D144269#4237539 , @dkrupp wrote:

> This is a totally rewritten version of the patch which solely relies on the 
> existing "interestingness" utility to track back the taint propagation.  (And 
>  does not introduce a new FlowID in the ProgramState as requested in the 
> reviews.)
>
> -The new version also places a Note, when the taintedness is propagated to an 
> argument or to a return value. So it should be easier for the user to follow 
> how the taint information is spreading. 
> -"The taint originated here" is printed correctly at the taint source 
> function, which introduces taintedness. (Main goal of this patch.)
>
> Implementation:
> -The createTaintPreTag() function places a NoteTag at the taint propagation 
> function calls, if taintedness is propagated. Then at report creation, the 
> tainted arguments are marked interesting if propagated taintedness is 
> relevant for the bug report.
>
> - The isTainted() function is extended to return the actually tainted 
> SymbolRef. This is important to be able to consistently mark relevant symbol 
> interesting which carries the taintedness in a complex expression.

So this is how you circumvent introducing "transitive interestingness", because 
now you know which symbol to track.

> -createTaintPostTag(..) function places a NoteTag to the taint generating 
> function calls to mark them interesting if they are relevant for a 
> taintedness report. So if they propagated taintedness to interesting 
> symbol(s).
>
> The tests are passing and the reports on the open source projects are much 
> better understandable than before (twin, tmux, curl):
>
> https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29

I've looked at the results you attached and they look good to me.

Do you have an example where two tainted values are contributing to the same 
bug? Something like:

  n <- read();
  m <- read();
  malloc(n+m)

Will both of these values tracked back? What do the notes look like?

---

All in all, I'm pleased to see the improvements. It looks much better now IMO.
Using two NoteTags cleans up the implementation quite a bit, kudos.
I don't think there are major problems with this implementation, so I decided 
to spew your code with my nitpicks :D

Remember to clang-format your code. See 
`clang/tools/clang-format/clang-format-diff.py`.
And there are a few overloads of `getNoteTag()`, and there could be a better 
fit for usecases; you should decide.

I also find it difficult to track how a variable got tainted across assignments 
and computations like in this 

 case.
This observation is completely orthogonal to your patch, I'm just noting it. it 
was bad previously as well.
Maybe we could have a visitor for explaining the taint tracking across 
assignments and computations to complement the NoteTag.

I hope I didn't miss much this time :D




Comment at: clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp:238-244
+if (kind == OOB_Tainted)
+  BT.reset(
+  new BugType(this, "Out-of-bound access", categories::TaintedData));
+else
+  BT.reset(
+  new BugType(this, "Out-of-bound access", categories::LogicError));
+  }





Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:45
 
-void DivZeroChecker::reportBug(
-const char *Msg, ProgramStateRef StateZero, CheckerContext &C,
-std::unique_ptr Visitor) const {
+void DivZeroChecker::reportBug(const char *Msg, std::string Category,
+   ProgramStateRef StateZero, CheckerContext &C,

It feels odd to have both `const char*` and a `std::string` on the same line.
Should we update `const char*` to a more sophisticated type?

I'm thinking of `StringRef`. It seems like that type should be used for the 
`Category` as well since the `PathSensitiveBugReport` constructor takes that, 
so we don't need to have an owning type here.



Comment at: clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp:91-96
+  SymbolRef TSR = isTainted(C.getState(), *DV);
+  if ((stateNotZero && stateZero && TSR)) {
+reportBug("Division by a tainted value, possibly zero",
+  categories::TaintedData, stateZe

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-03-31 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 510108.
dkrupp added a comment.

This is a totally rewritten version of the patch which solely relies on the 
existing "interestingness" utility to track back the taint propagation.  (And  
does not introduce a new FlowID in the ProgramState as requested in the 
reviews.)

-The new version also places a Note, when the taintedness is propagated to an 
argument or to a return value. So it should be easier for the user to follow 
how the taint information is spreading. 
-"The taint originated here" is printed correctly at the taint source function, 
which introduces taintedness. (Main goal of this patch.)

Implementation:
-The createTaintPreTag() function places a NoteTag at the taint propagation 
function calls, if taintedness is propagated. Then at report creation, the 
tainted arguments are marked interesting if propagated taintedness is relevant 
for the bug report.

- The isTainted() function is extended to return the actually tainted 
SymbolRef. This is important to be able to consistently mark relevant symbol 
interesting which carries the taintedness in a complex expression.

-createTaintPostTag(..) function places a NoteTag to the taint generating 
function calls to mark them interesting if they are relevant for a taintedness 
report. So if they propagated taintedness to interesting symbol(s).

The tests are passing and the reports on the open source projects are much 
better understandable than before (twin, tmux, curl):

https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=curl_curl-7_66_0_dkrupp_taint_origin_fix_new&run=tmux_2.6_dkrupp_taint_origin_fix_new&run=twin_v0.8.1_dkrupp_taint_origin_fix_new&is-unique=on&diff-type=New&checker-msg=%2auntrusted%2a&checker-msg=Out%20of%20bound%20memory%20access%20%28index%20is%20tainted%29


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-tester.c

Index: clang/test/Analysis/taint-tester.c
===
--- clang/test/Analysis/taint-tester.c
+++ clang/test/Analysis/taint-tester.c
@@ -122,7 +122,7 @@
   fscanf(pp, "%d", &ii);
   int jj = ii;// expected-warning + {{tainted}}
 
-  fscanf(p, "%d", &ii);
+  fscanf(p, "%d", &ii);// expected-warning + {{tainted}}
   int jj2 = ii;// expected-warning + {{tainted}}
 
   ii = 3;
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,13 +2,23 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef unsigned long size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
   char buf[128];
   scanf("%s", buf); // expected-note {{Taint originated here}}
+// expected-note@-1 {{Taint propagated to argument 1}}
   system(buf); // expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
 }
 
@@ -16,6 +26,7 @@
   int index;
   int Array[] = {1, 2, 3, 4, 5};
   scanf("%d", &index); // expected-note {{Taint originated here}}
+   // expected-note@-1 {{Taint propagated to argument 1}}
   return Array[index]; // expected-warning {{Out of bound memory access (index is tainted)}}
// expected-note@-1 {{Out of bound memory access (index is tainted)}}
 }
@@ -23,6 +34,7 @@
 int taintDiagnosticDivZero(int operand) {
   scanf("%d", &operand); // expected-note {{Value assigned to 'operand'}}
  // expected-note@-1 {{Taint originated here}}
+ // expected-note@-2 {{Taint propagated to argument 1}}
   return 10 / operand; // expected-warning {{Division by a tainted value, possibly zero}}
// expected-note@-1 {{Division by a tainted value, possibly zero}}
 }
@@ -31,6 +43,50 @@
   int x;
   scanf("%d", &x); // expected-note {{Value assigned to 'x'}}
// expected-note@-1 {{Taint originated her

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-27 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp planned changes to this revision.
dkrupp added a comment.

@steakhal , @NoQ  thanks for the reviews. I will try to implement an 
alternative solution based on your suggestions.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-27 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

If we worry about having taint-related reports without a Note message 
explaining where the taint was introduced, we could just assert that in a 
`BugReportVisitor` at the `finalizeVisitor()` callback. I think such an 
assertion would make a lot of sense.
To achieve this, we could take the `R.getNotes()` and check if any of them 
refers to a specific one produced by the `NoteTag` callback for taint sources, 
let's say `TaintSourceTag` for that `PathDiagnosticNotePiece`.

  void MyVisitor::finalizeVisitor(BugReporterContext &, const ExplodedNode *, 
PathSensitiveBugReport &R) {
assert(llvm::any_of(R.getNotes(),
[](const auto &Piece) { return Piece->getTag() == 
TaintSourceTag; }) &&
   "Each taint report should have at least one taint-source");
  }

With this assertion, we would gain confidence that the taint reports are 
complete, or at least they all have at least one taint source.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-27 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

First and foremost, I want to deeply apologize about my rushed response. When I 
say that the `Taint originated here` note remained, I **wrongly** draw 
conclusions. Won't happen again.

__Taint-flow IDs__:
I would challenge that we need a flow ID (number) because for me the tainted 
symbol already has a unique ID which should be suitable for this purpose.
What I can see is that it would be useful to know directly what was the first 
tainted symbols of any given tainted symbol. (`SymbolData -> SymbolData` 
mapping)
This is pretty much analog with what you implemented by piggybacking on the 
taint kind. Although, I'd argue that having an explicit mapping would be 
cleaner, but I can see that it's more on personal preference.
In addition to that, I think there is value in minimizing the number of places 
where we introduce such static counters, so I would adwise against that unless 
we have a good reason for doing so.

I'm thinking that although it's handy to have the originally tainted symbol 
directly at the error-node at hand, I'm still not sure if that couldn't be 
calculated and tracked back to the place where we introduced taint.

  int n;
  scanf("%d", &n); // binds $1, not important
  int v = mytaintsource(n); // taint originated here, returns $2
  int z = taintprop(v); // taint propagated here, returns $3
  int x = 42 / z; // 42 div $3

__Soundness of the back-propagation__:
The back-propagation is always in-sync **given that** the state-transition of 
introducing taint **also attaches** the `NoteTag` explaining what it did and 
why.
This basically means that after we call `addTaint()` we must also add a 
`NoteTag` when we call `addTransition()`. Under these circumstances I find it 
easier to argue that the back-propagation is consistent (sound). If a specific 
checker (other than the `GenericTaintChecker`) models taint, it should stick to 
the rules described and emit the right `NoteTag`. That way even downstream 
checkers would play nicely with the taint-tracking and benefit from it.

__Interestingness__:
The concerns seem valid about that the set of interesting symbols could be 
larger than the actually required (and desired) set. However, I wouldn't 
worried about this much unless we have an example on which we can continue 
discussing that this is a real concern.
What I can see is that as-of-now we use the interestingness notion for this, 
and deviating or introducing something else would introduce complexity. So, I'm 
not strictly against having something else, but I would lean towards using 
interestingness here as well, unless we have clear-cut examples demonstrating 
the need for something more sophisticated.

Finally, I'd like to thank you for investing your time into this subject. We 
really should have done it much earlier. Without these similar improvements 
like this one, it's just half way done. Thank you.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-24 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

> TaintBugReport is brilliant and we already have a precedent for subclassing 
> BugReport in another checker. However I'm somewhat worried that once we start 
> doing more of this, we'll eventually end up with multiple inheritance 
> situations when the report needs multiple kinds of information. So at a 
> glance my approach with a "generic data map" in bugreport objects looks a bit 
> more future-proof to me. Also a bit easier to set up, no need to deal with 
> custom RTTI.

Adding a data map (like a string->sval map) to the `PathSensitiveBugreport` 
instead of relying on dynamic casting sounds an easy addition. I will update 
the patch with this. Or you specifically mean this kind of datamap ?  `typedef 
llvm::ImmutableMap   GenericDataMap;` (`ProgramState.h:74`) I 
guess it should not be immutable…

> I guess my main point is, there shouldn't be a need to assist tracking by 
> adding extra information to the program state. Information in the state 
> should ideally be "material" to program execution, "tangible", it has to 
> describe something that's actually stored somewhere in memory (either by 
> directly defining it, or by constraining it). In particular, if two nodes 
> result in indistinguishable future behavior of the program, we're supposed to 
> merge them; but any "immaterial" bits of information in the state will 
> prevent that from happening.

@NoQ aha! Now I see where  you are coming from! If an SVal is tainted on both 
analysis branches, but their taint flow value is different (meaning that they 
carry taint values from different taint sources), then they cannot be merged 
which causes inefficiency.
I understand the generic principle, but I wonder how frequent would that be in 
practice. I would think not too much, because taint sources are uncommon. 
Especially having multiple taint sources in the same source file/Translation 
Unit (only that creates different taint flow ids).

> In our case it should be enough to have the lambda for propagation method ask 
> "Hey, is this freshly produced propagation target value relevant to this 
> specific report?" and if yes, mark the corresponding propagation source value 
> as relevant to the report as well; also emit a note and "consume" the mark on 
> the target value. Such chain of local decisions can easily replace the global 
> taint flow identifier, and it's more flexible because this way the flow 
> doesn't need to be "linear", it may branch in various ways and that's ok.

Taint propagation is not only handled in the `GenericTaintChecker:892`, where 
we calculate that the taintedness should propagate from function argument `x` 
to `y` or return value, but also it spreads in peculiar ways within 
expressions, from subregion to parent region etc. handled in the `addtaint(..)` 
and `addPartialTaint(..)` functions in `Taint.cpp`. What your proposed solution 
would essentially mean that we would need to implement the taint propagation in 
backward direction too. I think this design would be fragile and difficult to 
maintain (especially if taint propagation would change in the future).

I definitely don’t have the full picture here, so if you think that the sval 
backtracking is the better way, because of the potential performance penalty of 
the taintflow solution with the merges, I will go down that road and start to 
work on an alternative patch.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-23 Thread Artem Dergachev via Phabricator via cfe-commits
NoQ added a comment.

Yeah looks like I replied without properly reading the patch.

`TaintBugReport` is brilliant and we already have a precedent for subclassing 
`BugReport` in another checker. However I'm somewhat worried that once we start 
doing more of this, we'll eventually end up with multiple inheritance 
situations when the report needs multiple kinds of information. So at a glance 
my approach with a "generic data map" in bugreport objects looks a bit more 
future-proof to me. Also a bit easier to set up, no need to deal with custom 
RTTI.

So I think interestingness is just an example of such "generic data" attached 
to bug report. Interestingness is also somewhat confusing, because indeed, 
there are existing interesting rules, and I don't think anybody remembers what 
they are or what was even the purpose of having interestingness in the first 
place. Interestingness is currently used for tracking symbols with 
`trackExpressionValue()`, and we have those tracking kinds added by @Szelethus 
to make tracking behave slightly differently. So, yeah, I think interestingness 
shouldn't be used; it's already in use. I think it should be generalized upon 
i.e. just let checkers track whatever/however they want.

I guess my main point is, there shouldn't be a need to assist tracking by 
adding extra information to the program state. Information in the state should 
ideally be "material" to program execution, "tangible", it has to describe 
something that's actually stored somewhere in memory (either by directly 
defining it, or by constraining it). In particular, if two nodes result in 
indistinguishable future behavior of the program, we're supposed to merge them; 
but any "immaterial" bits of information in the state will prevent that from 
happening.

In our case it should be enough to have the lambda for propagation method ask 
"Hey, is this freshly produced propagation target value relevant to this 
specific report?" and if yes, mark the corresponding propagation source value 
as relevant to the report as well; also emit a note and "consume" the mark on 
the target value. Such chain of local decisions can easily replace the global 
taint flow identifier, and it's more flexible because this way the flow doesn't 
need to be "linear", it may branch in various ways and that's ok.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-23 Thread Kristóf Umann via Phabricator via cfe-commits
Szelethus added a comment.

In D144269#4143066 , @NoQ wrote:

> The challenging part with note tags is how do you figure out whether your bug 
> report is taint-related. The traditional solution is to check the `BugType` 
> but in this case an indeterminate amount of checkers may emit taint-related 
> reports.

Yeah, this is why we created a new type. Not sure what is the better 
infrastructure design, whether to create a subtype of `BugType` or `BugReport`, 
but it fundamentally achieves the same thing.

In D144269#4146809 , @dkrupp wrote:

> My fear with the interestingness is that it may propagating backwards 
> according to different "rules" than whot the taintedness is popagating in the 
> foward direction even with the "extensions" pointed out by steakhal.
> So the two types of propagation may be or can get out of sync.
>
> So if the above is not a concern and you think implementing this with 
> interestingness is more elegant, idiomatic and causes less maintenance 
> burden, I am happy to create an alternative patch with that solution.

@dkrupp and I discussed in detail whether to use FlowID's (what is currently 
implemented in the patch) or something similar, or reuse interestingness. 
Here's why we decided against reusing interestiness as is.

Interestingness, as it stands now, mostly expresses data-dependency, and is 
propageted with using the analyzers usualy somewhat conservative approach. 
While the length of a string is strictly speaking data dependent on the actual 
string, I don't think analyzer currently understand that. We approach taint 
very differently, and propagete it in some sense more liberally.

As I best recall, however, interestingness may be propagated through other 
means as well. If we reused interestingness, I fear that the interestiness set 
could actually be greater than the actual //interesting tainted// set, causing 
more notes to be emitted than needed.

For these reasons, which I admit are a result of some speculation, we concluded 
that interstingness as it is and taint are two different properties that are 
best separated.

In D144269#4143066 , @NoQ wrote:

> I think now's a good time to include a "generic data map"-like data structure 
> in `PathSensitiveBugReport` objects, so that checkers could put some data 
> there during `emitReport()`, which can be picked up by note tags and 
> potentially mutated in the process.

Maybe a new interestingness kind (D65723 )? 
Not sure how this design aged, but we don't really need to store an ID for 
this, so a simple interestingness flag (just not the default `Thorough` 
interestiness) is good enough.

In D144269#4146809 , @dkrupp wrote:

> The tricky part was is to how to show only that single "Taint originated 
> here" note tag at the taint source only which is relevant to the report. This 
> is done by remembering the unique flowid in the
> NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) 
> `InjectionTag = createTaintOriginTag(C, TaintFlowId);` and then filtering out 
> the irrelevant 
> NoteTags when the the report is generated (when the lamdba callback is 
> called). See that flowid which reaches the sink is backpropagated in the 
> PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).
>
> FlowIds are unique and increased at every taint source 
> (GenericTaintChecker:869) and it is stored as an additional simple `int` in 
> the program state along with the already existing (Taint.cpp:22) TaintTagType.

If you propagate this property during analysis, those IDs may be needed, but a 
simple flag should suffice when BugReporter does it.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-23 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp added a comment.

@steakhal, @NoQ

thanks for your reviews.

Please note that I am not extending `TaintBugVisitor`. On the contrary I 
removed it.
Instead I use NoteTag to generate the "Taint Originated here" text (see 
GenericTaintChecker.cpp:156).

I can also add additional NoteTags for generating propagation messages "Taint 
was propagated here" easily.

> The challenging part with note tags is how do you figure out whether your bug 
> report is taint-related.

I solved this by checking if the report is an instance of `TaintBugReport` a 
new BugReport type, which should be used by all taint related reports 
(ArrayBoundCheckerV2 checker and divisionbyzero checker 
was changed to use this new report type for taint related reports).

The tricky part was is to how to show only that single "Taint originated here" 
note tag at the taint source only which is relevant to the report. This is done 
by remembering the unique flowid in the
NoteTag in the forward analysis direction (see GenericTaintChecker:cpp:859) 
`InjectionTag = createTaintOriginTag(C, TaintFlowId);` and then filtering out 
the irrelevant 
NoteTags when the the report is generated (when the lamdba callback is called). 
See that flowid which reaches the sink is backpropagated in the 
PathSensitiveBugreport (see GenericTaintCHekcer.cpp:167).

FlowIds are unique and increased at every taint source 
(GenericTaintChecker:869) and it is stored as an additional simple `int` in the 
program state along with the already existing (Taint.cpp:22) TaintTagType.

My fear with the interestingness is that it may propagating backwards according 
to different "rules" than whot the taintedness is popagating in the foward 
direction even with the "extensions" pointed out by steakhal.
So the two types of propagation may be or can get out of sync.

So if the above is not a concern and you think implementing this with 
interestingness is more elegant, idiomatic and causes less maintenance burden, 
I am happy to create an alternative patch with that solution.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-21 Thread Artem Dergachev via Phabricator via cfe-commits
NoQ added a comment.

I completely agree with @steakhal, these should be note tags:

- The "visitor way" is to reverse-engineer the exploded graph after the fact.
- The "slightly more sophisticated visitor way" is have checker callbacks leave 
extra hints in the graph to assist reverse engineering, which is what you 
appear to be trying to do.
- The "note tag" way is to simply capture that information from inside checker 
callbacks in the form of lambda captures. It eliminates the need to think about 
how to store the information in the state (it's stored in the program point 
instead), or how to structure it.

I also completely agree with @steakhal that the intermediate notes are 
valuable. In the motivating example, ideally both `strtol` and `getenv` need a 
note ("taint propagated here" and "taint originated here" respectively).

The challenging part with note tags is how do you figure out whether your bug 
report is taint-related. The traditional solution is to check the `BugType` but 
in this case an indeterminate amount of checkers may emit taint-related 
reports. I think now's a good time to include a "generic data map"-like data 
structure in `PathSensitiveBugReport` objects, so that checkers could put some 
data there during `emitReport()`, which can be picked up by note tags and 
potentially mutated in the process. For example, you can introduce a set of 
tracked tainted symbols there, which will be pre-populated by the checker with 
the final tainted symbol, then every time a note tag discovers that a symbol in 
the set becomes a target of taint propagation, it removes the symbol from the 
set and replaces it with the symbols from which its taint originated, so that 
later note tags would react on these new symbols instead.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-21 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

I haven't checked the implementation, but fundamentally patching the 
TaintBugVisitor is not how we should improve the diagnostic for taint issues.
I saw that this patch is not about NoteTags, so I didn't go any further that 
point.

What we should do instead, to add a fancy NoteTags to each of the Post 
transitions to propagate interestingness to the taint sources.
Where each NoteTag does:

- checks if any of the taint destinations are actually 'interesting', if none 
then just return an empty note.
- take the taint source arguments and mark their pre-call values as interesting
- construct a descriptive message explaining what happened:
  - If the transition had no taint sources, then it must be a "taint source"
  - If we had tainted sources, tell the user that X', Y', and Z' arguments were 
tainted, hence we propagated taint
  - take all the "interesting" taint destinations and tell the user that X, Y 
and Z arguments become tainted due to the propagation rule.

I'm attaching my proposed version for improving the diagnostics where I 
demonstrate all what I said. F26595921: proposed.patch 

Note that my patch is really crude, and I just finished hacking it to get all 
tests pass in a couple hours.

Let me know if it would be a good way to refine your patch or I should review 
your current implementation.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-20 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp updated this revision to Diff 498795.
dkrupp added a comment.

Added documentation to the newly introduced types: TaintData, TaintBugReport.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144269/new/

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporter.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/TaintBugType.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/BugReporter.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  
clang/test/Analysis/diagnostics/Inputs/expected-sarif/sarif-diagnostics-taint-test.c.sarif
  
clang/test/Analysis/diagnostics/Inputs/expected-sarif/sarif-multi-diagnostic-test.c.sarif
  clang/test/Analysis/diagnostics/sarif-diagnostics-taint-test.c
  clang/test/Analysis/taint-checker-callback-order-has-definition.c
  clang/test/Analysis/taint-checker-callback-order-without-definition.c
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-dumps.c

Index: clang/test/Analysis/taint-dumps.c
===
--- clang/test/Analysis/taint-dumps.c
+++ clang/test/Analysis/taint-dumps.c
@@ -6,7 +6,7 @@
 int getchar(void);
 
 // CHECK: Tainted symbols:
-// CHECK-NEXT: conj_$2{{.*}} : 0
+// CHECK-NEXT: conj_$2{{.*}} : Type:0 Flow:0
 int test_taint_dumps(void) {
   int x = getchar();
   clang_analyzer_printState();
Index: clang/test/Analysis/taint-diagnostic-visitor.c
===
--- clang/test/Analysis/taint-diagnostic-visitor.c
+++ clang/test/Analysis/taint-diagnostic-visitor.c
@@ -2,8 +2,17 @@
 
 // This file is for testing enhanced diagnostics produced by the GenericTaintChecker
 
+typedef unsigned long size_t;
+struct _IO_FILE;
+typedef struct _IO_FILE FILE;
+
 int scanf(const char *restrict format, ...);
 int system(const char *command);
+char* getenv( const char* env_var );
+size_t strlen( const char* str );
+void *malloc(size_t size );
+char *fgets(char *str, int n, FILE *stream);
+FILE *stdin;
 
 void taintDiagnostic(void)
 {
@@ -34,3 +43,41 @@
   int vla[x]; // expected-warning {{Declared variable-length array (VLA) has tainted size}}
   // expected-note@-1 {{Declared variable-length array (VLA) has tainted size}}
 }
+
+//Tests if the originated note is correctly placed even if the path is
+//propagating through variables and expressions
+char* taintDiagnosticPropagation(){
+  char *pathbuf;
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+return pathbuf;
+  }
+  return 0;
+}
+
+
+//Taint origin should be marked correctly even if there are multiple taint
+//sources in the function
+char* taintDiagnosticPropagation2(){
+  char *pathbuf;
+  char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
+  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  char *user_env=getenv("USER_ENV_VAR");//unrelated taint source
+  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+	   // expected-note@-1 {{Taking true branch}}
+pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+// expected-note@-1{{Untrusted data is used to specify the buffer size}}
+return pathbuf;
+  }
+  return 0;
+}
+
+void testReadStdIn(){
+  char buf[1024];
+  fgets(buf, sizeof(buf), stdin);// expected-note {{Taint originated here}}
+  system(buf);// expected-warning {{Untrusted data is passed to a system call}} // expected-note {{Untrusted data is passed to a system call (CERT/STR02-C. Sanitize data passed to complex subsystems)}}
+
+}
Index: clang/test/Analysis/taint-checker-callback-order-without-definition.c
===
--- clang/test/Analysis/taint-checker-callback-order-without-definition.c
+++ clang/test/Analysis/taint-checker-callback-order-without-definition.c
@@ -13,19 +13,22 @@
 
 void top(const char *fname, char *buf) {
   FILE *fp = fopen(fname, "r"); // Introduce taint.
-  // CHECK:  PreCall prepares tainting arg index: -1
-  // CHECK-NEXT: PostCall actually wants to 

[PATCH] D144269: [Analyzer] Show "taint originated here" note of alpha.security.taint.TaintPropagation checker at the correct place

2023-02-17 Thread Daniel Krupp via Phabricator via cfe-commits
dkrupp created this revision.
dkrupp added a reviewer: Szelethus.
dkrupp added a project: clang.
Herald added subscribers: steakhal, manas, ASDenysPetrov, martong, gamesh411, 
donat.nagy, mikhail.ramalho, a.sidorin, JDevlieghere, rnkovacs, szepet, 
baloghadamsoftware, xazax.hun.
Herald added a reviewer: NoQ.
Herald added a project: All.
dkrupp requested review of this revision.
Herald added a subscriber: cfe-commits.

This patch improves the diagnostics of the 
alpha.security.taint.TaintPropagation and taint related checkers by showing the 
"Taint originated here" note at the correct place, where the attacker may 
inject it. This greatly improves the understandability of the taint reports.

Taint Analysis: The attacker injects the malicious data at the taint source 
(e.g. getenv() call) which is then propagated and used at taint sink (e.g. 
exec() call) causing a security vulnerability (e.g. shell injection 
vulnerability), without data sanitation.

The goal of the checker is to discover and show to the user these potential 
taint source, sink pairs and the propagation call chain.

In the baseline the taint source was pointing to an invalid location, typically 
somewhere between the real taint source and sink.

After the fix, the "Taint originated here" tag is correctly shown at the taint 
source. This is the function call where the attacker can inject a malicious 
data (e.g. reading from environment variable, reading from file, reading from 
standard input etc.).

Before the patch the clang static analyzer puts the taint origin note wrongly 
to the `strtol(..)` call.

  c
  int main(){
char *pathbuf;
char *user_data=getenv("USER_INPUT"); 
char *end;  
long size=strtol(user_data, &end, 10); // note: Taint originated here. 
if (size > 0){
  pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify 
the buffer size ...
  // ... 
  free(pathbuf);
}
return 0;
  }

After the fix, the taint origin point is correctly annotated at `getenv()` 
where the attacker really injects the value.

  c
  int main(){
char *pathbuf;
char *user_data=getenv("USER_INPUT");  // note: Taint originated here. 
char *end;
long size=strtol(user_data, &end, 10); 
if (size > 0){
  pathbuf=(char*) malloc(size+1);//note: Untrusted data is used to specify 
the buffer size ...
  // ... 
  free(pathbuf);
}
return 0;
  }

The BugVisitor placing the note was wrongly going back only until introduction 
of the tainted SVal in the sink.

This patch creates a new uniquely identified taint flow for each taint source 
(e.g.getenv()) it traverses and places a NoteTag ("Taint originated here") with 
the new id. Then, when the bug report is generated, the taint flow id is 
propagated back (in the new TainBugReport) along the bug path and the correct 
"Taint originated here." annotation is generated (matching the flow id).

You can find the new improved reports

here 


And the old reports (look out for "Taint originated here" notes. They are at 
the wrong place, close to the end of the reports)

here 



Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D144269

Files:
  clang/include/clang/StaticAnalyzer/Checkers/Taint.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/BugReporter.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/CommonBugCategories.h
  clang/include/clang/StaticAnalyzer/Core/BugReporter/TaintBugType.h
  clang/lib/StaticAnalyzer/Checkers/ArrayBoundCheckerV2.cpp
  clang/lib/StaticAnalyzer/Checkers/DivZeroChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
  clang/lib/StaticAnalyzer/Checkers/Taint.cpp
  clang/lib/StaticAnalyzer/Checkers/VLASizeChecker.cpp
  clang/lib/StaticAnalyzer/Core/BugReporter.cpp
  clang/lib/StaticAnalyzer/Core/CommonBugCategories.cpp
  
clang/test/Analysis/diagnostics/Inputs/expected-sarif/sarif-diagnostics-taint-test.c.sarif
  
clang/test/Analysis/diagnostics/Inputs/expected-sarif/sarif-multi-diagnostic-test.c.sarif
  clang/test/Analysis/diagnostics/sarif-diagnostics-taint-test.c
  clang/test/Analysis/taint-checker-callback-order-has-definition.c
  clang/test/Analysis/taint-checker-callback-order-without-definition.c
  clang/test/Analysis/taint-diagnostic-visitor.c
  clang/test/Analysis/taint-dumps.c

Index: clang/test/Analysis/taint-dumps.c
=