steakhal added a comment. Sorry for the wall-of-nits. Overall, looks great.
================ Comment at: clang/docs/analyzer/checkers.rst:2358 Default sources defined by ``GenericTaintChecker``: -``fdopen``, ``fopen``, ``freopen``, ``getch``, ``getchar``, ``getchar_unlocked``, ``gets``, ``scanf``, ``socket``, ``wgetch`` + ``_IO_getc``, ``fdopen``, ``fopen``, ``freopen``, ``get_current_dir_name``, ``getch``, ``getchar``, ``getchar_unlocked``, ``getcw``, ``getcwd``, ``getgroups``, ``gethostname``, ``getlogin``, ``getlogin_r``, ``getnameinfo``, ``getopt``, ``getopt_long``, ``getopt_only``, ``gets``, ``getseuserbyname``, ``readlink``, ``scanf``, ``scanf_s``, ``socket``, ``wgetch`` ---------------- typo/dup? I cannot recognize the `getcw()` call. Could you please refer to the specification or an instance where it was defined? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:546-548 {{"gets"}, TR::Source({{0}, ReturnValueIndex})}, {{"scanf"}, TR::Source({{}, 1})}, + {{"scanf_s"}, TR::Source({{}, {1}})}, ---------------- If we handle `gets`, `scanf`, we should also model the `*_s` versions as well. ```lang=C char *gets_s(char *s, rsize_t n); int scanf_s(const char *restrict format, ...); int fscanf_s(FILE *restrict stream, const char *restrict format, ...); int sscanf_s(const char *restrict s, const char *restrict format, ...); int vscanf_s(const char *restrict format, va_list arg); int vfscanf_s(FILE *restrict stream, const char *restrict format, va_list arg); int vsscanf_s(const char *restrict s, const char *restrict format, va_list arg); ``` ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:550-552 + {{"getopt"}, TR::Source({{ReturnValueIndex}})}, + {{"getopt_long"}, TR::Source({{ReturnValueIndex}})}, + {{"getopt_long_only"}, TR::Source({{ReturnValueIndex}})}, ---------------- IMO these functions are highly domain-specific. On errors, they return specific/well-defined values e.g. `-1`, `'?'` or `':'`. That being said, the analyzer does not have this knowledge, thus it will model these as `conjured` symbols. If these values were modeled as tainted, we would likely introduce the number of false-positives regarding our limited capabilities of modeling the function accurately. tldr; I'm against these three rules; or alternatively prove that my concerns are not issues on real code bases. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:556 + {{"getwd"}, TR::Source({{0, ReturnValueIndex}})}, + {{"readlink"}, TR::Source({{1, ReturnValueIndex}})}, + {{"get_current_dir_name"}, TR::Source({{ReturnValueIndex}})}, ---------------- We should check `readlinkat` as well. ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:559 + {{"gethostname"}, TR::Source({{0}})}, + {{"getnameinfo"}, TR::Source({{2, 4}})}, + {{"getseuserbyname"}, TR::Source({{1, 2}})}, ---------------- In what cases can this function introduce taint? ================ Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:561 + {{"getseuserbyname"}, TR::Source({{1, 2}})}, + {{"getgroups"}, TR::Source({{1}})}, + {{"getlogin"}, TR::Source({{ReturnValueIndex}})}, ---------------- >On success, `getgroups()` returns the number of supplementary group IDs. On >error, -1 is returned, and `errno` is set appropriately. According to this, the return value index should be also tainted. ================ Comment at: clang/test/Analysis/taint-generic.c:385 + struct option long_opts[] = {{0, 0, 0, 0}}; + int opt = getopt_long_only(argc, argv, "a:b:02", long_opts, &option_index); + return 1 / opt; // expected-warning {{Division by a tainted value, possibly zero}} ---------------- Well, this can never return zero. What we should do is to do a state-split for the failure case; since the application should definitely handle a failure in this part; thus the split would be justified. ================ Comment at: clang/test/Analysis/taint-generic.c:392 +int underscore_IO_getc_is_source(_IO_FILE *fp) { + char c = _IO_getc(fp); + return 1 / c; // expected-warning {{Division by a tainted value, possibly zero}} ---------------- Sometimes we taint the `fd` and propagate based on that, and othertimes, we simply just return taint. However, I think it still looks like a better tradeoff this way. I just wanted to highlight this. Maybe a comment on the CallDescription would be beneficial in describing this discrepancy. ================ Comment at: clang/test/Analysis/taint-generic.c:413 + system(buf); // expected-warning {{Untrusted data is passed to a system call}} + return 1 / s; // expected-warning {{Division by a tainted value, possibly zero}} +} ---------------- Well, it can never be zero AFAIK. It might be `-1` on error, or the number of bytes stored into `buf`. So, I would rather modify the example to `1 / (s + 1)`. ================ Comment at: clang/test/Analysis/taint-generic.c:417 +char *get_current_dir_name(void); +int get_current_dir_name_is_source() { + char *d = get_current_dir_name(); ---------------- Please avoid spelling the given function in the name of the test directly in a verbatim manner. If we would use the `CDF_MaybeBuiltin` matching mode, the `get_current_dir_name_is_source` would match for the `CallDescription {"get_current_dir_name"}` due to the way we fuzzy match for builtins. Prefixing with `Test` like `testGet_current_dir_name` would be fine though, only the underscores are handled differently. This applies to the rest of the test cases as well. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D120236/new/ https://reviews.llvm.org/D120236 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits