aaron.ballman accepted this revision.
aaron.ballman added a comment.
This revision is now accepted and ready to land.

Aside from a minor nit, I think this LGTM. Thank you for working on this!



================
Comment at: clang/lib/Sema/SemaChecking.cpp:476
+    analyze_printf::PrintfSpecifier Specifier;
+    if (Specifier.fixType(T, S.getLangOpts(), S.Context, 
/*IsObjCLiteral=*/false)) {
+      // We were able to guess how to format this.
----------------
Note for future reference: I don't think `fixType()` does anything special to 
help us out with arrays.


================
Comment at: clang/lib/Sema/SemaChecking.cpp:507
+                       unsigned Depth) {
+    // FIXME: Decide what to do if RD is a union. At least we should probably
+    // turn off printing `const char*` members with `%s`, because that is very
----------------
My suggestion is to print the value of all fields and treat any string-like 
types as `%p` rather than `%s`. I'm fine with this being done later.


================
Comment at: clang/lib/Sema/SemaChecking.cpp:542
+      auto *FD = IFD ? IFD->getAnonField() : dyn_cast<FieldDecl>(D);
+      if (!FD || (FD->isUnnamedBitfield() || FD->isAnonymousStructOrUnion()))
+        continue;
----------------



================
Comment at: clang/lib/Sema/SemaChecking.cpp:433
+
+  llvm::StringRef getFormatSpecifier(QualType T) {
+    if (auto *BT = T->getAs<BuiltinType>()) {
----------------
yihanaa wrote:
> yihanaa wrote:
> > erichkeane wrote:
> > > rsmith wrote:
> > > > yihanaa wrote:
> > > > > rsmith wrote:
> > > > > > rsmith wrote:
> > > > > > > aaron.ballman wrote:
> > > > > > > > yihanaa wrote:
> > > > > > > > > I think this is better maintained in 
> > > > > > > > > "clang/AST/FormatString.h". For example 
> > > > > > > > > analyze_printf::PrintfSpecifier can get format specifier, 
> > > > > > > > > what do you all think about?
> > > > > > > > +1 to this suggestion -- my hope is that we could generalize it 
> > > > > > > > more then as I notice there are missing specifiers for things 
> > > > > > > > like intmax_t, size_t, ptrdiff_t, _Decimal types, etc. Plus, 
> > > > > > > > that will hopefully make it easier for __builtin_dump_struct to 
> > > > > > > > benefit when new format specifiers are added, such as ones for 
> > > > > > > > printing a _BitInt.
> > > > > > > I am somewhat uncertain: every one of these is making arbitrary 
> > > > > > > choices about how to format the value, so it's not clear to me 
> > > > > > > that this is general logic rather than something specific to 
> > > > > > > `__builtin_dump_struct`. For example, using `%f` rather than one 
> > > > > > > of the myriad other ways of formatting `double` is somewhat 
> > > > > > > arbitrary. Using `%s` for any `const char*` is *extremely* 
> > > > > > > arbitrary and will be wrong and will cause crashes in some cases, 
> > > > > > > but it may be the pragmatically correct choice for a dumping 
> > > > > > > utility. A general-purpose mechanism would use `%p` for all kinds 
> > > > > > > of pointer.
> > > > > > > 
> > > > > > > We could perhaps factor out the formatting for cases where there 
> > > > > > > is a clear canonical default formatting, such as for integer 
> > > > > > > types and probably `%p` for pointers, then call that from here 
> > > > > > > with some special-casing, but without a second consumer for that 
> > > > > > > functionality it's really not clear to me what form it should 
> > > > > > > take.
> > > > > > I went ahead and did this, mostly to match concurrent changes to 
> > > > > > the old implementation. There are a few cases where our existing 
> > > > > > "guess a format specifier" logic does the wrong thing for dumping 
> > > > > > purposes, which I've explicitly handled -- things like wanting to 
> > > > > > dump a `char` / `signed char` / `unsigned char` member as a number 
> > > > > > rather than as a (potentially non-printable or whitespace) 
> > > > > > character.
> > > > >  When I was patching that old implementation, I found that for 
> > > > > uint8_t, int8_t, Clang's existing "guess a format specifier" logic 
> > > > > would treat the value as an integer, but for unsigned char, signed 
> > > > > char, char types, it would Treat it as a character, please look at 
> > > > > this example ( https://godbolt.org/z/ooqn4468T ), I guess this 
> > > > > existing logic may have made some special judgment.
> > > > Yeah. I think in the case where we see some random call to `printf`, 
> > > > `%c` is probably the right thing to guess here, but it doesn't seem 
> > > > appropriate to me to use this in a dumping routine. If we could dump as 
> > > > `'x'` for printable characters and as `'\xAB'` for everything else, 
> > > > that'd be great, but `printf` can't do that itself and I'm not sure we 
> > > > want to be injecting calls to `isprint` or whatever to make that work. 
> > > > Dumping as an integer always seems like it's probably the least-bad 
> > > > option.
> > > > 
> > > > Somewhat related: I wonder if we should use `"\"%s\""` instead of 
> > > > simply `"%s"` when dumping a `const char*`. That's not ideal but 
> > > > probably clearer than the current dump output.
> > > I see value to having strings with SOME level of delimiter, if at least 
> > > to handle cases when the string itself has a newline in it.
> > This looks good, but the string may need to be escaped (if there are some 
> > special characters in the original string)
> Maybe users should wrap the printf function, and then pass the wrapper 
> function as a parameter, so that they can do some custom things
> Somewhat related: I wonder if we should use "\"%s\"" instead of simply "%s" 
> when dumping a const char*. That's not ideal but probably clearer than the 
> current dump output.

I'd be in favor of that. (I sort of wonder whether we do want to inject calls 
to `isprint()` though -- it seems like escaping nonprintable characters would 
be the kindest option for both characters and strings.)


================
Comment at: clang/lib/Sema/SemaChecking.cpp:573
+          // We don't know how to print this field. Print out its address
+          // with a format specifier that a smart tool will be able to
+          // recognize and treat specially.
----------------
rsmith wrote:
> aaron.ballman wrote:
> > rsmith wrote:
> > > yihanaa wrote:
> > > > Can we generate a warning diagnostic message when we see an unsupported 
> > > > type (perhaps temporarily not supported), what do you all think about?
> > > We could, but I'm leaning towards thinking that we shouldn't warn by 
> > > default at least. That would get in the way of the C++ use case where the 
> > > type actually is supported by the printing function; I'm not sure that we 
> > > can easily tell those cases apart. And it's probably not all that useful 
> > > to someone who just wants whatever information we can give them while 
> > > debugging. I'd be OK with adding a warning that's off by default, though. 
> > > What would be the intended use case here? Would an off-by-default warning 
> > > satisfy it?
> > So long as we don't trigger UB, I think it's fine to not diagnose for 
> > unsupported types. But the UB cases do strike me as something we might want 
> > to warn about if we think they can be hit (specifically, these kinds of UB: 
> > https://eel.is/c++draft/cstdarg.syn#1).
> > 
> > As an example, do we properly handle dumping this case:
> > ```
> > struct S {
> >   int &i;
> >   S(int &ref) : i(ref) {}
> > };
> > ```
> > where the vararg argument is of reference type? I suppose another 
> > interesting question is:
> > ```
> > struct S {
> >   int i;
> > };
> > 
> > void func() {
> >   S s;
> >   [=]() {
> >     __builtin_dump_struct(&s, ::printf);
> >   };
> > }
> > ```
> > (is S.i "an entity resulting from a lambda capture"?)
> The only things we pass to the formatting function are:
> * The small number of builtin types for which we have known formatting 
> specifiers.
> * Pointers to `void`, using `%p`.
> * Pointers to `[const] char`, using `%s`.
> * Pointers to user-defined types, using `*%p`.
> All of these can be passed through a `...` without problems.
> 
> We can potentially introduce the following sources of UB:
> * Indirection through a struct pointer that doesn't point to a struct object.
> * Read of an inactive union member.
> * Read of an uninitialized member.
> * Printing something other than a pointer to a nul-terminated byte sequence 
> with `%s`.
> I'm not sure how much we should be worrying about those. It'd be nice to 
> avoid them, though I don't think we can fix the fourth issue without giving 
> up the `%s` formatting entirely.
> 
> Reference types aren't on our list of types with known formatting specifiers, 
> so in your first example we'd call `printf("%s%s = *%p", "int &", "i", 
> &p->i)`, producing output like `int & i = *0x1234abcd`. We could special-case 
> that to also dereference the reference and include the value (`int & i = 
> *0x1234abcd (5)`, but for now I'm treating it like any other type for which 
> we don't have a special formatting rule. I've added test coverage for this 
> case.
> 
> In your second example, we rewrite into `::printf("%s%s = %s", "int", "i", 
> (&s)->i)`, which is fine. The `parmN` restriction is about calls to 
> `va_start`:
> ```
> void f(int a, ...) {
>   [&] {
>     va_list va;
>     va_start(va, a); // error, `a` is captured here
>     // ...
>   };
> }
> ```
Thank you for the explanation, it sounds like we're doing about as good as we 
can get here in terms of safety.

FWIW, I like the idea of outputting the value for a reference type, but am fine 
punting on that for now (it could get awkward to format when the referenced 
type is a structure, for example).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124221/new/

https://reviews.llvm.org/D124221

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to