[PATCH] D88299: [clang-format] Add MacroUnexpander.

Manuel Klimek via Phabricator via cfe-commits Fri, 26 Nov 2021 10:26:37 -0800

klimek added a comment.

Noticed I should have waiting with the renaming of the file until the review is 
done :( Sorry for the extra confusion.




================
Comment at: clang/lib/Format/MacroUnexpander.cpp:77
+  // stream, we need to continue the unexpansion until we find the right token
+  // if it is part of the unexpanded token stream.
+  // Note that hidden tokens can be part of the unexpanded stream in nested
----------------
sammccall wrote:
> I can't work out what "it" refers to in this sentence.
> 
> (and "spelled" token stream?)
Changed to "given token" - it refers to the token.

It's reconstructed, not spelled, like the example below explains: we do have 
hidden tokens that we want to find when we're in the middle of a recursive 
expansion. I wouldn't call the hidden tokens here "spelled" - would you?


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:82
+  // And the call: BRACED(BRACED(x))
+  // The outer macro call will be BRACED({a}), and the hidden tokens '{' and
+  // '}' can be found in the unexpanded macro stream of that level.
----------------
sammccall wrote:
> It would be helpful to complete the example by spelling out which token 
> you're adding, which the correct parent is, and which tokens you need to 
> "expand over" to make it available.
> 
> I *think* the answer to the first two is - when you're adding the `a` then 
> its proper parent is the inner `(` in `BRACED(BRACED(`... but I don't know 
> the last part.
Found out that I was missing a unit test. Added a unit test, and now explained 
the unit test here in the comment. PTAL.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:84
+  // '}' can be found in the unexpanded macro stream of that level.
+  if (!Unexpanded.empty() && Token->MacroCtx &&
+      (Token->MacroCtx->Role != MR_Hidden ||
----------------
sammccall wrote:
> is it possible that you may need to unexpand more than the innermost macro?
> 
> e.g. BRACED(ID(ID(BRACED(ID(ID(a)))))) expands to {{a}} and the parents of 
> `a` and inner `{` each come from a couple of macro-levels up.
> 
> (I don't totally understand the logic here, so the answer's probably "no"...)
A token on a higher level must always also be there on a lower level.
Calls in this example are:
BRACED({a})
ID({a})
ID({a})
BRACED(a)
ID(a)
ID(a)
When the next token comes in, we thus always find it in the higher level (open) 
macro call. When we reconstruct for that token, we then reconstruct multiple 
macro calls at once, if it is the first token in multiple macro calls.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:173
+  assert(Token->MacroCtx);
+  // A single token can be the only result of a macro call:
+  // Given: ID(x, y) ;
----------------
sammccall wrote:
> This raises another point: a macro can have an empty body.
> AFAICT this fundamentally isn't supported here, as we're driven by the 
> expanded token stream.
> I guess it makes sense to handle this elsewhere in clang-format (or even not 
> handle it) but it should be documented somewhere.
I think the right point to document and handle it is at the next layer, where 
we integrate all of this into clang-format.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:238
+  // generated by the call.
+  for (size_t I = Unexpanded.size(); I < Token->MacroCtx->ExpandedFrom.size();
+       ++I) {
----------------
sammccall wrote:
> nit: index arithmetic obscures what's going on a bit.
> You could write
> 
> ```
> ArrayRef<FormatToken *> StartedMacros = 
> makeArrayRef(Token->MacroCtx->ExpandedFrom).drop_front(Unexpanded.size());
> for (FormatToken *ID : llvm::reverse(StartedMacros)) {
> ...
> }
> ```
> but up to you
> 
> It's not obvious to me *why* we're iterating in reverse order BTW: i.e. why 
> the order of the `Unexpanded` stack is opposite the order of the 
> `ExpandedFrom` stack.
> So maybe just a comment to reinforce that, like (// ExpandedFrom is 
> outside-in order, we want inside-out)
Oooh, this is nice. s/drop_front/drop_back/. Added comment.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:282
+         (Unexpanded.size() >= Token->MacroCtx->EndOfExpansion));
+  if (!Token->MacroCtx)
+    return;
----------------
sammccall wrote:
> nit: Token->MacroCtx is checked at the callsite, and startUnexpansion asserts 
> it as a precondition - do that here too for symmetry?
Thanks for spotting, this was from a time where the code looked differently 
(the assert above also didn't make sense any more in that form).


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:310
+    // Expand the closing parenthesis, if it exists, including an optional
+    // trailing comment.
+    for (auto T = Unexpanded.back().I; T != Unexpanded.back().End; ++T) {
----------------
sammccall wrote:
> I can't work out if this is supposed to say comma or comment :-)
> 
> If comma - is that a thing?
> If comment - why would a comment be considered part of the unexpansion of a 
> macro invocation, rather than a sibling to the macro? How do we know what 
> trails is a comment - should we have an assertion?
Added words.

Re: the trailing comment, that's an idiosyncrasy of who clang-format handles 
trailing comments - it parses them with the token preceding them. We could 
probably spend more complexity on the call side, trying to break them out, but 
I'm not sure that's better, given that this point already needs to be fairly 
reliable against all user code.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:348
+    if (Token->is(tok::comma)) {
+      TokenToParentInOutput
+          [MacroCallStructure.back().Line->Tokens.back()->Tok] = Token;
----------------
sammccall wrote:
> This line is both subtle and cryptic :-).
> ```
> // New unwrapped-lines from unexpansion are normally "chained" - treated as
> // children of the last token of the previous line. 
> // Redirect them to be treated as children of the comma instead.
> // (only affects lines added before we push more tokens into 
> MacroStructure.Line)
> ```
This is partially captured in the comment above. Added a shorter description 
here, let me know if that's not enough.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:366
+  }
+  if (!Token->MacroCtx && Token->is(tok::l_paren)) {
+    MacroCallStructure.push_back(
----------------
sammccall wrote:
> This feels like a stupid question, but how do we know that this 
> non-macro-parenthesis has anything to do with macros?
> (Fairly sure the answer is "processNextUnexpanded() is only called when the 
> token is macro-related", it'd be nice to have this precondition spelled out 
> somewhere)
Added a comment above the restructured !Token->MacroCtx check.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:367
+  if (!Token->MacroCtx && Token->is(tok::l_paren)) {
+    MacroCallStructure.push_back(
+        {currentLine(), parentLine().Tokens.back()->Tok, Token});
----------------
sammccall wrote:
> please assign to the struct fields or use /*Foo=*/ comments
> 
> All the data flow around this struct is local but the order of fields isn't 
> :-)
Added a constructor.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:369
+        {currentLine(), parentLine().Tokens.back()->Tok, Token});
+    TokenToParentInOutput[MacroCallStructure.back().RedirectParentFrom] = 
Token;
+    pushToken(Token);
----------------
sammccall wrote:
> nit: = MacroCallStructure.back().RedirectParentTo
> This line + line 361 are the keys to understanding what the 
> RedirectParentFrom/To do, so it'd be helpful for them to explicitly use those 
> variables.
> 
> (or use the source expressions for both... but in that case the fields may 
> need docs)
1. Line 361 was not actually necessary; I first thought that was a bug, but 
then dug in, and noticed that we weren't able to hit that code path (anymore, 
after restructuring)

2. Completely renamed the members and documented them, added some more asserts 
to make things hopefully clearer - thanks for pointing it out, those were 
clearly underdocumented; I hope it's now more clear.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:406
+  for (auto *E = Output.Tokens.front()->Children.end(); I != E; ++I) {
+    if ((*I)->Tokens.empty())
+      continue;
----------------
sammccall wrote:
> hmm, when is this possible? are these literal blank lines? where do they come 
> from?
Looks like you're right; this looks like it was left over from a previous 
different structure, I'll make sure to fuzz it thoroughly.


================
Comment at: clang/lib/Format/MacroUnexpander.cpp:409
+
+    // Every subsequent line will become a child of the last token in the
+    // previous line, which is the token prior to the first token in the line.
----------------
sammccall wrote:
> Isn't this just the end of line from the previous iteration of this loop? Why 
> the global map populated by pushToken?
> (I feel like i'm missing something obvious here)
Nice catch, the complexity here was also from a previous iteration where the 
structure of the algorithm was (even) more complex, where we needed to also do 
stitching for non-top-level lines. Now able to completely get rid of 
PreviousToken and TokenToPrevious \o/


================
Comment at: clang/lib/Format/Macros.h:30
 ///
 /// When formatting, clang-format formats the expanded unwrapped lines first,
 /// determining the token types. Next, it formats the spelled unwrapped lines,
----------------
sammccall wrote:
> would s/formats/annotates/ be inaccurate?
> 
> This is just my poor understanding of the code, but it wasn't obvious to me 
> that annotation is closely associated with formatting and not with parsing 
> UnwrappedLines.
Said "annotates and formats" now - yes, the fact that annotating and formatting 
is so coupled is a admittedly weird choice of the initial design.


================
Comment at: clang/lib/Format/Macros.h:135
+/// Converts a stream of unwrapped lines of expanded tokens into unwrapped 
lines
+/// containing the tokens of the macro call in a way that the resulting
+/// unwrapped lines of the macro call best resemble the split of unwrapped 
lines
----------------
sammccall wrote:
> I know it's the common case, but I think saying "the macro call" here is 
> misleading, because it quickly becomes apparent reading the code that the 
> scope *isn't* one macro call, and (at least for me) it's easy to get hung up 
> on not understanding what the scope is. (AIUI the scope is actually one UL of 
> *output*... so the use of plural there is also confusing).
> 
> I think a  escription could be something like:
> 
> Converts a sequence of UnwrappedLines containing expanded macros into a 
> single UnwrappedLine containing the macro calls.
> This UnwrappedLine may be broken into child lines, in a way that best conveys 
> the structure of the expanded code.
> ...
> In the simplest case, a spelled UnwrappedLine contains one macro, and after 
> expanding it we have one expanded UnwrappedLine.
> In general, macro expansions can span UnwrappedLines, and multiple macros can 
> contribute tokens to the same line.
> We keep consuming expanded lines until:
>  - all expansions that started have finished (we're not chopping any macros 
> in half)
>  - *and* we've reached the end of a *spelled* unwrapped line
> A single UnwrappedLine represents this chunk of code.
> 
> After this point, the state of the spelled/expanded stream is "in sync" (both 
> at the start of an UnwrappedLine, with no macros open), so the Unexpander can 
> be thrown away and parsing can continue.
> 
> (and then launch into an example)
> 
Thanks, that's a really good write-up!


================
Comment at: clang/lib/Format/Macros.h:192
+  /// token.
+  UnwrappedLine getResult();
+
----------------
sammccall wrote:
> nit: giving `getResult()` a side-effect but also making it idempotent is a 
> bit clever/confusing.
> 
> Either exposing `void finalize();` + `UnwrappedLine get() const`, or 
> `UnwrappedLine takeResult() &&`, seems a little more obvious.
Done.


================
Comment at: clang/lib/Format/Macros.h:218
+  enum UnexpanderState {
+    Start,      // No macro expansion was found in the input yet.
+    InProgress, // During a macro unexpansion.
----------------
sammccall wrote:
> you have this "no expansions are open, but we already didn't find any" state.
> The effect of this is that finished() returns false so the caller keeps 
> looping.
> 
> But a correct caller will never rely on this:
>  - the first line a caller feeds must have macro tokens in it, or our output 
> will be garbage AFAICT
>  - calling getResult() without feeding in any lines is definitely not correct
> 
> It seems we could rather assert on these two conditions, and eliminate the 
> Start/InProgress distinction.
> That way incorrect usage is an assertion crash, rather than turning into an 
> infinite loop.
Changed to asserts.


================
Comment at: clang/lib/Format/Macros.h:278
+  // once we're past the comma in the unexpansion.
+  llvm::DenseMap<FormatToken *, FormatToken *> TokenToParentInOutput;
+
----------------
sammccall wrote:
> ParentInExpandedToParentInUnexpanded?
> 
> (current name implies that it maps a token to *its* parent. It also uses the 
> input/output names, rather than expanded/unexpanded - it would be nice to be 
> consistent)
Called it SpelledParentToReconstructedParent.


================
Comment at: clang/lib/Format/Macros.h:286
+    // Our current position in the unexpansion.
+    std::list<UnwrappedLineNode>::iterator I;
+    // The end of the unexpanded token sequence.
----------------
sammccall wrote:
> consider calling these NextSpelled and EndSpelled to be really explicit?
> Since the type doesn't really clarify which sequence is being referred to.
Called them SpelledI and SpelledE in llvm tradition.


================
Comment at: clang/lib/Format/Macros.h:314
+  // \- )
+  llvm::SmallVector<MacroCallState, 4> MacroCallStructure;
+
----------------
sammccall wrote:
> nit: if you're going to specify SmallVector sizes, I don't understand why 
> you'd size Unexpanded vs MacroCallStructure differently - they're going to be 
> the same size, right?
> 
> These days you can leave off the size entirely though, and have it pick a 
> value
I did not know that, that's awesome!


================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine 
method:
+/// -> class A {
+/// -> public:
----------------
sammccall wrote:
> klimek wrote:
> > klimek wrote:
> > > sammccall wrote:
> > > > sammccall wrote:
> > > > > I'm a bit confused by these arrows. It doesn't seem that they each 
> > > > > point to an unwrappedline passed to addLine?
> > > > This example didn't really help me understand the interface of this 
> > > > class, it seems to be a special case:
> > > >  - the input is a single block construct (rather than e.g. a whole 
> > > > program), but it's not clear why that's the case.
> > > >  - the output (unexpanded form) consists of exactly a macro call with 
> > > > no leading/trailing tokens, which isn't true in general
> > > > 
> > > > If the idea is to provide as input the minimal range of lines that span 
> > > > the macro, we should say so somewhere. But I would like to understand 
> > > > why we're not simply processing the whole file.
> > > Why not? That is the intention? Note that high-level we do not pass class 
> > > definitions in one unwrapped line; if they ever go into a single line 
> > > it's done in the end by a special merging pass (yes, that's a bit 
> > > confusing).
> > Re: input is a single construct - that is not the case (see other comment)
> > Re: leading / trailing tokens: I didn't want to make it too complex in the 
> > example.
> > Re: input is a single construct - that is not the case (see other comment)
> 
> A class is definitely a single construct :-) It sounds like that's not 
> significant to the MacroUnexpander though, so it feels like a red herring to 
> me.
> 
> > Re: leading / trailing tokens: I didn't want to make it too complex in the 
> > example.
> That seems fine, I think the complexities of the general case need to be 
> mentioned somewhere because the API reflects them. But you're right, the 
> primary example should be simple.
> I think a tricky example (maybe the `#define M ; x++` one?) could be given on 
> one of addLine/finish/getResult maybe.
> 
#define M ; x++ seems to be similarly tricky to #define CLASSA(x) class A x
We get multiple calls to addLine, in between which finished() is false.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:207
+  Unexp.addLine(line(E.consume(lex("};"))));
+  EXPECT_TRUE(Unexp.finished());
+  Matcher U(Call);
----------------
sammccall wrote:
> klimek wrote:
> > sammccall wrote:
> > > you have lots of assertions that this is true, but none that it is false
> > Yeah, that's a white-box assertion - finished() is false the vast majority 
> > of the time, so testing the true cases is the important part - happy to add 
> > tests if you think it's worth it.
> Yes - a basic test that it's not always true would be useful I think (maybe 
> the `#define M ;x++` case would be useful for showing the expected loop and 
> finished() values)
We got a couple of tests like that, added checks for negative finished in 
between.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88299/new/

https://reviews.llvm.org/D88299

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D88299: [clang-format] Add MacroUnexpander.

Reply via email to