[PATCH] D88299: [clang-format] Add MacroUnexpander.

Manuel Klimek via Phabricator via cfe-commits Wed, 28 Oct 2020 05:38:25 -0700

klimek added inline comments.


================
Comment at: clang/lib/Format/FormatToken.h:449
 
+  /// When macro expansion introduces parents, those are marked as
+  /// \c MacroParent, so formatting knows their children need to be formatted.
----------------
sammccall wrote:
> I can't really understand from the comment when this is supposed to be set, 
> and there are no tests of it.
> 
> (The comment is vague: is a "parent" the inverse of FormatToken::Children 
> here? Is this scenario when the parents in question are new, or their 
> children are new, or both? What part of the code is "formatting", and why 
> would it otherwise skip the children?)
Rewrote comment.


================
Comment at: clang/lib/Format/Macros.h:134
 
+/// Matches formatted lines that were created by macro expansion to a format
+/// for the original macro call.
----------------
sammccall wrote:
> "matches formatted lines" probably describes the hard technical problem it 
> has to solve, but not so much what it does for the caller:  what the 
> transformation is between its inputs and its outputs.
> 
> Is it something like:
> 
> ```
> Rewrites expanded code (containing tokens expanded from macros) into spelled 
> code (containing tokens for the macro call itself). Token types are 
> preserved, so macro arguments in the output have semantically-correct types 
> from their expansion. This is the point of expansion/unexpansion: to allow 
> this information to be used in formatting.
> ```
> 
> [Is it just tokentypes? I guess it's also Role and MustBreakBefore and some 
> other stuff like that?]
It's not the token info, this we'd trivially have by using the original token 
sequence which is still lying around (we re-use the same tokens).

Reworded.


================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine 
method:
+/// -> class A {
+/// -> public:
----------------
sammccall wrote:
> sammccall wrote:
> > I'm a bit confused by these arrows. It doesn't seem that they each point to 
> > an unwrappedline passed to addLine?
> This example didn't really help me understand the interface of this class, it 
> seems to be a special case:
>  - the input is a single block construct (rather than e.g. a whole program), 
> but it's not clear why that's the case.
>  - the output (unexpanded form) consists of exactly a macro call with no 
> leading/trailing tokens, which isn't true in general
> 
> If the idea is to provide as input the minimal range of lines that span the 
> macro, we should say so somewhere. But I would like to understand why we're 
> not simply processing the whole file.
Why not? That is the intention? Note that high-level we do not pass class 
definitions in one unwrapped line; if they ever go into a single line it's done 
in the end by a special merging pass (yes, that's a bit confusing).


================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine 
method:
+/// -> class A {
+/// -> public:
----------------
klimek wrote:
> sammccall wrote:
> > sammccall wrote:
> > > I'm a bit confused by these arrows. It doesn't seem that they each point 
> > > to an unwrappedline passed to addLine?
> > This example didn't really help me understand the interface of this class, 
> > it seems to be a special case:
> >  - the input is a single block construct (rather than e.g. a whole 
> > program), but it's not clear why that's the case.
> >  - the output (unexpanded form) consists of exactly a macro call with no 
> > leading/trailing tokens, which isn't true in general
> > 
> > If the idea is to provide as input the minimal range of lines that span the 
> > macro, we should say so somewhere. But I would like to understand why we're 
> > not simply processing the whole file.
> Why not? That is the intention? Note that high-level we do not pass class 
> definitions in one unwrapped line; if they ever go into a single line it's 
> done in the end by a special merging pass (yes, that's a bit confusing).
Re: input is a single construct - that is not the case (see other comment)
Re: leading / trailing tokens: I didn't want to make it too complex in the 
example.


================
Comment at: clang/lib/Format/Macros.h:147
+///
+/// Creates the unwrapped lines containing the macro call tokens so that
+/// the macro call tokens fit the semantic structure of the expanded formatted
----------------
sammccall wrote:
> this says "creates the unwrapped lines" but getResult() returns only one.
> Does the plural here refer to the tree? Maybe just use singular or "the 
> unwrappedlinetree"?
Fixed comment.


================
Comment at: clang/lib/Format/Macros.h:154
+/// -> })
+class MacroUnexpander {
+public:
----------------
sammccall wrote:
> I get the symmetry between the expander/unexpander classes, but the name is 
> making it harder for me to follow the code.
>  - the extra compound+negation in the name makes it hard/slow to understand, 
> as I'm always thinking first about expansion
>  - the fact that expansion/unexpansion are not actually opposites completely 
> breaks my intuition. It also creates two different meanings of "unexpanded" 
> that led me down the garden path a bit (e.g. in the test).
> 
> Would you consider `MacroCollapser`? It's artificial, but expand/collapse are 
> well-known antonyms without being the same word.
> 
> (Incidentally, I just realized this is why I find "UnwrappedLine" so slippery 
> - the ambiguity between "this isn't wrapped" and "this was unwrapped")
Happy to rename, but first wanted to check in - I use "unexpanded token stream" 
quite often to refer to the macro call. Perhaps we should also find different 
wording for that then?

Perhaps we should call this MacroLineMatcher btw? This is not creating anything 
new - the resulting token sequence is the "unexpanded token sequence" with the 
exact same tokens, the special thing is that they're matched into unwrapped 
lines.


================
Comment at: clang/lib/Format/Macros.h:164
+  /// For the given \p Line, match all occurences of tokens expanded from a
+  /// macro and replace them with the original macro call in \c getResult().
+  void addLine(const UnwrappedLine &Line);
----------------
sammccall wrote:
> I find this hard to follow.
> - again, the "match" part is an implementation detail that sounds interesting 
> but isn't :-)
> - "from a macro" sounds like one in particular, but is actually every macro
> - the bit about "getResult" is a separate point that feels shoehorned in
> 
> What about:
> "Replaces tokens that were expanded from macros with the original macro 
> calls. The result is stored and available in getResult()"
I think the match part is important, as it's matching unwrapped lines, which is 
the heart of the algorithm.


================
Comment at: clang/lib/Format/Macros.h:175
+  /// Generally, this line tries to have the same structure as the expanded,
+  /// formatted unwrapped lines handed in via \c addLine():
+  /// If a token in a macro argument is a child of a token in the expansion,
----------------
sammccall wrote:
> how can this be the case if the input can have multiple lines and the output 
> only one?
> 
> Is the return value a synthetic parent of the translated lines?
> Or is there a hidden requirement on the caller here that we don't keep 
> feeding in lines unless we're continuing a macro call and therefore know 
> we'll end up with one line?
> 
> This stuff could be clarified in docs but again I have to ask, can't we 
> sidestep the whole issue by processing the whole file and returning all the 
> lines?
> 
> 
> (This is somewhat answered in the implementation, though that doesn't seem 
> like the right place)
Reworded; the reason why we have the single-line anyway is that:
- a macro call is something we generally want to have in one unwrapped line
- the tokens (or other macro calls) that go into the same instance of 
MacroUnexpander only consist of tokens that do not have an unwrapped line break 
and macro calls
Thus, we want the output to be in a single unwrapped line, as we're otherwise 
majorly confusing ~everything else in the formatter.


================
Comment at: clang/lib/Format/Macros.h:237
+  // lines as children.
+  // In the end, we stich the lines together so that each subsequent line
+  // is a child of the last token of the previous line. This is necessary
----------------
sammccall wrote:
> The explanation here seems to be proving the converse: if we *didn't* use 
> this representation, then the code wouldn't work. However what I'm missing is 
> an explanation of why it **is** correct/sensible.
> 
> After staring at the tests, I'm still not sure, since the tests seem to 
> postprocess the "natural" output the same way before asserting equality.
> 
> My tentative conclusion is it would be clearest to move this "in the end" 
> step to the *caller* of getResult(), as it seems to have more to do with 
> formatting than unexpansion. But I haven't looked in detail at that caller, 
> maybe I'm wrong...
This is basically what I wrote before - in the end, that the expanded code 
creates multiple unwrapped lines is the weird thing, as the input is 
fundamentally a single unwrapped line (the macro call plus a bit of stuff 
around it). Thus, it's quite natural for the unexpander to return a single 
unwrapped line that represents the original structure. Not sure how to best put 
this in words.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:1
+#include "../../lib/Format/Macros.h"
+#include "../../lib/Format/UnwrappedLineParser.h"
----------------
sammccall wrote:
> sammccall wrote:
> > All of these tests use both the expander and unexpander (so need consider 
> > behavior/bugs of both). 
> > Is it possible to test the unexpander in isolation?
> > 
> > (Context: I'm trying to use the tests to understand what the class does, 
> > but the inputs aren't visible)
> Having read all the tests here, they seem to follow exactly the same schema:
> 
> 1. some macro definitions
> 2. some sequence of expand() calls, simulating expanding some macro-heavy code
> 3. verify the expanded token sequence, rearranging it into expected 
> UnwrappedLine structure
> 4. call unexpand() and check that the result has the expected UnwrappedLine 
> structure
> 
> You do this using various fairly-general helpers and DSLs, but don't really 
> combine them in different ways... the tests are somewhat readable and error 
> messages are OK, but if these are important tests, it might be worth looking 
> at a data-driven test. e.g.
> 
> ```
> Case NestedChildBlocks;
> NestedChildBlocks.Macros = {"ID(x)=x", "CALL(x)=f([] { x })"};
> NestedChildBlocks.Original = "ID(CALL(CALL(return a * b;)))";
> NestedChildBlocks.Expanded = R"cpp(
> {
> f([] {
>   f([] {
>     return a * b;
>   })
> })
> }
> )cpp"; // indentation shows structure
> NestedChildBlocks.Unexpanded = R"cpp(
> ID(
>   {
>   CALL(CALL(return a * b;))
>   }
> )
> )cpp";
> NestedChildBlocks.verify();
> ```
> 
> Definitely involves a bit of reinventing wheels though.
Not sure I understand what you mean - everything that's interesting about the 
inputs is spelled out in the test - namely, what the structure of unwrapped 
lines going into the unexpander is.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:1
+#include "../../lib/Format/Macros.h"
+#include "../../lib/Format/UnwrappedLineParser.h"
----------------
klimek wrote:
> sammccall wrote:
> > sammccall wrote:
> > > All of these tests use both the expander and unexpander (so need consider 
> > > behavior/bugs of both). 
> > > Is it possible to test the unexpander in isolation?
> > > 
> > > (Context: I'm trying to use the tests to understand what the class does, 
> > > but the inputs aren't visible)
> > Having read all the tests here, they seem to follow exactly the same schema:
> > 
> > 1. some macro definitions
> > 2. some sequence of expand() calls, simulating expanding some macro-heavy 
> > code
> > 3. verify the expanded token sequence, rearranging it into expected 
> > UnwrappedLine structure
> > 4. call unexpand() and check that the result has the expected UnwrappedLine 
> > structure
> > 
> > You do this using various fairly-general helpers and DSLs, but don't really 
> > combine them in different ways... the tests are somewhat readable and error 
> > messages are OK, but if these are important tests, it might be worth 
> > looking at a data-driven test. e.g.
> > 
> > ```
> > Case NestedChildBlocks;
> > NestedChildBlocks.Macros = {"ID(x)=x", "CALL(x)=f([] { x })"};
> > NestedChildBlocks.Original = "ID(CALL(CALL(return a * b;)))";
> > NestedChildBlocks.Expanded = R"cpp(
> > {
> > f([] {
> >   f([] {
> >     return a * b;
> >   })
> > })
> > }
> > )cpp"; // indentation shows structure
> > NestedChildBlocks.Unexpanded = R"cpp(
> > ID(
> >   {
> >   CALL(CALL(return a * b;))
> >   }
> > )
> > )cpp";
> > NestedChildBlocks.verify();
> > ```
> > 
> > Definitely involves a bit of reinventing wheels though.
> Not sure I understand what you mean - everything that's interesting about the 
> inputs is spelled out in the test - namely, what the structure of unwrapped 
> lines going into the unexpander is.
It seems like the main thing this does is replacing the structure how we build 
unwrapped lines with a DSL that gets parsed into unwrapped lines by 
indentation? I personally find that significantly less readable unless we'd 
create really good error messages if the indentation doesn't line up. In your 
example "Unexpanded" I have problems understanding exactly what goes into what 
unwrapped line intuitively - for example, {} can be one unwrapped line or 2 
different ones. How do we decide when an unwrapped line is finished?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:28
+    auto *ID = Lex.id(Name);
+    auto UnexpandedLine = std::make_unique<UnwrappedLine>();
+    UnexpandedLine->Tokens.push_back(ID);
----------------
sammccall wrote:
> this name (and all the other "Unexpanded*" in this class) is misleading, 
> because there's a process called "unexpanding" but these aren't the output of 
> it.
> 
> I'd suggest "spelled", though I do think renaming the unexpander would also 
> be worthwhile.
This comment made it clear why Unwrapper is a really bad name, because it's 
also not about collapsing.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:96
+
+  Chunk consume(const TokenList &Tokens) {
+    TokenList Result;
----------------
sammccall wrote:
> you always consume(lex(...)) - taking a StringRef directly instead would make 
> it clearer that the identity of the temporary tokens doesn't matter
Yeah, I thought that I want Matcher and the test to be more decoupled, but 
passing in the Lexer is not a big thing, so changed.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:141
+
+  UnwrappedLine line(llvm::ArrayRef<Chunk> Chunks) {
+    UnwrappedLine Result;
----------------
sammccall wrote:
> FWIW, this seems confusing to me - line() has overloads that are simple, and 
> then this one that does the same nontrivial stitching that the production 
> code does.
> The fact that the a/b lines are not sibling sub-lines in `line({tokens("a"), 
> tokens("b")})`but they *are* sibling tokens in  `line(lex("a b"))` is hard to 
> keep track of in the tests.
> 
> If the stitching really is necessary, I think it's important for the expected 
> output to also be shown in its stitched form.
Would renaming tokens() to chunk() help with that? The idea is that in the test 
I mostly need to create a single line from chunks of tokens. Also happy to 
rename this function if it helps more?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:155
+
+  Chunk tokens(llvm::StringRef Text) { return Chunk(lex(Text)); }
+
----------------
sammccall wrote:
> why is this "tokens" and not "chunk"?
My thought was that Chunk is just a type for a chunk of a line, and I can 
create one from a set of tokens (via tokens()) or from a set of child unwrapped 
lines (via children()).


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:207
+  Unexp.addLine(line(E.consume(lex("};"))));
+  EXPECT_TRUE(Unexp.finished());
+  Matcher U(Call);
----------------
sammccall wrote:
> you have lots of assertions that this is true, but none that it is false
Yeah, that's a white-box assertion - finished() is false the vast majority of 
the time, so testing the true cases is the important part - happy to add tests 
if you think it's worth it.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:242
+  Expansion Exp1(Lex, *Macros);
+  TokenList Call1 = Exp1.expand("ID", {"a *b"});
+  // 2. expand ID({ a *b; })
----------------
sammccall wrote:
> the high-level point of this test seems to be that by looking the 
> post-expansion context, we can determine that b is a declared pointer so we 
> don't put a space before it.
> 
> And everywhere these tokens are mentioned/verified here, the spacing is 
> correct, as if it were propagated... but of course the spacing is actually 
> ignored everywhere and underneath it's just sequences of tokens.
> This makes it hard to track how well this it testing what it really wants to 
> test.
> 
> Is it possible to mark the asterisk with the correct tokentype at the point 
> where formatting would occur, and then verify that the unexpanded form (i.e. 
> `Chunk1.Tokens[1]`) still has the tokentype set?
> (It's probably possible to prove this by inspection by understanding what U1 
> contains, how Matcher works etc, but I think it'd still be illustrative)
I don't care about anything about the tokens than that they have pointer 
identity and that the same tokens end up in the right unwrapped lines in the 
result (thus match also checks token identity).


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:245
+  TokenList Arg;
+  Arg.push_back(Lex.id("{"));
+  Arg.append(Exp1.getTokens().begin(), Exp1.getTokens().end());
----------------
sammccall wrote:
> what does id() mean if not identifier?
Sigh, yeah, good point - initially this was used for identifiers, but it really 
means "lex exactly one token". Do you have a good name that is short (it's used 
really often) and descriptive?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:406
+  Matcher E(Exp.getTokens());
+  Unexp.addLine(line(E.consume(lex("x;"))));
+  Unexp.addLine(line(E.consume(lex("x y"))));
----------------
sammccall wrote:
> because this example isn't realistic, it'd be nice to have the comment here 
> showing what formatting you're simulating
As before, formatting is (to me) not relevant here - what's relevant is that we 
don't crash and the resulting unwrapped lines contain the right tokens.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88299/new/

https://reviews.llvm.org/D88299

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D88299: [clang-format] Add MacroUnexpander.

Reply via email to