[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added inline comments.



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:119
+++End;
+  return llvm::makeArrayRef([Start], [End]);
+}

hokein wrote:
> aaron.ballman wrote:
> > This is causing an assertion with debug builds on Windows because 
> > `Actions[End]` is out of bounds (so the MSVC STL debug assertions catch the 
> > issue) for the test cases in this patch.
>  this should be `llvm::makeArrayRef([Start], End - Start)`. Fixed in 
> 302ca279cb83043ef7d60115eb5ba58f12064a4a.
I can confirm that the issue is now fixed for me, thank you!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Haojian Wu via Phabricator via cfe-commits
hokein added inline comments.



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:119
+++End;
+  return llvm::makeArrayRef([Start], [End]);
+}

aaron.ballman wrote:
> This is causing an assertion with debug builds on Windows because 
> `Actions[End]` is out of bounds (so the MSVC STL debug assertions catch the 
> issue) for the test cases in this patch.
 this should be `llvm::makeArrayRef([Start], End - Start)`. Fixed in 
302ca279cb83043ef7d60115eb5ba58f12064a4a.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added a comment.

In D118196#3341159 , @hokein wrote:

> In D118196#3341110 , @aaron.ballman 
> wrote:
>
>> Hi, this commit is causing runtime failures on Windows in debug builds. Can 
>> you please correct or revert? Thanks!
>
> sorry for the trouble. I'm mostly running out of time today, I will revert 
> this patch, and fix it tomorrow.

Thanks!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Haojian Wu via Phabricator via cfe-commits
hokein added a comment.

In D118196#3341110 , @aaron.ballman 
wrote:

> Hi, this commit is causing runtime failures on Windows in debug builds. Can 
> you please correct or revert? Thanks!

sorry for the trouble. I'm mostly running out of time today, I will revert this 
patch, and fix it tomorrow.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added a comment.

Hi, this commit is causing runtime failures on Windows in debug builds. Can you 
please correct or revert? Thanks!




Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:119
+++End;
+  return llvm::makeArrayRef([Start], [End]);
+}

This is causing an assertion with debug builds on Windows because 
`Actions[End]` is out of bounds (so the MSVC STL debug assertions catch the 
issue) for the test cases in this patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Haojian Wu via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rGa2fab82f33bb: [pseudo] Implement LRTable. (authored by 
hokein).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
  clang/test/Syntax/check-cxx-bnf.test
  clang/test/Syntax/lr-build-basic.test
  clang/test/Syntax/lr-build-conflicts.test
  clang/tools/clang-pseudo/ClangPseudo.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
@@ -0,0 +1,56 @@
+//===--- LRTableTest.cpp - ---*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "clang/Basic/TokenKinds.h"
+#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+using testing::IsEmpty;
+using testing::UnorderedElementsAre;
+using Action = LRTable::Action;
+
+TEST(LRTable, Builder) {
+  GrammarTable GTable;
+
+  //   eof   semi  ...
+  // +---++---+---
+  // |state0 || s0,r0 |...
+  // |state1 | acc|   |...
+  // |state2 ||  r1   |...
+  // +---++---+---
+  std::vector Entries = {
+  {/* State */ 0, tokenSymbol(tok::semi), Action::shift(0)},
+  {/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0)},
+  {/* State */ 1, tokenSymbol(tok::eof), Action::accept(2)},
+  {/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1)}};
+  GrammarTable GT;
+  LRTable T = LRTable::buildForTests(GT, Entries);
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
+  UnorderedElementsAre(Action::accept(2)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::reduce(1)));
+  // Verify the behaivor for other non-available-actions terminals.
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-//===--- LRGraphTest.cpp - LRGraph tests -*- C++-*-===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===--===//
-
-#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-#include 
-
-namespace clang {
-namespace syntax {
-namespace pseudo {
-namespace {
-
-TEST(LRGraph, Build) {
-  struct TestCase {
-llvm::StringRef BNF;
-llvm::StringRef ExpectedStates;
-  };
-
-  TestCase Cases[] = {{
-  R"bnf(
-_ := expr
-expr := IDENTIFIER
-  )bnf",
-  R"(States:
-State 0
-_ :=  • expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-State 2
-expr := IDENTIFIER • 
-0 ->[expr] 1
-0 ->[IDENTIFIER] 2
-)"},
-  {// A grammar with a S/R conflict in SLR table:
-   // (id-id)-id, or id-(id-id).
-   R"bnf(
-_ := expr
-expr := expr - expr  # S/R conflict at state 4 on '-' token
-expr := IDENTIFIER
-  )bnf",
-   R"(States:
-State 0
-_ :=  • expr
-expr :=  • expr - expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-expr := expr • - expr
-State 2
-expr := IDENTIFIER • 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-23 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 410725.
hokein marked 3 inline comments as done.
hokein added a comment.

address comments


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
  clang/test/Syntax/check-cxx-bnf.test
  clang/test/Syntax/lr-build-basic.test
  clang/test/Syntax/lr-build-conflicts.test
  clang/tools/clang-pseudo/ClangPseudo.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
@@ -0,0 +1,56 @@
+//===--- LRTableTest.cpp - ---*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "clang/Basic/TokenKinds.h"
+#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+using testing::IsEmpty;
+using testing::UnorderedElementsAre;
+using Action = LRTable::Action;
+
+TEST(LRTable, Builder) {
+  GrammarTable GTable;
+
+  //   eof   semi  ...
+  // +---++---+---
+  // |state0 || s0,r0 |...
+  // |state1 | acc|   |...
+  // |state2 ||  r1   |...
+  // +---++---+---
+  std::vector Entries = {
+  {/* State */ 0, tokenSymbol(tok::semi), Action::shift(0)},
+  {/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0)},
+  {/* State */ 1, tokenSymbol(tok::eof), Action::accept(2)},
+  {/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1)}};
+  GrammarTable GT;
+  LRTable T = LRTable::buildForTests(GT, Entries);
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
+  UnorderedElementsAre(Action::accept(2)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::reduce(1)));
+  // Verify the behaivor for other non-available-actions terminals.
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-//===--- LRGraphTest.cpp - LRGraph tests -*- C++-*-===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===--===//
-
-#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-#include 
-
-namespace clang {
-namespace syntax {
-namespace pseudo {
-namespace {
-
-TEST(LRGraph, Build) {
-  struct TestCase {
-llvm::StringRef BNF;
-llvm::StringRef ExpectedStates;
-  };
-
-  TestCase Cases[] = {{
-  R"bnf(
-_ := expr
-expr := IDENTIFIER
-  )bnf",
-  R"(States:
-State 0
-_ :=  • expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-State 2
-expr := IDENTIFIER • 
-0 ->[expr] 1
-0 ->[IDENTIFIER] 2
-)"},
-  {// A grammar with a S/R conflict in SLR table:
-   // (id-id)-id, or id-(id-id).
-   R"bnf(
-_ := expr
-expr := expr - expr  # S/R conflict at state 4 on '-' token
-expr := IDENTIFIER
-  )bnf",
-   R"(States:
-State 0
-_ :=  • expr
-expr :=  • expr - expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-expr := expr • - expr
-State 2
-expr := IDENTIFIER • 
-State 3
-expr :=  • expr - expr
-expr := expr - • expr
-expr :=  • 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-21 Thread Sam McCall via Phabricator via cfe-commits
sammccall accepted this revision.
sammccall added a comment.
This revision is now accepted and ready to land.

Thanks, I really like the way the tests look now!




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:78
+  Shift,
+  // Reduce by a rule, the value is a ruleID.
+  Reduce,

// Pops 



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+  Reduce,
+  // Signals that we have parsd the input successfully.
+  Accept,

parsd -> parsed



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:28
+  case LRTable::Action::GoTo:
+return OS << llvm::formatv("goTo state {0}", A.getGoToState());
+  case LRTable::Action::Accept:

nit: goTo -> go to, like dumpForTests?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-18 Thread Haojian Wu via Phabricator via cfe-commits
hokein added inline comments.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:177
 };
+void initTerminals(std::vector );
 

sammccall wrote:
> why is this exposed/required rather than being initialized by the 
> GrammarTable constructor?
> Since this is essentially static (must always correspond to tok::TokenKind) 
> it seems that GrammarTable could just have an ArrayRef and it could be 
> initialized by a lazy-init singleton:
> 
> ```
> // in Grammar.cpp
> static ArrayRef getTerminalNames() {
>   static std::vector *TerminalNames = []{
> 
>   };
>   return *TerminalNames;
> }
> ```
> 
> (I know eventually we'd like GrammarTable to be generated and have very 
> minimal dynamic init, but there are lots of other details needed e.g. we 
> can't statically initialize `vector` either, so we should cross that 
> bridge when we come to it)
fair enough, let's hide it in a GrammarTable contructor.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:68
+  // Terminal actions, corresponding to entries of ACTION table.
+  Error = 0,
+  // Shift to state n: move forward with the lookahead, and push state n

sammccall wrote:
> Error seems like an action we'd dynamically handle, rather than an 
> unreachable sentinel.
> 
> I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant 
> places. Even if we do plan dynamic error actions, we have enough spare bits 
> to keep these two cases separate.
changed to Sentinel, and remove Error action for now.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+  // Nonterminal actions, corresponding to entry of GOTO table.
+  // Go to state n: pust state n onto the state stack.
+  // Similar to Shift, but it is a nonterminal forward transition.

sammccall wrote:
> sammccall wrote:
> > again blank line between "nonterminal actions" and "go to state n"
> pust -> push?
> 
> I thought goto replaces the top of the stack rather than being pushed onto it.
> 
> e.g.
> stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := 
> IDENT ."
> stack is { stmt := . expr ; | expr := IDENT . }, lookahead is semi, action is 
> reduce
> stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := 
> expr . ;"
> stack is { stmt := expr . ;}, lookahead is semi...
> 
> Line 3=>4 doesn't seem like a push
>  
> 
yeah, this is mostly right -- stack in step 2 has two frames, rather than a 
combined one.

The truth is:

1. stack is [{ _ := . stmt | stmt := . expr ; }], lookahead is IDENT, action is 
shift "expr := IDENT ." (push a new state to the stack)
2. stack is [{ _ := . stmt | stmt := . expr ; }, { expr := IDENT .  }], 
lookahead is semi, action is reduce
3. stack is [{ _ := . stmt | stmt := . expr ; }], reduced symbol is expr, goto 
is state "stmt := expr . ;"
4. stack is [{ _ := . stmt | stmt := . expr ;}, { stmt := expr . ;}], lookahead 
is semi, action is shift "stmt := expr ; ."
5. stack is  [{ _ := . stmt | stmt := . expr ;}, { stmt := expr . ;}, { stmt := 
expr ; .}], lookahead is eof, action is reduce "stmt := expr ;"
6. stack is [{ _ := . stmt  | stmt := . expr ;}], reduced symbol is stmt, goto 
state is "_ := stmt ."
7. stack is [{ _ := . stmt | stmt := . expr ;}, { _ := stmt .}], lookahead is 
eof, action is accept, the parsing is done

Step 3 => 4 is a push. 



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:106
+bool operator==(const Action ) const { return value() == L.value(); }
+uint16_t value() const { return K << ValueBits | Value; };
+

sammccall wrote:
> maybe value()=>opaque() or asInteger()?
> 
> Partly to give a hint a bit more at the purpose, partly to avoid confusion of 
> `Value != value()`
renamed to `opaque`.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:136
+
+  class Builder {
+  public:

sammccall wrote:
> This is quite a lot of surface (together with the DenseMapInfo stuff) to 
> expose.
> Currently it looks like it needs to be in the header so that the LRTableTest 
> can construct a table directly rather than going through LRGraph.
> 
> I think you could expose a narrower interface:
> ```
> struct Entry { ;... };
> // Specify exact table contents for testing.
> static LRTable buildForTests(ArrayRef);
> ```
> 
> Then the builder/densemapinfo can be moved to the cpp file, and both 
> `buildForTests` and `buildSLR` can use the builder internally. WDYT?
Yeah, I exposed the Builder mainly for testing purposes. The new interface 
looks good to me. 



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:97
+  auto Result = find(State, Nonterminal);
+  assert(Result.size() == 1 && Result.front().kind() == Action::GoTo);
+  return Result.front().getGoToState();

sammccall wrote:
> (moving the comment thread to 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-18 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 409893.
hokein added a comment.

update


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
  clang/test/Syntax/check-cxx-bnf.test
  clang/test/Syntax/lr-build-basic.test
  clang/test/Syntax/lr-build-conflicts.test
  clang/tools/clang-pseudo/ClangPseudo.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
@@ -0,0 +1,56 @@
+//===--- LRTableTest.cpp - ---*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "clang/Basic/TokenKinds.h"
+#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+using testing::IsEmpty;
+using testing::UnorderedElementsAre;
+using Action = LRTable::Action;
+
+TEST(LRTable, Builder) {
+  GrammarTable GTable;
+
+  //   eof   semi  ...
+  // +---++---+---
+  // |state0 || s0,r0 |...
+  // |state1 | acc|   |...
+  // |state2 ||  r1   |...
+  // +---++---+---
+  std::vector Entries = {
+  {/* State */ 0, tokenSymbol(tok::semi), Action::shift(0)},
+  {/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0)},
+  {/* State */ 1, tokenSymbol(tok::eof), Action::accept(2)},
+  {/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1)}};
+  GrammarTable GT;
+  LRTable T = LRTable::buildForTests(GT, Entries);
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
+  UnorderedElementsAre(Action::accept(2)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::reduce(1)));
+  // Verify the behaivor for other non-available-actions terminals.
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-//===--- LRGraphTest.cpp - LRGraph tests -*- C++-*-===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===--===//
-
-#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-#include 
-
-namespace clang {
-namespace syntax {
-namespace pseudo {
-namespace {
-
-TEST(LRGraph, Build) {
-  struct TestCase {
-llvm::StringRef BNF;
-llvm::StringRef ExpectedStates;
-  };
-
-  TestCase Cases[] = {{
-  R"bnf(
-_ := expr
-expr := IDENTIFIER
-  )bnf",
-  R"(States:
-State 0
-_ :=  • expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-State 2
-expr := IDENTIFIER • 
-0 ->[expr] 1
-0 ->[IDENTIFIER] 2
-)"},
-  {// A grammar with a S/R conflict in SLR table:
-   // (id-id)-id, or id-(id-id).
-   R"bnf(
-_ := expr
-expr := expr - expr  # S/R conflict at state 4 on '-' token
-expr := IDENTIFIER
-  )bnf",
-   R"(States:
-State 0
-_ :=  • expr
-expr :=  • expr - expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-expr := expr • - expr
-State 2
-expr := IDENTIFIER • 
-State 3
-expr :=  • expr - expr
-expr := expr - • expr
-expr :=  • IDENTIFIER
-State 4
-expr := expr - expr • 
- 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-18 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 409892.
hokein marked 14 inline comments as done.
hokein added a comment.

- address comments
- more api polishments
- LRGraph unittest => lit tests
- hide LRTable::Builder, move to LRTableBuild.cpp


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
  clang/test/Syntax/check-cxx-bnf.test
  clang/test/Syntax/lr-build-basic.test
  clang/test/Syntax/lr-build-conflicts.test
  clang/tools/clang-pseudo/ClangPseudo.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
@@ -0,0 +1,56 @@
+//===--- LRTableTest.cpp - ---*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "clang/Basic/TokenKinds.h"
+#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+using testing::IsEmpty;
+using testing::UnorderedElementsAre;
+using Action = LRTable::Action;
+
+TEST(LRTable, Builder) {
+  GrammarTable GTable;
+
+  //   eof   semi  ...
+  // +---++---+---
+  // |state0 || s0,r0 |...
+  // |state1 | acc|   |...
+  // |state2 ||  r1   |...
+  // +---++---+---
+  std::vector Entries = {
+  {/* State */ 0, tokenSymbol(tok::semi), Action::shift(0)},
+  {/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0)},
+  {/* State */ 1, tokenSymbol(tok::eof), Action::accept(2)},
+  {/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1)}};
+  GrammarTable GT;
+  LRTable T = LRTable::buildForTests(GT, Entries);
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
+  UnorderedElementsAre(Action::accept(2)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::reduce(1)));
+  // Verify the behaivor for other non-available-actions terminals.
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-//===--- LRGraphTest.cpp - LRGraph tests -*- C++-*-===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===--===//
-
-#include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
-#include "gmock/gmock.h"
-#include "gtest/gtest.h"
-#include 
-
-namespace clang {
-namespace syntax {
-namespace pseudo {
-namespace {
-
-TEST(LRGraph, Build) {
-  struct TestCase {
-llvm::StringRef BNF;
-llvm::StringRef ExpectedStates;
-  };
-
-  TestCase Cases[] = {{
-  R"bnf(
-_ := expr
-expr := IDENTIFIER
-  )bnf",
-  R"(States:
-State 0
-_ :=  • expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-State 2
-expr := IDENTIFIER • 
-0 ->[expr] 1
-0 ->[IDENTIFIER] 2
-)"},
-  {// A grammar with a S/R conflict in SLR table:
-   // (id-id)-id, or id-(id-id).
-   R"bnf(
-_ := expr
-expr := expr - expr  # S/R conflict at state 4 on '-' token
-expr := IDENTIFIER
-  )bnf",
-   R"(States:
-State 0
-_ :=  • expr
-expr :=  • expr - expr
-expr :=  • IDENTIFIER
-State 1
-_ := expr • 
-expr := expr • - expr
-State 2
-

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-11 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

This is really nice, mostly nits.




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:177
 };
+void initTerminals(std::vector );
 

why is this exposed/required rather than being initialized by the GrammarTable 
constructor?
Since this is essentially static (must always correspond to tok::TokenKind) it 
seems that GrammarTable could just have an ArrayRef and it could be initialized 
by a lazy-init singleton:

```
// in Grammar.cpp
static ArrayRef getTerminalNames() {
  static std::vector *TerminalNames = []{

  };
  return *TerminalNames;
}
```

(I know eventually we'd like GrammarTable to be generated and have very minimal 
dynamic init, but there are lots of other details needed e.g. we can't 
statically initialize `vector` either, so we should cross that bridge 
when we come to it)



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:28
+//  LRTable is *performance-critial* as it is consulted frequently during a
+//  parse. In general, LRTable is very sprase (most of the entries are empty).
+//  For example, for the C++ language, the SLR table has ~1500 states and 650

space -> sparse



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:58
+// Unlike A typical LR parsing table allows at most one available action per
+// entry, conflicted actions are allowed in LRTable.
+class LRTable {

Maybe mention relation to GLR: GLR can execute these in parallel?



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+enum Kind : uint8_t {
+  // Terminal actions, corresponding to entries of ACTION table.
+  Error = 0,

nit: I think you want this to be a comment on the section rather than on Error, 
leave a blank line?



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:68
+  // Terminal actions, corresponding to entries of ACTION table.
+  Error = 0,
+  // Shift to state n: move forward with the lookahead, and push state n

Error seems like an action we'd dynamically handle, rather than an unreachable 
sentinel.

I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant 
places. Even if we do plan dynamic error actions, we have enough spare bits to 
keep these two cases separate.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+  // Nonterminal actions, corresponding to entry of GOTO table.
+  // Go to state n: pust state n onto the state stack.
+  // Similar to Shift, but it is a nonterminal forward transition.

again blank line between "nonterminal actions" and "go to state n"



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+  // Nonterminal actions, corresponding to entry of GOTO table.
+  // Go to state n: pust state n onto the state stack.
+  // Similar to Shift, but it is a nonterminal forward transition.

sammccall wrote:
> again blank line between "nonterminal actions" and "go to state n"
pust -> push?

I thought goto replaces the top of the stack rather than being pushed onto it.

e.g.
stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := 
IDENT ."
stack is { stmt := . expr ; | expr := IDENT . }, lookahead is semi, action is 
reduce
stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := 
expr . ;"
stack is { stmt := expr . ;}, lookahead is semi...

Line 3=>4 doesn't seem like a push
 




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:106
+bool operator==(const Action ) const { return value() == L.value(); }
+uint16_t value() const { return K << ValueBits | Value; };
+

maybe value()=>opaque() or asInteger()?

Partly to give a hint a bit more at the purpose, partly to avoid confusion of 
`Value != value()`



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+// Either StateID or RuleID, depending on the Kind.
+uint16_t Value : ValueBits;
+  };

static assert RuleBits <= ValueBits, and I think we want a StateBits above plus 
some assertions on the number of states when building



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+// Either StateID or RuleID, depending on the Kind.
+uint16_t Value : ValueBits;
+  };

sammccall wrote:
> static assert RuleBits <= ValueBits, and I think we want a StateBits above 
> plus some assertions on the number of states when building
FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems 
a little confusing to explicitly specify the size both as 16 and 13 bits.

Up to you, but if you change this one also consider the Kind enum.



Comment at: 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-11 Thread Haojian Wu via Phabricator via cfe-commits
hokein added inline comments.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:84
+
+Kind K = Kind::Error;
+// Value

sammccall wrote:
> This action struct can be squeezed to 16 bits.
> 
> Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't 
> heavily used).
> RuleID is 12 bits, StateID would fit within 11 and 12 should be safe.
> 
> Combined with the optimization suggested below, this should get the whole 
> table down to  335KB.
squeezed to 16 bits (3 bits for Kind, and 13 bits for the Value).



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:105
+auto Goto = find({From, NonTerminal});
+assert(Goto.size() == 1 && Goto.front().isGoTo());
+return Goto.front().goTo();

sammccall wrote:
> Why is this assertion valid? Is it a precondition of the function that some 
> item in From can advance over NonTerminal?
This is guaranteed by a DFA (a node can not have two out edges with a same 
label), the same reason why we don't have shift/shift conflicts. 



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:121
+private:
+  struct IndexKey {
+StateID From;

sammccall wrote:
> I think there's an even more compact representation:
> Basically sort the index keys as symbol-major, but then don't store the 
> actual symbol there but rather store the range for the symbol in an external 
> array:
> 
> ```
> // index is lookahead tok::TokenKind. Values are indices into Index: the 
> range for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]]
> vector TokenIdx;
> 
> // index is nonterminal SymbolID. Values are  indices into Index: the range 
> for sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]].
> vector NontermIdx;
> 
> // parallel to actions, value is the from state. Only subranges are sorted.
> vector StateIdx;
> 
> vector Actions;
> ```
> 
> This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra 
> indirection should save 5-10 steps of binary search.
> 
> This is equivalent to 2 * `vector` + `vector Action>>` but keeping the searched index compact is probably a good thing.
This looks like a nice improvement, though the implementation is a little 
tricky -- we reduced the binary-search space from 2 dimension (State, Symbol) 
to 1 dimension Symbol.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-11 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 407862.
hokein marked 10 inline comments as done.
hokein added a comment.
Herald added a subscriber: mgrang.

- rebase, rescope the patch to LRTable
- refine the Action class interfaces, and nameing;
- use a more compact Index, reduce LRTable from 660KB => 335KB;
- address review comments;


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/GrammarBNF.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRTableTest.cpp
@@ -0,0 +1,57 @@
+//===--- LRTableTest.cpp - ---*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "clang/Basic/TokenKinds.h"
+#include "clang/Tooling/Syntax/Pseudo/Grammar.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+using testing::IsEmpty;
+using testing::UnorderedElementsAre;
+using Action = LRTable::Action;
+
+TEST(LRTable, Builder) {
+  LRTable::Builder Build;
+  GrammarTable GTable;
+  initTerminals(GTable.Terminals);
+
+  //   eof   semi  ...
+  // +---++---+---
+  // |state0 || s0,r0 |...
+  // |state1 | acc|   |...
+  // |state2 ||  r1   |...
+  // +---++---+---
+  Build.insert(/* State */ 0, tokenSymbol(tok::semi), Action::shift(0));
+  Build.insert(/* State */ 0, tokenSymbol(tok::semi), Action::reduce(0));
+  Build.insert(/* State */ 1, tokenSymbol(tok::eof), Action::accept());
+  Build.insert(/* State */ 2, tokenSymbol(tok::semi), Action::reduce(1));
+
+  LRTable T = std::move(Build).build(GTable);
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::eof)), IsEmpty());
+  EXPECT_THAT(T.find(0, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::shift(0), Action::reduce(0)));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::eof)),
+  UnorderedElementsAre(Action::accept()));
+  EXPECT_THAT(T.find(1, tokenSymbol(tok::semi)), IsEmpty());
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::semi)),
+  UnorderedElementsAre(Action::reduce(1)));
+  // Verify the behaivor for other non-available-actions terminals.
+  EXPECT_THAT(T.find(2, tokenSymbol(tok::kw_int)), IsEmpty());
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
+++ clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp
@@ -7,6 +7,7 @@
 //===--===//
 
 #include "clang/Tooling/Syntax/Pseudo/LRGraph.h"
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 #include 
@@ -20,14 +21,16 @@
   struct TestCase {
 llvm::StringRef BNF;
 llvm::StringRef ExpectedStates;
+llvm::StringRef ExpectedTables;
   };
 
-  TestCase Cases[] = {{
-  R"bnf(
+  TestCase Cases[] = {
+  {
+  R"bnf(
 _ := expr
 expr := IDENTIFIER
   )bnf",
-  R"(States:
+  R"(States:
 State 0
 _ :=  • expr
 expr :=  • IDENTIFIER
@@ -37,15 +40,25 @@
 expr := IDENTIFIER • 
 0 ->[expr] 1
 0 ->[IDENTIFIER] 2
-)"},
-  {// A grammar with a S/R conflict in SLR table:
-   // (id-id)-id, or id-(id-id).
-   R"bnf(
+)",
+  R"(LRTable:
+State 0
+'IDENTIFIER': shift state 2
+'expr': go to state 1
+State 1
+'EOF': accept
+State 2
+'EOF': reduce by rule 1 'expr := IDENTIFIER'
+)",
+  },
+  {// A grammar with a S/R conflict in SLR table:
+   // (id-id)-id, or id-(id-id).
+   R"bnf(
 _ := expr
 expr := expr - expr  # S/R conflict at state 4 on '-' token
 expr := IDENTIFIER
-  )bnf",
-   R"(States:
+ )bnf",
+   R"(States:
 State 0
 _ :=  • expr
 expr :=  

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-07 Thread Haojian Wu via Phabricator via cfe-commits
hokein marked 11 inline comments as done.
hokein added a comment.

comments around LRAutomaton should be addressed in the separate patch, 
https://reviews.llvm.org/D119172.




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:32
+public:
+  Item(RuleID ID, uint8_t RuleLength, uint8_t DotPos)
+  : RID(ID), RuleLength(RuleLength), DotPos(DotPos) {}

sammccall wrote:
> accepting the grammar as a parameter rather than the rule length would let us 
> ensure the length is correct locally rather than relying on the caller.
> Essentially all the callers need to do a lookup anyway.
right, but we still need this one, for constructing an empty/Tombstone items 
for DenseMapInfo.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:66
+
+using ItemSet = std::set;
+// A state of the LR automaton is an item set, which contains all possible 
rules

sammccall wrote:
> why is std::set used here? (generally a data structure to avoid)
the main reason to we want a deterministic order for computing the hash value. 
Changed to a sorted vector.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:79
+
+// A deterministic finite state automaton for LR parsing.
+//

sammccall wrote:
> Hmm, isn't the point of GLR that it's a nondeterministic automaton?
> 
> We've done the LR-style NFA->DFA conversion, but it's not complete (when we 
> hit a conflict we continue rather than giving up)
yes and no. There are a few conceptually difference here.

GLR is indeed a nondeterministic parser, but I think the term 
deterministic/nondeterministic in parser is different than the ones used in 
automaton context. 
A deterministic parser (typical LR) has a property that there will always be 
only one possible action to choose from, thus it is fast (linear), and it 
produces only one syntax tree;
A deterministic automaton is described 
[here](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton).

> We've done the LR-style NFA->DFA conversion, but it's not complete (when we 
> hit a conflict we continue rather than giving up)

Not exactly, there is no NFA->DFA conversion here. What this patch does:
1. we constructs a deterministic LR(0) automaton (DFA) directly, this is what 
LRAutomaton represents;
2. based on the DFA, we construct an LR table (called Action and GoTo Table in 
standard LR). In particular, we use the FollowSet to determine reduce actions, 
it is called SLR(1) table which is powerful the the LR(0) table.

A conflict in the LRTable doesn't imply an incomplete DFA. For an arbitrary 
grammar (no matter it is ambiguous or not), we can always construct a DFA. For 
example, based on the same LR(0) automaton, we could build a SLR(1) table 
(without conflicts) or LR(0) table (with conflicts). 
The real problem lies in the interpretation of states of the automaton, 
considering a node in the automaton with state 0
```
E := T. + E
E := T.
```
and the node as a "+" outer edge, if the next symbol is `+`, we have two 
options "reduce E := T" or shift `+`. SLR can tell reduction is not available 
`+` is not in the FOLLOW(E) (thus no conflict) while LR(0) will accept both 
(thus conflicts). 

I added some bits in the file comment of the LRGraph.h, hope it can clarify 
these concepts.



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:109
+  auto Hash = hash_code(S);
+  auto It = StatesIndex.find(Hash);
+  if (It != StatesIndex.end())

sammccall wrote:
> using hashcode instead of the key itself should be explained.
> (If it's just an issue with hash of std::set, I think using std::vector fixes 
> it)
using the ItemSet as key seems heavy (for cxx.grammar, 65KB for 
`llvm::DenseMap` vs 32KB for `llvm::DenseMap`) and 
unnecessary (we don't need to access the Itemset).

Added comments.



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:144
+   "augmented symbol rule must be _ := start_symbol.");
+auto StartState = closure({{{RID, Rule.size(), 0}}}, G);
+auto Result = Builder.insert(std::move(StartState));

sammccall wrote:
> nit: too many braces - please spell out some of the types
> (and Auto should probably be State here?)
> too many braces - please spell out some of the types
this is one of cons of using inlay-hints :(


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-04 Thread Haojian Wu via Phabricator via cfe-commits
hokein marked 7 inline comments as done.
hokein added inline comments.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:130
+
+  // Returns the SymbolID of the augmented symbol '_'
+  SymbolID augmentedSymbol() const { return 0; }

sammccall wrote:
> this "augmented symbol" name doesn't seem widely used. (search for `lr 
> "augmented symbol"` produces few results, mostly unrelated). It's also a 
> confusing name, because it's IIUC it's not the symbol that's augmented, but 
> rather the grammar has been augmented with the symbol...
> 
> I'd suggest just calling it the start symbol, unless the distinction is 
> critical (I don't yet understand why it's important - I added it in the 
> prototype just to avoid having the language-specific name "translation-unit" 
> hardcoded).
> 
> Otherwise it's hard to suggest a good name without understanding what it's 
> for, but it should be descriptive...
yeah, it is about augmented grammar. Strictly speaking, it is a start symbol of 
augmented symbol. Maybe we should not speak augmented stuff in the code, it is 
not a well-known term, and causes confusion. rename it to start symbol instead.

It looks like using augmented grammar is a "standard" LR technique,  it 
introduces a converged start state and accepting state in the LR graph, which 
makes it easier for the implementation. And it is a good fit for support 
multiple start symbols.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector> firstSets() const;

sammccall wrote:
> sammccall wrote:
> > In about the same number of words you could explain what this actually is 
> > :-)
> > 
> > // Computes the set of terminals that could appear at the start of each 
> > non-terminal.
> is there some reason these functions need to be part of the grammar class?
> 
> They seem like part of the LR building to me, it could be local to 
> LRBuilder.cpp, or exposed from LRTable.h or so for testing
as discussed, made them as free functions in Grammar.h. Separate the first and 
follow set changes to https://reviews.llvm.org/D118990, comments around them 
should be addressed there.



Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:88
+for (const auto  : T->Rules) {
+  // Rule 2: for a rule X := ...Yz, we add all symbols from FIRST(z) to
+  // FOLLOW(Y).

sammccall wrote:
> nit: `... Y Z`, with consistent spaces/capitalization
> Unless case is meant to imply something here?
Y was for nonterminal while z was for nonterminal or terminals. They are not 
super important, changed all to Uppercase.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-03 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

OK, I think I understand it now! Sorry for so many comments, happy to chat 
offline about which ones to address/where to start.




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+enum class Kind : uint8_t {

API polish: this class is a mix of accessors and direct member state - it can 
be hard to know how to best interact with such a thing.

I'd suggest:
 - public factories `static Action Action::reduce(RuleID)` etc, private data & 
constructor
 - expose `Kind kind()`, but not the individual is() methods: they don't buy 
much and this pattern encourages switch where appropriate
 - make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently 
scoped already
 - shift() & goTo() -> `targetState()` which asserts either condition. Apart 
from anything else, this makes naming easier: "shift" is a verb so `shift()` is 
a confusing name.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+enum class Kind : uint8_t {

sammccall wrote:
> API polish: this class is a mix of accessors and direct member state - it can 
> be hard to know how to best interact with such a thing.
> 
> I'd suggest:
>  - public factories `static Action Action::reduce(RuleID)` etc, private data 
> & constructor
>  - expose `Kind kind()`, but not the individual is() methods: they don't buy 
> much and this pattern encourages switch where appropriate
>  - make `Kind` enum rather than `enum class` - `Action::Shift` is 
> sufficiently scoped already
>  - shift() & goTo() -> `targetState()` which asserts either condition. Apart 
> from anything else, this makes naming easier: "shift" is a verb so `shift()` 
> is a confusing name.
mention that this combines the Action and Go-to tables from LR literature?



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:41
+enum class Kind : uint8_t {
+  // Action table
+  Error = 0,

I think it's worth documenting what each of these actions means.

Referring to enum values as "action table" and "goto table" seems confusing - 
they're part of the same table! The main reference point when reading the code 
needs to be the code, rather than the textbook.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+
+bool operator==(const Action ) const {
+  if (K != L.K)

if you make the data members private the union could just be a uint16 and save 
yourself some trouble here :-) Up to you



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:84
+
+Kind K = Kind::Error;
+// Value

This action struct can be squeezed to 16 bits.

Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't heavily 
used).
RuleID is 12 bits, StateID would fit within 11 and 12 should be safe.

Combined with the optimization suggested below, this should get the whole table 
down to  335KB.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:102
+  // Returns the transisted state for the state From on a nonterminal.
+  StateID goTo(StateID From, SymbolID NonTerminal) const {
+assert(pseudo::isNonterminal(NonTerminal));

nit: please be consistent with `Nonterminal` (and "nonterminal") vs 
`NonTerminal` (and "non-terminal"). I think `Nonterminal` is used elsewhere in 
the code.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:105
+auto Goto = find({From, NonTerminal});
+assert(Goto.size() == 1 && Goto.front().isGoTo());
+return Goto.front().goTo();

Why is this assertion valid? Is it a precondition of the function that some 
item in From can advance over NonTerminal?



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:121
+private:
+  struct IndexKey {
+StateID From;

I think there's an even more compact representation:
Basically sort the index keys as symbol-major, but then don't store the actual 
symbol there but rather store the range for the symbol in an external array:

```
// index is lookahead tok::TokenKind. Values are indices into Index: the range 
for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]]
vector TokenIdx;

// index is nonterminal SymbolID. Values are  indices into Index: the range for 
sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]].
vector NontermIdx;

// parallel to actions, value is the from state. Only subranges are sorted.
vector StateIdx;

vector Actions;
```

This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra 
indirection should save 5-10 steps of binary search.

This is equivalent to 2 * `vector` + `vector>` 
but keeping the searched index compact is probably a good thing.



Comment at: 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-03 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

OK, a bunch of random notes but I'm getting a higher level picture.

I'm not sure about the split between LRAutomaton and LRTable. I suppose doing 
it in two stages simplifies the implementation, but I don't think the first 
stage particularly needs to be public.
Also I think the names are confusing: Automaton vs Table isn't a clear contrast 
as one describes behavior and one a data structure. And the "LRTable" seems 
more automaton-like to me!

I think my suggestion would be:

- call the final structure LRAutomaton
- call the LRAutomaton::build()::Builder structure `LRGraph` or something
- I'm not sure the LRAutomaton structure needs to exist exactly, seems like the 
current LRTable::build() could operate on `LRGraph` directly

but I'm going to ponder this more.




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:32
+public:
+  Item(RuleID ID, uint8_t RuleLength, uint8_t DotPos)
+  : RID(ID), RuleLength(RuleLength), DotPos(DotPos) {}

accepting the grammar as a parameter rather than the rule length would let us 
ensure the length is correct locally rather than relying on the caller.
Essentially all the callers need to do a lookup anyway.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:45
+  }
+  const pseudo::Rule (const Grammar ) const { return G.lookupRule(RID); 
}
+  RuleID ruleID() const { return RID; }

nit: redundant pseudo::qualifier



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:47
+  RuleID ruleID() const { return RID; }
+  int dotPos() const { return DotPos; }
+  std::string dump(const Grammar ) const;

nit: unsigned.

just dot()?
and similarly ruleID()->rule()?
Not sure these would really be unclear



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:79
+
+// A deterministic finite state automaton for LR parsing.
+//

Hmm, isn't the point of GLR that it's a nondeterministic automaton?

We've done the LR-style NFA->DFA conversion, but it's not complete (when we hit 
a conflict we continue rather than giving up)



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:144
+   "augmented symbol rule must be _ := start_symbol.");
+auto StartState = closure({{{RID, Rule.size(), 0}}}, G);
+auto Result = Builder.insert(std::move(StartState));

nit: too many braces - please spell out some of the types
(and Auto should probably be State here?)



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:215
+for (const Item  : Automaton.states()[SID]) {
+  // If "_ := start_symbol ." in the State, set Action[State, eof] to
+  // accept.

These comments are echoing the code - saying what, not why.

"If we've just parsed the start symbol, we can accept the input".



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:220
+Builder.addAction(SID, tokenSymbol(tok::eof),
+  LRTable::Action{LRTable::Action::Kind::Accept, {}});
+continue;

Seems like we should maybe have the accepted symbol as a payload on the Accept 
action?

Actually why do we need an Accept action, as opposed to just a reduce action 
and recognize it because it's the symbol we want and the stack is empty?



Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:224
+  if (!I.hasNext()) {
+// If a state i has an item "A := x.", set Action[i, a] to reduce
+// "A := x" for all a in FOLLOW(A).

"If we've reached the end of a rule A := ..., then we can reduce if the next 
token is in the follow set of A"


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-02-02 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

A few initial comments just on the "easy parts" - changes to grammar and the 
first/follow set computation.

I need to study the rest. Generally I have a feeling some of the impl is very 
inefficient but efficiency may not matter much in this context, and it's 
unclear how much better it can be made without adding lots of complexity. Going 
to look more closely :-)




Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:128
+  // Returns all symbols.
+  std::vector symbols() const;
+

This function seems like a bit of an attractive nuisance: it could be a cheap 
accessor, but isn't. There are just two callers - both are calling it in a loop 
and both seem dubious to be iterating over all symbols at all. I wonder if we 
can avoid it entirely.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:130
+
+  // Returns the SymbolID of the augmented symbol '_'
+  SymbolID augmentedSymbol() const { return 0; }

this "augmented symbol" name doesn't seem widely used. (search for `lr 
"augmented symbol"` produces few results, mostly unrelated). It's also a 
confusing name, because it's IIUC it's not the symbol that's augmented, but 
rather the grammar has been augmented with the symbol...

I'd suggest just calling it the start symbol, unless the distinction is 
critical (I don't yet understand why it's important - I added it in the 
prototype just to avoid having the language-specific name "translation-unit" 
hardcoded).

Otherwise it's hard to suggest a good name without understanding what it's for, 
but it should be descriptive...



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector> firstSets() const;

In about the same number of words you could explain what this actually is :-)

// Computes the set of terminals that could appear at the start of each 
non-terminal.



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector> firstSets() const;

sammccall wrote:
> In about the same number of words you could explain what this actually is :-)
> 
> // Computes the set of terminals that could appear at the start of each 
> non-terminal.
is there some reason these functions need to be part of the grammar class?

They seem like part of the LR building to me, it could be local to 
LRBuilder.cpp, or exposed from LRTable.h or so for testing



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:66
+
+using ItemSet = std::set;
+// A state of the LR automaton is an item set, which contains all possible 
rules

why is std::set used here? (generally a data structure to avoid)



Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:75
+// precompute all the transitions between these states.
+using State = ItemSet;
+llvm::hash_code hash_code(const State );

I have a feeling we've discussed this before but...

I find this alias/abstraction unhelpful when reading the code, because 
operations on states are e.g. `state.insert` or `state.contains`, which don't 
match the abstraction.

I'd probably prefer just using `ItemSet`, but alternatively `struct State { 
ItemSet Items; }` seems like it would work well: `State.Items.contains(...)` 
reads very naturally.



Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:39
 
+std::vector> Grammar::firstSets() const {
+  std::vector> FirstSets(T->Nonterminals.size());

this function could use some comments.

```
A rule S := T ... implies elements in first(S):
 - if T is a terminal, first(S) contains T
 - if T is a nonterminal, first(S) contains first(T)
Since first(T) may not have been fully computed yet,
first(S) itself may end up being incomplete.
We iterate until a fixed point.
(This isn't particularly efficient, but table building performance isn't 
critical).
```



Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:51
+
+  bool Changed = false;
+  do {

`bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; 
Changed;) {`?

(do-while is rare enough that it always takes me a bit to grasp the control 
flow)




Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:56
+  assert(R.size() > 0);
+  Changed |= ExpandFirstSet(R.target(), R.seq().front());
+}

comment that we only need to consider the first element because symbols are 
non-nullable



Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:88
+for (const auto  : T->Rules) {
+  // Rule 2: for a rule X := ...Yz, we add all symbols from FIRST(z) to
+  // FOLLOW(Y).

nit: `... Y Z`, with consistent 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-01-31 Thread Haojian Wu via Phabricator via cfe-commits
hokein updated this revision to Diff 404442.
hokein added a comment.

Polish the LRTable implementation, using two flat vectors

memory-size (bytes): 1380908 => 664600


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/tools/clang-pseudo/ClangPseudo.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp
@@ -0,0 +1,112 @@
+//===--- LRBuilderTests.cpp --*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRAutomaton.h"
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+TEST(TableTest, BuildLRTable) {
+  struct TestCase {
+llvm::StringRef BNF;
+llvm::StringRef ExpectedStates;
+llvm::StringRef ExpectedTable;
+  };
+
+  TestCase Cases[] = {
+  {
+  R"bnf(
+_ := expr
+expr := IDENTIFIER
+  )bnf",
+  R"(States:
+State 0
+_ :=  • expr
+expr :=  • IDENTIFIER
+State 1
+expr := IDENTIFIER • 
+State 2
+_ := expr • 
+)",
+  R"(LRTable:
+State 0
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 2
+State 1
+'EOF': reduce by rule 1 'expr := IDENTIFIER'
+State 2
+'EOF': accept
+)",
+  },
+  {
+  // A grammar with a S/R conflict in SLR table: (id-id)-id, or
+  // id-(id-id).
+  R"bnf(
+_ := expr
+expr := expr - expr  # S/R conflict at state 4 on '-' token
+expr := IDENTIFIER
+  )bnf",
+  R"(States:
+State 0
+_ :=  • expr
+expr :=  • IDENTIFIER
+expr :=  • expr - expr
+State 1
+expr := IDENTIFIER • 
+State 2
+_ := expr • 
+expr := expr • - expr
+State 3
+expr :=  • IDENTIFIER
+expr :=  • expr - expr
+expr := expr - • expr
+State 4
+expr := expr • - expr
+expr := expr - expr • 
+)",
+  R"(LRTable:
+State 0
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 2
+State 1
+'EOF': reduce by rule 1 'expr := IDENTIFIER'
+'-': reduce by rule 1 'expr := IDENTIFIER'
+State 2
+'EOF': accept
+'-': shift, and go to state 3
+State 3
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 4
+State 4
+'EOF': reduce by rule 2 'expr := expr - expr'
+'-': shift, and go to state 3
+'-': reduce by rule 2 'expr := expr - expr'
+)",
+  },
+  };
+  for (const auto  : Cases) {
+std::vector Diags;
+auto G = Grammar::parseBNF(C.BNF, Diags);
+ASSERT_THAT(Diags, testing::IsEmpty());
+auto Automaton = LRAutomaton::build(*G);
+auto Table = LRTable::build(*G);
+EXPECT_EQ(Automaton.dumpForTests(*G), C.ExpectedStates);
+EXPECT_EQ(Table.dumpForTests(*G), C.ExpectedTable);
+  }
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
+++ clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
@@ -1,4 +1,4 @@
-//===--- GrammarTests.cpp - grammar tests  *- C++-*-===//
+//===--- GrammarTests.cpp - grammar tests  ---*- C++-*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -19,6 +19,7 @@
 using testing::AllOf;
 using testing::ElementsAre;
 using testing::IsEmpty;
+using testing::Pair;
 using testing::UnorderedElementsAre;
 
 MATCHER_P(TargetID, SID, "") { return arg.target() == SID; }
@@ -34,7 +35,7 @@
 G = Grammar::parseBNF(BNF, Diags);
   }
 
-  SymbolID lookup(llvm::StringRef Name) const {
+  SymbolID id(llvm::StringRef Name) const {
 for (unsigned I = 0; I < NumTerminals; ++I)
   if (G->table().Terminals[I] == Name)
 return 

[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

2022-01-25 Thread Haojian Wu via Phabricator via cfe-commits
hokein created this revision.
hokein added a reviewer: sammccall.
Herald added a subscriber: mgorny.
hokein requested review of this revision.
Herald added a project: clang.

This patch introduces a dense implementation of the LR parsing table, which is
used by LR parsers. We implement a SLR(1) parsing table from an LR(0) automaton.

Statistics of the LR parsing table on the C++ spec grammar:

  number of states: 1449
  number of actions: 83069
  number of index entries: 72612
  size of the table (bytes): 1380908


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D118196

Files:
  clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h
  clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h
  clang/lib/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp
  clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp
  clang/unittests/Tooling/Syntax/Pseudo/CMakeLists.txt
  clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
  clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp

Index: clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp
===
--- /dev/null
+++ clang/unittests/Tooling/Syntax/Pseudo/LRBuilderTests.cpp
@@ -0,0 +1,112 @@
+//===--- LRBuilderTests.cpp --*- C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+
+#include "clang/Tooling/Syntax/Pseudo/LRAutomaton.h"
+#include "clang/Tooling/Syntax/Pseudo/LRTable.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include 
+
+namespace clang {
+namespace syntax {
+namespace pseudo {
+namespace {
+
+TEST(TableTest, BuildLRTable) {
+  struct TestCase {
+llvm::StringRef BNF;
+llvm::StringRef ExpectedStates;
+llvm::StringRef ExpectedTable;
+  };
+
+  TestCase Cases[] = {
+  {
+  R"bnf(
+_ := expr
+expr := IDENTIFIER
+  )bnf",
+  R"(States:
+State 0
+_ :=  • expr
+expr :=  • IDENTIFIER
+State 1
+expr := IDENTIFIER • 
+State 2
+_ := expr • 
+)",
+  R"(LRTable:
+State 0
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 2
+State 1
+'EOF': reduce by rule 1 'expr := IDENTIFIER'
+State 2
+'EOF': accept
+)",
+  },
+  {
+  // A grammar with a S/R conflict in SLR table: (id-id)-id, or
+  // id-(id-id).
+  R"bnf(
+_ := expr
+expr := expr - expr  # S/R conflict at state 4 on '-' token
+expr := IDENTIFIER
+  )bnf",
+  R"(States:
+State 0
+_ :=  • expr
+expr :=  • IDENTIFIER
+expr :=  • expr - expr
+State 1
+expr := IDENTIFIER • 
+State 2
+_ := expr • 
+expr := expr • - expr
+State 3
+expr :=  • IDENTIFIER
+expr :=  • expr - expr
+expr := expr - • expr
+State 4
+expr := expr • - expr
+expr := expr - expr • 
+)",
+  R"(LRTable:
+State 0
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 2
+State 1
+'EOF': reduce by rule 1 'expr := IDENTIFIER'
+'-': reduce by rule 1 'expr := IDENTIFIER'
+State 2
+'EOF': accept
+'-': shift, and go to state 3
+State 3
+'IDENTIFIER': shift, and go to state 1
+'expr': go to state 4
+State 4
+'EOF': reduce by rule 2 'expr := expr - expr'
+'-': shift, and go to state 3
+'-': reduce by rule 2 'expr := expr - expr'
+)",
+  },
+  };
+  for (const auto  : Cases) {
+std::vector Diags;
+auto G = Grammar::parseBNF(C.BNF, Diags);
+ASSERT_THAT(Diags, testing::IsEmpty());
+auto Automaton = LRAutomaton::build(*G);
+auto Table = LRTable::build(*G);
+EXPECT_EQ(Automaton.dumpForTests(*G), C.ExpectedStates);
+EXPECT_EQ(Table.dumpForTests(*G), C.ExpectedTable);
+  }
+}
+
+} // namespace
+} // namespace pseudo
+} // namespace syntax
+} // namespace clang
Index: clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
===
--- clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
+++ clang/unittests/Tooling/Syntax/Pseudo/GrammarTests.cpp
@@ -1,4 +1,4 @@
-//===--- GrammarTests.cpp - grammar tests  *- C++-*-===//
+//===--- GrammarTests.cpp - grammar tests  ---*- C++-*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -19,6 +19,7 @@
 using testing::AllOf;
 using testing::ElementsAre;
 using testing::IsEmpty;
+using testing::Pair;
 using testing::UnorderedElementsAre;
 
 MATCHER_P(TargetID, SID, "") { return arg.target() == SID; }
@@ -34,7 +35,7 @@
 G =