Comparing keyword handling in Babel and Core parsers

Julian Hyde Wed, 10 Jul 2024 11:03:05 -0700

I am working on https://issues.apache.org/jira/browse/CALCITE-5541
(upgrading JavaCC) and the Babel parser is having problems deducing
whether a keyword is reserved. Investigating this, I took a look at
the generated code, and found something interesting.


Here are the NonReservedKeyWord and NonReservedKeyWord0of3 methods in
Babel 
(babel/build/javacc/javaCCMain/org/apache/calcite/sql/parser/babel/SqlBabelParserImpl.java):

  final public String NonReservedKeyWord() throws ParseException {
    if (jj_2_1116(2)) {
      NonReservedKeyWord0of3();
    } else if (jj_2_1117(2)) {
      NonReservedKeyWord1of3();
    } else if (jj_2_1118(2)) {
      NonReservedKeyWord2of3();
    } else {
      jj_consume_token(-1);
      throw new ParseException();
    }
  {if ("" != null) return unquotedIdentifier();}
    throw new Error("Missing return statement in function");
  }

  /** @see #NonReservedKeyWord */
  final public void NonReservedKeyWord0of3() throws ParseException {
    if (jj_2_1119(2)) {
      jj_consume_token(A);
    } else if (jj_2_1120(2)) {
      jj_consume_token(ACTION);
    } else if (jj_2_1121(2)) {
      jj_consume_token(ADMIN);
    ...

And here are the same methods in Core
(core/build/javacc/javaCCMain/org/apache/calcite/sql/parser/impl/SqlParserImpl.java):

  final public String NonReservedKeyWord() throws ParseException {
    switch ((jj_ntk==-1)?jj_ntk_f():jj_ntk) {
    case A:
    case ACTION:
    case ADMIN:
    case APPLY:
    ...
    case YEARS:{
      NonReservedKeyWord0of3();
      break;
      }
    case ABSENT:
   ...
    case ZONE:{
      NonReservedKeyWord1of3();
      break;
      }
    ...
    default:
      jj_la1[436] = jj_gen;
      jj_consume_token(-1);
      throw new ParseException();
    }
  {if ("" != null) return unquotedIdentifier();}
    throw new Error("Missing return statement in function");
  }

  /** @see #NonReservedKeyWord */
  final public void NonReservedKeyWord0of3() throws ParseException {
    switch ((jj_ntk==-1)?jj_ntk_f():jj_ntk) {
    case A:{
      jj_consume_token(A);
      break;
      }
    case ACTION:{
      jj_consume_token(ACTION);
      break;
      }
    case ADMIN:{
      jj_consume_token(ADMIN);
      break;
      }
    ...

Both of the above are generated using JavaCC 7.0.13. Other parsers,
such as Server, look similar to Core. Under JavaCC 4.0, all parsers
generate a 'switch'.

In all parsers we split the reserved keywords into 3 rules (0of3,
1of3, 2of3) due to the size restrictions noted in
https://issues.apache.org/jira/browse/CALCITE-2405.

I was puzzled why one is generating a 'switch' and the other is
generating chained 'if'...'else-if's. At first I thought it was that
Babel had more keywords, but some experiments eliminated that
possibility. I also disproved the hypothesis that it is because Babel
allows extra characters in identifiers (see
https://issues.apache.org/jira/browse/CALCITE-5668). My current
hypothesis is that Babel needs to use lookahead in order to determine
whether a non-reserved keyword can be converted to an identifier.

But whatever the reason, something seems to be very different about
the Babel grammar. Given how frequently identifiers occur when parsing
SQL, I would not be surprised if the Babel parser is significantly
slower than the regular parser under JavaCC 7.0.13.

In my opinion, that is not a bug that should prevent us from upgrading
JavaCC. Especially given that JavaCC 4.0 has a performance bug that is
affecting all of our parser variants.

Julian

Comparing keyword handling in Babel and Core parsers

Reply via email to