Thank you for the responses. The Babel parser does not perform significantly better than the standard parser. It still trips up on a lot of statements. For example, out of the 9428 "reference" SQL statements from the BIRD-SQL text to SQL benchmark<https://bird-bench.github.io/> data, the default parser fails on 3379 statements, whereas the Babel parser fails on 3318. However, for the default parser, 1192 of those failures are SqlParseException, whereas the Babel parser produces 1199 SqlParseException failures. Overall the Babel parser does better, but it actually produces more SqlParseException failures.
Examples of failures include statements such as this: SELECT T1.Value FROM Indicators AS T1 INNER JOIN Country AS T2 ON T1.CountryCode = T2.CountryCode WHERE T2.LongName = 'Commonwealth of Australia' AND T1.IndicatorName = 'Deposit interest rate (%)' AND T1.Year = 1979 which produces the failure: org.apache.calcite.sql.parser.SqlParseException: Encountered ". Value" at line 1, column 10. It seems to my non-expert eye that it should be possible to determine that many identifiers are not function names (or other reserved words) from the context. Another big source of parse errors I have hit is CREATE TABLE statements, where table or column names must be quoted to avoid confusion with reserved words. This (perhaps naively) looks like another case where the context should make it obvious. It isn't immediately obvious to me that CALCITE-4653<https://issues.apache.org/jira/browse/CALCITE-4653> describes the same problem. If it does, I am happy to use that to associate any enhancement work. Other than that, it's not very helpful since I don't see any activity around it since it was raised 3.5 years ago. Once again, I am happy to have a go at trying to improve the code, but I really need some advice and pointers on how (and where in the codebase) to proceed. Can anyone help? ________________________________ From: Stamatis Zampetakis <[email protected]> Sent: 22 January 2025 07:43 To: [email protected] <[email protected]> Subject: Re: Parsing SQL where a column (or table) name matches a keyword Hey Mark, Most SQL parsers require the reserved words to be quoted for using them as identifiers. This helps to avoid ambiguities and generally leads to simpler and more efficient parsers. I am not sure if it's possible to avoid quotes without sacrificing something else. The standard parser also tries to remain close to the SQL standard specification and I guess that quoting rules and reserved keywords are coming from there. Most of the time we don't want to deviate a lot from the standard. Apart from the standard parser, Calcite provides the Babel parser that is more lenient and has a much smaller set of reserved keywords [1]. The Babel parser may be a better fit for your use-case. Best, Stamatis [1] https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fcalcite%2Fblob%2F3fce658c05aa88b44af8cbcfd0809a420896d62b%2Fbabel%2Fsrc%2Fmain%2Fcodegen%2Fconfig.fmpp%23L61&data=05%7C02%7C%7C57e2ffa1963a4cb01f7c08dd3ab8bf4d%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638731287342398299%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=exHkvgbRiBVi%2FbyeHC6f7tr5%2Bycj5Ht1wqpmPYSM30E%3D&reserved=0<https://github.com/apache/calcite/blob/3fce658c05aa88b44af8cbcfd0809a420896d62b/babel/src/main/codegen/config.fmpp#L61> On Tue, Jan 21, 2025 at 7:11 PM Mihai Budiu <[email protected]> wrote: > > There is at least one open issue related to this topic: > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCALCITE-4653&data=05%7C02%7C%7C57e2ffa1963a4cb01f7c08dd3ab8bf4d%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638731287342418000%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=BZ79SNbp3Pbk3FsobBZUQA111Z3FCLp2skCi7OOzqF4%3D&reserved=0<https://issues.apache.org/jira/browse/CALCITE-4653> > > Mihai > > ________________________________ > From: Mark Lewis <[email protected]> > Sent: Tuesday, January 21, 2025 2:32 AM > To: [email protected] <[email protected]> > Subject: Parsing SQL where a column (or table) name matches a keyword > > While trying to conversions from SQL using large volumes of existing (sample > / benchmark) SQL expressions, I have experienced significant issues due to > column (or table) names that happen to match reserved keywords. For example: > > SELECT COUNTRY FROM COUNTRY_LANGUAGE WHERE LANGUAGE = 'ENGLISH' > > Calcite appears to require quoting of any name that matches a keyword — in > this case LANGUAGE — whereas generally this isn't required since (I guess) > the meaning can be inferred from the context. > > I would be happy to try to contribute an enhancement to address this, but I > don't really know the bet place to start in the codebase; even where best to > put unit tests to drive the implementation. I am completely new to the > Calcite codebase. Can anyone provide me with some pointers on where to look > and advice on the most suitable implementation approach? >
