Thank you for the responses. The Babel parser does not perform significantly 
better than the standard parser. It still trips up on a lot of statements. For 
example, out of the 9428 "reference" SQL statements from the BIRD-SQL text to 
SQL benchmark<https://bird-bench.github.io/> data, the default parser fails on 
3379 statements, whereas the Babel parser fails on 3318. However, for the 
default parser, 1192 of those failures are SqlParseException, whereas the Babel 
parser produces 1199 SqlParseException failures. Overall the Babel parser does 
better, but it actually produces more SqlParseException failures.

Examples of failures include statements such as this:

SELECT T1.Value FROM Indicators AS T1 INNER JOIN Country AS T2 ON 
T1.CountryCode = T2.CountryCode WHERE T2.LongName = 'Commonwealth of Australia' 
AND T1.IndicatorName = 'Deposit interest rate (%)' AND T1.Year = 1979

which produces the failure:

org.apache.calcite.sql.parser.SqlParseException: Encountered ". Value" at line 
1, column 10.

It seems to my non-expert eye that it should be possible to determine that many 
identifiers are not function names (or other reserved words) from the context.

Another big source of parse errors I have hit is CREATE TABLE statements, where 
table or column names must be quoted to avoid confusion with reserved words. 
This (perhaps naively) looks like another case where the context should make it 
obvious.

It isn't immediately obvious to me that 
CALCITE-4653<https://issues.apache.org/jira/browse/CALCITE-4653> describes the 
same problem. If it does, I am happy to use that to associate any enhancement 
work. Other than that, it's not very helpful since I don't see any activity 
around it since it was raised 3.5 years ago.

Once again, I am happy to have a go at trying to improve the code, but I really 
need some advice and pointers on how (and where in the codebase) to proceed. 
Can anyone help?

________________________________
From: Stamatis Zampetakis <[email protected]>
Sent: 22 January 2025 07:43
To: [email protected] <[email protected]>
Subject: Re: Parsing SQL where a column (or table) name matches a keyword

Hey Mark,

Most SQL parsers require the reserved words to be quoted for using
them as identifiers. This helps to avoid ambiguities and generally
leads to simpler and more efficient parsers. I am not sure if it's
possible to avoid quotes without sacrificing something else.

The standard parser also tries to remain close to the SQL standard
specification and I guess that quoting rules and reserved keywords are
coming from there. Most of the time we don't want to deviate a lot
from the standard.

Apart from the standard parser, Calcite provides the Babel parser that
is more lenient and has a much smaller set of reserved keywords [1].
The Babel parser may be a better fit for your use-case.

Best,
Stamatis

[1] 
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fcalcite%2Fblob%2F3fce658c05aa88b44af8cbcfd0809a420896d62b%2Fbabel%2Fsrc%2Fmain%2Fcodegen%2Fconfig.fmpp%23L61&data=05%7C02%7C%7C57e2ffa1963a4cb01f7c08dd3ab8bf4d%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638731287342398299%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=exHkvgbRiBVi%2FbyeHC6f7tr5%2Bycj5Ht1wqpmPYSM30E%3D&reserved=0<https://github.com/apache/calcite/blob/3fce658c05aa88b44af8cbcfd0809a420896d62b/babel/src/main/codegen/config.fmpp#L61>

On Tue, Jan 21, 2025 at 7:11 PM Mihai Budiu <[email protected]> wrote:
>
> There is at least one open issue related to this topic: 
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCALCITE-4653&data=05%7C02%7C%7C57e2ffa1963a4cb01f7c08dd3ab8bf4d%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638731287342418000%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=BZ79SNbp3Pbk3FsobBZUQA111Z3FCLp2skCi7OOzqF4%3D&reserved=0<https://issues.apache.org/jira/browse/CALCITE-4653>
>
> Mihai
>
> ________________________________
> From: Mark Lewis <[email protected]>
> Sent: Tuesday, January 21, 2025 2:32 AM
> To: [email protected] <[email protected]>
> Subject: Parsing SQL where a column (or table) name matches a keyword
>
> While trying to conversions from SQL using large volumes of existing (sample 
> / benchmark) SQL expressions, I have experienced significant issues due to 
> column (or table) names that happen to match reserved keywords. For example:
>
> SELECT COUNTRY FROM COUNTRY_LANGUAGE WHERE LANGUAGE = 'ENGLISH'
>
> Calcite appears to require quoting of any name that matches a keyword — in 
> this case LANGUAGE — whereas generally this isn't required since (I guess) 
> the meaning can be inferred from the context.
>
> I would be happy to try to contribute an enhancement to address this, but I 
> don't really know the bet place to start in the codebase; even where best to 
> put unit tests to drive the implementation. I am completely new to the 
> Calcite codebase. Can anyone provide me with some pointers on where to look 
> and advice on the most suitable implementation approach?
>

Reply via email to