Re: [DISCUSS] Calcite as SQL translator, and dialect testing

Mihai Budiu Mon, 02 Dec 2024 11:52:52 -0800

The difficulty will be in defining the "core".

If you want 100% fidelity the goals are clear. If you want 99% fidelity you 
have to define what is in the 1% left over. That's why I asked Yanjing Wang to 
setup a goal that needs to be achieved.


Even if you leave out all functions, just the type systems mismatches and the 
behavior of casts make this a difficult job. Today the static and dynamic 
behavior of casts seems to be to a large degree built-in in Calcite, but to 
make that flexible it would have to move somehow into the type system, so each 
dialect can customize the way casts work. We have seen this for example in the 
handling of casts to/from BINARY: 
https://issues.apache.org/jira/browse/CALCITE-6210

I suspect the "high severity" thing happens often because this is the JIRA 
default, and people don't bother to change it.

Mihai

________________________________
From: Julian Hyde <[email protected]>
Sent: Monday, December 2, 2024 11:40 AM
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing

I agree with Mihai that it’s not practical to translate dialect-to-dialect with 
100% fidelity. You get defeated by the edge cases — values of DECIMAL that are 
valid on one DB but not another, what happens when the ‘offset’ argument of the 
SUBSTRING function is negative.

But the core of the language is a lot more valuable than the edges. We can do a 
good job at the core.

Let’s use our tests to define the core, keep it working, and continually extend 
it.

Tests are crucial, and should be the heart of this initiative. If people want 
Calcite to be a better SQL translator, they should invest in improving the 
tests. (Not just a stream of band-aid PRs.)

You may notice that I get exasperated when someone logs a bug at the boundaries 
and claims that it is high severity. For example, wants us to handle dates 
earlier than 1 BCE. I know why this happens — your customer has logged a bug 
because their compliance suite has a date that is 2 BCE. These cases exist 
because of QA engineers testing up to the boundaries — and logging bugs that 
are handled by developers of another product and *their* QA engineers. But 
let’s keep some perspective — these cases are filed, and fixed, only to make QA 
engineers happy. Let’s focus on the core.

>> My claim was that it's impossible to even implement a translation between 
>> any two dialects such that the resulting program provides the exact same 
>> results as the source program.

Not impossible. Google acquired a company, CompilerWorks, that models the type 
system and built-in functions of various SQL dialects and maps between them. It 
is a lot of work, and requires a lot of testing. We cannot get to that place 
with the limited resources of an open source project, but we can asymptotically 
approach it if we are clever.

Julian



> On Dec 2, 2024, at 10:05 AM, Mihai Budiu <[email protected]> wrote:
>
> My claim was that it's impossible to even implement a translation between any 
> two dialects such that the resulting program provides the exact same results 
> as the source program.
>
> Clearly, I don't know all SQL dialects, but I think this is true for all 
> major dialects.
>
> Mihai
>
> ________________________________
> From: Yanjing Wang <[email protected]>
> Sent: Sunday, December 1, 2024 7:10 PM
> To: [email protected] <[email protected]>
> Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing
>
> Thank you, Mihai, for your detailed insights.
>
> Implementing translations for all dialects indeed requires a significant
> amount of work and can be quite challenging to complete. Would it be
> possible for us to create an example that demonstrates how to implement
> translation between two specific dialects and materialized view
> substitution from scratch? This example could potentially be enhanced with
> contributions from various companies and be friendly to newbies.
>
>
> Thank you, Walaa, for your recommendation. I understand that Coral uses
> ANTLR4 for parsing and translating from ASTNode to Calcite SqlNode. Do you
> find this approach easier compared to defining a Calcite FTL file for
> unsupported syntax?
>
> Walaa Eldin Moustafa <[email protected]> 于2024年11月28日周四 01:27写道：
>
>> You may consider looking into Coral [1] which is based on Calcite, but
>> extends it with the translation requirements in mind, and as first class
>> problems.
>>
>> [1] https://github.com/linkedin/coral
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Wed, Nov 27, 2024 at 9:20 AM Mihai Budiu <[email protected]> wrote:
>>
>>> I think that the task of translation between SQL dialects is hopeless.
>>> Consider just the type DECIMAL:
>>>
>>>
>>>  *
>>> Each dialect supports a different range of DECIMAL types, specified by
>>> scale and precision.
>>>  *
>>> Some dialects support negative scales, and some support scales >
>> precision.
>>>  *
>>> Some dialects, like Postgres have infinite values in DECIMAL. This means
>>> that DECIMAL arithmetic can never cause a runtime error in Postgres,
>> unlike
>>> all other dialects.
>>>  *
>>> The rounding rules for converting between DECIMAL and other types differ
>>> between dialects.
>>>  *
>>> The typing rules for DECIMAL arithmetic differ between dialects: what is
>>> the type of a/b when a and b are DECIMALs?
>>>  *
>>> The implicit typing of literals is different between dialects: is 1.0e0 a
>>> DECIMAL or some floating point value?
>>>
>>> The conclusion is that a SQL dialect can express many programs that
>> cannot
>>> be expressed at all in other SQL dialects. What is a translator supposed
>> to
>>> do with such programs?
>>>
>>> These are a set of problems for just one type, but when you consider the
>>> full set of types supported, the problem is exponentially larger.
>>>
>>> Just look at the differences between functions in dialects:
>>> https://calcite.apache.org/docs/reference.html Calcite has 4 versions of
>>> CONCAT_WS, with subtle semantic differences.
>>>
>>> Are you aware of any tool, anywhere that translates between 2 fixed SQL
>>> dialects while preserving the program's exact behavior in all cases?
>>> Building a tool for all possible dialects is again exponentially harder.
>>>
>>> Since we have established that exact semantics-preserving transformation
>>> of SQL programs between dialects is impossible, you need to set up a goal
>>> for the translation effort. What do you really hope to achieve? Then we
>> can
>>> discuss about how it can be done.
>>>
>>> Mihai
>>>
>>> ________________________________
>>> From: Yanjing Wang <[email protected]>
>>> Sent: Wednesday, November 27, 2024 1:02 AM
>>> To: [email protected] <[email protected]>
>>> Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing
>>>
>>> Hello, this discussion has been ongoing for a week. Let's move it
>> forward.
>>> Does anyone else have any suggestions?
>>>
>>> Yanjing Wang <[email protected]> 于2024年11月19日周二 14:47写道：
>>>
>>>> 1. RelToSqlConverterTest
>>>> the class name implies tests conversion from RelNode to SQL, but now
>> its
>>>> RelNode comes from different dialects with target sql. it is difficult
>>> for
>>>> me to understand the test case
>>>>
>>>> @Test void testNullCollation() {
>>>>  final String query = "select * from \"product\" order by
>>> \"brand_name\"";
>>>>  final String expected = "SELECT *\n"
>>>>      + "FROM \"foodmart\".\"product\"\n"
>>>>      + "ORDER BY \"brand_name\"";
>>>>  final String sparkExpected = "SELECT *\n"
>>>>      + "FROM `foodmart`.`product`\n"
>>>>      + "ORDER BY `brand_name` NULLS LAST";
>>>>  sql(query)
>>>>      .withPresto().ok(expected)
>>>>      .withSpark().ok(sparkExpected);
>>>> }
>>>>
>>>>
>>>> Why does the spark sql have 'NULLS LAST' in the end? the information is
>>>> missing if we don't add source rel or source dialect.
>>>>
>>>> 2. Dialect-to-dialect translation
>>>> I think it's necessary, dialect translation and materialized view
>>>> substitution are common in big data domain, it would be beneficial to
>>> make
>>>> Calcite more user-friendly for these scenarios.
>>>> Could we create end-to-end test cases that start with the source SQL of
>>>> one dialect and end with the target SQL of another (or the same)
>> dialect?
>>>> We could also include user-defined materialized views in the process
>> and
>>>> perform result comparison.
>>>>
>>>> Julian Hyde <[email protected]> 于2024年11月19日周二 07:21写道：
>>>>
>>>>> A recent case, https://issues.apache.org/jira/browse/CALCITE-6693,
>> "Add
>>>>> Source SQL Dialect to RelToSqlConverterTest”, implies that people are
>>> using
>>>>> Calcite to translate SQL from dialect to another. The test wanted to
>>> test
>>>>> translating a SQL string from Presto to Redshift. I pushed back on
>> that
>>>>> case (and its PR) because that test is for translating RelNode to a
>> SQL
>>>>> dialect, not about handling source dialects.
>>>>>
>>>>> Dialect-to-dialect translation is undoubtedly something that people do
>>>>> with Calcite. I think we should recognize that fact, and document how
>>>>> someone can use Calcite as a translator. When we have documented it,
>> we
>>> can
>>>>> also add some tests.
>>>>>
>>>>> I am also worried about our dialect tests in general. The surface area
>>> to
>>>>> be tested is huge, and the tests are added haphazardly, so while many
>>> cases
>>>>> are tested there is a much greater set of cases that are not tested.
>>>>> Consider, for example, how testSelectQueryWithGroupByEmpty [1] tests
>>>>> against MySQL, Presto, StarRocks but not against BigQuery, Snowflake
>> or
>>>>> Postgres. If we want our SQL dialect support to be high quality, we
>>> have to
>>>>> find a way to improve the coverage of our tests. I logged
>>>>> https://issues.apache.org/jira/browse/CALCITE-5529 with some ideas
>> but
>>> I
>>>>> need help implement it.
>>>>>
>>>>> Julian
>>>>>
>>>>> [1]
>>>>>
>>>
>> https://github.com/apache/calcite/blob/f2ec11fe7e23ecf2db903bc02c40609242993aad/core/src/test/java/org/apache/calcite/rel/rel2sql/RelToSqlConverterTest.java#L577
>>>>>
>>>>>
>>>>>
>>>
>>

Re: [DISCUSS] Calcite as SQL translator, and dialect testing

Reply via email to