Re: [DISCUSS] Calcite as SQL translator, and dialect testing

Julian Hyde Mon, 02 Dec 2024 12:13:29 -0800

In an open source project the core is "de facto”. People decide what is 
important to them, add tests, fix the code to match the tests.


The core is well-defined - it is the test suite. If the core doesn’t match your 
needs, contributions are welcome.

Julian



> On Dec 2, 2024, at 11:52 AM, Mihai Budiu <[email protected]> wrote:
> 
> The difficulty will be in defining the "core".
> 
> If you want 100% fidelity the goals are clear. If you want 99% fidelity you 
> have to define what is in the 1% left over. That's why I asked Yanjing Wang 
> to setup a goal that needs to be achieved.
> 
> Even if you leave out all functions, just the type systems mismatches and the 
> behavior of casts make this a difficult job. Today the static and dynamic 
> behavior of casts seems to be to a large degree built-in in Calcite, but to 
> make that flexible it would have to move somehow into the type system, so 
> each dialect can customize the way casts work. We have seen this for example 
> in the handling of casts to/from BINARY: 
> https://issues.apache.org/jira/browse/CALCITE-6210
> 
> I suspect the "high severity" thing happens often because this is the JIRA 
> default, and people don't bother to change it.
> 
> Mihai
> 
> ________________________________
> From: Julian Hyde <[email protected]>
> Sent: Monday, December 2, 2024 11:40 AM
> To: [email protected] <[email protected]>
> Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing
> 
> I agree with Mihai that it’s not practical to translate dialect-to-dialect 
> with 100% fidelity. You get defeated by the edge cases — values of DECIMAL 
> that are valid on one DB but not another, what happens when the ‘offset’ 
> argument of the SUBSTRING function is negative.
> 
> But the core of the language is a lot more valuable than the edges. We can do 
> a good job at the core.
> 
> Let’s use our tests to define the core, keep it working, and continually 
> extend it.
> 
> Tests are crucial, and should be the heart of this initiative. If people want 
> Calcite to be a better SQL translator, they should invest in improving the 
> tests. (Not just a stream of band-aid PRs.)
> 
> You may notice that I get exasperated when someone logs a bug at the 
> boundaries and claims that it is high severity. For example, wants us to 
> handle dates earlier than 1 BCE. I know why this happens — your customer has 
> logged a bug because their compliance suite has a date that is 2 BCE. These 
> cases exist because of QA engineers testing up to the boundaries — and 
> logging bugs that are handled by developers of another product and *their* QA 
> engineers. But let’s keep some perspective — these cases are filed, and 
> fixed, only to make QA engineers happy. Let’s focus on the core.
> 
>>> My claim was that it's impossible to even implement a translation between 
>>> any two dialects such that the resulting program provides the exact same 
>>> results as the source program.
> 
> Not impossible. Google acquired a company, CompilerWorks, that models the 
> type system and built-in functions of various SQL dialects and maps between 
> them. It is a lot of work, and requires a lot of testing. We cannot get to 
> that place with the limited resources of an open source project, but we can 
> asymptotically approach it if we are clever.
> 
> Julian
> 
> 
> 
>> On Dec 2, 2024, at 10:05 AM, Mihai Budiu <[email protected]> wrote:
>> 
>> My claim was that it's impossible to even implement a translation between 
>> any two dialects such that the resulting program provides the exact same 
>> results as the source program.
>> 
>> Clearly, I don't know all SQL dialects, but I think this is true for all 
>> major dialects.
>> 
>> Mihai
>> 
>> ________________________________
>> From: Yanjing Wang <[email protected]>
>> Sent: Sunday, December 1, 2024 7:10 PM
>> To: [email protected] <[email protected]>
>> Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing
>> 
>> Thank you, Mihai, for your detailed insights.
>> 
>> Implementing translations for all dialects indeed requires a significant
>> amount of work and can be quite challenging to complete. Would it be
>> possible for us to create an example that demonstrates how to implement
>> translation between two specific dialects and materialized view
>> substitution from scratch? This example could potentially be enhanced with
>> contributions from various companies and be friendly to newbies.
>> 
>> 
>> Thank you, Walaa, for your recommendation. I understand that Coral uses
>> ANTLR4 for parsing and translating from ASTNode to Calcite SqlNode. Do you
>> find this approach easier compared to defining a Calcite FTL file for
>> unsupported syntax?
>> 
>> Walaa Eldin Moustafa <[email protected]> 于2024年11月28日周四 01:27写道：
>> 
>>> You may consider looking into Coral [1] which is based on Calcite, but
>>> extends it with the translation requirements in mind, and as first class
>>> problems.
>>> 
>>> [1] https://github.com/linkedin/coral
>>> 
>>> Thanks,
>>> Walaa.
>>> 
>>> 
>>> On Wed, Nov 27, 2024 at 9:20 AM Mihai Budiu <[email protected]> wrote:
>>> 
>>>> I think that the task of translation between SQL dialects is hopeless.
>>>> Consider just the type DECIMAL:
>>>> 
>>>> 
>>>> *
>>>> Each dialect supports a different range of DECIMAL types, specified by
>>>> scale and precision.
>>>> *
>>>> Some dialects support negative scales, and some support scales >
>>> precision.
>>>> *
>>>> Some dialects, like Postgres have infinite values in DECIMAL. This means
>>>> that DECIMAL arithmetic can never cause a runtime error in Postgres,
>>> unlike
>>>> all other dialects.
>>>> *
>>>> The rounding rules for converting between DECIMAL and other types differ
>>>> between dialects.
>>>> *
>>>> The typing rules for DECIMAL arithmetic differ between dialects: what is
>>>> the type of a/b when a and b are DECIMALs?
>>>> *
>>>> The implicit typing of literals is different between dialects: is 1.0e0 a
>>>> DECIMAL or some floating point value?
>>>> 
>>>> The conclusion is that a SQL dialect can express many programs that
>>> cannot
>>>> be expressed at all in other SQL dialects. What is a translator supposed
>>> to
>>>> do with such programs?
>>>> 
>>>> These are a set of problems for just one type, but when you consider the
>>>> full set of types supported, the problem is exponentially larger.
>>>> 
>>>> Just look at the differences between functions in dialects:
>>>> https://calcite.apache.org/docs/reference.html Calcite has 4 versions of
>>>> CONCAT_WS, with subtle semantic differences.
>>>> 
>>>> Are you aware of any tool, anywhere that translates between 2 fixed SQL
>>>> dialects while preserving the program's exact behavior in all cases?
>>>> Building a tool for all possible dialects is again exponentially harder.
>>>> 
>>>> Since we have established that exact semantics-preserving transformation
>>>> of SQL programs between dialects is impossible, you need to set up a goal
>>>> for the translation effort. What do you really hope to achieve? Then we
>>> can
>>>> discuss about how it can be done.
>>>> 
>>>> Mihai
>>>> 
>>>> ________________________________
>>>> From: Yanjing Wang <[email protected]>
>>>> Sent: Wednesday, November 27, 2024 1:02 AM
>>>> To: [email protected] <[email protected]>
>>>> Subject: Re: [DISCUSS] Calcite as SQL translator, and dialect testing
>>>> 
>>>> Hello, this discussion has been ongoing for a week. Let's move it
>>> forward.
>>>> Does anyone else have any suggestions?
>>>> 
>>>> Yanjing Wang <[email protected]> 于2024年11月19日周二 14:47写道：
>>>> 
>>>>> 1. RelToSqlConverterTest
>>>>> the class name implies tests conversion from RelNode to SQL, but now
>>> its
>>>>> RelNode comes from different dialects with target sql. it is difficult
>>>> for
>>>>> me to understand the test case
>>>>> 
>>>>> @Test void testNullCollation() {
>>>>> final String query = "select * from \"product\" order by
>>>> \"brand_name\"";
>>>>> final String expected = "SELECT *\n"
>>>>>     + "FROM \"foodmart\".\"product\"\n"
>>>>>     + "ORDER BY \"brand_name\"";
>>>>> final String sparkExpected = "SELECT *\n"
>>>>>     + "FROM `foodmart`.`product`\n"
>>>>>     + "ORDER BY `brand_name` NULLS LAST";
>>>>> sql(query)
>>>>>     .withPresto().ok(expected)
>>>>>     .withSpark().ok(sparkExpected);
>>>>> }
>>>>> 
>>>>> 
>>>>> Why does the spark sql have 'NULLS LAST' in the end? the information is
>>>>> missing if we don't add source rel or source dialect.
>>>>> 
>>>>> 2. Dialect-to-dialect translation
>>>>> I think it's necessary, dialect translation and materialized view
>>>>> substitution are common in big data domain, it would be beneficial to
>>>> make
>>>>> Calcite more user-friendly for these scenarios.
>>>>> Could we create end-to-end test cases that start with the source SQL of
>>>>> one dialect and end with the target SQL of another (or the same)
>>> dialect?
>>>>> We could also include user-defined materialized views in the process
>>> and
>>>>> perform result comparison.
>>>>> 
>>>>> Julian Hyde <[email protected]> 于2024年11月19日周二 07:21写道：
>>>>> 
>>>>>> A recent case, https://issues.apache.org/jira/browse/CALCITE-6693,
>>> "Add
>>>>>> Source SQL Dialect to RelToSqlConverterTest”, implies that people are
>>>> using
>>>>>> Calcite to translate SQL from dialect to another. The test wanted to
>>>> test
>>>>>> translating a SQL string from Presto to Redshift. I pushed back on
>>> that
>>>>>> case (and its PR) because that test is for translating RelNode to a
>>> SQL
>>>>>> dialect, not about handling source dialects.
>>>>>> 
>>>>>> Dialect-to-dialect translation is undoubtedly something that people do
>>>>>> with Calcite. I think we should recognize that fact, and document how
>>>>>> someone can use Calcite as a translator. When we have documented it,
>>> we
>>>> can
>>>>>> also add some tests.
>>>>>> 
>>>>>> I am also worried about our dialect tests in general. The surface area
>>>> to
>>>>>> be tested is huge, and the tests are added haphazardly, so while many
>>>> cases
>>>>>> are tested there is a much greater set of cases that are not tested.
>>>>>> Consider, for example, how testSelectQueryWithGroupByEmpty [1] tests
>>>>>> against MySQL, Presto, StarRocks but not against BigQuery, Snowflake
>>> or
>>>>>> Postgres. If we want our SQL dialect support to be high quality, we
>>>> have to
>>>>>> find a way to improve the coverage of our tests. I logged
>>>>>> https://issues.apache.org/jira/browse/CALCITE-5529 with some ideas
>>> but
>>>> I
>>>>>> need help implement it.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>>> https://github.com/apache/calcite/blob/f2ec11fe7e23ecf2db903bc02c40609242993aad/core/src/test/java/org/apache/calcite/rel/rel2sql/RelToSqlConverterTest.java#L577
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>

Re: [DISCUSS] Calcite as SQL translator, and dialect testing

Reply via email to