[jira] [Commented] (KYLIN-5742) When the "Group by" group has duplicate values, the result of Grouping Set query is inconsistent with that in SparkSQL

zhong.zhu (Jira) Sun, 10 Dec 2023 23:08:04 -0800


    [ 
https://issues.apache.org/jira/browse/KYLIN-5742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795195#comment-17795195
 ]


zhong.zhu commented on KYLIN-5742:
----------------------------------

h1. Root Cause

1. GROUP BY a, b, GROUPING SETS syntax, which is not supported by Calcite's 
default parser and is considered illegal syntax, which is different from both 
HIVE/SPARK
2. Kylin returns inconsistent results with SPARK because Calcite's processing 
logic for Grouping Set is inconsistent with Spark SQL syntax, and the 
optimization rule `AggregateMultipleExpandRule` introduced by Kylin also 
affects the return results.
h1. Dev Design

Borrowing from HIVE, for the syntax definition of GROUPING SETS, modify 
Calcite's syntax definition file while aligning Spark's syntax definition, 
mainly for the context definitions of GROUP BY and GROUPING SETS.
h2. Extending the default SqlParserImpl syntax file (e.g. ExtendedSqlParserImpl)

The grammar file will extend GroupBy SqlNode to refer to Spark's grammar 
definitions, and add two new variables to hold the GROUP BY elements and 
GROUPING SETS expressions obtained from parsing in order to generate Aggregate 
RelNode by scenario in the Planning phase.

The SQL parsers used by Kylin all need to be replaced with the 
ExtendedSqlParserImpl implementation to support comma syntax.
h2. Extending GROUP BY-related implementation logic to align SPARK/HIVE 
semantics

That is, it supports the set semantics of GROUPING SETS
h1. Major changes

1. Calcite syntax definition file, support GROUP BY ... , GROUPING SETS syntax, 
corresponding to HIVE & SPARK

2. considering the compatibility of Calcite community version, cherry-pick the 
related code, in order to solve the problem of incorrect query result of 
GROUPING SETS in Kyiln 4.6.X version.
 

> When the "Group by" group has duplicate values, the result of Grouping Set 
> query is inconsistent with that in SparkSQL
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5742
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5742
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: 5.0-beta
>            Reporter: zhong.zhu
>            Assignee: zhong.zhu
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: image-2023-12-11-14-54-38-652.png, 
> image-2023-12-11-14-55-46-222.png, image-2023-12-11-14-57-32-037.png, 
> image-2023-12-11-14-57-56-771.png
>
>
> {code:sql}
> -- sql1
> select C_NAME,C_CITY,C_NATION,C_REGION,count(*)
> FROM SSB.LINEORDER as LINEORDER
> INNER JOIN SSB.CUSTOMER as CUSTOMER
> ON LINEORDER.LO_CUSTKEY = CUSTOMER.C_CUSTKEY
> where C_NATION = 'CHINA' and C_CITY = 'CHINA    0'
> group by 
> GROUPING SETS ((),(C_NAME,C_CITY),(C_NATION,C_REGION))
> order by C_NAME;
> -- sql2
> select C_NAME,C_CITY,C_NATION,C_REGION,count(*)
> FROM SSB.LINEORDER as LINEORDER
> INNER JOIN SSB.CUSTOMER as CUSTOMER
> ON LINEORDER.LO_CUSTKEY = CUSTOMER.C_CUSTKEY
> where C_NATION = 'CHINA' and C_CITY = 'CHINA    0'
> group by 
> C_NAME,C_CITY,C_NATION,C_REGION,
> GROUPING SETS ((),(C_NAME,C_CITY),(C_NATION,C_REGION))
> order by C_NAME;
> -- sql3
> select C_NAME,C_CITY,C_NATION,C_REGION,count(*)
> FROM SSB.LINEORDER as LINEORDER
> INNER JOIN SSB.CUSTOMER as CUSTOMER
> ON LINEORDER.LO_CUSTKEY = CUSTOMER.C_CUSTKEY
> where C_NATION = 'CHINA' and C_CITY = 'CHINA    0'
> group by 
> C_NAME,C_CITY,C_NATION,C_REGION
> GROUPING SETS ((),(C_NAME,C_CITY),(C_NATION,C_REGION))
> order by C_NAME
> {code}
> In spark-sql, sql1 and sql3 query results are consistent as follows:
>  !image-2023-12-11-14-54-38-652.png! 
> In spark-sql, sql 2 the query results are as follows.
>  !image-2023-12-11-14-55-46-222.png! 
> In KYLIN, the query result of sql1 is as follows, which is consistent with 
> the result of spark-sql sql sql1 sql2:
>  !image-2023-12-11-14-57-32-037.png! 
> The query result of sql2 is as follows, which is inconsistent with the 
> spark-sql sql2 result:
>  !image-2023-12-11-14-57-56-771.png! 
> The syntax of sql3 is not supported
> Hive does not support commas before grouping sets, that is, sql2 is not 
> supported, and the query results of sql1 and sql3 are consistent with 
> spark-sql



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KYLIN-5742) When the "Group by" group has duplicate values, the result of Grouping Set query is inconsistent with that in SparkSQL

Reply via email to