[jira] [Commented] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误

2022-10-15 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618044#comment-17618044
 ] 

Yuming Wang commented on SPARK-40741:
-

[~lkqqingcao] How to reproduce this issue?

> spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
> --
>
> Key: SPARK-40741
> URL: https://issues.apache.org/jira/browse/SPARK-40741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: spark 2.4.5
> hive 3.0
>Reporter: kaiqingli
>Priority: Major
>
> The results are inconsistent in part with the same sql statement that 
> contains 'distribute by  sort by ' between 
> spark-beeline($SPARK_HOME/bin/beeline) and hive-beeline, the result of 
> hive-beeline is expected and the result of spark-beeline is error.sql 
> statement:
> select count(1) all_cnt,
> sum(if(cmc_cellvoltage != new_cell_voltage, 1, 0)) ne_cnt
> from (
> select vin,
> samplingtimesec,
> cmc_cellvoltage,
> concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_cell_voltage
> from (
> select vin, samplingtimesec, cmc_cellvoltage, cell_index, cell_voltage
> from (
> select vin, samplingtimesec, cmc_cellvoltage, cell_index, cell_voltage
> from (
> select vin,
> samplingtimesec,
> cmc_cellvoltage,–[1,2,3,4...,111,112]
> row_number() over (partition by vin,samplingtimesec order by samplingtimesec) 
> r
> from table_name
> WHERE dt = '20221007'
> and samplingtimesec <= 166507920
> ) tmp
> lateral view posexplode(split(replace(replace(cmc_cellvoltage, '[', ''), ']', 
> ''), ',')) v0 as cell_index, cell_voltage
> where r = 1
> ) tmp
> distribute by vin
> , samplingtimesec sort by cell_index
> ) tmp
> group by vin, samplingtimesec, cmc_cellvoltage
> ) tmp;
> hive-beeline result:  5682904 , 0
> spark-beeline result: 5682904 , 5613492



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误

2022-10-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616166#comment-17616166
 ] 

Hyukjin Kwon commented on SPARK-40741:
--

[~lkqqingcao] would be great to file an issue in English because there are many 
maintainers who don't speak other languages.

> spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
> --
>
> Key: SPARK-40741
> URL: https://issues.apache.org/jira/browse/SPARK-40741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: spark 3.1
> hive 3.0
>Reporter: kaiqingli
>Priority: Major
>
> sql中使用distribute by ... sort by 
> ...时,通过spark/bin/beeline执行的结果错误,使用hive/beeline输出结果正确,具体场景为,先基于posexplode拆分array数据,然后基于拆分的下标进行sort
>  by,之后再collect list,结果与原始的array结果不一致,sql如下:
> select id,
> samplingtimesec,
> array_data = new_array_data flag,
> array_data,
> new_array_data
> from (
> select id,
> samplingtimesec,
> array_data,
> concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_array_data
> from (
> select id, samplingtimesec, array_data, cell_index, cell_voltage
> from (
> select id,
> samplingtimesec,
> array_data,--格式[1,2,3,4,5]
> row_number() over (partition by id,samplingtimesec order by samplingtimesec) 
> r --去重
> from table
> WHERE dt = '20221007'
> and samplingtimesec <= 166507920
> ) tmp
> lateral view posexplode(split(replace(replace(array_data, '[', ''), ']', ''), 
> ',')) v0 as cell_index, cell_voltage
> where r = 1
> distribute by id
> , samplingtimesec sort by cell_index
> ) tmp
> group by id, samplingtimesec, array_data
> ) tmp
> where array_data != new_array_data;
> 以上sql,对于hive/beeline输出结果为0条;
> 对于spark/beeline输出结果不为0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org