[jira] [Commented] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
[ https://issues.apache.org/jira/browse/SPARK-40741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618044#comment-17618044 ] Yuming Wang commented on SPARK-40741: - [~lkqqingcao] How to reproduce this issue? > spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误 > -- > > Key: SPARK-40741 > URL: https://issues.apache.org/jira/browse/SPARK-40741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: spark 2.4.5 > hive 3.0 >Reporter: kaiqingli >Priority: Major > > The results are inconsistent in part with the same sql statement that > contains 'distribute by sort by ' between > spark-beeline($SPARK_HOME/bin/beeline) and hive-beeline, the result of > hive-beeline is expected and the result of spark-beeline is error.sql > statement: > select count(1) all_cnt, > sum(if(cmc_cellvoltage != new_cell_voltage, 1, 0)) ne_cnt > from ( > select vin, > samplingtimesec, > cmc_cellvoltage, > concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_cell_voltage > from ( > select vin, samplingtimesec, cmc_cellvoltage, cell_index, cell_voltage > from ( > select vin, samplingtimesec, cmc_cellvoltage, cell_index, cell_voltage > from ( > select vin, > samplingtimesec, > cmc_cellvoltage,–[1,2,3,4...,111,112] > row_number() over (partition by vin,samplingtimesec order by samplingtimesec) > r > from table_name > WHERE dt = '20221007' > and samplingtimesec <= 166507920 > ) tmp > lateral view posexplode(split(replace(replace(cmc_cellvoltage, '[', ''), ']', > ''), ',')) v0 as cell_index, cell_voltage > where r = 1 > ) tmp > distribute by vin > , samplingtimesec sort by cell_index > ) tmp > group by vin, samplingtimesec, cmc_cellvoltage > ) tmp; > hive-beeline result: 5682904 , 0 > spark-beeline result: 5682904 , 5613492 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
[ https://issues.apache.org/jira/browse/SPARK-40741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616166#comment-17616166 ] Hyukjin Kwon commented on SPARK-40741: -- [~lkqqingcao] would be great to file an issue in English because there are many maintainers who don't speak other languages. > spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误 > -- > > Key: SPARK-40741 > URL: https://issues.apache.org/jira/browse/SPARK-40741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 > Environment: spark 3.1 > hive 3.0 >Reporter: kaiqingli >Priority: Major > > sql中使用distribute by ... sort by > ...时,通过spark/bin/beeline执行的结果错误,使用hive/beeline输出结果正确,具体场景为,先基于posexplode拆分array数据,然后基于拆分的下标进行sort > by,之后再collect list,结果与原始的array结果不一致,sql如下: > select id, > samplingtimesec, > array_data = new_array_data flag, > array_data, > new_array_data > from ( > select id, > samplingtimesec, > array_data, > concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_array_data > from ( > select id, samplingtimesec, array_data, cell_index, cell_voltage > from ( > select id, > samplingtimesec, > array_data,--格式[1,2,3,4,5] > row_number() over (partition by id,samplingtimesec order by samplingtimesec) > r --去重 > from table > WHERE dt = '20221007' > and samplingtimesec <= 166507920 > ) tmp > lateral view posexplode(split(replace(replace(array_data, '[', ''), ']', ''), > ',')) v0 as cell_index, cell_voltage > where r = 1 > distribute by id > , samplingtimesec sort by cell_index > ) tmp > group by id, samplingtimesec, array_data > ) tmp > where array_data != new_array_data; > 以上sql,对于hive/beeline输出结果为0条; > 对于spark/beeline输出结果不为0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org