[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24607. ---------------------------------- Resolution: Incomplete > Distribute by rand() can lead to data inconsistency > --------------------------------------------------- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0, 2.3.1 > Reporter: zenglinxi > Priority: Major > Labels: bulk-closed > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org