[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zenglinxi updated SPARK-24607: ------------------------------ Description: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, may happen during failure tolerance operations, I think it's also an hidden serious problem in sparksql. was: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, I think it's also an hidden serious problem in sparksql. > Distribute by rand() can lead to data inconsistency > --------------------------------------------------- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0, 2.3.1 > Reporter: zenglinxi > Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, may happen during failure > tolerance operations, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org