[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086770#comment-15086770 ]
Apache Spark commented on SPARK-12662: -------------------------------------- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/10626 > Add a local sort operator to DataFrame used by randomSplit > ---------------------------------------------------------- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL > Reporter: Yin Huai > Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org