[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-12662. --------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 1.6.1 > Add a local sort operator to DataFrame used by randomSplit > ---------------------------------------------------------- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL > Reporter: Yin Huai > Assignee: Sameer Agarwal > Fix For: 1.6.1, 2.0.0 > > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org