[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai updated SPARK-12662: ----------------------------- Assignee: Sameer Agarwal > Add document to randomSplit to explain the sampling depends on the ordering > of the rows in a partition > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL > Reporter: Yin Huai > Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add more document to the > api doc to explain it. To make intersectDF contain 0 row, the df needs to > have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org