Yin Huai created SPARK-12662:
--------------------------------

             Summary: Add document to randomSplit to explain the sampling 
depends on the ordering of the rows in a partition
                 Key: SPARK-12662
                 URL: https://issues.apache.org/jira/browse/SPARK-12662
             Project: Spark
          Issue Type: Bug
          Components: Documentation, SQL
            Reporter: Yin Huai


With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following code 
will provide overlapped rows for two DFs returned by the randomSplit. 

{code}
sqlContext.sql("drop table if exists test")
val x = sc.parallelize(1 to 210)
case class R(ID : Int)
sqlContext.createDataFrame(x.map 
{R(_)}).write.format("json").saveAsTable("bugsc1597")
var df = sql("select distinct ID from test")
var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
a.registerTempTable("a")
b.registerTempTable("b")
val intersectDF = a.intersect(b)
intersectDF.show
{code}

The reason is that {{sql("select distinct ID from test")} does not guarantee 
the ordering rows in a partition. It will be good to add more document to the 
api doc to explain it. To make intersectDF contain 0 row, the df needs to have 
fixed row ordering within a partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to