[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit

2016-01-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086056#comment-15086056
 ] 

Davies Liu commented on SPARK-12662:


Another way to make the DataFrame deterministic is materialize it, via cache.

> Add a local sort operator to DataFrame used by randomSplit
> --
>
> Key: SPARK-12662
> URL: https://issues.apache.org/jira/browse/SPARK-12662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following 
> code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map 
> {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee 
> the ordering rows in a partition. It will be good to add a local sort 
> operator to make row ordering within a partition deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit

2016-01-06 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085764#comment-15085764
 ] 

Yin Huai commented on SPARK-12662:
--

btw, with local sort operator, we can make row ordering in a partition 
deterministic. However, if the partition that a row belongs to is not 
deterministic (e.g. the DF shuffles data randomly), the returned DFs of 
randomSplit still have this issue. Let's also add doc to the randomSplit to let 
users know that random shuffle should not be used for the input DF of 
randomSplit.

> Add a local sort operator to DataFrame used by randomSplit
> --
>
> Key: SPARK-12662
> URL: https://issues.apache.org/jira/browse/SPARK-12662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following 
> code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map 
> {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee 
> the ordering rows in a partition. It will be good to add a local sort 
> operator to make row ordering within a partition deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit

2016-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086770#comment-15086770
 ] 

Apache Spark commented on SPARK-12662:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/10626

> Add a local sort operator to DataFrame used by randomSplit
> --
>
> Key: SPARK-12662
> URL: https://issues.apache.org/jira/browse/SPARK-12662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following 
> code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map 
> {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee 
> the ordering rows in a partition. It will be good to add a local sort 
> operator to make row ordering within a partition deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org