[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086056#comment-15086056 ] Davies Liu commented on SPARK-12662: Another way to make the DataFrame deterministic is materialize it, via cache. > Add a local sort operator to DataFrame used by randomSplit > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085764#comment-15085764 ] Yin Huai commented on SPARK-12662: -- btw, with local sort operator, we can make row ordering in a partition deterministic. However, if the partition that a row belongs to is not deterministic (e.g. the DF shuffles data randomly), the returned DFs of randomSplit still have this issue. Let's also add doc to the randomSplit to let users know that random shuffle should not be used for the input DF of randomSplit. > Add a local sort operator to DataFrame used by randomSplit > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12662) Add a local sort operator to DataFrame used by randomSplit
[ https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086770#comment-15086770 ] Apache Spark commented on SPARK-12662: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/10626 > Add a local sort operator to DataFrame used by randomSplit > -- > > Key: SPARK-12662 > URL: https://issues.apache.org/jira/browse/SPARK-12662 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > > With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following > code will provide overlapped rows for two DFs returned by the randomSplit. > {code} > sqlContext.sql("drop table if exists test") > val x = sc.parallelize(1 to 210) > case class R(ID : Int) > sqlContext.createDataFrame(x.map > {R(_)}).write.format("json").saveAsTable("bugsc1597") > var df = sql("select distinct ID from test") > var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) > a.registerTempTable("a") > b.registerTempTable("b") > val intersectDF = a.intersect(b) > intersectDF.show > {code} > The reason is that {{sql("select distinct ID from test")} does not guarantee > the ordering rows in a partition. It will be good to add a local sort > operator to make row ordering within a partition deterministic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org