[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016021#comment-16016021 ] Mitesh commented on SPARK-17867: Ah I see, thanks [~viirya]. The repartitionByColumns is just a short-cut method I created. But I do have some aliasing code changes compared to 2.1, I will try to remove those and see if that is whats breaking it. > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015852#comment-16015852 ] Liang-Chi Hsieh commented on SPARK-17867: - The above example code can't compile with current codebase. There is no repartitionByColumns but only repartition. {code} val df = Seq((1, 2, 3, "hi"), (1, 2, 4, "hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartition($"userid") .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartition($"userid") .sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") {code} The optimized plan looks like: {code} Sort [userid#9 ASC NULLS FIRST], false +- RepartitionByExpression [userid#9], 5 +- Filter (isnotnull(del#12) && NOT (del#12 = hi)) +- Aggregate [eventid#10], [first(userid#9, false) AS userid#9, eventid#10, first(vk#11, false) AS vk#11, first(del#12, false) AS del#12] +- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, vk#11 DESC NULLS LAST], false +- RepartitionByExpression [userid#9], 5 +- LocalRelation [userid#9, eventid#10, vk#11, del#12] {code} The spark plan looks like: {code} Sort [userid#9 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(userid#9, 5) +- Filter (isnotnull(del#12) && NOT (del#12 = hi)) +- SortAggregate(key=[eventid#10], functions=[first(userid#9, false), first(vk#11, false), first(del#12, false)], output=[userid#9, eventid#10, vk#11, del#12]) +- SortAggregate(key=[eventid#10], functions=[partial_first(userid#9, false), partial_first(vk#11, false), partial_first(del#12, false)], output=[eventid#10, first#35, valueSet#36, first#37, valueSet#38, first#39, valueSet#40]) +- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, vk#11 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(userid#9, 5) +- LocalTableScan [userid#9, eventid#10, vk#11, del#12] {code} Looks like the "del <> 'hi'" filter doesn't be pushed down? > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772 ] Mitesh commented on SPARK-17867: I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:scala} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564617#comment-15564617 ] Apache Spark commented on SPARK-17867: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15427 > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org