[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last `del <> 'hi'` filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last {del <> 'hi'} filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last `del <> 'hi'` filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last {del <> 'hi'} filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:49 PM: - I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM: - I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:scala} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org