[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last `del <> 'hi'` filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last {del <> 'hi'} filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last `del <> 'hi'` filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last {del <> 'hi'} filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:49 PM:
-

I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM:
-

I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:scala}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org