[jira] [Commented] (SPARK-13496) Optimizing count distinct changes the resulting column name

Xiao Li (JIRA) Sun, 28 Feb 2016 17:32:48 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171289#comment-15171289
 ]


Xiao Li commented on SPARK-13496:
---------------------------------

If you use the latest Spark 2.0 upstream, you will not hit this problem. 

{code}
Seq((1, "s")).toDF("a", "b").agg(countDistinct("a")).printSchema()

root
 |-- count(DISTINCT a): long (nullable = false)
{code}

> Optimizing count distinct changes the resulting column name
> -----------------------------------------------------------
>
>                 Key: SPARK-13496
>                 URL: https://issues.apache.org/jira/browse/SPARK-13496
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Ryan Blue
>
> SPARK-9241 updated the optimizer to rewrite count distinct. That change uses 
> a count that is no longer distinct because duplicates are eliminated further 
> down in the plan. This caused the name of the column to change:
> {code:title=Spark 1.5.2}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], 
> functions=[(count(a#7),mode=Complete,isDistinct=true)], 
> output=[COUNT(DISTINCT a)#9L])
>  TungstenAggregate(key=[a#7], functions=[], output=[a#7])
>   TungstenExchange SinglePartition
>    TungstenAggregate(key=[a#7], functions=[], output=[a#7])
>     LocalTableScan [a#7], [[1]]
> {code}
> {code:title=Spark 1.6.0}
> scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
> res0: org.apache.spark.sql.DataFrame = [count(a): bigint]
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Final,isDistinct=false)], output=[count(a)#31L])
> +- TungstenExchange SinglePartition, None
>    +- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else 
> null),mode=Partial,isDistinct=false)], output=[count#39L])
>       +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>          +- TungstenExchange hashpartitioning(a#36,gid#35,500), None
>             +- TungstenAggregate(key=[a#36,gid#35], functions=[], 
> output=[a#36,gid#35])
>                +- Expand [List(a#29, 1)], [a#36,gid#35]
>                   +- LocalTableScan [a#29], [[1]]
> {code}
> This has broken jobs that used the generated name. For example, 
> {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}.
> I think that the previous generated name is correct, even though the plan has 
> changed.
> [~marmbrus], you may want to take a look. It looks like you reviewed 
> SPARK-9241 and have some context here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13496) Optimizing count distinct changes the resulting column name

Reply via email to