[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753748#comment-16753748
 ] 

Xiao Li commented on SPARK-26741:
-

Could we re-enable the test by making the test case valid? That means, we just 
need to remove the aggregate function in the order by clause? 

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
> {code}
> 

[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753310#comment-16753310
 ] 

Kris Mok commented on SPARK-26741:
--

Note that there's a similar issue with non-aggregate functions. Here's an 
example:
{code:none}
spark.sql("create table foo (id int, blob binary)")
val df = spark.sql("select length(blob) from foo where id = 1 order by 
length(blob) limit 10")
df.explain(true)
{code}
{code:none}
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Sort ['length('blob) ASC NULLS FIRST], true
  +- 'Project [unresolvedalias('length('blob), None)]
 +- 'Filter ('id = 1)
+- 'UnresolvedRelation `foo`

== Analyzed Logical Plan ==
length(blob): int
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (id#23 = 1)
   +- SubqueryAlias `default`.`foo`
  +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (isnotnull(id#23) && (id#23 = 1))
   +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], 
output=[length(blob)#25])
+- *(1) Project [length(blob#24) AS length(blob)#25, blob#24]
   +- *(1) Filter (isnotnull(id#23) && (id#23 = 1))
  +- Scan hive default.foo [blob#24, id#23], HiveTableRelation 
`default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, 
blob#24]
{code}

Note how the {{Sort}} operator performs the {{length()}} again, despite there's 
one in the projection right below it. The root cause of this problem in the 
Analyzer is the same as the main example in this ticket, although this example 
is not as harmful (at least it still runs...)

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
>