[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

Wenchen Fan (JIRA) Wed, 23 Sep 2015 12:17:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905069#comment-14905069
 ]


Wenchen Fan commented on SPARK-10741:
-------------------------------------

This bug is caused by a conflict between 2 tricky part in our Analyzer. Let me 
explain it a little more.

We have a special rule for Sort on Aggregate in 
https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L563-L604
In this rule, we put sort ordering expressions in Aggregate and call Analyzer 
to resolve this Aggregate again(which means we go through all rules).

We also have a special rule for parquet in 
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L580-L612
In this rule, we convert hive's MetastoreRelation to LogicalRelation of 
parquet, which means we replaced leaf node and changed the output attribute 
ids. At the end of this rule, we go through the whole tree to replace old 
AttributeRefence of MetastoreRelation with new ones of LogicalRelation.

Then these 2 rules get conflicted. At the point when we resolve Sort on 
Aggregate, we only have MetastoreRelation, but when we resolve sort ordering 
expressions with Aggregate, we go through all rules and these ordering 
expressions will reference to parquet's LogicalRelation whose output attribute 
ids are different from the old MetastoreRelations. Finally oops, our ordering 
expressions are referencing something doesn't  exist.

One solution is: do not go through all rules when resolve Sort on 
Aggregate(thus the parquet relation conversion won't happen).
Another is: keep the attribute ids when convert MetastoreRelation to 
LogicalRelation.

Personally I prefer the second one, what do you think?



> Hive Query Having/OrderBy against Parquet table is not working 
> ---------------------------------------------------------------
>
>                 Key: SPARK-10741
>                 URL: https://issues.apache.org/jira/browse/SPARK-10741
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Ian
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
>     val ddl =
>       """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
>     val failedHaving =
>       """ SELECT c1, avg ( c2 ) as c_avg
>         | FROM test
>         | GROUP BY c1
>         | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
>     TestHive.sql(ddl)
>     TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
>     val ddl =
>       """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
>     val failedOrderBy =
>       """ SELECT c1, avg ( c2 ) c_avg
>         | FROM test
>         | GROUP BY c1
>         | ORDER BY avg ( c2 )""".stripMargin
>     TestHive.sql(ddl)
>     TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

Reply via email to