[ https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905069#comment-14905069 ]
Wenchen Fan commented on SPARK-10741: ------------------------------------- This bug is caused by a conflict between 2 tricky part in our Analyzer. Let me explain it a little more. We have a special rule for Sort on Aggregate in https://github.com/apache/spark/blob/v1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L563-L604 In this rule, we put sort ordering expressions in Aggregate and call Analyzer to resolve this Aggregate again(which means we go through all rules). We also have a special rule for parquet in https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L580-L612 In this rule, we convert hive's MetastoreRelation to LogicalRelation of parquet, which means we replaced leaf node and changed the output attribute ids. At the end of this rule, we go through the whole tree to replace old AttributeRefence of MetastoreRelation with new ones of LogicalRelation. Then these 2 rules get conflicted. At the point when we resolve Sort on Aggregate, we only have MetastoreRelation, but when we resolve sort ordering expressions with Aggregate, we go through all rules and these ordering expressions will reference to parquet's LogicalRelation whose output attribute ids are different from the old MetastoreRelations. Finally oops, our ordering expressions are referencing something doesn't exist. One solution is: do not go through all rules when resolve Sort on Aggregate(thus the parquet relation conversion won't happen). Another is: keep the attribute ids when convert MetastoreRelation to LogicalRelation. Personally I prefer the second one, what do you think? > Hive Query Having/OrderBy against Parquet table is not working > --------------------------------------------------------------- > > Key: SPARK-10741 > URL: https://issues.apache.org/jira/browse/SPARK-10741 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Ian > > Failed Query with Having Clause > {code} > def testParquetHaving() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedHaving = > """ SELECT c1, avg ( c2 ) as c_avg > | FROM test > | GROUP BY c1 > | HAVING ( avg ( c2 ) > 5) ORDER BY c1""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedHaving).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing > from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as > bigint)) > cast(5 as double)) as boolean) AS > havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > {code} > Failed Query with OrderBy > {code} > def testParquetOrderBy() { > val ddl = > """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS > PARQUET""" > val failedOrderBy = > """ SELECT c1, avg ( c2 ) c_avg > | FROM test > | GROUP BY c1 > | ORDER BY avg ( c2 )""".stripMargin > TestHive.sql(ddl) > TestHive.sql(failedOrderBy).collect > } > org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing > from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) > AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org