[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250611#comment-14250611 ]
David Ross commented on SPARK-4296: ----------------------------------- I can still reproduce this issue. The test case above does appear to be fixed, but if you use other types of agg functions, it can fail. For example: {code} CREATE TABLE test_spark_4296(s STRING); SELECT UPPER(s) FROM test_spark_4296 GROUP BY UPPER(s); {code} That works. But this query doesn't: {code} SELECT REGEXP_EXTRACT(s, ".*", 1) FROM test_spark_4296 GROUP BY REGEXP_EXTRACT(s, ".*", 1); {code} The error is similar to the one above: {code} 14/12/17 21:39:22 INFO thriftserver.SparkExecuteStatementOperation: Running query 'SELECT REGEXP_EXTRACT(s, ".*", 1) FROM test_spark_4296 GROUP BY REGEXP_EXTRACT(s, ".*", 1)' 14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on slave0:50816 in memory (size: 5.2 KB, free: 534.4 MB) 14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on slave1:45411 in memory (size: 5.2 KB, free: 534.4 MB) 14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on slave2:59650 in memory (size: 5.2 KB, free: 534.4 MB) 14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 7 14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_7_piece0 14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_7_piece0 of size 5308 dropped from memory (free 276233416) 14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on master:34621 in memory (size: 5.2 KB, free: 265.0 MB) 14/12/17 21:39:22 INFO storage.BlockManagerMaster: Updated info of block broadcast_7_piece0 14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_7 14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_7 of size 9344 dropped from memory (free 276242760) 14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned broadcast 7 14/12/17 21:39:22 INFO parse.ParseDriver: Parsing command: SELECT REGEXP_EXTRACT(s, ".*", 1) FROM test_spark_4296 GROUP BY REGEXP_EXTRACT(s, ".*", 1) 14/12/17 21:39:22 INFO parse.ParseDriver: Parse Completed 14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned shuffle 1 14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 6 14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_6_piece0 14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_6_piece0 of size 47235 dropped from memory (free 276289995) 14/12/17 21:39:22 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on master:34621 in memory (size: 46.1 KB, free: 265.0 MB) 14/12/17 21:39:22 INFO storage.BlockManagerMaster: Updated info of block broadcast_6_piece0 14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_6 14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_6 of size 523775 dropped from memory (free 276813770) 14/12/17 21:39:22 INFO spark.ContextCleaner: Cleaned broadcast 6 14/12/17 21:39:22 INFO storage.BlockManager: Removing broadcast 5 14/12/17 21:39:22 INFO storage.BlockManager: Removing block broadcast_5_piece0 14/12/17 21:39:22 INFO storage.MemoryStore: Block broadcast_5_piece0 of size 7179 dropped from memory (free 276820949) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on master:34621 in memory (size: 7.0 KB, free: 265.0 MB) 14/12/17 21:39:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0 14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_5 14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_5 of size 12784 dropped from memory (free 276833733) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slave0:50816 in memory (size: 7.0 KB, free: 534.4 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slave1:45411 in memory (size: 7.0 KB, free: 534.4 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_5_piece0 on slave2:59650 in memory (size: 7.0 KB, free: 534.4 MB) 14/12/17 21:39:23 INFO spark.ContextCleaner: Cleaned broadcast 5 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slave1:45411 in memory (size: 7.9 KB, free: 534.4 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slave2:59650 in memory (size: 7.9 KB, free: 534.4 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on slave0:50816 in memory (size: 7.9 KB, free: 534.4 MB) 14/12/17 21:39:23 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1) AS _c0#608, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1) AS _c0#608] MetastoreRelation as_adventure, test_spark_4296, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:127) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:109) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:109) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:107) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:181) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy18.executeStatement(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:220) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/12/17 21:39:23 WARN thrift.ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1) AS _c0#608, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFRegExpExtract(s#609,.*,1) AS _c0#608] MetastoreRelation as_adventure, test_spark_4296, None at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy18.executeStatement(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:220) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/12/17 21:39:23 INFO storage.BlockManager: Removing broadcast 4 14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_4_piece0 14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_4_piece0 of size 8076 dropped from memory (free 276841809) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on master:34621 in memory (size: 7.9 KB, free: 265.0 MB) 14/12/17 21:39:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece0 14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_4 14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_4 of size 14488 dropped from memory (free 276856297) 14/12/17 21:39:23 INFO spark.ContextCleaner: Cleaned broadcast 4 14/12/17 21:39:23 INFO spark.ContextCleaner: Cleaned shuffle 0 14/12/17 21:39:23 INFO storage.BlockManager: Removing broadcast 3 14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_3_piece0 14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_3_piece0 of size 47609 dropped from memory (free 276903906) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on master:34621 in memory (size: 46.5 KB, free: 265.1 MB) 14/12/17 21:39:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece0 14/12/17 21:39:23 INFO storage.BlockManager: Removing block broadcast_3 14/12/17 21:39:23 INFO storage.MemoryStore: Block broadcast_3 of size 524287 dropped from memory (free 277428193) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slave0:50816 in memory (size: 46.5 KB, free: 534.5 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slave2:59650 in memory (size: 46.5 KB, free: 534.5 MB) 14/12/17 21:39:23 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on slave1:45411 in memory (size: 46.5 KB, free: 534.5 MB) 14/12/17 21:39:23 INFO spark.ContextCleaner: Cleaned broadcast 3 {code} > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --------------------------------------------------------------------------------------------------- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.1.0 > Reporter: Shixiong Zhu > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28")))) > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org