[ https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010157#comment-15010157 ]
Xiao Li edited comment on SPARK-11770 at 11/18/15 3:47 AM: ----------------------------------------------------------- Hi, [~simeons] I am unable to reproduce your issue. Could you try it using my script? case class Individual2(F1: Integer, F2: Integer, F3: Integer, F4: Integer) def main(args: Array[String]) { val sc = new SparkContext("local", "join DFs") val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val rdd1 = sc.parallelize(Seq( Individual2(1,2,3,4), Individual2(5,6,7,8))) val df1 = hiveContext.createDataFrame(rdd1) df1.registerTempTable("foo") val df2 = hiveContext.sql("SELECT F1+1 as c, SUM(F2) as F2 FROM foo GROUP BY F1 HAVING SUM(F2) > 5") df2.explain(true) df2.show() } Thanks, Xiao was (Author: smilegator): Hi, [~simeons] I am unable to reproduce your issue. Could you try it using my script? case class Individual2(F1: Integer, F2: Integer, F3: Integer, F4: Integer) def main(args: Array[String]) { val sc = new SparkContext("local", "join DFs") val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val rdd1 = sc.parallelize(Seq( Individual2(1,2,3,4), Individual2(5,6,7,8))) val df1 = hiveContext.createDataFrame(rdd1) df1.registerTempTable("foo") val df2 = hiveContext.sql("SELECT F1+1 as c, SUM(F2) as F2 FROM foo GROUP BY F1 HAVING SUM(F2) > 5") df2.explain(true) df2.show() } Thanks, Xiao > Spark SQL field resolution error in GROUP BY HAVING clause > ---------------------------------------------------------- > > Key: SPARK-11770 > URL: https://issues.apache.org/jira/browse/SPARK-11770 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.1 > Reporter: Simeon Simeonov > Labels: SQL > > A query fails to resolve columns from the source data when an alias is added > to the SELECT clause. I have not been able to isolate a reproducible > standalone test. Below are a series of {{spark-shell}} operations that show > the problem step-by-step. Spark SQL execution happens via {{HiveContext}}. > I believe the root cause of the problem is that when (and only when) there > are aliased expression columns in the SELECT clause, Spark SQL "sees" columns > from the SELECT clause while evaluating a HAVING clause. According to the SQL > standard that should not happen. > The table in question is simple: > {code} > scala> ctx.table("hevents_test").printSchema > 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default > tbl=hevents_test > 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table : > db=default tbl=hevents_test > root > |-- vertical: string (nullable = true) > |-- did: string (nullable = true) > |-- surl: string (nullable = true) > |-- creative_id: long (nullable = true) > |-- keyword_text: string (nullable = true) > |-- errors: integer (nullable = true) > |-- views: integer (nullable = true) > |-- clicks: integer (nullable = true) > |-- actions: long (nullable = true) > {code} > A basic aggregation with a SELECT expression works without a problem: > {code} > cala> ctx.sql(""" > | select 1.0*creative_id, sum(views) as views > | from hevents_test > | group by creative_id > | having sum(views) > 500 > | """) > 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, > sum(views) as views > from hevents_test > group by creative_id > having sum(views) > 500 > 15/11/16 22:25:53 INFO ParseDriver: Parse Completed > 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default > tbl=hevents_test > 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table : > db=default tbl=hevents_test > res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint] > {code} > However, if the expression is aliased, the analyzer gets confused about > {{views}}. > {code} > scala> ctx.sql(""" > | select 1.0*creative_id as cid, sum(views) as views > | from hevents_test > | group by creative_id > | having sum(views) > 500 > | """) > 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id > as cid, sum(views) as views > from hevents_test > group by creative_id > having sum(views) > 500 > 15/11/16 22:26:59 INFO ParseDriver: Parse Completed > 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default > tbl=hevents_test > 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table : > db=default tbl=hevents_test > org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L > missing from > vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5 > in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as > bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as > double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67) > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69) > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71) > at $iwC$$iwC$$iwC.<init>(<console>:73) > at $iwC$$iwC.<init>(<console>:75) > at $iwC.<init>(<console>:77) > at <init>(<console>:79) > at .<init>(<console>:83) > at .<clinit>(<console>) > at .<init>(<console>:7) > at .<clinit>(<console>) > at $print(<console>) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > If the {{views}} column in the HAVING clause is explicitly disambiguated, the > problem goes away: > {code} > scala> ctx.sql(""" > | select 1.0*creative_id as cid, sum(views) as views > | from hevents_test > | group by creative_id > | having sum(hevents_test.views) > 500 > | """) > 15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id > as cid, sum(views) as views > from hevents_test > group by creative_id > having sum(hevents_test.views) > 500 > 15/11/16 22:29:17 INFO ParseDriver: Parse Completed > 15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default > tbl=hevents_test > 15/11/16 22:29:17 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table : > db=default tbl=hevents_test > res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint] > {code} > That disambiguation should not be necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org