[ https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262016#comment-15262016 ]
Murtaza Kanchwala commented on SPARK-9338: ------------------------------------------ Please update Jira Ticket for the same. > Aliases from SELECT not available in GROUP BY > --------------------------------------------- > > Key: SPARK-9338 > URL: https://issues.apache.org/jira/browse/SPARK-9338 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 1.4.1 > Environment: Observed on Mac OS X and Ubuntu 14.04 > Reporter: James Aley > > It feels like this should really be a known issue, but I've not been able to > find any mailing list or JIRA tickets for exactly this. There are a few > closed/resolved tickets about specific types of exceptions, but I couldn't > find this exact problem, so apologies if this is a dupe! > Spark SQL doesn't appear to support referencing aliases from a SELECT in the > GROUP BY part of the query. This is confusing our analysts, as it works in > most other tools they use. Here's an example to reproduce: > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val schema = > StructType( > StructField("x", IntegerType, nullable=false) :: > StructField("y", > StructType(StructField("a", DoubleType, nullable=false) :: Nil), > nullable=false) :: Nil) > val rdd = sc.parallelize( > Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, > Row(2.5)) :: Nil) > val df = sqlContext.createDataFrame(rdd, schema) > // DataFrame content looks like this: > // x z > // 1 {a: 1.0} > // 2 {a: 1.34} > // 3 {a: 2.3} > // 4 {a: 2.5} > df.registerTempTable("test_data") > sqlContext.udf.register("roundToInt", (x: Double) => x.toInt) > sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data > GROUP BY grp").show() > // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given > input columns x, y > sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY > grp").show() > // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given > input columns x, y; > sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data > GROUP BY roundToInt(y.a)").show() > // => > // +---+----+ > // |grp| s| > // +---+----+ > // | 1|2.34| > // | 2| 4.8| > // +---+----+ > {code} > As you can see, it's particularly inconvenient when using UDFs on nested > fields, as it means repeating some potentially complex expressions. It's very > common for us to want to make a date type conversion (from epoch milliseconds > or something) from some nested field, then reference it in multiple places in > the query. With this issue, it makes for quite verbose queries. > Might it also mean that we're mapping these functions over the data twice? I > can't quite tell from the explain output whether that's been optimised out or > not, but here it is for somebody who understands :-) > {code} > sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data > GROUP BY roundToInt(y.a)").explain() > // == Physical Plan == > // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS > grp#116,CombineSum(PartialSum#125L) AS s#117L] > // Exchange (HashPartitioning 200) > // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS > PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L] > // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at > <console>:31 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org