James Aley created SPARK-9338:
---------------------------------

             Summary: Aliases from SELECT not available in GROUP BY
                 Key: SPARK-9338
                 URL: https://issues.apache.org/jira/browse/SPARK-9338
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1
         Environment: Observed on Mac OS X and Ubuntu 14.04
            Reporter: James Aley


It feels like this should really be a known issue, but I've not been able to 
find any mailing list or JIRA tickets for exactly this. There are a few 
closed/resolved tickets about specific types of exceptions, but I couldn't find 
this exact problem, so apologies if this is a dupe!

Spark SQL doesn't appear to support referencing aliases from a SELECT in the 
GROUP BY part of the query. This is confusing our analysts, as it works in most 
other tools they use. Here's an example to reproduce:

{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema =
  StructType(
    StructField("x", IntegerType, nullable=false) ::
    StructField("y",
      StructType(StructField("a", DoubleType, nullable=false) :: Nil),
      nullable=false) :: Nil)

val rdd = sc.parallelize(
  Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, Row(2.5)) 
:: Nil)

val df = sqlContext.createDataFrame(rdd, schema)

// DataFrame content looks like this:
// x   z
// 1   {a: 1.0}
// 2   {a: 1.34}
// 3   {a: 2.3}
// 4   {a: 2.5}


df.registerTempTable("test_data")
sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)


sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP 
BY grp").show()
// => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input 
columns x, y

sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY 
grp").show()
// => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input 
columns x, y;

sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data 
GROUP BY roundToInt(y.a)").show()
// =>
// +---+----+
// |grp|   s|
// +---+----+
// |  1|2.34|
// |  2| 4.8|
// +---+----+
{code}

As you can see, it's particularly inconvenient when using UDFs on nested 
fields, as it means repeating some potentially complex expressions. It's very 
common for us to want to make a date type conversion (from epoch milliseconds 
or something) from some nested field, then reference it in multiple places in 
the query. With this issue, it makes for quite verbose queries. 

Might it also mean that we're mapping these functions over the data twice? I 
can't quite tell from the explain output whether that's been optimised out or 
not, but here it is for somebody who understands :-)

{code}
sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP 
BY roundToInt(y.a)").explain()
// == Physical Plan ==
// Aggregate false, [PartialGroup#126], [PartialGroup#126 AS 
grp#116,CombineSum(PartialSum#125L) AS s#117L]
// Exchange (HashPartitioning 200)
// Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS 
PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
// PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at 
<console>:31
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to