Matthew Livesey created SPARK-15251: ---------------------------------------
Summary: Cannot apply PythonUDF to aggregated column Key: SPARK-15251 URL: https://issues.apache.org/jira/browse/SPARK-15251 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Reporter: Matthew Livesey In scala it is possible to define a UDF an apply it to an aggregated value in an expression, for example: {code} def timesTwo(x: Int): Int = x * 2 sqlContext.udf.register("timesTwo", timesTwo _) sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() case class Data(x: Int, y: String) val data = List(Data(1, "a"), Data(2, "b")) val rdd = sc.parallelize(data) val df = rdd.toDF df.registerTempTable("my_data") sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() +---+ | t| +---+ | 6| +---+ {code} Performing the same computation in pyspark: {code} def timesTwo(x): return x * 2 sqlContext.udf.register("timesTwo", timesTwo) data = [(1, 'a'), (2, 'b')] rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, ["x", "y"]) df.registerTempTable("my_data") sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() {code} Gives the following: {code} AnalysisException: u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;" {code} Using a lambda rather than a named function gives the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org