[ https://issues.apache.org/jira/browse/SPARK-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-15251. ---------------------------------- Resolution: Cannot Reproduce {code} >>> def timesTwo(x): ... return x * 2 ... >>> sqlContext.udf.register("timesTwo", timesTwo) data = [(1, 'a'), (2, 'b')] rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, ["x", "y"]) df.registerTempTable("my_data") sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() >>> data = [(1, 'a'), (2, 'b')] >>> rdd = sc.parallelize(data) >>> df = sqlContext.createDataFrame(rdd, ["x", "y"]) >>> >>> df.registerTempTable("my_data") >>> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() +---+ | t| +---+ | 6| +---+ {code} It seems it is fixed somewhere and I can't reproduce this in the current master. It'd be great if this is backported if anyone can point out the PR > Cannot apply PythonUDF to aggregated column > ------------------------------------------- > > Key: SPARK-15251 > URL: https://issues.apache.org/jira/browse/SPARK-15251 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1 > Reporter: Matthew Livesey > > In scala it is possible to define a UDF an apply it to an aggregated value in > an expression, for example: > {code} > def timesTwo(x: Int): Int = x * 2 > sqlContext.udf.register("timesTwo", timesTwo _) > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > case class Data(x: Int, y: String) > val data = List(Data(1, "a"), Data(2, "b")) > val rdd = sc.parallelize(data) > val df = rdd.toDF > df.registerTempTable("my_data") > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > +---+ > | t| > +---+ > | 6| > +---+ > {code} > Performing the same computation in pyspark: > {code} > def timesTwo(x): > return x * 2 > sqlContext.udf.register("timesTwo", timesTwo) > data = [(1, 'a'), (2, 'b')] > rdd = sc.parallelize(data) > df = sqlContext.createDataFrame(rdd, ["x", "y"]) > df.registerTempTable("my_data") > sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() > {code} > Gives the following: > {code} > AnalysisException: u"expression 'pythonUDF' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;" > {code} > Using a lambda rather than a named function gives the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org