Matthew Livesey created SPARK-15251:
---------------------------------------

             Summary: Cannot apply PythonUDF to aggregated column
                 Key: SPARK-15251
                 URL: https://issues.apache.org/jira/browse/SPARK-15251
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.1
            Reporter: Matthew Livesey


In scala it is possible to define a UDF an apply it to an aggregated value in 
an expression, for example:

{code}
def timesTwo(x: Int): Int = x * 2
sqlContext.udf.register("timesTwo", timesTwo _)
sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
case class Data(x: Int, y: String)
val data = List(Data(1, "a"), Data(2, "b"))
val rdd = sc.parallelize(data)
val df = rdd.toDF
df.registerTempTable("my_data")
sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show() 

+---+
|  t|
+---+
|  6|
+---+
{code}

Performing the same computation in pyspark:

{code}
def timesTwo(x):
    return x * 2

sqlContext.udf.register("timesTwo", timesTwo)
data = [(1, 'a'), (2, 'b')]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, ["x", "y"])

df.registerTempTable("my_data")
sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
{code}

Gives the following:

{code}
AnalysisException: u"expression 'pythonUDF' is neither present in the group by, 
nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;"
{code}

Using a lambda rather than a named function gives the same error.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to