RE: Aggregate UDF (UDAF) in Python

Mendelson, Assaf Mon, 17 Oct 2016 10:52:00 -0700

A possible (bad) workaround would be to use the collect_list function. This 
will give you all the values in an array (list) and you can then create a UDF 
to do the aggregation yourself. This would be very slow and cost a lot of 
memory but it would work if your cluster can handle it.
This is the only workaround I can think of, otherwise you  will need to write 
the UDAF in java/scala and wrap it for python use. If you need an example on 
how to do so I can provide one.
Assaf.

From: Tobi Bosede [mailto:ani.to...@gmail.com]
Sent: Sunday, October 16, 2016 7:49 PM
To: Holden Karau
Cc: user
Subject: Re: Aggregate UDF (UDAF) in Python

OK, I misread the year on the dev list. Can you comment on work arounds? (I.e. 
question about if scala/java are the only option.)

On Sun, Oct 16, 2016 at 12:09 PM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
The comment on the developer list is from earlier this week. I'm not sure why 
UDAF support hasn't made the hop to Python - while I work a fair amount on 
PySpark it's mostly in core & ML and not a lot with SQL so there could be good 
reasons I'm just not familiar with. We can try pinging Davies or Michael on the 
JIRA to see what their thoughts are.

On Sunday, October 16, 2016, Tobi Bosede 
<ani.to...@gmail.com<mailto:ani.to...@gmail.com>> wrote:
Thanks for the info Holden.

So it seems both the jira and the comment on the developer list are over a year 
old. More surprising, the jira has no assignee. Any particular reason for the 
lack of activity in this area?

Is writing scala/java the only work around for this? I hear a lot of people say 
python is the gateway language to scala. It is because of issues like this that 
people use scala for Spark rather than python or eventually abandon python for 
scala. It just takes too long for features to get ported over from scala/java.

On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I don't believe UDAFs are available in PySpark as this came up on the developer 
list while I was asking for what features people were missing in PySpark - see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html
 . The JIRA for tacking this issue is at 
https://issues.apache.org/jira/browse/SPARK-10915

On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede 
<ani.to...@gmail.com<mailto:ani.to...@gmail.com>> wrote:
Hello,

I am trying to use a UDF that calculates inter-quartile (IQR) range for pivot() 
and SQL in pyspark and got the error that my function wasn't an aggregate 
function in both scenarios. Does anyone know if UDAF functionality is available 
in python? If not, what can I do as a work around?

Thanks,
Tobi

--
Cell : 425-233-8271<tel:425-233-8271>
Twitter: https://twitter.com/holdenkarau

--
Cell : 425-233-8271<tel:425-233-8271>
Twitter: https://twitter.com/holdenkarau

RE: Aggregate UDF (UDAF) in Python

Reply via email to