Re: Aggregate UDF (UDAF) in Python

Tobi Bosede Mon, 17 Oct 2016 12:15:58 -0700

Thanks Assaf. Yes please provide an example of how to wrap code for python.
I am leaning towards scala.


On Mon, Oct 17, 2016 at 1:50 PM, Mendelson, Assaf <assaf.mendel...@rsa.com>
wrote:

> A possible (bad) workaround would be to use the collect_list function.
> This will give you all the values in an array (list) and you can then
> create a UDF to do the aggregation yourself. This would be very slow and
> cost a lot of memory but it would work if your cluster can handle it.
>
> This is the only workaround I can think of, otherwise you  will need to
> write the UDAF in java/scala and wrap it for python use. If you need an
> example on how to do so I can provide one.
>
> Assaf.
>
>
>
> *From:* Tobi Bosede [mailto:ani.to...@gmail.com]
> *Sent:* Sunday, October 16, 2016 7:49 PM
> *To:* Holden Karau
> *Cc:* user
> *Subject:* Re: Aggregate UDF (UDAF) in Python
>
>
>
> OK, I misread the year on the dev list. Can you comment on work arounds?
> (I.e. question about if scala/java are the only option.)
>
>
>
> On Sun, Oct 16, 2016 at 12:09 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
> The comment on the developer list is from earlier this week. I'm not sure
> why UDAF support hasn't made the hop to Python - while I work a fair amount
> on PySpark it's mostly in core & ML and not a lot with SQL so there could
> be good reasons I'm just not familiar with. We can try pinging Davies or
> Michael on the JIRA to see what their thoughts are.
>
>
> On Sunday, October 16, 2016, Tobi Bosede <ani.to...@gmail.com> wrote:
>
> Thanks for the info Holden.
>
>
>
> So it seems both the jira and the comment on the developer list are over a
> year old. More surprising, the jira has no assignee. Any particular reason
> for the lack of activity in this area?
>
>
>
> Is writing scala/java the only work around for this? I hear a lot of
> people say python is the gateway language to scala. It is because of issues
> like this that people use scala for Spark rather than python or eventually
> abandon python for scala. It just takes too long for features to get ported
> over from scala/java.
>
>
>
>
>
> On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
> I don't believe UDAFs are available in PySpark as this came up on the
> developer list while I was asking for what features people were missing in
> PySpark - see http://apache-spark-developers-list.1001551.n3.
> nabble.com/Python-Spark-Improvements-forked-from-
> Spark-Improvement-Proposals-td19422.html . The JIRA for tacking this
> issue is at https://issues.apache.org/jira/browse/SPARK-10915
>
>
>
> On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>
> Hello,
>
>
>
> I am trying to use a UDF that calculates inter-quartile (IQR) range for
> pivot() and SQL in pyspark and got the error that my function wasn't an
> aggregate function in both scenarios. Does anyone know if UDAF
> functionality is available in python? If not, what can I do as a work
> around?
>
>
>
> Thanks,
>
> Tobi
>
>
>
>
>
> --
>
> Cell : 425-233-8271
>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
>
> --
>
> Cell : 425-233-8271
>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
>

Re: Aggregate UDF (UDAF) in Python

Reply via email to