Thanks Assaf. Yes please provide an example of how to wrap code for python. I am leaning towards scala.
On Mon, Oct 17, 2016 at 1:50 PM, Mendelson, Assaf <assaf.mendel...@rsa.com> wrote: > A possible (bad) workaround would be to use the collect_list function. > This will give you all the values in an array (list) and you can then > create a UDF to do the aggregation yourself. This would be very slow and > cost a lot of memory but it would work if your cluster can handle it. > > This is the only workaround I can think of, otherwise you will need to > write the UDAF in java/scala and wrap it for python use. If you need an > example on how to do so I can provide one. > > Assaf. > > > > *From:* Tobi Bosede [mailto:ani.to...@gmail.com] > *Sent:* Sunday, October 16, 2016 7:49 PM > *To:* Holden Karau > *Cc:* user > *Subject:* Re: Aggregate UDF (UDAF) in Python > > > > OK, I misread the year on the dev list. Can you comment on work arounds? > (I.e. question about if scala/java are the only option.) > > > > On Sun, Oct 16, 2016 at 12:09 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > The comment on the developer list is from earlier this week. I'm not sure > why UDAF support hasn't made the hop to Python - while I work a fair amount > on PySpark it's mostly in core & ML and not a lot with SQL so there could > be good reasons I'm just not familiar with. We can try pinging Davies or > Michael on the JIRA to see what their thoughts are. > > > On Sunday, October 16, 2016, Tobi Bosede <ani.to...@gmail.com> wrote: > > Thanks for the info Holden. > > > > So it seems both the jira and the comment on the developer list are over a > year old. More surprising, the jira has no assignee. Any particular reason > for the lack of activity in this area? > > > > Is writing scala/java the only work around for this? I hear a lot of > people say python is the gateway language to scala. It is because of issues > like this that people use scala for Spark rather than python or eventually > abandon python for scala. It just takes too long for features to get ported > over from scala/java. > > > > > > On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > I don't believe UDAFs are available in PySpark as this came up on the > developer list while I was asking for what features people were missing in > PySpark - see http://apache-spark-developers-list.1001551.n3. > nabble.com/Python-Spark-Improvements-forked-from- > Spark-Improvement-Proposals-td19422.html . The JIRA for tacking this > issue is at https://issues.apache.org/jira/browse/SPARK-10915 > > > > On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > > Hello, > > > > I am trying to use a UDF that calculates inter-quartile (IQR) range for > pivot() and SQL in pyspark and got the error that my function wasn't an > aggregate function in both scenarios. Does anyone know if UDAF > functionality is available in python? If not, what can I do as a work > around? > > > > Thanks, > > Tobi > > > > > > -- > > Cell : 425-233-8271 > > Twitter: https://twitter.com/holdenkarau > > > > > > -- > > Cell : 425-233-8271 > > Twitter: https://twitter.com/holdenkarau > > > > >