Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Gourav Sengupta
super duper :) On Tue, Jul 24, 2018 at 7:11 PM, Patrick McCarthy < pmccar...@dstillery.com.invalid> wrote: > Thanks Byran. I think it was ultimately groupings that were too large - > after setting spark.sql.shuffle.partitions to a much higher number I was > able to get the UDF to execute. > > On

Re: Arrow type issue with Pandas UDF

2018-07-24 Thread Patrick McCarthy
Thanks Byran. I think it was ultimately groupings that were too large - after setting spark.sql.shuffle.partitions to a much higher number I was able to get the UDF to execute. On Fri, Jul 20, 2018 at 12:45 AM, Bryan Cutler wrote: > Hi Patrick, > > It looks like it's failing in Scala before it

Re: Arrow type issue with Pandas UDF

2018-07-19 Thread Bryan Cutler
Hi Patrick, It looks like it's failing in Scala before it even gets to Python to execute your udf, which is why it doesn't seem to matter what's in your udf. Since you are doing a grouped map udf maybe your group sizes are too big or skewed? Could you try to reduce the size of your groups by

Arrow type issue with Pandas UDF

2018-07-19 Thread Patrick McCarthy
PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. I'm trying to run a pandas UDF, but I seem to get nonsensical exceptions in the last stage of the job regardless of my output type. The problem I'm trying to solve: I have a column of scalar values, and each value on the same row has a sorted