Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Gourav Sengupta
just try using an apply on a series for a custom function or on any other library. Advertisement and actual delivery are two different skills altogether. Not everyone wants to add a one to their column using the pandas udf as one of their links shows :) Most of the actual used cases are more

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Nicolas Paris
hi Gourav, > And also be aware that pandas UDF does not always lead to better performance > and sometimes even massively slow performance. this information is not widely spread. this is good to know. in which circumstances is it worst than regular udf ? > With Grouped Map dont you run into the

Re: pySpark - pandas UDF and binaryType

2019-05-03 Thread Gourav Sengupta
And also be aware that pandas UDF does not always lead to better performance and sometimes even massively slow performance. With Grouped Map dont you run into the risk of random memory errors as well? On Thu, May 2, 2019 at 9:32 PM Bryan Cutler wrote: > Hi, > > BinaryType support was not added

Re: pySpark - pandas UDF and binaryType

2019-05-02 Thread Bryan Cutler
Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs. Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris wrote: > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0

pySpark - pandas UDF and binaryType

2019-05-02 Thread Nicolas Paris
Hi all I am using pySpark 2.3.0 and pyArrow 0.10.0 I want to apply a pandas-udf on a dataframe with I have the bellow error: > Invalid returnType with grouped map Pandas UDFs: > StructType(List(StructField(filename,StringType,true),StructField(contents,BinaryType,true))) > is not supported