Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs.
Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris <nicolas.pa...@riseup.net> wrote: > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0 > > I want to apply a pandas-udf on a dataframe with <String, binaryType> > I have the bellow error: > > > Invalid returnType with grouped map Pandas UDFs: > > > StructType(List(StructField(filename,StringType,true),StructField(contents,BinaryType,true))) > > is not supported > > > I am missing something ? > the doc > https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#supported-sql-types > says pyArrow 0.10 is minimum to handle binaryType > > here is the code: > > > from pyspark.sql.functions import pandas_udf, PandasUDFType > > > > df = sql("select filename, contents from test_binary") > > > > @pandas_udf("filename String, contents binary", > PandasUDFType.GROUPED_MAP) > > def transform_binary(pdf): > > contents = pdf.contents > > return pdf.assign(contents=contents) > > > > df.groupby("filename").apply(transform_binary).count() > > Thanks > -- > nicolas > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >