Hi Leon,
please refer to this link:
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html
I have found using GROUP MAP to be a bit tricky, please refer to the
statement: "All data for a group is loaded into memory before the function
is applied. This can lead to out of memory
think you want your Pandas UDF to be PandasUDFType.GROUPED_AGG? Is your
> result the same?
>
> From: Lian Jiang
> Date: Sunday, April 5, 2020 at 3:28 AM
> To: user
> Subject: pandas_udf is very slow
>
> Hi,
>
> I am using pandas udf in pyspark 2.4.3 on EMR 5.21.
Subject: pandas_udf is very slow
Hi,
I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored
over non pandas udf per
https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf.
My data has about 250M records and the pandas udf
Hi,
I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored
over non pandas udf per
https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf.
My data has about 250M records and the pandas udf code is like:
def