Re: pandas_udf is very slow

2020-04-06 Thread Gourav Sengupta
Hi Leon, please refer to this link: https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html I have found using GROUP MAP to be a bit tricky, please refer to the statement: "All data for a group is loaded into memory before the function is applied. This can lead to out of memory

Re: pandas_udf is very slow

2020-04-05 Thread Lian Jiang
think you want your Pandas UDF to be PandasUDFType.GROUPED_AGG? Is your > result the same? > > From: Lian Jiang > Date: Sunday, April 5, 2020 at 3:28 AM > To: user > Subject: pandas_udf is very slow > > Hi, > > I am using pandas udf in pyspark 2.4.3 on EMR 5.21.

Re: pandas_udf is very slow

2020-04-05 Thread Silvio Fiorito
Subject: pandas_udf is very slow Hi, I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored over non pandas udf per https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf. My data has about 250M records and the pandas udf

pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Hi, I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored over non pandas udf per https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf. My data has about 250M records and the pandas udf code is like: def