Hi, I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes):
spark.sql.execution.arrow.pyspark.enabled: True spark.sql.execution.arrow.pyspark.fallback.enabled: True This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea why the perf improvement is small after enabling arrow? Anything else could be missing? Thanks. On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang <jiangok2...@gmail.com> wrote: > Please ignore this question. > https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow > shows pandas udf should have avoided jvm<->Python SerDe by maintaining one > data copy in memory. spark.sql.execution.arrow.enabled is false by default. > I think I missed enabling spark.sql.execution.arrow.enabled. Thanks. > Regards. > > On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Hi, >> >> I am using pyspark Grouped Map pandas UDF ( >> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). >> Functionality wise it works great. However, serDe causes a lot of perf >> hits. To optimize this UDF, can I do either below: >> >> 1. use a java UDF to completely replace the python Grouped Map pandas >> UDF. >> 2. The Python Grouped Map pandas UDF calls a java function internally. >> >> Which way is more promising and how? Thanks for any pointers. >> >> Thanks >> Lian >> >> >> >> > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>