Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Evgeniy Ignatiev
Note: forwarding to list, incorrectly hit "Repliy" first, instead of "Reply List" Hello, Does your code run without enabling fallback mode? Arrow vectorization might not just get applied - if you still observe "javaToPython" stages on Spark UI. Also data is not skewed (partitions are too

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Lian Jiang
Hi, I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes): spark.sql.execution.arrow.pyspark.enabled: True spark.sql.execution.arrow.pyspark.fallback.enabled: True This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang
Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed

use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang
Hi, I am using pyspark Grouped Map pandas UDF ( https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below: 1. use a java UDF to completely replace the python