Hi,

I used these settings but did not see obvious improvement (190 minutes
reduced to 170 minutes):

        spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr.
Any idea why the perf improvement is small after enabling arrow?
Anything else could be missing? Thanks.


On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang <jiangok2...@gmail.com> wrote:

> Please ignore this question.
> https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
> shows pandas udf should have avoided jvm<->Python SerDe by maintaining one
> data copy in memory. spark.sql.execution.arrow.enabled is false by default.
> I think I missed enabling spark.sql.execution.arrow.enabled. Thanks.
> Regards.
>
> On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using pyspark Grouped Map pandas UDF (
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html).
>> Functionality wise it works great. However, serDe causes a lot of perf
>> hits. To optimize this UDF, can I do either below:
>>
>> 1. use a java UDF to completely replace the python Grouped Map pandas
>> UDF.
>> 2. The Python Grouped Map pandas UDF calls a java function internally.
>>
>> Which way is more promising and how? Thanks for any pointers.
>>
>> Thanks
>> Lian
>>
>>
>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Reply via email to