[
https://issues.apache.org/jira/browse/SPARK-53903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sumit Kumar updated SPARK-53903:
--------------------------------
Attachment: testcase.ipynb
> Performance degradation for PySpark apis at scale as compared to Scala apis
> ---------------------------------------------------------------------------
>
> Key: SPARK-53903
> URL: https://issues.apache.org/jira/browse/SPARK-53903
> Project: Spark
> Issue Type: Brainstorming
> Components: PySpark
> Affects Versions: 3.5.1
> Reporter: Sumit Kumar
> Priority: Major
> Attachments: testcase.ipynb
>
>
> Customers love PySpark and the flexibility of using several python libraries
> as part of our workflows. I've a unique scenario where this specific usecase
> has multiple tables with around 10k columns and some of those columns have
> array datatype that when exploded, contain ~1k columns each.
> *Issues that we are facing:*
> * Frequent driver OOM depending on the use case and how many columns are
> involved in the logic and how many array type columns are exploded. There is
> frequent GC, slowing down the workflows.
> * We tried equivalent scala apis and the performance as well latency seemed
> a lot better (no OOM and significantly less GC overheads).
> *Here is what we understand so far from thread and memory dumps:*
> * driver ends up having open references for every pyspark object created in
> the python vm because of py4j bridge-based communication implementation for
> pyspark apis. and the garbage keeps accumulating on driver ultimately leading
> to OOM
> * even if we delete python references in pyspark code (for example: del
> df_dummy1) and run "gc.collect()" specifically, we are not able to ease the
> memory pressure. Python gc or python triggered gc via py4j bridge in the
> driver doesn't seem to be that good.
> This is not a typical workload but we have multiple such usecases and we are
> debating if it's worth changing existing workflows to scala just like that
> (existing DEs are more comfortable with PySpark, there is cost of migration
> as well that we will have to convince our management to approve)
> *ASK:*
> This Jira includes a sample notebook that reproduces what our usecases see.
> We are seeking community feedback on such a usecase and if there are ideas to
> improve this situation further other than migrating to Scala apis. Any
> PySpark improvement ideas that could help?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]