[ 
https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28843:
------------------------------
    Docs Text: Pyspark workers now set the env variable OMP_NUM_THREADS (if not 
already set) to the number of cores used by an executor (spark.executor.cores). 
When unset, it defaulted to the total number of VM cores. This avoids 
excessively large OpenMP thread pools when using, for example, numpy.

> Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
> ----------------------------------------------------------------------
>
>                 Key: SPARK-28843
>                 URL: https://issues.apache.org/jira/browse/SPARK-28843
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.3.3, 3.0.0, 2.4.3
>            Reporter: Ryan Blue
>            Priority: Major
>              Labels: release-notes
>
> While testing hardware with more cores, we found that the amount of memory 
> required by PySpark applications increased and tracked the problem to 
> importing numpy. The numpy issue isĀ 
> [https://github.com/numpy/numpy/issues/10455]
> NumPy uses OpenMP that starts a thread pool with the number of cores on the 
> machine (and does not respect cgroups). When we set this lower we see a 
> significant reduction in memory consumption.
> This parallelism setting should be set to the number of cores allocated to 
> the executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to