[GitHub] [spark] srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python

GitBox Wed, 28 Aug 2019 09:53:30 -0700

srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set 
OMP_NUM_THREADS to executor cores for python
URL: https://github.com/apache/spark/pull/25545#discussion_r318687919


 ##########
 File path: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
 ##########
 @@ -106,6 +106,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
     val startTime = System.currentTimeMillis
     val env = SparkEnv.get
     val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
+    // if OMP_NUM_THREADS is not explicitly set, override it with the number 
of cores
+    if (conf.getOption("spark.executorEnv.OMP_NUM_THREADS").isEmpty) {
+      // SPARK-28843: limit the OpenMP thread pool to the number of cores 
assigned to this executor
+      // this avoids high memory consumption with pandas/numpy because of a 
large OpenMP thread pool
+      // see https://github.com/numpy/numpy/issues/10455
 
 Review comment:
   I think it's pretty straightforward. This env variable controls how many 
threads OpenMP uses, and it shouldn't be more than the number of cores the 
executor is allowed to use, of course. However its default, unset, will 
sometimes use more than the allowed number of cores. So it is set to the number 
of allowed cores if not set.
   
   I agree it's broader than numpy. However the change to Pyspark would mostly 
improve the situation for numpy users (by extension, pandas) specifically. I 
don't think it matters so much; we could remove the commentary and point to the 
JIRA or something.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python

Reply via email to