I've noticed a couple oddities with the pyspark.daemons which are causing us a bit of memory problems within some of our heavy spark jobs, especially when they run at the same time...
It seems that there is typically a 1-to-1 ratio of pyspark.daemons to cores per executor during aggregations. By default the spark.python.worker.memory is left at the default of 512MB, after which, the remainder of the aggregations are supposed to spill to disk. However: *1)* I'm not entirely sure what cases would result in random numbers of pyspark daemons which do not respect the python worker memory limit. I've seen some go up to as far as 2GB each (well over the 512MB limit) which is when we run into some crazy memory problems for jobs making use of many cores on each executor. To be clear here, they ARE spilling to disk as well, but also blowing past the memory limits at the same time somehow. *2)* Another scenario specifically relates to when we want to join RDDs, where for example, say there are 4 cores per executor, and therefore 4 pyspark daemons during most aggregations. It seems that if a Join occurs, it will spawn up 4 additional pyspark daemons as opposed to simply re-using the ones that were already present during the aggregation stage that occurred before it. This, combined with the case where the python worker memory limit is not strictly respected, can pose problems for using way more memory per node. The fact that the python worker memory appears to use memory *outside* of the executor memory is what poses the biggest challenge for preventing memory depletion on a node. Is there something obvious, or some environment variable I may have missed that could potentially help with one/both of the above memory concerns? Alternatively, any suggestions would be greatly appreciated! :) Thanks, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-daemon-issues-tp10533.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org