[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294570#comment-14294570 ]
Sven Krasser commented on SPARK-5395: ------------------------------------- Some new findings: I can trigger the problem now just using the {{coalesce}} call. My job now looks like this: {code}sc.newAPIHadoopFile().map().map().coalesce().count(){code} In the 64 executor case, this occurs when processing 1TB in 1500 files. If I go down to 2 executors, 200GB in 305 files make the worker count go up to 9 (higher as I add more files). With less data, things appear normal. That raises the question about what {{coalesce()}} is doing that causes new workers to spawn. > Large number of Python workers causing resource depletion > --------------------------------------------------------- > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Environment: AWS ElasticMapReduce > Reporter: Sven Krasser > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_000030 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org