[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

Mark Khaitman (JIRA) Mon, 26 Jan 2015 17:05:04 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292769#comment-14292769
 ]


Mark Khaitman commented on SPARK-5395:
--------------------------------------

This may prove to be useful...

I'm watching a presently running spark-submitted job, while watching the 
pyspark.daemon processes. The framework is permitted to only use 8 cores on 
each node with the default python worker memory of 512mb per node (not the 
executor memory which is set to higher than this).

Ignoring the exact RDD actions for a moment, it looks like while it transitions 
from Stage 1 -> Stage 2, it spawned up 8-10 additional pyspark.daemon processes 
making the box use more cores than it was even allowed to... A few seconds 
after that, the other 8 processes entered a sleeping state while still holding 
onto the physical memory it ate up in Stage 1. As soon as Stage 2 finished, 
practically all of the pyspark.daemons vanished and freed up the memory usage. 
I was keeping an eye on 2 random nodes and the exact same thing occurred on 
both. It was also the only currently executing job at the time so there was 
really no other interference/contention for resources.

I will try to provide a bit more detail on the exact transformations/actions 
occurring between the 2 stages, although I know a PartionBy and cogroup are 
occurring at the very least without inspecting the spark-submitted code 
directly.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>       [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

Reply via email to