[jira] [Comment Edited] (SPARK-5395) Large number of Python workers causing resource depletion

Mark Khaitman (JIRA) Mon, 26 Jan 2015 15:57:53 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668
 ]


Mark Khaitman edited comment on SPARK-5395 at 1/26/15 11:57 PM:
----------------------------------------------------------------

[~skrasser], I actually only managed to have this reproduced using production 
data as well (so far). I'll try to write a simple version tomorrow but it seems 
that it's a mix of both python worker processes not being killed after it's no 
longer running (causing build up), as well as the python worker exceeding the 
allocated memory limit. 

I think it *may* be related to a couple of specific actions such as 
groupByKey/cogroup, though I'll still need to do some tests to be sure what's 
causing this.

I should also add that we haven't modified the default for the 
python.worker.reuse variable, so in our case it should be using the default of 
True.


was (Author: mkman84):
[~skrasser], I actually only managed to have this reproduced using production 
data as well (so far). I'll try to write a simple version tomorrow but it seems 
that it's a mix of both python worker processes not being killed after it's no 
longer running (causing build up), as well as the python worker exceeding the 
allocated memory limit. 

I think it *may* be related to a couple of specific actions such as 
groupByKey/cogroup, though I'll still need to do some tests to be sure what's 
causing this.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>       [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5395) Large number of Python workers causing resource depletion

Reply via email to