[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

Sven Krasser (JIRA) Tue, 27 Jan 2015 17:44:46 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294570#comment-14294570
 ]


Sven Krasser commented on SPARK-5395:
-------------------------------------

Some new findings: I can trigger the problem now just using the {{coalesce}} 
call. My job now looks like this:
{code}sc.newAPIHadoopFile().map().map().coalesce().count(){code}

In the 64 executor case, this occurs when processing 1TB in 1500 files. If I go 
down to 2 executors, 200GB in 305 files make the worker count go up to 9 
(higher as I add more files). With less data, things appear normal.

That raises the question about what {{coalesce()}} is doing that causes new 
workers to spawn.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>       [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion

Reply via email to