[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256468#comment-14256468
 ] 

Patrick Wendell commented on SPARK-4906:
----------------------------------------

Hey [~mingyu.z...@gmail.com] - could you say a bit more about how a workload 
can generate this number of failed tasks in the "live set" of running stages? 
If they are each 10kb and you see them taking an aggregate of 500MB, this means 
you have 50,000 failed tasks in the live set. I've never seen this before 
because typically once a few tasks have failed the stage will fail, so this 
definitely seems like an extreme case.

Running a job with hundreds of thousands of tasks might require a good size 
heap at the driver even for other reasons. How big of a heap are you using?

We might be able limit the number of unique string objects that are allocated 
if we have a large number of tasks that refer to an identical stack trace.

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4906
>                 URL: https://issues.apache.org/jira/browse/SPARK-4906
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.1.1
>            Reporter: Mingyu Kim
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to