[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277849#comment-14277849
 ] 

Mingyu Kim commented on SPARK-4906:
-----------------------------------

"typically once a few tasks have failed the stage will fail": Is that actually 
true? Doesn't a stage with N tasks run until all N tasks run (and potentially 
fail)? We have datasets that are thousands of partitions each, so that can 
totally add up to 50k failed tasks over time.

We can certainly allocate more memory, but the concern is that for long-running 
SparkContexts, this memory requirement will grow linearly over time, assuming 
the failure happens at a fixed frequency. So, we will run into OOMs no matter 
how large the heap is as long as we have an enough number of failures over time.

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4906
>                 URL: https://issues.apache.org/jira/browse/SPARK-4906
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 1.1.1
>            Reporter: Mingyu Kim
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to