[ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Liu updated SPARK-8790:
-------------------------------
    Description: 
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for 
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

The log shows that the executor first try to connect to master to report 
broadcast value, however the network is not available, so the executor lost 
heartbeat to Master. 
Then the master require the executor to reregister. When executor are 
reporAllBlocks to master, the network is still not so stable, sometimes 
time-out.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.


  was:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for 
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

The log shows that the executor first try to connect to master to report 
broadcast value, however the network is not available, so the executor lost 
heartbeat to Master. 
Then the master require the executor to reregister. When executor are 
reporAllBlocks to master, the network is still not so stable, so sometimes 
time-out.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.



> BlockManager.reregister cause OOM
> ---------------------------------
>
>                 Key: SPARK-8790
>                 URL: https://issues.apache.org/jira/browse/SPARK-8790
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Liu
>         Attachments: driver.log, executor.log, webui-executor.png, 
> webui-slow-task.png
>
>
> We run SparkSQL 1.2.1 on Yarn.
> A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for 
> 16m.
> The webUI shows that the executor has running GC for 15m brfore OOM.
> The log shows that the executor first try to connect to master to report 
> broadcast value, however the network is not available, so the executor lost 
> heartbeat to Master. 
> Then the master require the executor to reregister. When executor are 
> reporAllBlocks to master, the network is still not so stable, sometimes 
> time-out.
> Finally, the executor OOM.
> Please take a look.
> Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to