[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-8790: ------------------------------ Component/s: Block Manager > BlockManager.reregister cause OOM > --------------------------------- > > Key: SPARK-8790 > URL: https://issues.apache.org/jira/browse/SPARK-8790 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core > Reporter: Patrick Liu > Attachments: driver.log, executor.log, webui-executor.png, > webui-slow-task.png > > > We run SparkSQL 1.2.1 on Yarn. > A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for > 16m. > The webUI shows that the executor has running GC for 15m brfore OOM. > The log shows that the executor first try to connect to master to report > broadcast value, however the network is not available, so the executor lost > heartbeat to Master. > Then the master require the executor to reregister. When executor are > reporAllBlocks to master, the network is still not so stable, sometimes > time-out. > Finally, the executor OOM. > Please take a look. > Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org