[ https://issues.apache.org/jira/browse/SPARK-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell resolved SPARK-7054. ------------------------------------ Resolution: Invalid Hey There, Please send this to the Spark users list to get feedback and help to further isolate the issue. As it stands now it's underspecified for a JIRA. > Spark jobs hang for ~15 mins when a node goes down > -------------------------------------------------- > > Key: SPARK-7054 > URL: https://issues.apache.org/jira/browse/SPARK-7054 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.2.1 > Environment: Cent OS - 6 ,Java 8 > Reporter: Abhishek Choudhary > Priority: Blocker > > In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 > executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a > running vm is shut down, spark job hangs for ~15 mins . > After ~45-50 seconds driver got information of lost block managers, > From logs : > 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager > BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms > exceeds 45000ms > 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager > BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms > exceeds 45000ms > 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager > BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms > exceeds 45000ms > 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager > BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms > exceeds 45000ms > 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager > BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms > exceeds 45000ms > After ~15 mins Spark driver got executor lost event and rescheduled failed > tasks > From logs : > 2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR > org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor > 1 on ACUME-DN2: remote Akka client disassociated > For these 15 mins all the jobs were stuck for executors running on shutdown > vm . -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org