liupengcheng created SPARK-26689: ------------------------------------ Summary: Bad disk causing broadcast failure Key: SPARK-26689 URL: https://issues.apache.org/jira/browse/SPARK-26689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.1.0 Environment: Spark on Yarn
Mutliple Disk Reporter: liupengcheng We encoutered an application failure in our production cluster which caused by the bad disk problems. It will incur application failure. {code:java} Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} We have multiple disk on our cluster nodes, however, it still fails. I think it's because spark does not handle bad disk in `DiskBlockManager` currently. Actually, we can handle bad disk in multiple disk environment to avoid application failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org