[ https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750669#comment-16750669 ]
liupengcheng commented on SPARK-26689: -------------------------------------- [~tgraves] We use yarn as the resource manager, and we run spark applications on Yarn with spark version 2.1.0. The information is already provided in the environment field. Is there any more information you want me to provide? BTW, I don't think this exception is related to resource manager. > Bad disk causing broadcast failure > ---------------------------------- > > Key: SPARK-26689 > URL: https://issues.apache.org/jira/browse/SPARK-26689 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0, 2.4.0 > Environment: Spark on Yarn > Mutliple Disk > Reporter: liupengcheng > Priority: Major > > We encoutered an application failure in our production cluster which caused > by the bad disk problems. It will incur application failure. > {code:java} > Job aborted due to stage failure: Task serialization failed: > java.io.IOException: Failed to create local dir in > /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) > org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) > org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85) > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > scala.Option.foreach(Option.scala:236) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > We have multiple disk on our cluster nodes, however, it still fails. I think > it's because spark does not handle bad disk in `DiskBlockManager` currently. > Actually, we can handle bad disk in multiple disk environment to avoid > application failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org