[ https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625983#comment-16625983 ]
Imran Rashid commented on SPARK-25422: -------------------------------------- As it might be useful in the future, lemme add a few more details about what was going wrong and some lessons learned for debugging things like this. I noticed that all the failures were when a broadcast block was fetched from another executor, not from the driver. I made some hacks to the internals of spark to reproduce that scenario, but the failure still wasn't happening regularly just by reproducing that sequence of events, though it did become far more common with that, so I could reproduce locally (though it was still infrequent enough that I had to run tests for ~30 min to reproduce). I kept tweaking logging to get more info, but it seemed like some changes were making the failures even less likely (might have been my imagination). Somehow, I eventually got the failures to be relatively common, every 10 iterations or so, and I observed that (a) the broadcast block was received correctly on the first executor, but when there was a failure, by the time that executor sent the block, the block was already garbage on that same executor and (b) the bytes it was sending always looked like {{0000000000000000000000000000000000000000000000000070112801000000001004000000000000000000000000000020eb24010000000000000100000000000000000000000000000000000000000000000000000000000000000000000000f0f41d01...}} it turns out that buffer.dispose() was always called before the block got sent, but the buffer doesn't actually get cleaned immediately. That byte sequence may be "breadcrumbs" left from the cleaner, perhaps as a way to tell the buffer had been cleaned, instead of just being 0s -- though googling didn't turn up a reference to anything. encryption was not directly involved, I guess it just made the error more likely as everything is slower with encryption. And in fact it led me astray for a while, as other changes that have gone in since make this error impossible with encryption, as there were no mapped buffers with encryption. > flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated > (encryption = on) (with replication as stream) > ------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-25422 > URL: https://issues.apache.org/jira/browse/SPARK-25422 > Project: Spark > Issue Type: Test > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Wenchen Fan > Priority: Major > > stacktrace > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 7, localhost, executor 1): java.io.IOException: > org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of > broadcast_0: 1651574976 != 1165629262 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: corrupt remote block > broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org