[ 
https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625983#comment-16625983
 ] 

Imran Rashid commented on SPARK-25422:
--------------------------------------

As it might be useful in the future, lemme add a few more details about what 
was going wrong and some lessons learned for debugging things like this.

I noticed that all the failures were when a broadcast block was fetched from 
another executor, not from the driver.  I made some hacks to the internals of 
spark to reproduce that scenario, but the failure still wasn't happening 
regularly just by reproducing that sequence of events, though it did become far 
more common with that, so I could reproduce locally (though it was still 
infrequent enough that I had to run tests for ~30 min to reproduce).

I kept tweaking logging to get more info, but it seemed like some changes were 
making the failures even less likely (might have been my imagination).  
Somehow, I eventually got the failures to be relatively common, every 10 
iterations or so, and I observed that (a) the broadcast block was received 
correctly on the first executor, but when there was a failure, by the time that 
executor sent the block, the block was already garbage on that same executor 
and (b) the bytes it was sending always looked like 
{{0000000000000000000000000000000000000000000000000070112801000000001004000000000000000000000000000020eb24010000000000000100000000000000000000000000000000000000000000000000000000000000000000000000f0f41d01...}}

it turns out that buffer.dispose() was always called before the block got sent, 
but the buffer doesn't actually get cleaned immediately.  That byte sequence 
may be "breadcrumbs" left from the cleaner, perhaps as a way to tell the buffer 
had been cleaned, instead of just being 0s -- though googling didn't turn up a 
reference to anything.

encryption was not directly involved, I guess it just made the error more 
likely as everything is slower with encryption.  And in fact it led me astray 
for a while, as other changes that have gone in since make this error 
impossible with encryption, as there were no mapped buffers with encryption.


> flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated 
> (encryption = on) (with replication as stream)
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25422
>                 URL: https://issues.apache.org/jira/browse/SPARK-25422
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Priority: Major
>
> stacktrace
> {code}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 7, localhost, executor 1): java.io.IOException: 
> org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
> broadcast_0: 1651574976 != 1165629262
>       at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>       at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
>       at org.apache.spark.scheduler.Task.run(Task.scala:121)
>       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
>       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: corrupt remote block 
> broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
>       at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>       at scala.collection.immutable.List.foreach(List.scala:392)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
>       at scala.Option.getOrElse(Option.scala:121)
>       at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>       at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
>       ... 13 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to