There are a few interacting issues here - and unfortunately I dont recall all of it (since this was fixed a few months back). >From memory though :
a) With shuffle consolidation, data sent to remote node incorrectly includes data from partially constructed blocks - not just the request blocks. Actually, with shuffle consolidation (with and without failures) quite a few things broke. b) There might have been a few other bugs in DiskBlockObjectWriter too. c) We also suspected buffers overlapping when using cached kryo serializer (though never proved this, just disabled caching across board for now : and always create new instance). The way we debug'ed it is by introducing an Input/Output stream which introduced checksum into the data stream and validating that at each side for compression, serialization, etc. Apologies for being non specific ... I really dont have the details right now, and our internal branch is in flux due to merge effort to port our local changes to master. Hopefully we will be able to submit PR's as soon as this is done and testcases are added to validate. Regards, Mridul On Tue, Jun 24, 2014 at 10:21 AM, Reynold Xin <r...@databricks.com> wrote: > Mridul, > > Can you comment a little bit more on this issue? We are running into the > same stack trace but not sure whether it is just different Spark versions > on each cluster (doesn't seem likely) or a bug in Spark. > > Thanks. > > > > On Sat, May 17, 2014 at 4:41 AM, Mridul Muralidharan <mri...@gmail.com> > wrote: > >> I suspect this is an issue we have fixed internally here as part of a >> larger change - the issue we fixed was not a config issue but bugs in >> spark. >> >> Unfortunately we plan to contribute this as part of 1.1 >> >> Regards, >> Mridul >> On 17-May-2014 4:09 pm, "sam (JIRA)" <j...@apache.org> wrote: >> >> > sam created SPARK-1867: >> > -------------------------- >> > >> > Summary: Spark Documentation Error causes >> > java.lang.IllegalStateException: unread block data >> > Key: SPARK-1867 >> > URL: https://issues.apache.org/jira/browse/SPARK-1867 >> > Project: Spark >> > Issue Type: Bug >> > Reporter: sam >> > >> > >> > I've employed two System Administrators on a contract basis (for quite a >> > bit of money), and both contractors have independently hit the following >> > exception. What we are doing is: >> > >> > 1. Installing Spark 0.9.1 according to the documentation on the website, >> > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. >> > 2. Building a fat jar with a Spark app with sbt then trying to run it on >> > the cluster >> > >> > I've also included code snippets, and sbt deps at the bottom. >> > >> > When I've Googled this, there seems to be two somewhat vague responses: >> > a) Mismatching spark versions on nodes/user code >> > b) Need to add more jars to the SparkConf >> > >> > Now I know that (b) is not the problem having successfully run the same >> > code on other clusters while only including one jar (it's a fat jar). >> > >> > But I have no idea how to check for (a) - it appears Spark doesn't have >> > any version checks or anything - it would be nice if it checked versions >> > and threw a "mismatching version exception: you have user code using >> > version X and node Y has version Z". >> > >> > I would be very grateful for advice on this. >> > >> > The exception: >> > >> > Exception in thread "main" org.apache.spark.SparkException: Job aborted: >> > Task 0.0:1 failed 32 times (most recent failure: Exception failure: >> > java.lang.IllegalStateException: unread block data) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) >> > at >> > >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> > at >> > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> > at org.apache.spark.scheduler.DAGScheduler.org >> > $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) >> > at scala.Option.foreach(Option.scala:236) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) >> > at >> > >> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >> > at akka.actor.ActorCell.invoke(ActorCell.scala:456) >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >> > at akka.dispatch.Mailbox.run(Mailbox.scala:219) >> > at >> > >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >> > at >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> > at >> > >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> > at >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> > at >> > >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to >> > java.lang.IllegalStateException: unread block data [duplicate 59] >> > >> > My code snippet: >> > >> > val conf = new SparkConf() >> > .setMaster(clusterMaster) >> > .setAppName(appName) >> > .setSparkHome(sparkHome) >> > .setJars(SparkContext.jarOfClass(this.getClass)) >> > >> > println("count = " + new >> SparkContext(conf).textFile(someHdfsPath).count()) >> > >> > My SBT dependencies: >> > >> > // relevant >> > "org.apache.spark" % "spark-core_2.10" % "0.9.1", >> > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", >> > >> > // standard, probably unrelated >> > "com.github.seratch" %% "awscala" % "[0.2,)", >> > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", >> > "org.specs2" %% "specs2" % "1.14" % "test", >> > "org.scala-lang" % "scala-reflect" % "2.10.3", >> > "org.scalaz" %% "scalaz-core" % "7.0.5", >> > "net.minidev" % "json-smart" % "1.2" >> > >> > >> > >> > -- >> > This message was sent by Atlassian JIRA >> > (v6.2#6252) >> > >>