1. Caused by: java.io.IOException: Failed to connect to ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 2. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) 3. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) 4. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90) 5. at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) 6. at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) 7. at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
Above message indicates that there used to be a executor on that address and by the time other executor was about to read - it did not exist. You may also be able to confirm ( if this is the case )by looking at spark App ui - you may find dead executors.. On Sun, May 8, 2016 at 6:02 PM, Brandon White <bwwintheho...@gmail.com> wrote: > I'm not quite sure how this is a memory problem. There are no OOM > exceptions and the job only breaks when actions are ran in parallel, > submitted to the scheduler by different threads. > > The issue is that the doGetRemote function does not retry when it is > denied access to a cache block. > On May 8, 2016 5:55 PM, "Ashish Dubey" <ashish....@gmail.com> wrote: > > Brandon, > > how much memory are you giving to your executors - did you check if there > were dead executors in your application logs.. Most likely you require > higher memory for executors.. > > Ashish > > On Sun, May 8, 2016 at 1:01 PM, Brandon White <bwwintheho...@gmail.com> > wrote: > >> Hello all, >> >> I am running a Spark application which schedules multiple Spark jobs. >> Something like: >> >> val df = sqlContext.read.parquet("/path/to/file") >> >> filterExpressions.par.foreach { expression => >> df.filter(expression).count() >> } >> >> When the block manager fails to fetch a block, it throws an exception >> which eventually kills the exception: http://pastebin.com/2ggwv68P >> >> This code works when I run it on one thread with: >> >> filterExpressions.foreach { expression => >> df.filter(expression).count() >> } >> >> But I really need the parallel execution of the jobs. Is there anyway >> around this? It seems like a bug in the BlockManagers doGetRemote function. >> I have tried the HTTP Block Manager as well. >> > >