1. Caused by: java.io.IOException: Failed to connect to
   ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
   2.         at
   
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
   3.         at
   
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
   4.         at
   
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
   5.         at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
   6.         at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
   7.         at
   
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)


Above message indicates that there used to be a executor on that address
and by the time other executor was about to read - it did not exist. You
may also be able to confirm ( if this is the case )by looking at spark App
ui - you may find dead executors..

On Sun, May 8, 2016 at 6:02 PM, Brandon White <bwwintheho...@gmail.com>
wrote:

> I'm not quite sure how this is a memory problem. There are no OOM
> exceptions and the job only breaks when actions are ran in parallel,
> submitted to the scheduler by different threads.
>
> The issue is that the doGetRemote function does not retry when it is
> denied access to a cache block.
> On May 8, 2016 5:55 PM, "Ashish Dubey" <ashish....@gmail.com> wrote:
>
> Brandon,
>
> how much memory are you giving to your executors - did you check if there
> were dead executors in your application logs.. Most likely you require
> higher memory for executors..
>
> Ashish
>
> On Sun, May 8, 2016 at 1:01 PM, Brandon White <bwwintheho...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> I am running a Spark application which schedules multiple Spark jobs.
>> Something like:
>>
>> val df  = sqlContext.read.parquet("/path/to/file")
>>
>> filterExpressions.par.foreach { expression =>
>>   df.filter(expression).count()
>> }
>>
>> When the block manager fails to fetch a block, it throws an exception
>> which eventually kills the exception: http://pastebin.com/2ggwv68P
>>
>> This code works when I run it on one thread with:
>>
>> filterExpressions.foreach { expression =>
>>   df.filter(expression).count()
>> }
>>
>> But I really need the parallel execution of the jobs. Is there anyway
>> around this? It seems like a bug in the BlockManagers doGetRemote function.
>> I have tried the HTTP Block Manager as well.
>>
>
>

Reply via email to