Networking issues with Spark on EC2

2015-09-24 Thread SURAJ SHETH
Hi,

I am using Spark 1.2 and facing network related issues while performing
simple computations.

This is a custom cluster set up using ec2 machines and spark prebuilt
binary from apache site. The problem is only when we have workers on other
machines(networking involved). Having a single node for the master and the
slave works correctly.

The error log from slave node is attached below. It is reading textFile
from local FS(copied each node) and counting it. The first 30 tasks get
completed within 5 seconds. Then, it takes several minutes to complete
another 10 tasks and eventually dies.

Sometimes, one of the workers completes all the tasks assigned to it.
Different workers have different behavior at different
times(non-deterministic).

Is it related to something specific to EC2?



15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)

15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
[master_ip:56305]

15/09/24 13:04:41 INFO SendingConnection: Connected to
[master_ip/master_ip_address:56305], 1 messages pending

15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 (TID
77)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

at java.lang.Thread.run(Thread.java:745)

15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122

15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)

15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0 (TID
113)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

at java.lang.Thread.run(Thread.java:745)

15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
[master_ip/master_ip_address:44427]

15/09/24 13:06:41 INFO SendingConnection: Connected to
[master_ip/master_ip_address:44427], 1 messages pending

15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 (TID
37)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

at java.lang.Thread.run(Thread.java:745)





I checked the network speed between the master and the slave and it is able
to scp large files at a speed of 60 MB/s.

Any leads on how this can be fixed?



Thanks and Regards,

Suraj Sheth


Re: Networking issues with Spark on EC2

2015-09-24 Thread Ankur Srivastava
Hi Suraj,

Spark uses a lot of ports to communicate between nodes. Probably your
security group is restrictive and does not allow instances to communicate
on all networks. The easiest way to resolve it is to add a Rule to allow
all Inbound traffic on all ports (0-65535) to instances in same security
group like this.

All TCP
TCP
0 - 65535
 your security group

Hope this helps!!

Thanks
Ankur

On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH  wrote:

> Hi,
>
> I am using Spark 1.2 and facing network related issues while performing
> simple computations.
>
> This is a custom cluster set up using ec2 machines and spark prebuilt
> binary from apache site. The problem is only when we have workers on other
> machines(networking involved). Having a single node for the master and the
> slave works correctly.
>
> The error log from slave node is attached below. It is reading textFile
> from local FS(copied each node) and counting it. The first 30 tasks get
> completed within 5 seconds. Then, it takes several minutes to complete
> another 10 tasks and eventually dies.
>
> Sometimes, one of the workers completes all the tasks assigned to it.
> Different workers have different behavior at different
> times(non-deterministic).
>
> Is it related to something specific to EC2?
>
>
>
> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)
>
> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
> [master_ip:56305]
>
> 15/09/24 13:04:41 INFO SendingConnection: Connected to
> [master_ip/master_ip_address:56305], 1 messages pending
>
> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 (TID
> 77)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
> at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122
>
> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>
> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
> (TID 113)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
> at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
> [master_ip/master_ip_address:44427]
>
> 15/09/24 13:06:41 INFO SendingConnection: Connected to
> [master_ip/master_ip_address:44427], 1 messages pending
>
> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 (TID
> 37)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
> at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
> at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
> at java.lang.Thread.ru

Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi Ankur,
Thanks for the reply.
This is already done.
If I wait for a long amount of time(10 minutes), a few tasks get successful
even on slave nodes. Sometime, a fraction of the tasks(20%) are completed
on all the machines in the initial 5 seconds and then, it slows down
drastically.

Thanks and Regards,
Suraj Sheth

On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Suraj,
>
> Spark uses a lot of ports to communicate between nodes. Probably your
> security group is restrictive and does not allow instances to communicate
> on all networks. The easiest way to resolve it is to add a Rule to allow
> all Inbound traffic on all ports (0-65535) to instances in same security
> group like this.
>
> All TCP
> TCP
> 0 - 65535
>  your security group
>
> Hope this helps!!
>
> Thanks
> Ankur
>
> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH  wrote:
>
>> Hi,
>>
>> I am using Spark 1.2 and facing network related issues while performing
>> simple computations.
>>
>> This is a custom cluster set up using ec2 machines and spark prebuilt
>> binary from apache site. The problem is only when we have workers on other
>> machines(networking involved). Having a single node for the master and the
>> slave works correctly.
>>
>> The error log from slave node is attached below. It is reading textFile
>> from local FS(copied each node) and counting it. The first 30 tasks get
>> completed within 5 seconds. Then, it takes several minutes to complete
>> another 10 tasks and eventually dies.
>>
>> Sometimes, one of the workers completes all the tasks assigned to it.
>> Different workers have different behavior at different
>> times(non-deterministic).
>>
>> Is it related to something specific to EC2?
>>
>>
>>
>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)
>>
>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
>> [master_ip:56305]
>>
>> 15/09/24 13:04:41 INFO SendingConnection: Connected to
>> [master_ip/master_ip_address:56305], 1 messages pending
>>
>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
>> (TID 77)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>
>> at scala.Option.foreach(Option.scala:236)
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>
>> at
>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>
>> at
>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>
>> at
>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122
>>
>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>>
>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
>> (TID 113)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>
>> at scala.Option.foreach(Option.scala:236)
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>
>> at
>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>
>> at
>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>
>> at
>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
>> variable 1
>>
>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
>> [master_ip/master_ip_address:44427]
>>
>> 15/09/24 13:06:41 INFO SendingConnection: Connected to
>> [master_ip/master_ip_address:44427], 1 messages pending
>>
>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0
>> (TID 37)
>>
>> java.io.IOException: sendMessageReliably failed because ack was not
>> received within 60 sec
>>
>> at
>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>
>> at
>> org.apache.spark.network.nio.Connecti

Re: Networking issues with Spark on EC2

2015-09-25 Thread Natu Lauchande
Hi,

Are you using EMR ?

Natu

On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH  wrote:

> Hi Ankur,
> Thanks for the reply.
> This is already done.
> If I wait for a long amount of time(10 minutes), a few tasks get
> successful even on slave nodes. Sometime, a fraction of the tasks(20%) are
> completed on all the machines in the initial 5 seconds and then, it slows
> down drastically.
>
> Thanks and Regards,
> Suraj Sheth
>
> On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Suraj,
>>
>> Spark uses a lot of ports to communicate between nodes. Probably your
>> security group is restrictive and does not allow instances to communicate
>> on all networks. The easiest way to resolve it is to add a Rule to allow
>> all Inbound traffic on all ports (0-65535) to instances in same security
>> group like this.
>>
>> All TCP
>> TCP
>> 0 - 65535
>>  your security group
>>
>> Hope this helps!!
>>
>> Thanks
>> Ankur
>>
>> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH  wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 1.2 and facing network related issues while performing
>>> simple computations.
>>>
>>> This is a custom cluster set up using ec2 machines and spark prebuilt
>>> binary from apache site. The problem is only when we have workers on other
>>> machines(networking involved). Having a single node for the master and the
>>> slave works correctly.
>>>
>>> The error log from slave node is attached below. It is reading textFile
>>> from local FS(copied each node) and counting it. The first 30 tasks get
>>> completed within 5 seconds. Then, it takes several minutes to complete
>>> another 10 tasks and eventually dies.
>>>
>>> Sometimes, one of the workers completes all the tasks assigned to it.
>>> Different workers have different behavior at different
>>> times(non-deterministic).
>>>
>>> Is it related to something specific to EC2?
>>>
>>>
>>>
>>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID
>>> 117)
>>>
>>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
>>> [master_ip:56305]
>>>
>>> 15/09/24 13:04:41 INFO SendingConnection: Connected to
>>> [master_ip/master_ip_address:56305], 1 messages pending
>>>
>>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
>>> (TID 77)
>>>
>>> java.io.IOException: sendMessageReliably failed because ack was not
>>> received within 60 sec
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>
>>> at scala.Option.foreach(Option.scala:236)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 122
>>>
>>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>>>
>>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
>>> (TID 113)
>>>
>>> java.io.IOException: sendMessageReliably failed because ack was not
>>> received within 60 sec
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>
>>> at scala.Option.foreach(Option.scala:236)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
>>> [master_ip/master_ip_address:44427]
>>>
>>> 15/09/24 13:06:41 INFO SendingConnection: Connected to
>>> [master_ip/master_ip_address:44427], 1 messages pending
>>>
>>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0
>>> (TID 37)
>>>
>>> java.io.IOException: sendMess

Re: Networking issues with Spark on EC2

2015-09-25 Thread SURAJ SHETH
Hi,
Nopes. I was trying to use EC2(due to a few constraints) where I faced the
problem.
With EMR, it works flawlessly.
But, I would like to go back and use EC2 if I can fix this issue.
Has anybody set up a spark cluster using plain EC2 machines. What steps did
you follow?

Thanks and Regards,
Suraj Sheth

On Sat, Sep 26, 2015 at 10:36 AM, Natu Lauchande 
wrote:

> Hi,
>
> Are you using EMR ?
>
> Natu
>
> On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH  wrote:
>
>> Hi Ankur,
>> Thanks for the reply.
>> This is already done.
>> If I wait for a long amount of time(10 minutes), a few tasks get
>> successful even on slave nodes. Sometime, a fraction of the tasks(20%) are
>> completed on all the machines in the initial 5 seconds and then, it slows
>> down drastically.
>>
>> Thanks and Regards,
>> Suraj Sheth
>>
>> On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi Suraj,
>>>
>>> Spark uses a lot of ports to communicate between nodes. Probably your
>>> security group is restrictive and does not allow instances to communicate
>>> on all networks. The easiest way to resolve it is to add a Rule to allow
>>> all Inbound traffic on all ports (0-65535) to instances in same
>>> security group like this.
>>>
>>> All TCP
>>> TCP
>>> 0 - 65535
>>>  your security group
>>>
>>> Hope this helps!!
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH  wrote:
>>>
 Hi,

 I am using Spark 1.2 and facing network related issues while performing
 simple computations.

 This is a custom cluster set up using ec2 machines and spark prebuilt
 binary from apache site. The problem is only when we have workers on other
 machines(networking involved). Having a single node for the master and the
 slave works correctly.

 The error log from slave node is attached below. It is reading textFile
 from local FS(copied each node) and counting it. The first 30 tasks get
 completed within 5 seconds. Then, it takes several minutes to complete
 another 10 tasks and eventually dies.

 Sometimes, one of the workers completes all the tasks assigned to it.
 Different workers have different behavior at different
 times(non-deterministic).

 Is it related to something specific to EC2?



 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID
 117)

 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
 variable 1

 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
 [master_ip:56305]

 15/09/24 13:04:41 INFO SendingConnection: Connected to
 [master_ip/master_ip_address:56305], 1 messages pending

 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
 variable 1

 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
 (TID 77)

 java.io.IOException: sendMessageReliably failed because ack was not
 received within 60 sec

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

 at
 io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

 at
 io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

 at
 io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

 at java.lang.Thread.run(Thread.java:745)

 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task
 122

 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)

 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
 (TID 113)

 java.io.IOException: sendMessageReliably failed because ack was not
 received within 60 sec

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

 at
 io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

 at
 io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

 at
 io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

>