Hi Suraj,

Spark uses a lot of ports to communicate between nodes. Probably your
security group is restrictive and does not allow instances to communicate
on all networks. The easiest way to resolve it is to add a Rule to allow
all Inbound traffic on all ports (0-65535) to instances in same security
group like this.

All TCP
TCP
0 - 65535
 your security group

Hope this helps!!

Thanks
Ankur

On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH <shet...@gmail.com> wrote:

> Hi,
>
> I am using Spark 1.2 and facing network related issues while performing
> simple computations.
>
> This is a custom cluster set up using ec2 machines and spark prebuilt
> binary from apache site. The problem is only when we have workers on other
> machines(networking involved). Having a single node for the master and the
> slave works correctly.
>
> The error log from slave node is attached below. It is reading textFile
> from local FS(copied each node) and counting it. The first 30 tasks get
> completed within 5 seconds. Then, it takes several minutes to complete
> another 10 tasks and eventually dies.
>
> Sometimes, one of the workers completes all the tasks assigned to it.
> Different workers have different behavior at different
> times(non-deterministic).
>
> Is it related to something specific to EC2?
>
>
>
> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)
>
> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
> [master_ip:56305]
>
> 15/09/24 13:04:41 INFO SendingConnection: Connected to
> [master_ip/master_ip_address:56305], 1 messages pending
>
> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 (TID
> 77)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
>         at scala.Option.foreach(Option.scala:236)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
>         at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122
>
> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>
> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
> (TID 113)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
>         at scala.Option.foreach(Option.scala:236)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
>         at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
> variable 1
>
> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
> [master_ip/master_ip_address:44427]
>
> 15/09/24 13:06:41 INFO SendingConnection: Connected to
> [master_ip/master_ip_address:44427], 1 messages pending
>
> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 (TID
> 37)
>
> java.io.IOException: sendMessageReliably failed because ack was not
> received within 60 sec
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>
>         at scala.Option.foreach(Option.scala:236)
>
>         at
> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>
>         at
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>
>         at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> I checked the network speed between the master and the slave and it is
> able to scp large files at a speed of 60 MB/s.
>
> Any leads on how this can be fixed?
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>
>

Reply via email to