Re: Networking issues with Spark on EC2

SURAJ SHETH Fri, 25 Sep 2015 22:57:55 -0700

Hi,
Nopes. I was trying to use EC2(due to a few constraints) where I faced the
problem.
With EMR, it works flawlessly.
But, I would like to go back and use EC2 if I can fix this issue.
Has anybody set up a spark cluster using plain EC2 machines. What steps did
you follow?


Thanks and Regards,
Suraj Sheth

On Sat, Sep 26, 2015 at 10:36 AM, Natu Lauchande <nlaucha...@gmail.com>
wrote:

> Hi,
>
> Are you using EMR ?
>
> Natu
>
> On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH <shet...@gmail.com> wrote:
>
>> Hi Ankur,
>> Thanks for the reply.
>> This is already done.
>> If I wait for a long amount of time(10 minutes), a few tasks get
>> successful even on slave nodes. Sometime, a fraction of the tasks(20%) are
>> completed on all the machines in the initial 5 seconds and then, it slows
>> down drastically.
>>
>> Thanks and Regards,
>> Suraj Sheth
>>
>> On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
>> ankur.srivast...@gmail.com> wrote:
>>
>>> Hi Suraj,
>>>
>>> Spark uses a lot of ports to communicate between nodes. Probably your
>>> security group is restrictive and does not allow instances to communicate
>>> on all networks. The easiest way to resolve it is to add a Rule to allow
>>> all Inbound traffic on all ports (0-65535) to instances in same
>>> security group like this.
>>>
>>> All TCP
>>> TCP
>>> 0 - 65535
>>>  your security group
>>>
>>> Hope this helps!!
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH <shet...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Spark 1.2 and facing network related issues while performing
>>>> simple computations.
>>>>
>>>> This is a custom cluster set up using ec2 machines and spark prebuilt
>>>> binary from apache site. The problem is only when we have workers on other
>>>> machines(networking involved). Having a single node for the master and the
>>>> slave works correctly.
>>>>
>>>> The error log from slave node is attached below. It is reading textFile
>>>> from local FS(copied each node) and counting it. The first 30 tasks get
>>>> completed within 5 seconds. Then, it takes several minutes to complete
>>>> another 10 tasks and eventually dies.
>>>>
>>>> Sometimes, one of the workers completes all the tasks assigned to it.
>>>> Different workers have different behavior at different
>>>> times(non-deterministic).
>>>>
>>>> Is it related to something specific to EC2?
>>>>
>>>>
>>>>
>>>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID
>>>> 117)
>>>>
>>>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
>>>> variable 1
>>>>
>>>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
>>>> [master_ip:56305]
>>>>
>>>> 15/09/24 13:04:41 INFO SendingConnection: Connected to
>>>> [master_ip/master_ip_address:56305], 1 messages pending
>>>>
>>>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
>>>> variable 1
>>>>
>>>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
>>>> (TID 77)
>>>>
>>>> java.io.IOException: sendMessageReliably failed because ack was not
>>>> received within 60 sec
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>>
>>>>         at scala.Option.foreach(Option.scala:236)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>>
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>
>>>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task
>>>> 122
>>>>
>>>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>>>>
>>>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
>>>> (TID 113)
>>>>
>>>> java.io.IOException: sendMessageReliably failed because ack was not
>>>> received within 60 sec
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>>
>>>>         at scala.Option.foreach(Option.scala:236)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>>
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>
>>>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
>>>> variable 1
>>>>
>>>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
>>>> [master_ip/master_ip_address:44427]
>>>>
>>>> 15/09/24 13:06:41 INFO SendingConnection: Connected to
>>>> [master_ip/master_ip_address:44427], 1 messages pending
>>>>
>>>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0
>>>> (TID 37)
>>>>
>>>> java.io.IOException: sendMessageReliably failed because ack was not
>>>> received within 60 sec
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>>
>>>>         at scala.Option.foreach(Option.scala:236)
>>>>
>>>>         at
>>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>>
>>>>         at
>>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>>
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I checked the network speed between the master and the slave and it is
>>>> able to scp large files at a speed of 60 MB/s.
>>>>
>>>> Any leads on how this can be fixed?
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Suraj Sheth
>>>>
>>>>
>>>>
>>>
>>
>

Re: Networking issues with Spark on EC2

Reply via email to