Hi, Nopes. I was trying to use EC2(due to a few constraints) where I faced the problem. With EMR, it works flawlessly. But, I would like to go back and use EC2 if I can fix this issue. Has anybody set up a spark cluster using plain EC2 machines. What steps did you follow?
Thanks and Regards, Suraj Sheth On Sat, Sep 26, 2015 at 10:36 AM, Natu Lauchande <nlaucha...@gmail.com> wrote: > Hi, > > Are you using EMR ? > > Natu > > On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH <shet...@gmail.com> wrote: > >> Hi Ankur, >> Thanks for the reply. >> This is already done. >> If I wait for a long amount of time(10 minutes), a few tasks get >> successful even on slave nodes. Sometime, a fraction of the tasks(20%) are >> completed on all the machines in the initial 5 seconds and then, it slows >> down drastically. >> >> Thanks and Regards, >> Suraj Sheth >> >> On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava < >> ankur.srivast...@gmail.com> wrote: >> >>> Hi Suraj, >>> >>> Spark uses a lot of ports to communicate between nodes. Probably your >>> security group is restrictive and does not allow instances to communicate >>> on all networks. The easiest way to resolve it is to add a Rule to allow >>> all Inbound traffic on all ports (0-65535) to instances in same >>> security group like this. >>> >>> All TCP >>> TCP >>> 0 - 65535 >>> your security group >>> >>> Hope this helps!! >>> >>> Thanks >>> Ankur >>> >>> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH <shet...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am using Spark 1.2 and facing network related issues while performing >>>> simple computations. >>>> >>>> This is a custom cluster set up using ec2 machines and spark prebuilt >>>> binary from apache site. The problem is only when we have workers on other >>>> machines(networking involved). Having a single node for the master and the >>>> slave works correctly. >>>> >>>> The error log from slave node is attached below. It is reading textFile >>>> from local FS(copied each node) and counting it. The first 30 tasks get >>>> completed within 5 seconds. Then, it takes several minutes to complete >>>> another 10 tasks and eventually dies. >>>> >>>> Sometimes, one of the workers completes all the tasks assigned to it. >>>> Different workers have different behavior at different >>>> times(non-deterministic). >>>> >>>> Is it related to something specific to EC2? >>>> >>>> >>>> >>>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID >>>> 117) >>>> >>>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast >>>> variable 1 >>>> >>>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to >>>> [master_ip:56305] >>>> >>>> 15/09/24 13:04:41 INFO SendingConnection: Connected to >>>> [master_ip/master_ip_address:56305], 1 messages pending >>>> >>>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast >>>> variable 1 >>>> >>>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 >>>> (TID 77) >>>> >>>> java.io.IOException: sendMessageReliably failed because ack was not >>>> received within 60 sec >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>>> >>>> at scala.Option.foreach(Option.scala:236) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>>> >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task >>>> 122 >>>> >>>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122) >>>> >>>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0 >>>> (TID 113) >>>> >>>> java.io.IOException: sendMessageReliably failed because ack was not >>>> received within 60 sec >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>>> >>>> at scala.Option.foreach(Option.scala:236) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>>> >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast >>>> variable 1 >>>> >>>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to >>>> [master_ip/master_ip_address:44427] >>>> >>>> 15/09/24 13:06:41 INFO SendingConnection: Connected to >>>> [master_ip/master_ip_address:44427], 1 messages pending >>>> >>>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 >>>> (TID 37) >>>> >>>> java.io.IOException: sendMessageReliably failed because ack was not >>>> received within 60 sec >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917) >>>> >>>> at scala.Option.foreach(Option.scala:236) >>>> >>>> at >>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656) >>>> >>>> at >>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) >>>> >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> >>>> >>>> >>>> >>>> I checked the network speed between the master and the slave and it is >>>> able to scp large files at a speed of 60 MB/s. >>>> >>>> Any leads on how this can be fixed? >>>> >>>> >>>> >>>> Thanks and Regards, >>>> >>>> Suraj Sheth >>>> >>>> >>>> >>> >> >