I have a workaround to the issue

As you can see from the log it is about 15 sec btw worker start and
shutdown.

The workaround might be to sleep 30 sec, check if worker is running and if
not try to start-slave again

part of emr spark bootstrap py script

spark_master = "spark://...:7077"
...
curl_worker_cmd = "curl -o /dev/null --silent --head --write-out
'%{http_code}' localhost:8081"
while True:
    subprocess.call(["/home/hadoop/spark/sbin/start-slave.sh",spark_master])
    time.sleep(30)
    if subprocess.Popen(curl_worker_cmd.split(" "),
stdout=subprocess.PIPE).communicate()[0] == "'200'":
        break

On Thu, Aug 27, 2015 at 3:07 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> I see the following error time to time when try to start slaves on spark
> 1.4.0
>
>
> [hadoop@ip-10-0-27-240 apps]$ pwd
> /mnt/var/log/apps
>
> [hadoop@ip-10-0-27-240 apps]$ cat
> spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out
> Spark Command: /usr/java/latest/bin/java -cp
> /home/hadoop/spark/conf/:/home/hadoop/conf/:/home/hadoop/spark/classpath/distsupplied/*:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar:/usr/share/aws/emr/auxlib/*:/home/hadoop/.versions/spark-1.4.0.b/sbin/../conf/:/home/hadoop/.versions/spark-1.4.0.b/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-core-3.2.10.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/conf/:/home/hadoop/conf/
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
> -XX:MaxHeapFreeRatio=70 -Xms2048m -Xmx2048m -XX:MaxPermSize=128m
> org.apache.spark.deploy.worker.Worker --webui-port 8081
> spark://ip-10-0-27-185.ec2.internal:7077
> ========================================
> 15/08/27 21:10:25 INFO Worker: Registered signal handlers for [TERM, HUP,
> INT]
> 15/08/27 21:10:26 INFO SecurityManager: Changing view acls to: hadoop
> 15/08/27 21:10:26 INFO SecurityManager: Changing modify acls to: hadoop
> 15/08/27 21:10:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hadoop); users
> with modify permissions: Set(hadoop)
> 15/08/27 21:10:26 INFO Slf4jLogger: Slf4jLogger started
> 15/08/27 21:10:26 INFO Remoting: Starting remoting
> Exception in thread "main" java.util.concurrent.TimeoutException: Futures
> timed out after [10000 milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at akka.remote.Remoting.start(Remoting.scala:180)
> at
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
> at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
> at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
> at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
> at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
> at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
> at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
> at
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
> at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
> at
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
> at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
> at
> org.apache.spark.deploy.worker.Worker$.startSystemAndActor(Worker.scala:553)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:533)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> 15/08/27 21:10:39 INFO Utils: Shutdown hook called
> Heap
>  par new generation   total 613440K, used 338393K [0x0000000778000000,
> 0x00000007a1990000, 0x00000007a1990000)
>   eden space 545344K,  62% used [0x0000000778000000, 0x000000078ca765b0,
> 0x0000000799490000)
>   from space 68096K,   0% used [0x0000000799490000, 0x0000000799490000,
> 0x000000079d710000)
>   to   space 68096K,   0% used [0x000000079d710000, 0x000000079d710000,
> 0x00000007a1990000)
>  concurrent mark-sweep generation total 1415616K, used 0K
> [0x00000007a1990000, 0x00000007f8000000, 0x00000007f8000000)
>  concurrent-mark-sweep perm gen total 21248K, used 19285K
> [0x00000007f8000000, 0x00000007f94c0000, 0x0000000800000000)
>

Reply via email to