Re: Job aborted: Spark cluster looks down
It seems your workers are disassociating. can you try setting STANDALONE_SPARK_MASTER_HOST=`hostname -f` in spark-env.sh I think the issue is in the way workers resolve ip & master resolves IP. master has a full dns & slaves dont. All guesses here, can you try to resolve all the hostnames on each of the machines & try to telnet the host:port in the log. telnet node1 telnet node1.cluster.local etc.. if any of them are not working please resolve them by a /etc/hosts mapping. Also note that some hostnames like node1.cluster.local may/should resolve to internal IP. Further issue could be that some ports may be open on all ips like the webUI & some only on selective IP , so make sure you definately fix the ip & port pairs mentioned in the logs Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 11:38 PM, Christian wrote: > Dear Mayur, > > The webUI shows the worker nodes correctly and the master URL shown there > is spark://node1.cluster.local:7077 (please see the attached screenshot). > > I can launch the shell without problems and run some simple code in there. > But when I launch the job it doesn't work: > > MASTER=spark://node1.cluster.local:7077 > SPARK_CLASSPATH=target/scala-2.10/condel-calc-assembly-1.0.jar > $SPARK_HOME/bin/spark-class org.upf.bg.condel.calc.CondelCalc > ~/notebooks/condel/metrics-condel.json > > I have attached the logs for the master and the slaves just after > executing the job. > > My conf/spark-env.sh is: > > SPARK_MASTER_IP=node1.cluster.local > SPARK_WORKER_CORES=20 > SPARK_WORKER_MEMORY=12g > SPARK_WORKER_DIR=/scratch/cperez/spark > STANDALONE_SPARK_MASTER_HOST=node1.cluster.local > > The conf/slaves is: > > node2.cluster.local > node3.cluster.local > > The cluster is in a local network without internet access with its > dedicated dns server. I don't know if having "node1" as hostname and > "node1.cluster.local" as dns would affect. I have tried with > spark://node1:7077 and spark://node1.cluster.local:7077 and neither work. > > The Job code that initializes the Spark context is: > > val sparkConf = new SparkConf() > .setMaster(sys.env.getOrElse("MASTER", "local")) > .setAppName("CondelCalc") > .setSparkHome(sys.env("SPARK_HOME")) > .setJars(SparkContext.jarOfObject(this)) > .set("spark.executor.memory", "256m") > > val spark = new SparkContext(sparkConf) > > I don't understand why the shell works but not the Job. > > > On Thu, Mar 6, 2014 at 11:54 PM, Mayur Rustagi wrote: > >> Can you see your webUI of Spark. Is it running? (would run on >> masterurl:8080) >> if so what is the master URL shown thr.. >> MASTER=spark://: ./bin/spark-shell >> Should work. >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Thu, Mar 6, 2014 at 2:22 PM, Christian wrote: >> >>> Hello, has anyone found this problem before? I am sorry to insist but I >>> can not guess what is happening. Should I ask to the dev mailing list? Many >>> thanks in advance. >>> El 05/03/2014 23:57, "Christian" escribió: >>> >>> I have deployed a Spark cluster in standalone mode with 3 machines: >>>> >>>> node1/192.168.1.2 -> master >>>> node2/192.168.1.3 -> worker 20 cores 12g >>>> node3/192.168.1.4 -> worker 20 cores 12g >>>> >>>> The web interface shows the workers correctly. >>>> >>>> When I launch the scala job (which only requires 256m of memory) these >>>> are the logs: >>>> >>>> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 >>>> with 55 tasks >>>> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient memory >>>> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to >>>> master spark://node1:7077... >>>> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient memory >>>> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to >>>> master spark://node1:7077... >>>> 14/03/
Re: Job aborted: Spark cluster looks down
Can you see your webUI of Spark. Is it running? (would run on masterurl:8080) if so what is the master URL shown thr.. MASTER=spark://: ./bin/spark-shell Should work. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 2:22 PM, Christian wrote: > Hello, has anyone found this problem before? I am sorry to insist but I > can not guess what is happening. Should I ask to the dev mailing list? Many > thanks in advance. > El 05/03/2014 23:57, "Christian" escribió: > > I have deployed a Spark cluster in standalone mode with 3 machines: >> >> node1/192.168.1.2 -> master >> node2/192.168.1.3 -> worker 20 cores 12g >> node3/192.168.1.4 -> worker 20 cores 12g >> >> The web interface shows the workers correctly. >> >> When I launch the scala job (which only requires 256m of memory) these >> are the logs: >> >> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 >> with 55 tasks >> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not >> accepted any resources; check your cluster UI to ensure that workers are >> registered and have sufficient memory >> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master >> spark://node1:7077... >> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not >> accepted any resources; check your cluster UI to ensure that workers are >> registered and have sufficient memory >> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master >> spark://node1:7077... >> 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not >> accepted any resources; check your cluster UI to ensure that workers are >> registered and have sufficient memory >> 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are >> unresponsive! Giving up. >> 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark >> cluster looks dead, giving up. >> 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 >> from pool >> 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run >> saveAsNewAPIHadoopFile at CondelCalc.scala:146 >> Exception in thread "main" org.apache.spark.SparkException: Job aborted: >> Spark cluster looks down >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) >> ... >> >> The generated logs by the master and the 2 workers are attached, but I >> found something weird in the master logs: >> >> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with >> 20 cores, 12.0 GB RAM >> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with >> 20 cores, 12.0 GB RAM >> >> It reports that the two workers are node1:57297 and node1:34188 instead >> of node3 and node2 respectively. >> >> $ cat /etc/hosts >> ... >> 192.168.1.2 node1 >> 192.168.1.3 node2 >> 192.168.1.4 node3 >> ... >> >> $ nslookup node2 >> Server: 192.168.1.1 >> Address:192.168.1.1#53 >> >> Name: node2.cluster.local >> Address: 192.168.1.3 >> >> $ nslookup node3 >> Server: 192.168.1.1 >> Address:192.168.1.1#53 >> >> Name: node3.cluster.local >> Address: 192.168.1.4 >> >> $ ssh node1 "ps aux | grep spark" >> cperez 17023 1.4 0.1 4691944 154532 pts/3 Sl 23:37 0:15 >> /data/users/cperez/opt/jdk/bin/java -cp >> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop >> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m >> org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port >> 8080 >> >> $ ssh node2 "ps aux | grep spark" >> cperez 17511 2.7 0.1 4625248 156304 ? Sl 23:37 0:07 >> /data/users/cperez/opt/jdk/bin/java -cp >> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop >> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m >> org.apache.spark.deploy.worker.Worker spark://node1:7077 >> >> $ ssh node2 &q
Re: Job aborted: Spark cluster looks down
Hello, has anyone found this problem before? I am sorry to insist but I can not guess what is happening. Should I ask to the dev mailing list? Many thanks in advance. El 05/03/2014 23:57, "Christian" escribió: > I have deployed a Spark cluster in standalone mode with 3 machines: > > node1/192.168.1.2 -> master > node2/192.168.1.3 -> worker 20 cores 12g > node3/192.168.1.4 -> worker 20 cores 12g > > The web interface shows the workers correctly. > > When I launch the scala job (which only requires 256m of memory) these are > the logs: > > 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 > with 55 tasks > 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master > spark://node1:7077... > 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master > spark://node1:7077... > 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient memory > 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are > unresponsive! Giving up. > 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster > looks dead, giving up. > 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 > from pool > 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run > saveAsNewAPIHadoopFile at CondelCalc.scala:146 > Exception in thread "main" org.apache.spark.SparkException: Job aborted: > Spark cluster looks down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) > ... > > The generated logs by the master and the 2 workers are attached, but I > found something weird in the master logs: > > 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with 20 > cores, 12.0 GB RAM > 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with 20 > cores, 12.0 GB RAM > > It reports that the two workers are node1:57297 and node1:34188 instead of > node3 and node2 respectively. > > $ cat /etc/hosts > ... > 192.168.1.2 node1 > 192.168.1.3 node2 > 192.168.1.4 node3 > ... > > $ nslookup node2 > Server: 192.168.1.1 > Address:192.168.1.1#53 > > Name: node2.cluster.local > Address: 192.168.1.3 > > $ nslookup node3 > Server: 192.168.1.1 > Address:192.168.1.1#53 > > Name: node3.cluster.local > Address: 192.168.1.4 > > $ ssh node1 "ps aux | grep spark" > cperez 17023 1.4 0.1 4691944 154532 pts/3 Sl 23:37 0:15 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m > org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port > 8080 > > $ ssh node2 "ps aux | grep spark" > cperez 17511 2.7 0.1 4625248 156304 ? Sl 23:37 0:07 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m > org.apache.spark.deploy.worker.Worker spark://node1:7077 > > $ ssh node2 "netstat -lptun | grep 17511" > tcp0 0 :::8081 :::* > LISTEN 17511/java > tcp0 0 :::192.168.1.3:34188:::* >LISTEN 17511/java > > $ ssh node3 "ps aux | grep spark" > cperez7543 1.9 0.1 4625248 158600 ? Sl 23:37 0:09 > /data/users/cperez/opt/jdk/bin/java -cp > :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m &g
Job aborted: Spark cluster looks down
I have deployed a Spark cluster in standalone mode with 3 machines: node1/192.168.1.2 -> master node2/192.168.1.3 -> worker 20 cores 12g node3/192.168.1.4 -> worker 20 cores 12g The web interface shows the workers correctly. When I launch the scala job (which only requires 256m of memory) these are the logs: 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 55 tasks 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master spark://node1:7077... 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master spark://node1:7077... 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are unresponsive! Giving up. 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster looks dead, giving up. 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 from pool 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run saveAsNewAPIHadoopFile at CondelCalc.scala:146 Exception in thread "main" org.apache.spark.SparkException: Job aborted: Spark cluster looks down at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) ... The generated logs by the master and the 2 workers are attached, but I found something weird in the master logs: 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297* with 20 cores, 12.0 GB RAM 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188* with 20 cores, 12.0 GB RAM It reports that the two workers are node1:57297 and node1:34188 instead of node3 and node2 respectively. $ cat /etc/hosts ... 192.168.1.2 node1 192.168.1.3 node2 192.168.1.4 node3 ... $ nslookup node2 Server: 192.168.1.1 Address:192.168.1.1#53 Name: node2.cluster.local Address: 192.168.1.3 $ nslookup node3 Server: 192.168.1.1 Address:192.168.1.1#53 Name: node3.cluster.local Address: 192.168.1.4 $ ssh node1 "ps aux | grep spark" cperez 17023 1.4 0.1 4691944 154532 pts/3 Sl 23:37 0:15 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port 8080 $ ssh node2 "ps aux | grep spark" cperez 17511 2.7 0.1 4625248 156304 ? Sl 23:37 0:07 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://node1:7077 $ ssh node2 "netstat -lptun | grep 17511" tcp0 0 :::8081 :::* LISTEN 17511/java tcp0 0 :::192.168.1.3:34188:::* LISTEN 17511/java $ ssh node3 "ps aux | grep spark" cperez7543 1.9 0.1 4625248 158600 ? Sl 23:37 0:09 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://node1:7077 $ ssh node3 "netstat -lptun | grep 7543" tcp0 0 :::8081 :::* LISTEN 7543/java tcp0 0 :::192.168.1.4:57297:::* LISTEN 7543/java I am completely blocked at this, any help would be very helpful to me. Many thanks in advance. Christian spark-cperez-org.apache.spark.deploy.master.Master-1-node1.out Description: Binary data spark-cperez-org.apache.spark.deploy.worker.Worker-1-node2.out Description: Binary data spark-cperez-org.apache.spark.deploy.worker.Worker-1-node3.out Description: Binary data