Re: Job aborted: Spark cluster looks down

2014-03-07 Thread Mayur Rustagi
It seems your workers are disassociating.
can you try setting
STANDALONE_SPARK_MASTER_HOST=`hostname -f` in spark-env.sh
I think the issue is in the way workers resolve ip & master resolves IP.
master has a full dns & slaves dont. All guesses here, can you try to
resolve all the hostnames on each of the machines & try to telnet the
host:port in the log.
telnet node1 
telnet node1.cluster.local  etc..
if any of them are not working please resolve them by a  /etc/hosts
mapping. Also note that some hostnames like node1.cluster.local may/should
resolve to internal IP. Further issue could be that some ports may be open
on all ips like the webUI & some only on selective IP , so make sure you
definately fix the ip & port pairs mentioned in the logs


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 6, 2014 at 11:38 PM, Christian  wrote:

> Dear Mayur,
>
> The webUI shows the worker nodes correctly and the master URL shown there
> is spark://node1.cluster.local:7077 (please see the attached screenshot).
>
> I can launch the shell without problems and run some simple code in there.
> But when I launch the job it doesn't work:
>
> MASTER=spark://node1.cluster.local:7077
> SPARK_CLASSPATH=target/scala-2.10/condel-calc-assembly-1.0.jar
> $SPARK_HOME/bin/spark-class org.upf.bg.condel.calc.CondelCalc
> ~/notebooks/condel/metrics-condel.json
>
> I have attached the logs for the master and the slaves just after
> executing the job.
>
> My conf/spark-env.sh is:
>
> SPARK_MASTER_IP=node1.cluster.local
> SPARK_WORKER_CORES=20
> SPARK_WORKER_MEMORY=12g
> SPARK_WORKER_DIR=/scratch/cperez/spark
> STANDALONE_SPARK_MASTER_HOST=node1.cluster.local
>
> The conf/slaves is:
>
> node2.cluster.local
> node3.cluster.local
>
> The cluster is in a local network without internet access with its
> dedicated dns server. I don't know if having "node1" as hostname and
> "node1.cluster.local" as dns would affect. I have tried with
> spark://node1:7077 and spark://node1.cluster.local:7077 and neither work.
>
> The Job code that initializes the Spark context is:
>
> val sparkConf = new SparkConf()
>   .setMaster(sys.env.getOrElse("MASTER", "local"))
>   .setAppName("CondelCalc")
>   .setSparkHome(sys.env("SPARK_HOME"))
>   .setJars(SparkContext.jarOfObject(this))
>   .set("spark.executor.memory", "256m")
>
> val spark = new SparkContext(sparkConf)
>
> I don't understand why the shell works but not the Job.
>
>
> On Thu, Mar 6, 2014 at 11:54 PM, Mayur Rustagi wrote:
>
>> Can you see your webUI of Spark. Is it running? (would run on
>> masterurl:8080)
>> if so what is the master URL shown thr..
>> MASTER=spark://: ./bin/spark-shell
>> Should work.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, Mar 6, 2014 at 2:22 PM, Christian  wrote:
>>
>>> Hello, has anyone found this problem before? I am sorry to insist but I
>>> can not guess what is happening. Should I ask to the dev mailing list? Many
>>> thanks in advance.
>>> El 05/03/2014 23:57, "Christian"  escribió:
>>>
>>> I have deployed a Spark cluster in standalone mode with 3 machines:
>>>>
>>>> node1/192.168.1.2 -> master
>>>> node2/192.168.1.3 -> worker 20 cores 12g
>>>> node3/192.168.1.4 -> worker 20 cores 12g
>>>>
>>>> The web interface shows the workers correctly.
>>>>
>>>> When I launch the scala job (which only requires 256m of memory) these
>>>> are the logs:
>>>>
>>>> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
>>>> with 55 tasks
>>>> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to
>>>> master spark://node1:7077...
>>>> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to
>>>> master spark://node1:7077...
>>>> 14/03/

Re: Job aborted: Spark cluster looks down

2014-03-06 Thread Mayur Rustagi
Can you see your webUI of Spark. Is it running? (would run on
masterurl:8080)
if so what is the master URL shown thr..
MASTER=spark://: ./bin/spark-shell
Should work.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 6, 2014 at 2:22 PM, Christian  wrote:

> Hello, has anyone found this problem before? I am sorry to insist but I
> can not guess what is happening. Should I ask to the dev mailing list? Many
> thanks in advance.
> El 05/03/2014 23:57, "Christian"  escribió:
>
> I have deployed a Spark cluster in standalone mode with 3 machines:
>>
>> node1/192.168.1.2 -> master
>> node2/192.168.1.3 -> worker 20 cores 12g
>> node3/192.168.1.4 -> worker 20 cores 12g
>>
>> The web interface shows the workers correctly.
>>
>> When I launch the scala job (which only requires 256m of memory) these
>> are the logs:
>>
>> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
>> with 55 tasks
>> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master
>> spark://node1:7077...
>> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master
>> spark://node1:7077...
>> 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are
>> unresponsive! Giving up.
>> 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark
>> cluster looks dead, giving up.
>> 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0
>> from pool
>> 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run
>> saveAsNewAPIHadoopFile at CondelCalc.scala:146
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
>> Spark cluster looks down
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> ...
>>
>> The generated logs by the master and the 2 workers are attached, but I
>> found something weird in the master logs:
>>
>> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with 
>> 20 cores, 12.0 GB RAM
>> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with 
>> 20 cores, 12.0 GB RAM
>>
>> It reports that the two workers are node1:57297 and node1:34188 instead
>> of node3 and node2 respectively.
>>
>> $ cat /etc/hosts
>> ...
>> 192.168.1.2 node1
>> 192.168.1.3 node2
>> 192.168.1.4 node3
>> ...
>>
>> $ nslookup node2
>> Server: 192.168.1.1
>> Address:192.168.1.1#53
>>
>> Name:   node2.cluster.local
>> Address: 192.168.1.3
>>
>> $ nslookup node3
>> Server: 192.168.1.1
>> Address:192.168.1.1#53
>>
>> Name:   node3.cluster.local
>> Address: 192.168.1.4
>>
>> $ ssh node1 "ps aux | grep spark"
>> cperez   17023  1.4  0.1 4691944 154532 pts/3  Sl   23:37   0:15
>> /data/users/cperez/opt/jdk/bin/java -cp
>> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
>> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port
>> 8080
>>
>> $ ssh node2 "ps aux | grep spark"
>> cperez   17511  2.7  0.1 4625248 156304 ?  Sl   23:37   0:07
>> /data/users/cperez/opt/jdk/bin/java -cp
>> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
>> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.worker.Worker spark://node1:7077
>>
>> $ ssh node2 &q

Re: Job aborted: Spark cluster looks down

2014-03-06 Thread Christian
Hello, has anyone found this problem before? I am sorry to insist but I can
not guess what is happening. Should I ask to the dev mailing list? Many
thanks in advance.
El 05/03/2014 23:57, "Christian"  escribió:

> I have deployed a Spark cluster in standalone mode with 3 machines:
>
> node1/192.168.1.2 -> master
> node2/192.168.1.3 -> worker 20 cores 12g
> node3/192.168.1.4 -> worker 20 cores 12g
>
> The web interface shows the workers correctly.
>
> When I launch the scala job (which only requires 256m of memory) these are
> the logs:
>
> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
> with 55 tasks
> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master
> spark://node1:7077...
> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master
> spark://node1:7077...
> 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
> 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are
> unresponsive! Giving up.
> 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster
> looks dead, giving up.
> 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0
> from pool
> 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run
> saveAsNewAPIHadoopFile at CondelCalc.scala:146
> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
> Spark cluster looks down
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> ...
>
> The generated logs by the master and the 2 workers are attached, but I
> found something weird in the master logs:
>
> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with 20 
> cores, 12.0 GB RAM
> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with 20 
> cores, 12.0 GB RAM
>
> It reports that the two workers are node1:57297 and node1:34188 instead of
> node3 and node2 respectively.
>
> $ cat /etc/hosts
> ...
> 192.168.1.2 node1
> 192.168.1.3 node2
> 192.168.1.4 node3
> ...
>
> $ nslookup node2
> Server: 192.168.1.1
> Address:192.168.1.1#53
>
> Name:   node2.cluster.local
> Address: 192.168.1.3
>
> $ nslookup node3
> Server: 192.168.1.1
> Address:192.168.1.1#53
>
> Name:   node3.cluster.local
> Address: 192.168.1.4
>
> $ ssh node1 "ps aux | grep spark"
> cperez   17023  1.4  0.1 4691944 154532 pts/3  Sl   23:37   0:15
> /data/users/cperez/opt/jdk/bin/java -cp
> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port
> 8080
>
> $ ssh node2 "ps aux | grep spark"
> cperez   17511  2.7  0.1 4625248 156304 ?  Sl   23:37   0:07
> /data/users/cperez/opt/jdk/bin/java -cp
> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
> org.apache.spark.deploy.worker.Worker spark://node1:7077
>
> $ ssh node2 "netstat -lptun | grep 17511"
> tcp0  0 :::8081 :::*
>  LISTEN  17511/java
> tcp0  0 :::192.168.1.3:34188:::*
>LISTEN  17511/java
>
> $ ssh node3 "ps aux | grep spark"
> cperez7543  1.9  0.1 4625248 158600 ?  Sl   23:37   0:09
> /data/users/cperez/opt/jdk/bin/java -cp
> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
&g

Job aborted: Spark cluster looks down

2014-03-05 Thread Christian
I have deployed a Spark cluster in standalone mode with 3 machines:

node1/192.168.1.2 -> master
node2/192.168.1.3 -> worker 20 cores 12g
node3/192.168.1.4 -> worker 20 cores 12g

The web interface shows the workers correctly.

When I launch the scala job (which only requires 256m of memory) these are
the logs:

14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
with 55 tasks
14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory
14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master
spark://node1:7077...
14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory
14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master
spark://node1:7077...
14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory
14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are
unresponsive! Giving up.
14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster
looks dead, giving up.
14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 from
pool
14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run
saveAsNewAPIHadoopFile at CondelCalc.scala:146
Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Spark cluster looks down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
...

The generated logs by the master and the 2 workers are attached, but I
found something weird in the master logs:

14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297* with
20 cores, 12.0 GB RAM
14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188* with
20 cores, 12.0 GB RAM

It reports that the two workers are node1:57297 and node1:34188 instead of
node3 and node2 respectively.

$ cat /etc/hosts
...
192.168.1.2 node1
192.168.1.3 node2
192.168.1.4 node3
...

$ nslookup node2
Server: 192.168.1.1
Address:192.168.1.1#53

Name:   node2.cluster.local
Address: 192.168.1.3

$ nslookup node3
Server: 192.168.1.1
Address:192.168.1.1#53

Name:   node3.cluster.local
Address: 192.168.1.4

$ ssh node1 "ps aux | grep spark"
cperez   17023  1.4  0.1 4691944 154532 pts/3  Sl   23:37   0:15
/data/users/cperez/opt/jdk/bin/java -cp
:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
-Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port
8080

$ ssh node2 "ps aux | grep spark"
cperez   17511  2.7  0.1 4625248 156304 ?  Sl   23:37   0:07
/data/users/cperez/opt/jdk/bin/java -cp
:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
-Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://node1:7077

$ ssh node2 "netstat -lptun | grep 17511"
tcp0  0 :::8081 :::*
 LISTEN  17511/java
tcp0  0 :::192.168.1.3:34188:::*
 LISTEN  17511/java

$ ssh node3 "ps aux | grep spark"
cperez7543  1.9  0.1 4625248 158600 ?  Sl   23:37   0:09
/data/users/cperez/opt/jdk/bin/java -cp
:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
-Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://node1:7077

$ ssh node3 "netstat -lptun | grep 7543"
tcp0  0 :::8081 :::*
 LISTEN  7543/java
tcp0  0 :::192.168.1.4:57297:::*
 LISTEN  7543/java

I am completely blocked at this, any help would be very helpful to me. Many
thanks in advance.
Christian


spark-cperez-org.apache.spark.deploy.master.Master-1-node1.out
Description: Binary data


spark-cperez-org.apache.spark.deploy.worker.Worker-1-node2.out
Description: Binary data


spark-cperez-org.apache.spark.deploy.worker.Worker-1-node3.out
Description: Binary data