Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted

2017-03-31 Thread Yu Wei
Got that.


Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux


From: Timothy Chen <tnac...@gmail.com>
Sent: Friday, March 31, 2017 11:33:42 AM
To: Yu Wei
Cc: dev; us...@spark.apache.org
Subject: Re: [Spark on mesos] Spark framework not re-registered and lost after 
mesos master restarted

Hi Yu,

As mentioned earlier, currently the Spark framework will not
re-register as the failover_timeout is not set and there is no
configuration available yet.
It's only enabled in MesosClusterScheduler since it's meant to be a HA
framework.

We should add that configuration for users that want their Spark
frameworks to be able to failover in case of Master failover or
network disconnect, etc.

Tim

On Thu, Mar 30, 2017 at 8:25 PM, Yu Wei <yu20...@hotmail.com> wrote:
> Hi Tim,
>
> I tested the scenario again with settings as below,
>
> [dcos@agent spark-2.0.2-bin-hadoop2.7]$ cat conf/spark-defaults.conf
> spark.deploy.recoveryMode  ZOOKEEPER
> spark.deploy.zookeeper.url 192.168.111.53:2181
> spark.deploy.zookeeper.dir /spark
> spark.executor.memory 512M
> spark.mesos.principal agent-dev-1
>
>
> However, the case still failed. After master restarted, spark framework did
> not re-register.
> From spark framework log, it seemed that below method in
> MesosClusterScheduler was not called.
> override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo):
> Unit
>
> Did I miss something? Any advice?
>
>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
>
>
>
> ________
> From: Timothy Chen <tnac...@gmail.com>
> Sent: Friday, March 31, 2017 5:13 AM
> To: Yu Wei
> Cc: us...@spark.apache.org; dev
> Subject: Re: [Spark on mesos] Spark framework not re-registered and lost
> after mesos master restarted
>
> I think failover isn't enabled on regular Spark job framework, since we
> assume jobs are more ephemeral.
>
> It could be a good setting to add to the Spark framework to enable failover.
>
> Tim
>
> On Mar 30, 2017, at 10:18 AM, Yu Wei <yu20...@hotmail.com> wrote:
>
> Hi guys,
>
> I encountered a problem about spark on mesos.
>
> I setup mesos cluster and launched spark framework on mesos successfully.
>
> Then mesos master was killed and started again.
>
> However, spark framework couldn't be re-registered again as mesos agent
> does. I also couldn't find any error logs.
>
> And MesosClusterDispatcher is still running there.
>
>
> I suspect this is spark framework issue.
>
> What's your opinion?
>
>
>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux


Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted

2017-03-30 Thread Yu Wei
Hi Tim,

I tested the scenario again with settings as below,

[dcos@agent spark-2.0.2-bin-hadoop2.7]$ cat conf/spark-defaults.conf
spark.deploy.recoveryMode  ZOOKEEPER
spark.deploy.zookeeper.url 192.168.111.53:2181
spark.deploy.zookeeper.dir /spark
spark.executor.memory 512M
spark.mesos.principal agent-dev-1


However, the case still failed. After master restarted, spark framework did not 
re-register.
From spark framework log, it seemed that below method in MesosClusterScheduler 
was not called.
override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo): Unit

Did I miss something? Any advice?



Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux



From: Timothy Chen <tnac...@gmail.com>
Sent: Friday, March 31, 2017 5:13 AM
To: Yu Wei
Cc: us...@spark.apache.org; dev
Subject: Re: [Spark on mesos] Spark framework not re-registered and lost after 
mesos master restarted

I think failover isn't enabled on regular Spark job framework, since we assume 
jobs are more ephemeral.

It could be a good setting to add to the Spark framework to enable failover.

Tim

On Mar 30, 2017, at 10:18 AM, Yu Wei 
<yu20...@hotmail.com<mailto:yu20...@hotmail.com>> wrote:


Hi guys,

I encountered a problem about spark on mesos.

I setup mesos cluster and launched spark framework on mesos successfully.

Then mesos master was killed and started again.

However, spark framework couldn't be re-registered again as mesos agent does. I 
also couldn't find any error logs.

And MesosClusterDispatcher is still running there.


I suspect this is spark framework issue.

What's your opinion?



Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux


[Spark on mesos] Spark framework not re-registered and lost after mesos master restarted

2017-03-30 Thread Yu Wei
Hi guys,

I encountered a problem about spark on mesos.

I setup mesos cluster and launched spark framework on mesos successfully.

Then mesos master was killed and started again.

However, spark framework couldn't be re-registered again as mesos agent does. I 
also couldn't find any error logs.

And MesosClusterDispatcher is still running there.


I suspect this is spark framework issue.

What's your opinion?



Thanks,

Jared, (韦煜)
Software developer
Interested in open source software, big data, Linux


Spark on mesos, it seemed spark dispatcher didn't abort when authorization failed

2016-12-26 Thread Yu Wei
Hi Guys,


When running some cases about spark on mesos, it seemed that spark dispatcher 
didn't abort when authorization failed.

It seemed that spark dispatcher detected the error but did not handle it 
properly.

The detailed log is as below,

16/12/26 16:02:08 INFO Utils: Successfully started service on port 8081.
16/12/26 16:02:08 INFO MesosClusterUI: Bound MesosClusterUI to 0.0.0.0, and 
started at http://192.168.111.192:8081
I1226 16:02:08.861893 11966 sched.cpp:232] Version: 1.2.0
I1226 16:02:08.868672 11964 sched.cpp:336] New master detected at 
master@192.168.111.191:5050
I1226 16:02:08.870041 11964 sched.cpp:402] Authenticating with master 
master@192.168.111.191:5050
I1226 16:02:08.870066 11964 sched.cpp:409] Using default CRAM-MD5 authenticatee
I1226 16:02:08.870635 11959 authenticatee.cpp:97] Initializing client SASL
I1226 16:02:08.871201 11959 authenticatee.cpp:121] Creating new client SASL 
connection
I1226 16:02:08.971091 11964 authenticatee.cpp:213] Received SASL authentication 
mechanisms: CRAM-MD5
I1226 16:02:08.971156 11964 authenticatee.cpp:239] Attempting to authenticate 
with mechanism 'CRAM-MD5'
I1226 16:02:08.972964 11962 authenticatee.cpp:259] Received SASL authentication 
step
I1226 16:02:08.974642 11957 authenticatee.cpp:299] Authentication success
I1226 16:02:08.975075 11964 sched.cpp:508] Successfully authenticated with 
master master@192.168.111.191:5050
I1226 16:02:08.977557 11960 sched.cpp:1177] Got error 'Not authorized to use 
role 'spark''
I1226 16:02:08.977583 11960 sched.cpp:2042] Asked to abort the driver
16/12/26 16:02:08 ERROR MesosClusterScheduler: Error received: Not authorized 
to use role 'spark'
I1226 16:02:08.978495 11960 sched.cpp:1223] Aborting framework
16/12/26 16:02:08 INFO MesosClusterScheduler: driver.run() returned with code 
DRIVER_ABORTED
16/12/26 16:02:08 INFO Utils: Successfully started service on port 7077.
16/12/26 16:02:08 INFO MesosRestServer: Started REST server for submitting 
applications on port 7077


It seems this is bug.


Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux


driver in queued state and not started

2016-12-05 Thread Yu Wei
Hi Guys,


I tried to run spark on mesos cluster.

However, when I tried to submit jobs via spark-submit. The driver is in "Queued 
state" and not started.


Which should I check?



Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux


subscribe

2016-11-14 Thread Yu Wei


Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux


Failed to run spark jobs on mesos due to "hadoop" not found.

2016-11-10 Thread Yu Wei
Hi Guys,

I failed to launch spark jobs on mesos. Actually I submitted the job to cluster 
successfully.

But the job failed to run.

I1110 18:25:11.095507   301 fetcher.cpp:498] Fetcher Info: 
{"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740-98f9653e7fc2","user":"root"}
I1110 18:25:11.099799   301 fetcher.cpp:409] Fetching URI 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
I1110 18:25:11.099820   301 fetcher.cpp:250] Fetching directly into the sandbox 
directory
I1110 18:25:11.099862   301 fetcher.cpp:187] Fetching URI 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
E1110 18:25:11.101842   301 shell.hpp:106] Command 'hadoop version 2>&1' 
failed; this is the output:
sh: hadoop: command not found
Failed to fetch 
'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar': 
Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the 
command was either not found or exited with a non-zero exit status: 127
Failed to synchronize with agent (it's probably exited


Actually I installed hadoop on each agent node.


Any advice?


Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux