Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted
Got that. Thanks, Jared, (韦煜) Software developer Interested in open source software, big data, Linux From: Timothy Chen <tnac...@gmail.com> Sent: Friday, March 31, 2017 11:33:42 AM To: Yu Wei Cc: dev; us...@spark.apache.org Subject: Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted Hi Yu, As mentioned earlier, currently the Spark framework will not re-register as the failover_timeout is not set and there is no configuration available yet. It's only enabled in MesosClusterScheduler since it's meant to be a HA framework. We should add that configuration for users that want their Spark frameworks to be able to failover in case of Master failover or network disconnect, etc. Tim On Thu, Mar 30, 2017 at 8:25 PM, Yu Wei <yu20...@hotmail.com> wrote: > Hi Tim, > > I tested the scenario again with settings as below, > > [dcos@agent spark-2.0.2-bin-hadoop2.7]$ cat conf/spark-defaults.conf > spark.deploy.recoveryMode ZOOKEEPER > spark.deploy.zookeeper.url 192.168.111.53:2181 > spark.deploy.zookeeper.dir /spark > spark.executor.memory 512M > spark.mesos.principal agent-dev-1 > > > However, the case still failed. After master restarted, spark framework did > not re-register. > From spark framework log, it seemed that below method in > MesosClusterScheduler was not called. > override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo): > Unit > > Did I miss something? Any advice? > > > Thanks, > > Jared, (韦煜) > Software developer > Interested in open source software, big data, Linux > > > > ________ > From: Timothy Chen <tnac...@gmail.com> > Sent: Friday, March 31, 2017 5:13 AM > To: Yu Wei > Cc: us...@spark.apache.org; dev > Subject: Re: [Spark on mesos] Spark framework not re-registered and lost > after mesos master restarted > > I think failover isn't enabled on regular Spark job framework, since we > assume jobs are more ephemeral. > > It could be a good setting to add to the Spark framework to enable failover. > > Tim > > On Mar 30, 2017, at 10:18 AM, Yu Wei <yu20...@hotmail.com> wrote: > > Hi guys, > > I encountered a problem about spark on mesos. > > I setup mesos cluster and launched spark framework on mesos successfully. > > Then mesos master was killed and started again. > > However, spark framework couldn't be re-registered again as mesos agent > does. I also couldn't find any error logs. > > And MesosClusterDispatcher is still running there. > > > I suspect this is spark framework issue. > > What's your opinion? > > > > Thanks, > > Jared, (韦煜) > Software developer > Interested in open source software, big data, Linux
Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted
Hi Tim, I tested the scenario again with settings as below, [dcos@agent spark-2.0.2-bin-hadoop2.7]$ cat conf/spark-defaults.conf spark.deploy.recoveryMode ZOOKEEPER spark.deploy.zookeeper.url 192.168.111.53:2181 spark.deploy.zookeeper.dir /spark spark.executor.memory 512M spark.mesos.principal agent-dev-1 However, the case still failed. After master restarted, spark framework did not re-register. From spark framework log, it seemed that below method in MesosClusterScheduler was not called. override def reregistered(driver: SchedulerDriver, masterInfo: MasterInfo): Unit Did I miss something? Any advice? Thanks, Jared, (韦煜) Software developer Interested in open source software, big data, Linux From: Timothy Chen <tnac...@gmail.com> Sent: Friday, March 31, 2017 5:13 AM To: Yu Wei Cc: us...@spark.apache.org; dev Subject: Re: [Spark on mesos] Spark framework not re-registered and lost after mesos master restarted I think failover isn't enabled on regular Spark job framework, since we assume jobs are more ephemeral. It could be a good setting to add to the Spark framework to enable failover. Tim On Mar 30, 2017, at 10:18 AM, Yu Wei <yu20...@hotmail.com<mailto:yu20...@hotmail.com>> wrote: Hi guys, I encountered a problem about spark on mesos. I setup mesos cluster and launched spark framework on mesos successfully. Then mesos master was killed and started again. However, spark framework couldn't be re-registered again as mesos agent does. I also couldn't find any error logs. And MesosClusterDispatcher is still running there. I suspect this is spark framework issue. What's your opinion? Thanks, Jared, (韦煜) Software developer Interested in open source software, big data, Linux
[Spark on mesos] Spark framework not re-registered and lost after mesos master restarted
Hi guys, I encountered a problem about spark on mesos. I setup mesos cluster and launched spark framework on mesos successfully. Then mesos master was killed and started again. However, spark framework couldn't be re-registered again as mesos agent does. I also couldn't find any error logs. And MesosClusterDispatcher is still running there. I suspect this is spark framework issue. What's your opinion? Thanks, Jared, (韦煜) Software developer Interested in open source software, big data, Linux
Spark on mesos, it seemed spark dispatcher didn't abort when authorization failed
Hi Guys, When running some cases about spark on mesos, it seemed that spark dispatcher didn't abort when authorization failed. It seemed that spark dispatcher detected the error but did not handle it properly. The detailed log is as below, 16/12/26 16:02:08 INFO Utils: Successfully started service on port 8081. 16/12/26 16:02:08 INFO MesosClusterUI: Bound MesosClusterUI to 0.0.0.0, and started at http://192.168.111.192:8081 I1226 16:02:08.861893 11966 sched.cpp:232] Version: 1.2.0 I1226 16:02:08.868672 11964 sched.cpp:336] New master detected at master@192.168.111.191:5050 I1226 16:02:08.870041 11964 sched.cpp:402] Authenticating with master master@192.168.111.191:5050 I1226 16:02:08.870066 11964 sched.cpp:409] Using default CRAM-MD5 authenticatee I1226 16:02:08.870635 11959 authenticatee.cpp:97] Initializing client SASL I1226 16:02:08.871201 11959 authenticatee.cpp:121] Creating new client SASL connection I1226 16:02:08.971091 11964 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 I1226 16:02:08.971156 11964 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' I1226 16:02:08.972964 11962 authenticatee.cpp:259] Received SASL authentication step I1226 16:02:08.974642 11957 authenticatee.cpp:299] Authentication success I1226 16:02:08.975075 11964 sched.cpp:508] Successfully authenticated with master master@192.168.111.191:5050 I1226 16:02:08.977557 11960 sched.cpp:1177] Got error 'Not authorized to use role 'spark'' I1226 16:02:08.977583 11960 sched.cpp:2042] Asked to abort the driver 16/12/26 16:02:08 ERROR MesosClusterScheduler: Error received: Not authorized to use role 'spark' I1226 16:02:08.978495 11960 sched.cpp:1223] Aborting framework 16/12/26 16:02:08 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED 16/12/26 16:02:08 INFO Utils: Successfully started service on port 7077. 16/12/26 16:02:08 INFO MesosRestServer: Started REST server for submitting applications on port 7077 It seems this is bug. Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux
driver in queued state and not started
Hi Guys, I tried to run spark on mesos cluster. However, when I tried to submit jobs via spark-submit. The driver is in "Queued state" and not started. Which should I check? Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux
subscribe
Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux
Failed to run spark jobs on mesos due to "hadoop" not found.
Hi Guys, I failed to launch spark jobs on mesos. Actually I submitted the job to cluster successfully. But the job failed to run. I1110 18:25:11.095507 301 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740-98f9653e7fc2","user":"root"} I1110 18:25:11.099799 301 fetcher.cpp:409] Fetching URI 'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' I1110 18:25:11.099820 301 fetcher.cpp:250] Fetching directly into the sandbox directory I1110 18:25:11.099862 301 fetcher.cpp:187] Fetching URI 'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar' E1110 18:25:11.101842 301 shell.hpp:106] Command 'hadoop version 2>&1' failed; this is the output: sh: hadoop: command not found Failed to fetch 'hdfs://192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was either not found or exited with a non-zero exit status: 127 Failed to synchronize with agent (it's probably exited Actually I installed hadoop on each agent node. Any advice? Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux