RE: Deactivationg framework unexpectly

志昌余 Mon, 15 Aug 2016 03:24:07 -0700

Here's the Chronos container log:

{"log":"I0812 08:15:42.256145   112 sched.cpp:981] Scheduler::statusUpdate took 
2.572805ms\n","stream":"stderr","time":"2016-08-12T00:15:42.256201718Z"}
{"log":"Exception in thread \"Thread-1433105\" 
java.lang.IllegalArgumentException: no such vertex in 
graph\n","stream":"stderr","time":"2016-08-12T00:15:43.902249654Z"}
{"log":"\u0009at 
org.jgrapht.graph.AbstractGraph.assertVertexExist(AbstractGraph.java:158)\n","stream":"stderr","time":"2016-08-12T00:15:43.902297744Z"}
{"log":"\u0009at 
org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.getEdgeContainer(AbstractBaseGraph.java:927)\n","stream":"stderr","time":"2016-08-12T00:15:43.902310237Z"}
{"log":"\u0009at 
org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.edgesOf(AbstractBaseGraph.java:851)\n","stream":"stderr","time":"2016-08-12T00:15:43.902324329Z"}
{"log":"\u0009at 
org.jgrapht.graph.AbstractBaseGraph.edgesOf(AbstractBaseGraph.java:395)\n","stream":"stderr","time":"2016-08-12T00:15:43.902333866Z"}
{"log":"\u0009at 
org.apache.mesos.chronos.scheduler.graph.JobGraph.getChildren(JobGraph.scala:175)\n","stream":"stderr","time":"2016-08-12T00:15:43.902343167Z"}
{"log":"\u0009at 
org.apache.mesos.chronos.scheduler.graph.JobGraph.getExecutableChildren(JobGraph.scala:148)\n","stream":"stderr","time":"2016-08-12T00:15:43.902357005Z"}
{"log":"\u0009at 
org.apache.mesos.chronos.scheduler.jobs.JobScheduler.processDependencies(JobScheduler.scala:347)\n","stream":"stderr","time":"2016-08-12T00:15:43.902369764Z"}
{"log":"\u0009at 
org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:272)\n","stream":"stderr","time":"2016-08-12T00:15:43.902381643Z"}
{"log":"\u0009at 
org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:226)\n","stream":"stderr","time":"2016-08-12T00:15:43.902393721Z"}
{"log":"\u0009at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown 
Source)\n","stream":"stderr","time":"2016-08-12T00:15:43.90240862Z"}
{"log":"\u0009at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n","stream":"stderr","time":"2016-08-12T00:15:43.902417552Z"}
{"log":"\u0009at 
java.lang.reflect.Method.invoke(Method.java:606)\n","stream":"stderr","time":"2016-08-12T00:15:43.902427895Z"}
{"log":"\u0009at 
com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)\n","stream":"stderr","time":"2016-08-12T00:15:43.902447417Z"}
{"log":"\u0009at com.sun.proxy.$Proxy31.statusUpdate(Unknown 
Source)\n","stream":"stderr","time":"2016-08-12T00:15:43.902458275Z"}
{"log":"I0812 08:15:43.902712    96 sched.cpp:1937] Asked to abort the 
driver\n","stream":"stderr","time":"2016-08-12T00:15:43.902765363Z"}



(C++ function at 
org_apache_mesos_MesosSchedulerDriver.cpp)JNIScheduler::statusUpdate invokes 
(Scala function at MesosJobFramework.scala) statusUpdate, which query/replace a 
job.
At the same time, another thread deleted that job. So statusUpdate throws an 
excepthion, catched by JNIScheduler::statusUpdate, then invoked driver->abort().

In summary, this is a race-condition bug of Chronos.

Regards,
Zhichang Yu


________________________________
发件人: tommy xiao <xia...@gmail.com>
发送时间: 2016年8月14日 7:43
收件人: user
主题: Re: 答复: Deactivationg framework unexpectly

hi Yu,

please enable debug mode to see more details logs with GLOG_v=3

2016-08-12 14:27 GMT+08:00 志昌 余 
<yuzhichang_...@hotmail.com<mailto:yuzhichang_...@hotmail.com>>:

Hi Anindya,

    The problem occurred again. The following is the log of the scheduler 
driver log at Chronos side:


I0812 08:15:43.902712    96 sched.cpp:1937] Asked to abort the driver
I0812 08:15:43.902763    96 sched.cpp:981] Scheduler::statusUpdate took 
1.436378441secs
I0812 08:15:43.902788    96 sched.cpp:988] Not sending status update 
acknowledgment message b\
ecause the driver is not running!
I0812 08:15:43.902866    96 sched.cpp:919] Ignoring task status update message 
because the dr\
iver is not running!

    However from the earlier log I don't see the clue of why scheduler driver 
be aborted.



    Thankds,

Zhichang Yu



________________________________
发件人: 志昌 余 <yuzhichang_...@hotmail.com<mailto:yuzhichang_...@hotmail.com>>
发送时间: 2016年8月9日 18:03:31
收件人: user@mesos.apache.org<mailto:user@mesos.apache.org>
主题: 答复: Deactivationg framework unexpectly


Hi Anindys,

    Thanks for the info. I'll enable  scheduler driver log to see what happen.

Regards,

Zhichang Yu

________________________________
发件人: anindya_si...@apple.com<mailto:anindya_si...@apple.com> 
<anindya_si...@apple.com<mailto:anindya_si...@apple.com>> 代表 Anindya Sinha 
<anindya_si...@apple.com<mailto:anindya_si...@apple.com>>
发送时间: 2016年8月8日 23:50:10
收件人: user@mesos.apache.org<mailto:user@mesos.apache.org>
主题: Re: Deactivationg framework unexpectly

Looks like your framework (chronos) is sending a DeactivateFrameworkMessage 
message to the master. The scheduler driver would also send a 
DeativateFramework message if it is aborted 
(https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L1224).

Also, master can deactivate your framework if your framework disconnects or 
fails over. Please check logs in master or see if your framework received a 
FrameworkErrorMessage.

Thanks
Anindya

On Aug 8, 2016, at 3:35 AM, 志昌 余 
<yuzhichang_...@hotmail.com<mailto:yuzhichang_...@hotmail.com>> wrote:

Hi,
    I recently faced a wired problem. I'm running mesos + chronos. Chronos 
often (once every several days) stops scheduling tasks due to mesos deactived 
the framework.
As following is the log of mesos master leader:


# grep -iP "activat|disconnected" /var/log/mesos/mesos-master.INFO
I0806 13:40:33.143658    30 master.cpp:2551] Deactivating framework 
90a6a7dc-7256-4e55-bd7e-573233c5df74-0000 (chronos-2.5.0-SNAPSHOT) at 
scheduler-86a64d22-5201-4bb0-8a2c-70d3e97afae6@10.8.139.246<mailto:scheduler-86a64d22-5201-4bb0-8a2c-70d3e97afae6@10.8.139.246>:34544
I0806 13:40:33.143908    23 hierarchical.cpp:375] Deactivated framework 
90a6a7dc-7256-4e55-bd7e-573233c5df74-0000

The fix is to manually reboot the chronos leader.


My env:
There are 3 physical machines, on each are running containerized mesos master 
and chronos. When the issue occurred,  the mesos leader and chronos leader were 
both running on the same machine.

Software Version:
mesos-master:0.28.0-2.0.16.ubuntu1404

chronos:2.5.0-ce4469d.ubuntu1404-mesos-0.28.0-2.0.16.ubuntu1404

    Can anyone give insight for this problem?
    Thanks,
Zhichang Yu




--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com<http://gmail.com>

RE: Deactivationg framework unexpectly

Reply via email to