Hi List,

We are running chronos on mesos 0.19.0 and found a interesting problem, that if
we were trying to launch about 1k tasks in a single resourceOffers(), it may 
crash
and no tasks started by mesos at all.

So we did a test like below:

change code in chronos resourceOffers() callback as below:

1. print log
2. decline the first offer in bunch of offers
3. sleep 30 seconds
4. decline all the offers received

add a log in src/master/master.cpp to print some log whenever received a
LaunchTasksMessage, see below log.

-----------8<-----------------------
I0203 18:32:33.169342  7680 master.cpp:2939] Sending 3 offers to framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:32:39.523227  7670 http.cpp:452] HTTP request for '/master/state.json'
I0203 18:32:49.601284  7674 http.cpp:452] HTTP request for '/master/state.json'
I0203 18:32:59.677875  7677 http.cpp:452] HTTP request for '/master/state.json'
I0203 18:33:03.390188  7676 master.cpp:1754] Received launchTasks message for 
offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.390949  7676 master.cpp:1895] Processing reply for offers: [ 
20150203-183014-2487817994-5050-7668-0 ] on slave 
20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051 
(xulijian-mesos-online016-cqdx.qiyi.virtual) for framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.391469  7676 master.cpp:1754] Received launchTasks message for 
offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.391791  7670 hierarchical_allocator_process.hpp:589] Framework 
20150203-174243-2487817994-5050-10996-0000 filtered slave 
20150203-183014-2487817994-5050-7668-2 for 5secs
W0203 18:33:03.392019  7676 master.cpp:1871] Failed to validate offer 
20150203-183014-2487817994-5050-7668-0: Offer 
20150203-183014-2487817994-5050-7668-0 is no longer valid
I0203 18:33:03.393173  7676 master.cpp:1754] Received launchTasks message for 
offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.393601  7676 master.cpp:1895] Processing reply for offers: [ 
20150203-183014-2487817994-5050-7668-1 ] on slave 
20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051 
(xulijian-mesos-online017-cqdx.qiyi.virtual) for framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.394057  7676 master.cpp:1754] Received launchTasks message for 
offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.394379  7679 hierarchical_allocator_process.hpp:589] Framework 
20150203-174243-2487817994-5050-10996-0000 filtered slave 
20150203-183014-2487817994-5050-7668-1 for 5secs
I0203 18:33:03.394664  7676 master.cpp:1895] Processing reply for offers: [ 
20150203-183014-2487817994-5050-7668-2 ] on slave 
20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051 
(xulijian-mesos-online015-cqdx.qiyi.virtual) for framework 
20150203-174243-2487817994-5050-10996-0000
I0203 18:33:03.395504  7676 hierarchical_allocator_process.hpp:589] Framework 
20150203-174243-2487817994-5050-10996-0000 filtered slave 
20150203-183014-2487817994-5050-7668-0 for 5secs
---------------8<-------------------

As we can see, mesos-master send offer to chronos at 18:32:33, but received all
4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious why the
first decline doesn't sent before sleep 30 seconds?

From the log, we see that the offer 0 is no longer valid because we already send
a decline before.

Does that mean we(the framework scheduler) have to reply for all offers received
before we can launch any task?

--
Thanks,
Chengwei

Attachment: signature.asc
Description: Digital signature

Reply via email to