Hi List, We are running chronos on mesos 0.19.0 and found a interesting problem, that if we were trying to launch about 1k tasks in a single resourceOffers(), it may crash and no tasks started by mesos at all.
So we did a test like below: change code in chronos resourceOffers() callback as below: 1. print log 2. decline the first offer in bunch of offers 3. sleep 30 seconds 4. decline all the offers received add a log in src/master/master.cpp to print some log whenever received a LaunchTasksMessage, see below log. -----------8<----------------------- I0203 18:32:33.169342 7680 master.cpp:2939] Sending 3 offers to framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:32:39.523227 7670 http.cpp:452] HTTP request for '/master/state.json' I0203 18:32:49.601284 7674 http.cpp:452] HTTP request for '/master/state.json' I0203 18:32:59.677875 7677 http.cpp:452] HTTP request for '/master/state.json' I0203 18:33:03.390188 7676 master.cpp:1754] Received launchTasks message for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.390949 7676 master.cpp:1895] Processing reply for offers: [ 20150203-183014-2487817994-5050-7668-0 ] on slave 20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051 (xulijian-mesos-online016-cqdx.qiyi.virtual) for framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.391469 7676 master.cpp:1754] Received launchTasks message for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.391791 7670 hierarchical_allocator_process.hpp:589] Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave 20150203-183014-2487817994-5050-7668-2 for 5secs W0203 18:33:03.392019 7676 master.cpp:1871] Failed to validate offer 20150203-183014-2487817994-5050-7668-0: Offer 20150203-183014-2487817994-5050-7668-0 is no longer valid I0203 18:33:03.393173 7676 master.cpp:1754] Received launchTasks message for offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.393601 7676 master.cpp:1895] Processing reply for offers: [ 20150203-183014-2487817994-5050-7668-1 ] on slave 20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051 (xulijian-mesos-online017-cqdx.qiyi.virtual) for framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.394057 7676 master.cpp:1754] Received launchTasks message for offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.394379 7679 hierarchical_allocator_process.hpp:589] Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave 20150203-183014-2487817994-5050-7668-1 for 5secs I0203 18:33:03.394664 7676 master.cpp:1895] Processing reply for offers: [ 20150203-183014-2487817994-5050-7668-2 ] on slave 20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051 (xulijian-mesos-online015-cqdx.qiyi.virtual) for framework 20150203-174243-2487817994-5050-10996-0000 I0203 18:33:03.395504 7676 hierarchical_allocator_process.hpp:589] Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave 20150203-183014-2487817994-5050-7668-0 for 5secs ---------------8<------------------- As we can see, mesos-master send offer to chronos at 18:32:33, but received all 4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious why the first decline doesn't sent before sleep 30 seconds? From the log, we see that the offer 0 is no longer valid because we already send a decline before. Does that mean we(the framework scheduler) have to reply for all offers received before we can launch any task? -- Thanks, Chengwei
signature.asc
Description: Digital signature
