Vinod Kone created MESOS-3907: --------------------------------- Summary: PersistentVolumeTest.SlaveRecovery test is flaky Key: MESOS-3907 URL: https://issues.apache.org/jira/browse/MESOS-3907 Project: Mesos Issue Type: Bug Environment: ASF CI Reporter: Vinod Kone
Looks like the executor didn't re-register in time after slave restart. The Clock::settle() should've guaranteed that the executor sent re-register and slave processed it, so it's not clear why this race happened. {code} I1112 17:49:00.663300 28118 slave.cpp:392] Slave resources: cpus(*):2; mem(*):1024; disk(role1):1024; ports(*):[31000-32000] I1112 17:49:00.663388 28118 slave.cpp:400] Slave attributes: [ ] I1112 17:49:00.663406 28118 slave.cpp:405] Slave hostname: f4f9d084d7ba I1112 17:49:00.663419 28118 slave.cpp:410] Slave checkpoint: true I1112 17:49:00.669028 28115 state.cpp:54] Recovering state from '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta' I1112 17:49:00.675704 28121 fetcher.cpp:79] Clearing fetcher cache I1112 17:49:00.675799 28121 slave.cpp:4318] Recovering framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.675891 28121 slave.cpp:5108] Recovering executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.677217 28119 status_update_manager.cpp:202] Recovering status update manager I1112 17:49:00.677250 28119 status_update_manager.cpp:210] Recovering executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.677326 28119 status_update_manager.cpp:499] Creating StatusUpdate stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.677896 28119 status_update_manager.cpp:802] Replaying status update stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 I1112 17:49:00.678973 28121 slave.cpp:682] Successfully attached file '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/executors/2b35f3fc-373c-4b5f-bbe6-dc3279117145/runs/b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' I1112 17:49:00.679440 28121 containerizer.cpp:385] Recovering containerizer I1112 17:49:00.679527 28121 containerizer.cpp:440] Recovering container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' for executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.685341 28115 slave.cpp:4170] Sending reconnect request to executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865 I1112 17:49:00.690004 1529 exec.cpp:256] Received reconnect request from slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 I1112 17:49:00.703845 28122 hierarchical.cpp:977] No resources available to allocate! I1112 17:49:00.703917 28122 hierarchical.cpp:1070] No inverse offers to send out! I1112 17:49:00.704025 28122 hierarchical.cpp:876] Performed allocation for 1 slaves in 271350ns I1112 17:49:00.703974 28116 slave.cpp:2702] Cleaning up un-reregistered executors I1112 17:49:00.704107 28116 slave.cpp:2720] Killing un-reregistered executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865 I1112 17:49:00.704262 28116 slave.cpp:4230] Finished recovery I1112 17:49:00.704397 28123 containerizer.cpp:1075] Destroying container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' I1112 17:49:00.706696 28116 slave.cpp:4387] Querying resource estimator for oversubscribable resources I1112 17:49:00.707142 28127 slave.cpp:4401] Received oversubscribable resources from the resource estimator W1112 17:49:00.707540 28127 slave.cpp:2546] Shutting down executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 because the slave is not in recovery mode I1112 17:49:00.708133 28116 slave.cpp:729] New master detected at master@172.17.6.239:32961 I1112 17:49:00.708228 28116 slave.cpp:792] Authenticating with master master@172.17.6.239:32961 I1112 17:49:00.708253 28116 slave.cpp:797] Using default CRAM-MD5 authenticatee I1112 17:49:00.708421 28116 slave.cpp:765] Detecting new master I1112 17:49:00.708533 28116 status_update_manager.cpp:176] Pausing sending status updates I1112 17:49:00.708673 28116 authenticatee.cpp:123] Creating new client SASL connection I1112 17:49:00.709913 28116 master.cpp:5150] Authenticating slave(197)@172.17.6.239:32961 I1112 17:49:00.710218 28116 authenticator.cpp:415] Starting authentication session for crammd5_authenticatee(467)@172.17.6.239:32961 I1112 17:49:00.710438 1516 exec.cpp:383] Executor asked to shutdown I1112 17:49:00.710582 1516 exec.cpp:398] Executor::shutdown took 35481ns I1112 17:49:00.710598 28116 authenticator.cpp:100] Creating new server SASL connection I1112 17:49:00.710705 1516 exec.cpp:82] Scheduling shutdown of the executor Shutting down Sending SIGTERM to process tree at pid 1531 I1112 17:49:00.710830 28116 authenticatee.cpp:214] Received SASL authentication mechanisms: CRAM-MD5 I1112 17:49:00.710866 28116 authenticatee.cpp:240] Attempting to authenticate with mechanism 'CRAM-MD5' I1112 17:49:00.712005 28116 authenticator.cpp:205] Received SASL authentication start I1112 17:49:00.712069 28116 authenticator.cpp:327] Authentication requires more steps I1112 17:49:00.712214 28124 authenticatee.cpp:260] Received SASL authentication step I1112 17:49:00.712357 28116 authenticator.cpp:233] Received SASL authentication step I1112 17:49:00.712389 28116 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false I1112 17:49:00.712412 28116 auxprop.cpp:181] Looking up auxiliary property '*userPassword' I1112 17:49:00.712467 28116 auxprop.cpp:181] Looking up auxiliary property '*cmusaslsecretCRAM-MD5' I1112 17:49:00.712507 28116 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true I1112 17:49:00.712524 28116 auxprop.cpp:131] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true I1112 17:49:00.712534 28116 auxprop.cpp:131] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true I1112 17:49:00.712555 28116 authenticator.cpp:319] Authentication success I1112 17:49:00.712708 28128 authenticatee.cpp:300] Authentication success I1112 17:49:00.721004 28122 master.cpp:5180] Successfully authenticated principal 'test-principal' at slave(197)@172.17.6.239:32961 I1112 17:49:00.721352 28117 authenticator.cpp:433] Authentication session cleanup for crammd5_authenticatee(467)@172.17.6.239:32961 Killing the following process trees: [ --- 1531 sh -c while true; do test -d path1; done ] I1112 17:49:00.721968 28128 slave.cpp:860] Successfully authenticated with master master@172.17.6.239:32961 I1112 17:49:00.723456 28130 master.cpp:4019] Re-registering slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(196)@172.17.6.239:32961 (f4f9d084d7ba) I1112 17:49:00.724056 28126 hierarchical.cpp:417] Slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 reactivated I1112 17:49:00.723914 28130 master.cpp:4207] Sending updated checkpointed resources disk(role1)[id1:path1]:64 to slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) I1112 17:49:00.724306 28128 slave.cpp:1254] Will retry registration in 17.384714ms if necessary I1112 17:49:00.726639 28128 slave.cpp:1004] Re-registered with master master@172.17.6.239:32961 I1112 17:49:00.726775 28122 master.cpp:3479] Telling slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) to kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961 I1112 17:49:00.726794 28127 status_update_manager.cpp:183] Resuming sending status updates I1112 17:49:00.726913 28128 slave.cpp:1040] Forwarding total oversubscribed resources I1112 17:49:00.727146 28126 master.cpp:4269] Received update of slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) with total oversubscribed resources I1112 17:49:00.727447 28126 hierarchical.cpp:400] Slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 (f4f9d084d7ba) updated with oversubscribed resources (total: cpus(*):2; mem(*):1024; disk(role1):960; ports(*):[31000-32000]; disk(role1)[id1:path1]:64, allocated: cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64) I1112 17:49:00.728248 28126 hierarchical.cpp:1162] Filtered offer with cpus(*):1; mem(*):896; ports(*):[31000-32000]; disk(role1):928 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 for framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.728301 28126 hierarchical.cpp:977] No resources available to allocate! I1112 17:49:00.728350 28126 hierarchical.cpp:1070] No inverse offers to send out! I1112 17:49:00.728374 28126 hierarchical.cpp:892] Performed allocation for slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 in 865336ns I1112 17:49:00.728529 28128 slave.cpp:2178] Updating framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 pid to scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961 I1112 17:49:00.728587 28128 slave.cpp:2194] Checkpointing framework pid 'scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961' to '/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/framework.pid' I1112 17:49:00.729202 28119 status_update_manager.cpp:183] Resuming sending status updates I1112 17:49:00.729709 28128 slave.cpp:2274] Updated checkpointed resources from disk(role1)[id1:path1]:64 to disk(role1)[id1:path1]:64 I1112 17:49:00.729821 28128 slave.cpp:1821] Asked to kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 W1112 17:49:00.729884 28128 slave.cpp:1948] Ignoring kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 because the executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865 is terminating/terminated I1112 17:49:00.804975 28117 containerizer.cpp:1258] Executor for container 'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' has exited I1112 17:49:00.807514 28117 slave.cpp:3553] Executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 terminated with signal Killed I1112 17:49:00.807701 28117 slave.cpp:2762] Handling status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from @0.0.0.0:0 I1112 17:49:00.807884 28117 slave.cpp:5298] Terminating task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 W1112 17:49:00.808377 28117 containerizer.cpp:967] Ignoring update for unknown container: b9dbc84e-57fa-49d5-840f-8f6d9f0233fc I1112 17:49:00.808811 28117 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.808878 28117 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.814139 28117 status_update_manager.cpp:376] Forwarding update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to the slave I1112 17:49:00.814621 28122 slave.cpp:3087] Forwarding the update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to master@172.17.6.239:32961 I1112 17:49:00.814888 28122 slave.cpp:3005] Status update manager successfully handled status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.815172 28117 master.cpp:4414] Status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) I1112 17:49:00.815218 28117 master.cpp:4462] Forwarding status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.815402 28117 master.cpp:6066] Updating the state of task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (latest state: TASK_LOST, status update state: TASK_LOST) I1112 17:49:00.815636 28122 sched.cpp:921] Scheduler::statusUpdate took 135083ns I1112 17:49:00.816069 28117 master.cpp:3571] Processing ACKNOWLEDGE call 0c911b27-4aa4-4435-965d-1b4b5033a40c for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at scheduler-7193507a-d32c-48fe-bfc2-1f594ff8a63a@172.17.6.239:32961 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 I1112 17:49:00.816118 28122 hierarchical.cpp:744] Recovered cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64 (total: cpus(*):2; mem(*):1024; disk(role1):960; ports(*):[31000-32000]; disk(role1)[id1:path1]:64, allocated: ) on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 from framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.816128 28117 master.cpp:6132] Removing task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 with resources cpus(*):1; mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 on slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961 (f4f9d084d7ba) I1112 17:49:00.816721 28127 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 I1112 17:49:00.817196 28127 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework 5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 ../../src/tests/persistent_volume_tests.cpp:709: Failure Value of: status2.get().state() Actual: TASK_LOST Expected: TASK_KILLED {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)