Re: Review Request 14292: Added authentication support to scheduler and master.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14292/#review26502 --- src/master/master.hpp https://reviews.apache.org/r/14292/#comment51677 const ? src/master/master.cpp https://reviews.apache.org/r/14292/#comment51679 The dispatch is authenticate would need to pass the pid here, since 'from' could have changed! src/master/master.cpp https://reviews.apache.org/r/14292/#comment51678 s/Cancel it/Try to cancel it/ Can you add a comment about what happens when the authenticating future is ready? (E.g. If the Future was Ready at this point, the pending _authenticate could result in a successful authentication for a previous Authenticatee and a new Authenticatee will be created on the client side to retry?). src/master/master.cpp https://reviews.apache.org/r/14292/#comment51680 Should we be retrying in the Master? Seems like the retry logic should rest solely in the client (if the client is down, it looks like this will retry forever). Be sure to update the retry comments if you think we should remove this bit. src/sched/sched.cpp https://reviews.apache.org/r/14292/#comment51683 Can you add a comment describing what we discussed for when the future was already ready at this point (say, if two newMasterDetected calls occurred in quick succession)? // Authentication is in progress, try to cancel it. authenticating.get().discard(); // If authenticating was already ready, this means there is a pending dispatch to _authenticate. This call will consider authentication successful even though we may not be authenticated (say, if two masters were elected in quick succession). In this case, the driver will proceed with registration but will receive a framework error and will exit as a result. This is a sufficiently rare race condition so it's not worth complicating the code here to handle it. return; src/sched/sched.cpp https://reviews.apache.org/r/14292/#comment51684 Do we want to add a shared initialize(...) function? It looks like the only difference between the two constructors is the credential option passed to the SchedulerProcess? - Ben Mahler On Sept. 30, 2013, 2:46 a.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14292/ --- (Updated Sept. 30, 2013, 2:46 a.m.) Review request for mesos, Benjamin Hindman and Ben Mahler. Bugs: MESOS-704 https://issues.apache.org/jira/browse/MESOS-704 Repository: mesos-git Description --- Added authentication support for scheduler driver and master. Diffs - include/mesos/mesos.proto 957576bbc1c73513a9591194d017f76fe562a616 include/mesos/scheduler.hpp cf3ecdaaf40fd878a80fe0b6f7e61a0997329cbd src/Makefile.am ee336130ad93d8b524c841f75be36f00d4a2b147 src/common/type_utils.hpp 674a8820c339c6446dfa7d57477ab4512e79 src/java/jni/construct.cpp b01bd7ae2eda2dc5e0dcd68848c65bd9f9ea81f0 src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp 6d2a03b6a88e71ac4e2e2d1ee8e15925e393ef3d src/java/src/org/apache/mesos/MesosSchedulerDriver.java 7ef1fe7755286bf92b94d7ece4f72d54e5b57a84 src/master/flags.hpp d59e67d5b2799d6d7a37e9cfe7246ae7372091ac src/master/master.hpp bd5cb1ff05967518d7dc7f3a9dc0d32c22eb9643 src/master/master.cpp a49b17ef43fca5b385a89731ca8776a26b61399a src/python/native/mesos_scheduler_driver_impl.cpp f25d41d38caf2701813dbec0d342a3b327e9dedf src/sasl/authenticator.hpp 2f78cf0fdd97f0ddc3a6ebd162e6559497d708e4 src/sched/sched.cpp c399f2481259683a8e178abb3478307042292f23 src/tests/allocator_tests.cpp c57da6eb3c431b47468b6a6941c3de06af9209e5 src/tests/allocator_zookeeper_tests.cpp 6e3214c15c14dc8ba82082738c172c8833cd0887 src/tests/authentication_tests.cpp PRE-CREATION src/tests/exception_tests.cpp 3fc1ac32d553644080a88f04f22077691ae1820b src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 src/tests/gc_tests.cpp e404de3bfacfcac5515995f1b45c3d39181e138f src/tests/isolator_tests.cpp cd3b360b379ef10e38a2a98a2eebe69d90fc src/tests/master_detector_tests.cpp 2d140ba1a364a7af4d643951d6016ac17dd10526 src/tests/master_tests.cpp 52f09d4f1ddeabcc1a797a13fae9641b72425dd5 src/tests/mesos.hpp 8fbd56c8dd438a08673b54630bfe25d60ad5ee0e src/tests/mesos.cpp 776cb0f13d10b4ae437fe9a3c97dc8b3481290af src/tests/resource_offers_tests.cpp 3888e461de5c8fa807cff2fd2bd7ca12c704823a src/tests/slave_recovery_tests.cpp 48b2e6380a9ae688291992f3bf25c3cc473bc808 src/tests/status_update_manager_tests.cpp cf420e4764356402f05b27c3b8e8802c21a58f8e Diff:
Review Request 14414: Set resource requirements in new ExecutorInfos from TaskInfo messages
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14414/ --- Review request for mesos, Benjamin Hindman, Ben Mahler, Ian Downes, Jie Yu, David Mackey, Vinod Kone, and Jiang Yan Xu. Repository: mesos-git Description --- slave: Copy resource requirements from the first TaskInfo message to the ExecutorInfo before an executor is launched. Otherwise, this leads to a null value passed to launchExecutor for the resources field. It's necessary for some resource subsystems to initialize executors with resource requirement upfront. Diffs - src/slave/slave.cpp 0ad4576 Diff: https://reviews.apache.org/r/14414/diff/ Testing --- Can't tell for sure. With or without the patch, `make -j check` fails at the same place on a Mesos dev box. [--] Global test environment tear-down [==] 263 tests from 47 test cases ran. (146351 ms total) [ PASSED ] 259 tests. [ FAILED ] 4 tests, listed below: [ FAILED ] CgroupsIsolatorTest.ROOT_CGROUPS_BalloonFramework [ FAILED ] SASL.success [ FAILED ] SASL.failed1 [ FAILED ] SASL.failed2 4 FAILED TESTS make[3]: *** [check-local] Error 1 make[3]: Leaving directory `/home/czhang/mesos-apache/build/src' make[2]: *** [check-am] Error 2 make[2]: Leaving directory `/home/czhang/mesos-apache/build/src' make[1]: *** [check] Error 2 make[1]: Leaving directory `/home/czhang/mesos-apache/build/src' make: *** [check-recursive] Error 1 Connection to smfd-aki-27-sr1.devel.twitter.com closed. Thanks, Chi Zhang
[jira] [Created] (MESOS-711) Master::reconcile incorrectly recovers resources from reconciled tasks.
Benjamin Mahler created MESOS-711: - Summary: Master::reconcile incorrectly recovers resources from reconciled tasks. Key: MESOS-711 URL: https://issues.apache.org/jira/browse/MESOS-711 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Critical The following sequence of events will over-subscribe a slave in the allocator: -- Slave re-registers with the same master due to a slave restart. Tasks were running on the slave, but are lost in the process of the slave restarting. -- As a result, the slave includes no task / executor information in it's re-registration message. -- The slave is added back to the allocator with it's full resources, in Master::reregisterSlave(): // If this is a disconnected slave, add it back to the allocator. if (slave-disconnected) { slave-disconnected = false; // Reset the flag. hashmapFrameworkID, Resources resources; foreach (const ExecutorInfo executorInfo, executorInfos) { resources[executorInfo.framework_id()] += executorInfo.resources(); } foreach (const Task task, tasks) { // Ignore tasks that have reached terminal state. if (!protobuf::isTerminalState(task.state())) { resources[task.framework_id()] += task.resources(); } } allocator-slaveAdded(slaveId, slaveInfo, resources); } -- Now reconciliation occurs, and the master sends TASK_LOST messages for each slave through Master::statusUpdate, which results in a call to Allocator::resourcesRecovered! -- Reconciliation also calls Allocator::resourcesRecovered for the unknown executors. -- These two bugs result in the allocator offering more resources than the slave contains. We can either change the re-registration code, or change the reconciliation code. The easiest fix here is to add the slave back taking into account the used resources from the slave *and the master's* information. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MESOS-712) invalid zhandle state
David Robinson created MESOS-712: Summary: invalid zhandle state Key: MESOS-712 URL: https://issues.apache.org/jira/browse/MESOS-712 Project: Mesos Issue Type: Bug Affects Versions: 0.14.0 Reporter: David Robinson {noformat:title=log snippet} 2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 16533ms 2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [192.168.0.1:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 13199ms) I0929 08:58:17.544836 45283 cgroups.cpp:1193] Trying to freeze cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 2013-09-29 08:58:30,474:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1141: Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE 2013-09-29 08:58:30,475:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 16564ms 2013-09-29 08:58:30,475:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT I0929 08:58:30.445508 45282 detector.cpp:251] Trying to create path '/home/mesos/prod/master' in ZooKeeper 2013-09-29 08:58:30,483:45279(0x7f9024e3f940):ZOO_INFO@check_events@1585: initiated connection to server [192.168.0.2:2181] 2013-09-29 08:58:30,488:45279(0x7f9031267940):ZOO_DEBUG@zoo_awexists@2587: Sending request xid=0x5244d598 for path [/home/mesos/prod/master] to 192.168.0.2:2181 2013-09-29 08:58:30,488:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [192.168.0.2:2181] zk retcode=-112, errno=116(Stale NFS file handle): sessionId=0x340523200364932 has expired. 2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE 2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@do_io@317: IO thread terminated 2013-09-29 08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT 2013-09-29 08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1784: Calling COMPLETION_STAT for xid=0x5244d598 rc=-112 I0929 08:58:30.475751 45283 cgroups.cpp:1232] Successfully froze cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 after 1 attempts F0929 08:58:30.492090 45282 detector.cpp:266] Failed to create '/home/mesos/prod/master' in ZooKeeper: invalid zhandle state *** Check failure stack trace: *** I0929 08:58:30.492761 45292 cgroups.cpp:1208] Trying to thaw cgroup /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:31.144810 45291 cgroups_isolator.cpp:937] Executor thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- terminated with status 9 I0929 08:58:32.791193 45292 cgroups.cpp:1318] Successfully thawed /cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:33.675348 45298 cgroups_isolator.cpp:1275] Successfully destroyed cgroup mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738 I0929 08:58:33.676269 45300 slave.cpp:2158] Executor 'thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f' of framework 201205082337-03- has terminated with signal Killed I0929 08:58:33.678154 45300 slave.cpp:1778] Handling status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- from @0.0.0.0:0 I0929 08:58:33.679175 45288 cgroups_isolator.cpp:700] Asked to update resources for an unknown/killed executor I0929 08:58:33.679201 45300 status_update_manager.cpp:300] Received status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- I0929 08:58:33.680452 45300 status_update_manager.hpp:337] Checkpointing UPDATE for status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 201205082337-03- @ 0x7f9035fb562d
[jira] [Updated] (MESOS-713) Support for adding subsystems to existing cgroup hierarchies.
[ https://issues.apache.org/jira/browse/MESOS-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-713: -- Labels: starter_project (was: ) Support for adding subsystems to existing cgroup hierarchies. - Key: MESOS-713 URL: https://issues.apache.org/jira/browse/MESOS-713 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Benjamin Mahler Priority: Minor Labels: starter_project Currently if a slave is restarted with additional subsystems, it will refuse to proceed if those subsystems are not attached to the existing hierarchy. It's possible to add subsystems to existing hierarchies via re-mounting: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Attaching_Subsystems_to_and_Detaching_Them_From_an_Existing_Hierarchy.html We can add support for this by calling mount with the MS_REMOUNT option. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MESOS-711) Master::reconcile incorrectly recovers resources from reconciled tasks.
[ https://issues.apache.org/jira/browse/MESOS-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-711: -- Affects Version/s: 0.14.0 Master::reconcile incorrectly recovers resources from reconciled tasks. --- Key: MESOS-711 URL: https://issues.apache.org/jira/browse/MESOS-711 Project: Mesos Issue Type: Bug Affects Versions: 0.14.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Critical The following sequence of events will over-subscribe a slave in the allocator: -- Slave re-registers with the same master due to a slave restart. Tasks were running on the slave, but are lost in the process of the slave restarting. -- As a result, the slave includes no task / executor information in it's re-registration message. -- The slave is added back to the allocator with it's full resources, in Master::reregisterSlave(): // If this is a disconnected slave, add it back to the allocator. if (slave-disconnected) { slave-disconnected = false; // Reset the flag. hashmapFrameworkID, Resources resources; foreach (const ExecutorInfo executorInfo, executorInfos) { resources[executorInfo.framework_id()] += executorInfo.resources(); } foreach (const Task task, tasks) { // Ignore tasks that have reached terminal state. if (!protobuf::isTerminalState(task.state())) { resources[task.framework_id()] += task.resources(); } } allocator-slaveAdded(slaveId, slaveInfo, resources); } -- Now reconciliation occurs, and the master sends TASK_LOST messages for each slave through Master::statusUpdate, which results in a call to Allocator::resourcesRecovered! -- Reconciliation also calls Allocator::resourcesRecovered for the unknown executors. -- These two bugs result in the allocator offering more resources than the slave contains. We can either change the re-registration code, or change the reconciliation code. The easiest fix here is to add the slave back taking into account the used resources from the slave *and the master's* information. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (MESOS-714) Slave should check if the (re-)registered is from the expected master
Vinod Kone created MESOS-714: Summary: Slave should check if the (re-)registered is from the expected master Key: MESOS-714 URL: https://issues.apache.org/jira/browse/MESOS-714 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.15.0 The following sequence of events happened in production at Twitter. -- Slave registered with master A -- A sent an ACK for registration but died immediately (user restart) -- Slave detected a new master B and sent a re-register request -- Slave received the ACK from A now. -- The bug here is that the slave accepted this ACK even though it was not from master B. -- Master B ignored the re-register request because it didn't know it was the master yet! -- Slave never re-tried its registration because it thinks its registered with B. At this point slave thinks it is registered but the master (B) has no idea of it! Fix: Slaves should check that (re-)registered messages are from the expected master pid. -- This message was sent by Atlassian JIRA (v6.1#6144)