Re: Review Request 14292: Added authentication support to scheduler and master.

2013-09-30 Thread Ben Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14292/#review26502
---



src/master/master.hpp
https://reviews.apache.org/r/14292/#comment51677

const ?



src/master/master.cpp
https://reviews.apache.org/r/14292/#comment51679

The dispatch is authenticate would need to pass the pid here, since 'from' 
could have changed!



src/master/master.cpp
https://reviews.apache.org/r/14292/#comment51678

s/Cancel it/Try to cancel it/

Can you add a comment about what happens when the authenticating future is 
ready? (E.g. If the Future was Ready at this point, the pending _authenticate 
could result in a successful authentication for a previous Authenticatee and a 
new Authenticatee will be created on the client side to retry?).



src/master/master.cpp
https://reviews.apache.org/r/14292/#comment51680

Should we be retrying in the Master? Seems like the retry logic should rest 
solely in the client (if the client is down, it looks like this will retry 
forever).

Be sure to update the retry comments if you think we should remove this 
bit.



src/sched/sched.cpp
https://reviews.apache.org/r/14292/#comment51683

Can you add a comment describing what we discussed for when the future was 
already ready at this point (say, if two newMasterDetected calls occurred in 
quick succession)?

// Authentication is in progress, try to cancel it.
authenticating.get().discard();

// If authenticating was already ready, this means there is a pending 
dispatch to _authenticate. This call will consider authentication successful 
even though we may not be authenticated (say, if two masters were elected in 
quick succession). In this case, the driver will proceed with registration but 
will receive a framework error and will exit as a result. This is a 
sufficiently rare race condition so it's not worth complicating the code here 
to handle it.

return;



src/sched/sched.cpp
https://reviews.apache.org/r/14292/#comment51684

Do we want to add a shared initialize(...) function? It looks like the only 
difference between the two constructors is the credential option passed to the 
SchedulerProcess?


- Ben Mahler


On Sept. 30, 2013, 2:46 a.m., Vinod Kone wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/14292/
 ---
 
 (Updated Sept. 30, 2013, 2:46 a.m.)
 
 
 Review request for mesos, Benjamin Hindman and Ben Mahler.
 
 
 Bugs: MESOS-704
 https://issues.apache.org/jira/browse/MESOS-704
 
 
 Repository: mesos-git
 
 
 Description
 ---
 
 Added authentication support for scheduler driver and master.
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 957576bbc1c73513a9591194d017f76fe562a616 
   include/mesos/scheduler.hpp cf3ecdaaf40fd878a80fe0b6f7e61a0997329cbd 
   src/Makefile.am ee336130ad93d8b524c841f75be36f00d4a2b147 
   src/common/type_utils.hpp 674a8820c339c6446dfa7d57477ab4512e79 
   src/java/jni/construct.cpp b01bd7ae2eda2dc5e0dcd68848c65bd9f9ea81f0 
   src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp 
 6d2a03b6a88e71ac4e2e2d1ee8e15925e393ef3d 
   src/java/src/org/apache/mesos/MesosSchedulerDriver.java 
 7ef1fe7755286bf92b94d7ece4f72d54e5b57a84 
   src/master/flags.hpp d59e67d5b2799d6d7a37e9cfe7246ae7372091ac 
   src/master/master.hpp bd5cb1ff05967518d7dc7f3a9dc0d32c22eb9643 
   src/master/master.cpp a49b17ef43fca5b385a89731ca8776a26b61399a 
   src/python/native/mesos_scheduler_driver_impl.cpp 
 f25d41d38caf2701813dbec0d342a3b327e9dedf 
   src/sasl/authenticator.hpp 2f78cf0fdd97f0ddc3a6ebd162e6559497d708e4 
   src/sched/sched.cpp c399f2481259683a8e178abb3478307042292f23 
   src/tests/allocator_tests.cpp c57da6eb3c431b47468b6a6941c3de06af9209e5 
   src/tests/allocator_zookeeper_tests.cpp 
 6e3214c15c14dc8ba82082738c172c8833cd0887 
   src/tests/authentication_tests.cpp PRE-CREATION 
   src/tests/exception_tests.cpp 3fc1ac32d553644080a88f04f22077691ae1820b 
   src/tests/fault_tolerance_tests.cpp 
 10e52c401476eb8416361de49b8e4061bb7ac4f3 
   src/tests/gc_tests.cpp e404de3bfacfcac5515995f1b45c3d39181e138f 
   src/tests/isolator_tests.cpp cd3b360b379ef10e38a2a98a2eebe69d90fc 
   src/tests/master_detector_tests.cpp 
 2d140ba1a364a7af4d643951d6016ac17dd10526 
   src/tests/master_tests.cpp 52f09d4f1ddeabcc1a797a13fae9641b72425dd5 
   src/tests/mesos.hpp 8fbd56c8dd438a08673b54630bfe25d60ad5ee0e 
   src/tests/mesos.cpp 776cb0f13d10b4ae437fe9a3c97dc8b3481290af 
   src/tests/resource_offers_tests.cpp 
 3888e461de5c8fa807cff2fd2bd7ca12c704823a 
   src/tests/slave_recovery_tests.cpp 48b2e6380a9ae688291992f3bf25c3cc473bc808 
   src/tests/status_update_manager_tests.cpp 
 cf420e4764356402f05b27c3b8e8802c21a58f8e 
 
 Diff: 

Review Request 14414: Set resource requirements in new ExecutorInfos from TaskInfo messages

2013-09-30 Thread Chi Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14414/
---

Review request for mesos, Benjamin Hindman, Ben Mahler, Ian Downes, Jie Yu, 
David Mackey, Vinod Kone, and Jiang Yan Xu.


Repository: mesos-git


Description
---

slave: Copy resource requirements from the first TaskInfo message to the 
ExecutorInfo before an executor is launched.
 
Otherwise, this leads to a null value passed to launchExecutor for the 
resources field. It's necessary for some resource subsystems to initialize 
executors with resource requirement upfront.


Diffs
-

  src/slave/slave.cpp 0ad4576 

Diff: https://reviews.apache.org/r/14414/diff/


Testing
---

Can't tell for sure. With or without the patch, `make -j check` fails at the 
same place on a Mesos dev box.

[--] Global test environment tear-down 
[==] 263 tests from 47 test cases ran. (146351 ms total)   
[  PASSED  ] 259 tests.   
[  FAILED  ] 4 tests, listed below:   
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_BalloonFramework
[  FAILED  ] SASL.success 
[  FAILED  ] SASL.failed1  
[  FAILED  ] SASL.failed2  
   
 4 FAILED TESTS  
make[3]: *** [check-local] Error 1   
make[3]: Leaving directory `/home/czhang/mesos-apache/build/src' 
make[2]: *** [check-am] Error 2   
make[2]: Leaving directory `/home/czhang/mesos-apache/build/src' 
make[1]: *** [check] Error 2 
make[1]: Leaving directory `/home/czhang/mesos-apache/build/src' 
make: *** [check-recursive] Error 1  
Connection to smfd-aki-27-sr1.devel.twitter.com closed.


Thanks,

Chi Zhang



[jira] [Created] (MESOS-711) Master::reconcile incorrectly recovers resources from reconciled tasks.

2013-09-30 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-711:
-

 Summary: Master::reconcile incorrectly recovers resources from 
reconciled tasks.
 Key: MESOS-711
 URL: https://issues.apache.org/jira/browse/MESOS-711
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Critical


The following sequence of events will over-subscribe a slave in the allocator:

-- Slave re-registers with the same master due to a slave restart. Tasks were 
running on the slave, but are lost in the process of the slave restarting.

-- As a result, the slave includes no task / executor information in it's 
re-registration message.

-- The slave is added back to the allocator with it's full resources, in 
Master::reregisterSlave():

  // If this is a disconnected slave, add it back to the allocator.
  if (slave-disconnected) {
slave-disconnected = false; // Reset the flag.

hashmapFrameworkID, Resources resources;
foreach (const ExecutorInfo executorInfo, executorInfos) {
  resources[executorInfo.framework_id()] += executorInfo.resources();
}
foreach (const Task task, tasks) {
  // Ignore tasks that have reached terminal state.
  if (!protobuf::isTerminalState(task.state())) {
resources[task.framework_id()] += task.resources();
  }
}
allocator-slaveAdded(slaveId, slaveInfo, resources);
  }

-- Now reconciliation occurs, and the master sends TASK_LOST messages for each 
slave through Master::statusUpdate, which results in a call to 
Allocator::resourcesRecovered!

-- Reconciliation also calls Allocator::resourcesRecovered for the unknown 
executors.

-- These two bugs result in the allocator offering more resources than the 
slave contains.

We can either change the re-registration code, or change the reconciliation 
code. The easiest fix here is to add the slave back taking into account the 
used resources from the slave *and the master's* information.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MESOS-712) invalid zhandle state

2013-09-30 Thread David Robinson (JIRA)
David Robinson created MESOS-712:


 Summary: invalid zhandle state
 Key: MESOS-712
 URL: https://issues.apache.org/jira/browse/MESOS-712
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: David Robinson


{noformat:title=log snippet}
2013-09-29 08:58:30,445:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: 
Exceeded deadline by 16533ms
2013-09-29 
08:58:30,445:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1528: 
Socket [192.168.0.1:2181] zk retcode=-7, errno=110(Connection timed out): 
connection timed out (exceeded timeout by 13199ms)
I0929 08:58:17.544836 45283 cgroups.cpp:1193] Trying to freeze cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
2013-09-29 08:58:30,474:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1141: 
Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE
2013-09-29 08:58:30,475:45279(0x7f9024e3f940):ZOO_WARN@zookeeper_interest@1461: 
Exceeded deadline by 16564ms
2013-09-29 
08:58:30,475:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling 
a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0929 08:58:30.445508 45282 detector.cpp:251] Trying to create path 
'/home/mesos/prod/master' in ZooKeeper
2013-09-29 08:58:30,483:45279(0x7f9024e3f940):ZOO_INFO@check_events@1585: 
initiated connection to server [192.168.0.2:2181]
2013-09-29 08:58:30,488:45279(0x7f9031267940):ZOO_DEBUG@zoo_awexists@2587: 
Sending request xid=0x5244d598 for path [/home/mesos/prod/master] to 
192.168.0.2:2181
2013-09-29 
08:58:30,488:45279(0x7f9024e3f940):ZOO_ERROR@handle_socket_error_msg@1621: 
Socket [192.168.0.2:2181] zk retcode=-112, errno=116(Stale NFS file handle): 
sessionId=0x340523200364932 has expired.
2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@handle_error@1138: 
Calling a watcher for a ZOO_SESSION_EVENT and the 
state=ZOO_EXPIRED_SESSION_STATE
2013-09-29 08:58:30,489:45279(0x7f9024e3f940):ZOO_DEBUG@do_io@317: IO thread 
terminated
2013-09-29 
08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1765: Calling 
a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
2013-09-29 
08:58:30,489:45279(0x7f901940):ZOO_DEBUG@process_completions@1784: Calling 
COMPLETION_STAT for xid=0x5244d598 rc=-112
I0929 08:58:30.475751 45283 cgroups.cpp:1232] Successfully froze cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
 after 1 attempts
F0929 08:58:30.492090 45282 detector.cpp:266] Failed to create 
'/home/mesos/prod/master' in ZooKeeper: invalid zhandle state
*** Check failure stack trace: ***
I0929 08:58:30.492761 45292 cgroups.cpp:1208] Trying to thaw cgroup 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:31.144810 45291 cgroups_isolator.cpp:937] Executor 
thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of 
framework 201205082337-03- terminated with status 9
I0929 08:58:32.791193 45292 cgroups.cpp:1318] Successfully thawed 
/cgroup/mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:33.675348 45298 cgroups_isolator.cpp:1275] Successfully destroyed 
cgroup 
mesos/framework_201205082337-03-_executor_thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f_tag_8edc5ce9-20bc-4b09-bc92-d9bab7769738
I0929 08:58:33.676269 45300 slave.cpp:2158] Executor 
'thermos-1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f' of 
framework 201205082337-03- has terminated with signal Killed
I0929 08:58:33.678154 45300 slave.cpp:1778] Handling status update TASK_FAILED 
(UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 
1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 
201205082337-03- from @0.0.0.0:0
I0929 08:58:33.679175 45288 cgroups_isolator.cpp:700] Asked to update resources 
for an unknown/killed executor
I0929 08:58:33.679201 45300 status_update_manager.cpp:300] Received status 
update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for task 
1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of framework 
201205082337-03- 
I0929 08:58:33.680452 45300 status_update_manager.hpp:337] Checkpointing UPDATE 
for status update TASK_FAILED (UUID: 4d90de5a-cdad-4bb8-ab93-7c4f185a0d24) for 
task 1380442146400-test_master-0-f947deee-f813-47fa-8bd3-d0f06ece941f of 
framework 201205082337-03- 
@ 0x7f9035fb562d  

[jira] [Updated] (MESOS-713) Support for adding subsystems to existing cgroup hierarchies.

2013-09-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-713:
--

Labels: starter_project  (was: )

 Support for adding subsystems to existing cgroup hierarchies.
 -

 Key: MESOS-713
 URL: https://issues.apache.org/jira/browse/MESOS-713
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Benjamin Mahler
Priority: Minor
  Labels: starter_project

 Currently if a slave is restarted with additional subsystems, it will refuse 
 to proceed if those subsystems are not attached to the existing hierarchy.
 It's possible to add subsystems to existing hierarchies via re-mounting:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Attaching_Subsystems_to_and_Detaching_Them_From_an_Existing_Hierarchy.html
 We can add support for this by calling mount with the MS_REMOUNT option.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MESOS-711) Master::reconcile incorrectly recovers resources from reconciled tasks.

2013-09-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-711:
--

Affects Version/s: 0.14.0

 Master::reconcile incorrectly recovers resources from reconciled tasks.
 ---

 Key: MESOS-711
 URL: https://issues.apache.org/jira/browse/MESOS-711
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Critical

 The following sequence of events will over-subscribe a slave in the allocator:
 -- Slave re-registers with the same master due to a slave restart. Tasks 
 were running on the slave, but are lost in the process of the slave 
 restarting.
 -- As a result, the slave includes no task / executor information in it's 
 re-registration message.
 -- The slave is added back to the allocator with it's full resources, in 
 Master::reregisterSlave():
   // If this is a disconnected slave, add it back to the allocator.
   if (slave-disconnected) {
 slave-disconnected = false; // Reset the flag.
 hashmapFrameworkID, Resources resources;
 foreach (const ExecutorInfo executorInfo, executorInfos) {
   resources[executorInfo.framework_id()] += executorInfo.resources();
 }
 foreach (const Task task, tasks) {
   // Ignore tasks that have reached terminal state.
   if (!protobuf::isTerminalState(task.state())) {
 resources[task.framework_id()] += task.resources();
   }
 }
 allocator-slaveAdded(slaveId, slaveInfo, resources);
   }
 -- Now reconciliation occurs, and the master sends TASK_LOST messages for 
 each slave through Master::statusUpdate, which results in a call to 
 Allocator::resourcesRecovered!
 -- Reconciliation also calls Allocator::resourcesRecovered for the unknown 
 executors.
 -- These two bugs result in the allocator offering more resources than the 
 slave contains.
 We can either change the re-registration code, or change the reconciliation 
 code. The easiest fix here is to add the slave back taking into account the 
 used resources from the slave *and the master's* information.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MESOS-714) Slave should check if the (re-)registered is from the expected master

2013-09-30 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-714:


 Summary: Slave should check if the (re-)registered is from the 
expected master
 Key: MESOS-714
 URL: https://issues.apache.org/jira/browse/MESOS-714
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Vinod Kone
 Fix For: 0.15.0


The following sequence of events happened in production at Twitter.

-- Slave registered with master A
-- A sent an ACK for registration but died immediately (user restart)
-- Slave detected a new master B and sent a re-register request
-- Slave received the ACK from A now.
-- The bug here is that the slave accepted this ACK even though it was not 
from master B.
-- Master B ignored the re-register request because it didn't know it was the 
master yet!
-- Slave never re-tried its registration because it thinks its registered with 
B.

At this point slave thinks it is registered but the master (B) has no idea of 
it!

Fix: Slaves should check that (re-)registered messages are from the expected 
master pid.



--
This message was sent by Atlassian JIRA
(v6.1#6144)