Review Request 33257: Fixed recover tasks only by the intiated containerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/ --- Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, and Till Toenshoff. Bugs: MESOS-2601 https://issues.apache.org/jira/browse/MESOS-2601 Repository: mesos Description --- Fixed recover tasks only by the intiated containerizer. Currently both mesos and docker containerizer recovers tasks that wasn't started by themselves. The proposed fix is to record the intended containerizer in the checkpointed executorInfo, and reuse that information on recover to know if the containerizer should recover or not. We are free to modify the executorInfo since it's not being used to relaunch any task. Diffs - src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd src/slave/containerizer/mesos/containerizer.cpp e4136095fca55637864f495098189ab3ad8d8fe7 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f src/tests/docker_containerizer_tests.cpp c772d4c836de18b0e87636cb42200356d24ec73d Diff: https://reviews.apache.org/r/33257/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/ --- (Updated April 16, 2015, 7:08 a.m.) Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, and Till Toenshoff. Bugs: MESOS-2601 https://issues.apache.org/jira/browse/MESOS-2601 Repository: mesos Description --- Fixed recover tasks only by the intiated containerizer. Currently both mesos and docker containerizer recovers tasks that wasn't started by themselves. The proposed fix is to record the intended containerizer in the checkpointed executorInfo, and reuse that information on recover to know if the containerizer should recover or not. We are free to modify the executorInfo since it's not being used to relaunch any task. Diffs (updated) - src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd src/slave/containerizer/mesos/containerizer.cpp e4136095fca55637864f495098189ab3ad8d8fe7 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f src/tests/docker_containerizer_tests.cpp c772d4c836de18b0e87636cb42200356d24ec73d Diff: https://reviews.apache.org/r/33257/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/ --- (Updated April 16, 2015, 7:10 a.m.) Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, and Till Toenshoff. Bugs: MESOS-2601 https://issues.apache.org/jira/browse/MESOS-2601 Repository: mesos Description (updated) --- Fixed recover tasks only by the intiated containerizer. Currently both mesos and docker containerizer recovers tasks that wasn't started by themselves. The proposed fix is to record the intended containerizer in the checkpointed executorInfo, and reuse that information on recover to know if the containerizer should recover or not. We are free to modify the executorInfo since it's not being used to relaunch any task. The external containerizer doesn't need to change since it is only recovering containers that are returned by the containers script. Diffs - src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd src/slave/containerizer/mesos/containerizer.cpp e4136095fca55637864f495098189ab3ad8d8fe7 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f src/tests/docker_containerizer_tests.cpp c772d4c836de18b0e87636cb42200356d24ec73d Diff: https://reviews.apache.org/r/33257/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/#review80308 --- src/slave/containerizer/docker.cpp https://reviews.apache.org/r/33257/#comment130134 Cosmetic comment: `executor.id` is enclosed in single-quotes but `framework.id` is not. This is in keeping with most of this file, so it *is* somewhat consistent. But the `launch` member function does enclose `executorInfo.framework_id()` in single-quotes, so there is precedent for doing so within this file. - Erik Weathers On April 16, 2015, 7:10 a.m., Timothy Chen wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/ --- (Updated April 16, 2015, 7:10 a.m.) Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, and Till Toenshoff. Bugs: MESOS-2601 https://issues.apache.org/jira/browse/MESOS-2601 Repository: mesos Description --- Fixed recover tasks only by the intiated containerizer. Currently both mesos and docker containerizer recovers tasks that wasn't started by themselves. The proposed fix is to record the intended containerizer in the checkpointed executorInfo, and reuse that information on recover to know if the containerizer should recover or not. We are free to modify the executorInfo since it's not being used to relaunch any task. The external containerizer doesn't need to change since it is only recovering containers that are returned by the containers script. Diffs - src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd src/slave/containerizer/mesos/containerizer.cpp e4136095fca55637864f495098189ab3ad8d8fe7 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f src/tests/docker_containerizer_tests.cpp c772d4c836de18b0e87636cb42200356d24ec73d Diff: https://reviews.apache.org/r/33257/diff/ Testing --- make check Thanks, Timothy Chen
Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33263/ --- Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till Toenshoff. Bugs: MESOS-2625 https://issues.apache.org/jira/browse/MESOS-2625 Repository: mesos Description --- Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. Check that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the Slave::sendExecutorTerminatedStatusUpdate member function, this reason is expected instead of more general REASON_EXECUTOR_TERMINATED because the command executer is used in this test. Diffs - src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33263/diff/ Testing --- make check Thanks, Andrey Dyatlov
Re: Review Request 32058: Added protobuf-JSON validation
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32058/#review80333 --- 3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp https://reviews.apache.org/r/32058/#comment130182 Just as Vinod commented on the JIRA, we might want to do that differently - e.g. by adding a factory. - Till Toenshoff On March 16, 2015, 8:02 p.m., Akanksha Agrawal wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32058/ --- (Updated March 16, 2015, 8:02 p.m.) Review request for mesos and Till Toenshoff. Bugs: MESOS-1194 https://issues.apache.org/jira/browse/MESOS-1194 Repository: mesos Description --- Added protobuf-JSON validation Diffs - 3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp 2c020d98b94995bce862ac3b9c173dd64ed1198d Diff: https://reviews.apache.org/r/32058/diff/ Testing --- Thanks, Akanksha Agrawal
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33263/#review80335 --- Patch looks great! Reviews applied: [33263] All tests passed. - Mesos ReviewBot On April 16, 2015, 2:31 p.m., Andrey Dyatlov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33263/ --- (Updated April 16, 2015, 2:31 p.m.) Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till Toenshoff. Bugs: MESOS-2625 https://issues.apache.org/jira/browse/MESOS-2625 Repository: mesos Description --- Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. Check that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the Slave::sendExecutorTerminatedStatusUpdate member function, this reason is expected instead of more general REASON_EXECUTOR_TERMINATED because the command executer is used in this test. Diffs - src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33263/diff/ Testing --- make check Thanks, Andrey Dyatlov
Re: Review Request 33154: Added reason metrics for slave removals.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/#review80339 --- src/master/master.cpp https://reviews.apache.org/r/33154/#comment130185 Is this optional because there are other removals that we don't have metrics for? the ones in exited() and reregisteredSlave()? can we have metrics for them too while we are at it? src/master/master.cpp https://reviews.apache.org/r/33154/#comment130184 Would be awesome for debugging if the log line also prints the reason. - Vinod Kone On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-2485 https://issues.apache.org/jira/browse/MESOS-2485 Repository: mesos Description --- See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485). Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b Diff: https://reviews.apache.org/r/33154/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 33153: Moved a partition test into partition_tests.cpp.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33153/#review80338 --- Ship it! Ship It! - Vinod Kone On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33153/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Repository: mesos Description --- See summary. Diffs - src/tests/fault_tolerance_tests.cpp a637c32f004638a110390b22cf5b626e904097cf src/tests/partition_tests.cpp 1018e479a77a6b533f2dd392fedbdccb80e3ac1f Diff: https://reviews.apache.org/r/33153/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 33152: Moved the slave shutdown test into slave_tests.cpp.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33152/#review80337 --- Ship it! Ship It! - Vinod Kone On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33152/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Repository: mesos Description --- See summary. Diffs - src/tests/fault_tolerance_tests.cpp a637c32f004638a110390b22cf5b626e904097cf src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33152/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/#review80321 --- src/slave/slave.cpp https://reviews.apache.org/r/33257/#comment130173 Kill this line. src/slave/slave.cpp https://reviews.apache.org/r/33257/#comment130165 Remove this line. src/tests/containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130167 Could you add a comment and describe what this test does? src/tests/containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130176 Move this down to when you need it. src/tests/containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130175 Move this down to when you actually need it. src/tests/containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130168 I am not sure this comment adds any value - we may just as well remove it, no? src/tests/containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130174 Kill these. src/tests/docker_containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130166 Could you add a comment and describe what this test does? src/tests/docker_containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130178 Move this down to where you need it. src/tests/docker_containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130177 Move this down to where you need it. src/tests/docker_containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130179 Move this down to where you need it. src/tests/docker_containerizer_tests.cpp https://reviews.apache.org/r/33257/#comment130180 Move this down to where you need it. - Till Toenshoff On April 16, 2015, 7:10 a.m., Timothy Chen wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33257/ --- (Updated April 16, 2015, 7:10 a.m.) Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, and Till Toenshoff. Bugs: MESOS-2601 https://issues.apache.org/jira/browse/MESOS-2601 Repository: mesos Description --- Fixed recover tasks only by the intiated containerizer. Currently both mesos and docker containerizer recovers tasks that wasn't started by themselves. The proposed fix is to record the intended containerizer in the checkpointed executorInfo, and reuse that information on recover to know if the containerizer should recover or not. We are free to modify the executorInfo since it's not being used to relaunch any task. The external containerizer doesn't need to change since it is only recovering containers that are returned by the containers script. Diffs - src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd src/slave/containerizer/mesos/containerizer.cpp e4136095fca55637864f495098189ab3ad8d8fe7 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f src/tests/docker_containerizer_tests.cpp c772d4c836de18b0e87636cb42200356d24ec73d Diff: https://reviews.apache.org/r/33257/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33263/#review80332 --- Ship it! - Bernd Mathiske On April 16, 2015, 7:31 a.m., Andrey Dyatlov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33263/ --- (Updated April 16, 2015, 7:31 a.m.) Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till Toenshoff. Bugs: MESOS-2625 https://issues.apache.org/jira/browse/MESOS-2625 Repository: mesos Description --- Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. Check that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the Slave::sendExecutorTerminatedStatusUpdate member function, this reason is expected instead of more general REASON_EXECUTOR_TERMINATED because the command executer is used in this test. Diffs - src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33263/diff/ Testing --- make check Thanks, Andrey Dyatlov
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80347 --- src/slave/slave.cpp https://reviews.apache.org/r/33249/#comment130197 We're already sending back a status update when the registration timeout, and if we send another one here the scheduler will actually get two TASK_FAILED statuses instead. I think either we populate the reason when we send back the final status update that it's the containerizer launched failed, or we make sure we just send one here. The nice thing about having it be handled in the timeout is that it's less places in the slave that we do status updates, but with the cavaet you wait until the timeout to occur which is something I never really liked about. I think if we can make the code clean and make sure there is just one status update propagated back I rather see it happen here. - Timothy Chen On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80350 --- src/tests/containerizer.cpp https://reviews.apache.org/r/33249/#comment130199 How about setting a flag in the TestContainerizer constructor that tells it to send a failure on launch? Checking value like is too implicit, and someone else might not know the magic meaning on the other side (as I don't see any comment in the test to mark that this has sepecial meaning). - Timothy Chen On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80352 --- src/slave/slave.cpp https://reviews.apache.org/r/33249/#comment130205 Btw the framework and executor might not be there anymore if they are terminated and cleaned up between launch callback, see the checks we do below. - Timothy Chen On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Review Request 33266: Check patch command in configure.ac
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33266/ --- Review request for mesos and Till Toenshoff. Bugs: MESOS-2624 https://issues.apache.org/jira/browse/MESOS-2624 Repository: mesos Description --- Check patch command in configure.ac Diffs - configure.ac 7f9e52916b9d78f2bbff9d6ed9871444a0fda629 Diff: https://reviews.apache.org/r/33266/diff/ Testing --- Thanks, haosdent huang
Re: Review Request 33154: Added reason metrics for slave removals.
On April 16, 2015, 4 p.m., Vinod Kone wrote: src/master/master.cpp, line 4465 https://reviews.apache.org/r/33154/diff/1/?file=926687#file926687line4465 Is this optional because there are other removals that we don't have metrics for? the ones in exited() and reregisteredSlave()? can we have metrics for them too while we are at it? I can add one for when the a new slave registers and replaces the old slave. For exited(), I didn't add a metric since we are removing checkpointing. It doesn't look like the flag is even available anymore? Looks like some further cleanup is required (remove the field in the message, and the comment / logic in `Master::exited`). - Ben --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/#review80339 --- On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-2485 https://issues.apache.org/jira/browse/MESOS-2485 Repository: mesos Description --- See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485). Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b Diff: https://reviews.apache.org/r/33154/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 33154: Added reason metrics for slave removals.
On April 16, 2015, 4 p.m., Vinod Kone wrote: src/master/master.cpp, line 4465 https://reviews.apache.org/r/33154/diff/1/?file=926687#file926687line4465 Is this optional because there are other removals that we don't have metrics for? the ones in exited() and reregisteredSlave()? can we have metrics for them too while we are at it? Ben Mahler wrote: I can add one for when the a new slave registers and replaces the old slave. For exited(), I didn't add a metric since we are removing checkpointing. It doesn't look like the flag is even available anymore? Looks like some further cleanup is required (remove the field in the message, and the comment / logic in `Master::exited`). Sorry, s/removing checkpointing/removing the ability to disable checkpointing support/ - Ben --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/#review80339 --- On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33154/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-2485 https://issues.apache.org/jira/browse/MESOS-2485 Repository: mesos Description --- See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485). Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b Diff: https://reviews.apache.org/r/33154/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.
On April 15, 2015, 7:20 p.m., Ben Mahler wrote: src/master/master.hpp, lines 1138-1143 https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1138 Thanks for this NOTE! We might want to take this opportunity to articulate why we can safely keep the totals. Or, should we strip non-scalars and store a 'totalUsedScalarResources'? Michael Park wrote: We can't do the former since we can't safely keep the totals. We might want to do something like the latter but I think we may break some people that way, and therefore should be done as a subsequent task. I've created [MESOS-2623](https://issues.apache.org/jira/browse/MESOS-2623) to track this. We can't do the former since we can't safely keep the totals. Ah ok great, can we articulate why for posterity? It's not immediately obvious to me. As in, is there a bug due to this? Or is it just that the non-scalar information is not correct, and there's not yet a bug due to this since we only expose scalars? Something else? On April 15, 2015, 7:20 p.m., Ben Mahler wrote: src/master/master.hpp, lines 1145-1147 https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1145 How about s/usedResources/totalUsedResources/ and s/usedResourcesBySlaveId/usedResources/ ? This might also make the total referred to in your note a bit clearer. Ditto for offered. Michael Park wrote: Yeah, I considered `totalUsedResources` and `usedResources` and I liked those. I went with `usedResources` and `usedResourcesBySlaveId` because I thought exposing `totalUsedResources` via `used_resources`, then having to expose `usedResources` via some other key like `used_resources_by_slave_id` would create confusion later on. This way `usedResources` maps to `used_resources` and `usedResourcesBySlaveId` can map to `used_resources_by_slave_id`. Unless we decide that we're never going to expose `usedResourcesBySlaveId`, in which case we'll have `totalUsedResources` mapped to `used_resources` and that'll be it. What do you think? That makes sense, although, since we both had the same initial intuition, others may as well. I think it's ok not to allow the JSON format to affect our variable naming, since we want to make sure that the code is as readable / understandable as possible. I like the clarified total and used naming scheme you considered initially because it allows us to make it clear that used resources have to be namespaced by slaves to be correctly represented, whereas the total only makes sense for scalars (we can have a note about that). Someone is more likely to ponder this, especially if it's called `'totalUsedScalarResources'`. Feel free to punt on stripping non-scalars if you want to tackle it later! - Ben --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/#review80252 --- On April 15, 2015, 10:45 p.m., Michael Park wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/ --- (Updated April 15, 2015, 10:45 p.m.) Review request for mesos, Alexander Rukletsov and Ben Mahler. Bugs: MESOS-2373 https://issues.apache.org/jira/browse/MESOS-2373 Repository: mesos Description --- `master::Framework` holds 2 member variables of type `Resources`: `usedResources` and `offeredResources`. Both of these are aggregates of resources from multiple slaves. We add the `hashmapSlaveID, Resources` versions since we can lose non-scalar resources by summing them up across multiple slaves. For further details refer to [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373). In [r31666](https://reviews.apache.org/r/31666/), we report `usedResourcesBySlaveId` rather than `usedResources` when adding the framework to the allocator. We don't actually use `offeredResourcesBySlaveId` currently, but it should be kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` as well as `offeredResourcesBySlaveId` via the HTTP endpoint. Diffs - src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 Diff: https://reviews.apache.org/r/31665/diff/ Testing --- make check Thanks, Michael Park
Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/#review80370 --- src/master/master.hpp https://reviews.apache.org/r/31665/#comment130237 For example, could we document in NOTE on these total variables that subtraction is safe even though non-scalars lose information? Specifically, is my understanding correct here: (1) For overlapping set items / ranges across slaves, these will get added N times but only represented 1 time in the Resources. (2) So, when an initial subtraction occurs (N-1), the resource is no longer represented. We don't expose non-scalars for now, so this is safe for now. (3) When any further subtractions occur (N-(1+M)), the Resources simply ignores the subtraction since there's nothing to remove, so this is safe for now. - Ben Mahler On April 15, 2015, 10:45 p.m., Michael Park wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/ --- (Updated April 15, 2015, 10:45 p.m.) Review request for mesos, Alexander Rukletsov and Ben Mahler. Bugs: MESOS-2373 https://issues.apache.org/jira/browse/MESOS-2373 Repository: mesos Description --- `master::Framework` holds 2 member variables of type `Resources`: `usedResources` and `offeredResources`. Both of these are aggregates of resources from multiple slaves. We add the `hashmapSlaveID, Resources` versions since we can lose non-scalar resources by summing them up across multiple slaves. For further details refer to [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373). In [r31666](https://reviews.apache.org/r/31666/), we report `usedResourcesBySlaveId` rather than `usedResources` when adding the framework to the allocator. We don't actually use `offeredResourcesBySlaveId` currently, but it should be kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` as well as `offeredResourcesBySlaveId` via the HTTP endpoint. Diffs - src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 Diff: https://reviews.apache.org/r/31665/diff/ Testing --- make check Thanks, Michael Park
Re: [jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
__ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.651608064+00:00 I0416 12:21:01.651600 368406528 process.cpp:2091] Resuming help@127.0.0.1:52945 at 2015-04-16 19:21:01.651613952+00:00 I0416 12:21:01.651583 367869952 process.cpp:2091] Resuming authorizer(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.651602944+00:00 I0416 12:21:01.651657 369479680 process.cpp:2091] Resuming __latch__(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.651691008+00:00 I0416 12:21:01.651705 366260224 process.cpp:2081] Spawned process __latch__(1)@127.0.0.1:52945 I0416 12:21:01.651888 366260224 recover.cpp:145] Broadcasting recover request to all replicas I0416 12:21:01.651996 368406528 process.cpp:2091] Resuming (1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652011008+00:00 I0416 12:21:01.652019 36776 process.cpp:2091] Resuming metrics@127.0.0.1:52945 at 2015-04-16 19:21:01.652039168+00:00 I0416 12:21:01.652148 366796800 process.cpp:2091] Resuming __ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.652165888+00:00 I0416 12:21:01.652154 367869952 process.cpp:2091] Resuming (3)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652169984+00:00 I0416 12:21:01.652217 368406528 process.cpp:2081] Spawned process (3)@ 127.0.0.1:52945 I0416 12:21:01.652369 366260224 process.cpp:2091] Resuming (3)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652391936+00:00 I0416 12:21:01.652400 368943104 process.cpp:2091] Resuming log-recover-protocol(1)@127.0.0.1:52945 at 2015-04-16 19:21:01.652436992+00:00 I0416 12:21:01.652465 368943104 recover.cpp:154] Broadcast request completed I0416 12:21:01.652479 368406528 process.cpp:2091] Resuming log-replica(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652500992+00:00 I0416 12:21:01.652529 368406528 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0416 12:21:01.652642 366260224 process.cpp:2091] Resuming (3)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652664064+00:00 I0416 12:21:01.652740 366260224 process.cpp:2198] Cleaning up (3)@ 127.0.0.1:52945 I0416 12:21:01.652799 368943104 recover.cpp:195] Received a recover response from a replica in EMPTY status I0416 12:21:01.652814 368406528 process.cpp:2091] Resuming __ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.652847104+00:00 I0416 12:21:01.652892 368406528 process.cpp:2091] Resuming __latch__(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.652925952+00:00 I0416 12:21:01.652953 368406528 process.cpp:2198] Cleaning up __latch__(1)@ 127.0.0.1:52945 I0416 12:21:01.653031 366796800 process.cpp:2091] Resuming __ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.653053952+00:00 I0416 12:21:01.653064 367869952 process.cpp:2091] Resuming log-recover(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.653084160+00:00 I0416 12:21:01.653147 367869952 recover.cpp:566] Updating replica status to STARTING I0416 12:21:01.653162 368943104 process.cpp:2198] Cleaning up log-recover-protocol(1)@127.0.0.1:52945 I0416 12:21:01.653259 366796800 process.cpp:2091] Resuming __ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.653281024+00:00 I0416 12:21:01.653301 368406528 process.cpp:2091] Resuming log-replica(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.653321984+00:00 I0416 12:21:01.653651 368406528 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 295us I0416 12:21:01.653693 368406528 replica.cpp:323] Persisted replica status to STARTING I0416 12:21:01.653681 365723648 process.cpp:2091] Resuming standalone-master-detector(1)@127.0.0.1:52945 at 2015-04-16 19:21:01.653703936+00:00 I0416 12:21:01.653806 365723648 process.cpp:2091] Resuming log-recover(1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.653827072+00:00 I0416 12:21:01.653833 366260224 process.cpp:2091] Resuming master@127.0.0.1:52945 at 2015-04-16 19:21:01.653851136+00:00 I0416 12:21:01.653863 2105078528 process.cpp:2081] Spawned process master@127.0.0.1:52945 I0416 12:21:01.653874 366260224 master.cpp:361] Master 20150416-122101-16777343-52945-98969 (localhost) started on 127.0.0.1:52945 I0416 12:21:01.653918 365723648 recover.cpp:475] Replica is in STARTING status I0416 12:21:01.654013 365723648 process.cpp:2081] Spawned process log-recover-protocol(2)@127.0.0.1:52945 I0416 12:21:01.654022 36776 process.cpp:2091] Resuming __ gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.654042880+00:00 I0416 12:21:01.654026 368406528 process.cpp:2091] Resuming log-recover-protocol(2)@127.0.0.1:52945 at 2015-04-16 19:21:01.654049024+00:00 I0416 12:21:01.654101 368406528 recover.cpp:131] Starting to wait for enough quorum of replicas before running recovery protocol, expected quroum size: 1 I0416 12:21:01.654170 367869952 process.cpp:2091] Resuming (1)@ 127.0.0.1:52945 at 2015-04-16 19:21:01.654191872+00:00 I0416 12:21:01.653894 366260224 master.cpp:363] Flags at startup: --acls=permissive: false register_frameworks { principals { type: SOME values: test-principal } roles { type: SOME values: * } } run_tasks { principals { type: SOME
Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.
On April 15, 2015, 7:20 p.m., Ben Mahler wrote: src/master/master.hpp, lines 1145-1147 https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1145 How about s/usedResources/totalUsedResources/ and s/usedResourcesBySlaveId/usedResources/ ? This might also make the total referred to in your note a bit clearer. Ditto for offered. Michael Park wrote: Yeah, I considered `totalUsedResources` and `usedResources` and I liked those. I went with `usedResources` and `usedResourcesBySlaveId` because I thought exposing `totalUsedResources` via `used_resources`, then having to expose `usedResources` via some other key like `used_resources_by_slave_id` would create confusion later on. This way `usedResources` maps to `used_resources` and `usedResourcesBySlaveId` can map to `used_resources_by_slave_id`. Unless we decide that we're never going to expose `usedResourcesBySlaveId`, in which case we'll have `totalUsedResources` mapped to `used_resources` and that'll be it. What do you think? Ben Mahler wrote: That makes sense, although, since we both had the same initial intuition, others may as well. I think it's ok not to allow the JSON format to affect our variable naming, since we want to make sure that the code is as readable / understandable as possible. I like the clarified total and used naming scheme you considered initially because it allows us to make it clear that used resources have to be namespaced by slaves to be correctly represented, whereas the total only makes sense for scalars (we can have a note about that). Someone is more likely to ponder this, especially if it's called `'totalUsedScalarResources'`. Feel free to punt on stripping non-scalars if you want to tackle it later! I think it's ok not to allow the JSON format to affect our variable naming, since we want to make sure that the code is as readable / understandable as possible. I agree. To be specific, my worry is that it'll be confusing for `usedResources` to be not exposed as used_resources. If you think that wouldn't be a big deal, I'm happy to change the naming scheme. Perhaps we can simply have a `NOTE` to help avoid the confusion. - Michael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/#review80252 --- On April 15, 2015, 10:45 p.m., Michael Park wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31665/ --- (Updated April 15, 2015, 10:45 p.m.) Review request for mesos, Alexander Rukletsov and Ben Mahler. Bugs: MESOS-2373 https://issues.apache.org/jira/browse/MESOS-2373 Repository: mesos Description --- `master::Framework` holds 2 member variables of type `Resources`: `usedResources` and `offeredResources`. Both of these are aggregates of resources from multiple slaves. We add the `hashmapSlaveID, Resources` versions since we can lose non-scalar resources by summing them up across multiple slaves. For further details refer to [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373). In [r31666](https://reviews.apache.org/r/31666/), we report `usedResourcesBySlaveId` rather than `usedResources` when adding the framework to the allocator. We don't actually use `offeredResourcesBySlaveId` currently, but it should be kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` as well as `offeredResourcesBySlaveId` via the HTTP endpoint. Diffs - src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 Diff: https://reviews.apache.org/r/31665/diff/ Testing --- make check Thanks, Michael Park
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
On April 16, 2015, 5:39 p.m., Timothy Chen wrote: src/slave/slave.cpp, line 3088 https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3088 Btw the framework and executor might not be there anymore if they are terminated and cleaned up between launch callback, see the checks we do below. Ah, great point. Thanks! I'll fix. - Jay --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80352 --- On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
On April 16, 2015, 5:32 p.m., Timothy Chen wrote: src/slave/slave.cpp, line 3085 https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3085 We're already sending back a status update when the registration timeout, and if we send another one here the scheduler will actually get two TASK_FAILED statuses instead. I think either we populate the reason when we send back the final status update that it's the containerizer launched failed, or we make sure we just send one here. The nice thing about having it be handled in the timeout is that it's less places in the slave that we do status updates, but with the cavaet you wait until the timeout to occur which is something I never really liked about. I think if we can make the code clean and make sure there is just one status update propagated back I rather see it happen here. Sending a terminal update (TASK_FAILED) removes the task from 'executor-queuedTasks', so the scheduler won't get two status updates. See https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2112 I admit this is super confusing, in fact, when I ran the code the first time I was expecting to see two status updates. I pinged Vinod about it and he was confused and it took us a while to work through what was going on. I am concerned that we are changing state for the callbacks that clean things up, so I'm open to moving it. When you say timeout are you referring to the Slave::sendExecutorTerminatedStatusUpdate method? - Jay --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80347 --- On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure
On April 16, 2015, 5:32 p.m., Timothy Chen wrote: src/slave/slave.cpp, line 3085 https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3085 We're already sending back a status update when the registration timeout, and if we send another one here the scheduler will actually get two TASK_FAILED statuses instead. I think either we populate the reason when we send back the final status update that it's the containerizer launched failed, or we make sure we just send one here. The nice thing about having it be handled in the timeout is that it's less places in the slave that we do status updates, but with the cavaet you wait until the timeout to occur which is something I never really liked about. I think if we can make the code clean and make sure there is just one status update propagated back I rather see it happen here. Jay Buffington wrote: Sending a terminal update (TASK_FAILED) removes the task from 'executor-queuedTasks', so the scheduler won't get two status updates. See https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2112 I admit this is super confusing, in fact, when I ran the code the first time I was expecting to see two status updates. I pinged Vinod about it and he was confused and it took us a while to work through what was going on. I am concerned that we are changing state for the callbacks that clean things up, so I'm open to moving it. When you say timeout are you referring to the Slave::sendExecutorTerminatedStatusUpdate method? Ah, I forgot about it too. I think comments will be great so we avoid confusion! - Timothy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review80347 --- On April 16, 2015, 3:16 p.m., Jay Buffington wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/ --- (Updated April 16, 2015, 3:16 p.m.) Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. Bugs: MESOS-2020 https://issues.apache.org/jira/browse/MESOS-2020 Repository: mesos Description --- When mesos is unable to launch the containerizer the scheduler should get a TASK_FAILED with a status message that includes the error the containerizer encounted when trying to launch. Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED Fixes MESOS-2020 Diffs - include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33249/diff/ Testing --- I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the docker pull failure stderr message in Aurora's web UI. Thanks, Jay Buffington
Re: Review Request 33155: Added commented-out tests for slave removal metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33155/#review80397 --- src/tests/master_tests.cpp https://reviews.apache.org/r/33155/#comment130286 I'm a bit confused on how slave's lifetime affects the metrics collection on master endpoint? - Vinod Kone On April 14, 2015, 1:46 a.m., Ben Mahler wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33155/ --- (Updated April 14, 2015, 1:46 a.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-2485 https://issues.apache.org/jira/browse/MESOS-2485 Repository: mesos Description --- I updated the existing tests to validate the metrics. However, it blocks the tests due to the slave metrics never getting satisfied. Added a TODO. Diffs - src/tests/master_tests.cpp 32b1e9bb58d8046e5363fafe2ab8f662b6c9a666 src/tests/partition_tests.cpp 1018e479a77a6b533f2dd392fedbdccb80e3ac1f src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 Diff: https://reviews.apache.org/r/33155/diff/ Testing --- make check Thanks, Ben Mahler
Re: Review Request 33266: Check patch command in configure.ac
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33266/#review80356 --- Patch looks great! Reviews applied: [33266] All tests passed. - Mesos ReviewBot On April 16, 2015, 5:30 p.m., haosdent huang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33266/ --- (Updated April 16, 2015, 5:30 p.m.) Review request for mesos and Till Toenshoff. Bugs: MESOS-2624 https://issues.apache.org/jira/browse/MESOS-2624 Repository: mesos Description --- Check patch command in configure.ac Diffs - configure.ac 7f9e52916b9d78f2bbff9d6ed9871444a0fda629 Diff: https://reviews.apache.org/r/33266/diff/ Testing --- Thanks, haosdent huang