Review Request 33257: Fixed recover tasks only by the intiated containerizer.

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33257/
---

Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, 
and Till Toenshoff.


Bugs: MESOS-2601
https://issues.apache.org/jira/browse/MESOS-2601


Repository: mesos


Description
---

Fixed recover tasks only by the intiated containerizer.
Currently both mesos and docker containerizer recovers tasks that wasn't 
started by themselves.
The proposed fix is to record the intended containerizer in the checkpointed 
executorInfo, and reuse that information on recover to know if the 
containerizer should recover or not. We are free to modify the executorInfo 
since it's not being used to relaunch any task.


Diffs
-

  src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd 
  src/slave/containerizer/mesos/containerizer.cpp 
e4136095fca55637864f495098189ab3ad8d8fe7 
  src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
  src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f 
  src/tests/docker_containerizer_tests.cpp 
c772d4c836de18b0e87636cb42200356d24ec73d 

Diff: https://reviews.apache.org/r/33257/diff/


Testing
---

make check


Thanks,

Timothy Chen



Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33257/
---

(Updated April 16, 2015, 7:08 a.m.)


Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, 
and Till Toenshoff.


Bugs: MESOS-2601
https://issues.apache.org/jira/browse/MESOS-2601


Repository: mesos


Description
---

Fixed recover tasks only by the intiated containerizer.
Currently both mesos and docker containerizer recovers tasks that wasn't 
started by themselves.
The proposed fix is to record the intended containerizer in the checkpointed 
executorInfo, and reuse that information on recover to know if the 
containerizer should recover or not. We are free to modify the executorInfo 
since it's not being used to relaunch any task.


Diffs (updated)
-

  src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd 
  src/slave/containerizer/mesos/containerizer.cpp 
e4136095fca55637864f495098189ab3ad8d8fe7 
  src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
  src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f 
  src/tests/docker_containerizer_tests.cpp 
c772d4c836de18b0e87636cb42200356d24ec73d 

Diff: https://reviews.apache.org/r/33257/diff/


Testing
---

make check


Thanks,

Timothy Chen



Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33257/
---

(Updated April 16, 2015, 7:10 a.m.)


Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie Yu, 
and Till Toenshoff.


Bugs: MESOS-2601
https://issues.apache.org/jira/browse/MESOS-2601


Repository: mesos


Description (updated)
---

Fixed recover tasks only by the intiated containerizer.
Currently both mesos and docker containerizer recovers tasks that wasn't 
started by themselves.
The proposed fix is to record the intended containerizer in the checkpointed 
executorInfo, and reuse that information on recover to know if the 
containerizer should recover or not. We are free to modify the executorInfo 
since it's not being used to relaunch any task.
The external containerizer doesn't need to change since it is only recovering 
containers that are returned by the containers script.


Diffs
-

  src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd 
  src/slave/containerizer/mesos/containerizer.cpp 
e4136095fca55637864f495098189ab3ad8d8fe7 
  src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
  src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f 
  src/tests/docker_containerizer_tests.cpp 
c772d4c836de18b0e87636cb42200356d24ec73d 

Diff: https://reviews.apache.org/r/33257/diff/


Testing
---

make check


Thanks,

Timothy Chen



Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.

2015-04-16 Thread Erik Weathers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33257/#review80308
---



src/slave/containerizer/docker.cpp
https://reviews.apache.org/r/33257/#comment130134

Cosmetic comment: `executor.id` is enclosed in single-quotes but 
`framework.id` is not.  This is in keeping with most of this file, so it *is* 
somewhat consistent.  But the `launch` member function does enclose 
`executorInfo.framework_id()` in single-quotes, so there is precedent for doing 
so within this file.


- Erik Weathers


On April 16, 2015, 7:10 a.m., Timothy Chen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33257/
 ---
 
 (Updated April 16, 2015, 7:10 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie 
 Yu, and Till Toenshoff.
 
 
 Bugs: MESOS-2601
 https://issues.apache.org/jira/browse/MESOS-2601
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Fixed recover tasks only by the intiated containerizer.
 Currently both mesos and docker containerizer recovers tasks that wasn't 
 started by themselves.
 The proposed fix is to record the intended containerizer in the checkpointed 
 executorInfo, and reuse that information on recover to know if the 
 containerizer should recover or not. We are free to modify the executorInfo 
 since it's not being used to relaunch any task.
 The external containerizer doesn't need to change since it is only recovering 
 containers that are returned by the containers script.
 
 
 Diffs
 -
 
   src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd 
   src/slave/containerizer/mesos/containerizer.cpp 
 e4136095fca55637864f495098189ab3ad8d8fe7 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f 
   src/tests/docker_containerizer_tests.cpp 
 c772d4c836de18b0e87636cb42200356d24ec73d 
 
 Diff: https://reviews.apache.org/r/33257/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Timothy Chen
 




Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.

2015-04-16 Thread Andrey Dyatlov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33263/
---

Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till 
Toenshoff.


Bugs: MESOS-2625
https://issues.apache.org/jira/browse/MESOS-2625


Repository: mesos


Description
---

Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. Check 
that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the 
Slave::sendExecutorTerminatedStatusUpdate member function, this reason is 
expected instead of more general REASON_EXECUTOR_TERMINATED because the command 
executer is used in this test.


Diffs
-

  src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 

Diff: https://reviews.apache.org/r/33263/diff/


Testing
---

make check


Thanks,

Andrey Dyatlov



Re: Review Request 32058: Added protobuf-JSON validation

2015-04-16 Thread Till Toenshoff

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32058/#review80333
---



3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp
https://reviews.apache.org/r/32058/#comment130182

Just as Vinod commented on the JIRA, we might want  to do that differently 
- e.g. by adding a factory.


- Till Toenshoff


On March 16, 2015, 8:02 p.m., Akanksha Agrawal wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/32058/
 ---
 
 (Updated March 16, 2015, 8:02 p.m.)
 
 
 Review request for mesos and Till Toenshoff.
 
 
 Bugs: MESOS-1194
 https://issues.apache.org/jira/browse/MESOS-1194
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Added protobuf-JSON validation
 
 
 Diffs
 -
 
   3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp 
 2c020d98b94995bce862ac3b9c173dd64ed1198d 
 
 Diff: https://reviews.apache.org/r/32058/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Akanksha Agrawal
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated April 16, 2015, 3:16 p.m.)


Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.


Bugs: MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2020


Repository: mesos


Description
---

When mesos is unable to launch the containerizer the scheduler should
get a TASK_FAILED with a status message that includes the error the
containerizer encounted when trying to launch.

Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED

Fixes MESOS-2020


Diffs
-

  include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
  src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
  src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
  src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I added test case to slave_test.cpp.  I also tried this with Aurora, supplied a 
bogus docker image url and saw the docker pull failure stderr message in 
Aurora's web UI.


Thanks,

Jay Buffington



Re: Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.

2015-04-16 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33263/#review80335
---


Patch looks great!

Reviews applied: [33263]

All tests passed.

- Mesos ReviewBot


On April 16, 2015, 2:31 p.m., Andrey Dyatlov wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33263/
 ---
 
 (Updated April 16, 2015, 2:31 p.m.)
 
 
 Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till 
 Toenshoff.
 
 
 Bugs: MESOS-2625
 https://issues.apache.org/jira/browse/MESOS-2625
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. 
 Check that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the 
 Slave::sendExecutorTerminatedStatusUpdate member function, this reason is 
 expected instead of more general REASON_EXECUTOR_TERMINATED because the 
 command executer is used in this test.
 
 
 Diffs
 -
 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33263/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Andrey Dyatlov
 




Re: Review Request 33154: Added reason metrics for slave removals.

2015-04-16 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33154/#review80339
---



src/master/master.cpp
https://reviews.apache.org/r/33154/#comment130185

Is this optional because there are other removals that we don't have 
metrics for? the ones in exited() and reregisteredSlave()? can we have metrics 
for them too while we are at it?



src/master/master.cpp
https://reviews.apache.org/r/33154/#comment130184

Would be awesome for debugging if the log line also prints the reason.


- Vinod Kone


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33154/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Bugs: MESOS-2485
 https://issues.apache.org/jira/browse/MESOS-2485
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485).
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
   src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 
   src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed 
   src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b 
 
 Diff: https://reviews.apache.org/r/33154/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 33153: Moved a partition test into partition_tests.cpp.

2015-04-16 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33153/#review80338
---

Ship it!


Ship It!

- Vinod Kone


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33153/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary.
 
 
 Diffs
 -
 
   src/tests/fault_tolerance_tests.cpp 
 a637c32f004638a110390b22cf5b626e904097cf 
   src/tests/partition_tests.cpp 1018e479a77a6b533f2dd392fedbdccb80e3ac1f 
 
 Diff: https://reviews.apache.org/r/33153/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 33152: Moved the slave shutdown test into slave_tests.cpp.

2015-04-16 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33152/#review80337
---

Ship it!


Ship It!

- Vinod Kone


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33152/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary.
 
 
 Diffs
 -
 
   src/tests/fault_tolerance_tests.cpp 
 a637c32f004638a110390b22cf5b626e904097cf 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33152/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 33257: Fixed recover tasks only by the intiated containerizer.

2015-04-16 Thread Till Toenshoff

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33257/#review80321
---



src/slave/slave.cpp
https://reviews.apache.org/r/33257/#comment130173

Kill this line.



src/slave/slave.cpp
https://reviews.apache.org/r/33257/#comment130165

Remove this line.



src/tests/containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130167

Could you add a comment and describe what this test does?



src/tests/containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130176

Move this down to when you need it.



src/tests/containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130175

Move this down to when you actually need it.



src/tests/containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130168

I am not sure this comment adds any value - we may just as well remove it, 
no?



src/tests/containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130174

Kill these.



src/tests/docker_containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130166

Could you add a comment and describe what this test does?



src/tests/docker_containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130178

Move this down to where you need it.



src/tests/docker_containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130177

Move this down to where you need it.



src/tests/docker_containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130179

Move this down to where you need it.



src/tests/docker_containerizer_tests.cpp
https://reviews.apache.org/r/33257/#comment130180

Move this down to where you need it.


- Till Toenshoff


On April 16, 2015, 7:10 a.m., Timothy Chen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33257/
 ---
 
 (Updated April 16, 2015, 7:10 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Bernd Mathiske, Ian Downes, Jie 
 Yu, and Till Toenshoff.
 
 
 Bugs: MESOS-2601
 https://issues.apache.org/jira/browse/MESOS-2601
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Fixed recover tasks only by the intiated containerizer.
 Currently both mesos and docker containerizer recovers tasks that wasn't 
 started by themselves.
 The proposed fix is to record the intended containerizer in the checkpointed 
 executorInfo, and reuse that information on recover to know if the 
 containerizer should recover or not. We are free to modify the executorInfo 
 since it's not being used to relaunch any task.
 The external containerizer doesn't need to change since it is only recovering 
 containers that are returned by the containers script.
 
 
 Diffs
 -
 
   src/slave/containerizer/docker.cpp f9fb07806e3b7d7d2afc1be3b8756eac23b32dcd 
   src/slave/containerizer/mesos/containerizer.cpp 
 e4136095fca55637864f495098189ab3ad8d8fe7 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer_tests.cpp 5991aa628083dac7c5e8bf7ba297f4f9edeec05f 
   src/tests/docker_containerizer_tests.cpp 
 c772d4c836de18b0e87636cb42200356d24ec73d 
 
 Diff: https://reviews.apache.org/r/33257/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Timothy Chen
 




Re: Review Request 33263: Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check.

2015-04-16 Thread Bernd Mathiske

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33263/#review80332
---

Ship it!


- Bernd Mathiske


On April 16, 2015, 7:31 a.m., Andrey Dyatlov wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33263/
 ---
 
 (Updated April 16, 2015, 7:31 a.m.)
 
 
 Review request for mesos, Alexander Rukletsov, Bernd Mathiske, and Till 
 Toenshoff.
 
 
 Bugs: MESOS-2625
 https://issues.apache.org/jira/browse/MESOS-2625
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Extended SlaveTest.ShutdownUnregisteredExecutor test with a reason check. 
 Check that the reason is REASON_COMMAND_EXECUTOR_FAILED. According to the 
 Slave::sendExecutorTerminatedStatusUpdate member function, this reason is 
 expected instead of more general REASON_EXECUTOR_TERMINATED because the 
 command executer is used in this test.
 
 
 Diffs
 -
 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33263/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Andrey Dyatlov
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80347
---



src/slave/slave.cpp
https://reviews.apache.org/r/33249/#comment130197

We're already sending back a status update when the registration timeout, 
and if we send another one here the scheduler will actually get two TASK_FAILED 
statuses instead.

I think either we populate the reason when we send back the final status 
update that it's the containerizer launched failed, or we make sure we just 
send one here.

The nice thing about having it be handled in the timeout is that it's less 
places in the slave that we do status updates, but with the cavaet you wait 
until the timeout to occur which is something I never really liked about.

I think if we can make the code clean and make sure there is just one 
status update propagated back I rather see it happen here.


- Timothy Chen


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80350
---



src/tests/containerizer.cpp
https://reviews.apache.org/r/33249/#comment130199

How about setting a flag in the TestContainerizer constructor that tells it 
to send a failure on launch? Checking value like is too implicit, and someone 
else might not know the magic meaning on the other side (as I don't see any 
comment in the test to mark that this has sepecial meaning).


- Timothy Chen


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80352
---



src/slave/slave.cpp
https://reviews.apache.org/r/33249/#comment130205

Btw the framework and executor might not be there anymore if they are 
terminated and cleaned up between launch callback, see the checks we do below.


- Timothy Chen


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Review Request 33266: Check patch command in configure.ac

2015-04-16 Thread haosdent huang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33266/
---

Review request for mesos and Till Toenshoff.


Bugs: MESOS-2624
https://issues.apache.org/jira/browse/MESOS-2624


Repository: mesos


Description
---

Check patch command in configure.ac


Diffs
-

  configure.ac 7f9e52916b9d78f2bbff9d6ed9871444a0fda629 

Diff: https://reviews.apache.org/r/33266/diff/


Testing
---


Thanks,

haosdent huang



Re: Review Request 33154: Added reason metrics for slave removals.

2015-04-16 Thread Ben Mahler


 On April 16, 2015, 4 p.m., Vinod Kone wrote:
  src/master/master.cpp, line 4465
  https://reviews.apache.org/r/33154/diff/1/?file=926687#file926687line4465
 
  Is this optional because there are other removals that we don't have 
  metrics for? the ones in exited() and reregisteredSlave()? can we have 
  metrics for them too while we are at it?

I can add one for when the a new slave registers and replaces the old slave.

For exited(), I didn't add a metric since we are removing checkpointing. It 
doesn't look like the flag is even available anymore? Looks like some further 
cleanup is required (remove the field in the message, and the comment / logic 
in `Master::exited`).


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33154/#review80339
---


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33154/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Bugs: MESOS-2485
 https://issues.apache.org/jira/browse/MESOS-2485
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485).
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
   src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 
   src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed 
   src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b 
 
 Diff: https://reviews.apache.org/r/33154/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 33154: Added reason metrics for slave removals.

2015-04-16 Thread Ben Mahler


 On April 16, 2015, 4 p.m., Vinod Kone wrote:
  src/master/master.cpp, line 4465
  https://reviews.apache.org/r/33154/diff/1/?file=926687#file926687line4465
 
  Is this optional because there are other removals that we don't have 
  metrics for? the ones in exited() and reregisteredSlave()? can we have 
  metrics for them too while we are at it?
 
 Ben Mahler wrote:
 I can add one for when the a new slave registers and replaces the old 
 slave.
 
 For exited(), I didn't add a metric since we are removing checkpointing. 
 It doesn't look like the flag is even available anymore? Looks like some 
 further cleanup is required (remove the field in the message, and the comment 
 / logic in `Master::exited`).

Sorry, s/removing checkpointing/removing the ability to disable checkpointing 
support/


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33154/#review80339
---


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33154/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Bugs: MESOS-2485
 https://issues.apache.org/jira/browse/MESOS-2485
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See [MESOS-2485](https://issues.apache.org/jira/browse/MESOS-2485).
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
   src/master/master.cpp 44b0a0147f5354824d86332a67b30018634c9a36 
   src/master/metrics.hpp 52a83289cfe7e6b6fd8d5bff0774ebe5ce51d0ed 
   src/master/metrics.cpp 14486bf7130250f561c9fb7a43c95f3fc1e76a4b 
 
 Diff: https://reviews.apache.org/r/33154/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.

2015-04-16 Thread Ben Mahler


 On April 15, 2015, 7:20 p.m., Ben Mahler wrote:
  src/master/master.hpp, lines 1138-1143
  https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1138
 
  Thanks for this NOTE! We might want to take this opportunity to 
  articulate why we can safely keep the totals. Or, should we strip 
  non-scalars and store a 'totalUsedScalarResources'?
 
 Michael Park wrote:
 We can't do the former since we can't safely keep the totals. We might 
 want to do something like the latter but I think we may break some people 
 that way, and therefore should be done as a subsequent task. I've created 
 [MESOS-2623](https://issues.apache.org/jira/browse/MESOS-2623) to track this.

 We can't do the former since we can't safely keep the totals.

Ah ok great, can we articulate why for posterity? It's not immediately obvious 
to me. As in, is there a bug due to this? Or is it just that the non-scalar 
information is not correct, and there's not yet a bug due to this since we only 
expose scalars? Something else?


 On April 15, 2015, 7:20 p.m., Ben Mahler wrote:
  src/master/master.hpp, lines 1145-1147
  https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1145
 
  How about s/usedResources/totalUsedResources/ and 
  s/usedResourcesBySlaveId/usedResources/ ?
  
  This might also make the total referred to in your note a bit 
  clearer. Ditto for offered.
 
 Michael Park wrote:
 Yeah, I considered `totalUsedResources` and `usedResources` and I liked 
 those. I went with `usedResources` and `usedResourcesBySlaveId` because I 
 thought exposing `totalUsedResources` via `used_resources`, then having to 
 expose `usedResources` via some other key like `used_resources_by_slave_id` 
 would create confusion later on. This way `usedResources` maps to 
 `used_resources` and `usedResourcesBySlaveId` can map to 
 `used_resources_by_slave_id`. Unless we decide that we're never going to 
 expose `usedResourcesBySlaveId`, in which case we'll have 
 `totalUsedResources` mapped to `used_resources` and that'll be it.
 
 What do you think?

That makes sense, although, since we both had the same initial intuition, 
others may as well. I think it's ok not to allow the JSON format to affect our 
variable naming, since we want to make sure that the code is as readable / 
understandable as possible.

I like the clarified total and used naming scheme you considered initially 
because it allows us to make it clear that used resources have to be 
namespaced by slaves to be correctly represented, whereas the total only makes 
sense for scalars (we can have a note about that). Someone is more likely to 
ponder this, especially if it's called `'totalUsedScalarResources'`. Feel free 
to punt on stripping non-scalars if you want to tackle it later!


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31665/#review80252
---


On April 15, 2015, 10:45 p.m., Michael Park wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31665/
 ---
 
 (Updated April 15, 2015, 10:45 p.m.)
 
 
 Review request for mesos, Alexander Rukletsov and Ben Mahler.
 
 
 Bugs: MESOS-2373
 https://issues.apache.org/jira/browse/MESOS-2373
 
 
 Repository: mesos
 
 
 Description
 ---
 
 `master::Framework` holds 2 member variables of type `Resources`: 
 `usedResources` and `offeredResources`. Both of these are aggregates of 
 resources from multiple slaves. We add the `hashmapSlaveID, Resources` 
 versions since we can lose non-scalar resources by summing them up across 
 multiple slaves. For further details refer to 
 [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373).
 
 In [r31666](https://reviews.apache.org/r/31666/), we report 
 `usedResourcesBySlaveId` rather than `usedResources` when adding the 
 framework to the allocator.
 
 We don't actually use `offeredResourcesBySlaveId` currently, but it should be 
 kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` 
 as well as `offeredResourcesBySlaveId` via the HTTP endpoint.
 
 
 Diffs
 -
 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
 
 Diff: https://reviews.apache.org/r/31665/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Michael Park
 




Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.

2015-04-16 Thread Ben Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31665/#review80370
---



src/master/master.hpp
https://reviews.apache.org/r/31665/#comment130237

For example, could we document in NOTE on these total variables that 
subtraction is safe even though non-scalars lose information?

Specifically, is my understanding correct here: 

(1) For overlapping set items / ranges across slaves, these will get added 
N times but only represented 1 time in the Resources. 
(2) So, when an initial subtraction occurs (N-1), the resource is no longer 
represented. We don't expose non-scalars for now, so this is safe for now.
(3) When any further subtractions occur (N-(1+M)), the Resources simply 
ignores the subtraction since there's nothing to remove, so this is safe for 
now.


- Ben Mahler


On April 15, 2015, 10:45 p.m., Michael Park wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31665/
 ---
 
 (Updated April 15, 2015, 10:45 p.m.)
 
 
 Review request for mesos, Alexander Rukletsov and Ben Mahler.
 
 
 Bugs: MESOS-2373
 https://issues.apache.org/jira/browse/MESOS-2373
 
 
 Repository: mesos
 
 
 Description
 ---
 
 `master::Framework` holds 2 member variables of type `Resources`: 
 `usedResources` and `offeredResources`. Both of these are aggregates of 
 resources from multiple slaves. We add the `hashmapSlaveID, Resources` 
 versions since we can lose non-scalar resources by summing them up across 
 multiple slaves. For further details refer to 
 [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373).
 
 In [r31666](https://reviews.apache.org/r/31666/), we report 
 `usedResourcesBySlaveId` rather than `usedResources` when adding the 
 framework to the allocator.
 
 We don't actually use `offeredResourcesBySlaveId` currently, but it should be 
 kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` 
 as well as `offeredResourcesBySlaveId` via the HTTP endpoint.
 
 
 Diffs
 -
 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
 
 Diff: https://reviews.apache.org/r/31665/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Michael Park
 




Re: [jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-04-16 Thread Benjamin Mahler
 __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.651608064+00:00
 I0416 12:21:01.651600 368406528 process.cpp:2091] Resuming
 help@127.0.0.1:52945 at 2015-04-16 19:21:01.651613952+00:00
 I0416 12:21:01.651583 367869952 process.cpp:2091] Resuming authorizer(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.651602944+00:00
 I0416 12:21:01.651657 369479680 process.cpp:2091] Resuming __latch__(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.651691008+00:00
 I0416 12:21:01.651705 366260224 process.cpp:2081] Spawned process
 __latch__(1)@127.0.0.1:52945
 I0416 12:21:01.651888 366260224 recover.cpp:145] Broadcasting recover
 request to all replicas
 I0416 12:21:01.651996 368406528 process.cpp:2091] Resuming (1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652011008+00:00
 I0416 12:21:01.652019 36776 process.cpp:2091] Resuming
 metrics@127.0.0.1:52945 at 2015-04-16 19:21:01.652039168+00:00
 I0416 12:21:01.652148 366796800 process.cpp:2091] Resuming __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.652165888+00:00
 I0416 12:21:01.652154 367869952 process.cpp:2091] Resuming (3)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652169984+00:00
 I0416 12:21:01.652217 368406528 process.cpp:2081] Spawned process (3)@
 127.0.0.1:52945
 I0416 12:21:01.652369 366260224 process.cpp:2091] Resuming (3)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652391936+00:00
 I0416 12:21:01.652400 368943104 process.cpp:2091] Resuming
 log-recover-protocol(1)@127.0.0.1:52945 at 2015-04-16
 19:21:01.652436992+00:00
 I0416 12:21:01.652465 368943104 recover.cpp:154] Broadcast request
 completed
 I0416 12:21:01.652479 368406528 process.cpp:2091] Resuming log-replica(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652500992+00:00
 I0416 12:21:01.652529 368406528 replica.cpp:641] Replica in EMPTY status
 received a broadcasted recover request
 I0416 12:21:01.652642 366260224 process.cpp:2091] Resuming (3)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652664064+00:00
 I0416 12:21:01.652740 366260224 process.cpp:2198] Cleaning up (3)@
 127.0.0.1:52945
 I0416 12:21:01.652799 368943104 recover.cpp:195] Received a recover
 response from a replica in EMPTY status
 I0416 12:21:01.652814 368406528 process.cpp:2091] Resuming __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.652847104+00:00
 I0416 12:21:01.652892 368406528 process.cpp:2091] Resuming __latch__(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.652925952+00:00
 I0416 12:21:01.652953 368406528 process.cpp:2198] Cleaning up __latch__(1)@
 127.0.0.1:52945
 I0416 12:21:01.653031 366796800 process.cpp:2091] Resuming __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.653053952+00:00
 I0416 12:21:01.653064 367869952 process.cpp:2091] Resuming log-recover(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.653084160+00:00
 I0416 12:21:01.653147 367869952 recover.cpp:566] Updating replica status
 to STARTING
 I0416 12:21:01.653162 368943104 process.cpp:2198] Cleaning up
 log-recover-protocol(1)@127.0.0.1:52945
 I0416 12:21:01.653259 366796800 process.cpp:2091] Resuming __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.653281024+00:00
 I0416 12:21:01.653301 368406528 process.cpp:2091] Resuming log-replica(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.653321984+00:00
 I0416 12:21:01.653651 368406528 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 295us
 I0416 12:21:01.653693 368406528 replica.cpp:323] Persisted replica status
 to STARTING
 I0416 12:21:01.653681 365723648 process.cpp:2091] Resuming
 standalone-master-detector(1)@127.0.0.1:52945 at 2015-04-16
 19:21:01.653703936+00:00
 I0416 12:21:01.653806 365723648 process.cpp:2091] Resuming log-recover(1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.653827072+00:00
 I0416 12:21:01.653833 366260224 process.cpp:2091] Resuming
 master@127.0.0.1:52945 at 2015-04-16 19:21:01.653851136+00:00
 I0416 12:21:01.653863 2105078528 process.cpp:2081] Spawned process
 master@127.0.0.1:52945
 I0416 12:21:01.653874 366260224 master.cpp:361] Master
 20150416-122101-16777343-52945-98969 (localhost) started on
 127.0.0.1:52945
 I0416 12:21:01.653918 365723648 recover.cpp:475] Replica is in STARTING
 status
 I0416 12:21:01.654013 365723648 process.cpp:2081] Spawned process
 log-recover-protocol(2)@127.0.0.1:52945
 I0416 12:21:01.654022 36776 process.cpp:2091] Resuming __
 gc__@127.0.0.1:52945 at 2015-04-16 19:21:01.654042880+00:00
 I0416 12:21:01.654026 368406528 process.cpp:2091] Resuming
 log-recover-protocol(2)@127.0.0.1:52945 at 2015-04-16
 19:21:01.654049024+00:00
 I0416 12:21:01.654101 368406528 recover.cpp:131] Starting to wait for
 enough quorum of replicas before running recovery protocol, expected quroum
 size: 1
 I0416 12:21:01.654170 367869952 process.cpp:2091] Resuming (1)@
 127.0.0.1:52945 at 2015-04-16 19:21:01.654191872+00:00
 I0416 12:21:01.653894 366260224 master.cpp:363] Flags at startup:
 --acls=permissive: false
 register_frameworks {
   principals {
 type: SOME
 values: test-principal
   }
   roles {
 type: SOME
 values: *
   }
 }
 run_tasks {
   principals {
 type: SOME

Re: Review Request 31665: Added 'hashmapSlaveID, Resources (used|offered)ResourcesBySlaveId' to master::Framework.

2015-04-16 Thread Michael Park


 On April 15, 2015, 7:20 p.m., Ben Mahler wrote:
  src/master/master.hpp, lines 1145-1147
  https://reviews.apache.org/r/31665/diff/5/?file=927382#file927382line1145
 
  How about s/usedResources/totalUsedResources/ and 
  s/usedResourcesBySlaveId/usedResources/ ?
  
  This might also make the total referred to in your note a bit 
  clearer. Ditto for offered.
 
 Michael Park wrote:
 Yeah, I considered `totalUsedResources` and `usedResources` and I liked 
 those. I went with `usedResources` and `usedResourcesBySlaveId` because I 
 thought exposing `totalUsedResources` via `used_resources`, then having to 
 expose `usedResources` via some other key like `used_resources_by_slave_id` 
 would create confusion later on. This way `usedResources` maps to 
 `used_resources` and `usedResourcesBySlaveId` can map to 
 `used_resources_by_slave_id`. Unless we decide that we're never going to 
 expose `usedResourcesBySlaveId`, in which case we'll have 
 `totalUsedResources` mapped to `used_resources` and that'll be it.
 
 What do you think?
 
 Ben Mahler wrote:
 That makes sense, although, since we both had the same initial intuition, 
 others may as well. I think it's ok not to allow the JSON format to affect 
 our variable naming, since we want to make sure that the code is as readable 
 / understandable as possible.
 
 I like the clarified total and used naming scheme you considered 
 initially because it allows us to make it clear that used resources have to 
 be namespaced by slaves to be correctly represented, whereas the total only 
 makes sense for scalars (we can have a note about that). Someone is more 
 likely to ponder this, especially if it's called 
 `'totalUsedScalarResources'`. Feel free to punt on stripping non-scalars if 
 you want to tackle it later!

 I think it's ok not to allow the JSON format to affect our variable naming, 
 since we want to make sure that the code is as readable / understandable as 
 possible.

I agree. To be specific, my worry is that it'll be confusing for 
`usedResources` to be not exposed as used_resources.
If you think that wouldn't be a big deal, I'm happy to change the naming 
scheme. Perhaps we can simply have a `NOTE` to help avoid the confusion.


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31665/#review80252
---


On April 15, 2015, 10:45 p.m., Michael Park wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/31665/
 ---
 
 (Updated April 15, 2015, 10:45 p.m.)
 
 
 Review request for mesos, Alexander Rukletsov and Ben Mahler.
 
 
 Bugs: MESOS-2373
 https://issues.apache.org/jira/browse/MESOS-2373
 
 
 Repository: mesos
 
 
 Description
 ---
 
 `master::Framework` holds 2 member variables of type `Resources`: 
 `usedResources` and `offeredResources`. Both of these are aggregates of 
 resources from multiple slaves. We add the `hashmapSlaveID, Resources` 
 versions since we can lose non-scalar resources by summing them up across 
 multiple slaves. For further details refer to 
 [MESOS-2373](https://issues.apache.org/jira/browse/MESOS-2373).
 
 In [r31666](https://reviews.apache.org/r/31666/), we report 
 `usedResourcesBySlaveId` rather than `usedResources` when adding the 
 framework to the allocator.
 
 We don't actually use `offeredResourcesBySlaveId` currently, but it should be 
 kept for symmetry. There will be a ticket to expose `usedResourcesBySlaveId` 
 as well as `offeredResourcesBySlaveId` via the HTTP endpoint.
 
 
 Diffs
 -
 
   src/master/master.hpp 6141917644b84edfed9836fa0a005d55a36880e3 
 
 Diff: https://reviews.apache.org/r/31665/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Michael Park
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Jay Buffington


 On April 16, 2015, 5:39 p.m., Timothy Chen wrote:
  src/slave/slave.cpp, line 3088
  https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3088
 
  Btw the framework and executor might not be there anymore if they are 
  terminated and cleaned up between launch callback, see the checks we do 
  below.

Ah, great point.  Thanks!  I'll fix.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80352
---


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Jay Buffington


 On April 16, 2015, 5:32 p.m., Timothy Chen wrote:
  src/slave/slave.cpp, line 3085
  https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3085
 
  We're already sending back a status update when the registration 
  timeout, and if we send another one here the scheduler will actually get 
  two TASK_FAILED statuses instead.
  
  I think either we populate the reason when we send back the final 
  status update that it's the containerizer launched failed, or we make sure 
  we just send one here.
  
  The nice thing about having it be handled in the timeout is that it's 
  less places in the slave that we do status updates, but with the cavaet you 
  wait until the timeout to occur which is something I never really liked 
  about.
  
  I think if we can make the code clean and make sure there is just one 
  status update propagated back I rather see it happen here.

Sending a terminal update (TASK_FAILED) removes the task from 
'executor-queuedTasks', so the scheduler won't get two status updates.  See 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2112

I admit this is super confusing, in fact, when I ran the code the first time I 
was expecting to see two status updates. I pinged Vinod about it and he was 
confused and it took us a while to work through what was going on.

I am concerned that we are changing state for the callbacks that clean things 
up, so I'm open to moving it.  When you say timeout are you referring to the 
Slave::sendExecutorTerminatedStatusUpdate method?


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80347
---


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-16 Thread Timothy Chen


 On April 16, 2015, 5:32 p.m., Timothy Chen wrote:
  src/slave/slave.cpp, line 3085
  https://reviews.apache.org/r/33249/diff/1/?file=931231#file931231line3085
 
  We're already sending back a status update when the registration 
  timeout, and if we send another one here the scheduler will actually get 
  two TASK_FAILED statuses instead.
  
  I think either we populate the reason when we send back the final 
  status update that it's the containerizer launched failed, or we make sure 
  we just send one here.
  
  The nice thing about having it be handled in the timeout is that it's 
  less places in the slave that we do status updates, but with the cavaet you 
  wait until the timeout to occur which is something I never really liked 
  about.
  
  I think if we can make the code clean and make sure there is just one 
  status update propagated back I rather see it happen here.
 
 Jay Buffington wrote:
 Sending a terminal update (TASK_FAILED) removes the task from 
 'executor-queuedTasks', so the scheduler won't get two status updates.  See 
 https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2112
 
 I admit this is super confusing, in fact, when I ran the code the first 
 time I was expecting to see two status updates. I pinged Vinod about it and 
 he was confused and it took us a while to work through what was going on.
 
 I am concerned that we are changing state for the callbacks that clean 
 things up, so I'm open to moving it.  When you say timeout are you 
 referring to the Slave::sendExecutorTerminatedStatusUpdate method?

Ah, I forgot about it too. I think comments will be great so we avoid confusion!


- Timothy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review80347
---


On April 16, 2015, 3:16 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 16, 2015, 3:16 p.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp a0595f93ce4720f5b9926326d01210460ccb0667 
   src/tests/containerizer.cpp 26b87ac6b16dfeaf84888e80296ef540697bd775 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33155: Added commented-out tests for slave removal metrics.

2015-04-16 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33155/#review80397
---



src/tests/master_tests.cpp
https://reviews.apache.org/r/33155/#comment130286

I'm a bit confused on how slave's lifetime affects the metrics collection 
on master endpoint?


- Vinod Kone


On April 14, 2015, 1:46 a.m., Ben Mahler wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33155/
 ---
 
 (Updated April 14, 2015, 1:46 a.m.)
 
 
 Review request for mesos and Vinod Kone.
 
 
 Bugs: MESOS-2485
 https://issues.apache.org/jira/browse/MESOS-2485
 
 
 Repository: mesos
 
 
 Description
 ---
 
 I updated the existing tests to validate the metrics. However, it blocks the 
 tests due to the slave metrics never getting satisfied. Added a TODO.
 
 
 Diffs
 -
 
   src/tests/master_tests.cpp 32b1e9bb58d8046e5363fafe2ab8f662b6c9a666 
   src/tests/partition_tests.cpp 1018e479a77a6b533f2dd392fedbdccb80e3ac1f 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33155/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Ben Mahler
 




Re: Review Request 33266: Check patch command in configure.ac

2015-04-16 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33266/#review80356
---


Patch looks great!

Reviews applied: [33266]

All tests passed.

- Mesos ReviewBot


On April 16, 2015, 5:30 p.m., haosdent huang wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33266/
 ---
 
 (Updated April 16, 2015, 5:30 p.m.)
 
 
 Review request for mesos and Till Toenshoff.
 
 
 Bugs: MESOS-2624
 https://issues.apache.org/jira/browse/MESOS-2624
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Check patch command in configure.ac
 
 
 Diffs
 -
 
   configure.ac 7f9e52916b9d78f2bbff9d6ed9871444a0fda629 
 
 Diff: https://reviews.apache.org/r/33266/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 haosdent huang