Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-05-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/
---

(Updated May 2, 2018, 5:18 a.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: 8750
https://issues.apache.org/jira/browse/8750


Repository: mesos


Description (updated)
---

A RunTask messsage could get dropped for an agent while its
disconnected from the master and when such an agent goes unreachable
then this dropped task message gets added to the unreachable tasks.
When the agent reregisters, the master sends status updates for the
tasks that the agent reported when re-registering and these tasks are
also removed from the unreachableTasks on the framework but since the
agent doesn't know about the dropped task so it doesn't get removed
from the unreachableTasks leading to a check failure when
this inconsistency is detected during framework removal.


Diffs (updated)
-

  src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
  src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
  src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 


Diff: https://reviews.apache.org/r/66644/diff/5/

Changes: https://reviews.apache.org/r/66644/diff/4-5/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-05-01 Thread Megha Sharma


> On May 1, 2018, 8:50 p.m., Jiang Yan Xu wrote:
> > src/tests/partition_tests.cpp
> > Lines 171 (patched)
> > <https://reviews.apache.org/r/66644/diff/3-4/?file=2012149#file2012149line171>
> >
> > This seems to be exceeded the 80 character limit.

For some reason the pre-commit hook didn't complain for this.


> On May 1, 2018, 8:50 p.m., Jiang Yan Xu wrote:
> > src/tests/partition_tests.cpp
> > Lines 47 (patched)
> > <https://reviews.apache.org/r/66644/diff/4/?file=2012299#file2012299line47>
> >
> > It doesn't look like this is used?

Aah yes that's right.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review202215
---


On April 25, 2018, 3:46 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> ---
> 
> (Updated April 25, 2018, 3:46 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
> https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A RunTask messsage could get dropped for an agent while it's
> disconnected from the master and when such an agent goes unreachable
> then this dropped task gets added to the unreachable tasks.
> When the agent reregisters, tasks reported by it are removed from the 
> unreachableTasks bookkeeping on the master but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a master check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-04-27 Thread Megha Sharma


> On April 25, 2018, 10:39 p.m., Jiang Yan Xu wrote:
> >

Yan, Gaston, Please take another look. I have addressed your comments. Thanks!


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review201963
---


On April 25, 2018, 3:46 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> ---
> 
> (Updated April 25, 2018, 3:46 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
> https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A RunTask messsage could get dropped for an agent while it's
> disconnected from the master and when such an agent goes unreachable
> then this dropped task gets added to the unreachable tasks.
> When the agent reregisters, tasks reported by it are removed from the 
> unreachableTasks bookkeeping on the master but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a master check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 66769: Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState.

2018-04-25 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66769/#review201944
---


Ship it!




Ship It!

- Megha Sharma


On April 23, 2018, 10:23 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66769/
> ---
> 
> (Updated April 23, 2018, 10:23 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik, Alexander Rukletsov, and Megha 
> Sharma.
> 
> 
> Bugs: MESOS-8618
> https://issues.apache.org/jira/browse/MESOS-8618
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> To simulate a master failover we need to use `replicated_log` as the
> registry otherwise the master loses persisted info about the agents.
> 
> 
> Diffs
> -
> 
>   src/tests/reconciliation_tests.cpp b06244b5f3efad3a88e367cd8e26cebd3d9f2e43 
> 
> 
> Diff: https://reviews.apache.org/r/66769/diff/1/
> 
> 
> Testing
> ---
> 
> Make check.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>



Review Request 66644: Remove unknown unreachable tasks when agent re-registers.

2018-04-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/
---

Review request for mesos and Jiang Yan Xu.


Bugs: 8750
https://issues.apache.org/jira/browse/8750


Repository: mesos


Description
---

A RunTask messsage could get dropped for an agent while it's
disconnected from the master and when such an agent goes unreachable
then this dropped task message gets added to the unreachable tasks.
When the agent re-registers, the master sends status updates for the
tasks that the agent reported when re-registering and these tasks are
also removed from the unreachableTasks on the framework but since the
agent doesn't know about the dropped task so it doesn't get removed
from the unreachableTasks leading to a check failure when
this inconsistency is detected during framework removal.


Diffs
-

  src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
  src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
  src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 


Diff: https://reviews.apache.org/r/66644/diff/1/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-12 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

(Updated Dec. 12, 2017, 12:54 p.m.)


Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Changes
---

tested with make check


Bugs: MESOS-6406
https://issues.apache.org/jira/browse/MESOS-6406


Repository: mesos


Description
---

Master will send task status updates to frameworks when an agent
which has been previously removed by the master for being unreachable
or is unknown to the master due to the garbage collection of
the unreachable and gone agents in the registry re-registers.


Diffs (updated)
-

  src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
  src/tests/master_allocator_tests.cpp 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
  src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
  src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
  src/tests/persistent_volume_tests.cpp 
4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
  src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
  src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 


Diff: https://reviews.apache.org/r/64098/diff/13/

Changes: https://reviews.apache.org/r/64098/diff/12-13/


Testing
---

with make check


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-08 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

(Updated Dec. 8, 2017, 6:42 p.m.)


Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Bugs: MESOS-6406
https://issues.apache.org/jira/browse/MESOS-6406


Repository: mesos


Description
---

Master will send task status updates to frameworks when an agent
which has been previously removed by the master for being unreachable
or is unknown to the master due to the garbage collection of
the unreachable and gone agents in the registry re-registers.


Diffs (updated)
-

  src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
  src/tests/master_allocator_tests.cpp 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
  src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
  src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
  src/tests/persistent_volume_tests.cpp 
4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
  src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
  src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 


Diff: https://reviews.apache.org/r/64098/diff/12/

Changes: https://reviews.apache.org/r/64098/diff/11-12/


Testing
---

with make check


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-04 Thread Megha Sharma


> On Dec. 2, 2017, 12:41 a.m., Megha Sharma wrote:
> > src/tests/upgrade_tests.cpp
> > Line 138 (original), 138 (patched)
> > <https://reviews.apache.org/r/64098/diff/8/?file=1906538#file1906538line138>
> >
> > Sorry my bad looks like the fix for this test is still needed. I am not 
> > clear though what do you mean by ignoring the update? As I understand 
> > setting the expectation is unavoidable coz if we don't then the test fails 
> > claiming that previously set expectation here is being called more than 
> > once.
> 
> Megha Sharma wrote:
> ^^Ilya
> 
> Ilya Pronin wrote:
> Yeah, we'll most likely need to set an expectation, but we can make it 
> just `WillRepeatedly(Return())` and omit the future, because we don't really 
> care about that status update in those tests.

This expectation is not needed anymore in the test after using the registry.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192607
---


On Dec. 2, 2017, 12:29 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 2, 2017, 12:29 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/11/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-04 Thread Megha Sharma


> On Dec. 4, 2017, 6:43 p.m., Jiang Yan Xu wrote:
> > src/tests/master_allocator_tests.cpp
> > Lines 1452-1454 (original), 1452-1459 (patched)
> > <https://reviews.apache.org/r/64098/diff/8/?file=1906533#file1906533line1452>
> >
> > So these tests had to be changed becauase the master is not using the 
> > `replicated_log` registry flag so it cannot remember the agent...
> > 
> > I think the right change is to use `replicated_log` for the tests that 
> > don't make sense to change for this patch.
> > 
> > Note that the use of replicated log registrar breaks on Windows (see 
> > MESOS-5932) so we need to make sure such tests are disabled on Windows.
> > 
> > The way to do it is to check whether the file is in this list: 
> > https://github.com/apache/mesos/blob/42952093c081e494d36721bc6c56ab73069a82e4/src/tests/CMakeLists.txt#L152
> > 
> > If it is, then it is already not built on Windows.
> > 
> > If not, we can use `TEST_F_TEMP_DISABLED_ON_WINDOWS` as part of the 
> > test name to disable it.
> > 
> > Please review all the tests in this RR for this.
> 
> Megha Sharma wrote:
> Thank you Yan, Good find. Working on fixing the tests.

Updated the RR, verified that all tests but one (`partition_tests.cpp`) are 
present in the section `if (NOT WIN32)` of the mentioned file. I have added 
`TEST_F_TEMP_DISABLED_ON_WINDOWS` to individual tests in `partition_tests.cpp` 
which have been updated to use replicated_logs in this RR.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192733
---


On Dec. 2, 2017, 12:29 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 2, 2017, 12:29 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/11/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192607
---




src/tests/upgrade_tests.cpp
Line 138 (original), 138 (patched)
<https://reviews.apache.org/r/64098/#comment270859>

Sorry my bad looks like the fix for this test is still needed. I am not 
clear though what do you mean by ignoring the update? As I understand setting 
the expectation is unavoidable coz if we don't then the test fails claiming 
that previously set expectation here is being called more than once.


- Megha Sharma


On Dec. 2, 2017, 12:29 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 2, 2017, 12:29 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/8/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

(Updated Dec. 2, 2017, 12:29 a.m.)


Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Bugs: MESOS-6406
https://issues.apache.org/jira/browse/MESOS-6406


Repository: mesos


Description
---

Master will send task status updates to frameworks when an agent
which has been previously removed by the master for being unreachable
or is unknown to the master due to the garbage collection of
the unreachable and gone agents in the registry re-registers.


Diffs (updated)
-

  src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
  src/tests/master_allocator_tests.cpp 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
  src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
  src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
  src/tests/persistent_volume_tests.cpp 
4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
  src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
  src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 


Diff: https://reviews.apache.org/r/64098/diff/8/

Changes: https://reviews.apache.org/r/64098/diff/7-8/


Testing
---

with make check


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

(Updated Dec. 2, 2017, 12:12 a.m.)


Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Bugs: MESOS-6406
https://issues.apache.org/jira/browse/MESOS-6406


Repository: mesos


Description (updated)
---

Master will send task status updates to frameworks when an agent
which has been previously removed by the master for being unreachable
or is unknown to the master due to the garbage collection of
the unreachable and gone agents in the registry re-registers.


Diffs (updated)
-

  src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
  src/tests/master_allocator_tests.cpp 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
  src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
  src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
  src/tests/persistent_volume_tests.cpp 
4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
  src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
  src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 


Diff: https://reviews.apache.org/r/64098/diff/7/

Changes: https://reviews.apache.org/r/64098/diff/6-7/


Testing
---

with make check


Thanks,

Megha Sharma



Re: Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Megha Sharma


> On Dec. 1, 2017, 9:55 p.m., Ilya Pronin wrote:
> > docs/task-state-reasons.md
> > Lines 474-477 (patched)
> > <https://reviews.apache.org/r/64250/diff/2/?file=1906002#file1906002line474>
> >
> > I don't quite follow this note. A modified copy of which update? Should 
> > we just say that these updates reflect the states of the tasks reported by 
> > the agent upon its re-registration?
> 
> Megha Sharma wrote:
> Here, we are saying that master takes the actual state reported by the 
> agent and enriches it with reason and message before sending to the 
> framework. But I am open to changing it if you or Yan feel it doesn't convery 
> the idea.
> 
> Jiang Yan Xu wrote:
> I was suggesting it because this status is generated by the master the 
> same way they are generated upon reconciliation.
> 
> For `REASON_RECONCILIATION` in the same doc there is this note:
> 
> ```
> Note: Status updates with this reason are not the original ones, but 
> rather a modified copy that is re-sent from the master. In particular, the 
> original data and message fields are erased and the original reason field is 
> overwritten by REASON_RECONCILIATION .
> ```
> 
> It is the same for `REASON_SLAVE_REREGISTERED`. 
> 
> 
> Megha I just noticed that you didn't mention the erasure of `data` and 
> `message` fields, but perhaps this is not worth repeating. I was originally 
> suggesting refering to `REASON_RECONCILIATION` for details: something like: 
> `Status updates with this reason are a modified copies re-sent by the master. 
> See comments for REASON_RECONCILIATION.`
> 
> Would this be clearer?
> 
> "reflect the states of the tasks reported by the agent upon its 
> re-registration" this sentence is good too.
> 
> Ilya Pronin wrote:
> Oh, now I got where this came from :) Yeah, it would be clearer if there 
> was a mention that new states come from the agent re-registration message and 
> a reference to `REASON_RECONCILIATION` description. Or it should be enough to 
> jsut say that these are produced by the master and reflect the states 
> reported by the agent during re-registration.

PTAL, updated the description based on the discussion.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/#review192567
---


On Dec. 1, 2017, 9:06 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64250/
> ---
> 
> (Updated Dec. 1, 2017, 9:06 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new reasons `REASON_SLAVE_REREGISTERED` and
> `REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
> The new reason will be used when master starts to send status
> update during the re-registration of an agent which was
> previously removed by the master because of being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry.
> 
> 
> Diffs
> -
> 
>   docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
>   include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
>   include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 
> 
> 
> Diff: https://reviews.apache.org/r/64250/diff/5/
> 
> 
> Testing
> ---
> 
> code changes verified with make check and the documents changes with markdown 
> viewer
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-01 Thread Megha Sharma


> On Dec. 1, 2017, 9:29 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Line 6808 (original), 6808 (patched)
> > <https://reviews.apache.org/r/64098/diff/5/?file=1905844#file1905844line6808>
> >
> > When considering the comment by Ilya in MESOS-6406 (i.e., what if 
> > agents GCed from the unreachable or gone list reregster?), seems like we 
> > can do this:
> > 
> > 1. Move down the line `slaves.recovered.erase(slaveInfo.id());` to 
> > after we process `recoveredTasks`.
> > 2. Instead of checking `slaves.unreachable.contains(slaveInfo.id()` we 
> > could check `!slaves.recovered.contains(slaveInfo.id()`
> > 3. Now we are sending status updates for two cases: reregistering 
> > unreachable agents or unknown agents (which could have been marked either 
> > unreachable or gone but we can't distiguish)
> > - We can distinguish unreachable and unknown in the task status 
> > message.
> > - We can probably log a warning about tasks from unknown agents.

Sounds good, thanks for fomalizing.


> On Dec. 1, 2017, 9:29 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6818 (patched)
> > <https://reviews.apache.org/r/64098/diff/5/?file=1905844#file1905844line6818>
> >
> > s/REASON_AGENT_REREGISTERED/REASON_SLAVE_REREGISTERED/

Thanks!


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192557
---


On Nov. 28, 2017, 12:55 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 28, 2017, 12:55 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/6/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/
---

(Updated Dec. 1, 2017, 9:06 p.m.)


Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Changes
---

done testing


Repository: mesos


Description
---

Added new reasons `REASON_SLAVE_REREGISTERED` and
`REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
The new reason will be used when master starts to send status
update during the re-registeration of an agent which was
previosuly removed by the master because of being unreachable.


Diffs
-

  docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
  include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
  include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 


Diff: https://reviews.apache.org/r/64250/diff/2/


Testing (updated)
---

with make check


Thanks,

Megha Sharma



Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/
---

Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.


Repository: mesos


Description
---

Added new reasons `REASON_SLAVE_REREGISTERED` and
`REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
The new reason will be used when master starts to send status
update during the re-registeration of an agent which was
previosuly removed by the master because of being unreachable.


Diffs
-

  docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
  include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
  include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 


Diff: https://reviews.apache.org/r/64250/diff/1/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-30 Thread Megha Sharma


> On Nov. 28, 2017, 7:22 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6788 (patched)
> > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6788>
> >
> > I think this type of status updates could benefit from a distinct 
> > reason for to make it more strongly typed.
> > 
> > Could you add
> > 
> > ```
> > v1/mesos.proto: REASON_AGENT_REREGISTERED
> > mesos.proto: REASON_SLAVE_REREGISTERED
> > ```
> > 
> > and use it?
> > 
> > Note that the tag numbers in the proto enum need to match.
> > 
> > Also note that there are also documentation related changes for these 
> > fields: 
> > https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md
> > 
> > It's fine to make it a separate patch preceding this one.
> 
> Ilya Pronin wrote:
> Won't the new reason break any frameworks?
> 
> Jiang Yan Xu wrote:
> How will it break? We add new reasons all the time. This is indeed a new 
> sitaution that Mesos master sends status updates, how do you think the 
> framework will assert the reason otherwise?
> 
> Ilya Pronin wrote:
> Sure, the new reason reflects what's going on much better. I was thinking 
> that a framework that uses the old libmesos will get a "default" status 
> update reason, which is `REASON_COMMAND_EXECUTOR_FAILED`, and may act 
> incorrectly. But it seems that Aurora will be fine with such change.

Yeah it seems that the frameworks should be fine with the addition of another 
reason, I guess that's how we have been adding new reasons all along. The proto 
default behavior to interpret the unknown enum value as first one defined, is 
of course a problem but I believe the frameworks must already have some guards 
in place to avoid that.


- Megha


-------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192041
---


On Nov. 28, 2017, 12:55 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 28, 2017, 12:55 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
>   src/tests/persistent_volume_tests.cpp 
> acfeac16884b00581a3523607ff26f44f6dca53a 
>   src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/4/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-30 Thread Megha Sharma


> On Nov. 28, 2017, 7:01 p.m., Ilya Pronin wrote:
> > src/master/master.cpp
> > Lines 6789 (patched)
> > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6789>
> >
> > I think this is not specific to unreachable agents. Can be an agent 
> > that was recovered after failover.

Ilya, I agree the reason needs to be changed based on whether or not the agent 
was unreachable. Also, Yan and I dicussed more about the agent re-registeration 
scenarios in which the master should do a status update. If the master 
undergoes a failover then the current approach will make the master do status 
updates for all tasks on re-registering agents which will make the make the 
critical path of agent re-registeration slower. One good alternative was to do 
status updates for only unreachable agents. Since master already sent a 
TASK_LOST/TASK_UNREACHABLE for these so there should definitely be a followup. 
Most of the frameworks today already do frequent reconciliations upon 
re-registering with master so doing explicit status updates for re-registering 
agents due to failover seemed a bit unnecessary. How do you feel about the 
changed approach?


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review191979
-------


On Nov. 28, 2017, 12:55 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 28, 2017, 12:55 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
>   src/tests/persistent_volume_tests.cpp 
> acfeac16884b00581a3523607ff26f44f6dca53a 
>   src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/4/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-28 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 29, 2017, 12:59 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Changes
---

Tested with make check.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including
the ones that are still registered which was problematic because
the offer from this agent could still go to the same framework which
could then launch new tasks. The agent would then receive tasks
of the same framework and ignore them because it thinks the
framework is shutting down. The framework is not shutting down of
course, so from the master and the scheduler's perspective the task
is pending in STAGING forever until the next agent reregistration,
which could happen much later. This commit fixes the problem by
not shutting down the non-partition aware frameworks on such an
agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/26/

Changes: https://reviews.apache.org/r/61473/diff/25-26/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-28 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 28, 2017, 5:28 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including
the ones that are still registered which was problematic because
the offer from this agent could still go to the same framework which
could then launch new tasks. The agent would then receive tasks
of the same framework and ignore them because it thinks the
framework is shutting down. The framework is not shutting down of
course, so from the master and the scheduler's perspective the task
is pending in STAGING forever until the next agent reregistration,
which could happen much later. This commit fixes the problem by
not shutting down the non-partition aware frameworks on such an
agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/23/

Changes: https://reviews.apache.org/r/61473/diff/22-23/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 28, 2017, 12:56 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including
the ones that are still registered which was problematic because
the offer from this agent could still go to the same framework which
could then launch new tasks. The agent would then receive tasks
of the same framework and ignore them because it thinks the
framework is shutting down. The framework is not shutting down of
course, so from the master and the scheduler's perspective the task
is pending in STAGING forever until the next agent reregistration,
which could happen much later. This commit fixes the problem by
not shutting down the non-partition aware frameworks on such an
agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/22/

Changes: https://reviews.apache.org/r/61473/diff/21-22/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

(Updated Nov. 28, 2017, 12:29 a.m.)


Review request for mesos, James Peach and Jiang Yan Xu.


Repository: mesos


Description
---

Master will send task status updates to frameworks when an agent
re-registers.


Diffs
-

  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/master_allocator_tests.cpp 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
  src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
  src/tests/persistent_volume_tests.cpp 
acfeac16884b00581a3523607ff26f44f6dca53a 
  src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
  src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 


Diff: https://reviews.apache.org/r/64098/diff/2/


Testing (updated)
---

with make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 27, 2017, 10:08 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including
the ones that are still registered which was problematic because
the offer from this agent could still go to the same framework which
could then launch new tasks. The agent would then receive tasks
of the same framework and ignore them because it thinks the
framework is shutting down. The framework is not shutting down of
course, so from the master and the scheduler's perspective the task
is pending in STAGING forever until the next agent reregistration,
which could happen much later. This commit fixes the problem by
not shutting down the non-partition aware frameworks on such an
agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/20/

Changes: https://reviews.apache.org/r/61473/diff/19-20/


Testing
---

make check


Thanks,

Megha Sharma



Review Request 64098: Send status updates when agent re-registers.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/
---

Review request for mesos, James Peach and Jiang Yan Xu.


Repository: mesos


Description
---

Master will send task status updates to frameworks when an agent
re-registers.


Diffs
-

  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/64098/diff/1/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 27, 2017, 10:05 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Summary (updated)
-

Do not kill non partition aware tasks.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description (updated)
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including
the ones that are still registered which was problematic because
the offer from this agent could still go to the same framework which
could then launch new tasks. The agent would then receive tasks
of the same framework and ignore them because it thinks the
framework is shutting down. The framework is not shutting down of
course, so from the master and the scheduler's perspective the task
is pending in STAGING forever until the next agent reregistration,
which could happen much later. This commit fixes the problem by
not shutting down the non-partition aware frameworks on such an
agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/19/

Changes: https://reviews.apache.org/r/61473/diff/18-19/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Send status updates when agent re-registers.

2017-11-27 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 27, 2017, 10:02 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Changes
---

Split into multiple commits.


Summary (updated)
-

Send status updates when agent re-registers.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description (updated)
---

Master will send task status updates to frameworks when an agent
re-registers.


Diffs (updated)
-

  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/18/

Changes: https://reviews.apache.org/r/61473/diff/17-18/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 17, 2017, 10:35 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Changes
---

More test fixes.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler's perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/17/

Changes: https://reviews.apache.org/r/61473/diff/16-17/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 17, 2017, 9:27 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Changes
---

Fix for partition aware tests


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler's perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/16/

Changes: https://reviews.apache.org/r/61473/diff/15-16/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 7169-7173 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line7218>
> >
> > You have created this `newUpdate` but are not immediately using it, and 
> > I have to pay attention to notice that it is `update` instead of 
> > `newUpdate` that is passed to `updateTask()`.
> > 
> > So perhaps defer this change to after `updateTask` and `removeTask` are 
> > called? At that point because the only remaining use of the `update` is to 
> > send it out, you don't even need to make a copy any more. Just change the 
> > field.

Not applicable anymore.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
-------


On Nov. 17, 2017, 7:26 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 17, 2017, 7:26 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler's perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/15/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.hpp
> > Lines 2475-2477 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819741#file1819741line2478>
> >
> > Because this method takes a pointer, this mutation could affect future 
> > uses of it. Even though right now nothing that cares about the state 
> > follows the call of `addUnreachableTask`, it may still be good to not 
> > propagate the change.
> > 
> > How about doing it in `addUnreachableTask`?
> > 
> > ```
> > void addUnreachableTask(const Task& _task)
> > {
> >   Task* task = new Task(_task);
> > 
> >   // We have to use TASK_LOST for non-partition-aware frameworks
> >   // for backwards compatibility.
> >   if (!capabilities.partitionAware) {
> > task->set_state(TASK_LOST);
> >   }
> > 
> >   unreachableTasks.set(task.task_id(), process::Owned(task));
> > }
> > ```

I think this comment is not applicable anymore since we are not changing the 
state but just passing a boolean around to move the tasks in the right data 
structure.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
---


On Nov. 17, 2017, 7:26 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 17, 2017, 7:26 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler's perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/15/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 17, 2017, 7:26 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler's perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/15/

Changes: https://reviews.apache.org/r/61473/diff/14-15/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 17, 2017, 7:11 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler's perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/14/

Changes: https://reviews.apache.org/r/61473/diff/13-14/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 16, 2017, 8:04 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler's perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/13/

Changes: https://reviews.apache.org/r/61473/diff/12-13/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 16, 2017, 7:43 p.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/12/

Changes: https://reviews.apache.org/r/61473/diff/11-12/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-16 Thread Megha Sharma


> On Sept. 28, 2017, 9:32 p.m., James Peach wrote:
> > src/master/master.cpp
> > Lines 9476 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line9525>
> >
> > This doesn't seem right to me. Even if the framework is not 
> > partition-aware, the master is still tracking these tasks as unreachable. I 
> > can't see a reason why we would want to fudge the metrics in this case.

We needed to manipulate the metrics for backwards compatibility to continue 
showing only tasks in state  TASK_UNREACHABLE as unreachable since the 
unreachable tasks datastructutre will now hold both tasks in both TASK_LOST and 
TASK_UNREACHABLE.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186580
-------


On Nov. 16, 2017, 7:20 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 16, 2017, 7:20 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/11/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-16 Thread Megha Sharma


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 7171-7172 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line7220>
> >
> > This could be shortened to one line.
> > 
> > ```
> > update.mutable_status()->set_state(TASK_LOST);
> > ```

I think this one is not applicable any more.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
-------


On Nov. 16, 2017, 7:20 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 16, 2017, 7:20 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/11/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-15 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 16, 2017, 7:20 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/11/

Changes: https://reviews.apache.org/r/61473/diff/10-11/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-15 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 16, 2017, 5:43 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/10/

Changes: https://reviews.apache.org/r/61473/diff/9-10/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-15 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Nov. 16, 2017, 5:24 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
  include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
  src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
  src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
  src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
  src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 


Diff: https://reviews.apache.org/r/61473/diff/9/

Changes: https://reviews.apache.org/r/61473/diff/8-9/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-15 Thread Megha Sharma


> On Oct. 26, 2017, 8:23 a.m., Jiang Yan Xu wrote:
> > src/master/master.hpp
> > Lines 2591 (patched)
> > <https://reviews.apache.org/r/61473/diff/8/?file=1856822#file1856822line2591>
> >
> > You made a redundant copy here but I understand this line may go away 
> > anyways. :)

That's right!


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review188966
-------


On Oct. 16, 2017, 8:59 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Oct. 16, 2017, 8:59 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
>   src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
>   src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
>   src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/8/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-10-16 Thread Megha Sharma


> On Sept. 30, 2017, 12:37 a.m., Jiang Yan Xu wrote:
> > As I commented on the JIRA, we should probably bring the work for 
> > MESOS-6406 into this JIRA because this JIRA wouldn't be complete without 
> > it. It certainly should be a different patch though.
> > 
> > For this patch we should also update the comments above `PARTITION_AWARE` 
> > API to reflect the change.
> > https://github.com/apache/mesos/blob/6eefc685ccf304d0fb8ed4ff9bc314197d77f078/include/mesos/mesos.proto#L336
> > 
> > Also please search the docs for necessary changes.

Agreed, will do MESOS-6406 as a separate patch.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186737
-------


On Oct. 16, 2017, 8:59 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Oct. 16, 2017, 8:59 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
>   src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
>   src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
>   src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/8/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-10-16 Thread Megha Sharma


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 7188-7190 (original), 7144-7146 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line7188>
> >
> > Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little 
> > different than elsewhere so I think this warrants a bit of explanation.
> > 
> > e.g., 
> > ```
> > // Transition tasks to TASK_UNREACHABLE and remove (archive) them.
> > // We convert the task state to TASK_LOST if is the framework is not 
> > partition aware.
> > // However we only do the conversion right before the status update is 
> > sent out or the
> > // task is archived because the processing prior to then requires tasks 
> > to be of the 
> > // correct state TASK_UNREACHABLE.
> > ```
> > 
> > Does this sound right?

+1


> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 8989-8990 (original), 8945-8946 (patched)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line8994>
> >
> > This is going to send `TASK_UNREACHABLE` to the operator API 
> > subscribers even for NPA framework tasks. 
> > 
> > We should probably be consistent and send `TASK_LOST`.

Right, missed it. So, one way to solve it is to let the state be TASK_LOST for 
NPA and change it to TASK_UNREACHABLE just before calling removeTask() so the 
task goes to unreachable tasks datastructure.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186615
---


On Oct. 16, 2017, 8:59 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Oct. 16, 2017, 8:59 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
>   src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
>   src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
>   src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/8/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-10-16 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Oct. 16, 2017, 8:59 a.m.)


Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 
  src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 
  src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 
  src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a 


Diff: https://reviews.apache.org/r/61473/diff/8/

Changes: https://reviews.apache.org/r/61473/diff/7-8/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-10-16 Thread Megha Sharma


> On Sept. 28, 2017, 9:32 p.m., James Peach wrote:
> > src/master/master.cpp
> > Line 6552 (original)
> > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line6552>
> >
> > `protobuf::isTerminalState` still defines `TASK_LOST` as a terminal 
> > state. However we are now refining it as non-terminal, so that seems 
> > inconsistent. I didn't see any discussion about why frameworks that are not 
> > partition-aware would be prepared to handle a transition from `TASK_LOST` 
> > to `TASK_RUNNING`.

The thing is for mesos versions <= 1.2, with strict registry disabled the 
agents were allowed to re-register with the master after having being removed 
due to consecutive healthcheck failures and as a result the frameworks could 
still receive TASK_RUNNING after TASK_LOST (due to agent removed). So, we are 
really changing the semantics of TASK_LOST.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review186580
-------


On Sept. 24, 2017, 10:46 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Sept. 24, 2017, 10:46 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 
>   src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 
>   src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 
>   src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-09-11 Thread Megha Sharma


> On Aug. 30, 2017, 1:05 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 8948-8950 (patched)
> > <https://reviews.apache.org/r/61473/diff/6/?file=1806155#file1806155line8990>
> >
> > If the framework is not partition aware, the `update` will already have 
> > a `TASK_LOST` state right?

Good point! Fixed in the new update.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review184107
-------


On Sept. 11, 2017, 9:23 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Sept. 11, 2017, 9:23 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 
>   src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 
>   src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 
>   src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-09-11 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Sept. 11, 2017, 9:23 p.m.)


Review request for mesos, Vinod Kone and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 
  src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 
  src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 
  src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f 


Diff: https://reviews.apache.org/r/61473/diff/7/

Changes: https://reviews.apache.org/r/61473/diff/6-7/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-28 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Aug. 28, 2017, 3:31 p.m.)


Review request for mesos, Vinod Kone and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/61473/diff/4/

Changes: https://reviews.apache.org/r/61473/diff/3-4/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-28 Thread Megha Sharma


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/http.cpp
> > Lines 342 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793873#file1793873line342>
> >
> > One empty line above.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/http.cpp
> > Lines 3203-3206 (original), 3215-3227 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793873#file1793873line3215>
> >
> > Should we only loop through the once? And in order to try not to 
> > duplicate similar lines, consider the following?
> > 
> > ```
> > foreachvalue (const Owned& _task, framework->unreachableTasks) {
> >   Owned task = _task;
> > 
> >   if (framework->capabilities.partitionAware) {
> > task = ...
> >   }
> >   
> >   frameworkTaskSummaries[frameworkId].count(*task.get());
> >   slaveTaskSummaries[task->slave_id()].count(*task.get());
> > }
> > ```

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/http.cpp
> > Lines 4017-4027 (original), 4038-4063 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793873#file1793873line4038>
> >
> > Similar to above, don't duplicate lines.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/http.cpp
> > Lines 4167-4175 (original), 4203-4221 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793873#file1793873line4203>
> >
> > Ditto.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.hpp
> > Lines 2403-2408 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793874#file1793874line2406>
> >
> > Can we hold off creating a helper for this? IMO this helper is doing 
> > too little and not so much of an "abstraction", i.e., what the method does 
> > it not perfectly captured by the name of the mehod. Inlining 1) making a 
> > copy & 2) setting the state is not too much redudancy than calling this 
> > method.

Code changed, N/A anymore


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.hpp
> > Lines 2769-2771 (original), 2773-2775 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793874#file1793874line2776>
> >
> > Fix the comment.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6398-6414 (original), 6389-6405 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line6398>
> >
> > Fix the comments.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6422 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line6431>
> >
> > `erase` can handle the `!contains` case.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6477-6480 (original), 6464-6467 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line6478>
> >
> > Fix the comment.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Line 7137 (original), 7112 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line7138>
> >
> > If our code stops making such distinction, I don't think the comment 
> > should continue to make such distinction.

Fixed


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 7132-7137 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line7161>
> >
> > Do this only when we actually send an update i.e., 
> > `framework->connected()`?

Code changed, not applicable any more.


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 8928-8931 (original), 8907-8910 (patched)
> > <https://reviews.apache.org/r/61473/diff/3/?file=1793875#file1793875line8936>
> >
> > So for this update event we are going to send TASK_UNREACHABLE instead 
> > TASK_LOST, which is unintended?

Code changed, not applicable any more.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review182605
---


On Aug. 10, 2017, 4:07 p.m., Megha Sharma wrote:
> 
> ---

Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-25 Thread Megha Sharma


> On Aug. 10, 2017, 9 p.m., Jiang Yan Xu wrote:
> > Some of the comments below were made before I started to feel that we are 
> > probably doing too many conversions to justify storing these tasks in 
> > TASK_UNREACHABLE. Perhaps we can just store them in 
> > `Framework.unreachableTasks` but in TASK_LOST state?
> > 
> > It's possible that we can add another map `BoundedHashMap > process::Owned> unreachableNonPartitionAwareTasks;` for these tasks 
> > but it's clunky in the sense that you have to clarify that 
> > `unreachableTasks` is only for partition aware tasks but in fact all of 
> > these tasks belong to the same framework which is either parition aware or 
> > not, however with the possibility of changing capability... so it's 
> > probably easier to describe things if we just put all of them in 
> > `unreachableTasks` and simply say that "if the framework is not 
> > partition-aware, the tasks stored in `unreachableTasks` may be in 
> > `TASK_LOST`".
> > 
> > If we do that, then some of the comments below don't apply any more but I 
> > am keep them just for posterity (some styling issues etc).

Dicussed in person, +1 for keeping the non partition aware but unreachable 
tasks in Framework.unreachableTasks in state TASK_LOST.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review182605
---


On Aug. 10, 2017, 4:07 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Aug. 10, 2017, 4:07 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
>   src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
>   src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
>   src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-08 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Aug. 9, 2017, 4:55 a.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/61473/diff/3/

Changes: https://reviews.apache.org/r/61473/diff/2-3/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-08 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Aug. 9, 2017, 4:48 a.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/61473/diff/2/

Changes: https://reviews.apache.org/r/61473/diff/1-2/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-07 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review182310
---



We talked about 2 approaches and approach 2 seemed like a cleaner way to 
address the issue.

Approach 1: ShutdownFrameworkMessage

1. Upon agent re-registration, master will add tasks even for non-PA frameworks 
on this agent. This is needed by the master to do correct resource accounting 
and not offer resources already in use on this agent. We need to mutate the 
TaskState on the Task before adding them to the master's data structures since 
the TaskState might be non-terminal when the agent sends these tasks with 
ReregisterSlaveMessage. And the master has already sent TASK_LOST for these 
tasks to the frameworks so we need to set the TaskState to TASK_LOST so that 
any future reconciliations with the framework doesn't have this task 
transitioning from TASK_LOST to TASK_RUNNNG/TASK_STARTING. This is to avoid 
unnecessary confusion about task state as observed by the framework but indeed 
this could have happened with non-strict registry as well where the framework 
can actually receive a non terminal task state update after receiving a 
TASK_LOST for the same task in the past.

2. When the agent re-registers, the master will continue to send a 
ShutdownFrameworkMessage to the agent to kill the tasks pertaining to non-PA 
frameworks on the agent as it does today. An additional optional field will be 
added to the ShutdownFrameworkMessage to indicate whether or not the shutdown 
was initiated internally.

3. During framework shutdown the state of the framework is set to 
Framework::TERMINATING which prevents it from launching new tasks. Here, since 
the framework is not really terminating so in order to allow it to launch new 
tasks, the agent will not set the state to terminating if the 
ShutdownFrameworkMessage is generated internally.

4. The framework shutdown today doesn't generate any status updates which needs 
to change. The status updates will be sent if the framework shutdown is 
triggered internally, this is needed to remove the tasks of non-PA frameworks 
that got added when the agent re-registered (1).

Approach 2: Do not shutdown non-PA framework when agent re-registers and let 
the frameworks make the decision on what needs to be done when they receive 
non-terminal status updates for tasks for which they have already received a 
TASK_LOST. This hopefully won't break any frameworks since it could have 
happened in the past with non-strict registry as well and frameworks should be 
resilient enough to handle this scenario.

Let me know if I have missed anything

- Megha Sharma


On Aug. 7, 2017, 6:23 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Aug. 7, 2017, 6:23 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
>   src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
>   src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
>   src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 58898: Do not kill non partition aware tasks.

2017-08-07 Thread Megha Sharma


> On May 10, 2017, 11:24 p.m., Neil Conway wrote:
> > My apologies for the delay in reviewing this.
> > 
> > High-level comments:
> > 
> > (a) Can we improve the description of the problem in the commit summary? It 
> > took me quite a while to figure out what is actually going on here. My 
> > understanding is:
> > 
> > (1) Agent re-registers
> > (2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the 
> > agent
> > (3) Master offers agent resources to framework
> > (4) Framework launches new task on agent _before the agent has finished 
> > shutting down the framework_
> > (5) Agent ignores launch task because it believes the framework is still 
> > terminating.
> > 
> > This is a race condition, because if the agent finished shutting down the 
> > framework (in #4) before the launch task was received, there would not be a 
> > problem. Is my understanding correct?
> > 
> > (2) I'd prefer a new unit test that constructs exactly this scenario, 
> > rather than changing existing unit tests.
> > 
> > (3) The new behavior is that the framework will receive `TASK_KILLED` for 
> > any non-PA tasks running on a partitioned agent that re-registers. This 
> > does not seem ideal, because `TASK_KILLED` _normally_ corresponds to a 
> > framework-initiated `killTask` operation.
> > 
> > (4) Thinking out loud, what if a custom executor ignores the `killTask` 
> > request? When shutting down a framework, it seems we forcibly destroy the 
> > container (via `containerizer->destroy()`), if the executor doesn't exit 
> > promptly upon receiving the framework shutdown request. AFAIK we don't have 
> > similar logic for `killTask` requests.
> > 
> > 3 + 4 suggests that maybe we want a different way to terminate the task on 
> > the agent -- let's discuss.
> 
> Megha Sharma wrote:
> Summarizing the 2 approaches we talked about.
> 
> Approach 1: ShutdownFrameworkMessage
> 
> 1. Upon agent re-registration, master will add tasks even for non-PA 
> frameworks on this agent. This is needed by the master to do correct resource 
> accounting and not offer resources already in use on this agent. We need to 
> mutate the TaskState on the Task before adding them to the master's data 
> structures since the TaskState might be non-terminal when the agent sends 
> these tasks with ReregisterSlaveMessage. And the master has already sent 
> TASK_LOST for these tasks to the frameworks so we need to set the TaskState 
> to TASK_LOST so that any future reconciliations with the framework doesn't 
> have this task transitioning from TASK_LOST to TASK_RUNNNG/TASK_STARTING. 
> This is to avoid unnecessary confusion about task state as observed by the 
> framework but indeed this could have happened with non-strict registry as 
> well where the framework can actually receive a non terminal task state 
> update after receiving a TASK_LOST for the same task in the past.
> 
> 2. When the agent re-registers, the master will continue to send a 
> ShutdownFrameworkMessage to the agent to kill the tasks pertaining to non-PA 
> frameworks on the agent as it does today. An additional optional field will 
> be added to the ShutdownFrameworkMessage to indicate whether or not the 
> shutdown was initiated internally.
> 
> 3. During framework shutdown the state of the framework is set to 
> Framework::TERMINATING which prevents it from launching new tasks. Here, 
> since the framework is not really terminating so in order to allow it to 
> launch new tasks, the agent will not set the state to terminating if the 
> ShutdownFrameworkMessage is generated internally.
> 
> 4. The framework shutdown today doesn't generate any status updates which 
> needs to change. The status updates will be sent if the framework shutdown is 
> triggered internally, this is needed to remove the tasks of non-PA frameworks 
> that got added when the agent re-registered (1).
> 
> Approach 2: Do not shutdown non-PA framework when agent re-registers and 
> let the frameworks make the decision on what needs to be done when they 
> receive non-terminal status updates for tasks for which they have already 
> received a TASK_LOST. This hopefully won't break any frameworks since it 
> could have happened in the past with non-strict registry as well and 
> frameworks should be resilient enough to handle this scenario.
> 
> Let me know if I have missed anything. Personally, I am in favor of 
> approach 1 as it seems less catastrophic for the frameworks. What do you 
> think?
> 
> Megha Sharma wrote:
>

Re: Review Request 61473: Do not kill non partition aware tasks.

2017-08-07 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

(Updated Aug. 7, 2017, 6:23 p.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/61473/diff/1/


Testing (updated)
---

make check


Thanks,

Megha Sharma



Review Request 61473: Do not kill non partition aware tasks.

2017-08-07 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/
---

Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/61473/diff/1/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 58898: Do not kill non partition aware tasks.

2017-08-05 Thread Megha Sharma


> On May 10, 2017, 11:24 p.m., Neil Conway wrote:
> > src/master/master.cpp
> > Lines 6034 (patched)
> > <https://reviews.apache.org/r/58898/diff/2/?file=1704920#file1704920line6034>
> >
> > `tasksToKill`

Code changed, not relevant anymore


> On May 10, 2017, 11:24 p.m., Neil Conway wrote:
> > src/master/master.cpp
> > Lines 6057 (patched)
> > <https://reviews.apache.org/r/58898/diff/2/?file=1704920#file1704920line6057>
> >
> > "Keep"

Code changed, not relevant anymore


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58898/#review174194
-------


On Aug. 6, 2017, 4:35 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58898/
> ---
> 
> (Updated Aug. 6, 2017, 4:35 a.m.)
> 
> 
> Review request for mesos, Neil Conway and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including the
> ones that are still registered which was problematic because the offer
> from this agent could still go to the same framework which could then
> launch new tasks. The agent would then receive tasks of the same
> framework and ignore them because it thinks the framework is shutting
> down. The framework is not shutting down of course, so from the master
> and the scheduler’s perspective the task is pending in STAGING forever
> until the next agent reregistration, which could happen much later.
> This commit fixes the problem by not shutting down the non-partition
> aware frameworks on such an agent.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
>   src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
>   src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
>   src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 
> 
> 
> Diff: https://reviews.apache.org/r/58898/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 58898: Do not kill non partition aware tasks.

2017-08-05 Thread Megha Sharma


> On May 10, 2017, 11:24 p.m., Neil Conway wrote:
> > My apologies for the delay in reviewing this.
> > 
> > High-level comments:
> > 
> > (a) Can we improve the description of the problem in the commit summary? It 
> > took me quite a while to figure out what is actually going on here. My 
> > understanding is:
> > 
> > (1) Agent re-registers
> > (2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the 
> > agent
> > (3) Master offers agent resources to framework
> > (4) Framework launches new task on agent _before the agent has finished 
> > shutting down the framework_
> > (5) Agent ignores launch task because it believes the framework is still 
> > terminating.
> > 
> > This is a race condition, because if the agent finished shutting down the 
> > framework (in #4) before the launch task was received, there would not be a 
> > problem. Is my understanding correct?
> > 
> > (2) I'd prefer a new unit test that constructs exactly this scenario, 
> > rather than changing existing unit tests.
> > 
> > (3) The new behavior is that the framework will receive `TASK_KILLED` for 
> > any non-PA tasks running on a partitioned agent that re-registers. This 
> > does not seem ideal, because `TASK_KILLED` _normally_ corresponds to a 
> > framework-initiated `killTask` operation.
> > 
> > (4) Thinking out loud, what if a custom executor ignores the `killTask` 
> > request? When shutting down a framework, it seems we forcibly destroy the 
> > container (via `containerizer->destroy()`), if the executor doesn't exit 
> > promptly upon receiving the framework shutdown request. AFAIK we don't have 
> > similar logic for `killTask` requests.
> > 
> > 3 + 4 suggests that maybe we want a different way to terminate the task on 
> > the agent -- let's discuss.
> 
> Megha Sharma wrote:
> Summarizing the 2 approaches we talked about.
> 
> Approach 1: ShutdownFrameworkMessage
> 
> 1. Upon agent re-registration, master will add tasks even for non-PA 
> frameworks on this agent. This is needed by the master to do correct resource 
> accounting and not offer resources already in use on this agent. We need to 
> mutate the TaskState on the Task before adding them to the master's data 
> structures since the TaskState might be non-terminal when the agent sends 
> these tasks with ReregisterSlaveMessage. And the master has already sent 
> TASK_LOST for these tasks to the frameworks so we need to set the TaskState 
> to TASK_LOST so that any future reconciliations with the framework doesn't 
> have this task transitioning from TASK_LOST to TASK_RUNNNG/TASK_STARTING. 
> This is to avoid unnecessary confusion about task state as observed by the 
> framework but indeed this could have happened with non-strict registry as 
> well where the framework can actually receive a non terminal task state 
> update after receiving a TASK_LOST for the same task in the past.
> 
> 2. When the agent re-registers, the master will continue to send a 
> ShutdownFrameworkMessage to the agent to kill the tasks pertaining to non-PA 
> frameworks on the agent as it does today. An additional optional field will 
> be added to the ShutdownFrameworkMessage to indicate whether or not the 
> shutdown was initiated internally.
> 
> 3. During framework shutdown the state of the framework is set to 
> Framework::TERMINATING which prevents it from launching new tasks. Here, 
> since the framework is not really terminating so in order to allow it to 
> launch new tasks, the agent will not set the state to terminating if the 
> ShutdownFrameworkMessage is generated internally.
> 
> 4. The framework shutdown today doesn't generate any status updates which 
> needs to change. The status updates will be sent if the framework shutdown is 
> triggered internally, this is needed to remove the tasks of non-PA frameworks 
> that got added when the agent re-registered (1).
> 
> Approach 2: Do not shutdown non-PA framework when agent re-registers and 
> let the frameworks make the decision on what needs to be done when they 
> receive non-terminal status updates for tasks for which they have already 
> received a TASK_LOST. This hopefully won't break any frameworks since it 
> could have happened in the past with non-strict registry as well and 
> frameworks should be resilient enough to handle this scenario.
> 
> Let me know if I have missed anything. Personally, I am in favor of 
> approach 1 as it seems less catastrophic for the frameworks. What do you 
> think?
> 
> Megha Sharma wrote:
>

Re: Review Request 58898: Do not kill non partition aware tasks.

2017-08-05 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58898/
---

(Updated Aug. 6, 2017, 4:35 a.m.)


Review request for mesos, Neil Conway and Jiang Yan Xu.


Summary (updated)
-

Do not kill non partition aware tasks.


Bugs: MESOS-7215
https://issues.apache.org/jira/browse/MESOS-7215


Repository: mesos


Description (updated)
---

Master will not kill the tasks for non-Partition aware frameworks
when an unreachable agent re-registers with the master.
Master used to send a ShutdownFrameworkMessages to the agent
to kill the tasks from non partition aware frameworks including the
ones that are still registered which was problematic because the offer
from this agent could still go to the same framework which could then
launch new tasks. The agent would then receive tasks of the same
framework and ignore them because it thinks the framework is shutting
down. The framework is not shutting down of course, so from the master
and the scheduler’s perspective the task is pending in STAGING forever
until the next agent reregistration, which could happen much later.
This commit fixes the problem by not shutting down the non-partition
aware frameworks on such an agent.


Diffs (updated)
-

  src/master/http.cpp 959091c8ec03b6ac7bcb5d21b04d2f7d5aff7d54 
  src/master/master.hpp b802fd153a10f6012cea381f153c28cc78cae995 
  src/master/master.cpp 7f38a5e21884546d4b4c866ca5918db779af8f99 
  src/tests/partition_tests.cpp 62a84f797201ccd18b71490949e3130d2b9c3668 


Diff: https://reviews.apache.org/r/58898/diff/3/

Changes: https://reviews.apache.org/r/58898/diff/2-3/


Testing
---

make check


Thanks,

Megha Sharma



Re: Review Request 58898: Send task kill for non-Partition Aware frameworks.

2017-08-04 Thread Megha Sharma


> On May 10, 2017, 11:24 p.m., Neil Conway wrote:
> > My apologies for the delay in reviewing this.
> > 
> > High-level comments:
> > 
> > (a) Can we improve the description of the problem in the commit summary? It 
> > took me quite a while to figure out what is actually going on here. My 
> > understanding is:
> > 
> > (1) Agent re-registers
> > (2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the 
> > agent
> > (3) Master offers agent resources to framework
> > (4) Framework launches new task on agent _before the agent has finished 
> > shutting down the framework_
> > (5) Agent ignores launch task because it believes the framework is still 
> > terminating.
> > 
> > This is a race condition, because if the agent finished shutting down the 
> > framework (in #4) before the launch task was received, there would not be a 
> > problem. Is my understanding correct?
> > 
> > (2) I'd prefer a new unit test that constructs exactly this scenario, 
> > rather than changing existing unit tests.
> > 
> > (3) The new behavior is that the framework will receive `TASK_KILLED` for 
> > any non-PA tasks running on a partitioned agent that re-registers. This 
> > does not seem ideal, because `TASK_KILLED` _normally_ corresponds to a 
> > framework-initiated `killTask` operation.
> > 
> > (4) Thinking out loud, what if a custom executor ignores the `killTask` 
> > request? When shutting down a framework, it seems we forcibly destroy the 
> > container (via `containerizer->destroy()`), if the executor doesn't exit 
> > promptly upon receiving the framework shutdown request. AFAIK we don't have 
> > similar logic for `killTask` requests.
> > 
> > 3 + 4 suggests that maybe we want a different way to terminate the task on 
> > the agent -- let's discuss.
> 
> Megha Sharma wrote:
> Summarizing the 2 approaches we talked about.
> 
> Approach 1: ShutdownFrameworkMessage
> 
> 1. Upon agent re-registration, master will add tasks even for non-PA 
> frameworks on this agent. This is needed by the master to do correct resource 
> accounting and not offer resources already in use on this agent. We need to 
> mutate the TaskState on the Task before adding them to the master's data 
> structures since the TaskState might be non-terminal when the agent sends 
> these tasks with ReregisterSlaveMessage. And the master has already sent 
> TASK_LOST for these tasks to the frameworks so we need to set the TaskState 
> to TASK_LOST so that any future reconciliations with the framework doesn't 
> have this task transitioning from TASK_LOST to TASK_RUNNNG/TASK_STARTING. 
> This is to avoid unnecessary confusion about task state as observed by the 
> framework but indeed this could have happened with non-strict registry as 
> well where the framework can actually receive a non terminal task state 
> update after receiving a TASK_LOST for the same task in the past.
> 
> 2. When the agent re-registers, the master will continue to send a 
> ShutdownFrameworkMessage to the agent to kill the tasks pertaining to non-PA 
> frameworks on the agent as it does today. An additional optional field will 
> be added to the ShutdownFrameworkMessage to indicate whether or not the 
> shutdown was initiated internally.
> 
> 3. During framework shutdown the state of the framework is set to 
> Framework::TERMINATING which prevents it from launching new tasks. Here, 
> since the framework is not really terminating so in order to allow it to 
> launch new tasks, the agent will not set the state to terminating if the 
> ShutdownFrameworkMessage is generated internally.
> 
> 4. The framework shutdown today doesn't generate any status updates which 
> needs to change. The status updates will be sent if the framework shutdown is 
> triggered internally, this is needed to remove the tasks of non-PA frameworks 
> that got added when the agent re-registered (1).
> 
> Approach 2: Do not shutdown non-PA framework when agent re-registers and 
> let the frameworks make the decision on what needs to be done when they 
> receive non-terminal status updates for tasks for which they have already 
> received a TASK_LOST. This hopefully won't break any frameworks since it 
> could have happened in the past with non-strict registry as well and 
> frameworks should be resilient enough to handle this scenario.
> 
> Let me know if I have missed anything. Personally, I am in favor of 
> approach 1 as it seems less catastrophic for the frameworks. What do you 
> think?

The proposed solution to fix the ra

Re: Review Request 60925: Updated CHANGELOG for changed semantic of recovery.

2017-07-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60925/
---

(Updated July 17, 2017, 5:35 p.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Updated CHANGELOG for changed semantic of recovery.


Diffs
-

  CHANGELOG 978747fc11c90a0d73d6815823474f974bc0c82e 
  docs/agent-recovery.md 1b3f2d2d2e630478a8d577eaa6474e3aedc3cb89 
  docs/upgrades.md dda55f90e2716e805fb145a369fe248cc178f5ac 


Diff: https://reviews.apache.org/r/60925/diff/2/


Testing
---

Testing done, renders correctly.


Thanks,

Megha Sharma



Re: Review Request 60925: Updated CHANGELOG for changed semantic of recovery.

2017-07-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60925/
---

(Updated July 18, 2017, 12:02 a.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: 6223
https://issues.apache.org/jira/browse/6223


Repository: mesos


Description
---

Updated CHANGELOG for changed semantic of recovery.


Diffs
-

  CHANGELOG 978747fc11c90a0d73d6815823474f974bc0c82e 
  docs/agent-recovery.md 1b3f2d2d2e630478a8d577eaa6474e3aedc3cb89 
  docs/upgrades.md dda55f90e2716e805fb145a369fe248cc178f5ac 


Diff: https://reviews.apache.org/r/60925/diff/2/


Testing (updated)
---

Testing done, renders correctly.


Thanks,

Megha Sharma



Re: Review Request 60925: Updated CHANGELOG for changed semantic of recovery.

2017-07-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60925/
---

(Updated July 18, 2017, 12:02 a.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: 6223
https://issues.apache.org/jira/browse/6223


Repository: mesos


Description
---

Updated CHANGELOG for changed semantic of recovery.


Diffs (updated)
-

  CHANGELOG 978747fc11c90a0d73d6815823474f974bc0c82e 
  docs/agent-recovery.md 1b3f2d2d2e630478a8d577eaa6474e3aedc3cb89 
  docs/upgrades.md dda55f90e2716e805fb145a369fe248cc178f5ac 


Diff: https://reviews.apache.org/r/60925/diff/2/

Changes: https://reviews.apache.org/r/60925/diff/1-2/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60925: Updated CHANGELOG for changed semantic of recovery.

2017-07-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60925/
---

(Updated July 17, 2017, 10:10 p.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: 6223
https://issues.apache.org/jira/browse/6223


Repository: mesos


Description
---

Updated CHANGELOG for changed semantic of recovery.


Diffs
-

  CHANGELOG 978747fc11c90a0d73d6815823474f974bc0c82e 
  docs/agent-recovery.md 1b3f2d2d2e630478a8d577eaa6474e3aedc3cb89 
  docs/upgrades.md dda55f90e2716e805fb145a369fe248cc178f5ac 


Diff: https://reviews.apache.org/r/60925/diff/1/


Testing
---


Thanks,

Megha Sharma



Review Request 60925: Updated CHANGELOG for changed semantic of recovery.

2017-07-17 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60925/
---

Review request for mesos.


Repository: mesos


Description
---

Updated CHANGELOG for changed semantic of recovery.


Diffs
-

  CHANGELOG 978747fc11c90a0d73d6815823474f974bc0c82e 
  docs/agent-recovery.md 1b3f2d2d2e630478a8d577eaa6474e3aedc3cb89 
  docs/upgrades.md dda55f90e2716e805fb145a369fe248cc178f5ac 


Diff: https://reviews.apache.org/r/60925/diff/1/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-12 Thread Megha Sharma


> On July 12, 2017, 5:47 p.m., Jiang Yan Xu wrote:
> > LGTM. Just a few minor style issues.

Thanks, addressed the comments.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review180319
---


On July 13, 2017, 4:22 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated July 13, 2017, 4:22 a.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests to verify that the state is recovered post reboot and
> agentId is retained given the recovery finishes without errors and
> if the recovery fails due to slaveInfo mismatch then agent no
> state is recoverd making the agent register as a new agent.
> 
> 
> Diffs
> -
> 
>   src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/17/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 60103
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-12 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated July 13, 2017, 4:22 a.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 


Diff: https://reviews.apache.org/r/56895/diff/17/

Changes: https://reviews.apache.org/r/56895/diff/16-17/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-11 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated July 12, 2017, 12:13 a.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 


Diff: https://reviews.apache.org/r/56895/diff/16/

Changes: https://reviews.apache.org/r/56895/diff/15-16/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-11 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated July 11, 2017, 10:32 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 


Diff: https://reviews.apache.org/r/56895/diff/15/

Changes: https://reviews.apache.org/r/56895/diff/14-15/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-11 Thread Megha Sharma


> On July 11, 2017, 5:53 p.m., Jiang Yan Xu wrote:
> > src/tests/slave_recovery_tests.cpp
> > Lines 2762 (patched)
> > <https://reviews.apache.org/r/56895/diff/13-14/?file=1767635#file1767635line2772>
> >
> > This is unintentional?

My bad, will fix.


> On July 11, 2017, 5:53 p.m., Jiang Yan Xu wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 2652 (original)
> > <https://reviews.apache.org/r/56895/diff/14/?file=1773824#file1773824line2784>
> >
> > In the case of real reboot with slave info mismatch I think we should 
> > receive a task lost for this so why not set the expetation explicitly 
> > instead of removing it?

Dicussed in person.


> On July 11, 2017, 5:53 p.m., Jiang Yan Xu wrote:
> > src/tests/slave_recovery_tests.cpp
> > Line 2655 (original), 2786 (patched)
> > <https://reviews.apache.org/r/56895/diff/14/?file=1773824#file1773824line2787>
> >
> > Are there no `master/slave_removals` here? Seems like MESOS-5396 
> > doens't apply here.

Discussed in person, removed.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/#review180202
---


On July 10, 2017, 11:15 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56895/
> ---
> 
> (Updated July 10, 2017, 11:15 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests to verify that the state is recovered post reboot and
> agentId is retained given the recovery finishes without errors and
> if the recovery fails due to slaveInfo mismatch then agent no
> state is recoverd making the agent register as a new agent.
> 
> 
> Diffs
> -
> 
>   src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 
> 
> 
> Diff: https://reviews.apache.org/r/56895/diff/14/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 60103
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-07-10 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated July 10, 2017, 11:15 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 


Diff: https://reviews.apache.org/r/56895/diff/14/

Changes: https://reviews.apache.org/r/56895/diff/13-14/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-06-30 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated June 30, 2017, 3:18 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp f6eafcbe3d89c7a69c03db0fd7bc10ae73d06584 


Diff: https://reviews.apache.org/r/56895/diff/13/

Changes: https://reviews.apache.org/r/56895/diff/12-13/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-29 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 29, 2017, 11:42 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/10/

Changes: https://reviews.apache.org/r/60105/diff/9-10/


Testing
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-29 Thread Megha Sharma


> On June 24, 2017, 6:19 a.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 6014 (patched)
> > <https://reviews.apache.org/r/60105/diff/8/?file=1761637#file1761637line6022>
> >
> > s/INFO/WARNING/.

Changed to warning. I was debating earlier between warning and info and felt 
info was more apt since mesos is already doing the remediation here.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178830
---


On June 29, 2017, 11:35 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 29, 2017, 11:35 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/9/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-29 Thread Megha Sharma


> On June 24, 2017, 6:19 a.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 6011-6020 (patched)
> > <https://reviews.apache.org/r/60105/diff/8/?file=1761637#file1761637line6019>
> >
> > Could you use the code in my comment:
> > 
> > ```
> >   if (!state->rebooted) {
> > return Failure(message);
> >   }
> > 
> >   LOG(WARNING) << "Falling back to recover as a new agent due to error: 
> > "
> ><< message;
> > 
> >   // Clean up the slave state to avoid any state recovery for the
> >   // old agent.
> >   slaveState = None();
> > ```
> > 
> > I'll comment on the specific lines for the reasoning.

Aah my bad, will get rid of the else.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178830
---


On June 29, 2017, 11:35 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 29, 2017, 11:35 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/9/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-29 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 29, 2017, 11:35 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/9/

Changes: https://reviews.apache.org/r/60105/diff/8-9/


Testing
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-06-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated June 23, 2017, 9:45 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp 90c172b6f7ca0082c3dfa2082050d6c18e4e636c 


Diff: https://reviews.apache.org/r/56895/diff/12/

Changes: https://reviews.apache.org/r/56895/diff/11-12/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-23 Thread Megha Sharma


> On June 23, 2017, 6:21 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Line 6003 (original), 6007 (patched)
> > <https://reviews.apache.org/r/60105/diff/6/?file=1759987#file1759987line6010>
> >
> > Empty line above.

This is not possible I guess, git pre-commit hook complains: Redundant blank 
line at the start of a code block should be deleted.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178750
---


On June 23, 2017, 9:30 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 23, 2017, 9:30 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/8/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 23, 2017, 9:30 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/8/

Changes: https://reviews.apache.org/r/60105/diff/7-8/


Testing
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-23 Thread Megha Sharma


> On June 23, 2017, 6:21 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 5994-5999 (patched)
> > <https://reviews.apache.org/r/60105/diff/6/?file=1759987#file1759987line5997>
> >
> > I tweaked it a little bit:
> > 
> > ```
> >   // Fail the recovery unless the agent is recovering for the first
> >   // time after host reboot.
> >   //
> >   // Prior to Mesos 1.4 we directly bypass the state recovery and
> >   // start as a new agent upon reboot (introduced in MESOS-844).
> >   // This unncessarily discards the existing agent ID (MESOS-6223).
> >   // Starting in Mesos 1.4 we'll attempt to recover the slave state
> >   // even after reboot but in case of slave info mismatch we'll fall
> >   // back to recovering as a new agent (existing behavior). This
> >   // prevents the agent from flapping if the slave info (resources,
> >   // attributes, etc.) change is due to host maintainance associated
> >   // with the reboot.
> > ```
> > 
> > What do you think? Feel free to improve on it.

+1, good and concise explanation about the changed behavior.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178750
---


On June 23, 2017, 5:19 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 23, 2017, 5:19 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/7/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-06-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated June 23, 2017, 6:01 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp 90c172b6f7ca0082c3dfa2082050d6c18e4e636c 


Diff: https://reviews.apache.org/r/56895/diff/11/

Changes: https://reviews.apache.org/r/56895/diff/10-11/


Testing
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 60104: Added rebooted flag to State.

2017-06-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/
---

(Updated June 23, 2017, 5:20 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added rebooted flag to State to remember whether or not
the agent has rebooted.


Diffs (updated)
-

  src/slave/state.hpp 537358cb54393dd02b655050c26504808a1505ad 
  src/slave/state.cpp 5dd8b1cc29cf1809e948015e146691596a8202b8 


Diff: https://reviews.apache.org/r/60104/diff/7/

Changes: https://reviews.apache.org/r/60104/diff/6-7/


Testing
---

make check done together with 60105 and 56895


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-23 Thread Megha Sharma


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 5798-5867 (original), 5798-5867 (patched)
> > <https://reviews.apache.org/r/60105/diff/3/?file=1752079#file1752079line5798>
> >
> > So all of this work is only useful for recovering frameworks, it looks 
> > like we don't need to do them before we decide we actually are going to 
> > recover the frameworks (i.e., not continue as a new agent)?
> > 
> > This reasoning certainly also applies to the current HEAD but since we 
> > are donig refactor here, we should probably address this (in a separate 
> > preceding review perhaps)
> 
> Megha Sharma wrote:
> I strongly agree we should defer doing this work until we start to 
> recover the frameworks. I was thinking of doing this may be as another review 
> for slave recovery improvements after this change is in.

Agreed to do in a separate refactoring/improvements patch may be after this one 
lands.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178388
-------


On June 23, 2017, 5:19 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 23, 2017, 5:19 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/7/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-23 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 23, 2017, 5:19 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/7/

Changes: https://reviews.apache.org/r/60105/diff/6-7/


Testing
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-22 Thread Megha Sharma


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > PTAL at these comments but I realize for this review it's probably easier 
> > if we sync on the high-level strategy first.

Rearranged the code as per our discussion.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178388
---


On June 22, 2017, 11:38 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 22, 2017, 11:38 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/6/
> 
> 
> Testing
> ---
> 
> make check done together with 60104 and 56895
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-22 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 22, 2017, 11:38 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/6/

Changes: https://reviews.apache.org/r/60105/diff/5-6/


Testing
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 60104: Added rebooted flag to State.

2017-06-22 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/
---

(Updated June 22, 2017, 9:17 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added rebooted flag to State to remember whether or not
the agent has rebooted.


Diffs
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
  src/slave/state.hpp 537358cb54393dd02b655050c26504808a1505ad 
  src/slave/state.cpp 5dd8b1cc29cf1809e948015e146691596a8202b8 


Diff: https://reviews.apache.org/r/60104/diff/6/


Testing (updated)
---

make check done together with 60105 and 56895


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-22 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 22, 2017, 9:16 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/5/


Testing (updated)
---

make check done together with 60104 and 56895


Thanks,

Megha Sharma



Re: Review Request 56895: Added tests to ensure slave recovery post reboot.

2017-06-22 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56895/
---

(Updated June 22, 2017, 9:15 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description (updated)
---

Added tests to verify that the state is recovered post reboot and
agentId is retained given the recovery finishes without errors and
if the recovery fails due to slaveInfo mismatch then agent no
state is recoverd making the agent register as a new agent.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp 90c172b6f7ca0082c3dfa2082050d6c18e4e636c 


Diff: https://reviews.apache.org/r/56895/diff/10/

Changes: https://reviews.apache.org/r/56895/diff/9-10/


Testing (updated)
---

make check done together with 60104 and 60103


Thanks,

Megha Sharma



Re: Review Request 60104: Added rebooted flag to State.

2017-06-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/#review178601
---




src/slave/state.cpp
Line 96 (original), 95 (patched)
<https://reviews.apache.org/r/60104/#comment252712>

If this line is moved to 60105 then this patch will be independently 
mergeable.


- Megha Sharma


On June 21, 2017, 11:32 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60104/
> ---
> 
> (Updated June 21, 2017, 11:32 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added rebooted flag to State to remember whether or not
> the agent has rebooted.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
>   src/slave/state.hpp 537358cb54393dd02b655050c26504808a1505ad 
>   src/slave/state.cpp 5dd8b1cc29cf1809e948015e146691596a8202b8 
> 
> 
> Diff: https://reviews.apache.org/r/60104/diff/6/
> 
> 
> Testing
> ---
> 
> Latest make check results pending.
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60104: Added rebooted flag to State.

2017-06-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/
---

(Updated June 21, 2017, 11:32 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Summary (updated)
-

Added rebooted flag to State.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description (updated)
---

Added rebooted flag to State to remember whether or not
the agent has rebooted.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
  src/slave/state.hpp 537358cb54393dd02b655050c26504808a1505ad 
  src/slave/state.cpp 5dd8b1cc29cf1809e948015e146691596a8202b8 


Diff: https://reviews.apache.org/r/60104/diff/6/

Changes: https://reviews.apache.org/r/60104/diff/5-6/


Testing (updated)
---

Latest make check results pending.


Thanks,

Megha Sharma



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-21 Thread Megha Sharma


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 5798-5867 (original), 5798-5867 (patched)
> > <https://reviews.apache.org/r/60105/diff/3/?file=1752079#file1752079line5798>
> >
> > So all of this work is only useful for recovering frameworks, it looks 
> > like we don't need to do them before we decide we actually are going to 
> > recover the frameworks (i.e., not continue as a new agent)?
> > 
> > This reasoning certainly also applies to the current HEAD but since we 
> > are donig refactor here, we should probably address this (in a separate 
> > preceding review perhaps)

I strongly agree we should defer doing this work until we start to recover the 
frameworks. I was thinking of doing this may be as another review for slave 
recovery improvements after this change is in.


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178388
-------


On June 21, 2017, 11:22 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 21, 2017, 11:22 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to Mesos 1.4 we bypass the state recovery and
> start as a new agent upon reboot. Starting in Mesos 1.4
> we'll attempt to recover the slave state even after reboot
> except for when there is a SlaveInfo mismatch.
> Here, we cleanup the slave state for a rebooted agent if
> there's been a SlaveInfo mismatch during recovery to
> ensure that no other state is recovered and the
> agent enventually registers as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/5/
> 
> 
> Testing
> ---
> 
> Latest make check results pending.
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60105: Clean rebooted slave's state if slaveInfo mismatches.

2017-06-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 21, 2017, 11:22 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Summary (updated)
-

Clean rebooted slave's state if slaveInfo mismatches.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/5/

Changes: https://reviews.apache.org/r/60105/diff/4-5/


Testing (updated)
---

Latest make check results pending.


Thanks,

Megha Sharma



Re: Review Request 60105: Added helper method recoverSlaveState.

2017-06-21 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 21, 2017, 11:11 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description (updated)
---

Prior to Mesos 1.4 we bypass the state recovery and
start as a new agent upon reboot. Starting in Mesos 1.4
we'll attempt to recover the slave state even after reboot
except for when there is a SlaveInfo mismatch.
Here, we cleanup the slave state for a rebooted agent if
there's been a SlaveInfo mismatch during recovery to
ensure that no other state is recovered and the
agent enventually registers as a new agent.


Diffs (updated)
-

  src/slave/slave.cpp f808458849bb9667a91abe18868751d377d36e0c 


Diff: https://reviews.apache.org/r/60105/diff/4/

Changes: https://reviews.apache.org/r/60105/diff/3-4/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60105: Added helper method recoverSlaveState.

2017-06-21 Thread Megha Sharma


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 5945-5946 (patched)
> > <https://reviews.apache.org/r/60105/diff/3/?file=1752079#file1752079line5945>
> >
> > This comment seems to be misplaced, why does the code here cares? It 
> > just processes the `slaveState` which could be optional regardless of what 
> > we have done for the reboot case.

Some of the comments I added were for my own undrestanding (like this one) 
which I forgot to clean up before pushing


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Line 5950 (original), 5969-5974 (patched)
> > <https://reviews.apache.org/r/60105/diff/3/?file=1752079#file1752079line5969>
> >
> > As suggested by related comments, we are not removing the symlink here 
> > so we shouldn't need to comment about it here. 
> > 
> > "This is being done ...": here the comment describes the low-level 
> > mechanics but perhaps we need a high level comment about "this" instead, 
> > which is that:
> > 
> > Prior to Mesos 1.4 we bypass the state recovery and start as a new 
> > agent upon reboot (introduced in MESOS-844); starting in Mesos 1.4 we'll 
> > attempt to recover the slave state even after reboot (so we don't discard 
> > the agent ID unnecessarily) but in case of SlaveInfo mismatch we'll still 
> > continue as a new agent so we don't introduce a regression for the reboot 
> > scenario.
> > 
> > We can go on to explain the rationale: agent resource and attribute 
> > changes are more often associated with host reboots and we don't want to 
> > agent to flap in such cases.
> > 
> > But it's fine if we focus on improving the comments after we first get 
> > the code rigth.

Makes sense, updated the comment.


> On June 21, 2017, 4:39 p.m., Jiang Yan Xu wrote:
> > src/slave/slave.cpp
> > Lines 5987-5989 (patched)
> > <https://reviews.apache.org/r/60105/diff/3/?file=1752079#file1752079line5987>
> >
> > You don't need to do this here. If you continue as a new agent, it's 
> > going to update the latest symlink when it's registered:
> > 
> > 
> > https://github.com/apache/mesos/blob/3bea3fcb4eef00ce469ba4f27fcfd2d3eec05aea/src/slave/paths.cpp#L621-L622

Good Point, removed


- Megha


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/#review178388
---


On June 15, 2017, 7:31 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60105/
> ---
> 
> (Updated June 15, 2017, 7:31 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Here, a helper method recoverSlaveState is added to
> determine if the slave state is to be recovered or not.
> During agent recovery, if the SlaveInfo compatibility
> check fails and the agent is rebooted then to be
> backwards compatible with MESOS <= 1.3, since a
> a rebooted agent didn't recover state, we clear
> slave state and slaveId so the agent registers with the
> master as a new agent.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
> 
> 
> Diff: https://reviews.apache.org/r/60105/diff/3/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60104: Added rebooted flag to RecoveryInfo and State.

2017-06-20 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/
---

(Updated June 20, 2017, 5:09 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added rebooted flag to State and Slave.RecoveryInfo to remember
if the agent has rebooted.


Diffs (updated)
-

  src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
  src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 


Diff: https://reviews.apache.org/r/60104/diff/5/

Changes: https://reviews.apache.org/r/60104/diff/4-5/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60104: Added rebooted flag to RecoveryInfo and State.

2017-06-20 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/#review178394
---




src/slave/state.cpp
Line 82 (original), 82 (patched)
<https://reviews.apache.org/r/60104/#comment252329>

Fixing the whitespace


- Megha Sharma


On June 20, 2017, 3:55 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60104/
> ---
> 
> (Updated June 20, 2017, 3:55 p.m.)
> 
> 
> Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6223
> https://issues.apache.org/jira/browse/MESOS-6223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added rebooted flag to State and Slave.RecoveryInfo to remember
> if the agent has rebooted.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
>   src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
>   src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
>   src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 
> 
> 
> Diff: https://reviews.apache.org/r/60104/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>



Re: Review Request 60104: Added rebooted flag to RecoveryInfo and State.

2017-06-20 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60104/
---

(Updated June 20, 2017, 3:55 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Added rebooted flag to State and Slave.RecoveryInfo to remember
if the agent has rebooted.


Diffs (updated)
-

  src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
  src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 
  src/slave/state.hpp a497ce1f58fb8dc7718ee5bb10bc62dd7479efa5 
  src/slave/state.cpp 18b790d2cc4f537cc9b0c3cca59b9cbaac0eda10 


Diff: https://reviews.apache.org/r/60104/diff/4/

Changes: https://reviews.apache.org/r/60104/diff/3-4/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60103: Changed variable name _ack to statusUpdateAck.

2017-06-20 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60103/
---

(Updated June 20, 2017, 3:41 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Changed variable name _ack to statusUpdateAck which
correctly identifies with the purpose of this variable.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 


Diff: https://reviews.apache.org/r/60103/diff/4/

Changes: https://reviews.apache.org/r/60103/diff/3-4/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60103: Changed variable name _ack to statusUpdateAck.

2017-06-15 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60103/
---

(Updated June 15, 2017, 7:30 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Changed variable name _ack to statusUpdateAck which
correctly identifies with the purpose of this variable.


Diffs (updated)
-

  src/tests/slave_recovery_tests.cpp 38502584186793686f78ff5f4e03f36a3bf7ad1c 


Diff: https://reviews.apache.org/r/60103/diff/3/

Changes: https://reviews.apache.org/r/60103/diff/2-3/


Testing
---


Thanks,

Megha Sharma



Re: Review Request 60105: Added helper method recoverSlaveState.

2017-06-15 Thread Megha Sharma

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60105/
---

(Updated June 15, 2017, 7:31 p.m.)


Review request for mesos, Neil Conway, Vinod Kone, and Jiang Yan Xu.


Bugs: MESOS-6223
https://issues.apache.org/jira/browse/MESOS-6223


Repository: mesos


Description
---

Here, a helper method recoverSlaveState is added to
determine if the slave state is to be recovered or not.
During agent recovery, if the SlaveInfo compatibility
check fails and the agent is rebooted then to be
backwards compatible with MESOS <= 1.3, since a
a rebooted agent didn't recover state, we clear
slave state and slaveId so the agent registers with the
master as a new agent.


Diffs (updated)
-

  src/slave/slave.hpp 7ffaed14035a05259ec72c70532ee4f0affa1f5d 
  src/slave/slave.cpp 7d147ac6609933ac884bfc29032dba572a0952c6 


Diff: https://reviews.apache.org/r/60105/diff/3/

Changes: https://reviews.apache.org/r/60105/diff/2-3/


Testing
---


Thanks,

Megha Sharma



  1   2   >