Re: Review Request 59584: Introduced executor reconnect retries on the agent.

2017-05-26 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59584/#review176195
---


Ship it!




Ship It!

- Vinod Kone


On May 26, 2017, 12:56 a.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59584/
> ---
> 
> (Updated May 26, 2017, 12:56 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone.
> 
> 
> Bugs: MESOS-5332, MESOS-7057 and MESOS-7569
> https://issues.apache.org/jira/browse/MESOS-5332
> https://issues.apache.org/jira/browse/MESOS-7057
> https://issues.apache.org/jira/browse/MESOS-7569
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link
> with the agent upon receiving the reconnect message. This avoids the
> executor replying on a half-open TCP connection to the old agent
> (possible if netfilter is dropping packets, see: MESOS-7057).
> However, PID-based executors using Mesos libraries < 1.1.2 do not
> re-link and are therefore prone to replying on a half-open connection
> after the agent restarts. If we only send a single reconnect message,
> these "old" executors will reply on their half-open connection,
> receive a RST, and think the agent just died. To ensure these "old"
> executors can reconnect in the presence of netfilter dropping packets,
> we introduced optional retries of the reconnect message. This results
> in "old" executors correctly establishing a link when processing the
> second reconnect message.
> 
> Generally, users should not enable this flag unless they are affected
> by this issue.
> 
> 
> Diffs
> -
> 
>   src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc 
>   src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 
>   src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 
> 
> 
> Diff: https://reviews.apache.org/r/59584/diff/2/
> 
> 
> Testing
> ---
> 
> Added tests in follow up reviews.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>



Re: Review Request 59584: Introduced executor reconnect retries on the agent.

2017-05-25 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59584/#review176167
---




src/slave/flags.cpp
Lines 355 (patched)


s/sent to the executor/sent to the executor during recovery/



src/slave/flags.cpp
Lines 356 (patched)


s/MESOS-5322/MESOS-5332/



src/slave/flags.cpp
Lines 365-366 (patched)


Maybe something like:

these "old" executors will reply on their half-open connection and receive 
a RST; without any retries, they will fail to reconnect and be killed by the 
agent once the executor re-registration timeout elapses.



src/slave/slave.cpp
Lines 5964-5965 (patched)


Ditto, as above.



src/slave/slave.cpp
Lines 5967 (patched)


s/an optional/optional/



src/slave/slave.cpp
Lines 5972-5973 (patched)


Is this TODO necessary, since this entire block only executes when 
`executor->pid.isSome() && executor->pid.get()`?



src/slave/slave.cpp
Lines 5975-5979 (patched)


Why const ref for the IDs but not for the retry interval?


- Greg Mann


On May 26, 2017, 12:56 a.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59584/
> ---
> 
> (Updated May 26, 2017, 12:56 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone.
> 
> 
> Bugs: MESOS-5332, MESOS-7057 and MESOS-7569
> https://issues.apache.org/jira/browse/MESOS-5332
> https://issues.apache.org/jira/browse/MESOS-7057
> https://issues.apache.org/jira/browse/MESOS-7569
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link
> with the agent upon receiving the reconnect message. This avoids the
> executor replying on a half-open TCP connection to the old agent
> (possible if netfilter is dropping packets, see: MESOS-7057).
> However, PID-based executors using Mesos libraries < 1.1.2 do not
> re-link and are therefore prone to replying on a half-open connection
> after the agent restarts. If we only send a single reconnect message,
> these "old" executors will reply on their half-open connection,
> receive a RST, and think the agent just died. To ensure these "old"
> executors can reconnect in the presence of netfilter dropping packets,
> we introduced optional retries of the reconnect message. This results
> in "old" executors correctly establishing a link when processing the
> second reconnect message.
> 
> Generally, users should not enable this flag unless they are affected
> by this issue.
> 
> 
> Diffs
> -
> 
>   src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc 
>   src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 
>   src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 
> 
> 
> Diff: https://reviews.apache.org/r/59584/diff/1/
> 
> 
> Testing
> ---
> 
> Added tests in follow up reviews.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>



Re: Review Request 59584: Introduced executor reconnect retries on the agent.

2017-05-25 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59584/#review176168
---




src/slave/slave.cpp
Lines 5956 (patched)


Maybe just link to the flag help here instead of duplicating?



src/slave/slave.cpp
Lines 6006 (patched)


Can you log this as well like we do for the first message?

```
LOG(INFO) << "Re-sending reconnect request to executor " << *executor;
```


- Vinod Kone


On May 26, 2017, 12:56 a.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59584/
> ---
> 
> (Updated May 26, 2017, 12:56 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone.
> 
> 
> Bugs: MESOS-5332, MESOS-7057 and MESOS-7569
> https://issues.apache.org/jira/browse/MESOS-5332
> https://issues.apache.org/jira/browse/MESOS-7057
> https://issues.apache.org/jira/browse/MESOS-7569
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link
> with the agent upon receiving the reconnect message. This avoids the
> executor replying on a half-open TCP connection to the old agent
> (possible if netfilter is dropping packets, see: MESOS-7057).
> However, PID-based executors using Mesos libraries < 1.1.2 do not
> re-link and are therefore prone to replying on a half-open connection
> after the agent restarts. If we only send a single reconnect message,
> these "old" executors will reply on their half-open connection,
> receive a RST, and think the agent just died. To ensure these "old"
> executors can reconnect in the presence of netfilter dropping packets,
> we introduced optional retries of the reconnect message. This results
> in "old" executors correctly establishing a link when processing the
> second reconnect message.
> 
> Generally, users should not enable this flag unless they are affected
> by this issue.
> 
> 
> Diffs
> -
> 
>   src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc 
>   src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 
>   src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 
> 
> 
> Diff: https://reviews.apache.org/r/59584/diff/1/
> 
> 
> Testing
> ---
> 
> Added tests in follow up reviews.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>



Re: Review Request 59584: Introduced executor reconnect retries on the agent.

2017-05-25 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59584/
---

(Updated May 26, 2017, 12:56 a.m.)


Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone.


Changes
---

Added MESOS-7569.


Bugs: MESOS-5332, MESOS-7057 and MESOS-7569
https://issues.apache.org/jira/browse/MESOS-5332
https://issues.apache.org/jira/browse/MESOS-7057
https://issues.apache.org/jira/browse/MESOS-7569


Repository: mesos


Description
---

PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link
with the agent upon receiving the reconnect message. This avoids the
executor replying on a half-open TCP connection to the old agent
(possible if netfilter is dropping packets, see: MESOS-7057).
However, PID-based executors using Mesos libraries < 1.1.2 do not
re-link and are therefore prone to replying on a half-open connection
after the agent restarts. If we only send a single reconnect message,
these "old" executors will reply on their half-open connection,
receive a RST, and think the agent just died. To ensure these "old"
executors can reconnect in the presence of netfilter dropping packets,
we introduced optional retries of the reconnect message. This results
in "old" executors correctly establishing a link when processing the
second reconnect message.

Generally, users should not enable this flag unless they are affected
by this issue.


Diffs
-

  src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc 
  src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 
  src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 


Diff: https://reviews.apache.org/r/59584/diff/1/


Testing
---

Added tests in follow up reviews.


Thanks,

Benjamin Mahler



Review Request 59584: Introduced executor reconnect retries on the agent.

2017-05-25 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59584/
---

Review request for mesos, Anand Mazumdar, Greg Mann, and Vinod Kone.


Bugs: MESOS-5332 and MESOS-7057
https://issues.apache.org/jira/browse/MESOS-5332
https://issues.apache.org/jira/browse/MESOS-7057


Repository: mesos


Description
---

PID-based v0 executors using Mesos libraries >= 1.1.2 always re-link
with the agent upon receiving the reconnect message. This avoids the
executor replying on a half-open TCP connection to the old agent
(possible if netfilter is dropping packets, see: MESOS-7057).
However, PID-based executors using Mesos libraries < 1.1.2 do not
re-link and are therefore prone to replying on a half-open connection
after the agent restarts. If we only send a single reconnect message,
these "old" executors will reply on their half-open connection,
receive a RST, and think the agent just died. To ensure these "old"
executors can reconnect in the presence of netfilter dropping packets,
we introduced optional retries of the reconnect message. This results
in "old" executors correctly establishing a link when processing the
second reconnect message.

Generally, users should not enable this flag unless they are affected
by this issue.


Diffs
-

  src/slave/flags.hpp b66995630f89dfb95a6d0cf66efc5d7590e90cbc 
  src/slave/flags.cpp 0c8276e425a6a7d22ee68edc6cc25b331635ec44 
  src/slave/slave.cpp 15e4d68714556ca30a766acd3b9729367df680c3 


Diff: https://reviews.apache.org/r/59584/diff/1/


Testing
---

Added tests in follow up reviews.


Thanks,

Benjamin Mahler