> On Feb. 17, 2015, 5:58 a.m., Ben Mahler wrote:
> > 3rdparty/libprocess/src/process.cpp, line 2118
> > <https://reviews.apache.org/r/30587/diff/1/?file=846771#file846771line2118>
> >
> > Hm.. it looks like `make check` with `GLOG_v=1` generates a ton of
> > noise from these messages, is this expected?
> >
> > If this is to be expected, do we want to bump up the level?
The metrics ones seem to come from overall libprocess shutdown / after
completion of tests. It looks like we shut down metrics __before__ we stop
sending messages to it. (Probably not overly harmful, but not strictly
correct). That some code is running after all the tests have "completed" is
worrying to me... (Relevant log chunks below)
The waiter ones are generated in some of the libprocess tests. They don't show
up in the test which actually is causing the code to fire (again worrying...).
There is a WaitWaiter process spawned when you do a process::wait() with a
duration. That WaitWaiter has its own timeout function set with a delay. Since
the wait completes before the timeout, the wait() returns, the WaitWaiter
destructs, and doesn't clean up the delay it set. The delay then fires
eventually once the clock has been advanced enough by other tests.
Both of these I believe are programmer errors of not cleaning up properly.
We could fix it by just fixing the cases (Sample patch for the WaitWaiter):
```
diff --git a/3rdparty/libprocess/src/process.cpp
b/3rdparty/libprocess/src/process.cpp
index
d7fcba90bbe990236a45a76b8f39285a5097861c..f60f21c7b197b30f812fb923616be8693ee5365c
100644
--- a/3rdparty/libprocess/src/process.cpp
+++ b/3rdparty/libprocess/src/process.cpp
@@ -2915,7 +2915,7 @@ public:
{
VLOG(3) << "Running waiter process for " << pid;
link(pid);
- delay(duration, self(), &WaitWaiter::timeout);
+ timer = delay(duration, self(), &WaitWaiter::timeout);
}
private:
@@ -2923,6 +2923,7 @@ private:
{
VLOG(3) << "Waiter process waited for " << pid;
*waited = true;
+ Clock::cancel(timer.get());
terminate(self());
}
@@ -2937,6 +2938,7 @@ private:
const UPID pid;
const Duration duration;
bool* const waited;
+ Option<Timer> timer;
};
```
What would be better though I think is making it so when a Process /
ProcessBase is destroyed / de-registers we remove all its outstanding clocks.
This prevents future programmer mistakes, as well as eliminates a possible
cause of subtle bugs inside mesos processes: Spawn a process, set a delay
message for it, kill that process, start a new one with the same name. The new
process will get the message once the timeout is reached. Reusing process names
happens fairly frequently in the tests.
The relevant log bits (Generated with GLOG_v=3):
**Metrics log chunk:**
```
[==========] 1 test from 1 test case ran. (103 ms total)
[ PASSED ] 1 test.
YOU HAVE 3 DISABLED TESTS
I0227 12:29:24.962259 6723 clock.cpp:295] Clock paused at 0xeb0400
I0227 12:29:24.963186 6732 clock.cpp:135] Handling timers up to 2015-02-27
20:29:30.962255872+00:00
I0227 12:29:24.963199 6732 clock.cpp:142] Have timeout(s) at 2015-02-27
20:29:30.959864832+00:00
I0227 12:29:24.963220 6732 clock.cpp:180] Clock has settled
I0227 12:29:24.963242 6725 process.cpp:2161] Resuming
reaper(1)@127.0.0.1:53591 at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.963263 6725 clock.cpp:243] Created a timer for
reaper(1)@127.0.0.1:53591 in 100ms in the future (2015-02-27
20:29:31.062255872+00:00)
I0227 12:29:24.972337 6723 clock.cpp:429] Clock is settled
I0227 12:29:24.972363 6723 process.cpp:2466] Donating thread to
[email protected]:53591 while waiting
I0227 12:29:24.972375 6723 process.cpp:2161] Resuming [email protected]:53591
at 2015-02-27 20:29:30.962255872+00:00e
I0227 12:29:24.972388 6723 process.cpp:2268] Cleaning up [email protected]:53591
I0227 12:29:24.972437 6723 process.cpp:2466] Donating thread to
__limiter__(1)@127.0.0.1:53591 while waiting
I0227 12:29:24.972446 6723 process.cpp:2161] Resuming
__limiter__(1)@127.0.0.1:53591 at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972457 6723 process.cpp:2268] Cleaning up
__limiter__(1)@127.0.0.1:53591
I0227 12:29:24.972487 6726 process.cpp:2161] Resuming
[email protected]:53591 at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972506 6726 process.cpp:2268] Cleaning up
[email protected]:53591
I0227 12:29:24.972527 6723 process.cpp:2466] Donating thread to
[email protected]:53591 while waiting
I0227 12:29:24.972535 6723 process.cpp:2161] Resuming [email protected]:53591 at
2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972545 6723 process.cpp:2268] Cleaning up [email protected]:53591
I0227 12:29:24.972571 6731 process.cpp:2161] Resuming [email protected]:53591
at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972590 6731 process.cpp:2268] Cleaning up
[email protected]:53591
I0227 12:29:24.972611 6723 process.cpp:2466] Donating thread to
[email protected]:53591 while waiting
I0227 12:29:24.972620 6723 process.cpp:2161] Resuming [email protected]:53591
at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972630 6723 process.cpp:2268] Cleaning up
[email protected]:53591
I0227 12:29:24.972651 6726 process.cpp:2161] Resuming [email protected]:53591
at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972662 6726 process.cpp:2268] Cleaning up
[email protected]:53591
I0227 12:29:24.972677 6729 process.cpp:2161] Resuming
reaper(1)@127.0.0.1:53591 at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972688 6729 process.cpp:2268] Cleaning up
reaper(1)@127.0.0.1:53591
I0227 12:29:24.972705 6723 process.cpp:2466] Donating thread to
[email protected]:53591 while waiting
I0227 12:29:24.972714 6723 process.cpp:2161] Resuming [email protected]:53591
at 2015-02-27 20:29:30.962255872+00:00
I0227 12:29:24.972730 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972743 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972755 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972767 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972779 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972790 6723 process.cpp:2118] Dropped / Lost event for PID:
[email protected]:53591
I0227 12:29:24.972800 6723 process.cpp:2268] Cleaning up [email protected]:53591
make[5]: Leaving directory '/home/cody/projects/mesos/build/3rdparty/libprocess'
```
**__waiter__ log chunks:**
```
[----------] 7 tests from HTTP
[ RUN ] HTTP.auth
I0227 12:40:42.212824 8256 process.cpp:2161] Resuming [email protected]:44132 at
2015-02-27 20:40:42.212829184+00:00
I0227 12:40:42.212836 8257 process.cpp:2161] Resuming (1)@127.0.0.1:44132 at
2015-02-27 20:40:42.212840192+00:00
I0227 12:40:42.212839 8253 process.cpp:2151] Spawned process
(1)@127.0.0.1:44132
I0227 12:40:42.212970 8253 process.cpp:2151] Spawned process
__latch__(1)@127.0.0.1:44132
I0227 12:40:42.212975 8261 process.cpp:2161] Resuming [email protected]:44132
at 2015-02-27 20:40:42.212979200+00:00
I0227 12:40:42.212995 8253 process.cpp:2151] Spawned process
__waiter__(1)@127.0.0.1:44132
I0227 12:40:42.212976 8260 process.cpp:2161] Resuming
__latch__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.212996096+00:00
I0227 12:40:42.212996 8256 process.cpp:2161] Resuming
__waiter__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.212999168+00:00
I0227 12:40:42.213027 8256 process.cpp:2916] Running waiter process for
__latch__(1)@127.0.0.1:44132
I0227 12:40:42.213052 8256 clock.cpp:243] Created a timer for
__waiter__(1)@127.0.0.1:44132 in 15secs in the future (2015-02-27
20:40:57.213051136+00:00)
I0227 12:40:42.213198 8254 process.cpp:2161] Resuming (1)@127.0.0.1:44132 at
2015-02-27 20:40:42.213201920+00:00
I0227 12:40:42.213217 8254 process.cpp:2771] Handling HTTP event for process
'(1)' with path: '/(1)/auth'
I0227 12:40:42.213254 8254 process.cpp:2151] Spawned process
__http__(1)@127.0.0.1:44132
I0227 12:40:42.213258 8258 process.cpp:2161] Resuming [email protected]:44132
at 2015-02-27 20:40:42.213270784+00:00
I0227 12:40:42.213290 8259 process.cpp:2161] Resuming
__http__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.213294080+00:00
I0227 12:40:42.213408 8257 process.cpp:2161] Resuming
__http__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.213412096+00:00
I0227 12:40:42.213420 8257 process.cpp:2268] Cleaning up
__http__(1)@127.0.0.1:44132
I0227 12:40:42.213443 8255 process.cpp:2161] Resuming [email protected]:44132
at 2015-02-27 20:40:42.213448960+00:00
I0227 12:40:42.213508 8257 process.cpp:2161] Resuming
__latch__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.213510912+00:00
I0227 12:40:42.213518 8257 process.cpp:2268] Cleaning up
__latch__(1)@127.0.0.1:44132
I0227 12:40:42.213531 8258 process.cpp:2161] Resuming
__waiter__(1)@127.0.0.1:44132 at 2015-02-27 20:40:42.213534976+00:00
I0227 12:40:42.213533 8261 process.cpp:2161] Resuming [email protected]:44132
at 2015-02-27 20:40:42.213537024+00:00
I0227 12:40:42.213546 8258 process.cpp:2924] Waiter process waited for
__latch__(1)@127.0.0.1:44132
I0227 12:40:42.213553 8258 process.cpp:2268] Cleaning up
__waiter__(1)@127.0.0.1:44132
...
[ OK ] HTTP.auth (2 ms)
...
[ RUN ] Metrics.SnapshotTimeout
...
I0227 12:40:42.484825 8253 clock.cpp:429] Clock is settled
I0227 12:40:42.484838 8253 clock.cpp:345] Clock advanced (1secs) to 0x2599420
I0227 12:40:42.484966 8262 clock.cpp:135] Handling timers up to 2015-02-27
20:40:57.446742048+00:00
I0227 12:40:42.484983 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.213051136+00:00
I0227 12:40:42.484994 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.213692928+00:00
I0227 12:40:42.485003 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.214238976+00:00
I0227 12:40:42.485012 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.214799872+00:00
I0227 12:40:42.485019 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.214999808+00:00
I0227 12:40:42.485028 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.215301888+00:00
I0227 12:40:42.485035 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.215611904+00:00
I0227 12:40:42.485046 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.216196096+00:00
I0227 12:40:42.485055 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.216718080+00:00
I0227 12:40:42.485062 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.217353984+00:00
I0227 12:40:42.485070 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.217850880+00:00
I0227 12:40:42.485077 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.219537920+00:00
I0227 12:40:42.485088 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.221196800+00:00
I0227 12:40:42.485096 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.228478976+00:00
I0227 12:40:42.485103 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.334412064+00:00
I0227 12:40:42.485110 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.334772000+00:00
I0227 12:40:42.485118 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.334937120+00:00
I0227 12:40:42.485126 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.335040032+00:00
I0227 12:40:42.485133 8262 clock.cpp:142] Have timeout(s) at 2015-02-27
20:40:57.335128864+00:00
I0227 12:40:42.485159 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(1)@127.0.0.1:44132
I0227 12:40:42.485170 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(2)@127.0.0.1:44132
I0227 12:40:42.485179 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(3)@127.0.0.1:44132
I0227 12:40:42.485188 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(4)@127.0.0.1:44132
I0227 12:40:42.485198 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(5)@127.0.0.1:44132
I0227 12:40:42.485205 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(6)@127.0.0.1:44132
I0227 12:40:42.485214 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(7)@127.0.0.1:44132
I0227 12:40:42.485224 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(8)@127.0.0.1:44132
I0227 12:40:42.485231 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(9)@127.0.0.1:44132
I0227 12:40:42.485241 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(10)@127.0.0.1:44132
I0227 12:40:42.485250 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(11)@127.0.0.1:44132
I0227 12:40:42.485260 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(12)@127.0.0.1:44132
I0227 12:40:42.485268 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(13)@127.0.0.1:44132
I0227 12:40:42.485278 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(14)@127.0.0.1:44132
I0227 12:40:42.485286 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(15)@127.0.0.1:44132
I0227 12:40:42.485297 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(16)@127.0.0.1:44132
I0227 12:40:42.485306 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(17)@127.0.0.1:44132
I0227 12:40:42.485314 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(18)@127.0.0.1:44132
I0227 12:40:42.485324 8262 process.cpp:2118] Dropped / Lost event for PID:
__waiter__(19)@127.0.0.1:44132
I0227 12:40:42.485333 8262 clock.cpp:180] Clock has settled
```
- Cody
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30587/#review72676
-----------------------------------------------------------
On Feb. 4, 2015, 12:06 a.m., Cody Maloney wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30587/
> -----------------------------------------------------------
>
> (Updated Feb. 4, 2015, 12:06 a.m.)
>
>
> Review request for mesos and Benjamin Hindman.
>
>
> Bugs: MESOS-1806
> https://issues.apache.org/jira/browse/MESOS-1806
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Log when we drop a message since we can't find the local recipient.
> Include details of what was lost in a send failure.
>
>
> Diffs
> -----
>
> 3rdparty/libprocess/src/process.cpp
> 67b6b3b9c13d95fa1a24b48a12c5c831c7f249bf
>
> Diff: https://reviews.apache.org/r/30587/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Cody Maloney
>
>