[jira] [Commented] (MESOS-3509) SlaveTest.TerminatingSlaveDoesNotReregister is flaky

2016-01-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099128#comment-15099128
 ] 

Zhitao Li commented on MESOS-3509:
--

I'll post some inconclusive findings since my last change seems to made this 
test from "flaky" to "failing".

In short, clock/timer implementation in {{libevent}} seems to allow a timer to 
be created *out of order* w.r.t to {{Clock::advance()}} in certain cases. As a 
result, timer created after Clock advancing code was not affected by that and 
instead waiting for a real wall time of 120s 
({{slave::REGISTER_RETRY_INTERVAL_MAX * 2}}), which is longer than the default 
15s of {{AWAIT_READY}}.

I took a snippet of log to run the test with {{--libevent}} and {{GLOG_v=3}}:
{panel}
I0114 23:25:16.880738 129300 authenticator.cpp:317] Authentication success
I0114 23:25:16.880803 129295 process.cpp:2502] Resuming master@127.0.0.1:31370 
at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880844 129301 process.cpp:2502] Resuming 
crammd5_authenticator(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880861 129295 master.cpp:5475] Successfully authenticated 
principal 'test-principal' at slave(1)@127.0.0.1:31370
I0114 23:25:16.880903 129301 authenticator.cpp:431] Authentication session 
cleanup for crammd5_authenticatee(3)@127.0.0.1:31370
I0114 23:25:16.880873 129298 process.cpp:2502] Resuming 
crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.880975 129301 process.cpp:2800] Donating thread to 
crammd5_authenticator_session(3)@127.0.0.1:31370 while waiting
I0114 23:25:16.880991 129298 authenticatee.cpp:298] Authentication success
I0114 23:25:16.881000 129301 process.cpp:2502] Resuming 
crammd5_authenticator_session(3)@127.0.0.1:31370 at 2016-01-14 
23:25:16.878512896+00:00
I0114 23:25:16.881027 129301 process.cpp:2607] Cleaning up 
crammd5_authenticator_session(3)@127.0.0.1:31370
I0114 23:25:16.881069 129298 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881184 129300 process.cpp:2502] Resuming 
crammd5_authenticatee(3)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881225 129300 process.cpp:2607] Cleaning up 
crammd5_authenticatee(3)@127.0.0.1:31370
I0114 23:25:16.881301 129298 slave.cpp:860] Successfully authenticated with 
master master@127.0.0.1:31370
I0114 23:25:16.881631 129296 process.cpp:2502] Resuming master@127.0.0.1:31370 
at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.881629 129298 slave.cpp:1254] Will retry registration in 
2.172151ms if necessary
I0114 23:25:16.881724 129298 clock.cpp:279] Created a timer for 
slave(1)@127.0.0.1:31370 in 2.172151ms in the future (2016-01-14 
23:25:16.880685047+00:00)
I0114 23:25:16.881906 129296 master.cpp:4314] Re-registering slave 
a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at slave(1)@127.0.0.1:31370 (localhost)
I0114 23:25:16.882228 129294 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.882613 129295 process.cpp:2502] Resuming 
slave(1)@127.0.0.1:31370 at 2016-01-14 23:25:16.878512896+00:00
I0114 23:25:16.882622 129296 master.cpp:4502] Sending updated checkpointed 
resources  to slave a9a5fba6-3191-424d-a1cf-5d12f35ada17-S0 at 
slave(1)@127.0.0.1:31370 (localhost)
I0114 23:25:16.882712 129295 pid.cpp:93] Attempting to parse 
'scheduler-72698b83-ea69-4f94-ac79-1fe005ba5ea9@127.0.0.1:31370' into a PID
W0114 23:25:16.882750 129295 slave.cpp:2162] Dropping updateFramework message 
for a9a5fba6-3191-424d-a1cf-5d12f35ada17- because the slave is in 
DISCONNECTED state
I0114 23:25:16.882935 129295 slave.cpp:2277] Updated checkpointed resources 
from  to 
I0114 23:25:16.883074 129302 clock.cpp:152] Handling timers up to 2016-01-14 
23:25:16.878512896+00:00
I0114 23:25:16.883116 129302 clock.cpp:197] Clock has settled
I0114 23:25:16.888927 129289 clock.cpp:465] Clock is settled
I0114 23:25:16.889067 129289 clock.cpp:381] Clock advanced (2mins) to 0x20681f0
I0114 23:25:16.889143 129302 clock.cpp:152] Handling timers up to 2016-01-14 
23:27:16.878512896+00:00
I0114 23:25:16.889176 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.880685047+00:00
I0114 23:25:16.889196 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.882896819+00:00
I0114 23:25:16.889209 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:16.965011968+00:00
I0114 23:25:16.889220 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:17.800440064+00:00
I0114 23:25:16.889231 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:18.572319400+00:00
I0114 23:25:16.889242 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:21.561927936+00:00
I0114 23:25:16.889253 129302 clock.cpp:159] Have timeout(s) at 2016-01-14 
23:25:21.847002880+00:00
I0114 23:25:16.889263 129302 clock.cpp:159] Have timeout(s) at 2016-01-

[jira] [Commented] (MESOS-3509) SlaveTest.TerminatingSlaveDoesNotReregister is flaky

2016-01-07 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088428#comment-15088428
 ] 

Adam B commented on MESOS-3509:
---

Was failing consistently on CentOS7 after the MESOS-313 changes, so I disabled 
the test for now.
commit e1f99010e13b3d0e918d875f24f63075855e2cc7
Author: Adam B 
Date:   Thu Jan 7 15:56:05 2016 -0800

Disabled flaky/failing TerminatingSlaveDoesNotReregister test.

> SlaveTest.TerminatingSlaveDoesNotReregister is flaky
> 
>
> Key: MESOS-3509
> URL: https://issues.apache.org/jira/browse/MESOS-3509
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Jie Yu
>Assignee: Liqiang Lin
>
> Observed on Apache CI
> {noformat}
> [ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
> Using temporary directory 
> '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn'
> I0924 04:54:06.645386 29977 leveldb.cpp:176] Opened db in 112.347533ms
> I0924 04:54:06.665618 29977 leveldb.cpp:183] Compacted db in 20.158157ms
> I0924 04:54:06.665704 29977 leveldb.cpp:198] Created db iterator in 24721ns
> I0924 04:54:06.665727 29977 leveldb.cpp:204] Seeked to beginning of db in 
> 3125ns
> I0924 04:54:06.665741 29977 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 256ns
> I0924 04:54:06.665799 29977 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0924 04:54:06.666378 3 recover.cpp:449] Starting replica recovery
> I0924 04:54:06.666831 3 recover.cpp:475] Replica is in EMPTY status
> I0924 04:54:06.667945 30008 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0924 04:54:06.668473 30005 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0924 04:54:06.669003 30007 recover.cpp:566] Updating replica status to 
> STARTING
> I0924 04:54:06.669154 3 master.cpp:370] Master 
> bd006350-cdbf-414f-9eef-25f03ccdc5fb (7c9c99874a2d) started on 
> 172.17.1.154:55676
> I0924 04:54:06.669347 3 master.cpp:372] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/credentials"
>  --framework_sorter="drf" --help="false" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" 
> --registry_strict="true" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.25.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/master" 
> --zk_session_timeout="10secs"
> I0924 04:54:06.669595 3 master.cpp:417] Master only allowing 
> authenticated frameworks to register
> I0924 04:54:06.669605 3 master.cpp:422] Master only allowing 
> authenticated slaves to register
> I0924 04:54:06.669613 3 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_PiT5jn/credentials'
> I0924 04:54:06.669874 3 master.cpp:461] Using default 'crammd5' 
> authenticator
> I0924 04:54:06.670027 3 master.cpp:498] Authorization enabled
> I0924 04:54:06.670264 30005 hierarchical.hpp:468] Initialized hierarchical 
> allocator process
> I0924 04:54:06.670320 30005 whitelist_watcher.cpp:79] No whitelist given
> I0924 04:54:06.671406 2 master.cpp:1597] The newly elected leader is 
> master@172.17.1.154:55676 with id bd006350-cdbf-414f-9eef-25f03ccdc5fb
> I0924 04:54:06.671438 2 master.cpp:1610] Elected as the leading master!
> I0924 04:54:06.671461 2 master.cpp:1370] Recovering from registrar
> I0924 04:54:06.671612 30011 registrar.cpp:309] Recovering registrar
> I0924 04:54:06.704048 30002 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 34.707125ms
> I0924 04:54:06.704123 30002 replica.cpp:323] Persisted replica status to 
> STARTING
> I0924 04:54:06.704504 29997 recover.cpp:475] Replica is in STARTING status
> I0924 04:54:06.705658 30002 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0924 04:54:06.706466 30007 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0924 04:54:06.707064 30004 recover.cpp:566] Updating replica status to VOTING
> I0924 04:54:06.737488 30004 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 30.261425ms
> I0924 04:54:06.737560 30004 replica.