[jira] [Commented] (MESOS-4645) Mesos agent shutdown on healtcheck timeout rather than lost and recovered

Ian Babrou (JIRA) Wed, 30 Mar 2016 03:43:56 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217787#comment-15217787
 ]


Ian Babrou commented on MESOS-4645:
-----------------------------------

I've also got all my task killed on what is supposed to be a seamless 0.26.0 -> 
0.27.2 upgrade:

{noformat}
Mar 30 10:11:06 myhost mesos-slave[17550]: I0330 10:11:06.451997 17561 
slave.cpp:709] Slave asked to shut down by master@1.2.11.16:5050 because 'Slave 
attempted to re-register after removal'
{noformat}

Marathon fail reason:

{noformat}
Slave myhost removed: health check timed out
{noformat}

> Mesos agent shutdown on healtcheck timeout rather than lost and recovered
> -------------------------------------------------------------------------
>
>                 Key: MESOS-4645
>                 URL: https://issues.apache.org/jira/browse/MESOS-4645
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.1
>            Reporter: Cody Maloney
>              Labels: mesosphere
>
> I expected slaves to have to be gone the re-registration timeout before 
> they'd be lost to the cluster, not fail 5 healtchecks (Failing the 
> healthchecks indicates there is a network partition, not that the agent is 
> gone for good and will never come back).
> Is there some flag I'm missing here which I should be setting?
> From my perspective I expect frameworks to not get offers for resources on 
> agents which haven't been contacted recently (The framework wouldn't be able 
> to launch anything on the agent). Once the re-registration period times out 
> the slave would be assumed completely lost and the tasks assumed terminated / 
> able to be re-launched if desired. If an agent recovers between the 
> healthcheck timeout and re-registration timeout, it should be able to re-join 
> the cluster with its running tasks kept running.
> Note: Some log lines have their start or tail truncated. Critical stuff 
> should all be there
> Master flags
> {noformat}
> Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:22:19.690507  1362 master.cpp:369] Flags at startup: 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" --authorizers="local" --cluster="cody-cm52sd-2" 
> --framework_sorter="drf" --help="false" --hostname_lookup="false" 
> --initialize_driver_logging="true" 
> --ip_discovery_command="/opt/mesosphere/bin/detect_ip" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" 
> --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="5secs" --registry_strict="false" 
> --roles="slave_public" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui"
>  --weights="slave_public=1" --work_dir="/var/lib/mesos/master" 
> --zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> Slave flags
> {noformat}
> Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: 
> I0211 00:34:13.334395  3914 slave.cpp:192] Flags at startup: 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="docker,mesos" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_auth_server="auth.docker.io" --docker_auth_server_port="443" 
> --docker_kill_orphans="true" 
> --docker_local_archives_dir="/tmp/mesos/images/docker" 
> --docker_puller="local" --docker_puller_timeout="60" 
> --docker_registry="registry-1.docker.io" --docker_registry_port="443" 
> --docker_remove_delay="1hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/opt\/mesosphere\/lib","PATH":"\/usr\/bin:\/bin","SASL_PATH":"\/opt\/mesosphere\/lib\/sasl2","SHELL":"\/usr\/bin\/bash"}"
>  --executor_registration_timeout="5mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="2days" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="false" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --ip_discovery_command="/opt/mesosphere/bin/detect_ip" 
> --isolation="cgroups/cpu,cgroups/mem" 
> --launcher_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/libexec/mesos"
>  --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" 
> --master="zk://leader.mesos:2181/mesos" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" 
> --resources="ports:[1025-2180,2182-3887,3889-5049,5052-8079,8082-8180,8182-32000]"
>  --re
> Feb 11 00:34:13 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3914]: 
> vocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
> --slave_subsystems="cpu,memory" --strict="true" --switch_user="true" 
> --systemd_runtime_directory="/run/systemd/system" --version="false" 
> --work_dir="/var/lib/mesos/slave"
> {noformat}
> h2. Restarting the slave0
> {noformat}
> Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3261]: 
> W0211 00:32:40.981289  3261 logging.cpp:81] RAW: Received signal SIGTERM from 
> process 1 of user 0; exiting
> Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Stopping 
> Mesos Slave...
> Feb 11 00:32:44 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Stopped 
> Mesos Slave.
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Starting 
> Mesos Slave...
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: PING 
> leader.mesos (10.0.4.187) 56(84) bytes of data.
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: 64 bytes 
> from ip-10-0-4-187.us-west-2.compute.internal (10.0.4.187): icmp_seq=1 ttl=64 
> time=0.314 ms
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: --- 
> leader.mesos ping statistics ---
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: 1 packets 
> transmitted, 1 received, 0% packet loss, time 0ms
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal ping[3534]: rtt 
> min/avg/max/mdev = 0.314/0.314/0.314/0.000 ms
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal systemd[1]: Started 
> Mesos Slave.
> Feb 11 00:34:02 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:02.256242  3536 logging.cpp:172] INFO level logging started!
> {noformat}
> h2. The slave detects the new master, gets shutdown for re-registering after 
> removal
> {noformat}
> Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:04.705356  3546 slave.cpp:729] New master detected at 
> master@10.0.4.187:5050
> Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:04.705366  3539 status_update_manager.cpp:176] Pausing sending 
> status updates
> Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:04.705550  3546 slave.cpp:754] No credentials provided. 
> Attempting to register without authentication
> Feb 11 00:34:04 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:04.705597  3546 slave.cpp:765] Detecting new master
> Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:05.624832  3544 slave.cpp:643] Slave asked to shut down by 
> master@10.0.4.187:5050 because 'Slave attempted to re-register after removal'
> Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:05.624908  3544 slave.cpp:2009] Asked to shut down framework 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 by master@10.0.4.187:5050
> Feb 11 00:34:05 ip-10-0-0-52.us-west-2.compute.internal mesos-slave[3536]: 
> I0211 00:34:05.624939  3544 slave.cpp:2034] Shutting down framework 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000
> {noformat}
> h2. Snippet of master flags
> {noformat}
> Feb 11 00:22:19 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:22:19.690507  1362 master.cpp:369] Flags at startup: 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" --authorizers="local" --cluster="cody-cm52sd-2" 
> --framework_sorter="drf" --help="false" --hostname_lookup="false" 
> --initialize_driver_logging="true" 
> --ip_discovery_command="/opt/mesosphere/bin/detect_ip" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" 
> --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="5secs" --registry_strict="false" 
> --roles="slave_public" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/opt/mesosphere/packages/mesos--4dd59ec6bde2052f6f2a0a0da415b6c92c3c418a/share/mesos/webui"
>  --weights="slave_public=1" --work_dir="/var/lib/mesos/master" 
> --zk="zk://127.0.0.1:2181/mesos" --zk_session_timeout="10secs"
> {noformat}
> h2. Master initially registering the slave
> {noformat}
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.968310  1373 master.cpp:3859] Registering slave at 
> slave(1)@10.0.0.52:5051 (10.0.0.52) with id 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0
> {noformat}
> {noformat}
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.976769  1374 log.cpp:704] Attempting to truncate the log to 3
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.976820  1370 coordinator.cpp:350] Coordinator attempting to 
> write TRUNCATE action at position 4
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.977002  1369 replica.cpp:540] Replica received write request 
> for position 4 from (13)@10.0.4.187:5050
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.977157  1374 master.cpp:3927] Registered slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) with ports(*):[1025-
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.977207  1368 hierarchical.cpp:344] Added slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) with ports(*):[1025-2180, 
> 2182-3887, 3889-5049,
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.977552  1368 master.cpp:4979] Sending 1 offers to framework 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (marathon) at 
> scheduler-8174298d-3ef3-4683-9
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.978520  1369 leveldb.cpp:343] Persisting action (16 bytes) to 
> leveldb took 1.485099ms
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.978559  1369 replica.cpp:715] Persisted action at 4
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.978710  1369 replica.cpp:694] Replica received learned notice 
> for position 4 from @0.0.0.0:0
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.979212  1372 master.cpp:4269] Received update of slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) with total o
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.979322  1372 hierarchical.cpp:400] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) updated with 
> oversubscribed resources  (total: ports(
> Feb 11 00:23:01 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:23:01.980257  1369 leveldb.cpp:343] Persisting action (18 bytes) to 
> leveldb took 1.514614ms
> {noformat}
> h2. Lose the slave
> {noformat}
> Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:12.578547  1368 master.cpp:1083] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) disconnected
> Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:12.578627  1368 master.cpp:2531] Disconnecting slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52)
> Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:12.578673  1368 master.cpp:2550] Deactivating slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52)
> Feb 11 00:32:12 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:12.578764  1374 hierarchical.cpp:429] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 deactivated
> {noformat}
> h2. Slave came back (earlier restart, only gone for seconds)
> {noformat}
> Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:15.965806  1370 master.cpp:4019] Re-registering slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52)
> Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:15.966354  1373 hierarchical.cpp:417] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 reactivated
> Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:15.966419  1370 master.cpp:4207] Sending updated checkpointed 
> resources  to slave 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at 
> slave(1)@10.0.0.52:5051 
> Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:15.967167  1371 master.cpp:4269] Received update of slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) with total o
> Feb 11 00:32:15 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:15.967296  1371 hierarchical.cpp:400] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52) updated with 
> oversubscribed resources  (total: ports(
> {noformat}
> h2. This shutdown of the slave
> {noformat}
> Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:44.142541  1371 http.cpp:334] HTTP GET for /master/state-summary 
> from 10.0.4.187:44274 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) 
> AppleWebKit/5
> Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:44.150949  1368 master.cpp:1083] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) disconnected
> Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:44.151002  1368 master.cpp:2531] Disconnecting slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52)
> Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:44.151048  1368 master.cpp:2550] Deactivating slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 (10.0.0.52)
> Feb 11 00:32:44 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:32:44.151113  1368 hierarchical.cpp:429] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 deactivated
> {noformat}
> h2. Slave lost (The critical part). Slave should be lost at healthcheck 
> timeout, not shut down.
> {noformat}
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:33:47.009037  1372 master.cpp:236] Shutting down slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 due to health check timeout
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> W0211 00:33:47.009124  1372 master.cpp:4581] Shutting down slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) with message 'hea
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:33:47.009181  1372 master.cpp:5846] Removing slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52): health check timed ou
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:33:47.009297  1372 master.cpp:6066] Updating the state of task 
> test-app-2.4057f89f-d056-11e5-8aeb-0242d6f35f4b of framework 
> 0c9ebb3f-23f8-4fce-b276-9ebc
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:33:47.009353  1369 hierarchical.cpp:373] Removed slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0
> {noformat}
> h2. Tasks marked as slave-lost
> {noformat}
> 2] Removing task test-app.4076cb59-d056-11e5-8aeb-0242d6f35f4b with resources 
> cpus(*):0.1; mem(*):16; ports(*):[2791-2791] of framework 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 on slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)
> 6] Updating the state of task test-app.40756bc5-d056-11e5-8aeb-0242d6f35f4b 
> of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (latest state: 
> TASK_LOST, status update state: TASK_LOST)
> 2] Removing task test-app.40756bc5-d056-11e5-8aeb-0242d6f35f4b with resources 
> cpus(*):0.1; mem(*):16; ports(*):[6724-6724] of framework 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 on slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)
> 6] Updating the state of task test-app-2.40765628-d056-11e5-8aeb-0242d6f35f4b 
> of framework 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-0000 (latest state: 
> TASK_LOST, status update state: TASK_LOST)
> {noformat}
> h2. Slave gone gone
> {noformat}
> Feb 11 00:33:47 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:33:47.021023  1374 master.cpp:5965] Removed slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 (10.0.0.52): health check timed out
> {noformat}
> h2. Master refuses to accept slave
> {noformat}
> Feb 11 00:34:05 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> W0211 00:34:05.614985  1368 master.cpp:3997] Slave 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S0 at slave(1)@10.0.0.52:5051 
> (10.0.0.52) attempted to re-register after 
> {noformat}
> h2. Slave comes up with new id, registers properly 
> {noformat}
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.757870  1368 master.cpp:3859] Registering slave at 
> slave(1)@10.0.0.52:5051 (10.0.0.52) with id 
> 0c9ebb3f-23f8-4fce-b276-9ebca1ede0b1-S1
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.758057  1372 registrar.cpp:441] Applied 1 operations in 
> 23020ns; attempting to update the 'registry'
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.758257  1368 log.cpp:685] Attempting to append 367 bytes to 
> the log
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.758316  1368 coordinator.cpp:350] Coordinator attempting to 
> write APPEND action at position 7
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.758450  1368 replica.cpp:540] Replica received write request 
> for position 7 from (75)@10.0.4.187:5050
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.759891  1368 leveldb.cpp:343] Persisting action (386 bytes) to 
> leveldb took 1.411937ms
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.759927  1368 replica.cpp:715] Persisted action at 7
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.760097  1368 replica.cpp:694] Replica received learned notice 
> for position 7 from @0.0.0.0:0
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.763203  1368 leveldb.cpp:343] Persisting action (388 bytes) to 
> leveldb took 3.072892ms
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.763236  1368 replica.cpp:715] Persisted action at 7
> Feb 11 00:34:13 ip-10-0-4-187.us-west-2.compute.internal mesos-master[1362]: 
> I0211 00:34:13.763250  1368 replica.cpp:700] Replica learned APPEND action at 
> position 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4645) Mesos agent shutdown on healtcheck timeout rather than lost and recovered

Reply via email to