[jira] [Commented] (MESOS-4795) mesos agent not recovering after ZK init failure

Sharma Podila (JIRA) Thu, 14 Jul 2016 11:34:56 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378086#comment-15378086
 ]


Sharma Podila commented on MESOS-4795:
--------------------------------------

We had a similar issue happen again. This time we were able to debug it 
further. 
Some of the agents were not able to restart repeatedly because of a different 
cause. The cause wasn't obvious in the .INFO/.WARN/.ERROR files. It was shown 
only in stdout when we tried to start the mesos-slave process manually. It 
would be helpful to have that show up in INFO/WARN/ERROR files. We din't expect 
to redirect stdout when there are three logfiles already being created.

The problem was that some of the resources changed since the previous start and 
therefore the new process was exiting with an error, but shown only in stdout. 
Specifically, the "disk" resource changed its total size by a ~10 bytes (out 
of, for example, 600 GB). We will be separately investigating solutions for 
this (the partition was served by ZFS). 

Effectively, this isn't a bug in Mesos. I will leave this open, however, if 
there's a need to create an enhancement or two:
  - have messages going to stdout also show up in the log files (is this a 
config issue that we overlooked?)
  - have fuzzy logic to not complain about 10 bytes difference in a 600 GB disk 
partition

If these are already addressed elsewhere, I am OK closing this.


> mesos agent not recovering after ZK init failure
> ------------------------------------------------
>
>                 Key: MESOS-4795
>                 URL: https://issues.apache.org/jira/browse/MESOS-4795
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.24.1
>            Reporter: Sharma Podila
>
> Here's the sequence of events that happened:
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the 
> detector.cpp:481 log line.
> -The agents that continue to flap repaired with manual removal of contents in 
> mesos-slave's working dir
> Here's the contents of the various log files on the agent:
> The .INFO logfile for one of the restarts before mesos-slave process exited 
> with no other error messages:
> {code}
> Log file created at: 2016/02/09 02:12:48
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by builds
> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 
> 1)@10.138.146.230:7101
> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: 
> --appc_store_dir="/tmp/mesos/store/appc" 
> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" <snip>"
> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: 
> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
> I0209 02:12:48.516139 97299 group.cpp:331] Group process 
> (group(1)@10.138.146.230:7101) connected to ZooKeeper
> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path 
> '/titus/main/mesos' in ZooKeeper
> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: 
> (id='209')
> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get 
> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from 
> '/mnt/data/mesos/meta'
> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file 
> '/mnt/data/mesos/meta/resources/resources.info'
> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master 
> (UPID=master@10.230.95.110:7103) is detected
> {code}
> The .FATAL log file when the original transient ZK error occurred:
> {code}
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> {code}
> The .ERROR log file:
> {code}
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> {code}
> The .WARNING file had the same content. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4795) mesos agent not recovering after ZK init failure

Reply via email to