Re: mesos agent not recovering after ZK init failure

Vinod Kone Fri, 15 Jul 2016 11:46:48 -0700

On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <spod...@netflix.com> wrote:


> We had this issue happen again and were able to debug further. The cause
> for agent not being able to restart is that one of the resources (disk)
> changed its total size since the last restart. However, this error does not
> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
> restarting the agent. It would be good to have all messages going to
> stdout/stderr show up in the logs. Is there a config setting for it that I
> missed?
>

When the master/agent exits due to an un-recoverable error they use a stout
library function `EXIT` which only prints to stderr. Agreed that this is
not great UX, mind filing a ticket? Note that even if we fix this in Mesos,
we can't easily fix this behavior in the 3rd party libraries that we use
(e.g., ZooKeeper).  The way we've dealt with this in production, in my
previous company, was to redirect stdout/stderr to a
mesos-{master,agent}.log. You can disable "--log_dir" to avoid double
logging.



> The disk size total is changing sometimes on our agents. It is off by a
> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on
> our agents to manage the disk partition. From my colleague, Andrew (copied
> here):
>
> The current Mesos approach (i.e., `statvfs()` for total blocks and assume
>> that never changes) won’t work reliably on ZFS
>
>
As Jie alluded to, one strategy is to have a startup wrapper script that
calculates the resources and calls `mesos-agent` binary with `--resources`
flag set. This is what we used to do in production.

Re: mesos agent not recovering after ZK init failure

Reply via email to