See https://bugs.launchpad.net/juju-core/+bug/1514874.

When starting, the machine agent starts up several workers in a runner
(more or less the same in 1.25 and 2.0).  Then it waits for the runner
to stop.  If one of the workers has a "fatal" error then the runner
stops and the machine agent handles the error.  If the error is
worker.ErrTerminateAgent then the machine agent uninstalls (cleans up)
itself.  This amounts to deleting its data dir and uninstalling its
jujud and mongo init services.  However, all other files are left in
place* and the machine/instance is not removed from Juju (agent stays
"lost").

Notably, the unit agent handles fatal errors a bit differently, with
some retry logic (e.g. for API connections) and never cleans up after
itself.

The problem here is that a fatal error *may* be recoverable with
manual intervention or with retries.  Cleaning up like we do makes it
harder to manually recover.  Such errors are theoretically extremely
uncommon so we don't worry about doing anything other than stop and
"uninstall".  However, as seen with the above-referenced bug report,
bugs can lead to fatal errors happening with enough frequency that the
agent's clean-up becomes a pain point.

The history of this behavior isn't particularly clear so I'd first
like to hear what the original rationale was for cleaning up the agent
when "terminating" it.  Then I'd like to know if perhaps that
rationale has changed.  Finally, I'd like us to consider alternatives
that allow for better recoverability.

Regarding that third part, we have a number of options.  I introduced
several in the above-referenced bug report and will expand on them
here:

1. do nothing
    This would be easy :) but does not help with the pain point.
2. be smarter about retrying (e.g. retry connecting to API) when
running into fatal errors
    This would probably be good to do but the effort might not pay for itself.
3. do not clean up (data dir, init service, or either)
    Leaving the init service installed really isn't an option because
we don't want
    the init service to try restarting the agent over and over.
Leaving the data dir
    in place isn't good because it will conflict with any new agent
dir the controller
    tries to put on the instance in the future.
4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the
machine/model config or as a sentinel file on instances) and do not
clean up if it is set to true (default?)
    This would provide a reasonable quick fix for the above bug, even if
    temporary and even if it defaults to false.
5. disable (instead of uninstall) the init services and move the data
dir to some obvious but out-of-the-way place (e.g.
/home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of
deleting it
    This is a reasonable longer term solution as the concerns
described for 3 are
    addressed and manual intervention becomes more feasible.
6. same as 5 but also provide an "enable-juju" (or "revive-juju",
"repair-juju", etc.) command *on the machine* that would re-enable the
init services and restore (or rebuild) the agent's data dir
    This would make it even easier to manually recover.
7. first try to automatically recover (from machine or controller)
    This would require a sizable effort for a problem that shouldn't
normally happen.
8. do what the unit agent does
    I haven't looked closely enough to see if this is a good fit.

I'd consider 4 with a default of false to be an acceptable quick fix.
Additionally, I'll advocate for 6 (or 5) as the most appropriate
solution in support of manual recovery from "fatal" errors.

-eric


* Could this lead to collisions if the instance is re-purposed as a
different machine?  I suppose it could also expose sensitive data when
likewise re-purposed, since it won't necessarily be in the same model
or controller.  However, given the need for admin access that probably
isn't a likely problem.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to