On 06/08/2012 11:25 AM, Eric Sammer wrote:
On Thu, Jun 7, 2012 at 7:12 PM, Juhani Connolly<
[email protected]> wrote:
Thanks for your input on this.
I'm more than happy for misbehaving things to just fail, though I hope
that we can leave proper logs so people can know something is wrong.
Completely agree. +1.
Do you feel the auto restarting behaviour should still remain a part of
the lifecycle? I'm not entirely sure if it is something the supervisor
should be doing, and the majority of components that fail tend to just keep
failing.
I do like the idea of erlang's always restart policy way of doing things.
The rationale is that if something is failing due to an outside resource
(let's say the file channel due to a lack of disk space) correcting the
external component should effect recovery of flume. In other words, if a
disk files, I want to delete some files and let flume get back to work
without needing to figure out if it's still within its retry period. I say
let it go forever. Maybe we don't log at some point, but I'm inclined to
just let it go. If users don't want the logs, they can configure log4j
conservatively and use metrics to track the number of component failures /
restarts (which should be exposed, IMO).
Ok, I'm not familiar with Erlang and its framework, but sounds good to me.
With these simplicity needs in mind though, the starting point that Brock
has attached to the JIRA looks simple and performs most of what we need.
I've been working on something bigger than that, but with more settings for
policy(configurable policy restricting number of attempts to start/stop,
timeouts, and forced termination). This probably isn't necessary though if
we decide on some "correct" behavior for components and enforce it in code
reviews.
My general feeling is that the framework (flume, proper) should be dumb;
restart forever or never, configurable by the user by way of a policy
(which we don't expose today, admittedly). Sources and sinks should not
enforce lifecycle decisions; a source should do what it's told, as should a
sink. Weird magic numbers don't buy you much (e.g. "restart three times and
then give up").
The one thing I didn't implement from erlang in the supervisor code was
restart policies that impact other components owned by the same supervisor.
Ex: erlang's supervisors permit one to say "if you restart component X for
some reason, also restart Y, Z, and A." This is nice as it allows one to
discard state associated with a previous incarnation of the lifecycle. I
didn't have a strong need for it, though. In fact, my gut tells me the
*best* thing to do is go one step further than discard the instances of the
components and use the last configuration event and builders to
re-instantiate the components to prevent state leak after an exception that
may have left a component in a bad state. That's more complicated though.
Sorry, these are just ideas that have been rolling around that I never got
to really put out there on the lists. I'm brain-vomitting at you. :)
These ideas sound cool, but not exactly easy to get straight away.
Hopefully I can start with something more simple and we can iterate from
there without adding more complexity than is absolutely necessary. I
plan to put in a rudimentary policy with hooks to inform it of stuff,
and we can if necessary isolate complex logic in there at a later date.
On 06/07/2012 03:32 PM, Eric Sammer wrote:
I can try and answer any lingering questions about the existing lifecycle
management code. I knew there were outstanding issues in it (which was the
impetus for that JIRA to move to Guava's service model) but I just never
was able to put in the time. I'm in favor of moving to the Guava
implementation.
More generally, I strongly believe in well defined semantics and simple
contracts. In other words, we should not get into the business of
attempting to deal with byzantine failure. The idea is that the shutdown
handler (in response to SIGINT) should request an orderly shutdown. If the
system is in good working order, it should do so. If, for instance,
something PermGen OOMs or there's incorrect behavior in a LifecycleAware
component (i.e. a component that does not respect the contract), we should
explicitly *not* try and handle that and a forced kill is proper. My vote
is just to avoid insanely complex logic to deal with incorrectly
implemented components; that always leads to insanely complicated code
that
doesn't always work when things are correctly implemented. Not to mention,
process stop events suffer from the halting problem[1] anyway...
[1] http://bit.ly/Kz5GMJ
Thanks for taking this on guys. It's not sexy work, but it's super
important.
On Wed, Jun 6, 2012 at 8:01 PM, Juhani Connolly<
[email protected].**jp<[email protected]>>
wrote:
The biggest barrier to this right now is with the restarting behavior our
current lifecycle model has, which is not part of the guava lifecycle. It
means if we're to restart services we need to store everything needed to
build a new service when the old one dies, and start that. In essence
we're
going to need an outer layer to watch the inner(guava layer), which sort
of
defeats the purpose.
I'm trying to figure out if there's a way to get around this, or if
switching from a restarting model to guavas starting/running/stopping/**
**terminated
model is possible(this would probably require some components to take
better care of themselves as once they fail they wouldn't auto-restart)
On 06/06/2012 05:48 PM, Hari Shreedharan wrote:
Juhani,
It would be interesting to see how much of an effort it would be to
replace the current system with Guava. It would be nice to see an
initial
proof of concept, maybe you can post it on the dev list. I think there
would be others would also have ideas and feel the need to update the
Lifecycle system.
Thanks
Hari