Re: Endless loops and lack of interrupts

Eric Sammer Thu, 07 Jun 2012 19:25:35 -0700

On Thu, Jun 7, 2012 at 7:12 PM, Juhani Connolly <
[email protected]> wrote:

> Thanks for your input on this.
>
> I'm more than happy for misbehaving things to just fail, though I hope
> that we can leave proper logs so people can know something is wrong.
>

Completely agree. +1.

> Do you feel the auto restarting behaviour should still remain a part of
> the lifecycle? I'm not entirely sure if it is something the supervisor
> should be doing, and the majority of components that fail tend to just keep
> failing.
>

I do like the idea of erlang's always restart policy way of doing things.
The rationale is that if something is failing  due to an outside resource
(let's say the file channel due to a lack of disk space) correcting the
external component should effect recovery of flume. In other words, if a
disk files, I want to delete some files and let flume get back to work
without needing to figure out if it's still within its retry period. I say
let it go forever. Maybe we don't log at some point, but I'm inclined to
just let it go. If users don't want the logs, they can configure log4j
conservatively and use metrics to track the number of component failures /
restarts (which should be exposed, IMO).

> With these simplicity needs in mind though, the starting point that Brock
> has attached to the JIRA looks simple and performs most  of what we need.
> I've been working on something bigger than that, but with more settings for
> policy(configurable policy restricting number of attempts to start/stop,
> timeouts, and forced termination). This probably isn't necessary though if
> we decide on some "correct" behavior for components and enforce it in code
> reviews.

My general feeling is that the framework (flume, proper) should be dumb;
restart forever or never, configurable by the user by way of a policy
(which we don't expose today, admittedly). Sources and sinks should not
enforce lifecycle decisions; a source should do what it's told, as should a
sink. Weird magic numbers don't buy you much (e.g. "restart three times and
then give up").

The one thing I didn't implement from erlang in the supervisor code was
restart policies that impact other components owned by the same supervisor.
Ex: erlang's supervisors permit one to say "if you restart component X for
some reason, also restart Y, Z, and A." This is nice as it allows one to
discard state associated with a previous incarnation of the lifecycle. I
didn't have a strong need for it, though. In fact, my gut tells me the
*best* thing to do is go one step further than discard the instances of the
components and use the last configuration event and builders to
re-instantiate the components to prevent state leak after an exception that
may have left a component in a bad state. That's more complicated though.

Sorry, these are just ideas that have been rolling around that I never got
to really put out there on the lists. I'm brain-vomitting at you. :)

>
> On 06/07/2012 03:32 PM, Eric Sammer wrote:
>
>> I can try and answer any lingering questions about the existing lifecycle
>> management code. I knew there were outstanding issues in it (which was the
>> impetus for that JIRA to move to Guava's service model) but I just never
>> was able to put in the time. I'm in favor of moving to the Guava
>> implementation.
>>
>> More generally, I strongly believe in well defined semantics and simple
>> contracts. In other words, we should not get into the business of
>> attempting to deal with byzantine failure. The idea is that the shutdown
>> handler (in response to SIGINT) should request an orderly shutdown. If the
>> system is in good working order, it should do so. If, for instance,
>> something PermGen OOMs or there's incorrect behavior in a LifecycleAware
>> component (i.e. a component that does not respect the contract), we should
>> explicitly *not* try and handle that and a forced kill is proper. My vote
>> is just to avoid insanely complex logic to deal with incorrectly
>> implemented components; that always leads to insanely complicated code
>> that
>> doesn't always work when things are correctly implemented. Not to mention,
>> process stop events suffer from the halting problem[1] anyway...
>>
>> [1] http://bit.ly/Kz5GMJ
>>
>> Thanks for taking this on guys. It's not sexy work, but it's super
>> important.
>>
>> On Wed, Jun 6, 2012 at 8:01 PM, Juhani Connolly<
>> [email protected].**jp <[email protected]>>
>>  wrote:
>>
>>  The biggest barrier to this right now is with the restarting behavior our
>>> current lifecycle model has, which is not part of the guava lifecycle. It
>>> means if we're to restart services we need to store everything needed to
>>> build a new service when the old one dies, and start that. In essence
>>> we're
>>> going to need an outer layer to watch the inner(guava layer), which sort
>>> of
>>> defeats the purpose.
>>>
>>> I'm trying to figure out if there's a way to get around this, or if
>>> switching from a restarting model to guavas starting/running/stopping/**
>>> **terminated
>>>
>>> model is possible(this would probably require some components to take
>>> better care of themselves as once they fail they wouldn't auto-restart)
>>>
>>>
>>> On 06/06/2012 05:48 PM, Hari Shreedharan wrote:
>>>
>>>  Juhani,
>>>>
>>>> It would be interesting to see how much of an effort it would be to
>>>> replace the current system with Guava. It would be nice to see an
>>>> initial
>>>> proof of concept, maybe you can post it on the dev list. I think there
>>>> would be others would also have ideas and feel the need to update the
>>>> Lifecycle system.
>>>>
>>>> Thanks
>>>> Hari
>>>>
>>>>
>>>>
>>
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Endless loops and lack of interrupts

Reply via email to