Re: distinguishing failure types during upgrade

Bill Farner Wed, 01 Nov 2017 09:25:34 -0700

>
> How does rollback work in that case


Rollback behavior is unchanged when update pulses are enabled.

disable auto-rollback


That's also a feasible option.

On Wed, Nov 1, 2017 at 9:15 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Signal =
> - exit status from service
> - reason code from mesos, it task was killed by Mesos e.g. revocable core
> revoked during oversubscription
>
> Yes, I am aware of co-ordinated updates which allow this logic to be
> placed outside Aurora. How does rollback work in that case? Perhaps I
> should just disable auto-rollback in that case and out the rollback logic
> also into this external system.
>
> On Wed, Nov 1, 2017 at 8:39 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> Can Aurora distinguish between failures caused by the upgrade itself or
>>> other transient systemic issues
>>
>>
>> There isn't any signal i know of that would allow Aurora to independently
>> determine the cause of task failures in a generic way.
>>
>> Two options come to mind:
>> 1. Human intervention - aurora update pause from the CLI
>> 2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs
>> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L708-L714>,
>> and set up an in-house service to invoke pulseJobUpdate()
>> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L1134-L1139>.
>> This opts the job update into requiring periodic positive acknowledgement
>> from an external system that it is safe to proceed.  You could use this,
>> for example, to automatically gate an update while a service has alerts
>> firing.
>>
>>
>>
>> On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Folks,
>>> Sometimes in our cluster upgrades start failing due to transient outages
>>> of dependencies or reasons unrelated to the new code being pushed out.
>>> Aurora hits its failure threshold and starts automatic rollback which may
>>> make a bad condition worse (e.g. if the outage was related to load rollback
>>> will increase load). Can Aurora distinguish between failures caused by the
>>> upgrade itself or other transient systemic issues (using e.g. reason code)?
>>> If not does this make sense as a new feature?
>>>
>>> Mohit.
>>>
>>>
>>
>

Re: distinguishing failure types during upgrade

Reply via email to