Re: [Launchpad-dev] The future of downtime for rollouts?

Robert Collins Tue, 14 Sep 2010 16:06:49 -0700

On Wed, Sep 15, 2010 at 12:46 AM, Curtis Hovey
<[email protected]> wrote:
> On Tue, 2010-09-14 at 11:41 +0100, Tom Haddon wrote:
>> Might be more reliable but less accurate :) We estimate the downtime
>> based on how long the last update took on staging, and then
>> multiplying
>> by a factor that seems to have accurately reflected the difference in
>> time between staging and production (with a little padding). We could
>> only commit to 90 mins if we refused to rollout any DB updates that
>> took
>> longer than a certain period of time on staging.
>
> Staging restore times trend up, so we are always talking about
> increasing time for a rollout. We will continue to do schema development
> after the featureflag is complete. What we cannot see is the staging
> restore time verses the real time--maybe that is pointless because there
> are other rollout incidents that increased the rollout.


Staging restore times as a whole are a poor surrogate as already discussed.

The point I am making is that unless we decide *how much downtime we
will tolerate*, we'll always have reasons to do more.

So I'm proposing:
 - a 90m budget.
 - if we can't do it in that timeframe, we don't do it.

We *will have to* innovate and address various issues to stick to
this, but 90m of time is actually a lot of lost time to the many
thousands of users we have in every timezone. We could spend a week
with the whole team working on something to make the upgrade faster,
and still be spending less time than our users are losing, when we're
down.

-Rob

_______________________________________________
Mailing list: https://launchpad.net/~launchpad-dev
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp

Re: [Launchpad-dev] The future of downtime for rollouts?

Reply via email to