On Wed, Sep 15, 2010 at 12:46 AM, Curtis Hovey <[email protected]> wrote: > On Tue, 2010-09-14 at 11:41 +0100, Tom Haddon wrote: >> Might be more reliable but less accurate :) We estimate the downtime >> based on how long the last update took on staging, and then >> multiplying >> by a factor that seems to have accurately reflected the difference in >> time between staging and production (with a little padding). We could >> only commit to 90 mins if we refused to rollout any DB updates that >> took >> longer than a certain period of time on staging. > > Staging restore times trend up, so we are always talking about > increasing time for a rollout. We will continue to do schema development > after the featureflag is complete. What we cannot see is the staging > restore time verses the real time--maybe that is pointless because there > are other rollout incidents that increased the rollout.
Staging restore times as a whole are a poor surrogate as already discussed. The point I am making is that unless we decide *how much downtime we will tolerate*, we'll always have reasons to do more. So I'm proposing: - a 90m budget. - if we can't do it in that timeframe, we don't do it. We *will have to* innovate and address various issues to stick to this, but 90m of time is actually a lot of lost time to the many thousands of users we have in every timezone. We could spend a week with the whole team working on something to make the upgrade faster, and still be spending less time than our users are losing, when we're down. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

