On Thu, Mar 31, 2011 at 5:31 AM, Julian Edwards <[email protected]> wrote: > On Tuesday 29 March 2011 05:11:53 Robert Collins wrote: >> What-do-you-all-think? > > I think that simplification is good. > > Can you summarise why we need downtime for DB change deployments? Then > perhaps we can work a plan to minimise or eliminate those altogether.
https://dev.launchpad.net/LEP/ReliableDBDeploys + https://dev.launchpad.net/Database/LivePatching both have reasonable technical information, but there is indeed no simple statement of limitations. In short 'concurrency hard'. Slightly less shortly: slony 1.x (which we run) takes out an exclusive lock on all relations in the replication set when performing DDL (create table/alter table/ drop table - ditto for indices and views etc etc). Postgresql also takes out exclusive locks on relations to do most DDL. One notable exception to this is 'create index concurrently' which can add an index live without ever taking an exclusive lock. exclusive locks on single objects can be hard to acquire (because we're a busy system - 70 requests a second from the lpnet cluster alone - there may be no point where no transaction has a lock e.g. product for several minutes (because of overlapping requests having row level locks (which can happen when related tables are updated)). exclusive locks on multiple objects are even harder to acquire, *and* can also dead lock (if a client has a row lock in table A and goes for one in table B, and we've got an exclusive lock on B already - boom, deadlock. One option would be a different replication strategy - e.g. slony2 takes out narrower locks for executing DDL, but will still take out multiple exclusive locks because it locks all relations directly affected by DDL (and sometimes gets it wrong and takes too few :(). However we'd still need to deal with postgresql itself taking exclusive locks out; ditto for londista and postgresql 9's built in replication. What I think we need to do is design for *very fast* downtime: quiesce the system in a few seconds, apply the DDL (and structure it to be a few seconds long (*)), then bring the system back to activity in a few more seconds. This implies a few things: - we cannot affort to stop or upgrade during this process - cronscripts doing multi-minute things need to be interruptible (or use the LP API rather than db connections) - some web and api requests will be disrupted, so this process will be a little disruptive (we can't be casual about it) (*) short DDL implies: - when adding a column it must be NULLable, must not have a default - uniqueness on very big existing tables needs to be done via a unique index (with CONCURRENTLY) - we'll learn more as we go. .. HTH Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

