[HACKERS] pg_upgrade project: high-level design proposal of in-place upgrade facility

Serguei A. Mokhov Thu, 30 Sep 2004 12:49:47 -0700

Hello dear all,

[Please CC your replies to me as I am on the digest mode]


Here's finally a very high-level design proposal of the pg_upgrade feature
I was handwaiving a couple of weeks ago. Since, I am almost done with the
moving, I can allocate some time for this for 8.1/8.2.

If this topic is of interest to you, please read on until the very end
before flaming or bashing the ideas out.  I had designed that thing and
kept updating (the design) more or less regularly, and also reflected some
issues from the nearby threads [1] and [2].

This design is very high-level at the moment and is not very detailed.  I
will need to figure out more stuff as I go and design some aspects in
finer detail.  I started to poke around asking for initdb-forcing code
paths in [3], but got no response so far.  But I guess if the general idea
or, rather, ideas accepted I will insist on more information more
aggressively :) if I can't figure something out for myself.

[1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php
[2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php
[3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php

Comments are very welcome, especially _*CONSTRUCTIVE*_...

Thank you, and now sit back and read...

CONTENTS:
=========

1. The Need
1. Utilities and User's View of the pg_upgrade Feature
2. Storage Management
   - Storage Managers and the smgr API
3. Source Code Maintenance Aspects
2. The Upgrade Sequence
4. Proposed Implementation Plan
   - initdb() API
   - upgrade API


1. The Need
-----------

It's been a problem for PG for quite awhile now to have a less painful
upgrade procedure with every new revision of PostgreSQL, so the
dump/restore sequence is required.  That can take a while for a production
DB, while keeping it offline.  The new replication-related solutions, such
as Slony I, pg_pool, and others can remedy the problem somewhat, but
require to roughly double the storage requirements of a given database
while replicating from the older server to a newer one.

The proposed implementation of an in-server pg_upgrade facility attempts
to address both issues at the same time -- a possibility to keep the
server running and upgrading lazily w/o doubling the storage requirements
(there will be some extra disk space taken, but far from doubling the
size).  The in-process upgrade will not take much of down time and won't
require that much memory/disk/network resources as replication solutions
do.


Prerequisites
-------------

Ideally, the (maybe not so anymore) ambitious goal is to simply be able to
"drop in" the new binaries of the new server and kick off on the older
version of data files. I think is this feasible now a lot more than before
since we have those things available, which should ease up the
implementation:

  - bgwriter
  - pg_autovacuum (the one to be integrated into the backend in 8.1)
  - smgr API for pluggable storage managers
  - initdb in C
  - ...

initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager
have made the possibility of creation of the Upgrade Subsystem for
PostgreSQL to be something more reasonable, complete, feasible, and sane
to a point.


Utilities and the User's (DBA) View of the Feature
--------------------------------------------------

Two instances exist:

   pg_upgrade (in C)

       A standalone utility to upgrade the binary on-disk format from one
       version to another when the database is offline.
       We should always have this as an option.
       pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore
       options that do not require a connection. I haven't
       thought through this in detail yet.

   pg_autoupgrade

       a postgres subprocess, modeled after bgwriter and pg_autovacuum
       daemons.  This will work when the database system is running
       on old data directory, and lazily converting relations to the new
       format.

pg_autoupgrade daemon can be triggered by the following events in addition
to the lazy upgrade process:

   "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time]

While the database won't be offline running over older database files,
SELECT/read-only queries would be allowed using older storage managers*.
Any write operation on old data will act using write-invalidate approach
that will force the upgrade the affected relations to the new format to be
scheduled after the relation-in-progress.

(* See the "Storage Management" section.)

Availability of the relations while upgrade is in progress is likely to be
the same as in VACUUM FULL for that relation, i.e. the entire relation is
locked until the upgrade is complete.  Maybe we could optimize that by
locking only particular pages of relations, I have to figure that out.

The upgrade of indices can be done using REINDEX, which seems far less
complicated than trying to convert its on-disk representation.  This has
to be done after the relation is converted.  Alternatively, the index
upgrade can simply be done by "CREATE INDEX" after the upgrade of
relations.

The relations to be upgraded are ordered according to some priority, e.g.
system relations being first, then user-owned relations.  System relations
upgrade is forced upon the postmaster startup, and then user relations are
processed lazily.

So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate
storage manager (like a driver) between the new server and the old data
file upgrading them on-demand.  For that purpose we might need to add a
pg_upgradeproxy to intercept backend requests and use appropriate storage
manager.  There will be one proxy process per backend.


Storage Management
==================

Somebody has made a possibility to plug a different storage manager in
postgres and we even had two of them at some point . for the magnetic disk
and the main memory.  The main memory one is gone, but the smgr API is
still there.  Some were dubious why we would ever need another third-party
storage manager, but here I propose to "plug in" storage managers from the
older Postgres versions itself!  Here is where the pluggable storage
manager API would be handy once fully resurrected.  Instead of trying to
plug some third party storage managers it will primarily be used by the
storage managers of different versions of Postgres.

We can take the storage manager code from the past maintenance releases,
namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them
in appropriate fashion and have them implement the API properly.  Anyone
can contribute a storage manager as they see fit, there's no need to get
them all at once.  As a trial implementation I will try to do the last
three or four maybe.


Where to put relations being upgraded?
--------------------------------------

At the beginning of the upgrade process if pg detects the old version of
data files, it moves them under $PGDATA/<ver>, and keeps the old relations
there until upgraded.  The relations to be upgraded will be kept in the
pg_upgrade_catalog.  Once all relations upgraded, the <ver> directory is
removed and the auto and proxy processes are shut down.  The contents of
the pg_upgrade_catalog emptied.  The only issue remains is how to deal
with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can
probably be addressed in the similar fashion, but having a
/my/tablespace/<ver> directory.

Source Code Maintenance
=======================

Now, after the above some of you may get scared on the amount of similar
code to possibly maintain in all those storage managers, but in reality
they would require as much maintenance as the corresponding releases do
get back-patched in that code area, and some are not being maintained for
quite some time already.  Plus, I should be around to maintain it, should
this become realized.

Release-time Maintenance
------------------------

For maintenance of pg_upgrade itself, one will have to fork out a new
storage manager from the previous stable release and "register" it within
the system.  Alternatively, the new storage manager can be forked when the
new release cycle begins.  Additionally, a pg_upgrade version has to be
added implementing the API steps outlined in the pg_upgrade API section.


Implementation Steps
====================

To materialize the above idea, I'd proceed as follows:

*) Provide the initdb() API (quick)

*) Resurrect the pluggable storage manager API to be usable for the
   purpose.

*) Document it

*) Implement pg_upgrade API for 8.0 and 7.4.5.

*) Extract 8.0 and 7.4.5 storage managers and have them implement the API
   as a proof of concept.  Massage the API as needed.

*) Document the process of adding new storage managers and pg_upgrade
   drivers.

*) Extract other versions storage managers.


pg_upgrade sequence
-------------------

pg_upgrade API for the steps below to update for the next release.

What to do with WAL?? Maybe upgrade can simply be done using WAL replay
with old WAL manager? Not, fully, because not everything is in WAL, but
some WAL recovery maybe needed in case the server was not shutdown cleanly
before the upgrade.

pg_upgrade will proceed as follows:

- move PGDATA to PGDATA/<major pg version>
- move tablespaces likewise
- optional recovery from WAL in case old server was not shutdown properly
-? Shall I upgrade PITR logs of 8.x??? So one can recover to a
  point-in-time in the upgraded database?
- CLUSTER all old data
- ANALYZE all old data
- initdb() new system catalogs
- Merge in modifications from old system catalogs
- upgrade schemas/users
  -- variations
- upgrade user relations

Upgrade API:
------------

First draft, to be refined multiple times, but to convey the ideas behind:

moveData()
  movePGData()
  moveTablespaces() 8.0+
  moveDbLocation() < 8.0

preliminaryRecovery()
 - WAL??
 - PITR 8.0+??

preliminaryCleanup()
  CLUSTER -- recover some dead space
  ANALYZE -- gives us stats

upgradeSystemInfo()
  initdb()
  mergeOldCatalogs()
  mergeOldTemplates()

upgradeUsers()

upgradeSchemas()
  - > 7.2, else NULL

upgradeUserRelations()
  upgradeIndices()
    DROP/CREATE

upgradeInit()
{

}

The main body in pseudocode:

upgradeLoop()
{
    moveData();
    preliminaryRecovery();
    preliminaryCleanup();
    upgradeSystemInfo();
    upgradeUsers();
    upgradeSchemas();
    upgradeUserRelations();
}

Something along these lines the API would be:

typedef struct t_upgrade
{
        bool            (*moveData) (void);
        bool            (*preliminaryRecovery) (void);          /* may be NULL */
        bool            (*preliminaryCleanup) (void);           /* may be NULL */
        bool            (*upgradeSystemInfo) (void);            /* may be NULL */
        bool            (*upgradeUsers) (void);         /* may be NULL */
        bool            (*upgradeSchemas) (void);               /* may be NULL */
        bool            (*upgradeUserRelations) (void);         /* may be NULL */
} t_upgrade;


The above sequence is executed by either pg_upgrade utility uninterrupted
or by the pg_autoupgrade daemon. In the former the upgrade priority is
simply by OID, in the latter also, but can be overridden by the user using
the UPGRADE command to schedule relations upgrade, write operation can
also change such schedule, with user's selected choice to be first.  The
more write requests a relation receives while in the upgrade queue, its
priority increases; thus, the relation with most hits is on top. In case
of tie, OID is the decision mark.

Some issues to look into:

- catalog merger
- a crash in the middle of upgrade
- PITR logs for 8.x+
- ...

Flames and Handwaiving
----------------------

Okay, flame is on, but before you flame, mind you, this is a very initial
version of the design.  Some of the ideas may seem far fetched, the
contents may seem messy, but I believe it's now more doable than ever and
I am willing to put effort in it for the next release or two and then
maintain it afterwards.  It's not going to be done in one shot maybe, but
incrementally, using input, feedback, and hints from you, guys.

Thank you for reading till this far :-) I.d like to hear from you if any
of this made sense to you.

Truly yours,

-- 
Serguei A. Mokhov            |  /~\    The ASCII
Computer Science Department  |  \ / Ribbon Campaign
Concordia University         |   X    Against HTML
Montreal, Quebec, Canada     |  / \      Email!

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html

[HACKERS] pg_upgrade project: high-level design proposal of in-place upgrade facility

Reply via email to