Hello dear all, [Please CC your replies to me as I am on the digest mode]
Here's finally a very high-level design proposal of the pg_upgrade feature I was handwaiving a couple of weeks ago. Since, I am almost done with the moving, I can allocate some time for this for 8.1/8.2. If this topic is of interest to you, please read on until the very end before flaming or bashing the ideas out. I had designed that thing and kept updating (the design) more or less regularly, and also reflected some issues from the nearby threads [1] and [2]. This design is very high-level at the moment and is not very detailed. I will need to figure out more stuff as I go and design some aspects in finer detail. I started to poke around asking for initdb-forcing code paths in [3], but got no response so far. But I guess if the general idea or, rather, ideas accepted I will insist on more information more aggressively :) if I can't figure something out for myself. [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php Comments are very welcome, especially _*CONSTRUCTIVE*_... Thank you, and now sit back and read... CONTENTS: ========= 1. The Need 1. Utilities and User's View of the pg_upgrade Feature 2. Storage Management - Storage Managers and the smgr API 3. Source Code Maintenance Aspects 2. The Upgrade Sequence 4. Proposed Implementation Plan - initdb() API - upgrade API 1. The Need ----------- It's been a problem for PG for quite awhile now to have a less painful upgrade procedure with every new revision of PostgreSQL, so the dump/restore sequence is required. That can take a while for a production DB, while keeping it offline. The new replication-related solutions, such as Slony I, pg_pool, and others can remedy the problem somewhat, but require to roughly double the storage requirements of a given database while replicating from the older server to a newer one. The proposed implementation of an in-server pg_upgrade facility attempts to address both issues at the same time -- a possibility to keep the server running and upgrading lazily w/o doubling the storage requirements (there will be some extra disk space taken, but far from doubling the size). The in-process upgrade will not take much of down time and won't require that much memory/disk/network resources as replication solutions do. Prerequisites ------------- Ideally, the (maybe not so anymore) ambitious goal is to simply be able to "drop in" the new binaries of the new server and kick off on the older version of data files. I think is this feasible now a lot more than before since we have those things available, which should ease up the implementation: - bgwriter - pg_autovacuum (the one to be integrated into the backend in 8.1) - smgr API for pluggable storage managers - initdb in C - ... initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager have made the possibility of creation of the Upgrade Subsystem for PostgreSQL to be something more reasonable, complete, feasible, and sane to a point. Utilities and the User's (DBA) View of the Feature -------------------------------------------------- Two instances exist: pg_upgrade (in C) A standalone utility to upgrade the binary on-disk format from one version to another when the database is offline. We should always have this as an option. pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore options that do not require a connection. I haven't thought through this in detail yet. pg_autoupgrade a postgres subprocess, modeled after bgwriter and pg_autovacuum daemons. This will work when the database system is running on old data directory, and lazily converting relations to the new format. pg_autoupgrade daemon can be triggered by the following events in addition to the lazy upgrade process: "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time] While the database won't be offline running over older database files, SELECT/read-only queries would be allowed using older storage managers*. Any write operation on old data will act using write-invalidate approach that will force the upgrade the affected relations to the new format to be scheduled after the relation-in-progress. (* See the "Storage Management" section.) Availability of the relations while upgrade is in progress is likely to be the same as in VACUUM FULL for that relation, i.e. the entire relation is locked until the upgrade is complete. Maybe we could optimize that by locking only particular pages of relations, I have to figure that out. The upgrade of indices can be done using REINDEX, which seems far less complicated than trying to convert its on-disk representation. This has to be done after the relation is converted. Alternatively, the index upgrade can simply be done by "CREATE INDEX" after the upgrade of relations. The relations to be upgraded are ordered according to some priority, e.g. system relations being first, then user-owned relations. System relations upgrade is forced upon the postmaster startup, and then user relations are processed lazily. So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate storage manager (like a driver) between the new server and the old data file upgrading them on-demand. For that purpose we might need to add a pg_upgradeproxy to intercept backend requests and use appropriate storage manager. There will be one proxy process per backend. Storage Management ================== Somebody has made a possibility to plug a different storage manager in postgres and we even had two of them at some point . for the magnetic disk and the main memory. The main memory one is gone, but the smgr API is still there. Some were dubious why we would ever need another third-party storage manager, but here I propose to "plug in" storage managers from the older Postgres versions itself! Here is where the pluggable storage manager API would be handy once fully resurrected. Instead of trying to plug some third party storage managers it will primarily be used by the storage managers of different versions of Postgres. We can take the storage manager code from the past maintenance releases, namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them in appropriate fashion and have them implement the API properly. Anyone can contribute a storage manager as they see fit, there's no need to get them all at once. As a trial implementation I will try to do the last three or four maybe. Where to put relations being upgraded? -------------------------------------- At the beginning of the upgrade process if pg detects the old version of data files, it moves them under $PGDATA/<ver>, and keeps the old relations there until upgraded. The relations to be upgraded will be kept in the pg_upgrade_catalog. Once all relations upgraded, the <ver> directory is removed and the auto and proxy processes are shut down. The contents of the pg_upgrade_catalog emptied. The only issue remains is how to deal with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can probably be addressed in the similar fashion, but having a /my/tablespace/<ver> directory. Source Code Maintenance ======================= Now, after the above some of you may get scared on the amount of similar code to possibly maintain in all those storage managers, but in reality they would require as much maintenance as the corresponding releases do get back-patched in that code area, and some are not being maintained for quite some time already. Plus, I should be around to maintain it, should this become realized. Release-time Maintenance ------------------------ For maintenance of pg_upgrade itself, one will have to fork out a new storage manager from the previous stable release and "register" it within the system. Alternatively, the new storage manager can be forked when the new release cycle begins. Additionally, a pg_upgrade version has to be added implementing the API steps outlined in the pg_upgrade API section. Implementation Steps ==================== To materialize the above idea, I'd proceed as follows: *) Provide the initdb() API (quick) *) Resurrect the pluggable storage manager API to be usable for the purpose. *) Document it *) Implement pg_upgrade API for 8.0 and 7.4.5. *) Extract 8.0 and 7.4.5 storage managers and have them implement the API as a proof of concept. Massage the API as needed. *) Document the process of adding new storage managers and pg_upgrade drivers. *) Extract other versions storage managers. pg_upgrade sequence ------------------- pg_upgrade API for the steps below to update for the next release. What to do with WAL?? Maybe upgrade can simply be done using WAL replay with old WAL manager? Not, fully, because not everything is in WAL, but some WAL recovery maybe needed in case the server was not shutdown cleanly before the upgrade. pg_upgrade will proceed as follows: - move PGDATA to PGDATA/<major pg version> - move tablespaces likewise - optional recovery from WAL in case old server was not shutdown properly -? Shall I upgrade PITR logs of 8.x??? So one can recover to a point-in-time in the upgraded database? - CLUSTER all old data - ANALYZE all old data - initdb() new system catalogs - Merge in modifications from old system catalogs - upgrade schemas/users -- variations - upgrade user relations Upgrade API: ------------ First draft, to be refined multiple times, but to convey the ideas behind: moveData() movePGData() moveTablespaces() 8.0+ moveDbLocation() < 8.0 preliminaryRecovery() - WAL?? - PITR 8.0+?? preliminaryCleanup() CLUSTER -- recover some dead space ANALYZE -- gives us stats upgradeSystemInfo() initdb() mergeOldCatalogs() mergeOldTemplates() upgradeUsers() upgradeSchemas() - > 7.2, else NULL upgradeUserRelations() upgradeIndices() DROP/CREATE upgradeInit() { } The main body in pseudocode: upgradeLoop() { moveData(); preliminaryRecovery(); preliminaryCleanup(); upgradeSystemInfo(); upgradeUsers(); upgradeSchemas(); upgradeUserRelations(); } Something along these lines the API would be: typedef struct t_upgrade { bool (*moveData) (void); bool (*preliminaryRecovery) (void); /* may be NULL */ bool (*preliminaryCleanup) (void); /* may be NULL */ bool (*upgradeSystemInfo) (void); /* may be NULL */ bool (*upgradeUsers) (void); /* may be NULL */ bool (*upgradeSchemas) (void); /* may be NULL */ bool (*upgradeUserRelations) (void); /* may be NULL */ } t_upgrade; The above sequence is executed by either pg_upgrade utility uninterrupted or by the pg_autoupgrade daemon. In the former the upgrade priority is simply by OID, in the latter also, but can be overridden by the user using the UPGRADE command to schedule relations upgrade, write operation can also change such schedule, with user's selected choice to be first. The more write requests a relation receives while in the upgrade queue, its priority increases; thus, the relation with most hits is on top. In case of tie, OID is the decision mark. Some issues to look into: - catalog merger - a crash in the middle of upgrade - PITR logs for 8.x+ - ... Flames and Handwaiving ---------------------- Okay, flame is on, but before you flame, mind you, this is a very initial version of the design. Some of the ideas may seem far fetched, the contents may seem messy, but I believe it's now more doable than ever and I am willing to put effort in it for the next release or two and then maintain it afterwards. It's not going to be done in one shot maybe, but incrementally, using input, feedback, and hints from you, guys. Thank you for reading till this far :-) I.d like to hear from you if any of this made sense to you. Truly yours, -- Serguei A. Mokhov | /~\ The ASCII Computer Science Department | \ / Ribbon Campaign Concordia University | X Against HTML Montreal, Quebec, Canada | / \ Email! ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html