On 15 September 2012 01:39, Andres Freund <and...@2ndquadrant.com> wrote: > (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)
This patch is the 8th of 8 in a patch series that covers different aspects of the bi-directional replication feature planned for PostgreSQL 9.3. For those that are unfamiliar with the BDR projection, a useful summary is available in an e-mail sent to this list by Simon Riggs back in April [1]. I should also point out that Andres has made available a design document that discusses aspects of this patch in particular, in another thread [2]. That document, "High level design for logical replication in Postgres", emphasizes source data generation in particular: generalising how PostgreSQL's WAL stream is generated to represent the changes it describes logically, without pointers to physical heap tuples and so forth (data generation as a concept is described in detail in an earlier design document [3]). This particular patch can be thought of as a response to the earlier discussion [4] surrounding how to solve the problem of keeping system catalogs consistent during WAL replay on followers: "catalog time travel" is now used, rather than maintaining a synchronized catalog at the decoding end. Andres' git tree ("xlog-decoding-rebasing-cf2" branch) [5] provides additional useful comments in commit messages (he rebases things such that each commit represents a distinct piece of functionality/ patch for review). This patch is not strictly speaking an atomic unit. It is necessary to apply all 8 patches in order to get the code to compile. However, it is approximately an atomic unit, that represents a subdivision of the entire BDR patch that it is manageable and logical to write a discrete review for. This is idiomatic use of git-format-patch, but it is unusual enough within our community for me feel the need to note these facts. I briefly discussed this patch with Andres off-list. His feeling is that the review process ought to focus on the design of WAL decoding, including how it fits within the larger set of replication features proposed for 9.3. There are a number of known omissions in this patch. Andres has listed some of these above, and edge-cases and so on are noted next to XXX and FIXME comments in the patch itself. I am inclined to agree with Andres' view that we should attempt to solidify community support for this prototype patch's design, or some variant, before fixing the edge-cases and working towards committable code. I will try my best to proceed on that basis. What follows is an initial overview of the patch (or at least my understanding of the patch, which you may need to correct), and some points of concern. > * applycache module which reassembles transactions from a stream of > interspersed changes > This is what the design doc [2] refers to as "4.5. TX reassembly". This functionality is concentrated in applycache.c. As [2] notes, the reassembly component "was previously coined ApplyCache because it was proposed to run on replication consumers just before applying changes. This is not the case anymore. It is still called that way in the source of the patch recently submitted." The purpose of ApplyCache/transaction reassembly is to reassemble interlaced records, and organise them by XID, so that the consumer client code sees only streams (well, lists) of records split by XID. I meant to avoid talking about anything other than the bigger picture for now, but I must ask: Why the frequent use of malloc(), particularly within applycache.c? The obvious answer is that it's rough code and that that will change, but that still doesn't comport with my idea about how rough Postgres code should look, so I have to wonder if there's a better reason. applycache.c has an acute paucity of comments, which makes it really hard to review well. [2] doesn't have all that much to say about it either. I'm going to not comment much on this here, except to say that I think that the file should be renamed to reassembly.c or something like that, to reflect its well-specified purpose, and not how it might be used. Any cache really belongs in src/backend/utils/cache/ anyway. Applycache is presumably where you're going to want to spill transaction streams to disk, eventually. That seems like a prerequisite to commit. By the way, I see that you're doing this here: + /* most callers don't need snapshot.h */ + typedef struct SnapshotData *Snapshot; Tom, Alvaro and I had a discussion about whether or not this was an acceptable way to reduce build dependencies back in July [8] – I lost that one. You're relying on a Gnu extension/C11 feature here (that is, typedef redefinition). If you find it intensely irritating that you cannot do this while following the standard to the letter, you are not alone. > * snapbuilder which builds catalog snapshots so that tuples from wal can be > understood > This component analyses xlog and builds a special kind of Snapshot. This has been compared to the KnownAssignedXids machinery for Hot Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is meant by this). Since decoding only has to occur within a single backend, I guess it's sufficient that it's all within local memory (in contrast to the KnownAssignedXids array, which is in shared memory). The design document [2] really just explains the problem (which is the need for catalog metadata at a point in time to make sense of heap tuples), without describing the solution that this patch offers with any degree of detail. Rather, [2] says "How we build snapshots is somewhat intricate and complicated and seems to be out of scope for this document", which is unsatisfactory. I look forward to reading the promised document that describes this mechanism in more detail. It's really hard to guess what you might have meant to do, and why you might have done it, much less verifying the codes correctness. This functionality is concentrated in snapbuild.c. A comment in decode.c notes: + * Its possible that the separation between decode.c and snapbuild.c is a + * bit too strict, in the end they just about have the same struct. I prefer the current separation. I think it's reasonable that decode.c is sort of minimal glue code. [2] talks about this under "4.6. Snapshot building". > * wal decoding into an applycache > This functionality is concentrated in decode.c (not applycache.c – decoding just call those functions). Decoding means breaking up individual XLogRecord structs, and storing them in an applycache (applycache does this, and stores them as ApplyCacheChange records), while building a snapshot (which is needed in advance of adding tuples from records). It can be thought of as the small piece of glue between applycache and snapbuild that is called by XLogReader (DecodeRecordIntoApplyCache() is the only public function, which will be called by many xlogreader_state.finished_record-hooked plugin functions in practice, including this example one). An example of what belongs in decode.c is the way it ignores physical XLogRecords, because they are not of interest. By the way, why not just use struct assignment here?: + memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode)); > * decode_xlog(lsn, lsn) debugging function > You consider this to be a throw-away function that won't ever be committed. However, I strongly feel that you should move it into /contrib, so that it can serve as a sort of reference implementation for authors of decoder client code, in the same spirit as numerous existing contrib modules (think contrib/spi). I think that such a module could even be useful to people that were just a bit intellectually curious about how WAL works, which is something I'd like to encourage. Besides, simply having this code in a module will more explicitly demarcate client code (just look at logicalfuncs.c – it is technically client code, but that's too subtle right now). I don't like this code in decode_xlog(): + apply_state = (ReaderApplyState*)xlogreader_state->private_data; Why is it necessary to cast here? In other words, why is private_data a void pointer at all? Are we really well-served by presuming absolutely nothing about XlogReader's state? Wouldn't an “abstract base class” pointer be more appropriate a type for private_data? I don't think it's virtuous to remove type-safety any more than is strictly necessary. Note that I'm not asserting that you shouldn't do this – I'm merely asking the question. When developing a user-facing API, it is particularly crucial to make interfaces easy to use correctly and hard to use incorrectly. > The applycache provides 3 major callbacks: > * apply_begin > * apply_change > * apply_commit These are callbacks intended to be used by third-party modules, perhaps including a full multi-master replication implementation (though this patch isn't directly concerned with that), or even a speculative future version of a logical replication system like Slony. [2] refers to these under "4.7. Output Plugin". These are the typedef for the hook types involved (incidentally, don't like the name of these types – I think you should lose the CB): +/* XXX: were currently passing the originating subtxn. Not sure thats necessary */ +typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change); +typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn); +typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn); So we register these callbacks like this in the patch: + /* + * allocate an ApplyCache that will apply data using lowlevel calls + * without type conversion et al. This requires binary compatibility + * between both systems. + * XXX: This would be the place too hook different apply methods, like + * producing sql and applying it. + */ + apply_cache = ApplyCacheAllocate(); + apply_cache->begin = decode_begin_txn; + apply_cache->apply_change = decode_change; + apply_cache->commit = decode_commit_txn; The decode_xlog(lsn, lsn) debugging function that Andres has played with [6] (that this patch makes available, for now) is where this code comes from. Whenever ApplyCache calls an "apply_change" callback for a single change (that is, a INSERT|UPDATE|DELETE) it locally overrides the normal SnapshotNow semantics used for catalog access with a previously built snapshot. Behaviour should now be consistent with a normal snapshotNow acquired when the tuple change was originally written to WAL. Having commented specifically on modules that Andres highlighted, I'd like to highlight one myself: tqual.c . This module has had significant new functionalty added, so it would be an omission to not pass remark on it in this opening e-mail, having mentioned all other modules with significant new pieces of functionality. The file has had new utility functions added, that pertain to snapshot visibility during decoding - "time travel". I draw attention to this. This code is located within the new function HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is done for "dirty snapshots" (dirty snapshots are used with ri_triggers.c, for example, when even uncommitted tuple should be visible). Both of these functions are generally accessed through function pointers. Anyway, here's the code: + /* + * FIXME: The not yet existing decoding infrastructure will need to force + * the xmin to stay lower than what they are currently decoding. + */ + bool fixme_xmin_horizon = false; I'm sort of wondering what this is going to look like in the finished patch. This FIXME is rather hard for me to take at face value. It seems to me that the need to coordinate decoding with xmin horizon itself represents a not insignificant engineering challenge. So, with all due respect, it would be nice if I wasn't asked to make that leap of faith. The xmin horizon prepared transaction hack needs to go. Within tqual.h, shouldn't you have something like this, but for time travel snapshots during decoding?: #define InitDirtySnapshot(snapshotdata) \ ((snapshotdata).satisfies = HeapTupleSatisfiesDirty) Alternatively, adding a variable to these two might be appropriate: static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC}; static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC}; In any case, assigning this hook in snapbuild.c looks like a modularity violation to me. See also my observations on initialising ReaderApplyState below. My general feeling is that the code is very under-commented, and in need of a polish, though I'm sure that you are perfectly well aware of that. The basic way all of these components that I have described separately fit together is: (if others want to follow this, refer to decode_xlog()) 1. Start with some client code “output plugin” (for now, a throw-away debugging function “decode_xlog()”) | \ / 2. Client allocates an XLogReaderState. (This module is a black box to me, though it's well encapsulated so that shouldn't matter much. Heikki is reviewing this [7]. Like I said, this isn't quite an atomic unit I'm reviewing.) | \ / 3. Plugin registers various callbacks (within logicalfuncs.c). These callbacks, while appearing in this patch, are mostly NO-OPS, and are somewhat specific to XLogReader's concerns. I mostly defer to Heikki here. | \ / 4. Plugin allocates an “ApplyCache”. Plugin assigns some more callbacks to “ApplyCache”. This time, they're the aforementioned 3 apply cache functions. | \ / 5. Plugin assigns this new ApplyCache to variable within private state of the XLogReader (this private state is a subset of its main state, and is opaque to XLogReader). | \ / 6. Finally, plugin calls XLogReader(main_state). | \ / 7. At some point during its magic, XLogReader calls the hook registered in step 3, finished_record. This is all it does directly with the plugin, which it makes minimal assumptions about. | \ / 8. finished_record (which is logically a part of the “plugin”) knows what type the opaque private_data actually is. It casts it to an apply_state, and calls the decoder (supplying the apply_state as an argument to DecodeRecordIntoApplyCache()). | \ / 9. During the first call (within the first record within a call to decode_xlog()), we allocate a snapshot reader. | \ / 10. Builds snapshot callback. This scribbles on our snapshot state, which essentially encapsulates a snapshot. The state (and snapshot) changes continually, once per call. | \ / 11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at an XLogRecord. Decodes based on record type. Let's assume it's an XLOG_HEAP_INSERT. | \ / 12. DecodeInsert() called. This in turn calls DecodeXLogTuple(). We store the tuple metadata in our ApplyCache. (some ilists, somewhere, each corresponding to an XID). We don't store the relation oid, because we don't know it yet (only relfilenode is known from WAL). / / \ / 13. We're back in XLogReader(). It calls the only callback of interest to us covered in step 3 (and not of interest to XlogReader()/Heikki) – decode_change(). It does this through the apply_cache.apply_change hook. This happens because we encounter another record, this time a commit record (in the same codepath as discussed in step 12). | \ / 14. In decode_change(), the actual function that raises the interesting WARNINGs within Andres' earlier example [6], showing actual integer/varchar Datum value for tuples previously inserted. Resolve table oid based on relfilenode (albeit unsatisfactorily). Using a StringInfo, tupledescs, syscache and typcache, build WARNING string. So my general opinion of how all this fits together is that it isn't quite right. Problems include: * Why does the ReaderApplyState get magically initialised in two stages? apply_cache is initialised in decode_xlog (or whatever plugin). Snapstate is allocated within DecodeRecordIntoApplyCache() (with a dependency on apply_cache). Shouldn't this all be happening within a single function? As you yourself have point out, not everyone needs to know about these snapshots. * Maybe I've missed something, but I think you need a more satisfactory example plugin. What happens in step 14 is plainly unacceptable. You haven't adequately communicated to me how this is going to be used in logical replication. Maybe I just haven't got that far yet. I'm not impressed by the InvalidateSystemCaches() calls here and elsewhere. * Please break-out the client code as a contrib module. That separation would increase the readability of the patch. That's all I have for now... [1] http://archives.postgresql.org/message-id/ca+u5nmlk-wt806zab7sj2x5x4pqc3we-hfctonaktqsagbq...@mail.gmail.com [2] http://archives.postgresql.org/message-id/201209221900.53190.and...@2ndquadrant.com [3] http://archives.postgresql.org/message-id/201206131327.24092.and...@2ndquadrant.com [4] http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com [5] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2 [6] http://archives.postgresql.org/message-id/201209150233.25616.and...@2ndquadrant.com [7] http://archives.postgresql.org/message-id/5056dfab.3050...@vmware.com [8] http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5x5ww...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers