Actually CC'd Petr this time. On Mon, 7 Dec 2020 at 11:05, Craig Ringer <craig.rin...@enterprisedb.com> wrote:
> Reply follows inline. I addressed your last point first, so it's out of > order. > > On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4...@yandex-team.ru> wrote > > > If OLAP cannot consume data fast enough - we are out of space due to > repl slot. > > There is a much simpler solution to this than logical PITR. > > What we should be doing is teaching xlogreader how to invoke the > restore_command to fetch archived WALs for decoding. > > Replication slots already have a WAL retention limit, but right now when > that limit is reached the slot is invalidated and becomes useless, it's > effectively dropped. Instead, if WAL archiving is enabled, we should leave > the slot as valid. If a consumer of the slot needs WAL that no longer > exists in pg_wal, we should have the walsender invoke the restore_command > to read the missing WAL segment, decode it, and remove it again. > > This would not be a technically difficult patch, and it's IMO one of the > more important ones for improving logical replication. > > > I was discussing problems of CDC with scientific community and they > asked this simple question: "So you have efficient WAL archive on a very > cheap storage, why don't you have a logical archive too?" > > I've done work in this area, as has Petr (CC'd). > > In short, logical archiving and PITR is very much desirable, but we're not > nearly ready for it yet and we're missing a lot of the foundations needed > to make it really useful. > > IMO the strongest pre-requisite is that we need integrated DDL capture and > replication in Pg. While this could be implemented in the > publisher/subscriber logic for logical replication, it would make much more > sense (IMO) to make it easier to feed DDL events into any logical > replication output plugin. > > pglogical3 (the closed one) has quite comprehensive DDL replication > support. Doing it is not simple though - there are plenty of complexities: > > * Reliably identifying the target objects and mapping them to replication > set memberships for DML-replication > * Capturing, replicating and managing the search_path and other DDL > execution context (DateStyle and much more) reliably > > - Each statement type needs specific logic to indicate whether it > needs DDL replication (and often filter functions since we have lots of > sub-types where some need replication and some don't) > - Handling DDL affecting global objects in pg_global correctly, like > those affecting roles, grants, database security labels etc. There's no one > right answer for this, it depends on the deployment and requires the user > to cooperate. > - Correct handling of transactions that mix DDL and DML (mostly only > an issue for multimaster). > - Identifying statements that target a mix of replicated and > non-replicated objects and handling them appropriately, including for > CASCADEs > - Gracefully handling DDL statements that mix TEMPORARY and persistent > targets. We can do this ok for DROPs but it still requires care. Anything > else gets messier. > - Lack of hooks into table rewrite operations and the extremely clumsy > and inefficient way logical decoding currently exposes decoding of the > temp-table data during decoding of rewrites means handling table-rewriting > DDL is difficult and impractical to do correctly. In pglogical we punt on > it entirely and refuse to permit DDL that would rewrite a table except > where we can prove it's reliant only on immutable inputs so we can discard > the upstream rewrite and rely on statement replication. > - As a consequence of the above, reliably determining whether a given > statement will cause a table rewrite. > - Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc. > - Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing > something sensible for truncate cascade. > - Probably more I've forgotten > > > If we don't handle these, then any logical change-log archives will become > largely useless as soon as there's any schema change. > > So we kind of have to solve DDL replication first IMO. > > Some consideration is also required for metadata management. Right now > relation and type metadata has session-lifetime, but you'd want to be able > to discard old logical change-stream archives and have the later ones still > be usable. So we'd need to define some kind of restartpoint where we repeat > the metadata, or we'd have to support externalizing the metadata so it can > be retained when the main change archives get aged out. > > We'd also need to separate the existing apply worker into a "receiver" and > "apply/writer" part, so the wire-protocol handling isn't tightly coupled > with the actual change apply code, in order to make it possible to actually > consume those archives and apply them to the database. In pglogical3 we did > that by splitting them into two processes, connected by a shm_mq. > Originally the process split was optional and you could run a combined > receiver/writer process without the shm_mq if you wanted, but we quickly > found it difficult to reliably handle locking issues etc that way so the > writers all moved out-of-process. > > That was done mainly to make it possible to support parallelism in logical > decoding apply. But we also have the intention of supporting an alternative > reader process that can ingest "logical archives" and send them to the > writer to apply them, as if they'd been received from the on-wire stream. > That's not implemented at this time though. It'd be useful for a number of > things: > > * PITR-style logical replay and recovery > * Ability to pre-decode a txn once on the upstream then send the buffered > protocol-stream to multiple subscribers, saving on logical decoding and > reorder buffering overheads and write-multiplication costs > * ability to ingest change-streams generated by non-postgres sources so we > could support streaming foreign-data ingestion, streaming OLAP and data > warehousing, etc > > To make logical PITR more useful we'd also want to be a bit more tolerant > of schema divergence, though that's not overly hard to do: > > - fill defaults for downstream columns if no value is present for the > column in the upstream row and the downstream column is nullable or has a > default (I think built-in logical rep does this one already) > - ignore values for columns in upstream data if the downstream table > lacks the column and the upstream value is null > - optionally allow apply to be configured to ignore non-null data in > upstream columns that're missing on downstream > - optionally allow apply to be configured to drop rows on the floor if > the downstream table is missing > - policies for handling data conflicts like duplicate PKs > > and we'd probably want ways to filter the apply data-stream to apply > changes for only a subset of tables, rows, etc at least in a later version. > > None of this is insurmountable. Most or all of the DDL replication support > and divergence-tolerance stuff is already done in production deployments > using pglogical3 and bdr3. > > While I can't share the code, I am happy to share the experience I have > gained from my part in working on these things. As you've probably recently > seen with the wiki article I wrote on physical/logical failover interop. > > You're free to take information like this and use it in wiki articles too. > > Right now I won't be able to launch into writing big patches for these > things, but I'll do my best to share what I can and review things. > > > This seems like a wild idea. But really, we have a super expensive NVMe > drives for OLTP workload. And use this devices to store buffer for data to > be dumped into MapReduce\YT analytical system. > > It's not a wild idea at all, as noted above. > > In pglogical3 we already support streaming decoded WAL data to alternative > writer downstreams including RabbitMQ and Kafka via writer plugins. >