Reply follows inline. I addressed your last point first, so it's out of
order.

On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4...@yandex-team.ru> wrote

> If OLAP cannot consume data fast enough - we are out of space due to repl
slot.

There is a much simpler solution to this than logical PITR.

What we should be doing is teaching xlogreader how to invoke the
restore_command to fetch archived WALs for decoding.

Replication slots already have a WAL retention limit, but right now when
that limit is reached the slot is invalidated and becomes useless, it's
effectively dropped. Instead, if WAL archiving is enabled, we should leave
the slot as valid. If a consumer of the slot needs WAL that no longer
exists in pg_wal, we should have the walsender invoke the restore_command
to read the missing WAL segment, decode it, and remove it again.

This would not be a technically difficult patch, and it's IMO one of the
more important ones for improving logical replication.

> I was discussing problems of CDC with scientific community and they asked
this simple question: "So you have efficient WAL archive on a very cheap
storage, why don't you have a logical archive too?"

I've done work in this area, as has Petr (CC'd).

In short, logical archiving and PITR is very much desirable, but we're not
nearly ready for it yet and we're missing a lot of the foundations needed
to make it really useful.

IMO the strongest pre-requisite is that we need integrated DDL capture and
replication in Pg. While this could be implemented in the
publisher/subscriber logic for logical replication, it would make much more
sense (IMO) to make it easier to feed DDL events into any logical
replication output plugin.

pglogical3 (the closed one) has quite comprehensive DDL replication
support. Doing it is not simple though - there are plenty of complexities:

* Reliably identifying the target objects and mapping them to replication
set memberships for DML-replication
* Capturing, replicating and managing the search_path and other DDL
execution context (DateStyle and much more) reliably

   - Each statement type needs specific logic to indicate whether it needs
   DDL replication (and often filter functions since we have lots of sub-types
   where some need replication and some don't)
   - Handling DDL affecting global objects in pg_global correctly, like
   those affecting roles, grants, database security labels etc. There's no one
   right answer for this, it depends on the deployment and requires the user
   to cooperate.
   - Correct handling of transactions that mix DDL and DML (mostly only an
   issue for multimaster).
   - Identifying statements that target a mix of replicated and
   non-replicated objects and handling them appropriately, including for
   CASCADEs
   - Gracefully handling DDL statements that mix TEMPORARY and persistent
   targets. We can do this ok for DROPs but it still requires care. Anything
   else gets messier.
   - Lack of hooks into table rewrite operations and the extremely clumsy
   and inefficient way logical decoding currently exposes decoding of the
   temp-table data during decoding of rewrites means handling table-rewriting
   DDL is difficult and impractical to do correctly. In pglogical we punt on
   it entirely and refuse to permit DDL that would rewrite a table except
   where we can prove it's reliant only on immutable inputs so we can discard
   the upstream rewrite and rely on statement replication.
   - As a consequence of the above, reliably determining whether a given
   statement will cause a table rewrite.
   - Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
   - Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
   something sensible for truncate cascade.
   - Probably more I've forgotten


If we don't handle these, then any logical change-log archives will become
largely useless as soon as there's any schema change.

So we kind of have to solve DDL replication first IMO.

Some consideration is also required for metadata management. Right now
relation and type metadata has session-lifetime, but you'd want to be able
to discard old logical change-stream archives and have the later ones still
be usable. So we'd need to define some kind of restartpoint where we repeat
the metadata, or we'd have to support externalizing the metadata so it can
be retained when the main change archives get aged out.

We'd also need to separate the existing apply worker into a "receiver" and
"apply/writer" part, so the wire-protocol handling isn't tightly coupled
with the actual change apply code, in order to make it possible to actually
consume those archives and apply them to the database. In pglogical3 we did
that by splitting them into two processes, connected by a shm_mq.
Originally the process split was optional and you could run a combined
receiver/writer process without the shm_mq if you wanted, but we quickly
found it difficult to reliably handle locking issues etc that way so the
writers all moved out-of-process.

That was done mainly to make it possible to support parallelism in logical
decoding apply. But we also have the intention of supporting an alternative
reader process that can ingest "logical archives" and send them to the
writer to apply them, as if they'd been received from the on-wire stream.
That's not implemented at this time though. It'd be useful for a number of
things:

* PITR-style logical replay and recovery
* Ability to pre-decode a txn once on the upstream then send the buffered
protocol-stream to multiple subscribers, saving on logical decoding and
reorder buffering overheads and write-multiplication costs
* ability to ingest change-streams generated by non-postgres sources so we
could support streaming foreign-data ingestion, streaming OLAP and data
warehousing, etc

To make logical PITR more useful we'd also want to be a bit more tolerant
of schema divergence, though that's not overly hard to do:

   - fill defaults for downstream columns if no value is present for the
   column in the upstream row and the downstream column is nullable or has a
   default (I think built-in logical rep does this one already)
   - ignore values for columns in upstream data if the downstream table
   lacks the column and the upstream value is null
   - optionally allow apply to be configured to ignore non-null data in
   upstream columns that're missing on downstream
   - optionally allow apply to be configured to drop rows on the floor if
   the downstream table is missing
   - policies for handling data conflicts like duplicate PKs

and we'd probably want ways to filter the apply data-stream to apply
changes for only a subset of tables, rows, etc at least in a later version.

None of this is insurmountable. Most or all of the DDL replication support
and divergence-tolerance stuff is already done in production deployments
using pglogical3 and bdr3.

While I can't share the code, I am happy to share the experience I have
gained from my part in working on these things. As you've probably recently
seen with the wiki article I wrote on physical/logical failover interop.

You're free to take information like this and use it in wiki articles too.

Right now I won't be able to launch into writing big patches for these
things, but I'll do my best to share what I can and review things.

> This seems like a wild idea. But really, we have a super expensive NVMe
drives for OLTP workload. And use this devices to store buffer for data to
be dumped into MapReduce\YT analytical system.

It's not a wild idea at all, as noted above.

In pglogical3 we already support streaming decoded WAL data to alternative
writer downstreams including RabbitMQ and Kafka via writer plugins.

Reply via email to