On 07/17/2018 08:10 PM, Robert Haas wrote:
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).

But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).

But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?

I haven't traced out the control flow completely, but it sure looks to
me like there are places where decodeGroupLeader is checked without
holding any LWLock at all.  Also, it looks to me like some places
(like where we're trying to find a PGPROC by XID) we use ProcArrayLock
and in others -- I guess where we're checking the decodeGroupBlah
stuff -- we are using the lock manager locks.  I don't know how safe
that is, and there are not a lot of comments justifying it.  I also
wonder why we're using the lock manager locks to protect the
decodeGroup stuff rather than backendLock.

FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.

Yeah I wasn't seeing the code the way I thought you were describing it
in that response, but I'm dumb this week so maybe I just
misunderstood.

I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!

Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.

It's not a matter of fun; it's a matter of whether it can be made to
work.  Don't get me wrong -- I want the ability to decode in-progress
transactions.  I complained about that aspect of the design to Andres
when I was reviewing and committing logical slots & logical decoding,
and I complained about it probably more than I complained about any
other aspect of it, largely because it instantaneously generates a
large lag when a bulk load commits.  But not liking something about
the way things are is not the same as knowing how to make them better.
I believe there is a way to make it work because I believe there's a
way to make anything work.  But I suspect that it's at least one order
of magnitude more complex than this patch currently is, and likely an
altogether different design.


Sure, it may turn out not to work - but how do you know until you try?

We have a well known theater play here, where of the actors is blowing tobacco smoke into the sink, to try if gold can be created that way. Which is foolish, but his reasoning is "Someone had to try, to be sure!" So we're in the phase of blowing tobacco smoke, kinda ;-)

Also, you often discover solutions while investigating approaches that seem to be unworkable initially. Or entirely new approaches. It sure happened to me, many times.

There's a great book/movie "Touching the Void" [1] about a climber falling into a deep crevasse. Unable to climb up, he decides to crawl down - which is obviously foolish, but he happens to find a way out.

I suppose we're kinda doing the same thing here - crawling down a crevasse (while still smoking and blowing the tobacco smoke into a sink, which we happened to find in the crevasse or something).

Anyway, I have no clear idea what changes would be necessary to the original design of logical decoding to make implementing this easier now. The decoding in general is quite constrained by how our transam and WAL stuff works. I suppose Andres thought about this aspect, and I guess he concluded that (a) it's not needed for v1, and (b) adding it later will require about the same effort. So in the "better" case we'd end up waiting for logical decoding much longer, in the worse case we would not have it at all.

But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.

So the parts that aren't relevant shouldn't be copied over.


I'm not sure which parts aren't relevant, but in general I agree that stuff that is not necessary should not be copied over.

That having been said, I still don't see how that's really going to
work.  Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader.  The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker.  Then the
deadlock detector would know to abort some transaction.  But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then?  We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things.  Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.

Fair point, not sure. I'll leave this up to Nikhil.

That's fine, but please understand that I think there's a basic design
flaw here that just can't be overcome with any amount of hacking on
the details here.  I think we need a much higher-level consideration
of the problem here and probably a lot of new infrastructure to
support it.  One idea might be to initially support decoding of
in-progress transactions only if they don't modify any catalog state.

The problem is you don't know if a transaction does DDL sometime later, in the part that you might not have decoded yet (or perhaps concurrently with the decoding). So I don't see how you could easily exclude such transactions from the decoding ...

That would leave out a bunch of cases we'd probably like to support,
such as CREATE TABLE + COPY in the same transaction, but it would
likely dodge a lot of really hard problems, too, and we could improve
things later.  One approach to the problem of catalog changes would be
to prevent catalog tuples from being removed even after transaction
abort until such time as there's no decoding in progress that might
care about them.  That is not by itself sufficient because a
transaction can abort after inserting a heap tuple but before
inserting an index tuple and we can't look at the catalog when it's an
inconsistent state like that and expect reasonable results.  But it
helps: for example, if you are decoding a transaction that has
inserted a WAL record with a cmin or cmax value of 4, and you know
that none of the catalog records created by that transaction can have
been pruned, then it should be safe to use a snapshot with CID 3 or
smaller to decode the catalogs.  So consider a case like:

BEGIN;
CREATE TABLE blah ...  -- command ID 0
COPY blah FROM '/tmp/blah' ... -- command ID 1

Once we see the COPY show up in the WAL, it should be safe to decode
the CREATE TABLE command and figure out what a snapshot with command
ID 0 can see (again, assuming we've suppressed pruning in the catalogs
in a sufficiently-well-considered way).  Then, as long as the COPY
command doesn't do any DML via a trigger or a datatype input function
(!) or whatever, we should be able to use that snapshot to decode the
data inserted by COPY.

One obvious issue with this is that it actually does not help with reducing the replication lag, which is about the main goal of this whole effort. Because if the COPY is a big data load, waiting until after the COPY completes gives us pretty much nothing.

 I'm not quite sure what happens if the COPY
does do some DML or something like that -- we might have to stop
decoding until the following command begins in the live transaction,
or something like that.  Or maybe we don't have to do that.  I'm not
totally sure how the command counter is managed for catalog snapshots.
However it works in detail, we will get into trouble if we ever use a
catalog snapshot that can see a change that the live transaction is
still in the midst of making.  Even with pruning prevented, we can
only count on the catalogs to be in a consistent state once the live
transaction has finished the command -- otherwise, for example, it
might have increased pg_class.relnatts but not yet added the
pg_attribute entry at the time it aborts, or something like that.  I'm
blathering a little bit but hopefully you get the point: I think the
way forward is for somebody to think carefully through how and under
what circumstances using a catalog snapshot can be made safe even if
an abort has occurred afterwards -- not trying to postpone the abort,
which I think is never going to be right.


But isn't this (delaying the catalog cleanup etc.) pretty much the original approach, implemented by the original patch? Which you also claimed to be unworkable, IIRC? Or how is this addressing the problems with broken HOT chains, for example? Those issues were pretty much the reason why we started looking at alternative approaches, like delaying the abort ...

I wonder if disabling HOT on catalogs with wal_level=logical would be an option here. I'm not sure how important HOT on catalogs is, in practice (it surely does not help with the typical catalog bloat issue, which is temporary tables, because that's mostly insert+delete). I suppose we could disable it only when there's a replication slot indicating support for decoding of in-progress transactions, so that you still get HOT with plain logical decoding.

I'm sure there will be other obstacles, not just the HOT chain stuff, but it would mean one step closer to a solution.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to