Amit Kapila <amit.kapil...@gmail.com> writes: > On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> The subscriber log includes >> 2017-09-18 08:43:08.240 UTC [15672] WARNING: out of background worker slots >> Maybe that's harmless, but I'm suspicious that it's a smoking gun.
> I have noticed while fixing the issue you are talking that this path > is also susceptible to such problems. In > WaitForReplicationWorkerAttach(), it relies on > GetBackgroundWorkerPid() to know the status of the worker which won't > give the correct status in case of fork failure. The patch I have > posted has a fix for the issue, Your GetBackgroundWorkerPid fix looks good as far as it goes, but I feel that WaitForReplicationWorkerAttach's problems go way deeper than that --- in fact, that function should not exist at all IMO. The way that launcher.c is set up, it's relying on either the calling process or the called process to free the logicalrep slot when done. The called process might never exist at all, so the second half of that is obviously unreliable; but WaitForReplicationWorkerAttach can hardly be relied on either, seeing it's got a big fat exit-on- SIGINT in it. I thought about putting a PG_TRY, or more likely PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic problem: you can't assume that the worker isn't ever going to start, just because you got a signal that means you shouldn't wait anymore. Now, this rickety business might be fine if it were only about the states of the caller and called processes, but we've got long-lived shared data involved (ie the worker slots); failing to clean it up is not an acceptable outcome. So, frankly, I think we would be best off losing the "logical rep worker slot" business altogether, and making do with just bgworker slots. The alternative is getting the postmaster involved in cleaning up lrep slots as well as bgworker slots, and I'm going to resist any such idea strongly. That way madness lies, or at least an unreliable postmaster --- if we need lrep slots, what other forty-seven data structures are we going to need the postmaster to clean up in the future? I haven't studied the code enough to know why it grew lrep worker slots in the first place. Maybe someone could explain? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers