On Wed, Dec 18, 2013 at 12:21:08PM -0500, Robert Haas wrote: > On Tue, Dec 10, 2013 at 6:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > The larger point is that such a shutdown process has never in the history > > of Postgres been successful at removing shared-memory (or semaphore) > > resources. I do not feel a need to put a higher recoverability standard > > onto the DSM code than what we've had for fifteen years with SysV shmem. > > > > But, by the same token, it shouldn't be any *less* recoverable. In this > > context that means that starting a new postmaster should recover the > > resources, at least 90% of the time (100% if we still have the old > > postmaster lockfile). I think the idea of keeping enough info in the > > SysV segment to permit recovery of DSM resources is a good one. Then, > > any case where the existing logic is able to free the old SysV segment > > is guaranteed to also work for DSM. > > So I'm taking a look at this. There doesn't appear to be anything > intrinsically intractable here, but it seems important to time the > order of operations so as to minimize the chances that things fail in > awkward places. The point where we remove the old shared memory > segment from the previous postmaster invocation is here: > > /* > * The segment appears to be from a dead Postgres process, or from a > * previous cycle of life in this same process. Zap it, if possible. > * This probably shouldn't fail, but if it does, assume the segment > * belongs to someone else after all, and continue quietly. > */ > shmdt(memAddress); > if (shmctl(shmid, IPC_RMID, NULL) < 0) > continue; > > My first thought was to remember the control segment ID from the > header just before the shmdt() there, and then later when the DSM > module is starting, do cleanup. But that doesn't seem like a good > idea, because then if startup fails after we remove the old shared > memory segment and before we get to the DSM initialization code, we'll > have lost the information on what control segment needs to be cleaned > up. A subsequent startup attempt won't see the old shm again, because > it's already gone. I'm fairly sure that this would be a net reduction > in reliability vs. the status quo. > > So now what I'm thinking is that we ought to actually perform the DSM > cleanup before detaching the old segment and trying to remove it. > That shouldn't fail, but if it does, or if we get killed before > completing it, the next run will hopefully find the same old shm and > finish the cleanup. But that kind of flies in the face of the comment > above: if we perform DSM cleanup and then discover that the segment > wasn't ours after all, that means we just stomped all over somebody > else's stuff. That's not too good. But trying to remove the segment > first and then perform the cleanup creates a window where, if we get > killed, the next restart will have lost information about how to > finish cleaning up. So it seems that some kind of logic tweak is > needed here, but I'm not sure exactly what. As I see it, the options > are: > > 1. Make failure to remove the shared memory segment we thought was > ours an error. This will definitely show up any problems, but only > after we've burned down some other processes's dynamic shared memory > segments. The most likely outcome is creating failure-to-start > problems that don't exist today. > > 2. Make it a warning. We'll still burn down somebody else's DSMs, but > at least we'll still start ourselves. Sadly, "WARNING: You have just > done a bad thing. It's too late to fix it. Sorry!" is not very > appealing.
It has long been the responsibility of PGSharedMemoryCreate() to determine that a segment is unimportant before calling IPC_RMID. The success or failure of IPC_RMID is an unreliable guide to the correctness of that determination. IPC_RMID will succeed on an important segment owned by the same UID, and it will fail if some other process removed the segment after our shmat(). Such a failure does not impugn our having requested DSM cleanup on the basis of the PGShmemHeader we did read, so an apologetic WARNING would be wrong. > 3. Remove the old shared memory segment first, then perform the > cleanup immediately afterwards. If we get killed before completing > the cleanup, we'll leak the un-cleaned-up stuff. Decide that's OK and > move on. The period in which process exit would leak segments is narrow enough that this seems fine. I see no net advantage over performing the cleanup before shmdt(), though. > 4. Adopt some sort of belt-and-suspenders approach, keeping the state > file we have now and backstopping it with this mechanism, so that we > really only need this to work when $PGDATA has been blown away and > recreated. This seems pretty inelegant, and I'm not sure who it'll > benefit other than those (few?) people who kill -9 the postmaster (or > it segfaults or otherwise dies without running the code to remove > shared memory segments) and then remove $PGDATA and then re-initdb and > then expect cleanup to find the old DSMs. Independent of the choice we make here, I would keep the state file. POSIX permits shm_open() segments to survive reboots, so the state file would come into play after crashes on such systems. Also, folks who use "ipcrm" after a "kill -9" of the postmaster would benefit from the state file. Nobody should do that, but hey, catering to things nobody should do is what brought us here. An option I had envisioned was to make PGSharedMemoryCreate() create a state file based on the old sysv segment. Then the later dsm_postmaster_startup() would proceed normally, and a restart after a failure between shmdt() and dsm_postmaster_startup() would also do the cleanup. A wrinkle comes from the fact that an existing state file could reference a different control segment; this would happen if we picked a different sysv shm key compared to the last postmaster start. In that case, we should presumably clean both control segments. Performing the cleanup per #1/#2 achieves that. > 5. Give up on this approach. We could keep what we have now, or make > the DSM control segment land at a well-known address as we do for the > main segment. How would having the DSM control segment at a well-known address affect the problem at hand? Did you mean a well-known dsm_handle? > What do people prefer? I recommend performing cleanup on the control segment named in PGShmemHeader just before shmdt() in PGSharedMemoryCreate(). No new ERROR or WARNING sites are necessary. Have dsm_postmaster_startup() continue to perform a cleanup on the control segment named in the state file. > The other thing I'm a bit squidgy about is injecting more code that > runs before we get the new shared memory segment up and running. > During that time, we don't have effective mutual exclusion, so it's > possible that someone else could be trying to do cleanup at the same > time we are. That should be OK as far as the DSM code itself is > concerned, but there may be other downsides to lengthening the period > of time that's required to get mutual exclusion in place that I should > be worrying about. I see no general need to avoid small increases to that time period. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers