On Mon, Mar 12, 2012 at 7:58 PM, Heikki Linnakangas <heikki.linnakan...@enterprisedb.com> wrote: > On 09.03.2012 13:40, Fujii Masao wrote: >> >> I found the bug which causes walsender to enter into busy loop >> when replication connection is terminated. Walsender consumes >> lots of CPU resource (%sys), and this situation lasts until it has >> detected the termination of replication connection and exited. >> >> The cause of this bug is that the walsender loop doesn't call >> ResetLatch at all in the above case. Since the latch remains set, >> the walsender loop cannot sleep on the latch, i.e., WaitLatch >> always returns immediately. >> >> We can fix this bug by adding ResetLatch into the top of the >> walsender loop. Patch attached. >> >> This bug exists in 9.1 but not in 9.2dev. In 9.2dev, this bug has >> already been fixed by the commit >> (cff75130b5f63e45423c2ed90d6f2e84c21ef840). This commit >> refactors and refines the walsender loop logic in addition to >> adding ResetLatch. So I'm tempted to backport this commit >> (except the deletion of wal_sender_delay) to 9.1 rather than >> applying the attached patch. OTOH, attached patch is quite simple, >> and its impact on 9.1 would be very small, so it's easy to backport that. >> Thought? > > > This patch makes the code that follows bogus: > >> /* >> * If we don't have any pending data in the output buffer, >> try to send >> * some more. >> */ >> if (!pq_is_send_pending()) >> { >> XLogSend(output_message, &caughtup); >> >> /* >> * Even if we wrote all the WAL that was available >> when we started >> * sending, more might have arrived while we were >> sending this >> * batch. We had the latch set while sending, so we >> have not >> * received any signals from that time. Let's arm >> the latch again, >> * and after that check that we're still >> up-to-date. >> */ >> if (caughtup && !pq_is_send_pending()) >> { >> ResetLatch(&MyWalSnd->latch); >> >> XLogSend(output_message, &caughtup); >> } >> } > > > The comment is no longer valid, and the calls to ResetLatch and XLogSend are > no longer necessary, once you have the ResetLatch() call at the top of the > loop.
Right. > I also think we should backport commit > cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of > wal_sender_delay). Agreed. The attached patch is the same as cff75130b5f63e45423c2ed90d6f2e84c21ef840, except for the removal of wal_sender_delay. Could you apply this? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers