Re: [HACKERS] Buildfarm failure from overly noisy warning message
Jeff Janes writes: > On Tue, Jul 28, 2015 at 2:38 PM, Tom Lane wrote: >> Rather than remove the "sending signal" elog entirely, I reduced it to >> DEBUG1; that will avoid log chatter for normal cases but the info can >> still be obtained at need. > This part doesn't seem right to me. Now the autovac worker logs that it > was cancelled, but we have no idea why it was cancelled i.e. which lock > request caused it to be cancelled and which query caused the lock request. The argument for changing it was basically that there's not any very good reason to care, and if this happens a lot you're just bloating the log. While I am not going to defend that proposition to the death, I've certainly heard plenty of complaints that the postmaster log is too chatty about routine occurrences. Why is this information important enough to log by default? > Although looking at the code from that patch, it is not clear to me why all > the string building needs to be done under the ProcArrayLock. The string > depends only on the local state of the lock being blocked, not on the proc > doing the blocking. Actually, I'm not sure I see the point of taking the ProcArrayLock at all in this stanza. If the state of the autovac process can change under us then the lines just in front of the LWLockAcquire are already broken, no? Worst case, the blocker might not be there at all anymore. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
On Tue, Jul 28, 2015 at 2:38 PM, Tom Lane wrote: > Kevin Grittner writes: > > Tom Lane wrote: > >> Kevin Grittner writes: > >>> I think a LOG entry when an autovacuum process is actually canceled > >>> has value just in case it is happening on a particular table so > >>> frequently that the table starts to bloat. I see no reason to log > >>> anything if there is an intention to cancel an autovacuum process > >>> but it actually completes before we can do so. > > >> Hm. By that logic, I'm not sure if we need anything to be logged here, > >> because the autovacuum process will log something about having received > >> a query cancel signal. > > > That seems sufficient to me for normal cases. > > Rather than remove the "sending signal" elog entirely, I reduced it to > DEBUG1; that will avoid log chatter for normal cases but the info can > still be obtained at need. > This part doesn't seem right to me. Now the autovac worker logs that it was cancelled, but we have no idea why it was cancelled i.e. which lock request caused it to be cancelled and which query caused the lock request. (It looks like the build-farm failures were fixed by the other part of the change, which I have no objection to) This effectively undoes commit d7318d43d891bd63e82dcfc27. I thought that that commit was a huge improvement and would prefer it not be undone. commit d7318d43d891bd63e82dcfc27948113ed7b1db80 Author: Robert Haas Date: Thu Jul 26 09:18:32 2012 -0400 Log a better message when canceling autovacuum. The old message was at DEBUG2, so typically it didn't show up in the log at all. As a result, in most cases where autovacuum was canceled, the only information that was logged was the table being vacuumed, with no indication as to what problem caused the cancel. Crank up the level to LOG and add some more details to assist with debugging. Although looking at the code from that patch, it is not clear to me why all the string building needs to be done under the ProcArrayLock. The string depends only on the local state of the lock being blocked, not on the proc doing the blocking. Cheers, Jeff
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Kevin Grittner writes: > Tom Lane wrote: >> Kevin Grittner writes: >>> I think a LOG entry when an autovacuum process is actually canceled >>> has value just in case it is happening on a particular table so >>> frequently that the table starts to bloat. I see no reason to log >>> anything if there is an intention to cancel an autovacuum process >>> but it actually completes before we can do so. >> Hm. By that logic, I'm not sure if we need anything to be logged here, >> because the autovacuum process will log something about having received >> a query cancel signal. > That seems sufficient to me for normal cases. Rather than remove the "sending signal" elog entirely, I reduced it to DEBUG1; that will avoid log chatter for normal cases but the info can still be obtained at need. >> If we're in the business of minimizing log chatter, I'd suggest that >> we remove the entirely-routine "sending cancel" log message, and only >> log something in the uncommon case where the kill() fails (but, per >> original point, reduce that entry to LOG or so; or else print something >> only for not-ESRCH cases). > +1 for only printing for the non-ESRCH cases. Left that one as a WARNING, but it doesn't print for ESRCH. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Tom Lane wrote: > Kevin Grittner writes: >> I think a LOG entry when an autovacuum process is actually canceled >> has value just in case it is happening on a particular table so >> frequently that the table starts to bloat. I see no reason to log >> anything if there is an intention to cancel an autovacuum process >> but it actually completes before we can do so. > Hm. By that logic, I'm not sure if we need anything to be logged here, > because the autovacuum process will log something about having received > a query cancel signal. That seems sufficient to me for normal cases. > If we're in the business of minimizing log chatter, I'd suggest that > we remove the entirely-routine "sending cancel" log message, and only > log something in the uncommon case where the kill() fails (but, per > original point, reduce that entry to LOG or so; or else print something > only for not-ESRCH cases). +1 for only printing for the non-ESRCH cases. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Kevin Grittner writes: > Tom Lane wrote: >> What's evidently happened here is that our session tried to boot an >> autovacuum process off a table lock, only that process was gone by the >> time we issued the kill() call. > I think a LOG entry when an autovacuum process is actually canceled > has value just in case it is happening on a particular table so > frequently that the table starts to bloat. I see no reason to log > anything if there is an intention to cancel an autovacuum process > but it actually completes before we can do so. IMV the best > solution would be to proceed straight to the kill() and only log > something if it was found by kill(). Hm. By that logic, I'm not sure if we need anything to be logged here, because the autovacuum process will log something about having received a query cancel signal. Also, if we do it like that there's a race condition: it's entirely possible (maybe even likely, on single-CPU machines) that the autovacuum process would issue its log entry before the sender is able to log its request. That would probably confuse people no end. If we're in the business of minimizing log chatter, I'd suggest that we remove the entirely-routine "sending cancel" log message, and only log something in the uncommon case where the kill() fails (but, per original point, reduce that entry to LOG or so; or else print something only for not-ESRCH cases). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Tom Lane wrote: > + WARNING: could not send signal to process 30123: No such process > What's evidently happened here is that our session tried to boot an > autovacuum process off a table lock, only that process was gone by the > time we issued the kill() call. > I'm inclined to reduce the WARNING to LOG, and/or skip it altogether > if the error is ESRCH. > One could also argue that both of these ereports ought to be downgraded to > DEBUG1 > or less, since this mechanism is pretty well shaken out by now. > Thoughts? I think a LOG entry when an autovacuum process is actually canceled has value just in case it is happening on a particular table so frequently that the table starts to bloat. I see no reason to log anything if there is an intention to cancel an autovacuum process but it actually completes before we can do so. IMV the best solution would be to proceed straight to the kill() and only log something if it was found by kill(). -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Andrew Dunstan writes: > On 07/26/2015 11:00 AM, Andres Freund wrote: >> On 2015-07-26 10:56:05 -0400, Tom Lane wrote: >>> I'm inclined to reduce the WARNING to LOG >> Hm, that doesn't seem like a very nice solution, given that LOG is even >> more likely to end up in the server log. >>> and/or skip it altogether if the error is ESRCH. >> Combined with a comment why we're ignoring that case that'd work for me. > +1 for this. if the process is gone we shouldn't really have a problem, > should we? The only real reason for printing anything here is the possibility that the shared lock table is sufficiently corrupted that we think there's an autovacuum process blocking us when there isn't. However, I don't see any particular reason to think that this is more probable than any other bad consequence of corrupted shmem, so I don't feel a need to be in the user's face with a WARNING. The evidence so far is that the race condition is real, but I don't recall anyone ever reporting this in a context where we'd suspect actual shmem corruption. I do, however, think it might be reasonable to LOG the failure given that we LOG'd the action. Is nobody else on board with downgrading both ereports to DEBUG? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
On 07/26/2015 11:00 AM, Andres Freund wrote: Hi, On 2015-07-26 10:56:05 -0400, Tom Lane wrote: CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops); + WARNING: could not send signal to process 30123: No such process What's evidently happened here is that our session tried to boot an autovacuum process off a table lock, only that process was gone by the time we issued the kill() call. No problem really ... but aside from causing buildfarm noise, I could see somebody getting really panicky if this message appeared on a production server. Oops. I'm inclined to reduce the WARNING to LOG Hm, that doesn't seem like a very nice solution, given that LOG is even more likely to end up in the server log. and/or skip it altogether if the error is ESRCH. Combined with a comment why we're ignoring that case that'd work for me. +1 for this. if the process is gone we shouldn't really have a problem, should we? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buildfarm failure from overly noisy warning message
Hi, On 2015-07-26 10:56:05 -0400, Tom Lane wrote: > CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops); > + WARNING: could not send signal to process 30123: No such process > What's evidently happened here is that our session tried to boot an > autovacuum process off a table lock, only that process was gone by the > time we issued the kill() call. No problem really ... but aside from > causing buildfarm noise, I could see somebody getting really panicky > if this message appeared on a production server. Oops. > I'm inclined to reduce the WARNING to LOG Hm, that doesn't seem like a very nice solution, given that LOG is even more likely to end up in the server log. > and/or skip it altogether if the error is ESRCH. Combined with a comment why we're ignoring that case that'd work for me. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Buildfarm failure from overly noisy warning message
chipmunk failed last night http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2015-07-26%2007%3A36%3A32 like so: == pgsql.build/src/test/regress/regression.diffs === *** /home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/expected/create_index.out Sun Jul 26 10:37:41 2015 --- /home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/results/create_index.out Sun Jul 26 10:51:48 2015 *** *** 14,19 --- 14,20 CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops); CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous); CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops); + WARNING: could not send signal to process 30123: No such process CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops); CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops); CREATE INDEX rix ON road USING btree (name text_ops); == What's evidently happened here is that our session tried to boot an autovacuum process off a table lock, only that process was gone by the time we issued the kill() call. No problem really ... but aside from causing buildfarm noise, I could see somebody getting really panicky if this message appeared on a production server. I'm inclined to reduce the WARNING to LOG, and/or skip it altogether if the error is ESRCH. The relevant code in ProcSleep() is: ereport(LOG, (errmsg("sending cancel to blocking autovacuum PID %d", pid), errdetail_log("%s", logbuf.data))); if (kill(pid, SIGINT) < 0) { /* Just a warning to allow multiple callers */ ereport(WARNING, (errmsg("could not send signal to process %d: %m", pid))); } so logging failures at LOG level seems pretty reasonable. One could also argue that both of these ereports ought to be downgraded to DEBUG1 or less, since this mechanism is pretty well shaken out by now. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers