Hi Tom, thanks for your detailed responses. I really appreciate it. My comments below inline ...
> > I'm having a strange problem with postgres & autovacuum > > Everything is working fine until I start getting the following errors ... > > and postgres stops working shortly after (it stops accepting connections) > > > 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08 > > CET|1/44303|0|| LOG: automatic vacuum of table > > "mrs.pg_catalog.pg_statistic": index scans: 1 > > pages: 0 removed, 189 remain > > tuples: 132 removed, 4587 remain > > system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec > > 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > > CET||0|| WARNING: worker took too long to start; cancelled > > 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > > CET||0|| WARNING: worker took too long to start; cancelled > > 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > > CET||0|| WARNING: worker took too long to start; cancelled > > Hm. The code comment above that warning says > > * The only problems that may cause this code to > * fire are errors in the earlier sections of AutoVacWorkerMain, > * before the worker removes the WorkerInfo from the > * startingWorker pointer. > > but it's hard to see what problem there could lead to an issue. (In > particular, I discount the idea that AutovacuumLock could be stuck, > because we had to acquire it in order to issue this message.) > > But it strikes me that the code comment is wrong in one significant way: > if the postmaster were failing to heed SIGUSR1 at all, you could reach > the timeout here, because the fork-failed signal wouldn't get sent. > > Given that you say it also stops accepting connections, I'm thinking > this is a postmaster problem not an autovacuum problem. But you've > not provided any information about that end of it. Exactly what happens > when you try to make a connection? Are there any entries at all in the > postmaster log? What about the kernel log? Are you sure that new > connections stop working *after* this happens, and not at exactly the > same time? > When I try to make a connection from the command line I get nothing. I mean, if I execute ... *# psql -U postgres -l* I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever. It's like no more connections were accepted. Obviously our applications stop working immediately. Trying to stop postgres with *# /etc/init.d/postgres stop* *Stopping postgresql service: * I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min. with no luck). As a consequence I had to kill the postmaster process manually. Kernel log says nothing :( Not sure what started failing first: the connections or the autovacuum. I just went to the log and saw that sequence of events. Searching in google I found a few references saying autovacuum and that error log "WARNING: worker took too long to start; cancelled" were releated. > > > OS: Centos 5.5 > > Kernel: 2.6.18-194.26.1.el5 > > Postgres version: 8.4.5 (installation out-of-the-box using yum) > > Given that this is a Linux system, I think that an OOM kill on the > postmaster is a not-insignificant possibility. Or at least I would > think that if there weren't a PostmasterIsAlive check in the autovac > launcher loop. It's real hard to see how you could get more than > one of these messages if the postmaster were gone entirely. > I just noticed the error log repeats once every minute but the content is the same ... First time the log appeared *2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled* * 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled * It repeats once every minute until we fix it * 2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled * *2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled* All 1,000+ lines are alike except for the date (every new line adds 1 more minute) However the log is the same * 2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled * According to our postgresql.conf log format ... *log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| ' * * # %u = user name* * # %d = database name* * # %r = remote host and port* * # %h = remote host* * # %p = process ID* * # %t = timestamp without milliseconds* * # %m = timestamp with milliseconds* * # %i = command tag* * # %c = session ID* * # %l = session line number* * # %s = session start timestamp* * # %v = virtual transaction ID* * # %x = transaction ID (0 if none) * * # %q = stop here in non-session* * # processes* Splitting the log ... * (%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET (%l = session line number) = 209 (%d = database name) = (%r = remote host and port) = (%p = process ID) = 3300 (%i = command tag) = (%c = session ID) = 4cdc2ae6.ce4 (%s = session start timestamp) = 2010-11-11 18:41:58 CET (%v = virtual transaction ID) = (%x = transaction ID (0 if none)) = 0 %u = user name) = Message = WARNING: worker took too long to start; cancelled * I see the following fields are the same in the 1,200+ lines of the log * (%d = database name) = (%r = remote host and port) = * * (%c = session ID) (%s = session start timestamp) (%v = virtual transaction ID) (%x = transaction ID (0 if none)) %u = user name) * * Message * Does this info help? What does "empty database" mean? could be the autovacuum process? I noticed the "%s" (* 2010-11-11 18:41:58 CET) *matches the time we started postgres. Here you have the log *2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58 CET||0|| LOG: database system was shut down at 2010-11-11 18:38:55 CET* *2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| LOG: autovacuum launcher started* *2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56 CET||0|| LOG: database system is ready to accept connections* > Could you try strace'ing the postmaster process to see what it's doing > when this is happening? > I definitely will. More information in case that helps. Strangely, we had the same problem in a temporary database we used last week while migrating some data. This time there was not autovacuum entry in the log ... *2010-11-10 20:26:06.441 CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04 CET|114/0|0|postgres| LOG: duration: 1770.500 ms statement: select id_type,id,id_language_fk,data from cache where id_type=5 and id = '60820269' and id_language_fk = 2* *2010-11-10 20:26:07.100 CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15 CET|80/0|0|postgres| LOG: duration: 1178.423 ms statement: SELECT "reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551') LIMIT 1* *2010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:28:41.641 CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08 CET|6/701|0|postgres| LOG: duration: 1878.724 ms execute <unnamed>: select tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on (tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on (t.id_token = tu.id_token_fk) inner join useoccupation as uo on (uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk = tp.id_tokenprovider) where tp.id_tokenprovider in (10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020) order by tp.id_tokenprovider,t.id_token,uo.sort* *2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:30:01.940 CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15 CET|70/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.942 CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06 CET|60/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.948 CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15 CET|89/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.950 CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28 CET|72/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.951 CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30 CET|83/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.961 CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19 CET|57/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:01.969 CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17 CET|65/0|0|postgres| LOG: unexpected EOF on client connection* *2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* *2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled* Well, thanks for any information. I can provide more logs/information if necessary. Pablo > > regards, tom lane >