[ADMIN] Unfamiliar recovery message afer server crash

2004-04-20 Thread Arthur Ward
For unknown reasons, our PG server died overnight. (PG 7.4.2, RedHat 8,
ext3 ordered data mode, and battery backed RAID5)  When it came back up, I
saw an unfamiliar line in the recovery output from Postgres:

Apr 20 11:28:14 postgres: [6203] LOG:  database system was interrupted at
2004-04-20 05:50:59 CDT
Apr 20 11:28:14 postgres: [6203] LOG:  checkpoint record is at 7C/CF55FB6C
Apr 20 11:28:14 postgres: [6203] LOG:  redo record is at 7C/CF55FB6C; undo
record is at 0/0; shutdown FALSE
Apr 20 11:28:14 postgres: [6203] LOG:  next transaction ID: 9374951; next
OID: 158516389
Apr 20 11:28:14 postgres: [6203] LOG:  database system was not properly
shut down; automatic recovery in progress
Apr 20 11:28:14 postgres: [6203] LOG:  redo starts at 7C/CF55FBAC
Apr 20 11:28:17 postgres: [6203] LOG:  unexpected pageaddr 7C/C7A0A000 in
log file 124, segment 207, offset 10526720
Apr 20 11:28:17 postgres: [6203] LOG:  redo done at 7C/CFA09F7C
Apr 20 11:28:20 postgres: [6203] LOG:  database system is ready

What's the significance of the unexpected pageaddr item?


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [ADMIN] syslog enabled causes random hangs?

2003-08-14 Thread Arthur Ward
 A checkpoint would also have reason to wait for a page-level lock, if
 the stuck backend was holding one.  I am wondering though if the stuck
 condition consistently happens while trying to fire a trigger?  That
 would be very interesting ... not sure what it'd mean though ...

Hmm. I'm really at a loss as to how I could test this theory. I don't have
a truly reproducible test case for any of these syslog problems, just one
set of code with really bad luck one afternoon. Race conditions (or things
that smell a lot like them) stink.

 It looks to me like your plpython code is all dead in the water, seeing
 that your Python installation is refusing creation of rexec.  (AFAIK the
 only workaround is to downgrade Python to a version that allows rexec.)
 If you're using it all over the place, how come you haven't noticed
 that??

I did notice, and it was an oversight. I had just rebuilt Python 2.2.3
(for unrelated reasons) and forgot to comment out the poision line in
rexec.py where they raise the stop using RExec exception. It behaves
properly once that line is commented out (properly being it works same as
with earlier versions of Python; it appears all that changed was the
insertion of the exception at the beginning of the RExec constructor). I
tried to get the deadlock behavior again after fixing rexec.py, but my
luck wasn't bad enough for another three runs, so I posted the case I had
traces for.

An idea just popped into my head, though. Perhaps I can create procedures
in plpgsql and plpython which do nothing but spew notices (which would in
turn be sent to syslog), and run one or two copies to see if they'll die
on their own given sufficient time. That seems worthwhile, especially if I
can get a deadlock in plpgsql, since it will take the blame away from both
triggers and plpython. Does this sound like a reasonable experiment? I may
try setting this up on my home machine tonight to run for a few days.




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [ADMIN] syslog enabled causes random hangs?

2003-08-14 Thread Arthur Ward
 It looks to me like the guy doing VACUUM is simply waiting for the other
 guy to release a page-level lock.  The other guy is running a deferred
 trigger and so I'd expect him to be holding one or two page-level locks,
 on the page or pages containing the tuple or tuples passed to the
 trigger.  Nothing evidently wrong there.

If I remember what I was working on the other day when this whole thing
started, I think it was a single backend and a checkpoint that collided.
I'll trace that combination, assuming it happens again.

 The real question is why does vsyslog() have anything to block on, when
 it's running in an unthreaded process?  Seeing that you are using
 plpython, I wonder if Python is confusing matters somehow.

Oof. I'm using plpython all over the place; I don't think this problem has
happened in any location that can work without it easily. :-/




---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [ADMIN] syslog enabled causes random hangs?

2003-08-14 Thread Arthur Ward
I'm back with more on the funky glibc-syslog-Postgres deadlocking behavior:

 It's really too bad that your gdb backtrace didn't show anything past
 the write_syslog level (which is an elog subroutine).  If we could see
 whether the elog had been issued from a signal handler, and if so what
 it had interrupted, we'd have an idea whether this is a known syslog
 deficiency or not.  Do you want to rebuild with debug symbols and try
 the backtrace again?

A minor ah-ha here: Whoever wrote the original Gentoo build for Postgres
left out --enable-debug, even though Gentoo has a system-wide flag for
no, I don't want symbols (which I did not set).

I was testing a different piece of my app at home, and triggered the
deadlock by accident. Being frustrated, I took the time to rebuild PG with
symbols, and also rebuilt Python (for other reasons), which accounts for
the RExec exception in the traceback here. The exception normally comes
back to the client with no problems. I had forgotten to comment out the
poison line in Python 2.2.3's RExec module so my plpython code would
continue working.

This is Postgresql 7.3.4 running against glibc 2.3.2. It just occured to
me that I didn't trace the postmaster or the other always-on processes,
just the two dead backends. In a couple of attempts to recreate the
problem again, I had no failures. :-( The original problem was on a
different machine running a different part of my project, and seemed more
reproducible, or I could have just had a bad day, so I can go back to that
at some point if tracing the other processes is a good idea.

postgres 27259  3.2  6.4 47488 20600 pts/0   S03:30   0:05 postgres:
postgres Trucking [local] VACUUM
(gdb) bt
#0  0x4dbf81a7 in semop () from /lib/libc.so.6
#1  0x080f2dd7 in PGSemaphoreLock (sema=0x1, interruptOK=1 '\001') at
pg_sema.c:434
#2  0x08106dd3 in ProcWaitForSignal () at proc.c:895
#3  0x08101878 in LockBufferForCleanup (buffer=1021) at bufmgr.c:1946
#4  0x080c380b in lazy_vacuum_heap (onerel=0x420a20f8,
vacrelstats=0x8288f88) at vacuumlazy.c:474
#5  0x080c3476 in lazy_scan_heap (onerel=0x420a20f8,
vacrelstats=0x8288f88, Irel=0x8297818, nindexes=3) at vacuumlazy.c:426
#6  0x080c31e2 in lazy_vacuum_rel (onerel=0x21, vacstmt=0x1) at
vacuumlazy.c:158
#7  0x080c02ee in vacuum_rel (relid=1021, vacstmt=0x8287138,
expected_relkind=114 'r') at vacuum.c:829
#8  0x080bfd32 in vacuum (vacstmt=0x8287138) at vacuum.c:290
#9  0x0810a3a1 in pg_exec_query_string (query_string=0x8287138,
dest=Remote, parse_context=0x827bb30) at postgres.c:789
#10 0x0810afdf in PostgresMain (argc=4, argv=0xbfff7cc8,
username=0x822e1d9 postgres) at postgres.c:2013
#11 0x080f5175 in DoBackend (port=0x822e0a8) at postmaster.c:2310
#12 0x080f4d2f in BackendStartup (port=0x822e0a8) at postmaster.c:1932
#13 0x080f3f95 in ServerLoop () at postmaster.c:1009
#14 0x080f3989 in PostmasterMain (argc=1, argv=0x82162e0) at postmaster.c:788
#15 0x080d684c in main (argc=1, argv=0xbfff85c4) at main.c:210
#16 0x4db327a7 in __libc_start_main () from /lib/libc.so.6

postgres 27235  5.7  3.0 41900 9784 pts/0S03:29   0:15 postgres:
award Trucking [local] INSERT
(gdb) bt
#0  0x4db45ed6 in sigsuspend () from /lib/libc.so.6
#1  0x4de40218 in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x4de419a0 in __pthread_alt_lock () from /lib/libpthread.so.0
#3  0x4de3ec17 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x4dbf334c in vsyslog () from /lib/libc.so.6
#5  0x4dbf2ebf in syslog () from /lib/libc.so.6
#6  0x0814a49b in write_syslog (level=4,
line=0xbfff70fc ERROR:  plpython: Unable to create rexec.RExec
instance\nexceptions.RuntimeError: This code is not secure in Python
2.2 and 2.3) at elog.c:728
#7  0x08149bf9 in elog (lev=-1073779172, fmt=0x421a44fa plpython:
%s\n%s) at elog.c:383
#8  0x4214b496 in PLy_elog (elevel=20, fmt=0x421a4a80 Unable to create
rexec.RExec instance) at plpython.c:2811
#9  0x421490ff in PLy_procedure_compile (proc=0x83abdb8,
src=0x8359610 \n# TD[\args\][0] - audit_table (with schema) \n#
TD[\relid\] = OID of the table being triggered on\n\nif not
SD.has_key(\plans\):\n\tSD[\plans\] = {}\n\nif
SD[\plans\].has_key(TD[\relid\]):\n\tcachedPlans = SD...)
at plpython.c:1203
#10 0x42148f32 in PLy_procedure_create (fcinfo=0xbfff78c0, is_trigger=1
'\001', procTup=0x420b1910,
key=0xbfff7730 5210458_trigger) at plpython.c:1177
#11 0x42148ba8 in PLy_procedure_get (fcinfo=0xbfff78c0, is_trigger=1
'\001') at plpython.c:1043
#12 0x4214758e in plpython_call_handler (fcinfo=0xbfff78c0) at plpython.c:423
#13 0x0814b4cc in fmgr_security_definer (fcinfo=0xbfff78c0) at fmgr.c:681
#14 0x080bbb8f in ExecCallTriggerFunc (trigdata=0xbfff79e8,
finfo=0x82aff98, per_tuple_context=0x827c9d0) at trigger.c:1121
#15 0x080bc39a in DeferredTriggerExecute (event=0x82c2ba8, itemno=-4,
rel=0x4f815b, trigdesc=0x828bff8, finfo=0x82afcf0,
per_tuple_context=0x827c9d0) at trigger.c:1656
#16 0x080bc685 in deferredTriggerInvokeEvents (immediate_only=1 

[ADMIN] syslog enabled causes random hangs?

2003-07-29 Thread Arthur Ward
I'm encountering strange hangs in postgresql backends at random moments.
They seem to be associated with attempts to issue log entries via syslog.
I have run backtraces on the hung backends a few times, and they routinely
trace into system libraries where it looks like a stuck syslog call. So
far, I haven't had this problem with any other apps, so I'm thinking it's
a condition being aggravated by Postgres.

The machines I'm seeing this on are RedHat 8 and Gentoo Linux machines
(x86). We're running compiled-from-source 7.3.3 on both machines (by hand
on the RH8 machine, and built by Gentoo's package manager, because it's a
build-from-source distribution). On my Gentoo test boxes, it happens
occasionally using syslog-ng 1.6.0_rc3 and glibc 2.3.2-r1. On the RedHat 8
machine, it's sysklogd-1.4.1-10 and glibc-devel-2.3.2-4.80.6. I switched
to not using syslog on the RH8 machine because it was hanging backends
quite regularly, and that's rather bad in production. :)

Has anyone else had problems using the syslog option?

I need to rebuild PostgreSQL on my test machine so I'll have unstripped
binaries, but once that's done, I'll bring on the stack traces when I get
another hang.



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [ADMIN] syslog enabled causes random hangs?

2003-07-29 Thread Arthur Ward
 Arthur Ward [EMAIL PROTECTED] writes:
 I'm encountering strange hangs in postgresql backends at random
 moments. They seem to be associated with attempts to issue log entries
 via syslog. I have run backtraces on the hung backends a few times,
 and they routinely trace into system libraries where it looks like a
 stuck syslog call. So far, I haven't had this problem with any other
 apps, so I'm thinking it's a condition being aggravated by Postgres.

 How verbose are your Postgres logging settings?

Defaults. (single-user test machine -- not much activity) On what I was
doing today, it seemed to happen most often with the occurrance of a
checkpoint, which would normally log a message about recycling the
transaction log files.

 On most platforms I've dealt with, syslog drops messages on the floor if
 it starts to get saturated.  It may be that the Linux implementation has
 worse behavior than that under heavy load :-(.  In any case I'd suggest
 filing a bug against syslog.  There's very little we can do about it if
 the syslog library call hangs up.

After rebuilding to get a non-stripped copy of Postgresql, I did a little
searching on my backtraces. Some of the first hits are pointing at a
deadlock condition in glibc, although I'm not sure how or why this is
affecting PostgreSQL running independently of my app and its signal
handler. Maybe some more digging will enlighten me...

 Personally I find it more reliable to pipe the postmaster's stderr to
 some sort of log-rotation program than to depend on syslog.  It seems
 that the Apache folks have found the same, since they include a suitable
 log-rotation filter in their distribution ...

Considering deadlocks in the system libraries sound a bit scary to me, I'm
definitely convinced to change my development machine now... As I
originally wrote, I'd already changed the production machine.


FWIW, since I already went to the trouble (and for the sake of people
searching the archives in the future), here's what I was seeing in the
backtraces after rebuilding this afternoon:

This process:
19763 pts/2S  0:00 postgres: checkpoint subprocess
Was stuck here:
(gdb) bt
#0  0x402cf077 in semop () from /lib/libc.so.6
(gdb)

And the other:
19720 pts/2S  0:04 postgres: award Trucking [local] UPDATE
(gdb) bt
#0  0x4021cec6 in sigsuspend () from /lib/libc.so.6
#1  0x424b6218 in __pthread_wait_for_restart_signal ()
   from /lib/libpthread.so.0
#2  0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0
#3  0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x402ca21c in vsyslog () from /lib/libc.so.6
#5  0x402c9d8f in syslog () from /lib/libc.so.6
#6  0x08150a57 in write_syslog ()
(gdb)




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])