We have now experienced index corruption on two separate but identical slony clusters. In each case the slony subscriber failed after attempting to insert a duplicate record. In each case reindexing the sl_log_1 table on the provider fixed the problem.
The latest occurrence was on our production cluster yesterday. This has only happened since we performed kernel upgrades and we are uncertain whether this represents a kernel bug, or a postgres bug exposed by different timings in the new kernel. Our systems are: Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux kernel boot option: elevator=deadline 16 Gigs of RAM postgresql-8.0.3-1PGDG Bonded e1000/tg3 NICs with 8192 MTU. Slony 1.1.0 NetApp FAS270 OnTap 7.0.3 Mounted with the NFS options rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac Jumbo frames 8192 MTU. All postgres data and logs are stored on the netapp. In the latest episode, the index corruption was coincident with a slony-induced vacuum. I don't know if this was the case with our test system failures. What can we do to help identify the cause of this? I believe we will be able to reproduce this on a test system if there is some useful investigation we can perform. __ Marc
signature.asc
Description: This is a digitally signed message part