unsubscribe
On Thu, Mar 28, 2013 at 10:30 PM, Sergey Konoplev <gray...@gmail.com> wrote: > Hi all, > > A couple of days ago I found the replica stopped after the PANIC message: > > PANIC: WAL contains references to invalid pages > > When I tried to restart it I got this FATAL: > > FATAL: could not access status of transaction 280557568 > > Below is the description of the server and information from PostgreSQL > and system logs. After googling the problem I have found nothing like > this. > > Any thoughts of what it could be and how to prevent it in the future? > > Hardware: > > IBM System x3650 M4, 148GB RAM, NAS > > Software: > > PostgreSQL 9.2.3, yum.postgresql.org > CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64 > > Configuration: > > listen_addresses = '*' > max_connections = 550 > shared_buffers = 35GB > work_mem = 256MB > maintenance_work_mem = 1GB > bgwriter_delay = 10ms > bgwriter_lru_multiplier = 10.0 > effective_io_concurrency = 32 > wal_level = hot_standby > synchronous_commit = off > checkpoint_segments = 1024 > checkpoint_timeout = 1h > checkpoint_completion_target = 0.9 > checkpoint_warning = 5min > max_wal_senders = 3 > wal_keep_segments = 2048 > hot_standby = on > max_standby_streaming_delay = 5min > hot_standby_feedback = on > effective_cache_size = 133GB > log_directory = '/var/log/pgsql' > log_filename = 'postgresql-%Y-%m-%d.log' > log_checkpoints = on > log_line_prefix = '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] ' > log_lock_waits = on > log_statement = 'ddl' > log_timezone = 'W-SU' > track_activity_query_size = 4096 > autovacuum_max_workers = 5 > autovacuum_naptime = 5s > autovacuum_vacuum_scale_factor = 0.05 > autovacuum_analyze_scale_factor = 0.05 > autovacuum_vacuum_cost_delay = 5ms > datestyle = 'iso, dmy' > timezone = 'W-SU' > lc_messages = 'en_US.UTF-8' > lc_monetary = 'ru_RU.UTF-8' > lc_numeric = 'ru_RU.UTF-8' > lc_time = 'ru_RU.UTF-8' > default_text_search_config = 'pg_catalog.russian' > > System: > > # Controls the maximum shared segment size, in bytes > kernel.shmmax = 53287555072 > > # Controls the maximum number of shared memory segments, in pages > kernel.shmall = 13009657 > > # Maximum number of file-handles > fs.file-max = 65535 > > # pdflush tuning to prevent lag spikes > vm.dirty_ratio = 10 > vm.dirty_background_ratio = 1 > vm.dirty_expire_centisecs = 499 > > # Prevent the scheduler breakdown > kernel.sched_migration_cost = 5000000 > > # Turned off to provide more CPU to PostgreSQL > kernel.sched_autogroup_enabled = 0 > > # Setup hugepages > vm.hugetlb_shm_group = 26 > vm.hugepages_treat_as_movable = 0 > vm.nr_overcommit_hugepages = 512 > > # The Huge Page Size is 2048kB, so for 35GB shared buffers the number is > 17920 > vm.nr_hugepages = 17920 > > # Turn off the NUMA local pages reclaim as it leads to wrong caching > strategy for databases > vm.zone_reclaim_mode = 0 > > Environment: > > HUGETLB_SHM=yes > LD_PRELOAD='/usr/lib64/libhugetlbfs.so' > export HUGETLB_SHM LD_PRELOAD > > When it is stopped: > > 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: > restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction > log file(s) added, 0 removed, 555 recycled; write=3237.402 s, > sync=0.071 s, total=3237.507 s; sync files=2673, longest=0.008 s, > average=0.000 s > 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: recovery > restart point at 2538/6E154AC0 > 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] DETAIL: last > completed transaction was at log time 2013-03-26 11:50:31.613948+04 > 2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: > restartpoint starting: xlog > 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] WARNING: > page 451 of relation base/16436/2686702648 is uninitialized > 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, > lastBlockVacuumed 0 > 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] PANIC: WAL > contains references to invalid pages > 2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo vacuum: rel 1663/16436/2686702648; blk 2485, > lastBlockVacuumed 0 > 2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: startup > process (PID 3773) was terminated by signal 6: Aborted > 2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: > terminating any other active server processes > > From /var/log/messages: > > Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure. > order:8, mode:0xd0 > Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted > 2.6.32-279.22.1.el6.x86_64 #1 > Mar 26 10:50:52 tms2 kernel: : Call Trace: > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8112343f>] ? > __alloc_pages_nodemask+0x77f/0x940 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115d3e2>] ? > kmem_getpages+0x62/0x170 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dffa>] ? > fallback_alloc+0x1ba/0x270 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115da4f>] ? > cache_grow+0x2cf/0x320 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dd79>] ? > ____cache_alloc_node+0x99/0x160 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? > dma_pin_iovec_pages+0xb5/0x230 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115eb49>] ? __kmalloc+0x189/0x220 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ? > dma_pin_iovec_pages+0xb5/0x230 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8141a47c>] ? > lock_sock_nested+0xac/0xc0 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8146edaa>] ? > tcp_recvmsg+0x4ca/0xe80 > Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029a67e>] ? > xfs_vm_write_end+0x2e/0x60 [xfs] > Mar 26 10:50:52 tms2 kernel: : [<ffffffffa0292f39>] ? > xfs_trans_unlocked_item+0x39/0x60 [xfs] > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8148f90a>] ? > inet_recvmsg+0x5a/0x90 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418b93>] ? > sock_recvmsg+0x133/0x160 > Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029f542>] ? > xfs_rw_iunlock+0x32/0x40 [xfs] > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81090be0>] ? > autoremove_wake_function+0x0/0x40 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81173fff>] ? > __dentry_open+0x23f/0x360 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff811764ca>] ? > do_sync_write+0xfa/0x140 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff811875d0>] ? > do_filp_open+0x780/0xd60 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418d0e>] ? > sys_recvfrom+0xee/0x180 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81183105>] ? putname+0x35/0x50 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff81176842>] ? vfs_write+0x132/0x1a0 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff810d3f47>] ? > audit_syscall_entry+0x1d7/0x200 > Mar 26 10:50:52 tms2 kernel: : [<ffffffff8100b072>] ? > system_call_fastpath+0x16/0x1b > > After restart: > > 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: database > system was interrupted while in recovery at log time 2013-03-26 > 10:08:37 MSK > 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] HINT: If this > has occurred more than once some data might be corrupted and you might > need to choose an earlier recovery target. > 2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: entering > standby mode > 2013-03-27 10:43:01 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: redo > starts at 2538/6E154AC0 > 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file > "pg_subtrans/10B7" doesn't exist, reading as zeroes > 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo xid assignment xtop 280482186: subxacts: 280483328 280483333 > 280483343 280483354 280483356 280483368 280483377 280483382 280483392 > 280483404 280483416 280483429 280483440 280483451 280483460 280483472 > 280483487 280483500 280483516 280483530 280483541 280483555 280483565 > 280483574 280483585 280483595 280483604 280483607 280483617 280483626 > 280483636 280483646 280483656 280483665 280483677 280483688 280483699 > 280483709 280483719 280483730 280483739 280483749 280483759 280483761 > 280483771 280483782 280483799 280483800 280483811 280483821 280483824 > 280483836 280483847 280483859 280483871 280483874 280483883 280483897 > 280483906 280483915 280483925 280483937 280483948 280483958 > 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file > "pg_subtrans/10B7" doesn't exist, reading as zeroes > 2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo xid assignment xtop 280482270: subxacts: 280485056 280485070 > 280485083 280485086 280485098 280485113 280485132 280485144 280485156 > 280485167 280485178 280485188 280485201 280485217 280485234 280485249 > 280485267 280485293 280485309 280485327 280485333 280485345 280485353 > 280485373 280485388 280485405 280485420 280485434 280485457 280485476 > 280485482 280485507 280485516 280485531 280485537 280485550 280485565 > 280485568 280485585 280485587 280485601 280485613 280485634 280485639 > 280485656 280485669 280485684 280485690 280485693 280485712 280485730 > 280485754 280485757 280485779 280485801 280485808 280485811 280485830 > 280485856 280485880 280485900 280485920 280485941 280485946 > > [ skipped several more messages of this kind] > > 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file > "pg_subtrans/10B8" doesn't exist, reading as zeroes > 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo xid assignment xtop 280549936: subxacts: 280555514 280555515 > 280555516 280555517 280555518 280555519 280555520 280555521 280555522 > 280555523 280555524 280555525 280555526 280555527 280555528 280555529 > 280555530 280555531 280555532 280555533 280555534 280555535 280555536 > 280555537 280555538 280555539 280555540 280555541 280555542 280555543 > 280555544 280555545 280555546 280555547 280555548 280555549 280555550 > 280555551 280555552 280555553 280555554 280555555 280555556 280555557 > 280555558 280555559 280555560 280555561 280555562 280555563 280555564 > 280555565 280555566 280555567 280555568 280555569 280555570 280555571 > 280555572 280555573 280555574 280555575 280555576 280555577 > 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] FATAL: > could not access status of transaction 280557568 > 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] DETAIL: > Could not read from file "pg_subtrans/10B8" at offset 253952: Success. > 2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT: > xlog redo xid assignment xtop 280555981: subxacts: 280557520 280557521 > 280557522 280557523 280557524 280557525 280557526 280557527 280557528 > 280557529 280557530 280557531 280557532 280557533 280557534 280557535 > 280557536 280557537 280557538 280557539 280557540 280557541 280557542 > 280557543 280557544 280557545 280557546 280557547 280557548 280557549 > 280557550 280557551 280557552 280557553 280557554 280557555 280557556 > 280557557 280557558 280557559 280557560 280557561 280557562 280557563 > 280557564 280557565 280557566 280557567 280557568 280557569 280557570 > 280557571 280557572 280557573 280557574 280557575 280557576 280557577 > 280557578 280557579 280557580 280557581 280557582 280557583 > 2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: startup > process (PID 1535) exited with exit code 1 > 2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: > terminating any other active server processes > > Thank you in advance. > > -- > Kind regards, > Sergey Konoplev > Database and Software Consultant > > Profile: http://www.linkedin.com/in/grayhemp > Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979 > Skype: gray-hemp > Jabber: gray...@gmail.com > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs >