On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> Thanks, but I'm not sure that'll help, at this point. We already know
> what happened (corrupted memory), we don't know "how". And core files
> are mostly just "snapshots" so are not very useful in answering that :-(

Is there anything I should be saving for these or hints how else to debug?  I
just got a segfault while looking for logs from an segfault from 2hr
earlier...but it was a duplicate from the first stacktrace...

< 2017-10-16 23:21:23.172 -04  >LOG:  server process (PID 31543) was terminated 
by signal 11: Segmentation fault
< 2017-10-16 23:21:23.172 -04  >DETAIL:  Failed process was running: 
autovacuum: BRIN summarize public.gtt 162175

Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 
sp:7ffe08a94890 error:0 in postgres[400000+692000]
...
Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 
000000000084a177 sp 00007ffe08a94a88 error 4 in postgres[400000+692000]
Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 
(/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-17-01:58:38-8646 
(7692288 bytes)
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' 
creation detected

Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 
(/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-17-01:58:38-8646 
(7692288 bytes)
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' 
creation detected
Oct 17 01:59:18 database abrtd: Generating core_backtrace
Oct 17 01:59:18 database abrtd: Duplicate: core backtrace
Oct 17 01:59:18 database abrtd: DUP_OF_DIR: 
/var/spool/abrt/ccpp-2017-10-14-18:05:35-26500
Oct 17 01:59:18 database abrtd: Deleting problem directory 
ccpp-2017-10-17-01:58:38-8646 (dup of ccpp-2017-10-14-18:05:35-26500)

postgres=# SELECT * FROM postgres_log_2017_10_16_2300 WHERE pid=31543;
-[ RECORD 1 
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 2
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54376
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table 
"gtt.public.cdrs_eric_egsnpdprecord_2017_10_15"
-[ RECORD 2 
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 3
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54377
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | processing work entry for relation 
"gtt.public.cdrs_eric_msc_voice_2017_10_16_startofcharge_idx"
-[ RECORD 3 
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:21.003-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 1
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54374
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table 
"gtt.public.cdrs_eric_ggsnpdprecord_2017_10_15"

Note, the most recent LOCK TABLE I could find beofre 23:21:22 was actually
~10min earlier at 2017-10-16 23:12:16.519-04.

Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
178             LockRelId       relid = relation->rd_lockInfo.lockRelId;
Missing separate debuginfos, use: debuginfo-install 
audit-libs-2.4.5-6.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 
libcom_err-1.41.12-23.el6.x86_64 libgcc-4.4.7-18.el6.x86_64 
libicu-4.2.1-14.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 
libstdc++-4.4.7-18.el6.x86_64 nspr-4.13.1-1.el6.x86_64 
nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 
nss-util-3.28.4-1.el6_9.x86_64 openldap-2.4.40-16.el6.x86_64 
openssl-1.0.1e-57.el6.x86_64 pam-1.1.1-24.el6.x86_64 zlib-1.2.3-29.el6.x86_64

(gdb) p relation
$1 = (struct RelationData *) 0x324647603246466
(gdb) p relation->rd_lockInfo
Cannot access memory at address 0x3246476032464aa

(gdb) bt
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized 
out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized 
out>) at catcache.c:1190
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at 
lsyscache.c:1730
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at 
autovacuum.c:2609
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, 
argv=<value optimized out>) at autovacuum.c:1694
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value 
optimized out>) at postmaster.c:5414
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at 
postmaster.c:5111
#10 <signal handler called>
#11 0x0000003bd74e1603 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:82
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value 
optimized out>) at postmaster.c:1717
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at 
postmaster.c:1361
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228

#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
        relid = {relId = 21345368, dbId = 0}
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
No locals.
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized 
out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized 
out>) at catcache.c:1190
        cur_skey = {{sk_flags = 0, sk_attno = 1, sk_strategy = 3, sk_subtype = 
0, sk_collation = 0, sk_func = {fn_addr = 0x7a8d50 <oideq>, fn_oid = 184, 
fn_nargs = 2, fn_strict = 1 '\001',
              fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, 
fn_mcxt = 0x13cf3d8, fn_expr = 0x0}, sk_argument = 618293329}, {sk_flags = 0, 
sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
            sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, 
fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 
0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0},
          {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, 
sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 
0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000',
              fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, 
{sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, 
sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0,
              fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', 
fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}}
        hashValue = 884019
        hashIndex = 51
        iter = {cur = <value optimized out>, end = 0x14276a8}
        bucket = 0x14276a8
        ct = <value optimized out>
        relation = 0x7fd6b8788e58
        scandesc = 0x145b3c8
        ntp = <value optimized out>
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at 
lsyscache.c:1730
        tp = <value optimized out>
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at 
autovacuum.c:2609
        cur_datname = 0x0
        cur_nspname = 0x0
        cur_relname = 0x0
        __func__ = "perform_work_item"
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
        workitem = 0x7fd7c7f94860
        classRel = 0x7fd6b8788e58
        tuple = <value optimized out>
        relScan = <value optimized out>
        dbForm = <value optimized out>
        table_oids = <value optimized out>
        orphan_oids = 0x0
        ctl = {num_partitions = 0, ssize = 0, dsize = 0, max_dsize = 0, ffactor 
= 0, keysize = 4, entrysize = 80, hash = 0, match = 0, keycopy = 0, alloc = 0, 
hcxt = 0x0, hctl = 0x0}
        table_toast_map = 0x14d5228
        cell = 0x0
        shared = 0x1499eb8
        dbentry = 0x149a140
        bstrategy = 0x156ef38
        key = {sk_flags = 0, sk_attno = 16, sk_strategy = 3, sk_subtype = 0, 
sk_collation = 100, sk_func = {fn_addr = 0x750430 <chareq>, fn_oid = 61, 
fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
            fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x1400838, fn_expr = 
0x0}, sk_argument = 116}
        pg_class_desc = 0x14c3748
        effective_multixact_freeze_max_age = 400000000
        did_vacuum = <value optimized out>
        found_concurrent_worker = <value optimized out>
        i = <value optimized out>
        __func__ = "do_autovacuum"
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, 
argv=<value optimized out>) at autovacuum.c:1694
        dbname = 
"gtt\000\000\000\000\000\325I\001\327;\000\000\000\067{\000\000\000\000\000\000E\352\200\327;\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\374\024\316",
 '\000' <repeats 12 times>
        local_sigjmp_buf = {{__jmpbuf = {0, -6625136415178345871, 20928464, 
1508210280, 31539, 1508209622, 6625596434918027889, -6625136601089035663}, 
__mask_was_saved = 1, __saved_mask = {__val = {
                18446744066192964103, 13184024, 1, 13182664, 0, 31539, 
1508209622, 0, 257010229664, 1, 0, 140729043734800, 257014425560, 4, 0, 4}}}}
        dbid = 16400
        __func__ = "AutoVacWorkerMain"
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
        worker_pid = 0
        __func__ = "StartAutoVacWorker"
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value 
optimized out>) at postmaster.c:5414
        bn = 0x13f57d0
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at 
postmaster.c:5111
        save_errno = 4
        __func__ = "sigusr1_handler"
#10 <signal handler called>
No symbol table info available.
#11 0x0000003bd74e1603 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
No locals.
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value 
optimized out>) at postmaster.c:1717
        timeout = {tv_sec = 59, tv_usec = 981940}
        rmask = {fds_bits = {120, 0 <repeats 15 times>}}
        selres = <value optimized out>
        now = <value optimized out>
        readmask = {fds_bits = {120, 0 <repeats 15 times>}}
        nSockets = 7
        last_lockfile_recheck_time = 1508210280
        last_touch_time = 1508209622
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at 
postmaster.c:1361
        opt = <value optimized out>
        status = <value optimized out>
        userDoption = <value optimized out>
        listen_addr_saved = <value optimized out>
        i = <value optimized out>
        output_config_variable = <value optimized out>
        __func__ = "PostmasterMain"
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to