(Sorry if duplicates show up this is the third time ive posted this in the past 10 hours, Im assuming it got dropped because of the attachments)
Problem: Apparently random segfaults apparently query agnostic, seem to be more frequent when a pg_dump is running The most frequent query it segfaults with is: select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.last_deactive)) as time_off from workers as w left join worker_vis as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id')))) where wv.worker_vis_id is not null and w.defunct = 0 order by coalesce(w.alias, w.worker); (sample arguments, they do not seem to make a difference) 167214, 13 167340, 16 But have seen other simpler queries such as (sorry no backtraces for these... yet) insert into audit_transaction (action, open_user_id, page_load_id, user_id) values (?, ?, ?, ?); ARGS=suds, 509057, 15148365, 167217 If I select * from workers; and select * from workers_active; everything looks good, no segfaults. I can also take a pg_dump if virtually nothing is using the database. Otherwise pg_dump dies randomly with an errors like: Dumping the contents of table "clients_audit" failed: PQgetCopyData() failed. Dumping the contents of table "file_data" failed: PQgetCopyData() failed. Dumping the contents of table "workers_audit" failed: PQgetCopyData() failed. (note these are probably 3 of the larger tables in the database, maybe it has something to do with statement_timeout, i do have it set quite low (3 min) ?) If I manually prepare and execute the above queries they work fine. prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.last_deactive)) as time_off from workers as w left join worker_vis as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id')))) where wv.worker_vis_id is not null and w.defunct = 0 order by coalesce(w.alias, w.worker); execute worker (167214, 13); Core dumps and binaries available if needed. I have about 6 core dumps across 2 different servers. Which leads me to believe its either postgres bug or a corrupt database (the other server is a pitr slave for the master which i switched over to see if i could reproduce the segfault on the master) Nothing in dmesg, nothing in mcelog. Raid controller is a 3ware 9550SX-12 with 6 70GB WD Raptor hard drives in a raid 10, smart tests pass fine... everything looks good. Work load is a web application where each page beings a transaction; creates a temp table, does a few selects, inserts and updates and the commits. Postgresql 8.2.5 was working without any problems just prior to the upgrade. \d workers Table "public.workers" Column | Type | Modifiers ---------------+--------------------------+----------------------------------------------------------------- worker_id | bigint | not null default nextval('workers_worker_id_seq'::regclass) date_created | timestamp with time zone | not null default now() clientid | bigint | not null worker | text | not null alias | text | job_type | smallint | not null active | smallint | not null default 0 last_active | timestamp with time zone | last_deactive | timestamp with time zone | defunct | smallint | not null default 0 audit_class | integer | audit_date | timestamp with time zone | not null default now() audit_desc | text | audit_id | bigint | audit_seq | integer | not null default nextval('audit_basic_audit_seq_seq'::regclass) audit_table | character varying(64) | audit_tid | bigint | audit_type | character varying(32) | audit_orig_id | integer | path | text | Indexes: "workers_pkey" PRIMARY KEY, btree (worker_id) CLUSTER Triggers: workers_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON workers FOR EACH ROW EXECUTE PROCEDURE audit_table_go() \d worker_vis Table "public.worker_vis" Column | Type | Modifiers ---------------+--------------------------+----------------------------------------------------------------- worker_vis_id | bigint | not null default nextval('worker_vis_ruleid_seq'::regclass) date_created | timestamp with time zone | not null default now() worker_id | bigint | auth_id | bigint | auth_class | smallint | defunct | smallint | not null default 0 audit_class | integer | audit_date | timestamp with time zone | not null default now() audit_desc | text | audit_id | bigint | audit_seq | integer | not null default nextval('audit_basic_audit_seq_seq'::regclass) audit_table | character varying(64) | audit_tid | bigint | audit_type | character varying(32) | audit_orig_id | integer | Indexes: "worker_vis_pkey" PRIMARY KEY, btree (worker_vis_id) "worker_vis_vis_idx" btree (worker_id, defunct, auth_id, auth_class) CLUSTER Triggers: worker_vis_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON worker_vis FOR EACH ROW EXECUTE PROCEDURE audit_table_go() Version Information: Following REL8.3_STABLE 2 versions tried both crash with a similar backtrace (both died in CopySnapshot): (currently running) Author: teodor <teodor> Date: Sun Mar 9 10:42:48 2008 +0000 Revert changes of CompareTSQ: it affects existing btree indexes. (was running but also crashed) Author: tgl <tgl> Date: Sun Mar 2 00:10:28 2008 +0000 Venezuela Time now means UTC-4:30, not UTC-4:00. Adjust our treatment of "VET" accordingly. Per bug #3997 from Aaron Mizrachi. Memory: 8G ECC Swap: 16GB (unused even at segfault time aka it does not seem to be running out of memmory) OS: Linux 2.6.24.3 select version(); PostgreSQL 8.3.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) built with: --disable-nls --with-openssl --without-python --without-tcl --without-krb5 --with-perl --enable-debug --enable-integer-datetimes --with-ossp-uuid postgresql.conf: listen_addresses = '*' port = 5432 max_connections = 500 superuser_reserved_connections = 3 ssl = on shared_buffers = 1000MB work_mem = 32MB maintenance_work_mem = 256MB max_stack_depth = 7MB max_fsm_pages = 304800 max_fsm_relations = 3000 vacuum_cost_delay = 30 bgwriter_lru_maxpages = 500 wal_sync_method = fdatasync wal_buffers = 256kB checkpoint_segments = 8 checkpoint_timeout = 10min archive_command = 'rsync -az -e "ssh" %p [EMAIL PROTECTED]:standby/cur/wal/%f </dev/null' random_page_cost = 2.6 effective_cache_size = 6GB default_statistics_target = 101 constraint_exclusion = on log_destination = 'syslog' client_min_messages = warning log_min_messages = warning silent_mode = on log_line_prefix = '[%m] %h:%d ' log_lock_waits = on log_temp_files = 20MB autovacuum_naptime = 3min autovacuum_vacuum_threshold = 200 autovacuum_analyze_threshold = 200 statement_timeout = '3min' datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' backslash_quote = off escape_string_warning = on sql_inheritance = off standard_conforming_strings = on cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 2000.000 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3991.79 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 2000.000 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3991.79 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(49502) BIND '. Program terminated with signal 11, Segmentation fault. #0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6 (gdb) bt #0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6 #1 0x000000000065fc88 in CopySnapshot (snapshot=0xa4e6e0) at tqual.c:1311 #2 0x000000000051cf34 in fmgr_sql (fcinfo=0x7fffb3f51160) at functions.c:299 #3 0x00000000005164cd in ExecMakeFunctionResult (fcache=0x10b4710, econtext=0x10b4ba0, isNull=0x7fffb3f515af "", isDone=0x0) at execQual.c:1351 #4 0x00000000005191ca in ExecEvalExprSwitchContext (expression=0x2b6f2c5dc078, econtext=0x10, isNull=0x24c8500 <Address 0x24c8500 out of bounds>, isDone=0x24c84c0) at execQual.c:3726 #5 0x00000000005690bd in evaluate_expr (expr=<value optimized out>, result_type=23, result_typmod=-1) at clauses.c:3273 #6 0x0000000000569d21 in simplify_function (funcid=2214548, result_type=23, result_typmod=-1, args=0xf8db40, allow_inline=1 '\001', context=0x7fffb3f51700) at clauses.c:2887 #7 0x000000000056a53a in eval_const_expressions_mutator (node=0xf24d90, context=0x7fffb3f51700) at clauses.c:1795 #8 0x000000000056b185 in estimate_expression_value (root=<value optimized out>, node=0x2b6f2c5dc078) at clauses.c:1719 #9 0x00000000005f8b86 in get_restriction_variable (root=0xf22d80, args=<value optimized out>, varRelid=0, vardata=0x7fffb3f51810, other=0x7fffb3f51848, varonleft=0x7fffb3f5185f "\001\236") at selfuncs.c:3570 #10 0x00000000005fa562 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169 #11 0x000000000064540c in OidFunctionCall4 (functionId=<value optimized out>, arg1=15871360, arg2=532, arg3=15879312, arg4=0) at fmgr.c:1615 #12 0x000000000056d0c2 in restriction_selectivity (root=0xf22d80, operator=532, args=0xf24c90, varRelid=0) at plancat.c:805 #13 0x000000000054e797 in clause_selectivity (root=0xf22d80, clause=0xf24de0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639 #14 0x000000000054e0df in clauselist_selectivity (root=0xf22d80, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #15 0x000000000054e448 in clause_selectivity (root=0xf22d80, clause=0xf8bb20, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576 #16 0x000000000054e72a in clause_selectivity (root=0xf22d80, clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:597 #17 0x000000000054e0df in clauselist_selectivity (root=0xf22d80, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #18 0x000000000054f99f in set_baserel_size_estimates (root=0x2b6f2c5dc078, rel=0xf26748) at costsize.c:2262 #19 0x000000000054d67c in set_rel_pathlist (root=0xf22d80, rel=0xf26748, rti=2, rte=0xec7c30) at allpaths.c:215 #20 0x000000000054df02 in make_one_rel (root=0xf22d80, joinlist=0xf26668) at allpaths.c:150 #21 0x000000000055f710 in query_planner (root=0xf22d80, tlist=<value optimized out>, tuple_fraction=0, limit_tuples=-1, cheapest_path=0x7fffb3f52220, sorted_path=0x7fffb3f52218, num_groups=0x7fffb3f52228) at planmain.c:249 #22 0x0000000000560041 in grouping_planner (root=0xf22d80, tuple_fraction=<value optimized out>) at planner.c:897 #23 0x00000000005610b8 in subquery_planner (glob=0xee11a0, parse=0xee1230, level=0, tuple_fraction=0, subroot=0x7fffb3f523e8) at planner.c:431 #24 0x00000000005614b1 in standard_planner (parse=0xee1230, cursorOptions=0, boundParams=0x0) at planner.c:158 #25 0x00000000005a1c61 in pg_plan_query (querytree=0xee1230, cursorOptions=0, boundParams=0x0) at postgres.c:681 #26 0x00000000005a1d13 in pg_plan_queries (querytrees=<value optimized out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at postgres.c:752 #27 0x0000000000633f93 in do_planning (querytrees=0xf22d50, cursorOptions=0) at plancache.c:560 #28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0, useResOwner=0 '\0') at plancache.c:484 #29 0x00000000005a3b05 in PostgresMain (argc=4, argv=<value optimized out>, username=0x8f3020 "docsysweb") at postgres.c:1605 #30 0x0000000000579d8f in ServerLoop () at postmaster.c:3207 #31 0x000000000057a85c in PostmasterMain (argc=3, argv=0x8ee2d0) at postmaster.c:1029 #32 0x00000000005345ce in main (argc=3, argv=<value optimized out>) at main.c:188 gdb) frame 28 #28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0, useResOwner=0 '\0') at plancache.c:484 484 slist = do_planning(slist, plansource->cursor_options); (gdb) print plansource->query_string $1 = 0xf08ea0 "select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.l"... -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs