On Fri, 2004-09-10 at 16:24, Tom Lane wrote: > Kjetil Torgrim Homme <[EMAIL PROTECTED]> writes: > > how can att[i]->attlen possibly change in the interim? but > > data_length looks corrupted, too. > > Unless you compiled with no optimization at all (-O0), the compiler > would likely fold the identical memcpy() calls in the different > if-branches together. So I wouldn't put too much stock in the > reported line number. > > It does seem striking that a 0x2f got dumped into the high byte of the > length word in both cases. Have you checked to see what the > page-on-disk looks like? I'd be interested to know if the offset of the > damaged byte within the page is again 0x0fff. >
Hei Tom Kjetil will answer you about this. In the meant time we got new core dumps when taking a backup of the same database. Some more info I got from the departament in charge of this database: ----------------------------------------------------------- We make a backup of our production server every 15 minutes. Recently, we've seen behaviour like this: [12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no [12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no [12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no pg_dump: ERROR: MemoryContextAlloc: invalid request size 1577058307 pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "paid_quota_history" failed: PQendcopy() failed. pg_dump: Error message from server: pg_dump: The command was: COPY public.paid_quota_history (job_id, transaction_type, person_id, tstamp, update_by, update _program, pageunits_free, pageunits_paid, pageunits_total) TO stdout; pg_dumpall: pg_dump failed on cerebrum_prod, exiting [12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no Every consecutive backup failes with the same message, and then suddenly: [12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no [12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no To me this looks like a cache somewhere that upon read contained some incorrect data. This cache was somehow flushed two-hours later, and fresh data was read from disk. Could this be postgres problem, or is it hardware/kernel related? Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help. We have now moved the database between 3 different Dell2650 servers, and replaced memory chips on one system once. Lately one or more postgres processes received signal11 atleast once a day. The problems started about a week ago after stable production for about 9 months. The backup failures above were accompanied by 4 core-dumps. Backtrace follows: #0 0xb734d07c in memcpy () from /lib/tls/libc.so.6 #1 0x08174880 in set_var_from_num (num=0xb7021d24, dest=0x87b432fe) at numeric.c:2673 #2 0x08171927 in numeric_out (fcinfo=0xbfffc2d0) at numeric.c:373 #3 0x081aa81d in FunctionCall3 (flinfo=0x82cc4e8, arg1=3221209808, arg2=3221209808, arg3=3221209808) at fmgr.c:1016 #4 0x080c78fb in CopyTo (rel=0xb6800bd0, attnumlist=0x82cb4a0, binary=0 '\0', oids=0 '\0', delim=0x82232a8 "\t", null_print=0x81fc95d "\\N") at copy.c:1096 #5 0x080c7021 in DoCopy (stmt=0x2f000004) at copy.c:920 #6 0x081507c5 in PortalRunUtility (portal=0x82bdfd8, query=0x82ba220, dest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:772 #7 0x08150a3e in PortalRunMulti (portal=0x82bdfd8, dest=0x82ba1d8, altdest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:836 #8 0x0815033c in PortalRun (portal=0x82bdfd8, count=2147483647, dest=0x82ba1d8, altdest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:494 #9 0x0814d5f8 in exec_simple_query ( query_string=0x82b9bc0 "COPY public.change_log (tstamp, change_id, subject_entity, change_type_id, dest_entity, change_params, change_by, change_program, description) TO stdout;") at postgres.c:873 #10 0x0814f660 in PostgresMain (argc=4, argv=0x82701b8, username=0x8270188 "postgres") at postgres.c:2868 #11 0x0812f5ab in BackendFork (port=0x827d0a0) at postmaster.c:2564 #12 0x0812f09e in BackendStartup (port=0x827d0a0) at postmaster.c:2207 #13 0x0812d95f in ServerLoop () at postmaster.c:1119 #14 0x0812d305 in PostmasterMain (argc=3, argv=0x826e1c0) at postmaster.c:897 #15 0x08104f10 in main (argc=3, argv=0xbfffd6c4) at main.c:214 We are currently in the process of moving the production server to an IBM box, which should eliminate any Dell2650 spesific causes. ----------------------------------------------------------- -- Rafael Martinez, <[EMAIL PROTECTED]> Center for Information Technology Services University of Oslo, Norway
signature.asc
Description: This is a digitally signed message part