Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug?

Rafael Martinez Guerrero Mon, 13 Sep 2004 04:33:42 -0700

On Fri, 2004-09-10 at 16:24, Tom Lane wrote:
> Kjetil Torgrim Homme <[EMAIL PROTECTED]> writes:
> > how can att[i]->attlen possibly change in the interim?  but
> > data_length looks corrupted, too.
> 
> Unless you compiled with no optimization at all (-O0), the compiler
> would likely fold the identical memcpy() calls in the different
> if-branches together.  So I wouldn't put too much stock in the
> reported line number.
> 
> It does seem striking that a 0x2f got dumped into the high byte of the
> length word in both cases.  Have you checked to see what the
> page-on-disk looks like?  I'd be interested to know if the offset of the
> damaged byte within the page is again 0x0fff.
>


Hei Tom 

Kjetil will answer you about this. 

In the meant time we got new core dumps when taking a backup of the same
database. 

Some more info I got from the departament in charge of this database:
-----------------------------------------------------------
We make a backup of our production server every 15 minutes.  Recently,
we've seen behaviour like this:

  [12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no
  [12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  pg_dump: ERROR:  MemoryContextAlloc: invalid request size 1577058307
  pg_dump: lost synchronization with server, resetting connection
  pg_dump: SQL command to dump the contents of table
"paid_quota_history" failed: PQendcopy() failed.
  pg_dump: Error message from server: pg_dump: The command was: COPY
public.paid_quota_history (job_id, transaction_type, person_id, tstamp,
update_by, update
  _program, pageunits_free, pageunits_paid, pageunits_total) TO stdout;
  pg_dumpall: pg_dump failed on cerebrum_prod, exiting
  [12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

Every consecutive backup failes with the same message, and then
suddenly:

  [12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

To me this looks like a cache somewhere that upon read contained some
incorrect data.  This cache was somehow flushed two-hours later, and
fresh data was read from disk.

Could this be postgres problem, or is it hardware/kernel related?
Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help.  We have now
moved the database between 3 different Dell2650 servers, and replaced
memory chips on one system once.  Lately one or more postgres
processes received signal11 atleast once a day.  The problems started
about a week ago after stable production for about 9 months.

The backup failures above were accompanied by 4 core-dumps.  Backtrace
follows:

  #0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
  #1  0x08174880 in set_var_from_num (num=0xb7021d24, dest=0x87b432fe)
at numeric.c:2673
  #2  0x08171927 in numeric_out (fcinfo=0xbfffc2d0) at numeric.c:373
  #3  0x081aa81d in FunctionCall3 (flinfo=0x82cc4e8, arg1=3221209808,
arg2=3221209808, arg3=3221209808) at fmgr.c:1016
  #4  0x080c78fb in CopyTo (rel=0xb6800bd0, attnumlist=0x82cb4a0,
binary=0 '\0', oids=0 '\0', delim=0x82232a8 "\t", null_print=0x81fc95d
"\\N")
      at copy.c:1096
  #5  0x080c7021 in DoCopy (stmt=0x2f000004) at copy.c:920
  #6  0x081507c5 in PortalRunUtility (portal=0x82bdfd8, query=0x82ba220,
dest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:772
  #7  0x08150a3e in PortalRunMulti (portal=0x82bdfd8, dest=0x82ba1d8,
altdest=0x82ba1d8, completionTag=0xbfffc650 "") at pquery.c:836
  #8  0x0815033c in PortalRun (portal=0x82bdfd8, count=2147483647,
dest=0x82ba1d8, altdest=0x82ba1d8, completionTag=0xbfffc650 "") at
pquery.c:494
  #9  0x0814d5f8 in exec_simple_query (
      query_string=0x82b9bc0 "COPY public.change_log (tstamp, change_id,
subject_entity, change_type_id, dest_entity, change_params, change_by,
change_program, description) TO stdout;") at postgres.c:873
  #10 0x0814f660 in PostgresMain (argc=4, argv=0x82701b8,
username=0x8270188 "postgres") at postgres.c:2868
  #11 0x0812f5ab in BackendFork (port=0x827d0a0) at postmaster.c:2564
  #12 0x0812f09e in BackendStartup (port=0x827d0a0) at postmaster.c:2207
  #13 0x0812d95f in ServerLoop () at postmaster.c:1119
  #14 0x0812d305 in PostmasterMain (argc=3, argv=0x826e1c0) at
postmaster.c:897
  #15 0x08104f10 in main (argc=3, argv=0xbfffd6c4) at main.c:214

We are currently in the process of moving the production server to an
IBM box, which should eliminate any Dell2650 spesific causes.
-----------------------------------------------------------

-- 
 Rafael Martinez, <[EMAIL PROTECTED]>
 Center for Information Technology Services
 University of Oslo, Norway

signature.asc
Description: This is a digitally signed message part

Re: [BUGS] PosgreSQL is crashing with a signal 11 - Bug?

Reply via email to