Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-02 Thread Duncan Rance
I recently raised BUG #6425: Bus error in slot_deform_tuple. During the last 
reproduction of the problem I saw this:

Client 2 aborted in state 0: ERROR:  invalid memory alloc request size 
18446744073709551613

So like Tom said, these two issues could well be related. I just wanted to 
mention it here in this thread, FYI.

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-02 Thread Duncan Rance
I recently raised BUG #6425: Bus error in slot_deform_tuple. During the last 
reproduction of the problem I saw this:

Client 2 aborted in state 0: ERROR:  invalid memory alloc request size 
18446744073709551613

So like Tom said, these two issues could well be related. I just wanted to 
mention it here in this thread, FYI.


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-01 Thread Bridget Frey
So here's a better stack trace for the segfault issue (again, just to
summarize, since this is a long thread, we're seeing two issues: 1) alloc
errors that do not crash the DB (although we modified postgres to panic
when this happens in our test environment, and posted a stack earlier) 2) a
postgres segfault that happens once every couple of days on our slaves.
We're still not sure if these are the same issue or not.  This stack is not
perfect because it still has some things optimized out (this came from our
production database), but it's much more detailed than the last one we
posted for a segfault... hope this helps get closer to an answer on
this...  I'd also be interested in knowing if the postgres experts thing
these two symptoms are likely related... or totally separate issues...

Thanks!
-B

#0  0x00455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at
heaptuple.c:1090
1090 off = att_align_pointer(off, thisatt-attalign, -1,
(gdb) bt
#0  0x00455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at
heaptuple.c:1090
#1  0x00455fbd in slot_getallattrs (slot=0x53cfc20) at
heaptuple.c:1253
#2  0x00458ac7 in printtup (slot=0x53cfc20, self=0x534f1e0) at
printtup.c:300
#3  0x0055bd69 in ExecutePlan (queryDesc=0x5515978,
direction=value optimized out, count=0)
at execMain.c:1464
#4  standard_ExecutorRun (queryDesc=0x5515978, direction=value optimized
out, count=0) at execMain.c:313
#5  0x00623594 in PortalRunSelect (portal=0x5394f10, forward=value
optimized out, count=0,
dest=0x534f1e0) at pquery.c:943
#6  0x00624ae0 in PortalRun (portal=0x5394f10,
count=9223372036854775807, isTopLevel=1 '\001',
dest=0x534f1e0, altdest=0x534f1e0, completionTag=0x7fff014e0640 ) at
pquery.c:787
#7  0x006220f2 in exec_execute_message (argc=value optimized out,
argv=value optimized out,
username=value optimized out) at postgres.c:1963
#8  PostgresMain (argc=value optimized out, argv=value optimized out,
username=value optimized out)
at postgres.c:3983
#9  0x005e6ba4 in ServerLoop () at postmaster.c:3601
#10 0x005e791c in PostmasterMain (argc=5, argv=0x524cab0) at
postmaster.c:1116
#11 0x0058b9ae in in (argc=5, argv=value optimized out) at
main.c:199


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-01 Thread Robert Haas
On Tue, Jan 31, 2012 at 4:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, after a bit more reflection it occurs to me that it's not so much
 that the data is necessarily *bad*, as that it seemingly doesn't match
 the tuple descriptor that the backend's trying to interpret it with.

 Hmm.  Could this be caused by the recovery process failing to obtain a
 sufficiently strong lock on a buffer before replaying some WAL record?

 Well, I was kinda speculating that inadequate locking could result in
 use of a stale (or too-new?) tuple descriptor, and that would be as good
 a candidate as any if the basic theory were right.  But Bridget says
 they are not doing any DDL, so it's hard to see how there'd be any tuple
 descriptor mismatch at all.  Still baffled ...

No, I wasn't thinking about a tuple descriptor mismatch.  I was
imagining that the page contents themselves might be in flux while
we're trying to read from it.  Off the top of my head I don't see how
that can happen, but it would be awfully interesting to be able to see
which WAL record last touched the relevant heap page, and how long
before the error that happened.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 No, I wasn't thinking about a tuple descriptor mismatch.  I was
 imagining that the page contents themselves might be in flux while
 we're trying to read from it.

Oh, gotcha.  Yes, that's a horribly plausible idea.  All it'd take is
one WAL replay routine that hasn't been upgraded to acquire sufficient
buffer locks.  Pre-hot-standby, there was no reason for them to be
careful about locking.

On the other hand, if that were the cause, you'd expect the symptoms
to be a bit more variable...

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-01 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 No, I wasn't thinking about a tuple descriptor mismatch.  I was
 imagining that the page contents themselves might be in flux while
 we're trying to read from it.

 It would be nice to get a dump of what PostgreSQL thought the entire
 block looked like at the time the crash happened.  That information is
 presumably already in the core dump, but I'm not sure if there's a
 nice way to extract it using gdb.

It probably would be possible to get the page out of the dump, but
I'd be really surprised if that proved much.  By the time the
crash-dump-making code gets around to examining the shared memory, the
other process that's hypothetically changing the page will have done its
work and moved on.  A crash in process X doesn't freeze execution in
process Y, at least not in any Unixoid system I've ever heard of.

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-02-01 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mié feb 01 18:06:27 -0300 2012:
 Robert Haas robertmh...@gmail.com writes:
  No, I wasn't thinking about a tuple descriptor mismatch. I was
  imagining that the page contents themselves might be in flux while
  we're trying to read from it.
 
  It would be nice to get a dump of what PostgreSQL thought the entire
  block looked like at the time the crash happened.  That information is
  presumably already in the core dump, but I'm not sure if there's a
  nice way to extract it using gdb.
 
 It probably would be possible to get the page out of the dump, but
 I'd be really surprised if that proved much.  By the time the
 crash-dump-making code gets around to examining the shared memory, the
 other process that's hypothetically changing the page will have done its
 work and moved on.  A crash in process X doesn't freeze execution in
 process Y, at least not in any Unixoid system I've ever heard of.

Maybe you can do something like send SIGSTOP to every other backend,
then attach to them and find which one was touching the same buffer,
then peek at what it was doing.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-31 Thread Bridget Frey
We have no DDL whatsoever in the code.  We do update rows in the
logins table frequently, but we basically have a policy of only doing
DDL changes during scheduled upgrades when we bring the site down.  We
have been discussing this issue a lot and we really haven't come up
with anything that would be considered unusual here.  The tables
experiencing issues have maybe 1M to 200M rows, we do updates and
selects frequently, they have standard btree primary key indexes, and
the failing query always seems to be a select for a single row based
on a primary key lookup.  All of these code paths worked flawlessly
prior to our 9.1 upgrade (we had been using skytools replication).
And we see no problems on the master despite similar workloads there.
It is definitely puzzling, and we are not too sure what to look into
next...

Sent from my iPhone

On Jan 30, 2012, at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
 Hm.  The stack trace is definitive that it's finding the bad data in a
 tuple that it's trying to print to the client, not in an index.

 BTW, after a bit more reflection it occurs to me that it's not so much
 that the data is necessarily *bad*, as that it seemingly doesn't match
 the tuple descriptor that the backend's trying to interpret it with.
 (In particular, the reported symptom would be consistent with finding
 a small integer constant at a place where the descriptor expects to find
 a variable-length field.)  So that opens up a different line of thought
 about how those could get out of sync on a standby.  Are you in the
 habit of issuing ALTER TABLE commands to add/delete/change columns on
 these tables?  In fact, is there any DDL whatsoever going on around the
 time these failures happen?

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-31 Thread Alvaro Herrera

Excerpts from Bridget Frey's message of lun ene 30 18:59:08 -0300 2012:

 Anyway, here goes...

Maybe a bt full could give more insight into what's going on ...

 #0  0x003a83e30265 in raise () from /lib64/libc.so.6
 #1  0x003a83e31d10 in abort () from /lib64/libc.so.6
 #2  0x007cb84e in errfinish (dummy=0) at elog.c:523
 #3  0x007cd951 in elog_finish (elevel=22, fmt=0x95cdf0 invalid
 memory alloc request size %lu) at elog.c:1202
 #4  0x007f115c in MemoryContextAlloc (context=0x17b581d0,
 size=18446744073709551613) at mcxt.c:516
 #5  0x00771a46 in text_to_cstring (t=0x17b23608) at varlena.c:139
 #6  0x00770747 in varcharout (fcinfo=0x7fffd44854e0) at
 varchar.c:515

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-31 Thread Robert Haas
On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Hm.  The stack trace is definitive that it's finding the bad data in a
 tuple that it's trying to print to the client, not in an index.

 BTW, after a bit more reflection it occurs to me that it's not so much
 that the data is necessarily *bad*, as that it seemingly doesn't match
 the tuple descriptor that the backend's trying to interpret it with.

Hmm.  Could this be caused by the recovery process failing to obtain a
sufficiently strong lock on a buffer before replaying some WAL record?
 For example, getting only an exclusive content lock where a cleanup
lock is needed could presumably cause something like this to happen -
it would explain the transient nature of the errors as well as the
fact that they only seem to occur during Hot Standby operation.  On
the other hand, it's a little hard to believe we would have missed
something that obvious; there aren't that many things that need a
cleanup lock on a heap page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, after a bit more reflection it occurs to me that it's not so much
 that the data is necessarily *bad*, as that it seemingly doesn't match
 the tuple descriptor that the backend's trying to interpret it with.

 Hmm.  Could this be caused by the recovery process failing to obtain a
 sufficiently strong lock on a buffer before replaying some WAL record?

Well, I was kinda speculating that inadequate locking could result in
use of a stale (or too-new?) tuple descriptor, and that would be as good
a candidate as any if the basic theory were right.  But Bridget says
they are not doing any DDL, so it's hard to see how there'd be any tuple
descriptor mismatch at all.  Still baffled ...

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-30 Thread Bridget Frey
Hi Tom,
Thanks for the reply, we appreciate you time on this.  The alloc error
queries all seem to be selects from a btree primary index.   I gave an
example in my initial post from the logins table.  Usually for us it
is logins but sometimes we have seen it on a few other tables, and
it's always a btree primary key index, very simple type of select.
The queries have been showing up in the logs which is how we know, but
we could also confirm in the core dump.  If the problem is data
corruption, it is transient.  We replay the same queries and get no
errors.  We also have jobs that run that basically do the same series
of selects every day or hour etc. but it is totally random which ones
cause an error.  E.g. If it is corruption it somehow magically fixes
itself.  Also we still have not seen any issues on the master, this
seems to be a problem only on hot standby slaves (we have three
slaves).  The OP, incidentally, reported the same thing - issue is
only on hot standby slaves, it is transient, and it happens on a
select from a btree primary index.

This also does not seem to be load related.  It often happens under
periods of light load for us.

Please let us know if you have any other thoughts on what we should look at...

Sent from my iPhone

On Jan 30, 2012, at 7:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Bridget Frey bridget.f...@redfin.com writes:
 The second error is an invalid memory alloc error that we're getting ~2
 dozen times per day in production.  The bt for this alloc error is below.

 This trace is consistent with the idea that we're getting a corrupt
 tuple out of a table, although it doesn't entirely preclude the
 possibility that the corrupt value is manufactured inside the backend.
 To get much further you're going to need to look at the specific query
 being executed each time this happens, and see if you can detect any
 pattern.  Now that you've got debug symbols straightened out, the
 gdb command p debug_query_string should accomplish this.  (If that
 does not produce anything that looks like one of your application's
 SQL commands, we'll need to try harder to extract the info.)  You could
 probably hack the elog(PANIC) to log that before dying, too, if you'd
 rather not manually gdb each core dump.

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-30 Thread Tom Lane
Bridget Frey bridget.f...@redfin.com writes:
 Thanks for the reply, we appreciate you time on this.  The alloc error
 queries all seem to be selects from a btree primary index.   I gave an
 example in my initial post from the logins table.  Usually for us it
 is logins but sometimes we have seen it on a few other tables, and
 it's always a btree primary key index, very simple type of select.

Hm.  The stack trace is definitive that it's finding the bad data in a
tuple that it's trying to print to the client, not in an index.
That tuple might've been straight from disk, or it could have been
constructed inside the backend ... but if it's a simple SELECT FROM
single-table WHERE index-condition then the tuple should be raw data
found in a shared buffer.

 The queries have been showing up in the logs which is how we know, but
 we could also confirm in the core dump.  If the problem is data
 corruption, it is transient.  We replay the same queries and get no
 errors.

The idea that comes to mind is that somehow btree index updates are
reaching the standby in advance of the heap updates they refer to.
But how could that be?  And even more to the point, if we did follow
a bogus TID pointer from an index, how come it's failing there?  You'd
expect it to usually notice such a problem much earlier, while examining
the heap tuple header.  (Invalid xmin complaints are the typical symptom
from that, since the xmin is one of the first fields we look at that
can be sanity-checked to any extent.)

Still baffled here.

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-28 Thread Michael Brauwerman
I work with Bridget at Redfin.

We have a core dump from a once-in-5-days (multi-million queries) hot
standby segfault in pg 9.1.2 . (It might or might be the same root issue as
the alloc errors. If I should file a new bug report, let me know.

The postgres executable that crashed did not have debugging symbols
installed, and we were unable to debug (gdb) the core file using a debug
build of postgres. (Symbols didn't match.) Running gdb against a non-debug
postgres executable gave us this stack trace:


[root@query-7 core]# gdb -q -c  /postgres/core/query-9.core.19678
/usr/pgsql-9.1/bin/postgres-non-debug
Reading symbols from /usr/pgsql-9.1/bin/postgres-non-debug...(no debugging
symbols found)...done.

warning: core file may not match specified executable file.
[New Thread 19678]

warning: no loadable sections found in added symbol-file system-supplied
DSO at 0x7fffdcd58000
Core was generated by `postgres: datamover stingray_prod 10.11.0.134(54140)
SELEC'.
Program terminated with signal 11, Segmentation fault.
#0  0x0045694c in nocachegetattr ()



(gdb) bt
#0  0x0045694c in nocachegetattr ()
#1  0x006f93c9 in ?? ()
#2  0x006fa231 in tuplesort_puttupleslot ()
#3  0x00573ad1 in ExecSort ()
#4  0x0055cdda in ExecProcNode ()
#5  0x0055bcd1 in standard_ExecutorRun ()
#6  0x00623594 in ?? ()
#7  0x00624ae0 in PortalRun ()
#8  0x006220f2 in PostgresMain ()
#9  0x005e6ba4 in ?? ()
#10 0x005e791c in PostmasterMain ()
#11 0x0058b9ae in main ()



We have the (5GB) core file, and are happy to do any more forensics anyone
can advise.

Please instruct.

I hope this helps point to a root cause and resolution

Thank you,

Mike Brauwerman
Data Team, Redfin

On Fri, Jan 27, 2012 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey bridget.f...@redfin.com
 wrote:
  Thanks for the info - that's very helpful.  We had also noted that the
 alloc
  seems to be -3 bytes.  We have run pg_check and it found no instances of
  corruption. We've also replayed queries that have failed, and have never
  been able to get the same query to fail twice.  In the case you
  investigated, was there permanent page corruption - e.g. you could run
 the
  same query over and over and get the same result?

 Yes.  I observed that the infomask bits on several tuples had somehow
 been overwritten by nonsense.  I am not sure whether there were other
 kinds of corruption as well - I suspect probably so - but that's the
 only one I saw with my own eyes, courtesy of pg_filedump.

  It really does seem like this is an issue either in Hot Standby or very
  closely related to that feature, where there is temporary corruption of a
  btree index that then disappears.  Our master is not experiencing any
 malloc
  issues, while the 3 slaves get about a dozen per day, despite similar
  workloads.  We haven't have a slave segfault since we set it up to
 produce a
  core dump, but we're expecting to have that within the next few days
  (assuming we'll continue to get a segfault every 3-4 days).  We're also
  planning to set up one slave that will panic when it gets a malloc
 issue, as
  you (and other posters on 6400) had suggested.
 
  Thanks again for the help, and we'll keep you posted as we learn more...

 The case I investigated involved corruption on the master, and I think
 it predated Hot Standby.  However, the symptom is generic enough that
 it seems quite possible that there's more than one way for it to
 happen.  :-(

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

 --
 Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-bugs




-- 
Mike Brauwerman
Data Team, Redfin


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-28 Thread Peter Geoghegan
On 28 January 2012 21:34, Michael Brauwerman
michael.brauwer...@redfin.com wrote:
 We have the (5GB) core file, and are happy to do any more forensics anyone
 can advise.

Ideally, you'd be able to install debug information packages, which
should give a more detailed and useful stack trace, as described here:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-28 Thread Michael Brauwerman
We did try that with a postgres 9.1.2, compiled from source with debug
flags, but we got 0x10 bad address in gdb. (Obviously we did it wrong
somehow)

We will keep trying to get a good set of symbols set up.
On Jan 28, 2012 2:34 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:

 On 28 January 2012 21:34, Michael Brauwerman
 michael.brauwer...@redfin.com wrote:
  We have the (5GB) core file, and are happy to do any more forensics
 anyone
  can advise.

 Ideally, you'd be able to install debug information packages, which
 should give a more detailed and useful stack trace, as described here:


 http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

 --
 Peter Geoghegan   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training and Services



Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-27 Thread Robert Haas
On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey bridget.f...@redfin.com wrote:
 Hello,
 We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an
 issue that seems very similar to the one reported as bug 6200.  We see
 approximately 2 dozen alloc errors per day across 3 slaves, and we are
 getting one segfault approximately every 3 days.  We did not experience this
 issue before our upgrade (we were on version 8.4, and used skytools for
 replication).

 We are attempting to get a core dump on segfault (our last attempt did not
 work due to a config issue for the core dump).  We're also attempting to
 repro the alloc errors on a test setup, but it seems like we may need quite
 a bit of load to trigger the issue.  We're not certain that the alloc issues
 and the sefaults are the same issue - but it seems that it may be since
 the OP for bug 6200 sees the same behavior.  We have seen no issues on the
 master, all alloc errors and segfaults have been on the slaves.

 We've seen the alloc errors on a few different tables, but most frequently
 on logins.  Rows are added to the logins table one-by-one, and updates
 generally happen one row at a time.  The table is pretty basic, it looks
 like this...

 CREATE TABLE logins
 (
   login_id bigserial NOT NULL,
   snip - a bunch of columns
   CONSTRAINT logins_pkey PRIMARY KEY (login_id ),
   snip - some other constraints...
 )
 WITH (
   FILLFACTOR=80,
   OIDS=FALSE
 );

 The queries that trigger the alloc error on this table look like this (we
 use hibernate hence the funny underscoring...)
 select login0_.login_id as login1_468_0_, l...  from logins login0_ where
 login0_.login_id=$1

 The alloc error in the logs looks like this:
 -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR:
 invalid memory alloc request size 18446744073709551613

 The alloc error is nearly always for size 18446744073709551613 - though we
 have seen one time where it was a different amount...

Hmm, that number in hex works out to 0xfffd, which makes
it sound an awful lot like the system (for some unknown reason)
attempted to allocate -3 bytes of memory.  I've seen something like
this once before on a customer system running a modified version of
PostgreSQL.  In that case, the problem turned out to be page
corruption.  Circumstances didn't permit determination of the root
cause of the page corruption, however, nor was I able to figure out
exactly how the corruption I saw resulted in an allocation request
like this.  It would be nice to figure out where in the code this is
happening and put in a higher-level guard so that we get a better
error message.

You want want to compile a modified PostgreSQL executable that puts an
extremely long sleep (like a year) just before this error is reported.
 Then, when the system hangs at that point, you can attach a debugger
and pull a stack backtrace.  Or you could insert an abort() at that
point in the code and get a backtrace from the core dump.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-27 Thread Robert Haas
On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey bridget.f...@redfin.com wrote:
 Thanks for the info - that's very helpful.  We had also noted that the alloc
 seems to be -3 bytes.  We have run pg_check and it found no instances of
 corruption. We've also replayed queries that have failed, and have never
 been able to get the same query to fail twice.  In the case you
 investigated, was there permanent page corruption - e.g. you could run the
 same query over and over and get the same result?

Yes.  I observed that the infomask bits on several tuples had somehow
been overwritten by nonsense.  I am not sure whether there were other
kinds of corruption as well - I suspect probably so - but that's the
only one I saw with my own eyes, courtesy of pg_filedump.

 It really does seem like this is an issue either in Hot Standby or very
 closely related to that feature, where there is temporary corruption of a
 btree index that then disappears.  Our master is not experiencing any malloc
 issues, while the 3 slaves get about a dozen per day, despite similar
 workloads.  We haven't have a slave segfault since we set it up to produce a
 core dump, but we're expecting to have that within the next few days
 (assuming we'll continue to get a segfault every 3-4 days).  We're also
 planning to set up one slave that will panic when it gets a malloc issue, as
 you (and other posters on 6400) had suggested.

 Thanks again for the help, and we'll keep you posted as we learn more...

The case I investigated involved corruption on the master, and I think
it predated Hot Standby.  However, the symptom is generic enough that
it seems quite possible that there's more than one way for it to
happen.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-27 Thread Bridget Frey
Thanks for the info - that's very helpful.  We had also noted that the
alloc seems to be -3 bytes.  We have run pg_check and it found no instances
of corruption. We've also replayed queries that have failed, and have never
been able to get the same query to fail twice.  In the case you
investigated, was there permanent page corruption - e.g. you could run the
same query over and over and get the same result?

It really does seem like this is an issue either in Hot Standby or very
closely related to that feature, where there is temporary corruption of a
btree index that then disappears.  Our master is not experiencing any
malloc issues, while the 3 slaves get about a dozen per day, despite
similar workloads.  We haven't have a slave segfault since we set it up to
produce a core dump, but we're expecting to have that within the next few
days (assuming we'll continue to get a segfault every 3-4 days).  We're
also planning to set up one slave that will panic when it gets a malloc
issue, as you (and other posters on 6400) had suggested.

Thanks again for the help, and we'll keep you posted as we learn more...
-B

On Fri, Jan 27, 2012 at 6:31 AM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey bridget.f...@redfin.com
 wrote:
  Hello,
  We upgraded to postgres 9.1.2 two weeks ago, and we are also
 experiencing an
  issue that seems very similar to the one reported as bug 6200.  We see
  approximately 2 dozen alloc errors per day across 3 slaves, and we are
  getting one segfault approximately every 3 days.  We did not experience
 this
  issue before our upgrade (we were on version 8.4, and used skytools for
  replication).
 
  We are attempting to get a core dump on segfault (our last attempt did
 not
  work due to a config issue for the core dump).  We're also attempting to
  repro the alloc errors on a test setup, but it seems like we may need
 quite
  a bit of load to trigger the issue.  We're not certain that the alloc
 issues
  and the sefaults are the same issue - but it seems that it may be since
  the OP for bug 6200 sees the same behavior.  We have seen no issues on
 the
  master, all alloc errors and segfaults have been on the slaves.
 
  We've seen the alloc errors on a few different tables, but most
 frequently
  on logins.  Rows are added to the logins table one-by-one, and updates
  generally happen one row at a time.  The table is pretty basic, it looks
  like this...
 
  CREATE TABLE logins
  (
login_id bigserial NOT NULL,
snip - a bunch of columns
CONSTRAINT logins_pkey PRIMARY KEY (login_id ),
snip - some other constraints...
  )
  WITH (
FILLFACTOR=80,
OIDS=FALSE
  );
 
  The queries that trigger the alloc error on this table look like this (we
  use hibernate hence the funny underscoring...)
  select login0_.login_id as login1_468_0_, l...  from logins login0_ where
  login0_.login_id=$1
 
  The alloc error in the logs looks like this:
  -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934]
 ERROR:
  invalid memory alloc request size 18446744073709551613
 
  The alloc error is nearly always for size 18446744073709551613 - though
 we
  have seen one time where it was a different amount...

 Hmm, that number in hex works out to 0xfffd, which makes
 it sound an awful lot like the system (for some unknown reason)
 attempted to allocate -3 bytes of memory.  I've seen something like
 this once before on a customer system running a modified version of
 PostgreSQL.  In that case, the problem turned out to be page
 corruption.  Circumstances didn't permit determination of the root
 cause of the page corruption, however, nor was I able to figure out
 exactly how the corruption I saw resulted in an allocation request
 like this.  It would be nice to figure out where in the code this is
 happening and put in a higher-level guard so that we get a better
 error message.

 You want want to compile a modified PostgreSQL executable that puts an
 extremely long sleep (like a year) just before this error is reported.
  Then, when the system hangs at that point, you can attach a debugger
 and pull a stack backtrace.  Or you could insert an abort() at that
 point in the code and get a backtrace from the core dump.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Bridget Frey  Director, Data  Analytics Engineering | Redfin

bridget.f...@redfin.com | tel: 206.576.5894


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2012-01-23 Thread Bridget Frey
Hello,
We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing
an issue that seems very similar to the one reported as bug 6200.  We see
approximately 2 dozen alloc errors per day across 3 slaves, and we are
getting one segfault approximately every 3 days.  We did not experience
this issue before our upgrade (we were on version 8.4, and used skytools
for replication).

We are attempting to get a core dump on segfault (our last attempt did not
work due to a config issue for the core dump).  We're also attempting to
repro the alloc errors on a test setup, but it seems like we may need quite
a bit of load to trigger the issue.  We're not certain that the alloc
issues and the sefaults are the same issue - but it seems that it may be
since the OP for bug 6200 sees the same behavior.  We have seen no issues
on the master, all alloc errors and segfaults have been on the slaves.

We've seen the alloc errors on a few different tables, but most frequently
on logins.  Rows are added to the logins table one-by-one, and updates
generally happen one row at a time.  The table is pretty basic, it looks
like this...

CREATE TABLE logins
(
  login_id bigserial NOT NULL,
  snip - a bunch of columns
  CONSTRAINT logins_pkey PRIMARY KEY (login_id ),
  snip - some other constraints...
)
WITH (
  FILLFACTOR=80,
  OIDS=FALSE
);

The queries that trigger the alloc error on this table look like this (we
use hibernate hence the funny underscoring...)
select login0_.login_id as login1_468_0_, l...  from logins login0_ where
login0_.login_id=$1

The alloc error in the logs looks like this:
-01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR:
invalid memory alloc request size 18446744073709551613

The alloc error is nearly always for size 18446744073709551613 - though we
have seen one time where it was a different amount...

We have been in touch with the OP for bug 6200, who said he may have time
to help us out a bit on debugging this.  It seems like what is being
suggested is getting a build of postgres that will capture a stack trace
for each alloc issue and/or simply dump core when that happens.  As this is
a production system we would prefer the former.  As I mentioned above we're
also trying to get a core dump for the segfault.

We are treating this as extremely high priority as it is currently causing
2 dozen failures for users of our site per day, as well as a few min of
downtime for the segfault every 3 days.  I realize there may be little that
the postgres experts can do until we provide more information - but since
our use case is really not very complicated here (basic use of HS), and
another site is also experiencing it, I figured it would be worth posting
about what we're seeing.

Thanks,
-Bridget Frey
Redfin


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2011-09-09 Thread Tom Lane
Daniel Farina dan...@heroku.com writes:
 A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the
 details of this bug report.

This isn't really enough information to reproduce the problem ...

 The occurrence rate is somewhere in the one per tens-of-millions of
 queries.

... and that statement is going to discourage anyone from even trying,
since with such a low occurrence rate it's going to be impossible to be
sure whether the setup to reproduce the problem is correct.  So if you'd
like this to be fixed, you're either going to need to show us exactly
how to reproduce it, or investigate it yourself.

The way that I'd personally proceed to investigate it would probably be
to change the invalid memory alloc request size size errors (in
src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from
ERROR to PANIC so that they'll provoke a core dump, and then use gdb
to get a stack trace, which would provide at least a little more
information about what happened.  However, if you are only able to
reproduce it in a production server, you might not like that approach.
Perhaps you can set up an extra standby that's only there for testing,
so you don't mind if it crashes?

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2011-09-09 Thread Heikki Linnakangas

On 09.09.2011 18:02, Tom Lane wrote:

The way that I'd personally proceed to investigate it would probably be
to change the invalid memory alloc request size size errors (in
src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from
ERROR to PANIC so that they'll provoke a core dump, and then use gdb
to get a stack trace, which would provide at least a little more
information about what happened.  However, if you are only able to
reproduce it in a production server, you might not like that approach.
Perhaps you can set up an extra standby that's only there for testing,
so you don't mind if it crashes?


If that's not possible or doesn't reproduce the issue, there's also 
functions in glibc to produce a backtrace without aborting the program: 
https://www.gnu.org/s/libc/manual/html_node/Backtraces.html.


I think you could also fork() + abort() to generate a core dump, not 
just a backtrace.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT

2011-09-09 Thread Simon Riggs
On Thu, Sep 8, 2011 at 11:33 PM, Daniel Farina dan...@heroku.com wrote:

  ERROR: invalid memory alloc request size 18446744073709551613

 At least once, a hot standby was promoted to a primary and the errors seem
 to discontinue, but then reappear on a newly-provisioned standby.

So the query that fails is a btree index on a hot standby. I don't
fully accept it as an HS bug, but lets assume that it is and analyse
what could cause it.

The MO is certain user queries, only observed in HS. So certain
queries might be related to the way we use indexes or not.

There is a single and small difference between how a btree index
operates in HS and normal operation, which relates to whether we
kill tuples in the index. That's simple code and there's no obvious
bugs there, nor anything that specifically allocates memory even. So
the only bug that springs to mind is something related to how we
navigate hot chains with/without killed tuples. i.e. the bug is not
actually HS related, but is only observed under conditions typical in
HS.

HS touches almost nothing else in user space, apart from snapshots. So
there could be a bug there also, maybe in CopySnapshot().

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs