Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
I recently raised BUG #6425: Bus error in slot_deform_tuple. During the last reproduction of the problem I saw this: Client 2 aborted in state 0: ERROR: invalid memory alloc request size 18446744073709551613 So like Tom said, these two issues could well be related. I just wanted to mention it here in this thread, FYI. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
I recently raised BUG #6425: Bus error in slot_deform_tuple. During the last reproduction of the problem I saw this: Client 2 aborted in state 0: ERROR: invalid memory alloc request size 18446744073709551613 So like Tom said, these two issues could well be related. I just wanted to mention it here in this thread, FYI. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
So here's a better stack trace for the segfault issue (again, just to summarize, since this is a long thread, we're seeing two issues: 1) alloc errors that do not crash the DB (although we modified postgres to panic when this happens in our test environment, and posted a stack earlier) 2) a postgres segfault that happens once every couple of days on our slaves. We're still not sure if these are the same issue or not. This stack is not perfect because it still has some things optimized out (this came from our production database), but it's much more detailed than the last one we posted for a segfault... hope this helps get closer to an answer on this... I'd also be interested in knowing if the postgres experts thing these two symptoms are likely related... or totally separate issues... Thanks! -B #0 0x00455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at heaptuple.c:1090 1090 off = att_align_pointer(off, thisatt-attalign, -1, (gdb) bt #0 0x00455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at heaptuple.c:1090 #1 0x00455fbd in slot_getallattrs (slot=0x53cfc20) at heaptuple.c:1253 #2 0x00458ac7 in printtup (slot=0x53cfc20, self=0x534f1e0) at printtup.c:300 #3 0x0055bd69 in ExecutePlan (queryDesc=0x5515978, direction=value optimized out, count=0) at execMain.c:1464 #4 standard_ExecutorRun (queryDesc=0x5515978, direction=value optimized out, count=0) at execMain.c:313 #5 0x00623594 in PortalRunSelect (portal=0x5394f10, forward=value optimized out, count=0, dest=0x534f1e0) at pquery.c:943 #6 0x00624ae0 in PortalRun (portal=0x5394f10, count=9223372036854775807, isTopLevel=1 '\001', dest=0x534f1e0, altdest=0x534f1e0, completionTag=0x7fff014e0640 ) at pquery.c:787 #7 0x006220f2 in exec_execute_message (argc=value optimized out, argv=value optimized out, username=value optimized out) at postgres.c:1963 #8 PostgresMain (argc=value optimized out, argv=value optimized out, username=value optimized out) at postgres.c:3983 #9 0x005e6ba4 in ServerLoop () at postmaster.c:3601 #10 0x005e791c in PostmasterMain (argc=5, argv=0x524cab0) at postmaster.c:1116 #11 0x0058b9ae in in (argc=5, argv=value optimized out) at main.c:199
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On Tue, Jan 31, 2012 at 4:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. Hmm. Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? Well, I was kinda speculating that inadequate locking could result in use of a stale (or too-new?) tuple descriptor, and that would be as good a candidate as any if the basic theory were right. But Bridget says they are not doing any DDL, so it's hard to see how there'd be any tuple descriptor mismatch at all. Still baffled ... No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. Off the top of my head I don't see how that can happen, but it would be awfully interesting to be able to see which WAL record last touched the relevant heap page, and how long before the error that happened. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Robert Haas robertmh...@gmail.com writes: No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. Oh, gotcha. Yes, that's a horribly plausible idea. All it'd take is one WAL replay routine that hasn't been upgraded to acquire sufficient buffer locks. Pre-hot-standby, there was no reason for them to be careful about locking. On the other hand, if that were the cause, you'd expect the symptoms to be a bit more variable... regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Robert Haas robertmh...@gmail.com writes: No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. It would be nice to get a dump of what PostgreSQL thought the entire block looked like at the time the crash happened. That information is presumably already in the core dump, but I'm not sure if there's a nice way to extract it using gdb. It probably would be possible to get the page out of the dump, but I'd be really surprised if that proved much. By the time the crash-dump-making code gets around to examining the shared memory, the other process that's hypothetically changing the page will have done its work and moved on. A crash in process X doesn't freeze execution in process Y, at least not in any Unixoid system I've ever heard of. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Excerpts from Tom Lane's message of mié feb 01 18:06:27 -0300 2012: Robert Haas robertmh...@gmail.com writes: No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. It would be nice to get a dump of what PostgreSQL thought the entire block looked like at the time the crash happened. That information is presumably already in the core dump, but I'm not sure if there's a nice way to extract it using gdb. It probably would be possible to get the page out of the dump, but I'd be really surprised if that proved much. By the time the crash-dump-making code gets around to examining the shared memory, the other process that's hypothetically changing the page will have done its work and moved on. A crash in process X doesn't freeze execution in process Y, at least not in any Unixoid system I've ever heard of. Maybe you can do something like send SIGSTOP to every other backend, then attach to them and find which one was touching the same buffer, then peek at what it was doing. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
We have no DDL whatsoever in the code. We do update rows in the logins table frequently, but we basically have a policy of only doing DDL changes during scheduled upgrades when we bring the site down. We have been discussing this issue a lot and we really haven't come up with anything that would be considered unusual here. The tables experiencing issues have maybe 1M to 200M rows, we do updates and selects frequently, they have standard btree primary key indexes, and the failing query always seems to be a select for a single row based on a primary key lookup. All of these code paths worked flawlessly prior to our 9.1 upgrade (we had been using skytools replication). And we see no problems on the master despite similar workloads there. It is definitely puzzling, and we are not too sure what to look into next... Sent from my iPhone On Jan 30, 2012, at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Hm. The stack trace is definitive that it's finding the bad data in a tuple that it's trying to print to the client, not in an index. BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. (In particular, the reported symptom would be consistent with finding a small integer constant at a place where the descriptor expects to find a variable-length field.) So that opens up a different line of thought about how those could get out of sync on a standby. Are you in the habit of issuing ALTER TABLE commands to add/delete/change columns on these tables? In fact, is there any DDL whatsoever going on around the time these failures happen? regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Excerpts from Bridget Frey's message of lun ene 30 18:59:08 -0300 2012: Anyway, here goes... Maybe a bt full could give more insight into what's going on ... #0 0x003a83e30265 in raise () from /lib64/libc.so.6 #1 0x003a83e31d10 in abort () from /lib64/libc.so.6 #2 0x007cb84e in errfinish (dummy=0) at elog.c:523 #3 0x007cd951 in elog_finish (elevel=22, fmt=0x95cdf0 invalid memory alloc request size %lu) at elog.c:1202 #4 0x007f115c in MemoryContextAlloc (context=0x17b581d0, size=18446744073709551613) at mcxt.c:516 #5 0x00771a46 in text_to_cstring (t=0x17b23608) at varlena.c:139 #6 0x00770747 in varcharout (fcinfo=0x7fffd44854e0) at varchar.c:515 -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Hm. The stack trace is definitive that it's finding the bad data in a tuple that it's trying to print to the client, not in an index. BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. Hmm. Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? For example, getting only an exclusive content lock where a cleanup lock is needed could presumably cause something like this to happen - it would explain the transient nature of the errors as well as the fact that they only seem to occur during Hot Standby operation. On the other hand, it's a little hard to believe we would have missed something that obvious; there aren't that many things that need a cleanup lock on a heap page. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. Hmm. Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? Well, I was kinda speculating that inadequate locking could result in use of a stale (or too-new?) tuple descriptor, and that would be as good a candidate as any if the basic theory were right. But Bridget says they are not doing any DDL, so it's hard to see how there'd be any tuple descriptor mismatch at all. Still baffled ... regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Hi Tom, Thanks for the reply, we appreciate you time on this. The alloc error queries all seem to be selects from a btree primary index. I gave an example in my initial post from the logins table. Usually for us it is logins but sometimes we have seen it on a few other tables, and it's always a btree primary key index, very simple type of select. The queries have been showing up in the logs which is how we know, but we could also confirm in the core dump. If the problem is data corruption, it is transient. We replay the same queries and get no errors. We also have jobs that run that basically do the same series of selects every day or hour etc. but it is totally random which ones cause an error. E.g. If it is corruption it somehow magically fixes itself. Also we still have not seen any issues on the master, this seems to be a problem only on hot standby slaves (we have three slaves). The OP, incidentally, reported the same thing - issue is only on hot standby slaves, it is transient, and it happens on a select from a btree primary index. This also does not seem to be load related. It often happens under periods of light load for us. Please let us know if you have any other thoughts on what we should look at... Sent from my iPhone On Jan 30, 2012, at 7:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bridget Frey bridget.f...@redfin.com writes: The second error is an invalid memory alloc error that we're getting ~2 dozen times per day in production. The bt for this alloc error is below. This trace is consistent with the idea that we're getting a corrupt tuple out of a table, although it doesn't entirely preclude the possibility that the corrupt value is manufactured inside the backend. To get much further you're going to need to look at the specific query being executed each time this happens, and see if you can detect any pattern. Now that you've got debug symbols straightened out, the gdb command p debug_query_string should accomplish this. (If that does not produce anything that looks like one of your application's SQL commands, we'll need to try harder to extract the info.) You could probably hack the elog(PANIC) to log that before dying, too, if you'd rather not manually gdb each core dump. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Bridget Frey bridget.f...@redfin.com writes: Thanks for the reply, we appreciate you time on this. The alloc error queries all seem to be selects from a btree primary index. I gave an example in my initial post from the logins table. Usually for us it is logins but sometimes we have seen it on a few other tables, and it's always a btree primary key index, very simple type of select. Hm. The stack trace is definitive that it's finding the bad data in a tuple that it's trying to print to the client, not in an index. That tuple might've been straight from disk, or it could have been constructed inside the backend ... but if it's a simple SELECT FROM single-table WHERE index-condition then the tuple should be raw data found in a shared buffer. The queries have been showing up in the logs which is how we know, but we could also confirm in the core dump. If the problem is data corruption, it is transient. We replay the same queries and get no errors. The idea that comes to mind is that somehow btree index updates are reaching the standby in advance of the heap updates they refer to. But how could that be? And even more to the point, if we did follow a bogus TID pointer from an index, how come it's failing there? You'd expect it to usually notice such a problem much earlier, while examining the heap tuple header. (Invalid xmin complaints are the typical symptom from that, since the xmin is one of the first fields we look at that can be sanity-checked to any extent.) Still baffled here. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
I work with Bridget at Redfin. We have a core dump from a once-in-5-days (multi-million queries) hot standby segfault in pg 9.1.2 . (It might or might be the same root issue as the alloc errors. If I should file a new bug report, let me know. The postgres executable that crashed did not have debugging symbols installed, and we were unable to debug (gdb) the core file using a debug build of postgres. (Symbols didn't match.) Running gdb against a non-debug postgres executable gave us this stack trace: [root@query-7 core]# gdb -q -c /postgres/core/query-9.core.19678 /usr/pgsql-9.1/bin/postgres-non-debug Reading symbols from /usr/pgsql-9.1/bin/postgres-non-debug...(no debugging symbols found)...done. warning: core file may not match specified executable file. [New Thread 19678] warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffdcd58000 Core was generated by `postgres: datamover stingray_prod 10.11.0.134(54140) SELEC'. Program terminated with signal 11, Segmentation fault. #0 0x0045694c in nocachegetattr () (gdb) bt #0 0x0045694c in nocachegetattr () #1 0x006f93c9 in ?? () #2 0x006fa231 in tuplesort_puttupleslot () #3 0x00573ad1 in ExecSort () #4 0x0055cdda in ExecProcNode () #5 0x0055bcd1 in standard_ExecutorRun () #6 0x00623594 in ?? () #7 0x00624ae0 in PortalRun () #8 0x006220f2 in PostgresMain () #9 0x005e6ba4 in ?? () #10 0x005e791c in PostmasterMain () #11 0x0058b9ae in main () We have the (5GB) core file, and are happy to do any more forensics anyone can advise. Please instruct. I hope this helps point to a root cause and resolution Thank you, Mike Brauwerman Data Team, Redfin On Fri, Jan 27, 2012 at 10:53 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey bridget.f...@redfin.com wrote: Thanks for the info - that's very helpful. We had also noted that the alloc seems to be -3 bytes. We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice. In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? Yes. I observed that the infomask bits on several tuples had somehow been overwritten by nonsense. I am not sure whether there were other kinds of corruption as well - I suspect probably so - but that's the only one I saw with my own eyes, courtesy of pg_filedump. It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears. Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads. We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days). We're also planning to set up one slave that will panic when it gets a malloc issue, as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... The case I investigated involved corruption on the master, and I think it predated Hot Standby. However, the symptom is generic enough that it seems quite possible that there's more than one way for it to happen. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs -- Mike Brauwerman Data Team, Redfin
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On 28 January 2012 21:34, Michael Brauwerman michael.brauwer...@redfin.com wrote: We have the (5GB) core file, and are happy to do any more forensics anyone can advise. Ideally, you'd be able to install debug information packages, which should give a more detailed and useful stack trace, as described here: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
We did try that with a postgres 9.1.2, compiled from source with debug flags, but we got 0x10 bad address in gdb. (Obviously we did it wrong somehow) We will keep trying to get a good set of symbols set up. On Jan 28, 2012 2:34 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: On 28 January 2012 21:34, Michael Brauwerman michael.brauwer...@redfin.com wrote: We have the (5GB) core file, and are happy to do any more forensics anyone can advise. Ideally, you'd be able to install debug information packages, which should give a more detailed and useful stack trace, as described here: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey bridget.f...@redfin.com wrote: Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an issue that seems very similar to the one reported as bug 6200. We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days. We did not experience this issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump). We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue. We're not certain that the alloc issues and the sefaults are the same issue - but it seems that it may be since the OP for bug 6200 sees the same behavior. We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins. Rows are added to the logins table one-by-one, and updates generally happen one row at a time. The table is pretty basic, it looks like this... CREATE TABLE logins ( login_id bigserial NOT NULL, snip - a bunch of columns CONSTRAINT logins_pkey PRIMARY KEY (login_id ), snip - some other constraints... ) WITH ( FILLFACTOR=80, OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l... from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... Hmm, that number in hex works out to 0xfffd, which makes it sound an awful lot like the system (for some unknown reason) attempted to allocate -3 bytes of memory. I've seen something like this once before on a customer system running a modified version of PostgreSQL. In that case, the problem turned out to be page corruption. Circumstances didn't permit determination of the root cause of the page corruption, however, nor was I able to figure out exactly how the corruption I saw resulted in an allocation request like this. It would be nice to figure out where in the code this is happening and put in a higher-level guard so that we get a better error message. You want want to compile a modified PostgreSQL executable that puts an extremely long sleep (like a year) just before this error is reported. Then, when the system hangs at that point, you can attach a debugger and pull a stack backtrace. Or you could insert an abort() at that point in the code and get a backtrace from the core dump. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey bridget.f...@redfin.com wrote: Thanks for the info - that's very helpful. We had also noted that the alloc seems to be -3 bytes. We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice. In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? Yes. I observed that the infomask bits on several tuples had somehow been overwritten by nonsense. I am not sure whether there were other kinds of corruption as well - I suspect probably so - but that's the only one I saw with my own eyes, courtesy of pg_filedump. It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears. Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads. We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days). We're also planning to set up one slave that will panic when it gets a malloc issue, as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... The case I investigated involved corruption on the master, and I think it predated Hot Standby. However, the symptom is generic enough that it seems quite possible that there's more than one way for it to happen. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Thanks for the info - that's very helpful. We had also noted that the alloc seems to be -3 bytes. We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice. In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears. Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads. We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days). We're also planning to set up one slave that will panic when it gets a malloc issue, as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... -B On Fri, Jan 27, 2012 at 6:31 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey bridget.f...@redfin.com wrote: Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an issue that seems very similar to the one reported as bug 6200. We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days. We did not experience this issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump). We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue. We're not certain that the alloc issues and the sefaults are the same issue - but it seems that it may be since the OP for bug 6200 sees the same behavior. We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins. Rows are added to the logins table one-by-one, and updates generally happen one row at a time. The table is pretty basic, it looks like this... CREATE TABLE logins ( login_id bigserial NOT NULL, snip - a bunch of columns CONSTRAINT logins_pkey PRIMARY KEY (login_id ), snip - some other constraints... ) WITH ( FILLFACTOR=80, OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l... from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... Hmm, that number in hex works out to 0xfffd, which makes it sound an awful lot like the system (for some unknown reason) attempted to allocate -3 bytes of memory. I've seen something like this once before on a customer system running a modified version of PostgreSQL. In that case, the problem turned out to be page corruption. Circumstances didn't permit determination of the root cause of the page corruption, however, nor was I able to figure out exactly how the corruption I saw resulted in an allocation request like this. It would be nice to figure out where in the code this is happening and put in a higher-level guard so that we get a better error message. You want want to compile a modified PostgreSQL executable that puts an extremely long sleep (like a year) just before this error is reported. Then, when the system hangs at that point, you can attach a debugger and pull a stack backtrace. Or you could insert an abort() at that point in the code and get a backtrace from the core dump. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Bridget Frey Director, Data Analytics Engineering | Redfin bridget.f...@redfin.com | tel: 206.576.5894
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an issue that seems very similar to the one reported as bug 6200. We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days. We did not experience this issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump). We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue. We're not certain that the alloc issues and the sefaults are the same issue - but it seems that it may be since the OP for bug 6200 sees the same behavior. We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins. Rows are added to the logins table one-by-one, and updates generally happen one row at a time. The table is pretty basic, it looks like this... CREATE TABLE logins ( login_id bigserial NOT NULL, snip - a bunch of columns CONSTRAINT logins_pkey PRIMARY KEY (login_id ), snip - some other constraints... ) WITH ( FILLFACTOR=80, OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l... from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... We have been in touch with the OP for bug 6200, who said he may have time to help us out a bit on debugging this. It seems like what is being suggested is getting a build of postgres that will capture a stack trace for each alloc issue and/or simply dump core when that happens. As this is a production system we would prefer the former. As I mentioned above we're also trying to get a core dump for the segfault. We are treating this as extremely high priority as it is currently causing 2 dozen failures for users of our site per day, as well as a few min of downtime for the segfault every 3 days. I realize there may be little that the postgres experts can do until we provide more information - but since our use case is really not very complicated here (basic use of HS), and another site is also experiencing it, I figured it would be worth posting about what we're seeing. Thanks, -Bridget Frey Redfin
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
Daniel Farina dan...@heroku.com writes: A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the details of this bug report. This isn't really enough information to reproduce the problem ... The occurrence rate is somewhere in the one per tens-of-millions of queries. ... and that statement is going to discourage anyone from even trying, since with such a low occurrence rate it's going to be impossible to be sure whether the setup to reproduce the problem is correct. So if you'd like this to be fixed, you're either going to need to show us exactly how to reproduce it, or investigate it yourself. The way that I'd personally proceed to investigate it would probably be to change the invalid memory alloc request size size errors (in src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from ERROR to PANIC so that they'll provoke a core dump, and then use gdb to get a stack trace, which would provide at least a little more information about what happened. However, if you are only able to reproduce it in a production server, you might not like that approach. Perhaps you can set up an extra standby that's only there for testing, so you don't mind if it crashes? regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On 09.09.2011 18:02, Tom Lane wrote: The way that I'd personally proceed to investigate it would probably be to change the invalid memory alloc request size size errors (in src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from ERROR to PANIC so that they'll provoke a core dump, and then use gdb to get a stack trace, which would provide at least a little more information about what happened. However, if you are only able to reproduce it in a production server, you might not like that approach. Perhaps you can set up an extra standby that's only there for testing, so you don't mind if it crashes? If that's not possible or doesn't reproduce the issue, there's also functions in glibc to produce a backtrace without aborting the program: https://www.gnu.org/s/libc/manual/html_node/Backtraces.html. I think you could also fork() + abort() to generate a core dump, not just a backtrace. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #6200: standby bad memory allocations on SELECT
On Thu, Sep 8, 2011 at 11:33 PM, Daniel Farina dan...@heroku.com wrote: ERROR: invalid memory alloc request size 18446744073709551613 At least once, a hot standby was promoted to a primary and the errors seem to discontinue, but then reappear on a newly-provisioned standby. So the query that fails is a btree index on a hot standby. I don't fully accept it as an HS bug, but lets assume that it is and analyse what could cause it. The MO is certain user queries, only observed in HS. So certain queries might be related to the way we use indexes or not. There is a single and small difference between how a btree index operates in HS and normal operation, which relates to whether we kill tuples in the index. That's simple code and there's no obvious bugs there, nor anything that specifically allocates memory even. So the only bug that springs to mind is something related to how we navigate hot chains with/without killed tuples. i.e. the bug is not actually HS related, but is only observed under conditions typical in HS. HS touches almost nothing else in user space, apart from snapshots. So there could be a bug there also, maybe in CopySnapshot(). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs