Re: [GENERAL] Memory Errors
Okay, we're finally getting the last bits of corruption fixed, and I finally remembered to ask my boss about the kill script. The only details I have are these: 1) The script does nothing if there are fewer than 1000 locks on tables in the database 2) If there are 1000 or more locks, it will grab the processes in pg_stat_activity that are in a waiting state 3) for each of the previous processes, it will do a system kill $pid call The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not a kill -9. Just a normal kill. As far as the postgres and EC2 instances go, we're not really sure if anyone shut down, created, or migrated them in a weird way, but Kevin (my boss) said that it wouldn't surprise him. All I can say is that where we were getting 1 new row of corruption every day when the kill script was running, we haven't gotten any new corruption since we stopped it. As far as the table with memory errors goes, we had asked them to rebuild the table, and they came back saying that they no longer need that table. So they're just going to drop it. We'll try to keep digging, but I'm not sure we'll get much more info than that. We're quite busy and my ability to remember things is ... questionable. -Sam On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson s...@consistentstate.com wrote: Even if the corruption wasn't a result of that, we weren't too excited about the process being there to begin with. We thought there had to be a better solution than just killing the processes. So we had a discussion about the intent of that script and my boss dealt with something that solved the same problem without killing queries, then had them stop that daemon and we have been working with that database to make sure it doesn't go screwy again. No new corruption has shown up since stopping that daemon. That memory allocation issue looked drastically different from the toast value errors, though, so it seemed like a separate problem. But now it's looking like more corruption. --- We're requesting that they do a few things (this is their production database, so we usually don't alter any data unless they ask us to), including deleting those rows. My memory is insufficient, so there's a good chance that I'll forget to post back to the mailing list with the results, but I'll try to remember to do so. Thank you for the help - I'm sure I'll be back soon with many more questions. Any information on repeatable data corruption, whether it is ec2 improperly flushing data on instance resets, postgres misbehaving under atypical conditions, or bad interactions between ec2 and postgres is highly valuable. The only cases of 'understandable' data corruption are hardware failures, sync issues (either fsync off, or fsync not honored by hardware), torn pages on non journaling file systems, etc. Naturally people are going to be skeptical of ec2 since you are so abstracted from the hardware. Maybe all your problems stem from a single explainable incident -- but we definitely want to get to the bottom of this...please keep us updated! merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
On Tue, Sep 21, 2010 at 12:57 PM, Sam Nelson s...@consistentstate.com wrote: On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote: Naturally people are going to be skeptical of ec2 since you are so abstracted from the hardware. Maybe all your problems stem from a single explainable incident -- but we definitely want to get to the bottom of this...please keep us updated! As far as the postgres and EC2 instances go, we're not really sure if anyone shut down, created, or migrated them in a weird way, but Kevin (my boss) said that it wouldn't surprise him. please try to avoid top-posting -- it destroys the context of the conversation The shutdown/migration point is key, along with fsync settings and a description of whatever durability guarantees ec2 gives on the storage you are using. It's the difference between this being a non-event and something much more interesting. The correct way btw to kill backends is with pg_ctl, but what you did is not related to data corruption. merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
Sam Nelson s...@consistentstate.com writes: Okay, we're finally getting the last bits of corruption fixed, and I finally remembered to ask my boss about the kill script. The only details I have are these: 1) The script does nothing if there are fewer than 1000 locks on tables in the database 2) If there are 1000 or more locks, it will grab the processes in pg_stat_activity that are in a waiting state 3) for each of the previous processes, it will do a system kill $pid call The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not a kill -9. Just a normal kill. SIGTERM then. Since (according to the other thread) this was 8.3.11, that should in theory be safe; but it's not something I'd consider tremendously well tested before 8.4.x. I'd still lean to the theory of data lost during an EC2 instance shutdown. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson s...@consistentstate.com wrote: Even if the corruption wasn't a result of that, we weren't too excited about the process being there to begin with. We thought there had to be a better solution than just killing the processes. So we had a discussion about the intent of that script and my boss dealt with something that solved the same problem without killing queries, then had them stop that daemon and we have been working with that database to make sure it doesn't go screwy again. No new corruption has shown up since stopping that daemon. That memory allocation issue looked drastically different from the toast value errors, though, so it seemed like a separate problem. But now it's looking like more corruption. --- We're requesting that they do a few things (this is their production database, so we usually don't alter any data unless they ask us to), including deleting those rows. My memory is insufficient, so there's a good chance that I'll forget to post back to the mailing list with the results, but I'll try to remember to do so. Thank you for the help - I'm sure I'll be back soon with many more questions. Any information on repeatable data corruption, whether it is ec2 improperly flushing data on instance resets, postgres misbehaving under atypical conditions, or bad interactions between ec2 and postgres is highly valuable. The only cases of 'understandable' data corruption are hardware failures, sync issues (either fsync off, or fsync not honored by hardware), torn pages on non journaling file systems, etc. Naturally people are going to be skeptical of ec2 since you are so abstracted from the hardware. Maybe all your problems stem from a single explainable incident -- but we definitely want to get to the bottom of this...please keep us updated! merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
On Wed, Sep 8, 2010 at 12:56 PM, Sam Nelson s...@consistentstate.com wrote: Hey, a client of ours has been having some data corruption in their database. We got the data corruption fixed and we believe we've discovered the cause (they had a script killing any waiting queries if the locks on their database hit 1000), but they're still getting errors from one table: Not sure that's really the underlying problem. Depending on how they killed the processes there's a slight chance for corruption, but more likely they've got bad hardware. Can they take their machine down for testing? memtest86+ is a good tool to get an idea if you've got a good cpu mobo ram combo or not. The last bit you included definitely looks like something's corrupted in the database. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
Sam Nelson s...@consistentstate.com writes: pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613 pg_dump: The command was: COPY public.foo (columns) TO stdout; That seems like an incredibly large memory allocation request - it shouldn't be possible for the table to really be that large, should it? Any idea what may be wrong if it's actually trying to allocate that much memory for a copy command? What that looks like is data corruption; specifically, a bogus length word for a variable-length field. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
It figures I'd have an idea right after posting to the mailing list. Yeah, running COPY foo TO stdout; gets me a list of data before erroring out, so I did a copy (select * from foo order by id asc) to stdout; to see if I could make some kind of guess as to whether this was related to a single row or something else. I got the id of the last row the copy to command was able to grab normally and tried to figure out the next id. The following started to make me think along the lines of some kinda bad corruption (even before getting responses that agreed with that): Assuming that the last id copied was 1500: 1) select * from foo where id = (select min(id) from foo where id 1500); Results in 0 rows 2) select min(id) from foo where id 1500; Results in, for example, 20 3) select max(id) from foo where id 1500; Results in, for example, 9 (a much lower number than returned by min) 4) select id from foo where id 1500 order by id asc limit 10; Results in (for example): 20 202000 210273 220980 15005 15102 15104 15110 15111 15113 So ... yes, it seems that those four id's are somehow part of the problem. They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes either), so memtest isn't available, but no new corruption has cropped up since they stopped killing the waiting queries (I just double checked - they were getting corrupted rows constantly, and we haven't gotten one since that script stopped killing queries). We're going to have them attempt to delete the rows with those id's (even though the rows don't exist) and if that fails, we're going to copy (select * from foo where id not in (list)) to file;, drop table foo;, create table foo;, and copy foo from file. I'll try to remember to write back with whether or not any of those things worked. On Wed, Sep 8, 2010 at 1:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: Sam Nelson s...@consistentstate.com writes: pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551613 pg_dump: The command was: COPY public.foo (columns) TO stdout; That seems like an incredibly large memory allocation request - it shouldn't be possible for the table to really be that large, should it? Any idea what may be wrong if it's actually trying to allocate that much memory for a copy command? What that looks like is data corruption; specifically, a bogus length word for a variable-length field. regards, tom lane
Re: [GENERAL] Memory Errors
On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com wrote: It figures I'd have an idea right after posting to the mailing list. Yeah, running COPY foo TO stdout; gets me a list of data before erroring out, so I did a copy (select * from foo order by id asc) to stdout; to see if I could make some kind of guess as to whether this was related to a single row or something else. I got the id of the last row the copy to command was able to grab normally and tried to figure out the next id. The following started to make me think along the lines of some kinda bad corruption (even before getting responses that agreed with that): Assuming that the last id copied was 1500: 1) select * from foo where id = (select min(id) from foo where id 1500); Results in 0 rows 2) select min(id) from foo where id 1500; Results in, for example, 20 3) select max(id) from foo where id 1500; Results in, for example, 9 (a much lower number than returned by min) 4) select id from foo where id 1500 order by id asc limit 10; Results in (for example): 20 202000 210273 220980 15005 15102 15104 15110 15111 15113 So ... yes, it seems that those four id's are somehow part of the problem. They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes either), so memtest isn't available, but no new corruption has cropped up since they stopped killing the waiting queries (I just double checked - they were getting corrupted rows constantly, and we haven't gotten one since that script stopped killing queries). That's actually a startling indictment of ec2 -- how were you killing your queries exactly? You say this is repeatable? What's your setting of full_page_writes? one way to identify and potentially nuke bad records of this kind is to do something like: select max(length(field1)) from foo order by 1 desc limit 5; where field1 is the first varlen field (text, bytea, etc) from left to right order. look for bogously high values and move on to the next field if you don't find any. once you hit a bad value, try deleting the record by it's key. once you've found/deleted them all, immediately pull off a dump, then rebuild the table. as always, take a filesystem dump before doing this type of surgery... merlin merlin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
Merlin Moncure mmonc...@gmail.com writes: On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com wrote: So ... yes, it seems that those four id's are somehow part of the problem. They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes either), so memtest isn't available, but no new corruption has cropped up since they stopped killing the waiting queries (I just double checked - they were getting corrupted rows constantly, and we haven't gotten one since that script stopped killing queries). That's actually a startling indictment of ec2 -- how were you killing your queries exactly? You say this is repeatable? What's your setting of full_page_writes? I think we'd established that they were doing kill -9 on backend processes :-(. However, PG has a lot of track record that says that backend crashes don't result in corrupt data. What seems more likely to me is that the corruption is the result of some shortcut taken while shutting down or migrating the ec2 instance, so that some writes that Postgres thought got to disk didn't really. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Memory Errors
My (our) complaints about EC2 aren't particularly extensive, but last time I posted to the mailing list saying they were using EC2, the first reply was someone saying that the corruption was the fault of EC2. Not that we don't have complaints at all (there are some aspects that are very frustrating), but I was just trying to stave off anyone who was going to reply saying Tell them to stop using EC2. -- More detail about the script that kills queries: Honestly, we (or, at least, I) haven't discovered which type of kill they were doing, but it does seem to be the culprit in some way. I don't talk to the customers (that's my boss's job), so I didn't get to ask specifics. If my boss did ask specifics, he didn't tell me. The previous issue involved toast corruption showing up very regularly (e.g. once a day, in some cases), the end result being that we had to delete the corrupted rows. Coming back the next day to see the same corruption on different rows was not very encouraging. We found out after that that they had a script running as a daemon that would, every ten minutes (I believe), check the number of locks on the table and kill all waiting queries if there were = 1000 locks. Even if the corruption wasn't a result of that, we weren't too excited about the process being there to begin with. We thought there had to be a better solution than just killing the processes. So we had a discussion about the intent of that script and my boss dealt with something that solved the same problem without killing queries, then had them stop that daemon and we have been working with that database to make sure it doesn't go screwy again. No new corruption has shown up since stopping that daemon. That memory allocation issue looked drastically different from the toast value errors, though, so it seemed like a separate problem. But now it's looking like more corruption. --- We're requesting that they do a few things (this is their production database, so we usually don't alter any data unless they ask us to), including deleting those rows. My memory is insufficient, so there's a good chance that I'll forget to post back to the mailing list with the results, but I'll try to remember to do so. Thank you for the help - I'm sure I'll be back soon with many more questions. -Sam On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com wrote: So ... yes, it seems that those four id's are somehow part of the problem. They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes either), so memtest isn't available, but no new corruption has cropped up since they stopped killing the waiting queries (I just double checked - they were getting corrupted rows constantly, and we haven't gotten one since that script stopped killing queries). That's actually a startling indictment of ec2 -- how were you killing your queries exactly? You say this is repeatable? What's your setting of full_page_writes? I think we'd established that they were doing kill -9 on backend processes :-(. However, PG has a lot of track record that says that backend crashes don't result in corrupt data. What seems more likely to me is that the corruption is the result of some shortcut taken while shutting down or migrating the ec2 instance, so that some writes that Postgres thought got to disk didn't really. regards, tom lane
Re: [GENERAL] Memory Errors OS X
Jeffrey Melloy [EMAIL PROTECTED] writes: I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but initdb failed with an error about not enough shared memory. Don't forget that both shmmax and shmall may need attention ... and, just to confuse matters, they are measured in different units. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [GENERAL] Memory Errors OS X
What version of OS X? Apparently some of the earlier versions did not permit changing this parameter without recompiling the kernel. It seems to have been changed in the more recent versions, though: http://www.opendarwin.org/pipermail/hackers/2002-August/003583.html http://borkware.com/rants/openacs/ http://www.ssec.wisc.edu/mug/users_guide/SharedMemory.html A note from that last URL is that shmall*4096=shmmax. And yes, trying to manually run an rc script is a bad idea. On Dec 22, 2004, at 3:44 PM, Jeffrey Melloy wrote: I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but initdb failed with an error about not enough shared memory. Remembering that this was a problem for starting two postmasters at the same time on OS X, I increased the shmmax value to 500 megabytes (I had seen something say raising it to half the available ram would be fine), but when I rebooted my machine neither 8.0 or 7.4.5 would start. So I lowered it to 256 megabytes, thinking there might be an upper limit on that kind of stuff. When I rebooted my machine, 7.4.5 starts fine, but 8.0 still will not start alongside it. I don't particularly need both postmasters running at the same time, but I would like to figure out the solution to this problem. (By the way, in the course of this I attempted to manually run /etc/rc ... there were humorous results and my computer didn't really like it: http://www.visualdistortion.org/misc/dont_do_this.png) Jeffrey Melloy [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly --- Frank D. Engel, Jr. [EMAIL PROTECTED]> $ ln -s /usr/share/kjvbible /usr/manual $ true | cat /usr/manual | grep John 3:16 John 3:16 For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life. $ PGP.sig Description: This is a digitally signed message part
Re: [GENERAL] Memory Errors OS X
Tom Lane wrote: Jeffrey Melloy [EMAIL PROTECTED] writes: I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but initdb failed with an error about not enough shared memory. Don't forget that both shmmax and shmall may need attention ... and, just to confuse matters, they are measured in different units. regards, tom lane I didn't realize that they were different units. Setting shmmax to 268435456 and shmall to 65536 works fine. Thanks, Jeff ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match