Re: [GENERAL] Memory Errors

2010-09-21 Thread Sam Nelson
Okay, we're finally getting the last bits of corruption fixed, and I finally
remembered to ask my boss about the kill script.

The only details I have are these:

1) The script does nothing if there are fewer than 1000 locks on tables in
the database

2) If there are 1000 or more locks, it will grab the processes in
pg_stat_activity that are in a waiting state

3) for each of the previous processes, it will do a system kill $pid call

The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not
a kill -9.  Just a normal kill.

As far as the postgres and EC2 instances go, we're not really sure if anyone
shut down, created, or migrated them in a weird way, but Kevin (my boss)
said that it wouldn't surprise him.

All I can say is that where we were getting 1 new row of corruption every
day when the kill script was running, we haven't gotten any new corruption
since we stopped it.

As far as the table with memory errors goes, we had asked them to rebuild
the table, and they came back saying that they no longer need that table.
 So they're just going to drop it.

We'll try to keep digging, but I'm not sure we'll get much more info than
that.  We're quite busy and my ability to remember things is ...
questionable.

-Sam

On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson s...@consistentstate.com
 wrote:
  Even if the corruption wasn't a result of that, we weren't too excited
 about
  the process being there to begin with.  We thought there had to be a
 better
  solution than just killing the processes.  So we had a discussion about
 the
  intent of that script and my boss dealt with something that solved the
 same
  problem without killing queries, then had them stop that daemon and we
 have
  been working with that database to make sure it doesn't go screwy again.
  No
  new corruption has shown up since stopping that daemon.
  That memory allocation issue looked drastically different from the toast
  value errors, though, so it seemed like a separate problem.  But now it's
  looking like more corruption.
  ---
  We're requesting that they do a few things (this is their production
  database, so we usually don't alter any data unless they ask us to),
  including deleting those rows.  My memory is insufficient, so there's a
 good
  chance that I'll forget to post back to the mailing list with the
 results,
  but I'll try to remember to do so.
  Thank you for the help - I'm sure I'll be back soon with many more
  questions.

 Any information on repeatable data corruption, whether it is ec2
 improperly flushing data on instance resets, postgres misbehaving
 under atypical conditions, or bad interactions between ec2 and
 postgres is highly valuable.  The only cases of 'understandable' data
 corruption are hardware failures, sync issues (either fsync off, or
 fsync not honored by hardware), torn pages on non journaling file
 systems, etc.

 Naturally people are going to be skeptical of ec2 since you are so
 abstracted from the hardware.  Maybe all your problems stem from a
 single explainable incident -- but we definitely want to get to the
 bottom of this...please keep us updated!

 merlin

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Memory Errors

2010-09-21 Thread Merlin Moncure
On Tue, Sep 21, 2010 at 12:57 PM, Sam Nelson s...@consistentstate.com wrote:
 On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Naturally people are going to be skeptical of ec2 since you are so
 abstracted from the hardware.  Maybe all your problems stem from a
 single explainable incident -- but we definitely want to get to the
 bottom of this...please keep us updated!

 As far as the postgres and EC2 instances go, we're not really sure if anyone
 shut down, created, or migrated them in a weird way, but Kevin (my boss)
 said that it wouldn't surprise him.

please try to avoid top-posting -- it destroys the context of the conversation

The shutdown/migration point is key, along with fsync settings and a
description of whatever durability guarantees ec2 gives on the storage
you are using.  It's the difference between this being a non-event and
something much more interesting.  The correct way btw to kill backends
is with pg_ctl, but what you did is not related to data corruption.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-21 Thread Tom Lane
Sam Nelson s...@consistentstate.com writes:
 Okay, we're finally getting the last bits of corruption fixed, and I finally
 remembered to ask my boss about the kill script.

 The only details I have are these:

 1) The script does nothing if there are fewer than 1000 locks on tables in
 the database

 2) If there are 1000 or more locks, it will grab the processes in
 pg_stat_activity that are in a waiting state

 3) for each of the previous processes, it will do a system kill $pid call

 The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not
 a kill -9.  Just a normal kill.

SIGTERM then.  Since (according to the other thread) this was 8.3.11,
that should in theory be safe; but it's not something I'd consider
tremendously well tested before 8.4.x.

I'd still lean to the theory of data lost during an EC2 instance
shutdown.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-09 Thread Merlin Moncure
On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson s...@consistentstate.com wrote:
 Even if the corruption wasn't a result of that, we weren't too excited about
 the process being there to begin with.  We thought there had to be a better
 solution than just killing the processes.  So we had a discussion about the
 intent of that script and my boss dealt with something that solved the same
 problem without killing queries, then had them stop that daemon and we have
 been working with that database to make sure it doesn't go screwy again.  No
 new corruption has shown up since stopping that daemon.
 That memory allocation issue looked drastically different from the toast
 value errors, though, so it seemed like a separate problem.  But now it's
 looking like more corruption.
 ---
 We're requesting that they do a few things (this is their production
 database, so we usually don't alter any data unless they ask us to),
 including deleting those rows.  My memory is insufficient, so there's a good
 chance that I'll forget to post back to the mailing list with the results,
 but I'll try to remember to do so.
 Thank you for the help - I'm sure I'll be back soon with many more
 questions.

Any information on repeatable data corruption, whether it is ec2
improperly flushing data on instance resets, postgres misbehaving
under atypical conditions, or bad interactions between ec2 and
postgres is highly valuable.  The only cases of 'understandable' data
corruption are hardware failures, sync issues (either fsync off, or
fsync not honored by hardware), torn pages on non journaling file
systems, etc.

Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware.  Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-08 Thread Scott Marlowe
On Wed, Sep 8, 2010 at 12:56 PM, Sam Nelson s...@consistentstate.com wrote:
 Hey, a client of ours has been having some data corruption in their
 database.  We got the data corruption fixed and we believe we've discovered
 the cause (they had a script killing any waiting queries if the locks on
 their database hit 1000), but they're still getting errors from one table:

Not sure that's really the underlying problem. Depending on how they
killed the processes there's a slight chance for corruption, but more
likely they've got bad hardware.  Can they take their machine down for
testing?  memtest86+ is a good tool to get an idea if you've got a
good cpu mobo ram combo or not.

The last bit you included definitely looks like something's corrupted
in the database.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-08 Thread Tom Lane
Sam Nelson s...@consistentstate.com writes:
 pg_dump: Error message from server: ERROR:  invalid memory alloc request
 size 18446744073709551613
 pg_dump: The command was: COPY public.foo (columns) TO stdout;

 That seems like an incredibly large memory allocation request - it shouldn't
 be possible for the table to really be that large, should it?  Any idea what
 may be wrong if it's actually trying to allocate that much memory for a copy
 command?

What that looks like is data corruption; specifically, a bogus length
word for a variable-length field.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-08 Thread Sam Nelson
It figures I'd have an idea right after posting to the mailing list.

Yeah, running COPY foo TO stdout; gets me a list of data before erroring
out, so I did a copy (select * from foo order by id asc) to stdout; to see
if I could make some kind of guess as to whether this was related to a
single row or something else.

I got the id of the last row the copy to command was able to grab normally
and tried to figure out the next id.  The following started to make me think
along the lines of some kinda bad corruption (even before getting responses
that agreed with that):

Assuming that the last id copied was 1500:

1) select * from foo where id = (select min(id) from foo where id  1500);
Results in 0 rows

2) select min(id) from foo where id  1500;
Results in, for example, 20

3) select max(id) from foo where id  1500;
Results in, for example, 9 (a much lower number than returned by min)

4) select id from foo where id  1500 order by id asc limit 10;
Results in (for example):

20
202000
210273
220980
15005
15102
15104
15110
15111
15113

So ... yes, it seems that those four id's are somehow part of the problem.

They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
either), so memtest isn't available, but no new corruption has cropped up
since they stopped killing the waiting queries (I just double checked - they
were getting corrupted rows constantly, and we haven't gotten one since that
script stopped killing queries).

We're going to have them attempt to delete the rows with those id's (even
though the rows don't exist) and if that fails, we're going to copy (select
* from foo where id not in (list)) to file;, drop table foo;, create table
foo;, and copy foo from file.  I'll try to remember to write back with
whether or not any of those things worked.

On Wed, Sep 8, 2010 at 1:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Sam Nelson s...@consistentstate.com writes:
  pg_dump: Error message from server: ERROR:  invalid memory alloc request
  size 18446744073709551613
  pg_dump: The command was: COPY public.foo (columns) TO stdout;

  That seems like an incredibly large memory allocation request - it
 shouldn't
  be possible for the table to really be that large, should it?  Any idea
 what
  may be wrong if it's actually trying to allocate that much memory for a
 copy
  command?

 What that looks like is data corruption; specifically, a bogus length
 word for a variable-length field.

regards, tom lane



Re: [GENERAL] Memory Errors

2010-09-08 Thread Merlin Moncure
On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com wrote:
 It figures I'd have an idea right after posting to the mailing list.
 Yeah, running COPY foo TO stdout; gets me a list of data before erroring
 out, so I did a copy (select * from foo order by id asc) to stdout; to see
 if I could make some kind of guess as to whether this was related to a
 single row or something else.
 I got the id of the last row the copy to command was able to grab normally
 and tried to figure out the next id.  The following started to make me think
 along the lines of some kinda bad corruption (even before getting responses
 that agreed with that):
 Assuming that the last id copied was 1500:
 1) select * from foo where id = (select min(id) from foo where id  1500);
 Results in 0 rows
 2) select min(id) from foo where id  1500;
 Results in, for example, 20
 3) select max(id) from foo where id  1500;
 Results in, for example, 9 (a much lower number than returned by min)
 4) select id from foo where id  1500 order by id asc limit 10;
 Results in (for example):
 20
 202000
 210273
 220980
 15005
 15102
 15104
 15110
 15111
 15113
 So ... yes, it seems that those four id's are somehow part of the problem.
 They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
 either), so memtest isn't available, but no new corruption has cropped up
 since they stopped killing the waiting queries (I just double checked - they
 were getting corrupted rows constantly, and we haven't gotten one since that
 script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly?  You say this is repeatable?  What's your
setting of full_page_writes?

one way to identify and potentially nuke bad records of this kind is
to do something like:

select max(length(field1)) from foo order by 1 desc limit 5;

where field1 is the first varlen field (text, bytea, etc) from left to
right order.  look for bogously high values and move on to the next
field if you don't find any.  once you hit a bad value, try deleting
the record by it's key.

once you've found/deleted them all,  immediately pull off a dump, then
rebuild the table.  as always, take a filesystem dump before doing
this type of surgery...

merlin
merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-08 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com wrote:
 So ... yes, it seems that those four id's are somehow part of the problem.
 They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
 either), so memtest isn't available, but no new corruption has cropped up
 since they stopped killing the waiting queries (I just double checked - they
 were getting corrupted rows constantly, and we haven't gotten one since that
 script stopped killing queries).

 That's actually a startling indictment of ec2 -- how were you killing
 your queries exactly?  You say this is repeatable?  What's your
 setting of full_page_writes?

I think we'd established that they were doing kill -9 on backend
processes :-(.  However, PG has a lot of track record that says that
backend crashes don't result in corrupt data.  What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory Errors

2010-09-08 Thread Sam Nelson
My (our) complaints about EC2 aren't particularly extensive, but last time I
posted to the mailing list saying they were using EC2, the first reply was
someone saying that the corruption was the fault of EC2.

Not that we don't have complaints at all (there are some aspects that are
very frustrating), but I was just trying to stave off anyone who was going
to reply saying Tell them to stop using EC2.

 -- More detail about the script that kills queries:

Honestly, we (or, at least, I) haven't discovered which type of kill they
were doing, but it does seem to be the culprit in some way.  I don't talk to
the customers (that's my boss's job), so I didn't get to ask specifics.  If
my boss did ask specifics, he didn't tell me.

The previous issue involved toast corruption showing up very regularly (e.g.
once a day, in some cases), the end result being that we had to delete the
corrupted rows.  Coming back the next day to see the same corruption on
different rows was not very encouraging.

We found out after that that they had a script running as a daemon that
would, every ten minutes (I believe), check the number of locks on the table
and kill all waiting queries if there were = 1000 locks.

Even if the corruption wasn't a result of that, we weren't too excited about
the process being there to begin with.  We thought there had to be a better
solution than just killing the processes.  So we had a discussion about the
intent of that script and my boss dealt with something that solved the same
problem without killing queries, then had them stop that daemon and we have
been working with that database to make sure it doesn't go screwy again.  No
new corruption has shown up since stopping that daemon.

That memory allocation issue looked drastically different from the toast
value errors, though, so it seemed like a separate problem.  But now it's
looking like more corruption.

---

We're requesting that they do a few things (this is their production
database, so we usually don't alter any data unless they ask us to),
including deleting those rows.  My memory is insufficient, so there's a good
chance that I'll forget to post back to the mailing list with the results,
but I'll try to remember to do so.

Thank you for the help - I'm sure I'll be back soon with many more
questions.

-Sam

On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Merlin Moncure mmonc...@gmail.com writes:
  On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson s...@consistentstate.com
 wrote:
  So ... yes, it seems that those four id's are somehow part of the
 problem.
  They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
  either), so memtest isn't available, but no new corruption has cropped
 up
  since they stopped killing the waiting queries (I just double checked -
 they
  were getting corrupted rows constantly, and we haven't gotten one since
 that
  script stopped killing queries).

  That's actually a startling indictment of ec2 -- how were you killing
  your queries exactly?  You say this is repeatable?  What's your
  setting of full_page_writes?

 I think we'd established that they were doing kill -9 on backend
 processes :-(.  However, PG has a lot of track record that says that
 backend crashes don't result in corrupt data.  What seems more likely
 to me is that the corruption is the result of some shortcut taken while
 shutting down or migrating the ec2 instance, so that some writes that
 Postgres thought got to disk didn't really.

regards, tom lane



Re: [GENERAL] Memory Errors OS X

2004-12-22 Thread Tom Lane
Jeffrey Melloy [EMAIL PROTECTED] writes:
 I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but 
 initdb failed with an error about not enough shared memory.

Don't forget that both shmmax and shmall may need attention ... and,
just to confuse matters, they are measured in different units.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] Memory Errors OS X

2004-12-22 Thread Frank D. Engel, Jr.
What version of OS X?

Apparently some of the earlier versions did not permit changing this parameter without recompiling the kernel.  It seems to have been changed in the more recent versions, though:

http://www.opendarwin.org/pipermail/hackers/2002-August/003583.html
http://borkware.com/rants/openacs/
http://www.ssec.wisc.edu/mug/users_guide/SharedMemory.html

A note from that last URL is that shmall*4096=shmmax.

And yes, trying to manually run an rc script is a bad idea.

On Dec 22, 2004, at 3:44 PM, Jeffrey Melloy wrote:

I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but initdb failed with an error about not enough shared memory.

Remembering that this was a problem for starting two postmasters at the same time on OS X, I increased the shmmax value to 500 megabytes (I had seen something say raising it to half the available ram would be fine), but when I rebooted my machine neither 8.0 or 7.4.5 would start.

So I lowered it to 256 megabytes, thinking there might be an upper limit on that kind of stuff.  When I rebooted my machine, 7.4.5 starts fine, but 8.0 still will not start alongside it.

I don't particularly need both postmasters running at the same time, but I would like to figure out the solution to this problem.

(By the way, in the course of this I attempted to manually run /etc/rc ... there were humorous results and my computer didn't really like it:  http://www.visualdistortion.org/misc/dont_do_this.png)

Jeffrey Melloy
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly


---
Frank D. Engel, Jr.  [EMAIL PROTECTED]>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep John 3:16
John 3:16 For God so loved the world, that he gave his only begotten Son, that whosoever believeth in him should not perish, but have everlasting life.
$ 


PGP.sig
Description: This is a digitally signed message part


Re: [GENERAL] Memory Errors OS X

2004-12-22 Thread Jeffrey Melloy
Tom Lane wrote:
Jeffrey Melloy [EMAIL PROTECTED] writes:
 

I attempted to install 8.0 RC 2 alongside 7.4.5 on my OS X box, but 
initdb failed with an error about not enough shared memory.
   

Don't forget that both shmmax and shmall may need attention ... and,
just to confuse matters, they are measured in different units.
			regards, tom lane
 

I didn't realize that they were different units.  Setting shmmax to 
268435456 and shmall to 65536 works fine.

Thanks,
Jeff
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match