Re: [PERFORM] Table locking problems?

2005-08-09 Thread Joshua D. Drake


Could these locking issues be related to the other changes I made?  I'm 
really scared that this is related to choosing XFS, but I sure  hope 
not.   How should I go about troubleshooting the problem  queries?  
They don't seem to be specific to a single table or single  database.


My experience is that when this type of thing happens it is typically 
specific queries that cause the problem. If you turn on statement 
logging you can get the exact queries and debug from there.


Here are some things to look for:

Is it a large table (and thus large indexes) that it is updating?
Is the query using indexes?
Is the query modifying ALOT of rows?

Of course there is also the RTFM of are you analyzing and vacuuming?

Sincerely,

Joshua D. Drake




I'm running 8.0.1 on kernel 2.6.12-3 on 64-bit Opterons if that  matters..


-Dan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match



--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Michael Fuhr
On Tue, Aug 09, 2005 at 12:04:11PM -0600, Dan Harris wrote:
 These changes have definitely improved performance, but I am now  
 finding some trouble with UPDATE or DELETE queries hanging and  
 never releasing their locks.  As this happens, other statements queue  
 up behind it.

Have you examined pg_locks to see if the UPDATE or DELETE is blocked
because of a lock another session holds?

Are you using foreign keys?  When updating referencing rows, released
versions of PostgreSQL acquire a lock on the referenced row that can
hurt concurrency or cause deadlock (this will be improved in 8.1).

-- 
Michael Fuhr

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Steve Poe
Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure even if
RAID1 is on top of RAID0. 

How many users are connected when your update / delete queries are
hanging? Have you done an analyze verbose on those queries?

Have you made changes to the postgresql.conf? kernel.vm settings? IO
scheduler?

If you're not doing so already, you may consider running sar (iostat) to
monitor when the hanging occurs if their is a memory / IO bottleneck
somewhere.

Good luck.

Steve Poe


On Tue, 2005-08-09 at 12:04 -0600, Dan Harris wrote:
 I thought I would send this to pg-performance since so many people  
 helped me with my speed issues recently.  I was definitely IO- 
 bottlenecked.
 
 Since then, I have installed 2 RAID arrays with 7 15k drives in them  
 in RAID 0+1 as well as add a new controller card with 512MB of cache  
 on it.  I also created this new partition on the RAID as XFS instead  
 of ext3.
 
 These changes have definitely improved performance, but I am now  
 finding some trouble with UPDATE or DELETE queries hanging and  
 never releasing their locks.  As this happens, other statements queue  
 up behind it.  It seems to occur at times of very high loads on the  
 box.  Is my only option to kill the query ( which usually takes down  
 the whole postmaster with it! ouch ).
 
 Could these locking issues be related to the other changes I made?  
 I'm really scared that this is related to choosing XFS, but I sure  
 hope not.   How should I go about troubleshooting the problem  
 queries?  They don't seem to be specific to a single table or single  
 database.
 
 I'm running 8.0.1 on kernel 2.6.12-3 on 64-bit Opterons if that  
 matters..
 
 
 -Dan
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Tom Lane
Joshua D. Drake [EMAIL PROTECTED] writes:
 My experience is that when this type of thing happens it is typically 
 specific queries that cause the problem. If you turn on statement 
 logging you can get the exact queries and debug from there.

 Here are some things to look for:

 Is it a large table (and thus large indexes) that it is updating?
 Is the query using indexes?
 Is the query modifying ALOT of rows?

Another thing to look at is foreign keys.  Dan could be running into
problems with an update on one side of an FK being blocked by locks
on the associated rows on the other side.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Dan Harris


On Aug 9, 2005, at 1:08 PM, Tom Lane wrote:


Joshua D. Drake [EMAIL PROTECTED] writes:


My experience is that when this type of thing happens it is typically
specific queries that cause the problem. If you turn on statement
logging you can get the exact queries and debug from there.





Here are some things to look for:





Is it a large table (and thus large indexes) that it is updating?
Is the query using indexes?
Is the query modifying ALOT of rows?



Another thing to look at is foreign keys.  Dan could be running into
problems with an update on one side of an FK being blocked by locks
on the associated rows on the other side.

regards, tom lane



Tom, Steve, Josh:

Thank you for your ideas.  The updates are only on a single table, no  
joins.  I had stats collection turned off.  I have turned that on  
again so that I can try and catch one while the problem is  
occurring.  The last table it did this on was about 3 million  
records.  4 single-column indexes on it.


The problem I had with statement logging is that if the query never  
finishes, it doesn't get logged as far as I can tell.  So everything  
that did get logged was normal and would run with no isses in psql by  
copy and pasting it.  The rows updated will certainly vary by query.   
I really need to catch it in the act with stats collection on so I  
can get the query from pg_stat_activity.  Once I get it, I will play  
with explains and see if I can reproduce it outside the wild.


Thanks again for your help.

-Dan


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Dan Harris


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure even if
RAID1 is on top of RAID0.


Well, you tell me if I stated incorrectly.  There are two raid  
enclosures with 7 drives in each.  Each is on its own bus on a dual- 
channel controller.  Each box has a stripe across its drives and the  
enclosures are mirrors of each other.  I understand the controller  
could be a single point of failure, but I'm not sure I understand  
your concern about the RAID structure itself.




How many users are connected when your update / delete queries are
hanging? Have you done an analyze verbose on those queries?


Most of the traffic is from programs we run to do analysis of the  
data and managing changes.  At the time I noticed it this morning,  
there were 10 connections open to the database.  That rarely goes  
above 20 concurrent.  As I said in my other response, I believe that  
the log will only contain the query at the point the query finishes,  
so if it never finishes...




Have you made changes to the postgresql.conf? kernel.vm settings? IO
scheduler?


I set shmmax appropriately for my shared_buffers setting, but that's  
the only kernel tweak.




If you're not doing so already, you may consider running sar  
(iostat) to

monitor when the hanging occurs if their is a memory / IO bottleneck
somewhere.



I will try that.  Thanks



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure even if
RAID1 is on top of RAID0.



Well, you tell me if I stated incorrectly.  There are two raid  
enclosures with 7 drives in each.  Each is on its own bus on a dual- 
channel controller.  Each box has a stripe across its drives and the  
enclosures are mirrors of each other.  I understand the controller  
could be a single point of failure, but I'm not sure I understand  your 
concern about the RAID structure itself.


In this configuration, if you have a drive fail on both controllers, the 
entire RAID dies. Lets label them A1-7, B1-7, because you stripe within 
a set, if a single one of A dies, and a single one of B dies, you have 
lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and then 
stripe above that. Since you are using 2 7-disk enclosures, I'm not sure 
how you can do it well, since it is not an even number of disks. Though 
if you are using software RAID, there should be no problem.


The difference is that in this scenario, *all* of the A drives can die, 
and you haven't lost any data. The only thing you can't lose is a 
matched pair (eg losing both A1 and B1 will cause complete data loss)


I believe the correct notation for this last form is RAID 1 + 0 (RAID10) 
since you have a set of RAID1 drives, with a RAID0 on-top of them.






How many users are connected when your update / delete queries are
hanging? Have you done an analyze verbose on those queries?



Most of the traffic is from programs we run to do analysis of the  data 
and managing changes.  At the time I noticed it this morning,  there 
were 10 connections open to the database.  That rarely goes  above 20 
concurrent.  As I said in my other response, I believe that  the log 
will only contain the query at the point the query finishes,  so if it 
never finishes...




Have you made changes to the postgresql.conf? kernel.vm settings? IO
scheduler?



I set shmmax appropriately for my shared_buffers setting, but that's  
the only kernel tweak.




If you're not doing so already, you may consider running sar  (iostat) to
monitor when the hanging occurs if their is a memory / IO bottleneck
somewhere.



I will try that.  Thanks



When you discover that an update is hanging, can you get into the 
database, and see what locks currently exist? (SELECT * FROM pg_locks)


That might help you figure out what is being locked and possibly 
preventing your updates.


It is also possible that your UPDATE query is trying to do something 
funny (someone just recently was talking about an UPDATE that wanted to 
do a hash join against 12M rows). Which probably meant that it had to 
spill to disk, where a merge join would have worked better.


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Table locking problems?

2005-08-09 Thread Dan Harris


On Aug 9, 2005, at 3:51 PM, John A Meinel wrote:


Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure  
even if

RAID1 is on top of RAID0.

Well, you tell me if I stated incorrectly.  There are two raid   
enclosures with 7 drives in each.  Each is on its own bus on a  
dual- channel controller.  Each box has a stripe across its drives  
and the  enclosures are mirrors of each other.  I understand the  
controller  could be a single point of failure, but I'm not sure I  
understand  your concern about the RAID structure itself.




In this configuration, if you have a drive fail on both  
controllers, the entire RAID dies. Lets label them A1-7, B1-7,  
because you stripe within a set, if a single one of A dies, and a  
single one of B dies, you have lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and  
then stripe above that. Since you are using 2 7-disk enclosures,  
I'm not sure how you can do it well, since it is not an even number  
of disks. Though if you are using software RAID, there should be no  
problem.


The difference is that in this scenario, *all* of the A drives can  
die, and you haven't lost any data. The only thing you can't lose  
is a matched pair (eg losing both A1 and B1 will cause complete  
data loss)


I believe the correct notation for this last form is RAID 1 + 0  
(RAID10) since you have a set of RAID1 drives, with a RAID0 on-top  
of them.




I have read up on the difference now. I don't understand why it's a  
single point of failure.  Technically any array could be a single  
point depending on your level of abstraction.   In retrospect, I  
probably should have gone 8 drives in each and used RAID 10 instead  
for the better fault-tolerance,  but it's online now and will require  
some planning to see if I want to reconfigure that in the future.  I  
wish HP's engineer would have promoted that method instead of 0+1..


-Dan


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 9, 2005, at 3:51 PM, John A Meinel wrote:


Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure  even if
RAID1 is on top of RAID0.

Well, you tell me if I stated incorrectly.  There are two raid   
enclosures with 7 drives in each.  Each is on its own bus on a  dual- 
channel controller.  Each box has a stripe across its drives  and 
the  enclosures are mirrors of each other.  I understand the  
controller  could be a single point of failure, but I'm not sure I  
understand  your concern about the RAID structure itself.




In this configuration, if you have a drive fail on both  controllers, 
the entire RAID dies. Lets label them A1-7, B1-7,  because you stripe 
within a set, if a single one of A dies, and a  single one of B dies, 
you have lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and  
then stripe above that. Since you are using 2 7-disk enclosures,  I'm 
not sure how you can do it well, since it is not an even number  of 
disks. Though if you are using software RAID, there should be no  
problem.


The difference is that in this scenario, *all* of the A drives can  
die, and you haven't lost any data. The only thing you can't lose  is 
a matched pair (eg losing both A1 and B1 will cause complete  data loss)


I believe the correct notation for this last form is RAID 1 + 0  
(RAID10) since you have a set of RAID1 drives, with a RAID0 on-top  of 
them.




I have read up on the difference now. I don't understand why it's a  
single point of failure.  Technically any array could be a single  
point depending on your level of abstraction.   In retrospect, I  
probably should have gone 8 drives in each and used RAID 10 instead  for 
the better fault-tolerance,  but it's online now and will require  some 
planning to see if I want to reconfigure that in the future.  I  wish 
HP's engineer would have promoted that method instead of 0+1..


I wouldn't say that it is a single point of failure, but I *can* say 
that it is much more likely to fail. (2 drives rather than on average n 
drives)


If your devices will hold 8 drives, you could simply do 1 8-drive, and 
one 6-drive. And then do RAID1 with pairs, and RAID0 across the 
resultant 7 RAID1 sets.


I'm really surprised that someone promoted RAID 0+1 over RAID10. I think 
I've heard that there is a possible slight performance improvement, but 
really the failure mode makes it a poor tradeoff.


John
=:-



-Dan



signature.asc
Description: OpenPGP digital signature