from:"Matthew Spilich"

Re: [PERFORM] Stalls on PGSemaphoreLock

2014-04-22 Thread Matthew Spilich

Hi all -  I am a little delayed in reporting back on this issue, but it was 
indeed the hugepage defrag setting that was the cause of my issue.   One item 
that we noticed as we were testing this issue that I wanted to report back to 
the forum is that these settings 

cat  /sys/kernel/mm/transparent_hugepage/defrag 
always [never]

cat  /sys/kernel/mm/transparent_hugepage/enabled 
always [never]

Were not sicky on reboot for my version of CentOS, which probably explains why 
I thought this was disabled already only to have it crop back up.   Anyway, I 
wanted to report back these findings to close the loop on this and to thank the 
community again for their support.

Best,
Matt


From: Pavy Philippe [philippe.p...@worldline.com]
Sent: Tuesday, March 25, 2014 4:10 PM
To: Matthew Spilich; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Stalls on PGSemaphoreLock

Here, we were the transparent hugepage always actif:
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
[always] never

We changed to:
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
always [never]



For the semaphore, our initial configuration was:
cat /proc/sys/kernel/sem
250 32000 32 128

And we changed to:
cat /proc/sys/kernel/sem
5010641280  5010128




-Message d'origine-
De : pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] De la part de Matthew Spilich
Envoyé : mardi 25 mars 2014 19:38
À : pgsql-performance@postgresql.org
Objet : Re: [PERFORM] Stalls on PGSemaphoreLock

Thanks all:

Ray:  Thanks, we started to look at the hardware/firmware, but didn't get to 
the the level of detail or running sar.   I will probably collect more detail 
in this area if I continue to see issues.

Pavy - I hope that you are right that the hugepage setting is the issue.   I 
was under the impression that I had it disabled already because this has been 
an known issue for us in the past, but it turns out this was not the case for 
this server in question.   I have disabled it at this time, but it will take a 
few days of running without issue before I am comfortable declaring that this 
is the solution.   Can you elaborate on the change you mention to upgrade the 
semaphore configuration?   I think this is not something I have looked at 
before.

Ashutosh - Thanks for the reply, I started to do that at first.   I turned on 
log_statement=all for a few hours and I generated a few GB of log file, and I 
didn't want to leave it running in that state for too long because the issue 
happens every few days, and not on any regular schedule, so I reverted that 
after collecting a few GB of detail in the pg log.   What I'm doing now to 
sample every few seconds is I think giving me a decent picture of what is going 
on with the incident occurs and is a level of data collection that I am more 
comfortable will not impact operations.  I am also logging at the level of 
'mod' and all duration  500ms.   I don't see that large write operations are a 
contributing factor leading up to these incidents.

I'm hoping that disabling the hugepage setting will be the solution to this.  
I'll check back in a day or two with feedback.

Thanks,
Matt



From: Pavy Philippe [philippe.p...@worldline.com]
Sent: Tuesday, March 25, 2014 1:45 PM
To: Ray Stell; Matthew Spilich
Cc: pgsql-performance@postgresql.org
Subject: RE : [PERFORM] Stalls on PGSemaphoreLock

Hello

Recently I have a similar problem. The first symptom was a freeze of the 
connection and 100% of CPU SYS during 2 et 10 minutes, 1 or 2 times per day.
Connection impossible, slow query. The strace on one backend show a very long 
system call on semop().
We have a node with 48 cores dans 128 Go of memory.

We have disable the hugepage and upgrade the semaphore configuration, and since 
that time, we no longer have any problem of freeze on our instance.

Can you check the hugepage and semaphore configuration on our node ?

I am interested in this case, so do not hesitate to let me make a comeback. 
Thanks.

excuse me for my bad english !!!


De : pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org] de la part de Ray Stell 
[ste...@vt.edu] Date d'envoi : mardi 25 mars 2014 18:17 À : Matthew Spilich Cc 
: pgsql-performance@postgresql.org Objet : Re: [PERFORM] Stalls on 
PGSemaphoreLock

On Mar 25, 2014, at 8:46 AM, Matthew Spilich wrote:

The symptom:   The database machine (running postgres 9.1.9 on CentOS 6.4) is 
running a low utilization most of the time, but once every day or two, it will 
appear to slow down to the point where queries back up and clients are unable 
to connect.  Once this event occurs, there are lots of concurrent queries, I 
see slow queries appear in the logs, but there doesn't appear to be anything 
abnormal that I have been able

[PERFORM] semaphore waits and performance stall

2014-03-26 Thread Matthew Spilich

Hi everyone!


I've been working on a puzzling issue for a few days am am hoping that someone 
has seen something similar or can help.  There have been some odd behaviors on 
one of my production facing postgres servers.


version info from postgres: PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, 
compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit


The symptom:   The database machine (running postgres 9.1.9 on CentOS 6.4) is 
running a low utilization most of the time, but once every day or two, it will 
appear to slow down to the point where queries back up and clients are unable 
to connect.  Once this event occurs, there are lots of concurrent queries, I 
see slow queries appear in the logs, but there doesn't appear to be anything 
abnormal that I have been able to see that causes this behavior.  The event 
will occur just long enough for monitoring to alarm.   The team will respond to 
alerts to take a look, but within a minute or three at most, load returns back 
to normal levels and all running queries complete in expected times.


At the time of the event, we see a spike in system CPU and load average, but we 
do not see a corresponding spike in disk reads or writes which would indicate 
IO load.   Initial troubleshooting to monitor active processes led us to see a 
flurry of activity in ps waiting on semtimedop.   Our efforts internally to 
diagnose this problem are to sample pg_locks and pg_stat_activity every 5s plus 
running a script to look for at least one postgres process waiting on a 
semaphore, and if it finds one, it gets a stack trace of every running postgres 
processes with GDB.  It also uses strace on 5 processes to find out which 
semaphore they're waiting on.


What we were catching in the following stack trace seems to be representative 
of where things are waiting when we see an event - here are two examples that 
are representative:


- 47245 -
0x0037392eb197 in semop () from /lib64/libc.so.6
#0  0x0037392eb197 in semop () from /lib64/libc.so.6
#1  0x005e0c87 in PGSemaphoreLock ()
#2  0x0061e3af in LWLockAcquire ()
#3  0x0060aa0f in ReadBuffer_common ()
#4  0x0060b2e4 in ReadBufferExtended ()
#5  0x0047708d in _bt_relandgetbuf ()
#6  0x0047aac4 in _bt_search ()
#7  0x0047af8d in _bt_first ()
#8  0x00479704 in btgetbitmap ()
#9  0x006e7e00 in FunctionCall2Coll ()
#10 0x00473120 in index_getbitmap ()
#11 0x005726b8 in MultiExecBitmapIndexScan ()
#12 0x0057214d in BitmapHeapNext ()
#13 0x0056b18e in ExecScan ()
#14 0x00563ed8 in ExecProcNode ()
#15 0x00562d72 in standard_ExecutorRun ()
#16 0x0062ce67 in PortalRunSelect ()
#17 0x0062e128 in PortalRun ()
#18 0x0062bb66 in PostgresMain ()
#19 0x005ecd01 in ServerLoop ()
#20 0x005ef401 in PostmasterMain ()
#21 0x00590ff8 in main ()

- 47257 -
0x0037392eb197 in semop () from /lib64/libc.so.6
#0  0x0037392eb197 in semop () from /lib64/libc.so.6
#1  0x005e0c87 in PGSemaphoreLock ()
#2  0x0061e3af in LWLockAcquire ()
#3  0x0060aa0f in ReadBuffer_common ()
#4  0x0060b2e4 in ReadBufferExtended ()
#5  0x0047708d in _bt_relandgetbuf ()
#6  0x0047aac4 in _bt_search ()
#7  0x0047af8d in _bt_first ()
#8  0x004797d1 in btgettuple ()
#9  0x006e7e00 in FunctionCall2Coll ()
#10 0x0047339d in index_getnext ()
#11 0x00575ed6 in IndexNext ()
#12 0x0056b18e in ExecScan ()
#13 0x00563ee8 in ExecProcNode ()
#14 0x00562d72 in standard_ExecutorRun ()
#15 0x0062ce67 in PortalRunSelect ()
#16 0x0062e128 in PortalRun ()
#17 0x0062bb66 in PostgresMain ()
#18 0x005ecd01 in ServerLoop ()
#19 0x005ef401 in PostmasterMain ()
#20 0x00590ff8 in main ()


Has any on the forum seen something similar?   Any suggestions on what to look 
at next?If it is helpful to describe the server hardware, it's got 2 
E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 
local storage (15K 300 GB drives).   The workload is predominantly read and the 
queries are mostly fairly simple selects from a single large table generally 
specifying the primary key as part of the where clause along with a few other 
filters.


Thanks,

Matt

[PERFORM] Stalls on PGSemaphoreLock

2014-03-25 Thread Matthew Spilich

Hi everyone!


I've been working on a puzzling issue for a few days am am hoping that someone 
has seen something similar or can help.  There have been some odd behaviors on 
one of my production facing postgres servers.


version info from postgres: PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, 
compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit


The symptom:   The database machine (running postgres 9.1.9 on CentOS 6.4) is 
running a low utilization most of the time, but once every day or two, it will 
appear to slow down to the point where queries back up and clients are unable 
to connect.  Once this event occurs, there are lots of concurrent queries, I 
see slow queries appear in the logs, but there doesn't appear to be anything 
abnormal that I have been able to see that causes this behavior.  The event 
will occur just long enough for monitoring to alarm.   We will respond to 
alerts to take a look, but within a minute or three at most, load returns back 
to normal levels and all running queries complete in expected times.


At the time of the event, we see a spike in system CPU and load average, but we 
do not see a corresponding spike in disk reads or writes which would indicate 
IO load.   Initial troubleshooting to monitor active processes led us to see a 
flurry of activity in ps waiting on semtimedop.   Our efforts internally to 
diagnose this problem are to sample pg_locks and pg_stat_activity every 5s plus 
running a script to look for at least one postgres process waiting on a 
semaphore, and if it finds one, it gets a stack trace of every running postgres 
processes with GDB.  It also uses strace on 5 processes to find out which 
semaphore they're waiting on.


What we were catching in the following stack trace seems to be representative 
of where things are waiting when we see an event - here are two examples that 
are representative - lots of threads will appear to be in this state:


- 47245 -
0x0037392eb197 in semop () from /lib64/libc.so.6
#0  0x0037392eb197 in semop () from /lib64/libc.so.6
#1  0x005e0c87 in PGSemaphoreLock ()
#2  0x0061e3af in LWLockAcquire ()
#3  0x0060aa0f in ReadBuffer_common ()
#4  0x0060b2e4 in ReadBufferExtended ()
#5  0x0047708d in _bt_relandgetbuf ()
#6  0x0047aac4 in _bt_search ()
#7  0x0047af8d in _bt_first ()
#8  0x00479704 in btgetbitmap ()
#9  0x006e7e00 in FunctionCall2Coll ()
#10 0x00473120 in index_getbitmap ()
#11 0x005726b8 in MultiExecBitmapIndexScan ()
#12 0x0057214d in BitmapHeapNext ()
#13 0x0056b18e in ExecScan ()
#14 0x00563ed8 in ExecProcNode ()
#15 0x00562d72 in standard_ExecutorRun ()
#16 0x0062ce67 in PortalRunSelect ()
#17 0x0062e128 in PortalRun ()
#18 0x0062bb66 in PostgresMain ()
#19 0x005ecd01 in ServerLoop ()
#20 0x005ef401 in PostmasterMain ()
#21 0x00590ff8 in main ()

- 47257 -
0x0037392eb197 in semop () from /lib64/libc.so.6
#0  0x0037392eb197 in semop () from /lib64/libc.so.6
#1  0x005e0c87 in PGSemaphoreLock ()
#2  0x0061e3af in LWLockAcquire ()
#3  0x0060aa0f in ReadBuffer_common ()
#4  0x0060b2e4 in ReadBufferExtended ()
#5  0x0047708d in _bt_relandgetbuf ()
#6  0x0047aac4 in _bt_search ()
#7  0x0047af8d in _bt_first ()
#8  0x004797d1 in btgettuple ()
#9  0x006e7e00 in FunctionCall2Coll ()
#10 0x0047339d in index_getnext ()
#11 0x00575ed6 in IndexNext ()
#12 0x0056b18e in ExecScan ()
#13 0x00563ee8 in ExecProcNode ()
#14 0x00562d72 in standard_ExecutorRun ()
#15 0x0062ce67 in PortalRunSelect ()
#16 0x0062e128 in PortalRun ()
#17 0x0062bb66 in PostgresMain ()
#18 0x005ecd01 in ServerLoop ()
#19 0x005ef401 in PostmasterMain ()
#20 0x00590ff8 in main ()


Has any on the forum seen something similar?   Any suggestions on what to look 
at next?If it is helpful to describe the server hardware, it's got 2 
E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 
local storage (15K 300 GB drives).   The workload is predominantly read and the 
queries are mostly fairly simple selects from a single large table generally 
specifying the primary key as part of the where clause along with a few other 
filters.


Thanks,

Matt

Re: [PERFORM] Stalls on PGSemaphoreLock

2014-03-25 Thread Matthew Spilich

Thanks all: 

Ray:  Thanks, we started to look at the hardware/firmware, but didn't get to 
the the level of detail or running sar.   I will probably collect more detail 
in this area if I continue to see issues.

Pavy - I hope that you are right that the hugepage setting is the issue.   I 
was under the impression that I had it disabled already because this has been 
an known issue for us in the past, but it turns out this was not the case for 
this server in question.   I have disabled it at this time, but it will take a 
few days of running without issue before I am comfortable declaring that this 
is the solution.   Can you elaborate on the change you mention to upgrade the 
semaphore configuration?   I think this is not something I have looked at 
before.

Ashutosh - Thanks for the reply, I started to do that at first.   I turned on 
log_statement=all for a few hours and I generated a few GB of log file, and I 
didn't want to leave it running in that state for too long because the issue 
happens every few days, and not on any regular schedule, so I reverted that 
after collecting a few GB of detail in the pg log.   What I'm doing now to 
sample every few seconds is I think giving me a decent picture of what is going 
on with the incident occurs and is a level of data collection that I am more 
comfortable will not impact operations.  I am also logging at the level of 
'mod' and all duration  500ms.   I don't see that large write operations are a 
contributing factor leading up to these incidents.

I'm hoping that disabling the hugepage setting will be the solution to this.  
I'll check back in a day or two with feedback.

Thanks,
Matt



From: Pavy Philippe [philippe.p...@worldline.com]
Sent: Tuesday, March 25, 2014 1:45 PM
To: Ray Stell; Matthew Spilich
Cc: pgsql-performance@postgresql.org
Subject: RE : [PERFORM] Stalls on PGSemaphoreLock

Hello

Recently I have a similar problem. The first symptom was a freeze of the 
connection and 100% of CPU SYS during 2 et 10 minutes, 1 or 2 times per day.
Connection impossible, slow query. The strace on one backend show a very long 
system call on semop().
We have a node with 48 cores dans 128 Go of memory.

We have disable the hugepage and upgrade the semaphore configuration, and since 
that time, we no longer have any problem of freeze on our instance.

Can you check the hugepage and semaphore configuration on our node ?

I am interested in this case, so do not hesitate to let me make a comeback. 
Thanks.

excuse me for my bad english !!!


De : pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org] de la part de Ray Stell [ste...@vt.edu]
Date d'envoi : mardi 25 mars 2014 18:17
À : Matthew Spilich
Cc : pgsql-performance@postgresql.org
Objet : Re: [PERFORM] Stalls on PGSemaphoreLock

On Mar 25, 2014, at 8:46 AM, Matthew Spilich wrote:

The symptom:   The database machine (running postgres 9.1.9 on CentOS 6.4) is 
running a low utilization most of the time, but once every day or two, it will 
appear to slow down to the point where queries back up and clients are unable 
to connect.  Once this event occurs, there are lots of concurrent queries, I 
see slow queries appear in the logs, but there doesn't appear to be anything 
abnormal that I have been able to see that causes this behavior.
...
Has any on the forum seen something similar?   Any suggestions on what to look 
at next?If it is helpful to describe the server hardware, it's got 2 
E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 
local storage (15K 300 GB drives).


I could be way off here, but years ago I experienced something like this (in 
oracle land) and after some stressful chasing, the marginal failure of the raid 
controller revealed itself.  Same kind of event, steady traffic and then some 
i/o would not complete and normal ops would stack up.  Anyway, what you report 
reminded me of that event.  The E5 is a few years old, I wonder if the raid 
controller firmware needs a patch?  I suppose a marginal power supply might 
cause a similar hang.  Anyway, marginal failures are very painful.  Have you 
checked sar or OS logging at event time?


Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential

Re: [PERFORM] Stalls on PGSemaphoreLock

[PERFORM] semaphore waits and performance stall

[PERFORM] Stalls on PGSemaphoreLock

Re: [PERFORM] Stalls on PGSemaphoreLock

4 matches

Site Navigation

Mail list logo

Footer information