Re: [PERFORM] Queries intermittently slow
On 1/7/16 1:47 PM, Andres Freund wrote: On 2016-01-07 13:34:51 -0500, Tom Lane wrote: It's fairly well established that the implementation of transparent huge pages in Linux kernels from the 2.6-or-so era sucks, and you're best off turning it off if you care about consistency of performance. I think the feature wasn't introduced in original 2.6 kernels (3.2 or so?), but red hat had backported it to their 2.6.32 kernel. I am not sure whether modern kernels have improved this area. I think the problem has largely been solved around 3.16. Around 4.1 I could still reproduce problems, but the regressions were only in the sub 10% range in my test workload. BTW, looks like Scott blogged about this along with some nice graphs: https://sdwr98.wordpress.com/2016/01/07/transparent-huge-pages-or-why-dbas-are-great/ -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 2016-01-07 13:34:51 -0500, Tom Lane wrote: > It's fairly well established that the implementation of transparent > huge pages in Linux kernels from the 2.6-or-so era sucks, and you're > best off turning it off if you care about consistency of performance. I think the feature wasn't introduced in original 2.6 kernels (3.2 or so?), but red hat had backported it to their 2.6.32 kernel. > I am not sure whether modern kernels have improved this area. I think the problem has largely been solved around 3.16. Around 4.1 I could still reproduce problems, but the regressions were only in the sub 10% range in my test workload. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
Rick Otten writes: > On Thu, Jan 7, 2016 at 12:51 PM, Scott Rankin wrote: >> Winner! Both of those settings were set to always, and as soon as I >> turned them off, the query times smoothed right out. > Is this generally true for all PostgreSQL systems on Linux, or only for > specific use cases? It's fairly well established that the implementation of transparent huge pages in Linux kernels from the 2.6-or-so era sucks, and you're best off turning it off if you care about consistency of performance. I am not sure whether modern kernels have improved this area. I think you can get an idea of how big a problem you have by noting the accumulated runtime of the khugepaged daemon. (BTW, it would likely be a good thing to collect some current wisdom in this area and add it to section 17.4 of our docs.) regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 1/7/16, 10:19 AM, "Tom Lane" wrote: >Scott Rankin writes: >> Any other ideas? > >If there's no lock waits, and the tables are static, it's difficult to >arrive at any other conclusion but that the process is simply losing the >CPU for long intervals. What else is going on on this box? Have you >made any attempt to correlate the slow queries with spikes in load >average, swap activity, etc? > >regards, tom lane This box is a dedicated postgresql box, the load is running around 2-3 (on a 24-core box). No swapping. I’ll keep digging! Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
Scott Rankin writes: > Any other ideas? If there's no lock waits, and the tables are static, it's difficult to arrive at any other conclusion but that the process is simply losing the CPU for long intervals. What else is going on on this box? Have you made any attempt to correlate the slow queries with spikes in load average, swap activity, etc? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 1/6/16, 11:14 PM, "Jim Nasby" wrote: >On 1/6/16 12:01 PM, Scott Rankin wrote: >> I guess we’re back to lock contention? > >Is there by chance an anti-wraparound vacuum happening on that table? > >Actually, for that matter... if autovacuum is hitting that table it's >locking could be causing problems, and it won't release it's locks until >the deadlock detector kicks in. >-- >Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX >Experts in Analytics, Data Architecture and PostgreSQL >Data in Trouble? Get it in Treble! http://BlueTreble.com Hi Jim, Tom, Thanks both of you for your help, by the way. I lowered the deadlock_timeout this morning to 100ms and ran for a little while, but I did not see any lock waits even when the statements ran slowly. Jim, to your point, I enabled auto vacuum logging (log_autovacuum_min_duration=0) but no vacuums are taking place even when I see slow queries. Any other ideas? Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 1/6/16 12:01 PM, Scott Rankin wrote: I guess we’re back to lock contention? Is there by chance an anti-wraparound vacuum happening on that table? Actually, for that matter... if autovacuum is hitting that table it's locking could be causing problems, and it won't release it's locks until the deadlock detector kicks in. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
> >So what seems more likely after more thought is that the pages are staying >in our shared buffer arena just fine, but the kernel is randomly choosing >to swap out parts of the arena, and the delays correspond to swap-in >waits. (There would still have to be a mighty crummy disk subsystem >underlying things for swap-in to take so long, but this is a more >plausible theory for exactly what's invoking the disk read.) > >Postgres can't directly see when this is happening, but you could try >watching "iostat 1" and noticing whether swap-in events seem to be >correlated with the slow queries. > >If this is the problem, then the answer is to reduce the pressure on >system memory so that swap-outs are less likely. You might find that >a smaller shared_buffer arena is a good thing (so that all of it stays >"hot" and unswappable from the kernel's perspective). Or reduce the >number of active backend processes. > >regards, tom lane One change to my setup that my server guy pointed out - this database server is *not* a cloud server, but in fact a physical Dell box with SSDs. So unless something is majorly wrong with our disk setup, disk IO should not be a factor. I also checked and the box is not swapping at all. I guess we’re back to lock contention? Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
I wrote: > A possible theory is that the slow cases represent times when the desired > page is not in cache, but you'd have to have a seriously overloaded disk > subsystem for a disk fetch to take hundreds of ms. Unless maybe this is > running on some cloud service with totally unspecified I/O bandwidth? BTW, a glaring flaw in that theory is that if this query is touching only about four pages worth of data, and you are running it ten times a second, how in heck would that data ever fall out of shared buffer cache at all? Your working set across your whole DB would have to be enormously more than your 8GB shared_buffers setting for that to possibly happen. So what seems more likely after more thought is that the pages are staying in our shared buffer arena just fine, but the kernel is randomly choosing to swap out parts of the arena, and the delays correspond to swap-in waits. (There would still have to be a mighty crummy disk subsystem underlying things for swap-in to take so long, but this is a more plausible theory for exactly what's invoking the disk read.) Postgres can't directly see when this is happening, but you could try watching "iostat 1" and noticing whether swap-in events seem to be correlated with the slow queries. If this is the problem, then the answer is to reduce the pressure on system memory so that swap-outs are less likely. You might find that a smaller shared_buffer arena is a good thing (so that all of it stays "hot" and unswappable from the kernel's perspective). Or reduce the number of active backend processes. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
> >Hm, well, given that you are able to capture instances of the behavior >in EXPLAIN ANALYZE, I'd suggest trying EXPLAIN (ANALYZE,BUFFERS). >That will tell you the number of pages it found in shared buffers vs. >having to read them. Now, a "read" just means we had to ask the kernel, >not necessarily that the page came all the way from disk; if it's in >the kernel's disk cache that won't be very much slower than a shared- >buffers hit. Still, if the slowdowns are reliably seen only when a read >occurred, I'd say that's good evidence. Well, that was a good thought - but: Nested Loop (cost=0.00..4.20 rows=1 width=185) (actual time=567.113..567.113 rows=0 loops=1) Join Filter: (be.branding_id = b.branding_id) Rows Removed by Join Filter: 1 Buffers: shared hit=4 -> Nested Loop (cost=0.00..3.19 rows=1 width=189) (actual time=567.081..567.092 rows=1 loops=1) Join Filter: (s.setting_id = be.setting_id) Rows Removed by Join Filter: 41 Buffers: shared hit=3 -> Seq Scan on branding_extension be (cost=0.00..1.00 rows=1 width=8) (actual time=0.011..0.012 rows=1 loops=1) Buffers: shared hit=1 -> Seq Scan on setting s (cost=0.00..2.04 rows=42 width=185) (actual time=567.041..567.049 rows=42 loops=1) Buffers: shared hit=2 -> Seq Scan on branding b (cost=0.00..1.01 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1) Filter: ((name)::text = 'crs'::text) Rows Removed by Filter: 1 Buffers: shared hit=1 Total runtime: 567.185 ms I ran it several times and there were no reads, just shared hits, even on slow executions. It looks like it didn’t go out to the kernel for the data. I guess next steps would be to adjust the deadlock_timeout? If I’m reading the docs correctly, this doesn’t change how Postgres determines a deadlock, it just makes it check for them sooner? Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
Scott Rankin writes: > On 1/6/16, 10:38 AM, "Tom Lane" wrote: >> A possible theory is that the slow cases represent times when the desired >> page is not in cache, but you'd have to have a seriously overloaded disk >> subsystem for a disk fetch to take hundreds of ms. Unless maybe this is >> running on some cloud service with totally unspecified I/O bandwidth? > This intrigues me. We are running on a, shall we say, less than name-brand > cloud provider at the moment (transitioning to AWS later this month). Is > there a reasonably straightforward way of confirming this hypothesis? We have > had many performance issues with this vendor in the past, so I wouldnât be > surprised. Hm, well, given that you are able to capture instances of the behavior in EXPLAIN ANALYZE, I'd suggest trying EXPLAIN (ANALYZE,BUFFERS). That will tell you the number of pages it found in shared buffers vs. having to read them. Now, a "read" just means we had to ask the kernel, not necessarily that the page came all the way from disk; if it's in the kernel's disk cache that won't be very much slower than a shared- buffers hit. Still, if the slowdowns are reliably seen only when a read occurred, I'd say that's good evidence. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 1/6/16, 10:38 AM, "Tom Lane" wrote: > >A possible theory is that the slow cases represent times when the desired >page is not in cache, but you'd have to have a seriously overloaded disk >subsystem for a disk fetch to take hundreds of ms. Unless maybe this is >running on some cloud service with totally unspecified I/O bandwidth? This intrigues me. We are running on a, shall we say, less than name-brand cloud provider at the moment (transitioning to AWS later this month). Is there a reasonably straightforward way of confirming this hypothesis? We have had many performance issues with this vendor in the past, so I wouldn’t be surprised. > >> I will look at changing the deadlock_timeout, but that might have to wait >> for the weekend since this is a production system. > >You needn't restart the server for that, just edit postgresql.conf and >SIGHUP the postmaster. Yep, we just try to limit any database changes until off-hours unless it’s an emergency. > >regards, tom lane Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
Scott Rankin writes: > Thanks for the update. The query in question is a pretty simple one - it > joins 3 tables, all of which are static - they donât have any writes being > done against them. They have very few rows, and the query plan for them > indicates that they are all sequential scans. When doing an EXPLAIN ANALYZE, > the delay is not consistently on one table, it can vary between the three > tables involved in the query. No writes at all? That's pretty odd then. All the likely explanations involve autovacuum or other forms of deferred maintenance. A possible theory is that the slow cases represent times when the desired page is not in cache, but you'd have to have a seriously overloaded disk subsystem for a disk fetch to take hundreds of ms. Unless maybe this is running on some cloud service with totally unspecified I/O bandwidth? > I will look at changing the deadlock_timeout, but that might have to wait for > the weekend since this is a production system. You needn't restart the server for that, just edit postgresql.conf and SIGHUP the postmaster. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Queries intermittently slow
On 1/6/16, 10:19 AM, "Tom Lane" wrote: >Scott Rankin writes: >> I’m trying to track down why some queries on my database system are >> intermittently much slower than usual. I have some queries that run, on >> average, 2-3ms, and they run at a rate of about 10-20 queries/second. >> However, every 3-5 seconds, one of the queries will be 500-100ms. This is >> making the average query time turn out to be closer to 20ms, with a very >> large standard deviation. > >> This happens to a number of otherwise very fast queries, and I’m trying to >> trace the reason. I’ve turned on lock logging and checkpoint logging, and >> this behavior happens whether or not a checkpoint is occurring. There are >> no lock waits happening in the system either. > >I doubt you've proved that --- log_lock_waits will only report on waits >longer than deadlock_timeout, which you don't appear to have changed from >its default of 1 sec. If you're trying to capture events that last a few >hundred msec, you're going to need to reduce deadlock_timeout to maybe >100ms. > >It would help to know more about what the queries are, too. The cause >might be something like GIN index pending-list cleanup but we can't tell >on the basis of this much info. > >regards, tom lane Hi Tom, Thanks for the update. The query in question is a pretty simple one - it joins 3 tables, all of which are static - they don’t have any writes being done against them. They have very few rows, and the query plan for them indicates that they are all sequential scans. When doing an EXPLAIN ANALYZE, the delay is not consistently on one table, it can vary between the three tables involved in the query. Here is the query plan for a fast run: "Nested Loop (cost=0.00..4.20 rows=1 width=185) (actual time=0.069..0.069 rows=0 loops=1)" " Join Filter: (be.branding_id = b.branding_id)" " Rows Removed by Join Filter: 1" " -> Nested Loop (cost=0.00..3.19 rows=1 width=189) (actual time=0.040..0.057 rows=1 loops=1)" "Join Filter: (s.setting_id = be.setting_id)" "Rows Removed by Join Filter: 41" "-> Seq Scan on branding_extension be (cost=0.00..1.00 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)" "-> Seq Scan on setting s (cost=0.00..2.04 rows=42 width=185) (actual time=0.004..0.018 rows=42 loops=1)" " -> Seq Scan on branding b (cost=0.00..1.01 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)" "Filter: ((name)::text = 'crs'::text)" "Rows Removed by Filter: 1" "Total runtime: 0.150 ms" And for a slow one: "Nested Loop (cost=0.00..4.20 rows=1 width=185) (actual time=383.862..383.862 rows=0 loops=1)" " Join Filter: (be.branding_id = b.branding_id)" " Rows Removed by Join Filter: 1" " -> Nested Loop (cost=0.00..3.19 rows=1 width=189) (actual time=383.815..383.849 rows=1 loops=1)" "Join Filter: (s.setting_id = be.setting_id)" "Rows Removed by Join Filter: 41" "-> Seq Scan on branding_extension be (cost=0.00..1.00 rows=1 width=8) (actual time=383.776..383.776 rows=1 loops=1)" "-> Seq Scan on setting s (cost=0.00..2.04 rows=42 width=185) (actual time=0.005..0.037 rows=42 loops=1)" " -> Seq Scan on branding b (cost=0.00..1.01 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)" "Filter: ((name)::text = 'crs'::text)" "Rows Removed by Filter: 1" "Total runtime: 383.920 ms" I will look at changing the deadlock_timeout, but that might have to wait for the weekend since this is a production system. Thanks, Scott This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make
Re: [PERFORM] Queries intermittently slow
Scott Rankin writes: > Iâm trying to track down why some queries on my database system are > intermittently much slower than usual. I have some queries that run, on > average, 2-3ms, and they run at a rate of about 10-20 queries/second. > However, every 3-5 seconds, one of the queries will be 500-100ms. This is > making the average query time turn out to be closer to 20ms, with a very > large standard deviation. > This happens to a number of otherwise very fast queries, and Iâm trying to > trace the reason. Iâve turned on lock logging and checkpoint logging, and > this behavior happens whether or not a checkpoint is occurring. There are no > lock waits happening in the system either. I doubt you've proved that --- log_lock_waits will only report on waits longer than deadlock_timeout, which you don't appear to have changed from its default of 1 sec. If you're trying to capture events that last a few hundred msec, you're going to need to reduce deadlock_timeout to maybe 100ms. It would help to know more about what the queries are, too. The cause might be something like GIN index pending-list cleanup but we can't tell on the basis of this much info. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance