Re: [GENERAL] Problem after VACUUM ANALYZE
Hi all, dear Richard, your mail about my configuration parameter were the right hint, but i am still struggling with the problem. i will appreciate if you or somebody else can help me even further. After some investigation i got some new results to my problem. The following query is not working as it should and is my most important query: # select * from SpielTipp natural join Tippspiel # where tippspieltag_id=1254056; - Spieltipp has about 80.000.000 records - Tippspiel has about 10.000.000 records - random_page_cost = 3 Both table have indexes, of course. So there should be no seqscan in use, but the planner is using a sequence scan: # explain analyze # select * from SpielTipp natural join Tippspiel # where tippspieltag_id = 817372; QUERY PLAN Hash Join (cost=333.93..1647646.04 rows=1383 width=28) (actual time=104193.150..104193.150 rows=0 loops=1) Hash Cond: ("outer".tippspiel_id = "inner".tippspiel_id) -> Seq Scan on spieltipp (cost=0.00..1253846.52 rows=78690352 width=16) (actual time=10.355..69195.235 rows=78690348 loops=1) -> Hash (cost=333.44..333.44 rows=198 width=16) (actual time=44.821..44.821 rows=9 loops=1) -> Index Scan using tippspiel_tippspieltag_id_key on tippspiel (cost=0.00..333.44 rows=198 width=16) (actual time=44.798..44.809 rows=9 loops=1) Index Cond: (tippspieltag_id = 817372) Total runtime: 104193.209 ms (7 rows) -- just to see how wrong the plan is, i disabled seqscan: # set enable_seqscan to off; # explain analyze # select * from SpielTipp natural join Tippspiel # where tippspieltag_id = 817372; QUERY PLAN Nested Loop (cost=0.00..1682807.57 rows=1383 width=28) (actual time=0.186..0.186 rows=0 loops=1) -> Index Scan using tippspiel_tippspieltag_id_key on tippspiel (cost=0.00..333.44 rows=198 width=16) (actual time=0.072..0.082 rows=9 loops=1) Index Cond: (tippspieltag_id = 817372) -> Index Scan using ix_spieltipp_tippspiel_id on spieltipp (cost=0.00..8458.83 rows=3081 width=16) (actual time=0.010..0.010 rows=0 loops=9) Index Cond: (spieltipp.tippspiel_id = "outer".tippspiel_id) Total runtime: 0.232 ms (6 rows) no i tried to lower random_page_cost: # set enable_seqscan to on; # set random_page_cost to 1.5; # explain analyze # select * from SpielTipp natural join Tippspiel # where tippspieltag_id = 817372; QUERY PLAN Nested Loop (cost=0.00..731643.62 rows=1383 width=28) (actual time=0.089..0.089 rows=0 loops=1) -> Index Scan using tippspiel_tippspieltag_id_key on tippspiel (cost=0.00..146.17 rows=198 width=16) (actual time=0.017..0.024 rows=9 loops=1) Index Cond: (tippspieltag_id = 817372) -> Index Scan using ix_spieltipp_tippspiel_id on spieltipp (cost=0.00..3655.92 rows=3081 width=16) (actual time=0.005..0.005 rows=0 loops=9) Index Cond: (spieltipp.tippspiel_id = "outer".tippspiel_id) Total runtime: 0.135 ms (6 rows) looks fine at first glance. but the total estimated cost of 731643 is still far to high, right? so what happened to me with a random_page_cost of 3 at my production server: the estimated costs between a seq scan and an index scan are not too different. So sometimes it will use a sequence scan after a fresh ANALYZE and sometime not as the statistics vary across ANALYZE runs. so i had the problem that the plan changed by running nightly ANALYZE on the database. As i have 200-240 connections at peak time, so reading your advices and annotated postgresql.conf, my conclusion is: max_connections = 250 shared_buffers = 20 # 1.6 GB = 20% of avail. RAM work_mem = 2 maintenance_work_mem = 16 effective_cache_size = 60 # 4.8 GB = 60% of avail. RAM random_page_cost = 2 Are those settings reasonable for my box? my box is: - dedicated - AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ - 3ware RAID 1 Controller with two rather cheap SATA disks - 8 GB RAM kind regards Janning Am Dienstag, 8. April 2008 17:40 schrieb Richard Huxton: > > Here are some of our configuration parameters. We never really tweaked it > > as it ran fine. W
Re: [GENERAL] Problem after VACUUM ANALYZE
<[EMAIL PROTECTED]> writes: > Could it be that something like this is happening: > - In the early morning a new DB connection is opened. > - While running VACUUM ANALYZE the planner uses different plans because some > index could not be used or the statistics are right in that moment not > present because they are updated... So the query gets a wrong plan. It uses a > seqScan instead of an index scan. This isn't supposed to happen. The old statistics are still visible until the new ones are visible. Nothing is locked or unavailable while analyze is running. However one of the things that can happen is the statistics for one table get updated and until the statistics for another table are updated the planner has skewed results. It has the new statistics for the first table but old statistics for the second table. Normally the solution to this is to run analyze more frequently so things don't change too drastically from one set of statistics to another. In some situations this isn't good enough, for example if you're truncating a bunch of tables together. You can also run analyze (but not vacuum analyze) inside a transaction so all the stats go into effect together when you commit. This does have downsides however. I think it will block a vacuum analyze which tries to update the same statistics, for example. Another thing to be aware of is that *empty* tables cause Postgres to default to a heuristic of assuming 1,000 records. That can cause strange things to happen if you truncate just some tables and run analyze on them. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
Am Mittwoch, 9. April 2008 10:11 schrieb David Wilson: > On Wed, Apr 9, 2008 at 3:29 AM, <[EMAIL PROTECTED]> wrote: > > But if i do "VACUUM ANALYZE" without concurrent queries, everything runs > > fine afterwards. > > > > If i run "VACUUM ANALYZE" with few concurrent queries, it slows down to > > a crawl. > > > > Could it be that something like this is happening: > > - In the early morning a new DB connection is opened. > > - While running VACUUM ANALYZE the planner uses different plans because > > some index could not be used or the statistics are right in that moment > > not present because they are updated... So the query gets a wrong plan. > > It uses a seqScan instead of an index scan. > > - This wrongly planned statement is prepared so even after VACUUM > > ANALYZE is done, the statement does not use the wrong plan. > > - load raises triggers many concurrent queries with wrong plans. so > > everything slows down. > > I've run into a very similar problem. I have some long-running > processes that generate a large amount of data and then query that > data that must periodically drop their connections and rebuild to > ensure that query plans get re-planned according to updated > statistics. This is especially true when a new DB is first being > brought up with an initial set of data (~1 week of compute time, ~30gb > of data and ~120m rows) with processes that live through the entire > process. This is not the case at my place. We are not producing lots of data. Just few data with small INSERTs all the time. I don't think that the statistic are really changing quite often as the tables are rather fixed and just some data is added. > My solution, as mentioned above, is to rebuild the connection approx > every hour on my long-running processes. This is a tricky question, > because you want the benefit of not re-planning queries in 99.9% of > the cases- I'm not really opposed to the current system that requires > the reconnections, given that the overhead involved in them is > completely negligible in the grand scheme of things. There are > downsides to not replanning queries, and if you can't live with them > then you should either force re-planning at intervals or avoid > prepared statements. In release 8.3: http://www.postgresql.org/docs/8.3/interactive/release-8-3.html "Automatically re-plan cached queries when table definitions change or statistics are updated" So no reconnection is necessary in 8.3 to replan prepared statements. Maybe it helps you. kind regards, janning -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
On Wed, Apr 9, 2008 at 1:41 PM, David Wilson <[EMAIL PROTECTED]> wrote: > > > I've run into a very similar problem. I have some long-running > processes that generate a large amount of data and then query that > data that must periodically drop their connections and rebuild to > ensure that query plans get re-planned according to updated > statistics. This is especially true when a new DB is first being > brought up with an initial set of data (~1 week of compute time, ~30gb > of data and ~120m rows) with processes that live through the entire > process. > I believe plan-invalidation in 8.3 should address this. Isn't it ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
On Wed, Apr 9, 2008 at 3:29 AM, <[EMAIL PROTECTED]> wrote: > But if i do "VACUUM ANALYZE" without concurrent queries, everything runs fine > afterwards. > > If i run "VACUUM ANALYZE" with few concurrent queries, it slows down to a > crawl. > > Could it be that something like this is happening: > - In the early morning a new DB connection is opened. > - While running VACUUM ANALYZE the planner uses different plans because some > index could not be used or the statistics are right in that moment not > present because they are updated... So the query gets a wrong plan. It uses a > seqScan instead of an index scan. > - This wrongly planned statement is prepared so even after VACUUM ANALYZE is > done, the statement does not use the wrong plan. > - load raises triggers many concurrent queries with wrong plans. so > everything > slows down. > I've run into a very similar problem. I have some long-running processes that generate a large amount of data and then query that data that must periodically drop their connections and rebuild to ensure that query plans get re-planned according to updated statistics. This is especially true when a new DB is first being brought up with an initial set of data (~1 week of compute time, ~30gb of data and ~120m rows) with processes that live through the entire process. My solution, as mentioned above, is to rebuild the connection approx every hour on my long-running processes. This is a tricky question, because you want the benefit of not re-planning queries in 99.9% of the cases- I'm not really opposed to the current system that requires the reconnections, given that the overhead involved in them is completely negligible in the grand scheme of things. There are downsides to not replanning queries, and if you can't live with them then you should either force re-planning at intervals or avoid prepared statements. -- - David T. Wilson Princeton Satellite Systems [EMAIL PROTECTED] -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
Am Dienstag, 8. April 2008 18:38 schrieb Scott Marlowe: > It sounds to me like two possible problems, maybe combined. > > One possibility is that you have a data distribution that results in > statistics being gathered that don't really represent your data. Try > increasing the stats target for that column (or the whole db if you'd > rather) and re-running analyze. > > The other possibility is that you've got some index bloat happening > and you might benefit from reindexing the problematic table. But if i do "VACUUM ANALYZE" without concurrent queries, everything runs fine afterwards. If i run "VACUUM ANALYZE" with few concurrent queries, it slows down to a crawl. Could it be that something like this is happening: - In the early morning a new DB connection is opened. - While running VACUUM ANALYZE the planner uses different plans because some index could not be used or the statistics are right in that moment not present because they are updated... So the query gets a wrong plan. It uses a seqScan instead of an index scan. - This wrongly planned statement is prepared so even after VACUUM ANALYZE is done, the statement does not use the wrong plan. - load raises triggers many concurrent queries with wrong plans. so everything slows down. kind regards, janning -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
It sounds to me like two possible problems, maybe combined. One possibility is that you have a data distribution that results in statistics being gathered that don't really represent your data. Try increasing the stats target for that column (or the whole db if you'd rather) and re-running analyze. The other possibility is that you've got some index bloat happening and you might benefit from reindexing the problematic table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
On Tuesday 08 April 2008, [EMAIL PROTECTED] wrote: > The problem is that we have peaktimes were everything is running fine. It > has something to do with the vacuum process running. To simplify my > problem: > > - I run vaccum analyze concurrently with some few user queries: slows > down to a crawl after vacuum is finished. > > - if i run it while no user is connected, everything runs fine > afterwards. > > It has something to do with the vacuum analyze process. It's probably running the machine into swap. Check swap usage and disk I/O when it happens. -- Alan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
HI Richard, thanks for your immediate response. I will answer your questions below: Am Dienstag, 8. April 2008 17:40 schrieb Richard Huxton: > [EMAIL PROTECTED] wrote: > > We looked in our cpu monitoring and saw that we have huge IOwait while > > VACUUM is running, not unusual though. But just after VACUUM was > > finished, the userCPU load raised to 200% (dual cpu). At this time in the > > early morning there were only very few users using our app. We analyzed > > the situation further and saw that most queries had a very long duration > > time (minutes instead of milliseconds). BTW: the vacuum process was not > > running at this moment. It was logged before as done. > > But WHAT was using all of your cpu? Was it PostgreSQL, and if so just > one backend? If something else was using all your cpu, then it might > just be that the PostgreSQL server wasn't getting a chance to run your > queries. The CPU was used by postgresql. As i saw many SELECTS by looking at "ps ax" i guess that many backends using all of the cpu. > > * Maybe the Vacuum analyze process has not enough memory and therefore > > could not ANALYZE the tables correctly. It then writes wrong statistics > > to the database which results in wrong execution plans using sequence > > scans instead of index scans. This only happens if the vacuum analyze > > process runs concurrently with user requests. If it runs on its own, the > > vacuum process has enough memory and writes correct statistics to the > > database. > > Doesn't sound likely to me. You've got 8GB of RAM, and if you were going > into swap you'd have noticed the disk I/O. That makes sense. We had not any IOwait at this moment but 200% userCPU. But maybe vaccum can't use the ram because of a mis-configuration. > > Here are some of our configuration parameters. We never really tweaked it > > as it ran fine. We just raised some parameters. The following list should > > show all parameters changed from the default: > > > > max_connections = 300 > > shared_buffers = 3 > > work_mem = 10240 > > OK, so that's 30,000 * 8KB = 240MB of shared_buffers > You have 10MB of work_mem and if all 300 connections were using that > much you'd have committed 3GB of your RAM for that. Of course they'll > want more than just that. > > Do you really have 300 concurrent connections? At peaktime, yes. > > maintenance_work_mem = 163840 > > 160MB for vacuums - should be OK given how much memory you have and the > fact that it's quiet when you vacuum. > > > vacuum_mem = 32000 > > This is what maintenance_work_mem used to be called. You can delete this > entry. > > > max_fsm_pages = 50 > > You can track at most 500,000 pages with free space on them. In 8.2+ > versions VACUUM VERBOSE will show you how many are currently being used. > Not sure about 8.1 > > > bgwriter_lru_percent = 10.0 > > bgwriter_lru_maxpages = 100 > > bgwriter_all_percent = 5 > > bgwriter_all_maxpages = 200 > > wal_buffers = 16 > > > > checkpoint_segments = 10 > > If you have bursts of write activity you might want to increase this. > > > checkpoint_warning = 3600 > > > > effective_cache_size = 18 > > That's 180,000 * 8KB = 180 * 8MB = 1.4GB > If that's really all you're using as cache, I'd reduce the number of > concurrent connections. Check free/top and see how much RAM is really > being used as disk cache. > > > random_page_cost = 3 > > Might be too high - you don't mention what disks you have. 3ware RAID-1 Controller with plain (cheap) SATA disks > The crucial thing is to find out exactly what is happening when things > get very slow. Check vmstat and top, look in the pg_locks system-table > and if needs be we can see what strace says a particular backend is doing. The problem is that we have peaktimes were everything is running fine. It has something to do with the vacuum process running. To simplify my problem: - I run vaccum analyze concurrently with some few user queries: slows down to a crawl after vacuum is finished. - if i run it while no user is connected, everything runs fine afterwards. It has something to do with the vacuum analyze process. kind regards, janning -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Problem after VACUUM ANALYZE
[EMAIL PROTECTED] wrote: We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM is running, not unusual though. But just after VACUUM was finished, the userCPU load raised to 200% (dual cpu). At this time in the early morning there were only very few users using our app. We analyzed the situation further and saw that most queries had a very long duration time (minutes instead of milliseconds). BTW: the vacuum process was not running at this moment. It was logged before as done. But WHAT was using all of your cpu? Was it PostgreSQL, and if so just one backend? If something else was using all your cpu, then it might just be that the PostgreSQL server wasn't getting a chance to run your queries. * Maybe the Vacuum analyze process has not enough memory and therefore could not ANALYZE the tables correctly. It then writes wrong statistics to the database which results in wrong execution plans using sequence scans instead of index scans. This only happens if the vacuum analyze process runs concurrently with user requests. If it runs on its own, the vacuum process has enough memory and writes correct statistics to the database. Doesn't sound likely to me. You've got 8GB of RAM, and if you were going into swap you'd have noticed the disk I/O. Here are some of our configuration parameters. We never really tweaked it as it ran fine. We just raised some parameters. The following list should show all parameters changed from the default: max_connections = 300 shared_buffers = 3 work_mem = 10240 OK, so that's 30,000 * 8KB = 240MB of shared_buffers You have 10MB of work_mem and if all 300 connections were using that much you'd have committed 3GB of your RAM for that. Of course they'll want more than just that. Do you really have 300 concurrent connections? maintenance_work_mem = 163840 160MB for vacuums - should be OK given how much memory you have and the fact that it's quiet when you vacuum. vacuum_mem = 32000 This is what maintenance_work_mem used to be called. You can delete this entry. max_fsm_pages = 50 You can track at most 500,000 pages with free space on them. In 8.2+ versions VACUUM VERBOSE will show you how many are currently being used. Not sure about 8.1 bgwriter_lru_percent = 10.0 bgwriter_lru_maxpages = 100 bgwriter_all_percent = 5 bgwriter_all_maxpages = 200 wal_buffers = 16 checkpoint_segments = 10 If you have bursts of write activity you might want to increase this. checkpoint_warning = 3600 effective_cache_size = 18 That's 180,000 * 8KB = 180 * 8MB = 1.4GB If that's really all you're using as cache, I'd reduce the number of concurrent connections. Check free/top and see how much RAM is really being used as disk cache. random_page_cost = 3 Might be too high - you don't mention what disks you have. stats_command_string = off If you turn this one on, you'll be able to see the queries each backend is executing as they happen. Might be useful, but does have some cost. The crucial thing is to find out exactly what is happening when things get very slow. Check vmstat and top, look in the pg_locks system-table and if needs be we can see what strace says a particular backend is doing. -- Richard Huxton Archonet Ltd -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general