Sorry about the long wait between reply. On Sat, May 7, 2011 at 2:36 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
> Resource usually means there's too much I/O so the query is slow, but > when you try it later the drives are idle and query runs much faster. > Run some monitoring, e.g. even a simple 'iostat -x' or 'dstat' output > might be sufficient. Once the slow query happens, try to correlate it to > the CPU / disk activity. > I am unsure if disk IO could be a part of the problem. iostat shows a max of ~2000kb/second write. This query gets during a batch of insert/update process (every 5 minutes or so, we batch process a list of 20 or so xmls, i.e., "leads"). During the batch processing, iostat shows pretty low usage. avg-cpu: %user %nice %system %iowait %steal %idle 4.38 0.00 0.12 0.62 0.00 94.88 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.99 0.00 5.94 0.00 55.45 9.33 0.10 16.17 6.17 3.66 sdb 0.00 172.28 0.00 11.88 0.00 1473.27 124.00 0.03 2.42 1.08 1.29 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 8.35 0.00 0.37 0.37 0.00 90.90 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 621.00 0.00 23.00 0.00 5152.00 224.00 0.08 3.52 1.22 2.80 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 9.62 0.00 0.12 1.88 0.00 88.38 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 73.00 0.00 9.00 0.00 656.00 72.89 0.01 1.33 1.00 0.90 sdc 0.00 200.00 20.00 129.00 304.00 2632.00 19.70 1.22 8.16 0.68 10.20 sdd 0.00 0.00 9.00 0.00 120.00 0.00 13.33 0.04 4.67 4.56 4.10 The 'atop' command shows low disk activity ~2%, but I'm not sure how accurate atop is. If I were to run a similar query from the command line while the batch job is running, I still get a 50ms execution time. CPU usage does goes up a lot during this time. There are 8 cores and 1 of the cores shoots up to ~97%. The process updates XML records in the database and performs dozens of xpath based queries to pull data out of the xml into relational tables. These extractions are performed with pgplsql triggers and Postgres xpath functions. Hence the spike in CPU usage during that time. > > The unexpected plan change is usually caused by autovacuum/autoanalyze > collecting skewed data for some reason, resulting in bad plan choice. > Then the autovacuum runs again and you get different (much better) plan. > This can be detected using the auto_explain contrib module, as someone > already recommended. > We are using Postgres 8.3 and I don't believe the autoexplain contrib module is available for 8.3. > > > The query is a bit complex, as it is generated by code, but I tried to > > format it for easier reading. I've also replaced actual data with fake > > data to protected personal information. > > I generally do recommend using explain.depesz.com to post explain plans, > especially in case of complex queries. I've posted your query and this > is the result > > http://explain.depesz.com/s/gJO > > Not sure if it's relevant to your issue (probably not), but the bitmap > index scans are significantly overestimated. Not sure if the overlap > operator affects the estimate accuracy ... > > BTW what postgresql version is this? How large the database is, how much > RAM is available? What is the size of shared_buffers? > I am loving explain.depesz.com! We are using Postgres 8.3.This system has 16GB of RAM. The database is 114GB. shared_buffers is set to 2000MB. A possibly related note... the 'effective_cache_size' setting. The previous admin set it to 8000MB. The free command shows 13.8GB cache, 397MB buffer. I wonder if changing the effective_cache_size can improve query planning.