Re: [GENERAL] Cannot reproduce why a query is slow

John Cheng Wed, 27 Jul 2011 12:57:34 -0700

Sorry about the long wait between reply.

On Sat, May 7, 2011 at 2:36 PM, Tomas Vondra <[email protected]> wrote:


> Resource usually means there's too much I/O so the query is slow, but
> when you try it later the drives are idle and query runs much faster.
> Run some monitoring, e.g. even a simple 'iostat -x' or 'dstat' output
> might be sufficient. Once the slow query happens, try to correlate it to
> the CPU / disk activity.
>

I am unsure if disk IO could be a part of the problem. iostat shows a max of
~2000kb/second write. This query gets during a batch of insert/update
process (every 5 minutes or so, we batch process a list of 20 or so xmls,
i.e., "leads"). During the batch processing, iostat shows pretty low usage.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.38    0.00    0.12    0.62    0.00   94.88

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.99  0.00  5.94     0.00    55.45     9.33
0.10   16.17   6.17   3.66
sdb               0.00   172.28  0.00 11.88     0.00  1473.27   124.00
0.03    2.42   1.08   1.29
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00
sdd               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.35    0.00    0.37    0.37    0.00   90.90

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00
sdb               0.00   621.00  0.00 23.00     0.00  5152.00   224.00     0.08
   3.52   1.22   2.80
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00
sdd               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.62    0.00    0.12    1.88    0.00   88.38

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00
0.00    0.00   0.00   0.00
sdb               0.00    73.00  0.00  9.00     0.00   656.00    72.89
0.01    1.33   1.00   0.90
sdc               0.00   200.00 20.00 129.00   304.00  2632.00    19.70
1.22    8.16   0.68  10.20
sdd               0.00     0.00  9.00  0.00   120.00     0.00    13.33
0.04    4.67   4.56   4.10

The 'atop' command shows low disk activity ~2%, but I'm not sure how
accurate atop is. If I were to run a similar query from the command line
while the batch job is running, I still get a 50ms execution time.

CPU usage does goes up a lot during this time. There are 8 cores and 1 of
the cores shoots up to ~97%. The process updates XML records in the database
and performs dozens of xpath based queries to pull data out of the xml into
relational tables. These extractions are performed with pgplsql triggers and
Postgres xpath functions. Hence the spike in CPU usage during that time.


>
> The unexpected plan change is usually caused by autovacuum/autoanalyze
> collecting skewed data for some reason, resulting in bad plan choice.
> Then the autovacuum runs again and you get different (much better) plan.
> This can be detected using the auto_explain contrib module, as someone
> already recommended.
>

We are using Postgres 8.3 and I don't believe the autoexplain contrib module
is available for 8.3.


>
> > The query is a bit complex, as it is generated by code, but I tried to
> > format it for easier reading. I've also replaced actual data with fake
> > data to protected personal information.
>
> I generally do recommend using explain.depesz.com to post explain plans,
> especially in case of complex queries. I've posted your query and this
> is the result
>
>  http://explain.depesz.com/s/gJO
>
> Not sure if it's relevant to your issue (probably not), but the bitmap
> index scans are significantly overestimated. Not sure if the overlap
> operator affects the estimate accuracy ...
>
> BTW what postgresql version is this? How large the database is, how much
> RAM is available? What is the size of shared_buffers?
>

I am loving explain.depesz.com!

We are using Postgres 8.3.This system has 16GB of RAM. The database is
114GB. shared_buffers is set to 2000MB. A possibly related note... the
'effective_cache_size' setting. The previous admin set it to 8000MB. The
free command shows 13.8GB cache, 397MB buffer. I wonder if changing the
effective_cache_size can improve query planning.

Re: [GENERAL] Cannot reproduce why a query is slow

Reply via email to