On May 24, 2015 6:42 AM, "Nils Goroll" <sl...@schokola.de> wrote: > > Hi Jeff and all, > > On 23/05/15 22:13, Jeff Janes wrote: > > Are you sure it is the read IO that causes the problem? > > Yes. Trouble is here that we are talking about a 361 GB table > > List of relations > Schema | Name | Type | Owner | Size | > Description > --------+-----------------------------+----------+----------+------------+------------- > public | *redacted*_y2015m04 | table | postgres | 361 GB | > > and while we have > > shared_buffers = 325GB > huge_pages = on
As mentioned, that is very large setting for share buffers. > > this is not the only table of this size (total db size ist 1.8tb) and more > current data got written to *redacted*_y2015m05 (the manually-partitioned table > for may), so most of the m04 data would have got evicted from the cache when > this issue surfaced initially. > > There is one application pushing data (bulk inserts) and we have transaction > rates for this app in a log. The moment the vacuum started, these rates dropped. > Unfortunately I cannot present helpful log excerpts here as the autovacuum never > finished so far (because the admin killed the db), so we have zero logging about > past autovac events. Could you do an experiment in which you do a large sequential read on the database files and measure the impact on the queries that way? Like: tar -cf - data_dir | wc -c Or better, use some fancy version that throttles to the read rate observed below. > At the moment, the application is shut down and the machine is only running the > vacs: > > query_start | 2015-05-22 19:33:52.44334+02 > waiting | f > query | autovacuum: VACUUM public.*redacted*_y2015m04 (to prevent > wraparound) > query_start | 2015-05-22 19:34:02.46004+02 > waiting | f > query | autovacuum: VACUUM ANALYZE public.*redacted*_y2015m05 (to > prevent wraparound) > > so we know that any io must be caused by the vacs: > > shell# uptime > 13:33:33 up 1 day, 18:01, 2 users, load average: 5.75, 12.71, 8.43 What OS is this? This load average looks very high. Does the OS charge processes that are blocked on IO against uptime? > shell# zpool iostat > capacity operations bandwidth > pool alloc free read write read write > --------------- ----- ----- ----- ----- ----- ----- > tank1 358G 6.90T 872 55 15.1M 3.08M I'm not familiar with zpool but this shows a lot of writing going on. If the table was already frozen and just needed to be observed as being all frozen, then it should not be dirtying one block for every 5 blocks read. I would not be surprised if it were the reading, not the writing, which caused the performance problem. Cheers, Jeff