On Tue, Oct 29, 2019 at 4:06 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Mon, Oct 28, 2019 at 2:13 PM Dilip Kumar <dilipbal...@gmail.com> wrote: > > > > On Thu, Oct 24, 2019 at 4:33 PM Dilip Kumar <dilipbal...@gmail.com> wrote: > > > > > > On Thu, Oct 24, 2019 at 4:21 PM Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > > > > > > > On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar <dilipbal...@gmail.com> > > > > wrote: > > > > > > > > > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar <dilipbal...@gmail.com> > > > > > wrote: > > > > > > > > > > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila > > > > > > <amit.kapil...@gmail.com> wrote: > > > > > > > > > > > > > > I am thinking if we can write the patch for both the approaches > > > > > > > (a. > > > > > > > compute shared costs and try to delay based on that, b. try to > > > > > > > divide > > > > > > > the I/O cost among workers as described in the email above[1]) > > > > > > > and do > > > > > > > some tests to see the behavior of throttling, that might help us > > > > > > > in > > > > > > > deciding what is the best strategy to solve this problem, if any. > > > > > > > What do you think? > > > > > > > > > > > > I agree with this idea. I can come up with a POC patch for approach > > > > > > (b). Meanwhile, if someone is interested to quickly hack with the > > > > > > approach (a) then we can do some testing and compare. Sawada-san, > > > > > > by any chance will you be interested to write POC with approach (a)? > > > > > > Otherwise, I will try to write it after finishing the first one > > > > > > (approach b). > > > > > > > > > > > I have come up with the POC for approach (a). > > > > > > Can we compute the overall throttling (sleep time) in the operation > > > > separately for heap and index, then divide the index's sleep_time with > > > > a number of workers and add it to heap's sleep time? Then, it will be > > > > a bit easier to compare the data between parallel and non-parallel > > > > case. > > I have come up with a patch to compute the total delay during the > > vacuum. So the idea of computing the total cost delay is > > > > Total cost delay = Total dealy of heap scan + Total dealy of > > index/worker; Patch is attached for the same. > > > > I have prepared this patch on the latest patch of the parallel > > vacuum[1]. I have also rebased the patch for the approach [b] for > > dividing the vacuum cost limit and done some testing for computing the > > I/O throttling. Attached patches 0001-POC-compute-total-cost-delay > > and 0002-POC-divide-vacuum-cost-limit can be applied on top of > > v31-0005-Add-paralell-P-option-to-vacuumdb-command.patch. I haven't > > rebased on top of v31-0006, because v31-0006 is implementing the I/O > > throttling with one approach and 0002-POC-divide-vacuum-cost-limit is > > doing the same with another approach. But, > > 0001-POC-compute-total-cost-delay can be applied on top of v31-0006 as > > well (just 1-2 lines conflict). > > > > Testing: I have performed 2 tests, one with the same size indexes and > > second with the different size indexes and measured total I/O delay > > with the attached patch. > > > > Setup: > > VacuumCostDelay=10ms > > VacuumCostLimit=2000 > > > > Test1 (Same size index): > > create table test(a int, b varchar, c varchar); > > create index idx1 on test(a); > > create index idx2 on test(b); > > create index idx3 on test(c); > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from > > generate_series(1,500000) as i; > > delete from test where a < 200000; > > > > Vacuum (Head) Parallel Vacuum > > Vacuum Cost Divide Patch > > Total Delay 1784 (ms) 1398(ms) > > 1938(ms) > > > > > > Test2 (Variable size dead tuple in index) > > create table test(a int, b varchar, c varchar); > > create index idx1 on test(a); > > create index idx2 on test(b) where a > 100000; > > create index idx3 on test(c) where a > 150000; > > > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from > > generate_series(1,500000) as i; > > delete from test where a < 200000; > > > > Vacuum (Head) Parallel Vacuum > > Vacuum Cost Divide Patch > > Total Delay 1438 (ms) 1029(ms) > > 1529(ms) > > > > > > Conclusion: > > 1. The tests prove that the total I/O delay is significantly less with > > the parallel vacuum. > > 2. With the vacuum cost divide the problem is solved but the delay bit > > more compared to the non-parallel version. The reason could be the > > problem discussed at[2], but it needs further investigation. > > > > Next, I will test with the v31-0006 (shared vacuum cost) patch. I > > will also try to test different types of indexes. > > > > Thank you for testing! > > I realized that v31-0006 patch doesn't work fine so I've attached the > updated version patch that also incorporated some comments I got so > far. Sorry for the inconvenience. I'll apply your 0001 patch and also > test the total delay time. >
FWIW I'd like to share the results of total delay time evaluation of approach (a) (shared cost balance). I used the same workloads that Dilip shared and set vacuum_cost_delay to 10. The results of two test cases are here: * Test1 normal : 12656 ms (hit 50594, miss 5700, dirty 7258, total 63552) 2 workers : 17149 ms (hit 47673, miss 8647, dirty 9157, total 65477) 1 worker : 19498 ms (hit 45954, miss 10340, dirty 10517, total 66811) * Test2 normal : 1530 ms (hit 30645, miss 2, dirty 3, total 30650) 2 workers : 1538 ms (hit 30645, miss 2, dirty 3, total 30650) 1 worker : 1538 ms (hit 30645, miss 2, dirty 3, total 30650) 'hit', 'miss' and 'dirty' are the total numbers of buffer hits, buffer misses and flushing dirty buffer, respectively. 'total' is the sum of these three values. In this evaluation I expect that parallel vacuum cases delay time as much as the time of normal vacuum because the total number of pages to vacuum is the same and we have the shared cost balance value and each workers decide to sleep based on that value. According to the above Test1 results, we can see that there is a big difference in the total delay time among these cases (normal vacuum case is shortest), but the cause of this is that parallel vacuum had to to flush more dirty pages. Actually after increased shared_buffer I got expected results: * Test1 (after increased shared_buffers) normal : 2807 ms (hit 56295, miss 2, dirty 3, total 56300) 2 workers : 2840 ms (hit 56295, miss 2, dirty 3, total 56300) 1 worker : 2841 ms (hit 56295, miss 2, dirty 3, total 56300) I updated the patch that computes the total cost delay shared by Dilip[1] so that it collects the number of buffer hits and so on, and have attached it. It can be applied on top of my latest patch set[1]. [1] https://www.postgresql.org/message-id/CAFiTN-thU-z8f04jO7xGMu5yUUpTpsBTvBrFW6EhRf-jGvEz%3Dg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com Regards, -- Masahiko Sawada
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index a9d9f31887..5ed92ac8d7 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -137,6 +137,7 @@ #define PARALLEL_VACUUM_KEY_SHARED 1 #define PARALLEL_VACUUM_KEY_DEAD_TUPLES 2 #define PARALLEL_VACUUM_KEY_QUERY_TEXT 3 +#define PARALLEL_VACUUM_KEY_COST_DELAY 4 /* * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION disables the leader's @@ -257,6 +258,21 @@ typedef struct LVSharedIndStats #define GetIndexBulkDeleteResult(s) \ ((IndexBulkDeleteResult *)((char *)(s) + sizeof(LVSharedIndStats))) +typedef struct LVDelayStats +{ + double time; + int hit; + int miss; + int dirty; +} LVDelayStats; + +typedef struct LVCostDelay +{ + pg_atomic_uint32 nslot; + LVDelayStats stats[FLEXIBLE_ARRAY_MEMBER]; +} LVCostDelay; +#define SizeOfLVCostDelay offsetof(LVCostDelay, stats) + sizeof(LVDelayStats) + /* Struct for parallel lazy vacuum */ typedef struct LVParallelState { @@ -265,6 +281,8 @@ typedef struct LVParallelState /* Shared information among parallel vacuum workers */ LVShared *lvshared; + /* Shared cost delay. */ + LVCostDelay *lvcostdelay; /* * Always true except for a debugging case where * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION are defined. @@ -757,6 +775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, parallel_workers = compute_parallel_workers(Irel, nindexes, params->nworkers); + VacuumCostTotalDelay = 0; if (parallel_workers > 0) { /* @@ -1733,6 +1752,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, vacrelstats->scanned_pages, nblocks), errdetail_internal("%s", buf.data))); pfree(buf.data); + + elog(WARNING, "stats delay %lf, hit %d, miss %d, dirty %d, total %d", + VacuumCostTotalDelay, _nhit, _nmiss, _ndirty, + _nhit + _nmiss + _ndirty); } @@ -1967,6 +1990,8 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel, int nindexes, IndexBulkDeleteResult **stats, LVParallelState *lps) { + int i; + Assert(!IsParallelWorker()); Assert(ParallelVacuumIsActive(lps)); Assert(nindexes > 0); @@ -2011,6 +2036,18 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel, /* Continue to use the shared balance value */ VacuumCostBalance = pg_atomic_read_u32(&(lps->lvshared->cost_balance)); + /* + * Collect all the delay from workers and add to total delay. The delay from leader + * is already included in VacuumCostTotalDelay. + */ + for (i = 0; i < lps->pcxt->nworkers_launched; i++) + { + VacuumCostTotalDelay += lps->lvcostdelay->stats[i].time; + _nhit += lps->lvcostdelay->stats[i].hit; + _nmiss += lps->lvcostdelay->stats[i].miss; + _ndirty += lps->lvcostdelay->stats[i].dirty; + } + /* * We need to reinitialize the parallel context as no more index vacuuming and * index cleanup will be performed after that. @@ -2968,10 +3005,12 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks, ParallelContext *pcxt; LVShared *shared; LVDeadTuples *dead_tuples; + LVCostDelay *costdelay; long maxtuples; char *sharedquery; Size est_shared; Size est_deadtuples; + Size est_costdelay; int querylen; int i; @@ -3043,6 +3082,14 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks, sharedquery[querylen] = '\0'; shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery); + /* Vacuum cost balance. */ + est_costdelay = MAXALIGN(add_size(SizeOfLVCostDelay, + mul_size(sizeof(LVDelayStats), nrequested))); + costdelay = (LVCostDelay *) shm_toc_allocate(pcxt->toc, est_costdelay); + pg_atomic_init_u32(&(costdelay->nslot), 0); + shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_COST_DELAY, costdelay); + lps->lvcostdelay = costdelay; + return lps; } @@ -3171,8 +3218,10 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc) Relation *indrels; LVShared *lvshared; LVDeadTuples *dead_tuples; + LVCostDelay *costdelay; int nindexes; char *sharedquery; + int slot; IndexBulkDeleteResult **stats; lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED, @@ -3207,6 +3256,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc) PARALLEL_VACUUM_KEY_DEAD_TUPLES, false); + costdelay = (LVCostDelay *) shm_toc_lookup(toc, + PARALLEL_VACUUM_KEY_COST_DELAY, + false); + slot = pg_atomic_fetch_add_u32(&(costdelay->nslot), 1); + /* Set cost-based vacuum delay */ VacuumCostActive = (VacuumCostDelay > 0); VacuumCostBalance = 0; @@ -3214,6 +3268,7 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc) VacuumPageMiss = 0; VacuumPageDirty = 0; VacuumSharedCostBalance = &(lvshared->cost_balance); + _nhit = _nmiss = _ndirty = 0; stats = (IndexBulkDeleteResult **) palloc0(nindexes * sizeof(IndexBulkDeleteResult *)); @@ -3225,6 +3280,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc) vacuum_or_cleanup_indexes_worker(indrels, nindexes, stats, lvshared, dead_tuples); + /* update the total delay in the shared location. */ + costdelay->stats[slot].time = VacuumCostTotalDelay; + costdelay->stats[slot].hit = _nhit; + costdelay->stats[slot].miss = _nmiss; + costdelay->stats[slot].dirty = _ndirty; vac_close_indexes(nindexes, indrels, RowExclusiveLock); table_close(onerel, ShareUpdateExclusiveLock); } diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 1b9ea9b672..7fcd2d082f 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -412,6 +412,7 @@ vacuum(List *relations, VacuumParams *params, VacuumPageHit = 0; VacuumPageMiss = 0; VacuumPageDirty = 0; + _nhit = _nmiss = _ndirty = 0; VacuumSharedCostBalance = NULL; /* @@ -2050,6 +2051,8 @@ vacuum_delay_point(void) /* update balance values for workers */ AutoVacuumUpdateDelay(); + VacuumCostTotalDelay += msec; + /* Might have gotten an interrupt while sleeping */ CHECK_FOR_INTERRUPTS(); } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 483f705305..56e3631c6e 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -770,7 +770,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, VacuumPageHit++; if (VacuumCostActive) + { VacuumCostBalance += VacuumCostPageHit; + _nhit++; + } TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, @@ -959,7 +962,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, VacuumPageMiss++; if (VacuumCostActive) + { VacuumCostBalance += VacuumCostPageMiss; + _nmiss++; + } TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, @@ -1500,7 +1506,10 @@ MarkBufferDirty(Buffer buffer) VacuumPageDirty++; pgBufferUsage.shared_blks_dirtied++; if (VacuumCostActive) + { VacuumCostBalance += VacuumCostPageDirty; + _ndirty++; + } } } @@ -3556,7 +3565,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) VacuumPageDirty++; pgBufferUsage.shared_blks_dirtied++; if (VacuumCostActive) + { VacuumCostBalance += VacuumCostPageDirty; + _ndirty++; + } } } } diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 3bf96de256..de214f347b 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -139,11 +139,15 @@ int VacuumCostPageMiss = 10; int VacuumCostPageDirty = 20; int VacuumCostLimit = 200; double VacuumCostDelay = 0; - +double VacuumCostTotalDelay = 0; int VacuumPageHit = 0; int VacuumPageMiss = 0; int VacuumPageDirty = 0; +int _nhit = 0; +int _nmiss = 0; +int _ndirty = 0; + int VacuumCostBalance = 0; /* working state for vacuum */ bool VacuumCostActive = false; diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index bc6e03fbc7..8d95b6ef4f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -251,6 +251,10 @@ extern int VacuumCostPageMiss; extern int VacuumCostPageDirty; extern int VacuumCostLimit; extern double VacuumCostDelay; +extern double VacuumCostTotalDelay; +extern int _nhit; +extern int _nmiss; +extern int _ndirty; extern int VacuumPageHit; extern int VacuumPageMiss;