Re: [HACKERS] Block level parallel vacuum

Masahiko Sawada Tue, 29 Oct 2019 01:30:24 -0700

On Tue, Oct 29, 2019 at 4:06 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Mon, Oct 28, 2019 at 2:13 PM Dilip Kumar <dilipbal...@gmail.com> wrote:
> >
> > On Thu, Oct 24, 2019 at 4:33 PM Dilip Kumar <dilipbal...@gmail.com> wrote:
> > >
> > > On Thu, Oct 24, 2019 at 4:21 PM Amit Kapila <amit.kapil...@gmail.com> 
> > > wrote:
> > > >
> > > > On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar <dilipbal...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar <dilipbal...@gmail.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila 
> > > > > > <amit.kapil...@gmail.com> wrote:
> > > > > > >
> > > > > > > I am thinking if we can write the patch for both the approaches 
> > > > > > > (a.
> > > > > > > compute shared costs and try to delay based on that, b. try to 
> > > > > > > divide
> > > > > > > the I/O cost among workers as described in the email above[1]) 
> > > > > > > and do
> > > > > > > some tests to see the behavior of throttling, that might help us 
> > > > > > > in
> > > > > > > deciding what is the best strategy to solve this problem, if any.
> > > > > > > What do you think?
> > > > > >
> > > > > > I agree with this idea.  I can come up with a POC patch for approach
> > > > > > (b).  Meanwhile, if someone is interested to quickly hack with the
> > > > > > approach (a) then we can do some testing and compare.  Sawada-san,
> > > > > > by any chance will you be interested to write POC with approach (a)?
> > > > > > Otherwise, I will try to write it after finishing the first one
> > > > > > (approach b).
> > > > > >
> > > > > I have come up with the POC for approach (a).
> >
> > > > Can we compute the overall throttling (sleep time) in the operation
> > > > separately for heap and index, then divide the index's sleep_time with
> > > > a number of workers and add it to heap's sleep time?  Then, it will be
> > > > a bit easier to compare the data between parallel and non-parallel
> > > > case.
> > I have come up with a patch to compute the total delay during the
> > vacuum.  So the idea of computing the total cost delay is
> >
> > Total cost delay = Total dealy of heap scan + Total dealy of
> > index/worker;  Patch is attached for the same.
> >
> > I have prepared this patch on the latest patch of the parallel
> > vacuum[1].  I have also rebased the patch for the approach [b] for
> > dividing the vacuum cost limit and done some testing for computing the
> > I/O throttling.  Attached patches 0001-POC-compute-total-cost-delay
> > and 0002-POC-divide-vacuum-cost-limit can be applied on top of
> > v31-0005-Add-paralell-P-option-to-vacuumdb-command.patch.  I haven't
> > rebased on top of v31-0006, because v31-0006 is implementing the I/O
> > throttling with one approach and 0002-POC-divide-vacuum-cost-limit is
> > doing the same with another approach.   But,
> > 0001-POC-compute-total-cost-delay can be applied on top of v31-0006 as
> > well (just 1-2 lines conflict).
> >
> > Testing:  I have performed 2 tests, one with the same size indexes and
> > second with the different size indexes and measured total I/O delay
> > with the attached patch.
> >
> > Setup:
> > VacuumCostDelay=10ms
> > VacuumCostLimit=2000
> >
> > Test1 (Same size index):
> > create table test(a int, b varchar, c varchar);
> > create index idx1 on test(a);
> > create index idx2 on test(b);
> > create index idx3 on test(c);
> > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from
> > generate_series(1,500000) as i;
> > delete from test where a < 200000;
> >
> >                       Vacuum (Head)                   Parallel Vacuum
> >            Vacuum Cost Divide Patch
> > Total Delay        1784 (ms)                           1398(ms)
> >                  1938(ms)
> >
> >
> > Test2 (Variable size dead tuple in index)
> > create table test(a int, b varchar, c varchar);
> > create index idx1 on test(a);
> > create index idx2 on test(b) where a > 100000;
> > create index idx3 on test(c) where a > 150000;
> >
> > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from
> > generate_series(1,500000) as i;
> > delete from test where a < 200000;
> >
> > Vacuum (Head)                                   Parallel Vacuum
> >               Vacuum Cost Divide Patch
> > Total Delay 1438 (ms)                               1029(ms)
> >                    1529(ms)
> >
> >
> > Conclusion:
> > 1. The tests prove that the total I/O delay is significantly less with
> > the parallel vacuum.
> > 2. With the vacuum cost divide the problem is solved but the delay bit
> > more compared to the non-parallel version.  The reason could be the
> > problem discussed at[2], but it needs further investigation.
> >
> > Next, I will test with the v31-0006 (shared vacuum cost) patch.  I
> > will also try to test different types of indexes.
> >
>
> Thank you for testing!
>
> I realized that v31-0006 patch doesn't work fine so I've attached the
> updated version patch that also incorporated some comments I got so
> far. Sorry for the inconvenience. I'll apply your 0001 patch and also
> test the total delay time.
>


FWIW I'd like to share the results of total delay time evaluation of
approach (a) (shared cost balance). I used the same workloads that
Dilip shared and set vacuum_cost_delay to 10. The results of two test
cases are here:

* Test1
normal      : 12656 ms (hit 50594, miss 5700, dirty 7258, total 63552)
2 workers : 17149 ms (hit 47673, miss 8647, dirty 9157, total 65477)
1 worker   : 19498 ms (hit 45954, miss 10340, dirty 10517, total 66811)

* Test2
normal      : 1530 ms (hit 30645, miss 2, dirty 3, total 30650)
2 workers : 1538 ms (hit 30645, miss 2, dirty 3, total 30650)
1 worker   : 1538 ms (hit 30645, miss 2, dirty 3, total 30650)

'hit', 'miss' and 'dirty' are the total numbers of buffer hits, buffer
misses and flushing dirty buffer, respectively. 'total' is the sum of
these three values.

In this evaluation I expect that parallel vacuum cases delay time as
much as the time of normal vacuum because the total number of pages to
vacuum is the same and we have the shared cost balance value and each
workers decide to sleep based on that value. According to the above
Test1 results, we can see that there is a big difference in the total
delay time among  these cases (normal vacuum case is shortest), but
the cause of this is that parallel vacuum had to to flush more dirty
pages. Actually after increased shared_buffer I got expected results:

* Test1 (after increased shared_buffers)
normal      : 2807 ms (hit 56295, miss 2, dirty 3, total 56300)
2 workers : 2840 ms (hit 56295, miss 2, dirty 3, total 56300)
1 worker   : 2841 ms (hit 56295, miss 2, dirty 3, total 56300)

I updated the patch that computes the total cost delay shared by
Dilip[1] so that it collects the number of buffer hits and so on, and
have attached it. It can be applied on top of my latest patch set[1].

[1] 
https://www.postgresql.org/message-id/CAFiTN-thU-z8f04jO7xGMu5yUUpTpsBTvBrFW6EhRf-jGvEz%3Dg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com

Regards,

--
Masahiko Sawada

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a9d9f31887..5ed92ac8d7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -137,6 +137,7 @@
 #define PARALLEL_VACUUM_KEY_SHARED			1
 #define PARALLEL_VACUUM_KEY_DEAD_TUPLES		2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
+#define PARALLEL_VACUUM_KEY_COST_DELAY		4
 
 /*
  * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION disables the leader's
@@ -257,6 +258,21 @@ typedef struct LVSharedIndStats
 #define GetIndexBulkDeleteResult(s) \
 	((IndexBulkDeleteResult *)((char *)(s) + sizeof(LVSharedIndStats)))
 
+typedef struct LVDelayStats
+{
+	double	time;
+	int		hit;
+	int		miss;
+	int		dirty;
+} LVDelayStats;
+
+typedef struct LVCostDelay
+{
+	pg_atomic_uint32	nslot;
+	LVDelayStats 		stats[FLEXIBLE_ARRAY_MEMBER];
+} LVCostDelay;
+#define SizeOfLVCostDelay offsetof(LVCostDelay, stats) + sizeof(LVDelayStats)
+
 /* Struct for parallel lazy vacuum */
 typedef struct LVParallelState
 {
@@ -265,6 +281,8 @@ typedef struct LVParallelState
 	/* Shared information among parallel vacuum workers */
 	LVShared		*lvshared;
 
+	/* Shared cost delay. */
+	LVCostDelay		*lvcostdelay;
 	/*
 	 * Always true except for a debugging case where
 	 * PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION are defined.
@@ -757,6 +775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		parallel_workers = compute_parallel_workers(Irel, nindexes,
 													params->nworkers);
 
+	VacuumCostTotalDelay = 0;
 	if (parallel_workers > 0)
 	{
 		/*
@@ -1733,6 +1752,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
+
+	elog(WARNING, "stats delay %lf, hit %d, miss %d, dirty %d, total %d",
+		 VacuumCostTotalDelay, _nhit, _nmiss, _ndirty,
+		_nhit + _nmiss + _ndirty);
 }
 
 
@@ -1967,6 +1990,8 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 										int nindexes, IndexBulkDeleteResult **stats,
 										LVParallelState *lps)
 {
+	int		i;
+
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
 	Assert(nindexes > 0);
@@ -2011,6 +2036,18 @@ lazy_parallel_vacuum_or_cleanup_indexes(LVRelStats *vacrelstats, Relation *Irel,
 	/* Continue to use the shared balance value */
 	VacuumCostBalance = pg_atomic_read_u32(&(lps->lvshared->cost_balance));
 
+	/*
+	 * Collect all the delay from workers and add to total delay. The delay from leader
+	 * is already included in VacuumCostTotalDelay.
+	 */
+	for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+	{
+		VacuumCostTotalDelay += lps->lvcostdelay->stats[i].time;
+		_nhit += lps->lvcostdelay->stats[i].hit;
+		_nmiss += lps->lvcostdelay->stats[i].miss;
+		_ndirty += lps->lvcostdelay->stats[i].dirty;
+	}
+
 	/*
 	 * We need to reinitialize the parallel context as no more index vacuuming and
 	 * index cleanup will be performed after that.
@@ -2968,10 +3005,12 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	ParallelContext *pcxt;
 	LVShared		*shared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	long	maxtuples;
 	char	*sharedquery;
 	Size	est_shared;
 	Size	est_deadtuples;
+	Size	est_costdelay;
 	int		querylen;
 	int		i;
 
@@ -3043,6 +3082,14 @@ begin_parallel_vacuum(LVRelStats *vacrelstats, Oid relid, BlockNumber nblocks,
 	sharedquery[querylen] = '\0';
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_QUERY_TEXT, sharedquery);
 
+	/* Vacuum cost balance. */
+	est_costdelay = MAXALIGN(add_size(SizeOfLVCostDelay,
+								   mul_size(sizeof(LVDelayStats), nrequested)));
+	costdelay = (LVCostDelay *) shm_toc_allocate(pcxt->toc, est_costdelay);
+	pg_atomic_init_u32(&(costdelay->nslot), 0);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_COST_DELAY, costdelay);
+	lps->lvcostdelay = costdelay;
+
 	return lps;
 }
 
@@ -3171,8 +3218,10 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation	*indrels;
 	LVShared	*lvshared;
 	LVDeadTuples	*dead_tuples;
+	LVCostDelay		*costdelay;
 	int			nindexes;
 	char		*sharedquery;
+	int			slot;
 	IndexBulkDeleteResult **stats;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3207,6 +3256,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	costdelay = (LVCostDelay *) shm_toc_lookup(toc,
+												   PARALLEL_VACUUM_KEY_COST_DELAY,
+												   false);
+	slot = pg_atomic_fetch_add_u32(&(costdelay->nslot), 1);
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3214,6 +3268,7 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumPageMiss = 0;
 	VacuumPageDirty = 0;
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
+	_nhit = _nmiss = _ndirty = 0;
 
 	stats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -3225,6 +3280,11 @@ heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacuum_or_cleanup_indexes_worker(indrels, nindexes, stats, lvshared,
 									 dead_tuples);
 
+	/* update the total delay in the shared location. */
+	costdelay->stats[slot].time = VacuumCostTotalDelay;
+	costdelay->stats[slot].hit = _nhit;
+	costdelay->stats[slot].miss = _nmiss;
+	costdelay->stats[slot].dirty = _ndirty;
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
 }
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b9ea9b672..7fcd2d082f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -412,6 +412,7 @@ vacuum(List *relations, VacuumParams *params,
 		VacuumPageHit = 0;
 		VacuumPageMiss = 0;
 		VacuumPageDirty = 0;
+		_nhit = _nmiss = _ndirty = 0;
 		VacuumSharedCostBalance = NULL;
 
 		/*
@@ -2050,6 +2051,8 @@ vacuum_delay_point(void)
 			/* update balance values for workers */
 			AutoVacuumUpdateDelay();
 
+			VacuumCostTotalDelay += msec;
+
 			/* Might have gotten an interrupt while sleeping */
 			CHECK_FOR_INTERRUPTS();
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 483f705305..56e3631c6e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -770,7 +770,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			VacuumPageHit++;
 
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageHit;
+				_nhit++;
+			}
 
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 											  smgr->smgr_rnode.node.spcNode,
@@ -959,7 +962,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	VacuumPageMiss++;
 	if (VacuumCostActive)
+	{
 		VacuumCostBalance += VacuumCostPageMiss;
+		_nmiss++;
+	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
 									  smgr->smgr_rnode.node.spcNode,
@@ -1500,7 +1506,10 @@ MarkBufferDirty(Buffer buffer)
 		VacuumPageDirty++;
 		pgBufferUsage.shared_blks_dirtied++;
 		if (VacuumCostActive)
+		{
 			VacuumCostBalance += VacuumCostPageDirty;
+			_ndirty++;
+		}
 	}
 }
 
@@ -3556,7 +3565,10 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			VacuumPageDirty++;
 			pgBufferUsage.shared_blks_dirtied++;
 			if (VacuumCostActive)
+			{
 				VacuumCostBalance += VacuumCostPageDirty;
+				_ndirty++;
+			}
 		}
 	}
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..de214f347b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -139,11 +139,15 @@ int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
 int			VacuumCostLimit = 200;
 double		VacuumCostDelay = 0;
-
+double		VacuumCostTotalDelay = 0;
 int			VacuumPageHit = 0;
 int			VacuumPageMiss = 0;
 int			VacuumPageDirty = 0;
 
+int	_nhit = 0;
+int _nmiss = 0;
+int _ndirty = 0;
+
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..8d95b6ef4f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -251,6 +251,10 @@ extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
 extern int	VacuumCostLimit;
 extern double VacuumCostDelay;
+extern double VacuumCostTotalDelay;
+extern int	_nhit;
+extern int _nmiss;
+extern int _ndirty;
 
 extern int	VacuumPageHit;
 extern int	VacuumPageMiss;

Re: [HACKERS] Block level parallel vacuum

Reply via email to