Test case:

drop table test;
create table test(a int, b varchar, c varchar);

create index idx1 on test(a);
create index idx2 on test(b);
create index idx3 on test(c);

insert into test select i, repeat('a',30)||i, repeat('a',20)||i from generate_series(1,200000) as i;
delete from test where a < 100000;
\timing on
--vacuum test;
vacuum (PARALLEL 3) test;

VacuumCostDelay=10ms
VacuumCostLimit=2000
Total Vacuum Cost of the operation: ~24030   Heap Scan Cost: ~1724  Index Scan Cost: ~22350

I have put some tracepoint in the code to observe how I/O throteling is done
On master when complete vacuum is done from a single process we can observe that every time
balance reach to ~2000 the delay is done.

=============================================================================

Test target : Parallel vacuum patch (v30)
Phase1: Heap vacuuming
Leader:  Vacuum delay pid=12695 balance=2001 limit=2000 delay=10.005000 ms
Leader:  Vacuum Index start vacuum_cost_balance=1724

Phase2: Index vacuuming

Delay point ->  Almost at same time each worker put the delay point when costbalance for each worker is ~2000
SO basically we get the first delay at cost ~6000 whereas the VacuumCostLimit is set to 2000.  I think we can
observe the same thing for all the subsequent delay point.

Leader:   Vacuum delay pid=12695 balance=2005 limitDK=2000 delay=10.025000 ms
Worker1:  Vacuum delay pid=13733 balance=2003 limit=2000 delay=10.015000 ms
Worker2:  Vacuum delay pid=13732 balance=2003 limit=2000 delay=10.015000 ms

Worker1:  Vacuum delay pid=13733 balance=2007 limit=2000 delay=10.035000 ms
Worker2:  Vacuum delay pid=13732 balance=2007 limit=2000 delay=10.035000 ms

Worker1:  Vacuum delay pid=13733 balance=2000 limit=2000 delay=10.000000 ms
Worker2:  Vacuum delay pid=13732 balance=2001 limit=2000 delay=10.005000 ms

Worker1:  Vacuum delay pid=13733 balance=2005 limit=2000 delay=10.025000 ms
Worker2:  Vacuum delay pid=13732 balance=2001 limit=2000 delay=10.005000 ms

Worker2:  Vacuum delay pid=13732 balance=2000 limit=2000 delay=10.000000 ms
Leader:   Vacuum delay pid=12695 balance=2000 limit=2000 delay=10.000000 ms
Leader:   Vacuum delay pid=12695 balance=2000 limit=2000 delay=10.000000 ms


Observation:  With the parallel vacuum we have noticed few problem
1. All worker has been assigned with the full limit, so the first delay point occur when I/O cost is = VacuumCostLimit * workers.
2. The remaining balance for the heap scan phase is not handed over to worker, it will be only used by the leader so
it can significantly increase the I/O usage if workers are performing most of the work.


================================================================================

Test target: Parallel vacuum patch + POC-v1-0001-divide-vacuum-cost-limit
Phase1: Heap vacuuming
Leader:  Vacuum delay pid=48698 balance=2001 limit=2000 delay=10.005000 ms
Leader:  Vacuum Index start vacuum_cost_balance=1724

Phase2: Index vacuuming (The remaining cost balance from the heap scan is equally divided among the workers)
worker1:  pid = 49182 worker start balance=574
worker2:  pid = 49181 worker start balance=574


worker1:  Vacuum delay pid=49182 balance=668 limit=666 delay=10.030030 ms
leader:   Vacuum delay pid=48698 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=669 limit=666 delay=10.045045 ms

delaypoint ->  Each worker delay at almost same time and the combine cost is ~2000 which is equal to VacuumCostLimit
worker1:  Vacuum delay pid=49182 balance=669 limit=666 delay=10.045045 ms
leader:   Vacuum delay pid=48698 balance=666 limit=666 delay=10.000000 ms
worker2:  Vacuum delay pid=49181 balance=666 limit=666 delay=10.000000 ms
delaypoint ->
worker1:  Vacuum delay pid=49182 balance=667 limit=666 delay=10.015015 ms
leader:   Vacuum delay pid=48698 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=668 limit=666 delay=10.030030 ms
delaypoint ->
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms
leader:   Vacuum delay pid=48698 balance=666 limit=666 delay=10.000000 ms
worker2:  Vacuum delay pid=49181 balance=674 limit=666 delay=10.120120 ms
delaypoint ->
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=671 limit=666 delay=10.075075 ms
leader:   Vacuum delay pid=48698 balance=672 limit=666 delay=10.090090 ms
delaypoint->
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=672 limit=666 delay=10.090090 ms
delaypoint->
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=672 limit=666 delay=10.090090 ms
leader:   Vacuum delay pid=48698 balance=666 limit=666 delay=10.080000 ms
delaypoint->
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms
worker2:  Vacuum delay pid=49181 balance=666 limit=666 delay=10.000000 ms
worker1:  Vacuum delay pid=49182 balance=667 limit=666 delay=10.015015 ms

worker2:  Vacuum delay pid=49181 balance=672 limit=666 delay=10.090090 ms
worker1:  Vacuum delay pid=49182 balance=671 limit=666 delay=10.075075 ms
leader:   Vacuum delay pid=48698 balance=672 limit=666 delay=10.060070 ms

leader:   Vacuum delay pid=48698 balance=671 limit=666 delay=10.080000 ms
worker2:  Vacuum delay pid=49181 balance=671 limit=666 delay=10.075075 ms
worker1:  Vacuum delay pid=49182 balance=672 limit=666 delay=10.090090 ms

....
At the end of the index vacuuming we are combining the remaning cost balance from all worker
and add the balance to leader for the next phase of the heap vacuuming.
       :  Post parallel balance=894

Observation:
1. The main problem we observed with parallel vacuum that the I/O limit is overshooting before it delays, is solved here by dividing the I/O limit.
2. If some worker has very less work to do and not doing much I/O still other workers are having small I/O limit, this can cause frequent delay than required.