On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <k.jami...@jp.fujitsu.com> 
> wrote:
>> 1. Because the multiple scans of the whole shared buffer per concurrent 
>> truncate/drop table was the cause of the time-consuming behavior, DURING the 
>> failover process while WAL is being applied, we temporary delay the scanning 
>> and invalidating of shared buffers. At the same time, we remember the 
>> relations/relfilenodes (of dropped/truncated tables) by adding them in a 
>> hash table called "skip list".
>> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared 
>> buffer only ONCE, compare the pages against the skip list, and invalidates 
>> the relevant pages. After deleting the relevant pages on the shared memory, 
>> it will not be written back to the disk.
>>
>> Assuming the theory works, this design will only affect the behavior of 
>> checkpointer (or maybe bg writer) during recovery process / failover. Any 
>> feedback, thoughts?
>
> How would this work if a relfilenode number that belonged to an old
> relation got recycled for a new relation?
>
> I think something like this could be made to work -- both on the
> master and the standby, and not just while waiting for a failover --
> if we did something like this:
>
> (1) Limit the number of deferred drops to a reasonably small number
> (one cache line?  1kB?).
> (2) Background writer regularly does scans do clean out everything
> from the deferred-drop list.
> (3) If a foreground process finds the deferred-drop list is full when
> it needs to add to it, it forces a clean-out of the list contents +
> whatever new stuff it has in a single pass.
> (4) If we are about generate a relfilenode that's in the list, we
> either force a clean out or generate a different relfilenode instead.
> (5) Every buffer cleaning operation checks the deferred-drop list and
> invalidates without write-out if the buffer is found.
>
> It's not clear to me whether it would be worth the overhead of doing
> something like this.  Making relation drops faster at the cost of
> making buffer cleaning slower could be a loser.

Perhaps you could make it a bit more incremental, avoiding the full
clean-out scans (your points 2 and 3) while still allowing the
background writer to bear the brunt in the best case?  Something like
this:

The zombie relfilenode tracker could be fixed-size and self-cleaning:
entries could be dropped when the CLOCK hand completes a full rotation
since the entry was created, and if it's full when you're trying to
insert a new entry you can scan the buffer pool until you've caused
one entry (the oldest, by clock hand-at-time-of-insertion) to be
remove to make space (so the worst case is a full pass, but hopefully
we can drop one entry before then).  The main argument I can think of
against that idea is that it is tied to the current buffer eviction
strategy, using its buffer index-based clock hand as a horizon.
Extracting a useful horizon for algorithms like ARC or CAR would
probably need more thought, because their 'hands' jump around
following next pointers.

Anyway, it's a lot of complexity, and it falls back to a worst cases
like today, and can also transfer work to innocent foreground
processes.  I see why Andres says we should just get a better data
structure so we can make the guy doing the dropping pay for it up
front, but more efficiently.  I suspect we may also want an ordered
data structure for write clustering purposes one day.

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to