On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <[email protected]> wrote: > On Mon, Jul 30, 2018 at 1:22 AM, Jamison, Kirk <[email protected]> > wrote: >> 1. Because the multiple scans of the whole shared buffer per concurrent >> truncate/drop table was the cause of the time-consuming behavior, DURING the >> failover process while WAL is being applied, we temporary delay the scanning >> and invalidating of shared buffers. At the same time, we remember the >> relations/relfilenodes (of dropped/truncated tables) by adding them in a >> hash table called "skip list". >> 2. After WAL is applied, the checkpoint(or bg writer) scans the shared >> buffer only ONCE, compare the pages against the skip list, and invalidates >> the relevant pages. After deleting the relevant pages on the shared memory, >> it will not be written back to the disk. >> >> Assuming the theory works, this design will only affect the behavior of >> checkpointer (or maybe bg writer) during recovery process / failover. Any >> feedback, thoughts? > > How would this work if a relfilenode number that belonged to an old > relation got recycled for a new relation? > > I think something like this could be made to work -- both on the > master and the standby, and not just while waiting for a failover -- > if we did something like this: > > (1) Limit the number of deferred drops to a reasonably small number > (one cache line? 1kB?). > (2) Background writer regularly does scans do clean out everything > from the deferred-drop list. > (3) If a foreground process finds the deferred-drop list is full when > it needs to add to it, it forces a clean-out of the list contents + > whatever new stuff it has in a single pass. > (4) If we are about generate a relfilenode that's in the list, we > either force a clean out or generate a different relfilenode instead. > (5) Every buffer cleaning operation checks the deferred-drop list and > invalidates without write-out if the buffer is found. > > It's not clear to me whether it would be worth the overhead of doing > something like this. Making relation drops faster at the cost of > making buffer cleaning slower could be a loser.
Perhaps you could make it a bit more incremental, avoiding the full clean-out scans (your points 2 and 3) while still allowing the background writer to bear the brunt in the best case? Something like this: The zombie relfilenode tracker could be fixed-size and self-cleaning: entries could be dropped when the CLOCK hand completes a full rotation since the entry was created, and if it's full when you're trying to insert a new entry you can scan the buffer pool until you've caused one entry (the oldest, by clock hand-at-time-of-insertion) to be remove to make space (so the worst case is a full pass, but hopefully we can drop one entry before then). The main argument I can think of against that idea is that it is tied to the current buffer eviction strategy, using its buffer index-based clock hand as a horizon. Extracting a useful horizon for algorithms like ARC or CAR would probably need more thought, because their 'hands' jump around following next pointers. Anyway, it's a lot of complexity, and it falls back to a worst cases like today, and can also transfer work to innocent foreground processes. I see why Andres says we should just get a better data structure so we can make the guy doing the dropping pay for it up front, but more efficiently. I suspect we may also want an ordered data structure for write clustering purposes one day. -- Thomas Munro http://www.enterprisedb.com
