At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > (I added Yura, as the author of a related patch) > > At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs > <simon.ri...@enterprisedb.com> wrote in > > Thinking about poor performance in the case where the data fits in > > RAM, but the working set is too big for shared_buffers, I notice a > > couple of things that seem bad in BufMgr, but don't understand why > > they are like that. > > > > 1. If we need to allocate a buffer to a new block we do this in one > > step, while holding both partition locks for the old and the new tag. > > Surely it would cause less contention to make the old block/tag > > invalid (after flushing), drop the old partition lock and then switch > > to the new one? i.e. just hold one mapping partition lock at a time. > > Is there a specific reason we do it this way? > > I'm not sure but I guess the developer wanted to make the operation > atomic. > > Yura Sokolov is proposing a patch to separte the two partition locks. > > https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru > > And it seems to me viable for me and a benchmarking in the thread > showed a good result. I'd appreciate your input on that thread. > > > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE > > we issue BufTableDelete(). That means we need extra entries in the > > buffer mapping hash table to allow us to hold both the old and the new > > at the same time, for a short period. The way dynahash.c works, we try > > Yes. > > > to allocate an entry from the freelist and if that doesn't work, we > > begin searching ALL the freelists for free entries to steal. So if we > > get enough people trying to do virtual I/O at the same time, then we > > will hit a "freelist storm" where everybody is chasing the last few > > entries. It would make more sense if we could do BufTableDelete() > > To reduce that overhead, Yura proposed a surgically modification on > dynahash, but I didn't like that and the latest patch doesn't contain > that part. > > > first, then hold onto the buffer mapping entry rather than add it to > > the freelist, so we can use it again when we do BufTableInsert() very > > shortly afterwards. That way we wouldn't need to search the freelist > > at all. What is the benefit or reason of doing the Delete after the > > Insert? > > Hmm. something like hash_swap_key() or hash_reinsert_entry()? That > sounds reasonable. (Yura's proposal was taking out an entry from hash > then attach it with a new key again.) > > > Put that another way, it looks like BufTable functions are used in a > > way that mismatches against the way dynahash is designed. > > > > Thoughts? > > On the first part, I think Yura's patch works. On the second point, > Yura already showed it gives a certain amount of gain if we do that.
On second thought, even if we have a new dynahash API to atomically replace hash key, we need to hold two partition locks to do that. That is contradicting our objective. regards. -- Kyotaro Horiguchi NTT Open Source Software Center