On Fri, Aug 20, 2021 at 10:50 AM alvhe...@alvh.no-ip.org <alvhe...@alvh.no-ip.org> wrote: > 1. We use a hash table in shared memory. That's great. The part that's > not so great is that in both places where we read items from it, we > have to iterate in some way. This seems a bit silly. An array would > serve us better, if only we could expand it as needed. However, in > shared memory we can't do that. (I think the list of elements we > need to memoize is arbitrary long, if enough processes can be writing > WAL at the same time.)
We can't expand the hash table either. It has an initial and maximum size of 16 elements, which means it's basically an expensive array, and which also means that it imposes a new limit of 16 * wal_segment_size on the size of WAL records. If you exceed that limit, I think things just go boom... which I think is not acceptable. I think we can have records in the multi-GB range of wal_level=logical and someone chooses a stupid replica identity setting. It's actually not clear to me why we need to track multiple entries anyway. The scenario postulated by Horiguchi-san in https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota....@gmail.com seems to require that the write position be multiple segments ahead of the flush position, but that seems impossible with the present code, because XLogWrite() calls issue_xlog_fsync() at once if the segment is filled. So I think, at least with the present code, any record that isn't completely flushed to disk has to be at least partially in the current segment. And there can be only one record that starts in some earlier segment and ends in this one. I will be the first to admit that the forced end-of-segment syncs suck. They often stall every backend in the entire system at the same time. Everyone fills up the xlog segment really fast and then stalls HARD while waiting for that sync to happen. So it's arguably better not to do more things that depend on that being how it works, but I think needing a variable-size amount of shared memory is even worse. If we're going to track multiple entries here we need some rule that bounds how many of them we can need to track. If the number of entries is defined by the number of segment boundaries that a particular record crosses, it's effectively unbounded, because right now WAL records can be pretty much arbitrarily big. -- Robert Haas EDB: http://www.enterprisedb.com