On Wed, Oct 11, 2023 at 8:43 PM Andres Freund <and...@anarazel.de> wrote: > > Robert, Melanie and I spent an evening discussing this topic around > pgconf.nyc. Here are, mildly revised, notes from that:
Thanks for taking notes! > The main thing we are worried about is repeated freezing / unfreezing of > pages within a relatively short time period. > > - Computing an average "modification distance" as I (Andres) proposed efor > each page is complicated / "fuzzy" > > The main problem is that it's not clear how to come up with a good number > for workloads that have many more inserts into new pages than modifications > of existing pages. > > It's also hard to use average for this kind of thing, e.g. in cases where > new pages are frequently updated, but also some old data is updated, it's > easy for the updates to the old data to completely skew the average, even > though that shouldn't prevent us from freezing. > > - We also discussed an idea by Robert to track the number of times we need to > dirty a page when unfreezing and to compare that to the number of pages > dirtied overall (IIRC), but I don't think we really came to a conclusion > around that - and I didn't write down anything so this is purely from > memory. I was under the impression that we decided we still had to consider the number of clean pages dirtied as well as the number of pages unfrozen. The number of pages frozen and unfrozen over a time period gives us some idea of if we are freezing the wrong pages -- but it doesn't tell us if we are freezing the right pages. A riff on an earlier example by Robert: While vacuuming a relation, we freeze 100 pages. During the same time period, we modify 1,000,000 previously clean pages. Of these 1,000,000 pages modified, 90 were frozen. So we unfroze 90% of the pages frozen during this time. Does this mean we should back off of trying to freeze any pages in the relation? > A rough sketch of a freezing heuristic: ... > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" > > - We can perform this attribution by comparing the page LSN with recorded > start/end LSNs of recent vacuums While implementing a rough sketch of this, I realized I had a question about this. vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages. vacuum 2 then starts at lsn 600. 5 frozen pages with page lsn > 10 and < 200 were updated. We count those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and < 600 were updated. Do we count those somewhere? > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... Similar to the above question, if we are tracking pages frozen and unfrozen during a time period, if there are many vacuums in quick succession, we might care if a page was frozen by one vacuum and then unfrozen during a subsequent vacuum if not too much time has passed. - Melanie