Re: Eager page freeze criteria clarification

Andres Freund Wed, 11 Oct 2023 17:44:07 -0700

Hi,


Robert, Melanie and I spent an evening discussing this topic around
pgconf.nyc. Here are, mildly revised, notes from that:


First a few random points that didn't fit with the sketch of an approach
below:

- Are unlogged tables a problem for using LSN based heuristics for freezing?

  We concluded, no, not a problem, because aggressively freezing does not
  increase overhead meaningfully, as we would already dirty both the heap and VM
  page to set the all-visible flag.

- "Unfreezing" pages that were frozen hours / days ago aren't too bad and can
  be desirable.

  The main thing we are worried about is repeated freezing / unfreezing of
  pages within a relatively short time period.

- Computing an average "modification distance" as I (Andres) proposed efor
  each page is complicated / "fuzzy"

  The main problem is that it's not clear how to come up with a good number
  for workloads that have many more inserts into new pages than modifications
  of existing pages.

  It's also hard to use average for this kind of thing, e.g. in cases where
  new pages are frequently updated, but also some old data is updated, it's
  easy for the updates to the old data to completely skew the average, even
  though that shouldn't prevent us from freezing.

- We also discussed an idea by Robert to track the number of times we need to
  dirty a page when unfreezing and to compare that to the number of pages
  dirtied overall (IIRC), but I don't think we really came to a conclusion
  around that - and I didn't write down anything so this is purely from
  memory.


A rough sketch of a freezing heuristic:

- We concluded that to intelligently control opportunistic freezing we need
  statistics about the number of freezes and unfreezes

  - We should track page freezes / unfreezes in shared memory stats on a
    per-relation basis

  - To use such statistics to control heuristics, we need to turn them into
    rates. For that we need to keep snapshots of absolute values at certain
    times (when vacuuming), allowing us to compute a rate.

  - If we snapshot some stats, we need to limit the amount of data that occupies

    - evict based on wall clock time (we don't care about unfreezing pages
      frozen a month ago)

    - "thin out" data when exceeding limited amount of stats per relation
      using random sampling or such

    - need a smarter approach than just keeping N last vacuums, as there are
      situations where a table is (auto-) vacuumed at a high frequency


  - only looking at recent-ish table stats is fine, because we
     - a) don't want to look at too old data, as we need to deal with changing
       workloads

     - b) if there aren't recent vacuums, falsely freezing is of bounded cost

  - shared memory stats being lost on crash-restart/failover might be a problem

    - we certainly don't want to immediate store these stats in a table, due
      to the xid consumption that'd imply


- Attributing "unfreezes" to specific vacuums would be powerful:

  - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
    were frozen during the same vacuum" provides numerator / denominator for
    an "error rate"

  - We can perform this attribution by comparing the page LSN with recorded
    start/end LSNs of recent vacuums

  - If the freezing error rate of recent vacuums is low, freeze more
    aggressively. This is important to deal with insert mostly workloads.

  - If old data is "unfrozen", that's fine, we can ignore such unfreezes when
    controlling "freezing aggressiveness"

    - Ignoring unfreezing of old pages is important to e.g. deal with
      workloads that delete old data

  - This approach could provide "goals" for opportunistic freezing in a
    somewhat understandable way. E.g. aiming to rarely unfreeze data that has
    been frozen within 1h/1d/...


Around this point my laptop unfortunately ran out of battery. Possibly the
attendees of this mini summit also ran out of steam (and tea).


We had a few "disagreements" or "unresolved issues":

- How aggressive should we be when we have no stats?

- Should the freezing heuristic take into account whether freezing would
  require an FPI? Or whether page was not in s_b, or ...


I likely mangled this substantially, both when taking notes during the lively
discussion, and when revising them to make them a bit more readable.

Greetings,

Andres Freund

Re: Eager page freeze criteria clarification

Reply via email to