On Sat, Jan 30, 2021 at 9:11 AM Michail Nikolaev <michail.nikol...@gmail.com> wrote: > > Yeah, it would help a lot. But those bits are precious. So it makes > > sense to think about what to do with both of them in index AMs at the > > same time. Otherwise we risk missing some important opportunity. > > Hm. I was trying to "expand the scope" as you said and got an idea... Why we > even should do any conflict resolution for hint bits? Or share precious LP > bits?
What does it mean today when an LP_DEAD bit is set on a standby? I don't think that it means nothing at all -- at least not if you think about it in the most general terms. In a way it actually means something like "recently dead" even today, at least in one narrow specific sense: recovery may end, and then we actually *will* do *something* with the LP_DEAD bit, without directly concerning ourselves with when or how each LP_DEAD bit was set. During recovery, we will probably always have to consider the possibility that LP_DEAD bits that get set on the primary may be received by a replica through some implementation detail (e.g. LP_DEAD bits are set in FPIs we replay, maybe even some other thing that neither of us have thought of). We can't really mask LP_DEAD bits from the primary in recovery anyway, because of stuff like page-level checksums. I suspect that it just isn't useful to fight against that. These preexisting assumptions are baked into everything already. Why should we assume that *anybody* understands all of the ways in which that is true? Even today, "LP_DEAD actually means a limited kind of 'recently dead' during recovery + hot standby" is something that is true (as I just went into), but at the same time also has a fuzzy definition. My gut instinct is to be conservative about that. As I suggested earlier, you could invent some distinct kind of "recently dead" that achieves your goals in a way that is 100% orthogonal. The two unused LP dead bits (unused in indexes, though not tables) are only precious if we assume that somebody will eventually use them for something -- if nobody ever uses them then they're 100% useless. The number of possible projects that might end up using the two bits for something is not that high -- certainly not more than 3 or 4. Besides, it is always a good idea to keep the on-disk format as simple and explicit as possible -- it should be easy to analyze forensically in the event of bugs or some other kind of corruption, especially by using tools like amcheck. As I said in my earlier email, we can even play tricks during page deletion by treating certain kinds of "recently dead" bits as approximate things. As a further example, we can "rely" on the "dead-to-all but only on standby" bits when recovery ends, during a subsequent write transactions. We can do so by simply using them in _bt_deadblocks() as if they were LP_DEAD (we'll recheck heap tuples in heapam.c instead). > The only way standby could get an "invalid" hint bit - is an FPI from the > primary. We could just use the current *btree_mask* infrastructure and clear > all "probably invalid" bits from primary in case of standby while applying > FPI (in `XLogReadBufferForRedoExtended`)! I don't like that idea. Apart from what I said already, you're assuming that setting LP_DEAD bits in indexes on the primary won't eventually have some value on the standby after it is promoted and can accept writes -- they really do have meaning and value on standbys. Plus you'll introduce new overhead for this process during replay, which creates significant overhead -- often most leaf pages have some LP_DEAD bits set during recovery. > For binary compatibility, we could use one of `btpo_flags` bits to mark the > page as "primary bits masked". The same way would work for hash\gist too. I agree that there are plenty of btpo_flags bits. However, I have my doubts about this. Why shouldn't this break page-level checksums (or wal_log_hints) in some way? What about pg_rewind, some eventual implementation of incremental backups, etc? I suspect that it will be necessary to invent some entirely new concept that is like a hint bit, but works on standbys (without breaking anything else). > No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free > bits for now, easy to implement, page content of primary\standby already > differs in this bits... > Looks like an easy and effective solution for me. Note that the BTP_HAS_GARBAGE flag (which goes in btpo_flags) was deprecated in commit cf2acaf4. It was never a reliable indicator of whether or not some LP_DEAD bits are set in the page. And I think that it never could be made to work like that. > >> Also, btw, do you know any reason to keep minRecoveryPoint at a low value? > > Not offhand. > > If so, looks like it is not a bad idea to move minRecoveryPoint forward from > time to time (for more aggressive standby index hint bits). For each > `xl_running_xacts` (about each 15s), for example. It's necessary for recoverypoints (i.e. the standby equivalent of a checkpoint) to do that in order to ensure that we won't break checksums. This whole area is scary to me: https://postgr.es/m/CABOikdPOewjNL=05k5cbnmxnntxnqjhtx2f--4p4ruorcju...@mail.gmail.com Since, as I said, it's already true that LP_DEAD bits on standbys are some particular kind of "recently dead bit", even today, any design that uses LP_DEAD bits in some novel new way (also on the standby) is very hard to test. It might easily have very subtle bugs -- obviously a recently dead bit relates to a tuple pointing to a logical row that is bound to become dead soon enough. The difference between dead and recently dead is already blurred, and I would rather not go there. If you invent some entirely new category of standby-only hint bit at a level below the access method code, then you can use it inside access method code such as nbtree. Maybe you don't have to play games with minRecoveryPoint in code like the "if (RecoveryInProgress())" path from the XLogNeedsFlush() function. Maybe you can do some kind of rudimentary "masking" for the in recovery case at the point that we *write out* a buffer (*not* at the point hint bits are initially set) -- maybe this could happen near to or inside FlushBuffer(), and maybe only when checksums are enabled? I'm unsure. > > BTW, what happens when the page splits on the primary, btw? Does your > > patch "move over" the LP_DEAD bits to each half of the split? > > That part is not changed in any way. Maybe it's okay to assume that it's no loss to throw away hint bits set on the standby, because in practice deletion on the primary will usually do the right thing anyway. But you never said that. I think that you should take an explicit position on this question -- make it a formal part of your overall design. -- Peter Geoghegan