Last week, I posted a couple of possible designs for making the visibility map crash-safe, which did not elicit much comment. Since this is an important prerequisite to index-only scans, I'm trying again.
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php Here's a third possible design. Instead of representing each page with a single bit in the visibility map, use two bits. One bit indicates whether all tuples on the page are all-visible (same as the current bit) - call this VM_ALL_VISIBLE. The other bit is only used during VACUUM and indicates whether VACUUM is trying to set the all-visible bit - call this VM_BECOMING_ALL_VISIBLE. We observe the rule that any operation that clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map. VACUUM precedes as follows: 1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits are set, take the exclusive content lock for long enough to clear them. 2. Scan the heap. When a page is observed to be all-visible, set VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE. 3. Loop over shared buffers and write out every page to the OS which belongs to the target relation, was marked all-visible in step 2, and is still dirty. Note that this may require a WAL flush. 4. fsync() the heap. 5. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits are set, take the exclusive content lock, clear each such bit, set the corresponding VM_ALL_VISIBLE bits and XLOG the page. One might actually want to do steps 2-5 incrementally, in 1GB chunks, so that you don't fsync() too much of the relation all at once. If you tilt your head just right, the recurring problem in all of this is that the heap page and the visibility map page can go to disk in either order, and we have no way of knowing which one. A more radical solution to this problem (and, thus, a fourth possible design) would be to add a field to the buffer descriptor allowing one page to "wire" another page into shared buffers. If the value is >0, it's the number of a buffer it's currently wiring. If the value is <0, it's the number of other buffers that have wired this buffer. A buffer both wire another buffer and itself be wired at the same time. If the value is =0, everything's normal. To make this work, though, you'd have to complicate the checkpoint logic pretty considerably - make sure all the unwired buffers are written and fsync'd first, thus unwiring the remaining ones to be written and fsync'd in a second pass; and there are also possible problems with very small relations, where the number of wired buffers might grow to an uncomfortably high percentage of the total. Waving my hands at all this complexity, you could then make the heap pages wire the visibility map pages. I can't say I'm totally in love with any of these designs. Anyone else have any ideas, or any opinions about which one is best? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers