Hi,
On 2022-11-22 23:43:29 -0600, Justin Pryzby wrote:
> I think there may be a problem/deficiency with hint bits:
>
> |postgres=# DROP TABLE u2; CREATE TABLE u2 AS SELECT
> generate_series(1,999999)a; SELECT pg_stat_reset_shared('io'); explain
> (analyze,buffers) SELECT * FROM u2;
> |...
> | Seq Scan on u2 (cost=0.00..15708.75 rows=1128375 width=4) (actual
> time=0.111..458.239 rows=999999 loops=1)
> | Buffers: shared hit=2048 read=2377 dirtied=2377 written=2345
>
> |postgres=# SELECT COUNT(1), relname, COUNT(1) FILTER(WHERE isdirty) FROM
> pg_buffercache b LEFT JOIN pg_class c ON
> pg_relation_filenode(c.oid)=b.relfilenode GROUP BY 2 ORDER BY 1 DESC LIMIT 11;
> | count | relname | count
> |-------+---------------------------------+-------
> | 13619 | | 0
> | 2080 | u2 | 2080
> | 104 | pg_attribute | 4
> | 71 | pg_statistic | 1
> | 51 | pg_class | 1
>
> It says that SELECT caused 2377 buffers to be dirtied, of which 2080 are
> associated with the new table in pg_buffercache.
Note that there's 2048 dirty buffers for u2 in shared_buffers before the
SELECT, despite the relation being 4425 blocks long, due to the CTAS using
BAS_BULKWRITE.
> |postgres=# SELECT * FROM pg_stat_io WHERE
> backend_type!~'autovac|archiver|logger|standalone|startup|^wal|background
> worker' or true ORDER BY 2;
> | backend_type | io_context | io_object | read | written |
> extended | op_bytes | evicted | reused | files_synced | stats_reset
> |...
> | client backend | bulkread | relation | 2377 | 2345 |
> | 8192 | 0 | 2345 | | 2022-11-22 22:32:33.044552-06
>
> I think it's a known behavior that hint bits do not use the strategy
> ring buffer. For BAS_BULKREAD, ring_size = 256kB (32, 8kB pages), but
> there's 2080 dirty pages in the buffercache (~16MB).
I don't think there's any "circumvention" of the ringbuffer here. There's 2048
buffers for u2 in s_b before, all dirty, there's 2080 after, also all
dirty. So the ringbuffer restricted the increase in shared buffers used for u2
to 2080-2048=32 additional buffers.
The reason hint bits don't prevent pages from being written out here is that a
BAS_BULKREAD strategy doesn't cause all buffer writes to be rejected, it just
causes buffer writes to be rejected when the page LSN would require a WAL
flush. And that's not typically the case when you just set a hint bit, unless
you use wal_log_hint_bits = true.
If I turn on wal_log_hints=true and add a CHECKPOINT after the CTAS I see 0
reuses (and 4425 dirty buffers), which is what I'd expect.
> But the IO view says that 2345 of the pages were "reused", which seems
> misleading to me. Maybe that just follows from the behavior and the view is
> fine. If the view is fine, maybe this case should still be specifically
> mentioned in the docs.
I think that's just confusing due to the reset. 2048 + 2345 = 4393, but we
only have 2080 buffers for u2 in s_b.
Greetings,
Andres Freund