Hi Thomas / hackers,
>> The append-only bottleneck appears to be limited by syscalls/s due to small
>> block size even with everything in FS cache (but not in shared buffers,
>> please compare with TEST1 as there was no such bottleneck at all):
>>
>> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
>> ---copy_user_enhanced_fast_string
>> |--17.98%--copyin
>> [..]
>> | __pwrite_nocancel
>> | FileWrite
>> | mdwrite
>> | FlushBuffer
>> | ReadBuffer_common
>> | ReadBufferWithoutRelcache
>> | XLogReadBufferExtended
>> | XLogReadBufferForRedoExtended
>> | --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery. I'm going to look
> into that.
Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic
- maybe we should change subject? However, to add some data to the separate
bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way,
better sorting), I've measured the same configuration as last time with still
the same append-only WAL workload on NVMe and compared with various
shared_buffers settings (and buffers description sizing from
pg_shmem_allocations which as You stated is wrongly reported(?) which I'm
stating only for reference just in case):
shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413
The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please
read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's
not that big and it's not degrading linearly as one would expect (??)
L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine:
32K/32K/256K/46080K. Maybe the DB size is too small.
I've double-checked that in condition [a] (shared_buffers=128MB) there was a
lot of FlushBuffer() invocations per second (perf stat -e
probe_postgres:FlushBuffer -I 1000), e.g:
# time counts unit events
1.000485217 79,494 probe_postgres:FlushBuffer
2.000861366 75,305 probe_postgres:FlushBuffer
3.001263163 79,662 probe_postgres:FlushBuffer
4.001650795 80,317 probe_postgres:FlushBuffer
5.002033273 79,951 probe_postgres:FlushBuffer
6.002418646 79,993 probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time. So there is
like 3x speed-up possible when StartupXLog wouldn't have to care too much
about dirty buffers in the critical path (bgwriter would as you say?) at least
in append-only scenarios. But ... I've checked some real systems (even older
versions of PostgreSQL not doing that much of replication, and indeed it's
e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge
(>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really
performed by bgwriter not by startup/recovering as in this redo-bench case when
DB is open for reads and replicating-- so it appears that this isn't
optimization for hot standby case , but for the DB-closed startup
recovery/restart/disaster/PITR scenario).
As for the 24GB shared_buffers scenario where dirty buffers are not at all a
problem with given top profile (output trimmed), again as expected:
13.41% postgres postgres [.] hash_search_with_hash_value
|--8.31%--BufTableLookup <- ReadBuffer_common <-
ReadBufferWithoutRelcache
--5.11%--smgropen
|--2.77%--XLogReadBufferExtended
--2.34%--ReadBufferWithoutRelcache
7.88% postgres postgres [.] MarkBufferDirty
I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even
on bigger) reporting as not-supported or zeros. However interestingly the
workload has IPC of 1.40 (instruction bound) which to me is strange as I would
expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on
some different hardware one day.
>> __pread_nocancel
>> --11.44%--FileRead
>> mdread
>> ReadBuffer_common
>> ReadBufferWithoutRelcache
>> XLogReadBufferExtended
>> XLogReadBufferForRedoExtended
>
> For these reads, the solution should be WAL prefetching,(..) But... when
> combined with Andres's work-in-progress AIO stuff (..)
Yes, I've heard a thing or two about those :) I hope I'll be able to deliver
some measurements sooner or later of those two together (AIO+WALprefetch).
-Jakub Wartak.