Re: Improve heapgetpage() performance, overhead from serializable

2023-08-31 Thread Muhammad Malik
Hi,

Is there a plan to merge this patch in PG16?

Thanks,
Muhammad


From: Andres Freund 
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hack...@postgresql.org 
Cc: Thomas Munro 
Subject: Improve heapgetpage() performance, overhead from serializable

Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed.  It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise.  All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 1000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987


FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.




Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies.  I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...


I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?


Greetings,

Andres Freund


Regarding t_cid in Neon heap WAL records

2024-07-24 Thread Muhammad Malik
Neon added a t_cid field to heap WAL records 
https://github.com/yibit/neon-postgresql/blob/main/docs/core_changes.md#add-t_cid-to-heap-wal-records.

However, when replaying the delete log record, it is discarding the combo flag 
and storing the raw cmax on the old tuple 
https://github.com/neondatabase/neon/blob/main/pgxn/neon_rmgr/neon_rmgr.c#L376. 
This will make the tuple header different from what is in the buffer cache if 
the deleted tuple was using a combocid. Similarly, there was no t_cid added for 
the old tuple in xl_neon_heap_update, and it is using the t_cid of the new 
tuple to set cmax on the old tuple during redo_neon_heap_update.

Why is this not a problem when a visibility check is performed on the tuple 
after reading from storage, since it won't get the correct cmin value on the 
old tuple?
Also, what is the need of adding the t_cid of the new tuple in 
xl_neon_heap_update when it is already present in the xl_neon_heap_header? 
Seems like it is sending the same t_cid twice with the update WAL record.
Thanks,
Muhammad


Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-05-03 Thread Muhammad Malik
Hi,

Could you please share repro steps for running these benchmarks? I am doing 
performance testing in this area and want to use the same benchmarks.

Thanks,
Muhammad

From: Andres Freund 
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hack...@postgresql.org ; Thomas Munro 
; Melanie Plageman 
Cc: Yura Sokolov ; Robert Haas 
Subject: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock.  We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS


1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.


The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.


My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.


To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.



Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.


b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.


c) Contention on the FSM and the pages returned by it is a serious bottlene

Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-05-03 Thread Muhammad Malik

> I use a script like:

> c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data 
> text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1 
> -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'

> >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 10)) 
> >TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
> >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 
> >6*10)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);

When I ran this script it did not insert anything into the copytest_0 table. It 
only generated a single copytest_data_text.copy file of size 9.236MB.
Please help me understand how is this 'pgbench running COPY into a single 
table'. Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.

Thank you,
Muhammad