Re: Flushing large data immediately in pqcomm
Em seg., 8 de abr. de 2024 às 09:27, Jelte Fennema-Nio escreveu: > On Sun, 7 Apr 2024 at 11:34, David Rowley wrote: > > That seems to require modifying the following function signatures: > > secure_write(), be_tls_write(), be_gssapi_write(). That's not an area > > I'm familiar with, however. > > Attached is a new patchset where 0003 does exactly that. The only > place where we need to cast to non-const is for GSS, but that seems > fine (commit message has more details). > +1. Looks ok to me. The GSS pointer *ptr, is already cast to char * where it is needed, so the code is already correct. best regards, Ranier Vilela
Re: Flushing large data immediately in pqcomm
On Sun, 7 Apr 2024 at 11:34, David Rowley wrote: > That seems to require modifying the following function signatures: > secure_write(), be_tls_write(), be_gssapi_write(). That's not an area > I'm familiar with, however. Attached is a new patchset where 0003 does exactly that. The only place where we need to cast to non-const is for GSS, but that seems fine (commit message has more details). I also added patch 0002, which is a small addition to the function comment of internal_flush_buffer that seemed useful to me to differentiate it from internal_flush (feel free to ignore/rewrite). v9-0001-Make-a-few-variables-size_t-in-pqcomm.c.patch Description: Binary data v9-0002-Expand-comment-of-internal_flush_buffer.patch Description: Binary data v9-0003-Make-backend-libpq-write-functions-take-const-poi.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
Em seg., 8 de abr. de 2024 às 07:42, Jelte Fennema-Nio escreveu: > On Sun, 7 Apr 2024 at 14:41, David Rowley wrote: > > Looking at the code in socket_putmessage_noblock(), I don't understand > > why it's ok for PqSendBufferSize to be int but "required" must be > > size_t. There's a line that does "PqSendBufferSize = required;". It > > kinda looks like they both should be size_t. Am I missing something > > that you've thought about? > > > You and Ranier are totally right (I missed this assignment). Attached is > v8. > +1 LGTM. best regards, Ranier Vilela
Re: Flushing large data immediately in pqcomm
On Sun, 7 Apr 2024 at 14:41, David Rowley wrote: > Looking at the code in socket_putmessage_noblock(), I don't understand > why it's ok for PqSendBufferSize to be int but "required" must be > size_t. There's a line that does "PqSendBufferSize = required;". It > kinda looks like they both should be size_t. Am I missing something > that you've thought about? You and Ranier are totally right (I missed this assignment). Attached is v8. v8-0001-Make-a-few-variables-size_t-in-pqcomm.c.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
David Rowley , 6 Nis 2024 Cmt, 04:34 tarihinde şunu yazdı: > Does anyone else want to try the attached script on the v5 patch to > see if their numbers are better? > I'm seeing the below results with your script on my machine (). I ran it several times, results were almost similar each time. master: Run 100 100 500: 1.627905512 Run 1024 10240 20: 1.603231684 Run 1024 1048576 2000: 2.962812352 Run 1048576 1048576 1000: 2.940766748 v5: Run 100 100 500: 1.611508155 Run 1024 10240 20: 1.603505596 Run 1024 1048576 2000: 2.727241937 Run 1048576 1048576 1000: 2.721268988
Re: Flushing large data immediately in pqcomm
Em sáb., 6 de abr. de 2024 às 22:39, Andres Freund escreveu: > Hi, > > On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote: > > On Sat, 6 Apr 2024 at 22:21, Andres Freund wrote: > > > The small regression for small results is still kinda visible, I > haven't yet > > > tested the patch downthread. > > > > Thanks a lot for the faster test script, I'm also impatient. I still > > saw the small regression with David his patch. Here's a v6 where I > > think it is now gone. I added inline to internal_put_bytes too. I > > think that helped especially because for two calls to > > internal_put_bytes len is a constant (1 and 4) that is smaller than > > PqSendBufferSize. So for those calls the compiler can now statically > > eliminate the new codepath because "len >= PqSendBufferSize" is known > > to be false at compile time. > > Nice. > > > > Also I incorporated all of Ranier his comments. > > Changing the global vars to size_t seems mildly bogus to me. All it's > achieving is to use slightly more memory. It also just seems unrelated to > the > change. > I don't agree with this thought. Actually size_t uses 4 bytes of memory than int, right. But mixing up int and size_t is a sure way to write non-portable code. And the compilers will start showing messages such as " signed/unsigned mismatch". The global vars PqSendPointer and PqSendStart were changed in the v5 patch, so for the sake of style and consistency, I understand that it is better not to mix the types. The compiler will promote PqSendBufferSize to size_t in all comparisons. And finally the correct type to deal with char * variables is size_t. Best regards, Ranier Vilela > Greetings, > > Andres Freund >
Re: Flushing large data immediately in pqcomm
On Sun, 7 Apr 2024 at 22:05, Jelte Fennema-Nio wrote: > > On Sun, 7 Apr 2024 at 03:39, Andres Freund wrote: > > Changing the global vars to size_t seems mildly bogus to me. All it's > > achieving is to use slightly more memory. It also just seems unrelated to > > the > > change. > > I took a closer look at this. I agree that changing PqSendBufferSize > to size_t is unnecessary: given the locations that it is used I see no > risk of overflow anywhere. Changing the type of PqSendPointer and > PqSendStart is needed though, because (as described by Heiki and David > upthread) the argument type of internal_flush_buffer is size_t*. So if > you actually pass int* there, and the sizes are not the same then you > will start writing out of bounds. And because internal_flush_buffer is > introduced in this patch, it is related to this change. > > This is what David just committed too. > > However, the "required" var actually should be of size_t to avoid > overflow if len is larger than int even without this change. So > attached is a tiny patch that does that. Looking at the code in socket_putmessage_noblock(), I don't understand why it's ok for PqSendBufferSize to be int but "required" must be size_t. There's a line that does "PqSendBufferSize = required;". It kinda looks like they both should be size_t. Am I missing something that you've thought about? David
Re: Flushing large data immediately in pqcomm
On Sun, 7 Apr 2024 at 03:39, Andres Freund wrote: > Changing the global vars to size_t seems mildly bogus to me. All it's > achieving is to use slightly more memory. It also just seems unrelated to the > change. I took a closer look at this. I agree that changing PqSendBufferSize to size_t is unnecessary: given the locations that it is used I see no risk of overflow anywhere. Changing the type of PqSendPointer and PqSendStart is needed though, because (as described by Heiki and David upthread) the argument type of internal_flush_buffer is size_t*. So if you actually pass int* there, and the sizes are not the same then you will start writing out of bounds. And because internal_flush_buffer is introduced in this patch, it is related to this change. This is what David just committed too. However, the "required" var actually should be of size_t to avoid overflow if len is larger than int even without this change. So attached is a tiny patch that does that. v7-0001-Avoid-possible-overflow-in-socket_putmessage_nonb.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
On Sun, 7 Apr 2024 at 08:21, Andres Freund wrote: > I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and > localhost. I also reduced row counts and iteration counts, because I am > impatient, and I don't think it matters much here. Attached the modified > version. Thanks for the script. I'm able to reproduce the speedup with your script. I looked over the patch again and ended up making internal_flush an inline function rather than a macro. I compared the assembly produced from each and it's the same with the exception of the label names being different. I've now pushed the patch. One thing that does not seem ideal is having to cast away the const-ness of the buffer in internal_flush_buffer(). Previously this wasn't an issue as we always copied the buffer and passed that to secure_write(). I wonder if it's worth seeing if we can keep this buffer constant all the way to the socket write. That seems to require modifying the following function signatures: secure_write(), be_tls_write(), be_gssapi_write(). That's not an area I'm familiar with, however. David
Re: Flushing large data immediately in pqcomm
Hi, On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote: > On Sat, 6 Apr 2024 at 22:21, Andres Freund wrote: > > The small regression for small results is still kinda visible, I haven't yet > > tested the patch downthread. > > Thanks a lot for the faster test script, I'm also impatient. I still > saw the small regression with David his patch. Here's a v6 where I > think it is now gone. I added inline to internal_put_bytes too. I > think that helped especially because for two calls to > internal_put_bytes len is a constant (1 and 4) that is smaller than > PqSendBufferSize. So for those calls the compiler can now statically > eliminate the new codepath because "len >= PqSendBufferSize" is known > to be false at compile time. Nice. > Also I incorporated all of Ranier his comments. Changing the global vars to size_t seems mildly bogus to me. All it's achieving is to use slightly more memory. It also just seems unrelated to the change. Greetings, Andres Freund
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 22:21, Andres Freund wrote: > The small regression for small results is still kinda visible, I haven't yet > tested the patch downthread. Thanks a lot for the faster test script, I'm also impatient. I still saw the small regression with David his patch. Here's a v6 where I think it is now gone. I added inline to internal_put_bytes too. I think that helped especially because for two calls to internal_put_bytes len is a constant (1 and 4) that is smaller than PqSendBufferSize. So for those calls the compiler can now statically eliminate the new codepath because "len >= PqSendBufferSize" is known to be false at compile time. Also I incorporated all of Ranier his comments. v6-0001-Faster-internal_putbytes.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
Hi, On Fri, 5 Apr 2024 at 03:28, Melih Mutlu wrote: >Right. It was a mistake, forgot to remove that. Fixed it in v5. If you don't mind, I have some suggestions for patch v5. 1. Shouldn't PqSendBufferSize be of type size_t? There are several comparisons with other size_t variables. static size_t PqSendBufferSize; /* Size send buffer */ I think this would prevent possible overflows. 2. If PqSendBufferSize is changed to size_t, in the function socket_putmessage_noblock, the variable which name is *required*, should be changed to size_t as well. static void socket_putmessage_noblock(char msgtype, const char *s, size_t len) { int res PG_USED_FOR_ASSERTS_ONLY; size_t required; 3. In the internal_putbytes function, the *amout* variable could have the scope safely reduced. else { size_t amount; amount = PqSendBufferSize - PqSendPointer; 4. In the function internal_flush_buffer, the variables named *bufptr* and *bufend* could be const char * type, like: static int internal_flush_buffer(const char *s, size_t *start, size_t *end) { static int last_reported_send_errno = 0; const char *bufptr = s + *start; const char *bufend = s + *end; best regards, Ranier Vilela
Re: Flushing large data immediately in pqcomm
Hi, On 2024-04-06 14:34:17 +1300, David Rowley wrote: > I don't see any issues with v5, so based on the performance numbers > shown on this thread for the latest patch, it would make sense to push > it. The problem is, I just can't recreate the performance numbers. > > I've tried both on my AMD 3990x machine and an Apple M2 with a script > similar to the test.sh from above. I mostly just stripped out the > buffer size stuff and adjusted the timing code to something that would > work with mac. I think there are a few issues with the test script leading to not seeing a gain: 1) I think using the textual protocol, with the text datatype, will make it harder to spot differences. That's a lot of overhead. 2) Afaict the test is connecting over the unix socket, I think we expect bigger wins for tcp 3) Particularly the larger string is bottlenecked due to pglz compression in toast. Where I had noticed the overhead of the current approach badly, was streaming out basebackups. Which is all binary, of course. I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and localhost. I also reduced row counts and iteration counts, because I am impatient, and I don't think it matters much here. Attached the modified version. On a dual xeon Gold 5215, turbo boost disabled, server pinned to one core, script pinned to another: unix: master: Run 100 100 100: 0.058482377 Run 1024 10240 10: 0.120909810 Run 1024 1048576 2000: 0.153027916 Run 1048576 1048576 1000: 0.154953512 v5: Run 100 100 100: 0.058760126 Run 1024 10240 10: 0.118831396 Run 1024 1048576 2000: 0.124282503 Run 1048576 1048576 1000: 0.123894962 localhost: master: Run 100 100 100: 0.067088000 Run 1024 10240 10: 0.170894273 Run 1024 1048576 2000: 0.230346632 Run 1048576 1048576 1000: 0.230336078 v5: Run 100 100 100: 0.067144036 Run 1024 10240 10: 0.167950948 Run 1024 1048576 2000: 0.135167027 Run 1048576 1048576 1000: 0.135347867 The perf difference for 1MB via TCP is really impressive. The small regression for small results is still kinda visible, I haven't yet tested the patch downthread. Greetings, Andres Freund #!/bin/bash set -e dbname=postgres port=5440 host=/tmp host=localhost test_cases=( "100 100 100" # only 100 bytes "1024 10240 10"# 1Kb and 10Kb "1024 1048576 2000" # 1Kb and 1Mb "1048576 1048576 1000" # all 1Mb ) insert_rows(){ psql -d $dbname -p $port -h $host -c " DO \$\$ DECLARE counter INT; BEGIN FOR counter IN 1..$3 LOOP IF counter % 2 = 1 THEN INSERT INTO test_table VALUES (repeat('a', $1)::text); ELSE INSERT INTO test_table VALUES (repeat('b', $2)::text); END IF; END LOOP; END \$\$; " > /dev/null } psql -d $dbname -p $port -c "CREATE EXTENSION IF NOT EXISTS pg_prewarm;" > /dev/null for case in "${test_cases[@]}" do psql -d $dbname -p $port -h $host -c "DROP TABLE IF EXISTS test_table;" > /dev/null psql -d $dbname -p $port -h $host -c "CREATE UNLOGGED TABLE test_table(data text not null);" > /dev/null psql -d $dbname -p $port -h $host -c "ALTER TABLE test_table ALTER data SET STORAGE EXTERNAL;" > /dev/null insert_rows $case psql -d $dbname -p $port -h $host -c "select pg_prewarm('test_table');" > /dev/null echo -n "Run $case: " elapsed_time=0 for a in {1..5} do start_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') psql -d $dbname -p $port -h $host -c "COPY test_table TO STDOUT WITH BINARY;" > /dev/null end_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') elapsed_time=$(perl -e "printf('%.9f', ($end_time - $start_time) + $elapsed_time)") done avg_elapsed_time_in_ms=$(perl -e "printf('%.9f', ($elapsed_time / 30))") echo $avg_elapsed_time_in_ms done
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 23:17, Jelte Fennema-Nio wrote: > Weird that on your machines you don't see a difference. Are you sure > you didn't make a silly mistake, like not restarting postgres or > something? I'm sure. I spent quite a long time between the AMD and an Apple m2 trying. I did see the same regression as you on the smaller numbers. I experimented with the attached which macro'ifies internal_flush() and pg_noinlines internal_flush_buffer. Can you try that to see if it gets rid of the regression on the first two tests? David diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 6497100a1a..824b2f11a3 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -120,8 +120,8 @@ static List *sock_paths = NIL; static char *PqSendBuffer; static int PqSendBufferSize; /* Size send buffer */ -static int PqSendPointer; /* Next index to store a byte in PqSendBuffer */ -static int PqSendStart;/* Next index to send a byte in PqSendBuffer */ +static size_t PqSendPointer; /* Next index to store a byte in PqSendBuffer */ +static size_t PqSendStart; /* Next index to send a byte in PqSendBuffer */ static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE]; static int PqRecvPointer; /* Next index to read a byte from PqRecvBuffer */ @@ -133,6 +133,7 @@ static int PqRecvLength; /* End of data available in PqRecvBuffer */ static bool PqCommBusy;/* busy sending data to the client */ static bool PqCommReadingMsg; /* in the middle of reading a message */ +#define internal_flush() internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer) /* Internal functions */ static void socket_comm_reset(void); @@ -144,7 +145,8 @@ static bool socket_is_send_pending(void); static int socket_putmessage(char msgtype, const char *s, size_t len); static void socket_putmessage_noblock(char msgtype, const char *s, size_t len); static int internal_putbytes(const char *s, size_t len); -static int internal_flush(void); +static pg_noinline int internal_flush_buffer(const char *s, size_t *start, + size_t *end); static int Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath); static int Setup_AF_UNIX(const char *sock_path); @@ -1282,14 +1284,32 @@ internal_putbytes(const char *s, size_t len) if (internal_flush()) return EOF; } - amount = PqSendBufferSize - PqSendPointer; - if (amount > len) - amount = len; - memcpy(PqSendBuffer + PqSendPointer, s, amount); - PqSendPointer += amount; - s += amount; - len -= amount; + + /* +* If the buffer is empty and data length is larger than the buffer +* size, send it without buffering. Otherwise, put as much data as +* possible into the buffer. +*/ + if (len >= PqSendBufferSize && PqSendStart == PqSendPointer) + { + size_t start = 0; + + socket_set_nonblocking(false); + if (internal_flush_buffer(s, &start, &len)) + return EOF; + } + else + { + amount = PqSendBufferSize - PqSendPointer; + if (amount > len) + amount = len; + memcpy(PqSendBuffer + PqSendPointer, s, amount); + PqSendPointer += amount; + s += amount; + len -= amount; + } } + return 0; } @@ -1315,19 +1335,19 @@ socket_flush(void) } /* - * internal_flush - flush pending output + * internal_flush_buffer - flush the given buffer content * * Returns 0 if OK (meaning everything was sent, or operation would block * and the socket is in non-blocking mode), or EOF if trouble. * */ -static int -internal_flush(void) +static pg_noinline int +internal_flush_buffer(const char *s, size_t *start, size_t *end) { static int last_reported_send_errno = 0; - char *bufptr = PqSendBuffer + PqSendStart; - char *bufend = PqSendBuffer + PqSendPointer; + char *bufptr = (char*) s + *start; + char *bufend = (char*) s + *end; while (bufptr < bufend) { @@ -1373,7 +1393,7 @@ internal_flush(void) * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate * the connection. */ - PqSendStart =
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 03:34, David Rowley wrote: > Does anyone else want to try the attached script on the v5 patch to > see if their numbers are better? On my machine (i9-10900X, in Ubuntu 22.04 on WSL on Windows) v5 consistently beats master by ~0.25 seconds: master: Run 100 100 500: 1.948975205 Run 1024 10240 20: 3.039986587 Run 1024 1048576 2000: 2.444176276 Run 1048576 1048576 1000: 2.475328596 v5: Run 100 100 500: 1.997170909 Run 1024 10240 20: 3.057802598 Run 1024 1048576 2000: 2.199449857 Run 1048576 1048576 1000: 2.210328762 The first two runs are pretty much equal, and I ran your script a few more times and this seems like just random variance (sometimes v5 wins those, sometimes master does always quite close to each other). But the last two runs v5 consistently wins. Weird that on your machines you don't see a difference. Are you sure you didn't make a silly mistake, like not restarting postgres or something?
Re: Flushing large data immediately in pqcomm
On Fri, 5 Apr 2024 at 03:28, Melih Mutlu wrote: > > Jelte Fennema-Nio , 4 Nis 2024 Per, 16:34 tarihinde şunu > yazdı: >> >> On Thu, 4 Apr 2024 at 13:08, Melih Mutlu wrote: >> > I changed internal_flush() to an inline function, results look better this >> > way. >> >> It seems you also change internal_flush_buffer to be inline (but only >> in the actual function definition, not declaration at the top). I >> don't think inlining internal_flush_buffer should be necessary to >> avoid the perf regressions, i.e. internal_flush is adding extra >> indirection compared to master and is only a single line, so that one >> makes sense to inline. > > Right. It was a mistake, forgot to remove that. Fixed it in v5. I don't see any issues with v5, so based on the performance numbers shown on this thread for the latest patch, it would make sense to push it. The problem is, I just can't recreate the performance numbers. I've tried both on my AMD 3990x machine and an Apple M2 with a script similar to the test.sh from above. I mostly just stripped out the buffer size stuff and adjusted the timing code to something that would work with mac. The script runs each copy 30 times and takes the average time, reported here in seconds. With AMD 3990x: master Run 100 100 500: 1.032264113 sec Run 1024 10240 20: 1.016229105 sec Run 1024 1048576 2000: 1.242267116 sec Run 1048576 1048576 1000: 1.245425089 sec v5 Run 100 100 500: 1.068543053 sec Run 1024 10240 20: 1.026298571 sec Run 1024 1048576 2000: 1.231169669 sec Run 1048576 1048576 1000: 1.236355567 sec With the M2 mini: master Run 100 100 500: 1.167851249 sec Run 1024 10240 20: 1.962466987 sec Run 1024 1048576 2000: 2.052836275 sec Run 1048576 1048576 1000: 2.057908066 sec v5 Run 100 100 500: 1.149636571 sec Run 1024 10240 20: 2.158487741 sec Run 1024 1048576 2000: 2.046627068 sec Run 1048576 1048576 1000: 2.039329068 sec >From looking at the perf reports, the top function is: 57.62% postgres [.] CopyAttributeOutText I messed around with trying to speed up the string escaping in that function with the attached hacky patch and got the following on the AMD 3990x machine: CopyAttributeOutText_speedup.patch.txt Run 100 100 500: 0.821673910 Run 1024 10240 20: 0.546632147 Run 1024 1048576 2000: 0.848492694 Run 1048576 1048576 1000: 0.840870293 I don't think we could actually do this unless we modified the output function API to have it somehow output the number of bytes. The patch may look beyond the NUL byte with pg_lfind8, which I don't think is safe. Does anyone else want to try the attached script on the v5 patch to see if their numbers are better? David #!/bin/bash dbname=postgres port=5432 test_cases=( "100 100 500" # only 100 bytes "1024 10240 20"# 1Kb and 10Kb "1024 1048576 2000" # 1Kb and 1Mb "1048576 1048576 1000" # all 1Mb ) insert_rows(){ psql -d $dbname -p $port -c " DO \$\$ DECLARE counter INT; BEGIN FOR counter IN 1..$3 LOOP IF counter % 2 = 1 THEN INSERT INTO test_table VALUES (repeat('a', $1)::text); ELSE INSERT INTO test_table VALUES (repeat('b', $2)::text); END IF; END LOOP; END \$\$; " > /dev/null } psql -d $dbname -p $port -c "CREATE EXTENSION IF NOT EXISTS pg_prewarm;" > /dev/null for case in "${test_cases[@]}" do psql -d $dbname -p $port -c "DROP TABLE IF EXISTS test_table;" > /dev/null psql -d $dbname -p $port -c "CREATE TABLE test_table(data text not null);" > /dev/null insert_rows $case psql -d $dbname -p $port -c "select pg_prewarm('test_table');" > /dev/null echo -n "Run $case: " elapsed_time=0 for a in {1..30} do start_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') psql -d $dbname -p $port -c "COPY test_table TO STDOUT;" > /dev/null end_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') elapsed_time=$(perl -e "printf('%.9f', ($end_time - $start_time) + $elapsed_time)") done avg_elapsed_time_in_ms=$(perl -e "printf('%.9f', ($elapsed_time / 30))") echo $avg_elapsed_time_in_ms done diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index ae8b2e36d7..fcb200503f 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -29,6 +29,7 @@ #include "mb/pg_wchar.h" #include "miscadmin.h" #include "pgstat.h" +#include "port/pg_lfind.h" #include "storage/fd.h" #include "tcop/tcopprot.h" #include "utils/lsyscache.h" @@ -1068,9 +1069,18 @@ CopyAttributeOutText(CopyToState cstate, const char *string) else { start = ptr; - while ((c = *ptr) != '\0') + + while (true) { -
Re: Flushing large data immediately in pqcomm
Jelte Fennema-Nio , 4 Nis 2024 Per, 16:34 tarihinde şunu yazdı: > On Thu, 4 Apr 2024 at 13:08, Melih Mutlu wrote: > > I changed internal_flush() to an inline function, results look better > this way. > > It seems you also change internal_flush_buffer to be inline (but only > in the actual function definition, not declaration at the top). I > don't think inlining internal_flush_buffer should be necessary to > avoid the perf regressions, i.e. internal_flush is adding extra > indirection compared to master and is only a single line, so that one > makes sense to inline. > Right. It was a mistake, forgot to remove that. Fixed it in v5. > Other than that the code looks good to me. > > The new results look great. > > One thing that is quite interesting about these results is that > increasing the buffer size results in even better performance (by > quite a bit). I don't think we can easily choose a perfect number, as > it seems to be a trade-off between memory usage and perf. But allowing > people to configure it through a GUC like in your second patchset > would be quite useful I think, especially because larger buffers could > be configured for connections that would benefit most for it (e.g. > replication connections or big COPYs). > > I think your add-pq_send_buffer_size-GUC.patch is essentially what we > would need there but it would need some extra changes to actually be > merge-able: > 1. needs docs > 2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but > maybe also remove the PQ_ prefix) > 3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it > after initial allocation > I agree that the GUC patch requires more work to be in good shape. I created that for testing purposes. But if we decide to make the buffer size customizable, then I'll start polishing up that patch and address your suggestions. One case that could benefit from increased COPY performance is table sync of logical replication. It might make sense letting users to configure buffer size to speed up table sync. I'm not sure what kind of problems this GUC would bring though. Thanks, -- Melih Mutlu Microsoft v5-0001-Flush-large-data-immediately-in-pqcomm.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
On Thu, 4 Apr 2024 at 13:08, Melih Mutlu wrote: > I changed internal_flush() to an inline function, results look better this > way. It seems you also change internal_flush_buffer to be inline (but only in the actual function definition, not declaration at the top). I don't think inlining internal_flush_buffer should be necessary to avoid the perf regressions, i.e. internal_flush is adding extra indirection compared to master and is only a single line, so that one makes sense to inline. Other than that the code looks good to me. The new results look great. One thing that is quite interesting about these results is that increasing the buffer size results in even better performance (by quite a bit). I don't think we can easily choose a perfect number, as it seems to be a trade-off between memory usage and perf. But allowing people to configure it through a GUC like in your second patchset would be quite useful I think, especially because larger buffers could be configured for connections that would benefit most for it (e.g. replication connections or big COPYs). I think your add-pq_send_buffer_size-GUC.patch is essentially what we would need there but it would need some extra changes to actually be merge-able: 1. needs docs 2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but maybe also remove the PQ_ prefix) 3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it after initial allocation
Re: Flushing large data immediately in pqcomm
Hi, Melih Mutlu , 28 Mar 2024 Per, 22:44 tarihinde şunu yazdı: > > On Wed, Mar 27, 2024 at 14:39 David Rowley wrote: >> >> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu wrote: >> can you confirm if the test was done in debug with casserts on? If >> so, it would be much better to have asserts off and have >> -Dbuildtype=release. > > > Yes, previous numbers were with --buildtype debug -Dcassert=true. I can share new numbers with release build and asserts off soon. While testing the patch without --buildtype debug -Dcassert=true, I felt like there was still a slight regression. I changed internal_flush() to an inline function, results look better this way. 1- row size = 100 bytes, # of rows = 100 ┌───┬┬───┬───┬───┬───┬───┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼───┼───┼───┼───┼───┤ │ HEAD │ 861 │ 765 │ 612 │ 521 │ 477 │ 480 │ ├───┼┼───┼───┼───┼───┼───┤ │ patch │ 869 │ 766 │ 612 │ 519 │ 482 │ 467 │ ├───┼┼───┼───┼───┼───┼───┤ │ no buffer │ 13978 │ 13746 │ 13909 │ 13956 │ 13920 │ 13895 │ └───┴┴───┴───┴───┴───┴───┘ 2- row size = half of the rows are 1KB and rest is 10KB , # of rows = 100 ┌───┬┬───┬───┬───┬───┬───┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼───┼───┼───┼───┼───┤ │ HEAD │ 30195 │ 26455 │ 17338 │ 14562 │ 12844 │ 11652 │ ├───┼┼───┼───┼───┼───┼───┤ │ patch │ 14744 │ 15830 │ 15697 │ 14273 │ 12794 │ 11652 │ ├───┼┼───┼───┼───┼───┼───┤ │ no buffer │ 24054 │ 23992 │ 24162 │ 23951 │ 23901 │ 23925 │ └───┴┴───┴───┴───┴───┴───┘ 3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 3546 │ 3029 │ 2373 │ 2032 │ 1873 │ 1806 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 1715 │ 1723 │ 1724 │ 1731 │ 1729 │ 1709 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 1749 │ 1748 │ 1742 │ 1744 │ 1757 │ 1744 │ └───┴┴──┴──┴──┴──┴──┘ 4- row size = all rows are 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 7089 │ 5987 │ 4697 │ 4048 │ 3737 │ 3523 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 3438 │ 3411 │ 3400 │ 3416 │ 3399 │ 3429 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 3432 │ 3432 │ 3416 │ 3424 │ 3378 │ 3429 │ └───┴┴──┴──┴──┴──┴──┘ Thanks, -- Melih Mutlu Microsoft v4-0001-Flush-large-data-immediately-in-pqcomm.patch Description: Binary data
Re: Flushing large data immediately in pqcomm
On Wed, Mar 27, 2024 at 18:54 Robert Haas wrote: > On Wed, Mar 27, 2024 at 7:39 AM David Rowley wrote: > > Robert, I understand you'd like a bit more from this patch. I'm > > wondering if you planning on blocking another committer from going > > ahead with this? Or if you have a reason why the current state of the > > patch is not a meaningful enough improvement that would justify > > possibly not getting any improvements in this area for PG17? > > So, I think that the first version of the patch, when it got a big > chunk of data, would just flush whatever was already in the buffer and > then send the rest without copying. Correct. The current version, as I > understand it, only does that if the buffer is empty; otherwise, it > copies data as much data as it can into the partially-filled buffer. Yes, currently it should fill and flush the buffer first, if it’s not already empty. Only then it sends the rest without copying. Thanks, Melih
Re: Flushing large data immediately in pqcomm
On Wed, Mar 27, 2024 at 14:39 David Rowley wrote: > On Fri, 22 Mar 2024 at 12:46, Melih Mutlu wrote: > > I did all of the above changes and it seems like those resolved the > regression issue. > > Thanks for adjusting the patch. The numbers do look better, but on > looking at your test.sh script from [1], I see: > > meson setup --buildtype debug -Dcassert=true > --prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \ > > can you confirm if the test was done in debug with casserts on? If > so, it would be much better to have asserts off and have > -Dbuildtype=release. Yes, previous numbers were with --buildtype debug -Dcassert=true. I can share new numbers with release build and asserts off soon. Thanks, Melih
Re: Flushing large data immediately in pqcomm
On Wed, Mar 27, 2024 at 7:39 AM David Rowley wrote: > Robert, I understand you'd like a bit more from this patch. I'm > wondering if you planning on blocking another committer from going > ahead with this? Or if you have a reason why the current state of the > patch is not a meaningful enough improvement that would justify > possibly not getting any improvements in this area for PG17? So, I think that the first version of the patch, when it got a big chunk of data, would just flush whatever was already in the buffer and then send the rest without copying. The current version, as I understand it, only does that if the buffer is empty; otherwise, it copies data as much data as it can into the partially-filled buffer. I think that change addresses most of my concern about the approach; the old way could, I believe, lead to an increased total number of flushes with the right usage pattern, but I don't believe that's possible with the revised approach. I do kind of wonder whether there is some more fine-tuning of the approach that would improve things further, but I realize that we have very limited time to figure this out, and there's no sense letting the perfect be the enemy of the good. So in short... no, I don't have big concerns at this point. Melih's latest benchmarks look fairly promising to me, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
On Fri, 22 Mar 2024 at 12:46, Melih Mutlu wrote: > I did all of the above changes and it seems like those resolved the > regression issue. Thanks for adjusting the patch. The numbers do look better, but on looking at your test.sh script from [1], I see: meson setup --buildtype debug -Dcassert=true --prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \ can you confirm if the test was done in debug with casserts on? If so, it would be much better to have asserts off and have -Dbuildtype=release. I'm planning to run some benchmarks tomorrow. My thoughts are that the patch allows the memcpy() to be skipped without adding any additional buffer flushes and demonstrates a good performance increase in various scenarios from doing so. I think that is a satisfactory goal. If I don't see any issues from reviewing and benchmarking tomorrow, I'd like to commit this. Robert, I understand you'd like a bit more from this patch. I'm wondering if you planning on blocking another committer from going ahead with this? Or if you have a reason why the current state of the patch is not a meaningful enough improvement that would justify possibly not getting any improvements in this area for PG17? David [1] https://www.postgresql.org/message-id/CAGPVpCSX8bTF61ZL9jOgh1AaY3bgsWnQ6J7WmJK4TV0f2LPnJQ%40mail.gmail.com
Re: Flushing large data immediately in pqcomm
Hi, PSA v3. Jelte Fennema-Nio , 21 Mar 2024 Per, 12:58 tarihinde şunu yazdı: > On Thu, 21 Mar 2024 at 01:24, Melih Mutlu wrote: > > What if I do a simple comparison like PqSendStart == PqSendPointer > instead of calling pq_is_send_pending() > > Yeah, that sounds worth trying out. So the new suggestions to fix the > perf issues on small message sizes would be: > > 1. add "inline" to internal_flush function > 2. replace pq_is_send_pending() with PqSendStart == PqSendPointer > 3. (optional) swap the order of PqSendStart == PqSendPointer and len > >= PqSendBufferSize > I did all of the above changes and it seems like those resolved the regression issue. Since the previous results were with unix sockets, I share here the results of v3 when using unix sockets for comparison. Sharing only the case where all messages are 100 bytes, since this was when the regression was most visible. row size = 100 bytes, # of rows = 100 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 1106 │ 1006 │ 947 │ 920 │ 899 │ 888 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 1094 │ 997 │ 943 │ 913 │ 894 │ 881 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 6389 │ 6195 │ 6214 │ 6271 │ 6325 │ 6211 │ └───┴┴──┴──┴──┴──┴──┘ David Rowley , 21 Mar 2024 Per, 00:57 tarihinde şunu yazdı: > On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas wrote: > > - the "(int *) &len)" cast is not ok, and will break visibly on > > big-endian systems where sizeof(int) != sizeof(size_t). > > I think fixing this requires adjusting the signature of > internal_flush_buffer() to use size_t instead of int. That also > means that PqSendStart and PqSendPointer must also become size_t, or > internal_flush() must add local size_t variables to pass to > internal_flush_buffer and assign these back again to the global after > the call. Upgrading the globals might be the cleaner option. > > David This is done too. I actually tried to test it over a real network for a while. However, I couldn't get reliable-enough numbers with both HEAD and the patch due to network related issues. I've decided to go with Jelte's suggestion [1] which is decreasing MTU of the loopback interface to 1500 and using localhost. Here are the results: 1- row size = 100 bytes, # of rows = 100 ┌───┬┬───┬───┬───┬───┬───┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼───┼───┼───┼───┼───┤ │ HEAD │ 1351 │ 1233 │ 1074 │ 988 │ 944 │ 916 │ ├───┼┼───┼───┼───┼───┼───┤ │ patch │ 1369 │ 1232 │ 1073 │ 981 │ 928 │ 907 │ ├───┼┼───┼───┼───┼───┼───┤ │ no buffer │ 14949 │ 14533 │ 14791 │ 14864 │ 14612 │ 14751 │ └───┴┴───┴───┴───┴───┴───┘ 2- row size = half of the rows are 1KB and rest is 10KB , # of rows = 100 ┌───┬┬───┬───┬───┬───┬───┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼───┼───┼───┼───┼───┤ │ HEAD │ 37212 │ 31372 │ 25520 │ 21980 │ 20311 │ 18864 │ ├───┼┼───┼───┼───┼───┼───┤ │ patch │ 23006 │ 23127 │ 23147 │ 9 │ 20367 │ 19155 │ ├───┼┼───┼───┼───┼───┼───┤ │ no buffer │ 30725 │ 31090 │ 30917 │ 30796 │ 30984 │ 30813 │ └───┴┴───┴───┴───┴───┴───┘ 3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 4296 │ 3713 │ 3040 │ 2711 │ 2528 │ 2449 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 2401 │ 2411 │ 2404 │ 2374 │ 2395 │ 2408 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 2399 │ 2403 │ 2408 │ 2389 │ 2402 │ 2403 │ └───┴┴──┴──┴──┴──┴──┘ 4- row size = all rows are 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 8335 │ 7370 │ 6017 │ 5368 │ 5009 │ 4843 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 4711 │ 4722 │ 4708 │ 4693 │ 4724 │ 4717 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 4704 │ 4712 │ 4746 │ 4728 │ 4709 │ 4
Re: Flushing large data immediately in pqcomm
Heikki Linnakangas , 14 Mar 2024 Per, 15:46 tarihinde şunu yazdı: > On 14/03/2024 13:22, Melih Mutlu wrote: > > @@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len) > > if (internal_flush()) > > return EOF; > > } > > - amount = PqSendBufferSize - PqSendPointer; > > - if (amount > len) > > - amount = len; > > - memcpy(PqSendBuffer + PqSendPointer, s, amount); > > - PqSendPointer += amount; > > - s += amount; > > - len -= amount; > > + > > + /* > > + * If the buffer is empty and data length is larger than > the buffer > > + * size, send it without buffering. Otherwise, put as much > data as > > + * possible into the buffer. > > + */ > > + if (!pq_is_send_pending() && len >= PqSendBufferSize) > > + { > > + int start = 0; > > + > > + socket_set_nonblocking(false); > > + if (internal_flush_buffer(s, &start, (int *)&len)) > > + return EOF; > > + } > > + else > > + { > > + amount = PqSendBufferSize - PqSendPointer; > > + if (amount > len) > > + amount = len; > > + memcpy(PqSendBuffer + PqSendPointer, s, amount); > > + PqSendPointer += amount; > > + s += amount; > > + len -= amount; > > + } > > } > > + > > return 0; > > } > > Two small bugs: > > - the "(int *) &len)" cast is not ok, and will break visibly on > big-endian systems where sizeof(int) != sizeof(size_t). > > - If internal_flush_buffer() cannot write all the data in one call, it > updates 'start' for how much it wrote, and leaves 'end' unchanged. You > throw the updated 'start' value away, and will send the same data again > on next iteration. > There are two possible options for internal_flush_buffer() in internal_putbytes() case: 1- Write all the data and return 0. We don't need start or end of the data in this case. 2- Cannot write all and return EOF. In this case internal_putbytes() also returns EOF immediately and does not really retry. There will be no next iteration. If it was non-blocking, then we may need to keep the new value. But I think we do not need the updated start value in both cases here. What do you think? Thanks, -- Melih Mutlu Microsoft
Re: Flushing large data immediately in pqcomm
On Thu, 21 Mar 2024 at 22:44, Jelte Fennema-Nio wrote: > > On Thu, 21 Mar 2024 at 01:45, David Rowley wrote: > > As I understand the code, there's no problem calling > > internal_flush_buffer() when the buffer is empty and I suspect that if > > we're sending a few buffers with "len > PqSendBufferSize" that it's > > just so unlikely that the buffer is empty that we should just do the > > function call and let internal_flush_buffer() handle doing nothing if > > the buffer really is empty. I think the chances of > > internal_flush_buffer() having to do exactly nothing here is less than > > 1 in 8192, so I just don't think the check is worthwhile. > > I think you're missing the exact case that we're trying to improve > here: Calls to internal_putbytes with a very large len, e.g. 1MB. > With the new code the buffer will be empty ~50% of the time (not less > than 1 in 8192) with such large buffers, because the flow that will > happen: It was the code I misread. I understand what the aim is. I failed to notice the while loop in internal_putbytes(). So what I mentioned about trying to fill the buffer before flushing already happens. I now agree that the PqSendStart == PqSendPointer test. I'd say since the reported regression was with 100 byte rows that testing "len >= PqSendBufferSize" before PqSendStart == PqSendPointer makes sense. David
Re: Flushing large data immediately in pqcomm
On Thu, 21 Mar 2024 at 01:24, Melih Mutlu wrote: > What if I do a simple comparison like PqSendStart == PqSendPointer instead of > calling pq_is_send_pending() Yeah, that sounds worth trying out. So the new suggestions to fix the perf issues on small message sizes would be: 1. add "inline" to internal_flush function 2. replace pq_is_send_pending() with PqSendStart == PqSendPointer 3. (optional) swap the order of PqSendStart == PqSendPointer and len >= PqSendBufferSize
Re: Flushing large data immediately in pqcomm
On Thu, 21 Mar 2024 at 01:45, David Rowley wrote: > As I understand the code, there's no problem calling > internal_flush_buffer() when the buffer is empty and I suspect that if > we're sending a few buffers with "len > PqSendBufferSize" that it's > just so unlikely that the buffer is empty that we should just do the > function call and let internal_flush_buffer() handle doing nothing if > the buffer really is empty. I think the chances of > internal_flush_buffer() having to do exactly nothing here is less than > 1 in 8192, so I just don't think the check is worthwhile. I think you're missing the exact case that we're trying to improve here: Calls to internal_putbytes with a very large len, e.g. 1MB. With the new code the buffer will be empty ~50% of the time (not less than 1 in 8192) with such large buffers, because the flow that will happen: 1. We check len > PqSendBufferSize. There are some bytes in the buffer e.g. the 5 bytes of the msgtype. So we fill up the buffer, but have many bytes left in len. 2. We loop again, because len is not 0. 3. We flush the buffer (at the top of the loop) because the buffer is full. 4. We check len > PqSendBufferSize. Now the buffer is empty, so we call internal_flush_buffer directly As you can see we check len > PqSendBufferSize twice (in step 1. and step 4.), and 1 out of 2 times it returns 0 To be clear, the code is done this way so our behaviour would only ever be better than the status-quo, and cause no regressions. For instance, flushing the 5 byte header separately and then flushing the full input buffer might result in more IP packets being sent in total in some cases due to our TCP_NODELAY.
Re: Flushing large data immediately in pqcomm
On Thu, 21 Mar 2024 at 13:24, Melih Mutlu wrote: > What if I do a simple comparison like PqSendStart == PqSendPointer instead of > calling pq_is_send_pending() as Heikki suggested, then this check should not > hurt that much. Right? Does that make sense? As I understand the code, there's no problem calling internal_flush_buffer() when the buffer is empty and I suspect that if we're sending a few buffers with "len > PqSendBufferSize" that it's just so unlikely that the buffer is empty that we should just do the function call and let internal_flush_buffer() handle doing nothing if the buffer really is empty. I think the chances of internal_flush_buffer() having to do exactly nothing here is less than 1 in 8192, so I just don't think the check is worthwhile. The reason I don't think the odds are exactly 1 in 8192 is because if we're sending a large number of bytes then it will be common that the buffer will contain exactly 5 bytes due to the previous flush and command prefix just having been added. It's worth testing both, however. I might be wrong. Performance is hard to predict. It would be good to see your test.sh script run with and without the PqSendStart == PqSendPointer condition. David
Re: Flushing large data immediately in pqcomm
David Rowley , 21 Mar 2024 Per, 00:54 tarihinde şunu yazdı: > On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio > wrote: > > > > On Thu, 14 Mar 2024 at 13:12, Robert Haas wrote: > > > > > > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu > wrote: > > > > 1- Even though I expect both the patch and HEAD behave similarly in > case of small data (case 1: 100 bytes), the patch runs slightly slower than > HEAD. > > > > > > I wonder why this happens. It seems like maybe something that could be > fixed. > > > > some wild guesses: > > 1. maybe it's the extra call overhead of the new internal_flush > > implementation. What happens if you make that an inline function? > > 2. maybe swap these conditions around (the call seems heavier than a > > simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize > > I agree these are both worth trying. For #2, I wonder if the > pq_is_send_pending() call is even worth checking at all. It seems to > me that the internal_flush_buffer() code will just do nothing if > nothing is pending. Also, isn't there almost always going to be > something pending when the "len >= PqSendBufferSize" condition is met? > We've just added the msgtype and number of bytes to the buffer which > is 5 bytes. If the previous message was also more than > PqSendBufferSize, then the buffer is likely to have 5 bytes due to the > previous flush, otherwise isn't it a 1 in 8192 chance that the buffer > is empty? > > If that fails to resolve the regression, maybe it's worth memcpy()ing > enough bytes out of the message to fill the buffer then flush it and > check if we still have > PqSendBufferSize remaining and skip the > memcpy() for the rest. That way there are no small flushes of just 5 > bytes and only ever the possibility of reducing the flushes as no > pattern should cause the number of flushes to increase. > In len > PqSendBufferSize cases, the buffer should be filled as much as possible if we're sure that it will be flushed at some point. Otherwise we might end up with small flushes. The cases where we're sure that the buffer will be flushed is when the buffer is not empty. If it's empty, there is no need to fill it unnecessarily as it might cause an additional flush. AFAIU from what you said, we shouldn't be worried about such a case since it's unlikely to have the buffer empty due to the first 5 bytes. I guess the only case where the buffer can be empty is when the buffer has PqSendBufferSize-5 bytes from previous messages and adding 5 bytes of the current message will flush the buffer. I'm not sure if removing the check may cause any regression in any case, but it's just there to be safe. What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending() as Heikki suggested, then this check should not hurt that much. Right? Does that make sense? -- Melih Mutlu Microsoft
Re: Flushing large data immediately in pqcomm
On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas wrote: > - the "(int *) &len)" cast is not ok, and will break visibly on > big-endian systems where sizeof(int) != sizeof(size_t). I think fixing this requires adjusting the signature of internal_flush_buffer() to use size_t instead of int. That also means that PqSendStart and PqSendPointer must also become size_t, or internal_flush() must add local size_t variables to pass to internal_flush_buffer and assign these back again to the global after the call. Upgrading the globals might be the cleaner option. David
Re: Flushing large data immediately in pqcomm
On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio wrote: > > On Thu, 14 Mar 2024 at 13:12, Robert Haas wrote: > > > > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu wrote: > > > 1- Even though I expect both the patch and HEAD behave similarly in case > > > of small data (case 1: 100 bytes), the patch runs slightly slower than > > > HEAD. > > > > I wonder why this happens. It seems like maybe something that could be > > fixed. > > some wild guesses: > 1. maybe it's the extra call overhead of the new internal_flush > implementation. What happens if you make that an inline function? > 2. maybe swap these conditions around (the call seems heavier than a > simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize I agree these are both worth trying. For #2, I wonder if the pq_is_send_pending() call is even worth checking at all. It seems to me that the internal_flush_buffer() code will just do nothing if nothing is pending. Also, isn't there almost always going to be something pending when the "len >= PqSendBufferSize" condition is met? We've just added the msgtype and number of bytes to the buffer which is 5 bytes. If the previous message was also more than PqSendBufferSize, then the buffer is likely to have 5 bytes due to the previous flush, otherwise isn't it a 1 in 8192 chance that the buffer is empty? If that fails to resolve the regression, maybe it's worth memcpy()ing enough bytes out of the message to fill the buffer then flush it and check if we still have > PqSendBufferSize remaining and skip the memcpy() for the rest. That way there are no small flushes of just 5 bytes and only ever the possibility of reducing the flushes as no pattern should cause the number of flushes to increase. David
Re: Flushing large data immediately in pqcomm
On Thu, 14 Mar 2024 at 13:12, Robert Haas wrote: > > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu wrote: > > 1- Even though I expect both the patch and HEAD behave similarly in case of > > small data (case 1: 100 bytes), the patch runs slightly slower than HEAD. > > I wonder why this happens. It seems like maybe something that could be fixed. some wild guesses: 1. maybe it's the extra call overhead of the new internal_flush implementation. What happens if you make that an inline function? 2. maybe swap these conditions around (the call seems heavier than a simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize BTW, the improvements for the larger rows are awesome!
Re: Flushing large data immediately in pqcomm
On 14/03/2024 13:22, Melih Mutlu wrote: @@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len) if (internal_flush()) return EOF; } - amount = PqSendBufferSize - PqSendPointer; - if (amount > len) - amount = len; - memcpy(PqSendBuffer + PqSendPointer, s, amount); - PqSendPointer += amount; - s += amount; - len -= amount; + + /* +* If the buffer is empty and data length is larger than the buffer +* size, send it without buffering. Otherwise, put as much data as +* possible into the buffer. +*/ + if (!pq_is_send_pending() && len >= PqSendBufferSize) + { + int start = 0; + + socket_set_nonblocking(false); + if (internal_flush_buffer(s, &start, (int *)&len)) + return EOF; + } + else + { + amount = PqSendBufferSize - PqSendPointer; + if (amount > len) + amount = len; + memcpy(PqSendBuffer + PqSendPointer, s, amount); + PqSendPointer += amount; + s += amount; + len -= amount; + } } + return 0; } Two small bugs: - the "(int *) &len)" cast is not ok, and will break visibly on big-endian systems where sizeof(int) != sizeof(size_t). - If internal_flush_buffer() cannot write all the data in one call, it updates 'start' for how much it wrote, and leaves 'end' unchanged. You throw the updated 'start' value away, and will send the same data again on next iteration. Not a correctness issue, but instead of pq_is_send_pending(), I think it would be better to check "PqSendStart == PqSendPointer" directly, or call socket_is_send_pending() directly here. pq_is_send_pending() does the same, but it's at a higher level of abstraction. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Flushing large data immediately in pqcomm
On Thu, 14 Mar 2024 at 12:22, Melih Mutlu wrote: > I did some experiments with this patch, after previous discussions One thing I noticed is that the buffer sizes don't seem to matter much in your experiments, even though Andres his expectation was that 1400 would be better. I think I know the reason for that: afaict from your test.sh script you connect psql over localhost or maybe even unix socket to postgres. Neither of those would not have an MTU of 1500. You'd probably want to do those tests over an actual network or at least change the MTU of the loopback interface. e.g. my "lo" interface mtu is 65536 by default: ❯ ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever
Re: Flushing large data immediately in pqcomm
On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu wrote: > 1- Even though I expect both the patch and HEAD behave similarly in case of > small data (case 1: 100 bytes), the patch runs slightly slower than HEAD. I wonder why this happens. It seems like maybe something that could be fixed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
Hi hackers, I did some experiments with this patch, after previous discussions. This probably does not answer all questions, but would be happy to do more if needed. First, I updated the patch according to what suggested here [1]. PSA v2. I tweaked the master branch a bit to not allow any buffering. I compared HEAD, this patch and no buffering at all. I also added a simple GUC to control PqSendBufferSize, this change only allows to modify the buffer size and should not have any impact on performance. I again ran the COPY TO STDOUT command and timed it. AFAIU COPY sends data row by row, and I tried running the command under different scenarios with different # of rows and row sizes. You can find the test script attached (see test.sh). All timings are in ms. 1- row size = 100 bytes, # of rows = 100 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 1036 │ 998 │ 940 │ 910 │ 894 │ 874 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 1107 │ 1032 │ 980 │ 957 │ 917 │ 909 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 6230 │ 6125 │ 6282 │ 6279 │ 6255 │ 6221 │ └───┴┴──┴──┴──┴──┴──┘ 2- row size = half of the rows are 1KB and rest is 10KB , # of rows = 100 ┌───┬┬───┬───┬───┬───┬───┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼───┼───┼───┼───┼───┤ │ HEAD │ 25197 │ 23414 │ 20612 │ 19206 │ 18334 │ 18033 │ ├───┼┼───┼───┼───┼───┼───┤ │ patch │ 19843 │ 19889 │ 19798 │ 19129 │ 18578 │ 18260 │ ├───┼┼───┼───┼───┼───┼───┤ │ no buffer │ 23752 │ 23565 │ 23602 │ 23622 │ 23541 │ 23599 │ └───┴┴───┴───┴───┴───┴───┘ 3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 3137 │ 2937 │ 2687 │ 2551 │ 2456 │ 2465 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 2399 │ 2390 │ 2402 │ 2415 │ 2417 │ 2422 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 2417 │ 2414 │ 2429 │ 2418 │ 2435 │ 2404 │ └───┴┴──┴──┴──┴──┴──┘ 3- row size = all rows are 1MB , # of rows = 1000 ┌───┬┬──┬──┬──┬──┬──┐ │ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │ ├───┼┼──┼──┼──┼──┼──┤ │ HEAD │ 6113 │ 5764 │ 5281 │ 5009 │ 4885 │ 4872 │ ├───┼┼──┼──┼──┼──┼──┤ │ patch │ 4759 │ 4754 │ 4754 │ 4758 │ 4782 │ 4805 │ ├───┼┼──┼──┼──┼──┼──┤ │ no buffer │ 4756 │ 4774 │ 4793 │ 4766 │ 4770 │ 4774 │ └───┴┴──┴──┴──┴──┴──┘ Some quick observations: 1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD. 2- In cases where the data does not fit into the buffer, the patch starts performing better than HEAD. For example, in case 2, patch seems faster until the buffer size exceeds the data length. When the buffer size is set to something larger than 10KB (16KB/32KB in this case), there is again a slight performance loss with the patch as in case 1. 3- With large row sizes (i.e. sizes that do not fit into the buffer) not buffering at all starts performing better than HEAD. Similarly the patch performs better too as it stops buffering if data does not fit into the buffer. [1] https://www.postgresql.org/message-id/CAGECzQTYUhnC1bO%3DzNiSpUgCs%3DhCYxVHvLD2doXNx3My6ZAC2w%40mail.gmail.com Thanks, -- Melih Mutlu Microsoft From 7f1b72a0948156f8e35ce3b07b5e763a5578d641 Mon Sep 17 00:00:00 2001 From: Melih Mutlu Date: Mon, 26 Feb 2024 14:41:35 +0300 Subject: [PATCH] Added pq_send_buffer_size GUC --- src/backend/libpq/pqcomm.c | 2 +- src/backend/utils/misc/guc_tables.c | 11 +++ src/include/libpq/libpq.h | 1 + 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index c606bf3447..92708e46e6 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -116,7 +116,7 @@ static List *sock_paths = NIL; * enlarged by pq_putmessage_noblock() if the message doesn't fit otherwise. */ -#define PQ_SEND_BUFFER_SIZE 8192 +int PQ_SEND_BUFFER_SIZE = 8192; #define PQ_RECV_BUFFER_SIZE 8192 static ch
Re: Flushing large data immediately in pqcomm
Hi, On 2024-02-01 15:02:57 -0500, Robert Haas wrote: > On Thu, Feb 1, 2024 at 10:52 AM Robert Haas wrote: > There was probably a better way to phrase this email ... the sentiment > is sincere, but there was almost certainly a way of writing it that > didn't sound like I'm super-annoyed. NP - I could have phrased mine better as well... > > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund wrote: > > > While not perfect - e.g. because networks might use jumbo packets / large > > > MTUs > > > and we don't know how many outstanding bytes there are locally, I think a > > > decent heuristic could be to always try to send at least one packet worth > > > of > > > data at once (something like ~1400 bytes), even if that requires copying > > > some > > > of the input data. It might not be sent on its own, but it should make it > > > reasonably unlikely to end up with tiny tiny packets. > > > > I think that COULD be a decent heuristic but I think it should be > > TESTED, including against the ~3 or so other heuristics proposed on > > this thread, before we make a decision. > > > > I literally mentioned the Ethernet frame size as one of the things > > that we should test whether it's relevant in the exact email to which > > you're replying, and you replied by proposing that as a heuristic, but > > also criticizing me for wanting more research before we settle on > > something. I mentioned the frame size thing because afaict nobody in the thread had mentioned our use of TCP_NODELAY (which basically forces the kernel to send out data immediately instead of waiting for further data to be sent). Without that it'd be a lot less problematic to occasionally send data in small increments inbetween larger sends. Nor would packet sizes be as relevant. > > Are we just supposed to assume that your heuristic is better than the > > others proposed here without testing anything, or, like, what? I don't > > think this needs to be a completely exhaustive or exhausting process, but > > I think trying a few different things out and seeing what happens is > > smart. I wasn't trying to say that my heuristic necessarily is better. What I was trying to get at - and expressed badly - was that I doubt that testing can get us all that far here. It's not too hard to test the effects of our buffering with regards to syscall overhead, but once you actually take network effects into account it gets quite hard. Bandwidth, latency, the specific network hardware and operating systems involved all play a significant role. Given how, uh, naive our current approach is, I think analyzing the situation from first principles and then doing some basic validation of the results makes more sense. Separately, I think we shouldn't aim for perfect here. It's obviously extremely inefficient to send a larger amount of data by memcpy()ing and send()ing it in 8kB chunks. As mentioned by several folks upthread, we can improve upon that without having worse behaviour than today. Medium-long term I suspect we're going to want to use asynchronous network interfaces, in combination with zero-copy sending, which requires larger changes. Not that relevant for things like query results, quite relevant for base backups etc. It's perhaps also worth mentioning that the small send buffer isn't great for SSL performance, the encryption overhead increases when sending in small chunks. I hacked up Melih's patch to send the pending data together with the first bit of the large "to be sent" data and also added a patch to increased SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null time via test unix tcp tcp+ssl master 6.305s 9.436s 15.596s master-larger-buffer 6.535s 9.453s 15.208s patch 5.900s 7.465s 13.634s patch-larger-buffer5.233s 5.439s 11.730s The increase when using tcp is pretty darn impressive. If I had remembered in time to disable manifests checksums, the win would have been even bigger. The bottleneck for SSL is that it still ends up with ~16kB sends, not sure why. Greetings, Andres Freund
Re: Flushing large data immediately in pqcomm
On Thu, Feb 1, 2024 at 10:52 AM Robert Haas wrote: > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund wrote: > > While not perfect - e.g. because networks might use jumbo packets / large > > MTUs > > and we don't know how many outstanding bytes there are locally, I think a > > decent heuristic could be to always try to send at least one packet worth of > > data at once (something like ~1400 bytes), even if that requires copying > > some > > of the input data. It might not be sent on its own, but it should make it > > reasonably unlikely to end up with tiny tiny packets. > > I think that COULD be a decent heuristic but I think it should be > TESTED, including against the ~3 or so other heuristics proposed on > this thread, before we make a decision. > > I literally mentioned the Ethernet frame size as one of the things > that we should test whether it's relevant in the exact email to which > you're replying, and you replied by proposing that as a heuristic, but > also criticizing me for wanting more research before we settle on > something. Are we just supposed to assume that your heuristic is > better than the others proposed here without testing anything, or, > like, what? I don't think this needs to be a completely exhaustive or > exhausting process, but I think trying a few different things out and > seeing what happens is smart. There was probably a better way to phrase this email ... the sentiment is sincere, but there was almost certainly a way of writing it that didn't sound like I'm super-annoyed. Apologies for that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
On Wed, Jan 31, 2024 at 10:24 PM Andres Freund wrote: > While not perfect - e.g. because networks might use jumbo packets / large MTUs > and we don't know how many outstanding bytes there are locally, I think a > decent heuristic could be to always try to send at least one packet worth of > data at once (something like ~1400 bytes), even if that requires copying some > of the input data. It might not be sent on its own, but it should make it > reasonably unlikely to end up with tiny tiny packets. I think that COULD be a decent heuristic but I think it should be TESTED, including against the ~3 or so other heuristics proposed on this thread, before we make a decision. I literally mentioned the Ethernet frame size as one of the things that we should test whether it's relevant in the exact email to which you're replying, and you replied by proposing that as a heuristic, but also criticizing me for wanting more research before we settle on something. Are we just supposed to assume that your heuristic is better than the others proposed here without testing anything, or, like, what? I don't think this needs to be a completely exhaustive or exhausting process, but I think trying a few different things out and seeing what happens is smart. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
Hi, On 2024-01-31 14:57:35 -0500, Robert Haas wrote: > > You're right and I'm open to doing more legwork. I'd also appreciate any > > suggestion about how to test this properly and/or useful scenarios to > > test. That would be really helpful. > > I think experimenting to see whether the long-short-long-short > behavior that Heikki postulated emerges in practice would be a really > good start. > > Another experiment that I think would be interesting is: suppose you > create a patch that sends EVERY message without buffering and compare > that to master. My naive expectation would be that this will lose if > you pump short messages through that connection and win if you pump > long messages through that connection. Is that true? If yes, at what > point do we break even on performance? Does it depend on whether the > connection is local or over a network? Does it depend on whether it's > with or without SSL? Does it depend on Linux vs. Windows vs. > whateverBSD? What happens if you twiddle the 8kB buffer size up or, > say, down to just below the Ethernet frame size? I feel like you're putting up a too high bar for something that can be a pretty clear improvement on its own, without a downside. The current behaviour is pretty absurd, doing all this research across all platforms isn't going to disprove that - and it's a lot of work. ISTM we can analyze this without taking concrete hardware into account easily enough. One thing that I haven't seen mentioned here that's relevant around using small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing tiny sends can hurt substantially > I think that what we really want to understand here is under what > circumstances the extra layer of buffering is a win vs. being a loss. It's quite easy to see that doing no buffering isn't viable - we end up with tiny tiny TCP packets, one for each send(). And then there's the syscall overhead. Here's a quickly thrown together benchmark using netperf. First with -D, which instructs it to use TCP_NODELAY, as we do. 10gbit network, remote host: $ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1 -H alap5-10gbe -- -r $s,$s -D 1 -o "$fields";done)|column -t -s, request_size throughput 1 22.73 2 45.77 4 108.64 8 225.78 16560.32 321035.61 642177.91 128 3604.71 256 5878.93 512 9334.70 1024 9031.13 2048 9405.35 4096 9334.60 8192 9275.33 16384 9406.29 32768 9385.52 65536 9399.40 localhost: request_size throughput 1 2.76 2 5.10 4 9.89 8 20.51 1643.42 3287.13 64173.72 128 343.70 256 647.89 512 1328.79 1024 2550.14 2048 4998.06 4096 9482.06 8192 17130.76 16384 29048.02 32768 42106.33 65536 48579.95 I'm slightly baffled by the poor performance of localhost with tiny packet sizes. Ah, I see - it's the NODELA, without that: localhost: 1 32.02 2 60.58 4 114.32 8 262.71 16558.42 321053.66 642099.39 128 3815.60 256 6566.19 512 11751.79 1024 18976.11 2048 27222.99 4096 33838.07 8192 38219.60 16384 39146.37 32768 44784.98 65536 44214.70 NODELAY triggers many more context switches, because there's immediately data available for the receiving side. Whereas with real network the interrupts get coalesced. I think that's pretty clear evidence that we need buffering. But I think we can probably be smarter than we are right now, and then what's been proposed in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its own, it may trigger sending a small TCP packet, which is quite inefficient. While not perfect - e.g. because networks might use jumbo packets / large MTUs and we don't know how many outstanding bytes there are locally, I think a decent heuristic could be to always try to send at least one packet worth of data at once (something like ~1400 bytes), even if that requires copying some of the input data. It might not be sent on its own, but it should make it reasonably unlikely to end up with tiny tiny packets. Greetings, Andres Freund
Re: Flushing large data immediately in pqcomm
On Wed, Jan 31, 2024 at 2:23 PM Melih Mutlu wrote: >> That seems like it might be a useful refinement of Melih Mutlu's >> original proposal, but consider a message stream that consists of >> messages exactly 8kB in size. If that message stream begins when the >> buffer is empty, all messages are sent directly. If it begins when >> there are any number of bytes in the buffer, we buffer every message >> forever. That's kind of an odd artifact, but maybe it's fine in >> practice. I say again that it's good to test out a bunch of scenarios >> and see what shakes out. > > Isn't this already the case? Imagine sending exactly 8kB messages, the first > pq_putmessage() call will buffer 8kB. Any call after this point simply sends > a 8kB message already buffered from the previous call and buffers a new 8kB > message. Only difference here is we keep the message in the buffer for a > while instead of sending it directly. In theory, the proposed idea should not > bring any difference in the number of flushes and the size of data we send in > each time, but can remove unnecessary copies to the buffer in this case. I > guess the behaviour is also the same with or without the patch in case the > buffer has already some bytes. Yes, it's never worse than today in terms of number of buffer flushes, but it doesn't feel like great behavior, either. Users tend not to like it when the behavior of an algorithm depends heavily on incidental factors that shouldn't really be relevant, like whether the buffer starts with 1 byte in it or 0 at the beginning of a long sequence of messages. They see the performance varying "for no reason" and they dislike it. They don't say "even the bad performance is no worse than earlier versions so it's fine." > You're right and I'm open to doing more legwork. I'd also appreciate any > suggestion about how to test this properly and/or useful scenarios to test. > That would be really helpful. I think experimenting to see whether the long-short-long-short behavior that Heikki postulated emerges in practice would be a really good start. Another experiment that I think would be interesting is: suppose you create a patch that sends EVERY message without buffering and compare that to master. My naive expectation would be that this will lose if you pump short messages through that connection and win if you pump long messages through that connection. Is that true? If yes, at what point do we break even on performance? Does it depend on whether the connection is local or over a network? Does it depend on whether it's with or without SSL? Does it depend on Linux vs. Windows vs. whateverBSD? What happens if you twiddle the 8kB buffer size up or, say, down to just below the Ethernet frame size? I think that what we really want to understand here is under what circumstances the extra layer of buffering is a win vs. being a loss. If all the stuff I just mentioned doesn't really matter and the answer is, say, that an 8kB buffer is great and the breakpoint where extra buffering makes sense is also 8kB, and that's consistent regardless of other variables, then your algorithm or Jelte's variant or something of that nature is probably just right. But if it turns out, say, that the extra buffering is only a win for sub-1kB messages, that would be rather nice to know before we finalize the approach. Also, if it turns out that the answer differs dramatically based on whether you're using a UNIX socket or TCP, that would also be nice to know before finalizing an algorithm. > I understand that I should provide more/better analysis around this change to > prove that it doesn't hurt (hopefully) but improves some cases even though > not all the cases. That may even help us to find a better approach than > what's already proposed. Just to clarify, I don't think anyone here suggests > that the bar should be at "if it can't lose relative to today, it's good > enough". IMHO "a change that improves some cases, but regresses nowhere" does > not translate to that. Well, I thought those were fairly similar sentiments, so maybe I'm not quite understanding the statement in the way it was meant. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
Robert Haas , 31 Oca 2024 Çar, 20:23 tarihinde şunu yazdı: > On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio > wrote: > > I agree that it's hard to prove that such heuristics will always be > > better in practice than the status quo. But I feel like we shouldn't > > let perfect be the enemy of good here. > > Sure, I agree. > > > I one approach that is a clear > > improvement over the status quo is: > > 1. If the buffer is empty AND the data we are trying to send is larger > > than the buffer size, then don't use the buffer. > > 2. If not, fill up the buffer first (just like we do now) then send > > that. And if the left over data is then still larger than the buffer, > > then now the buffer is empty so 1. applies. > > That seems like it might be a useful refinement of Melih Mutlu's > original proposal, but consider a message stream that consists of > messages exactly 8kB in size. If that message stream begins when the > buffer is empty, all messages are sent directly. If it begins when > there are any number of bytes in the buffer, we buffer every message > forever. That's kind of an odd artifact, but maybe it's fine in > practice. I say again that it's good to test out a bunch of scenarios > and see what shakes out. > Isn't this already the case? Imagine sending exactly 8kB messages, the first pq_putmessage() call will buffer 8kB. Any call after this point simply sends a 8kB message already buffered from the previous call and buffers a new 8kB message. Only difference here is we keep the message in the buffer for a while instead of sending it directly. In theory, the proposed idea should not bring any difference in the number of flushes and the size of data we send in each time, but can remove unnecessary copies to the buffer in this case. I guess the behaviour is also the same with or without the patch in case the buffer has already some bytes. Robert Haas , 31 Oca 2024 Çar, 21:28 tarihinde şunu yazdı: > Personally, I don't think it's likely that anything will get committed > here without someone doing more legwork than I've seen on the thread > so far. I don't have any plan to pick up this patch anyway, but if I > were thinking about it, I would abandon the idea unless I were > prepared to go test a bunch of stuff myself. I agree with the core > idea of this work, but not with the idea that the bar is as low as "if > it can't lose relative to today, it's good enough." > You're right and I'm open to doing more legwork. I'd also appreciate any suggestion about how to test this properly and/or useful scenarios to test. That would be really helpful. I understand that I should provide more/better analysis around this change to prove that it doesn't hurt (hopefully) but improves some cases even though not all the cases. That may even help us to find a better approach than what's already proposed. Just to clarify, I don't think anyone here suggests that the bar should be at "if it can't lose relative to today, it's good enough". IMHO "a change that improves some cases, but regresses nowhere" does not translate to that. Thanks, -- Melih Mutlu Microsoft
Re: Flushing large data immediately in pqcomm
On Wed, Jan 31, 2024 at 12:49 PM Jelte Fennema-Nio wrote: > Testing a bunch of scenarios to find a good one sounds like a good > idea, which can probably give us a more optimal heuristic. But it also > sounds like a lot of work, and probably results in a lot of > discussion. That extra effort might mean that we're not going to > commit any change for PG17 (or even at all). If so, then I'd rather > have a modest improvement from my refinement of Melih's proposal, than > none at all. Personally, I don't think it's likely that anything will get committed here without someone doing more legwork than I've seen on the thread so far. I don't have any plan to pick up this patch anyway, but if I were thinking about it, I would abandon the idea unless I were prepared to go test a bunch of stuff myself. I agree with the core idea of this work, but not with the idea that the bar is as low as "if it can't lose relative to today, it's good enough." Of course, another committer may see it differently. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
On Wed, 31 Jan 2024 at 18:23, Robert Haas wrote: > That's kind of an odd artifact, but maybe it's fine in > practice. I agree it's an odd artifact, but it's not a regression over the status quo. Achieving that was the intent of my suggestion: A change that improves some cases, but regresses nowhere. > I say again that it's good to test out a bunch of scenarios > and see what shakes out. Testing a bunch of scenarios to find a good one sounds like a good idea, which can probably give us a more optimal heuristic. But it also sounds like a lot of work, and probably results in a lot of discussion. That extra effort might mean that we're not going to commit any change for PG17 (or even at all). If so, then I'd rather have a modest improvement from my refinement of Melih's proposal, than none at all.
Re: Flushing large data immediately in pqcomm
On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio wrote: > I agree that it's hard to prove that such heuristics will always be > better in practice than the status quo. But I feel like we shouldn't > let perfect be the enemy of good here. Sure, I agree. > I one approach that is a clear > improvement over the status quo is: > 1. If the buffer is empty AND the data we are trying to send is larger > than the buffer size, then don't use the buffer. > 2. If not, fill up the buffer first (just like we do now) then send > that. And if the left over data is then still larger than the buffer, > then now the buffer is empty so 1. applies. That seems like it might be a useful refinement of Melih Mutlu's original proposal, but consider a message stream that consists of messages exactly 8kB in size. If that message stream begins when the buffer is empty, all messages are sent directly. If it begins when there are any number of bytes in the buffer, we buffer every message forever. That's kind of an odd artifact, but maybe it's fine in practice. I say again that it's good to test out a bunch of scenarios and see what shakes out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
On Tue, 30 Jan 2024 at 19:48, Robert Haas wrote: > > On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu wrote: > > Sounds like it's difficult to come up with a heuristic that would work well > > enough for most cases. > > One thing with sending data instead of copying it if the buffer is empty is > > that initially the buffer is empty. I believe it will stay empty forever if > > we do not copy anything when the buffer is empty. We can maybe simply set > > the threshold to the buffer size/2 (4kB) and hope that will work better. Or > > copy the data only if it fits into the remaining space in the buffer. What > > do you think? > > > > An additional note while I mentioned pq_putmessage_noblock(), I've been > > testing sending input data immediately in pq_putmessage_noblock() without > > blocking and copy the data into PqSendBuffer only if the socket would block > > and cannot send it. Unfortunately, I don't have strong numbers to > > demonstrate any improvement in perf or timing yet. But I still like to know > > what would you think about it? > > I think this is an area where it's very difficult to foresee on > theoretical grounds what will be right in practice I agree that it's hard to prove that such heuristics will always be better in practice than the status quo. But I feel like we shouldn't let perfect be the enemy of good here. I one approach that is a clear improvement over the status quo is: 1. If the buffer is empty AND the data we are trying to send is larger than the buffer size, then don't use the buffer. 2. If not, fill up the buffer first (just like we do now) then send that. And if the left over data is then still larger than the buffer, then now the buffer is empty so 1. applies.
Re: Flushing large data immediately in pqcomm
On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu wrote: > Sounds like it's difficult to come up with a heuristic that would work well > enough for most cases. > One thing with sending data instead of copying it if the buffer is empty is > that initially the buffer is empty. I believe it will stay empty forever if > we do not copy anything when the buffer is empty. We can maybe simply set the > threshold to the buffer size/2 (4kB) and hope that will work better. Or copy > the data only if it fits into the remaining space in the buffer. What do you > think? > > An additional note while I mentioned pq_putmessage_noblock(), I've been > testing sending input data immediately in pq_putmessage_noblock() without > blocking and copy the data into PqSendBuffer only if the socket would block > and cannot send it. Unfortunately, I don't have strong numbers to demonstrate > any improvement in perf or timing yet. But I still like to know what would > you think about it? I think this is an area where it's very difficult to foresee on theoretical grounds what will be right in practice. The problem is that the best algorithm probably depends on what usage patterns are common in practice. I think one common usage pattern will be a bunch of roughly equal-sized messages in a row, like CopyData or DataRow messages -- but those messages won't have a consistent width. It would probably be worth testing what behavior you see in such cases -- start with say a stream of 100 byte messages and then gradually increase and see how the behavior evolves. But you can also have other patterns, with messages of different sizes interleaved. In the case of FE-to-BE traffic, the extended query protocol might be a good example of that: the Parse message could be quite long, or not, but the Bind Describe Execute Sync messages that follow are probably all short. That case doesn't arise in this direction, but I can't think exactly of what cases that do. It seems like someone would need to play around and try some different cases and maybe log the sizes of the secure_write() calls with various algorithms, and then try to figure out what's best. For example, if the alternating short-write, long-write behavior that Heikki mentioned is happening, and I do think that particular thing is a very real risk, then you haven't got it figured out yet... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
Hi Robert, Robert Haas , 29 Oca 2024 Pzt, 20:48 tarihinde şunu yazdı: > > If there's already some data in PqSendBuffer, I wonder if it would be > > better to fill it up with data, flush it, and then send the rest of the > > data directly. Instead of flushing the partial data first. I'm afraid > > that you'll make a tiny call to secure_write(), followed by a large one, > > then a tine one again, and so forth. Especially when socket_putmessage > > itself writes the msgtype and len, which are tiny, before the payload. > > > > Perhaps we should invent a new pq_putmessage() function that would take > > an input buffer with 5 bytes of space reserved before the payload. > > pq_putmessage() could then fill in the msgtype and len bytes in the > > input buffer and send that directly. (Not wedded to that particular API, > > but something that would have the same effect) > > I share the concern; I'm not sure about the best solution. I wonder if > it would be useful to have pq_putmessagev() in the style of writev() > et al. Or maybe what we need is secure_writev(). > I thought about using writev() for not only pq_putmessage() but pq_putmessage_noblock() too. Currently, pq_putmessage_noblock() repallocs PqSendBuffer and copies input buffer, which can easily be larger than 8kB, into PqSendBuffer.I also discussed it with Thomas off-list. The thing is that I believe we would need secure_writev() with SSL/GSS cases handled properly. I'm just not sure if the effort would be worthwhile considering what we gain from it. > I also wonder if the threshold for sending data directly should be > smaller than the buffer size, and/or whether it should depend on the > buffer being empty. You might be right. I'm not sure what the ideal threshold would be. > If we have an 8kB buffer that currently has > nothing in it, and somebody writes 2kB, I suspect it might be wrong to > copy that into the buffer. If the same buffer had 5kB used and 3kB > free, copying sounds a lot more likely to work out. The goal here is > probably to condense sequences of short messages into a single > transmission while sending long messages individually. I'm just not > quite sure what heuristic would do that most effectively. > Sounds like it's difficult to come up with a heuristic that would work well enough for most cases. One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I believe it will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the threshold to the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining space in the buffer. What do you think? An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in pq_putmessage_noblock() without blocking and copy the data into PqSendBuffer only if the socket would block and cannot send it. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still like to know what would you think about it? Thanks, -- Melih Mutlu Microsoft
Re: Flushing large data immediately in pqcomm
Hi Heikki, Heikki Linnakangas , 29 Oca 2024 Pzt, 19:12 tarihinde şunu yazdı: > > Proposed change modifies socket_putmessage to send any data larger than > > 8K immediately without copying it into the send buffer. Assuming that > > the send buffer would be flushed anyway due to reaching its limit, the > > patch just gets rid of the copy part which seems unnecessary and sends > > data without waiting. > > If there's already some data in PqSendBuffer, I wonder if it would be > better to fill it up with data, flush it, and then send the rest of the > data directly. Instead of flushing the partial data first. I'm afraid > that you'll make a tiny call to secure_write(), followed by a large one, > then a tine one again, and so forth. Especially when socket_putmessage > itself writes the msgtype and len, which are tiny, before the payload. > I agree that I could do better there without flushing twice for both PqSendBuffer and input data. PqSendBuffer always has some data, even if it's tiny, since msgtype and len are added. > Perhaps we should invent a new pq_putmessage() function that would take > an input buffer with 5 bytes of space reserved before the payload. > pq_putmessage() could then fill in the msgtype and len bytes in the > input buffer and send that directly. (Not wedded to that particular API, > but something that would have the same effect) > I thought about doing this. The reason why I didn't was because I think that such a change would require adjusting all input buffers wherever pq_putmessage is called, and I did not want to touch that many different places. These places where we need pq_putmessage might not be that many though, I'm not sure. > > > This change affects places where pq_putmessage is used such as > > pg_basebackup, COPY TO, walsender etc. > > > > I did some experiments to see how the patch performs. > > Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO > > STDOUT". Here are perf results of both the patch and HEAD > ... > > The patch brings a ~5% gain in socket_putmessage. > > > > [1] > > CREATE TABLE test(id int, name text, time TIMESTAMP); > > INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) > > AS name, NOW() AS time FROM generate_series(1, 1) AS i; > > I'm surprised by these results, because each row in that table is < 600 > bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in > that test. Am I missing something? > You're absolutely right. I made a silly mistake there. I also think that the way I did perf analysis does not make much sense, even if one row of the table is greater than 8kB. Here are some quick timing results after being sure that it triggers this patch's optimization. I need to think more on how to profile this with perf. I hope to share proper results soon. I just added a bit more zeros [1] and ran [2] (hopefully measured the correct thing) HEAD: real2m48,938s user0m9,226s sys 1m35,342s Patch: real2m40,690s user0m8,492s sys 1m31,001s [1] INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 1) AS name, NOW() AS time FROM generate_series(1, 100) AS i; [2] rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql -d postgres -c "COPY test TO STDOUT;" > /tmp/dummy Thanks, -- Melih Mutlu Microsoft
Re: Flushing large data immediately in pqcomm
On Mon, Jan 29, 2024 at 11:12 AM Heikki Linnakangas wrote: > Agreed, that's silly. +1. > If there's already some data in PqSendBuffer, I wonder if it would be > better to fill it up with data, flush it, and then send the rest of the > data directly. Instead of flushing the partial data first. I'm afraid > that you'll make a tiny call to secure_write(), followed by a large one, > then a tine one again, and so forth. Especially when socket_putmessage > itself writes the msgtype and len, which are tiny, before the payload. > > Perhaps we should invent a new pq_putmessage() function that would take > an input buffer with 5 bytes of space reserved before the payload. > pq_putmessage() could then fill in the msgtype and len bytes in the > input buffer and send that directly. (Not wedded to that particular API, > but something that would have the same effect) I share the concern; I'm not sure about the best solution. I wonder if it would be useful to have pq_putmessagev() in the style of writev() et al. Or maybe what we need is secure_writev(). I also wonder if the threshold for sending data directly should be smaller than the buffer size, and/or whether it should depend on the buffer being empty. If we have an 8kB buffer that currently has nothing in it, and somebody writes 2kB, I suspect it might be wrong to copy that into the buffer. If the same buffer had 5kB used and 3kB free, copying sounds a lot more likely to work out. The goal here is probably to condense sequences of short messages into a single transmission while sending long messages individually. I'm just not quite sure what heuristic would do that most effectively. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Flushing large data immediately in pqcomm
On 20/11/2023 14:21, Melih Mutlu wrote: Hi hackers I've been looking into ways to reduce the overhead we're having in pqcomm and I'd like to propose a small patch to modify how socket_putmessage works. Currently socket_putmessage copies any input data into the pqcomm send buffer (PqSendBuffer) and the size of this buffer is 8K. When the send buffer gets full, it's flushed and we continue to copy more data into the send buffer until we have no data left to be sent. Since the send buffer is flushed whenever it's full, I think we are safe to say that if the size of input data is larger than the buffer size, which is 8K, then the buffer will be flushed at least once (or more, depends on the input size) to store and all the input data. Agreed, that's silly. Proposed change modifies socket_putmessage to send any data larger than 8K immediately without copying it into the send buffer. Assuming that the send buffer would be flushed anyway due to reaching its limit, the patch just gets rid of the copy part which seems unnecessary and sends data without waiting. If there's already some data in PqSendBuffer, I wonder if it would be better to fill it up with data, flush it, and then send the rest of the data directly. Instead of flushing the partial data first. I'm afraid that you'll make a tiny call to secure_write(), followed by a large one, then a tine one again, and so forth. Especially when socket_putmessage itself writes the msgtype and len, which are tiny, before the payload. Perhaps we should invent a new pq_putmessage() function that would take an input buffer with 5 bytes of space reserved before the payload. pq_putmessage() could then fill in the msgtype and len bytes in the input buffer and send that directly. (Not wedded to that particular API, but something that would have the same effect) This change affects places where pq_putmessage is used such as pg_basebackup, COPY TO, walsender etc. I did some experiments to see how the patch performs. Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO STDOUT". Here are perf results of both the patch and HEAD > ... The patch brings a ~5% gain in socket_putmessage. [1] CREATE TABLE test(id int, name text, time TIMESTAMP); INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS name, NOW() AS time FROM generate_series(1, 1) AS i; I'm surprised by these results, because each row in that table is < 600 bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in that test. Am I missing something? -- Heikki Linnakangas Neon (https://neon.tech)