Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
I created a commitfest entry[1] to have the CI test the patch. There was a failure in headerscheck and cpluspluscheck when the include of SectionMemoryManager.h is checked[2] In file included from /usr/include/llvm/ADT/SmallVector.h:18, from /tmp/cirrus-ci-build/src/include/jit/SectionMemoryManager.h:23, from /tmp/headerscheck.4b1i5C/test.c:2: /usr/include/llvm/Support/type_traits.h:17:10: fatal error: type_traits: No such file or directory 17 | #include Since the SmallVector.h include type_traits, this file can't be compiled with a C compiler so I've just excluded it from headerscheck. Loosely related to headerscheck, running it locally was failing as it couldn't find the file. This is because headerscheck except llvm include files to be in /usr/include and don't rely on llvm-config. I created a second patch to use the LLVM_CPPFLAGS as extra flags when testing the src/include/jit/* files. Lastly, I've used www.github.com instead of github.com link to stop spamming the llvm-project's PR with reference to the commit every time it is pushed somewhere (which seems to be the unofficial hack[3]). [1] https://commitfest.postgresql.org/49/5220/ [2] https://cirrus-ci.com/task/4646639124611072?logs=headers_headerscheck#L42-L46 [3] https://github.com/orgs/community/discussions/23123#discussioncomment-3239240 v6-0002-Add-LLVM_CPPFLAGS-in-headerscheck-to-llvm-jit-fil.patch Description: Binary data v6-0001-Backport-of-LLVM-code-to-fix-ARM-relocation-bug.patch Description: Binary data
Re: Set query_id for query contained in utility statement
On Tue, Aug 27, 2024 at 11:14 AM jian he wrote: > also it's ok to use passed (ParseState *pstate) for > { > estate = CreateExecutorState(); > estate->es_param_list_info = params; > paramLI = EvaluateParams(pstate, entry, execstmt->params, estate); > } > ? > I really don't know. > > some of the change is refactoring, maybe you can put it into a separate patch. Thanks for the review. I think the parser state is mostly used for the error callbacks and parser_errposition but I'm not 100% sure. Either way, you're right and it probably shouldn't be in the patch. I've modified the patch to restrict the changes to only add the necessary query jumble and post parse hook calls. v4-0001-Set-query_id-for-queries-contained-in-utility-sta.patch Description: Binary data
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Wed, Aug 28, 2024 at 12:24 AM Thomas Munro wrote: > 2. I tested against LLVM 10-18, and found that 10 and 11 lack some > needed symbols. So I just hid this code from them. Even though our > stable branches support those and even older versions, I am not sure > if it's worth trying to do something about that for EOL'd distros that > no one has ever complained about. I am willing to try harder if > someone thinks that's important... I would also assume that people using arm64 are more likely to use recent versions than not. I've done some additional tests on different LLVM versions with both the unpatched version (to make sure the crash was triggered) and the patched version. I'm joining the test scripts I've used as reference. They target a kubernetes pod since it was the easiest way for me to get a test ubuntu Jammy: - setup_pod.sh: Install necessary packages, get multiple llvm versions, fetch and compile master and patched version of postgres on different LLVM version - run_test.sh: go through all LLVM versions for both unpatched and patched postgres to run the test_script.sh - test_script.sh: ran inside the pod to setup the db with the necessary tables and check if the crash happens This generated the following output: Test unpatched version on LLVM 19, : Crash triggered Test unpatched version on LLVM 18, libLLVM-18.so.18.1: Crash triggered Test unpatched version on LLVM 17, libLLVM-17.so.1: Crash triggered Test unpatched version on LLVM 16, libLLVM-16.so.1: Crash triggered Test unpatched version on LLVM 15, libLLVM-15.so.1: Crash triggered Test unpatched version on LLVM 14, libLLVM-14.so.1: Crash triggered Test unpatched version on LLVM 13, libLLVM-13.so.1: Crash triggered Test patched version on LLVM 19, : Query ran successfully Test patched version on LLVM 18, libLLVM-18.so.18.1: Query ran successfully Test patched version on LLVM 17, libLLVM-17.so.1: Query ran successfully Test patched version on LLVM 16, libLLVM-16.so.1: Query ran successfully Test patched version on LLVM 15, libLLVM-15.so.1: Query ran successfully Test patched version on LLVM 14, libLLVM-14.so.1: Query ran successfully Test patched version on LLVM 13, libLLVM-13.so.1: Query ran successfully I try to print the libLLVM linked to llvm.jit in the output to double check whether I test on the correct version. The LLVM 19 package only provides static libraries (probably because it's still a release candidate?) so it shows as empty in the output. There was no LLVM 12 available when using the llvm.sh script so I couldn't test it. As for the result, prepatch PG all crashed as expected while the patched version was able to run the query successfully. > Next, I think we should wait to see if the LLVM project commits that > PR, this so that we can sync with their 19.x stable branch, instead of > using code from a PR. Our next minor release is in November, so we > have some time. If they don't commit it, we can consider it anyway: I > mean, it's crashing all over the place in production, and we see that > other projects are shipping this code already. The PR[1] just received an approval and it sounds like they are ok to eventually merge it. [1] https://github.com/llvm/llvm-project/pull/71968 #!/bin/bash # Don't want coredumps ulimit -c 0 # Stop running PG /var/lib/postgresql/.local/bin/pg_ctl -D /var/lib/postgresql/db_data stop &>> /dev/null # Clean db_data rm -rf /var/lib/postgresql/db_data # Create new dbdata /var/lib/postgresql/.local/bin/initdb -D /var/lib/postgresql/db_data -Apeer > /dev/null echo "port = 5432 shared_buffers='4GB' " > /var/lib/postgresql/db_data/postgresql.conf # Start and setup test partitioned table /var/lib/postgresql/.local/bin/pg_ctl -D /var/lib/postgresql/db_data start -l /tmp/pg.log > /dev/null /var/lib/postgresql/.local/bin/pgbench -i --partitions=128 2> /dev/null # Run the query to trigger the issue /var/lib/postgresql/.local/bin/psql options=-cjit_above_cost=0 -c 'SELECT count(bid) from pgbench_accounts;' &>> /dev/null exit_code=$? if (( $exit_code == 2 )); then echo "Crash triggered" elif (( $exit_code == 0 )); then echo "Query ran successfully" else echo "Other issue while running the query" fi #!/bin/bash set -eu TEST_POD=$(kubectl get pod -l app="test_pg" -nsre -o=jsonpath='{.items[*].metadata.name}') LLVM_VERSIONS=(19 18 17 16 15 14 13) get_llvm_version() { kubectl exec "$pod" -- bash -c "ldd /var/lib/postgresql/.local/lib/llvmjit.so | grep libLLVM | awk '{print \$1}'" } run_test_on_pod() { local pod="$1" kubectl cp test_script.sh "$pod":/var/lib/postgresql/ for version in ${LLVM_VERSIONS[@]}; do build_dir="~/postgres_build/build_${version}" kubectl exec "$pod" -- bash -c "su postgres -c \" cd $build_dir; make -j19 install &>> /dev/null \" " echo -n "Test unpatched version on LLVM $version, $(get_llvm_version): " kubectl exec "$pod" -- bash -c "su postgres -c \" cd; bash test_script.sh \" " done for version in
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Tue, Aug 27, 2024 at 12:01 PM Thomas Munro wrote: > > Thanks! And that's great news. Do you want to report this experience > to the PR, in support of committing it? That'd make it seem easier to > consider shipping a back-ported copy... Yes, I will do that.
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Tue, Aug 27, 2024 at 1:33 AM Thomas Munro wrote: > I am sure this requires changes for various LLVM versions. I tested > it with LLVM 14 on a Mac where I've never managed to reproduce the > original complaint, but ... h, this might be exacerbated by ASLR, > and macOS only has a small ALSR slide window (16M or 256M apparently, > according to me in another thread), so I'd probably have to interpose > my own mmap() to choose some more interesting addresses, or run some > other OS, but that's quite enough rabbit holes for one morning. I've tested the patch. I had to make sure the issue was triggered on master first. The issue didn't happen with 4GB shared_buffers and 64 partitions. However, increasing to 6GB and 128 partitions triggered the issue. The architecture check in the patch was incorrect (__arch64__ instead of __aarch64__, glad to see I'm not the only one being confused with aarch64 and arm64 :)) but once fixed, it worked and avoided the segfault. I've run some additional tests to try to test different parameters: - I've tried disabling randomize_va_space, the issue still happened even with ASLR disabled. - I've tested different PG versions. With 14 and 15, 4GB and 64 partitions were enough. Starting PG 16, I had to increase shared_buffers to 6GB and partitions to 128. I've been able to trigger the issue on all versions from 14 to master (which was expected but I wanted confirmation) - I haven't been able to reproduce this on a macOS either. I've tried to remove MemGroup.Near hint so mmap addresses would be more random and played with different shared_buffers and partition values without success I've modified the patch with 3 changes: - meson.build was using SectionMemoryManager.cpp file name, I've replaced with SafeSectionMemoryManager.cpp - Use __aarch64__ instead of __arch64__ - Moved the architecture switch to llvm_create_object_layer and go through the normal LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager on non arm64 architectures. There's no need to use the custom memory manager for non arm64 so it looked better to avoid it entirely if there's no need for the reserve allocation. v3-0001-XXX-LLVM-ARM-relocation-bug-mitigation.patch Description: Binary data
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Mon, Aug 26, 2024 at 4:33 AM Thomas Munro wrote: > IIUC this one is a random and rare crash depending on malloc() and > perhaps also the working size of your virtual memory dart board. > (Annoyingly, I had tried to reproduce this quite a few times on small ARM > systems when earlier reports came in, d'oh!). allocateMappedMemory used when creating sections will eventually call mmap[1], not malloc. So the amount of shared memory configured may be a factor in triggering the issue. My first attempts to reproduce the issue from scratch weren't successful either. However, trying again with different values of shared_buffers, I've managed to trigger the issue somewhat reliably. On a clean Ubuntu jammy, I've compiled the current PostgreSQL REL_14_STABLE (6bc2bfc3) with the following options: CLANG=clang-14 ../configure --enable-cassert --enable-debug --prefix ~/.local/ --with-llvm Set "shared_buffers = '4GB'" in the configuration. More may be needed but 4GB was enough for me. Create a table with multiple partitions with pgbench. The goal is to have a jit module big enough to trigger the issue. pgbench -i --partitions=64 Then run the following query with jit forcefully enabled: psql options=-cjit_above_cost=0 -c 'SELECT count(bid) from pgbench_accounts;' If the issue was successfully triggered, it should segfault or be stuck in an infinite loop. > Ultimately, if it doesn't work, and doesn't get fixed, it's hard for > us to do much about it. But hmm, this is probably madness... I wonder > if it would be feasible to detect address span overflow ourselves at a > useful time, as a kind of band-aid defence... There's a possible alternative, but it's definitely in the same category as the hot-patching idea. llvmjit uses LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager to create the ObjectLinkingLayer and it will be created with the default SectionMemoryManager[2]. It should be possible to provide a modified SectionMemoryManager with the change to allocate sections in a single block and it could be restricted to arm64 architecture. A part of me tells me this is probably a bad idea but on the other hand, LLVM provides this way to plug a custom allocator and it would fix the issue... [1] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/Support/Unix/Memory.inc#L115-L117 [2] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/Orc/OrcV2CBindings.cpp#L967-L973
Re: Set query_id for query contained in utility statement
On Mon, Aug 26, 2024 at 5:26 AM jian he wrote: >queryid| left > | plans | calls | rows > --+--+---+---+-- > 2800308901962295548 | EXPLAIN (verbose) create table test_t as > select 1; | 2 | 2 |0 > 2093602470903273926 | EXPLAIN (verbose) create table test_t as > select $1 | 0 | 2 |0 > > "EXPLAIN (verbose) create table test_t as select 1;" called twice, > is that what we expect? pg_stat_statements reports nested queries and toplevel queries separately. You can check with the toplevel column. 2800308901962295548 is the nested ctas part while 2093602470903273926 is the top explain utility statement (which explain why there's 0 plans since it's an utility statement). Since the explain ctas query was called twice, it will be reported as 2 toplevel queries and 2 nested queries. > should first row, the normalized query becomes > EXPLAIN (verbose) create table test_t as select $1; Good point, the issue in this case was the nested query was stored by pg_stat_statements during the ExecutorEnd hook. This hook doesn't have the jstate and parseState at that point so pg_stat_statements can't normalize the query. I've modified the patch to fix this. Instead of just jumbling the query in ExplainQuery, I've moved jumbling in ExplainOneUtility which already has specific code to handle ctas and dcs. Calling the post parse hook here allows pg_stat_statements to store the normalized version of the query for this queryid and nesting level. v3-0001-Set-query_id-for-queries-contained-in-utility-sta.patch Description: Binary data
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Thu, Aug 22, 2024 at 12:33 PM Thomas Munro wrote: > I fear that back-porting, for the LLVM project, would mean "we fix it > in main/20.x, and also back-port it to 19.x". Do distros back-port > further? That's also my fear, I'm not familiar with distros back-port policy but eyeballing ubuntu package changelog[1], it seems to be mostly build fixes. Given that there's no visible way to fix the relocation issue, I wonder if jit shouldn't be disabled for arm64 until either the RuntimeDyld fix is merged or the switch to JITLink is done. Disabling jit tuple deforming may be enough but I'm not confident the issue won't happen in a different part. [1] https://launchpad.net/ubuntu/+source/llvm-toolchain-16/+changelog
Segfault in jit tuple deforming on arm64 due to LLVM issue
er, when a new section is allocated, LLVM will request a new memory block from the memory allocator[4]. The MemGroup.Near is passed as the start hint of mmap but that's only a hint and the kernel doesn't provide any guarantee that the new allocated block will be near. With the impacted query, there are more than 10GB of gap between the .text section and the .rodata section, making it impossible for the code in the .text section to correctly fetch data from the .rodata section as the address in ADRP is limited to a +/-4GB range. There are mentions about this in the ABI that the GOT section should be within 4GB from the text section[5]. Though in this case, there's no GOT section as the offsets are stored in the .rodata section but the constraint is going to be similar. This is a known LLVM issue[6] that impacted Impala, Numba and Julia. There's an open PR[7] to fix the issue by allocating all sections as a single memory block, avoiding the gaps between sections. There's also a related discussion on this on llvm-rtdyld discourse[8]. A possible mitigation is to switch from RuntimeDyld to JITLinking but this requires at least LLVM15 as LLVM14 doesn't have any significant relocation support for aarch64[9]. I did test using JITLinking on my impacted db and it seems to fix the issue. JITLinking has no exposed C interface though so it requires additional wrapping. I don't necessarily have a good answer for this issue. I've tried to tweak relocation settings or the jit code to avoid relocation without too much success. Ideally, the llvm fix will be merged and backported in llvm but the PR has been open for some time now. I've seen multiple segfault reports that look similar to this issue (example: [10], [11]) but I don't think it was linked to the LLVM bug so I figured I would at least share my findings. [1] https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/jit/llvm/llvmjit_deform.c#L364-L382 [2] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp#L501-L513 [3] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L41C32-L41C47 [4] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L94-L110 [5] https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst#7code-models [6] https://github.com/llvm/llvm-project/issues/71963 [7] https://github.com/llvm/llvm-project/pull/71968 [8] https://discourse.llvm.org/t/llvm-rtdyld-aarch64-abi-relocation-restrictions/74616 [9] https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/JITLink/ELF_aarch64.cpp#L75-L84 [10] https://www.postgresql.org/message-id/flat/CABa%2BnRvwZy_5t1QF9NJNGwAf03tv_PO_Sg1FsN1%2B-3Odb1XgBA%40mail.gmail.com [11] https://www.postgresql.org/message-id/flat/CADAf1kavcN-kY%3DvEm3MYxhUa%2BrtGFs7tym5d7Ee6Ni2cwwxGqQ%40mail.gmail.com Regards, Anthonin Bonnefoy
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind
On Wed, Aug 21, 2024 at 12:00 AM Jelte Fennema-Nio wrote: > @Anthonin are you planning to update the patch accordingly? Here's the patch with \bindx renamed to \bind_named. I've also made a small change to Michael's refactoring in 0002 by initialising success to false in ExecQueryAndProcessResults. There was a compiler warning about success possibly used uninitialized[1]. [1] https://cirrus-ci.com/task/6207675187331072 v7-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch Description: Binary data v7-0002-psql-Refactor-status-of-extended-protocol-command.patch Description: Binary data
Re: Set query_id for query contained in utility statement
I've realised my initial approach was wrong, calling the post parse for all nested queries in analyze.c prevents extension like pgss to correctly track the query's nesting level. I've changed the approach to replicate what's done in ExplainQuery to both CreateTableAs and DeclareCursor: Jumble the query contained by the utility statement and call the post parse hook before it is planned or executed. Additionally, explain's nested query can itself be a CreateTableAs or DeclareCursor which also needs to be jumbled. The issue is visible when explaining a CreateTableAs or DeclareCursor Query, the queryId is missing despite the verbose. EXPLAIN (verbose) create table test_t as select 1; QUERY PLAN -- Result (cost=0.00..0.01 rows=1 width=4) Output: 1 Post patch, the query id is correctly displayed. EXPLAIN (verbose) create table test_t as select 1; QUERY PLAN -- Result (cost=0.00..0.01 rows=1 width=4) Output: 1 Query Identifier: 2800308901962295548 Regards, Anthonin Bonnefoy v2-0001-Set-query_id-for-queries-contained-in-utility-sta.patch Description: Binary data
Set query_id for query contained in utility statement
Hi all, Some utility statements like Explain, CreateTableAs and DeclareCursor contain a query which will be planned and executed. During post parse, only the top utility statement is jumbled, leaving the contained query without a query_id set. Explain does the query jumble in ExplainQuery but for the contained query but CreateTableAs and DeclareCursor were left with unset query_id. This leads to extensions relying on query_id like pg_stat_statements to not be able to track those nested queries as the query_id was 0. This patch fixes this by recursively jumbling queries contained in those utility statements during post_parse, setting the query_id for those contained queries and removing the need for ExplainQuery to do it for the Explain statements. Regards, Anthonin Bonnefoy v1-0001-Set-query_id-for-queries-contained-in-utility-sta.patch Description: Binary data
Re: Use pgBufferUsage for block reporting in analyze
On Wed, Jul 31, 2024 at 9:36 PM Masahiko Sawada wrote: > Meanwhile, I think we can push 0001 and 0002 patches since they are in > good shape. I've updated commit messages to them and slightly changed > 0002 patch to write "finished analyzing of table \"%s.%s.%s\" instead > of "analyze of table \"%s.%s.%s\". Wouldn't it make sense to do the same for autoanalyze and write "finished automatic analyze of table \"%s.%s.%s\"\n" instead of "automatic analyze of table \"%s.%s.%s\"\n"? > Also, regarding 0003 patch, what is the main reason why we want to add > WAL usage to analyze reports? I think that analyze normally does not > write WAL records much so I'm not sure it's going to provide a good > insight for users. There was no strong reason except for consistency with VACUUM VERBOSE output. But as you said, it's not really providing valuable information so it's probably better to keep the noise down and drop it. Regards, Anthonin Bonnefoy
Re: Use pgBufferUsage for block reporting in analyze
On Tue, Jul 30, 2024 at 9:21 AM Anthonin Bonnefoy wrote: > A possible change would be to pass an inh flag when an acquirefunc is > called from acquire_inherited_sample_rows. The acquirefunc could then > use an alternative log format to append to logbuf. This way, we could > have a more compact format for partitioned tables. I've just tested this, the result isn't great as it creates an inconsistent output INFO: analyze of table "postgres.public.test_partition" "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples and 0 dead tuples; 999 rows in sample, 999 estimated total rows "test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples and 0 dead tuples; 1000 rows in sample, 1000 estimated total rows avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s ... INFO: analyze of table "postgres.public.test_partition_1" pages: 5 of 5 scanned tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999 estimated total tuples avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s Maybe the best approach is to always use the compact form? INFO: analyze of table "postgres.public.test_partition" "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples "test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples and 0 dead tuples; 1000 tuples in sample, 1000 estimated total tuples avg read rate: 1.953 MB/s, avg write rate: 0.000 MB/s ... INFO: analyze of table "postgres.public.test_partition_1" "test_partition_1": scanned 5 of 5 pages, containing 999 live tuples and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s I've updated the patchset with those changes. 0004 introduces the StringInfo logbuf so we can output logs as a single log and during ANALYZE VERBOSE while using the compact form. Regards, Anthonin v6-0004-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch Description: Binary data v6-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v6-0002-Output-analyze-details-on-analyze-verbose.patch Description: Binary data v6-0003-Output-wal-usage-of-analyze-in-verbose-output-and.patch Description: Binary data
Re: Use pgBufferUsage for block reporting in analyze
Hi, On Tue, Jul 30, 2024 at 1:13 AM Masahiko Sawada wrote: > I have one comment on 0001 patch: > The comment should also be updated or removed. Ok, I've removed the comment. > However, as the above comment says, delay_in_ms can have a duration up > to 25 days. I guess it would not be a problem for analyze cases but > could be in vacuum cases as vacuum could sometimes be running for a > very long time. I've seen vacuums running even for almost 1 month. So > I think we should keep this part. Good point, I've reverted to using TimestampDifference for vacuum. > /* measure elapsed time iff autovacuum logging requires it */ > -if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0) > +if (instrument) > > The comment should also be updated. Updated. > Could you split the 0002 patch into two patches? One is to have > ANALYZE command (with VERBOSE option) write the buffer usage, and > second one is to add WAL usage to both ANALYZE command and > autoanalyze. I think adding WAL usage to ANALYZE could be > controversial as it should not be WAL-intensive operation, so I'd like > to keep them separate. I've split the patch, 0002 makes verbose outputs the same as autoanalyze logs with buffer/io/system while 0003 adds WAL usage output. > It seems to me that the current style is more concise and readable (3 > rows per table vs. 1 row per table). We might need to consider a > better output format for partitioned tables as the number of > partitions could be high. I don't have a good idea now, though. A possible change would be to pass an inh flag when an acquirefunc is called from acquire_inherited_sample_rows. The acquirefunc could then use an alternative log format to append to logbuf. This way, we could have a more compact format for partitioned tables. Regards, Anthonin v5-0003-Output-wal-usage-of-analyze-in-verbose-output-and.patch Description: Binary data v5-0002-Output-analyze-details-on-analyze-verbose.patch Description: Binary data v5-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v5-0004-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch Description: Binary data
Re: Use pgBufferUsage for block reporting in analyze
On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada wrote: > An alternative idea would > be to pass StringInfo to AcquireSampleRowsFunc() so that callback can > write its messages there. This is somewhat similar to what we do in > the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too > much but I think it could be better than writing logs in the single > format. I've tested this approach, it definitely looks better. I've added a logbuf StringInfo to AcquireSampleRowsFunc which will receive the logs. elevel was removed as it is not used anymore. Since everything is in the same log line, I've removed the relation name in the acquire sample functions. For partitioned tables, I've also added the processed partition table being sampled. The output will look like: INFO: analyze of table "postgres.public.test_partition" Sampling rows from child "public.test_partition_1" pages: 5 of 5 scanned tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999 estimated total tuples Sampling rows from child "public.test_partition_2" pages: 5 of 5 scanned tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000 estimated total tuples avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s ... For a file_fdw, the output will be: INFO: analyze of table "postgres.public.pglog" tuples: 60043 tuples; 3 tuples in sample avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s ... Regards, Anthonin v4-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch Description: Binary data v4-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v4-0003-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch Description: Binary data
Add ALL_CANDIDATES option to EXPLAIN
Hi, I have a prototype for an ALL_CANDIDATES option for EXPLAIN. The goal of this option is to print all plan candidates instead of only the cheapest plan. It will output the plans from the most expensive at the top to the cheapest. Here's an example: explain (all_candidates) select * from pgbench_accounts where aid=1; QUERY PLAN - Plan 1 -> Gather (cost=1000.00..3375.39 rows=1 width=97) Workers Planned: 1 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..2375.29 rows=1 width=97) Filter: (aid = 1) Plan 2 -> Seq Scan on pgbench_accounts (cost=0.00..2890.00 rows=1 width=97) Filter: (aid = 1) Plan 3 -> Bitmap Heap Scan on pgbench_accounts (cost=4.30..8.31 rows=1 width=97) Recheck Cond: (aid = 1) -> Bitmap Index Scan on pgbench_accounts_pkey (cost=0.00..4.30 rows=1 width=0) Index Cond: (aid = 1) Plan 4 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.29..8.31 rows=1 width=97) Index Cond: (aid = 1) This can provide very useful insight on the planner's decisions like whether it tried to use a specific index and how much cost difference there is with the top plan. Additionally, it makes it possible to spot discrepancies in generated plans like incorrect row estimation [1]. The plan list is generated from the upper_rel's pathlist. However, due to how planning mutates the PlannerGlobal state, we can't directly iterate the path list generated by the subquery_planner and create a planned statement for them. Instead, we need to plan from scratch for every path in the pathlist to generate the list of PlannedStmt. The patch is split in two mostly to ease the review: 001: Propagate PlannerInfo root to add_path. This is needed to prevent add_path from discarding path if all_candidates is enabled which will be stored in PlannerGlobal. 002: Add the planner_all_candidates function and print of candidate list in explain [1] https://www.postgresql.org/message-id/flat/CAO6_Xqr9+51NxgO=XospEkUeAg-p=ejawmtpdczwjrggkj5...@mail.gmail.com Regards, Anthonin v1-0002-Add-ALL_CANDIDATES-explain-option.patch Description: Binary data v1-0001-Propagate-root-PlannerInfo-to-add_path.patch Description: Binary data
Re: query_id, pg_stat_activity, extended query protocol
On Thu, Jul 18, 2024 at 10:56 AM Michael Paquier wrote: > Please attach things to your emails, if your repository disappears for > a reason or another we would lose knowledge in the archives of the > community lists. Noted and thanks for the reminder, I'm still learning about mailing list etiquette. > I have not looked at that entirely in details, and I'd need to check > if it is possible to use what's here for more predictible tests: > https://www.postgresql.org/message-id/ZqCMCS4HUshUYjGc%40paquier.xyz For the tests, there are limited possibilities to check whether a query_id has been set correctly. - Checking pg_stat_activity is not possible in the regress tests as you need a second session to check the reported query_id. - pg_stat_statements can be used indirectly but you're limited to how pgss uses query_id. For example, it doesn't rely on queryId in ExecutorRun. A possible solution I've been thinking of is to use a test module. The module will assert on whether the queryId is set or not in parse, plan and executor hooks. It will also check if the queryId reported in pgstat matches the queryId at the root level. This allows us to check that the queryId is correctly set with the extended protocol. I've also found some queries which will trigger a failure (ctas and cursor usage) though this is probably a different issue from the extended protocol issue. Regards, Anthonin 0001-Create-module-for-query_id-test.patch Description: Binary data
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind
On Wed, Jul 24, 2024 at 05:33:07PM +0200, Peter Eisentraut wrote: > This commit message confused me, because I don't think this is what the > \bindx command actually does. AFAICT, it only binds, it does not execute. > At least that is what the documentation in the content of the patch appears > to indicate. Unless I misunderstand the remark, \bindx will call PQsendQueryPrepared which will bind then execute the query, similar to what \bind is doing (except \bind also parses the query). > I'm not sure \bindx is such a great name. The "x" stands for "I ran out of > ideas". ;-) That's definitely what happened :). \bind would have been a better fit but it was already used. On Thu, Jul 25, 2024 at 4:19 AM Michael Paquier wrote: > Not sure that I like much the additional option embedded in the > existing command; I'd rather keep a separate command for each libpq > call, that seems cleaner. So I would be OK with your suggested > \bind_named. Fine by me to be outvoted, of course. +1 keeping this as a separate command and using \bind_named. \bind has a different behaviour as it also parses the query so keeping them as separate commands would probably avoid some confusion.
Re: Use pgBufferUsage for block reporting in analyze
On Mon, Jul 22, 2024 at 10:59 PM Masahiko Sawada wrote: > The first line would vary depending on whether an autovacuum worker or > not. And the above suggestion includes a change of term "row" to > "tuple" for better consistency with VACUUM VERBOSE outputs. I think it > would be great if autoanalyze also writes logs in the same format. > IIUC with the patch, autoanalyze logs don't include the page and tuple > statistics. One issue is that the number of scanned pages, live tuples and dead tuples is only available in acquire_sample_rows which is where the log containing those stats is emitted. I've tried to implement the following in 0003: - Sampling functions now accept an AcquireSampleStats struct to store pages and tuples stats - Log is removed from sampling functions - do_analyze_rel now outputs scanned and tuples statistics when relevant. sampling from fdw doesn't provide those statistics so they are not displayed in those cases. This ensures that analyze logs are only emitted in do_analyze_rel, allowing to display the same output for both autoanalyze and ANALYZE VERBOSE. With those changes, we have the following output for analyze verbose of a table: analyze (verbose) pgbench_accounts ; INFO: analyzing "public.pgbench_accounts" INFO: analyze of table "postgres.public.pgbench_accounts" pages: 1640 of 1640 scanned tuples: 10 live tuples, 0 are dead; 3 tuples in samples, 10 estimated total tuples avg read rate: 174.395 MB/s, avg write rate: 0.000 MB/s buffer usage: 285 hits, 1384 reads, 0 dirtied WAL usage: 14 records, 0 full page images, 1343 bytes system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.06 s For a file_fdw, the output will look like: analyze (verbose) pglog; INFO: analyzing "public.pglog" INFO: analyze of table "postgres.public.pglog" tuples: 3 tuples in samples, 60042 estimated total tuples avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s buffer usage: 182 hits, 0 reads, 0 dirtied WAL usage: 118 records, 0 full page images, 13086 bytes system usage: CPU: user: 0.40 s, system: 0.00 s, elapsed: 0.41 s I've also slightly modified 0002 to display "automatic analyze" when we're inside an autovacuum worker, similar to what's done with vacuum output. Regards, Anthonin v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch Description: Binary data v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v3-0003-Output-pages-and-tuples-stats-in-ANALYZE-VERBOSE.patch Description: Binary data
Re: Possible incorrect row estimation for Gather paths
On Wed, Jul 17, 2024 at 3:59 AM Richard Guo wrote: > > I can reproduce this problem with the query below. > > explain (costs on) select * from tenk1 order by twenty; >QUERY PLAN > - > Gather Merge (cost=772.11..830.93 rows=5882 width=244) >Workers Planned: 1 >-> Sort (cost=772.10..786.80 rows=5882 width=244) > Sort Key: twenty > -> Parallel Seq Scan on tenk1 (cost=0.00..403.82 rows=5882 > width=244) > (5 rows) I was looking for a test to add in the regress checks that wouldn't rely on explain cost since it is disabled. However, I've realised I could do something similar to what's done in stats_ext with the check_estimated_rows function. I've added the get_estimated_rows test function that extracts the estimated rows from the top node and uses it to check the gather nodes' estimates. get_estimated_rows uses a simple explain compared to check_estimated_rows which relies on an explain analyze. > I wonder if the changes in create_ordered_paths should also be reduced > to 'total_groups = gather_rows_estimate(path);'. It should already be the case with the patch v2. It does create rounding errors that are visible in the tests but they shouldn't exceed +1 or -1. > I think perhaps it's better to declare gather_rows_estimate() in > cost.h rather than optimizer.h. > (BTW, I wonder if compute_gather_rows() would be a better name?) Good point, I've moved the declaration and renamed it. > I noticed another issue in generate_useful_gather_paths() -- *rowsp > would have a random value if override_rows is true and we use > incremental sort for gather merge. I think we should fix this too. That seems to be the case. I've tried to find a query that could trigger this codepath without success. All grouping and distinct paths I've tried where fully sorted and didn't trigger an incremental sort. I will need a bit more time to check this. In the meantime, I've updated the patches with the review comments. Regards, Anthonin v3-0001-Remove-unnecessary-assignment-of-path-rows-in-gat.patch Description: Binary data v3-0002-Fix-row-estimation-in-gather-paths.patch Description: Binary data
Correctly propagate queryId for utility stmt in function
Hi, For utility statements defined within a function, the queryTree is copied to a plannedStmt as utility commands don't require planning. However, the queryId is not propagated to the plannedStmt. This leads to plugins relying on queryId like pg_stat_statements to not be able to track utility statements within function calls. This patch fixes the issue by correctly propagating queryId from the cached queryTree to the plannedStmt. Regards, Anthonin From b70163d1bd1279ca3f9b29166431cc64d25ca586 Mon Sep 17 00:00:00 2001 From: Anthonin Bonnefoy Date: Thu, 18 Jul 2024 11:50:37 +0200 Subject: Correctly propagate queryId for utility stmt in function For utility statements defined within a function, the queryTree is copied to a plannedStmt as utility commands don't require planning. However, the queryId is not propagated to the plannedStmt. This leads to plugins relying on queryId like pg_stat_statements to not be able to track utility statements within function calls. This patch fixes the issue by correctly propagating queryId from the cached queryTree to the plannedStmt. --- .../expected/level_tracking.out | 28 +++ .../pg_stat_statements/sql/level_tracking.sql | 12 src/backend/executor/functions.c | 1 + 3 files changed, 41 insertions(+) diff --git a/contrib/pg_stat_statements/expected/level_tracking.out b/contrib/pg_stat_statements/expected/level_tracking.out index fe01dd79af..f7b2ffceb3 100644 --- a/contrib/pg_stat_statements/expected/level_tracking.out +++ b/contrib/pg_stat_statements/expected/level_tracking.out @@ -67,6 +67,34 @@ SELECT toplevel, calls, query FROM pg_stat_statements t| 1 | SET pg_stat_statements.track = 'all' (7 rows) +-- Procedure with multiple utility statements - all level tracking. +CREATE OR REPLACE PROCEDURE proc_with_utility_stmt() +LANGUAGE SQL +AS $$ +SHOW pg_stat_statements.track; +SHOW pg_stat_statements.track_utility; +$$; +CALL proc_with_utility_stmt(); +SELECT toplevel, calls, query FROM pg_stat_statements + WHERE query LIKE '%SHOW%' ORDER BY query COLLATE "C", toplevel; + toplevel | calls |query +--+---+-- + t| 1 | CREATE OR REPLACE PROCEDURE proc_with_utility_stmt()+ + | | LANGUAGE SQL+ + | | AS $$ + + | | SHOW pg_stat_statements.track; + + | | SHOW pg_stat_statements.track_utility; + + | | $$ + f| 1 | SHOW pg_stat_statements.track + f| 1 | SHOW pg_stat_statements.track_utility +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + -- DO block - top-level tracking without utility. SET pg_stat_statements.track = 'top'; SET pg_stat_statements.track_utility = FALSE; diff --git a/contrib/pg_stat_statements/sql/level_tracking.sql b/contrib/pg_stat_statements/sql/level_tracking.sql index aa37408d52..115ff392a0 100644 --- a/contrib/pg_stat_statements/sql/level_tracking.sql +++ b/contrib/pg_stat_statements/sql/level_tracking.sql @@ -33,6 +33,18 @@ END; $$; SELECT toplevel, calls, query FROM pg_stat_statements ORDER BY query COLLATE "C", toplevel; +-- Procedure with multiple utility statements - all level tracking. +CREATE OR REPLACE PROCEDURE proc_with_utility_stmt() +LANGUAGE SQL +AS $$ +SHOW pg_stat_statements.track; +SHOW pg_stat_statements.track_utility; +$$; +CALL proc_with_utility_stmt(); +SELECT toplevel, calls, query FROM pg_stat_statements + WHERE query LIKE '%SHOW%' ORDER BY query COLLATE "C", toplevel; +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + -- DO block - top-level tracking without utility. SET pg_stat_statements.track = 'top'; SET pg_stat_statements.track_utility = FALSE; diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c index 539cd0a999..692854e2b3 100644 --- a/src/backend/executor/functions.c +++ b/src/backend/executor/functions.c @@ -492,6 +492,7 @@ init_execution_state(List *queryTree_list, stmt->utilityStmt = queryTree->utilityStmt; stmt->stmt_location = queryTree->stmt_location; stmt->stmt_len = queryTree->stmt_len; +stmt->queryId = queryTree->queryId; } else stmt = pg_plan_query(queryTree, -- 2.39.3 (Apple Git-146)
Re: query_id, pg_stat_activity, extended query protocol
Hi, Wouldn't it be enough to call pgstat_report_query_id in ExecutorRun and ProcessUtility? With those changes [1], both normal statements and utility statements called through extended protocol will correctly report the query_id. -- Test utility statement with extended protocol show all \bind \g -- Check reported query_id select query, query_id from pg_stat_activity where application_name ='psql' and pid!=pg_backend_pid(); query | query_id ---+- show all | -866221123969716490 [1] https://github.com/bonnefoa/postgres/commit/bf4b332d7b481549c6d9cfa70db51e39a305b9b2 Regards, Anthonin
Re: Possible incorrect row estimation for Gather paths
Hi Rafia, Thanks for reviewing. On Wed, Jul 10, 2024 at 4:54 PM Rafia Sabih wrote: > But I don't quite understood the purpose of this, > + total_groups = input_rel->rows; > + > + /* > + * If the number of rows is unknown, fallback to gather rows > + * estimation > + */ > + if (total_groups == 0) > + total_groups = gather_rows_estimate(input_path); > > why not just use total_groups = gather_rows_estimate(input_path), what > is the importance of having total_groups = input_rel->rows? The initial goal was to use the source tuples if available and avoid possible rounding errors. Though I realise that the difference would be minimal. For example, 200K tuples and 3 workers would yield int(int(20 / 2.4) * 2.4)=19. That is probably not worth the additional complexity, I've updated the patch to just use gather_rows_estimate. > With respect to the change introduced by the patch in the regression > test, I wonder if we should test it on the tables of a larger scale > and check for performance issues. > Imagine the case when Parallel Seq Scan on extremely_skewed s > (cost=0.00..167.01 rows=1 width=4) returns 1000 rows instead of 1 ... > I wonder which plan would perform better then or will there be a > totally different plan. For the extremely_skewed test, having the parallel seq scan returning more rows won't impact the Gather since The Hash Join will reduce the number of rows to 1. I've found an example where we can see the plan changes with the default settings: CREATE TABLE simple (id SERIAL PRIMARY KEY, other bigint); insert into simple select generate_series(1,50), ceil(random()*100); analyze simple; EXPLAIN SELECT * FROM simple where other < 10 order by id; Unpatched: Gather Merge (cost=9377.85..12498.60 rows=27137 width=12) Workers Planned: 1 -> Sort (cost=8377.84..8445.68 rows=27137 width=12) Sort Key: id -> Parallel Seq Scan on simple (cost=0.00..6379.47 rows=27137 width=12) Filter: (other < 10) Patched: Sort (cost=12381.73..12492.77 rows=44417 width=12) Sort Key: id -> Seq Scan on simple (cost=0.00..8953.00 rows=44417 width=12) Filter: (other < 10) Looking at the candidates, the previous Gather Merge now has an estimated number of rows of 44418. The 1 difference compared to the other Gather Merge plan is due to rounding (26128 * 1.7 = 44417.6). Plan 3 -> Gather Merge (cost=9296.40..14358.75 rows=44418 width=12) Workers Planned: 1 -> Sort (cost=8296.39..8361.71 rows=26128 width=12) Sort Key: id -> Parallel Seq Scan on simple (cost=0.00..6379.47 rows=26128 width=12) Filter: (other < 10) Plan 4 -> Gather Merge (cost=9296.40..14358.63 rows=44417 width=12) Workers Planned: 1 -> Sort (cost=8296.39..8361.71 rows=26128 width=12) Sort Key: id -> Parallel Seq Scan on simple (cost=0.00..6379.47 rows=26128 width=12) Filter: (other < 10) Plan 5 -> Sort (cost=12381.73..12492.77 rows=44417 width=12) Sort Key: id -> Seq Scan on simple (cost=0.00..8953.00 rows=44417 width=12) Filter: (other < 10) The Sort plan is slightly slower than the Gather Merge plan: 100ms average against 83ms but the Gather Merge comes at the additional cost of creating and using a parallel worker. The unpatched row estimation makes the parallel plan look cheaper and running a parallel query for a 17ms improvement doesn't seem like a good trade. I've also realised from the comments in optimizer.h that nodes/pathnodes.h should not be included there and fixed it. Regards, Anthonin v2-0001-Remove-unnecessary-assignment-of-path-rows-in-gat.patch Description: Binary data v2-0002-Fix-row-estimation-in-gather-paths.patch Description: Binary data
Re: Use pgBufferUsage for block reporting in analyze
Hi, Thanks for the review, I've updated the patches with the suggestions: - moved renaming of misses to reads to the first patch - added intermediate variables for total blks usage I've also done some additional tests using the provided vacuum_analyze_buffer_usage.sql script. It relies on pg_stat_statements to check the results (only pgss gives information on dirtied buffers). It gives the following output: psql:vacuum_analyze_buffer_usage.sql:21: INFO: vacuuming "postgres.pg_temp_7.vacuum_blks_stat_test" ... buffer usage: 105 hits, 3 reads, 6 dirtied ... query | sum_hit | sum_read | sum_dirtied +-+--+- VACUUM (VERBOSE... | 105 |3 | 6 For vacuum, we have the same results with SKIP_DATABASE_STATS. Without this setting, we would have block usage generated by vac_update_datfrozenxid outside of vacuum_rel and therefore not tracked by the verbose output. For the second test, the second patch is needed to have ANALYZE (VERBOSE) output the block usage. It will output the following: psql:vacuum_analyze_buffer_usage.sql:29: INFO: analyzing "pg_temp_7.vacuum_blks_stat_test" ... buffer usage: 84 hits, 33 reads, 2 dirtied ... query| sum_hit | sum_read | sum_dirtied -+-+--+- ANALYZE (VERBOSE... | 91 | 38 | 2 There's additional buffer hits/reads reported by pgss, those are from analyze_rel opening the relations in try_relation_open and are not tracked by the ANALYZE VERBOSE. v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch Description: Binary data vacuum_analyze_buffer_usage.sql Description: Binary data
Re: Fix possible overflow of pg_stat DSA's refcnt
On Wed, Jun 26, 2024 at 7:40 AM Michael Paquier wrote: > > Very good catch! It looks like you have seen that in the field, then. > Sad face. Yeah, this happened last week on one of our replicas (version 15.5) last week that had 134 days uptime. We are doing a lot of parallel queries on this cluster so the combination of high uptime plus parallel workers creation eventually triggered the issue. > Will fix and backpatch, thanks for the report! Thanks for handling this and for the quick answer! Regards, Anthonin
Fix possible overflow of pg_stat DSA's refcnt
Hi, During backend initialisation, pgStat DSA is attached using dsa_attach_in_place with a NULL segment. The NULL segment means that there's no callback to release the DSA when the process exits. pgstat_detach_shmem only calls dsa_detach which, as mentioned in the function's comment, doesn't include releasing and doesn't decrement the reference count of pgStat DSA. Thus, every time a backend is created, pgStat DSA's refcnt is incremented but never decremented when the backend shutdown. It will eventually overflow and reach 0, triggering the "could not attach to dynamic shared area" error on all newly created backends. When this state is reached, the only way to recover is to restart the db to reset the counter. The issue can be visible by calling dsa_dump in pgstat_detach_shmem and checking that refcnt's value is continuously increasing as new backends are created. It is also possible to reach the state where all connections are refused by editing the refcnt manually with lldb/gdb (The alternative, creating enough backends to reach 0 exists but can take some time). Setting it to -10 and then opening 10 connections will eventually generate the "could not attach" error. This patch fixes this issue by releasing pgStat DSA with dsa_release_in_place during pgStat shutdown to correctly decrement the refcnt. Regards, Anthonin v1-0001-Fix-possible-overflow-of-pg_stat-DSA-s-refcnt.patch Description: Binary data
Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
Hi, I sent a similar patch for this in https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com Regards, Anthonin On Thu, Jun 6, 2024 at 11:10 AM Dilip Kumar wrote: > As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the > dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was > removed from the vacuum. However, do_analyze_rel() was still using > these counters, necessitating the tracking of global counters > alongside BufferUsage counters. > > The attached patch addresses the issue by eliminating the need to > track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making > the global counters obsolete. This simplifies the code and improves > consistency. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: Use pgBufferUsage for block reporting in analyze
Thanks for having a look. On Fri, May 10, 2024 at 12:40 PM Michael Paquier wrote: > This needs some runtime check to make sure that the calculations > are consistent before and after the fact (cannot do that now). > Yeah, testing this is also a bit painful as buffer usage of analyze is only displayed in the logs during autoanalyze. While looking at this, I've thought of additional changes that could make testing easier and improve consistency with VACUUM VERBOSE: - Have ANALYZE VERBOSE outputs the buffer usage stats - Add Wal usage to ANALYZE VERBOSE analyze verbose output would look like: postgres=# analyze (verbose) pgbench_accounts ; INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 1640 of 1640 pages, containing 10 live rows and 0 dead rows; 3 rows in sample, 10 estimated total rows INFO: analyze of table "postgres.public.pgbench_accounts" avg read rate: 124.120 MB/s, avg write rate: 0.110 MB/s buffer usage: 533 hits, 1128 reads, 1 dirtied WAL usage: 12 records, 1 full page images, 5729 bytes system usage: CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s Perhaps this should say "read" rather than "miss" in the logs as the > two read variables for the shared and local blocks are used? For > consistency, at least. > Sounds good. That's not material for v17, only for v18. > Definitely I've split the patch in two parts 1: Removal of the vacuum specific variables, this is the same as the initial patch. 2: Add buffer and wal usage to analyze verbose output + rename miss to reads Regards, Anthonin v2-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data v2-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch Description: Binary data
Possible incorrect row estimation for Gather paths
Hi, While experimenting on an explain option to display all plan candidates (very rough prototype here [1]), I've noticed some discrepancies in some generated plans. EXPLAIN (ALL_CANDIDATES) SELECT * FROM pgbench_accounts order by aid; Plan 1 -> Gather Merge (cost=11108.32..22505.38 rows=10 width=97) Workers Planned: 1 -> Sort (cost=10108.31..10255.37 rows=58824 width=97) Sort Key: aid -> Parallel Seq Scan on pgbench_accounts (cost=0.00..2228.24 rows=58824 width=97) Plan 2 -> Gather Merge (cost=11108.32..17873.08 rows=58824 width=97) Workers Planned: 1 -> Sort (cost=10108.31..10255.37 rows=58824 width=97) Sort Key: aid -> Parallel Seq Scan on pgbench_accounts (cost=0.00..2228.24 rows=58824 width=97) The 2 plans are similar except one Gather Merge has a lower 58K estimated rows. The first plan is created with generate_useful_gather_paths with override_rows=false then create_gather_merge_path and will use rel->rows as the row count (so the 100K rows of pgbench_accounts). #0: create_gather_merge_path(...) at pathnode.c:1885:30 #1: generate_useful_gather_paths(... override_rows=false) at allpaths.c:3286:11 #2: apply_scanjoin_target_to_paths(...) at planner.c:7744:3 #3: grouping_planner(...) at planner.c:1638:3 The second plan is created through create_ordered_paths then create_gather_merge_path and the number of rows is estimated to (worker_rows * parallel_workers). Since we only have 1 parallel worker, this yields 58K rows. #0: create_gather_merge_path(...) at pathnode.c:1885:30 #1: create_ordered_paths(...) at planner.c:5287:5 #2: grouping_planner(...) at planner.c:1717:17 The 58K row estimate looks possibly incorrect. A worker row count is estimated using total_rows/parallel_divisor. The parallel_divisor will include the possible leader participation and will be 1.7 for 1 worker thus the 58K rows (100K/1.7=58K) However, the gather merge will only do 58K*1, dropping the leader participation component. I have a tentative patch split in two changes: 1: This is a related refactoring to remove an unnecessary and confusing assignment of rows in create_gather_merge_path. This value is never used and immediately overwritten in cost_gather_merge 2: This changes the row estimation of gather path to use worker_rows*parallel_divisor to get a more accurate estimation. Additionally, when creating a gather merge path in create_ordered_paths, we try to use the source's rel rows when available. The patch triggered a small change in the hash_join regression test. Pre patch, we had the following candidates. Plan 4 -> Aggregate (cost=511.01..511.02 rows=1 width=8) -> Gather (cost=167.02..511.00 rows=2 width=0) Workers Planned: 1 -> Parallel Hash Join (cost=167.02..510.80 rows=1 width=0) Hash Cond: (r.id = s.id) -> Parallel Seq Scan on simple r (cost=0.00..299.65 rows=11765 width=4) -> Parallel Hash (cost=167.01..167.01 rows=1 width=4) -> Parallel Seq Scan on extremely_skewed s (cost=0.00..167.01 rows=1 width=4) Plan 5 -> Finalize Aggregate (cost=510.92..510.93 rows=1 width=8) -> Gather (cost=510.80..510.91 rows=1 width=8) Workers Planned: 1 -> Partial Aggregate (cost=510.80..510.81 rows=1 width=8) -> Parallel Hash Join (cost=167.02..510.80 rows=1 width=0) Hash Cond: (r.id = s.id) -> Parallel Seq Scan on simple r (cost=0.00..299.65 rows=11765 width=4) -> Parallel Hash (cost=167.01..167.01 rows=1 width=4) -> Parallel Seq Scan on extremely_skewed s (cost=0.00..167.01 rows=1 width=4) Post patch, the plan candidates became: Plan 4 -> Finalize Aggregate (cost=511.02..511.03 rows=1 width=8) -> Gather (cost=510.80..511.01 rows=2 width=8) Workers Planned: 1 -> Partial Aggregate (cost=510.80..510.81 rows=1 width=8) -> Parallel Hash Join (cost=167.02..510.80 rows=1 width=0) Hash Cond: (r.id = s.id) -> Parallel Seq Scan on simple r (cost=0.00..299.65 rows=11765 width=4) -> Parallel Hash (cost=167.01..167.01 rows=1 width=4) -> Parallel Seq Scan on extremely_skewed s (cost=0.00..167.01 rows=1 width=4) Plan 5 -> Aggregate (cost=511.01..511.02 rows=1 width=8) -> Gather (cost=167.02..511.00 rows=2 width=0) Workers Planned: 1 -> Parallel Hash Join (cost=167.02..510.80 rows=1 width=0) Hash Cond: (r.id = s.id) -> Parallel Seq Scan on simple r (cost=0.00..299.65 rows=11765 width=4) -> Parallel Hash (cost=167.01..167.01 rows=1 width=4) -> Parallel Seq Scan on extremely_skewed s (cost=0.00..167.01 rows=1 width=4) The FinalizeAggreg
Use pgBufferUsage for block reporting in analyze
Hi, Analyze logs within autovacuum uses specific variables VacuumPage{Hit,Miss,Dirty} to track the buffer usage count. However, pgBufferUsage already provides block usage tracking and handles more cases (temporary tables, parallel workers...). Those variables were only used in two places, block usage reporting in verbose vacuum and analyze. 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d removed their usage in the vacuum command as part of a bugfix. This patch replaces those Vacuum specific variables by pgBufferUsage in analyze. This makes VacuumPage{Hit,Miss,Dirty} unused and removable. This commit removes both their calls in bufmgr and their declarations. Regards, Anthonin v1-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data
Re: Fix parallel vacuum buffer usage reporting
On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada wrote: > Thank you for further testing! I've pushed the patch. Thanks! Here is the rebased version for the follow-up patch removing VacuumPage variables. Though I'm not sure if I should create a dedicated mail thread since the bug was fixed and the follow-up is more of a refactoring. What do you think? v1-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch Description: Binary data
Re: Fix parallel vacuum buffer usage reporting
I've done some additional tests to validate the reported numbers. Using pg_statio, it's possible to get the minimum number of block hits (Full script attached). -- Save block hits before vacuum SELECT pg_stat_force_next_flush(); SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where relname='vestat' \gset vacuum (verbose, index_cleanup on) vestat; -- Check the difference SELECT pg_stat_force_next_flush(); SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit, idx_blks_hit - :idx_blks_hit as delta_idx_hit, heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum FROM pg_statio_all_tables where relname='vestat'; Output: ... buffer usage: 14676 hits, 0 misses, 667 dirtied buffer usage (new): 16081 hits, 0 misses, 667 dirtied ... -[ RECORD 1 ]--+-- delta_heap_hit | 9747 delta_idx_hit | 6325 sum| 16072 >From pg_statio, we had 16072 blocks for the relation + indexes. Pre-patch, we are under reporting with 14676. Post-patch, we have 16081. The 9 additional block hits come from vacuum accessing catalog tables like pg_class or pg_class_oid_index. vacuum_block_with_pgstatio.sql Description: Binary data
Re: Fix parallel vacuum buffer usage reporting
On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina wrote: > I tested the main postgres branch with and without your fix using a script > that was written by me. It consists of five scenarios and I made a > comparison in the logs between the original version of the master branch > and the master branch with your patch: > Hi! Thanks for the tests. I have attached a test file (vacuum_check_logs.sql) > Looking at the script, you won't trigger the problem. The reporting issue will only happen if there's a parallel index vacuum and it will only happen if there's at least 2 indexes [0]. You will need to create an additional index. The same script was run, but using vacuum verbose analyze, and I saw the > difference again in the fifth step: > with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied > master: buffer usage: 32346 hits, 573 misses, 1360 dirtied > Isn't there a chance for the checkpointer to run during this time? That could make the conditions between the two runs slightly different and explain the change in buffer report. [0] https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831
Re: Fix parallel vacuum buffer usage reporting
Thanks for the review! > I think that if the anayze command doesn't have the same issue, we > don't need to change it. Good point, I've wrongly assumed that analyze was also impacted but there's no parallel analyze so the block count is correct. > (a) make lazy vacuum use BufferUsage instead of > VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13). > (b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty} > variables for consistency and simplicity (only for HEAD, if we agree). > I've isolated the fix in the first patch. I've kept the analyze change and removal of VacuumPage* variables split for clarity sake. BTW I realized that VACUUM VERBOSE running on a temp table always > shows the number of dirtied buffers being 0, which seems to be a bug. > The patch (a) will resolve it as well. > Indeed, it visible with the following test: SET temp_buffers='1024'; CREATE TEMPORARY TABLE vacuum_fix (aid int, bid int); INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 100); VACUUM vacuum_fix; UPDATE vacuum_fix SET bid = bid+1; VACUUM (VERBOSE, INDEX_CLEANUP ON) vacuum_fix; Pre-patch: avg read rate: 254.751 MB/s, avg write rate: 0.029 MB/s buffer usage: 8853 hits, 8856 misses, 1 dirtied WAL usage: 1 records, 1 full page images, 3049 bytes The dirtied page is from pg_class (so not a local buffer) With the fix: avg read rate: 250.230 MB/s, avg write rate: 250.230 MB/s buffer usage: 8853 hits, 8856 misses, 8856 dirtied WAL usage: 1 records, 1 full page images, 3049 bytes v4-0002-Use-pgBufferUsage-for-analyze-block-reporting.patch Description: Binary data v4-0001-Fix-parallel-and-temp-vacuum-buffer-usage-reporti.patch Description: Binary data v4-0003-Remove-unused-variables-for-vacuum-buffer-usage.patch Description: Binary data
Re: Fix parallel vacuum buffer usage reporting
On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina wrote: > Hi, thank you for your work with this subject. > > While I was reviewing your code, I noticed that your patch conflicts with > another patch [0] that been committed. > > [0] > https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com > I've rebased the patch and also split the changes: 1: Use pgBufferUsage in Vacuum and Analyze block reporting 2: Stop tracking buffer usage in the now unused Vacuum{Hit,Miss,Dirty} variables 3: Remove declarations of Vacuum{Hit,Miss,Dirty} v3-0003-Remove-declaration-of-Vacuum-block-usage-tracking.patch Description: Binary data v3-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch Description: Binary data v3-0002-Remove-unused-tracking-of-vacuum-buffer-usage.patch Description: Binary data
Re: Fix parallel vacuum buffer usage reporting
Hi, Thanks for the review. On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada wrote: > As for the proposed patch, the following part should handle the temp > tables too: > True, I've missed the local blocks. I've updated the patch: - read_rate and write_rate now include local block usage - I've added a specific output for reporting local blocks instead of combining them in the same output. Regards, Anthonin v2-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch Description: Binary data
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi all! Thanks for the reviews and comments. > - pg_tracing.c should include postgres.h as the first thing Will do. > afaict none of the use of volatile is required, spinlocks have been barriers > for a long time now Got it, I will remove them. I've been mimicking what was done in pg_stat_statements and all spinlocks are done on volatile variables [1]. > - I don't think it's a good idea to do memory allocations in the middle of a > PG_CATCH. If the error was due to out-of-memory, you'll throw another error. Good point. I was wondering what were the risks of generating spans for errors. I will try to find a better way to handle this. To give a quick update: following Heikki's suggestion, I'm currently working on getting pg_tracing as a separate extension on github. I will send an update when it's ready. [1]: https://github.com/postgres/postgres/blob/master/contrib/pg_stat_statements/pg_stat_statements.c#L2000 On Fri, Feb 9, 2024 at 7:50 PM Andres Freund wrote: > > Hi, > > On 2024-02-09 19:46:58 +0300, Nikita Malakhov wrote: > > I agree with Heikki on most topics and especially the one he recommended > > to publish your extension on GitHub, this tool is very promising for highly > > loaded > > systems, you will get a lot of feedback very soon. > > > > I'm curious about SpinLock - it is recommended for very short operations, > > taking up to several instructions, and docs say that for longer ones it > > will be > > too expensive, and recommends using LWLock. Why have you chosen SpinLock? > > Does it have some benefits here? > > Indeed - e.g. end_tracing() looks to hold the spinlock for far too long for > spinlocks to be appropriate. Use an lwlock. > > Random stuff I noticed while skimming: > - pg_tracing.c should include postgres.h as the first thing > > - afaict none of the use of volatile is required, spinlocks have been barriers > for a long time now > > - you acquire the spinlock for single increments of n_writers, perhaps that > could become an atomic, to reduce contention? > > - I don't think it's a good idea to do memory allocations in the middle of a > PG_CATCH. If the error was due to out-of-memory, you'll throw another error. > > > Greetings, > > Andres Freund
Fix expecteddir argument in pg_regress
Hi all! pg_regress accepts the expecteddir argument. However, it is never used and outputdir is used instead to get the expected files paths. This patch fixes this and uses the expecteddir argument as expected. Regards, Anthonin v01-0001-pg_regress-Use-expecteddir-for-the-expected-file.patch Description: Binary data
Fix parallel vacuum buffer usage reporting
Hi, With a db setup with pgbench, we add an additional index: CREATE INDEX ON pgbench_accounts(abalance) And trigger several updates and vacuum to reach a stable amount of dirtied pages: UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT; VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts The vacuum will report the following: INFO: vacuuming "postgres.public.pgbench_accounts" INFO: launched 1 parallel vacuum worker for index vacuuming (planned: 1) INFO: finished vacuuming "postgres.public.pgbench_accounts": index scans: 1 ... buffer usage: 122 hits, 165 misses, 4 dirtied 4 pages were reported dirtied. However, we have 5 dirtied blocks at the end of the vacuum when looking at pg_buffercache: SELECT c.relname, b.relfilenode FROM pg_buffercache b LEFT JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) WHERE isdirty=true; relname| relfilenode ---+- pg_class |1259 pgbench_accounts | 16400 pgbench_accounts | 16400 pgbench_accounts_pkey | 16408 pgbench_accounts_abalance_idx | 16480 The missing dirty block comes from the parallel worker vacuuming the abalance index. Running vacuum with parallel disabled will give the correct result. Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track buffer usage. However, those values are not collected at the end of parallel vacuum workers, leading to incorrect buffer count. Those vacuum specific globals are redundant with the existing pgBufferUsage and only used in the verbose output. This patch removes them and replaces them by pgBufferUsage which is already correctly collected at the end of parallel workers, fixing the buffer count. Regards, Anthonin v01-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch Description: Binary data
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi! On Fri, Jan 26, 2024 at 1:43 PM Nikita Malakhov wrote: > It's a good idea to split a big patch into several smaller ones. > But you've already implemented these features - you could provide them > as sequential small patches (i.e. v13-0002-guc-context-propagation.patch and > so on) Keeping a consistent set of patches is going to be hard as any change in the first patch would require rebasing that are likely going to conflict. I haven't discarded the features but keeping them separated could help with the discussion. On Fri, Feb 2, 2024 at 1:41 PM Heikki Linnakangas wrote: > > Hi Anthonin, > > I'm only now catching up on this thread. Very exciting feature! Thanks! And thanks for the comments. > - You're relying on various existing hooks, but it might be better to > have the 'begin_span'/'end_span' calls directly in PostgreSQL source > code in the right places. There are more places where spans might be > nice, like when performing a large sort in tuplesort.c to name one > example. Or even across every I/O; not sure how much overhead the spans > incur and how much we'd be willing to accept. 'begin_span'/'end_span' > could be new hooks that normally do nothing, but your extension would > implement them. Having information that would only be available to tracing doesn't sound great. It probably should be added to the core instrumentation which could be then leveraged by tracing and other systems. One of the features I initially implemented was to track parse start and end. However, since only the post parse hook is available, I could only infer when the parse started. Changing the core to track parse start would make this more easier and it could be used by EXPLAIN and pg_stat_statements. In the end, I've tried to match what's done in pg_stat_statements since the core idea is the same, except pg_tracing outputs spans instead of statistics. > - How to pass the tracing context from the client? You went with > SQLCommenter and others proposed a GUC. A protocol extension would make > sense too. I can see merits to all of those. It probably would be good > to support multiple mechanisms, but some might need core changes. It > might be good to implement the mechanism for accepting trace context in > core. Without the extension, we could still include the trace ID in the > logs, for example. I went with SQLCommenter for the initial implementation since it has the benefit of already being implemented and working out of the box. Jelte has an ongoing patch to add the possibility to modify GUC values through a protocol message [1]. I've tested it and it worked well without adding too much complexity in pg_tracing. I imagine having a dedicated traceparent GUC variable could be also used in the core to propagate in the logs and other places. > - Include EXPLAIN plans in the traces. This was already implemented in the first versions (or rather, the planstate was translated to spans) of the patch. I've removed it in the latest patch as it was contributing to 50% of the code and also required a core change, namely a change in instrument.c to track the first start of a node. Though if you were thinking of the raw EXPLAIN ANALYZE text dumped as a span metadata, that could be doable. > # Comments on the implementation > > There was discussion already on push vs pull model. Currently, you > collect the traces in memory / files, and require a client to poll them. > A push model is more common in applications that support tracing. If > Postgres could connect directly to the OTEL collector, you'd need one > fewer running component. You could keep the traces in backend-private > memory, no need to deal with shared memory and spilling to files. PostgreSQL metrics are currently exposed through tables like pg_stat_* and users likely have collectors pulling those metrics (like OTEL PostgreSQL Receiver [2]). Having this collector also pulling traces would keep the logic in a single place and would avoid a situation where both pull and push is used. > 2. Start a new pgsql-hackers thread on in-core changes that would > benefit the extension. Write patches for them. I'm thinking of the > things I listed in the Core changes section above, but maybe there are > others. I already have an ongoing patch (more support for extended protocol in psql [3] needed for tests) and plan to do others (the parse changes I've mentioned). Though I also don't want to spread myself too thin. [1]: https://commitfest.postgresql.org/46/4736/ [2]: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/postgresqlreceiver/README.md [3]: https://commitfest.postgresql.org/46/4650/ On Fri, Feb 2, 2024 at 5:42 PM Jan Katins wrote: > > Hi! > > On Fri, 2 Feb 2024 at 13:41, Heikki Linnakangas wrote: >> >> PS. Does any other DBMS implement this? Any lessons to be learned from them? > > > Mysql 8.3 has open telemetrie component, so very fresh: > > https://dev.mysql.com/doc/refman/8.3/en/telemetry-trace-install.html > https:
Re: A performance issue with Memoize
Hi, I've seen a similar issue with the following query (tested on the current head): EXPLAIN ANALYZE SELECT * FROM tenk1 t1 LEFT JOIN LATERAL (SELECT t1.two, tenk2.hundred, tenk2.two FROM tenk2) t2 ON t1.hundred = t2.hundred WHERE t1.hundred < 5; QUERY PLAN Nested Loop Left Join (cost=8.46..1495.10 rows=5 width=256) (actual time=0.860..111.013 rows=5 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=8.16..376.77 rows=500 width=244) (actual time=0.798..1.418 rows=500 loops=1) Recheck Cond: (hundred < 5) Heap Blocks: exact=263 -> Bitmap Index Scan on tenk1_hundred (cost=0.00..8.04 rows=500 width=0) (actual time=0.230..0.230 rows=500 loops=1) Index Cond: (hundred < 5) -> Memoize (cost=0.30..4.89 rows=100 width=12) (actual time=0.009..0.180 rows=100 loops=500) Cache Key: t1.hundred Cache Mode: logical Hits: 0 Misses: 500 Evictions: 499 Overflows: 0 Memory Usage: 5kB -> Index Scan using tenk2_hundred on tenk2 (cost=0.29..4.88 rows=100 width=12) (actual time=0.007..0.124 rows=100 loops=500) Index Cond: (hundred = t1.hundred) Planning Time: 0.661 ms Execution Time: 113.076 ms (14 rows) The memoize's cache key only uses t1.hundred while the nested loop has two changed parameters: the lateral var t1.two and t1.hundred. This leads to a chgParam that is always different and the cache is purged on each rescan. Regards, Anthonin On Sat, Jan 27, 2024 at 5:00 AM Alexander Lakhin wrote: > > Hello, > > 27.01.2024 00:09, Tom Lane wrote: > > David Rowley writes: > >> On Sat, 27 Jan 2024 at 09:41, Tom Lane wrote: > >>> drongo and fairywren are consistently failing the test case added > >>> by this commit. I'm not quite sure why the behavior of Memoize > >>> would be platform-specific when we're dealing with integers, > >>> but ... > >> Maybe snprintf(buf, "%.*f", 0, 5.0 / 2.0); results in "3" on those > >> rather than "2"? > >> Looking at the code in fmtfloat(), we fallback on the built-in snprintf. > > Maybe ... I don't have a better theory. > > FWIW, I've found where this behaviour is documented: > https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/sprintf-sprintf-l-swprintf-swprintf-l-swprintf-l?view=msvc-170 > > (I've remembered a case with test/sql/partition_prune from 2020, where > sprintf on Windows worked the other way.) > > > Best regards, > Alexander > >
Re: Add \syncpipeline command to pgbench
That looks like a bug with how opened pipelines are not caught at the end of the script processing. startpipeline seems to have similar related issues. $ cat only_startpipeline.sql \startpipeline SELECT 1; With only 1 transaction, pgbench will consider this a success despite not sending anything since the pipeline was not flushed: pgbench -t1 -Mextended -f only_startpipeline.sql [...] number of transactions per client: 1 number of transactions actually processed: 1/1 With 2 transactions, the error will happen when \startpipeline is called a second time: pgbench -t2 -Mextended -f only_startpipeline.sql [...] pgbench: error: client 0 aborted in command 0 (startpipeline) of script 0; already in pipeline mode number of transactions per client: 2 number of transactions actually processed: 1/2 I've split the changes into two patches. 0001 introduces a new error when the end of a pgbench script is reached while there's still an ongoing pipeline. 0002 adds the \syncpipeline command (original patch with an additional test case). Regards, Anthonin On Mon, Jan 22, 2024 at 7:16 AM Michael Paquier wrote: > > On Fri, Jan 19, 2024 at 08:55:31AM +0100, Anthonin Bonnefoy wrote: > > I've updated the doc to group the commands. It does look better and > > more consistent with similar command groups like \if. > > I was playing with a few meta command scenarios while looking at this > patch, and this sequence generates an error that should never happen: > $ cat /tmp/test.sql > \startpipeline > \syncpipeline > $ pgbench -n -f /tmp/test.sql -M extended > [...] > pgbench: error: unexpected transaction status 1 > pgbench: error: client 0 aborted while receiving the transaction status > > It looks to me that we need to be much smarter than that for the error > handling we'd need when a sync request is optionally sent when a > transaction stops at the end of pgbench. Could you look at it? > -- > Michael v3-0001-Abort-pgbench-if-script-end-is-reached-with-an-op.patch Description: Binary data v3-0002-pgbench-Add-syncpipeline-to-pgbench.patch Description: Binary data
Re: Add \syncpipeline command to pgbench
On Fri, Jan 19, 2024 at 5:08 AM Michael Paquier wrote: > The logic looks sound, but I have a > comment about the docs: could it be better to group \syncpipeline with > \startpipeline and \endpipeline? \syncpipeline requires a pipeline to > work. I've updated the doc to group the commands. It does look better and more consistent with similar command groups like \if. Regards, Anthonin v2-0001-pgbench-Add-syncpipeline-to-pgbench.patch Description: Binary data
Add \syncpipeline command to pgbench
Hi, PQsendSyncMessage() was added in https://commitfest.postgresql.org/46/4262/. It allows users to add a Sync message without flushing the buffer. As a follow-up, this change adds an additional meta-command to pgbench, \syncpipeline, which will call PQsendSyncMessage(). This will make it possible to benchmark impact and improvements of using PQsendSyncMessage through pgbench. Regards, Anthonin v1-0001-pgbench-Add-syncpipeline-to-pgbench.patch Description: Binary data
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx
> Hmm. So it does not lead to any user-visible changes, right? >From what I can tell, there's no change in the behaviour. All paths would eventually go through HandleSlashCmds's cleaning logic. This is also mentioned in ignore_slash_options's comment. * Read and discard "normal" slash command options. * * This should be used for inactive-branch processing of any slash command * that eats one or more OT_NORMAL, OT_SQLID, or OT_SQLIDHACK parameters. * We don't need to worry about exactly how many it would eat, since the * cleanup logic in HandleSlashCmds would silently discard any extras anyway. > Could you split the bits for \bind into a separate patch, please? > This requires a separate evaluation, especially if this had better be > backpatched. Done. patch 1 adds ignore_slash_options to bind. patch 2 adds the new \bindx, \close and \parse commands. v4-0002-psql-Add-support-for-prepared-stmt-with-extended-.patch Description: Binary data v4-0001-psql-Add-ignore_slash_options-in-bind-s-inactive-.patch Description: Binary data
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx
> I do realize the same is true for plain \bind, but it seems > like a bug there too. The unscanned bind's parameters are discarded later in the HandleSlashCmds functions. So adding the ignore_slash_options() for inactive branches scans and discards them earlier. I will add it to match what's done in the other commands but I don't think it's testable as the behaviour is the same unless I miss something. I did add the \bind, \bindx, \close and \parse to the inactive branch tests to complete the list. > One more usability thing. I think \parse and \close should not require > a \g to send the message. You can do that by returning PSQL_CMD_SEND > instead of PSQL_CMD_SKIP_LIN Changed. > I think the examples for \bindx and \close > should use \parse instead of PREPARE Done. I had to rely on manual PREPARE for my first tests and it leaked in the docs. v3-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch Description: Binary data
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx
Hi, Thanks for the review and comments. > One thing that I think should be added for completeness though is the > ability to deallocate the prepared statement using > PQsendClosePrepared. Other than that the changes look great. Good point, I've added the \close command. > Also a tiny nitpick: stmt_name should be replaced with STMT_NAME in > this line of the help message. Fixed On Sat, Jan 13, 2024 at 3:37 PM Jelte Fennema-Nio wrote: > > On Thu, 2 Nov 2023 at 10:52, Anthonin Bonnefoy > wrote: > > The main goal is to provide more ways to test extended protocol in > > regression tests > > similarly to what \bind is doing. > > I think this is a great addition. One thing that I think should be > added for completeness though is the ability to deallocate the > prepared statement using PQsendClosePrepared. Other than that the > changes look great. > > Also a tiny nitpick: stmt_name should be replaced with STMT_NAME in > this line of the help message. > > > + HELP0(" \\bindx stmt_name [PARAM]...\n" v2-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch Description: Binary data
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, > This approach avoids the need to rewrite SQL or give special meaning to SQL > comments. SQLCommenter already has a good amount of support and is referenced in opentelemetry https://github.com/open-telemetry/opentelemetry-sqlcommenter So the goal was more to leverage the existing trace propagation systems as much as possible. > I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`. > These can be sent as protocol-level messages by the client driver if the > client driver has native trace integration enabled, in which case the user > doesn't need to see or care. And for other drivers, the application can use > `SET` or `SET LOCAL` to assign them. Emitting `SET` and `SET LOCAL` for every traced statement sounds more disruptive and expensive than relying on SQLCommenter. When not using `SET LOCAL`, the variables also need to be cleared. This will also introduce a lot of headaches as the `SET` themselves could be traced and generate spans when tracing utility statements is already tricky enough. > But IIRC the current BDR/PGD folks are now using an OpenTelemetry > sidecar process instead. I think they send it UDP packets to emit > metrics and events. I would be curious to hear more about this. I didn't want to be tied to a specific protocol (I've only seen gRPC and http support on latest opentelemetry collectors) and I fear that relying on Postgres sending traces would lower the chance of having this supported on managed Postgres solutions (like RDS and CloudSQL). > By queries you mean particular queries, not transactions? And since > they have an assigned ID it means that the query is already executing > and we want to enable the tracking in another session, right? I think that was the idea. The query identifier generated for a specific statement is stable so we could have a GUC variable with a list of query id and only queries matching the provided query ids would be sampled. This would make it easier to generate traces for a specific query within a session. On Tue, Jan 2, 2024 at 1:14 PM Aleksander Alekseev wrote: > > Hi, > > > Overall solution looks good for me except SQL Commenter - query > > instrumentation > > with SQL comments is often not possible on production systems. Instead > > the very often requested feature is to enable tracing for a given single > > query ID, > > or several (very limited number of) queries by IDs. It could be done by > > adding > > SQL function to add and remove query ID into a list (even array would do) > > stored in top tracing context. > > Not 100% sure if I follow. > > By queries you mean particular queries, not transactions? And since > they have an assigned ID it means that the query is already executing > and we want to enable the tracking in another session, right? If this > is the case I would argue that enabling/disabling tracing for an > _already_ running query (or transaction) would be way too complicated. > > I wouldn't mind enabling/disabling tracing for the current session > and/or a given session ID. In the latter case it can have an effect > only for the new transactions. This however could be implemented > separately from the proposed patchset. I suggest keeping the scope > small. > > -- > Best regards, > Aleksander Alekseev
Re: Possible segfault when sending notification within a ProcessUtility hook
> On Tue, Dec 5, 2023 at 9:03 PM Tom Lane wrote: > Why should we regard that as anything other than a bug in the > ProcessUtility hook? A failed transaction should not send any > notifies. Fair point. That was also my initial assumption but I thought that the transaction state was not available from a hook as I've missed IsAbortedTransactionBlockState. I will rely on IsAbortedTransactionBlockState to avoid this case, thanks for the input. Regards, Anthonin.
Possible segfault when sending notification within a ProcessUtility hook
Hi, I've encountered the following segfault: #0: 0x000104e821a8 postgres`list_head(l=0x7f7f7f7f7f7f7f7f) at pg_list.h:130:17 #1: 0x000104e81c9c postgres`PreCommit_Notify at async.c:932:16 #2: 0x000104dd02f8 postgres`CommitTransaction at xact.c:2236:2 #3: 0x000104dcfc24 postgres`CommitTransactionCommand at xact.c:3061:4 #4: 0x00010528a880 postgres`finish_xact_command at postgres.c:2777:3 #5: 0x0001052883ac postgres`exec_simple_query(query_string="notify test;") at postgres.c:1298:4 This happens when a transaction block fails and a ProcessUtility hook sends a notification during the rollback command. When a transaction block fails, it will enter in a TBLOCK_ABORT state, waiting for a rollback. Calling rollback will switch to a TBLOCK_ABORT_END state and will only go through CleanupTransaction. If a hook sends a notification during the rollback command, a notification will be queued but its content will be wiped when the TopTransactionContext is destroyed. Trying to send a notification immediately after will segfault in PreCommit_Notify as pendingNotifies->events will be invalid. There's a test_notify_rollback test module attached to the patch that reproduces the issue. Moving notification clean up from AbortTransaction to CleanupTransaction fixes the issue as it will clear pendingActions in the same function that destroys the TopTransactionContext. Regards, Anthonin v1-0001-Fix-segfault-when-notifying-in-a-ProcessUtility-h.patch Description: Binary data
Re: Add PQsendSyncMessage() to libpq
Hi, I've played a bit with the patch on my side. One thing that would be great would be to make this available in pgbench through a \syncpipeline meta command. That would make it easier for users to test whether there's a positive impact with their queries or not. I've wrote a patch to add it to pgbench (don't want to mess with the thread's attachment so here's a GH link https://github.com/bonnefoa/postgres/commit/047b5b05169e36361fe29fef9f430da045ef012d). Here's some quick results: echo "\set aid1 random(1, 10 * :scale) \set aid2 random(1, 10 * :scale) \startpipeline select 1; select * from pgbench_accounts where aid=:aid1; select 2; \syncpipeline select 1; select * from pgbench_accounts where aid=:aid2; select 2; \endpipeline" > /tmp/pipeline_without_flush.sql pgbench -T30 -Mextended -f /tmp/pipeline_without_flush.sql -h127.0.0.1 latency average = 0.383 ms initial connection time = 2.810 ms tps = 2607.587877 (without initial connection time) echo "\set aid1 random(1, 10 * :scale) \set aid2 random(1, 10 * :scale) \startpipeline select 1; select * from pgbench_accounts where aid=:aid1; select 2; \endpipeline \startpipeline select 1; select * from pgbench_accounts where aid=:aid2; select 2; \endpipeline" > /tmp/pipeline_with_flush.sql pgbench -T30 -Mextended -f /tmp/pipeline_with_flush.sql -h127.0.0.1 latency average = 0.437 ms initial connection time = 2.602 ms tps = 2290.527462 (without initial connection time) I took some perfs and the main change is from the server spending less time in ReadCommand which makes sense since the commands are sent in a single tcp frame with the \syncpipeline version. Regards, Anthonin On Sun, Nov 12, 2023 at 2:37 PM Anton Kirilov wrote: > > Hello, > > Thanks for the feedback! > > On 07/11/2023 09:23, Jelte Fennema-Nio wrote: > > But I think it's looking at the situation from the wrong direction. > [...] we should look at it as an addition to our current list of PQsend > functions for a new packet type. And none of those PQsend functions ever > needed a flag. Which makes sense, because they are the lowest level > building blocks that make sense from a user perspective: They send a > message type over the socket and don't do anything else. > > Yes, I think that this is quite close to my thinking when I created the > original version of the patch. Also, the protocol specification states > that the Sync message lacks parameters. > > Since there haven't been any comments from the other people who have > chimed in on this e-mail thread, I will assume that there is consensus > (we are doing a U-turn with the implementation approach after all), so > here is the updated version of the patch. > > Best wishes, > Anton Kirilov
[PATCH] Add additional extended protocol commands to psql: \parse and \bindx
Hi all! Currently, only unnamed prepared statements are supported by psql with the \bind command and it's not possible to create or use named prepared statements through extended protocol. This patch introduces 2 additional commands: \parse and \bindx. \parse allows us to issue a Parse message to create a named prepared statement through extended protocol. \bindx allows to bind and execute a named prepared statement through extended protocol. The main goal is to provide more ways to test extended protocol in regression tests similarly to what \bind is doing. Regards, Anthonin 0001-psql-Add-support-for-prepared-stmt-with-extended-pro.patch Description: Binary data
Re: POC: Extension for adding distributed tracing - pg_tracing
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC > macros for timing calculations? If you're talking of the two instances where I'm modifying the instr_time's ticks, it's because I can't use the macros there. The first case is for the parse span. I only have the start timestamp using GetCurrentStatementStartTimestamp and don't have access to the start instr_time so I need to build the duration from 2 timestamps. The second case is when building node spans from the planstate. I directly have the duration from Instrumentation. I guess one fix would be to use an int64 for the span duration to directly store nanoseconds instead of an instr_time but I do use the instrumentation macros outside of those two cases to get the duration of other spans. > Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? That's a good point. I was focusing on leaving the sampling decision to the caller through the sampled flag and only recently added the pg_tracing_sample_rate parameter to give more control. It should be straightforward to add an option to create standalone traces based on sample rate alone. This way, setting the sample rate to 1 would force the queries running in the session to be traced. On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov wrote: > Hi! > > What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT > and INSTR_TIME_GET_MILLISEC > macros for timing calculations? > > Also, have you thought about a way to trace existing (running) queries > without directly instrumenting them? It would > be much more usable for maintenance and support personnel, because in > production environments you rarely could > change query text directly. For the current state the most simple solution > is switch tracing on and off by calling SQL > function, and possibly switch tracing for prepared statement the same way, > but not for any random query. > > I'll check the patch for the race conditions. > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, > I'd keep Autotools build up to date Definitely. The patch is still in a rough state and updating Autotools fell through the cracks. > I'm currently playing with the patch and > reviewing sources, if you need any cooperation - please let us know. The goal for me was to get validation on the design and to check if there was anything that would be a blocker before commiting more time on the project. There are already several things I was planning on improving: - The parse span is something I've added recently. I've realised that I could rely on the stmtStartTimestamp to get the time spent in parsing in the post parse hook so the code around this is still rough. - There are probably race conditions on the reads and writes of the query file that need to be fixed - I'm using the same Span structure for everything while Executor spans only need a small subset which is a bit wasteful. I've tried to use two different shared buffers (base spans and spans with counters) but it was making things more confusing. - Finish commenting code and start writing the doc So any comments, missing features and reviews on the current state of the project is welcome. Regards, Anthonin On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov wrote: > Hi, > > I'd keep Autotools build up to date, because Meson is very limited in > terms of not very > up-to-date OS versions. Anyway, it's OK now. I'm currently playing with > the patch and > reviewing sources, if you need any cooperation - please let us know. > I'm +1 for committing > this patch after review. > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: POC: Extension for adding distributed tracing - pg_tracing
> Agree, something goes wrong when using Autotools (but not Meson) on > both Linux and MacOS. I didn't investigate the issue though. I was only using meson and forgot to keep Automake files up to date when I've split pg_tracing.c in multiple files (span.c, explain.c...). On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov wrote: > Hi, > > I've fixed the Autotools build, please check patch below (v2). > > On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev < > aleksan...@timescale.com> wrote: > >> Hi, >> >> > Also FYI, there are build warnings because functions >> > const char * get_span_name(const Span * span, const char *qbuffer) >> > and >> > const char * get_operation_name(const Span * span, const char *qbuffer) >> > do not have default inside switch and no return outside of switch. >> >> You are right, there are a few warnings: >> >> ``` >> [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o >> ../contrib/pg_tracing/span.c: In function ‘get_span_name’: >> ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of >> non-void function [-Wreturn-type] >> 210 | } >> | ^ >> ../contrib/pg_tracing/span.c: In function ‘get_operation_name’: >> ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of >> non-void function [-Wreturn-type] >> 249 | } >> | ^ >> ``` >> >> Here is the patch v2 with a quick fix. >> >> > but got errors calling make check and cannot install the extension >> >> Agree, something goes wrong when using Autotools (but not Meson) on >> both Linux and MacOS. I didn't investigate the issue though. >> >> -- >> Best regards, >> Aleksander Alekseev >> > > > -- > Regards, > > -- > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: POC: Extension for adding distributed tracing - pg_tracing
I've initially thought of sending the spans from PostgreSQL since this is the usual behavior of tracing libraries. However, this created a lot potential issues: - Protocol support and differences between trace collectors. OpenTelemetry seems to use gRPC, others are using http and those will require additional libraries (plus gRPC support in C doesn't look good) and any change in protobuf definition would require updating the extension. - Do we send the spans within the query hooks? This means that we could block the process if the trace collector is slow to answer or we can’t connect. Sending spans from a background process sounded rather complex and resource heavy. Moving to a pull model fixed those issues and felt more natural as this is the way PostgreSQL exposes its metrics. On Wed, Jul 26, 2023 at 4:11 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Nikita, > > > This patch looks very interesting, I'm working on the same subject too. > But I've used > > another approach - I'm using C wrapper for C++ API library from > OpenTelemetry, and > > handle span storage and output to this library. There are some nuances > though, but it > > is possible. Have you tried to use OpenTelemetry APIs instead of > implementing all > > functionality around spans? > > I don't think that PostgreSQL accepts such kind of C++ code, not to > mention the fact that the PostgreSQL license is not necessarily > compatible with Apache 2.0 (I'm not a lawyer; this is not a legal > advice). Such a design decision will probably require using separate > compile flags since the user doesn't necessarily have a corresponding > dependency installed. Similarly to how we do with LLVM, OpenSSL, etc. > > So -1 to the OpenTelemetry C++ library and +1 to the properly licensed > C implementation without 3rd party dependencies from me. Especially > considering the fact that the implementation seems to be rather > simple. > > -- > Best regards, > Aleksander Alekseev >
Re: Flush SLRU counters in checkpointer process
I think I've managed to reproduce the issue. The test I've added to check slru flush was the one failing in the regression suite. SELECT SUM(flushes) > :slru_flushes_before FROM pg_stat_slru; ?column? -- t The origin seems to be a race condition on have_slrustats ( https://github.com/postgres/postgres/blob/c8e1ba736b2b9e8c98d37a5b77c4ed31baf94147/src/backend/utils/activity/pgstat_slru.c#L161-L162 ). I will try to get a new patch with improved test stability. On Mon, Jul 3, 2023 at 3:18 PM Daniel Gustafsson wrote: > > On 3 Mar 2023, at 09:06, Anthonin Bonnefoy < > anthonin.bonne...@datadoghq.com> wrote: > > > > Here's the patch rebased with Andres' suggestions. > > Happy to update it if there's any additionalj change required. > > This patch crashes 031_recovery_conflict with a SIGInvalid on Windows, can > you > please investigate and see what might be going on there? The test passed > about > 4 days ago on Windows so unless it's the CI being flaky it should be due > to a > recent change. > > If you don't have access to a Windows environment you can run your own > instrumented builds in your Github account with the CI files in the > postgres > repo. > > -- > Daniel Gustafsson > >
[PATCH] Add statement_timeout in pg_stat_activity
Hello hackers. This patch adds the backend's statement_timeout value to pg_stat_activity. This would provide some insights on clients that are disabling a default statement timeout or overriding it through a pgbouncer, messing with other sessions. pg_stat_activity seemed like the best place to have this information. Regards, Anthonin diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index d5a45f996d..2109595c40 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -861,6 +861,15 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + + statement_timeout integer + + + The current backend's statement_timeout. + + + wait_event_type text diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 574cbc2e44..7c24ef92bf 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -860,6 +860,7 @@ CREATE VIEW pg_stat_activity AS S.xact_start, S.query_start, S.state_change, +S.statement_timeout, S.wait_event_type, S.wait_event, S.state, diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c index f0f2e07655..0a2ba26070 100644 --- a/src/backend/commands/variable.c +++ b/src/backend/commands/variable.c @@ -1054,6 +1054,16 @@ assign_application_name(const char *newval, void *extra) pgstat_report_appname(newval); } +/* + * GUC assign_hook for statement_timeout + */ +void +assign_statement_timeout(int newval, void *extra) +{ + /* Update the pg_stat_activity view */ + pgstat_report_statement_timeout(newval); +} + /* * GUC check_hook for cluster_name */ diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c index 608d01ea0d..5d94cb0ba6 100644 --- a/src/backend/utils/activity/backend_status.c +++ b/src/backend/utils/activity/backend_status.c @@ -336,6 +336,7 @@ pgstat_bestart(void) lbeentry.st_activity_start_timestamp = 0; lbeentry.st_state_start_timestamp = 0; lbeentry.st_xact_start_timestamp = 0; + lbeentry.st_statement_timeout = 0; lbeentry.st_databaseid = MyDatabaseId; /* We have userid for client-backends, wal-sender and bgworker processes */ @@ -446,9 +447,11 @@ pgstat_bestart(void) PGSTAT_END_WRITE_ACTIVITY(vbeentry); - /* Update app name to current GUC setting */ - if (application_name) + /* Update app name and statement timeout to current GUC setting */ + if (application_name) { pgstat_report_appname(application_name); + pgstat_report_statement_timeout(StatementTimeout); + } } /* @@ -692,6 +695,33 @@ pgstat_report_appname(const char *appname) PGSTAT_END_WRITE_ACTIVITY(beentry); } +/* -- + * pgstat_report_statement_timeout() - + * + * Called to update statement_timeout. + * -- + */ +void +pgstat_report_statement_timeout(int statement_timeout) +{ + volatile PgBackendStatus *beentry = MyBEEntry; + + if (!beentry) + return; + + /* + * Update my status entry, following the protocol of bumping + * st_changecount before and after. We use a volatile pointer here to + * ensure the compiler doesn't try to get cute. + */ + PGSTAT_BEGIN_WRITE_ACTIVITY(beentry); + + beentry->st_statement_timeout = statement_timeout; + + PGSTAT_END_WRITE_ACTIVITY(beentry); +} + + /* * Report current transaction start timestamp as the specified value. * Zero means there is no active transaction. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index eec9f3cf9b..de72c2c1d0 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -303,7 +303,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS) Datum pg_stat_get_activity(PG_FUNCTION_ARGS) { -#define PG_STAT_GET_ACTIVITY_COLS 30 +#define PG_STAT_GET_ACTIVITY_COLS 31 int num_backends = pgstat_fetch_stat_numbackends(); int curr_backend; int pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0); @@ -612,6 +612,12 @@ pg_stat_get_activity(PG_FUNCTION_ARGS) nulls[29] = true; else values[29] = UInt64GetDatum(beentry->st_query_id); + /* Only report statement_timeout for client backend */ + if (beentry->st_backendType == B_BACKEND) { +values[30] = Int32GetDatum(beentry->st_statement_timeout); + } else { +nulls[30] = true; + } } else { @@ -640,6 +646,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS) nulls[27] = true; nulls[28] = true; nulls[29] = true; + nulls[30] = true; } tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 8062589efd..5dd33f583f 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2478,7 +2478,7 @@ struct config_int ConfigureNamesInt[] =
Re: Flush SLRU counters in checkpointer process
Here's the patch rebased with Andres' suggestions. Happy to update it if there's any additionalj change required. On Wed, Mar 1, 2023 at 8:46 PM Gregory Stark (as CFM) wrote: > On Thu, 12 Jan 2023 at 03:46, Anthonin Bonnefoy > wrote: > > > > > > That would make sense. I've created a new patch with everything moved in > pgstat_report_checkpointer(). > > I did split the checkpointer flush in a pgstat_flush_checkpointer() > function as it seemed more readable. Thought? > > This patch appears to need a rebase. Is there really any feedback > needed or is it ready for committer once it's rebased? > > > > -- > Gregory Stark > As Commitfest Manager > flush-slru-counters-v3.patch Description: Binary data
Re: Flush SLRU counters in checkpointer process
On Wed, Jan 11, 2023 at 5:33 PM Andres Freund wrote: > Hi, > > On 2023-01-11 10:29:06 +0100, Anthonin Bonnefoy wrote: > > Currently, the Checkpointer process only reports SLRU statistics at > server > > shutdown, leading to delayed statistics for SLRU flushes. This patch > adds a > > flush of SLRU stats to the end of checkpoints. > > Hm. I wonder if we should do this even earlier, by the > pgstat_report_checkpointer() calls in CheckpointWriteDelay(). > > I'm inclined to move the pgstat_report_wal() and pgstat_report_slru() calls > into pgstat_report_checkpointer() to avoid needing to care about all the > individual places. > That would make sense. I've created a new patch with everything moved in pgstat_report_checkpointer(). I did split the checkpointer flush in a pgstat_flush_checkpointer() function as it seemed more readable. Thought? > > @@ -505,6 +505,7 @@ CheckpointerMain(void) > > /* Report pending statistics to the cumulative stats > system */ > > pgstat_report_checkpointer(); > > pgstat_report_wal(true); > > + pgstat_report_slru(true); > > Why do we need a force parameter if all callers use it? Good point. I've written the same signature as pgstat_report_wal but there's no need for the nowait parameter. flush-slru-counters-v2.patch Description: Binary data
Flush SLRU counters in checkpointer process
Hello hackers, Currently, the Checkpointer process only reports SLRU statistics at server shutdown, leading to delayed statistics for SLRU flushes. This patch adds a flush of SLRU stats to the end of checkpoints. Best regards, Anthonin flush-slru-counters.patch Description: Binary data