Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-30 Thread Anthonin Bonnefoy
I created a commitfest entry[1] to have the CI test the patch. There
was a failure in headerscheck and cpluspluscheck when the include of
SectionMemoryManager.h is checked[2]

In file included from /usr/include/llvm/ADT/SmallVector.h:18,
from /tmp/cirrus-ci-build/src/include/jit/SectionMemoryManager.h:23,
from /tmp/headerscheck.4b1i5C/test.c:2:
/usr/include/llvm/Support/type_traits.h:17:10: fatal error:
type_traits: No such file or directory
   17 | #include 

Since the SmallVector.h include type_traits, this file can't be
compiled with a C compiler so I've just excluded it from headerscheck.

Loosely related to headerscheck, running it locally was failing as it
couldn't find the  file. This is because headerscheck
except llvm include files to be in /usr/include and don't rely on
llvm-config. I created a second patch to use the LLVM_CPPFLAGS as
extra flags when testing the src/include/jit/* files.

Lastly, I've used www.github.com instead of github.com link to stop
spamming the llvm-project's PR with reference to the commit every time
it is pushed somewhere (which seems to be the unofficial hack[3]).

[1] https://commitfest.postgresql.org/49/5220/
[2] 
https://cirrus-ci.com/task/4646639124611072?logs=headers_headerscheck#L42-L46
[3] 
https://github.com/orgs/community/discussions/23123#discussioncomment-3239240


v6-0002-Add-LLVM_CPPFLAGS-in-headerscheck-to-llvm-jit-fil.patch
Description: Binary data


v6-0001-Backport-of-LLVM-code-to-fix-ARM-relocation-bug.patch
Description: Binary data


Re: Set query_id for query contained in utility statement

2024-08-30 Thread Anthonin Bonnefoy
On Tue, Aug 27, 2024 at 11:14 AM jian he  wrote:
> also it's ok to use passed (ParseState *pstate) for
> {
> estate = CreateExecutorState();
> estate->es_param_list_info = params;
> paramLI = EvaluateParams(pstate, entry, execstmt->params, estate);
> }
> ?
> I really don't know.
>
> some of the change is refactoring, maybe you can put it into a separate patch.

Thanks for the review. I think the parser state is mostly used for the
error callbacks and parser_errposition but I'm not 100% sure. Either
way, you're right and it probably shouldn't be in the patch. I've
modified the patch to restrict the changes to only add the necessary
query jumble and post parse hook calls.


v4-0001-Set-query_id-for-queries-contained-in-utility-sta.patch
Description: Binary data


Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-29 Thread Anthonin Bonnefoy
On Wed, Aug 28, 2024 at 12:24 AM Thomas Munro  wrote:
> 2.  I tested against LLVM 10-18, and found that 10 and 11 lack some
> needed symbols.  So I just hid this code from them.  Even though our
> stable branches support those and even older versions, I am not sure
> if it's worth trying to do something about that for EOL'd distros that
> no one has ever complained about.  I am willing to try harder if
> someone thinks that's important...

I would also assume that people using arm64 are more likely to use
recent versions than not.

I've done some additional tests on different LLVM versions with both
the unpatched version (to make sure the crash was triggered) and the
patched version. I'm joining the test scripts I've used as reference.
They target a kubernetes pod since it was the easiest way for me to
get a test ubuntu Jammy:
- setup_pod.sh: Install necessary packages, get multiple llvm
versions, fetch and compile master and patched version of postgres on
different LLVM version
- run_test.sh: go through all LLVM versions for both unpatched and
patched postgres to run the test_script.sh
- test_script.sh: ran inside the pod to setup the db with the
necessary tables and check if the crash happens

This generated the following output:
Test unpatched version on LLVM 19, : Crash triggered
Test unpatched version on LLVM 18, libLLVM-18.so.18.1: Crash triggered
Test unpatched version on LLVM 17, libLLVM-17.so.1: Crash triggered
Test unpatched version on LLVM 16, libLLVM-16.so.1: Crash triggered
Test unpatched version on LLVM 15, libLLVM-15.so.1: Crash triggered
Test unpatched version on LLVM 14, libLLVM-14.so.1: Crash triggered
Test unpatched version on LLVM 13, libLLVM-13.so.1: Crash triggered

Test patched version on LLVM 19, : Query ran successfully
Test patched version on LLVM 18, libLLVM-18.so.18.1: Query ran successfully
Test patched version on LLVM 17, libLLVM-17.so.1: Query ran successfully
Test patched version on LLVM 16, libLLVM-16.so.1: Query ran successfully
Test patched version on LLVM 15, libLLVM-15.so.1: Query ran successfully
Test patched version on LLVM 14, libLLVM-14.so.1: Query ran successfully
Test patched version on LLVM 13, libLLVM-13.so.1: Query ran successfully

I try to print the libLLVM linked to llvm.jit in the output to double
check whether I test on the correct version. The LLVM 19 package only
provides static libraries (probably because it's still a release
candidate?) so it shows as empty in the output. There was no LLVM 12
available when using the llvm.sh script so I couldn't test it. As for
the result, prepatch PG all crashed as expected while the patched
version was able to run the query successfully.

> Next, I think we should wait to see if the LLVM project commits that
> PR, this so that we can sync with their 19.x stable branch, instead of
> using code from a PR.  Our next minor release is in November, so we
> have some time.  If they don't commit it, we can consider it anyway: I
> mean, it's crashing all over the place in production, and we see that
> other projects are shipping this code already.

The PR[1] just received an approval and it sounds like they are ok to
eventually merge it.

[1] https://github.com/llvm/llvm-project/pull/71968
#!/bin/bash

# Don't want coredumps
ulimit -c 0

# Stop running PG
/var/lib/postgresql/.local/bin/pg_ctl -D /var/lib/postgresql/db_data stop &>> /dev/null

# Clean db_data
rm -rf /var/lib/postgresql/db_data

# Create new dbdata
/var/lib/postgresql/.local/bin/initdb -D /var/lib/postgresql/db_data -Apeer > /dev/null
echo "port = 5432
shared_buffers='4GB'
" > /var/lib/postgresql/db_data/postgresql.conf

# Start and setup test partitioned table
/var/lib/postgresql/.local/bin/pg_ctl -D /var/lib/postgresql/db_data start -l /tmp/pg.log > /dev/null
/var/lib/postgresql/.local/bin/pgbench -i --partitions=128 2> /dev/null

# Run the query to trigger the issue
/var/lib/postgresql/.local/bin/psql options=-cjit_above_cost=0 -c 'SELECT count(bid) from pgbench_accounts;' &>> /dev/null
exit_code=$?

if (( $exit_code == 2 )); then
echo "Crash triggered"
elif (( $exit_code == 0 )); then
echo "Query ran successfully"
else
echo "Other issue while running the query"
fi
#!/bin/bash
set -eu

TEST_POD=$(kubectl get pod -l app="test_pg" -nsre -o=jsonpath='{.items[*].metadata.name}')
LLVM_VERSIONS=(19 18 17 16 15 14 13)

get_llvm_version()
{
kubectl exec "$pod" -- bash -c "ldd /var/lib/postgresql/.local/lib/llvmjit.so | grep libLLVM | awk '{print \$1}'"
}

run_test_on_pod()
{
local pod="$1"

kubectl cp test_script.sh "$pod":/var/lib/postgresql/

for version in ${LLVM_VERSIONS[@]}; do
build_dir="~/postgres_build/build_${version}"
kubectl exec "$pod" -- bash -c "su postgres -c \" cd $build_dir; make -j19 install &>> /dev/null \" "

echo -n "Test unpatched version on LLVM $version, $(get_llvm_version): "
kubectl exec "$pod" -- bash -c "su postgres -c \" cd; bash test_script.sh \" "
done

for version in 

Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-27 Thread Anthonin Bonnefoy
On Tue, Aug 27, 2024 at 12:01 PM Thomas Munro  wrote:
>
> Thanks!  And that's great news.  Do you want to report this experience
> to the PR, in support of committing it?  That'd make it seem easier to
> consider shipping a back-ported copy...

Yes, I will do that.




Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-27 Thread Anthonin Bonnefoy
On Tue, Aug 27, 2024 at 1:33 AM Thomas Munro  wrote:
> I am sure this requires changes for various LLVM versions.  I tested
> it with LLVM 14 on a Mac where I've never managed to reproduce the
> original complaint, but ... h, this might be exacerbated by ASLR,
> and macOS only has a small ALSR slide window (16M or 256M apparently,
> according to me in another thread), so I'd probably have to interpose
> my own mmap() to choose some more interesting addresses, or run some
> other OS, but that's quite enough rabbit holes for one morning.

I've tested the patch. I had to make sure the issue was triggered on
master first. The issue didn't happen with 4GB shared_buffers and 64
partitions. However, increasing to 6GB and 128 partitions triggered
the issue.

The architecture check in the patch was incorrect (__arch64__ instead
of __aarch64__, glad to see I'm not the only one being confused with
aarch64 and arm64 :)) but once fixed, it worked and avoided the
segfault.

I've run some additional tests to try to test different parameters:
- I've tried disabling randomize_va_space, the issue still happened
even with ASLR disabled.
- I've tested different PG versions. With 14 and 15, 4GB and 64
partitions were enough. Starting PG 16, I had to increase
shared_buffers to 6GB and partitions to 128. I've been able to trigger
the issue on all versions from 14 to master (which was expected but I
wanted confirmation)
- I haven't been able to reproduce this on a macOS either. I've tried
to remove MemGroup.Near hint so mmap addresses would be more random
and played with different shared_buffers and partition values without
success

I've modified the patch with 3 changes:
- meson.build was using SectionMemoryManager.cpp file name, I've
replaced with SafeSectionMemoryManager.cpp
- Use __aarch64__ instead of __arch64__
- Moved the architecture switch to llvm_create_object_layer and go
through the normal
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager on non
arm64 architectures. There's no need to use the custom memory manager
for non arm64 so it looked better to avoid it entirely if there's no
need for the reserve allocation.


v3-0001-XXX-LLVM-ARM-relocation-bug-mitigation.patch
Description: Binary data


Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-26 Thread Anthonin Bonnefoy
On Mon, Aug 26, 2024 at 4:33 AM Thomas Munro  wrote:
> IIUC this one is a random and rare crash depending on malloc() and
> perhaps also the working size of your virtual memory dart board.
> (Annoyingly, I had tried to reproduce this quite a few times on small ARM
> systems when earlier reports came in, d'oh!).

allocateMappedMemory used when creating sections will eventually call
mmap[1], not malloc. So the amount of shared memory configured may be
a factor in triggering the issue.

My first attempts to reproduce the issue from scratch weren't
successful either. However, trying again with different values of
shared_buffers, I've managed to trigger the issue somewhat reliably.

On a clean Ubuntu jammy, I've compiled the current PostgreSQL
REL_14_STABLE (6bc2bfc3) with the following options:
CLANG=clang-14 ../configure --enable-cassert --enable-debug --prefix
~/.local/ --with-llvm

Set "shared_buffers = '4GB'" in the configuration. More may be needed
but 4GB was enough for me.

Create a table with multiple partitions with pgbench. The goal is to
have a jit module big enough to trigger the issue.
pgbench -i --partitions=64

Then run the following query with jit forcefully enabled:
psql options=-cjit_above_cost=0 -c 'SELECT count(bid) from pgbench_accounts;'

If the issue was successfully triggered, it should segfault or be
stuck in an infinite loop.

> Ultimately, if it doesn't work, and doesn't get fixed, it's hard for
> us to do much about it.  But hmm, this is probably madness... I wonder
> if it would be feasible to detect address span overflow ourselves at a
> useful time, as a kind of band-aid defence...

There's a possible alternative, but it's definitely in the same
category as the hot-patching idea. llvmjit uses
LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager to
create the ObjectLinkingLayer and it will be created with the default
SectionMemoryManager[2]. It should be possible to provide a modified
SectionMemoryManager with the change to allocate sections in a single
block and it could be restricted to arm64 architecture. A part of me
tells me this is probably a bad idea but on the other hand, LLVM
provides this way to plug a custom allocator and it would fix the
issue...

[1] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/Support/Unix/Memory.inc#L115-L117
[2] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/Orc/OrcV2CBindings.cpp#L967-L973




Re: Set query_id for query contained in utility statement

2024-08-26 Thread Anthonin Bonnefoy
On Mon, Aug 26, 2024 at 5:26 AM jian he  wrote:
>queryid| left
>   | plans | calls | rows
> --+--+---+---+--
>   2800308901962295548 | EXPLAIN (verbose) create table test_t as
> select 1;   | 2 | 2 |0
>   2093602470903273926 | EXPLAIN (verbose) create table test_t as
> select $1   | 0 | 2 |0
>
> "EXPLAIN (verbose) create table test_t as select 1;" called twice,
> is that what we expect?

pg_stat_statements reports nested queries and toplevel queries
separately. You can check with the toplevel column.
2800308901962295548 is the nested ctas part while 2093602470903273926
is the top explain utility statement (which explain why there's 0
plans since it's an utility statement). Since the explain ctas query
was called twice, it will be reported as 2 toplevel queries and 2
nested queries.

> should first row, the normalized query becomes
> EXPLAIN (verbose) create table test_t as select $1;

Good point, the issue in this case was the nested query was stored by
pg_stat_statements during the ExecutorEnd hook. This hook doesn't have
the jstate and parseState at that point so pg_stat_statements can't
normalize the query.

I've modified the patch to fix this. Instead of just jumbling the
query in ExplainQuery, I've moved jumbling in ExplainOneUtility which
already has specific code to handle ctas and dcs. Calling the post
parse hook here allows pg_stat_statements to store the normalized
version of the query for this queryid and nesting level.


v3-0001-Set-query_id-for-queries-contained-in-utility-sta.patch
Description: Binary data


Re: Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-23 Thread Anthonin Bonnefoy
On Thu, Aug 22, 2024 at 12:33 PM Thomas Munro  wrote:
> I fear that back-porting, for the LLVM project, would mean "we fix it
> in main/20.x, and also back-port it to 19.x".  Do distros back-port
> further?

That's also my fear, I'm not familiar with distros back-port policy
but eyeballing ubuntu package changelog[1], it seems to be mostly
build fixes.

Given that there's no visible way to fix the relocation issue, I
wonder if jit shouldn't be disabled for arm64 until either the
RuntimeDyld fix is merged or the switch to JITLink is done. Disabling
jit tuple deforming may be enough but I'm not confident the issue
won't happen in a different part.

[1] https://launchpad.net/ubuntu/+source/llvm-toolchain-16/+changelog




Segfault in jit tuple deforming on arm64 due to LLVM issue

2024-08-22 Thread Anthonin Bonnefoy
er,
when a new section is allocated, LLVM will request a new memory block
from the memory allocator[4]. The MemGroup.Near is passed as the start
hint of mmap but that's only a hint and the kernel doesn't provide any
guarantee that the new allocated block will be near. With the impacted
query, there are more than 10GB of gap between the .text section and
the .rodata section, making it impossible for the code in the .text
section to correctly fetch data from the .rodata section as the
address in ADRP is limited to a +/-4GB range.

There are mentions about this in the ABI that the GOT section should
be within 4GB from the text section[5]. Though in this case, there's
no GOT section as the offsets are stored in the .rodata section but
the constraint is going to be similar. This is a known LLVM issue[6]
that impacted Impala, Numba and Julia. There's an open PR[7] to fix
the issue by allocating all sections as a single memory block,
avoiding the gaps between sections. There's also a related discussion
on this on llvm-rtdyld discourse[8].

A possible mitigation is to switch from RuntimeDyld to JITLinking but
this requires at least LLVM15 as LLVM14 doesn't have any significant
relocation support for aarch64[9]. I did test using JITLinking on my
impacted db and it seems to fix the issue. JITLinking has no exposed C
interface though so it requires additional wrapping.

I don't necessarily have a good answer for this issue. I've tried to
tweak relocation settings or the jit code to avoid relocation without
too much success. Ideally, the llvm fix will be merged and backported
in llvm but the PR has been open for some time now. I've seen multiple
segfault reports that look similar to this issue (example: [10], [11])
but I don't think it was linked to the LLVM bug so I figured I would
at least share my findings.

[1] 
https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/jit/llvm/llvmjit_deform.c#L364-L382
[2] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp#L501-L513
[3] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L41C32-L41C47
[4] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/SectionMemoryManager.cpp#L94-L110
[5] 
https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst#7code-models
[6] https://github.com/llvm/llvm-project/issues/71963
[7] https://github.com/llvm/llvm-project/pull/71968
[8] 
https://discourse.llvm.org/t/llvm-rtdyld-aarch64-abi-relocation-restrictions/74616
[9] 
https://github.com/llvm/llvm-project/blob/release/14.x/llvm/lib/ExecutionEngine/JITLink/ELF_aarch64.cpp#L75-L84
[10] 
https://www.postgresql.org/message-id/flat/CABa%2BnRvwZy_5t1QF9NJNGwAf03tv_PO_Sg1FsN1%2B-3Odb1XgBA%40mail.gmail.com
[11] 
https://www.postgresql.org/message-id/flat/CADAf1kavcN-kY%3DvEm3MYxhUa%2BrtGFs7tym5d7Ee6Ni2cwwxGqQ%40mail.gmail.com

Regards,
Anthonin Bonnefoy




Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind

2024-08-21 Thread Anthonin Bonnefoy
On Wed, Aug 21, 2024 at 12:00 AM Jelte Fennema-Nio  wrote:
> @Anthonin are you planning to update the patch accordingly?

Here's the patch with \bindx renamed to \bind_named.

I've also made a small change to Michael's refactoring in 0002 by
initialising success to false in ExecQueryAndProcessResults. There was
a compiler warning about success possibly used uninitialized[1].

[1] https://cirrus-ci.com/task/6207675187331072


v7-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch
Description: Binary data


v7-0002-psql-Refactor-status-of-extended-protocol-command.patch
Description: Binary data


Re: Set query_id for query contained in utility statement

2024-08-05 Thread Anthonin Bonnefoy
I've realised my initial approach was wrong, calling the post parse
for all nested queries in analyze.c prevents extension like pgss to
correctly track the query's nesting level.

I've changed the approach to replicate what's done in ExplainQuery to
both CreateTableAs and DeclareCursor: Jumble the query contained by
the utility statement and call the post parse hook before it is
planned or executed. Additionally, explain's nested query can itself
be a CreateTableAs or DeclareCursor which also needs to be jumbled.
The issue is visible when explaining a CreateTableAs or DeclareCursor
Query, the queryId is missing despite the verbose.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=4)
   Output: 1

Post patch, the query id is correctly displayed.

EXPLAIN (verbose) create table test_t as select 1;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=4)
   Output: 1
 Query Identifier: 2800308901962295548

Regards,
Anthonin Bonnefoy


v2-0001-Set-query_id-for-queries-contained-in-utility-sta.patch
Description: Binary data


Set query_id for query contained in utility statement

2024-08-02 Thread Anthonin Bonnefoy
Hi all,

Some utility statements like Explain, CreateTableAs and DeclareCursor
contain a query which will be planned and executed. During post parse,
only the top utility statement is jumbled, leaving the contained query
without a query_id set. Explain does the query jumble in ExplainQuery
but for the contained query but CreateTableAs and DeclareCursor were
left with unset query_id.

This leads to extensions relying on query_id like pg_stat_statements
to not be able to track those nested queries as the query_id was 0.

This patch fixes this by recursively jumbling queries contained in
those utility statements during post_parse, setting the query_id for
those contained queries and removing the need for ExplainQuery to do
it for the Explain statements.

Regards,
Anthonin Bonnefoy


v1-0001-Set-query_id-for-queries-contained-in-utility-sta.patch
Description: Binary data


Re: Use pgBufferUsage for block reporting in analyze

2024-07-31 Thread Anthonin Bonnefoy
On Wed, Jul 31, 2024 at 9:36 PM Masahiko Sawada  wrote:
> Meanwhile, I think we can push 0001 and 0002 patches since they are in
> good shape. I've updated commit messages to them and slightly changed
> 0002 patch to write "finished analyzing of table \"%s.%s.%s\" instead
> of  "analyze of table \"%s.%s.%s\".

Wouldn't it make sense to do the same for autoanalyze and write
"finished automatic analyze of table \"%s.%s.%s\"\n" instead of
"automatic analyze of table \"%s.%s.%s\"\n"?

> Also, regarding 0003 patch, what is the main reason why we want to add
> WAL usage to analyze reports? I think that analyze normally does not
> write WAL records much so I'm not sure it's going to provide a good
> insight for users.

There was no strong reason except for consistency with VACUUM VERBOSE
output. But as you said, it's not really providing valuable
information so it's probably better to keep the noise down and drop
it.

Regards,
Anthonin Bonnefoy




Re: Use pgBufferUsage for block reporting in analyze

2024-07-31 Thread Anthonin Bonnefoy
On Tue, Jul 30, 2024 at 9:21 AM Anthonin Bonnefoy
 wrote:
> A possible change would be to pass an inh flag when an acquirefunc is
> called from acquire_inherited_sample_rows. The acquirefunc could then
> use an alternative log format to append to logbuf. This way, we could
> have a more compact format for partitioned tables.

I've just tested this, the result isn't great as it creates an
inconsistent output

INFO:  analyze of table "postgres.public.test_partition"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 rows in sample, 999 estimated total rows
"test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
and 0 dead tuples; 1000 rows in sample, 1000 estimated total rows
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
...
INFO:  analyze of table "postgres.public.test_partition_1"
pages: 5 of 5 scanned
tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
estimated total tuples
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s

Maybe the best approach is to always use the compact form?

INFO:  analyze of table "postgres.public.test_partition"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
"test_partition_2": scanned 5 of 5 pages, containing 1000 live tuples
and 0 dead tuples; 1000 tuples in sample, 1000 estimated total tuples
avg read rate: 1.953 MB/s, avg write rate: 0.000 MB/s
...
INFO:  analyze of table "postgres.public.test_partition_1"
"test_partition_1": scanned 5 of 5 pages, containing 999 live tuples
and 0 dead tuples; 999 tuples in sample, 999 estimated total tuples
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s

I've updated the patchset with those changes. 0004 introduces the
StringInfo logbuf so we can output logs as a single log and during
ANALYZE VERBOSE while using the compact form.

Regards,
Anthonin


v6-0004-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data


v6-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v6-0002-Output-analyze-details-on-analyze-verbose.patch
Description: Binary data


v6-0003-Output-wal-usage-of-analyze-in-verbose-output-and.patch
Description: Binary data


Re: Use pgBufferUsage for block reporting in analyze

2024-07-30 Thread Anthonin Bonnefoy
Hi,

On Tue, Jul 30, 2024 at 1:13 AM Masahiko Sawada  wrote:
> I have one comment on 0001 patch:
> The comment should also be updated or removed.

Ok, I've removed the comment.

> However, as the above comment says, delay_in_ms can have a duration up
> to 25 days. I guess it would not be a problem for analyze cases but
> could be in vacuum cases as vacuum could sometimes be running for a
> very long time. I've seen vacuums running even for almost 1 month. So
> I think we should keep this part.

Good point, I've reverted to using TimestampDifference for vacuum.

>  /* measure elapsed time iff autovacuum logging requires it */
> -if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
> +if (instrument)
>
> The comment should also be updated.

Updated.

> Could you split the 0002 patch into two patches? One is to have
> ANALYZE command (with VERBOSE option) write the buffer usage, and
> second one is to add WAL usage to both ANALYZE command and
> autoanalyze. I think adding WAL usage to ANALYZE could be
> controversial as it should not be WAL-intensive operation, so I'd like
> to keep them separate.

I've split the patch, 0002 makes verbose outputs the same as
autoanalyze logs with buffer/io/system while 0003 adds WAL usage
output.

> It seems to me that the current style is more concise and readable (3
> rows per table vs. 1 row per table). We might need to consider a
> better output format for partitioned tables as the number of
> partitions could be high. I don't have a good idea now, though.

A possible change would be to pass an inh flag when an acquirefunc is
called from acquire_inherited_sample_rows. The acquirefunc could then
use an alternative log format to append to logbuf. This way, we could
have a more compact format for partitioned tables.

Regards,
Anthonin


v5-0003-Output-wal-usage-of-analyze-in-verbose-output-and.patch
Description: Binary data


v5-0002-Output-analyze-details-on-analyze-verbose.patch
Description: Binary data


v5-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v5-0004-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data


Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Anthonin Bonnefoy
On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  wrote:
> An alternative idea would
> be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> write its messages there. This is somewhat similar to what we do in
> the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> much but I think it could be better than writing logs in the single
> format.

I've tested this approach, it definitely looks better. I've added a
logbuf StringInfo to AcquireSampleRowsFunc which will receive the
logs. elevel was removed as it is not used anymore. Since everything
is in the same log line, I've removed the relation name in the acquire
sample functions.

For partitioned tables, I've also added the processed partition table
being sampled. The output will look like:

INFO:  analyze of table "postgres.public.test_partition"
Sampling rows from child "public.test_partition_1"
pages: 5 of 5 scanned
tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
estimated total tuples
Sampling rows from child "public.test_partition_2"
pages: 5 of 5 scanned
tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
estimated total tuples
avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
...

For a file_fdw, the output will be:

INFO:  analyze of table "postgres.public.pglog"
tuples: 60043 tuples; 3 tuples in sample
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
...

Regards,
Anthonin


v4-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


v4-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v4-0003-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data


Add ALL_CANDIDATES option to EXPLAIN

2024-07-26 Thread Anthonin Bonnefoy
Hi,

I have a prototype for an ALL_CANDIDATES option for EXPLAIN. The goal
of this option is to print all plan candidates instead of only the
cheapest plan. It will output the plans from the most expensive at the
top to the cheapest. Here's an example:

explain (all_candidates) select * from pgbench_accounts where aid=1;
 QUERY PLAN
-
 Plan 1
   ->  Gather  (cost=1000.00..3375.39 rows=1 width=97)
 Workers Planned: 1
 ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..2375.29 rows=1 width=97)
   Filter: (aid = 1)
 Plan 2
   ->  Seq Scan on pgbench_accounts  (cost=0.00..2890.00 rows=1 width=97)
 Filter: (aid = 1)
 Plan 3
   ->  Bitmap Heap Scan on pgbench_accounts  (cost=4.30..8.31 rows=1 width=97)
 Recheck Cond: (aid = 1)
 ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.30 rows=1 width=0)
   Index Cond: (aid = 1)
 Plan 4
   ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.29..8.31 rows=1 width=97)
 Index Cond: (aid = 1)

This can provide very useful insight on the planner's decisions like
whether it tried to use a specific index and how much cost difference
there is with the top plan. Additionally, it makes it possible to spot
discrepancies in generated plans like incorrect row estimation [1].

The plan list is generated from the upper_rel's pathlist. However, due
to how planning mutates the PlannerGlobal state, we can't directly
iterate the path list generated by the subquery_planner and create a
planned statement for them. Instead, we need to plan from scratch for
every path in the pathlist to generate the list of PlannedStmt.

The patch is split in two mostly to ease the review:
001: Propagate PlannerInfo root to add_path. This is needed to prevent
add_path from discarding path if all_candidates is enabled which will
be stored in PlannerGlobal.
002: Add the planner_all_candidates function and print of candidate
list in explain

[1] 
https://www.postgresql.org/message-id/flat/CAO6_Xqr9+51NxgO=XospEkUeAg-p=ejawmtpdczwjrggkj5...@mail.gmail.com

Regards,
Anthonin


v1-0002-Add-ALL_CANDIDATES-explain-option.patch
Description: Binary data


v1-0001-Propagate-root-PlannerInfo-to-add_path.patch
Description: Binary data


Re: query_id, pg_stat_activity, extended query protocol

2024-07-26 Thread Anthonin Bonnefoy
On Thu, Jul 18, 2024 at 10:56 AM Michael Paquier  wrote:
> Please attach things to your emails, if your repository disappears for
> a reason or another we would lose knowledge in the archives of the
> community lists.

Noted and thanks for the reminder, I'm still learning about mailing
list etiquette.

> I have not looked at that entirely in details, and I'd need to check
> if it is possible to use what's here for more predictible tests:
> https://www.postgresql.org/message-id/ZqCMCS4HUshUYjGc%40paquier.xyz

For the tests, there are limited possibilities to check whether a
query_id has been set correctly.
- Checking pg_stat_activity is not possible in the regress tests as
you need a second session to check the reported query_id.
- pg_stat_statements can be used indirectly but you're limited to how
pgss uses query_id. For example, it doesn't rely on queryId in
ExecutorRun.

A possible solution I've been thinking of is to use a test module. The
module will assert on whether the queryId is set or not in parse, plan
and executor hooks. It will also check if the queryId reported in
pgstat matches the queryId at the root level.

This allows us to check that the queryId is correctly set with the
extended protocol. I've also found some queries which will trigger a
failure (ctas and cursor usage) though this is probably a different
issue from the extended protocol issue.

Regards,
Anthonin


0001-Create-module-for-query_id-test.patch
Description: Binary data


Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind

2024-07-24 Thread Anthonin Bonnefoy
On Wed, Jul 24, 2024 at 05:33:07PM +0200, Peter Eisentraut wrote:
> This commit message confused me, because I don't think this is what the
> \bindx command actually does.  AFAICT, it only binds, it does not execute.
> At least that is what the documentation in the content of the patch appears
> to indicate.

Unless I misunderstand the remark, \bindx will call
PQsendQueryPrepared which will bind then execute the query, similar to
what \bind is doing (except \bind also parses the query).

> I'm not sure \bindx is such a great name.  The "x" stands for "I ran out of
> ideas". ;-)

That's definitely what happened :). \bind would have been a better fit
but it was already used.

On Thu, Jul 25, 2024 at 4:19 AM Michael Paquier  wrote:
> Not sure that I like much the additional option embedded in the
> existing command; I'd rather keep a separate command for each libpq
> call, that seems cleaner.  So I would be OK with your suggested
> \bind_named.  Fine by me to be outvoted, of course.

+1 keeping this as a separate command and using \bind_named. \bind has
a different behaviour as it also parses the query so keeping them as
separate commands would probably avoid some confusion.




Re: Use pgBufferUsage for block reporting in analyze

2024-07-24 Thread Anthonin Bonnefoy
On Mon, Jul 22, 2024 at 10:59 PM Masahiko Sawada  wrote:
> The first line would vary depending on whether an autovacuum worker or
> not. And the above suggestion includes a change of term "row" to
> "tuple" for better consistency with VACUUM VERBOSE outputs. I think it
> would be great if autoanalyze also writes logs in the same format.
> IIUC with the patch, autoanalyze logs don't include the page and tuple
> statistics.

One issue is that the number of scanned pages, live tuples and dead
tuples is only available in acquire_sample_rows which is where the log
containing those stats is emitted. I've tried to implement the
following in 0003:
- Sampling functions now accept an AcquireSampleStats struct to store
pages and tuples stats
- Log is removed from sampling functions
- do_analyze_rel now outputs scanned and tuples statistics when
relevant. sampling from fdw doesn't provide those statistics so they
are not displayed in those cases.

This ensures that analyze logs are only emitted in do_analyze_rel,
allowing to display the same output for both autoanalyze and ANALYZE
VERBOSE. With those changes, we have the following output for analyze
verbose of a table:

analyze (verbose) pgbench_accounts ;
INFO:  analyzing "public.pgbench_accounts"
INFO:  analyze of table "postgres.public.pgbench_accounts"
pages: 1640 of 1640 scanned
tuples: 10 live tuples, 0 are dead; 3 tuples in samples,
10 estimated total tuples
avg read rate: 174.395 MB/s, avg write rate: 0.000 MB/s
buffer usage: 285 hits, 1384 reads, 0 dirtied
WAL usage: 14 records, 0 full page images, 1343 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.06 s

For a file_fdw, the output will look like:

analyze (verbose) pglog;
INFO:  analyzing "public.pglog"
INFO:  analyze of table "postgres.public.pglog"
tuples: 3 tuples in samples, 60042 estimated total tuples
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 182 hits, 0 reads, 0 dirtied
WAL usage: 118 records, 0 full page images, 13086 bytes
system usage: CPU: user: 0.40 s, system: 0.00 s, elapsed: 0.41 s

I've also slightly modified 0002 to display "automatic analyze" when
we're inside an autovacuum worker, similar to what's done with vacuum
output.

Regards,
Anthonin


v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v3-0003-Output-pages-and-tuples-stats-in-ANALYZE-VERBOSE.patch
Description: Binary data


Re: Possible incorrect row estimation for Gather paths

2024-07-22 Thread Anthonin Bonnefoy
On Wed, Jul 17, 2024 at 3:59 AM Richard Guo  wrote:
>
> I can reproduce this problem with the query below.
>
> explain (costs on) select * from tenk1 order by twenty;
>QUERY PLAN
> -
>  Gather Merge  (cost=772.11..830.93 rows=5882 width=244)
>Workers Planned: 1
>->  Sort  (cost=772.10..786.80 rows=5882 width=244)
>  Sort Key: twenty
>  ->  Parallel Seq Scan on tenk1  (cost=0.00..403.82 rows=5882 
> width=244)
> (5 rows)
I was looking for a test to add in the regress checks that wouldn't
rely on explain cost since it is disabled. However, I've realised I
could do something similar to what's done in stats_ext with the
check_estimated_rows function. I've added the get_estimated_rows test
function that extracts the estimated rows from the top node and uses
it to check the gather nodes' estimates. get_estimated_rows uses a
simple explain compared to check_estimated_rows which relies on an
explain analyze.

> I wonder if the changes in create_ordered_paths should also be reduced
> to 'total_groups = gather_rows_estimate(path);'.

It should already be the case with the patch v2. It does create
rounding errors that are visible in the tests but they shouldn't
exceed +1 or -1.

> I think perhaps it's better to declare gather_rows_estimate() in
> cost.h rather than optimizer.h.
> (BTW, I wonder if compute_gather_rows() would be a better name?)

Good point, I've moved the declaration and renamed it.

> I noticed another issue in generate_useful_gather_paths() -- *rowsp
> would have a random value if override_rows is true and we use
> incremental sort for gather merge.  I think we should fix this too.

That seems to be the case. I've tried to find a query that could
trigger this codepath without success. All grouping and distinct paths
I've tried where fully sorted and didn't trigger an incremental sort.
I will need a bit more time to check this.

In the meantime, I've updated the patches with the review comments.

Regards,
Anthonin


v3-0001-Remove-unnecessary-assignment-of-path-rows-in-gat.patch
Description: Binary data


v3-0002-Fix-row-estimation-in-gather-paths.patch
Description: Binary data


Correctly propagate queryId for utility stmt in function

2024-07-18 Thread Anthonin Bonnefoy
Hi,

For utility statements defined within a function, the queryTree is
copied to a plannedStmt as utility commands don't require planning.
However, the queryId is not propagated to the plannedStmt. This leads
to plugins relying on queryId like pg_stat_statements to not be able
to track utility statements within function calls.

This patch fixes the issue by correctly propagating queryId from the
cached queryTree to the plannedStmt.

Regards,
Anthonin
From b70163d1bd1279ca3f9b29166431cc64d25ca586 Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy 
Date: Thu, 18 Jul 2024 11:50:37 +0200
Subject: Correctly propagate queryId for utility stmt in function

For utility statements defined within a function, the queryTree is
copied to a plannedStmt as utility commands don't require planning.
However, the queryId is not propagated to the plannedStmt. This leads to
plugins relying on queryId like pg_stat_statements to not be able to
track utility statements within function calls.

This patch fixes the issue by correctly propagating queryId from the
cached queryTree to the plannedStmt.
---
 .../expected/level_tracking.out   | 28 +++
 .../pg_stat_statements/sql/level_tracking.sql | 12 
 src/backend/executor/functions.c  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/contrib/pg_stat_statements/expected/level_tracking.out b/contrib/pg_stat_statements/expected/level_tracking.out
index fe01dd79af..f7b2ffceb3 100644
--- a/contrib/pg_stat_statements/expected/level_tracking.out
+++ b/contrib/pg_stat_statements/expected/level_tracking.out
@@ -67,6 +67,34 @@ SELECT toplevel, calls, query FROM pg_stat_statements
  t| 1 | SET pg_stat_statements.track = 'all'
 (7 rows)
 
+-- Procedure with multiple utility statements - all level tracking.
+CREATE OR REPLACE PROCEDURE proc_with_utility_stmt()
+LANGUAGE SQL
+AS $$
+SHOW pg_stat_statements.track;
+SHOW pg_stat_statements.track_utility;
+$$;
+CALL proc_with_utility_stmt();
+SELECT toplevel, calls, query FROM pg_stat_statements
+  WHERE query LIKE '%SHOW%' ORDER BY query COLLATE "C", toplevel;
+ toplevel | calls |query 
+--+---+--
+ t| 1 | CREATE OR REPLACE PROCEDURE proc_with_utility_stmt()+
+  |   | LANGUAGE SQL+
+  |   | AS $$   +
+  |   | SHOW pg_stat_statements.track;  +
+  |   | SHOW pg_stat_statements.track_utility;  +
+  |   | $$
+ f| 1 | SHOW pg_stat_statements.track
+ f| 1 | SHOW pg_stat_statements.track_utility
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 -- DO block - top-level tracking without utility.
 SET pg_stat_statements.track = 'top';
 SET pg_stat_statements.track_utility = FALSE;
diff --git a/contrib/pg_stat_statements/sql/level_tracking.sql b/contrib/pg_stat_statements/sql/level_tracking.sql
index aa37408d52..115ff392a0 100644
--- a/contrib/pg_stat_statements/sql/level_tracking.sql
+++ b/contrib/pg_stat_statements/sql/level_tracking.sql
@@ -33,6 +33,18 @@ END; $$;
 SELECT toplevel, calls, query FROM pg_stat_statements
   ORDER BY query COLLATE "C", toplevel;
 
+-- Procedure with multiple utility statements - all level tracking.
+CREATE OR REPLACE PROCEDURE proc_with_utility_stmt()
+LANGUAGE SQL
+AS $$
+SHOW pg_stat_statements.track;
+SHOW pg_stat_statements.track_utility;
+$$;
+CALL proc_with_utility_stmt();
+SELECT toplevel, calls, query FROM pg_stat_statements
+  WHERE query LIKE '%SHOW%' ORDER BY query COLLATE "C", toplevel;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
 -- DO block - top-level tracking without utility.
 SET pg_stat_statements.track = 'top';
 SET pg_stat_statements.track_utility = FALSE;
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 539cd0a999..692854e2b3 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -492,6 +492,7 @@ init_execution_state(List *queryTree_list,
 stmt->utilityStmt = queryTree->utilityStmt;
 stmt->stmt_location = queryTree->stmt_location;
 stmt->stmt_len = queryTree->stmt_len;
+stmt->queryId = queryTree->queryId;
 			}
 			else
 stmt = pg_plan_query(queryTree,
-- 
2.39.3 (Apple Git-146)



Re: query_id, pg_stat_activity, extended query protocol

2024-07-17 Thread Anthonin Bonnefoy
Hi,

Wouldn't it be enough to call pgstat_report_query_id in ExecutorRun
and ProcessUtility? With those changes [1], both normal statements and
utility statements called through extended protocol will correctly
report the query_id.

-- Test utility statement with extended protocol
show all \bind \g

-- Check reported query_id
select query, query_id from pg_stat_activity where
application_name ='psql' and pid!=pg_backend_pid();
   query   |  query_id
---+-
 show all  | -866221123969716490

[1] 
https://github.com/bonnefoa/postgres/commit/bf4b332d7b481549c6d9cfa70db51e39a305b9b2

Regards,
Anthonin




Re: Possible incorrect row estimation for Gather paths

2024-07-16 Thread Anthonin Bonnefoy
Hi Rafia,

Thanks for reviewing.

On Wed, Jul 10, 2024 at 4:54 PM Rafia Sabih  wrote:
> But I don't quite understood the purpose of this,
> + total_groups = input_rel->rows;
> +
> + /*
> + * If the number of rows is unknown, fallback to gather rows
> + * estimation
> + */
> + if (total_groups == 0)
> + total_groups = gather_rows_estimate(input_path);
>
> why not just use total_groups = gather_rows_estimate(input_path), what
> is the importance of having total_groups = input_rel->rows?

The initial goal was to use the source tuples if available and avoid
possible rounding errors. Though I realise that the difference would
be minimal. For example, 200K tuples and 3 workers would yield
int(int(20 / 2.4) * 2.4)=19. That is probably not worth the
additional complexity, I've updated the patch to just use
gather_rows_estimate.

> With respect to the change introduced by the patch in the regression
> test, I wonder if we should test it on the tables of a larger scale
> and check for performance issues.
> Imagine the case when Parallel Seq Scan on extremely_skewed s
> (cost=0.00..167.01 rows=1 width=4) returns 1000 rows instead of 1 ...
> I wonder which plan would perform better then or will there be a
> totally different plan.
For the extremely_skewed test, having the parallel seq scan returning
more rows won't impact the Gather since The Hash Join will reduce the
number of rows to 1. I've found an example where we can see the plan
changes with the default settings:

CREATE TABLE simple (id SERIAL PRIMARY KEY, other bigint);
insert into simple select generate_series(1,50), ceil(random()*100);
analyze simple;
EXPLAIN SELECT * FROM simple where other < 10 order by id;

Unpatched:
 Gather Merge  (cost=9377.85..12498.60 rows=27137 width=12)
   Workers Planned: 1
   ->  Sort  (cost=8377.84..8445.68 rows=27137 width=12)
 Sort Key: id
 ->  Parallel Seq Scan on simple  (cost=0.00..6379.47
rows=27137 width=12)
   Filter: (other < 10)

Patched:
 Sort  (cost=12381.73..12492.77 rows=44417 width=12)
   Sort Key: id
   ->  Seq Scan on simple  (cost=0.00..8953.00 rows=44417 width=12)
 Filter: (other < 10)

Looking at the candidates, the previous Gather Merge now has an
estimated number of rows of 44418. The 1 difference compared to the
other Gather Merge plan is due to rounding (26128 * 1.7 = 44417.6).

 Plan 3
   ->  Gather Merge  (cost=9296.40..14358.75 rows=44418 width=12)
 Workers Planned: 1
 ->  Sort  (cost=8296.39..8361.71 rows=26128 width=12)
   Sort Key: id
   ->  Parallel Seq Scan on simple  (cost=0.00..6379.47
rows=26128 width=12)
 Filter: (other < 10)
 Plan 4
   ->  Gather Merge  (cost=9296.40..14358.63 rows=44417 width=12)
 Workers Planned: 1
 ->  Sort  (cost=8296.39..8361.71 rows=26128 width=12)
   Sort Key: id
   ->  Parallel Seq Scan on simple  (cost=0.00..6379.47
rows=26128 width=12)
 Filter: (other < 10)
 Plan 5
   ->  Sort  (cost=12381.73..12492.77 rows=44417 width=12)
 Sort Key: id
 ->  Seq Scan on simple  (cost=0.00..8953.00 rows=44417 width=12)
   Filter: (other < 10)

The Sort plan is slightly slower than the Gather Merge plan: 100ms
average against 83ms but the Gather Merge comes at the additional cost
of creating and using a parallel worker. The unpatched row estimation
makes the parallel plan look cheaper and running a parallel query for
a 17ms improvement doesn't seem like a good trade.

I've also realised from the comments in optimizer.h that
nodes/pathnodes.h should not be included there and fixed it.

Regards,
Anthonin


v2-0001-Remove-unnecessary-assignment-of-path-rows-in-gat.patch
Description: Binary data


v2-0002-Fix-row-estimation-in-gather-paths.patch
Description: Binary data


Re: Use pgBufferUsage for block reporting in analyze

2024-07-08 Thread Anthonin Bonnefoy
Hi,

Thanks for the review, I've updated the patches with the suggestions:
- moved renaming of misses to reads to the first patch
- added intermediate variables for total blks usage

I've also done some additional tests using the provided
vacuum_analyze_buffer_usage.sql script. It relies on
pg_stat_statements to check the results (only pgss gives information
on dirtied buffers). It gives the following output:

 psql:vacuum_analyze_buffer_usage.sql:21: INFO:  vacuuming
"postgres.pg_temp_7.vacuum_blks_stat_test"
 ...
 buffer usage: 105 hits, 3 reads, 6 dirtied
 ...
  query  | sum_hit | sum_read | sum_dirtied
 +-+--+-
  VACUUM (VERBOSE... | 105 |3 |   6

For vacuum, we have the same results with SKIP_DATABASE_STATS. Without
this setting, we would have block usage generated by
vac_update_datfrozenxid outside of vacuum_rel and therefore not
tracked by the verbose output. For the second test, the second patch
is needed to have ANALYZE (VERBOSE) output the block usage. It will
output the following:

 psql:vacuum_analyze_buffer_usage.sql:29: INFO:  analyzing
"pg_temp_7.vacuum_blks_stat_test"
 ...
 buffer usage: 84 hits, 33 reads, 2 dirtied
 ...
 query| sum_hit | sum_read | sum_dirtied
 -+-+--+-
  ANALYZE (VERBOSE... |  91 |   38 |   2

There's additional buffer hits/reads reported by pgss, those are from
analyze_rel opening the relations in try_relation_open and are not
tracked by the ANALYZE VERBOSE.


v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


vacuum_analyze_buffer_usage.sql
Description: Binary data


Re: Fix possible overflow of pg_stat DSA's refcnt

2024-06-25 Thread Anthonin Bonnefoy
On Wed, Jun 26, 2024 at 7:40 AM Michael Paquier  wrote:
>
> Very good catch!  It looks like you have seen that in the field, then.
> Sad face.

Yeah, this happened last week on one of our replicas (version 15.5)
last week that had 134 days uptime. We are doing a lot of parallel
queries on this cluster so the combination of high uptime plus
parallel workers creation eventually triggered the issue.

> Will fix and backpatch, thanks for the report!

Thanks for handling this and for the quick answer!

Regards,
Anthonin




Fix possible overflow of pg_stat DSA's refcnt

2024-06-25 Thread Anthonin Bonnefoy
Hi,

During backend initialisation, pgStat DSA is attached using
dsa_attach_in_place with a NULL segment. The NULL segment means that
there's no callback to release the DSA when the process exits.
pgstat_detach_shmem only calls dsa_detach which, as mentioned in the
function's comment, doesn't include releasing and doesn't decrement the
reference count of pgStat DSA.

Thus, every time a backend is created, pgStat DSA's refcnt is incremented
but never decremented when the backend shutdown. It will eventually
overflow and reach 0, triggering the "could not attach to dynamic shared
area" error on all newly created backends. When this state is reached, the
only way to recover is to restart the db to reset the counter.

The issue can be visible by calling dsa_dump in pgstat_detach_shmem and
checking that refcnt's value is continuously increasing as new backends are
created. It is also possible to reach the state where all connections are
refused by editing the refcnt manually with lldb/gdb (The alternative,
creating enough backends to reach 0 exists but can take some time). Setting
it to -10 and then opening 10 connections will eventually generate the
"could not attach" error.

This patch fixes this issue by releasing pgStat DSA with
dsa_release_in_place during pgStat shutdown to correctly decrement the
refcnt.

Regards,
Anthonin


v1-0001-Fix-possible-overflow-of-pg_stat-DSA-s-refcnt.patch
Description: Binary data


Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Anthonin Bonnefoy
Hi,

I sent a similar patch for this in
https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com

Regards,
Anthonin

On Thu, Jun 6, 2024 at 11:10 AM Dilip Kumar  wrote:

> As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the
> dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was
> removed from the vacuum. However, do_analyze_rel() was still using
> these counters, necessitating the tracking of global counters
> alongside BufferUsage counters.
>
> The attached patch addresses the issue by eliminating the need to
> track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making
> the global counters obsolete. This simplifies the code and improves
> consistency.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: Use pgBufferUsage for block reporting in analyze

2024-05-28 Thread Anthonin Bonnefoy
Thanks for having a look.

On Fri, May 10, 2024 at 12:40 PM Michael Paquier 
wrote:

> This needs some runtime check to make sure that the calculations
> are consistent before and after the fact (cannot do that now).
>
Yeah, testing this is also a bit painful as buffer usage of analyze is only
displayed in the logs during autoanalyze. While looking at this, I've
thought of additional changes that could make testing easier and improve
consistency with VACUUM VERBOSE:
- Have ANALYZE VERBOSE outputs the buffer usage stats
- Add Wal usage to ANALYZE VERBOSE

analyze verbose output would look like:
postgres=# analyze (verbose) pgbench_accounts ;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 1640 of 1640 pages, containing 10
live rows and 0 dead rows; 3 rows in sample, 10 estimated total rows
INFO:  analyze of table "postgres.public.pgbench_accounts"
avg read rate: 124.120 MB/s, avg write rate: 0.110 MB/s
buffer usage: 533 hits, 1128 reads, 1 dirtied
WAL usage: 12 records, 1 full page images, 5729 bytes
system usage: CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s

Perhaps this should say "read" rather than "miss" in the logs as the
> two read variables for the shared and local blocks are used?  For
> consistency, at least.
>
Sounds good.

That's not material for v17, only for v18.
>
 Definitely

I've split the patch in two parts
1: Removal of the vacuum specific variables, this is the same as the
initial patch.
2: Add buffer and wal usage to analyze verbose output + rename miss to
reads

Regards,
Anthonin


v2-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v2-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


Possible incorrect row estimation for Gather paths

2024-05-24 Thread Anthonin Bonnefoy
Hi,

While experimenting on an explain option to display all plan candidates
(very rough prototype here [1]), I've noticed some discrepancies in some
generated plans.

EXPLAIN (ALL_CANDIDATES) SELECT * FROM pgbench_accounts order by aid;
 Plan 1
->  Gather Merge  (cost=11108.32..22505.38 rows=10 width=97)
 Workers Planned: 1
 ->  Sort  (cost=10108.31..10255.37 rows=58824 width=97)
   Sort Key: aid
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..2228.24
rows=58824 width=97)
 Plan 2
->  Gather Merge  (cost=11108.32..17873.08 rows=58824 width=97)
 Workers Planned: 1
 ->  Sort  (cost=10108.31..10255.37 rows=58824 width=97)
   Sort Key: aid
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..2228.24
rows=58824 width=97)

The 2 plans are similar except one Gather Merge has a lower 58K estimated
rows.

The first plan is created with generate_useful_gather_paths with
override_rows=false then create_gather_merge_path and will use rel->rows as
the row count (so the 100K rows of pgbench_accounts).
#0: create_gather_merge_path(...) at pathnode.c:1885:30
#1: generate_useful_gather_paths(... override_rows=false) at
allpaths.c:3286:11
#2: apply_scanjoin_target_to_paths(...) at planner.c:7744:3
#3: grouping_planner(...) at planner.c:1638:3

The second plan is created through create_ordered_paths then
create_gather_merge_path and the number of rows is estimated to
(worker_rows * parallel_workers). Since we only have 1 parallel worker,
this yields 58K rows.
#0: create_gather_merge_path(...) at pathnode.c:1885:30
#1: create_ordered_paths(...) at planner.c:5287:5
#2: grouping_planner(...) at planner.c:1717:17

The 58K row estimate looks possibly incorrect. A worker row count is
estimated using total_rows/parallel_divisor. The parallel_divisor will
include the possible leader participation and will be 1.7 for 1 worker thus
the 58K rows (100K/1.7=58K)
However, the gather merge will only do 58K*1, dropping the leader
participation component.

I have a tentative patch split in two changes:
1: This is a related refactoring to remove an unnecessary and confusing
assignment of rows in create_gather_merge_path. This value is never used
and immediately overwritten in cost_gather_merge
2: This changes the row estimation of gather path to use
worker_rows*parallel_divisor to get a more accurate estimation.
Additionally, when creating a gather merge path in create_ordered_paths, we
try to use the source's rel rows when available.

The patch triggered a small change in the hash_join regression test. Pre
patch, we had the following candidates.
Plan 4
->  Aggregate  (cost=511.01..511.02 rows=1 width=8)
 ->  Gather  (cost=167.02..511.00 rows=2 width=0)
   Workers Planned: 1
   ->  Parallel Hash Join  (cost=167.02..510.80 rows=1 width=0)
 Hash Cond: (r.id = s.id)
 ->  Parallel Seq Scan on simple r  (cost=0.00..299.65
rows=11765 width=4)
 ->  Parallel Hash  (cost=167.01..167.01 rows=1 width=4)
   ->  Parallel Seq Scan on extremely_skewed s
 (cost=0.00..167.01 rows=1 width=4)
Plan 5
->  Finalize Aggregate  (cost=510.92..510.93 rows=1 width=8)
 ->  Gather  (cost=510.80..510.91 rows=1 width=8)
   Workers Planned: 1
   ->  Partial Aggregate  (cost=510.80..510.81 rows=1 width=8)
 ->  Parallel Hash Join  (cost=167.02..510.80 rows=1
width=0)
   Hash Cond: (r.id = s.id)
   ->  Parallel Seq Scan on simple r
 (cost=0.00..299.65 rows=11765 width=4)
   ->  Parallel Hash  (cost=167.01..167.01 rows=1
width=4)
 ->  Parallel Seq Scan on extremely_skewed s
 (cost=0.00..167.01 rows=1 width=4)

Post patch, the plan candidates became:
Plan 4
->  Finalize Aggregate  (cost=511.02..511.03 rows=1 width=8)
 ->  Gather  (cost=510.80..511.01 rows=2 width=8)
   Workers Planned: 1
   ->  Partial Aggregate  (cost=510.80..510.81 rows=1 width=8)
 ->  Parallel Hash Join  (cost=167.02..510.80 rows=1
width=0)
   Hash Cond: (r.id = s.id)
   ->  Parallel Seq Scan on simple r
 (cost=0.00..299.65 rows=11765 width=4)
   ->  Parallel Hash  (cost=167.01..167.01 rows=1
width=4)
 ->  Parallel Seq Scan on extremely_skewed s
 (cost=0.00..167.01 rows=1 width=4)
Plan 5
->  Aggregate  (cost=511.01..511.02 rows=1 width=8)
 ->  Gather  (cost=167.02..511.00 rows=2 width=0)
   Workers Planned: 1
   ->  Parallel Hash Join  (cost=167.02..510.80 rows=1 width=0)
 Hash Cond: (r.id = s.id)
 ->  Parallel Seq Scan on simple r  (cost=0.00..299.65
rows=11765 width=4)
 ->  Parallel Hash  (cost=167.01..167.01 rows=1 width=4)
   ->  Parallel Seq Scan on extremely_skewed s
 (cost=0.00..167.01 rows=1 width=4)

The FinalizeAggreg

Use pgBufferUsage for block reporting in analyze

2024-05-10 Thread Anthonin Bonnefoy
Hi,

Analyze logs within autovacuum uses specific variables
VacuumPage{Hit,Miss,Dirty} to track the buffer usage count. However,
pgBufferUsage already provides block usage tracking and handles more cases
(temporary tables, parallel workers...).

Those variables were only used in two places, block usage reporting in
verbose vacuum and analyze. 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d
removed their usage in the vacuum command as part of a bugfix.

This patch replaces those Vacuum specific variables by pgBufferUsage
in analyze. This makes VacuumPage{Hit,Miss,Dirty} unused and removable.
This commit removes both their calls in bufmgr and their declarations.

Regards,
Anthonin


v1-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-05-02 Thread Anthonin Bonnefoy
On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada 
wrote:

> Thank you for further testing! I've pushed the patch.

Thanks!

Here is the rebased version for the follow-up patch removing VacuumPage
variables. Though I'm not sure if I should create a dedicated mail thread
since the bug was fixed and the follow-up is more of a refactoring. What
do you think?


v1-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-29 Thread Anthonin Bonnefoy
I've done some additional tests to validate the reported numbers. Using
pg_statio, it's possible to get the minimum number of block hits (Full
script attached).

-- Save block hits before vacuum
SELECT pg_stat_force_next_flush();
SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where
relname='vestat' \gset
vacuum (verbose, index_cleanup on) vestat;
-- Check the difference
SELECT pg_stat_force_next_flush();
SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit,
   idx_blks_hit - :idx_blks_hit as delta_idx_hit,
   heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum
FROM pg_statio_all_tables where relname='vestat';

Output:
...
buffer usage: 14676 hits, 0 misses, 667 dirtied
buffer usage (new): 16081 hits, 0 misses, 667 dirtied
...
 -[ RECORD 1 ]--+--
delta_heap_hit | 9747
delta_idx_hit  | 6325
sum| 16072

>From pg_statio, we had 16072 blocks for the relation + indexes.
Pre-patch, we are under reporting with 14676.
Post-patch, we have 16081. The 9 additional block hits come from vacuum
accessing catalog tables like pg_class or pg_class_oid_index.


vacuum_block_with_pgstatio.sql
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-25 Thread Anthonin Bonnefoy
On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina 
wrote:

> I tested the main postgres branch with and without your fix using a script
> that was written by me. It consists of five scenarios and I made a
> comparison in the logs between the original version of the master branch
> and the master branch with your patch:
>
 Hi! Thanks for the tests.

I have attached a test file (vacuum_check_logs.sql)
>
Looking at the script, you won't trigger the problem. The reporting issue
will only happen if there's a parallel index vacuum and it will only happen
if there's at least 2 indexes [0]. You will need to create an additional
index.

The same script was run, but using vacuum verbose analyze, and I saw the
> difference again in the fifth step:
> with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
> master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
>
Isn't there a chance for the checkpointer to run during this time? That
could make the conditions between the two runs slightly different and
explain the change in buffer report.

[0]
https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831


Re: Fix parallel vacuum buffer usage reporting

2024-04-24 Thread Anthonin Bonnefoy
Thanks for the review!


> I think that if the anayze command doesn't have the same issue, we
> don't need to change it.

Good point, I've wrongly assumed that analyze was also impacted but there's
no parallel analyze so the block count is correct.


> (a) make lazy vacuum use BufferUsage instead of
> VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
> (b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
> variables for consistency and simplicity (only for HEAD, if we agree).
>
 I've isolated the fix in the first patch. I've kept the analyze change and
removal of VacuumPage* variables split for clarity sake.

BTW I realized that VACUUM VERBOSE running on a temp table always
> shows the number of dirtied buffers being 0, which seems to be a bug.
> The patch (a) will resolve it as well.
>
Indeed, it visible with the following test:

SET temp_buffers='1024';
CREATE TEMPORARY TABLE vacuum_fix (aid int, bid int);
INSERT INTO vacuum_fix SELECT *, * FROM generate_series(1, 100);
VACUUM vacuum_fix;
UPDATE vacuum_fix SET bid = bid+1;
VACUUM (VERBOSE, INDEX_CLEANUP ON) vacuum_fix;

Pre-patch:
  avg read rate: 254.751 MB/s, avg write rate: 0.029 MB/s
  buffer usage: 8853 hits, 8856 misses, 1 dirtied
  WAL usage: 1 records, 1 full page images, 3049 bytes
The dirtied page is from pg_class (so not a local buffer)

With the fix:
  avg read rate: 250.230 MB/s, avg write rate: 250.230 MB/s
  buffer usage: 8853 hits, 8856 misses, 8856 dirtied
  WAL usage: 1 records, 1 full page images, 3049 bytes


v4-0002-Use-pgBufferUsage-for-analyze-block-reporting.patch
Description: Binary data


v4-0001-Fix-parallel-and-temp-vacuum-buffer-usage-reporti.patch
Description: Binary data


v4-0003-Remove-unused-variables-for-vacuum-buffer-usage.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-22 Thread Anthonin Bonnefoy
On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina 
wrote:

> Hi, thank you for your work with this subject.
>
> While I was reviewing your code, I noticed that your patch conflicts with
> another patch [0] that been committed.
>
> [0]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
>

I've rebased the patch and also split the changes:
1: Use pgBufferUsage in Vacuum and Analyze block reporting
2: Stop tracking buffer usage in the now unused Vacuum{Hit,Miss,Dirty}
variables
3: Remove declarations of Vacuum{Hit,Miss,Dirty}


v3-0003-Remove-declaration-of-Vacuum-block-usage-tracking.patch
Description: Binary data


v3-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data


v3-0002-Remove-unused-tracking-of-vacuum-buffer-usage.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-03-28 Thread Anthonin Bonnefoy
Hi,

Thanks for the review.

On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada 
wrote:

> As for the proposed patch, the following part should handle the temp
> tables too:
>

True, I've missed the local blocks. I've updated the patch:
- read_rate and write_rate now include local block usage
- I've added a specific output for reporting local blocks instead of
combining them in the same output.

Regards,
Anthonin


v2-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data


Re: POC: Extension for adding distributed tracing - pg_tracing

2024-03-12 Thread Anthonin Bonnefoy
Hi all!

Thanks for the reviews and comments.

> - pg_tracing.c should include postgres.h as the first thing
Will do.

> afaict none of the use of volatile is required, spinlocks have been barriers
> for a long time now
Got it, I will remove them. I've been mimicking what was done in
pg_stat_statements and all spinlocks are done on volatile variables
[1].

> - I don't think it's a good idea to do memory allocations in the middle of a
> PG_CATCH. If the error was due to out-of-memory, you'll throw another error.
Good point. I was wondering what were the risks of generating spans
for errors. I will try to find a better way to handle this.

To give a quick update: following Heikki's suggestion, I'm currently
working on getting pg_tracing as a separate extension on github. I
will send an update when it's ready.

[1]: 
https://github.com/postgres/postgres/blob/master/contrib/pg_stat_statements/pg_stat_statements.c#L2000

On Fri, Feb 9, 2024 at 7:50 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-02-09 19:46:58 +0300, Nikita Malakhov wrote:
> > I agree with Heikki on most topics and especially the one he recommended
> > to publish your extension on GitHub, this tool is very promising for highly
> > loaded
> > systems, you will get a lot of feedback very soon.
> >
> > I'm curious about SpinLock - it is recommended for very short operations,
> > taking up to several instructions, and docs say that for longer ones it
> > will be
> > too expensive, and recommends using LWLock. Why have you chosen SpinLock?
> > Does it have some benefits here?
>
> Indeed - e.g. end_tracing() looks to hold the spinlock for far too long for
> spinlocks to be appropriate. Use an lwlock.
>
> Random stuff I noticed while skimming:
> - pg_tracing.c should include postgres.h as the first thing
>
> - afaict none of the use of volatile is required, spinlocks have been barriers
>   for a long time now
>
> - you acquire the spinlock for single increments of n_writers, perhaps that
>   could become an atomic, to reduce contention?
>
> - I don't think it's a good idea to do memory allocations in the middle of a
>   PG_CATCH. If the error was due to out-of-memory, you'll throw another error.
>
>
> Greetings,
>
> Andres Freund




Fix expecteddir argument in pg_regress

2024-03-11 Thread Anthonin Bonnefoy
Hi all!

pg_regress accepts the expecteddir argument. However, it is never used
and outputdir is used instead to get the expected files paths.

This patch fixes this and uses the expecteddir argument as expected.

Regards,
Anthonin


v01-0001-pg_regress-Use-expecteddir-for-the-expected-file.patch
Description: Binary data


Fix parallel vacuum buffer usage reporting

2024-02-09 Thread Anthonin Bonnefoy
Hi,

With a db setup with pgbench, we add an additional index:
CREATE INDEX ON pgbench_accounts(abalance)

And trigger several updates and vacuum to reach a stable amount of
dirtied pages:
UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts

The vacuum will report the following:
INFO:  vacuuming "postgres.public.pgbench_accounts"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
...
buffer usage: 122 hits, 165 misses, 4 dirtied

4 pages were reported dirtied. However, we have 5 dirtied blocks at
the end of the vacuum when looking at pg_buffercache:

SELECT c.relname, b.relfilenode
 FROM
pg_buffercache b LEFT JOIN pg_class c
 ON b.relfilenode =
pg_relation_filenode(c.oid)
   WHERE isdirty=true;
relname| relfilenode
---+-
 pg_class  |1259
 pgbench_accounts  |   16400
 pgbench_accounts  |   16400
 pgbench_accounts_pkey |   16408
 pgbench_accounts_abalance_idx |   16480

The missing dirty block comes from the parallel worker vacuuming the
abalance index. Running vacuum with parallel disabled will give the
correct result.

Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
buffer usage. However, those values are not collected at the end of
parallel vacuum workers, leading to incorrect buffer count.

Those vacuum specific globals are redundant with the existing
pgBufferUsage and only used in the verbose output. This patch removes
them and replaces them by pgBufferUsage which is already correctly
collected at the end of parallel workers, fixing the buffer count.

Regards,
Anthonin


v01-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data


Re: POC: Extension for adding distributed tracing - pg_tracing

2024-02-06 Thread Anthonin Bonnefoy
Hi!

On Fri, Jan 26, 2024 at 1:43 PM Nikita Malakhov  wrote:
> It's a good idea to split a big patch into several smaller ones.
> But you've already implemented these features - you could provide them
> as sequential small patches (i.e. v13-0002-guc-context-propagation.patch and 
> so on)

Keeping a consistent set of patches is going to be hard as any change
in the first patch would require rebasing that are likely going to
conflict. I haven't discarded the features but keeping them separated
could help with the discussion.

On Fri, Feb 2, 2024 at 1:41 PM Heikki Linnakangas  wrote:
>
> Hi Anthonin,
>
> I'm only now catching up on this thread. Very exciting feature!
Thanks! And thanks for the comments.

> - You're relying on various existing hooks, but it might be better to
> have the 'begin_span'/'end_span' calls directly in PostgreSQL source
> code in the right places. There are more places where spans might be
> nice, like when performing a large sort in tuplesort.c to name one
> example. Or even across every I/O; not sure how much overhead the spans
> incur and how much we'd be willing to accept. 'begin_span'/'end_span'
> could be new hooks that normally do nothing, but your extension would
> implement them.

Having information that would only be available to tracing doesn't
sound great. It probably should be added to the core instrumentation
which could be then leveraged by tracing and other systems.
One of the features I initially implemented was to track parse start
and end. However, since only the post parse hook is available, I could
only infer when the parse started. Changing the core to track parse
start would make this more easier and it could be used by EXPLAIN and
pg_stat_statements.
In the end, I've tried to match what's done in pg_stat_statements
since the core idea is the same, except pg_tracing outputs spans
instead of statistics.

> - How to pass the tracing context from the client? You went with
> SQLCommenter and others proposed a GUC. A protocol extension would make
> sense too. I can see merits to all of those. It probably would be good
> to support multiple mechanisms, but some might need core changes. It
> might be good to implement the mechanism for accepting trace context in
> core. Without the extension, we could still include the trace ID in the
> logs, for example.

I went with SQLCommenter for the initial implementation since it has
the benefit of already being implemented and working out of the box.
Jelte has an ongoing patch to add the possibility to modify GUC values
through a protocol message [1]. I've tested it and it worked well
without adding too much complexity in pg_tracing. I imagine having a
dedicated traceparent GUC variable could be also used in the core to
propagate in the logs and other places.

> - Include EXPLAIN plans in the traces.
This was already implemented in the first versions (or rather, the
planstate was translated to spans) of the patch. I've removed it in
the latest patch as it was contributing to 50% of the code and also
required a core change, namely a change in instrument.c to track the
first start of a node. Though if you were thinking of the raw EXPLAIN
ANALYZE text dumped as a span metadata, that could be doable.

> # Comments on the implementation
>
> There was discussion already on push vs pull model. Currently, you
> collect the traces in memory / files, and require a client to poll them.
> A push model is more common in applications that support tracing. If
> Postgres could connect directly to the OTEL collector, you'd need one
> fewer running component. You could keep the traces in backend-private
> memory, no need to deal with shared memory and spilling to files.

PostgreSQL metrics are currently exposed through tables like pg_stat_*
and users likely have collectors pulling those metrics (like OTEL
PostgreSQL Receiver [2]). Having this collector also pulling traces
would keep the logic in a single place and would avoid a situation
where both pull and push is used.

> 2. Start a new pgsql-hackers thread on in-core changes that would
> benefit the extension. Write patches for them. I'm thinking of the
> things I listed in the Core changes section above, but maybe there are
> others.
I already have an ongoing patch (more support for extended protocol in
psql [3] needed for tests) and plan to do others (the parse changes
I've mentioned). Though I also don't want to spread myself too thin.


[1]: https://commitfest.postgresql.org/46/4736/
[2]: 
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/postgresqlreceiver/README.md
[3]: https://commitfest.postgresql.org/46/4650/

On Fri, Feb 2, 2024 at 5:42 PM Jan Katins  wrote:
>
> Hi!
>
> On Fri, 2 Feb 2024 at 13:41, Heikki Linnakangas  wrote:
>>
>> PS. Does any other DBMS implement this? Any lessons to be learned from them?
>
>
> Mysql 8.3 has open telemetrie component, so very fresh:
>
> https://dev.mysql.com/doc/refman/8.3/en/telemetry-trace-install.html
> https:

Re: A performance issue with Memoize

2024-01-31 Thread Anthonin Bonnefoy
Hi,

I've seen a similar issue with the following query (tested on the current head):

EXPLAIN ANALYZE SELECT * FROM tenk1 t1
LEFT JOIN LATERAL (SELECT t1.two, tenk2.hundred, tenk2.two FROM tenk2) t2
ON t1.hundred = t2.hundred WHERE t1.hundred < 5;
   QUERY PLAN

 Nested Loop Left Join  (cost=8.46..1495.10 rows=5 width=256)
(actual time=0.860..111.013 rows=5 loops=1)
   ->  Bitmap Heap Scan on tenk1 t1  (cost=8.16..376.77 rows=500
width=244) (actual time=0.798..1.418 rows=500 loops=1)
 Recheck Cond: (hundred < 5)
 Heap Blocks: exact=263
 ->  Bitmap Index Scan on tenk1_hundred  (cost=0.00..8.04
rows=500 width=0) (actual time=0.230..0.230 rows=500 loops=1)
   Index Cond: (hundred < 5)
   ->  Memoize  (cost=0.30..4.89 rows=100 width=12) (actual
time=0.009..0.180 rows=100 loops=500)
 Cache Key: t1.hundred
 Cache Mode: logical
 Hits: 0  Misses: 500  Evictions: 499  Overflows: 0  Memory Usage: 5kB
 ->  Index Scan using tenk2_hundred on tenk2  (cost=0.29..4.88
rows=100 width=12) (actual time=0.007..0.124 rows=100 loops=500)
   Index Cond: (hundred = t1.hundred)
 Planning Time: 0.661 ms
 Execution Time: 113.076 ms
(14 rows)

The memoize's cache key only uses t1.hundred while the nested loop has
two changed parameters: the lateral var t1.two and t1.hundred. This
leads to a chgParam that is always different and the cache is purged
on each rescan.

Regards,
Anthonin

On Sat, Jan 27, 2024 at 5:00 AM Alexander Lakhin  wrote:
>
> Hello,
>
> 27.01.2024 00:09, Tom Lane wrote:
> > David Rowley  writes:
> >> On Sat, 27 Jan 2024 at 09:41, Tom Lane  wrote:
> >>> drongo and fairywren are consistently failing the test case added
> >>> by this commit.  I'm not quite sure why the behavior of Memoize
> >>> would be platform-specific when we're dealing with integers,
> >>> but ...
> >> Maybe snprintf(buf, "%.*f", 0, 5.0 / 2.0); results in "3" on those
> >> rather than "2"?
> >> Looking at the code in fmtfloat(), we fallback on the built-in snprintf.
> > Maybe ... I don't have a better theory.
>
> FWIW, I've found where this behaviour is documented:
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/sprintf-sprintf-l-swprintf-swprintf-l-swprintf-l?view=msvc-170
>
> (I've remembered a case with test/sql/partition_prune from 2020, where
> sprintf on Windows worked the other way.)
>
>
> Best regards,
> Alexander
>
>




Re: Add \syncpipeline command to pgbench

2024-01-22 Thread Anthonin Bonnefoy
That looks like a bug with how opened pipelines are not caught at the
end of the script processing. startpipeline seems to have similar
related issues.

$ cat only_startpipeline.sql
\startpipeline
SELECT 1;

With only 1 transaction, pgbench will consider this a success despite
not sending anything since the pipeline was not flushed:
pgbench -t1 -Mextended -f only_startpipeline.sql
[...]
number of transactions per client: 1
number of transactions actually processed: 1/1

With 2 transactions, the error will happen when \startpipeline is
called a second time:
pgbench -t2 -Mextended -f only_startpipeline.sql
[...]
pgbench: error: client 0 aborted in command 0 (startpipeline) of
script 0; already in pipeline mode
number of transactions per client: 2
number of transactions actually processed: 1/2

I've split the changes into two patches.
0001 introduces a new error when the end of a pgbench script is
reached while there's still an ongoing pipeline.
0002 adds the \syncpipeline command (original patch with an additional
test case).

Regards,
Anthonin

On Mon, Jan 22, 2024 at 7:16 AM Michael Paquier  wrote:
>
> On Fri, Jan 19, 2024 at 08:55:31AM +0100, Anthonin Bonnefoy wrote:
> > I've updated the doc to group the commands. It does look better and
> > more consistent with similar command groups like \if.
>
> I was playing with a few meta command scenarios while looking at this
> patch, and this sequence generates an error that should never happen:
> $ cat /tmp/test.sql
> \startpipeline
> \syncpipeline
> $ pgbench -n -f /tmp/test.sql -M extended
> [...]
> pgbench: error: unexpected transaction status 1
> pgbench: error: client 0 aborted while receiving the transaction status
>
> It looks to me that we need to be much smarter than that for the error
> handling we'd need when a sync request is optionally sent when a
> transaction stops at the end of pgbench.  Could you look at it?
> --
> Michael


v3-0001-Abort-pgbench-if-script-end-is-reached-with-an-op.patch
Description: Binary data


v3-0002-pgbench-Add-syncpipeline-to-pgbench.patch
Description: Binary data


Re: Add \syncpipeline command to pgbench

2024-01-18 Thread Anthonin Bonnefoy
On Fri, Jan 19, 2024 at 5:08 AM Michael Paquier  wrote:
> The logic looks sound, but I have a
> comment about the docs: could it be better to group \syncpipeline with
> \startpipeline and \endpipeline?  \syncpipeline requires a pipeline to
> work.

I've updated the doc to group the commands. It does look better and
more consistent with similar command groups like \if.

Regards,
Anthonin


v2-0001-pgbench-Add-syncpipeline-to-pgbench.patch
Description: Binary data


Add \syncpipeline command to pgbench

2024-01-18 Thread Anthonin Bonnefoy
Hi,

PQsendSyncMessage() was added in
https://commitfest.postgresql.org/46/4262/. It allows users to add a
Sync message without flushing the buffer.

As a follow-up, this change adds an additional meta-command to
pgbench, \syncpipeline, which will call PQsendSyncMessage(). This will
make it possible to benchmark impact and improvements of using
PQsendSyncMessage through pgbench.

Regards,
Anthonin


v1-0001-pgbench-Add-syncpipeline-to-pgbench.patch
Description: Binary data


Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-18 Thread Anthonin Bonnefoy
> Hmm.  So it does not lead to any user-visible changes, right?

>From what I can tell, there's no change in the behaviour. All paths
would eventually go through HandleSlashCmds's cleaning logic. This is
also mentioned in ignore_slash_options's comment.

* Read and discard "normal" slash command options.
*
* This should be used for inactive-branch processing of any slash command
* that eats one or more OT_NORMAL, OT_SQLID, or OT_SQLIDHACK parameters.
* We don't need to worry about exactly how many it would eat, since the
* cleanup logic in HandleSlashCmds would silently discard any extras anyway.

> Could you split the bits for \bind into a separate patch, please?
> This requires a separate evaluation, especially if this had better be
> backpatched.

Done. patch 1 adds ignore_slash_options to bind. patch 2 adds the new
\bindx, \close and \parse commands.


v4-0002-psql-Add-support-for-prepared-stmt-with-extended-.patch
Description: Binary data


v4-0001-psql-Add-ignore_slash_options-in-bind-s-inactive-.patch
Description: Binary data


Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-17 Thread Anthonin Bonnefoy
> I do realize the same is true for plain \bind, but it seems
> like a bug there too.

The unscanned bind's parameters are discarded later in the
HandleSlashCmds functions. So adding the ignore_slash_options() for
inactive branches scans and discards them earlier. I will add it to
match what's done in the other commands but I don't think it's
testable as the behaviour is the same unless I miss something.

I did add the \bind, \bindx, \close and \parse to the inactive branch
tests to complete the list.

> One more usability thing. I think \parse and \close should not require
> a \g to send the message. You can do that by returning PSQL_CMD_SEND
> instead of PSQL_CMD_SKIP_LIN

Changed.

> I think the examples for \bindx and \close
> should use \parse instead of PREPARE

Done. I had to rely on manual PREPARE for my first tests and it leaked
in the docs.


v3-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch
Description: Binary data


Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-16 Thread Anthonin Bonnefoy
Hi,

Thanks for the review and comments.

> One thing that I think should be added for completeness though is the
> ability to deallocate the prepared statement using
> PQsendClosePrepared. Other than that the changes look great.
Good point, I've added the \close command.

> Also a tiny nitpick: stmt_name should be replaced with STMT_NAME in
> this line of the help message.
Fixed


On Sat, Jan 13, 2024 at 3:37 PM Jelte Fennema-Nio  wrote:
>
> On Thu, 2 Nov 2023 at 10:52, Anthonin Bonnefoy
>  wrote:
> > The main goal is to provide more ways to test extended protocol in
> > regression tests
> > similarly to what \bind is doing.
>
> I think this is a great addition. One thing that I think should be
> added for completeness though is the ability to deallocate the
> prepared statement using PQsendClosePrepared. Other than that the
> changes look great.
>
> Also a tiny nitpick: stmt_name should be replaced with STMT_NAME in
> this line of the help message.
>
> > +   HELP0("  \\bindx stmt_name [PARAM]...\n"


v2-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch
Description: Binary data


Re: POC: Extension for adding distributed tracing - pg_tracing

2024-01-04 Thread Anthonin Bonnefoy
Hi,

> This approach avoids the need to rewrite SQL or give special meaning to SQL 
> comments.

SQLCommenter already has a good amount of support and is referenced in
opentelemetry https://github.com/open-telemetry/opentelemetry-sqlcommenter
So the goal was more to leverage the existing trace propagation
systems as much as possible.

> I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`.
> These can be sent as protocol-level messages by the client driver if the
> client driver has native trace integration enabled, in which case the user
> doesn't need to see or care. And for other drivers, the application can use
> `SET` or `SET LOCAL` to assign them.

Emitting `SET` and `SET LOCAL` for every traced statement sounds more
disruptive and expensive than relying on SQLCommenter. When not using
`SET LOCAL`, the variables also need to be cleared.
This will also introduce a lot of headaches as the `SET` themselves
could be traced and generate spans when tracing utility statements is
already tricky enough.

> But IIRC the current BDR/PGD folks are now using an OpenTelemetry
> sidecar process instead. I think they send it UDP packets to emit
> metrics and events.

I would be curious to hear more about this. I didn't want to be tied
to a specific protocol (I've only seen gRPC and http support on latest
opentelemetry collectors) and I fear that relying on Postgres sending
traces would lower the chance of having this supported on managed
Postgres solutions (like RDS and CloudSQL).

> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right?

I think that was the idea. The query identifier generated for a
specific statement is stable so we could have a GUC variable with a
list of query id and only queries matching the provided query ids
would be sampled. This would make it easier to generate traces for a
specific query within a session.


On Tue, Jan 2, 2024 at 1:14 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > Overall solution looks good for me except SQL Commenter - query 
> > instrumentation
> > with SQL comments is often not possible on production systems. Instead
> > the very often requested feature is to enable tracing for a given single 
> > query ID,
> > or several (very limited number of) queries by IDs. It could be done by 
> > adding
> > SQL function to add and remove query ID into a list (even array would do)
> > stored in top tracing context.
>
> Not 100% sure if I follow.
>
> By queries you mean particular queries, not transactions? And since
> they have an assigned ID it means that the query is already executing
> and we want to enable the tracking in another session, right? If this
> is the case I would argue that enabling/disabling tracing for an
> _already_ running query (or transaction) would be way too complicated.
>
> I wouldn't mind enabling/disabling tracing for the current session
> and/or a given session ID. In the latter case it can have an effect
> only for the new transactions. This however could be implemented
> separately from the proposed patchset. I suggest keeping the scope
> small.
>
> --
> Best regards,
> Aleksander Alekseev




Re: Possible segfault when sending notification within a ProcessUtility hook

2023-12-06 Thread Anthonin Bonnefoy
> On Tue, Dec 5, 2023 at 9:03 PM Tom Lane  wrote:
> Why should we regard that as anything other than a bug in the
> ProcessUtility hook?  A failed transaction should not send any
> notifies.

Fair point. That was also my initial assumption but I thought that the
transaction
state was not available from a hook as I've missed
IsAbortedTransactionBlockState.

I will rely on IsAbortedTransactionBlockState to avoid this case,
thanks for the input.

Regards,
Anthonin.




Possible segfault when sending notification within a ProcessUtility hook

2023-12-05 Thread Anthonin Bonnefoy
Hi,

I've encountered the following segfault:

#0: 0x000104e821a8 postgres`list_head(l=0x7f7f7f7f7f7f7f7f) at
pg_list.h:130:17
#1: 0x000104e81c9c postgres`PreCommit_Notify at async.c:932:16
#2: 0x000104dd02f8 postgres`CommitTransaction at xact.c:2236:2
#3: 0x000104dcfc24 postgres`CommitTransactionCommand at xact.c:3061:4
#4: 0x00010528a880 postgres`finish_xact_command at postgres.c:2777:3
#5: 0x0001052883ac postgres`exec_simple_query(query_string="notify
test;") at postgres.c:1298:4

This happens when a transaction block fails and a ProcessUtility hook
sends a notification during the rollback command.

When a transaction block fails, it will enter in a TBLOCK_ABORT state,
waiting for a rollback. Calling rollback will switch to a
TBLOCK_ABORT_END state and will only go through CleanupTransaction.
If a hook sends a notification during the rollback command, a
notification will be queued but its content will be wiped when the
TopTransactionContext is destroyed.
Trying to send a notification immediately after will segfault in
PreCommit_Notify as pendingNotifies->events will be invalid.

There's a test_notify_rollback test module attached to the patch that reproduces
the issue.

Moving notification clean up from AbortTransaction to CleanupTransaction fixes
the issue as it will clear pendingActions in the same function that destroys the
TopTransactionContext.

Regards,
Anthonin


v1-0001-Fix-segfault-when-notifying-in-a-ProcessUtility-h.patch
Description: Binary data


Re: Add PQsendSyncMessage() to libpq

2023-11-13 Thread Anthonin Bonnefoy
Hi,

I've played a bit with the patch on my side. One thing that would be
great would be to make this
available in pgbench through a \syncpipeline meta command. That would
make it easier for users
to test whether there's a positive impact with their queries or not.

I've wrote a patch to add it to pgbench (don't want to mess with the
thread's attachment so here's a GH link
https://github.com/bonnefoa/postgres/commit/047b5b05169e36361fe29fef9f430da045ef012d).
Here's some quick results:

echo "\set aid1 random(1, 10 * :scale)
\set aid2 random(1, 10 * :scale)
\startpipeline
select 1;
select * from pgbench_accounts where aid=:aid1;
select 2;
\syncpipeline
select 1;
select * from pgbench_accounts where aid=:aid2;
select 2;
\endpipeline" > /tmp/pipeline_without_flush.sql
pgbench -T30 -Mextended -f /tmp/pipeline_without_flush.sql -h127.0.0.1
latency average = 0.383 ms
initial connection time = 2.810 ms
tps = 2607.587877 (without initial connection time)

echo "\set aid1 random(1, 10 * :scale)
\set aid2 random(1, 10 * :scale)
\startpipeline
select 1;
select * from pgbench_accounts where aid=:aid1;
select 2;
\endpipeline
\startpipeline
select 1;
select * from pgbench_accounts where aid=:aid2;
select 2;
\endpipeline" > /tmp/pipeline_with_flush.sql
pgbench -T30 -Mextended -f /tmp/pipeline_with_flush.sql -h127.0.0.1
latency average = 0.437 ms
initial connection time = 2.602 ms
tps = 2290.527462 (without initial connection time)

I took some perfs and the main change is from the server spending less time in
ReadCommand which makes sense since the commands are sent in a single tcp
frame with the \syncpipeline version.

Regards,
Anthonin

On Sun, Nov 12, 2023 at 2:37 PM Anton Kirilov  wrote:
>
> Hello,
>
> Thanks for the feedback!
>
> On 07/11/2023 09:23, Jelte Fennema-Nio wrote:
>  > But I think it's looking at the situation from the wrong direction.
> [...] we should look at it as an addition to our current list of PQsend
> functions for a new packet type. And none of those PQsend functions ever
> needed a flag. Which makes sense, because they are the lowest level
> building blocks that make sense from a user perspective: They send a
> message type over the socket and don't do anything else.
>
> Yes, I think that this is quite close to my thinking when I created the
> original version of the patch. Also, the protocol specification states
> that the Sync message lacks parameters.
>
> Since there haven't been any comments from the other people who have
> chimed in on this e-mail thread, I will assume that there is consensus
> (we are doing a U-turn with the implementation approach after all), so
> here is the updated version of the patch.
>
> Best wishes,
> Anton Kirilov




[PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2023-11-02 Thread Anthonin Bonnefoy
Hi all!

Currently, only unnamed prepared statements are supported by psql with the
\bind command and it's not possible to create or use named prepared statements
through extended protocol.

This patch introduces 2 additional commands: \parse and \bindx.
\parse allows us to issue a Parse message to create a named prepared statement
through extended protocol.
\bindx allows to bind and execute a named prepared statement through extended
protocol.

The main goal is to provide more ways to test extended protocol in
regression tests
similarly to what \bind is doing.

Regards,
Anthonin


0001-psql-Add-support-for-prepared-stmt-with-extended-pro.patch
Description: Binary data


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Anthonin Bonnefoy
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
If you're talking of the two instances where I'm modifying the instr_time's
ticks, it's because I can't use the macros there.
The first case is for the parse span. I only have the start timestamp
using GetCurrentStatementStartTimestamp and don't
have access to the start instr_time so I need to build the duration from 2
timestamps.
The second case is when building node spans from the planstate. I directly
have the duration from Instrumentation.

I guess one fix would be to use an int64 for the span duration to directly
store nanoseconds instead of an instr_time
but I do use the instrumentation macros outside of those two cases to get
the duration of other spans.

> Also, have you thought about a way to trace existing (running) queries
without directly instrumenting them?
That's a good point. I was focusing on leaving the sampling decision to the
caller through the sampled flag and
only recently added the pg_tracing_sample_rate parameter to give more
control. It should be straightforward to
add an option to create standalone traces based on sample rate alone. This
way, setting the sample rate to 1
would force the queries running in the session to be traced.


On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov  wrote:

> Hi!
>
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT
> and INSTR_TIME_GET_MILLISEC
> macros for timing calculations?
>
> Also, have you thought about a way to trace existing (running) queries
> without directly instrumenting them? It would
> be much more usable for maintenance and support personnel, because in
> production environments you rarely could
> change query text directly. For the current state the most simple solution
> is switch tracing on and off by calling SQL
> function, and possibly switch tracing for prepared statement the same way,
> but not for any random query.
>
> I'll check the patch for the race conditions.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-28 Thread Anthonin Bonnefoy
Hi,

> I'd keep Autotools build up to date
Definitely. The patch is still in a rough state and updating Autotools fell
through the cracks.

> I'm currently playing with the patch and
> reviewing sources, if you need any cooperation - please let us know.
The goal for me was to get validation on the design and to check if there
was anything that would be a blocker before commiting more time on the
project.

There are already several things I was planning on improving:
- The parse span is something I've added recently. I've realised that I
could rely on the stmtStartTimestamp to get the time spent in parsing in
the post parse hook so the code around this is still rough.
- There are probably race conditions on the reads and writes of the query
file that need to be fixed
- I'm using the same Span structure for everything while Executor spans
only need a small subset which is a bit wasteful. I've tried to use two
different shared buffers (base spans and spans with counters) but it was
making things more confusing.
- Finish commenting code and start writing the doc

So any comments, missing features and reviews on the current state of the
project is welcome.

Regards,
Anthonin

On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov  wrote:

> Hi,
>
> I'd keep Autotools build up to date, because Meson is very limited in
> terms of not very
> up-to-date OS versions. Anyway, it's OK now. I'm currently playing with
> the patch and
> reviewing sources, if you need any cooperation - please let us know.
> I'm +1 for committing
> this patch after review.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-27 Thread Anthonin Bonnefoy
> Agree, something goes wrong when using Autotools (but not Meson) on
> both Linux and MacOS. I didn't investigate the issue though.
I was only using meson and forgot to keep Automake files up to date when
I've split pg_tracing.c in multiple files (span.c, explain.c...).

On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov  wrote:

> Hi,
>
> I've fixed the Autotools build, please check patch below (v2).
>
> On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev <
> aleksan...@timescale.com> wrote:
>
>> Hi,
>>
>> > Also FYI, there are build warnings because functions
>> > const char * get_span_name(const Span * span, const char *qbuffer)
>> > and
>> > const char * get_operation_name(const Span * span, const char *qbuffer)
>> > do not have default inside switch and no return outside of switch.
>>
>> You are right, there are a few warnings:
>>
>> ```
>> [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o
>> ../contrib/pg_tracing/span.c: In function ‘get_span_name’:
>> ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>   210 | }
>>   | ^
>> ../contrib/pg_tracing/span.c: In function ‘get_operation_name’:
>> ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>   249 | }
>>   | ^
>> ```
>>
>> Here is the patch v2 with a quick fix.
>>
>> > but got errors calling make check and cannot install the extension
>>
>> Agree, something goes wrong when using Autotools (but not Meson) on
>> both Linux and MacOS. I didn't investigate the issue though.
>>
>> --
>> Best regards,
>> Aleksander Alekseev
>>
>
>
> --
> Regards,
>
> --
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/
>


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-26 Thread Anthonin Bonnefoy
I've initially thought of sending the spans from PostgreSQL since this is
the usual behavior of tracing libraries.

However, this created a lot potential issues:

- Protocol support and differences between trace collectors. OpenTelemetry
seems to use gRPC, others are using http and those will require additional
libraries (plus gRPC support in C doesn't look good) and any change in
protobuf definition would require updating the extension.

- Do we send the spans within the query hooks? This means that we could
block the process if the trace collector is slow to answer or we can’t
connect. Sending spans from a background process sounded rather complex and
resource heavy.

Moving to a pull model fixed those issues and felt more natural as this is
the way PostgreSQL exposes its metrics.


On Wed, Jul 26, 2023 at 4:11 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Nikita,
>
> > This patch looks very interesting, I'm working on the same subject too.
> But I've used
> > another approach - I'm using C wrapper for C++ API library from
> OpenTelemetry, and
> > handle span storage and output to this library. There are some nuances
> though, but it
> > is possible. Have you tried to use OpenTelemetry APIs instead of
> implementing all
> > functionality around spans?
>
> I don't think that PostgreSQL accepts such kind of C++ code, not to
> mention the fact that the PostgreSQL license is not necessarily
> compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
> advice). Such a design decision will probably require using separate
> compile flags since the user doesn't necessarily have a corresponding
> dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.
>
> So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
> C implementation without 3rd party dependencies from me. Especially
> considering the fact that the implementation seems to be rather
> simple.
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: Flush SLRU counters in checkpointer process

2023-07-18 Thread Anthonin Bonnefoy
I think I've managed to reproduce the issue. The test I've added to check
slru flush was the one failing in the regression suite.

SELECT SUM(flushes) > :slru_flushes_before FROM pg_stat_slru;
 ?column?
--
 t

The origin seems to be a race condition on have_slrustats (
https://github.com/postgres/postgres/blob/c8e1ba736b2b9e8c98d37a5b77c4ed31baf94147/src/backend/utils/activity/pgstat_slru.c#L161-L162
).
I will try to get a new patch with improved test stability.


On Mon, Jul 3, 2023 at 3:18 PM Daniel Gustafsson  wrote:

> > On 3 Mar 2023, at 09:06, Anthonin Bonnefoy <
> anthonin.bonne...@datadoghq.com> wrote:
> >
> > Here's the patch rebased with Andres' suggestions.
> > Happy to update it if there's any additionalj change required.
>
> This patch crashes 031_recovery_conflict with a SIGInvalid on Windows, can
> you
> please investigate and see what might be going on there?  The test passed
> about
> 4 days ago on Windows so unless it's the CI being flaky it should be due
> to a
> recent change.
>
> If you don't have access to a Windows environment you can run your own
> instrumented builds in your Github account with the CI files in the
> postgres
> repo.
>
> --
> Daniel Gustafsson
>
>


[PATCH] Add statement_timeout in pg_stat_activity

2023-04-04 Thread Anthonin Bonnefoy
Hello hackers.

This patch adds the backend's statement_timeout value to pg_stat_activity.

This would provide some insights on clients that are disabling a default
statement timeout or overriding it through a pgbouncer, messing with other
sessions.

pg_stat_activity seemed like the best place to have this information.

Regards,
Anthonin
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d5a45f996d..2109595c40 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -861,6 +861,15 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   statement_timeout integer
+  
+  
+   The current backend's statement_timeout.
+  
+ 
+
  
   
wait_event_type text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..7c24ef92bf 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -860,6 +860,7 @@ CREATE VIEW pg_stat_activity AS
 S.xact_start,
 S.query_start,
 S.state_change,
+S.statement_timeout,
 S.wait_event_type,
 S.wait_event,
 S.state,
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f0f2e07655..0a2ba26070 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1054,6 +1054,16 @@ assign_application_name(const char *newval, void *extra)
 	pgstat_report_appname(newval);
 }
 
+/*
+ * GUC assign_hook for statement_timeout
+ */
+void
+assign_statement_timeout(int newval, void *extra)
+{
+	/* Update the pg_stat_activity view */
+	pgstat_report_statement_timeout(newval);
+}
+
 /*
  * GUC check_hook for cluster_name
  */
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 608d01ea0d..5d94cb0ba6 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -336,6 +336,7 @@ pgstat_bestart(void)
 	lbeentry.st_activity_start_timestamp = 0;
 	lbeentry.st_state_start_timestamp = 0;
 	lbeentry.st_xact_start_timestamp = 0;
+	lbeentry.st_statement_timeout = 0;
 	lbeentry.st_databaseid = MyDatabaseId;
 
 	/* We have userid for client-backends, wal-sender and bgworker processes */
@@ -446,9 +447,11 @@ pgstat_bestart(void)
 
 	PGSTAT_END_WRITE_ACTIVITY(vbeentry);
 
-	/* Update app name to current GUC setting */
-	if (application_name)
+	/* Update app name and statement timeout to current GUC setting */
+	if (application_name) {
 		pgstat_report_appname(application_name);
+		pgstat_report_statement_timeout(StatementTimeout);
+	}
 }
 
 /*
@@ -692,6 +695,33 @@ pgstat_report_appname(const char *appname)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+/* --
+ * pgstat_report_statement_timeout() -
+ *
+ *	Called to update statement_timeout.
+ * --
+ */
+void
+pgstat_report_statement_timeout(int statement_timeout)
+{
+	volatile PgBackendStatus *beentry = MyBEEntry;
+
+	if (!beentry)
+		return;
+
+	/*
+	 * Update my status entry, following the protocol of bumping
+	 * st_changecount before and after.  We use a volatile pointer here to
+	 * ensure the compiler doesn't try to get cute.
+	 */
+	PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
+
+	beentry->st_statement_timeout = statement_timeout;
+
+	PGSTAT_END_WRITE_ACTIVITY(beentry);
+}
+
+
 /*
  * Report current transaction start timestamp as the specified value.
  * Zero means there is no active transaction.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eec9f3cf9b..de72c2c1d0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -303,7 +303,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	30
+#define PG_STAT_GET_ACTIVITY_COLS	31
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -612,6 +612,12 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 nulls[29] = true;
 			else
 values[29] = UInt64GetDatum(beentry->st_query_id);
+			/* Only report statement_timeout for client backend */
+			if (beentry->st_backendType == B_BACKEND) {
+values[30] = Int32GetDatum(beentry->st_statement_timeout);
+			} else {
+nulls[30] = true;
+			}
 		}
 		else
 		{
@@ -640,6 +646,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			nulls[27] = true;
 			nulls[28] = true;
 			nulls[29] = true;
+			nulls[30] = true;
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8062589efd..5dd33f583f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2478,7 +2478,7 @@ struct config_int ConfigureNamesInt[] =
 		

Re: Flush SLRU counters in checkpointer process

2023-03-03 Thread Anthonin Bonnefoy
Here's the patch rebased with Andres' suggestions.
Happy to update it if there's any additionalj change required.


On Wed, Mar 1, 2023 at 8:46 PM Gregory Stark (as CFM) 
wrote:

> On Thu, 12 Jan 2023 at 03:46, Anthonin Bonnefoy
>  wrote:
> >
> >
> > That would make sense. I've created a new patch with everything moved in
> pgstat_report_checkpointer().
> > I did split the checkpointer flush in a pgstat_flush_checkpointer()
> function as it seemed more readable. Thought?
>
> This patch appears to need a rebase. Is there really any feedback
> needed or is it ready for committer once it's rebased?
>
>
>
> --
> Gregory Stark
> As Commitfest Manager
>


flush-slru-counters-v3.patch
Description: Binary data


Re: Flush SLRU counters in checkpointer process

2023-01-12 Thread Anthonin Bonnefoy
On Wed, Jan 11, 2023 at 5:33 PM Andres Freund  wrote:

> Hi,
>
> On 2023-01-11 10:29:06 +0100, Anthonin Bonnefoy wrote:
> > Currently, the Checkpointer process only reports SLRU statistics at
> server
> > shutdown, leading to delayed statistics for SLRU flushes. This patch
> adds a
> > flush of SLRU stats to the end of checkpoints.
>
> Hm. I wonder if we should do this even earlier, by the
> pgstat_report_checkpointer() calls in CheckpointWriteDelay().
>
> I'm inclined to move the pgstat_report_wal() and pgstat_report_slru() calls
> into pgstat_report_checkpointer() to avoid needing to care about all the
> individual places.
>
That would make sense. I've created a new patch with everything moved in
pgstat_report_checkpointer().
I did split the checkpointer flush in a pgstat_flush_checkpointer()
function as it seemed more readable. Thought?


> > @@ -505,6 +505,7 @@ CheckpointerMain(void)
> >   /* Report pending statistics to the cumulative stats
> system */
> >   pgstat_report_checkpointer();
> >   pgstat_report_wal(true);
> > + pgstat_report_slru(true);
>
> Why do we need a force parameter if all callers use it?

Good point. I've written the same signature as pgstat_report_wal but
there's no need for the nowait parameter.


flush-slru-counters-v2.patch
Description: Binary data


Flush SLRU counters in checkpointer process

2023-01-11 Thread Anthonin Bonnefoy
Hello hackers,

Currently, the Checkpointer process only reports SLRU statistics at server
shutdown, leading to delayed statistics for SLRU flushes. This patch adds a
flush of SLRU stats to the end of checkpoints.

Best regards,
Anthonin


flush-slru-counters.patch
Description: Binary data