Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY

2024-06-24 Thread Michael Paquier
On Mon, Jun 17, 2024 at 07:00:51PM +0200, Michail Nikolaev wrote:
> The same issue may happen in case of CREATE/DROP INDEX CONCURRENTLY as well.

While looking at all that, I've been also curious about this specific
point, and it is indeed possible to finish in a state where a
duplicate key would be found in one of indexes selected by the
executor during an INSERT ON CONFLICT while a concurrent set of CICs
and DICs are run, so you don't really need a REINDEX.  See for example
the attached test.
--
Michael
diff --git a/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl b/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl
new file mode 100644
index 00..bd37c797a3
--- /dev/null
+++ b/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl
@@ -0,0 +1,46 @@
+
+# Copyright (c) 2024-2024, PostgreSQL Global Development Group
+
+# Test CREATE/DROP INDEX CONCURRENTLY with concurrent INSERT
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->start;
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE UNIQUE INDEX tbl_pkey_2 ON tbl (i)));
+
+$node->pgbench(
+		'--no-vacuum --client=100 -j 2 --transactions=1000',
+		0,
+		[qr{actually processed}],
+		[qr{^$}],
+		'concurrent INSERTs, UPDATES and CIC/DIC',
+		{
+			'01_updates' => q(
+INSERT INTO tbl VALUES(13,now()) ON CONFLICT(i) DO UPDATE SET updated_at = now();
+			),
+			'02_reindex' => q(
+SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+\if :gotlock
+	DROP INDEX CONCURRENTLY tbl_pkey_2;
+	CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_2 ON tbl (i);
+	SELECT pg_advisory_unlock(42);
+\endif
+			),
+		});
+$node->stop;
+done_testing();


signature.asc
Description: PGP signature


Re: Logical Replication of sequences

2024-06-24 Thread vignesh C
On Thu, 20 Jun 2024 at 18:45, Amit Kapila  wrote:
>
> On Wed, Jun 19, 2024 at 8:33 PM vignesh C  wrote:
> >
> > On Tue, 18 Jun 2024 at 16:10, Amit Kapila  wrote:
> > >
> > >
> > > Agreed and I am not sure which is better because there is a value in
> > > keeping the state name the same for both sequences and tables. We
> > > probably need more comments in code and doc updates to make the
> > > behavior clear. We can start with the sequence state as 'init' for
> > > 'needs-to-be-sycned' and 'ready' for 'synced' and can change if others
> > > feel so during the review.
> >
> > Here is a patch which does the sequence synchronization in the
> > following lines from the above discussion:
> >
>
> Thanks for summarizing the points discussed. I would like to confirm
> whether the patch replicates new sequences that are created
> implicitly/explicitly for a publication defined as ALL SEQUENCES.

Currently, FOR ALL SEQUENCES publication both explicitly created
sequences and implicitly created sequences will be synchronized during
the creation of subscriptions (using CREATE SUBSCRIPTION) and
refreshing publication sequences(using ALTER SUBSCRIPTION ... REFRESH
PUBLICATION SEQUENCES).
Therefore, the explicitly created sequence seq1:
CREATE SEQUENCE seq1;
and the implicitly created sequence seq_test2_c2_seq for seq_test2 table:
CREATE TABLE seq_test2 (c1 int, c2 SERIAL);
will both be synchronized.

Regards,
Vignesh




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Alena Rybakina
Hi, Melanie! I'm glad to hear you that you have found a root case of the 
problem) Thank you for that!


On 21.06.2024 02:42, Melanie Plageman wrote:

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.


Thisis an interestinganddifficultcase)Inoticedthatwheninitializingthe 
cluster,inmyopinion,we provideexcessivefreezing.Initializationtakesa 
longtime,whichcanlead,for example,tolongertestexecution.Igot ridofthisby 
addingthe OldestMxact checkboxisnotFirstMultiXactId,anditworksfine.


if (prstate->cutoffs &&
TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
prstate->cutoffs->OldestMxact != FirstMultiXactId &&
NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
    return HEAPTUPLE_DEAD;

CanI keepit?


Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

[1]https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2]https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Heikki Linnakangas

On 21/06/2024 03:02, Peter Geoghegan wrote:

On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.


This is a great summary.


+1


We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.


I think that this is the right general approach.


+1


The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.


Right. I saw details exactly consistent with this when I used GDB
against a production instance.

I'm glad that you were able to come up with a repro that involves
exactly the same basic elements, including index page deletion.


Would it be possible to make it robust so that we could always run it 
with "make check"? This seems like an important corner case to 
regression test.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: New function normal_rand_array function to contrib/tablefunc.

2024-06-24 Thread Stepan Neretin
It looks useful, for example, it can be used in sorting tests to make them more 
interesting. I just have one question. Why are you using SRF_IS_FIRST CALL and 
not _PG_init?
Best regards, Stepan Neretin.

Re: Proposal: Division operator for (interval / interval => double precision)

2024-06-24 Thread Laurenz Albe
On Sun, 2024-06-23 at 17:57 -0700, Gurjeet Singh wrote:
> Is there a desire to have a division operator / that takes dividend
> and divisor of types interval, and results in a quotient of type
> double precision.
> 
> This would be helpful in calculating how many times the divisor
> interval can fit into the dividend interval.
> 
> To complement this division operator, it would be desirable to also
> have a remainder operator %.
> 
> For example,
> 
> ('365 days'::interval / '5 days'::interval) => 73
> ('365 days'::interval % '5 days'::interval) => 0
> 
> ('365 days'::interval / '3 days'::interval) => 121
> ('365 days'::interval % '3 days'::interval) => 2

I think that is a good idea in principle, but I have one complaint,
and one thing should be discussed.

The complaint is that the result should be double precision or numeric.
I'd want the result of '1 minute' / '8 seconds' to be 7.5.
That would match how the multiplication operator works.

What should be settled is how to handle divisions that are not well defined.
For example, what is '1 year' / '1 day'?
- 365.24217, because that is the number of solar days in a solar year?
- 365, because we don't consider leap years?
- 360, because we use the usual conversion of 1 month -> 30 days?

Yours,
Laurenz Albe




Re: Meson far from ready on Windows

2024-06-24 Thread Dave Page
Hi Tristan,

On Fri, 21 Jun 2024 at 18:16, Tristan Partin  wrote:

>
> Hey Dave,
>
> I'm a maintainer for Meson, and am happy to help you in any way that
> I reasonably can.
>

Thank you!

>
> Let's start with the state of Windows support in Meson. If I were to
> rank Meson support for platforms, I would do something like this:
>
> - Linux
> - BSDs
> - Solaris/IllumOS
> - ...
> - Apple
> - Windows
>
> As you can see Windows is the bottom of the totem pole. We don't have
> Windows people coming along to contribute very often for whatever
> reason. Thus admittedly, Windows support can be very lackluster at
> times.
>
> Meson is not a project which sees a lot of funding. (Do any build
> tools?) The projects that support Meson in any way are Mesa and
> GStreamer, which don't have a lot of incentive to do anything with
> Windows, generally.
>
> I'm not even sure any of the regular contributors to Meson have
> Windows development machines. I surely don't have access to a Windows
> machine.
>

To be very clear, my comments - in particular the subject line of this
thread - are not referring to Meson itself, rather our use of it on
Windows. I've been quite impressed with Meson in general, and am coming to
like it a lot.


>
> All that being said, I would like to help you solve your Windows
> dependencies issue, or at least mitigate them. I think it is probably
> best to just look at one dependency at a time. Here is how lz4 is picked
> up in the Postgres Meson build:
>
> > lz4opt = get_option('lz4')
> > if not lz4opt.disabled()
> >   lz4 = dependency('liblz4', required: lz4opt)
> >
> >   if lz4.found()
> > cdata.set('USE_LZ4', 1)
> > cdata.set('HAVE_LIBLZ4', 1)
> >   endif
> >
> > else
> >   lz4 = not_found_dep
> > endif
>
> As you are well aware, dependency() looks largely at pkgconfig and cmake
> to find the dependencies. In your case, that is obviously not working.
>
> I think there are two ways to solve your problem. A first solution would
> look like this:
>
> > lz4opt = get_option('lz4')
> > if not lz4opt.disabled()
> >   lz4 = dependency('liblz4', required: false)
> >   if not lz4.found()
> > lz4 = cc.find_library('lz4', required: lz4opt, dirs: extra_lib_dirs)
> >   endif
> >
> >   if lz4.found()
> > cdata.set('USE_LZ4', 1)
> > cdata.set('HAVE_LIBLZ4', 1)
> >   end
> > else
> >   lz4 = not_found_dep
> > endif
>

Yes, that's the pattern I think we should generally be using:

- It supports the design goals, allowing for configurations we don't know
about to be communicated through pkgconfig or cmake files.
- It provides a fallback method to detect the dependencies as we do in the
old MSVC build system, which should work with most dependencies built with
their "standard" configuration on Windows.

To address Andres' concerns around mis-detection of dependencies, or other
oddities such as required compiler flags not being included, I would
suggest that a) that's happened very rarely, if ever, in the past, and b)
we can always spit out an obvious warning if we've not been able to use
cmake or pkgconfig for any particular dependencies.


>
> Another solution that could work alongside the previous suggestion is to
> use Meson subprojects[0] for managing Postgres dependencies. I don't
> know if we would want this in the Postgres repo or a patch that
> downstream packagers would need to apply, but essentially, add the wrap
> file:
>
> > [wrap-file]
> > directory = lz4-1.9.4
> > source_url = https://github.com/lz4/lz4/archive/v1.9.4.tar.gz
> > source_filename = lz4-1.9.4.tgz
> > source_hash =
> 0b0e3aa07c8c063ddf40b082bdf7e37a1562bda40a0ff5272957f3e987e0e54b
> > patch_filename = lz4_1.9.4-2_patch.zip
> > patch_url = https://wrapdb.mesonbuild.com/v2/lz4_1.9.4-2/get_patch
> > patch_hash =
> 4f33456cce986167d23faf5d28a128e773746c10789950475d2155a7914630fb
> > wrapdb_version = 1.9.4-2
> >
> > [provide]
> > liblz4 = liblz4_dep
>
> into subprojects/lz4.wrap, and Meson should be able to automagically
> pick up the dependency. Do this for all the projects that Postgres
> depends on, and you'll have an entire build managed by Meson. Note that
> Meson subprojects don't have to use Meson themselves. They can also use
> CMake[1] or Autotools[2], but your results may vary.
>

That's certainly interesting functionality. I'm not sure we'd want to try
to use it here, simply because building all of PostgreSQL's dependencies on
Windows requires multiple different toolchains and environments and is not
trivial to setup. That's largely why I started working on
https://github.com/dpage/winpgbuild.

Thanks!


>
> Happy to hear your thoughts. I think if our goal is to enable more
> people to work on Postgres, we should probably add subproject wraps to
> the source tree, but we also need to be forgiving like in the Meson DSL
> snippet above.
>
> Let me know your thoughts!
>
> [0]: https://mesonbuild.com/Wrap-dependency-system-manual.html
> [1]:
> https://github.com/hse-project/hse/blob/6d5207f88044a3bd9b3539260074395317e2

Re: Proposal: Division operator for (interval / interval => double precision)

2024-06-24 Thread Ashutosh Bapat
On Mon, Jun 24, 2024 at 2:04 PM Laurenz Albe 
wrote:

> On Sun, 2024-06-23 at 17:57 -0700, Gurjeet Singh wrote:
> > Is there a desire to have a division operator / that takes dividend
> > and divisor of types interval, and results in a quotient of type
> > double precision.
> >
> > This would be helpful in calculating how many times the divisor
> > interval can fit into the dividend interval.
> >
> > To complement this division operator, it would be desirable to also
> > have a remainder operator %.
> >
> > For example,
> >
> > ('365 days'::interval / '5 days'::interval) => 73
> > ('365 days'::interval % '5 days'::interval) => 0
> >
> > ('365 days'::interval / '3 days'::interval) => 121
> > ('365 days'::interval % '3 days'::interval) => 2
>
> I think that is a good idea in principle, but I have one complaint,
> and one thing should be discussed.
>
> The complaint is that the result should be double precision or numeric.
> I'd want the result of '1 minute' / '8 seconds' to be 7.5.
> That would match how the multiplication operator works.
>
> What should be settled is how to handle divisions that are not well
> defined.
> For example, what is '1 year' / '1 day'?
> - 365.24217, because that is the number of solar days in a solar year?
> - 365, because we don't consider leap years?
> - 360, because we use the usual conversion of 1 month -> 30 days?
>

We will need to go back to first principles, I guess. Result of division is
quotient, which is how many times a divisor can be subtracted from
dividend, and remainder, which is the what remains after so many
subtractions. Since day to hours and month to days conversions are not
constants, interval/interval will result in an integer quotient and
interval remainder. That looks painful.

-- 
Best Wishes,
Ashutosh Bapat


Re: Meson far from ready on Windows

2024-06-24 Thread Dave Page
Hi

On Sat, 22 Jun 2024 at 17:32, Andres Freund  wrote:

> > I don't think it's unreasonable to not support static linking, but I take
> > your point.
>
> Separately from this thread: ISTM that on windows it'd be quite beneficial
> to
> link a few things statically, given how annoying dealing with dlls can be?
> There's also the perf issue addressed further down.
>

Dealing with DLLs largely just boils down to copying them into the right
place when packaging. The perf issue is a much more compelling reason to
look at static linking imho.


>
>
> > My assumption all along was that Meson would replace autoconf etc. before
> > anything happened with MSVC, precisely because that's the type of
> > environment all the Postgres devs work in primarily. Instead we seem to
> > have taken what I think is a flawed approach of entirely replacing the
> > build system on the platform none of the devs use, whilst leaving the new
> > system as an experimental option on the platforms it will have had orders
> > of magnitude more testing.
>
> The old system was a major bottleneck. For one, there was no way to run all
> tests. And even the tests that one could run, would run serially, leading
> to
> exceedingly long tests times. While that could partially be addressed by
> having both buildsystems in parallel, the old one would frequently break
> in a
> way that one couldn't reproduce on other systems. And resource wise it
> wasn't
> feasible to test both old and new system for cfbot/CI.
>

Hmm, I've found that running the tests under Meson takes notably longer
than the old system - maybe 5 - 10x longer ("meson test" vs. "vcregress
check"). I haven't yet put any effort into figuring out a cause for that
yet.


> FWIW, dynamic linking has a noticeable overhead on other platforms too. A
> non-dependencies-enabled postgres can do about 2x the
> connections-per-second
> than a fully kitted out postgres can (basically due to more memory mapping
> metadata being copied).  But on windows the overhead is larger because so
> much
> more happens for every new connections, including loading all dlls from
> scratch.
>
> I suspect linking a few libraries statically would be quite worth it on
> windows. On other platforms it'd be quite inadvisable to statically link
> libraries, due to security updates, but for stuff like the EDB windows
> installer dynamic linking doesn't really help with that afaict?
>

Correct - we're shipping the dependencies ourselves, so we have to
rewrap/retest anyway.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com


Re: Partial aggregates pushdown

2024-06-24 Thread Jelte Fennema-Nio
On Sun, 23 Jun 2024 at 10:24, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I attached the POC patch(the second one) which supports avg(int8) whose 
> standard format is _numeric type.

Okay, very interesting. So instead of defining the
serialization/deserialization functions to text/binary, you serialize
the internal type to an existing non-internal type, which then in turn
gets serialized to text. In the specific case of avg(int8) this is
done to an array of numeric (with length 2).

> However, I do not agree that I modify the internal transtype to the native 
> data type. The reasons are the following three.
> 1. Generality
> I believe we should develop a method that can theoretically apply to any 
> aggregate function, even if we cannot implement it immediately. However, I 
> find it exceptionally challenging to demonstrate that any internal transtype 
> can be universally converted to a native data type for aggregate functions 
> that are parallel-safe under the current parallel query feature. 
> Specifically, proving this for user-defined aggregate functions presents a 
> significant difficulty in my view.
> On the other hand, I think that the usage of export and import functions can 
> theoretically apply to any aggregate functions.

The only thing required when doing CREATE TYPE is having an INPUT and
OUTPUT function for the type, which (de)serialize the type to text
format. As far as I can tell by definition that requirement should be
fine for any aggregates that we can do partial aggregation pushdown
for. To clarify I'm not suggesting we should change any of the
internal representation of the type for the current internal
aggregates. I'm suggesting we create new native types (i.e. do CREATE
TYPE) for those internal representations and then use the name of that
type instead of internal.

> 2. Amount of codes.
> It could need more codes.

I think it would be about the same as your proposal. Instead of
serializing to an intermediary existing type you serialize to string
straight away. I think it would probably be slightly less code
actually, since you don't have to add code to handle the new
aggpartialexportfn and aggpartialimportfn columns.

> 3. Concern about performance
> I'm concerned that altering the current internal data types could impact 
> performance.

As explained above in my proposal all the aggregation code would
remain unchanged, only new native types will be added. Thus
performance won't be affected, because all aggregation code will be
the same. The only thing that's changed is that the internal type now
has a name and an INPUT and OUTPUT function.

> I know. In my proposal, the standard format is not seriarized data by 
> serialfn, instead, is text or other native data type.
> Just to clarify, I'm writing this to avoid any potential misunderstanding.

Ah alright, that definitely clarifies the proposal. I was looking at
the latest patch file on the thread and that one was still using
serialfn. Your new one indeed doesn't, so this is fine.

> Basically responding to your advice,
> for now, I prepare two POC patches.

Great! I definitely think this makes the review/discussion easier.

> The first supports case a, currently covering only avg(int4) and other 
> aggregate functions that do not require import or export functions, such as 
> min, max, and count.

Not a full review but some initial notes:

1. Why does this patch introduce aggpartialpushdownsafe? I'd have
expected that any type with a non-pseudo/internal type as aggtranstype
would be safe to partially push down.
2. It seems the int4_avg_import function shouldn't be part of this
patch (but maybe of a future one).
3. I think splitting this patch in two pieces would make it even
easier to review: First adding support for the new PARTIAL_AGGREGATE
keyword (adds the new feature). Second, using PARTIAL_AGGREGATE in
postgres_fdw (starts using the new feature). Citus would only need the
first patch not the second one, so I think the PARTIAL_AGGREGATE
feature has merit to be added on its own, even without the
postgres_fdw usage.
4. Related to 3, I think it would be good to have some tests of
PARTIAL_AGGREGATE that don't involve postgres_fdw at all. I also
spotted some comments too that mention FDW, even though they apply to
the "pure" PARTIAL_AGGREGATE code.
5. This comment now seems incorrect:
-* Apply the agg's finalfn if one is provided, else return transValue.
+* If the agg's finalfn is provided and PARTIAL_AGGREGATE keyword is
+* not specified, apply the agg's finalfn.
+* If PARTIAL_AGGREGATE keyword is specified and the transValue type
+* is internal, apply the agg's serialfn. In this case the agg's
+* serialfn must not be invalid. Otherwise return transValue.

6. These errors are not on purpose afaict (if they are a comment in
the test would be good to explain why)

+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1;
+ERROR:  could not connect to server "loopback"
+DETAIL:  inval

Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Matthias van de Meent
On Mon, 24 Jun 2024 at 04:38,  wrote:
>
> In my local PoC patch, I have modified the output as follows, what do you 
> think?
>
> =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id2 = 101;
>QUERY PLAN
> -
>  Index Scan using test_idx on ikedamsh.test  (cost=0.42..8.45 rows=1 
> width=18) (actual time=0.082..0.086 rows=1 loops=1)
>Output: id1, id2, id3, value
>Index Cond: ((test.id1 = 1) AND (test.id2 = 101))  -- If it’s efficient, 
> the output won’t change.
>  Planning Time: 5.088 ms
>  Execution Time: 0.162 ms
> (5 rows)
>
> =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id3 = 101;
>   QUERY PLAN
> ---
>  Index Scan using test_idx on ikedamsh.test  (cost=0.42..12630.10 rows=1 
> width=18) (actual time=0.175..279.819 rows=1 loops=1)
>Output: id1, id2, id3, value
>Index Cond: (test.id1 = 1) -- Change the output. Show only 
> the bound quals.
>Index Filter: (test.id3 = 101)  -- New. Output quals which are 
> not used as the bound quals

I think this is too easy to confuse with the pre-existing 'Filter'
condition, which you'll find on indexes with INCLUDE-d columns or
filters on non-index columns.
Furthermore, I think this is probably not helpful (maybe even harmful)
for index types like GIN and BRIN, where index searchkey order is
mostly irrelevant to the index shape and performance.
Finally, does this change the index AM API? Does this add another
scankey argument to ->amrescan?

>Rows Removed by Index Filter: 49-- New. Output when ANALYZE option 
> is specified

Separate from the changes to Index Cond/Index Filter output changes I
think this can be useful output, though I'd probably let the AM
specify what kind of filter data to display.
E.g. BRIN may well want to display how many ranges matched the
predicate, vs how many ranges were unsummarized and thus returned; two
conditions which aren't as easy to differentiate but can be important
debugging query performance.

>  Planning Time: 0.354 ms
>  Execution Time: 279.908 ms
> (7 rows)

Was this a test against the same dataset as the one you'd posted your
measurements of your first patchset with? The execution time seems to
have slown down quite significantly, so if the testset is the same
then this doesn't bode well for your patchset.


Kind regards,

Matthias van de Meent




Re: Meson far from ready on Windows

2024-06-24 Thread Andres Freund
Hi,

On 2024-06-24 09:54:51 +0100, Dave Page wrote:
> > The old system was a major bottleneck. For one, there was no way to run all
> > tests. And even the tests that one could run, would run serially, leading
> > to
> > exceedingly long tests times. While that could partially be addressed by
> > having both buildsystems in parallel, the old one would frequently break
> > in a
> > way that one couldn't reproduce on other systems. And resource wise it
> > wasn't
> > feasible to test both old and new system for cfbot/CI.
> >
> 
> Hmm, I've found that running the tests under Meson takes notably longer
> than the old system - maybe 5 - 10x longer ("meson test" vs. "vcregress
> check"). I haven't yet put any effort into figuring out a cause for that
> yet.

That's because vcregress check only runs a small portion of the tests (just
the main pg_regress checks, no tap tests, no extension). Which is pretty much
my point.

To run a decent, but still far from complete, portion of the tests you needed 
to do
this:
https://github.com/postgres/postgres/blob/REL_15_STABLE/.cirrus.tasks.yml#L402-L440

If you want to run just the main regression tests with meson, you can:
  meson test --suite setup --suite regress

To see the list of all tests
  meson test --list

Greetings,

Andres Freund




Re: strange context message in spi.c?

2024-06-24 Thread Stepan Neretin
Hi! Looks good to me!
Best regards, Stepan Neretin.


Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Jelte Fennema-Nio
+1 for the idea.

On Mon, 24 Jun 2024 at 11:11, Matthias van de Meent
 wrote:
> I think this is too easy to confuse with the pre-existing 'Filter'
> condition, which you'll find on indexes with INCLUDE-d columns or
> filters on non-index columns.

Why not combine them? And both call them Filter? In a sense this
filtering acts very similar to INCLUDE based filtering (for btrees at
least). Although I might be wrong about that, because when I try to
confirm the same perf using the following script I do get quite
different timings (maybe you have an idea what's going on here). But
even if it does mean something slightly different perf wise, I think
using Filter for both is unlikely to confuse anyone. Since, while
allowed, it seems extremely unlikely in practice that someone will use
the same column as part of the indexed columns and as part of the
INCLUDE-d columns (why would you store the same info twice).

CREATE TABLE test (id1 int, id2 int, id3 int, value varchar(32));
INSERT INTO test (SELECT i % 10, i % 1000, i, 'hello' FROM
generate_series(1,100) s(i));
vacuum freeze test;
CREATE INDEX test_idx_include ON test(id1, id2) INCLUDE (id3);
ANALYZE test;
EXPLAIN (VERBOSE, ANALYZE, BUFFERS) SELECT id1, id3 FROM test WHERE
id1 = 1 AND id3 = 101;
CREATE INDEX test_idx ON test(id1, id2, id3);
ANALYZE test;
EXPLAIN (VERBOSE, ANALYZE, BUFFERS) SELECT id1, id3 FROM test WHERE
id1 = 1 AND id3 = 101;

  QUERY PLAN
───
 Index Only Scan using test_idx_include on public.test
(cost=0.42..3557.09 rows=1 width=8) (actual time=0.708..6.639 rows=1
loops=1)
   Output: id1, id3
   Index Cond: (test.id1 = 1)
   Filter: (test.id3 = 101)
   Rows Removed by Filter: 9
   Heap Fetches: 0
   Buffers: shared hit=1 read=386
 Query Identifier: 471139784017641093
 Planning:
   Buffers: shared hit=8 read=1
 Planning Time: 0.091 ms
 Execution Time: 6.656 ms
(12 rows)

Time: 7.139 ms
  QUERY PLAN
─
 Index Only Scan using test_idx on public.test  (cost=0.42..2591.77
rows=1 width=8) (actual time=0.238..2.110 rows=1 loops=1)
   Output: id1, id3
   Index Cond: ((test.id1 = 1) AND (test.id3 = 101))
   Heap Fetches: 0
   Buffers: shared hit=1 read=386
 Query Identifier: 471139784017641093
 Planning:
   Buffers: shared hit=10 read=1
 Planning Time: 0.129 ms
 Execution Time: 2.128 ms
(10 rows)

Time: 2.645 ms




Re: Support "Right Semi Join" plan shapes

2024-06-24 Thread Richard Guo
Thank you for reviewing.

On Mon, Jun 24, 2024 at 1:27 PM Li Japin  wrote:
> +   /*
> +* For now we do not support RIGHT_SEMI join in mergejoin or nestloop
> +* join.
> +*/
> +   if (jointype == JOIN_RIGHT_SEMI)
> +   return;
> +
>
> How about adding some reasons here?

I've included a brief explanation in select_mergejoin_clauses.

> + * this is a right-semi join, or this is a right/right-anti/full join and
> + * there are nonmergejoinable join clauses.  The executor's mergejoin
>
> Maybe we can put the right-semi join together with the right/right-anti/full
> join.  Is there any other significance by putting it separately?

I don't think so.  The logic is different: for right-semi join we will
always set *mergejoin_allowed to false, while for right/right-anti/full
join it is set to false only if there are nonmergejoinable join clauses.

> Maybe the following comments also should be updated. Right?

Correct.  And there are a few more places where we need to mention
JOIN_RIGHT_SEMI, such as in reduce_outer_joins_pass2 and in the comment
for SpecialJoinInfo.


I noticed that this patch changes the plan of a query in join.sql from
a semi join to right semi join, compromising the original purpose of
this query, which was to test the fix for neqjoinsel's behavior for
semijoins (see commit 7ca25b7d).

--
-- semijoin selectivity for <>
--
explain (costs off)
select * from int4_tbl i4, tenk1 a
where exists(select * from tenk1 b
 where a.twothousand = b.twothousand and a.fivethous <> b.fivethous)
  and i4.f1 = a.tenthous;

So I've changed this test case a bit so that it is still testing what it
is supposed to test.

In passing, I've also updated the commit message to clarify that this
patch does not address the support of "Right Semi Join" for merge joins.

Thanks
Richard


v6-0001-Support-Right-Semi-Join-plan-shapes.patch
Description: Binary data


sql/json miscellaneous issue

2024-06-24 Thread jian he
hi.
the following two queries should return the same result?

SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);

I've tried a patch to implement it.
(i raised the issue at
https://www.postgresql.org/message-id/CACJufxFWiCnG3Q7f0m_GdrytPbv29A5OWngCDwKVjcftwzHbTA%40mail.gmail.com
i think a new thread would be more appropriate).



current json_value  doc:
"Note that scalar strings returned by json_value always have their
quotes removed, equivalent to specifying OMIT QUOTES in json_query."

i think there are two exceptions: when the returning data types are
jsonb or json.




Injection point locking

2024-06-24 Thread Heikki Linnakangas
InjectionPointRun() acquires InjectionPointLock, looks up the hash 
entry, and releases the lock:



LWLockAcquire(InjectionPointLock, LW_SHARED);
entry_by_name = (InjectionPointEntry *)
hash_search(InjectionPointHash, name,
HASH_FIND, &found);
LWLockRelease(InjectionPointLock);


Later, it reads fields from the entry it looked up:


/* not found in local cache, so load and register */
snprintf(path, MAXPGPATH, "%s/%s%s", pkglib_path,
 entry_by_name->library, DLSUFFIX);


Isn't that a straightforward race condition, if the injection point is 
detached in between?



Another thing:

I wanted use injection points to inject an error early at backend 
startup, to write a test case for the scenarios that Jacob point out at 
https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com. 
But I can't do that, because InjectionPointRun() requires a PGPROC 
entry, because it uses an LWLock. That also makes it impossible to use 
injection points in the postmaster. Any chance we could allow injection 
points to be triggered without a PGPROC entry? Could we use a simple 
spinlock instead? With a fast path for the case that no injection points 
are attached or something?


--
Heikki Linnakangas
Neon (https://neon.tech)




Re: Buildfarm animal caiman showing a plperl test issue with newer Perl versions

2024-06-24 Thread Andrew Dunstan



On 2024-06-24 Mo 12:00 AM, Alexander Lakhin wrote:

Hello hackers,

As recent caiman failures ([1], [2], ...) show, plperl.sql is 
incompatible

with Perl 5.40. (The last successful test runs took place when cayman
had Perl 5.38.2 installed: [3].)

FWIW, I've found an already-existing fix for the issue [4] and a note
describing the change for Perl 5.39.10 [5].

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-06-24%2001%3A34%3A23
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-06-24%2000%3A15%3A16
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-05-02%2021%3A57%3A17
[4] 
https://git.alpinelinux.org/aports/tree/community/postgresql14/fix-test-plperl-5.8-pragma.patch?id=28aeb872811f59a7f646aa29ed7c9dc30e698e65
[5] 
https://metacpan.org/release/PEVANS/perl-5.39.10/changes#Selected-Bug-Fixes





It's a very odd bug. I guess we should just backpatch the removal of 
that redundant version check in plc_perlboot.pl, probably all the way 
down to 9.2 since godwit builds and tests with plperl that far back, and 
some day in the not too distant future it will upgrade to perl 5.40.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Track the amount of time waiting due to cost_delay

2024-06-24 Thread Bertrand Drouvot
Hi,

On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote:
> 1. vacuuming indexes time has been longer on master because with v2, the 
> leader
> has been interrupted 342605 times while waiting, then making v2 "faster".
> 
> 2. the leader being interrupted while waiting is also already happening on 
> master
> due to the pgstat_progress_parallel_incr_param() calls in
> parallel_vacuum_process_one_index() (that have been added in 
> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> I think that 2. is less of a concern but I think that 1. is something that 
> needs
> to be addressed because the leader process is not honouring its cost delay 
> wait
> time in a noticeable way (at least during my test case).
> 
> I did not think of a proposal yet, just sharing my investigation as to why
> v2 has been faster than master during the vacuuming indexes phase.

I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).

Please find attached v3, that:

- ensures that there is at least 1 second between 2 reports, per parallel 
worker,
to the leader.

- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).

- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).

Remarks:

1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait 
and
we are more interested about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample 
pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

With this approach in place, v3 attached applied, during my test case:

- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)

- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v3] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so

Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Amit Kapila
On Sun, Nov 19, 2023 at 7:48 AM Andres Freund  wrote:
>
> On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote:
> >
> > If understand correctly, with the current code (which only gets
> > ShareUpdateExclusiveLock), we may end up in a situation like this
> > (sessions A and B):
> >
> >   A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock
> >   A: writes the invalidation message(s) into WAL
> >   B: inserts into table "t"
> >   B: commit
> >   A: commit
>
> I don't think this the problematic sequence - at least it's not what I had
> reproed in
> https://postgr.es/m/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de
>
> Adding line numbers:
>
> 1) S1: CREATE TABLE d(data text not null);
> 2) S1: INSERT INTO d VALUES('d1');
> 3) S2: BEGIN; INSERT INTO d VALUES('d2');
> 4) S1: ALTER PUBLICATION pb ADD TABLE d;
> 5) S2: COMMIT
> 6) S2: INSERT INTO d VALUES('d3');
> 7) S1: INSERT INTO d VALUES('d4');
> 8) RL: 
>
> The problem with the sequence is that the insert from 3) is decoded *after* 4)
> and that to decode the insert (which happened before the ALTER) the catalog
> snapshot and cache state is from *before* the ALTER TABLE. Because the
> transaction started in 3) doesn't actually modify any catalogs, no
> invalidations are executed after decoding it. The result is that the cache
> looks like it did at 3), not like after 4). Undesirable timetravel...
>
> It's worth noting that here the cache state is briefly correct, after 4), it's
> just that after 5) it stays the old state.
>
> If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and
> everything is fine.
>

I agree, your analysis looks right to me.

>
>
> > > I'm not sure there are any cases where using SRE instead of AE would cause
> > > problems for logical decoding, but it seems very hard to prove. I'd be 
> > > very
> > > surprised if just using SRE would not lead to corrupted cache contents in 
> > > some
> > > situations. The cases where a lower lock level is ok are ones where we 
> > > just
> > > don't care that the cache is coherent in that moment.
>
> > Are you saying it might break cases that are not corrupted now? How
> > could obtaining a stronger lock have such effect?
>
> No, I mean that I don't know if using SRE instead of AE would have negative
> consequences for logical decoding. I.e. whether, from a logical decoding POV,
> it'd suffice to increase the lock level to just SRE instead of AE.
>
> Since I don't see how it'd be correct otherwise, it's kind of a moot question.
>

We lost track of this thread and the bug is still open. IIUC, the
conclusion is to use SRE in OpenTableList() to fix the reported issue.
Andres, Tomas, please let me know if my understanding is wrong,
otherwise, let's proceed and fix this issue.

-- 
With Regards,
Amit Kapila.




Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Matthias van de Meent
On Mon, 24 Jun 2024 at 11:58, Jelte Fennema-Nio  wrote:
>
> +1 for the idea.
>
> On Mon, 24 Jun 2024 at 11:11, Matthias van de Meent
>  wrote:
> > I think this is too easy to confuse with the pre-existing 'Filter'
> > condition, which you'll find on indexes with INCLUDE-d columns or
> > filters on non-index columns.
>
> Why not combine them? And both call them Filter? In a sense this
> filtering acts very similar to INCLUDE based filtering (for btrees at
> least).

It does not really behave similar: index scan keys (such as the
id3=101 scankey) don't require visibility checks in the btree code,
while the Filter condition _does_ require a visibility check, and
delegates the check to the table AM if the scan isn't Index-Only, or
if the VM didn't show all-visible during the check.

Furthermore, the index could use the scankey to improve the number of
keys to scan using "skip scans"; by realising during a forward scan
that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you
can skip forward to (1, 3, _), rather than having to go through tuples
(1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for
INCLUDE-d columns, because their datatypes and structure are opaque to
the index AM; the AM cannot assume anything about or do anything with
those values.

> Although I might be wrong about that, because when I try to
> confirm the same perf using the following script I do get quite
> different timings (maybe you have an idea what's going on here). But
> even if it does mean something slightly different perf wise, I think
> using Filter for both is unlikely to confuse anyone.

I don't want A to to be the plan, while showing B' to the user, as the
performance picture for the two may be completely different. And, as I
mentioned upthread, the differences between AMs in the (lack of)
meaning in index column order also makes it quite wrong to generally
separate prefixes equalities from the rest of the keys.

> Since, while
> allowed, it seems extremely unlikely in practice that someone will use
> the same column as part of the indexed columns and as part of the
> INCLUDE-d columns (why would you store the same info twice).

Yeah, people don't generally include the same index column more than
once in the same index.

> CREATE INDEX test_idx_include ON test(id1, id2) INCLUDE (id3);
> CREATE INDEX test_idx ON test(id1, id2, id3);
>
>   QUERY PLAN
> ───
>  Index Only Scan using test_idx_include on public.test
[...]
> Time: 7.139 ms
>   QUERY PLAN
> ─
>  Index Only Scan using test_idx on public.test  (cost=0.42..2591.77
[...]
> Time: 2.645 ms

As you can see, there's a huge difference in performance. Putting both
non-bound and "normal" filter clauses in the same Filter clause will
make it more difficult to explain performance issues based on only the
explain output.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: sql/json miscellaneous issue

2024-06-24 Thread Stepan Neretin
On Mon, Jun 24, 2024 at 5:05 PM jian he  wrote:

> hi.
> the following two queries should return the same result?
>
> SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
>
> I've tried a patch to implement it.
> (i raised the issue at
>
> https://www.postgresql.org/message-id/CACJufxFWiCnG3Q7f0m_GdrytPbv29A5OWngCDwKVjcftwzHbTA%40mail.gmail.com
> i think a new thread would be more appropriate).
>
>
>
> current json_value  doc:
> "Note that scalar strings returned by json_value always have their
> quotes removed, equivalent to specifying OMIT QUOTES in json_query."
>
> i think there are two exceptions: when the returning data types are
> jsonb or json.
>
>
>
Hi!

I also noticed a very strange difference in behavior in these two queries,
it seems to me that although it returns a string by default, for the boolean
operator it is necessary to return true or false
SELECT * FROM JSON_value (jsonb '1', '$ == "1"' returning jsonb);
 json_value


(1 row)

 SELECT * FROM JSON_value (jsonb 'null', '$ == "1"' returning jsonb);
 json_value

 false
(1 row)



Best regards, Stepan Neretin.


Re: sql/json miscellaneous issue

2024-06-24 Thread Amit Langote
Hi,

On Mon, Jun 24, 2024 at 8:02 PM Stepan Neretin  wrote:
> Hi!
>
> I also noticed a very strange difference in behavior in these two queries, it 
> seems to me that although it returns a string by default, for the boolean 
> operator it is necessary to return true or false
> SELECT * FROM JSON_value (jsonb '1', '$ == "1"' returning jsonb);
>  json_value
> 
>
> (1 row)
>
>  SELECT * FROM JSON_value (jsonb 'null', '$ == "1"' returning jsonb);
>  json_value
> 
>  false
> (1 row)

Hmm, that looks sane to me when comparing the above two queries with
their jsonb_path_query() equivalents:

select jsonb_path_query(jsonb '1', '$ == "1"');
 jsonb_path_query
--
 null
(1 row)

select jsonb_path_query(jsonb 'null', '$ == "1"');
 jsonb_path_query
--
 false
(1 row)

-- 
Thanks, Amit Langote




Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-06-24 Thread Ranier Vilela
Em dom., 23 de jun. de 2024 às 23:56, Richard Guo 
escreveu:

> On Mon, Jun 24, 2024 at 7:51 AM Ranier Vilela  wrote:
> > In src/include/access/xlogbackup.h, the field *name*
> > has one byte extra to store null-termination.
> >
> > But, in the function *do_pg_backup_start*,
> > I think that is a mistake in the line (8736):
> >
> > memcpy(state->name, backupidstr, strlen(backupidstr));
> >
> > memcpy with strlen does not copy the whole string.
> > strlen returns the exact length of the string, without
> > the null-termination.
>
> I noticed that the two callers of do_pg_backup_start both allocate
> BackupState with palloc0.  Can we rely on this to ensure that the
> BackupState.name is initialized with null-termination?
>
I do not think so.
It seems to me the best solution is to use Michael's suggestion, strlcpy +
sizeof.

Currently we have this:
memcpy(state->name, "longlongpathexample1",
strlen("longlongpathexample1"));
printf("%s\n", state->name);
longlongpathexample1

Next random call:
memcpy(state->name, "longpathexample2", strlen("longpathexample2"));
printf("%s\n", state->name);
longpathexample2ple1

It's not a good idea to use memcpy with strlen.

best regards,
Ranier Vilela


Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)

2024-06-24 Thread Ranier Vilela
Em seg., 24 de jun. de 2024 às 00:27, Yugo NAGATA 
escreveu:

> On Sun, 23 Jun 2024 22:34:03 -0300
> Ranier Vilela  wrote:
>
> > Em dom., 23 de jun. de 2024 às 22:14, Ranier Vilela  >
> > escreveu:
> >
> > > Em dom., 23 de jun. de 2024 às 22:05, Ranier Vilela <
> ranier...@gmail.com>
> > > escreveu:
> > >
> > >> Em dom., 23 de jun. de 2024 às 21:54, Michael Paquier <
> > >> mich...@paquier.xyz> escreveu:
> > >>
> > >>> On Sun, Jun 23, 2024 at 09:34:45PM -0300, Ranier Vilela wrote:
> > >>> > It's not critical code, so I think it's ok to use strlen, even
> because
> > >>> the
> > >>> > result of strlen will already be available using modern compilers.
> > >>> >
> > >>> > So, I think it's ok to use memcpy with strlen + 1.
> > >>>
> > >>> It seems to me that there is a pretty good argument to just use
> > >>> strlcpy() for the same reason as the one you cite: this is not a
> > >>> performance-critical code, and that's just safer.
> > >>>
> > >> Yeah, I'm fine with strlcpy. I'm not against it.
> > >>
> > > Perhaps, like the v2?
> > >
> > > Either v1 or v2, to me, looks good.
> > >
> > Thinking about, does not make sense the field size MAXPGPATH + 1.
> > all other similar fields are just MAXPGPATH.
> >
> > If we copy MAXPGPATH + 1, it will also be wrong.
> > So it is necessary to adjust logbackup.h as well.
>
> I am not sure whether we need to change the size of the field,
> but if change it, I wonder it is better to modify the following
> message from MAXPGPATH to MAXPGPATH -1.
>
>  errmsg("backup label too long (max %d
> bytes)",
> MAXPGPATH)));
>
Or perhaps, is it better to show the too long label?

errmsg("backup label too long (%s)",
backupidstr)));

best regards,
Ranier Vilela

>
> >
> > So, I think that v3 is ok to fix.
> >
> > best regards,
> > Ranier Vilela
> >
> > >
> > > best regards,
> > > Ranier Vilela
> > >
> > >>
>
>
> --
> Yugo NAGATA 
>


Re: Conflict detection and logging in logical replication

2024-06-24 Thread shveta malik
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When testing the patch, I noticed a bug that when reporting the conflict
> after calling ExecInsertIndexTuples(), we might find the tuple that we
> just inserted and report it.(we should only report conflict if there are
> other conflict tuples which are not inserted by us) Here is a new patch
> which fixed this and fixed a compile warning reported by CFbot.
>

Thanks for the patch. Few comments:

1) Few typos:
Commit msg of patch001: iolates--> violates
execIndexing.c:  ingored --> ignored

2) Commit msg of stats patch: "The commit adds columns in view
pg_stat_subscription_workers to shows"
--"pg_stat_subscription_workers" --> "pg_stat_subscription_stats"

3) I feel, chapter '31.5. Conflicts' in docs should also mention about
detection or point to the page where it is already mentioned.

thanks
Shveta




Re: sql/json miscellaneous issue

2024-06-24 Thread Amit Langote
Hi,

On Mon, Jun 24, 2024 at 7:04 PM jian he  wrote:
>
> hi.
> the following two queries should return the same result?
>
> SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
> SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);

I get this with HEAD:

SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb);
 json_query

 null
(1 row)

Time: 734.587 ms
SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb);
 json_value


(1 row)

Much like:

SELECT JSON_QUERY('{"key": null}', '$.key');
 json_query

 null
(1 row)

Time: 2.975 ms
SELECT JSON_VALUE('{"key": null}', '$.key');
 json_value


(1 row)

Which makes sense to me, because JSON_QUERY() is supposed to return a
JSON null in both cases and JSON_VALUE() is supposed to return a SQL
NULL for a JSON null.

-- 
Thanks, Amit Langote




Re: speed up a logical replica setup

2024-06-24 Thread Amit Kapila
On Sun, Jun 23, 2024 at 11:52 AM Noah Misch  wrote:
>
> > +static void
> > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > +{
>
> > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > +   ipubname_esc);
>
> This tool's documentation says it "guarantees that no transaction will be
> lost."  I tried to determine whether achieving that will require something
> like the fix from
> https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com.
> (Not exactly the fix from that thread, since that thread has not discussed the
> FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> hand, pg_createsubscriber benefits from creating a logical slot after creating
> the publication.  That snapbuild.c process will wait for running XIDs.  On the
> other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds
> its relcache entry before assigning an XID, so perhaps the snapbuild.c process
> isn't enough to prevent that thread's race condition.  What do you think?
>

I am not able to imagine how the race condition discussed in the
thread you quoted can impact this patch. The problem discussed is
mainly the interaction when we are processing the changes in logical
decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The
problem happens because we use the old cache state. I am missing your
point about the race condition mentioned in the thread you quoted with
snapbuild.c. Can you please elaborate a bit more?

-- 
With Regards,
Amit Kapila.




Re: Inconsistent Parsing of Offsets with Seconds

2024-06-24 Thread David E. Wheeler
On Jun 22, 2024, at 14:10, David E. Wheeler  wrote:

> I believe the former issue is caused by the latter: The jsonpath 
> implementation uses the formatting strings to parse the timestamps[1], and 
> since there is no formatting to support offsets with seconds, it doesn’t work 
> at all in JSON timestamp parsing.
> 
> [1]: 
> https://github.com/postgres/postgres/blob/70a845c/src/backend/utils/adt/jsonpath_exec.c#L2420-L2442

A side-effect of this implementation of date/time parsing using the to_char 
templates is that only time zone offsets and abbreviations are supported. I 
find the behavior a little surprising TBH:

david=# select to_timestamp('2024-06-03 12:35:00America/New_York', '-MM-DD 
HH24:MI:SSTZ');
ERROR:  invalid value "America/New_York" for "TZ"
DETAIL:  Time zone abbreviation is not recognized.

Unless the SQL standard only supports offsets and abbreviations, I wonder if 
we’d be better off updating the above parsing code to also try the various 
date/time input functions, as well as the custom formats that *are* defined by 
the standard.

Best,






Re: Changing default -march landscape

2024-06-24 Thread Christoph Berg
Hi,

sorry for the delayed reply, I suck at prioritizing things.

Re: Thomas Munro
> OK let me CC Christoph and ask the question this way: hypothetically,
> if RHEL users' PostgreSQL packages became automatically faster than
> Debian users' packages because of default -march choice in the
> toolchain, what would the Debian project think about that, and what
> should we do about it?  The options discussed so far were:
> 
> 1.  Consider a community apt repo that is identical to the one except
> targeting the higher microarch levels, that users can point a machine
> to if they want to.

There are sub-variations of that: Don't use -march in Debian for
strict baseline compatiblity, but use -march=something in
apt.postgresql.org; bump to -march=x86-64-v2 for the server build (but
not libpq and psql) saying that PG servers need better hardware; ...

I'd rather want to avoid adding yet another axis to the matrix we
target on apt.postgresql.org, it's already complex enough. So ideally
there should be only one server-build. Or if we decide it's worth to
have several, extension builds should still be compatible with either.

> 2.  Various ideas for shipping multiple versions of the code at a
> higher granularity than we're doing to day (a callback for a single
> instruction!  the opposite extreme being the whole executable +
> libraries), probably using some of techniques mentioned at
> https://wiki.debian.org/InstructionSelection.

I don't have enough experience with that to say anything about the
trade-offs, or if the online instruction selectors themselves add too
much overhead.

Not to mention that testing things against all instruction variants is
probably next to impossible in practice.

> I would guess that 1 is about 100x easier but I haven't looked into it.

Are there any numbers about what kind of speedup we could expect?

If the online selection isn't feasible/worthwhile, I'd be willing to
bump the baseline for the server packages. There are already packages
doing that, and there's even infrastructure in the "isa-support"
package that lets packages declare a dependency on CPU features.

Or Debian might just bump the baseline. PostgreSQL asking for it might
just be the reason we wanted to hear to make it happen.

Christoph




Re: Meson far from ready on Windows

2024-06-24 Thread Dave Page
On Sat, 22 Jun 2024 at 17:35, Andres Freund  wrote:

> Hi,
>
> On 2024-06-21 15:36:56 +0100, Dave Page wrote:
> > For giggles, I took a crack at doing that, manually creating .pc files
> for
> > everything I've been working with so far.
>
> Cool!
>
>
> > It seems to work as expected, except that unlike everything else libintl
> is
> > detected entirely based on whether the header and library can be found. I
> > had to pass extra lib and include dirs:
>
> Yea, right now libintl isn't using dependency detection because I didn't
> see
> any platform where it's distributed with a .pc for or such. It'd be just a
> line or two to make it use one...
>

I think it should, for consistency if nothing else - especially if we're
adding our own pc/cmake files to prebuilt dependencies.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com


Re: Changing default -march landscape

2024-06-24 Thread Christoph Berg
Re: To Thomas Munro
> Or Debian might just bump the baseline. PostgreSQL asking for it might
> just be the reason we wanted to hear to make it happen.

Which level would PostgreSQL specifically want? x86-64-v2 or even
x86-64-v3?

Christoph




[PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson
Hello hackers,

This patch is based on a suggestion from a separate thread [1]:

On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
> Rather unrelated to this patch, still this patch makes the situation
> more complicated in the docs, but wouldn't it be better to add ACL as
> a term in acronyms.sql, and reuse it here?  It would be a doc-only
> patch that applies on top of the rest (could be on a new thread of its
> own), with some  markups added where needed.

[1] https://postgr.es/m/Zniz1n7qa3_i4iac%40paquier.xyz

/Joel

0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data


Re: Conflict detection and logging in logical replication

2024-06-24 Thread Nisha Moond
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When testing the patch, I noticed a bug that when reporting the conflict
> after calling ExecInsertIndexTuples(), we might find the tuple that we
> just inserted and report it.(we should only report conflict if there are
> other conflict tuples which are not inserted by us) Here is a new patch
> which fixed this and fixed a compile warning reported by CFbot.
>
Thank you for the patch!
A review comment: The patch does not detect 'update_differ' conflicts
when the Publisher has a non-partitioned table and the Subscriber has
a partitioned version.

Here’s a simple failing test case:
Pub: create table tab (a int primary key, b int not null, c varchar(5));

Sub: create table tab (a int not null, b int not null, c varchar(5))
partition by range (b);
alter table tab add constraint tab_pk primary key (a, b);
create table tab_1 partition of tab for values from (minvalue) to (100);
create table tab_2 partition of tab for values from (100) to (maxvalue);

With the above setup, in case the Subscriber table has a tuple with
its own origin, the incoming remote update from the Publisher fails to
detect the 'update_differ' conflict.

--
Thanks,
Nisha




Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Jelte Fennema-Nio
On Mon, 24 Jun 2024 at 13:02, Matthias van de Meent
 wrote:
> It does not really behave similar: index scan keys (such as the
> id3=101 scankey) don't require visibility checks in the btree code,
> while the Filter condition _does_ require a visibility check, and
> delegates the check to the table AM if the scan isn't Index-Only, or
> if the VM didn't show all-visible during the check.

Any chance you could point me in the right direction for the
code/docs/comment about this? I'd like to learn a bit more about why
that is the case, because I didn't realize visibility checks worked
differently for index scan keys and Filter keys.

> Furthermore, the index could use the scankey to improve the number of
> keys to scan using "skip scans"; by realising during a forward scan
> that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you
> can skip forward to (1, 3, _), rather than having to go through tuples
> (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for
> INCLUDE-d columns, because their datatypes and structure are opaque to
> the index AM; the AM cannot assume anything about or do anything with
> those values.

Does Postgres actually support this currently? I thought skip scans
were not available (yet).

> I don't want A to to be the plan, while showing B' to the user, as the
> performance picture for the two may be completely different. And, as I
> mentioned upthread, the differences between AMs in the (lack of)
> meaning in index column order also makes it quite wrong to generally
> separate prefixes equalities from the rest of the keys.

Yeah, that makes sense. These specific explain lines probably
only/mostly make sense for btree. So yeah we'd want the index AM to be
able to add some stuff to the explain plan.

> As you can see, there's a huge difference in performance. Putting both
> non-bound and "normal" filter clauses in the same Filter clause will
> make it more difficult to explain performance issues based on only the
> explain output.

Fair enough, that's of course the main point of this patch in the
first place: being able to better interpret the explain plan when you
don't have access to the schema. Still I think Filter is the correct
keyword for both, so how about we make it less confusing by making the
current "Filter" more specific by calling it something like "Non-key
Filter" or "INCLUDE Filter" and then call the other something like
"Index Filter" or "Secondary Bound Filter".




Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Ashutosh Bapat
On Mon, Jun 24, 2024 at 8:08 AM  wrote:

> > I am unable to decide whether reporting the bound quals is just enough
> to decide the efficiency of index without knowing the difference in the
> number of index tuples selectivity and heap tuple selectivity. The
> difference seems to be a better indicator of index efficiency whereas the
> bound quals will help debug the in-efficiency, if any.
> > Also, do we want to report bound quals even if they are the same as
> index conditions or just when they are different?
>
> Thank you for your comment. After receiving your comment, I thought it
> would be better to also report information that would make the difference
> in selectivity understandable. One idea I had is to output the number of
> index tuples inefficiently extracted, like “Rows Removed by Filter”. Users
> can check the selectivity and efficiency by looking at the number.
>
> Also, I thought it would be better to change the way bound quals are
> reported to align with the "Filter". I think it would be better to modify
> it so that it does not output when the bound quals are the same as the
> index conditions.
>
> In my local PoC patch, I have modified the output as follows, what do you
> think?
>
> =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id2 =
> 101;
>QUERY PLAN
>
>
> -
>  Index Scan using test_idx on ikedamsh.test  (cost=0.42..8.45 rows=1
> width=18) (actual time=0.082..0.086 rows=1 loops=1)
>Output: id1, id2, id3, value
>Index Cond: ((test.id1 = 1) AND (test.id2 = 101))  -- If it’s
> efficient, the output won’t change.
>  Planning Time: 5.088 ms
>  Execution Time: 0.162 ms
> (5 rows)
>

This looks fine. We may highlight in the documentation that lack of Index
bound quals in EXPLAIN output indicate that they are same as Index Cond:.
Other idea is to use Index Cond and bound quals as property name but that's
too long.


>
> =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id3 =
> 101;
>   QUERY PLAN
>
>
> ---
>  Index Scan using test_idx on ikedamsh.test  (cost=0.42..12630.10 rows=1
> width=18) (actual time=0.175..279.819 rows=1 loops=1)
>Output: id1, id2, id3, value
>Index Cond: (test.id1 = 1) -- Change the output. Show
> only the bound quals.
>Index Filter: (test.id3 = 101)  -- New. Output quals which
> are not used as the bound quals
>Rows Removed by Index Filter: 49-- New. Output when ANALYZE
> option is specified
>  Planning Time: 0.354 ms
>  Execution Time: 279.908 ms
> (7 rows)
>

I don't think we want to split these clauses. Index Cond should indicate
the conditions applied to the index scan. Bound quals should be listed
separately even though they will have an intersection with Index Cond. I am
not sure whether Index Filter is the right name, maybe Index Bound Cond:
But I don't know this area enough to make a final call.

About Rows Removed by Index Filter: it's good to provide a number when
ANALYZE is specified, but it will be also better to specify what was
estimated. We do that for (cost snd rows etc.) but doing that somewhere in
the plan output may not have a precedent. I think we should try that and
see what others think.

-- 
Best Wishes,
Ashutosh Bapat


Re: replace strtok()

2024-06-24 Thread Peter Eisentraut

On 24.06.24 02:34, Michael Paquier wrote:

On Sat, Jun 22, 2024 at 11:48:21AM -0400, Tom Lane wrote:

Peter Eisentraut  writes:

On 18.06.24 13:43, Ranier Vilela wrote:

I found another implementation of strsep, it seems lighter to me.
I will attach it for consideration, however, I have not done any testing.



Yeah, surely there are many possible implementations.  I'm thinking,
since we already took other str*() functions from OpenBSD, it makes
sense to do this here as well, so we have only one source to deal with.


Why not use strpbrk?  That's equally thread-safe, it's been there
since C89, and it doesn't have the problem that you can't find out
which of the delimiter characters was found.


Yeah, strpbrk() has been used in the tree as far as 2003 without any
port/ implementation.


The existing uses of strpbrk() are really just checking whether some 
characters exist in a string, more like an enhanced strchr().  I don't 
see any uses for tokenizing a string like strtok() or strsep() would do. 
 I think that would look quite cumbersome.  So I think a simpler and 
more convenient abstraction like strsep() would still be worthwhile.





Re: Pgoutput not capturing the generated columns

2024-06-24 Thread Shlok Kyal
On Fri, 21 Jun 2024 at 12:51, Peter Smith  wrote:
>
> Hi, Here are some review comments for patch v9-0003
>
> ==
> Commit Message
>
> /fix/fixes/
Fixed

> ==
> 1.
> General. Is tablesync enough?
>
> I don't understand why is the patch only concerned about tablesync?
> Does it make sense to prevent VIRTUAL column replication during
> tablesync, if you aren't also going to prevent VIRTUAL columns from
> normal logical replication (e.g. when copy_data = false)? Or is this
> already handled somewhere?
I checked the behaviour during incremental changes. I saw during
decoding 'null' values are present for Virtual Generated Columns. I
made the relevant changes to not support replication of Virtual
generated columns.

> ~~~
>
> 2.
> General. Missing test.
>
> Add some test cases to verify behaviour is different for STORED versus
> VIRTUAL generated columns
I have not added the tests as it would give an error in cfbot.
I have added a TODO note for the same. This can be done once the
VIRTUAL generated columns patch is committted.

> ==
> src/sgml/ref/create_subscription.sgml
>
> NITPICK - consider rearranging as shown in my nitpicks diff
> NITPICK - use  sgml markup for STORED
Fixed

> ==
> src/backend/replication/logical/tablesync.c
>
> 3.
> - if ((walrcv_server_version(LogRepWorkerWalRcvConn) >= 12 &&
> - walrcv_server_version(LogRepWorkerWalRcvConn) <= 16) ||
> - !MySubscription->includegencols)
> + if (walrcv_server_version(LogRepWorkerWalRcvConn) < 17)
> + {
> + if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 12)
>   appendStringInfo(&cmd, " AND a.attgenerated = ''");
> + }
> + else if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 17)
> + {
> + if(MySubscription->includegencols)
> + appendStringInfo(&cmd, " AND a.attgenerated != 'v'");
> + else
> + appendStringInfo(&cmd, " AND a.attgenerated = ''");
> + }
>
> IMO this logic is too tricky to remain uncommented -- please add some 
> comments.
> Also, it seems somewhat complex. I think you can achieve the same more simply:
>
> SUGGESTION
>
> if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 12)
> {
>   bool gencols_allowed = walrcv_server_version(LogRepWorkerWalRcvConn) >= 
> 17
> && MySubscription->includegencols;
>   if (gencols_allowed)
>   {
> /* Replication of generated cols is supported, but not VIRTUAL cols. */
> appendStringInfo(&cmd, " AND a.attgenerated != 'v'");
>   }
>   else
>   {
> /* Replication of generated cols is not supported. */
> appendStringInfo(&cmd, " AND a.attgenerated = ''");
>   }
> }
Fixed

> ==
>
> 99.
> Please refer also to my attached nitpick diffs and apply those if you agree.
Applied the changes.

I have attached the updated patch v10 here in [1].
[1]: 
https://www.postgresql.org/message-id/CANhcyEUMCk6cCbw0vVZWo8FRd6ae9CmKG%3DgKP-9Q67jLn8HqtQ%40mail.gmail.com

Thanks and Regards,
Shlok Kyal




RE: Partial aggregates pushdown

2024-06-24 Thread fujii.y...@df.mitsubishielectric.co.jp
Hi. Jelte, hackers.

Thank you for your proposal and comments.

> From: Jelte Fennema-Nio 
> Sent: Monday, June 24, 2024 6:09 PM
> > 1. Generality
> > I believe we should develop a method that can theoretically apply to any
> aggregate function, even if we cannot implement it immediately. However, I 
> find
> it exceptionally challenging to demonstrate that any internal transtype can be
> universally converted to a native data type for aggregate functions that are
> parallel-safe under the current parallel query feature. Specifically, proving 
> this
> for user-defined aggregate functions presents a significant difficulty in my
> view.
> > On the other hand, I think that the usage of export and import functions can
> theoretically apply to any aggregate functions.
> 
> The only thing required when doing CREATE TYPE is having an INPUT and
> OUTPUT function for the type, which (de)serialize the type to text format. As
> far as I can tell by definition that requirement should be fine for any 
> aggregates
> that we can do partial aggregation pushdown for. To clarify I'm not suggesting
> we should change any of the internal representation of the type for the 
> current
> internal aggregates. I'm suggesting we create new native types (i.e. do CREATE
> TYPE) for those internal representations and then use the name of that type
> instead of internal.
I see. I maybe got your proposal.
Refer to your proposal, for avg(int8), 
I create a new native type like state_int8_avg
with the new typsend/typreceive functions
and use them to transmit the state value, right?

That might seem to be a more fundamental solution
because I can avoid adding export/import functions of my proposal,
which are the new components of aggregate function.
I have never considered the solution.
I appreciate your proposal.

However, I still have the following two questions.

1. Not necessary components of new native types
Refer to pg_type.dat, typinput and typoutput are required.
I think that in your proposal they are not necessary,
so waste. I think that it is not acceptable.
How can I resolve the problem? 

2. Many new native types
I think that, basically, each aggregate function does need a new native type.
For example,
avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
PolyNumAggState.
You said that it is enough to add only one native type like state_poly_num_agg
for supporting them, right?

But the combine functions of them might have quite different expectation
on the data items of PolyNumAggState like
the range of N(means count) and the true/false of calcSumX2
(the flag of calculating sum of squares).
The final functions of them have the similar expectation.
So, I think that, responded to your proposal,
each of them need a native data type
like state_int8_avg, state_numeric_avg, for safety.

And, we do need a native type for an aggregate function
whose transtype is not internal and not pseudo.
For avg(int4), the current transtype is _int8.
However, I do need a validation check on the number of the array
And the positiveness of count(the first element of the array).
Responded to your proposal,
I do need a new native type like state_int4_avg.

Consequently, I think that, responded to your proposal, finally
each of aggregate functions need a new native type
with typinput and typoutput.
That seems need the same amount of codes and more catalog data,
right?

> > 2. Amount of codes.
> > It could need more codes.
> 
> I think it would be about the same as your proposal. Instead of serializing 
> to an
> intermediary existing type you serialize to string straight away. I think it 
> would
> probably be slightly less code actually, since you don't have to add code to
> handle the new aggpartialexportfn and aggpartialimportfn columns.
> 
> > 3. Concern about performance
> > I'm concerned that altering the current internal data types could impact
> performance.
> 
> As explained above in my proposal all the aggregation code would remain
> unchanged, only new native types will be added. Thus performance won't be
> affected, because all aggregation code will be the same. The only thing that's
> changed is that the internal type now has a name and an INPUT and OUTPUT
> function.
I got it. Thank you.

> Not a full review but some initial notes:
Thank you. I don't have time today, so I'll answer after tomorrow.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation


basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE

2024-06-24 Thread Tomas Vondra
Hi,

While doing some additional testing of (incremental) backups, I ran into
a couple regular failures. After pulling my hair for a couple days, I
realized the issue seems to affect regular backups, and incremental
backups (which I've been trying to test) are likely innocent.

I'm using a simple (and admittedly not very pretty) bash scripts that
takes and verified backups, concurrently with this workload:


1) initialize a cluster

2) initialize pgbench in database 'db'

3) run short pgbench on 'db'

4) maybe do vacuum [full] on 'db'

5) drop a database 'db_copy' if it exists

6) create a database 'db_copy' by copying 'db' using one of the
   available strategies (file_copy, wal_log)

7) run short pgbench on 'db_copy'

8) maybe do vacuum [full] on 'db_copy'


And concurrently with this, it takes a basebackup, starts a cluster on
it (on a different port, ofc), and does various checks on that:


a) verify checksums using pg_checksums (cluster has them enabled)

b) run amcheck on tables/indexes on both databases

c) SQL check (we expect all tables to be 'consistent' as if we did a
PITR - in particular sum(balance) is expected to be the same value on
all pgbench tables) on both databases


I believe those are reasonable expectations - that we get a database
with valid checksums, with non-broken tables/indexes, and that the
database looks as a snapshot taken at a single instant.

Unfortunately it doesn't take long for the tests to start failing with
various strange symptoms on the db_copy database (I'm yet to see an
issue on the 'db' database):

i) amcheck fails with 'heap tuple lacks matching index tuple'

ERROR:  heap tuple (116195,22) from table "pgbench_accounts" lacks
matching index tuple within index "pgbench_accounts_pkey"
HINT:  Retrying verification using the function
bt_index_parent_check() might provide a more specific error.

I've seen this with other tables/indexes too, e.g. system catalogs
pg_statitics or toast tables, but 'accounts' is most common.

ii) amcheck fails with 'could not open file'

ERROR:  could not open file "base/18121/18137": No such file or
directory
LINE 9:  lateral verify_heapam(relation => c.oid, on_error_stop =>
f...
 ^
ERROR:  could not open file "base/18121/18137": No such file or
directory

iii) failures in the SQL check, with different tables have different
balance sums

SQL check fails (db_copy) (account 156142 branches 136132 tellers
   136132 history -42826)

Sometimes this is preceded by amcheck issue, but not always.

I guess this is not the behavior we expect :-(

I've reproduced all of this on PG16 - I haven't tried with older
releases, but I have no reason to assume pre-16 releases are not affected.

With incremental backups I've observed a couple more symptoms, but those
are most likely just fallout of this - not realizing the initial state
is a bit wrong, and making it worse by applying the increments.

The important observation is that this only happens if a database is
created while the backup is running, and that it only happens with the
FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
default since PG15).

I don't recall any reports of similar issues from pre-15 releases, where
FILE_COPY was the only available option - I'm not sure why is that.
Either it didn't have this issue back then, or maybe people happen to
not create databases concurrently with a backup very often. It's a race
condition / timing issue, essentially.

I have no ambition to investigate this part of the code much deeper, or
invent a fix myself, at least not in foreseeable future. But it seems
like something we probably should fix - subtly broken backups are not a
great thing.

I see there have been a couple threads proposing various improvements to
FILE_COPY, that might make it more efficient/faster, namely using the
filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
something that's (maybe) faster but not quite correct does not seem like
a winning strategy to me ...

Alternatively, if we don't have clear desire to fix it, maybe the right
solution would be get rid of it?


regards

[1]
https://www.postgresql.org/message-id/ca+hukglm+t+swbu-chemuxjcogbxshlgzutv5zcwy4qrcce...@mail.gmail.com

[2] https://www.postgresql.org/message-id/Zl9ta3FtgdjizkJ5%40nathan

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

basebackup-test-scripts.tgz
Description: application/compressed-tar


Re: long-standing data loss bug in initial sync of logical replication

2024-06-24 Thread Tomas Vondra
On 6/24/24 12:54, Amit Kapila wrote:
> ...
>>
 I'm not sure there are any cases where using SRE instead of AE would cause
 problems for logical decoding, but it seems very hard to prove. I'd be very
 surprised if just using SRE would not lead to corrupted cache contents in 
 some
 situations. The cases where a lower lock level is ok are ones where we just
 don't care that the cache is coherent in that moment.
>>
>>> Are you saying it might break cases that are not corrupted now? How
>>> could obtaining a stronger lock have such effect?
>>
>> No, I mean that I don't know if using SRE instead of AE would have negative
>> consequences for logical decoding. I.e. whether, from a logical decoding POV,
>> it'd suffice to increase the lock level to just SRE instead of AE.
>>
>> Since I don't see how it'd be correct otherwise, it's kind of a moot 
>> question.
>>
> 
> We lost track of this thread and the bug is still open. IIUC, the
> conclusion is to use SRE in OpenTableList() to fix the reported issue.
> Andres, Tomas, please let me know if my understanding is wrong,
> otherwise, let's proceed and fix this issue.
> 

It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
don't think we 'lost track' of it, but it's true we haven't done much
progress recently.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina  wrote:
>
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
> has a transaction that sees that tuple as alive, because it will
> simply wait to replay the removal until it would be correct to do so
> or recovery conflict handling will cancel the transaction that sees
> the tuple as alive and allow replay to continue.
>
> This is an interesting and difficult case) I noticed that when initializing 
> the cluster, in my opinion, we provide excessive freezing. Initialization 
> takes a long time, which can lead, for example, to longer test execution. I 
> got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, 
> and it works fine.
>
> if (prstate->cutoffs &&
> TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
> prstate->cutoffs->OldestMxact != FirstMultiXactId &&
> NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
> return HEAPTUPLE_DEAD;
>
> Can I keep it?

This looks like an addition to the new criteria I added to
heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so,
it looks like it would only return HEAPTUPLE_DEAD (and thus only
remove) a subset of the tuples my original criteria would remove. When
vacuum calculates OldestMxact as FirstMultiXactId, it would not remove
those tuples deleted before OldestXmin. It seems like OldestMxact will
equal FirstMultiXactID sometimes right after initdb and after
transaction ID wraparound. I'm not sure I totally understand the
criteria.

One thing I find confusing about this is that this would actually
remove less tuples than with my criteria -- which could lead to more
freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we
would not remove tuples deleted before OldestXmin and thus return
HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider
freezing them. So, it seems like we would do more freezing by adding
this criteria.

Could you explain more about how the criteria you are suggesting
works? Are you saying it does less freezing than master or less
freezing than with my patch?

> Attached is the suggested fix for master plus a repro. I wrote it as a
> recovery suite TAP test, but I am _not_ proposing we add it to the
> ongoing test suite. It is, amongst other things, definitely prone to
> flaking. I also had to use loads of data to force two index vacuuming
> passes now that we have TIDStore, so it is a slow test.
-- snip --
> I have a modified version of this that repros the infinite loop on
> 14-16 with substantially less data. See it here [2]. Also, the repro
> attached to this mail won't work on 14 and 15 because of changes to
> background_psql.
>
> I couldn't understand why the replica is necessary here. Now I am digging why 
> I got the similar behavior without replica when I have only one instance. I'm 
> still checking this in my test, but I believe this patch fixes the original 
> problem because the symptoms were the same.

Did you get similar behavior on master or on back branches? Was the
behavior you observed the infinite loop or the error during
heap_prepare_freeze_tuple()?

In my examples, the replica is needed because something has to move
the horizon on the primary backwards. When a standby reconnects with
an older oldest running transaction ID than any of the running
transactions on the primary and the vacuuming backend recomputes its
RecentXmin, the horizon may move backwards when compared to the
horizon calculated at the beginning of the vacuum. Vacuum does not
recompute cutoffs->OldestXmin during vacuuming a relation but it may
recompute the values in the GlobalVisState it uses for pruning.

We knew of only one other way that the horizon could move backwards
which Matthias describes here [1]. However, this is thought to be its
own concurrency-related bug in the commit-abort path that should be
fixed -- as opposed to the standby reconnecting with an older oldest
running transaction ID which can be expected.

Do you know if you were seeing the effects of the scenario Matthias describes?

- Melanie

[1] 
https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>
> On 21/06/2024 03:02, Peter Geoghegan wrote:
> > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
> >  wrote:
> >
> >> The repro forces a round of index vacuuming after the standby
> >> reconnects and before pruning a dead tuple whose xmax is older than
> >> OldestXmin.
> >>
> >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
> >> calls GetOldestNonRemovableTransactionId(), thereby updating the
> >> backend's GlobalVisState and moving maybe_needed backwards.
> >
> > Right. I saw details exactly consistent with this when I used GDB
> > against a production instance.
> >
> > I'm glad that you were able to come up with a repro that involves
> > exactly the same basic elements, including index page deletion.
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.

I'd have to look into how to ensure I can stabilize some of the parts
that seem prone to flaking. I can probably stabilize the vacuum bit
with a query of pg_stat_activity making sure it is waiting to acquire
the cleanup lock.

I don't, however, see a good way around the large amount of data
required to trigger more than one round of index vacuuming. I could
generate the data more efficiently than I am doing here
(generate_series() in the from clause). Perhaps with a copy? I know it
is too slow now to go in an ongoing test, but I don't have an
intuition around how fast it would have to be to be acceptable. Is
there a set of additional tests that are slower that we don't always
run? I didn't follow how the wraparound test ended up, but that seems
like one that would have been slow.

- Melanie




Re: Injection point locking

2024-06-24 Thread Tom Lane
Heikki Linnakangas  writes:
> ... I can't do that, because InjectionPointRun() requires a PGPROC 
> entry, because it uses an LWLock. That also makes it impossible to use 
> injection points in the postmaster. Any chance we could allow injection 
> points to be triggered without a PGPROC entry? Could we use a simple 
> spinlock instead? With a fast path for the case that no injection points 
> are attached or something?

Even taking a spinlock in the postmaster is contrary to project
policy.  Maybe we could look the other way for debug-only code,
but it seems like a pretty horrible precedent.

Given your point that the existing locking is just a fig leaf
anyway, maybe we could simply not have any?  Then it's on the
test author to be sure that the injection point won't be
getting invoked when it's about to be removed.  Or just rip
out the removal capability, and say that once installed an
injection point is there until postmaster shutdown (or till
shared memory reinitialization, anyway).

regards, tom lane




Re: scalability bottlenecks with (many) partitions (and more)

2024-06-24 Thread Robert Haas
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
 wrote:
> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
> LWLock table has 16 partitions by default - it's quite possible that on
> machine with many cores and/or many partitions, we can easily hit this.
> So I bumped this 4x to 64 partitions.

I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.

> What I ended up doing is having a hash table of 16-element arrays. There
> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
> have now. Each OID is mapped to exactly one of these parts as if in a
> hash table, and in each of those 16-element parts we do exactly the same
> thing we do now (linear search, removal, etc.). This works great, the
> locality is great, etc. The one disadvantage is this makes PGPROC
> larger, but I did a lot of benchmarks and I haven't seen any regression
> that I could attribute to this. (More about this later.)

I think this is a reasonable approach. Some comments:

- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.

- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:

+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)

When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.

Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE

2024-06-24 Thread Nathan Bossart
On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
> The important observation is that this only happens if a database is
> created while the backup is running, and that it only happens with the
> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
> default since PG15).

My first thought is that this sounds related to the large comment in
CreateDatabaseUsingFileCopy():

/*
 * We force a checkpoint before committing.  This effectively means that
 * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to 
be
 * replayed (at least not in ordinary crash recovery; we still have to
 * make the XLOG entry for the benefit of PITR operations). This avoids
 * two nasty scenarios:
 *
 * #1: When PITR is off, we don't XLOG the contents of newly created
 * indexes; therefore the drop-and-recreate-whole-directory behavior of
 * DBASE_CREATE replay would lose such indexes.
 *
 * #2: Since we have to recopy the source database during DBASE_CREATE
 * replay, we run the risk of copying changes in it that were committed
 * after the original CREATE DATABASE command but before the system 
crash
 * that led to the replay.  This is at least unexpected and at worst 
could
 * lead to inconsistencies, eg duplicate table names.
 *
 * (Both of these were real bugs in releases 8.0 through 8.0.3.)
 *
 * In PITR replay, the first of these isn't an issue, and the second is
 * only a risk if the CREATE DATABASE and subsequent template database
 * change both occur while a base backup is being taken. There doesn't
 * seem to be much we can do about that except document it as a
 * limitation.
 *
 * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
 * strategy that avoids these problems.
 */

> I don't recall any reports of similar issues from pre-15 releases, where
> FILE_COPY was the only available option - I'm not sure why is that.
> Either it didn't have this issue back then, or maybe people happen to
> not create databases concurrently with a backup very often. It's a race
> condition / timing issue, essentially.

If it requires concurrent activity on the template database, I wouldn't be
surprised at all that this is rare.

> I see there have been a couple threads proposing various improvements to
> FILE_COPY, that might make it more efficient/faster, namely using the
> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
> something that's (maybe) faster but not quite correct does not seem like
> a winning strategy to me ...
> 
> Alternatively, if we don't have clear desire to fix it, maybe the right
> solution would be get rid of it?

It would be unfortunate if we couldn't use this for pg_upgrade, especially
if it is unaffected by these problems.

-- 
nathan




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Tomas Vondra




On 6/24/24 16:53, Melanie Plageman wrote:
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
>>
>> On 21/06/2024 03:02, Peter Geoghegan wrote:
>>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
>>>  wrote:
>>>
 The repro forces a round of index vacuuming after the standby
 reconnects and before pruning a dead tuple whose xmax is older than
 OldestXmin.

 At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
 calls GetOldestNonRemovableTransactionId(), thereby updating the
 backend's GlobalVisState and moving maybe_needed backwards.
>>>
>>> Right. I saw details exactly consistent with this when I used GDB
>>> against a production instance.
>>>
>>> I'm glad that you were able to come up with a repro that involves
>>> exactly the same basic elements, including index page deletion.
>>
>> Would it be possible to make it robust so that we could always run it
>> with "make check"? This seems like an important corner case to
>> regression test.
> 
> I'd have to look into how to ensure I can stabilize some of the parts
> that seem prone to flaking. I can probably stabilize the vacuum bit
> with a query of pg_stat_activity making sure it is waiting to acquire
> the cleanup lock.
> 
> I don't, however, see a good way around the large amount of data
> required to trigger more than one round of index vacuuming. I could
> generate the data more efficiently than I am doing here
> (generate_series() in the from clause). Perhaps with a copy? I know it
> is too slow now to go in an ongoing test, but I don't have an
> intuition around how fast it would have to be to be acceptable. Is
> there a set of additional tests that are slower that we don't always
> run? I didn't follow how the wraparound test ended up, but that seems
> like one that would have been slow.
> 

I think it depends on what is the impact on the 'make check' duration.
If it could be added to one of the existing test groups, then it depends
on how long the slowest test in that group is. If the new test needs to
be in a separate group, it probably needs to be very fast.

But I was wondering how much time are we talking about, so I tried

creating a table, filling it with 300k rows => 250ms
creating an index => 180ms
delete 90% data => 200ms
vacuum t => 130ms

which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms.
I'm not sure how much more stuff does the test need to do, but this
would be pretty reasonable, if we could add it to an existing group.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas
On Fri, Jun 21, 2024 at 6:52 PM Alena Rybakina
 wrote:
> It's hard to tell, but I think it might be one of the good places to apply 
> transformation. Let me describe a brief conclusion on the two approaches.

This explanation is somewhat difficult for me to follow. For example:

> In the first approach, we definitely did not process the extra "OR" 
> expressions in the first approach, since they were packaged as an Array. It 
> could lead to the fact that less planning time would be spent on the 
> optimizer.

I don't know what the "first approach" refers to, or what processing
the extra "OR" expressions means, or what it would mean to package OR
expressions as an array. If you made them into an SAOP then you'd have
an array *instead of* OR expressions, not OR expressions "packaged as
an array" but even then, they'd still be processed somewhere, unless
the patch was just wrong.

I think you should try writing this summary again and see if you can
make it a lot clearer and more precise.

I'm suspicious based that we should actually be postponing the
transformation even further. If, for example, the transformation is
advantageous for index scans and disadvantageous for bitmap scans, or
the other way around, then this approach can't help much: it either
does the transformation and all scan types are affected, or it doesn't
do it and no scan types are affected. But if you decided for each scan
whether to transform the quals, then you could handle that. Against
that, there might be increased planning cost. But, perhaps that could
be avoided somehow.

> What exactly is the strategy around OR-clauses with type differences?
> If I'm reading the code correctly, the first loop requires an exact
> opno match, which presumably implies that the constant-type elements
> are of the same type. But then why does the second loop need to use
> coerce_to_common_type?
>
> It needs to transform all similar constants to one type, because some 
> constants of "OR" expressions can belong others, like the numeric and int 
> types. Due to the fact that array structure demands that all types must be 
> belonged to one type, so for this reason we applied this procedure.

The alternative that should be considered is not combining things if
the types don't match. If we're going to combine such things, we need
to be absolutely certain that type conversion cannot fail.

> I do not think this is acceptable. We should find a way to get the
> right operator into the ScalarArrayOpExpr without translating the OID
> back into a name and then back into an OID.
>
> I don’t really understand the reason why it’s better not to do this. Can you 
> explain please?

One reason is that it is extra work to convert things to a name and
then back to an OID. It's got to be slower than using the OID you
already have.

The other reason is that it's error-prone. If somehow the second
lookup doesn't produce the same OID as the first lookup, bad things
will happen, possibly including security vulnerabilities. I see you've
taken steps to avoid that, like nailing down the schema, and that's
good, but it's not a good enough reason to do it like this. If we
don't have a function that can construct the node we need with the OID
rather than the name as an argument, we should invent one, not do this
sort of thing.

> Also, why is the array built with eval_const_expressions instead of
> something like makeArrayResult? There should be no need for general
> expression evaluation here if we are just dealing with constants.
>
> I'm not ready to answer this question right now, I need time to study the use 
> of the makeArrayResult function in the optimizer.

OK. An important consideration here is that eval_const_expressions()
is prone to *fail* because it can call user-defined functions. We
really don't want this optimization to cause planner failure (or
queries to error out at any other stage, either). We also don't want
to end up with any security problems, which is another possible danger
when we call a function that can execute arbitrary code. It's better
to keep it simple and only do things that we know are simple and safe,
like assembling a bunch of datums that we already have into an array.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE

2024-06-24 Thread Tomas Vondra
On 6/24/24 17:14, Nathan Bossart wrote:
> On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote:
>> The important observation is that this only happens if a database is
>> created while the backup is running, and that it only happens with the
>> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the
>> default since PG15).
> 
> My first thought is that this sounds related to the large comment in
> CreateDatabaseUsingFileCopy():
> 
>   /*
>* We force a checkpoint before committing.  This effectively means that
>* committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to 
> be
>* replayed (at least not in ordinary crash recovery; we still have to
>* make the XLOG entry for the benefit of PITR operations). This avoids
>* two nasty scenarios:
>*
>* #1: When PITR is off, we don't XLOG the contents of newly created
>* indexes; therefore the drop-and-recreate-whole-directory behavior of
>* DBASE_CREATE replay would lose such indexes.
>*
>* #2: Since we have to recopy the source database during DBASE_CREATE
>* replay, we run the risk of copying changes in it that were committed
>* after the original CREATE DATABASE command but before the system 
> crash
>* that led to the replay.  This is at least unexpected and at worst 
> could
>* lead to inconsistencies, eg duplicate table names.
>*
>* (Both of these were real bugs in releases 8.0 through 8.0.3.)
>*
>* In PITR replay, the first of these isn't an issue, and the second is
>* only a risk if the CREATE DATABASE and subsequent template database
>* change both occur while a base backup is being taken. There doesn't
>* seem to be much we can do about that except document it as a
>* limitation.
>*
>* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE
>* strategy that avoids these problems.
>*/
> 

Perhaps, the mentioned risks certainly seem like it might be related to
the issues I'm observing.

>> I don't recall any reports of similar issues from pre-15 releases, where
>> FILE_COPY was the only available option - I'm not sure why is that.
>> Either it didn't have this issue back then, or maybe people happen to
>> not create databases concurrently with a backup very often. It's a race
>> condition / timing issue, essentially.
> 
> If it requires concurrent activity on the template database, I wouldn't be
> surprised at all that this is rare.
> 

Right. Although, "concurrent" here means a somewhat different thing.
AFAIK there can't be a any changes concurrent with the CREATE DATABASE
directly, because we make sure there are no connections:

createdb: error: database creation failed: ERROR:  source database
"test" is being accessed by other users
DETAIL:  There is 1 other session using the database.

But per the comment, it'd be a problem if there is activity after the
database gets copied, but before the backup completes (which is where
the replay will happen).

>> I see there have been a couple threads proposing various improvements to
>> FILE_COPY, that might make it more efficient/faster, namely using the
>> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having
>> something that's (maybe) faster but not quite correct does not seem like
>> a winning strategy to me ...
>>
>> Alternatively, if we don't have clear desire to fix it, maybe the right
>> solution would be get rid of it?
> 
> It would be unfortunate if we couldn't use this for pg_upgrade, especially
> if it is unaffected by these problems.
> 

Yeah. I wouldn't mind using FILE_COPY in contexts where we know it's
safe, like pg_upgrade. I just don't want to let users to unknowingly
step on this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
> We can fix this by always removing tuples considered dead before
> VacuumCutoffs->OldestXmin.

I don't have a great feeling about this fix. It's not that I think
it's wrong. It's just that the underlying problem here is that we have
heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
of deciding what gets pruned, but that doesn't actually work, because
as I pointed out in
http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
fix is to consider both variables, which again may be totally correct,
but wouldn't it be a lot better if we didn't have two variables
fighting for control of the same behavior?

(I'm not trying to be a nuisance here -- I think it's great that
you've done the work to pin this down and perhaps there is no better
fix than what you've proposed.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Nathan Bossart
On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote:
> This patch is based on a suggestion from a separate thread [1]:
> 
> On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
>> Rather unrelated to this patch, still this patch makes the situation
>> more complicated in the docs, but wouldn't it be better to add ACL as
>> a term in acronyms.sql, and reuse it here?  It would be a doc-only
>> patch that applies on top of the rest (could be on a new thread of its
>> own), with some  markups added where needed.

Sounds reasonable to me.

+  https://en.wikipedia.org/wiki/Access_Control_List";>Access 
Control List, i.e. privileges list

I think we could omit "i.e. privileges list."

-- 
nathan




Re: Improve EXPLAIN output for multicolumn B-Tree Index

2024-06-24 Thread Matthias van de Meent
On Mon, 24 Jun 2024 at 14:42, Jelte Fennema-Nio  wrote:
>
> On Mon, 24 Jun 2024 at 13:02, Matthias van de Meent
>  wrote:
> > It does not really behave similar: index scan keys (such as the
> > id3=101 scankey) don't require visibility checks in the btree code,
> > while the Filter condition _does_ require a visibility check, and
> > delegates the check to the table AM if the scan isn't Index-Only, or
> > if the VM didn't show all-visible during the check.
>
> Any chance you could point me in the right direction for the
> code/docs/comment about this? I'd like to learn a bit more about why
> that is the case, because I didn't realize visibility checks worked
> differently for index scan keys and Filter keys.

This can be derived by combining how Filter works (it only filters the
returned live tuples) and how Index-Only scans work (return the index
tuple, unless !ALL_VISIBLE, in which case the heap tuple is
projected). There have been several threads more or less recently that
also touch this topic and closely related topics, e.g. [0][1].

> > Furthermore, the index could use the scankey to improve the number of
> > keys to scan using "skip scans"; by realising during a forward scan
> > that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you
> > can skip forward to (1, 3, _), rather than having to go through tuples
> > (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for
> > INCLUDE-d columns, because their datatypes and structure are opaque to
> > the index AM; the AM cannot assume anything about or do anything with
> > those values.
>
> Does Postgres actually support this currently? I thought skip scans
> were not available (yet).

Peter Geoghegan has been working on it as project after PG17's
IN()-list improvements were committed, and I hear he has the basics
working but the further details need fleshing out.

> > As you can see, there's a huge difference in performance. Putting both
> > non-bound and "normal" filter clauses in the same Filter clause will
> > make it more difficult to explain performance issues based on only the
> > explain output.
>
> Fair enough, that's of course the main point of this patch in the
> first place: being able to better interpret the explain plan when you
> don't have access to the schema. Still I think Filter is the correct
> keyword for both, so how about we make it less confusing by making the
> current "Filter" more specific by calling it something like "Non-key
> Filter" or "INCLUDE Filter" and then call the other something like
> "Index Filter" or "Secondary Bound Filter".

I'm not sure how debuggable explain plans are without access to the
schema, especially when VERBOSE isn't configured, so I would be
hesitant to accept that as an argument here.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me
[1] 
https://www.postgresql.org/message-id/flat/cf85f46f-b02f-05b2-5248-5000b894ebab%40enterprisedb.com




Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston
On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart 
wrote:

> On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote:
> > This patch is based on a suggestion from a separate thread [1]:
> >
> > On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote:
> >> Rather unrelated to this patch, still this patch makes the situation
> >> more complicated in the docs, but wouldn't it be better to add ACL as
> >> a term in acronyms.sql, and reuse it here?  It would be a doc-only
> >> patch that applies on top of the rest (could be on a new thread of its
> >> own), with some  markups added where needed.
>
> Sounds reasonable to me.
>

+1


> +  https://en.wikipedia.org/wiki/Access_Control_List";>Access
> Control List, i.e. privileges list
>
> I think we could omit "i.e. privileges list."
>
>
Agreed.  Between the docs and code we say "privileges list" once and that
refers to the dumputIls description of the arguments to grant.  As the
acronym page now defines the term using fundamentals, introducing another
term not used elsewhere seems undesirable.

Observations:
We are referencing a disambiguation page.  We never actually spell out ACL
anywhere so we might as well just reference what Wikipedia believes is the
expected spelling.

The page we link to uses "permissions" while we consistently use
"privileges" to describe the contents of the list.  This seems like an
obvious synonym, but as the point of these is to formally define things,
pointing this equivalence is worth considering.

David J.


Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 11:28 AM Robert Haas  wrote:
> > It needs to transform all similar constants to one type, because some 
> > constants of "OR" expressions can belong others, like the numeric and int 
> > types. Due to the fact that array structure demands that all types must be 
> > belonged to one type, so for this reason we applied this procedure.
>
> The alternative that should be considered is not combining things if
> the types don't match. If we're going to combine such things, we need
> to be absolutely certain that type conversion cannot fail.

But what about cases like this:

SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna =
5_000_000_000; -- columna is int4

This is using two types, of course. 1_000_000_000 is int4, while
5_000_000_000 is bigint. If the transformation suddenly failed to work
when a constant above INT_MAX was used for the first time, then I'd
say that that's pretty surprising. That's what happens currently if
you write the same query as "WHERE columna =
any('{1_000_000_000,5_000_000_000}')", due to the way the coercion
works. That seems less surprising to me, because the user is required
to construct their own array, and users expect arrays to always have
one element type.

It would probably be okay to make the optimization not combine
things/not apply when the user gratuitously mixes different syntaxes.
For example, if a numeric constant was used, rather than an integer
constant.

Maybe it would be practical to do something with the B-Tree operator
class for each of the types involved in the optimization. You could
probably find a way for a SAOP to work against a
"heterogeneously-typed array" while still getting B-Tree index scans
-- provided the types all came from the same operator family. I'm
assuming that the index has an index column whose input opclass was a
member of that same family. That would necessitate changing the
general definition of SAOP, and adding new code to nbtree that worked
with that. But that seems doable.

I was already thinking about doing something like this, to support
index scans for "IS NOT DISTINCT FROM", or on constructs like "columna
= 5 OR columna IS NULL". That is more or less a SAOP with two values,
except that one of the values in the value NULL. I've already
implemented "nbtree SAOPs where one of the elements is a NULL" for
skip scan, which could be generalized to support these other cases.

Admittedly I'm glossing over a lot of important details here. Does it
just work for the default opclass for the type, or can we expect it to
work with a non-default opclass when that's the salient opclass (the
one used by our index)? I don't know what you'd do about stuff like
that.

-- 
Peter Geoghegan




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Nathan Bossart
On Thu, Apr 11, 2024 at 01:52:26PM -0400, David E. Wheeler wrote:
> I realize this probably isn´t going to happen for 17, given the freeze,
> but I would very much welcome feedback and pointers to address concerns
> about providing a second directory for extensions and DSOs. Quite a few
> people have talked about the need for this in the Extension Mini
> Summits[1], so I´m sure I could get some collaborators to make
> improvements or look at a different approach.

At first glance, the general idea seems reasonable to me.  I'm wondering
whether there is a requirement for this directory to be prepended or if it
could be appended to the end.  That way, the existing ones would take
priority, which might be desirable from a security standpoint.

-- 
nathan




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 11:44 AM Robert Haas  wrote:
> I don't have a great feeling about this fix. It's not that I think
> it's wrong. It's just that the underlying problem here is that we have
> heap_page_prune_and_freeze() getting both GlobalVisState *vistest and
> struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge
> of deciding what gets pruned, but that doesn't actually work, because
> as I pointed out in
> http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com
> it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your
> fix is to consider both variables, which again may be totally correct,
> but wouldn't it be a lot better if we didn't have two variables
> fighting for control of the same behavior?

Why would it be better? It's to our advantage to have vistest prune
away extra tuples when possible. Andres placed a lot of emphasis on
that during the snapshot scalability work -- vistest can be updated on
the fly.

The problem here is that OldestXmin is supposed to be more
conservative than vistest, which it almost always is, except in this
one edge case. I don't think that plugging that hole changes the basic
fact that there is one source of truth about what *needs* to be
pruned. There is such a source of truth: OldestXmin.

-- 
Peter Geoghegan




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas
On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> The problem here is that OldestXmin is supposed to be more
> conservative than vistest, which it almost always is, except in this
> one edge case. I don't think that plugging that hole changes the basic
> fact that there is one source of truth about what *needs* to be
> pruned. There is such a source of truth: OldestXmin.

Well, another approach could be to make it so that OldestXmin actually
is always more conservative than vistest rather than almost always.

I agree with you that letting the pruning horizon move forward during
vacuum is desirable. I'm just wondering if having the vacuum code need
to know a second horizon is really the best way to address that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas
On Mon, Jun 24, 2024 at 12:09 PM Peter Geoghegan  wrote:
> But what about cases like this:
>
> SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna =
> 5_000_000_000; -- columna is int4
>
> This is using two types, of course. 1_000_000_000 is int4, while
> 5_000_000_000 is bigint. If the transformation suddenly failed to work
> when a constant above INT_MAX was used for the first time, then I'd
> say that that's pretty surprising. That's what happens currently if
> you write the same query as "WHERE columna =
> any('{1_000_000_000,5_000_000_000}')", due to the way the coercion
> works. That seems less surprising to me, because the user is required
> to construct their own array, and users expect arrays to always have
> one element type.

I am not against handling this kind of case if we can do it, but it's
more important that the patch doesn't cause gratuitous failures than
that it handles more cases.

> Maybe it would be practical to do something with the B-Tree operator
> class for each of the types involved in the optimization. You could
> probably find a way for a SAOP to work against a
> "heterogeneously-typed array" while still getting B-Tree index scans
> -- provided the types all came from the same operator family. I'm
> assuming that the index has an index column whose input opclass was a
> member of that same family. That would necessitate changing the
> general definition of SAOP, and adding new code to nbtree that worked
> with that. But that seems doable.

I agree that something based on operator families might be viable. Why
would that require changing the definition of an SAOP?

> Admittedly I'm glossing over a lot of important details here. Does it
> just work for the default opclass for the type, or can we expect it to
> work with a non-default opclass when that's the salient opclass (the
> one used by our index)? I don't know what you'd do about stuff like
> that.

It seems to me that it just depends on the opclasses in the query. If
the user says

WHERE column op1 const1 AND column op2 const2

...then if op1 and op2 are in the same operator family and if we can
convert one of const1 and const2 to the type of the other without risk
of failure, then we can rewrite this as an SAOP with whichever of the
two operators pertains to the target type, e.g.

column1 op1 ANY[const1,converted_const2]

I don't think the default opclass matters here, or the index properties either.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

If we did things like that then it would still be necessary to write a
patch like the one Melanie came up with, on the grounds that we'd
really need to be paranoid about having missed some subtlety. We might
as well just rely on the mechanism directly. I just don't think that
it makes much difference.

-- 
Peter Geoghegan




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 1:29 PM Robert Haas  wrote:
> I am not against handling this kind of case if we can do it, but it's
> more important that the patch doesn't cause gratuitous failures than
> that it handles more cases.

I agree, with the proviso that "avoid gratuitous failures" should
include cases where a query that got the optimization suddenly fails
to get the optimization, due only to some very innocuous looking
change. Such as a change from using a constant 1_000_000_000 to a
constant 5_000_000_000 in the query text. That is a POLA violation.

> > Maybe it would be practical to do something with the B-Tree operator
> > class for each of the types involved in the optimization. You could
> > probably find a way for a SAOP to work against a
> > "heterogeneously-typed array" while still getting B-Tree index scans
> > -- provided the types all came from the same operator family. I'm
> > assuming that the index has an index column whose input opclass was a
> > member of that same family. That would necessitate changing the
> > general definition of SAOP, and adding new code to nbtree that worked
> > with that. But that seems doable.
>
> I agree that something based on operator families might be viable. Why
> would that require changing the definition of an SAOP?

Maybe it doesn't. My point was only that the B-Tree code doesn't
necessarily need to use just one rhs type for the same column input
opclass. The definition of SOAP works (or could work) in basically the
same way, provided the "OR condition" were provably disjunct. We could
for example mix different operators for the same nbtree scan key (with
some work in nbtutils.c), just as we could support "where mycol =5 OR
mycol IS NULL" with much effort.

BTW, did you know MySQL has long supported the latter? It has a <=>
operator, which is basically a non-standard spelling of IS NOT
DISTINCT FROM. Importantly, it is indexable, whereas right now
Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
interested in working on this problem within the scope of this patch,
or some follow-up patch, I can take care of the nbtree side of things.

> > Admittedly I'm glossing over a lot of important details here. Does it
> > just work for the default opclass for the type, or can we expect it to
> > work with a non-default opclass when that's the salient opclass (the
> > one used by our index)? I don't know what you'd do about stuff like
> > that.
>
> It seems to me that it just depends on the opclasses in the query. If
> the user says
>
> WHERE column op1 const1 AND column op2 const2
>
> ...then if op1 and op2 are in the same operator family and if we can
> convert one of const1 and const2 to the type of the other without risk
> of failure, then we can rewrite this as an SAOP with whichever of the
> two operators pertains to the target type, e.g.
>
> column1 op1 ANY[const1,converted_const2]
>
> I don't think the default opclass matters here, or the index properties 
> either.

Okay, good.

The docs do say "Another requirement for a multiple-data-type family
is that any implicit or binary-coercion casts that are defined between
data types included in the operator family must not change the
associated sort ordering" [1]. There must be precedent for this sort
of thing. Probably for merge joins.

[1] https://www.postgresql.org/docs/devel/btree.html#BTREE-BEHAVIOR
-- 
Peter Geoghegan




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Robert Haas
On Wed, Apr 3, 2024 at 3:13 AM Alvaro Herrera  wrote:
> I support the idea of there being a second location from where to load
> shared libraries ... but I don't like the idea of making it
> runtime-configurable.  If we want to continue to tighten up what
> superuser can do, then one of the things that has to go away is the
> ability to load shared libraries from arbitrary locations
> (dynamic_library_path).  I think we should instead look at making those
> locations hardcoded at compile time.  The packager can then decide where
> those things go, and the superuser no longer has the ability to load
> arbitrary code from arbitrary locations.

Is "tighten up what the superuser can do" on our list of objectives?
Personally, I think we should be focusing mostly, and maybe entirely,
on letting non-superusers do more things, with appropriate security
controls. The superuser can ultimately do anything, since they can
cause shell commands to be run and load arbitrary code into the
backend and write code in untrusted procedural languages and mutilate
the system catalogs and lots of other terrible things.

Now, I think there are environments where people have used things like
containers to try to lock down the superuser, and while I'm not sure
that can ever be particularly water-tight, if it were the case that
this patch would make it a whole lot easier for a superuser to bypass
the kinds of controls that people are imposing today, that might be an
argument against this patch. But ... off-hand, I'm not seeing such an
exposure.

On the patch itself, I find the documentation for this to be fairly
hard to understand. I think it could benefit from an example. I'm
confused about whether this is intended to let me search for
extensions in /my/temp/root/usr/lib/postgresql/... by setting
extension_directory=/my/temp/dir, or whether it's intended me to
search both /usr/lib/postgresql as I normally would and also
/some/other/place. If the latter, I wonder why we don't handle shared
libraries by setting dynamic_library_path and then just have an
analogue of that for control files.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 1:46 PM Peter Geoghegan  wrote:
> BTW, did you know MySQL has long supported the latter? It has a <=>
> operator, which is basically a non-standard spelling of IS NOT
> DISTINCT FROM. Importantly, it is indexable, whereas right now
> Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
> interested in working on this problem within the scope of this patch,
> or some follow-up patch, I can take care of the nbtree side of things.

To be clear, I meant that we could easily support "where mycol = 5 OR
mycol IS NULL" and have nbtree handle that efficiently, by making it a
SAOP internally. Separately, we could also make IS NOT DISTINCT FROM
indexable, though that probably wouldn't need any work in nbtree.

-- 
Peter Geoghegan




Apparent bug in WAL summarizer process (hung state)

2024-06-24 Thread Israel Barth Rubio
Hello,

Hope you are doing well.

I've been playing a bit with the incremental backup feature which might
come as
part of the 17 release, and I think I hit a possible bug in the WAL
summarizer
process.

The issue that I face refers to the summarizer process getting into a hung
state.
When the issue is triggered, it keeps in an infinite loop trying to process
a WAL
file that no longer exists.  It apparently comes up only when I perform
changes to
`wal_summarize` GUC and reload Postgres, while there is some load in
Postgres
which makes it recycle WAL files.

I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL
files
available in `pg_wal` and make it easier to reproduce the issue, I'm using
a low
value for `max_wal_size` ('100MB'). You can find below the steps that I
took to
reproduce this problem, assuming this small `max_wal_size`, and
`summarize_wal`
initially enabled:

```bash
# Assume we initially have max_wal_size = '100MB' and summarize_wal = on

# Create a table of ~ 100MB
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_1

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental
full_backup_1/backup_manifest

# Disable summarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO off"
psql -c "SELECT pg_reload_conf()"

# Recreate a table of ~ 100MB
psql -c "DROP TABLE test"
psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"

# Re-enable sumarize_wal
psql -c "ALTER SYSTEM SET summarize_wal TO on"
psql -c "SELECT pg_reload_conf()"

# Take a full backup
pg_basebackup -X none -c fast -P -D full_backup_2

# Take an incremental backup
pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental
full_backup_2/backup_manifest
```

I'm able to reproduce the issue most of the time when running these steps
manually. It's harder to reproduce if I attempt to run those commands as a
bash script.

This is the sample output of a run of those commands:

```console

(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS
SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
full_backup_1NOTICE:  WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331785/331785 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_1 --incremental
full_backup_1/backup_manifestNOTICE:  WAL archiving is not enabled;
you must ensure that all required WAL segments are copied through
other means to complete the backup111263/331720 kB (33%), 1/1
tablespace(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM
SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres@barmandevhost
~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf
t(1 row)
(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP
TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test
AS SELECT generate_series(1, 300)"SELECT 300(barman)
[postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO
on"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT
pg_reload_conf()" pg_reload_conf t(1 row)
(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P
-D full_backup_2NOTICE:  WAL archiving is not enabled; you must ensure
that all required WAL segments are copied through other means to
complete the backup331734/331734 kB (100%), 1/1 tablespace(barman)
[postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D
incremental_backup_2 --incremental
full_backup_2/backup_manifestWARNING:  still waiting for WAL
summarization through 2/C128 after 10 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 20 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and 2/B3D8 in memory.WARNING:  still waiting
for WAL summarization through 2/C128 after 30 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 40 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and 2/B3D8 in memory.WARNING:  still waiting
for WAL summarization through 2/C128 after 50 secondsDETAIL:
Summarization has reached 2/B3D8 on disk and 2/B3D8 in
memory.WARNING:  still waiting for WAL summarization through
2/C128 after 60 secondsDETAIL:  Summarization has reached
2/B3D8 on disk and 

Re: Apparent bug in WAL summarizer process (hung state)

2024-06-24 Thread Israel Barth Rubio
I'm attaching the files which I missed in the original email.

>
19:34:17.437626 epoll_wait(5, [], 1, 8161) = 0 <8.171542>
19:34:25.610176 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.000334>
19:34:25.611012 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.000323>
19:34:25.611657 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.000406>
19:34:25.612327 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:25.612"..., 
132) = 132 <0.000203>
19:34:25.612873 epoll_wait(5, [], 1, 1) = 0 <10.010950>
19:34:35.623942 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.12>
19:34:35.624120 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.18>
19:34:35.624191 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.07>
19:34:35.624237 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:35.624"..., 
132) = 132 <0.52>
19:34:35.624341 epoll_wait(5, [], 1, 1) = 0 <10.010941>
19:34:45.635408 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16>
19:34:45.635602 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.62>
19:34:45.635765 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.09>
19:34:45.635830 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:45.635"..., 
132) = 132 <0.000495>
19:34:45.636390 epoll_wait(5, [], 1, 1) = 0 <10.008785>
19:34:55.645305 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16>
19:34:55.645520 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.32>
19:34:55.645645 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.27>
19:34:55.646214 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:55.645"..., 
132) = 132 <0.000136>
19:34:55.646445 epoll_wait(5, [], 1, 1) = 0 <10.011731>
19:35:05.658366 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.18>
19:35:05.658591 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = 
-1 ENOENT (No such file or directory) <0.28>
19:35:05.658689 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL 
SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.10>
19:35:05.658750 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:35:05.658"..., 
132) = 132 <0.000521>
19:35:05.659335 epoll_wait(5,  
#0  0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, 
timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00922bad in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at 
storage/ipc/latch.c:1570
#2  WaitEventSetWait (set=0x1409f70, timeout=1, 
occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at 
storage/ipc/latch.c:1516
#3  0x00920e65 in WaitLatch (latch=, 
wakeEvents=, timeout=, wait_event_info=150994953) 
at storage/ipc/latch.c:538
#4  0x008aa7e4 in WalSummarizerMain (startup_data=, 
startup_data_len=) at postmaster/walsummarizer.c:317
#5  0x008a46a0 in postmaster_child_launch (client_sock=0x0, 
startup_data_len=0, startup_data=0x0, child_type=B_WAL_SUMMARIZER) at 
postmaster/launch_backend.c:265
#6  postmaster_child_launch (client_sock=0x0, startup_data_len=0, 
startup_data=0x0, child_type=B_WAL_SUMMARIZER) at 
postmaster/launch_backend.c:226
#7  StartChildProcess (type=type@entry=B_WAL_SUMMARIZER) at 
postmaster/postmaster.c:3928
#8  0x008a4dcf in MaybeStartWalSummarizer () at 
postmaster/postmaster.c:4075
#9  MaybeStartWalSummarizer () at postmaster/postmaster.c:4070
#10 0x008a7f8f in ServerLoop () at postmaster/postmaster.c:1746
#11 0x0089f5e0 in PostmasterMain (argc=3, argv=0x14097b0) at 
postmaster/postmaster.c:1372
#12 0x005381aa in main (argc=3, argv=0x14097b0) at main/main.c:197
No symbol "full" in current context.
#0  0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, 
timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
sc_ret = -4
sc_ret = 
#1  0x00922bad in WaitEventSetWaitBlock (nevents=1, 
occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at 
storage/ipc/latch.c:1570
returned_events = 0
rc = 
cur_event = 
cur_epoll_event = 
returned_events = 
rc = 
cur_event = 
cur_epoll_event = 
__func__ = { }
__errno_location = 
#2  WaitEventSetWait (set=0x1409f70, timeout=1, 
occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at 
storage/ipc/latch.c:1516
rc = 
returned_events = 0
start_time = {ticks = 49310570173263}
cur_time = {ticks = }
cur_timeout = 
#3  0x00920

PostgreSQL does not compile on macOS SDK 15.0

2024-06-24 Thread Stan Hu
Xcode 16 beta was released on 2024-06-10 and ships with macOS SDK 15.0
[1]. It appears PostgreSQL does not compile due to typedef redefiniton
errors in the regex library:

/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
-Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new -Wendif-labels
-Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro
-Wno-cast-function-type-strict -O2 -I../../../src/include  -isysroot
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk
   -c -o regcomp.o regcomp.c
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:107:24:
error: typedef redefinition with different types ('__darwin_off_t'
(aka 'long long') vs 'long')
  107 | typedef __darwin_off_t regoff_t;
  |^
../../../src/include/regex/regex.h:48:14: note: previous definition is here
   48 | typedef long regoff_t;
  |  ^
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:114:3:
error: typedef redefinition with different types ('struct regex_t' vs
'struct regex_t')
  114 | } regex_t;
  |   ^
../../../src/include/regex/regex.h:82:3: note: previous definition is here
   82 | } regex_t;
  |   ^
In file included from regcomp.c:2647:
In file included from ./regc_pg_locale.c:21:
In file included from ../../../src/include/utils/pg_locale.h:16:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45:
In file included from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27:
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:119:3:
error: typedef redefinition with different types ('struct regmatch_t'
vs 'struct regmatch_t')
  119 | } regmatch_t;
  |   ^
../../../src/include/regex/regex.h:89:3: note: previous definition is here
   89 | } regmatch_t;
  |   ^
3 errors generated.
make[3]: *** [regcomp.o] Error 1
make[2]: *** [regex-recursive] Error 2
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2

I've reproduced this issue by:

1. Download the XCode 16 beta 2 ZIP file:
https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_16_beta/Xcode_16_beta.xip
2. Extract this to `/tmp`.
3. Then I ran:

export 
PATH=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin:$PATH
export 
SDKROOT=/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk
export 
XCODE_DIR=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain
export CC="$XCODE_DIR/usr/bin/clang" export CXX="$XCODE_DIR/usr/bin/clang++"

./configure CC="$CC" CXX="$CXX"
make

The compilation goes through if I comment out the "#include
" from
/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h.
However, even on macOS SDK 14.5 I see that include statement. I'm
still trying to figure out what changed here.

[1] - https://developer.apple.com/macos/




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Robert Haas
On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan  wrote:
> I agree, with the proviso that "avoid gratuitous failures" should
> include cases where a query that got the optimization suddenly fails
> to get the optimization, due only to some very innocuous looking
> change. Such as a change from using a constant 1_000_000_000 to a
> constant 5_000_000_000 in the query text. That is a POLA violation.

Nope, I don't agree with that at all. If you imagine that we can
either have the optimization apply to one of those cases on the other,
or on the other hand we can have some cases that outright fail, I
think it's entirely clear that the former is better.

> Maybe it doesn't. My point was only that the B-Tree code doesn't
> necessarily need to use just one rhs type for the same column input
> opclass. The definition of SOAP works (or could work) in basically the
> same way, provided the "OR condition" were provably disjunct. We could
> for example mix different operators for the same nbtree scan key (with
> some work in nbtutils.c), just as we could support "where mycol =5 OR
> mycol IS NULL" with much effort.
>
> BTW, did you know MySQL has long supported the latter? It has a <=>
> operator, which is basically a non-standard spelling of IS NOT
> DISTINCT FROM. Importantly, it is indexable, whereas right now
> Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're
> interested in working on this problem within the scope of this patch,
> or some follow-up patch, I can take care of the nbtree side of things.

I was assuming this patch shouldn't be changing the way indexes work
at all, just making use of the facilities that we have today. More
could be done, but that might make it harder to get anything
committed.

Before we get too deep into arguing about hypotheticals, I don't think
there's any problem here that we can't solve with the infrastructure
we already have. For instance, consider this:

robert.haas=# explain select * from foo where a in (1, 1000);
QUERY PLAN
---
 Seq Scan on foo1 foo  (cost=0.00..25.88 rows=13 width=36)
   Filter: (a = ANY ('{1,1000}'::bigint[]))
(2 rows)

I don't know exactly what's happening here, but it seems very similar
to what we need to have happen for this patch to work. pg_typeof(1) is
integer, and pg_typeof(1000) is bigint, and we're able to
figure out that it's OK to put both of those in an array of a single
type and without having any type conversion failures. If you replace
1000 with 2, then the array ends up being of type
integer[] rather than type bigint[], so. clearly the system is able to
reason its way through these kinds of scenarios already.

It's even possible, in my mind at least, that the patch is already
doing exactly the right things here. Even if it isn't, the problem
doesn't seem to be fundamental, because if this example can work (and
it does) then what the patch is trying to do should be workable, too.
We just have to make sure we're plugging all the pieces properly
together, and that we have comments adequately explain what is
happening and test cases that verify it. My feeling is that the patch
doesn't meet that standard today, but I think that just means it needs
some more work. I'm not arguing we have to throw the whole thing out,
or invent a lot of new infrastructure, or anything like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: improve predefined roles documentation

2024-06-24 Thread Robert Haas
On Fri, Jun 21, 2024 at 11:40 AM Nathan Bossart
 wrote:
> Done.

If you look at how the varlistentries begin, there are three separate patterns:

* Some document a single role and start with "Allow doing blah blah blah".

* Some document a couple of rolls so there are several paragraphs,
each beginning with "name_of_rolehttp://www.enterprisedb.com




Re: Proposal: Document ABI Compatibility

2024-06-24 Thread Robert Haas
On Mon, Jun 17, 2024 at 6:38 PM David E. Wheeler  wrote:
> Is it? ISTM that there is the intention not to break things that don’t need 
> to be broken, though that doesn’t rule out interface improvements.

I suppose that it's true that we try to avoid gratuitous breakage, but
I feel like it would be weird to document that.

Sometimes I go to a store and I see a sign that says "shoplifters will
be prosecuted." But I have yet to see a store with a sign that says
"people who appear to be doing absolutely nothing wrong will not be
prosecuted." If I did see such a sign, I would frankly be a little
concerned.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 2:28 PM Robert Haas  wrote:
> On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan  wrote:
> > I agree, with the proviso that "avoid gratuitous failures" should
> > include cases where a query that got the optimization suddenly fails
> > to get the optimization, due only to some very innocuous looking
> > change. Such as a change from using a constant 1_000_000_000 to a
> > constant 5_000_000_000 in the query text. That is a POLA violation.
>
> Nope, I don't agree with that at all. If you imagine that we can
> either have the optimization apply to one of those cases on the other,
> or on the other hand we can have some cases that outright fail, I
> think it's entirely clear that the former is better.

I'm just saying that not having the optimization apply to a query very
similar to one where it does apply is a POLA violation. That's another
kind of failure, for all practical purposes. Weird performance cliffs
like that are bad. It's very easy to imagine code that generates a
query text, that at some point randomly and mysteriously gets a
sequential scan. Or a much less efficient index scan.

> I was assuming this patch shouldn't be changing the way indexes work
> at all, just making use of the facilities that we have today. More
> could be done, but that might make it harder to get anything
> committed.

I was just pointing out that there is currently no good way to make
nbtree efficiently execute a qual "WHERE a = 5 OR a IS NULL", which is
almost entirely (though not quite entirely) due to a lack of any way
of expressing that idea through SQL, in a way that'll get pushed down
to the index scan node. You can write "WHERE a = any('{5,NULL')", of
course, but that doesn't treat NULL as just another array element to
match against via an IS NULL qual (due to NULL semantics).

Yeah, this is nonessential. But it's quite a nice optimization, and
seems entirely doable within the framework of the patch. It would be a
natural follow-up.

All that I'd need on the nbtree side is to get an input scan key that
directly embodies "WHERE mycol = 5 OR mycol IS NULL". That would
probably just be a scan key with sk_flags "SK_SEARCHARRAY |
SK_SEARCHNULL", that was otherwise identical to the current
SK_SEARCHARRAY scan keys.

Adopting the nbtree array index scan code to work with this would be
straightforward. SK_SEARCHNULL scan keys basically already work like
regular equality scan keys at execution time, so all that this
optimization requires on the nbtree side is teaching
_bt_advance_array_keys to treat NULL as a distinct array condition
(evaluated as IS NULL, not as = NULL).

> It's even possible, in my mind at least, that the patch is already
> doing exactly the right things here. Even if it isn't, the problem
> doesn't seem to be fundamental, because if this example can work (and
> it does) then what the patch is trying to do should be workable, too.
> We just have to make sure we're plugging all the pieces properly
> together, and that we have comments adequately explain what is
> happening and test cases that verify it. My feeling is that the patch
> doesn't meet that standard today, but I think that just means it needs
> some more work. I'm not arguing we have to throw the whole thing out,
> or invent a lot of new infrastructure, or anything like that.

I feel like my point about the potential for POLA violations is pretty
much just common sense. I'm not particular about the exact mechanism
by which we avoid it; only that we avoid it.


--
Peter Geoghegan




Re: strange context message in spi.c?

2024-06-24 Thread Daniel Gustafsson
> On 24 Jun 2024, at 11:14, Stepan Neretin  wrote:
> 
> Hi! Looks good to me! 

Thanks for review.  I have this on my TODO for when the tree branches, it
doesn't seem like anything worth squeezing in before then.

--
Daniel Gustafsson





Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-06-24 Thread Jeff Davis
On Wed, 2024-06-19 at 08:53 +0530, Ashutosh Sharma wrote:
> For standalone functions, users can easily adjust the search_path
> settings as needed. However, managing this becomes challenging for
> functions created via extensions. Extensions are relocatable, making
> it difficult to determine and apply search_path settings in advance
> within the extension_name--*.sql file when defining functions or
> procedures.

A special search_path variable "$extension_schema" would be a better
solution to this problem. We need something like that anyway, in case
the extension is relocated, so that we don't have to dig through the
catalog and update all the settings.

>  Introducing a new control flag option allows users to set
> implicit search_path settings for functions created by the extension,
> if needed. The main aim here is to enhance security for functions
> created by extensions by setting search_path at the function level.
> This ensures precise control over how objects are accessed within
> each
> function, making behavior more predictable and minimizing security
> risks,

That leaves me wondering whether it's being addressed at the right
level.

For instance, did you consider just having a GUC to control the default
search_path for a function? That may be too magical, but if so, why
would an extension-only setting be better?

> especially for SECURITY DEFINER functions associated with
> extensions created by superusers.

I'm not sure that it's specific to SECURITY DEFINER functions.

Regards,
Jeff Davis





Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas  wrote:
>
> On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan  wrote:
> > The problem here is that OldestXmin is supposed to be more
> > conservative than vistest, which it almost always is, except in this
> > one edge case. I don't think that plugging that hole changes the basic
> > fact that there is one source of truth about what *needs* to be
> > pruned. There is such a source of truth: OldestXmin.
>
> Well, another approach could be to make it so that OldestXmin actually
> is always more conservative than vistest rather than almost always.

For the purposes of pruning, we are effectively always using the more
conservative of the two with this patch.

Are you more concerned about having a single horizon for pruning or
about having a horizon that does not move backwards after being
established at the beginning of vacuuming the relation?

Right now, in master, we do use a single horizon when determining what
is pruned -- that from GlobalVisState. OldestXmin is only used for
freezing and full page visibility determinations. Using a different
horizon for pruning by vacuum than freezing is what is causing the
error on master.

> I agree with you that letting the pruning horizon move forward during
> vacuum is desirable. I'm just wondering if having the vacuum code need
> to know a second horizon is really the best way to address that.

I was thinking about this some more and I realized I don't really get
why we think using GlobalVisState for pruning will let us remove more
tuples in the common case.

I had always thought it was because the vacuuming backend's
GlobalVisState will get updated periodically throughout vacuum and so,
assuming the oldest running transaction changes, our horizon for
vacuum would change. But, in writing this repro, it is actually quite
hard to get GlobalVisState to update. Our backend's RecentXmin needs
to have changed. And there aren't very many places where we take a new
snapshot after starting to vacuum a relation. One of those is at the
end of index vacuuming, but that can only affect the pruning horizon
if we have to do multiple rounds of index vacuuming. Is that really
the case we are thinking of when we say we want the pruning horizon to
move forward during vacuum?

- Melanie




Re: PostgreSQL does not compile on macOS SDK 15.0

2024-06-24 Thread Stan Hu
It appears in macOS SDK 14.5, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

#ifndef _XLOCALE__REGEX_H_
#define _XLOCALE__REGEX_H_

#ifndef _REGEX_H_
#include <_regex.h>
#endif // _REGEX_H_
#include <_xlocale.h>

In macOS SDK 15.5, these include guards are gone:

#ifndef _XLOCALE__REGEX_H_
#define _XLOCALE__REGEX_H_

#include <_regex.h>
#include <__xlocale.h>

Since _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from /usr/include/_regex.h (not to be confused with
/usr/include/xlocale/_regex.h).

If I hack the PostgreSQL src/include/regex/regex.h to include the double
underscore include guard of __REGEX_H_, the build succeeds:

```
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..734172167a 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -1,3 +1,6 @@
+#ifndef __REGEX_H_
+#define __REGEX_H_ /* never again */
+
 #ifndef _REGEX_H_
 #define _REGEX_H_ /* never again */
 /*
@@ -187,3 +190,5 @@ extern bool RE_compile_and_execute(text *text_re, char
*dat, int dat_len,
int nmatch, regmatch_t *pmatch);

 #endif /* _REGEX_H_ */
+
+#endif /* __REGEX_H_ */
```

Any better ideas here?


Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change.

I believe that it's more of an aspirational thing at this point. That
it is currently aspirational (it happens to some extent but isn't ever
particularly useful) shouldn't change the analysis about how to fix
this bug.

> One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

No, that's definitely not what we were thinking of. It's just an
accident that it's almost the only thing that'll do that.

-- 
Peter Geoghegan




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman
 wrote:
>
> If vacuum fails to remove a tuple with xmax older than
> VacuumCutoffs->OldestXmin and younger than
> GlobalVisState->maybe_needed, it will ERROR out when determining
> whether or not to freeze the tuple with "cannot freeze committed
> xmax".

One thing I don't understand is why it is okay to freeze the xmax of a
dead tuple just because it is from an aborted update.
heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD
tuples with normal xmaxes (non-multis) so that it can freeze tuples
from aborted updates. The only case in which we freeze dead tuples
with a non-multi xmax is if the xmax is from before OldestXmin and is
also not committed (so from an aborted update). Freezing dead tuples
replaces their xmax with InvalidTransactionId -- which would make them
look alive. So, it makes sense we don't do this for dead tuples in the
common case. But why is it 1) okay and 2) desirable to freeze xmaxes
of tuples from aborted updates? Won't it make them look alive again?

- Melanie




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler
On Jun 24, 2024, at 1:53 PM, Robert Haas  wrote:

> Is "tighten up what the superuser can do" on our list of objectives?
> Personally, I think we should be focusing mostly, and maybe entirely,
> on letting non-superusers do more things, with appropriate security
> controls. The superuser can ultimately do anything, since they can
> cause shell commands to be run and load arbitrary code into the
> backend and write code in untrusted procedural languages and mutilate
> the system catalogs and lots of other terrible things.

I guess the question then is what security controls are appropriate for this 
feature, which after all tells the postmaster what directories to read files 
from. It feels a little outside the scope of a regular user to even be aware of 
the file system undergirding the service. But perhaps there’s a non-superuser 
role for whom it is appropriate?

> Now, I think there are environments where people have used things like
> containers to try to lock down the superuser, and while I'm not sure
> that can ever be particularly water-tight, if it were the case that
> this patch would make it a whole lot easier for a superuser to bypass
> the kinds of controls that people are imposing today, that might be an
> argument against this patch. But ... off-hand, I'm not seeing such an
> exposure.

Yeah I’m not even sure I follow. Containers are immutable, other than mutable 
mounted volumes --- which is one use case this patch is attempting to enable.

> On the patch itself, I find the documentation for this to be fairly
> hard to understand. I think it could benefit from an example. I'm
> confused about whether this is intended to let me search for
> extensions in /my/temp/root/usr/lib/postgresql/... by setting
> extension_directory=/my/temp/dir, or whether it's intended me to
> search both /usr/lib/postgresql as I normally would and also
> /some/other/place.

I sketched them quickly, so agree they can be better. Reading the code, I now 
see that it appears to be the former case. I’d like to advocate for the latter. 

> If the latter, I wonder why we don't handle shared
> libraries by setting dynamic_library_path and then just have an
> analogue of that for control files.

The challenge is that it applies not just to shared object libraries and 
control files, but also extension SQL files and any other SHAREDIR files an 
extension might include. But also, I think it should support all the pg_config 
installation targets that extensions might use, including:

BINDIR
DOCDIR
HTMLDIR
PKGINCLUDEDIR
LOCALEDIR
MANDIR

I can imagine an extension wanting or needing to use any and all of these.

Best,

David







Unusually long checkpoint time on version 16, and 17beta1 running in a docker container

2024-06-24 Thread Dave Cramer
Greetings,

While testing pgjdbc I noticed the following

pgdb-1  | Will execute command on database postgres:
pgdb-1  | SELECT pg_drop_replication_slot(slot_name) FROM
pg_replication_slots WHERE slot_name = 'replica_one';
pgdb-1  | DROP USER IF EXISTS replica_one;
pgdb-1  | CREATE USER replica_one WITH REPLICATION PASSWORD 'test';
pgdb-1  | SELECT * FROM
pg_create_physical_replication_slot('replica_one');
pgdb-1  |
pgdb-1  | NOTICE:  role "replica_one" does not exist, skipping
pgdb-1  |  pg_drop_replication_slot
pgdb-1  | --
pgdb-1  | (0 rows)
pgdb-1  |
pgdb-1  | DROP ROLE
pgdb-1  | CREATE ROLE
pgdb-1  |   slot_name  | lsn
pgdb-1  | -+-
pgdb-1  |  replica_one |
pgdb-1  | (1 row)
pgdb-1  |
pgdb-1  | waiting for checkpoint
pgdb-1  | 2024-06-24 19:07:18.569 UTC [66] LOG:  checkpoint starting: force
wait
pgdb-1  | 2024-06-24 19:11:48.008 UTC [66] LOG:  checkpoint complete: wrote
6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 3 recycled;
write=269.438 s, sync=0.001 s, total=269.439 s; sync files=0, longest=0.000
s, average=0.000 s; distance=44140 kB, estimate=44140 kB; lsn=0/4B8,
redo lsn=0/428


Note that it takes 4 minutes 48 seconds to do the checkpoint. This seems
ridiculously long ?

If I add a checkpoint before doing anything there is no delay

 Will execute command on database postgres:
pgdb-1  | checkpoint;
pgdb-1  | SELECT pg_drop_replication_slot(slot_name) FROM
pg_replication_slots WHERE slot_name = 'replica_one';
pgdb-1  | DROP USER IF EXISTS replica_one;
pgdb-1  | CREATE USER replica_one WITH REPLICATION PASSWORD 'test';
pgdb-1  | SELECT * FROM
pg_create_physical_replication_slot('replica_one');
pgdb-1  |
pgdb-1  | 2024-06-24 19:19:57.498 UTC [66] LOG:  checkpoint starting:
immediate force wait
pgdb-1  | 2024-06-24 19:19:57.558 UTC [66] LOG:  checkpoint complete: wrote
6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 2 recycled;
write=0.060 s, sync=0.001 s, total=0.061 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=29947 kB, estimate=29947 kB; lsn=0/3223BA0, redo
lsn=0/3223B48
===> pgdb-1  | CHECKPOINT
pgdb-1  |  pg_drop_replication_slot
pgdb-1  | --
pgdb-1  | (0 rows)
pgdb-1  |
pgdb-1  | DROP ROLE
pgdb-1  | NOTICE:  role "replica_one" does not exist, skipping
pgdb-1  | CREATE ROLE
pgdb-1  |   slot_name  | lsn
pgdb-1  | -+-
pgdb-1  |  replica_one |
pgdb-1  | (1 row)
pgdb-1  |
pgdb-1  | waiting for checkpoint
pgdb-1  | 2024-06-24 19:19:57.614 UTC [66] LOG:  checkpoint starting: force
wait
pgdb-1  | 2024-06-24 19:19:57.915 UTC [66] LOG:  checkpoint complete: wrote
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.301
s, sync=0.001 s, total=0.302 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=14193 kB, estimate=28372 kB; lsn=0/480, redo
lsn=0/428

This starts in version 16, versions up to and including 15 do not impose
the wait.


Dave Cramer


Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson
On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
> On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart  
> wrote:
>> I think we could omit "i.e. privileges list."
>> 
>
> Agreed.  Between the docs and code we say "privileges list" once and 
> that refers to the dumputIls description of the arguments to grant.  As 
> the acronym page now defines the term using fundamentals, introducing 
> another term not used elsewhere seems undesirable.

New version attached.

> Observations:
> We are referencing a disambiguation page.  We never actually spell out 
> ACL anywhere so we might as well just reference what Wikipedia believes 
> is the expected spelling.
>
> The page we link to uses "permissions" while we consistently use 
> "privileges" to describe the contents of the list.  This seems like an 
> obvious synonym, but as the point of these is to formally define 
> things, pointing this equivalence is worth considering.

I like this idea. How could this be implemented in the docs? Maybe a 
... for ACL in acronyms.sgml?

/Joel

v2-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data


Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E. Wheeler
On Jun 19, 2024, at 05:42, Peter Eisentraut  wrote:

>>> https://postgr.es/m/CAH2-Wzm-W6hSn71sUkz0Rem=qdeu7tnfmc7_jg2djrlfef_...@mail.gmail.com
>>> 
>>> Theoretically anybody can do this themselves. In practice they don't.
>>> So something as simple as providing automated reports about ABI
>>> changes might well move the needle here.
>> What would be required to make such a thing? Maybe it’d make a good Summer 
>> of Code project.
> 
> The above thread contains a lengthy discussion about what one could do.

I somehow missed that GSoC 2024 is already going with contributors. Making a 
mental note to add an item for 2025.

D





Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E . Wheeler
On Jun 24, 2024, at 14:51, Robert Haas  wrote:

> I suppose that it's true that we try to avoid gratuitous breakage, but
> I feel like it would be weird to document that.

I see how that can seem weird to a committer deeply familiar with the 
development process and how things happen. But people outside the -hackers 
bubble have very little idea. It’s fair to say it needn’t be a long statement 
for major versions: a single sentence such as “we try to avoid gratuitous 
breakage” is a perfectly reasonable framing. But I’d say, in the interest of 
completeness, it would be useful to document the policy for major release *as 
well as* minor releases.

Best,

David





Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston
On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson  wrote:

> On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
>
> > The page we link to uses "permissions" while we consistently use
> > "privileges" to describe the contents of the list.  This seems like an
> > obvious synonym, but as the point of these is to formally define
> > things, pointing this equivalence is worth considering.
>
> I like this idea. How could this be implemented in the docs? Maybe a
> ... for ACL in acronyms.sgml?
>
>
Add a second  under the one holding the link?

David J.


Re: Proposal: Document ABI Compatibility

2024-06-24 Thread David E. Wheeler
On Jun 19, 2024, at 05:41, Peter Eisentraut  wrote:

> This is probably a bit confusing.  This might as well mean client application 
> code against libpq.  Better something like "server plugin code that uses the 
> PostgreSQL server APIs".

That works.

> But now we're talking about API.  That might be subject of another document 
> or another section in this one, but it seems confusing to mix this with the 
> ABI discussion.

Hrm. They’re super closely-related in my mind, as an extension developer. I 
need to know both! I guess I’m taking of this policy as what I can expect may 
be changed (and how to adapt to it) and what won’t.

That said, I’m fine to remove the API stuff if there’s consensus objecting to 
it, to be defined in a separate policy (perhaps on the same doc page).

>> PostgreSQL avoids unnecessary API changes in major releases, but usually 
>> ships a few necessary API changes, including deprecation, renaming, and 
>> argument variation.
> 
> Obviously, as a practical matter, there won't be random pointless changes.  
> But I wouldn't go as far as writing anything down about how these APIs are 
> developed.

Fair enough, was trying to give some idea of the sorts of changes. Don’t have 
to include them.

>> In such cases the incompatible changes will be listed in the Release Notes.
> 
> I don't think anyone is signing up to do that.

It needn’t be comprehensive. Just mention that an ABI or API changed in the 
release note item. Unless they almost *all* make such changes.

>> Minor Releases
>> --

> I think one major problem besides actively avoiding or managing such 
> minor-version ABI breaks is some automated detection.  Otherwise, this just 
> means "we try, but who knows”.

I think you *do* try, and the fact that there are so few issues means you 
succeed at that. I’m not advocating for an ABI guarantee here, just a 
description of the policy committees already follow.

Here’s an update based on all the feedback, framing things more from the 
perspective of “do I need to recompile this or change my code”. Many thanks!

``` md
ABI Policy
==

Changes to the the PostgreSQL server APIs may require recompilation of server 
plugin code that uses them. This policy describes the core team's approach to 
such changes, and what server API users can expect.

Major Releases
--

Applications that use server APIs must be compiled for each major release 
supported by the application. The inclusion of `PG_MODULE_MAGIC` ensures that 
code compiled for one major version will rejected by other major versions. 
Developers needing to support multiple versions of PostgreSQL with incompatible 
APIs should use the `PG_VERSION_NUM` constant to adjust code as appropriate. 
For example:

``` c
#if PG_VERSION_NUM >= 16
#include "varatt.h"
#endif
```

The core team avoids unnecessary breakage, but users of the server APIs should 
expect and be prepared to make adjustments and recompile for every major 
release.

Minor Releases
--

PostgreSQL makes an effort to avoid server API and ABI breaks in minor 
releases. In general, an application compiled against any minor release will 
work with any other minor release, past or future. In the absence of automated 
detection of such changes, this is not a guarantee, but history such breaking 
changes have been extremely rare.

When a change *is* required, PostgreSQL will choose the least invasive change 
possible, for example by squeezing a new field into padding space or appending 
it to the end of a struct. This sort of change should not impact dependent 
applications unless they use `sizeof(the struct)` on a struct with an appended 
field, or create their own instances of such structs --- patterns best avoided.

In rare cases, however, even such non-invasive changes may be impractical or 
impossible. In such an event, the change will be documented in the Release 
Notes, and details on the issue will also be posted to [TBD; mail list? Blog 
post? News item?].

To minimize issues and catch changes early, the project strongly recommends 
that developers adopt continuous integration testing at least for the latest 
minor release all major versions of Postgres they support.
```


Best,

David





Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Robert Haas
On Mon, Jun 24, 2024 at 3:37 PM David E. Wheeler  wrote:
> I guess the question then is what security controls are appropriate for this 
> feature, which after all tells the postmaster what directories to read files 
> from. It feels a little outside the scope of a regular user to even be aware 
> of the file system undergirding the service. But perhaps there’s a 
> non-superuser role for whom it is appropriate?

As long as the GUC is superuser-only, I'm not sure what else there is
to do here. The only question is whether there's some reason to
disallow this even from the superuser, but I'm not quite seeing such a
reason.

> > On the patch itself, I find the documentation for this to be fairly
> > hard to understand. I think it could benefit from an example. I'm
> > confused about whether this is intended to let me search for
> > extensions in /my/temp/root/usr/lib/postgresql/... by setting
> > extension_directory=/my/temp/dir, or whether it's intended me to
> > search both /usr/lib/postgresql as I normally would and also
> > /some/other/place.
>
> I sketched them quickly, so agree they can be better. Reading the code, I now 
> see that it appears to be the former case. I’d like to advocate for the 
> latter.

Sounds good.

> > If the latter, I wonder why we don't handle shared
> > libraries by setting dynamic_library_path and then just have an
> > analogue of that for control files.
>
> The challenge is that it applies not just to shared object libraries and 
> control files, but also extension SQL files and any other SHAREDIR files an 
> extension might include. But also, I think it should support all the 
> pg_config installation targets that extensions might use, including:
>
> BINDIR
> DOCDIR
> HTMLDIR
> PKGINCLUDEDIR
> LOCALEDIR
> MANDIR
>
> I can imagine an extension wanting or needing to use any and all of these.

Are these really all relevant to backend code?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Direct SSL connection and ALPN loose ends

2024-06-24 Thread Heikki Linnakangas

On 21/06/2024 02:32, Jacob Champion wrote:

On Thu, Jun 20, 2024 at 4:13 PM Heikki Linnakangas  wrote:

By "negotiation" I mean the server's response to the startup packet.
I.e. "supported"/"not supported"/"error".


Ok, I'm still a little confused, probably a terminology issue. The
server doesn't respond with "supported" or "not supported" to the
startup packet, that happens earlier. I think you mean the SSLRequst /
GSSRequest packet, which is sent *before* the startup packet?


Yes, sorry. (I'm used to referring to those as startup packets too, ha.)


Yeah I'm not sure what the right term would be.


Hmm, right, GSS encryption was introduced in v12, and older versions
respond with an error to a GSSRequest.

We probably could make the same assumption for GSS as we did for TLS in
a49fbaaf, i.e. that an error means that something's wrong with the
server, rather than that it's just very old and doesn't support GSS. But
the case for that is a lot weaker case than with TLS. There are still
pre-v12 servers out there in the wild.


Right. Since we default to gssencmode=prefer, if you have Kerberos
creds in your environment, I think this could potentially break
existing software that connects to v11 servers once you upgrade libpq.


When you connect to a V11 server and attempt to perform GSSAPI 
authentication, it will respond with a V3 error that says: "unsupported 
frontend protocol 1234.5680: server supports 2.0 to 3.0". That was a 
surprise to me until I tested it just now. I thought that it would 
respond with a protocol V2 error, but it is not so. The backend sets 
FrontendProtocol to 1234.5680 before sending the error, and because it 
is >= 3, the error is sent with protocol version 3.


Given that, I think it is a good thing to fail the connection completely 
on receiving a V2 error.


Attached is a patch to fix the other issue, with falling back from SSL 
to plaintext. And some tests and comment fixes I spotted while at it.


0001: A small comment fix
0002: This is the main patch that fixes the SSL fallback issue

0003: This adds fault injection tests to exercise these early error 
codepaths. It is not ready to be merged, as it contains a hack to skip 
locking. See thread at 
https://www.postgresql.org/message-id/e1ffb822-054e-4006-ac06-50532767f75b%40iki.fi.


0004: More tests, for what happens if the server sends an error after 
responding "yes" to the SSLRequest or GSSRequest, but before performing 
the SSL/GSS handshake.


Attached is also a little stand-alone perl program that listens on a 
socket, and when you connect to it, it immediately sends a V2 or V3 
error, depending on the argument. That's useful for testing. It could be 
used as an alternative strategy to the injection points I used in the 
0003-0004 patches, but for now I just used it for manual testing.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 4b988b1c74066fe8b5f98acb4ecd20150a47bc33 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 24 Jun 2024 20:41:41 +0300
Subject: [PATCH 1/4] Fix outdated comment after removal of direct SSL fallback

The option to fall back from direct SSL to negotiated SSL or a
plaintext connection was removed in commit fb5718f35f.
---
 src/interfaces/libpq/fe-connect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 071b1b34aa1..e003279fb6c 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3564,8 +3564,8 @@ keep_going:		/* We will come back to here until there is
 if (pollres == PGRES_POLLING_FAILED)
 {
 	/*
-	 * Failed direct ssl connection, possibly try a new
-	 * connection with postgres negotiation
+	 * SSL handshake failed.  We will retry with a plaintext
+	 * connection, if permitted by sslmode.
 	 */
 	CONNECTION_FAILED();
 }
-- 
2.39.2

From 7df581e902b76665de4c751ddbc1309143e6c0b8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 24 Jun 2024 18:14:52 +0300
Subject: [PATCH 2/4] Fix fallback behavior when server sends an ERROR early at
 startup

With sslmode=prefer, the desired behavior is to completely fail the
connection attempt, *not* fall back to a plaintext connection, if the
server responds to the SSLRequest with an error ('E') response instead
of rejecting SSL with an 'N' response. This was broken in commit
05fd30c0e7.

Discussion: https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e003279fb6c..360d9a45476 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3521,6 +3521,12 @@ keep_going:		/* We will come back to here until there is
 		 * byte here.
 		 */
 		conn->status = CONNECTION_AWAIT

Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Nikolay Shaplov
Hi!

Let me join the review process.

I am no expert in execution plans, so there would not be much help in doing 
even better optimization. But I can read the code, as a person who is not 
familiar 
with this area and help making it clear even to a person like me.

So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that 
have been posted some time ago, and especially transform_or_to_any function.

> @@ -38,7 +45,6 @@
>  int  from_collapse_limit;
>  int  join_collapse_limit;
>
> - 
>  /*
>   * deconstruct_jointree requires multiple passes over the join tree,  
> because we 
>   * need to finish computing JoinDomains before we start  distributing quals.

Do not think that removing empty line should be part of the patch

> + /*
> +  * If the const node's (right side of operator 
> expression) type
> +  * don't have “true” array type, then we cannnot do 
> the
> +  * transformation. We simply concatenate the expression 
> node.
> +  */
Guess using unicode double quotes is not the best idea here... 
   
Now to the first part of  `transform_or_to_any`


signature.asc
Description: This is a digitally signed message part.


Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Robert Haas
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman
 wrote:
> Are you more concerned about having a single horizon for pruning or
> about having a horizon that does not move backwards after being
> established at the beginning of vacuuming the relation?

I'm not sure I understand. The most important thing here is fixing the
bug. But if we have a choice of how to fix the bug, I'd prefer to do
it by having the pruning code test one horizon that is always correct,
rather than (as I think the patch does) having it test against two
horizons because as a way of covering possible discrepancies between
those values.

> Right now, in master, we do use a single horizon when determining what
> is pruned -- that from GlobalVisState. OldestXmin is only used for
> freezing and full page visibility determinations. Using a different
> horizon for pruning by vacuum than freezing is what is causing the
> error on master.

Agreed.

> I had always thought it was because the vacuuming backend's
> GlobalVisState will get updated periodically throughout vacuum and so,
> assuming the oldest running transaction changes, our horizon for
> vacuum would change. But, in writing this repro, it is actually quite
> hard to get GlobalVisState to update. Our backend's RecentXmin needs
> to have changed. And there aren't very many places where we take a new
> snapshot after starting to vacuum a relation. One of those is at the
> end of index vacuuming, but that can only affect the pruning horizon
> if we have to do multiple rounds of index vacuuming. Is that really
> the case we are thinking of when we say we want the pruning horizon to
> move forward during vacuum?

I thought the idea was that the GlobalVisTest stuff would force a
recalculation now and then, but maybe it doesn't actually do that?

Suppose process A begins a transaction, acquires an XID, and then goes
idle. Process B now begins a giant vacuum. At some point in the middle
of the vacuum, A ends the transaction. Are you saying that B's
GlobalVisTest never really notices that this has happened?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler
On Jun 24, 2024, at 4:28 PM, Robert Haas  wrote:

> As long as the GUC is superuser-only, I'm not sure what else there is
> to do here. The only question is whether there's some reason to
> disallow this even from the superuser, but I'm not quite seeing such a
> reason.

I can switch it back from requiring a restart to allowing a superuser to set it.

>> I sketched them quickly, so agree they can be better. Reading the code, I 
>> now see that it appears to be the former case. I’d like to advocate for the 
>> latter.
> 
> Sounds good.

Yeah, though then I have a harder time deciding how it should work. pg_config’s 
paths are absolute. With your first example, we just use them exactly as they 
are, but prefix them with the destination directory. So if it’s set to 
`/my/temp/root/`, then files go into

/my/temp/root/$(pg_conifg --sharedir)
/my/temp/root/$(pg_conifg --pkglibdir)
/my/temp/root/$(pg_conifg --bindir)
# etc.

Which is exactly how RPM and Apt packages are built, but seems like an odd 
configuration for general use.

>> BINDIR
>> DOCDIR
>> HTMLDIR
>> PKGINCLUDEDIR
>> LOCALEDIR
>> MANDIR
>> 
>> I can imagine an extension wanting or needing to use any and all of these.
> 
> Are these really all relevant to backend code?

Oh I think so. Especially BINDIR; lots of extensions ship with binary 
applications. And most ship with docs, too (PGXS puts items listed in DOCS into 
DOCDIR). Some might also produce man pages (for their binaries), HTML docs, and 
other stuff. Maybe an FTE extension would include locale files?

I find it pretty easy to imagine use cases for all of them. So much so that I 
wrote an extension binary distribution RFC[1] and its POC[2] around them.

Best,

David

[1]: https://github.com/orgs/pgxn/discussions/2
[1]: https://justatheory.com/2024/06/trunk-poc/





Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
 wrote:
> One thing I don't understand is why it is okay to freeze the xmax of a
> dead tuple just because it is from an aborted update.

We don't do that with XID-based xmaxs. Though perhaps we should, since
we'll already prune-away the successor tuple, and so might as well go
one tiny step further and clear the xmax for the original tuple via
freezing/setting it InvalidTransactionId. Instead we just leave the
original tuple largely undisturbed, with its original xmax.

We do something like that with Multi-based xmax fields, though not
with the specific goal of cleaning up after aborts in mind (we can
also remove lockers that are no longer running, regardless of where
they are relative to OldestXmin, stuff like that). The actual goal
with that is to enforce MultiXactCutoff, independently of whether or
not their member XIDs are < FreezeLimit yet.

> The only case in which we freeze dead tuples
> with a non-multi xmax is if the xmax is from before OldestXmin and is
> also not committed (so from an aborted update).

Perhaps I misunderstand, but: we simply don't freeze DEAD (not
RECENTLY_DEAD) tuples in the first place, because we don't have to
(pruning removes them instead). It doesn't matter if they're DEAD due
to being from aborted transactions or DEAD due to being
deleted/updated by a transaction that committed (committed and <
OldestXmin).

The freezing related code paths in heapam.c don't particularly care
whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
Just as long as it's not fully DEAD (then it should have been pruned).

--
Peter Geoghegan




Re: Partial aggregates pushdown

2024-06-24 Thread Jelte Fennema-Nio
On Mon, 24 Jun 2024 at 15:03, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I see. I maybe got your proposal.
> Refer to your proposal, for avg(int8),
> I create a new native type like state_int8_avg
> with the new typsend/typreceive functions
> and use them to transmit the state value, right?

Yes, that's roughly what I had in mind indeed.

> That might seem to be a more fundamental solution
> because I can avoid adding export/import functions of my proposal,
> which are the new components of aggregate function.
> I have never considered the solution.
> I appreciate your proposal.

Thank you :)

> However, I still have the following two questions.
>
> 1. Not necessary components of new native types
> Refer to pg_type.dat, typinput and typoutput are required.
> I think that in your proposal they are not necessary,
> so waste. I think that it is not acceptable.
> How can I resolve the problem?

I think requiring typinput/typoutput is a benefit personally, because
that makes it possible to do PARTIAL_AGGREGATE pushdown to a different
PG major version. Also it makes it easier to debug the partial
aggregate values when using psql/pgregress. So yes, it requires
implementing both binary (send/receive) and text (input/output)
serialization, but it also has some benefits. And in theory you might
be able to skip implementing the binary serialization, and rely purely
on the text serialization to send partial aggregates between servers.

> 2. Many new native types
> I think that, basically, each aggregate function does need a new native type.
> For example,
> avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
> PolyNumAggState.
> You said that it is enough to add only one native type like state_poly_num_agg
> for supporting them, right?

Yes, correct. That's what I had in mind.

> But the combine functions of them might have quite different expectation
> on the data items of PolyNumAggState like
> the range of N(means count) and the true/false of calcSumX2
> (the flag of calculating sum of squares).
> The final functions of them have the similar expectation.
> So, I think that, responded to your proposal,
> each of them need a native data type
> like state_int8_avg, state_numeric_avg, for safety.
>
> And, we do need a native type for an aggregate function
> whose transtype is not internal and not pseudo.
> For avg(int4), the current transtype is _int8.
> However, I do need a validation check on the number of the array
> And the positiveness of count(the first element of the array).
> Responded to your proposal,
> I do need a new native type like state_int4_avg.

To help avoid confusion let me try to restate what I think you mean
here: You're worried about someone passing in a bogus native type into
the final/combine functions and then getting crashes and/or wrong
results. With internal type people cannot do this because they cannot
manually call the combinefunc/finalfunc because the argument type is
internal. To solve this problem your suggestion is to make the type
specific to the specific aggregate such that send/receive or
input/output can validate the input as reasonable. But this would then
mean that we have many native types (and also many
deserialize/serialize functions).

Assuming that's indeed what you meant, that's an interesting thought,
I didn't consider that much indeed. My thinking was that we only need
to implement send/receive & input/output functions for these types,
and even though their meaning is very different we can serialize them
in the same way.

As you say though, something like that is already true for avg(int4)
today. The way avg(int4) handles this issue is by doing some input
validation for every call to its trans/final/combinefunc (see
int4_avg_accum, int4_avg_combine, and int8_avg). It checks the length
of the array there, but it doesn't check the positiveness of the
count. I think that makes sense. IMHO these functions only need to
protect against crashes (e.g. null pointer dereferences). But I don't
think there is a good reason for them to protect the user against
passing in weird data. These functions aren't really meant to be
called manually in the first place anyway, so if the user does that
and they pass in weird data then I'm fine with them getting a weird
result back, even errors are fine (only crashes are not).

So as long as our input/output & send/receive functions for
state_poly_num_agg handle all the inconsistencies that could cause
crashes later on (which I think is pretty simple to do for
PolyNumAggState), then I don't think we need state_int8_avg,
state_numeric_avg, etc.

> > Not a full review but some initial notes:
> Thank you. I don't have time today, so I'll answer after tomorrow.

Sure, no rush.




Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread Joel Jacobson
On Mon, Jun 24, 2024, at 21:51, David G. Johnston wrote:
> On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson  wrote:
>> On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote:
>> 
>> > The page we link to uses "permissions" while we consistently use 
>> > "privileges" to describe the contents of the list.  This seems like an 
>> > obvious synonym, but as the point of these is to formally define 
>> > things, pointing this equivalence is worth considering.
>> 
>> I like this idea. How could this be implemented in the docs? Maybe a 
>> ... for ACL in acronyms.sgml?
>> 
>
> Add a second  under the one holding the link?

How about?

+ 
+  The linked page uses "permissions" while we consistently use the synonym
+  "privileges", to describe the contents of the list. For avoidance of
+  doubt and clarity, these two terms are equivalent in the
+  PostgreSQL documentation.
+ 

/Joel

v3-0001-Add-ACL-Access-Control-List-acronym.patch
Description: Binary data


Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman
>  wrote:
> > One thing I don't understand is why it is okay to freeze the xmax of a
> > dead tuple just because it is from an aborted update.
>
> We don't do that with XID-based xmaxs. Though perhaps we should, since
> we'll already prune-away the successor tuple, and so might as well go
> one tiny step further and clear the xmax for the original tuple via
> freezing/setting it InvalidTransactionId. Instead we just leave the
> original tuple largely undisturbed, with its original xmax.

I thought that was the case too, but we call
heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then

else if (TransactionIdIsNormal(xid))
{
/* Raw xmax is normal XID */
freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
}

And then later we

if (freeze_xmax)
frz->xmax = InvalidTransactionId;

and then when we execute freezing the tuple in heap_execute_freeze_tuple()

HeapTupleHeaderSetXmax(tuple, frz->xmax);

Which sets the xmax to InvalidTransactionId. Or am I missing something?

> > The only case in which we freeze dead tuples
> > with a non-multi xmax is if the xmax is from before OldestXmin and is
> > also not committed (so from an aborted update).
>
> Perhaps I misunderstand, but: we simply don't freeze DEAD (not
> RECENTLY_DEAD) tuples in the first place, because we don't have to
> (pruning removes them instead). It doesn't matter if they're DEAD due
> to being from aborted transactions or DEAD due to being
> deleted/updated by a transaction that committed (committed and <
> OldestXmin).

Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples.
HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax
of a tuple that has been deleted or updated by a transaction that
committed with InvalidTransactionId. And it seems like the code does
that? Why even call heap_prepare_freeze_tuple() on
HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact
freezing?

> The freezing related code paths in heapam.c don't particularly care
> whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin.
> Just as long as it's not fully DEAD (then it should have been pruned).

But it just seems like we shouldn't freeze RECENTLY_DEAD either.

- Melanie




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Peter Geoghegan
On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> I'm not sure I understand. The most important thing here is fixing the
> bug. But if we have a choice of how to fix the bug, I'd prefer to do
> it by having the pruning code test one horizon that is always correct,
> rather than (as I think the patch does) having it test against two
> horizons because as a way of covering possible discrepancies between
> those values.

Your characterizing of OldestXmin + vistest as two horizons seems
pretty arbitrary to me. I know what you mean, of course, but it seems
like a distinction without a difference.

> I thought the idea was that the GlobalVisTest stuff would force a
> recalculation now and then, but maybe it doesn't actually do that?

It definitely can do that. Just not in a way that meaningfully
increases the number of heap tuples that we can recognize as DEAD and
remove. At least not currently.

> Suppose process A begins a transaction, acquires an XID, and then goes
> idle. Process B now begins a giant vacuum. At some point in the middle
> of the vacuum, A ends the transaction. Are you saying that B's
> GlobalVisTest never really notices that this has happened?

That's my understanding, yes. That is, vistest is approximately the
same thing as OldestXmin anyway. At least for now.

-- 
Peter Geoghegan




Re: POC, WIP: OR-clause support for indexes

2024-06-24 Thread Nikolay Shaplov
Hi!

Let me join the review process.

I am no expert in execution plans, so there would not be much help in doing 
even better optimization. But I can read the code, as a person who is not 
familiar 
with this area and help making it clear even to a person like me.

So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that 
have been posted some time ago, and especially transform_or_to_any function.

> @@ -38,7 +45,6 @@
>  int  from_collapse_limit;
>  int  join_collapse_limit;
>
> - 
>  /*
>   * deconstruct_jointree requires multiple passes over the join tree,  
because we 
>   * need to finish computing JoinDomains before we start  distributing quals.

Do not think that removing empty line should be part of the patch

> + /*
> +  * If the const node's (right side of operator 
expression) type
> +  * don't have “true” array type, then we cannnot 
do the
> +  * transformation. We simply concatenate the 
expression node.
> +  */
Guess using unicode double quotes is not the best idea here... 
   
Now to the first part of  `transform_or_to_any`:

> + List   *entries = NIL;
I guess the idea of entries should be explained from the start. What kind of 
entries are accomulated there... I see they are added there all around the 
code, but what is the purpose of that is not quite clear when you read it.

At the first part of `transform_or_to_any` function, you costanly repeat two 
lines, like a mantra:

> + entries = lappend(entries, rinfo);
> + continue;

"If something is wrong -- do that mantra"

From my perspective, if you have to repeat same code again and again, then 
most probably you have some issues with architecture of the code. If you 
repeat some code again and again, you need to try to rewrite the code, the 
way, that part is repeated only once.

In that case I would try to move the most of the first loop of 
`transform_or_to_any`  to a separate function (let's say its name is 
prepare_single_or), that will do all the checks, if this or is good for us; 
return NULL if it does not suits our purposes (and in this case we do "entries 
= lappend(entries, rinfo); continue" in the main code, but only once) or 
return pointer to some useful data if this or clause is good for our purposes. 

This, I guess will make that part more clear and easy to read, without 
repeating same "lappend mantra" again and again.

Will continue digging into the code tomorrow.

P.S. Sorry for sending partly finished email. Pressed Ctrl+Enter 
accidentally... With no way to make it back :-((( 

signature.asc
Description: This is a digitally signed message part.


Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-06-24 Thread Melanie Plageman
On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 24, 2024 at 4:36 PM Robert Haas  wrote:
> > I thought the idea was that the GlobalVisTest stuff would force a
> > recalculation now and then, but maybe it doesn't actually do that?
>
> It definitely can do that. Just not in a way that meaningfully
> increases the number of heap tuples that we can recognize as DEAD and
> remove. At least not currently.
>
> > Suppose process A begins a transaction, acquires an XID, and then goes
> > idle. Process B now begins a giant vacuum. At some point in the middle
> > of the vacuum, A ends the transaction. Are you saying that B's
> > GlobalVisTest never really notices that this has happened?
>
> That's my understanding, yes. That is, vistest is approximately the
> same thing as OldestXmin anyway. At least for now.

Exactly. Something has to cause this backend to update its view of the
horizon. At the end of index vacuuming,
GetOldestNonRemovableTransactionId() will explicitly
ComputeXidHorizons() which will update our backend's GlobalVisStates.
Otherwise, if our backend's RecentXmin is updated, by taking a new
snapshot, then we may update our GlobalVisStates. See
GlobalVisTestShouldUpdate() for the conditions under which we would
update our GlobalVisStates during the normal visibility checks
happening during pruning.

Vacuum used to open indexes after calculating horizons before starting
its first pass. This led to a recomputation of the horizon. But, in
master, there aren't many obvious places where such a thing would be
happening.

- Melanie




Re: [PATCH] Add ACL (Access Control List) acronym

2024-06-24 Thread David G. Johnston
On Mon, Jun 24, 2024 at 1:49 PM Joel Jacobson  wrote:

> How about?
>
> + 
> +  The linked page uses "permissions" while we consistently use the
> synonym
> +  "privileges", to describe the contents of the list. For avoidance of
> +  doubt and clarity, these two terms are equivalent in the
> +  PostgreSQL documentation.
> + 
>
> /Joel


I really dislike "For avoidance of doubt and clarity" - and in terms of
being equivalent the following seems like a more accurate description of
reality.

The PostgreSQL documentation, and code, refers to the specifications within
the ACL as "privileges".  This has the same meaning as "permissions" on the
linked page.  Generally if we say "permissions" we are referring to
something that is not covered by the ACL.  In routine communication the two
words are often used interchangeably.

David J.


Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio
On Thu, 11 Apr 2024 at 19:52, David E. Wheeler  wrote:
> I realize this probably isn’t going to happen for 17, given the freeze, but I 
> would very much welcome feedback and pointers to address concerns about 
> providing a second directory for extensions and DSOs. Quite a few people have 
> talked about the need for this in the Extension Mini Summits[1], so I’m sure 
> I could get some collaborators to make improvements or look at a different 
> approach.

Overall +1 for the idea. We're running into this same limitation (only
a single place to put extension files) at Microsoft at the moment.

+and to the '$libdir' directive when loading modules
+that back functions.

I feel like this is a bit strange. Either its impact is too wide, or
it's not wide enough depending on your intent.

If you want to only change $libdir during CREATE EXTENSION (or ALTER
EXTENSION UPDATE), then why not just change it there. And really you'd
only want to change it when creating an extension from which the
control file is coming from extension_destdir.

However, I can also see a case for really always changing $libdir.
Because some extensions in shared_preload_libraries, might want to
trigger loading other libraries that they ship with dynamically. And
these libraries are probably also in extension_destdir.




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio
On Mon, 24 Jun 2024 at 18:11, Nathan Bossart  wrote:
> At first glance, the general idea seems reasonable to me.  I'm wondering
> whether there is a requirement for this directory to be prepended or if it
> could be appended to the end.  That way, the existing ones would take
> priority, which might be desirable from a security standpoint.

Citus does ship with some override library for pgoutput to make
logical replication/CDC work correctly with sharded tables. Right now
using this override library requires changing dynamic_library_path. It
would be nice if that wasn't necessary. But this is obviously a small
thing. And I definitely agree that there's a security angle to this as
well, but honestly that seems rather small too. If an attacker can put
shared libraries into the extension_destdir, I'm pretty sure you've
lost already, no matter if extension_destdir is prepended or appended
to the existing $libdir.




Re: RFC: Additional Directory for Extensions

2024-06-24 Thread David E. Wheeler
On Jun 24, 2024, at 17:17, Jelte Fennema-Nio  wrote:

> If you want to only change $libdir during CREATE EXTENSION (or ALTER
> EXTENSION UPDATE), then why not just change it there. And really you'd
> only want to change it when creating an extension from which the
> control file is coming from extension_destdir.

IIUC, the postmaster needs to load an extension on first use in every session 
unless it’s in shared_preload_libraries.

> However, I can also see a case for really always changing $libdir.
> Because some extensions in shared_preload_libraries, might want to
> trigger loading other libraries that they ship with dynamically. And
> these libraries are probably also in extension_destdir.

Right, it can be more than just the DSOs for the extension itself.

Best,

David





Re: RFC: Additional Directory for Extensions

2024-06-24 Thread Jelte Fennema-Nio
On Mon, 24 Jun 2024 at 22:42, David E. Wheeler  wrote:
> >> BINDIR
> >> DOCDIR
> >> HTMLDIR
> >> PKGINCLUDEDIR
> >> LOCALEDIR
> >> MANDIR
> >>
> >> I can imagine an extension wanting or needing to use any and all of these.
> >
> > Are these really all relevant to backend code?
>
> Oh I think so. Especially BINDIR; lots of extensions ship with binary 
> applications. And most ship with docs, too (PGXS puts items listed in DOCS 
> into DOCDIR). Some might also produce man pages (for their binaries), HTML 
> docs, and other stuff. Maybe an FTE extension would include locale files?
>
> I find it pretty easy to imagine use cases for all of them. So much so that I 
> wrote an extension binary distribution RFC[1] and its POC[2] around them.

Definitely agreed on BINDIR needing to be supported.

And while lots of extensions ship with docs, I expect this feature to
mostly be used in production environments to make deploying extensions
easier. And I'm not sure that many people care about deploying docs to
production (honestly lots of people would probably want to strip
them).

Still, for the sake of completeness it might make sense to support
this whole list in extension_destdir. (assuming it's easy to do)




[PATCH] Fix type redefinition build errors with macOS SDK 15.0

2024-06-24 Thread Stan Hu
Prior to macOS SDK 15, there were include guards in
$SDK_ROOT/usr/include/xlocale/_regex.h:

  #ifndef _REGEX_H_
  #include <_regex.h>
  #endif // _REGEX_H_
  #include <_xlocale.h>

In macOS SDK 15.5, these include guards are gone:

  #include <_regex.h>
  #include <_xlocale.h>

Because _REGEX_H_ was defined locally in PostgreSQL's version of
src/include/regex/regex.h, these include guards prevented duplicate
definitions from $SDK_ROOT/usr/include/_regex.h (not to be confused
with $SDK_ROOT/usr/include/xlocale/_regex.h).

To fix this build issue, define __REGEX_H_ to prevent macOS from
including the header that contain redefinitions of the local regex
structures.

Discussion: 
https://www.postgresql.org/message-id/CAMBWrQ%3DF9SSPfsFtCv%3DJT51WGK2VcgLA%2BiiJJOmjN0zbbufOEA%40mail.gmail.com
---
 src/include/regex/regex.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index d08113724f..045ac626cc 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -32,6 +32,20 @@
  * src/include/regex/regex.h
  */
 
+#if defined(__darwin__)
+/*
+ * mmacOS SDK 15.0 removed the _REGEX_H_ include guards in
+ * $SDK_ROOT/usr/include/xlocale/_regex.h, so now
+ * $SDK_ROOT/usr/include/_regex.h is always included. That file defines
+ * the same types as below. To guard against type redefinition errors,
+ * define __REGEX_H_.
+ */
+#ifndef __REGEX_H_
+#define __REGEX_H_
+
+#endif /* __REGEX_H_ */
+#endif
+
 /*
  * Add your own defines, if needed, here.
  */
-- 
2.45.0





  1   2   >