Re: Optimize duplicate code and fix memory leak in function fetch_remote_table_info()

2024-01-19 Thread Yongtao Huang
Hi,

>  So whatever it leaks will be released at the transaction end.

I learned it. thank you very much for your explanation.

Regards,
Yongtao Huang

Tom Lane  于2024年1月20日周六 12:34写道:

> Yongtao Huang  writes:
> > (1)  I think *pfree(pub_names.data)* is necessary.
>
> Really?
>
> It looks to me like copy_table, and thence fetch_remote_table_info,
> is called once within a transaction.  So whatever it leaks will be
> released at transaction end.  This is a good thing, because it's
> messy enough that I seriously doubt that there aren't other leaks
> in it, or that it'd be practical to expect that it can be made
> to never leak anything.
>
> If anything, I'd be inclined to remove the random pfree's that
> are in it now.  It's unlikely that they constitute a net win
> compared to allowing memory context reset to clean things up.
>
> regards, tom lane
>


Re: Change GUC hashtable to use simplehash?

2024-01-19 Thread John Naylor
On Sat, Jan 20, 2024 at 7:13 AM Jeff Davis  wrote:
>
> On Fri, 2024-01-19 at 13:38 -0800, Jeff Davis wrote:
> > One post-commit question on 0aba255440: why do
> > haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How
> > does byteswapping affect whether a zero byte exists or not?
>
> I missed that it was used later when finding the rightmost one
> position.
>
> The placement of the comment was slightly confusing. Is:
>
>   haszero64(pg_bswap64(chunk)) == pg_bswap64(haszero64(chunk))
>
> ? If so, perhaps we can do the byte swapping outside of the loop, which
> might save a few cycles on longer strings and would be more readable.

The above identity is not true for this haszero64 macro. I phrased it
as "The rest of the bytes are indeterminate", but that's not very
clear. It can only be true if it set bytes for all and only those
bytes where the input had zeros. In the NetBSD strlen source, there is
a comment telling of a way to do this:

~(((x & 0x7f7f) + 0x7f7f) | (x | 0x7f7f))

https://github.com/NetBSD/src/blob/trunk/common/lib/libc/arch/x86_64/string/strlen.S

(They don't actually use it of course, since x86_64 is little-endian)
>From the commentary there, it sounds like 1 or 2 more instructions.
One unmentioned assumption I had was that the byte swap would be a
single instruction on all platforms where we care about performance
(*). If that's not the case, we could switch to the above macro for
big-endian machines. It'd be less readable since we'd then need an
additional #ifdef for counting leading, rather than trailing zeros
(that would avoid byte-swapping entirely). Either way, I'm afraid
big-endian is stuck doing a bit of extra work somewhere. That work
could be amortized by doing a quick check in the loop and afterwards
completely redoing the zero check (or a bytewise check same as the
unaligned path), but that would penalize short strings.

(*) 32-bit platforms don't take this path, but mamba's build failed
because the previously-misspelled symbol was still in the source file.
We could also #ifdef around the whole aligned-path function, although
it's redundant.

I hope this makes it more clear. Maybe the comment could use some work.




Re: Change GUC hashtable to use simplehash?

2024-01-19 Thread John Naylor
On Fri, Jan 19, 2024 at 11:54 PM Heikki Linnakangas  wrote:

> Thanks! I started to look at how to use this, and I have some questions.
> I'd like to replace this murmurhash ussage in resowner.c with this:
>
> >   /*
> >* Most resource kinds store a pointer in 'value', and pointers are 
> > unique
> >* all on their own.  But some resources store plain integers (Files 
> > and
> >* Buffers as of this writing), so we want to incorporate the 'kind' 
> > in
> >* the hash too, otherwise those resources will collide a lot.  But
> >* because there are only a few resource kinds like that - and only a 
> > few
> >* resource kinds to begin with - we don't need to work too hard to 
> > mix
> >* 'kind' into the hash.  Just add it with hash_combine(), it 
> > perturbs the
> >* result enough for our purposes.
> >*/
> > #if SIZEOF_DATUM == 8
> >   return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
> > #else
> >   return hash_combine(murmurhash32((uint32) value), (uint32) kind);
> > #endif
>
> The straightforward replacement would be:
>
>  fasthash_state hs;
>
>  fasthash_init(&hs, sizeof(Datum), 0);
>  fasthash_accum(&hs, (char *) &kind, sizeof(ResourceOwnerDesc *));
>  fasthash_accum(&hs, (char *) &value, sizeof(Datum));
>  return fasthash_final32(&hs, 0);

That would give the fullest mixing possible, more than currently.

> But I wonder if it would be OK to abuse the 'seed' and 'tweak'
> parameters to the init and final functions instead, like this:
>
>  fasthash_state hs;
>
>  fasthash_init(&hs, sizeof(Datum), (uint64) kind);
>  return fasthash_final32(&hs, (uint64) value);

This would go in the other direction, and sacrifice some quality for
speed. The fasthash finalizer is pretty short -- XMX, where X is
"right shift and XOR" and M is "multiply". In looking at some other
hash functions, it seems that's often done only if the input has
already had some mixing. The Murmur finalizer has the shape XMXMX, and
that seems to be the preferred way to get good mixing on a single
register-sized value. For that reason, hash functions whose main loop
is designed for long inputs will often skip that for small inputs and
just go straight to a Murmur-style finalizer. Fasthash doesn't do
that, so for a small input it ends up doing XMXM then XMX, which is a
little more expensive.

> I couldn't find any guidance on what properties the 'seed' and 'tweak'
> have, compared to just accumulating the values with accum. Anyone know?

In Postgres, I only know of one use of a seed parameter, to create two
independent hash functions from hash_bytes_uint32_extended(), in
brin-bloom indexes. I think that's the more typical use for a seed.
Since there was no guidance with the existing hash functions, and it's
a widespread concept, I didn't feel the need to put any here. We could
change that.

I modeled the finalizer tweak on one of the finalizers in xxHash that
also used it only for the input length. Length is used as a tiebreaker
where otherwise it will often not collide anyway, so it seems that's
how we should think about using it elsewhere. There is a comment above
fasthash_final64 mentioning that the tweak is used for length when
that is not known ahead of time, but it might be good to generalize
that, and maybe put it somewhere more prominent. With that in mind,
I'm not sure "value" is a good fit for the tweak.  "kind" is sort of
in the middle because IIUC it doesn't mattter at all for pointer
values, but it's important for other kinds, which would commonly
collide.

If I were to change from murmur64, I'd probably go in between the two
extremes mentioned earlier, and mix "value" normally and pass "kind"
as the seed:

  fasthash_state hs;

  fasthash_init(&hs, sizeof(Datum), kind);
  fasthash_accum(&hs, (char *) &value, sizeof(Datum));
  return fasthash_final32(&hs, 0);




Re: Synchronizing slots from primary to standby

2024-01-19 Thread Dilip Kumar
On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila  wrote:
>
> On Wed, Jan 17, 2024 at 4:00 PM shveta malik  wrote:
> >
>
> I had some off-list discussions with Sawada-San, Hou-San, and Shveta
> on the topic of extending replication commands instead of using the
> current model where we fetch the required slot information via SQL
> using a database connection. I would like to summarize the discussion
> and would like to know the thoughts of others on this topic.
>
> In the current patch, we launch the slotsync worker on physical
> standby which connects to the specified database (currently we let
> users specify the required dbname in primary_conninfo) on the primary.
> It then fetches the required information for failover marked slots
> from the primary and also does some primitive checks on the upstream
> node via SQL (the additional checks are like whether the upstream node
> has a specified physical slot or whether the upstream node is a
> primary node or a standby node). To fetch the required information it
> uses a libpqwalreciever API which is mostly apt for this purpose as it
> supports SQL execution but for this patch, we don't need a replication
> connection, so we extend the libpqwalreciever connect API.

What sort of extension we have done to 'libpqwalreciever'? Is it
something like by default this supports replication connections so we
have done an extension to the API so that we can provide an option
whether to create a replication connection or a normal connection?

> Now, the concerns related to this could be that users would probably
> need to change existing mechanisms/tools to update priamry_conninfo
> and one of the alternatives proposed is to have an additional GUC like
> slot_sync_dbname. Users won't be able to drop the database this worker
> is connected to aka whatever is specified in slot_sync_dbname but as
> the user herself sets up the configuration it shouldn't be a big deal.

Yeah for this purpose users may use template1 or so which they
generally don't plan to drop.  So in case the user wants to drop that
database user needs to turn off the slot syncing option and then it
can be done?

> Then we also discussed whether extending libpqwalreceiver's connect
> API is a good idea and whether we need to further extend it in the
> future. As far as I can see, slotsync worker's primary requirement is
> to execute SQL queries which the current API is sufficient, and don't
> see something that needs any drastic change in this API. Note that
> tablesync worker that executes SQL also uses these APIs, so we may
> need something in the future for either of those. Then finally we need
> a slotsync worker to also connect to a database to use SQL and fetch
> results.

While looking into the patch v64-0002 I could not exactly point out
what sort of extensions are there in libpqwalreceiver.c, I just saw
one extra API for fetching the dbname from connection info?

> Now, let us consider if we extend the replication commands like
> READ_REPLICATION_SLOT and or introduce a new set of replication
> commands to fetch the required information then we don't need a DB
> connection with primary or a connection in slotsync worker. As per my
> current understanding, it is quite doable but I think we will slowly
> go in the direction of making replication commands something like SQL
> because today we need to extend it to fetch all slots info that have
> failover marked as true, the existence of a particular replication,
> etc. Then tomorrow, if we want to extend this work to have multiple
> slotsync workers say workers perdb then we have to extend the
> replication command to fetch per-database failover marked slots. To
> me, it sounds more like we are slowly adding SQL-like features to
> replication commands.
>
> Apart from this when we are reading per-db replication slots without
> connecting to a database, we probably need some additional protection
> mechanism so that the database won't get dropped.

Something like locking the database only while fetching the slots?

> Considering all this it seems that for now probably extending
> replication commands can simplify a few things like mentioned above
> but using SQL's with db-connection is more extendable.

Even I have similar thoughts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Optimize duplicate code and fix memory leak in function fetch_remote_table_info()

2024-01-19 Thread Tom Lane
Yongtao Huang  writes:
> (1)  I think *pfree(pub_names.data)* is necessary.

Really?

It looks to me like copy_table, and thence fetch_remote_table_info,
is called once within a transaction.  So whatever it leaks will be
released at transaction end.  This is a good thing, because it's
messy enough that I seriously doubt that there aren't other leaks
in it, or that it'd be practical to expect that it can be made
to never leak anything.

If anything, I'd be inclined to remove the random pfree's that
are in it now.  It's unlikely that they constitute a net win
compared to allowing memory context reset to clean things up.

regards, tom lane




Re: Optimize duplicate code and fix memory leak in function fetch_remote_table_info()

2024-01-19 Thread Yongtao Huang
Thanks for your review.

(1)  I think *pfree(pub_names.data)* is necessary.
(2)  Agree with you. Considering that the new function is only called
twice, not encapsulating it into a function is not a huge problem.

Best wishes

Yongtao Huang

Michael Paquier  于2024年1月20日周六 11:13写道:

> On Fri, Jan 19, 2024 at 10:42:46PM +0800, Yongtao Huang wrote:
> > This patch fixes two things in the function fetch_remote_table_info().
> >
> > (1) *pfree(pub_names.data)* to avoid potential memory leaks.
>
> True that this code puts some effort in cleaning up the memory used
> locally.
>
> > (2) 2 pieces of code can do the same work,
> > ```
> > I wanna integrate them into one function `make_pubname_list()` to make
> the
> > code neater.
>
> It does not strike me as a huge problem to let the code be as it is on
> HEAD when building the lists, FWIW, as we are talking about two places
> and there is clarity in keeping the code as it is.
> --
> Michael
>


0001-Fix-potential-memory-leak-in-tablesync.c.patch
Description: Binary data


Re: Prefetch the next tuple's memory during seqscans

2024-01-19 Thread vignesh C
On Mon, 10 Jul 2023 at 15:04, Daniel Gustafsson  wrote:
>
> > On 10 Jul 2023, at 11:32, Daniel Gustafsson  wrote:
> >
> >> On 4 Apr 2023, at 06:50, David Rowley  wrote:
> >
> >> Updated patches attached.
> >
> > This patch is marked Waiting on Author, but from the thread it seems Needs
> > Review is more apt.  I've changed status and also attached a new version of 
> > the
> > patch as the posted v1 no longer applied due to changes in formatting for 
> > Perl
> > code.
>
> ..and again with both patches attached. Doh.

I'm seeing that there has been no activity in this thread for more
than 6 months, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2024-01-19 Thread vignesh C
On Mon, 20 Feb 2023 at 16:06, David Geier  wrote:
>
> Hi!
>
> On 2/14/23 13:48, David Geier wrote:
> >
> > It still fails.
> >
> > I'll get Cirrus-CI working on my own Github fork so I can make sure it
> > really compiles on all platforms before I submit a new version.
>
> It took some time until Cirrus CI allowed me to run tests against my new
> GitHub account (there's a 3 days freeze to avoid people from getting
> Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
> which passes builds, rebased on latest master.
>
> I also reviewed the first two patches a while ago in [1]. I hope we can
> progress with them to further reduce the size of this patch set.
>
> Beyond that: I could work on support for more OSs (e.g. starting with
> Windows). Is there appetite for that or do we rather want to instead
> start with a smaller patch?

Are we planning to continue on this and take it further?
I'm seeing that there has been no activity in this thread for nearly 1
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh




Re: MultiXact\SLRU buffers configuration

2024-01-19 Thread vignesh C
On Mon, 9 Jan 2023 at 09:49, Andrey Borodin  wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I'm seeing that there has been no activity in this thread for more
than 1 year now, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh




Re: Optimize duplicate code and fix memory leak in function fetch_remote_table_info()

2024-01-19 Thread Michael Paquier
On Fri, Jan 19, 2024 at 10:42:46PM +0800, Yongtao Huang wrote:
> This patch fixes two things in the function fetch_remote_table_info().
> 
> (1) *pfree(pub_names.data)* to avoid potential memory leaks.

True that this code puts some effort in cleaning up the memory used
locally.

> (2) 2 pieces of code can do the same work,
> ```
> I wanna integrate them into one function `make_pubname_list()` to make the
> code neater.

It does not strike me as a huge problem to let the code be as it is on
HEAD when building the lists, FWIW, as we are talking about two places
and there is clarity in keeping the code as it is.
--
Michael


signature.asc
Description: PGP signature


Re: fix stats_fetch_consistency value in postgresql.conf.sample

2024-01-19 Thread Michael Paquier
On Sat, Jan 20, 2024 at 07:59:22AM +0530, vignesh C wrote:
> I'm seeing that there has been no activity in this thread for more
> than 8 months, I'm planning to close this in the current commitfest
> unless someone is planning to take it forward.

Thanks, that seems right to me.

I have been looking again at the patch after seeing your reply (spent
some time looking at it but I could not decide what to do), and I am
not really excited with the amount of new facilities this requires in
the TAP test (especially the list of hardcoded parameters that may
change) and the backend-side changes for the GUC flags as well as the
requirements to make the checks flexible enough to work across initdb
and platform-dependent default values.  In short, I'm happy to let
003_check_guc.pl be what check_guc was able to do (script gone in
cf29a11ef646) for the parameter names.
--
Michael


signature.asc
Description: PGP signature


Re: add function argument names to regex* functions.

2024-01-19 Thread jian he
On Sat, Jan 20, 2024 at 10:55 AM jian he  wrote:
>
>
> another regex* function argument changes: from "N" to "occurences",  example:
> + If occurrence is specified
> + then the occurrence'th match of the pattern
> + is located,
>
> but [2] says
> Speaking of the "occurrence'th
> occurrence" is just silly, not to mention long and easy to misspell."
>

sorry.
[2], The reference link is
https://www.postgresql.org/message-id/1567465.1627860115%40sss.pgh.pa.us

my previous post will link to the whole thread.




Re: add function argument names to regex* functions.

2024-01-19 Thread jian he
On Thu, Jan 18, 2024 at 4:17 PM Peter Eisentraut  wrote:
>
> On 10.01.24 15:18, jian he wrote:
> > I put the changes into the new patch.
>
> Reading back through the discussion, I wasn't quite able to interpret
> the resolution regarding Oracle compatibility.  From the patch, it looks
> like you chose not to adopt the parameter names from Oracle.  Was that
> your intention?
>

per committee message:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=6424337073589476303b10f6d7cc74f501b8d9d7
Even if the names are all the same, our function is still not the same
as oracle.

There is a documentation bug.
In [0], Table 9.25. Regular Expression Functions Equivalencies
regexp_replace function definition: regexp_replace(string, pattern, replacement)

In one of the  section below, regexp_replace explains as
<
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, start [, N ]] [,
flags ]). (Notice that N cannot be specified unless start is, but
flags can be given in any case.)
<
So I changed the first argument of regexp_replace to "string". So
accordingly, the doc needs to change also, which I did.

another regex* function argument changes: from "N" to "occurences",  example:
+ If occurrence is specified
+ then the occurrence'th match of the pattern
+ is located,

but [2] says
Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell."

summary:
adding function-named notation is my intention.
To make regex.* functions named-notation works, we need to add
proargnames to src/include/catalog/pg_proc.dat.
add proargnames also require changing the doc.
naming proargnames is a matter of taste now, So I only change 'N' to
'occurrence'.

[0] https://www.postgresql.org/docs/current/functions-matching.html
[1] 
https://www.postgresql.org/message-id/flat/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net




Re: Re: Patch: Improve Boolean Predicate JSON Path Docs

2024-01-19 Thread Erik Wienhold
On 2024-01-19 22:15 +0100, Tom Lane wrote:
> "David E. Wheeler"  writes:
> > [ v7-0001-Improve-boolean-predicate-JSON-Path-docs.patch ]
> 
> + \set json '{
>"track": {
>  "segments": [
>{
> 
> I find the textual change rather unwieldy, but the bigger problem is
> that this example doesn't actually work.  If you try to copy-and-paste
> this into psql, you get "unterminated quoted string", because psql
> metacommands can't span line boundaries.

Interesting... copy-pasting the entire \set command works for me with
psql 16.1 in gnome-terminal and tmux.  Typing it out manually gives me
the "unterminated quoted string" error.  Maybe has to do with my stty
settings.

> I experimented with
> 
> SELECT '
>   ... multiline json value ...
> ' AS json
> \gexec
> 
> but that didn't seem to work either.  Anybody have a better idea?

Fine with me (the \gset variant).

-- 
Erik




Re: Amcheck verification of GiST and GIN

2024-01-19 Thread vignesh C
On Mon, 27 Mar 2023 at 03:47, Andrey Borodin  wrote:
>
> On Sun, Mar 19, 2023 at 4:00 PM Andrey Borodin  wrote:
> >
> > Also, there are INCLUDEd attributes. Right now we just put them as-is
> > to the bloom filter. Does this constitute a TOAST bug as in B-tree?
> > If so, I think we should use a version of tuple formatting that omits
> > included attributes...
> > What do you think?
> I've ported the B-tree TOAST test to GiST, and, as expected, it fails.
> Finds non-indexed tuple for a fresh valid index.
> I've implemented normalization, plz see gistFormNormalizedTuple().
> But there are two problems:
> 1. I could not come up with a proper way to pfree() compressed value
> after decompressing. See TODO in gistFormNormalizedTuple().
> 2. In the index tuples seem to be normalized somewhere. They do not
> have to be deformed and normalized. It's not clear to me how this
> happened.

I have changed the status of the commitfest entry to "Waiting on
Author" as there was no follow-up on Alexander's queries. Feel free to
address them and change the commitfest entry accordingly.

Regards,
Vignesh




Re: Inconsistency in reporting checkpointer stats

2024-01-19 Thread vignesh C
On Sun, 19 Feb 2023 at 15:28, Nitin Jadhav
 wrote:
>
> > This doesn't pass the tests, because the regression tests weren't adjusted:
> > https://api.cirrus-ci.com/v1/artifact/task/5937624817336320/testrun/build/testrun/regress/regress/regression.diffs
>
> Thanks for sharing this. I have fixed this in the patch attached.
>
>
> >> IMO, there's no need for 2 separate patches for these changes.
> > I will make it a single patch while sharing the next patch.
>
> Clubbed both patches into one.

The patch does not apply anymore, please post a rebased version of the patch :
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/catalog/system_views.sql.rej
patching file src/backend/utils/activity/pgstat_checkpointer.c
Hunk #1 FAILED at 52.
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/utils/activity/pgstat_checkpointer.c.rej
patching file src/backend/utils/adt/pgstatfuncs.c
Hunk #1 succeeded at 1217 with fuzz 1 (offset 24 lines).
patching file src/include/access/xlog.h
Hunk #1 succeeded at 165 (offset 3 lines).
patching file src/include/catalog/pg_proc.dat
Hunk #1 FAILED at 5680.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/catalog/pg_proc.dat.rej

Regards,
Vignesh




Re: Make documentation builds reproducible

2024-01-19 Thread vignesh C
On Fri, 25 Aug 2023 at 01:23, Tristan Partin  wrote:
>
> On Thu Aug 24, 2023 at 2:30 PM CDT, Tom Lane wrote:
> > "Tristan Partin"  writes:
> > > On Wed Aug 23, 2023 at 2:24 PM CDT, Peter Eisentraut wrote:
> > >> Somewhere at PGCon, I forgot exactly where, maybe in the same meeting
> > >> where we talked about getting rid of distprep, we talked about that the
> > >> documentation builds are not reproducible (in the sense of
> > >> https://reproducible-builds.org/).  This is easily fixable,
> >
> > > Is there anything I am missing? Is Postgres relying on releases older
> > > than snapshot-2018-12-07-01? If so, is it possible to up the minimum
> > > version?
> >
> > AFAICT the "latest stable release" of docbook-xsl is still 1.79.2,
> > which seems to have been released in 2017, so it's unsurprising that
> > it's missing this fix.
> >
> > It's kind of hard to argue that developers (much less distro packagers)
> > should install unsupported snapshot releases in order to build our docs.
> > Having said that, maybe we should check whether this patch is compatible
> > with those snapshot releases, just in case somebody is using one.
>
> I agree with you. Thanks for the pointer.

I'm seeing that there has been no activity in this thread for nearly 5
months, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh




Re: fix stats_fetch_consistency value in postgresql.conf.sample

2024-01-19 Thread vignesh C
On Wed, 10 May 2023 at 06:07, Justin Pryzby  wrote:
>
> On Wed, Mar 29, 2023 at 11:03:59PM -0500, Justin Pryzby wrote:
> > On Wed, Jul 13, 2022 at 04:49:00PM -0700, Andres Freund wrote:
> > > On 2022-07-14 08:46:02 +0900, Michael Paquier wrote:
> > > > On Wed, Jul 13, 2022 at 12:30:00PM -0500, Justin Pryzby wrote:
> > > > > How did you make this list ?  Was it by excluding things that failed 
> > > > > for you ?
> > > > >
> > > > > cfbot is currently failing due to io_concurrency on windows.
> > > > > I think there are more GUC which should be included here.
> > > > >
> > > > > http://cfbot.cputube.org/kyotaro-horiguchi.html
> > > >
> > > > FWIW, I am not really a fan of making this test depend on a hardcoded
> > > > list of GUCs.
> > >
> > > I wonder if we should add flags indicating platform dependency etc to 
> > > guc.c?
> > > That should allow to remove most of them?
> >
> > Michael commented on this, but on another thread, so I'm copying and
> > pasting it here.
> >
> > On Thu, Mar 23, 2023 at 08:59:57PM -0500, Justin Pryzby wrote:
> > > On Fri, Mar 24, 2023 at 10:24:43AM +0900, Michael Paquier wrote:
> > > > >> * Check consistency of GUC defaults between .sample.conf and 
> > > > >> pg_settings.boot_val
> > > > >   - It looks like this was pretty active until last October and might
> > > > > have been ready to apply at least partially? But no further work or
> > > > > review has happened since.
> > > >
> > > > FWIW, I don't find much appealing the addition of two GUC flags for
> > > > only the sole purpose of that,
> > >
> > > The flags seem independently interesting - adding them here follows
> > > a suggestion Andres made in response to your complaint.
> > > 20220713234900.z4rniuaerkq34...@awork3.anarazel.de
> > >
> > > > particularly as we get a stronger
> > > > dependency between GUCs that can be switched dynamically at
> > > > initialization and at compile-time.
> > >
> > > What do you mean by "stronger dependency between GUCs" ?
> >
> > I'm still not clear what that means ?
>
> Michael ?
>
> This fixes an issue with the last version that failed with
> log_autovacuum_min_duration in cirrusci's pg_ci_base.conf.
>
> And now includes both a perl and a sql-based versions of the test - both
> of which rely on the flags.

I'm seeing that there has been no activity in this thread for more
than 8 months, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh




Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-01-19 Thread vignesh C
On Fri, 22 Sept 2023 at 18:45, Amul Sul  wrote:
>
>
>
> On Wed, Sep 20, 2023 at 8:29 PM Alvaro Herrera  
> wrote:
>>
>> On 2023-Sep-20, Amul Sul wrote:
>>
>> > On the latest master head, I can see a $subject bug that seems to be 
>> > related
>> > commit #b0e96f311985:
>> >
>> > Here is the table definition:
>> > create table foo(i int, j int, CONSTRAINT pk PRIMARY KEY(i) DEFERRABLE);
>>
>> Interesting, thanks for the report.  Your attribution to that commit is
>> correct.  The table is dumped like this:
>>
>> CREATE TABLE public.foo (
>> i integer CONSTRAINT pgdump_throwaway_notnull_0 NOT NULL NO INHERIT,
>> j integer
>> );
>> ALTER TABLE ONLY public.foo
>> ADD CONSTRAINT pk PRIMARY KEY (i) DEFERRABLE;
>> ALTER TABLE ONLY public.foo DROP CONSTRAINT pgdump_throwaway_notnull_0;
>>
>> so the problem here is that the deferrable PK is not considered a reason
>> to keep attnotnull set, so we produce a throwaway constraint that we
>> then drop.  This is already bogus, but what is more bogus is the fact
>> that the backend accepts the DROP CONSTRAINT at all.
>>
>> The pg_dump failing should be easy to fix, but fixing the backend to
>> error out sounds more critical.  So, the reason for this behavior is
>> that RelationGetIndexList doesn't want to list an index that isn't
>> marked indimmediate as a primary key.  I can easily hack around that by
>> doing
>>
>> diff --git a/src/backend/utils/cache/relcache.c 
>> b/src/backend/utils/cache/relcache.c
>> index 7234cb3da6..971d9c8738 100644
>> --- a/src/backend/utils/cache/relcache.c
>> +++ b/src/backend/utils/cache/relcache.c
>> @@ -4794,7 +4794,6 @@ RelationGetIndexList(Relation relation)
>>  * check them.
>>  */
>> if (!index->indisunique ||
>> -   !index->indimmediate ||
>> !heap_attisnull(htup, Anum_pg_index_indpred, 
>> NULL))
>> continue;
>>
>> @@ -4821,6 +4820,9 @@ RelationGetIndexList(Relation relation)
>>  relation->rd_rel->relkind == 
>> RELKIND_PARTITIONED_TABLE))
>> pkeyIndex = index->indexrelid;
>>
>> +   if (!index->indimmediate)
>> +   continue;
>> +
>> if (!index->indisvalid)
>> continue;
>>
>>
>> But of course this is not great, since it impacts unrelated bits of code
>> that are relying on relation->pkindex or RelationGetIndexAttrBitmap
>> having their current behavior with non-immediate index.
>
>
> True, but still wondering why would relation->rd_pkattr skipped for a
> deferrable primary key, which seems to be a bit incorrect to me since it
> sensing that relation doesn't have PK at all.  Anyway, that is unrelated.
>
>>
>> I think a real solution is to stop relying on RelationGetIndexAttrBitmap
>> in ATExecDropNotNull().  (And, again, pg_dump needs some patch as well
>> to avoid printing a throwaway NOT NULL constraint at this point.)
>
>
> I might not have understood this, but I think, if it is ok to skip throwaway 
> NOT
> NULL for deferrable PK then that would be enough for the reported issue
> to be fixed.  I quickly tried with the attached patch which looks sufficient
> to skip that, but, TBH, I haven't thought carefully about this change.

I did not see any test addition for this, can we add it?

Regards,
Vignesh




Re: Collation version tracking for macOS

2024-01-19 Thread vignesh C
On Sat, 21 Jan 2023 at 02:24, Jeff Davis  wrote:
>
> On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:
> > Attached are a new set of patches, including a major enhancement: the
> > icu_multilib contrib module.
>
> Attached rebased v8.
>
> [ It looks like my email client truncated the last email somehow, in
> case someone was wondering why it just stopped. ]
>
> The big change is the introduction of the icu_multilib contrib module
> which provides a lot of the functionality requested in this thread:
>
>* icu version stability, which allows you to "lock down" ICU to a
> specific major and minor version (or major version only)
>* multi-lib ICU, which (if a GUC is set) will enable the "search by
> collversion" behavior. Some doubts were raised about the wisdom of this
> approach, but it's the only multi-lib solution we have without doing
> some significant catalog work.
>
> I rendered the HTML docs for icu_multilib and attached to this email to
> make it easier to view.
>
> icu_multilib assumes that the various ICU library versions are already
> available in a single location, most likely installed with a package
> manager. That location can be the same as the built-in ICU, or a
> different location. Ideally, packagers would start to offer a few
> "stable" versions of ICU that would be available for a long time, but
> it will take a while for that to happen. So for now, it's up to the
> user to figure out how to get the right versions of ICU on their system
> and keep them there.
>
> Automated tests of icu_multilib are a problem unless the one running
> the tests is willing to compile the right versions of ICU (like I did).
> But I at least have automated tests for the hooks by using the test
> module test_collator_lib_hooks.
>
> The v7 patches in this thread are dependent on the pure refactoring
> patches in this CF entry:
>
>https://commitfest.postgresql.org/41/3935/
>
> https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.ca...@j-davis.com
>
> The requested functionality _not_ offered by icu_multilib is tying a
> specific collation to a specific ICU version. A few variants were
> proposed, the latest is to tie a collation to the library file itself
> through the provider. That needs to be done with proper catalog support
> in core. But I believe the work I've done here has made a lot of
> progress in that direction, and also shows the versatility of the new
> hook to solve at least some problems.

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Regards,
Vignesh




Re: Lockless queue of waiters in LWLock

2024-01-19 Thread vignesh C
On Sat, 26 Nov 2022 at 00:24, Pavel Borisov  wrote:
>
> Hi, hackers!
> In the measurements above in the thread, I've been using LIFO wake
> queue in a primary lockless patch (and it was attached as the previous
> versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
> the alternative benchmarking option. I think using the latter is more
> fair and natural and provided they show no difference in the speed,
> I'd make the main patch using it (attached as v6). No other changes
> from v5, though.

There has not been much interest on this as the thread has been idle
for more than a year now. I'm not sure if we should take it forward or
not. I would prefer to close this in the current commitfest unless
someone wants to take it further.

Regards,
Vignesh




Re: Documentation for building with meson

2024-01-19 Thread vignesh C
On Wed, 29 Mar 2023 at 00:57, samay sharma  wrote:
>
> Hi,
>
> On Wed, Mar 15, 2023 at 4:28 AM Peter Eisentraut 
>  wrote:
>>
>>  > [PATCH v8 1/5] Make minor additions and corrections to meson docs
>>
>> The last hunk revealed that there is some mixing up between meson setup
>> and meson configure.  This goes a bit further.  For example, earlier it
>> says that to get a list of meson setup options, call meson configure
>> --help and look at https://mesonbuild.com/Commands.html#configure, which
>> are both wrong.  Also later throughout the text it uses one or the
>> other.  I think this has the potential to be very confusing, and we
>> should clean this up carefully.
>>
>> The text about additional meson test options maybe should go into the
>> regress.sgml chapter?
>
>
> I tried to make the meson setup and meson configure usage consistent. I've 
> removed the text for the test options.
>>
>>
>>
>>  > [PATCH v8 2/5] Add data layout options sub-section in installation
>>   docs
>>
>> This makes sense.  Please double check your patch for correct title
>> casing, vertical spacing of XML, to keep everything looking consistent.
>
>
> Thanks for noticing. Made it consistent on both sides.
>>
>>
>> This text isn't yours, but since your patch emphasizes it, I wonder if
>> it could use some clarification:
>>
>> + These options affect how PostgreSQL lays out data on disk.
>> + Note that changing these breaks on-disk database compatibility,
>> + meaning you cannot use pg_upgrade to upgrade to
>> + a build with different values of these options.
>>
>> This isn't really correct.  What breaking on-disk compatibility means is
>> that you can't use a server compiled one way with a data directory
>> initialized by binaries compiled another way.  pg_upgrade may well have
>> the ability to upgrade between one or the other; that's up to pg_upgrade
>> to figure out but not an intrinsic property.  (I wonder why pg_upgrade
>> cares about the WAL block size.)
>
>
>  Fixed.
>>
>>
>>
>>  > [PATCH v8 3/5] Remove Anti-Features section from Installation from
>>   source docs
>>
>> Makes sense.  But is "--disable-thread-safety" really a developer
>> feature?  I think not.
>>
>
> Moved to PostgreSQL features. Do you think there's a better place for it?
>
>>
>>
>>  > [PATCH v8 4/5] Re-organize Miscellaneous section
>>
>> This moves the Miscellaneous section after Developer Features.  I think
>> Developer Features should be last.
>>
>> Maybe should remove this section and add the options to the regular
>> PostgreSQL Features section.
>
>
> Yes, that makes sense. Made this change.
>>
>>
>> Also consider the grouping in meson_options.txt, which is slightly
>> different yet.
>
>
> Removed Misc options section from meson_options.txt too.
>>
>>
>>
>>  > [PATCH v8 5/5] Change Short Version for meson installation guide
>>
>> +# create working directory
>> +mkdir postgres
>> +cd postgres
>> +
>> +# fetch source code
>> +git clone https://git.postgresql.org/git/postgresql.git src
>>
>> This comes after the "Getting the Source" section, so at this point they
>> already have the source and don't need to do "git clone" etc. again.
>>
>> +# setup and enter build directory (done only first time)
>> +## Unix based platforms
>> +meson setup build src --prefix=$PWD/install
>> +
>> +## Windows
>> +meson setup build src --prefix=%cd%/install
>>
>> Maybe some people work this way, but to me the directory structures you
>> create here are completely weird.
>
>
> I'd like to discuss what you think is a good directory structure to work 
> with. I've mentioned some of the drawbacks I see with the current structure 
> for the short version. I know this structure can feel different but it 
> feeling weird is not ideal. Do you have a directory structure in mind which 
> is different but doesn't feel odd to you?
>
>>
>>
>> +# Initialize a new database
>> +../install/bin/initdb -D ../data
>> +
>> +# Start database
>> +../install/bin/pg_ctl -D ../data/ -l logfile start
>> +
>> +# Connect to the database
>> +../install/bin/psql -d postgres
>>
>> The terminology here needs to be tightened up.  You are using "database"
>> here to mean three different things.
>
>
> I'll address this together once we are aligned on the overall directory 
> structure etc.
>
>> There are a few reasons why I had done this. Some reasons Andres has 
>> described in his previous email and I'll add a few specific examples on why 
>> having the same section for both might not be a good idea.
>>
>>  * Having readline and zlib as required for building PostgreSQL is now wrong 
>> because they are not required for meson builds. Also, the name of the 
>> configs are different for make and meson and the current section only lists 
>> the make ones.
>>  * There are many references to configure in that section which don't apply 
>> to meson.
>>  * Last I checked Flex and Bison were always required to build via meson but 
>> not for make and the current section doesn't explain those differen

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-19 Thread Jeff Davis
On Wed, 2024-01-10 at 19:59 +0530, Bharath Rupireddy wrote:
> I've addressed the above review comments and attached v19 patch-set.

Regarding:

-   if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-&errinfo))
+   if (!WALRead(state, cur_page, targetPagePtr, count, tli,
&errinfo))

I'd like to understand the reason it was using XLOG_BLCKSZ before. Was
it a performance optimization? Or was it to zero the remainder of the
caller's buffer (readBuf)? Or something else?

If it was to zero the remainder of the caller's buffer, then we should
explicitly make that the caller's responsibility.

Regards,
Jeff Davis





Re: Change GUC hashtable to use simplehash?

2024-01-19 Thread Jeff Davis
On Fri, 2024-01-19 at 13:38 -0800, Jeff Davis wrote:
> One post-commit question on 0aba255440: why do
> haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How
> does byteswapping affect whether a zero byte exists or not?

I missed that it was used later when finding the rightmost one
position.

The placement of the comment was slightly confusing. Is:

  haszero64(pg_bswap64(chunk)) == pg_bswap64(haszero64(chunk))

? If so, perhaps we can do the byte swapping outside of the loop, which
might save a few cycles on longer strings and would be more readable.

Regards,
Jeff Davis





Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Daniel Gustafsson
> On 19 Jan 2024, at 23:09, Kirk Wolak  wrote:

>  From a FUTURE email, I noticed pg_jit_available() and it's set to f??

Right, then this installation does not contain the necessary library to JIT
compile the query.

> Okay, so does this require a special BUILD command?

Yes, it requires that you compile with --with-llvm.  If you don't have llvm in
the PATH you might need to set LLVM_CONFIG to point to your llvm-config binary.
With autotools that would be something like:

./configure  --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson





Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-19 Thread Peter Geoghegan
On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
 wrote:
> > I'm not going to break out the planner changes, because they're *not*
> > independent in any way.
>
> The planner changes depend on the btree changes, that I agree with.
> However, I don't think that the btree changes depend on the planner
> changes.

Yeah, technically it would be possible to break out the indxpath.c
changes -- that wouldn't leave the tree in a state with visibly-wrong
behavior at any point. But that's far from the only thing that
matters. If that was the standard that we applied, then I might have
to split the patch into 10 or more patches.

What it boils down to is this: it is totally natural for me to think
of the planner changes as integral to the nbtree changes, so I'm going
to include them together in one commit. That's just how the code was
written -- I always thought about it as one single thing. The majority
of the queries that I've used to promote the patch directly rely on
the planner changes in one way or another (even back in v1, when the
planner side of things had lots of kludges).

I don't necessarily always follow that standard. Sometimes it is
useful to further break things up. Like when it happens to make the
high-level division of labor a little bit clearer. The example that
comes to mind is commit dd299df8 and commit fab25024. These nbtree
commits were essentially one piece of work that was broken into two,
but committed within minutes of each other, and left the tree in a
momentary state that was not-very-useful (though still correct). That
made sense to me at the time because the code in question was
mechanically distinct, while at the same time modifying some of the
same nbtree files. Drawing a line under it seemed to make sense.

I will admit that there is some amount of subjective judgement (gasp!)
here. Plus I'll acknowledge that it's *slightly* odd that the most
compelling cases for the SAOP patch almost all involve savings in heap
page accesses, even though it is fundamentally a B-Tree patch. But
that's just how it came out. Slightly odd things happen all the time.

> I would agree with you if this was about new APIs and features, but
> here existing APIs are being repurposed without changing them. A
> maintainer of index AMs would not look at the commit title 'Enhance
> nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> my existing amsearcharray+amcanorder handling actually supports
> non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

> > As I said to Heikki, thinking about this some more is on my todo list.
> > I mean the way that this worked had substantial revisions on HEAD
> > right before I posted v9. v9 was only to fix the bit rot that that
> > caused.
>
> Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason. At this point I'm
still totally neutral.

> > Even single digit
> > thousand of array elements should be considered huge. Plus this code
> > path is only hit with poorly written queries.
>
> Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

> > > Please fix this, as it shows up in profiling of large array merges.
> > > Additionally, as it merges two arrays of unique items into one,
> > > storing only matching entries, I feel that it is quite wasteful to do
> > > this additional allocation here. Why not reuse the original allocation
> > > immediately?
> >
> > Because that's adding more complexity for a code path that will hardly
> > ever be used in practice. For something that happens once, during a
> > preprocess

Re: [DOC] Add detail regarding resource consumption wrt max_connections

2024-01-19 Thread reid . thompson
On Sat, 2024-01-13 at 10:31 -0700, Roberto Mello wrote:
> On Fri, Jan 12, 2024 at 3:15 PM Cary Huang 
> wrote:
> > I think it is good to warn the user about the increased allocation
> > of memory for certain parameters so that they do not abuse it by
> > setting it to a huge number without knowing the consequences.
> > 
> > It is true that max_connections can increase the size of proc array
> > and other resources, which are allocated in the shared buffer,
> > which also means less shared buffer to perform regular data
> > operations. I am sure this is not the only parameter that affects
> > the memory allocation. "max_prepared_xacts" can also affect the
> > shared memory allocation too so the same warning message applies
> > here as well. Maybe there are other parameters with similar
> > effects. 
> > 
> > Instead of stating that higher max_connections results in higher
> > allocation, It may be better to tell the user that if the value
> > needs to be set much higher, consider increasing the
> > "shared_buffers" setting as well.
> > 
> 
> 
> Appreciate the review, Cary.
> 
> My goal was to inform the reader that there are implications to
> setting max_connections higher. I've personally seen a user
> mindlessly set this to 50k connections, unaware it would cause
> unintended consequences. 
> 
> I can add a suggestion for the user to consider increasing
> shared_buffers in accordance with higher max_connections, but it
> would be better if there was a "rule of thumb" guideline to go along.
> I'm open to suggestions.
> 
> I can revise with a similar warning in max_prepared_xacts as well.
> 
> Sincerely,
> 
> Roberto

Can a "close enough" rule of thumb be calculated from:
postgresql.conf -> log_min_messages = debug3

start postgresql with varying max_connections to get
CreateSharedMemoryAndSemaphores() sizes to generate a rough equation

postgresql-12-main.log

max_connections=100
75:2024-01-19 17:04:56.544 EST [2762535] DEBUG: invoking
IpcMemoryCreate(size=149110784)
0.149110784GB

max_connections=1
1203:2024-01-19 17:06:13.502 EST [2764895] DEBUG: invoking
IpcMemoryCreate(size=644997120)
0.64499712GB

max_connections=2
5248:2024-01-19 17:24:27.956 EST [2954550] DEBUG: invoking
IpcMemoryCreate(size=1145774080)
1.14577408GB

max_connections=5
2331:2024-01-19 17:07:27.716 EST [2767079] DEBUG: invoking
IpcMemoryCreate(size=2591490048)
2.591490048GB


from lines 184-186

$ rg -B28 -A35 'invoking IpcMemoryCreate'
backend/storage/ipc/ipci.c
158-/*
159- * CreateSharedMemoryAndSemaphores
160- * Creates and initializes shared memory and semaphores.
161- *
162- * This is called by the postmaster or by a standalone backend.
163- * It is also called by a backend forked from the postmaster in the
164- * EXEC_BACKEND case. In the latter case, the shared memory segment
165- * already exists and has been physically attached to, but we have
to
166- * initialize pointers in local memory that reference the shared
structures,
167- * because we didn't inherit the correct pointer values from the
postmaster
168- * as we do in the fork() scenario. The easiest way to do that is
to run
169- * through the same code as before. (Note that the called routines
mostly
170- * check IsUnderPostmaster, rather than EXEC_BACKEND, to detect
this case.
171- * This is a bit code-wasteful and could be cleaned up.)
172- */
173-void
174-CreateSharedMemoryAndSemaphores(void)
175-{
176- PGShmemHeader *shim = NULL;
177-
178- if (!IsUnderPostmaster)
179- {
180- PGShmemHeader *seghdr;
181- Size size;
182- int numSemas;
183-
184- /* Compute the size of the shared-memory block */
185- size = CalculateShmemSize(&numSemas);
186: elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
187-
188- /*
189- * Create the shmem segment
190- */
191- seghdr = PGSharedMemoryCreate(size, &shim);
192-
193- InitShmemAccess(seghdr);
194-
195- /*
196- * Create semaphores
197- */
198- PGReserveSemaphores(numSemas);
199-
200- /*
201- * If spinlocks are disabled, initialize emulation layer (which
202- * depends on semaphores, so the order is important here).
203- */
204-#ifndef HAVE_SPINLOCKS
205- SpinlockSemaInit();
206-#endif
207- }
208- else
209- {
210- /*
211- * We are reattaching to an existing shared memory segment. This
212- * should only be reached in the EXEC_BACKEND case.
213- */
214-#ifndef EXEC_BACKEND
215- elog(PANIC, "should be attached to shared memory already");
216-#endif
217- }
218-
219- /*
220- * Set up shared memory allocation mechanism
221- */




Re: index prefetching

2024-01-19 Thread Tomas Vondra



On 1/19/24 16:19, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 5:57 pm, Tomas Vondra wrote:
>> On 1/16/24 21:10, Konstantin Knizhnik wrote:
>>> ...
>>>
 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that
> it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.
 I'm not against having a flag to enable/disable prefetching, but the
 question is whether doing prefetching for such indexes can be harmful.
 I'm not sure about that.
>>> I tend to agree with you - it is hard to imagine index implementation
>>> which doesn't win from prefetching heap pages.
>>> May be only the filtering case you have mentioned. But it seems to me
>>> that current B-Tree index scan (not IOS) implementation in Postgres
>>> doesn't try to use index tuple to check extra condition - it will fetch
>>> heap tuple in any case.
>>>
>> That's true, but that's why I started working on this:
>>
>> https://commitfest.postgresql.org/46/4352/
>>
>> I need to think about how to combine that with the prefetching. The good
>> thing is that both changes require fetching TIDs, not slots. I think the
>> condition can be simply added to the prefetch callback.
>>
>>
>> regards
>>
> Looks like I was not true, even if it is not index-only scan but index
> condition involves only index attributes, then heap is not accessed
> until we find tuple satisfying search condition.
> Inclusive index case described above
> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
> exotic case. If keys are actually used in search, then why not to create
> normal compound index instead?
> 

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Kirk Wolak
On Fri, Jan 19, 2024 at 4:20 AM Laurenz Albe 
wrote:

> On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:
> >   I did a little more checking and the reason I did not see the link
> MIGHT be because EXPLAIN did not show a JIT attempt.
> > I tried to use settings that FORCE a JIT...  But to no avail.
> >
> >   I am now concerned that the problem is more hidden in my use case.
> Meaning I CANNOT conclude it is fixed.
> > But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).
> >
> >   You don't know a way to force at least the JIT analysis to happen?
> (because I already knew if JIT was off, the leak wouldn't happen).
>
> If you set the limits to 0, you can trigger it easily:
>
> SET jit = on;
> SET jit_above_cost = 0;
> SET jit_inline_above_cost = 0;
> SET jit_optimize_above_cost = 0;
>

Okay,
  I Did exactly this (just now).  But the EXPLAIN does not contain the JIT?

---
 Sort  (cost=5458075.88..5477861.00 rows=7914047 width=12)
   Sort Key: seid
   ->  HashAggregate  (cost=3910139.62..4280784.00 rows=7914047 width=12)
 Group Key: seid, fr_field_name, st_field_name
 Planned Partitions: 128
 ->  Seq Scan on parts  (cost=0.00..1923249.00 rows=2985
width=12)
   Filter: ((seid <> 497) AND ((partnum)::text >= '1'::text))
(7 rows)

>From a FUTURE email, I noticed pg_jit_available()

and it's set to f??

Okay, so does this require a special BUILD command?

postgres=# select pg_jit_available();
 pg_jit_available
--
 f
(1 row)

postgres=# \dconfig *jit*
 List of configuration parameters
Parameter|  Value
-+-
 jit | on
 jit_above_cost  | 10
 jit_debugging_support   | off
 jit_dump_bitcode| off
 jit_expressions | on
 jit_inline_above_cost   | 50
 jit_optimize_above_cost | 50
 jit_profiling_support   | off
 jit_provider| llvmjit
 jit_tuple_deforming | on
(10 rows)


cleanup patches for dshash

2024-01-19 Thread Nathan Bossart
While working on the dynamic shared memory registry, I noticed a couple of
potential improvements for code that uses dshash tables.

* A couple of dshash_create() callers pass in 0 for the "void *arg"
  parameter, which seemed weird.  I incorrectly assumed this was some sort
  of flags parameter until I actually looked at the function signature.
  IMHO we should specify NULL here if arg is not used.  0001 makes this
  change.  This is admittedly a nitpick.

* There are no dshash compare/hash functions for string keys.  0002
  introduces some that simply forward to strcmp()/string_hash(), and it
  uses them for the DSM registry's dshash table.  This allows us to remove
  the hacky key padding code for lookups on this table.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5fe6d05f2cfa1401c2fb967fe4bb52cae3706228 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:23:35 -0600
Subject: [PATCH v1 1/2] Use NULL instead of 0 for arg parameter in
 dshash_create().

---
 src/backend/replication/logical/launcher.c | 2 +-
 src/backend/utils/activity/pgstat_shmem.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..ee66c292bd 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1013,7 +1013,7 @@ logicalrep_launcher_attach_dshmem(void)
 		last_start_times_dsa = dsa_create(LWTRANCHE_LAUNCHER_DSA);
 		dsa_pin(last_start_times_dsa);
 		dsa_pin_mapping(last_start_times_dsa);
-		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, 0);
+		last_start_times = dshash_create(last_start_times_dsa, &dsh_params, NULL);
 
 		/* Store handles in shared memory for other backends to use. */
 		LogicalRepCtx->last_start_dsa = dsa_get_handle(last_start_times_dsa);
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 3ce48e88a3..71410ddd3f 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -180,7 +180,7 @@ StatsShmemInit(void)
 		 * With the limit in place, create the dshash table. XXX: It'd be nice
 		 * if there were dshash_create_in_place().
 		 */
-		dsh = dshash_create(dsa, &dsh_params, 0);
+		dsh = dshash_create(dsa, &dsh_params, NULL);
 		ctl->hash_handle = dshash_get_hash_table_handle(dsh);
 
 		/* lift limit set above */
-- 
2.25.1

>From 6c6760e6553ac86c334e7c35d2ddbac8fdc1cb9b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 19 Jan 2024 15:38:34 -0600
Subject: [PATCH v1 2/2] Add hash/compare functions for dshash tables with
 string keys.

---
 src/backend/lib/dshash.c   | 23 +++
 src/backend/storage/ipc/dsm_registry.c |  8 +++-
 src/include/lib/dshash.h   |  4 
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b0bc0abda0..e497238e8f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -583,6 +583,29 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * A compare function that forwards to strcmp.
+ */
+int
+dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
+{
+	Assert(strlen((const char *) a) < size);
+	Assert(strlen((const char *) b) < size);
+
+	return strcmp((const char *) a, (const char *) b);
+}
+
+/*
+ * A hash function that forwards to string_hash.
+ */
+dshash_hash
+dshash_strhash(const void *v, size_t size, void *arg)
+{
+	Assert(strlen((const char *) v) < size);
+
+	return string_hash((const char *) v, size);
+}
+
 /*
  * Sequentially scan through dshash table and return all the elements one by
  * one, return NULL when all elements have been returned.
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
index ac11f51375..9dce41e8ab 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -50,8 +50,8 @@ typedef struct DSMRegistryEntry
 static const dshash_parameters dsh_params = {
 	offsetof(DSMRegistryEntry, handle),
 	sizeof(DSMRegistryEntry),
-	dshash_memcmp,
-	dshash_memhash,
+	dshash_strcmp,
+	dshash_strhash,
 	LWTRANCHE_DSM_REGISTRY_HASH
 };
 
@@ -132,7 +132,6 @@ GetNamedDSMSegment(const char *name, size_t size,
 {
 	DSMRegistryEntry *entry;
 	MemoryContext oldcontext;
-	char		name_padded[offsetof(DSMRegistryEntry, handle)] = {0};
 	void	   *ret;
 
 	Assert(found);
@@ -155,8 +154,7 @@ GetNamedDSMSegment(const char *name, size_t size,
 	/* Connect to the registry. */
 	init_dsm_registry();
 
-	strcpy(name_padded, name);
-	entry = dshash_find_or_insert(dsm_registry_table, name_padded, found);
+	entry = dshash_find_or_insert(dsm_registry_table, name, found);
 	if (!(*found))
 	{
 		/* Initialize the segment. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshas

Re: index prefetching

2024-01-19 Thread Melanie Plageman
On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
 wrote:
>
> On 1/9/24 21:31, Robert Haas wrote:
> > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> >  wrote:
> >> Here's a somewhat reworked version of the patch. My initial goal was to
> >> see if it could adopt the StreamingRead API proposed in [1], but that
> >> turned out to be less straight-forward than I hoped, for two reasons:
> >
> > I guess we need Thomas or Andres or maybe Melanie to comment on this.
> >
>
> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
> streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie
From 31a0b829b3aca31542dc3236b408f1e86133aea7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v1 2/2] use streaming reads in index scan

---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c   |  2 +
 src/backend/executor/nodeIndexscan.c | 83 
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h |  6 ++
 src/include/storage/streaming_read.h |  2 +
 6 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..0ef5f824546 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
/* Switch to correct buffer if we don't have it already */
Buffer  prev_buf = hscan->xs_cbuf;
 
-   hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-   
  hscan->xs_base.rel,
-   
  ItemPointerGetBlockNumber(tid));
+   if (scan->pgsr)
+   {
+   hscan->xs_cbuf = 
pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+   if (!BufferIsValid(hscan->xs_cbuf))
+   return false;
+   }
+   else
+   hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+   
hscan->xs_base.rel,
+   
ItemPointerGetBlockNumber(tid));
+
 
/*
 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c 
b/src/backend/access/index/indexam.c
index 63dff101e29..c118cc3861f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -237,6 +237,8 @@ index_beginscan(Relation heapRelation,
 
/* prepare to fetch index matches from table */
 

Re: Change GUC hashtable to use simplehash?

2024-01-19 Thread Jeff Davis
On Fri, 2024-01-19 at 14:27 +0700, John Naylor wrote:
> Pushed that way, thanks!

Thank you.

One post-commit question on 0aba255440: why do
haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How
does byteswapping affect whether a zero byte exists or not?

Regards,
Jeff Davis





Re: Patch: Improve Boolean Predicate JSON Path Docs

2024-01-19 Thread Tom Lane
I wrote:
> I experimented with

> SELECT '
>   ... multiline json value ...
> ' AS json
> \gexec

> but that didn't seem to work either.  Anybody have a better idea?

Oh, never mind, \gset is what I was reaching for.  We can make
it work with that.

regards, tom lane




Re: Patch: Improve Boolean Predicate JSON Path Docs

2024-01-19 Thread Tom Lane
"David E. Wheeler"  writes:
> [ v7-0001-Improve-boolean-predicate-JSON-Path-docs.patch ]

I started to review this, and got bogged down at

@@ -17203,9 +17214,12 @@ array w/o UK? | t
 
   
For example, suppose you have some JSON data from a GPS tracker that you
-   would like to parse, such as:
+   would like to parse, such as this JSON, set up as a
+   psql
+   \set variable for use as 
:'json'
+   in the examples below:
 
-{
+ \set json '{
   "track": {
 "segments": [
   {

I find the textual change rather unwieldy, but the bigger problem is
that this example doesn't actually work.  If you try to copy-and-paste
this into psql, you get "unterminated quoted string", because psql
metacommands can't span line boundaries.

Perhaps we could leave the existing display alone, and then add

To follow the examples below, paste this into psql:

\set json '{ "track": { "segments": [ { "location": [ 47.763, 13.4034 ], 
"start time": "2018-10-14 10:05:14", "HR": 73 }, { "location": [ 47.706, 
13.2635 ], "start time": "2018-10-14 10:39:21", "HR": 135 } ] }}'

This will allow :'json' to be expanded into the
above JSON value, plus suitable quoting.

However, I'm not sure that's a great solution, because it's going to
line-wrap on most displays, making copy-and-paste a bit iffy.

I experimented with

SELECT '
  ... multiline json value ...
' AS json
\gexec

but that didn't seem to work either.  Anybody have a better idea?

regards, tom lane




Re: introduce dynamic shared memory registry

2024-01-19 Thread Nathan Bossart
On Wed, Jan 17, 2024 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> The v8 patches look good to me.

Committed.  Thanks everyone for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




GUC-ify walsender MAX_SEND_SIZE constant

2024-01-19 Thread Majid Garoosi
Hello all,

Following the threads here

and there , I decided to submit
this patch.

Following is the description which is also written in the commit message:
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.

This is my first patch. So, I hope I haven't done something wrong. :'D

Best regards
Majid
From 218a925d1161878c888bef8adb3f246ec2b44d12 Mon Sep 17 00:00:00 2001
From: Majid Garoosi 
Date: Fri, 19 Jan 2024 22:48:23 +0330
Subject: [PATCH v1] GUC-ify MAX_SEND_SIZE constant in WALSender

MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.
---
 src/backend/replication/walsender.c | 18 +-
 src/backend/utils/init/postinit.c   | 11 +++
 src/backend/utils/misc/guc_tables.c | 13 +
 src/include/replication/walsender.h |  1 +
 src/include/utils/guc_hooks.h   |  1 +
 5 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..0299859d97 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -96,17 +96,6 @@
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
 
-/*
- * Maximum data payload in a WAL data message.  Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages.  128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
 
@@ -124,6 +113,9 @@ int			max_wal_senders = 10;	/* the maximum number of concurrent
 	 * walsenders */
 int			wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
 			 * data message */
+int wal_sender_max_send_size = XLOG_BLCKSZ * 16; /* Maximum data
+  * payload in a WAL
+  * data message */
 bool		log_replication_commands = false;
 
 /*
@@ -3087,7 +3079,7 @@ XLogSendPhysical(void)
 	 */
 	startptr = sentPtr;
 	endptr = startptr;
-	endptr += MAX_SEND_SIZE;
+	endptr += wal_sender_max_send_size;
 
 	/* if we went beyond SendRqstPtr, back off */
 	if (SendRqstPtr <= endptr)
@@ -3106,7 +3098,7 @@ XLogSendPhysical(void)
 	}
 
 	nbytes = endptr - startptr;
-	Assert(nbytes <= MAX_SEND_SIZE);
+	Assert(nbytes <= wal_sender_max_send_size);
 
 	/*
 	 * OK to read and send the slice.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1ad3367159..8d61b006c4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -617,6 +617,17 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for wal_sender_max_send_size
+ */
+bool
+check_wal_sender_max_send_size(int *newval, void **extra, GucSource source)
+{
+	if (*newval > wal_segment_size)
+		return false;
+	return true;
+}
+
 /*
  * Early initialization of a backend (either standalone or under postmaster).
  * This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e53ebc6dc2..76f755cbd3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2892,6 +2892,19 @@ struct confi

Improving EXPLAIN's display of SubPlan nodes

2024-01-19 Thread Tom Lane
EXPLAIN has always been really poor at displaying SubPlan nodes
in expressions: you don't get much more than "(SubPlan N)".
This is mostly because every time I thought about it, I despaired
of trying to represent all the information in a SubPlan accurately.
However, a recent discussion[1] made me realize that we could do
a lot better just by displaying the SubLinkType and the testexpr
(if relevant).  So here's a proposed patch.  You can see what
it does by examining the regression test changes.

There's plenty of room to bikeshed about exactly how to display
this stuff, and I'm open to suggestions.

BTW, I was somewhat depressed to discover that we have exactly
zero regression coverage of the ROWCOMPARE_SUBLINK code paths;
not only was EXPLAIN output not covered, but the planner and
executor too.  So I added some simple tests for that.  Otherwise
I think existing coverage is enough for this.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/149c5c2f-4267-44e3-a177-d1fd24c53f6d%40universite-paris-saclay.fr

From 30d4c77641876db2ffd81a188bd2e41856a7a99e Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 19 Jan 2024 14:16:09 -0500
Subject: [PATCH v1] Improve EXPLAIN's display of SubPlan nodes.

Represent the SubLinkType as best we can, and show the testexpr
where relevant.  To aid in interpreting testexprs, add the output
parameter IDs to the subplan name for subplans as well as initplans.
---
 .../postgres_fdw/expected/postgres_fdw.out|  12 +-
 src/backend/optimizer/plan/subselect.c|  30 ++--
 src/backend/utils/adt/ruleutils.c |  45 +-
 src/test/regress/expected/aggregates.out  |   2 +-
 src/test/regress/expected/insert_conflict.out |   2 +-
 src/test/regress/expected/join.out|  24 ++--
 src/test/regress/expected/memoize.out |   2 +-
 src/test/regress/expected/rowsecurity.out |  56 
 src/test/regress/expected/select_parallel.out |  22 +--
 src/test/regress/expected/subselect.out   | 135 --
 src/test/regress/expected/updatable_views.out |  24 ++--
 src/test/regress/sql/subselect.sql|  14 ++
 12 files changed, 241 insertions(+), 127 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d83f6ae8cb..2b5e56cae9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3264,14 +3264,14 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+ QUERY PLAN  
+-
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = $0) FROM HASHED SubPlan 1 (returns $0)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
-   SubPlan 1
+   SubPlan 1 (returns $0)
  ->  Foreign Scan on public.ft1 ft1_1
Output: ft1_1.c2
Remote SQL: SELECT c2 FROM "S 1"."T 1" WHERE ((c2 < 5))
@@ -11899,8 +11899,8 @@ SELECT a FROM base_tbl WHERE a IN (SELECT a FROM foreign_tbl);
 -
  Seq Scan on public.base_tbl
Output: base_tbl.a
-   Filter: (SubPlan 1)
-   SubPlan 1
+   Filter: (ANY (base_tbl.a = $1) FROM SubPlan 1 (returns $1))
+   SubPlan 1 (returns $1)
  ->  Result
Output: base_tbl.a
->  Append
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3115d79ad9..9ceac6234d 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -326,6 +326,7 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
 	Node	   *result;
 	SubPlan*splan;
 	bool		isInitPlan;
+	StringInfoData splanname;
 	ListCell   *lc;
 
 	/*
@@ -560,22 +561,31 @@ build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot,
    splan->plan_id);
 
 	/* Label the subplan for EXPLAIN purposes */
-	splan->plan_name = palloc(32 + 12 * list_length(splan->setParam));
-	sprintf(splan->plan_name, "%s %d",
-			isInitPlan ? "InitPlan" : "SubPlan",
-			splan->plan_id);
+	initStringInfo(&splanname);
+	appendStringInfo(&splanname, "%s %d",
+	 isInitPlan ? "InitPlan" : "SubPlan",
+	 splan->plan_id);
 	if (splan->setParam)
 	{
-		char	   *ptr = splan->plan_name + strlen(splan->plan_name);
-
-		ptr += sprintf(ptr, " (returns ");
+		appendStringInfoString(&splanname, " (returns ");
 		foreach(lc, splan->setParam)

Re: PG12 change to DO UPDATE SET column references

2024-01-19 Thread David G. Johnston
On Fri, Jan 19, 2024 at 10:01 AM James Coleman  wrote:

> Making this more confusing is the fact that if I want to do something
> like "SET bar = foo.bar + 1" the table qualification cannot be present
> on the setting column but is required on the reading column.
>
> There isn't anything in the docs that I see about this, and I don't
> see anything scanning the release notes for PG12 either (though I
> could have missed a keyword to search for).
>
>
https://www.postgresql.org/docs/12/sql-insert.html

"When referencing a column with ON CONFLICT DO UPDATE, do not include the
table's name in the specification of a target column. For example, INSERT
INTO table_name ... ON CONFLICT DO UPDATE SET table_name.col = 1 is invalid
(this follows the general behavior for UPDATE)."

The same text exists for v11.

David J.


Re: UUID v7

2024-01-19 Thread Andrey Borodin


> On 19 Jan 2024, at 13:25, Andrey Borodin  wrote:
> 
> Also, I've added some documentation on all functions.

Here's v12. Changes:
1. Documentation improvements
2. Code comments
3. Better commit message and reviews list

Best regards, Andrey Borodin.



v12-0001-Implement-UUID-v7.patch
Description: Binary data


Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-19 Thread Nathan Bossart
On Wed, Jan 17, 2024 at 06:48:37AM +0530, Bharath Rupireddy wrote:
> LGTM.

Committed.  Thanks for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add system identifier to backup manifest

2024-01-19 Thread Amul Sul
On Thu, Jan 18, 2024 at 6:39 AM Sravan Kumar 
wrote:

> I have also done a review of the patch and some testing. The patch looks
> good, and I agree with Robert's comments.
>

Thank you for your review, testing and the offline discussion.

Regards,
Amul


Re: Build versionless .so for Android

2024-01-19 Thread Matthias Kuhn
Thanks,

This patch brought it further and indeed, the resulting libpq.so file is
unversioned when built for Android while it's versioned when built for
linux.

Matthias

On Fri, Jan 19, 2024 at 3:00 PM Peter Eisentraut 
wrote:

> On 19.01.24 11:08, Matthias Kuhn wrote:
> > When trying to build with meson, including the patch which was provided
> > by Andres Freud (thanks),
> > I am currently stuck with the following error:
> >
> > Configuring pg_config_ext.h using configuration
> >
> > ../src/tgresql-16-685bc9fc97.clean/src/include/meson.build:12:0: ERROR:
> > File port/android.h does not exist.
>
> I think there is actually small bug in the meson setup.
>
> In the top-level meson.build, line 166 has
>
>  portname = host_system
>
> but then later around line 190, host_system is changed for some cases
> (including the android case that was added in the proposed patch).  So
> that won't work, and probably also won't work for the existing cases
> there.  The portname assignment needs to be moved later in the file.
> Maybe you can try that on your own.
>
>


Re: Add system identifier to backup manifest

2024-01-19 Thread Amul Sul
On Wed, Jan 17, 2024 at 8:40 PM Robert Haas  wrote:

> On Wed, Jan 17, 2024 at 6:31 AM Amul Sul  wrote:
> > With the attached patch, the backup manifest will have a new key item as
> > "System-Identifier" 64-bit integer whose value is derived from
> pg_control while
> > generating it, and the manifest version bumps to 2.
> >
> > This helps to identify the correct database server and/or backup for the
> > subsequent backup operations.  pg_verifybackup validates the manifest
> system
> > identifier against the backup control file and fails if they don’t match.
> > Similarly, pg_basebackup increment backup will fail if the manifest
> system
> > identifier does not match with the server system identifier.  The
> > pg_combinebackup is already a bit smarter -- checks the system
> identifier from
> > the pg_control of all the backups, with this patch the manifest system
> > identifier also validated.
>
> Thanks for working on this. Without this, I think what happens is that
> you can potentially take an incremental backup from the "wrong"
> server, if the states of the systems are such that all of the other
> sanity checks pass. When you run pg_combinebackup, it'll discover the
> problem and tell you, but you ideally want to discover such errors at
> backup time rather than at restore time. This addresses that. And,
> overall, I think it's a pretty good patch. But I nonetheless have a
> bunch of comments.
>

Thank you for the review.


>
> -  The associated value is always the integer 1.
> +  The associated value is the integer, either 1 or 2.
>
> is an integer. Beginning in PostgreSQL 17,
> it is 2; in older versions, it is 1.
>

Ok,


> + context.identity_cb = manifest_process_identity;
>
> I'm not really on board with calling the system identifier "identity"
> throughout the patch. I think it should just say system_identifier. If
> we were going to abbreviate, I'd prefer something like "sysident" that
> looks like it's derived from "system identifier" rather than
> "identity" which is a different word altogether. But I don't think we
> should abbreviate unless doing so creates *ridiculously* long
> identifier names.
>

Ok, used "system identifier" at all the places.


> +static void
> +manifest_process_identity(JsonManifestParseContext *context,
> +   int manifest_version,
> +   uint64 manifest_system_identifier)
> +{
> + uint64 system_identifier;
> +
> + /* Manifest system identifier available in version 2 or later */
> + if (manifest_version == 1)
> + return;
>
> I think you've got the wrong idea here. I think this function would
> only get called if System-Identifier is present in the manifest, so if
> it's a v1 manifest, this would never get called, so this if-statement
> would not ever do anything useful. I think what you should do is (1)
> if the client supplies a v1 manifest, reject it, because surely that's
> from an older server version that doesn't support incremental backup;
> but do that when the version is parsed rather than here; and (2) also
> detect and reject the case when it's supposedly a v2 manifest but this
> is absent.
>
> (1) should really be done when the version number is parsed, so I
> suspect you may need to add manifest_version_cb.
>
> +static void
> +combinebackup_identity_cb(JsonManifestParseContext *context,
> +   int manifest_version,
> +   uint64 manifest_system_identifier)
> +{
> + parser_context *private_context = context->private_data;
> + uint64 system_identifier = private_context->system_identifier;
> +
> + /* Manifest system identifier available in version 2 or later */
> + if (manifest_version == 1)
> + return;
>
> Very similar to the above case. Just reject a version 1 manifest as
> soon as we see the version number. In this function, set a flag
> indicating we saw the system identifier; if at the end of parsing that
> flag is not set, kaboom.
>

Ok, I added a version_cb callback. Using this pg_combinebackup &
pg_basebackup
will report an error for manifest version 1, whereas pg_verifybackup
doesn't (not needed IIUC).


>
> - parse_manifest_file(manifest_path, &context.ht, &first_wal_range);
> + parse_manifest_file(manifest_path, &context.ht, &first_wal_range,
> + context.backup_directory);
>
> Don't do this! parse_manifest_file() should just record everything
> found in the manifest in the context object. Syntax validation should
> happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
> and we should reject that at this stage) but semantic validation
> should happen later (e.g. "0/0" can't be a the correct backup end LSN
> but we don't figure that out while parsing, but rather later). I think
> you should actually move validation of the system identifier to the
> point where the directory walk encounters the control file (and update
> the docs and tests to match that decision). Imagine if you wanted to
> validate a tar-format backup; then you wouldn't have random access to
> the directory. You'd see the manifest file first, and then all the
> fil

PG12 change to DO UPDATE SET column references

2024-01-19 Thread James Coleman
Hello,

I realize this is almost ancient history at this point, but I ran into
a surprising behavior change from PG11->12 with ON CONFLICT ... DO
UPDATE SET ...

Suppose I have this table:
create table foo (id int primary key);

On PG11 this works:
postgres=# insert into foo (id) values (1) on conflict (id) do update
set foo.id = 1;
INSERT 0 1

But on PG12+ this is the result:
postgres=# insert into foo (id) values (1) on conflict (id) do update
set foo.id = 1;
ERROR:  column "foo" of relation "foo" does not exist
LINE 1: ...oo (id) values (1) on conflict (id) do update set foo.id = 1...

Making this more confusing is the fact that if I want to do something
like "SET bar = foo.bar + 1" the table qualification cannot be present
on the setting column but is required on the reading column.

There isn't anything in the docs that I see about this, and I don't
see anything scanning the release notes for PG12 either (though I
could have missed a keyword to search for).

Was this intended? Or a side effect? And should we explicitly document
the expectations here

The error is also pretty confusing: when you miss the required
qualification on the read column the error is more understandable:
ERROR:  column reference "bar" is ambiguous

It seems to me that it'd be desirable to either allow the unnecessary
qualification or give an error that's more easily understood.

Regards,
James Coleman




Re: Change GUC hashtable to use simplehash?

2024-01-19 Thread Heikki Linnakangas

On 19/01/2024 09:27, John Naylor wrote:

Pushed that way, thanks! After fixing another typo in big endian
builds, an s390x member reported green, so I think that aspect is
working now. I'll come back to follow-up topics shortly.


Thanks! I started to look at how to use this, and I have some questions. 
I'd like to replace this murmurhash ussage in resowner.c with this:



/*
 * Most resource kinds store a pointer in 'value', and pointers are 
unique
 * all on their own.  But some resources store plain integers (Files and
 * Buffers as of this writing), so we want to incorporate the 'kind' in
 * the hash too, otherwise those resources will collide a lot.  But
 * because there are only a few resource kinds like that - and only a 
few
 * resource kinds to begin with - we don't need to work too hard to mix
 * 'kind' into the hash.  Just add it with hash_combine(), it perturbs 
the
 * result enough for our purposes.
 */
#if SIZEOF_DATUM == 8
return hash_combine64(murmurhash64((uint64) value), (uint64) kind);
#else
return hash_combine(murmurhash32((uint32) value), (uint32) kind);
#endif


The straightforward replacement would be:

fasthash_state hs;

fasthash_init(&hs, sizeof(Datum), 0);
fasthash_accum(&hs, (char *) &kind, sizeof(ResourceOwnerDesc *));
fasthash_accum(&hs, (char *) &value, sizeof(Datum));
return fasthash_final32(&hs, 0);

But I wonder if it would be OK to abuse the 'seed' and 'tweak' 
parameters to the init and final functions instead, like this:


fasthash_state hs;

fasthash_init(&hs, sizeof(Datum), (uint64) kind);
return fasthash_final32(&hs, (uint64) value);

I couldn't find any guidance on what properties the 'seed' and 'tweak' 
have, compared to just accumulating the values with accum. Anyone know?


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: initdb's -c option behaves wrong way?

2024-01-19 Thread Tom Lane
Daniel Gustafsson  writes:
> I'll give some more time for opinions, then I'll go ahead with one of the
> patches with a backpatch to v16.

OK, I take back my previous complaint that just using strncasecmp
would be enough --- I was misremembering how the code worked, and
you're right that it would use the spelling from the command line
rather than that from the file.

However, the v3 patch is flat broken.  You can't assume you have
found a match until you've verified that whitespace and '='
appear next --- otherwise, you'll be fooled by a guc_name that
is a prefix of one that appears in the file.  I think the simplest
change that does it correctly is as attached.

Now, since I was the one who wrote the existing code, I freely
concede that I may have too high an opinion of its readability.
Maybe some larger refactoring is appropriate.  But I didn't
find that what you'd done improved the readability.  I'd still
rather keep the newline-assembly code together as much as possible.
Maybe we should do the search part before we build any of newline?

regards, tom lane

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ac409b0006..6a620c9d5f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -484,6 +484,7 @@ replace_guc_value(char **lines, const char *guc_name, const char *guc_value,
 	for (i = 0; lines[i]; i++)
 	{
 		const char *where;
+		const char *namestart;
 
 		/*
 		 * Look for a line assigning to guc_name.  Typically it will be
@@ -494,15 +495,19 @@ replace_guc_value(char **lines, const char *guc_name, const char *guc_value,
 		where = lines[i];
 		while (*where == '#' || isspace((unsigned char) *where))
 			where++;
-		if (strncmp(where, guc_name, namelen) != 0)
+		if (strncasecmp(where, guc_name, namelen) != 0)
 			continue;
+		namestart = where;
 		where += namelen;
 		while (isspace((unsigned char) *where))
 			where++;
 		if (*where != '=')
 			continue;
 
-		/* found it -- append the original comment if any */
+		/* found it -- let's use the canonical casing shown in the file */
+		memcpy(&newline->data[mark_as_comment ? 1 : 0], namestart, namelen);
+
+		/* now append the original comment if any */
 		where = strrchr(where, '#');
 		if (where)
 		{
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 03376cc0f7..413a5eca67 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -199,4 +199,17 @@ command_fails(
 command_fails([ 'initdb', '--no-sync', '--set', 'foo=bar', "$tempdir/dataX" ],
 	'fails for invalid --set option');
 
+# Make sure multiple invocations of -c parameters are added case insensitive
+command_ok(
+	[
+		'initdb', '-cwork_mem=128', '-cWork_Mem=256', '-cWORK_MEM=512',
+		"$tempdir/dataY"
+	],
+	'multiple -c options with different case');
+
+my $conf = slurp_file("$tempdir/dataY/postgresql.conf");
+ok($conf !~ qr/^WORK_MEM = /m, "WORK_MEM should not be configured");
+ok($conf !~ qr/^Work_Mem = /m, "Work_Mem should not be configured");
+ok($conf =~ qr/^work_mem = 512/m, "work_mem should be in config");
+
 done_testing();


Re: index prefetching

2024-01-19 Thread Konstantin Knizhnik


On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...


4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.


That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

Looks like I was not true, even if it is not index-only scan but index 
condition involves only index attributes, then heap is not accessed 
until we find tuple satisfying search condition.
Inclusive index case described above 
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO 
exotic case. If keys are actually used in search, then why not to create 
normal compound index instead?




[PATCH] pg_receivewal skip WAL file size checking

2024-01-19 Thread Sergey Sergey
We have a huge amount of WAL files at backup site. A listing of the
directory takes several seconds. During startup pg_receivewal checks size
of all theus files.  It does not check file integrity or gaps between
files. It takes several hours for our setup.
I have add an options that skip this file size checking. Default behavior
remains the same.
A patch looks huge due to large code block ident. Actually it consists of
option add and one if-branch.
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index d0a4079d50..6b8cf8fd58 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -51,6 +51,7 @@ static bool slot_exists_ok = false;
 static bool do_drop_slot = false;
 static bool do_sync = true;
 static bool synchronous = false;
+static bool skip_filesize_check = false;
 static char *replication_slot = NULL;
 static pg_compress_algorithm compression_algorithm = PG_COMPRESSION_NONE;
 static XLogRecPtr endpos = InvalidXLogRecPtr;
@@ -103,6 +104,7 @@ usage(void)
 	printf(_("\nOptional actions:\n"));
 	printf(_("  --create-slot  create a new replication slot (for the slot's name see --slot)\n"));
 	printf(_("  --drop-slotdrop the replication slot (for the slot's name see --slot)\n"));
+	printf(_("  --skip-filesize-checkDon't check file sizes at directory specified by -D option \n"));
 	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
 	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
 }
@@ -291,169 +293,172 @@ FindStreamingStart(uint32 *tli)
 		 */
 		XLogFromFileName(dirent->d_name, &tli, &segno, WalSegSz);
 
-		/*
-		 * Check that the segment has the right size, if it's supposed to be
-		 * completed.  For non-compressed segments just check the on-disk size
-		 * and see if it matches a completed segment.  For gzip-compressed
-		 * segments, look at the last 4 bytes of the compressed file, which is
-		 * where the uncompressed size is located for files with a size lower
-		 * than 4GB, and then compare it to the size of a completed segment.
-		 * The 4 last bytes correspond to the ISIZE member according to
-		 * http://www.zlib.org/rfc-gzip.html.
-		 *
-		 * For LZ4-compressed segments, uncompress the file in a throw-away
-		 * buffer keeping track of the uncompressed size, then compare it to
-		 * the size of a completed segment.  Per its protocol, LZ4 does not
-		 * store the uncompressed size of an object by default.  contentSize
-		 * is one possible way to do that, but we need to rely on a method
-		 * where WAL segments could have been compressed by a different source
-		 * than pg_receivewal, like an archive_command with lz4.
-		 */
-		if (!ispartial && wal_compression_algorithm == PG_COMPRESSION_NONE)
-		{
-			struct stat statbuf;
-			char		fullpath[MAXPGPATH * 2];
+		if(!skip_filesize_check)
+		{	
+			/*
+			 * Check that the segment has the right size, if it's supposed to be
+			 * completed.  For non-compressed segments just check the on-disk size
+			 * and see if it matches a completed segment.  For gzip-compressed
+			 * segments, look at the last 4 bytes of the compressed file, which is
+			 * where the uncompressed size is located for files with a size lower
+			 * than 4GB, and then compare it to the size of a completed segment.
+			 * The 4 last bytes correspond to the ISIZE member according to
+			 * http://www.zlib.org/rfc-gzip.html.
+			 *
+			 * For LZ4-compressed segments, uncompress the file in a throw-away
+			 * buffer keeping track of the uncompressed size, then compare it to
+			 * the size of a completed segment.  Per its protocol, LZ4 does not
+			 * store the uncompressed size of an object by default.  contentSize
+			 * is one possible way to do that, but we need to rely on a method
+			 * where WAL segments could have been compressed by a different source
+			 * than pg_receivewal, like an archive_command with lz4.
+			 */
+			if (!ispartial && wal_compression_algorithm == PG_COMPRESSION_NONE)
+			{
+struct stat statbuf;
+char		fullpath[MAXPGPATH * 2];
 
-			snprintf(fullpath, sizeof(fullpath), "%s/%s", basedir, dirent->d_name);
-			if (stat(fullpath, &statbuf) != 0)
-pg_fatal("could not stat file \"%s\": %m", fullpath);
+snprintf(fullpath, sizeof(fullpath), "%s/%s", basedir, dirent->d_name);
+if (stat(fullpath, &statbuf) != 0)
+	pg_fatal("could not stat file \"%s\": %m", fullpath);
 
-			if (statbuf.st_size != WalSegSz)
-			{
-pg_log_warning("segment file \"%s\" has incorrect size %lld, skipping",
-			   dirent->d_name, (long long int) statbuf.st_size);
-continue;
+if (statbuf.st_size != WalSegSz)
+{
+	pg_log_warning("segment file \"%s\" has incorrect size %lld, skipping",
+   dirent->d_name, (long long int) statbuf.st_size);
+	continue;
+}
 			}
-		}
-		else if (!ispartial && wal_compression_algorithm == PG_COMPRESSION_GZIP)
-		{
-			int			fd;
-			char		buf[4];
-			int			bytes_out;

Re: initdb's -c option behaves wrong way?

2024-01-19 Thread Daniel Gustafsson
> On 18 Jan 2024, at 05:49, Kyotaro Horiguchi  wrote:
> 
> Thank you for upicking this up.
> 
> At Wed, 17 Jan 2024 23:47:41 +0100, Daniel Gustafsson  wrote 
> in 
>>> On 17 Jan 2024, at 21:33, Tom Lane  wrote:
>>> 
>>> I wrote:
 However ... I don't like the patch much.  It seems to have left
 the code in a rather random state.  Why, for example, didn't you
 keep all the code that constructs the "newline" value together?
>>> 
>>> After thinking about it a bit more, I don't see why you didn't just
>>> s/strncmp/strncasecmp/ and call it good.  The messiness seems to be
>>> a result of your choice to replace the GUC's case as shown in the
>>> file with the case used on the command line, which is not better
>>> IMO.  We don't change our mind about the canonical spelling of a
>>> GUC because somebody varied the case in a SET command.
>> 
>> The original patch was preserving the case of the file throwing away the case
>> from the commandline.  The attached is a slimmed down version which only 
>> moves
>> the assembly of newline to the different cases (replace vs.  new) keeping the
>> rest of the code intact.  Keeping it in one place gets sort of messy too 
>> since
>> it needs to use different values for a replacement and a new variable.  Not
>> sure if there is a cleaner approach?
> 
> Just to clarify, the current code breaks out after processing the
> first matching line. I haven't changed that behavior.

Yup.

> The reason I
> moved the rewrite processing code out of the loop was I wanted to
> avoid adding more lines that are executed only once into the
> loop. However, it is in exchange of additional complexity to remember
> the original spelling of the parameter name. Personally, I believe
> separating the search and rewrite processing is better, but I'm not
> particularly attached to the approach, so either way is fine with me.

I'll give some more time for opinions, then I'll go ahead with one of the
patches with a backpatch to v16.

--
Daniel Gustafsson





Optimize duplicate code and fix memory leak in function fetch_remote_table_info()

2024-01-19 Thread Yongtao Huang
This patch fixes two things in the function fetch_remote_table_info().

(1) *pfree(pub_names.data)* to avoid potential memory leaks.
(2) 2 pieces of code can do the same work,

``` C
foreach(lc, MySubscription->publications)
{
if (foreach_current_index(lc) > 0)
appendStringInfoString(&pub_names, ", ");
appendStringInfoString(&pub_names,
quote_literal_cstr(strVal(lfirst(lc;
}
```
and
``` C
foreach_node(String, pubstr, MySubscription->publications)
{
char   *pubname = strVal(pubstr);

if (foreach_current_index(pubstr) > 0)
appendStringInfoString(&pub_names, ", ");

appendStringInfoString(&pub_names, quote_literal_cstr(pubname));
}
```
I wanna integrate them into one function `make_pubname_list()` to make the
code neater.

Thanks for your time.

Regards

Yongtao Huang


0001-Optimize-duplicate-code-and-fix-memory-leak-in-table.patch
Description: Binary data


Re: [PATCH] Exponential backoff for auth_delay

2024-01-19 Thread Abhijit Menon-Sen
At 2024-01-19 15:08:36 +0100, mba...@gmx.net wrote:
>
> I wonder whether maybe auth_delay.max_seconds should either be renamed
> to auth_delay.exponential_backoff_max_seconds (but then it is rather
> long) in order to make it clearer it only applies in that context or
> alternatively just apply to auth_delay.milliseconds as well (though
> that would be somewhat weird).

I think it's OK as-is. The description/docs are pretty clear.

> I wonder though whether this might be a useful message to have at some
> more standard level like INFO?

I don't have a strong opinion about this, but I suspect anyone who is
annoyed enough by repeated authentication failures to use auth_delay
will also be happy to have less noise in the logs about it.

> You mean something like "after 5 minutes, reset the delay to 0 again"?
> I agree that this would probably be useful, but would also make the
> change more complex.

Yes, that's the kind of thing I meant.

I agree that it would make this patch more complex, and I don't think
it's necessary to implement. However, since it's a feature that seems to
go hand-in-hand with exponential backoff in general, it _may_ be good to
mention in the docs that the sleep time for a host is reset only by
successful authentication, not by any timeout. Not sure.

> What I had in mind is that admins would lower auth_delay.milliseconds to
> something like 100 or 125 when exponential_backoff is on

Ah, that makes a lot of sense. Thanks for explaining.

Your new v3 patch looks fine to me. I'm marking it as ready for
committer.

-- Abhijit




Re: pgsql: Clean up role created in new subscription test.

2024-01-19 Thread Peter Eisentraut

On 19.01.24 15:26, Daniel Gustafsson wrote:

On 18 Jan 2024, at 01:57, vignesh C  wrote:



There are a lot of failures in CFBot at [1] with:



More details of the same are available at [2].
Do we need to clean up the objects leftover for the reported issues in the test?


Not really, these should not need cleaning up, and it's quite odd that it only
happens on FreeBSD.  I need to investigate further so I'll mark this waiting on
author in the meantime


Most likely because only the FreeBSD job uses 
ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS.






Re: Assertion failure with epoch when replaying standby records for 2PC

2024-01-19 Thread Alexander Korotkov
On Wed, Jan 17, 2024 at 11:08 PM Alexander Korotkov
 wrote:
> On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier  wrote:
> > rorqual has failed today with a very interesting failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31
> >
> > This has caused an assertion failure for a 2PC transaction when
> > replaying one of the tests from the main regression suite:
> > 2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction 
> > was at log time 2024-01-17 05:08:22.920244+00.
> > TRAP: failed Assert("epoch > 0"), File: 
> > "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
> > postgres: standby_1: startup recovering 
> > 0001000C(ExceptionalCondition+0x83)[0x55746c7838c1]
> > postgres: standby_1: startup recovering 
> > 0001000C(+0x194f0e)[0x55746c371f0e]
> > postgres: standby_1: startup recovering 
> > 0001000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
> > postgres: standby_1: startup recovering 
> > 0001000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
> > postgres: standby_1: startup recovering 
> > 0001000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
> > postgres: standby_1: startup recovering 
> > 0001000C(standby_redo+0xe1)[0x55746c621490]
> > postgres: standby_1: startup recovering 
> > 0001000C(PerformWalRecovery+0xa5e)[0x55746c392404]
> > postgres: standby_1: startup recovering 
> > 0001000C(StartupXLOG+0x3ac)[0x55746c3862b8]
> > postgres: standby_1: startup recovering 
> > 0001000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
> > postgres: standby_1: startup recovering 
> > 0001000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
> > postgres: standby_1: startup recovering 
> > 0001000C(+0x3c4235)[0x55746c5a1235]
> > postgres: standby_1: startup recovering 
> > 0001000C(PostmasterMain+0x1401)[0x55746c5a5a10]
> > postgres: standby_1: startup recovering 
> > 0001000C(main+0x835)[0x55746c4e90ce]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
> > postgres: standby_1: startup recovering 
> > 0001000C(_start+0x21)[0x55746c2b61d1]
> >
> > This refers to the following in twophase.c with
> > AdjustToFullTransactionId():
> > nextXid = XidFromFullTransactionId(nextFullXid);
> > epoch = EpochFromFullTransactionId(nextFullXid);
> >
> > if (unlikely(xid > nextXid))
> > {
> > /* Wraparound occurred, must be from a prev epoch. */
> > Assert(epoch > 0);
> > epoch--;
> > }
> >
> > This would mean that we've found a way to get a negative epoch, which
> > should not be possible.
> >
> > Alexander, you have added this code in 5a1dfde8334b when switching the
> > 2PC file names to use FullTransactionIds.  Could you check please?
>
> Thank you for reporting!  I'm going to look at this in the next couple of 
> days.

Oh, that is a forgotten piece I've already discovered.
https://www.postgresql.org/message-id/CAPpHfdv%3DVahovNqJHBqr0ejHvx%3DeDuGYySC48Wcvp%2BGDxYLCJg%40mail.gmail.com
I'm going to do some additional checks and push.

--
Regards,
Alexander Korotkov




Re: pgsql: Clean up role created in new subscription test.

2024-01-19 Thread Daniel Gustafsson
> On 18 Jan 2024, at 01:57, vignesh C  wrote:

> There are a lot of failures in CFBot at [1] with:

> More details of the same are available at [2].
> Do we need to clean up the objects leftover for the reported issues in the 
> test?

Not really, these should not need cleaning up, and it's quite odd that it only
happens on FreeBSD.  I need to investigate further so I'll mark this waiting on
author in the meantime.

--
Daniel Gustafsson





Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-19 Thread torikoshia

On 2024-01-19 22:27, Alexander Korotkov wrote:

Hi!

On Fri, Jan 19, 2024 at 2:37 PM torikoshia  
wrote:

Thanks for making the patch!


The patch is pushed!  The proposed changes are incorporated excluding 
this.



> -   /* If SAVE_ERROR_TO is specified, skip rows
> with soft errors */
> +   /* If ON_ERROR is specified with IGNORE, skip
> rows with soft errors */

This is correct now, but considering future works which add other
options like "file 'copy.log'" and
"table 'copy_log'", it may be better not to limit the case to 
'IGNORE'.

How about something like this?

   If ON_ERROR is specified and the value is not STOP, skip rows with
soft errors


I think when we have more options, then we wouldn't just skip rows
with soft errors but rather save them.  So, I left this comment as is
for now.


Agreed.
Thanks for the notification!



--
Regards,
Alexander Korotkov


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: POC: GROUP BY optimization

2024-01-19 Thread Alexander Korotkov
Hi!

I've applied your changes with minor editing, thank you.

On Thu, Jan 18, 2024 at 11:49 AM Andrei Lepikhov
 wrote:
> >> * Part of the work performed in this patch overlaps with that of
> >> preprocess_groupclause.  They are both trying to adjust the ordering of
> >> the GROUP BY keys to match ORDER BY.  I wonder if it would be better to
> >> perform this work only once.
> >
> > Andrei, could you take a look.
> As I see, the PathKeyInfo list often contains duplicated pathkeys,
> coming from either sort_pathkeys or path->pathkeys orderings. So, I can
> propose to check duplicates each time (see step2.txt in attachment).

Actually I asked to recheck if we can cut some part of
preprocess_groupclause() given that we're reordering the pathkeys
later.  It seems that we can remove everything except the work with a
grouping set.  I've done this in the revised patchset.

I'm going to push this if there are no objections.

--
Regards,
Alexander Korotkov


0002-Explore-alternative-orderings-of-group-by-p-20240119.patch
Description: Binary data


0001-Generalize-common-code-of-adding-sort-befor-20240119.patch
Description: Binary data


Re: [PATCH] Exponential backoff for auth_delay

2024-01-19 Thread Michael Banck
On Fri, Jan 19, 2024 at 03:08:36PM +0100, Michael Banck wrote:
> I went through your comments (a lot of which pertained to the original
> larger patch I took code from), attached is a reworked version 2.

Oops, we are supposed to be at version 3, attached.


Michael
>From e60cdca73d384fa467f8e6becdd85d9a0c530b60 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v3] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two new GUCs for auth_delay, exponential_backoff and max_seconds. The
former controls whether exponential backoff should be used or not, the latter
sets an maximum delay (default is 10s) in case exponential backoff is active.

The exponential backoff is tracked per remote host and doubled for every failed
login attempt (i.e., wrong password, not just missing pg_hba line or database)
and reset to auth_delay.milliseconds after a successful authentication from
that host.

This patch is partly based on a larger (but ultimately rejected) patch by
成之焕.

Authors: Michael Banck, 成之焕
Reviewed-by: Abhijit Menon-Sen
Discussion: https://postgr.es/m/ahwaxacqiwivoehs5yejpqog.1.1668569845751.hmail.zhch...@ceresdata.com
---
 contrib/auth_delay/auth_delay.c  | 197 ++-
 doc/src/sgml/auth-delay.sgml |  48 +++-
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 243 insertions(+), 3 deletions(-)

diff --git a/contrib/auth_delay/auth_delay.c b/contrib/auth_delay/auth_delay.c
index ff0e1fd461..06d2f5f280 100644
--- a/contrib/auth_delay/auth_delay.c
+++ b/contrib/auth_delay/auth_delay.c
@@ -14,24 +14,49 @@
 #include 
 
 #include "libpq/auth.h"
+#include "miscadmin.h"
 #include "port.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
+#define MAX_CONN_RECORDS 50
+
 /* GUC Variables */
 static int	auth_delay_milliseconds = 0;
+static bool auth_delay_exp_backoff = false;
+static int	auth_delay_max_seconds = 0;
 
 /* Original Hook */
 static ClientAuthentication_hook_type original_client_auth_hook = NULL;
 
+typedef struct AuthConnRecord
+{
+	char		remote_host[NI_MAXHOST];
+	double		sleep_time;		/* in milliseconds */
+} AuthConnRecord;
+
+static shmem_startup_hook_type shmem_startup_next = NULL;
+static shmem_request_hook_type shmem_request_next = NULL;
+static AuthConnRecord *acr_array = NULL;
+
+static AuthConnRecord *find_acr_for_host(char *remote_host);
+static AuthConnRecord *find_free_acr(void);
+static double increase_delay_after_failed_conn_auth(Port *port);
+static void cleanup_conn_record(Port *port);
+
 /*
  * Check authentication
  */
 static void
 auth_delay_checks(Port *port, int status)
 {
+	double		delay = auth_delay_milliseconds;
+
 	/*
 	 * Any other plugins which use ClientAuthentication_hook.
 	 */
@@ -41,10 +66,146 @@ auth_delay_checks(Port *port, int status)
 	/*
 	 * Inject a short delay if authentication failed.
 	 */
-	if (status != STATUS_OK)
+	if (status == STATUS_ERROR)
 	{
-		pg_usleep(1000L * auth_delay_milliseconds);
+		if (auth_delay_exp_backoff)
+		{
+			/*
+			 * Delay by 2^n seconds after each authentication failure from a
+			 * particular host, where n is the number of consecutive
+			 * authentication failures.
+			 */
+			delay = increase_delay_after_failed_conn_auth(port);
+
+			/*
+			 * Clamp delay to a maximum of auth_delay_max_seconds.
+			 */
+			if (auth_delay_max_seconds > 0) {
+delay = Min(delay, 1000L * auth_delay_max_seconds);
+			}
+		}
+
+		if (delay > 0)
+		{
+			elog(DEBUG1, "Authentication delayed for %g seconds due to auth_delay", delay / 1000.0);
+			pg_usleep(1000L * (long) delay);
+		}
 	}
+
+	/*
+	 * Remove host-specific delay if authentication succeeded.
+	 */
+	if (status == STATUS_OK)
+		cleanup_conn_record(port);
+}
+
+static double
+increase_delay_after_failed_conn_auth(Port *port)
+{
+	AuthConnRecord *acr = NULL;
+
+	acr = find_acr_for_host(port->remote_host);
+
+	if (!acr)
+	{
+		acr = find_free_acr();
+
+		if (!acr)
+		{
+			/*
+			 * No free space, MAX_CONN_RECORDS reached. Wait for the
+			 * configured maximum amount.
+			 */
+			return 1000L * auth_delay_max_seconds;
+		}
+		strcpy(acr->remote_host, port->remote_host);
+	}
+	if (acr->sleep_time == 0)
+		acr->sleep_time = (double) auth_delay_milliseconds;
+	else
+		acr->sleep_time *= 2;
+
+	return acr->sleep_time;
+}
+
+static AuthConnRecord *
+find_acr_for_host(char *remote_host)
+{
+	int			i;
+
+	for (i = 0; i < MAX_CONN_RECORDS; i++)
+	{
+		if (strcmp(acr_array[i].remote_host, remote_host) == 0)
+			return &acr_array[i];
+	}
+
+	return NULL;
+}
+
+static AuthConnRecord *
+find_free_acr(void)
+{
+	int			i;
+
+	for (i = 0; i < MAX_CONN_RECORDS; i++)
+	{
+		if (!acr_array[i].remote_host[0])
+			return &acr_array[i];
+	}
+
+	return 0;
+}
+
+static void
+cleanup_conn_record(Port *port)
+

Re: [PATCH] Exponential backoff for auth_delay

2024-01-19 Thread Michael Banck
Hi,

many thanks for the review!

I went through your comments (a lot of which pertained to the original
larger patch I took code from), attached is a reworked version 2.

Other changes:

1. Ignore STATUS_EOF, this led to auth_delay being applied twice (maybe
due to the gss/kerberos auth psql is trying by default? Is that legit
and should this change be reverted?) - i.e. handle STATUS_OK and
STATUS_ERROR explicitly.

2. Guard ShmemInitStruct() with LWLockAcquire(AddinShmemInitLock,
LW_EXCLUSIVE) / LWLockRelease(AddinShmemInitLock), as is done in
pg_prewarm and pg_stat_statements as well.

3. Added an additional paragraph discussing the value of
auth_delay.milliseconds when auth_delay.exponential_backoff is on, see
below.

I wonder whether maybe auth_delay.max_seconds should either be renamed
to auth_delay.exponential_backoff_max_seconds (but then it is rather
long) in order to make it clearer it only applies in that context or
alternatively just apply to auth_delay.milliseconds as well (though that
would be somewhat weird).

Further comments to your comments:

On Tue, Jan 16, 2024 at 12:00:49PM +0530, Abhijit Menon-Sen wrote:
> At 2024-01-04 08:30:36 +0100, mba...@gmx.net wrote:
> >
> > +typedef struct AuthConnRecord
> > +{
> > +   charremote_host[NI_MAXHOST];
> > +   boolused;
> > +   double  sleep_time; /* in milliseconds */
> > +} AuthConnRecord;
> 
> Do we really need a "used" field here? You could avoid it by setting
> remote_host[0] = '\0' in cleanup_conn_record.

Ok, removed.

> >  static void
> >  auth_delay_checks(Port *port, int status)
> >  {
> > +   double  delay;
> 
> I would just initialise this to auth_delay_milliseconds here, instead of
> the harder-to-notice "else" below.
 
Done.

> > @@ -43,8 +69,150 @@ auth_delay_checks(Port *port, int status)
> >  */
> > if (status != STATUS_OK)
> > {
> > -   pg_usleep(1000L * auth_delay_milliseconds);
> > +   if (auth_delay_exp_backoff)
> > +   {
> > +   /*
> > +* Exponential backoff per remote host.
> > +*/
> > +   delay = record_failed_conn_auth(port);
> > +   if (auth_delay_max_seconds > 0)
> > +   delay = Min(delay, 1000L * 
> > auth_delay_max_seconds);
> > +   }
> 
> I would make this comment more specific, maybe something like "Delay by
> 2^n seconds after each authentication failure from a particular host,
> where n is the number of consecutive authentication failures".

Done.
 
> It's slightly discordant to see the new delay being returned by a
> function named "record_" (rather than "calculate_delay" or
> similar). Maybe a name like "backoff_after_failed_auth" would be better?
> Or "increase_delay_on_auth_fail"?

I called it increase_delay_after_failed_conn_auth() now.
 
> > +static double
> > +record_failed_conn_auth(Port *port)
> > +{
> > +   AuthConnRecord *acr = NULL;
> > +   int j = -1;
> > +
> > +   acr = find_conn_record(port->remote_host, &j);
> > +
> > +   if (!acr)
> > +   {
> > +   if (j == -1)
> > +
> > +   /*
> > +* No free space, MAX_CONN_RECORDS reached. Wait as 
> > long as the
> > +* largest delay for any remote host.
> > +*/
> > +   return find_conn_max_delay();
> 
> In this extraordinary situation (where there are lots of hosts with lots
> of authentication failures), why not delay by auth_delay_max_seconds
> straightaway, especially when the default is only 10s? I don't see much
> point in coordinating the delay between fifty known hosts and an unknown
> number of others.

I was a bit worried about legitimate users suffering here if (for some
reason) a lot of different hosts try to guess passwords, but only once
or twice or something. But I have changed it now as you suggested as
that makes it simpler and I guess the problem I mentioned above is
rather contrived.

> > +   elog(DEBUG1, "new connection: %s, index: %d", acr->remote_host, 
> > j);
> 
> I think this should be removed, but if you want to leave it in, the
> message should be more specific about what this is actually about, and
> where the message is from, so as to not confuse debug-log readers.

I left it in but mentioned auth_delay in it now. I wonder though whether
this might be a useful message to have at some more standard level like
INFO?
 
> (The same applies to the other debug messages.)

Those are all gone.
 
> > +static AuthConnRecord *
> > +find_conn_record(char *remote_host, int *free_index)
> > +{
> > +   int i;
> > +
> > +   *free_index = -1;
> > +   for (i = 0; i < MAX_CONN_RECORDS; i++)
> > +   {
> > +   if (!acr_array[i].used)
> > +   {
> > +   if (*free_index == -1)
> > +   /* record unused element */
> > +   *free_inde

Re: generate syscache info automatically

2024-01-19 Thread Peter Eisentraut

On 19.01.24 06:28, John Naylor wrote:

On Wed, Jan 17, 2024 at 7:46 PM Peter Eisentraut  wrote:


I updated the patch to use this style (but I swapped the first two
arguments from my example, so that the thing being created is named first).

I also changed the names of the output files a bit to make them less
confusing.  (I initially had some files named .c.h, which was weird, but
apparently necessary to avoid confusing the build system.  But it's all
clearer now.)

Other than bugs and perhaps style opinions, I think the first patch is
pretty good now.


LGTM. The only style consideration that occurred to me just now was
MAKE_SYSCACHE, since it doesn't match the DECLARE_... macros. It seems
like the same thing from a code generation perspective. The other
macros refer to relations, so there is a difference, but maybe it
doesn't matter. I don't have a strong opinion.


The DECLARE_* macros map to actual BKI commands ("declare toast", 
"declare index").  So I wanted to use a different verb to distinguish 
"generate code for this" from those BKI commands.






Re: Build versionless .so for Android

2024-01-19 Thread Peter Eisentraut

On 19.01.24 11:08, Matthias Kuhn wrote:
When trying to build with meson, including the patch which was provided 
by Andres Freud (thanks),

I am currently stuck with the following error:

Configuring pg_config_ext.h using configuration

../src/tgresql-16-685bc9fc97.clean/src/include/meson.build:12:0: ERROR: 
File port/android.h does not exist.


I think there is actually small bug in the meson setup.

In the top-level meson.build, line 166 has

portname = host_system

but then later around line 190, host_system is changed for some cases 
(including the android case that was added in the proposed patch).  So 
that won't work, and probably also won't work for the existing cases 
there.  The portname assignment needs to be moved later in the file. 
Maybe you can try that on your own.






Re: BUG: Former primary node might stuck when started as a standby

2024-01-19 Thread Alexander Lakhin

Hi Aleksander,

19.01.2024 14:45, Aleksander Alekseev wrote:



it might not go online, due to the error:
new timeline N forked off current database system timeline M before current 
recovery point X/X
[...]
In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Unless I'm missing something, this is just the right behavior of the system.


Thank you for the answer!


node1 has no way of knowing the history of node1/node2/nodeN
promotion. It sees that it has more data and/or inconsistent timeline
with another node and refuses to process further until DBA will
intervene.


But node1 knows that it's a standby now and it's expected to get all the
WAL records from the primary, doesn't it?
Maybe it could REDO from it's own WAL as little records as possible,
before requesting records from the authoritative source...
Is it supposed that it's more performance-efficient (not on the first
restart, but on later ones)?


  What else can node1 do, drop the data? That's not how
things are done in Postgres :)


In case no other options exist (this behavior is really correct and the
only possible), maybe the server should just stop?
Can DBA intervene somehow to make the server proceed without stopping it?


It's been a while since I seriously played with replication, but if
memory serves, a proper way to switch node1 to a replica mode would be
to use pg_rewind on it first.


Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).

Best regards,
Alexander




Re: Use of backup_label not noted in log

2024-01-19 Thread David Steele

On 11/20/23 15:08, David Steele wrote:

On 11/20/23 14:27, Andres Freund wrote:

I've wondered whether it's worth also adding an explicit message 
just after

ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


I think the current message is sufficient, but what do you have in mind?


Well, the consistency message is emitted after every restart. Whereas 
a single
instance only should go through backup recovery once. So it seems 
worthwhile

to differentiate the two in log messages.


Ah, right. That works for me, then.


Any status on this patch? If we do back patch it would be nice to see 
this in the upcoming minor releases. I'm in favor of a back patch, as I 
think this is minimally invasive and would be very useful for debugging 
recovery issues.


I like the phrasing you demonstrated in [1] but doesn't seem like 
there's a new patch for that, so I have attached one.


Happy to do whatever else I can to get this across the line.

Regards,
-David

---

[1] 
https://www.postgresql.org/message-id/20231120183633.c4lhoq4hld4u56dd%40awork3.anarazel.dediff --git a/src/backend/access/transam/xlogrecovery.c 
b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..c0918e6c75 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
if (StandbyModeRequested)
EnableStandbyMode();
 
+   /*
+* Omitting backup_label when creating a new replica, PITR node 
etc.
+* unfortunately is a common cause of corruption. Logging that
+* backup_label was used makes it a bit easier to exclude that 
as the
+* cause of observed corruption.
+*
+* Do so before we try to read the checkpoint record (which can 
fail),
+* as otherwise it can be hard to understand why a checkpoint 
other
+* than ControlFile->checkPoint is used.
+*/
+   ereport(LOG,
+   (errmsg("starting backup recovery with redo LSN 
%X/%X, checkpoint LSN %X/%X, on timeline ID %u",
+   LSN_FORMAT_ARGS(RedoStartLSN),
+   LSN_FORMAT_ARGS(CheckPointLoc),
+   CheckPointTLI)));
+
/*
 * When a backup_label file is present, we want to roll forward 
from
 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool 
*wasShutdown_ptr,
EnableStandbyMode();
}
 
+   /*
+* For the same reason as when starting up with backup_label 
present,
+* emit a log message when we continue initializing from a base
+* backup.
+*/
+   if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+   ereport(LOG,
+   (errmsg("restarting backup recovery 
with redo LSN %X/%X",
+   
LSN_FORMAT_ARGS(ControlFile->backupStartPoint;
+
/* Get the last valid checkpoint record. */
CheckPointLoc = ControlFile->checkPoint;
CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
@@ -2155,6 +2181,8 @@ CheckRecoveryConsistency(void)
if (!XLogRecPtrIsInvalid(backupEndPoint) &&
backupEndPoint <= lastReplayedEndRecPtr)
{
+   XLogRecPtr oldBackupStartPoint = backupStartPoint;
+
elog(DEBUG1, "end of backup reached");
 
/*
@@ -2165,6 +2193,10 @@ CheckRecoveryConsistency(void)
backupStartPoint = InvalidXLogRecPtr;
backupEndPoint = InvalidXLogRecPtr;
backupEndRequired = false;
+
+   ereport(LOG,
+   (errmsg("completed backup recovery with redo 
LSN %X/%X",
+   
LSN_FORMAT_ARGS(oldBackupStartPoint;
}
 
/*


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-19 Thread Alexander Korotkov
Hi!

On Fri, Jan 19, 2024 at 2:37 PM torikoshia  wrote:
> Thanks for making the patch!

The patch is pushed!  The proposed changes are incorporated excluding this.

> > -   /* If SAVE_ERROR_TO is specified, skip rows
> > with soft errors */
> > +   /* If ON_ERROR is specified with IGNORE, skip
> > rows with soft errors */
>
> This is correct now, but considering future works which add other
> options like "file 'copy.log'" and
> "table 'copy_log'", it may be better not to limit the case to 'IGNORE'.
> How about something like this?
>
>If ON_ERROR is specified and the value is not STOP, skip rows with
> soft errors

I think when we have more options, then we wouldn't just skip rows
with soft errors but rather save them.  So, I left this comment as is
for now.

--
Regards,
Alexander Korotkov




Re: System username in pg_stat_activity

2024-01-19 Thread Julien Rouhaud
Hi,

On Thu, Jan 18, 2024 at 11:01 PM Magnus Hagander  wrote:
>
> I did. Here it is, and also including that suggested docs fix as well
> as a rebase on current master.

+if (MyClientConnectionInfo.authn_id)
+strlcpy(lbeentry.st_auth_identity,
MyClientConnectionInfo.authn_id, NAMEDATALEN);
+else
+MemSet(&lbeentry.st_auth_identity, 0,
sizeof(lbeentry.st_auth_identity));

Should we use pg_mbcliplen() here?  I don't think there's any strong
guarantee that no multibyte character can be used.  I also agree with
the nearby comment about MemSet being overkill.

+   value as the identity part in , or NULL
I was looking at
https://www.postgresql.org/docs/current/auth-username-maps.html and
noticed that this page is switching between system-user and
system-username.  Should we clean that up while at it?




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-19 Thread torikoshia

On 2024-01-18 23:59, jian he wrote:

Hi.
patch refactored based on "on_error {stop|ignore}"
doc changes:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,7 +43,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NOT_NULL { ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
-SAVE_ERROR_TO 'class="parameter">location'
+ON_ERROR 'class="parameter">error_action'
 ENCODING 'class="parameter">encoding_name'

 
  
@@ -375,20 +375,20 @@ COPY { table_name [ ( 


-SAVE_ERROR_TO
+ON_ERROR
 
  
-  Specifies to save error information to class="parameter">
-  location when there is malformed data in the 
input.

-  Currently, only error (default) and
none
+  Specifies which 
+  error_action to perform when there is malformed
data in the input.
+  Currently, only stop (default) and
ignore
   values are supported.
-  If the error value is specified,
+  If the stop value is specified,
   COPY stops operation at the first error.
-  If the none value is specified,
+  If the ignore value is specified,
   COPY skips malformed data and continues 
copying data.

   The option is allowed only in COPY FROM.
-  The none value is allowed only when
-  not using binary format.
+  Only stop value is allowed only when
+  using binary format.
  


Thanks for making the patch!

Here are some comments:


-  The none value is allowed only when
-  not using binary format.
+  Only stop value is allowed only when
+  using binary format.


The second 'only' may be unnecessary.

-   /* If SAVE_ERROR_TO is specified, skip rows 
with soft errors */
+   /* If ON_ERROR is specified with IGNORE, skip 
rows with soft errors */


This is correct now, but considering future works which add other 
options like "file 'copy.log'" and

"table 'copy_log'", it may be better not to limit the case to 'IGNORE'.
How about something like this?

  If ON_ERROR is specified and the value is not STOP, skip rows with 
soft errors



-COPY x from stdin (format BINARY, save_error_to none);
-COPY x to stdin (save_error_to none);
+COPY x from stdin (format BINARY, ON_ERROR ignore);
+COPY x from stdin (ON_ERROR unsupported);
 COPY x to stdin (format TEXT, force_quote(a));
 COPY x from stdin (format CSV, force_quote(a));


In the existing test for copy2.sql, the COPY options are written in 
lower case(e.g. 'format') and option value(e.g. 'BINARY') are written in 
upper case.

It would be more consistent to align them.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: index prefetching

2024-01-19 Thread Tomas Vondra



On 1/19/24 09:34, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>> I have integrated your prefetch patch in Neon and it actually works!
>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>> seems to work.
>>>
>> Cool! And do you think this is the right design/way to do this?
> 
> I like the idea of prefetching TIDs in executor.
> 
> But looking though your patch I have some questions:
> 
> 
> 1. Why it is necessary to allocate and store all_visible flag in data
> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
> 
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 

What you mean by "prefetch field"? The reason why it's done like this is
to only do the VM check once - without keeping the value, we'd have to
do it in the "next" callback, to determine if we need to prefetch the
heap tuple, and then later in the index-only scan itself. That's a
significant overhead, especially in the case when everything is visible.

> 2. Names of the functions `IndexPrefetchNext` and
> `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
> one can assume that for one is used for normal index scan and last one -
> for index only scan. But actually `IndexOnlyPrefetchNext` is callback
> and `IndexPrefetchNext` is used in both nodeIndexscan.c and
> nodeIndexonlyscan.c
> 

Yeah, that's a good point. The naming probably needs rethinking.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: psql JSON output format

2024-01-19 Thread Laurenz Albe
On Wed, 2024-01-17 at 14:52 -0500, Robert Haas wrote:
> Let me start by clarifying that I'm OK with sacrificing
> round-trippability here as long as we do it thoughtfully.

Got you.


> I'm not quite sure that addresses all the issues, though. For
> instance, consider that 1.00::numeric and 1.0::numeric are equal but
> distinguishable. If those get rendered into the JSON unquoted as 1.00
> and 1.0, respectively, is that going to round-trip properly? What
> about float8 values where extra_float_digits=3 is needed to properly
> round trip? If we take PostgreSQL's array data types and turn them
> into JSON arrays, what happens with non-default bounds? I know how
> we're going to turn '{1,2}'::int[] into a JSON array, or at least I
> assume I do, but what in the world are we going to do about
> '[-3:-2]={1,2}'?
> 
> As much as I think round-trippability is good, getting it to 100% here
> is probably a good bit of work.

I would go as far as saying that the attempt to preserve all that is
futile, if you are bound to JSON as format.

> But I do think it has positive
> value. If we produce output that could be ingested back into PG later
> with the right tool, that leaves the door open for someone to build
> the tool later even if we don't have it today. If we produce output
> that loses information, no tool built later can make up for the loss.

I am all for losing as little information as possible, but I think
that this discussion is going off on a tangent.  After all, we are not
talking about a data export tool here, we are talking about psql.
I don't see anybody complain that float8 values lose precision in
the default aligned format, or that empty strings and NULL values
look the same in aligned format.  Why do the goalposts move for the
JSON output format?  I don't think psql output should be considered
a form of backup.

I'd say that we should strive to preserve whatever information we
easily can, and we shouldn't worry about the rest.

Can we get consensus that SQL NULL columns should be omitted from the
output, and the rest left as it currently is?

Yours,
Laurenz Albe




Re: Synchronizing slots from primary to standby

2024-01-19 Thread Amit Kapila
On Wed, Jan 17, 2024 at 4:00 PM shveta malik  wrote:
>

I had some off-list discussions with Sawada-San, Hou-San, and Shveta
on the topic of extending replication commands instead of using the
current model where we fetch the required slot information via SQL
using a database connection. I would like to summarize the discussion
and would like to know the thoughts of others on this topic.

In the current patch, we launch the slotsync worker on physical
standby which connects to the specified database (currently we let
users specify the required dbname in primary_conninfo) on the primary.
It then fetches the required information for failover marked slots
from the primary and also does some primitive checks on the upstream
node via SQL (the additional checks are like whether the upstream node
has a specified physical slot or whether the upstream node is a
primary node or a standby node). To fetch the required information it
uses a libpqwalreciever API which is mostly apt for this purpose as it
supports SQL execution but for this patch, we don't need a replication
connection, so we extend the libpqwalreciever connect API.

Now, the concerns related to this could be that users would probably
need to change existing mechanisms/tools to update priamry_conninfo
and one of the alternatives proposed is to have an additional GUC like
slot_sync_dbname. Users won't be able to drop the database this worker
is connected to aka whatever is specified in slot_sync_dbname but as
the user herself sets up the configuration it shouldn't be a big deal.
Then we also discussed whether extending libpqwalreceiver's connect
API is a good idea and whether we need to further extend it in the
future. As far as I can see, slotsync worker's primary requirement is
to execute SQL queries which the current API is sufficient, and don't
see something that needs any drastic change in this API. Note that
tablesync worker that executes SQL also uses these APIs, so we may
need something in the future for either of those. Then finally we need
a slotsync worker to also connect to a database to use SQL and fetch
results.

Now, let us consider if we extend the replication commands like
READ_REPLICATION_SLOT and or introduce a new set of replication
commands to fetch the required information then we don't need a DB
connection with primary or a connection in slotsync worker. As per my
current understanding, it is quite doable but I think we will slowly
go in the direction of making replication commands something like SQL
because today we need to extend it to fetch all slots info that have
failover marked as true, the existence of a particular replication,
etc. Then tomorrow, if we want to extend this work to have multiple
slotsync workers say workers perdb then we have to extend the
replication command to fetch per-database failover marked slots. To
me, it sounds more like we are slowly adding SQL-like features to
replication commands.

Apart from this when we are reading per-db replication slots without
connecting to a database, we probably need some additional protection
mechanism so that the database won't get dropped.

Considering all this it seems that for now probably extending
replication commands can simplify a few things like mentioned above
but using SQL's with db-connection is more extendable.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: BUG: Former primary node might stuck when started as a standby

2024-01-19 Thread Aleksander Alekseev
Hi,

> it might not go online, due to the error:
> new timeline N forked off current database system timeline M before current 
> recovery point X/X
> [...]
> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
> only record 0/304DBF0, then node2, being promoted to primary, forked a next
> timeline from it, but when node1 was started as a standby, it first
> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
> starting from the previous position.

Unless I'm missing something, this is just the right behavior of the system.

node1 has no way of knowing the history of node1/node2/nodeN
promotion. It sees that it has more data and/or inconsistent timeline
with another node and refuses to process further until DBA will
intervene. What else can node1 do, drop the data? That's not how
things are done in Postgres :) What if this is a very important data
and node2 was promoted mistakenly, either manually or by a buggy
script.

It's been a while since I seriously played with replication, but if
memory serves, a proper way to switch node1 to a replica mode would be
to use pg_rewind on it first.

-- 
Best regards,
Aleksander Alekseev




Re: System username in pg_stat_activity

2024-01-19 Thread Aleksander Alekseev
Hi,

> > Did you forget to share the new revision (aka v4)? I can only see the
> > "reorder_parallel_worker_bestart.patch" attached.
>
> I did. Here it is, and also including that suggested docs fix as well
> as a rebase on current master.

```
+lbeentry.st_auth_method = MyClientConnectionInfo.auth_method;
+if (MyClientConnectionInfo.authn_id)
+strlcpy(lbeentry.st_auth_identity,
MyClientConnectionInfo.authn_id, NAMEDATALEN);
+else
+MemSet(&lbeentry.st_auth_identity, 0,
sizeof(lbeentry.st_auth_identity));
```

I believe using sizeof(lbeentry.st_auth_identity) instead of
NAMEDATALEN is generally considered a better practice.

Calling MemSet for a CString seems to be an overkill. I suggest
setting lbeentry.st_auth_identity[0] to zero.

Except for these nitpicks v4 LGTM.

-- 
Best regards,
Aleksander Alekseev




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-19 Thread John Naylor
I wrote:

> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> > During trying this idea, I realized that there is a visibility problem
> > in the radix tree template
>
> If it's broken even without the embedding I'll look into this (I don't
> know if this configuration has ever been tested). I think a good test
> is putting the shared tid tree in it's own translation unit, to see if
> anything needs to be fixed. I'll go try that.

Here's a quick test that this works. The only thing that really needed
fixing in the template was failure to un-define one symbol. The rest
was just moving some things around.


v51-addendum-Put-shared-radix-tree-for-tidstore-in-its-own-tr.patch.nocfbot
Description: Binary data


Re: Simplify documentation related to Windows builds

2024-01-19 Thread Andrew Dunstan



On 2024-01-19 Fr 04:38, Nazir Bilal Yavuz wrote:



At the end, the main issue that I have with this part of the
documentation is the lack of consistency leading to a confusing user
experience in the builds of Windows.  My recent impressions were that
Andrew has picked up Chocolatey in some (all?) of his buildfarm
animals with Strawberry Perl.  I've had a good experience with it,
FWIW, but Andres has also mentioned me a couple of weeks ago while in
Prague that Strawberry could lead to unpredictible results (sorry I
don't remember all the exact details).

Postgres CI uses Strawberry Perl [1] as well but it is directly
installed from the strawberryperl.com and its version is locked to
'5.26.3.1' for now.



FYI Strawberry was a bit stuck for a while at 5.32, but they are now up 
to 5.38. See 



I agree we shouldn't be recommending any particular perl distro, 
especially not ASPerl which now has annoying license issues.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: UUID v7

2024-01-19 Thread Aleksander Alekseev
Hi,

> No.
>
> Timestamp and TimestampTz are absolutely the same thing. The only
> difference is how they are shown to the user. TimestampTz uses session
> context in order to be displayed in the TZ chosen by the user. Thus
> typically it is somewhat more confusing to the users and thus I asked
> whether there was a good reason to choose TimestampTz over Timestamp.
>
>
> Theoretically, you're right. But look at this example:
>
> SET timezone TO 'Europe/Warsaw';
> SELECT extract(epoch from '2024-01-18 9:27:30'::timestamp), extract(epoch 
> from '2024-01-18 9:27:30'::timestamptz);
>
>  date_part  | date_part
> +
>  1705570050 | 1705566450
> (1 row)
>
> In my opinion, timestamptz gives greater guarantees that the time internally 
> is in UTC and the user gets the time in his/her time zone.

I believe you didn't notice, but this example just proves my point.

In this case you have two timestamps that are different _internally_,
but the way they are _shown_ is the same because the first one is in
UTC and the second one in your local session timezone, Europe/Warsaw.
extract(epoch ...) extract UNIX epoch, i.e. relies on the _internal_
representation. This is why you got different results.

This demonstrates that TimestampTz is a permanent source of confusion
for the users and the reason why personally I would prefer if UUIDv7
always used Timestamp (no Tz). TimestampTz can be converted to
TimestampTz by users who need them and have experience using them.

-- 
Best regards,
Aleksander Alekseev




Re: Synchronizing slots from primary to standby

2024-01-19 Thread shveta malik
On Thu, Jan 18, 2024 at 4:49 PM Amit Kapila  wrote:

> 2.
> +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> {
> ...
> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> + xmin_horizon = GetOldestSafeDecodingTransactionId(true);
> + SpinLockAcquire(&slot->mutex);
> + slot->data.catalog_xmin = xmin_horizon;
> + SpinLockRelease(&slot->mutex);
> ...
> }
>
> Here, why slot->effective_catalog_xmin is not updated? The same is
> required by a later call to ReplicationSlotsComputeRequiredXmin(). I
> see that the prior version v60-0002 has the corresponding change but
> it is missing in the latest version. Any reason?

I think it was a mistake in v61. Added it back in v64..

>
> 3.
> + * Return true either if the slot is marked as RS_PERSISTENT (sync-ready) or
> + * is synced periodically (if it was already sync-ready). Return false
> + * otherwise.
> + */
> +static bool
> +update_and_persist_slot(RemoteSlot *remote_slot)
>
> The second part of the above comment (or is synced periodically (if it
> was already sync-ready)) is not clear to me. Does it intend to
> describe the case when we try to update the already created temp slot
> in the last call. If so, that is not very clear because periodically
> sounds like it can be due to repeated sync for sync-ready slot.

The comment was as per old functionality where this function was doing
persist and save both. In v61 code changed, but comment was not
updated. I have changed it now in v64.

> 4.
> +update_and_persist_slot(RemoteSlot *remote_slot)
> {
> ...
> + (void) local_slot_update(remote_slot);
> ...
> }
>
> Can we write a comment to state the reason why we don't care about the
> return value here?

Since it is the first time 'local_slot_update' is happening on any
slot, the return value must be true i.e. local_slot_update() should
not skip the update. I have thus added an Assert on return value now
(in v64).

thanks
Shveta




Re: remaining sql/json patches

2024-01-19 Thread jian he
play with domain types.
in ExecEvalJsonCoercion, seems func json_populate_type cannot cope
with domain type.

tests:
drop domain test;
create domain test as int[] check ( array_length(value,1) =2 and
(value[1] = 1 or value[2] = 2));
SELECT * from JSON_QUERY(jsonb'{"rec": "{1,2,3}"}', '$.rec' returning
test omit quotes);
SELECT * from JSON_QUERY(jsonb'{"rec": "{1,11}"}', '$.rec' returning
test keep quotes);
SELECT * from JSON_QUERY(jsonb'{"rec": "{2,11}"}', '$.rec' returning
test omit quotes error on error);
SELECT * from JSON_QUERY(jsonb'{"rec": "{2,2}"}', '$.rec' returning
test keep quotes error on error);

SELECT * from JSON_QUERY(jsonb'{"rec": [1,2,3]}', '$.rec' returning
test omit quotes );
SELECT * from JSON_QUERY(jsonb'{"rec": [1,2,3]}', '$.rec' returning
test omit quotes null on error);
SELECT * from JSON_QUERY(jsonb'{"rec": [1,2,3]}', '$.rec' returning
test null on error);
SELECT * from JSON_QUERY(jsonb'{"rec": [1,11]}', '$.rec' returning
test omit quotes);
SELECT * from JSON_QUERY(jsonb'{"rec": [2,2]}', '$.rec' returning test
omit quotes);

Many domain related tests seem not right.
like the following, i think it should just return null.
+SELECT JSON_QUERY(jsonb '{"a": 1}', '$.b' RETURNING sqljsonb_int_not_null);
+ERROR:  domain sqljsonb_int_not_null does not allow null values

--another example
SELECT JSON_QUERY(jsonb '{"a": 1}', '$.b' RETURNING
sqljsonb_int_not_null null on error);

Maybe in node JsonCoercion, we don't need both via_io and
via_populate, but we can have one bool to indicate either call
InputFunctionCallSafe or json_populate_type in ExecEvalJsonCoercion.




Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-19 Thread Aleksander Alekseev
Hi,

> > The overall trend in machine learning embedding sizes has been growing 
> > rapidly over the last few years from 128 up to 4K dimensions yielding 
> > additional value and quality improvements. It's not clear when this trend 
> > in growth will ease. The leading text embedding models generate now exceeds 
> > the index storage available in IndexTupleData.t_info.
> >
> > The current index tuple size is stored in 13 bits of IndexTupleData.t_info, 
> > which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors 
> > implemented by pgvector currently use a 32 bit float for elements, which 
> > limits vector size to 2K dimensions, which is no longer state of the art.
> >
> > I've attached a patch that increases  IndexTupleData.t_info from 16bits to 
> > 32bits allowing for significantly larger index tuple sizes. I would guess 
> > this patch is not a complete implementation that allows for migration from 
> > previous versions, but it does compile and initdb succeeds. I'd be happy to 
> > continue work if the core team is receptive to an update in this area, and 
> > I'd appreciate any feedback the community has on the approach.

If I read this correctly, basically the patch adds 16 useless bits for
all applications except for ML ones...

Perhaps implementing an alternative storage specifically for ML using
TAM interface would be a better approach?

-- 
Best regards,
Aleksander Alekseev




Re: Synchronizing slots from primary to standby

2024-01-19 Thread shveta malik
On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada  wrote:
>
>
> Thank you for updating the patch. I have some comments:
>
> ---
> +latestWalEnd = GetWalRcvLatestWalEnd();
> +if (remote_slot->confirmed_lsn > latestWalEnd)
> +{
> +elog(ERROR, "exiting from slot synchronization as the
> received slot sync"
> + " LSN %X/%X for slot \"%s\" is ahead of the
> standby position %X/%X",
> + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
> + remote_slot->name,
> + LSN_FORMAT_ARGS(latestWalEnd));
> +}
>
> IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is
> typically the primary server's flush position and doesn't mean the LSN
> where the walreceiver received/flushed up to.

yes. I think it makes more sense to use something which actually tells
flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd()
with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have
enabled the slot-sync feature in a running standby, in that case we
are all good (flushedUpto is the same as actual flush-position
indicated by LogstreamResult.Flush). But if I restart standby, then I
observed that the startup process sets flushedUpto to some value 'x'
(see [1]) while when the wal-receiver starts, it sets
'LogstreamResult.Flush' to another value (see [2]) which is always
greater than 'x'. And we do not update flushedUpto with the
'LogstreamResult.Flush' value in walreceiver until we actually do an
operation on primary. Performing a data change on primary sends WALs
to standby which then hits XLogWalRcvFlush() and updates flushedUpto
same as LogstreamResult.Flush. Until then we have a situation where
slots received on standby are ahead of flushedUpto and thus slotsync
worker keeps one erroring out. I am yet to find out why flushedUpto is
set to a lower value than 'LogstreamResult.Flush' at the start of
standby.  Or maybe am I using the wrong function
GetWalRcvFlushRecPtr() and should be using something else instead?

[1]:
Startup process sets 'flushedUpto' here:
ReadPageInternal-->XLogPageRead-->WaitForWALToBecomeAvailable-->RequestXLogStreaming

[2]:
Walreceiver sets 'LogstreamResult.Flush' here but do not update
'flushedUpto' here:
WalReceiverMain():  LogstreamResult.Write = LogstreamResult.Flush =
GetXLogReplayRecPtr(NULL)


> Does it really happen
> that the slot's confirmed_flush_lsn is higher than the primary's flush
> lsn?

It may happen if we have not configured standby_slot_names on primary.
In such a case, slots may get updated w/o confirming that standby has
taken the change and thus slot-sync worker may fetch the slots which
have lsns ahead of the latest WAL position on standby.

thanks
Shveta




Re: Build versionless .so for Android

2024-01-19 Thread Matthias Kuhn
Hi

On Thu, Jan 18, 2024 at 9:27 AM Peter Eisentraut 
wrote:

> On 14.01.24 12:37, Matthias Kuhn wrote:
> > What I try to do is packaging an app with androiddeployqt which fails
> > with an error:
> >
> > The bundled library lib/libpq.so.5 doesn't end with .so. Android only
> > supports versionless libraries ending with the .so suffix.
> >
> > This error was introduced in response to this issue which contains hints
> > about the underlying problem:
> >
> > https://bugreports.qt.io/plugins/servlet/mobile#issue/QTBUG-101346
> > 
>
> I think the scenario there is using dlopen(), possibly via Qt, possibly
> via Java, so it's a very specific scenario, not necessarily indicative
> of a general requirement on Android, and apparently not using build-time
> linking.  It's quite conceivable that this issue would also exist with
> Qt on other platforms.
>
>
I haven't experienced this issue with Qt on any other platform (and I build
and run the same stack on windows, linux, macos and ios).
Also the links posted earlier in this thread hint to the same limitation of
Android, unrelated to Qt.


> As I mentioned before, Meson has Android support.  Some people there
> clearly thought about it.  So I suggest you try building PostgreSQL for
> your Android environment using Meson and see what it produces.(*)  If
> the output files are the same as the autoconf/make build, then I would
> think your scenario is nonstandard.  If the output files are different,
> then we should check that and consider changes to get them to match.
>
> It's of course possible that Meson is wrong, too, but then we need to
> have a broader analysis, because the implicit goal is to keep the two
> build systems for PostgreSQL consistent.
>

When trying to build with meson, including the patch which was provided by
Andres Freud (thanks),
I am currently stuck with the following error:

Configuring pg_config_ext.h using configuration

../src/tgresql-16-685bc9fc97.clean/src/include/meson.build:12:0: ERROR:
File port/android.h does not exist.

Kind regards
Matthias


>
> (*) - This means specifically that the installation trees produced by
> "make install-world-bin" and "meson install" should produce exactly the
> same set of files (same names in the same directories, not necessarily
> the same contents).
>
>


Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Daniel Gustafsson
> On 19 Jan 2024, at 11:04, Michael Banck  wrote:
> 
> Hi,
> 
> On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:
>> This does bring up an interesting point, I don't think there is a way
>> for a user to know whether the server is jit enabled or not (apart
>> from explaining a query with costs adjusted but that's not all that
>> userfriendly).  Maybe we need a way to reliably tell if JIT is active
>> or not.
> 
> I thought this is what pg_jit_available() is for?

Ah, it is, I completely forgot we had that one. Thanks!

--
Daniel Gustafsson





Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Michael Banck
Hi,

On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:
> This does bring up an interesting point, I don't think there is a way
> for a user to know whether the server is jit enabled or not (apart
> from explaining a query with costs adjusted but that's not all that
> userfriendly).  Maybe we need a way to reliably tell if JIT is active
> or not.

I thought this is what pg_jit_available() is for?

postgres=> SHOW jit;
 jit 
-
 on
(1 Zeile)

postgres=> SELECT pg_jit_available();
 pg_jit_available 
--
 f
(1 Zeile)


Michael




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-19 Thread John Naylor
On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 1:30 PM John Naylor  wrote:
> > I'm not quite sure what the point of "num_items" is anymore, because
> > it was really tied to the array in VacDeadItems. dead_items->num_items
> > is essential to reading/writing the array correctly. If this number is
> > wrong, the array is corrupt. There is no such requirement for the
> > radix tree. We don't need to know the number of tids to add to it or
> > do a lookup, or anything.
>
> True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> we need to have the number of TIDs in a tidstore, especially in the
> tidstore's control object.

Hmm, it would be kind of sad to require explicit locking in tidstore.c
is only for maintaining that one number at all times. Aside from the
two ereports after an index scan / second heap pass, the only
non-assert place where it's used is

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

...and that condition can be checked by doing a single step of
iteration to see if it shows anything. But for the ereport, my idea
for iteration + popcount is probably quite slow.

> IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> the whole lazy vacuum operation whereas num_items is the number of
> LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> That is, after heap vacuum, the latter counter is reset while the
> former counter is not.
>
> The latter counter is used in lazyvacuum.c as well as the ereport in
> vac_bulkdel_one_index().

Ah, of course.

> Putting a copy of the key in BlocktableEntry's header is an
> interesting idea. But the current debug code in the tidstore also
> makes sure that the tidstore returns TIDs in the correct order during
> an iterate operation. I think it still has a value and you can disable
> it by removing the "#define TIDSTORE_DEBUG" line.

Fair enough. I just thought it'd be less work to leave this out in
case we change how locking is called.

> > This week I tried an idea to use a callback there so that after
> > internal unlocking, the caller received the value (or whatever else
> > needs to happen, such as lookup an offset in the tid bitmap). I've
> > attached a draft for that that passes radix tree tests. It's a bit
> > awkward, but I'm guessing this would more closely match future
> > internal atomic locking. Let me know what you think of the concept,
> > and then do whichever way you think is best. (using v53 as the basis)
>
> Thank you for verifying this idea! Interesting. While it's promising
> in terms of future atomic locking, I'm concerned it might not be easy
> to use if radix tree APIs supports only such callback style.

Yeah, it's quite awkward. It could be helped by only exposing it for
varlen types. For simply returning "present or not" (used a lot in the
regression tests), we could skip the callback if the data is null.
That is all also extra stuff.

> I believe
> the caller would like to pass one more data along with val_data. For

That's trivial, however, if I understand you correctly. With "void *",
a callback can receive anything, including a struct containing
additional pointers to elsewhere.

> example, considering tidstore that has num_tids internally, it wants
> to pass both a pointer to BlocktableEntry and a pointer to TidStore
> itself so that it increments the counter while holding a lock.

Hmm, so a callback to RT_SET also. That's interesting!

Anyway, I agree it needs to be simple, since the first use doesn't
even have multiple writers.

> BTW in radixtree.h pg_attribute_unused() is used for some functions,
> but is it for debugging purposes? I don't see why it's used only for
> some functions.

It was there to silence warnings about unused functions. I only see
one remaining, and it's already behind a debug symbol, so we might not
need this attribute anymore.




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Daniel Gustafsson
> On 19 Jan 2024, at 01:50, Kirk Wolak  wrote:

>   I did a little more checking and the reason I did not see the link MIGHT be 
> because EXPLAIN did not show a JIT attempt.
> I tried to use settings that FORCE a JIT...  But to no avail.

Are you sure you are running a JIT enabled server?  Did you compile it yourself
or install a snapshot?

>   You don't know a way to force at least the JIT analysis to happen?  
> (because I already knew if JIT was off, the leak wouldn't happen). 

If you set jit_above_cost=0 then postgres will compile a JIT enabled execution
tree.  This does bring up an interesting point, I don't think there is a way
for a user to know whether the server is jit enabled or not (apart from
explaining a query with costs adjusted but that's not all that userfriendly).
Maybe we need a way to reliably tell if JIT is active or not.

--
Daniel Gustafsson





Re: Simplify documentation related to Windows builds

2024-01-19 Thread Nazir Bilal Yavuz
Hi,

On Sun, 31 Dec 2023 at 09:13, Michael Paquier  wrote:
>
> Hi all,
>
> As promised around thread [1] that has moved the docs related to
> Windows into a new sub-section for Visual, here is a follow-up to
> improve portions of its documentation, for discussion in the next CF.
>
> Some of my notes:
> - How much does command line editing work on Windows?  When it came to
> VS, I always got the impression that this never worked.  Andres in [2]
> mentioned otherwise because meson makes that easier?

I do not know that either.

> - ActiveState perl should not be recommended IMO, as being able to use
> a perl binary requires one to *register* into their service for tokens
> that can be used to kick perl sessions, last time I checked.  Two
> alternatives:
> -- MinGW perl binary.
> -- strawberry perl (?) and Chocolatey.
> -- MSYS

I agree. Also, its installation & use steps are complicated IMO. It is
not like install it, add it to PATH and forget.

> - http://www.mingw.org/ is a dead end.  This could be replaced by
> links to https://www.mingw-w64.org/ instead?

Correct.

> At the end, the main issue that I have with this part of the
> documentation is the lack of consistency leading to a confusing user
> experience in the builds of Windows.  My recent impressions were that
> Andrew has picked up Chocolatey in some (all?) of his buildfarm
> animals with Strawberry Perl.  I've had a good experience with it,
> FWIW, but Andres has also mentioned me a couple of weeks ago while in
> Prague that Strawberry could lead to unpredictible results (sorry I
> don't remember all the exact details).

Postgres CI uses Strawberry Perl [1] as well but it is directly
installed from the strawberryperl.com and its version is locked to
'5.26.3.1' for now.

> Between MSYS2, mingw-w64 and Chocolatey, there are a lot of options
> available to users.  So shouldn't we try to recommend only of them,
> then align the buildfarm and the CI to use one of them?  Supporting
> more than one, at most two, would be OK for me, my end goal would be
> to get rid entirely of the list of build dependencies in this "Visual"
> section, because that's just a duplicate of what meson lists, except
> that meson should do a better job at detecting dependencies than what
> the now-dead MSVC scripts did.  If we support two, the CI and the
> buildfarm should run them.

I agree.

> I am attaching a patch that's an embryon of work (little time for
> hacking as of life these days, still I wanted to get the discussion
> started), but let's discuss which direction we should take moving
> forward for 17~.

The current changes look good.

 Both Bison and
Flex
 are included in the msys tool
suite, available
-from http://www.mingw.org/wiki/MSYS";> as
part of the
-MinGW compiler suite.
+from https://www.msys2.org/";>.

Since we are changing that part, I think we need to change 'You will
need to add the directory containing flex.exe and bison.exe to the
PATH environment variable. In the case of MinGW, the directory is the
\msys\1.0\bin subdirectory of your MinGW installation directory.'
sentence to its msys2 version. My initial testing showed that the
directory is the '\usr\bin' subdirectory of the msys2 installation
directory in my environment.

[1] 
https://github.com/anarazel/pg-vm-images/blob/main/scripts/windows_install_perl.ps1

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-19 Thread Laurenz Albe
On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:
>   I did a little more checking and the reason I did not see the link MIGHT be 
> because EXPLAIN did not show a JIT attempt.
> I tried to use settings that FORCE a JIT...  But to no avail.
> 
>   I am now concerned that the problem is more hidden in my use case.  Meaning 
> I CANNOT conclude it is fixed.
> But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).
> 
>   You don't know a way to force at least the JIT analysis to happen?  
> (because I already knew if JIT was off, the leak wouldn't happen). 

If you set the limits to 0, you can trigger it easily:

SET jit = on;
SET jit_above_cost = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;

EXPLAIN (ANALYZE) SELECT count(*) FROM foo;
QUERY PLAN  
  
══
 Finalize Aggregate  (cost=58889.84..58889.85 rows=1 width=8) (actual 
time=400.462..418.214 rows=1 loops=1)
   ->  Gather  (cost=58889.62..58889.83 rows=2 width=8) (actual 
time=400.300..418.190 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=57889.62..57889.64 rows=1 width=8) 
(actual time=384.876..384.878 rows=1 loops=3)
   ->  Parallel Seq Scan on foo  (cost=0.00..52681.30 rows=2083330 
width=0) (actual time=0.028..168.510 rows=167 loops=3)
 Planning Time: 0.133 ms
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.038 ms, Inlining 279.779 ms, Optimization 38.395 ms, 
Emission 73.105 ms, Total 392.316 ms
 Execution Time: 478.257 ms

Yours,
Laurenz Albe


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-19 Thread Bertrand Drouvot
Hi,

On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> Hello,
> 
> 15.01.2024 12:00, Alexander Lakhin wrote:
> > If this approach looks promising to you, maybe we could add a submodule to
> > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> > 019_replslot_limit) as well.
> > 
> > Personally I think that having such a functionality for using in tests
> > might be useful not only to avoid some "problematic" behaviour but also to
> > test the opposite cases.
> 
> After spending a few days on it, I've discovered two more issues:
> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
> 
> (The latter is not related to bgwriter directly, but it was discovered
> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
> 
> So it becomes clear that the 035 test is not the only one, which might
> suffer from bgwriter's activity,

Yeah... thanks for sharing!

> and inventing a way to stop bgwriter/
> control it is a different subject, getting out of scope of the current
> issue.

Agree.

> 15.01.2024 11:49, Bertrand Drouvot wrote:
> > We did a few things in this thread, so to sum up what we've discovered:
> > 
> > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> > - we need to launch the vacuum(s) only if we are sure we got a newer XID 
> > horizon
> > ( proposal in in v6 attached)
> > - we need a way to control how frequent xl_running_xacts are emmitted (to 
> > ensure
> > they are not triggered in a middle of an active slot invalidation test).
> > 
> > I'm not sure it's possible to address Tom's concern and keep the test 
> > "predictable".
> > 
> > So, I think I'd vote for Michael's proposal to implement a 
> > superuser-settable
> > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass 
> > $windows_os) would
> > still not address Tom's concern anyway).
> > 
> > Another option would be to "sacrifice" the full predictablity of the test 
> > (in
> > favor of real-world behavior testing)?
> > 
> > [1]: 
> > https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
> 
> So, now we have the test 035 failing due to nondeterministic vacuum
> activity in the first place, and due to bgwriter's activity in the second.

Yeah, that's also my understanding.

> Maybe it would be a right move to commit the fix, and then think about
> more rare failures.

+1

> Though I have a couple of question regarding the fix left, if you don't
> mind:
> 1) The test has minor defects in the comments, that I noted before [1];
> would you like to fix them in passing?
> 
> > BTW, it looks like the comment:
> > # One way to produce recovery conflict is to create/drop a relation and
> > # launch a vacuum on pg_class with hot_standby_feedback turned off on the 
> > standby.
> > in scenario 3 is a copy-paste error.

Nice catch, thanks! Fixed in v7 attached.

> > Also, there are two "Scenario 4" in this test.

D'oh! Fixed in v7.

> > 
> 
> 2) Shall we align the 035_standby_logical_decoding with
> 031_recovery_conflict in regard to improving stability of vacuum?

Yeah, I think that could make sense.

> I see the following options for this:
> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
> b) use FREEZE and autovacuum = off in both tests;
> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>  autovacuum = off in both.
>

I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
sure using FREEZE would give full "stabilization predictability", at least for
035_standby_logical_decoding.pl). That said I did not test what the outcome 
would
be for 031_recovery_conflict.pl by making use of a).

> I've re-tested the v6 patch today and confirmed that it makes the test
> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
> parallel and got failures ('inactiveslot slot invalidation is logged with
> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> (With unlimited CPU, when average test duration is around 70 seconds.)
> 
> But with v6 applied, 60 iterations succeeded.

Nice! Thanks for the testing!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b45f0c55d17cddfac7c99d11000b819c8b09fa56 Mon Sep 17 00:00:00 2001
From: bdrouvot 
Date: Tue, 9 Jan 2024 05:08:35 +
Subject: [PATCH v7] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

While at it, also fixing some typos/bad test description in it.
---
 .../t/035_standby_logical_decoding.pl | 68 ++-
 1 file changed, 35 insertions(+), 33 deletions(-)
 100.0% src/test/recovery/t/

diff --git a

Re: speed up a logical replica setup

2024-01-19 Thread Shubham Khanna
On Tue, Jan 16, 2024 at 11:58 AM Shubham Khanna
 wrote:
>
> On Thu, Dec 21, 2023 at 11:47 AM Amit Kapila  wrote:
> >
> > On Wed, Dec 6, 2023 at 12:53 PM Euler Taveira  wrote:
> > >
> > > On Thu, Nov 9, 2023, at 8:12 PM, Michael Paquier wrote:
> > >
> > > On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> > > > On 08.11.23 00:12, Michael Paquier wrote:
> > > >> - Should the subdirectory pg_basebackup be renamed into something more
> > > >> generic at this point?  All these things are frontend tools that deal
> > > >> in some way with the replication protocol to do their work.  Say
> > > >> a replication_tools?
> > > >
> > > > Seems like unnecessary churn.  Nobody has complained about any of the 
> > > > other
> > > > tools in there.
> > >
> > > Not sure.  We rename things across releases in the tree from time to
> > > time, and here that's straight-forward.
> > >
> > >
> > > Based on this discussion it seems we have a consensus that this tool 
> > > should be
> > > in the pg_basebackup directory. (If/when we agree with the directory 
> > > renaming,
> > > it could be done in a separate patch.) Besides this move, the v3 provides 
> > > a dry
> > > run mode. It basically executes every routine but skip when should do
> > > modifications. It is an useful option to check if you will be able to run 
> > > it
> > > without having issues with connectivity, permission, and existing objects
> > > (replication slots, publications, subscriptions). Tests were slightly 
> > > improved.
> > > Messages were changed to *not* provide INFO messages by default and 
> > > --verbose
> > > provides INFO messages and --verbose --verbose also provides DEBUG 
> > > messages. I
> > > also refactored the connect_database() function into which the connection 
> > > will
> > > always use the logical replication mode. A bug was fixed in the transient
> > > replication slot name. Ashutosh review [1] was included. The code was 
> > > also indented.
> > >
> > > There are a few suggestions from Ashutosh [2] that I will reply in another
> > > email.
> > >
> > > I'm still planning to work on the following points:
> > >
> > > 1. improve the cleanup routine to point out leftover objects if there is 
> > > any
> > >connection issue.
> > >
> >
> > I think this is an important part. Shall we try to write to some file
> > the pending objects to be cleaned up? We do something like that during
> > the upgrade.
> >
> > > 2. remove the physical replication slot if the standby is using one
> > >(primary_slot_name).
> > > 3. provide instructions to promote the logical replica into primary, I 
> > > mean,
> > >stop the replication between the nodes and remove the replication setup
> > >(publications, subscriptions, replication slots). Or even include 
> > > another
> > >action to do it. We could add both too.
> > >
> > > Point 1 should be done. Points 2 and 3 aren't essential but will provide 
> > > a nice
> > > UI for users that would like to use it.
> > >
> >
> > Isn't point 2 also essential because how would otherwise such a slot
> > be advanced or removed?
> >
> > A few other points:
> > ==
> > 1. Previously, I asked whether we need an additional replication slot
> > patch created to get consistent LSN and I see the following comment in
> > the patch:
> >
> > + *
> > + * XXX we should probably use the last created replication slot to get a
> > + * consistent LSN but it should be changed after adding pg_basebackup
> > + * support.
> >
> > Yeah, sure, we may want to do that after backup support and we can
> > keep a comment for the same but I feel as the patch stands today,
> > there is no good reason to keep it. Also, is there a reason that we
> > can't create the slots after backup is complete and before we write
> > recovery parameters
> >
> > 2.
> > + appendPQExpBuffer(str,
> > +   "CREATE SUBSCRIPTION %s CONNECTION '%s' PUBLICATION %s "
> > +   "WITH (create_slot = false, copy_data = false, enabled = false)",
> > +   dbinfo->subname, dbinfo->pubconninfo, dbinfo->pubname);
> >
> > Shouldn't we enable two_phase by default for newly created
> > subscriptions? Is there a reason for not doing so?
> >
> > 3. How about sync slots on the physical standby if present? Do we want
> > to retain those as it is or do we need to remove those? We are
> > actively working on the patch [1] for the same.
> >
> > 4. Can we see some numbers with various sizes of databases (cluster)
> > to see how it impacts the time for small to large-size databases as
> > compared to the traditional method? This might help us with giving
> > users advice on when to use this tool. We can do this bit later as
> > well when the patch is closer to being ready for commit.
>
> I have done the Performance testing and attached the results to
> compare the 'Execution Time' between 'logical replication' and
> 'pg_subscriber' for 100MB, 1GB and 5GB data:
> | 100MB | 1GB  | 5GB
> Logical rep (2 w) | 1.815s  | 14.895s

Re: pgbnech: allow to cancel queries during benchmark

2024-01-19 Thread Tatsuo Ishii
>> +/* send cancel requests to all connections */
>> +static void
>> +cancel_all()
>> +{
>> +for (int i = 0; i < nclients; i++)
>> +{
>> +char errbuf[1];
>> +if (client_states[i].cancel != NULL)
>> +(void) PQcancel(client_states[i].cancel, errbuf, 
>> sizeof(errbuf));
>> +}
>> +}
>> +
>> 
>> Why in case of errors from PQCancel the error message is neglected? I
>> think it's better to print out the error message in case of error.
> 
> Is the message useful for pgbench users? I saw the error is ignored
> in pg_dump, for example in bin/pg_dump/parallel.c

I think the situation is different from pg_dump.  Unlike pg_dump, if
PQcancel does not work, users can fix the problem by using
pg_terminate_backend or kill command. In order to make this work, an
appropriate error message is essential.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-01-19 Thread torikoshia

On 2024-01-16 18:41, Melih Mutlu wrote:

Hi,

Thanks for reviewing.

torikoshia , 10 Oca 2024 Çar, 09:37
tarihinde şunu yazdı:


+ 
+  
+   context_id int4
+  
+  
+   Current context id. Note that the context id is a

temporary id

and may
+   change in each invocation
+  
+ 
+
+ 
+  
+   path int4[]
+  
+  
+   Path to reach the current context from TopMemoryContext.
Context ids in
+   this list represents all parents of the current context.

This

can be
+   used to build the parent and child relation
+  
+ 
+
+ 
+  
+   total_bytes_including_children
int8
+  
+  
+   Total bytes allocated for this memory context including

its

children
+  
+ 


These columns are currently added to the bottom of the table, but it
may
be better to put semantically similar items close together and
change
the insertion position with reference to other system views. For
example,

- In pg_group and pg_user, 'id' is placed on the line following
'name',
so 'context_id' be placed on the line following 'name'
- 'path' is similar with 'parent' and 'level' in that these are
information about the location of the context, 'path' be placed to
next
to them.

If we do this, orders of columns in the system view should be the
same,
I think.


I've done what you suggested. Also moved
"total_bytes_including_children" right after "total_bytes".


14dd0f27d have introduced new macro foreach_int.
It seems to be able to make the code a bit simpler and the commit
log
says this macro is primarily intended for use in new code. For
example:


Makes sense. Done.


Thanks for updating the patch!

+   Current context id. Note that the context id is a temporary id 
and may

+   change in each invocation
+  
+ 


It clearly states that the context id is temporary, but I am a little 
concerned about users who write queries that refer to this view multiple 
times without using CTE.


If you agree, how about adding some description like below you mentioned 
before?


We still need to use cte since ids are not persisted and might change 
in

each run of pg_backend_memory_contexts. Materializing the result can
prevent any inconsistencies due to id change. Also it can be even good 
for

performance reasons as well.


We already have additional description below the table which explains 
each column of the system view. For example pg_locks:

https://www.postgresql.org/docs/devel/view-pg-locks.html


Also giving an example query something like this might be useful.

  -- show all the parent context names of ExecutorState
  with contexts as (
select * from pg_backend_memory_contexts
  )
  select name from contexts where array[context_id] <@ (select path from 
contexts where name = 'ExecutorState');



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: index prefetching

2024-01-19 Thread Konstantin Knizhnik



On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.


Cool! And do you think this is the right design/way to do this?


I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:


1. Why it is necessary to allocate and store all_visible flag in data 
buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?


+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

2. Names of the functions `IndexPrefetchNext` and 
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and 
one can assume that for one is used for normal index scan and last one - 
for index only scan. But actually `IndexOnlyPrefetchNext` is callback 
and `IndexPrefetchNext` is used in both nodeIndexscan.c and 
nodeIndexonlyscan.c







Re: UUID v7

2024-01-19 Thread Andrey Borodin


> On 19 Jan 2024, at 08:24, David G. Johnston  
> wrote:
> 
> 
> You are mixing POSIX and ISO-8601 conventions and, as noted in our appendix, 
> they disagree on the direction that is positive.

Thanks! Now everything seems on its place.

I want to include in the patch following tests:
-- extract UUID v1, v6 and v7 timestamp
SELECT uuid_extract_time('C232AB00-9414-11EC-B3C8-9F6BDECED846') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';
SELECT uuid_extract_time('1EC9414C-232A-6B00-B3C8-9F6BDECED846') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';
SELECT uuid_extract_time('017F22E2-79B0-7CC3-98C4-DC0C0C07398F') = 'Tuesday, 
February 22, 2022 2:22:22.00 PM GMT+05:00';

How do you think, will it be stable all across buildfarm? Or should we change 
anything to avoid false positives inferred from different timestamp parsing?


> On 19 Jan 2024, at 07:58, Lukas Fittl  wrote:
> 
> Note how calling the uuidv7 function again after having called it with a 
> fixed future timestamp, returns the future timestamp, even though it should 
> return the current time.

Thanks for the review.
Well, that was intentional. But now I see it's kind of confusing behaviour. 
I've changed it to more expected version.

Also, I've added some documentation on all functions.


Best regards, Andrey Borodin.


v11-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-19 Thread Masahiko Sawada
On Thu, Jan 18, 2024 at 6:54 PM Amit Kapila  wrote:
>
> On Thu, Jan 18, 2024 at 11:15 AM Peter Smith  wrote:
> >
> > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada  
> > wrote:
> > >
> > ...
> > >
> > > Although we can improve it to handle this case too, I'm not sure it's
> > > a bug. The doc says[1]:
> > >
> > > Specifies whether the subscription should be automatically disabled if
> > > any errors are detected by subscription workers during data
> > > replication from the publisher.
> > >
> > > When an apply worker is trying to establish a connection, it's not
> > > replicating data from the publisher.
> > >
> > > Regards,
> > >
> > > [1] 
> > > https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > >
> > > --
> > > Masahiko Sawada
> > > Amazon Web Services: https://aws.amazon.com
> >
> > Yeah, I had also seen that wording of those docs. And I agree it
> > leaves open some room for doubts because strictly from that wording it
> > can be interpreted that establishing the connection is not actually
> > "data replication from the publisher" in which case maybe there is no
> > bug.
> >
>
> As far as I remember that was the intention. The idea was if there is
> any conflict during apply that users manually need to fix, they have
> the provision to stop repeating the error. If we wish to extend the
> purpose of this option for another valid use case and there is a good
> way to achieve the same then we can discuss but I don't think we need
> to change it in back-branches.

I agree not to change it in back-branches.

What is the use case of extending disable_on_error?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-01-19 Thread Bertrand Drouvot
Hi,

On Fri, Jan 19, 2024 at 11:46:51AM +0530, shveta malik wrote:
> On Fri, Jan 19, 2024 at 11:23 AM Amit Kapila  wrote:
> >
> > > > 5 === (coming from v62-0002)
> > > > +   Assert(tuplestore_tuple_count(res->tuplestore) == 1);
> > > >
> > > > Is it even possible for the related query to not return only one row? 
> > > > (I think the
> > > > "count" ensures it).
> > >
> > > I think you are right. This assertion was added sometime back on the
> > > basis of feedback on hackers. Let me review that again. I can consider
> > > this comment in the next version.
> > >
> >
> > OTOH, can't we keep the assert as it is but remove "= 1" from
> > "count(*) = 1" in the query. There shouldn't be more than one slot
> > with same name on the primary. Or, am I missing something?
> 
> There will be 1 record max and 0 record if the primary_slot_name is
> invalid. 

I think we'd have exactly one record in all the cases (due to the count):

postgres=# SELECT pg_is_in_recovery(), count(*) FROM pg_replication_slots WHERE 
1 = 2;
 pg_is_in_recovery | count
---+---
 f | 0
(1 row)

postgres=# SELECT pg_is_in_recovery(), count(*) FROM pg_replication_slots WHERE 
1 = 1;
 pg_is_in_recovery | count
---+---
 f | 1
(1 row)

> Keeping 'count(*)=1' gives the benefit that it will straight
> away give us true/false indicating if we are good or not wrt
> primary_slot_name. I feel Assert can be removed and we can simply
> have:
> 
> if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
> elog(ERROR, "failed to fetch primary_slot_name tuple");
> 

I'd also vote for keeping it as it is and remove the Assert.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Emitting JSON to file using COPY TO

2024-01-19 Thread Masahiko Sawada
On Thu, Dec 7, 2023 at 10:10 AM Joe Conway  wrote:
>
> On 12/6/23 18:09, Joe Conway wrote:
> > On 12/6/23 14:47, Joe Conway wrote:
> >> On 12/6/23 13:59, Daniel Verite wrote:
> >>> Andrew Dunstan wrote:
> >>>
>  IMNSHO, we should produce either a single JSON
>  document (the ARRAY case) or a series of JSON documents, one per row
>  (the LINES case).
> >>>
> >>> "COPY Operations" in the doc says:
> >>>
> >>> " The backend sends a CopyOutResponse message to the frontend, followed
> >>> by zero or more CopyData messages (always one per row), followed by
> >>> CopyDone".
> >>>
> >>> In the ARRAY case, the first messages with the copyjsontest
> >>> regression test look like this (tshark output):
> >>>
> >>> PostgreSQL
> >>>  Type: CopyOut response
> >>>  Length: 13
> >>>  Format: Text (0)
> >>>  Columns: 3
> >>> Format: Text (0)
> >>> PostgreSQL
> >>>  Type: Copy data
> >>>  Length: 6
> >>>  Copy data: 5b0a
> >>> PostgreSQL
> >>>  Type: Copy data
> >>>  Length: 76
> >>>  Copy data:
> >>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…
> >>>
> >>> The first Copy data message with contents "5b0a" does not qualify
> >>> as a row of data with 3 columns as advertised in the CopyOut
> >>> message. Isn't that a problem?
> >>
> >>
> >> Is it a real problem, or just a bit of documentation change that I missed?
> >>
> >> Anything receiving this and looking for a json array should know how to
> >> assemble the data correctly despite the extra CopyData messages.
> >
> > Hmm, maybe the real problem here is that Columns do not equal "3" for
> > the json mode case -- that should really say "1" I think, because the
> > row is not represented as 3 columns but rather 1 json object.
> >
> > Does that sound correct?
> >
> > Assuming yes, there is still maybe an issue that there are two more
> > "rows" that actual output rows (the "[" and the "]"), but maybe those
> > are less likely to cause some hazard?
>
>
> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:
>
> PostgreSQL
>   Type: CopyOut response
>   Length: 13
>   Format: Text (0)
>   Columns: 1
>   Format: Text (0)
> PostgreSQL
>   Type: Copy data
>   Length: 6
>   Copy data: 5b0a
> PostgreSQL
>   Type: Copy data
>   Length: 76
>   Copy data: [...]
>

If I'm not missing, copyto_json.007.diff is the latest patch but it
needs to be rebased to the current HEAD. Here are random comments:

---
 if (opts_out->json_mode)
+   {
+   if (is_from)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot use JSON mode in COPY FROM")));
+   }
+   else if (opts_out->force_array)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("COPY FORCE_ARRAY requires JSON mode")));

I think that flatting these two condition make the code more readable:

if (opts_out->json_mode && is_from)
ereport(ERROR, ...);

if (!opts_out->json_mode && opts_out->force_array)
ereport(ERROR, ...);

Also these checks can be moved close to other checks at the end of
ProcessCopyOptions().

---
@@ -3395,6 +3395,10 @@ copy_opt_item:
{
$$ = makeDefElem("format", (Node *) makeString("csv"), @1);
}
+   | JSON
+   {
+   $$ = makeDefElem("format", (Node *) makeString("json"), @1);
+   }
| HEADER_P
{
$$ = makeDefElem("header", (Node *) makeBoolean(true), @1);
@@ -3427,6 +3431,10 @@ copy_opt_item:
{
$$ = makeDefElem("encoding", (Node *) makeString($2), @1);
}
+   | FORCE ARRAY
+   {
+   $$ = makeDefElem("force_array", (Node *)
makeBoolean(true), @1);
+   }
;

I believe we don't need to support new options in old-style syntax.

---
@@ -3469,6 +3477,10 @@ copy_generic_opt_elem:
{
$$ = makeDefElem($1, $2, @1);
}
+   | FORMAT_LA copy_generic_opt_arg
+   {
+   $$ = makeDefElem("format", $2, @1);
+   }
;

I think it's not necessary. "format" option is already handled in
copy_generic_opt_elem.

---
+/* need delimiter to start next json array element */
+static bool json_row_delim_needed = false;

I think it's cleaner to include json_row_delim_needed into CopyToStateData.

---
Splitting the patch into two patches: add json format and add
force_array option would make reviews easy.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com