Re: Add index scan progress to pg_stat_progress_vacuum

2023-07-10 Thread Masahiko Sawada
On Thu, Jul 6, 2023 at 2:28 PM Masahiko Sawada  wrote:
>
> On Thu, Jul 6, 2023 at 11:15 AM Michael Paquier  wrote:
> >
> > On Thu, Jul 06, 2023 at 11:07:14AM +0900, Masahiko Sawada wrote:
> > > Thank you for updating the patch. It looks good to me too. I've
> > > updated the commit message.
> >
> > Thanks.  I was planning to review this patch today and/or tomorrow now
> > that my stack of things to do is getting slightly lower (registered my
> > name as committer as well a few weeks ago to not format).
> >
> > One thing I was planning to do is to move the new message processing
> > API for the incrementational updates in its own commit for clarity, as
> > that's a separate concept than the actual feature, useful on its own.
>
> +1. I had the same idea. Please find the attached patches.
>
> > Anyway, would you prefer taking care of it?
>
> I can take care of it if you're okay.
>

Pushed.

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Exclusion constraints on partitioned tables

2023-07-10 Thread Paul A Jungwirth
On Mon, Jul 10, 2023 at 8:06 AM Paul A Jungwirth
 wrote:
>
> On Mon, Jul 10, 2023 at 7:05 AM Peter Eisentraut  wrote:
> > I'm not sure what value we would get from testing this with btree_gist,
> > but if we wanted to do that, then adding a new test file to the
> > btree_gist sql/ directory would seem reasonable to me.
> >
> > (I would make the test a little bit bigger than you had shown, like
> > insert a few values.)
> >
> > If you want to do that, please send another patch.  Otherwise, I'm ok to
> > commit this one.
>
> I can get you a patch tonight or tomorrow. I think it's worth it since
> btree_gist uses different strategy numbers than ordinary gist.

Patch attached.

Regards,
Paul


v5-0001-Allow-some-exclusion-constraints-on-partitions.patch
Description: Binary data


Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-10 Thread Masahiko Sawada
On Tue, Jul 11, 2023 at 1:05 PM Amit Kapila  wrote:
>
> On Tue, Jul 11, 2023 at 4:54 AM Peter Smith  wrote:
> >
> > On Mon, Jul 10, 2023 at 2:21 PM Amit Kapila  wrote:
> > >
> > > On Mon, Jul 10, 2023 at 7:55 AM Peter Smith  wrote:
> > > >
> > > > On Sat, Jul 8, 2023 at 1:49 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Fri, Jul 7, 2023 at 1:36 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > I prefer the first suggestion. I've attached the updated patch.
> > > > > >
> > > > >
> > > > > This looks mostly good to me but I think it would be better if we can
> > > > > also add the information that the leftmost index column must be a
> > > > > non-expression. So, how about: "Candidate indexes must be btree,
> > > > > non-partial, and the leftmost index column must be a non-expression
> > > > > and reference to a published table column (i.e. cannot consist of only
> > > > > expressions)."?
> > > >
> > > > That part in parentheses ought to say "the index ..." because it is
> > > > referring to the full INDEX, not to the leftmost column. I think this
> > > > was missed when Sawada-san took my previous suggestion [1].
> > > >
> > > > IMO it doesn't sound right to say the "index column must be a
> > > > non-expression". It is already a non-expression because it is a
> > > > column. So I think it would be better to refer to this as an INDEX
> > > > "field" instead of an INDEX column. Note that "field" is the same
> > > > terminology used in the docs for CREATE INDEX [2].
> > > >
> > >
> > > I thought it would be better to be explicit for this case but I am
> > > fine if Sawada-San and you prefer some other way to document it.
> > >
> >
> > I see. How about just moving the parenthesized part to explicitly
> > refer only to the leftmost field?
> >
> > SUGGESTION
> > Candidate indexes must be btree, non-partial, and the leftmost index
> > field must be a column (not an expression) that references a published
> > table column.
> >
>
> Yeah, something like that works for me.

Looks good to me. I've attached the updated patch. In the comment in
RemoteRelContainsLeftMostColumnOnIdx(), I used "remote relation"
instead of "published table" as it's more consistent with surrounding
comments. Also, I've removed the comment starting with "We also skip
indedes..." as the new comment now covers it. Please review it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v4-0001-Doc-clarify-the-conditions-of-usable-indexes-for-.patch
Description: Binary data


Re: Generating code for query jumbling through gen_node_support.pl

2023-07-10 Thread Michael Paquier
On Tue, Jul 11, 2023 at 12:29:29PM +0700, Andrey Lepikhov wrote:
> I vote for only one method based on a query tree structure.

Noted

> BTW, did you think about different algorithms of queryId generation?

Not really, except if you are referring to the possibility of being
able to handle differently different portions of the nodes depending
on a context given by the callers willing to do a query jumbling
computation.  (For example, choose to *not* silence the Const nodes,
etc.)

> Auto-generated queryId code can open a way for extensions to have
> easy-supporting custom queryIds.

Extensions can control that at some extent, already.
--
Michael


signature.asc
Description: PGP signature


unrecognized node type while displaying a Path due to dangling pointer

2023-07-10 Thread Jeevan Chalke
Hello,

While debugging one issue, we have added the below line in
postgresGetForeignPlan() to see the foreignrel details.

@@ -1238,6 +1238,8 @@ postgresGetForeignPlan(PlannerInfo *root,
boolhas_limit = false;
ListCell   *lc;

+   elog(INFO, "foreignrel: %s", nodeToString(foreignrel));
+

And with this change, we ran the postgres_fdw regression (executing
sql/postgres_fdw.sql) suite. We observed the below warnings that seem
strange.

+WARNING:  could not dump unrecognized node type: 0
+WARNING:  could not dump unrecognized node type: 26072088
+WARNING:  could not dump unrecognized node type: 26438448
+WARNING:  could not dump unrecognized node type: 368

Of course, with multiple runs, we see some random node types listed above.
Thanks to my colleague Suraj Kharage for this and for working parallel with
me.

Does anybody have any idea about these?

After debugging one random query from the above-failed case, what we have
observed is (we might be wrong, but worth noting here):

1. This warning ended up while displaying RelOptInfo->pathlist.
2. In create_ordered_paths(), input_rel has two paths, and it loops over
both paths to get the best-sorted path.
3. First path was unsorted, and thus we add a sort node on top of it, and
adds that to the ordered_rel.
4. However, 2nd path was already sorted and passed as is to the add_path().
5. add_path() decides to reject this new path on some metrics. However, in
the end, it pfree() this passed in path. It seems wrong as its references
do present elsewhere. For example, in the first path's parent rels path
list.
6. So, while displaying the parent's path, we end up with these warnings.

I tried to get a fix for this but no luck so far.
One approach was to copy the path before passing it to the add_path().
However, there is no easy way to copy a path due to its circular references.

To see whether this warning goes or not, I have commented code in add_path()
that does pfree() on the new_path. And with that, I don't see any warnings.
But removing that code doesn't seem to be the correct fix.

Suggestions?

Thanks

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


Re: Generating code for query jumbling through gen_node_support.pl

2023-07-10 Thread Andrey Lepikhov

On 11/7/2023 05:35, Michael Paquier wrote:

On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote:

On 27.01.23 03:59, Michael Paquier wrote:

At the end, that would be unnoticeable for the average user, I guess,
but here are the numbers I get on my laptop :)


Personally, I think we do not want the two jumble methods in parallel.

Maybe there are other opinions.


(Thanks Jonathan for the poke.)

Now that we are in mid-beta for 16, it would be a good time to
conclude on this open item:
"Reconsider a utility_query_id GUC to control if query jumbling of
utilities can go through the past string-only mode and the new mode?"

In Postgres ~15, utility commands used a hash of the query string to
compute their query ID.  The current query jumbling code uses a Query
instead, like any other queries.  I have registered this open item as
a self-reminder, mostly in case there would be an argument to have a
GUC where users could switch from one mode to another.  See here as
well for some computation times for each method (table is in ns, wiht
millions of iterations):
https://www.postgresql.org/message-id/y9eeyindb1acp...@paquier.xyz

I still don't think that we need both methods based on these numbers,
but there may be more opinions about that?  Are people OK if this open
item is discarded?

I vote for only one method based on a query tree structure.
BTW, did you think about different algorithms of queryId generation? 
Auto-generated queryId code can open a way for extensions to have 
easy-supporting custom queryIds.


--
regards,
Andrey Lepikhov
Postgres Professional





Re: Show various offset arrays for heap WAL records

2023-07-10 Thread Peter Geoghegan
On Mon, Jul 10, 2023 at 12:44 AM Heikki Linnakangas  wrote:
> This is still listed in the July commitfest; is there some work remaining?

I don't think so; not in the scope of the original patch series from
Melanie, at least.

> You moved the comment from rmgrdesc_utils.c into rmgrdesc_utils.h, but I
> don't think that was a good idea. Our usual convention is to have the
> function comment in the .c file, not at the declaration in the header
> file. When I want to know what a function does, I jump to the .c file,
> and might miss the comment in the header entirely.

I think that this was a gray area. It wasn't particularly obvious
where this would go. At least not to me.

> Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> have any explanation anywhere why the rmgr desc functions are in a
> separate directory. The README would be a good place to explain that,
> and to have the formatting guidelines. See attached.

I agree that it's better this way, though.

-- 
Peter Geoghegan




RE: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-10 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Önder

Thanks for giving discussions. IIUC all have agreed that the patch focus on 
extending
to Hash index. PSA the patch for that.
The basic workflow was not so changed, some comments were updated.

Regarding the test code, I think it should be combined into 
032_subscribe_use_index.pl
because they have tested same feature. I have just copied tests to latter
part of 032.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v3-0001-Allow-the-use-Hash-index-when-REPLICA-IDENTITY-FU.patch
Description:  v3-0001-Allow-the-use-Hash-index-when-REPLICA-IDENTITY-FU.patch


Re: Use COPY for populating all pgbench tables

2023-07-10 Thread Michael Paquier
On Wed, Jun 14, 2023 at 10:58:06AM -0500, Tristan Partin wrote:
> Again, I forget to actually attach. Holy guacamole.

Looks rather OK seen from here, applied 0001 while browsing the
series.  I have a few comments about 0002.

 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t0\t\n",
+ curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to NULL */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t" INT64_FORMAT 
"\t0\t\n",
+ curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+   /* "filler" column defaults to blank padded empty string */
+   printfPQExpBuffer(sql,
+ INT64_FORMAT "\t" INT64_FORMAT 
"\t0\t\n",
+ curr + 1, curr / naccounts + 1);
+}

I was looking at that, and while this strikes me as a duplication for
the second and third ones, I'm OK with the use of a callback to fill
in the data, even if naccounts and ntellers are equivalent to the
"base" number given to initPopulateTable(), because the "filler"
column has different expectations for each table.  These routines
don't need a PGconn argument, by the way.

/* use COPY with FREEZE on v14 and later without partitioning */
if (partitions == 0 && PQserverVersion(con) >= 14)
-   copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+   copy_statement_fmt = "copy %s from stdin with (freeze on)";
else
-   copy_statement = "copy pgbench_accounts from stdin";
+   copy_statement_fmt = "copy %s from stdin";

This seems a bit incorrect because partitioning only applies to
pgbench_accounts, no?  This change means that the teller and branch
tables would not benefit from FREEZE under a pgbench compiled with
this patch and a backend version of 14 or newer if --partitions is
used.  So, perhaps "partitions" should be an argument of
initPopulateTable() specified for each table? 
--
Michael


signature.asc
Description: PGP signature


Re: Autogenerate some wait events code and documentation

2023-07-10 Thread Drouvot, Bertrand




On 7/11/23 12:52 AM, Michael Paquier wrote:

On Mon, Jul 10, 2023 at 09:11:36AM +0200, Alvaro Herrera wrote:

I don't like this bit, because it means the .txt file is now ungreppable
as source of the enum name.  Things become mysterious and people have to
track down the event name by reading the the Perl generating script.
It's annoying.  I'd rather have the extra column, even if it means a
little duplicity.


Hmm.  I can see your point that we'd lose the direct relationship
between the enum and string when running a single `git grep` from the
tree, still attempting to do that does not actually lead to much
information gained?  Personally, I usually grep for code when looking
for consistent information across various paths in the tree.  Wait
events are very different: each enum is used in a single place in the
tree making their grep search the equivalent of looking at
wait_event_names.txt anyway?



Before commit fa88928470 one could find the relationship between the enum and 
the name
in wait_event.c (a simple git grep would provide it).

With commit fa88928470 in place, one could find the relationship between the 
enum and the name
in wait_event_names.txt (a simple git grep would provide it).

With the proposal we are discussing here, once the build is done and so the 
pgstat_wait_event.c
file is generated then we have the same "grep" capability than pre commit 
fa88928470 (except that
"git grep" can't be used and one would need things like
find . -name "*.c" -exec grep -il "WAIT_EVENT_CHECKPOINTER_MAIN" {} \;)

I agree that it is less "obvious" than pre fa88928470 but still doable though.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-10 Thread Amit Kapila
On Mon, Jul 10, 2023 at 7:26 PM Sergei Kornilov  wrote:
>
> >> Is this restriction only for the subscriber?
> >>
> >> If we have not changed the replica identity and there is no primary key, 
> >> then we forbid update and delete on the publication side (a fairly common 
> >> usage error at the beginning of using publications).
> >> If we have replica identity FULL (the table has such a column), then on 
> >> the subscription side, update and delete will be performed.
> >
> > In the above sentence, do you mean the publisher side?
>
> Yep, sorry.
>
> > But we will not be able to apply them on a subscription. Right?
> >
> > If your previous sentence talks about the publisher and this sentence
> > about the subscriber then what you are saying is correct. You can see
> > the example in the email [1].
>
> Thank you
>
> >> This is an important difference for real use, when the subscriber is not 
> >> necessarily postgresql - for example, debezium.
> >
> > Can you explain the difference and problem you are seeing? As per my
> > understanding, this is the behavior from the time logical replication
> > has been introduced.
>
> The difference is that if it's a subscriber-only restriction, then it won't 
> automatically apply to anyone with a non-postgresql subscriber.
> But if suddenly this would be a limitation of the publisher - then it will 
> automatically apply to everyone, regardless of which subscriber is used.
> (and it's a completely different problem if the restriction affects the 
> update/delete themselves, not only their replication. Like as default replica 
> identity on table without primary key, not in this case)
>
> So, I suggest to mention subscriber explicitly:
>
> + class of Btree, then UPDATE and 
> DELETE
> -  operations cannot be replicated.
> + operations cannot be applied on subscriber.
>
> Another example of difference:
> Debezium users sometimes ask to set identity to FULL to get access to old 
> values: https://stackoverflow.com/a/59820210/10983392
> However, identity FULL is described in the documentation as: 
> https://www.postgresql.org/docs/current/logical-replication-publication.html
>

After seeing this, I am thinking about whether we add this restriction
on the Subscription page [1] or Restrictions page [2] as proposed. Do
you others have any preference?

[1] - 
https://www.postgresql.org/docs/devel/logical-replication-subscription.html
[2] - 
https://www.postgresql.org/docs/devel/logical-replication-restrictions.html

-- 
With Regards,
Amit Kapila.




Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-10 Thread Amit Kapila
On Tue, Jul 11, 2023 at 4:54 AM Peter Smith  wrote:
>
> On Mon, Jul 10, 2023 at 2:21 PM Amit Kapila  wrote:
> >
> > On Mon, Jul 10, 2023 at 7:55 AM Peter Smith  wrote:
> > >
> > > On Sat, Jul 8, 2023 at 1:49 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Fri, Jul 7, 2023 at 1:36 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > I prefer the first suggestion. I've attached the updated patch.
> > > > >
> > > >
> > > > This looks mostly good to me but I think it would be better if we can
> > > > also add the information that the leftmost index column must be a
> > > > non-expression. So, how about: "Candidate indexes must be btree,
> > > > non-partial, and the leftmost index column must be a non-expression
> > > > and reference to a published table column (i.e. cannot consist of only
> > > > expressions)."?
> > >
> > > That part in parentheses ought to say "the index ..." because it is
> > > referring to the full INDEX, not to the leftmost column. I think this
> > > was missed when Sawada-san took my previous suggestion [1].
> > >
> > > IMO it doesn't sound right to say the "index column must be a
> > > non-expression". It is already a non-expression because it is a
> > > column. So I think it would be better to refer to this as an INDEX
> > > "field" instead of an INDEX column. Note that "field" is the same
> > > terminology used in the docs for CREATE INDEX [2].
> > >
> >
> > I thought it would be better to be explicit for this case but I am
> > fine if Sawada-San and you prefer some other way to document it.
> >
>
> I see. How about just moving the parenthesized part to explicitly
> refer only to the leftmost field?
>
> SUGGESTION
> Candidate indexes must be btree, non-partial, and the leftmost index
> field must be a column (not an expression) that references a published
> table column.
>

Yeah, something like that works for me.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-10 Thread Amit Kapila
On Mon, Jul 10, 2023 at 8:01 PM Melih Mutlu  wrote:
>
> Hayato Kuroda (Fujitsu) , 6 Tem 2023 Per,
> 12:47 tarihinde şunu yazdı:
> >
> > Dear Melih,
> >
> > > Thanks for the 0003 patch. But it did not work for me. Can you create
> > > a subscription successfully with patch 0003 applied?
> > > I get the following error: " ERROR:  table copy could not start
> > > transaction on publisher: another command is already in progress".
> >
> > You got the ERROR when all the patches (0001-0005) were applied, right?
> > I have focused on 0001 and 0002 only, so I missed something.
> > If it was not correct, please attach the logfile and test script what you 
> > did.
>
> Yes, I did get an error with all patches applied. But with only 0001
> and 0002, your version seems like working and mine does not.
> What do you think about combining 0002 and 0003? Or should those stay 
> separate?
>

I am fine either way but I think one minor advantage of keeping 0003
separate is that we can focus on some of the problems specific to that
patch. For example, the following comment in the 0003 patch: "FIXME:
set appropriate application_name...". I have given a suggestion to
address it in [1] and Kuroda-San seems to have addressed the same but
I am not sure if all of us agree with that or if there is any better
way to address it. What do you think?

>
> > * 0003 basically improved performance from first two patches
>
> Agree, 0003 is definitely a good addition which was missing earlier.
>

+1.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JOZHmy2o2F2wTCPKsjpwDiKZPOeTa_jt%3Dwm2JLbf-jsg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Cleaning up array_in()

2023-07-10 Thread jian he
On Sun, Jul 9, 2023 at 3:38 AM Heikki Linnakangas  wrote:
>
> On 08/07/2023 19:08, Tom Lane wrote:
> > I wrote:
> >> So I end up with the attached.  I went ahead and dropped
> >> ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
> >> where the new 0002 avoids re-indenting any existing code in order
> >> to ease review, and then 0003 is just a mechanical application
> >> of pgindent.
> >
> > That got sideswiped by ae6d06f09, so here's a trivial rebase to
> > pacify the cfbot.
> >
> > #text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" 
> > [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch] 
> > /home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
> > #text/x-diff; 
> > name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" 
> > [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] 
> > /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
> > #text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" 
> > [v3-0003-Re-indent-ArrayCount.patch] 
> > /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch
>
> Something's wrong with your attachments.
>
> Hmm, I wonder if ae6d06f09 had a negative performance impact. In an
> unquoted array element, scanner_isspace() function is called for every
> character, so it might be worth inlining.
>
> On the patches: They are a clear improvement, thanks for that. That
> said, I still find the logic very hard to follow, and there are some
> obvious performance optimizations that could be made.
>
> ArrayCount() interprets low-level quoting and escaping, and tracks the
> dimensions at the same time. The state machine is pretty complicated.
> And when you've finally finished reading and grokking that function, you
> see that ReadArrayStr() repeats most of the same logic. Ugh.
>
> I spent some time today refactoring it for readability and speed. I
> introduced a separate helper function to tokenize the input. It deals
> with whitespace, escapes, and backslashes. Then I merged ArrayCount()
> and ReadArrayStr() into one function that parses the elements and
> determines the dimensions in one pass. That speeds up parsing large
> arrays. With the tokenizer function, the logic in ReadArrayStr() is
> still quite readable, even though it's now checking the dimensions at
> the same time.
>
> I also noticed that we used atoi() to parse the integers in the
> dimensions, which doesn't do much error checking. Some funny cases were
> accepted because of that, for example:
>
> postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
> text
> ---
>   {foo,bar}
> (1 row)
>
> I tightened that up in the passing.
>
> Attached are your patches, rebased to fix the conflicts with ae6d06f09
> like you intended. On top of that, my patches. My patches need more
> testing, benchmarking, and review, so if we want to sneak something into
> v16, better go with just your patches. If we're tightening up the
> accepted inputs, maybe fix that atoi() sloppiness, though.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)

your idea is so clear!!!
all the Namings are way more descriptive. ArrayToken, personally
something with "state", "type" will be more clear.

> /*
> * FIXME: Is this still required? I believe all the checks it performs are
> * redundant with other checks in ReadArrayDimension() and ReadArrayStr()
> */
> nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
> if (nitems_according_to_dims < 0)
> PG_RETURN_NULL();
> if (nitems != nitems_according_to_dims)
> elog(ERROR, "mismatch nitems, %d vs %d", nitems, nitems_according_to_dims);
> if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
> PG_RETURN_NULL();

--first time run
select '[0:3][0:2]={{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[];
INFO:  253 after ReadArrayDimensions dim:  4 3 71803430 21998
103381120 21998 ndim: 2
INFO:  770 after ReadArrayStr: dim:  4 3 71803430 21998 103381120
21998 nitems:12, ndim:2

--second time run.
INFO:  253 after ReadArrayDimensions dim:  4 3 0 0 0 0 ndim: 2
INFO:  770 after ReadArrayStr: dim:  4 3 0 0 0 0 nitems:12, ndim:2

select '{{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[]; --every time run,
the result is the same.
INFO:  253 after ReadArrayDimensions dim:  0 0 0 0 0 0 ndim: 0
INFO:  770 after ReadArrayStr: dim:  4 3 -1 -1 -1 -1 nitems:12, ndim:2

I think the reason is that the dim int array didn't explicitly assign
value when initializing it.

> /* Now it's safe to compute ub + 1 */
> if (ub + 1 < lBound[ndim])
> ereturn(escontext, false,
> (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
> errmsg("upper bound cannot be less than lower bound minus one")));

This part seems safe against cases like select
'[-2147483649:-2147483648]={1,2}'::int[];
but I am not sure. If so, then ArrayCheckBoundsSafe is unnecessary.

Another corner case successed: select '{1,}'::int[]; should fail.




Re: Make pgbench exit on SIGINT more reliably

2023-07-10 Thread Michael Paquier
On Tue, Jun 27, 2023 at 09:42:05AM -0500, Tristan Partin wrote:
> I would say there probably isn't much benefit if you think the polling
> for CancelRequested is fine. Looking at the other patch, I think it
> might be simple to add an exit code for SIGINT. But I think it might be
> best to do it after that patch is merged. I added the other patch to the
> commitfest for July. Holding off on this one.

Okay, I am going to jump on the patch to switch to COPY, then.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-10 Thread Peter Smith
On Tue, Jul 11, 2023 at 12:31 AM Melih Mutlu  wrote:
>
> Hi,
>
> Hayato Kuroda (Fujitsu) , 6 Tem 2023 Per,
> 12:47 tarihinde şunu yazdı:
> >
> > Dear Melih,
> >
> > > Thanks for the 0003 patch. But it did not work for me. Can you create
> > > a subscription successfully with patch 0003 applied?
> > > I get the following error: " ERROR:  table copy could not start
> > > transaction on publisher: another command is already in progress".
> >
> > You got the ERROR when all the patches (0001-0005) were applied, right?
> > I have focused on 0001 and 0002 only, so I missed something.
> > If it was not correct, please attach the logfile and test script what you 
> > did.
>
> Yes, I did get an error with all patches applied. But with only 0001
> and 0002, your version seems like working and mine does not.
> What do you think about combining 0002 and 0003? Or should those stay 
> separate?
>

Even if patches 0003 and 0002 are to be combined, I think that should
not happen until after the "reuse" design is confirmed which way is
best.

e.g. IMO it might be easier to compare the different PoC designs for
patch 0002 if there is no extra logic involved.

PoC design#1 -- each tablesync decides for itself what to do next
after it finishes
PoC design#2 -- reuse tablesync using a "pool" of available workers

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-10 Thread Peter Smith
Here are some review comments for patch v16-3

==
1. Commit Message.

The patch description is missing.

==
2. General.

+LogicalRepSyncTableStart(XLogRecPtr *origin_startpos, int worker_slot)

and

+start_table_sync(XLogRecPtr *origin_startpos,
+ char **myslotname,
+ int worker_slot)

and

@@ -4548,12 +4552,13 @@ run_tablesync_worker(WalRcvStreamOptions *options,
  char *slotname,
  char *originname,
  int originname_size,
- XLogRecPtr *origin_startpos)
+ XLogRecPtr *origin_startpos,
+ int worker_slot)


It seems the worker_slot is being passed all over the place as an
additional function argument so that it can be used to construct an
application_name. Is it possible/better to introduce a new
'MyLogicalRepWorker' field for the 'worker_slot' so it does not have
to be passed like this?

==
src/backend/replication/logical/tablesync.c

3.
+ /*
+ * Disconnect from publisher. Otherwise reused sync workers causes
+ * exceeding max_wal_senders.
+ */
+ if (LogRepWorkerWalRcvConn != NULL)
+ {
+ walrcv_disconnect(LogRepWorkerWalRcvConn);
+ LogRepWorkerWalRcvConn = NULL;
+ }
+

Why is this comment mentioning anything about "reused workers" at all?
The worker process exits in this function, right?

~~~

4. LogicalRepSyncTableStart

  /*
- * Here we use the slot name instead of the subscription name as the
- * application_name, so that it is different from the leader apply worker,
- * so that synchronous replication can distinguish them.
+ * Connect to publisher if not yet. The application_name must be also
+ * different from the leader apply worker because synchronous replication
+ * must distinguish them.
  */

I felt all the details in the 2nd part of this comment belong inside
the condition, not outside.

SUGGESTION
/* Connect to the publisher if haven't done so already. */

~~~

5.
+ if (LogRepWorkerWalRcvConn == NULL)
+ {
+ char application_name[NAMEDATALEN];
+
+ /*
+ * FIXME: set appropriate application_name. Previously, the slot name
+ * was used becasue the lifetime of the tablesync worker was same as
+ * that, but now the tablesync worker handles many slots during the
+ * synchronization so that it is not suitable. So what should be?
+ * Note that if the tablesync worker starts to reuse the replication
+ * slot during synchronization, we should use the slot name as
+ * application_name again.
+ */
+ snprintf(application_name, NAMEDATALEN, "pg_%u_sync_%i",
+ MySubscription->oid, worker_slot);
+ LogRepWorkerWalRcvConn =
+ walrcv_connect(MySubscription->conninfo, true,
+must_use_password,
+application_name, );
+ }

5a.
/becasue/because/

~

5b.
I am not sure about what name this should ideally use, but anyway for
uniqueness doesn't it still need to include the GetSystemIdentifier()
same as function ReplicationSlotNameForTablesync() was doing?

Maybe this can use the same function ReplicationSlotNameForTablesync()
can be used but just pass the worker_slot instead of the relid?

==
src/backend/replication/logical/worker.c

6. LogicalRepApplyLoop

  /*
  * Init the ApplyMessageContext which we clean up after each replication
- * protocol message.
+ * protocol message, if needed.
  */
- ApplyMessageContext = AllocSetContextCreate(ApplyContext,
- "ApplyMessageContext",
- ALLOCSET_DEFAULT_SIZES);
+ if (!ApplyMessageContext)
+ ApplyMessageContext = AllocSetContextCreate(ApplyContext,
+ "ApplyMessageContext",
+

Maybe slightly reword the comment.

BEFORE:
Init the ApplyMessageContext which we clean up after each replication
protocol message, if needed.

AFTER:
Init the ApplyMessageContext if needed. This context is cleaned up
after each replication protocol message.

==
src/backend/replication/walsender.c

7.
+ /*
+ * Initialize the flag again because this streaming may be
+ * second time.
+ */
+ streamingDoneSending = streamingDoneReceiving = false;

Isn't this only possible to be 2nd time because the "reuse tablesync
worker" might re-issue a START_REPLICATION again to the same
WALSender? So, should this flag reset ONLY be done for the logical
replication ('else' part), otherwise it should be asserted false?

e.g. Would it be better to be like this?

if (cmd->kind == REPLICATION_KIND_PHYSICAL)
{
Assert(!streamingDoneSending && !streamingDoneReceiving)
StartReplication(cmd);
}
else
{
/* Reset flags because reusing tablesync workers can mean this is the
second time here. */
streamingDoneSending = streamingDoneReceiving = false;
StartLogicalReplication(cmd);
}

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Masahiko Sawada
On Tue, Jul 11, 2023 at 1:24 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> > Out of curiosity I've tried and it is reproducible as you have stated : XFS
> > @ 4.18.0-425.10.1.el8_7.x86_64:
> >...
> > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> > the XFS issues sync writes while ext4 does not, xfs looks like constant
> > loop of sync writes (D) by kworker/2:1H-kblockd:
>
> That clearly won't go well.  It's not reproducible on newer systems,
> unfortunately :(. Or well, fortunately maybe.
>
>
> I wonder if a trick to avoid this could be to memorialize the fact that we
> bulk extended before and extend by that much going forward? That'd avoid the
> swapping back and forth.
>
>
> One thing that confuses me is that Sawada-san observed a regression at a
> single client - yet from what I can tell, there's actually not a single
> fallocate() being generated in that case, because the table is so narrow that
> we never end up extending by a sufficient number of blocks in
> RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.
>
> I don't yet have a RHEL8 system at hand, could either of you send the result
> of a
>   strace -c -p $pid_of_backend_doing_copy
>

Here are the results:

* PG16: nclients = 1, execution time = 23.758
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 53.180.308675   0358898   pwrite64
 33.920.196900   2 81202   pwritev
  7.780.045148   0 81378   lseek
  5.060.029371   2 11141   read
  0.040.000250   291   openat
  0.020.94   189   close
  0.000.00   0 1   munmap
  0.000.00   084   brk
  0.000.00   0 1   sendto
  0.000.00   0 2 1 recvfrom
  0.000.00   0 2   kill
  0.000.00   0 1   futex
  0.000.00   0 1   epoll_wait
-- --- --- - - 
100.000.580438   1532891 1 total

* PG16: nclients = 2, execution time = 18.045
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 54.190.218721   1187803   pwrite64
 29.240.118002   2 40242   pwritev
  6.230.025128   0 41239   lseek
  5.030.020285   2  9133   read
  2.040.008229   9   855   fallocate
  1.280.005151   0  5598  1000 futex
  1.120.004516   1  3965   kill
  0.780.003141   1  3058 1 epoll_wait
  0.060.000224   2   100   openat
  0.030.000136   198   close
  0.010.50   084   brk
  0.000.13   022   setitimer
  0.000.06   015 1 rt_sigreturn
  0.000.02   2 1   munmap
  0.000.02   2 1   sendto
  0.000.02   0 3 2 recvfrom
-- --- --- - - 
100.000.403608   1292217  1004 total

* PG16: nclients = 4, execution time = 18.807
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 64.610.240171   2 93868   pwrite64
 15.110.056173   4 11337   pwritev
  7.290.027100   7  3465   fallocate
  4.050.015048   2  5179   read
  3.550.013188   0 14941   lseek
  2.650.009858   1  8675  2527 futex
  1.760.006536   1  4190   kill
  0.880.003268   1  2199   epoll_wait
  0.060.000213   2   101   openat
  0.030.000130   199   close
  0.010.31   118   rt_sigreturn
  0.010.27   117   setitimer
  0.000.00   0 1   munmap
  0.000.00   084   brk
  0.000.00   0 1   sendto
  0.000.00   0 1   recvfrom
-- --- --- - - 
100.000.371743   2144176  2527 total

* PG16: nclients = 8, execution time = 11.982
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 73.16  

Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-10 Thread Amit Kapila
On Mon, Jul 10, 2023 at 7:43 PM Önder Kalacı  wrote:
>
>> I also think so. If this is true, how can we think of supporting
>> indexes other than hash like GiST, and SP-GiST as mentioned by you in
>> your latest email? As per my understanding if we don't have PK or
>> replica identity then after the index scan, we do tuples_equal which
>> will fail for GIST or SP-GIST. Am, I missing something?
>
>
> I also don't think we can support anything other than btree, hash and brin as 
> those lack equality operators.
>
> And, for BRIN, Hayato brought up the amgettuple issue, which is fair. So, 
> overall, as far as I can see, we can
> easily add hash indexes but all others are either very hard to add or not 
> possible.
>

Agreed. So, let's update the patch with comments indicating the
challenges for supporting the other index types than btree and hash.

-- 
With Regards,
Amit Kapila.




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Masahiko Sawada
On Tue, Jul 11, 2023 at 12:34 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> > Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> > vCPUs, 512GB RAM):
>
> Gah, RHEL with its frankenkernels, the bane of my existance.
>
> FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
> you tested this on RHEL9 as well by any chance?

I've tested this on RHEL9 (m5.24xlarge; 96vCPUs, 384GB RAM), and it
seems to be reproducible on RHEL9 too.

$ uname -rms
Linux 5.14.0-284.11.1.el9_2.x86_64 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)

PG15: nclients = 1, execution time = 22.193
PG15: nclients = 2, execution time = 12.430
PG15: nclients = 4, execution time = 8.677
PG15: nclients = 8, execution time = 6.144
PG15: nclients = 16, execution time = 5.938
PG15: nclients = 32, execution time = 5.775
PG15: nclients = 64, execution time = 5.928
PG15: nclients = 128, execution time = 6.346
PG15: nclients = 256, execution time = 6.641

PG16: nclients = 1, execution time = 24.601
PG16: nclients = 2, execution time = 27.845
PG16: nclients = 4, execution time = 40.575
PG16: nclients = 8, execution time = 24.379
PG16: nclients = 16, execution time = 15.835
PG16: nclients = 32, execution time = 8.370
PG16: nclients = 64, execution time = 4.042
PG16: nclients = 128, execution time = 2.956
PG16: nclients = 256, execution time = 2.591

Tests with test.c program:

$ rm -f test.data; time ./test test.data 0
total   20
fallocate   0
filewrite   20

real0m0.709s
user0m0.057s
sys 0m0.649s

$ rm -f test.data; time ./test test.data 1
total   20
fallocate   20
filewrite   0

real0m0.853s
user0m0.058s
sys 0m0.791s

$ rm -f test.data; time ./test test.data 2
total   20
fallocate   10
filewrite   10

real2m10.002s
user0m0.366s
sys 0m6.649s

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

2023-07-10 Thread Justin Pryzby
An instance compiled locally, without assertions, failed like this:

< 2023-07-09 22:04:46.470 UTC  >LOG:  process 30002 detected deadlock while 
waiting for ShareLock on transaction 813219577 after 333.228 ms
< 2023-07-09 22:04:46.470 UTC  >DETAIL:  Process holding the lock: 2103. Wait 
queue: 30001.
< 2023-07-09 22:04:46.470 UTC  >CONTEXT:  while checking uniqueness of tuple 
(549,4) in relation "pg_statistic_ext_data"
< 2023-07-09 22:04:46.470 UTC  >STATEMENT:  REINDEX INDEX 
pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.474 UTC  >ERROR:  deadlock detected
< 2023-07-09 22:04:46.474 UTC  >DETAIL:  Process 30001 waits for ShareLock on 
transaction 813219577; blocked by process 2103.
Process 2103 waits for RowExclusiveLock on relation 3429 of database 
16588; blocked by process 30001.
Process 30001: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
Process 2103: autovacuum: ANALYZE child.ericsson_sgsn_rac_202307
< 2023-07-09 22:04:46.474 UTC  >HINT:  See server log for query details.
< 2023-07-09 22:04:46.474 UTC  >CONTEXT:  while checking uniqueness of tuple 
(549,4) in relation "pg_statistic_ext_data"
< 2023-07-09 22:04:46.474 UTC  >STATEMENT:  REINDEX INDEX 
pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.483 UTC  >LOG:  background worker "parallel worker" (PID 
30002) exited with exit code 1
< 2023-07-09 22:04:46.487 UTC postgres >ERROR:  deadlock detected
< 2023-07-09 22:04:46.487 UTC postgres >DETAIL:  Process 30001 waits for 
ShareLock on transaction 813219577; blocked by process 2103.
Process 2103 waits for RowExclusiveLock on relation 3429 of database 
16588; blocked by process 30001.
< 2023-07-09 22:04:46.487 UTC postgres >HINT:  See server log for query details.
< 2023-07-09 22:04:46.487 UTC postgres >CONTEXT:  while checking uniqueness of 
tuple (549,4) in relation "pg_statistic_ext_data"
parallel worker
< 2023-07-09 22:04:46.487 UTC postgres >STATEMENT:  REINDEX INDEX 
pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.848 UTC  >LOG:  server process (PID 30001) was terminated 
by signal 11: Segmentation fault
< 2023-07-09 22:04:46.848 UTC  >DETAIL:  Failed process was running: REINDEX 
INDEX pg_statistic_ext_data_stxoid_inh_index

=> REINDEX was running, with parallel workers, but deadlocked with
ANALYZE, and then crashed.

It looks like parallel workers are needed to hit this issue.
I'm not sure if the issue is specific to extended stats - probably not.

I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
were not hit), and on a more recent commit (1124cb2cf).  The crash is hit about
30% of the time when running a loop around REINDEX and then also running
ANALYZE.

I hope someone has a hunch where to look; so far, I wasn't able to create a
minimal reproducer.  

Core was generated by `postgres: pryzbyj ts [local] REINDEX '.
Program terminated with signal 11, Segmentation fault.
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at 
../src/backend/storage/lmgr/lock.c:1898
1898LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at 
../src/backend/storage/lmgr/lock.c:1898
#1  0x007ab56b in LockErrorCleanup () at 
../src/backend/storage/lmgr/proc.c:735
#2  0x00548a7e in AbortTransaction () at 
../src/backend/access/transam/xact.c:2735
#3  0x00549405 in AbortCurrentTransaction () at 
../src/backend/access/transam/xact.c:3414
#4  0x007b6414 in PostgresMain (dbname=, 
username=) at ../src/backend/tcop/postgres.c:4352
#5  0x00730e9a in BackendRun (port=, port=) at ../src/backend/postmaster/postmaster.c:4461
#6  BackendStartup (port=0x12a8a50) at 
../src/backend/postmaster/postmaster.c:4189
#7  ServerLoop () at ../src/backend/postmaster/postmaster.c:1779
#8  0x0073207d in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x127a230) at ../src/backend/postmaster/postmaster.c:1463
#9  0x004b5535 in main (argc=3, argv=0x127a230) at 
../src/backend/main/main.c:198

(gdb) l
1893RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
1894{
1895LOCK   *waitLock = proc->waitLock;
1896PROCLOCK   *proclock = proc->waitProcLock;
1897LOCKMODElockmode = proc->waitLockMode;
1898LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
1899
1900/* Make sure proc is waiting */
1901Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
1902Assert(proc->links.next != NULL);

(gdb) p waitLock
$1 = (LOCK *) 0x0

Another variant on this crash:

Jul 11 00:55:19 telsa kernel: postgres[25415]: segfault at f ip 
008a sp 7ffdbc01ea90 error 4 in postgres[40+8df000]

Core was generated by `postgres: parallel worker for PID 27096  waiting '.

(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc154040, hashcode=2029421528) at 

Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-10 Thread Michael Paquier
On Mon, Jul 10, 2023 at 11:04:48AM -0400, Tom Lane wrote:
> ISTR that we discussed forbidding such changes way back when the
> extension mechanism was invented, and decided against it on the
> grounds that (a) it'd be nanny-ism, (b) we'd have to add checks in an
> awful lot of places and it'd be easy to miss some,

The namepace modifications depending on the object types are quite
centralized lately, FWIW.  And that was the case in 9.3 as well since
we have ExecAlterObjectSchemaStmt().  It would be easy to miss a new
code path if somebody introduces a new object type that needs its own
update path, but based on the last 15 years of experience on the
matter, that would be unlikely?  Adding a note at the top of
ExecAlterObjectSchemaStmt() would make that even harder to miss.

> and (c) forbidding
> superusers from doing anything they want is generally not our style.

Yeah.
--
Michael


signature.asc
Description: PGP signature


Re: Add hint message for check_log_destination()

2023-07-10 Thread Japin Li


On Mon, 10 Jul 2023 at 14:23, Masahiko Sawada  wrote:
> On Mon, Jul 10, 2023 at 2:07 PM Kyotaro Horiguchi
>  wrote:
>>
>> At Mon, 10 Jul 2023 09:04:42 +0800, Japin Li  wrote in
>> >
>> > On Sat, 08 Jul 2023 at 12:48, Michael Paquier  wrote:
>> > > On Fri, Jul 07, 2023 at 07:23:47PM +0800, Japin Li wrote:
>> > >> +  appendStringInfoString(, "\"stderr\"");
>> > >> +#ifdef HAVE_SYSLOG
>> > >> +  appendStringInfoString(, ", \"syslog\"");
>> > >> +#endif
>> > >> +#ifdef WIN32
>> > >> +  appendStringInfoString(, ", \"eventlog\"");
>> > >> +#endif
>> > >> +  appendStringInfoString(, ", \"csvlog\", and 
>> > >> \"jsonlog\"");
>> > >
>> > > Hmm.  Is that OK as a translatable string?
>
> It seems okay to me but needs to be checked.
>
>> >
>> >
>> > Sorry for the late reply!  I'm not sure.  How can I know whether it is 
>> > translatable?
>>
>> At the very least, we can't generate comma-separated lists
>> programatically because punctuation marks vary across languages.
>>
>> One potential approach could involve defining the message for every
>> potential combination, in full length.
>
> Don't we generate a comma-separated list for an error hint of an enum
> parameter? For example, to generate the following error hint:
>
> =# alter system set client_min_messages = 'aaa';
> ERROR:  invalid value for parameter "client_min_messages": "aaa"
> HINT:  Available values: debug5, debug4, debug3, debug2, debug1, log,
> notice, warning, error.
>
> we use the comma-separated generated by config_enum_get_options() and
> do ereport() like:
>
>   ereport(elevel,
>   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>errmsg("invalid value for parameter \"%s\": \"%s\"",
>   name, value),
>hintmsg ? errhint("%s", _(hintmsg)) : 0));

> IMO log_destination is a string GUC parameter but its value is the
> list of enums. So it makes sense to me to add a hint message like what
> we do for enum parameters in case where the user mistypes a wrong
> value. I'm not sure why the proposed patch needs to quote the usable
> values, though.

I borrowed the description from pg_settings extra_desc.  In my first patch,
I used the hint message line enum parameter, however, since it might be a
combination of multiple log destinations, so, I update the patch using
extra_desc.

-- 
Regrads,
Japin Li.




Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2023-07-10 Thread Tristan Partin
Here is an up to date patch given some churn on the master branch.

-- 
Tristan Partin
Neon (https://neon.tech)
From b68cec481768c7c635ec48329b4764eced264572 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Fri, 30 Jun 2023 09:31:04 -0500
Subject: [PATCH v3 1/3] Skip checking for uselocale on Windows

Windows doesn't have uselocale, so skip it.
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index 0c44f19cb9..3356f72bf0 100644
--- a/meson.build
+++ b/meson.build
@@ -2438,7 +2438,7 @@ func_checks = [
   ['strsignal'],
   ['sync_file_range'],
   ['syncfs'],
-  ['uselocale'],
+  ['uselocale', {'skip': host_system == 'windows'}],
   ['wcstombs_l'],
 ]
 
-- 
Tristan Partin
Neon (https://neon.tech)

From 3207694a1d214a7d5b844f3f6dfd8378408172af Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Fri, 30 Jun 2023 11:13:29 -0500
Subject: [PATCH v3 2/3] Add locale_is_c function

In some places throughout the codebase, there are string comparisons for
locales matching C or POSIX. Encapsulate this logic into a single
function and use it.
---
 src/backend/utils/adt/pg_locale.c | 39 ++-
 src/backend/utils/init/postinit.c |  4 +---
 src/backend/utils/mb/mbutils.c|  5 ++--
 src/include/utils/pg_locale.h |  1 +
 4 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index aa9da99308..047c02dbab 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -191,6 +191,23 @@ wcstombs_l(char *dest, const wchar_t *src, size_t n, locale_t loc)
 }
 #endif
 
+
+/*
+ * Check if a locale is the C locale
+ *
+ * POSIX is an alias for C. Passing ingore_case as true will use
+ * pg_strcasecmp() instead of strcmp().
+ */
+bool
+locale_is_c(const char *locale, bool ignore_case)
+{
+	if (!ignore_case)
+		return strcmp(locale, "C") == 0 || strcmp(locale, "POSIX") == 0;
+
+	return pg_strcasecmp(locale, "C") == 0 || pg_strcasecmp(locale, "POSIX") == 0;
+}
+
+
 /*
  * pg_perm_setlocale
  *
@@ -1276,10 +1293,8 @@ lookup_collation_cache(Oid collation, bool set_flags)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_collctype);
 			collctype = TextDatumGetCString(datum);
 
-			cache_entry->collate_is_c = ((strcmp(collcollate, "C") == 0) ||
-		 (strcmp(collcollate, "POSIX") == 0));
-			cache_entry->ctype_is_c = ((strcmp(collctype, "C") == 0) ||
-	   (strcmp(collctype, "POSIX") == 0));
+			cache_entry->collate_is_c = locale_is_c(collcollate, false);
+			cache_entry->ctype_is_c = locale_is_c(collctype, false);
 		}
 		else
 		{
@@ -1327,12 +1342,8 @@ lc_collate_is_c(Oid collation)
 		if (!localeptr)
 			elog(ERROR, "invalid LC_COLLATE setting");
 
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
-		else
-			result = false;
+		result = locale_is_c(localeptr, false);
+
 		return (bool) result;
 	}
 
@@ -1380,12 +1391,8 @@ lc_ctype_is_c(Oid collation)
 		if (!localeptr)
 			elog(ERROR, "invalid LC_CTYPE setting");
 
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
-		else
-			result = false;
+		result = locale_is_c(localeptr, false);
+
 		return (bool) result;
 	}
 
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 0f9b92b32e..a92a0c438f 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -419,9 +419,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		   " which is not recognized by setlocale().", ctype),
  errhint("Recreate the database with another locale or install the missing locale.")));
 
-	if (strcmp(ctype, "C") == 0 ||
-		strcmp(ctype, "POSIX") == 0)
-		database_ctype_is_c = true;
+	database_ctype_is_c = locale_is_c(ctype, false);
 
 	if (dbform->datlocprovider == COLLPROVIDER_ICU)
 	{
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 67a1ab2ab2..997156515c 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -39,6 +39,7 @@
 #include "mb/pg_wchar.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/pg_locale.h"
 #include "utils/syscache.h"
 #include "varatt.h"
 
@@ -1237,9 +1238,7 @@ pg_bind_textdomain_codeset(const char *domainname)
 	int			new_msgenc;
 
 #ifndef WIN32
-	const char *ctype = setlocale(LC_CTYPE, NULL);
-
-	if (pg_strcasecmp(ctype, "C") == 0 || pg_strcasecmp(ctype, "POSIX") == 0)
+	if (locale_is_c(locale_ctype, true))
 #endif
 		if (encoding != PG_SQL_ASCII &&
 			raw_pg_bind_textdomain_codeset(domainname, encoding))
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 6447bea8e0..e08d96e099 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool 

Re: WAL Insertion Lock Improvements

2023-07-10 Thread Michael Paquier
On Mon, Jun 05, 2023 at 08:00:00AM +0530, Bharath Rupireddy wrote:
> On Wed, May 31, 2023 at 5:05 PM Michael Paquier  wrote:
>> @Andres: Were there any extra tests you wanted to be run for more
>> input?
> 
> @Andres Freund please let us know your thoughts.

Err, ping.  It seems like this thread is waiting on input from you,
Andres?
--
Michael


signature.asc
Description: PGP signature


Re: Add bump memory context type and use it for tuplesorts

2023-07-10 Thread David Rowley
On Tue, 27 Jun 2023 at 21:19, David Rowley  wrote:
> I've attached the bump allocator patch and also the script I used to
> gather the performance results in the first 2 tabs in the attached
> spreadsheet.

I've attached a v2 patch which changes the BumpContext a little to
remove some of the fields that are not really required.  There was no
need for the "keeper" field as the keeper block always comes at the
end of the BumpContext as these are allocated in a single malloc().
The pointer to the "block" also isn't really needed. This is always
the same as the head element in the blocks dlist.

David
diff --git a/src/backend/nodes/gen_node_support.pl 
b/src/backend/nodes/gen_node_support.pl
index 72c7963578..a603f1d549 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -149,7 +149,7 @@ my @abstract_types = qw(Node);
 # they otherwise don't participate in node support.
 my @extra_tags = qw(
   IntList OidList XidList
-  AllocSetContext GenerationContext SlabContext
+  AllocSetContext GenerationContext SlabContext BumpContext
   TIDBitmap
   WindowObjectData
 );
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index dae3432c98..01a1fb8527 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
alignedalloc.o \
aset.o \
+   bump.o \
dsa.o \
freepage.o \
generation.o \
diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
new file mode 100644
index 00..015e66709f
--- /dev/null
+++ b/src/backend/utils/mmgr/bump.c
@@ -0,0 +1,747 @@
+/*-
+ *
+ * bump.c
+ *   Bump allocator definitions.
+ *
+ * Bump is a MemoryContext implementation designed for memory usages which
+ * require allocating a large number of chunks, none of which ever need to be
+ * pfree'd or realloc'd.  Chunks allocated by this context have no chunk header
+ * and operations which ordinarily require looking at the chunk header cannot
+ * be performed.  For example, pfree, realloc, GetMemoryChunkSpace and
+ * GetMemoryChunkContext are all not possible with bump allocated chunks.  The
+ * only way to release memory allocated by this context type is to reset or
+ * delete the context.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *   src/backend/utils/mmgr/bump.c
+ *
+ *
+ * Bump is best suited to cases which require a large number of short-lived
+ * chunks where performance matters.  Because bump allocated chunks don't
+ * have a chunk header, it can fit more chunks on each block.  This means 
we
+ * can do more with less memory and fewer cache lines.  The reason it's 
best
+ * suited for short-lived usages of memory is that ideally, pointers to 
bump
+ * allocated chunks won't be visible to a large amount of code.  The more
+ * code that operates on memory allocated by this allocator, the more 
chances
+ * that some code will try to perform a pfree or one of the other 
operations
+ * which are made impossible due to the lack of chunk header.  In order to
+ * to detect accidental usage of the various disallowed operations, we do 
add
+ * a MemoryChunk chunk header in MEMORY_CONTEXT_CHECKING builds and have 
the
+ * various disallowed functions raise an ERROR.
+ *
+ * Allocations are MAXALIGNed.
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "port/pg_bitutils.h"
+#include "utils/memdebug.h"
+#include "utils/memutils.h"
+#include "utils/memutils_memorychunk.h"
+#include "utils/memutils_internal.h"
+
+#define Bump_BLOCKHDRSZMAXALIGN(sizeof(BumpBlock))
+
+/* No chunk header unless built with MEMORY_CONTEXT_CHECKING */
+#ifdef MEMORY_CONTEXT_CHECKING
+#define Bump_CHUNKHDRSZsizeof(MemoryChunk)
+#else
+#define Bump_CHUNKHDRSZ0
+#endif
+
+#define Bump_CHUNK_FRACTION8
+
+/* The keeper block is allocated in the same allocation as the set */
+#define KeeperBlock(set) ((BumpBlock *) ((char *) (set) + sizeof(BumpContext)))
+#define IsKeeperBlock(set, blk) (KeeperBlock(set) == (blk))
+
+typedef struct BumpBlock BumpBlock; /* forward reference */
+
+typedef struct BumpContext
+{
+   MemoryContextData header;   /* Standard memory-context fields */
+
+   /* Bump context parameters */
+   uint32  initBlockSize;  /* initial block size */
+   uint32  maxBlockSize;   /* maximum block size */
+   uint32  nextBlockSize;  /* next block size to allocate */
+   uint32  allocChunkLimit;/* effective chunk size limit */
+
+   dlist_head  blocks; /* list of blocks with the 
block currently
+   

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-10 Thread Jacob Champion
On Fri, Jul 7, 2023 at 6:01 PM Thomas Munro  wrote:
>
> On Fri, Jul 7, 2023 at 4:57 AM Jacob Champion  wrote:
> > On Wed, Jul 5, 2023 at 3:07 PM Thomas Munro  wrote:
> > > BTW I will happily do the epoll->kqueue port work if necessary.
> >
> > And I will happily take you up on that; thanks!
>
> Some initial hacking, about 2 coffees' worth:
> https://github.com/macdice/postgres/commits/oauth-kqueue
>
> This compiles on FreeBSD and macOS, but I didn't have time to figure
> out all your Python testing magic so I don't know if it works yet and
> it's still red on CI...

This is awesome, thank you!

I need to look into the CI more, but it looks like the client tests
are passing, which is a good sign. (I don't understand why the
server-side tests are failing on FreeBSD, but they shouldn't be using
the libpq code at all, so I think your kqueue implementation is in the
clear. Cirrus doesn't have the logs from the server-side test failures
anywhere -- probably a bug in my Meson patch.)

> one thing I wondered about is the *altsock =
> timerfd part which I couldn't do.

I did that because I'm not entirely sure that libcurl is guaranteed to
have cleared out all its sockets from the mux, and I didn't want to
invite spurious wakeups. I should probably verify whether or not
that's possible. If so, we could just make that code resilient to
early wakeup, so that it matters less, or set up a second kqueue that
only holds the timer if that turns out to be unacceptable?

> The situation on macOS is a little odd: the man page says EVFILT_TIMER
> is not implemented.  But clearly it is, we can read the source code as
> I had to do to find out which unit of time it defaults to[1] (huh,
> Apple's github repo for Darwin appears to have been archived recently
> -- no more source code updates?  that'd be a shame!), and it works
> exactly as expected in simple programs.  So I would just assume it
> works until we see evidence otherwise.  (We already use a couple of
> other things on macOS more or less by accident because configure finds
> them, where they are undocumented or undeclared.)

Huh. Something to keep an eye on... might be a problem with older versions?

Thanks!
--Jacob




Re: Reducing connection overhead in pg_upgrade compat check phase

2023-07-10 Thread Daniel Gustafsson
> On 11 Jul 2023, at 01:09, Nathan Bossart  wrote:
> On Mon, Jul 10, 2023 at 04:43:23PM +0200, Daniel Gustafsson wrote:

>>> +static int n_data_types_usage_checks = 7;
>>> 
>>> Can we determine this programmatically so that folks don't need to remember
>>> to update it?
>> 
>> Fair point, I've added a counter loop to the beginning of the check function 
>> to
>> calculate it.
> 
> + /* Gather number of checks to perform */
> + while (tmp->status != NULL)
> + n_data_types_usage_checks++;
> 
> I think we need to tmp++ somewhere here.

Yuk, yes, will fix when caffeinated. Thanks.

--
Daniel Gustafsson





Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-10 Thread Peter Smith
On Mon, Jul 10, 2023 at 2:21 PM Amit Kapila  wrote:
>
> On Mon, Jul 10, 2023 at 7:55 AM Peter Smith  wrote:
> >
> > On Sat, Jul 8, 2023 at 1:49 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jul 7, 2023 at 1:36 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I prefer the first suggestion. I've attached the updated patch.
> > > >
> > >
> > > This looks mostly good to me but I think it would be better if we can
> > > also add the information that the leftmost index column must be a
> > > non-expression. So, how about: "Candidate indexes must be btree,
> > > non-partial, and the leftmost index column must be a non-expression
> > > and reference to a published table column (i.e. cannot consist of only
> > > expressions)."?
> >
> > That part in parentheses ought to say "the index ..." because it is
> > referring to the full INDEX, not to the leftmost column. I think this
> > was missed when Sawada-san took my previous suggestion [1].
> >
> > IMO it doesn't sound right to say the "index column must be a
> > non-expression". It is already a non-expression because it is a
> > column. So I think it would be better to refer to this as an INDEX
> > "field" instead of an INDEX column. Note that "field" is the same
> > terminology used in the docs for CREATE INDEX [2].
> >
>
> I thought it would be better to be explicit for this case but I am
> fine if Sawada-San and you prefer some other way to document it.
>

I see. How about just moving the parenthesized part to explicitly
refer only to the leftmost field?

SUGGESTION
Candidate indexes must be btree, non-partial, and the leftmost index
field must be a column (not an expression) that references a published
table column.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-10 Thread Jacob Champion
On Fri, Jul 7, 2023 at 2:16 PM Andrey Chudnovsky  wrote:
> Please confirm my understanding of the flow is correct:
> 1. Client calls PQconnectStart.
>   - The client doesn't know yet what is the issuer and the scope.

Right. (Strictly speaking it doesn't even know that OAuth will be used
for the connection, yet, though at some point we'll be able to force
the issue with e.g. `require_auth=oauth`. That's not currently
implemented.)

>   - Parameters are strings, so callback is not provided yet.
> 2. Client gets PgConn from PQconnectStart return value and updates
> conn->async_auth to its own callback.

This is where some sort of official authn callback registration (see
above reply to Daniele) would probably come in handy.

> 3. Client polls PQconnectPoll and checks conn->sasl_state until the
> value is SASL_ASYNC

In my head, the client's custom callback would always be invoked
during the call to PQconnectPoll, rather than making the client do
work in between calls. That way, a client can use custom flows even
with a synchronous PQconnectdb().

> 4. Client accesses conn->oauth_issuer and conn->oauth_scope and uses
> those info to trigger the token flow.

Right.

> 5. Expectations on async_auth:
> a. It returns PGRES_POLLING_READING while token acquisition is going on
> b. It returns PGRES_POLLING_OK and sets conn->sasl_state->token
> when token acquisition succeeds.

Yes. Though the token should probably be returned through some
explicit part of the callback, now that you mention it...

> 6. Is the client supposed to do anything with the altsock parameter?

The callback needs to set the altsock up with a select()able
descriptor, which wakes up the client when more work is ready to be
done. Without that, you can't handle multiple connections on a single
thread.

> If yes, it looks workable with a couple of improvements I think would be nice:
> 1. Currently, oauth_exchange function sets conn->async_auth =
> pg_fe_run_oauth_flow and starts Device Code flow automatically when
> receiving challenge and metadata from the server.
> There probably should be a way for the client to prevent default
> Device Code flow from triggering.

Agreed. I'd like the client to be able to override this directly.

> 2. The current signature and expectations from async_auth function
> seems to be tightly coupled with the internal implementation:
> - Pieces of information need to be picked and updated in different
> places in the PgConn structure.
> - Function is expected to return PostgresPollingStatusType which
> is used to communicate internal state to the client.
>Would it make sense to separate the internal callback used to
> communicate with Device Code flow from client facing API?
>I.e. introduce a new client facing structure and enum to facilitate
> callback and its return value.

Yep, exactly right! I just wanted to check that the architecture
*looked* sufficient before pulling it up into an API.

> On a separate note:
> The backend code currently spawns an external command for token validation.
> As we discussed before, an extension hook would be a more efficient
> extensibility option.
> We see clients make 10k+ connections using OAuth tokens per minute to
> our service, and stating external processes would be too much overhead
> here.

+1. I'm curious, though -- what language do you expect to use to write
a production validator hook? Surely not low-level C...?

> > 5) Does this maintenance tradeoff (full control over the client vs. a
> > large amount of RFC-governed code) seem like it could be okay?
>
> It's nice for psql to have Device Code flow. Can be made even more
> convenient with refresh tokens support.
> And for clients on resource constrained devices to be able to
> authenticate with Client Credentials (app secret) without bringing
> more dependencies.
>
> In most other cases, upstream PostgreSQL drivers written in higher
> level languages have libraries / abstractions to implement OAUTH flows
> for the platforms they support.

Yeah, I'm really interested in seeing which existing high-level flows
can be mixed in through a driver. Trying not to get too far ahead of
myself :D

Thanks for the review!

--Jacob




Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-10 Thread Cary Huang
 > Thanks for the new version!  It doesn't fail the ssl tests, but the Kerberos
 > test now fails.  You can see the test reports from the CFBot here:

Yes, kerberos tests failed due to the addition of notbefore and notafter 
values. The values array within "pg_stat_get_activity" function related to 
"pg_stat_gssapi" were not set correctly. It is now fixed


 > This runs on submitted patches, you can also run the same CI checks in your 
 > own
 > Github clone using the supplied CI files in the postgres repo.

Thank you for pointing this out. I followed the CI instruction as suggested and 
am able to run the same CI checks to reproduce the test failures.


> There are also some trivial whitespace issues shown with "git diff --check",
> these can of course easily be addressed by a committer in a final-version 
> patch
> but when sending a new version you might as well fix those.

Yes, the white spaces issues should be addressed in the attached patches.


> X509_getm_notBefore() and X509_getm_notAfter() are only available in OpenSSL
> 1.1.1 and onwards, but postgres support 1.0.2 (as of today with 8e278b6576).
> X509_get_notAfter() is available in 1.0.2 but deprecated in 1.1.1 and turned
> into an alias for X509_getm_notAfter() (same with _notBefore of course), and
> since we set 1.0.2 as the API compatibility we should be able to use that
> without warnings instead.

Thank you so much for catching this openssl function compatibility issue. I 
have changed the function calls to:
-  X509_get_notBefore()
-  X509_get_notAfter()

which are compatible in OpenSSL v1.0.2 and also v1.1.1 where they will get 
translated to X509_getm_notBefore() and X509_getm_notAfter() respectively


 > These functions should IMO return timestamp data types to save the user from
 > having to convert them. Same with the additions to pg_stat_get_activity.

Yes, agreed, the attached patches have the output changed to timestamp datatype 
instead of text.


 > You should add tests for the new functions in src/test/ssl/t/003_sslinfo.pl.

Yes, agreed, I added 2 additional tests in src/test/ssl/t/003_sslinfo.pl to 
compare the notbefore and notafter outputs from sslinfo extension and 
pg_stat_ssl outputs. Both should be tested equal.


Also added related documentation about the new not before and not after 
timestamps in pg_stat_ssl.

thank you

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca



v4-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v4-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


Re: Reducing connection overhead in pg_upgrade compat check phase

2023-07-10 Thread Nathan Bossart
On Mon, Jul 10, 2023 at 04:43:23PM +0200, Daniel Gustafsson wrote:
>> On 8 Jul 2023, at 23:43, Nathan Bossart  wrote:
>> Since "script" will be NULL and "db_used" will be false in the first
>> iteration of the loop, couldn't we move this stuff to before the loop?
> 
> We could.  It's done this way to match how all the other checks are performing
> the inner loop for consistency.  I think being consistent is better than micro
> optimizing in non-hot codepaths even though it adds some redundancy.
>
> [ ... ] 
> 
>> Won't "script" always be initialized here?  If I'm following this code
>> correctly, I think everything except the fclose() can be removed.
> 
> You are right that this check is superfluous.  This is again an artifact of
> modelling the code around how the other checks work for consistency.  At least
> I think that's a good characteristic of the code.

I can't say I agree with this, but I'm not going to hold up the patch over
it.  FWIW I was looking at this more from a code simplification/readability
standpoint.

>> +static int  n_data_types_usage_checks = 7;
>> 
>> Can we determine this programmatically so that folks don't need to remember
>> to update it?
> 
> Fair point, I've added a counter loop to the beginning of the check function 
> to
> calculate it.

+   /* Gather number of checks to perform */
+   while (tmp->status != NULL)
+   n_data_types_usage_checks++;

I think we need to tmp++ somewhere here.

>> In fact, I wonder if we could just add the versions directly to
>> data_types_usage_checks so that we don't need the separate hook functions.
> 
> We could, but it would be sort of contrived I think since some check <= and
> some == while some check the catversion as well (and new ones may have other
> variants.  I think this is the least paint-ourselves-in-a-corner version, if 
> we
> feel it's needlessly complicated and no other variants are added we can 
> revisit
> this.

Makes sense.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Implementation of distinct in Window Aggregates

2023-07-10 Thread Andreas Karlsson

On 3/12/23 09:17, Ankit Kumar Pandey wrote:

Attaching updated patch with a fix for an issue in window function.

I have also fixed naming convention of patch as last patch had 
incompatible name.


Hi,

This patch does not apply to master. Could you rebase it and submit it 
as one patch which applies directly to master? Maybe I am wrong but the 
latest version looks like it only applies on top of one of your previous 
patches which makes it hard for the reviewer.


Andreas




Re: Autogenerate some wait events code and documentation

2023-07-10 Thread Michael Paquier
On Mon, Jul 10, 2023 at 09:11:36AM +0200, Alvaro Herrera wrote:
> I don't like this bit, because it means the .txt file is now ungreppable
> as source of the enum name.  Things become mysterious and people have to
> track down the event name by reading the the Perl generating script.
> It's annoying.  I'd rather have the extra column, even if it means a
> little duplicity.

Hmm.  I can see your point that we'd lose the direct relationship
between the enum and string when running a single `git grep` from the
tree, still attempting to do that does not actually lead to much
information gained?  Personally, I usually grep for code when looking
for consistent information across various paths in the tree.  Wait
events are very different: each enum is used in a single place in the
tree making their grep search the equivalent of looking at
wait_event_names.txt anyway?

The quotes in the second columns can be removed even with your
argument in place.  That improves a bit the format.
--
Michael


signature.asc
Description: PGP signature


Re: Refactoring backend fork+exec code

2023-07-10 Thread Andres Freund
Hi,

On 2023-06-18 14:22:33 +0300, Heikki Linnakangas wrote:
> I started to look at the code in postmaster.c related to launching child
> processes. I tried to reduce the difference between EXEC_BACKEND and
> !EXEC_BACKEND code paths, and put the code that needs to differ behind a
> better abstraction. I started doing this to help with implementing
> multi-threading, but it doesn't introduce anything thread-related yet and I
> think this improves readability anyway.

Yes please!  This code is absolutely awful.


> From 0cb6f8d665980d30a5d2a29013000744f16bf813 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 11:00:21 +0300
> Subject: [PATCH 3/9] Refactor CreateSharedMemoryAndSemaphores.
> 
> Moves InitProcess calls a little later in EXEC_BACKEND case.

What's the reason for this part? ISTM that we'd really want to get away from
plastering duplicated InitProcess() etc everywhere.

I think this might be easier to understand if you just changed did the
CreateSharedMemoryAndSemaphores() -> AttachSharedMemoryAndSemaphores() piece
in this commit, and the rest later.


> +void
> +AttachSharedMemoryAndSemaphores(void)
> +{
> + /* InitProcess must've been called already */

Perhaps worth an assertion to make it easier to see that the order is wrong?


> From 1d89eec53c7fefa7a4a8c011c9f19e3df64dc436 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Jun 2023 16:33:20 +0300
> Subject: [PATCH 4/9] Use FD_CLOEXEC on ListenSockets
> 
> We went through some effort to close them in the child process. Better to
> not hand them down to the child process in the first place.

I think Thomas has a larger version of this patch:
https://postgr.es/m/CA%2BhUKGKPNFcfBQduqof4-7C%3DavjcSfdkKBGvQoRuAvfocnvY0A%40mail.gmail.com



> From 65384b9a6cfb3b9b589041526216e0f64d64bea5 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 13:56:44 +0300
> Subject: [PATCH 8/9] Introduce ClientSocket, rename some funcs
> 
> - Move more of the work on a client socket to the child process.
> 
> - Reduce the amount of data that needs to be passed from postmaster to
>   child. (Used to pass a full Port struct, although most of the fields were
>   empty. Now we pass the much slimmer ClientSocket.)

I think there might be extensions accessing Port. Not sure if it's worth
worrying about, but ...


> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -476,8 +476,8 @@ AutoVacLauncherMain(int argc, char *argv[])
>   pqsignal(SIGCHLD, SIG_DFL);
>  
>   /*
> -  * Create a per-backend PGPROC struct in shared memory. We must do
> -  * this before we can use LWLocks.
> +  * Create a per-backend PGPROC struct in shared memory. We must do this
> +  * before we can use LWLocks.
>*/
>   InitProcess();
>

Don't think this was intentional?


> From b33cfeb28a5419045acb659a01410b2b463bea3e Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 13:59:48 +0300
> Subject: [PATCH 9/9] Refactor postmaster child process launching
> 
> - Move code related to launching backend processes to new source file,
>   process_start.c

I think you might have renamed this to launch_backend.c?


> - Introduce new postmaster_child_launch() function that deals with the
>   differences between EXEC_BACKEND and fork mode.
> 
> - Refactor the mechanism of passing informaton from the parent to
>   child process. Instead of using different command-line arguments
>   when launching the child process in EXEC_BACKEND mode, pass a
>   variable-length blob of data along with all the global
>   variables. The contents of that blob depends on the kind of child
>   process being launched. In !EXEC_BACKEND mode, we use the same blob,
>   but it's simply inherited from the parent to child process.


> +constPMChildEntry entry_kinds[] = {
> + {"backend", BackendMain, true},
> +
> + {"autovacuum launcher", AutoVacLauncherMain, true},
> + {"autovacuum worker", AutoVacWorkerMain, true},
> + {"bgworker", BackgroundWorkerMain, true},
> + {"syslogger", SysLoggerMain, false},
> +
> + {"startup", StartupProcessMain, true},
> + {"bgwriter", BackgroundWriterMain, true},
> + {"archiver", PgArchiverMain, true},
> + {"checkpointer", CheckpointerMain, true},
> + {"wal_writer", WalWriterMain, true},
> + {"wal_receiver", WalReceiverMain, true},
> +};

I'd assign them with the PostmasterChildType as index, so there's no danger of
getting out of order.

constPMChildEntry entry_kinds = {
  [PMC_AV_LAUNCHER] = {"autovacuum launcher", AutoVacLauncherMain, true},
  ...
}

or such should work.


I'd also use designated initializers for the fields, it's otherwise hard to
know what true means etc.

I think it might be good to put more into array. If we e.g. knew whether a
particular child type is a backend-like, and aux process or syslogger, we
could avoid the duplicated 

Re: Generating code for query jumbling through gen_node_support.pl

2023-07-10 Thread Michael Paquier
On Mon, Jan 30, 2023 at 11:48:45AM +0100, Peter Eisentraut wrote:
> On 27.01.23 03:59, Michael Paquier wrote:
>> At the end, that would be unnoticeable for the average user, I guess,
>> but here are the numbers I get on my laptop :)
> 
> Personally, I think we do not want the two jumble methods in parallel.
> 
> Maybe there are other opinions.

(Thanks Jonathan for the poke.)

Now that we are in mid-beta for 16, it would be a good time to
conclude on this open item:
"Reconsider a utility_query_id GUC to control if query jumbling of
utilities can go through the past string-only mode and the new mode?"

In Postgres ~15, utility commands used a hash of the query string to
compute their query ID.  The current query jumbling code uses a Query
instead, like any other queries.  I have registered this open item as
a self-reminder, mostly in case there would be an argument to have a
GUC where users could switch from one mode to another.  See here as
well for some computation times for each method (table is in ns, wiht
millions of iterations):
https://www.postgresql.org/message-id/y9eeyindb1acp...@paquier.xyz

I still don't think that we need both methods based on these numbers,
but there may be more opinions about that?  Are people OK if this open
item is discarded?
--
Michael


signature.asc
Description: PGP signature


Re: check_strxfrm_bug()

2023-07-10 Thread Thomas Munro
On Tue, Jul 11, 2023 at 2:28 AM Peter Eisentraut  wrote:
> This looks sensible to me.
>
> If we ever need mbstowcs_l() etc. outside of pg_locale.c, then the
> proper way would be to make a mbstowcs_l.c file in src/port/.
>
> But I like your approach for now because it moves us more firmly into
> the direction of having it contained in pg_locale.c, instead of having
> some knowledge global and some local.

Thanks.  Pushed.

That leaves only one further cleanup opportunity from discussion this
thread, for future work: a TRUST_STRXFRM-ectomy.




Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-07-10 Thread Nikolay Samokhvalov
On Mon, Jul 10, 2023 at 2:02 PM Michael Banck  wrote:

> Thanks for that!
>

Thanks for the fast review.


>
> > I agree with Andrey – without it, we don't have any good way to upgrade
> > large clusters in short time. Default rsync mode (without "--size-only")
> > takes a lot of time too, if the load is heavy.
> >
> > With these adjustments, can "rsync --size-only" remain in the docs as the
> > *fast* and safe method to upgrade standbys, or there are still some
> > concerns related to corruption risks?
>
> I hope somebody can answer that definitively, but I read Stephen's mail
> to indicate that this procedure should be safe in principle (if you know
> what you are doing).
>

right, this is my understanding too


> > +++ b/doc/src/sgml/ref/pgupgrade.sgml
> > @@ -380,22 +380,28 @@ NET STOP postgresql-
> >  
> >
> >  
> > - Streaming replication and log-shipping standby servers can
> > + Streaming replication and log-shipping standby servers must
> >   remain running until a later step.
> >  
> > 
> >
> > -   
> > +   
> >
> >  
> > - If you are upgrading standby servers using methods outlined in
> section  > - linkend="pgupgrade-step-replicas"/>, verify that the old standby
> > - servers are caught up by running
> pg_controldata
> > - against the old primary and standby clusters.  Verify that the
> > - Latest checkpoint location values match in all
> clusters.
> > - (There will be a mismatch if old standby servers were shut down
> > - before the old primary or if the old standby servers are still
> running.)
> > + If you are upgrading standby servers using methods outlined in
> > + ,
>
> You dropped the "section" before the xref, I think that should be kept
> around.
>

Seems to be a problem in discussing source code that looks quite different
than the final result.

In the result – the docs – we currently have "section Step 9", looking
weird. I still think it's good to remove it. We also have "in Step 17
below" (without the extra word "section") in a different place on the same
page.


>
> > +ensure that they were
> running when
> > + you shut down the primaries in the previous step, so all the
> latest changes
>
> You talk of primaries in plural here, that is a bit weird for PostgreSQL
> documentation.
>

The same docs already discuss two primaries ("8. Stop both primaries"), but
I agree it might look confusing if you read only a part of the doc, jumping
into middle of it, like I do all the time when using the docs in "check the
reference" style.

I agree with this comment, but it tells me we need even more improvements
of this doc, beyond my original goal – e.g., I don't like section 8 saying
"Make sure both database servers", it should be "both primaries".


>
> > + and the shutdown checkpoint record were received. You can verify
> this by running
> > + pg_controldata against the old primary
> and standby
> > + clusters. The Latest checkpoint location values
> must match in all
> > + nodes. A mismatch might occur if old standby servers were shut
> down before
> > + the old primary. To fix a mismatch, start all old servers and
> return to the
> > + previous step; proceeding with mismatched
> > + Latest checkpoint location may lead to standby
> corruption.
> > +
> > +
> > +
> >   Also, make sure wal_level is not set to
> >   minimal in the
> postgresql.conf file on the
> >   new primary cluster.
> > @@ -497,7 +503,6 @@ pg_upgrade.exe
> >   linkend="warm-standby"/>) standby servers, you can follow these
> steps to
> >   quickly upgrade them.  You will not be running
> pg_upgrade on
> >   the standby servers, but rather rsync
> on the primary.
> > - Do not start any servers yet.
> >  
> >
> >  
> > @@ -508,6 +513,15 @@ pg_upgrade.exe
> >   is running.
> >  
> >
> > +
> > + Before running rsync, to avoid standby corruption, it is absolutely
> > + critical to ensure that both primaries are shut down and standbys
> > + have received the last changes (see  linkend="pgupgrade-step-prepare-standbys"/>).
>
> I think this should be something like "ensure both that the primary is
> shut down and that the standbys have received all the changes".
>

Well, we have two primary servers – old and new. I tried to clarify it in
the new version.


>
> > + Standbys can be running at this point or fully stopped.
>
> "or be fully stopped." I think.
>
> > + If they
> > + are still running, you can stop, upgrade, and start them one by
> one; this
> > + can be useful to keep the cluster open for read-only transactions.
> > +
>
> Maybe this is clear from the context, but "upgrade" in the above should
> maybe more explicitly refer to the rsync method or else people might
> think one can run pg_upgrade on them after all?
>

Maybe. It will 

Re: Allowing parallel-safe initplans

2023-07-10 Thread Tom Lane
I wrote:
> Richard Guo  writes:
>> On Mon, Apr 17, 2023 at 11:04 PM Tom Lane  wrote:
>>> I wondered about that too, but how come neither of us saw non-cosmetic
>>> failures (ie, actual query output changes not just EXPLAIN changes)
>>> when we tried this?

>> Sorry I forgot to mention that I did see query output changes after
>> moving the initPlans to the Gather node.

> Hmm, my memory was just of seeing the EXPLAIN output changes, but
> maybe those got my attention to the extent of missing the others.

I got around to trying this, and you are right, there are some wrong
query answers as well as EXPLAIN output changes.  This mystified me
for awhile, because it sure looks like e89a71fb4 should have made it
work.

>> It seems that in this case the top_plan does not have any extParam, so
>> the Gather node that is added atop the top_plan does not have a chance
>> to get its initParam filled in set_param_references().

Eventually I noticed that all the failing cases were instances of
optimizing MIN()/MAX() aggregates into indexscans, and then I figured
out what the problem is: we substitute Params for the optimized-away
Aggref nodes in setrefs.c, *after* SS_finalize_plan has been run.
That means we fail to account for those Params in extParam/allParam
sets.  We've gotten away with that up to now because such Params
could only appear where Aggrefs can appear, which is only in top-level
(above scans and joins) nodes, which generally don't have any of the
sorts of rescan optimizations that extParam/allParam bits control.
But this patch results in needing to have a correct extParam set for
the node just below Gather, and we don't.  I am not sure whether there
are any reachable bugs without this patch; but there might be, or some
future optimization might introduce one.

It seems like the cleanest fix for this is to replace such optimized
Aggrefs in a separate tree scan before running SS_finalize_plan.
That's fairly annoying from a planner-runtime standpoint, although
we could skip the extra pass in the typical case where no minmax aggs
have been optimized.

I also thought about swapping the order of operations so that we
run SS_finalize_plan after setrefs.c.  That falls down because of
set_param_references itself, which requires those bits to be
calculated already.  But maybe we could integrate that computation
into SS_finalize_plan instead?  There's certainly nothing very
pretty about the way it's done now.

A band-aid fix that seemed to work is to have set_param_references
consult the Gather's own allParam set instead of the extParam set
of its child.  That feels like a kluge though, and it would not
help matters for any future bug involving another usage of those
bitmapsets.

BTW, there is another way in which setrefs.c can inject PARAM_EXEC
Params: it can translate PARAM_MULTIEXPR Params into those.  So
those won't be accounted for either.  I think this is probably
not a problem, especially not after 87f3667ec got us out of the
business of treating those like initPlan outputs.  But it does
seem like "you can't inject PARAM_EXEC Params during setrefs.c"
would not be a workable coding rule; it's too tempting to do
exactly that.

So at this point my inclination is to try to move SS_finalize_plan
to run after setrefs.c, but I've not written any code yet.  I'm
not sure if we'd need to back-patch that, but it at least seems
like important future-proofing.

None of this would lead me to want to move initPlans to
Gather nodes injected by debug_parallel_query, though.
We'd have to kluge something to keep the EXPLAIN output
looking the same, and that seems like a kluge too many.
What I am wondering is if the issue is reachable for
Gather nodes that are built organically by the regular
planner paths.  It seems like that might be the case,
either now or after applying this patch.

regards, tom lane




Re: Refactoring backend fork+exec code

2023-07-10 Thread Tristan Partin
> From 1d89eec53c7fefa7a4a8c011c9f19e3df64dc436 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Jun 2023 16:33:20 +0300
> Subject: [PATCH 4/9] Use FD_CLOEXEC on ListenSockets

> @@ -831,7 +834,8 @@ StreamConnection(pgsocket server_fd, Port *port)
>  void
>  StreamClose(pgsocket sock)
>  {
> -   closesocket(sock);
> +   if (closesocket(sock) != 0)
> +   elog(LOG, "closesocket failed: %m");
>  }
> 
>  /*

Do you think WARNING would be a more appropriate log level?

> From 2f518be9e96cfed1a1a49b4af8f7cb4a837aa784 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Jun 2023 18:07:54 +0300
> Subject: [PATCH 5/9] Move "too many clients already" et al. checks from
>  ProcessStartupPacket.

This seems like a change you could push already (assuming another
maintainer agrees with you), which makes reviews for this patchset even
easier.

> From c25b67c045018a2bf05e6ff53819d26e561fc83f Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 14:11:16 +0300
> Subject: [PATCH 6/9] Pass CAC as argument to backend process.

Could you expand a bit more on why it is better to pass it as a separate
argument? Does it not fit well conceptually in struct Port?

> @@ -4498,15 +4510,19 @@ postmaster_forkexec(int argc, char *argv[])
>   * returns the pid of the fork/exec'd process, or -1 on failure
>   */
>  static pid_t
> -backend_forkexec(Port *port)
> +backend_forkexec(Port *port, CAC_state cac)
>  {
> -   char   *av[4];
> +   char   *av[5];
> int ac = 0;
> +   charcacbuf[10];
> 
> av[ac++] = "postgres";
> av[ac++] = "--forkbackend";
> av[ac++] = NULL;/* filled in by 
> internal_forkexec */
> 
> +   snprintf(cacbuf, sizeof(cacbuf), "%d", (int) cac);
> +   av[ac++] = cacbuf;

Might be worth a sanity check that there wasn't any truncation into
cacbuf, which is an impossibility as the code is written, but still
useful for catching a future developer error.

Is it worth adding a command line option at all instead of having the
naked positional argument? It would help anybody who might read the
command line what the seemingly random integer stands for.

> @@ -4910,7 +4926,10 @@ SubPostmasterMain(int argc, char *argv[])
> /* Run backend or appropriate child */
> if (strcmp(argv[1], "--forkbackend") == 0)
> {
> -   Assert(argc == 3);  /* shouldn't be any more args 
> */
> +   CAC_state   cac;
> +
> +   Assert(argc == 4);
> +   cac = (CAC_state) atoi(argv[3]);

Perhaps an assert or full error checking that atoi succeeds would be
useful for similar reasons to my previous comment.

> From 658cba5cdb2e5c45faff84566906d2fcaa8a3674 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Jun 2023 18:03:03 +0300
> Subject: [PATCH 7/9] Remove ConnCreate and ConnFree, and allocate Port in
>  stack.

Again, seems like another patch that could be pushed separately assuming
others don't have any comments.

> From 65384b9a6cfb3b9b589041526216e0f64d64bea5 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 13:56:44 +0300
> Subject: [PATCH 8/9] Introduce ClientSocket, rename some funcs

> @@ -1499,7 +1499,7 @@ CloseServerPorts(int status, Datum arg)
> {
> if (ListenSocket[i] != PGINVALID_SOCKET)
> {
> -   StreamClose(ListenSocket[i]);
> +   closesocket(ListenSocket[i]);
> ListenSocket[i] = PGINVALID_SOCKET;
> }
> }

I see you have been adding log messages in the case of closesocket()
failing. Do you think it is worth doing here as well?

One strange part about this patch is that in patch 4, you edit
StreamClose() to emit a log message in the case of closesocket()
failure, but then this patch just completely removes it.

> @@ -4407,11 +4420,11 @@ BackendInitialize(Port *port, CAC_state cac)
>   * Doesn't return at all.
>   */
>  static void
> -BackendRun(Port *port)
> +BackendRun(void)
>  {
> /*
> -* Create a per-backend PGPROC struct in shared memory. We must do
> -* this before we can use LWLocks (in 
> AttachSharedMemoryAndSemaphores).
> +* Create a per-backend PGPROC struct in shared memory. We must do 
> this
> +* before we can use LWLocks (in AttachSharedMemoryAndSemaphores).
>  */
> InitProcess();

This comment reflow probably fits better in the patch that added the
AttachSharedMemoryAndSemaphores function.

> From b33cfeb28a5419045acb659a01410b2b463bea3e Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Sun, 18 Jun 2023 13:59:48 +0300
> Subject: [PATCH 9/9] Refactor postmaster child process launching

> - Move code related to launching backend processes to new source file,
>   process_start.c

Since this seems pretty 

Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-07-10 Thread Michael Banck
Hi,

On Mon, Jul 10, 2023 at 01:36:39PM -0700, Nikolay Samokhvalov wrote:
> On Fri, Jul 7, 2023 at 6:31 AM Stephen Frost  wrote:
> > * Nikolay Samokhvalov (n...@postgres.ai) wrote:
> > > But this can happen with anyone who follows the procedure from the docs
> > as
> > > is and doesn't do any additional steps, because in step 9 "Prepare for
> > > standby server upgrades":
> > >
> > > 1) there is no requirement to follow specific order to shut down the
> > nodes
> > >- "Streaming replication and log-shipping standby servers can remain
> > > running until a later step" should probably be changed to a
> > > requirement-like "keep them running"
> >
> > Agreed that it would be good to clarify that the primary should be shut
> > down first, to make sure everything written by the primary has been
> > replicated to all of the replicas.
> 
> Thanks!
> 
> Here is a patch to fix the existing procedure description.

Thanks for that!

> I agree with Andrey – without it, we don't have any good way to upgrade
> large clusters in short time. Default rsync mode (without "--size-only")
> takes a lot of time too, if the load is heavy.
> 
> With these adjustments, can "rsync --size-only" remain in the docs as the
> *fast* and safe method to upgrade standbys, or there are still some
> concerns related to corruption risks?

I hope somebody can answer that definitively, but I read Stephen's mail
to indicate that this procedure should be safe in principle (if you know
what you are doing).

> From: Nikolay Samokhvalov 
> Date: Mon, 10 Jul 2023 20:07:18 +
> Subject: [PATCH] Improve major upgrade docs

Maybe mention standby here, like "Improve major upgrade documentation
for standby servers".

> +++ b/doc/src/sgml/ref/pgupgrade.sgml
> @@ -380,22 +380,28 @@ NET STOP postgresql-
>  
>  
>  
> - Streaming replication and log-shipping standby servers can
> + Streaming replication and log-shipping standby servers must
>   remain running until a later step.
>  
> 
>  
> -   
> +   
>
>  
> - If you are upgrading standby servers using methods outlined in section 
>  - linkend="pgupgrade-step-replicas"/>, verify that the old standby
> - servers are caught up by running 
> pg_controldata
> - against the old primary and standby clusters.  Verify that the
> - Latest checkpoint location values match in all clusters.
> - (There will be a mismatch if old standby servers were shut down
> - before the old primary or if the old standby servers are still running.)
> + If you are upgrading standby servers using methods outlined in 
> + , 

You dropped the "section" before the xref, I think that should be kept
around.

> +ensure that they were 
> running when 
> + you shut down the primaries in the previous step, so all the latest 
> changes 

You talk of primaries in plural here, that is a bit weird for PostgreSQL
documentation.

> + and the shutdown checkpoint record were received. You can verify this 
> by running 
> + pg_controldata against the old primary and 
> standby 
> + clusters. The Latest checkpoint location values must 
> match in all 
> + nodes. A mismatch might occur if old standby servers were shut down 
> before 
> + the old primary. To fix a mismatch, start all old servers and return to 
> the 
> + previous step; proceeding with mismatched 
> + Latest checkpoint location may lead to standby 
> corruption.
> +
> +
> +
>   Also, make sure wal_level is not set to
>   minimal in the postgresql.conf 
> file on the
>   new primary cluster.
> @@ -497,7 +503,6 @@ pg_upgrade.exe
>   linkend="warm-standby"/>) standby servers, you can follow these steps to
>   quickly upgrade them.  You will not be running 
> pg_upgrade on
>   the standby servers, but rather rsync on the 
> primary.
> - Do not start any servers yet.
>  
>  
>  
> @@ -508,6 +513,15 @@ pg_upgrade.exe
>   is running.
>  
>  
> +
> + Before running rsync, to avoid standby corruption, it is absolutely
> + critical to ensure that both primaries are shut down and standbys 
> + have received the last changes (see  linkend="pgupgrade-step-prepare-standbys"/>). 

I think this should be something like "ensure both that the primary is
shut down and that the standbys have received all the changes".

> + Standbys can be running at this point or fully stopped.

"or be fully stopped." I think.

> + If they 
> + are still running, you can stop, upgrade, and start them one by one; 
> this
> + can be useful to keep the cluster open for read-only transactions.
> +

Maybe this is clear from the context, but "upgrade" in the above should
maybe more explicitly refer to the rsync method or else people might
think one can run pg_upgrade on them after all?


Michael




Re: Preventing non-superusers from altering session authorization

2023-07-10 Thread Nathan Bossart
On Mon, Jul 10, 2023 at 04:46:07PM -0400, Joseph Koshakow wrote:
> Thanks, I think the patch set looks good to go!

Great.  I'm going to wait a few more days in case anyone has additional
feedback, but otherwise I intend to commit this shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Preventing non-superusers from altering session authorization

2023-07-10 Thread Joseph Koshakow
On Mon, Jul 10, 2023 at 4:32 PM Nathan Bossart 
wrote:
> Okay.  Here's a new patch set in which I believe I've addressed all
> feedback.  I didn't keep the GetAuthenticatedUserIsSuperuser() helper
> function around, as I didn't see a strong need for it.

Thanks, I think the patch set looks good to go!

> And I haven't
> touched the "is_superuser" GUC, either.  I figured we can take up any
> changes for it in the other thread.

Yeah, I think that makes sense.

Thanks,
Joe Koshakow


Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-07-10 Thread Nikolay Samokhvalov
On Fri, Jul 7, 2023 at 6:31 AM Stephen Frost  wrote:

> * Nikolay Samokhvalov (n...@postgres.ai) wrote:
> > But this can happen with anyone who follows the procedure from the docs
> as
> > is and doesn't do any additional steps, because in step 9 "Prepare for
> > standby server upgrades":
> >
> > 1) there is no requirement to follow specific order to shut down the
> nodes
> >- "Streaming replication and log-shipping standby servers can remain
> > running until a later step" should probably be changed to a
> > requirement-like "keep them running"
>
> Agreed that it would be good to clarify that the primary should be shut
> down first, to make sure everything written by the primary has been
> replicated to all of the replicas.
>

Thanks!

Here is a patch to fix the existing procedure description.

I agree with Andrey – without it, we don't have any good way to upgrade
large clusters in short time. Default rsync mode (without "--size-only")
takes a lot of time too, if the load is heavy.

With these adjustments, can "rsync --size-only" remain in the docs as the
*fast* and safe method to upgrade standbys, or there are still some
concerns related to corruption risks?


pg-upgrade-docs-clarify-rsync-size-only.patch
Description: Binary data


Re: Preventing non-superusers from altering session authorization

2023-07-10 Thread Nathan Bossart
On Sun, Jul 09, 2023 at 08:54:30PM -0400, Joseph Koshakow wrote:
> I just realized that you moved this comment from
> SetSessionAuthorization. I think we should leave the part about setting
> the GUC variable is_superuser on top of SetSessionAuthorization since
> that's where we actually set the GUC.

Okay.  Here's a new patch set in which I believe I've addressed all
feedback.  I didn't keep the GetAuthenticatedUserIsSuperuser() helper
function around, as I didn't see a strong need for it.  And I haven't
touched the "is_superuser" GUC, either.  I figured we can take up any
changes for it in the other thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 691c7a30d86385cca33a7ac43a5d63c4bc39dae1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sun, 9 Jul 2023 11:28:56 -0700
Subject: [PATCH v6 1/3] Rename session_auth_is_superuser to
 current_role_is_superuser.

Suggested-by: Joseph Koshakow
Discussion: https://postgr.es/m/CAAvxfHc-HHzONQ2oXdvhFF9ayRnidPwK%2BfVBhRzaBWYYLVQL-g%40mail.gmail.com
---
 src/backend/access/transam/parallel.c | 2 +-
 src/backend/utils/misc/guc_tables.c   | 9 ++---
 src/include/utils/guc.h   | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 2b8bc2f58d..12d97998cc 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -326,7 +326,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 	fps->database_id = MyDatabaseId;
 	fps->authenticated_user_id = GetAuthenticatedUserId();
 	fps->outer_user_id = GetCurrentRoleId();
-	fps->is_superuser = session_auth_is_superuser;
+	fps->is_superuser = current_role_is_superuser;
 	GetUserIdAndSecContext(>current_user_id, >sec_context);
 	GetTempNamespaceState(>temp_namespace_id,
 		  >temp_toast_namespace_id);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c14456060c..93dc2e7680 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -511,7 +511,7 @@ bool		check_function_bodies = true;
  * details.
  */
 bool		default_with_oids = false;
-bool		session_auth_is_superuser;
+bool		current_role_is_superuser;
 
 int			log_min_error_statement = ERROR;
 int			log_min_messages = WARNING;
@@ -1037,13 +1037,16 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
-		/* Not for general use --- used by SET SESSION AUTHORIZATION */
+		/*
+		 * Not for general use --- used by SET SESSION AUTHORIZATION and SET
+		 * ROLE
+		 */
 		{"is_superuser", PGC_INTERNAL, UNGROUPED,
 			gettext_noop("Shows whether the current user is a superuser."),
 			NULL,
 			GUC_REPORT | GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
 		},
-		_auth_is_superuser,
+		_role_is_superuser,
 		false,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index d5253c7ed2..223a19f80d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -250,7 +250,7 @@ extern PGDLLIMPORT bool log_statement_stats;
 extern PGDLLIMPORT bool log_btree_build_stats;
 
 extern PGDLLIMPORT bool check_function_bodies;
-extern PGDLLIMPORT bool session_auth_is_superuser;
+extern PGDLLIMPORT bool current_role_is_superuser;
 
 extern PGDLLIMPORT bool log_duration;
 extern PGDLLIMPORT int log_parameter_max_length;
-- 
2.25.1

>From a08490dd85f0d41830aa06982a5c5ea588ba79d1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 8 Jul 2023 21:35:03 -0700
Subject: [PATCH v6 2/3] Move session auth privilege check to
 check_session_authorization().

Author: Joseph Koshakow
Discussion: https://postgr.es/m/CAAvxfHc-HHzONQ2oXdvhFF9ayRnidPwK%2BfVBhRzaBWYYLVQL-g%40mail.gmail.com
---
 src/backend/commands/variable.c   | 21 +
 src/backend/utils/init/miscinit.c | 30 --
 src/include/miscadmin.h   |  1 +
 3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index f0f2e07655..b8a75012a5 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -846,6 +846,27 @@ check_session_authorization(char **newval, void **extra, GucSource source)
 
 	ReleaseSysCache(roleTup);
 
+	/*
+	 * Only superusers may SET SESSION AUTHORIZATION to roles other than
+	 * itself.  Note that in case of multiple SETs in a single session, the
+	 * original authenticated user's superuserness is what matters.
+	 */
+	if (roleid != GetAuthenticatedUserId() &&
+		!GetAuthenticatedUserIsSuperuser())
+	{
+		if (source == PGC_S_TEST)
+		{
+			ereport(NOTICE,
+	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+	 errmsg("permission will be denied to set session authorization \"%s\"",
+			*newval)));
+			return true;
+		}
+		GUC_check_errcode(ERRCODE_INSUFFICIENT_PRIVILEGE);
+		GUC_check_errmsg("permission denied to set session authorization");
+	

Re: add non-option reordering to in-tree getopt_long

2023-07-10 Thread Nathan Bossart
Here's a new version of the patch with the latest feedback addressed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e662a1b6b73c92ff7862444e092406ed82b0c2c7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 7 Jul 2023 20:00:47 -0700
Subject: [PATCH v6 1/1] Teach in-tree getopt_long() to move non-options to the
 end of argv.

Unlike the other implementations of getopt_long() I could find, the
in-tree implementation does not reorder non-options to the end of
argv.  Instead, it returns -1 as soon as the first non-option is
found, even if there are other options listed afterwards.  By
moving non-options to the end of argv, getopt_long() can parse all
specified options and return -1 when only non-options remain.
This quirk is periodically missed by hackers (e.g., 869aa40a27,
ffd398021c, and d9ddc50baf).

This commit introduces the aforementioned non-option reordering
behavior to the in-tree getopt_long() implementation.  The only
systems I'm aware of that use it are Windows and AIX, both of which
have been tested.

Special thanks to Noah Misch for his help validating behavior on
AIX.

Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20230609232257.GA121461%40nathanxps13
---
 src/bin/scripts/t/040_createuser.pl | 10 +++---
 src/port/getopt_long.c  | 54 -
 2 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/src/bin/scripts/t/040_createuser.pl b/src/bin/scripts/t/040_createuser.pl
index 9ca282181d..3290e30c88 100644
--- a/src/bin/scripts/t/040_createuser.pl
+++ b/src/bin/scripts/t/040_createuser.pl
@@ -42,9 +42,8 @@ $node->issues_sql_like(
 	'add a role as a member with admin option of the newly created role');
 $node->issues_sql_like(
 	[
-		'createuser', '-m',
-		'regress_user3', '-m',
-		'regress user #4', 'REGRESS_USER5'
+		'createuser', 'REGRESS_USER5', '-m', 'regress_user3',
+		'-m', 'regress user #4'
 	],
 	qr/statement: CREATE ROLE "REGRESS_USER5" NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS ROLE regress_user3,"regress user #4";/,
 	'add a role as a member of the newly created role');
@@ -73,11 +72,14 @@ $node->issues_sql_like(
 	qr/statement: CREATE ROLE regress_user11 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--role');
 $node->issues_sql_like(
-	[ 'createuser', '--member-of', 'regress_user1', 'regress_user12' ],
+	[ 'createuser', 'regress_user12', '--member-of', 'regress_user1' ],
 	qr/statement: CREATE ROLE regress_user12 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--member-of');
 
 $node->command_fails([ 'createuser', 'regress_user1' ],
 	'fails if role already exists');
+$node->command_fails(
+	[ 'createuser', 'regress_user1', '-m', 'regress_user2', 'regress_user3' ],
+	'fails for too many non-options');
 
 done_testing();
diff --git a/src/port/getopt_long.c b/src/port/getopt_long.c
index c989276988..3ba6094d93 100644
--- a/src/port/getopt_long.c
+++ b/src/port/getopt_long.c
@@ -50,8 +50,9 @@
  * This implementation does not use optreset.  Instead, we guarantee that
  * it can be restarted on a new argv array after a previous call returned -1,
  * if the caller resets optind to 1 before the first call of the new series.
- * (Internally, this means we must be sure to reset "place" to EMSG before
- * returning -1.)
+ *
+ * Note that this routine reorders the pointers in argv (despite the const
+ * qualifier) so that all non-options will be at the end when -1 is returned.
  */
 int
 getopt_long(int argc, char *const argv[],
@@ -60,40 +61,59 @@ getopt_long(int argc, char *const argv[],
 {
 	static char *place = EMSG;	/* option letter processing */
 	char	   *oli;			/* option letter list index */
+	static int	nonopt_start = -1;
+	static bool force_nonopt = false;
 
 	if (!*place)
 	{			/* update scanning pointer */
-		if (optind >= argc)
+		char	  **args = (char **) argv;
+
+retry:
+
+		/*
+		 * If we are out of arguments or only non-options remain, return -1.
+		 */
+		if (optind >= argc || optind == nonopt_start)
 		{
 			place = EMSG;
+			nonopt_start = -1;
+			force_nonopt = false;
 			return -1;
 		}
 
 		place = argv[optind];
 
-		if (place[0] != '-')
+		/*
+		 * An argument is a non-option if it meets any of the following
+		 * criteria: it follows an argument that is equivalent to the string
+		 * "--", it is equivalent to the string "-", or it does not start with
+		 * '-'.  When we encounter a non-option, we move it to the end of argv
+		 * (after shifting all remaining arguments over to make room), and
+		 * then we try again with the next argument.
+		 */
+		if (force_nonopt || strcmp("-", place) == 0 || place[0] != '-')
 		{
-			place = EMSG;
-			return -1;
-		}
+			for (int i = optind; i < argc - 1; i++)
+args[i] = args[i + 1];
+			args[argc - 1] = place;
 
-		place++;
+			if (nonopt_start == -1)
+nonopt_start = 

Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Tomas Vondra



On 7/10/23 18:18, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>  wrote:
>> On 7/10/23 14:38, Matthias van de Meent wrote:
>>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>>> only need to extract the normal min/max values for ASC/DESC sorts,
>>> which are readily available in the summary. But for multi-dimensional
>>> and distance searches (nearest neighbour) we need to calculate the
>>> distance between the indexed value(s) and the origin value to compare
>>> the summary against, and the order would thus be asc/desc on distance
>>> - a distance which may not be precisely represented by float types -
>>> thus 'relative order' with its own order operation.
>>>
>>
>> Can you give some examples of such data / queries, and how would it
>> leverage the BRIN sort stuff?
> 
> Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
> the opclass would then decide that `<->(box, point) ASC` means it has
> to return the closest distance from the point to the summary, for some
> measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
> that would return the distance from `'(1,2)'::point` to the furthest
> edge of the summary away from that point. Etc.
> 

Thanks.

>> For distance searches, I imagine this as data indexed by BRIN inclusion
>> opclass, which creates a bounding box. We could return closest/furthest
>> point on the bounding box (from the point used in the query). Which
>> seems a bit like a R-tree ...
> 
> Kind of; it would allow us to utilize such orderings without the
> expensive 1 tuple = 1 index entry and without scanning the full table
> before getting results. No tree involved, just a sequential scan on
> the index to allow some sketch-based pre-sort on the data. Again, this
> would work similar to how GiST's internal pages work: each downlink in
> GiST contains a summary of the entries on the downlinked page, and
> distance searches use a priority queue where the priority is the
> distance of the opclass-provided distance operator - lower distance
> means higher priority.

Yes, that's roughly how I understood this too - a tradeoff that won't
give the same performance as GiST, but much smaller and cheaper to maintain.

> For BRIN, we'd have to build a priority queue
> for the whole table at once, but presorting table sections is part of
> the design of BRIN sort, right?

Yes, that's kinda the whole point of BRIN sort.

> 
>> But I have no idea what would this do for multi-dimensional searches, or
>> what would those searches do? How would you sort such data other than
>> lexicographically? Which I think is covered by the current BRIN Sort,
>> because the data is either stored as multiple columns, in which case we
>> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
>> tuple of values, but then it's sorted lexicographically.
> 
> Yes, just any BRIN summary that allows distance operators and the like
> should be enough MINMAX is easy to understand, and box inclusion are
> IMO also fairly easy to understand.
> 

True. If minmax is interpreted as inclusion with a simple 1D points, it
kinda does the same thing. (Of course, minmax work with data types that
don't have distances, but there's similarity.)

 I haven't really thought about geometric types, just about minmax and
 minmax-multi. It's not clear to me what the benefit for these types be.
 I mean, we can probably sort points lexicographically, but is anyone
 doing that in queries? It seems useless for order by distance.
>>>
>>> Yes, that's why you would sort them by distance, where the distance is
>>> generated by the opclass as min/max distance between the summary and
>>> the distance's origin, and then inserted into the tuplesort.
>>>
>>
>> OK, so the query says "order by distance from point X" and we calculate
>> the min/max distance of values in a given page range.
> 
> Yes, and because it's BRIN that's an approximation, which should
> generally be fine.
> 

Approximation in what sense? My understanding was we'd get a range of
distances that we know covers all rows in that range. So the results
should be accurate, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: ResourceOwner refactoring

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-10 12:14:46 -0700, Andres Freund wrote:
> >  /*
> > - * Initially allocated size of a ResourceArray.  Must be power of two since
> > - * we'll use (arraysize - 1) as mask for hashing.
> > + * Size of the fixed-size array to hold most-recently remembered resources.
> >   */
> > -#define RESARRAY_INIT_SIZE 16
> > +#define RESOWNER_ARRAY_SIZE 32
>
> That's 512 bytes, pretty large to be searched sequentially. It's somewhat sad
> that the array needs to include 8 byte of ResourceOwnerFuncs...

On that note:

It's perhaps worth noting that this change actually *increases* the size of
ResourceOwnerData. Previously:

/* size: 576, cachelines: 9, members: 19 */
/* sum members: 572, holes: 1, sum holes: 4 */

now:
/* size: 696, cachelines: 11, members: 13 */
/* sum members: 693, holes: 1, sum holes: 3 */

It won't make a difference memory-usage wise, given aset.c's rounding
behaviour. But it does mean that more memory needs to be zeroed, as
ResourceOwnerCreate() uses MemoryContextAllocZero.

As there's really no need to initialize the long ->arr, ->locks, it seems
worth to just initialize explicitly instead of using MemoryContextAllocZero().


With the new representation, is there any point in having ->locks? We still
need ->nlocks to be able to switch to the lossy behaviour, but there doesn't
seem to be much point in having the separate array.

Greetings,

Andres Freund




Re: ResourceOwner refactoring

2023-07-10 Thread Andres Freund
Hi,

On 2023-05-25 20:14:23 +0300, Heikki Linnakangas wrote:
> @@ -2491,9 +2495,6 @@ BufferSync(int flags)
>   int mask = BM_DIRTY;
>   WritebackContext wb_context;
>
> - /* Make sure we can handle the pin inside SyncOneBuffer */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
>   /*
>* Unless this is a shutdown checkpoint or we have been explicitly told,
>* we write only permanent, dirty buffers.  But at shutdown or end of
> @@ -2970,9 +2971,6 @@ BgBufferSync(WritebackContext *wb_context)
>* requirements, or hit the bgwriter_lru_maxpages limit.
>*/
>
> - /* Make sure we can handle the pin inside SyncOneBuffer */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
>   num_to_scan = bufs_to_lap;
>   num_written = 0;
>   reusable_buffers = reusable_buffers_est;
> @@ -3054,8 +3052,6 @@ BgBufferSync(WritebackContext *wb_context)
>   *
>   * (BUF_WRITTEN could be set in error if FlushBuffer finds the buffer clean
>   * after locking it, but we don't care all that much.)
> - *
> - * Note: caller must have done ResourceOwnerEnlargeBuffers.
>   */
>  static int
>  SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext 
> *wb_context)

> @@ -3065,7 +3061,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, 
> WritebackContext *wb_context)
>   uint32  buf_state;
>   BufferTag   tag;
>
> + /* Make sure we can handle the pin */
>   ReservePrivateRefCountEntry();
> + ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
>
>   /*
>* Check whether buffer needs writing.
> @@ -4110,9 +4108,6 @@ FlushRelationBuffers(Relation rel)
>   return;
>   }
>
> - /* Make sure we can handle the pin inside the loop */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
>   for (i = 0; i < NBuffers; i++)
>   {
>   uint32  buf_state;
> @@ -4126,7 +4121,9 @@ FlushRelationBuffers(Relation rel)
>   if (!BufTagMatchesRelFileLocator(>tag, 
> >rd_locator))
>   continue;
>
> + /* Make sure we can handle the pin */
>   ReservePrivateRefCountEntry();
> + ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
>
>   buf_state = LockBufHdr(bufHdr);
>   if (BufTagMatchesRelFileLocator(>tag, >rd_locator) 
> &&
> @@ -4183,9 +4180,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
>   if (use_bsearch)
>   pg_qsort(srels, nrels, sizeof(SMgrSortArray), 
> rlocator_comparator);
>
> - /* Make sure we can handle the pin inside the loop */
> - ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> -
>   for (i = 0; i < NBuffers; i++)
>   {
>   SMgrSortArray *srelent = NULL;

Hm, a tad worried that increasing the number of ResourceOwnerEnlargeBuffers()
that drastically could have a visible overhead.



> +static ResourceOwnerFuncs tupdesc_resowner_funcs =
> +{
> + .name = "tupdesc reference",
> + .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
> + .release_priority = RELEASE_PRIO_TUPDESC_REFS,
> + .ReleaseResource = ResOwnerReleaseTupleDesc,
> + .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning
> +};

I think these should all be marked const, there's no need to have them be in a
mutable section of the binary. Of course that will require adjusting a bunch
of the signatures too, but that seems fine.

> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -5171,9 +5171,24 @@ AbortSubTransaction(void)
>   ResourceOwnerRelease(s->curTransactionOwner,
>
> RESOURCE_RELEASE_BEFORE_LOCKS,
>false, false);
> +
>   AtEOSubXact_RelationCache(false, s->subTransactionId,
> 
> s->parent->subTransactionId);
> +
> +
> + /*
> +  * AtEOSubXact_Inval sometimes needs to temporarily
> +  * bump the refcount on the relcache entries that it processes.
> +  * We cannot use the subtransaction's resource owner anymore,
> +  * because we've already started releasing it.  But we can use
> +  * the parent resource owner.
> +  */
> + CurrentResourceOwner = s->parent->curTransactionOwner;
> +
>   AtEOSubXact_Inval(false);
> +
> + CurrentResourceOwner = s->curTransactionOwner;
> +
>   ResourceOwnerRelease(s->curTransactionOwner,
>RESOURCE_RELEASE_LOCKS,
>false, false);

I might be a bit slow this morning, but why did this need to change as part of
this?


> @@ -5182,8 +5221,8 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, 
> uint32 

Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2023-07-10 Thread Sergei Kornilov
Hello

My +1 to have such a function in core or in some contrib at least (pg_surgery? 
amcheck?).

In the past, more than once I needed to find a damaged tuple knowing only chunk 
id and toastrelid. This feature would help a lot.

regards, Sergei




stats test intermittent failure

2023-07-10 Thread Melanie Plageman
Hi,

Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).

Failing test diff for those who prefer not to scroll:

+++ 
/home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
   2023-07-07 18:48:25.976313231 +
@@ -1415,7 +1415,7 @@
:io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
  ?column? | ?column?
 --+--
- t| t
+ t| f

My theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.

- Melanie

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork=2023-06-16%2018%3A30%3A32
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=francolin=2023-07-07%2018%3A43%3A57=recovery-check
From cf41551431751a2b03e0fc08b7296301fba81992 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 10 Jul 2023 13:57:36 -0400
Subject: [PATCH v1] Fix pg_stat_io buffer reuse test instability

The stats regression test attempts to ensure that Buffer Access Strategy
"reuses" are being counted in pg_stat_io by vacuuming a table which is
larger than the size of the strategy ring. However, when shared buffers
are in sufficiently high demand, another backend could evict one of the
blocks in the strategy ring before the first backend has a chance to
reuse the buffer. The backend using the strategy would then evict a
block from another shared buffer and add that buffer to the strategy
ring. This counts as an eviction and not a reuse in pg_stat_io. Count
both evictions and reuses in the test to ensure it does not fail
incorrectly.

Reported-by: Jeff Davis
---
 src/test/regress/expected/stats.out | 30 +++--
 src/test/regress/sql/stats.sql  | 19 +++---
 2 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 8e63340782..2ea6bbca7a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1385,7 +1385,11 @@ SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes;
 
 RESET temp_buffers;
 -- Test that reuse of strategy buffers and reads of blocks into these reused
--- buffers while VACUUMing are tracked in pg_stat_io.
+-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient
+-- demand for shared buffers from concurrent queries, some blocks may be
+-- evicted from the strategy ring before they can be reused. In such cases
+-- this, the backend will evict a block from a shared buffer outside of the
+-- ring and add it to the ring. This is considered an eviction and not a reuse.
 -- Set wal_skip_threshold smaller than the expected size of
 -- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will
 -- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL.
@@ -1393,15 +1397,15 @@ RESET temp_buffers;
 -- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy
 -- reads.
 SET wal_skip_threshold = '1 kB';
-SELECT sum(reuses) AS reuses, sum(reads) AS reads
+SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_before_
 CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false');
 INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 4500)i;
 -- Ensure that the next VACUUM will need to perform IO by rewriting the table
 -- first with VACUUM (FULL).
 VACUUM (FULL) test_io_vac_strategy;
--- Use the minimum BUFFER_USAGE_LIMIT to cause reuses with the smallest table
--- possible.
+-- Use the minimum BUFFER_USAGE_LIMIT to cause reuses or evictions with the
+-- smallest table possible.
 VACUUM (PARALLEL 0, BUFFER_USAGE_LIMIT 128) test_io_vac_strategy;
 SELECT pg_stat_force_next_flush();
  pg_stat_force_next_flush 
@@ -1409,13 +1413,19 @@ SELECT pg_stat_force_next_flush();
  
 (1 row)
 
-SELECT sum(reuses) AS reuses, sum(reads) AS reads
+SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_after_
-SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads,
-   :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
- ?column? | ?column? 
---+--
- t| t
+SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads;
+ ?column? 
+--
+ t
+(1 row)
+
+SELECT (:io_sum_vac_strategy_after_reuses + 

Re: add non-option reordering to in-tree getopt_long

2023-07-10 Thread Nathan Bossart
On Mon, Jul 10, 2023 at 04:57:11PM +0900, Kyotaro Horiguchi wrote:
> + if (!force_nonopt && place[0] == '-' && place[1])
> + {
> + if (place[1] != '-' || place[2])
> + break;
> +
> + optind++;
> + force_nonopt = true;
> + continue;
> + }
> 
> The first if looks good to me, but the second if is a bit hard to get the 
> meaning at a glance. "!(place[1] == '-' && place[2] == 0)" is easier to read 
> *to me*. Or I'm fine with the following structure here.

I'd readily admit that it's tough to read these lines.  I briefly tried to
make use of strcmp() to improve readability, but I wasn't having much luck.
I'll give it another try.

>> if (!force_nonopt ... )
>> {
>>   if (place[1] == '-' && place[2] == 0)
>>   {
>>  optind+;
>>  force_nonopt = true;
>>  continue;
>>   }
>>   break;
>> }
> 
> (To be honest, I see the goto looked clear than for(;;)..)

Okay.  I don't mind adding it back if it improves readability.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Should we remove db_user_namespace?

2023-07-10 Thread Nathan Bossart
On Mon, Jul 10, 2023 at 03:43:07PM +0900, Michael Paquier wrote:
> Thanks.  Reading through the patch, this version should be able to
> handle the dump reloads.

Thanks for reviewing.  I'm currently planning to commit this sometime next
week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: DecodeInterval fixes

2023-07-10 Thread Jacob Champion
On Mon, Jul 10, 2023 at 10:19 AM Reid Thompson  wrote:
> I made a another pass through the separated patches, it looks good to
> me.

LGTM too. (Thanks Tom for the clarification on ECPG.)

Marked RfC.

--Jacob




Re: pg_usleep for multisecond delays

2023-07-10 Thread Nathan Bossart
On Mon, Jul 10, 2023 at 10:12:36AM +0200, Daniel Gustafsson wrote:
> Have you had any chance to revisit this such that there is a new patch to
> review, or should it be returned/moved for now?

Not yet.  I moved it to the next commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-07-10 Thread Alena Rybakina

Well, I don't have a detailed plan either. In principle it shouldn't be
that hard, I think - examine_variable is loading the statistics, so it
could apply the same null_frac correction, just like nulltestsel would
do a bit later.

The main question is how to pass the information to examine_variable. It
doesn't get the SpecialJoinInfo (which is what nulltestsel used), so
we'd need to invent something new ... add a new argument?


Sorry I didn't answer right away, I could adapt the last version of the 
patch [2] to the current idea, but so far I have implemented
it only for the situation where we save the number of zero values in 
SpecialJoinInfo variable.


I'm starting to look for different functions scalararraysel_containment, 
boolvarsel and I try to find some bad cases for current problem,
when I can fix in similar way it in those functions. But I'm not sure 
about different ones functions:
(mergejoinscansel, estimate_num_groups, estimate_hash_bucket_stats, 
get_restriction_variable, get_join_variables).


The examine_variable function is also called in them, and even though 
there is no being outer join in them
(the absence of a SpecialJoinInfo variable),  I can't think of similar 
cases, when we have such problem caused by the same reasons.



The code passes all regression tests and I found no deterioration in 
cardinality prediction for queries from [1], except for one:


EXPLAIN ANALYZE
  SELECT * FROM large FULL JOIN small ON (large.id = small.id)
WHERE (large.a IS NULL);

MASTER:

*QUERY PLAN
---

 Merge Full Join  (cost=127921.69..299941.59 rows=56503 
width=16)(actual time=795.092..795.094 rows=0 loops=1)


   Merge Cond: (small.id = large.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 100

   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) 
(actualtime=0.038..0.046 rows=100 loops=1)


 Sort Key: small.id
 Sort Method: quicksort  Memory: 29kB

 ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 
width=8)(actual time=0.013..0.022 rows=100 loops=1)   ->  Materialize 
(cost=127763.19..132763.44 rows=150 width=8)(actual 
time=363.016..649.103 rows=100 loops=1) ->  Sort 
(cost=127763.19..130263.31 rows=150 width=8)(actual 
time=363.012..481.480 rows=100 loops=1)


   Sort Key: large.id
   Sort Method: external merge  Disk: 17664kB

   ->  Seq Scan on large (cost=0.00..14425.50 
rows=150width=8) (actual time=0.009..111.166 rows=100 loops=1)


 Planning Time: 0.124 ms
 Execution Time: 797.139 ms
(15 rows)*

With patch:

*QUERY PLAN
--

 Hash Full Join  (cost=3.25..18179.25 rows=00 width=16) 
(actualtime=261.480..261.482 rows=0 loops=1)


   Hash Cond: (large.id = small.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 100

   ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 
width=8)(actual time=0.006..92.827 rows=100 loops=1)   ->  Hash 
(cost=2.00..2.00 rows=100 width=8) (actualtime=0.032..0.034 rows=100 
loops=1)


 Buckets: 1024  Batches: 1  Memory Usage: 12kB

 ->  Seq Scan on small  (cost=0.00..2.00 rows=100 
width=8)(actual time=0.008..0.015 rows=100 loops=1)


 Planning Time: 0.151 ms
 Execution Time: 261.529 ms
(10 rows)

[1] 
https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg146044.html


[2] 
https://www.postgresql.org/message-id/148ff8f1-067b-1409-c754-af6117de9b7d%40yandex.ru



Unfortunately, I found that my previous change leads to a big error in 
predicting selectivity.
I will investigate this case in more detail and try to find a solution 
(I wrote the code and query below).


// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
if (nd1 != nd2)
{
    selec /= Max(nd1, nd2);
    *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
}
else
{
    selec /= nd2;
    *unmatched_frac = 0.0;
}


postgres=# explain analyze select * from large l1 inner join large l2 on 
l1.a is null where l1.a <100;


MASTER:

QUERY PLAN

 Nested Loop  (cost=1000.00..35058.43 rows=100 width=16) (actual 
time=91.846..93.622 rows=0 loops=1)
   ->  Gather (cost=1000.00..10633.43 rows=1 width=8) (actual 
time=91.844..93.619 rows=0 loops=1)

 Workers Planned: 2
 Workers Launched: 2
 ->  Parallel Seq Scan on large l1  (cost=0.00..9633.33 rows=1 
width=8) (actual time=86.153..86.154 rows=0 loops=3)

   Filter: ((a IS NULL) AND (a < 100))
   Rows Removed by Filter: 33
   ->  Seq Scan on large l2 (cost=0.00..14425.00 rows=100 width=8) 
(never executed)

 Planning Time: 0.299 ms
 Execution Time: 93.689 ms


Re: DecodeInterval fixes

2023-07-10 Thread Reid Thompson
On Sun, 2023-07-09 at 13:24 -0400, Joseph Koshakow wrote:
> 
> I've broken up this patch into three logical commits and attached
> them.
> None of the actual code has changed.
> 
> Thanks,
> Joe Koshakow

I made a another pass through the separated patches, it looks good to
me. 

Thanks,
Reid






Re: UUID v7

2023-07-10 Thread Peter Eisentraut

On 07.07.23 14:06, Andrey M. Borodin wrote:

Also, I think we should discuss UUID v8. UUID version 8 provides an RFC-compatible format for 
experimental or vendor-specific use cases. Revision 1 of IETF draft contained interesting code for 
v8: almost similar to v7, but with fields for "node ID" and "rolling sequence 
number".
I think this is reasonable approach, thus I attach implementation of UUID v8 
per [0].


I suggest we keep this thread to v7, which has pretty straightforward 
semantics for PostgreSQL.  v8 by definition has many possible 
implementations, so you're going to have to make pretty strong arguments 
that yours is the best and only one, if you are going to claim the 
gen_uuid_v8 function name.






Re: BUG #18016: REINDEX TABLE failure

2023-07-10 Thread Gurjeet Singh
On Sun, Jul 9, 2023 at 7:21 AM Richard Veselý  wrote:
>
> ... there's no shortage of people that suffer from sluggish 
> pg_dump/pg_restore cycle and I imagine there are any number of people that 
> would be interested in improving bulk ingestion which is often a bottleneck 
> for analytical workloads as you are well aware. What's the best place to 
> discuss this topic further - pgsql-performance or someplace else?

(moved conversation to -hackers, and moved -bugs to BCC)

> I was dissatisfied with storage layer performance, especially during the 
> initial database population, so I rewrote it for my use case. I'm done with 
> the heap, but for the moment I still rely on PostgreSQL to build indexes,

It sounds like you've developed a method to speed up loading of
tables, and might have ideas/suggestions for speeding up CREATE
INDEX/REINDEX. The -hackers list feels like a place to discuss such
changes.

Best regards,
Gurjeet
http://Gurje.et




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote:
> Out of curiosity I've tried and it is reproducible as you have stated : XFS
> @ 4.18.0-425.10.1.el8_7.x86_64:
>...
> According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output ,
> the XFS issues sync writes while ext4 does not, xfs looks like constant
> loop of sync writes (D) by kworker/2:1H-kblockd:

That clearly won't go well.  It's not reproducible on newer systems,
unfortunately :(. Or well, fortunately maybe.


I wonder if a trick to avoid this could be to memorialize the fact that we
bulk extended before and extend by that much going forward? That'd avoid the
swapping back and forth.


One thing that confuses me is that Sawada-san observed a regression at a
single client - yet from what I can tell, there's actually not a single
fallocate() being generated in that case, because the table is so narrow that
we never end up extending by a sufficient number of blocks in
RelationAddBlocks() to reach that path. Yet there's a regression at 1 client.

I don't yet have a RHEL8 system at hand, could either of you send the result
of a
  strace -c -p $pid_of_backend_doing_copy

Greetings,

Andres Freund




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
 wrote:
> On 7/10/23 14:38, Matthias van de Meent wrote:
>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>> only need to extract the normal min/max values for ASC/DESC sorts,
>> which are readily available in the summary. But for multi-dimensional
>> and distance searches (nearest neighbour) we need to calculate the
>> distance between the indexed value(s) and the origin value to compare
>> the summary against, and the order would thus be asc/desc on distance
>> - a distance which may not be precisely represented by float types -
>> thus 'relative order' with its own order operation.
>>
>
> Can you give some examples of such data / queries, and how would it
> leverage the BRIN sort stuff?

Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
the opclass would then decide that `<->(box, point) ASC` means it has
to return the closest distance from the point to the summary, for some
measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
that would return the distance from `'(1,2)'::point` to the furthest
edge of the summary away from that point. Etc.

> For distance searches, I imagine this as data indexed by BRIN inclusion
> opclass, which creates a bounding box. We could return closest/furthest
> point on the bounding box (from the point used in the query). Which
> seems a bit like a R-tree ...

Kind of; it would allow us to utilize such orderings without the
expensive 1 tuple = 1 index entry and without scanning the full table
before getting results. No tree involved, just a sequential scan on
the index to allow some sketch-based pre-sort on the data. Again, this
would work similar to how GiST's internal pages work: each downlink in
GiST contains a summary of the entries on the downlinked page, and
distance searches use a priority queue where the priority is the
distance of the opclass-provided distance operator - lower distance
means higher priority. For BRIN, we'd have to build a priority queue
for the whole table at once, but presorting table sections is part of
the design of BRIN sort, right?

> But I have no idea what would this do for multi-dimensional searches, or
> what would those searches do? How would you sort such data other than
> lexicographically? Which I think is covered by the current BRIN Sort,
> because the data is either stored as multiple columns, in which case we
> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
> tuple of values, but then it's sorted lexicographically.

Yes, just any BRIN summary that allows distance operators and the like
should be enough MINMAX is easy to understand, and box inclusion are
IMO also fairly easy to understand.

>>> I haven't really thought about geometric types, just about minmax and
>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>> I mean, we can probably sort points lexicographically, but is anyone
>>> doing that in queries? It seems useless for order by distance.
>>
>> Yes, that's why you would sort them by distance, where the distance is
>> generated by the opclass as min/max distance between the summary and
>> the distance's origin, and then inserted into the tuplesort.
>>
>
> OK, so the query says "order by distance from point X" and we calculate
> the min/max distance of values in a given page range.

Yes, and because it's BRIN that's an approximation, which should
generally be fine.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: gcc -Wclobbered in PostgresMain

2023-07-10 Thread Tom Lane
Sergey Shinderuk  writes:
> On 08.07.2023 18:11, Tom Lane wrote:
>> Having done that, it wouldn't really be necessary to mark these
>> as volatile.  I kept that marking anyway for consistency with
>> send_ready_for_query, but perhaps we shouldn't?

> I don't know. Maybe marking them volatile is more future proof. Not sure.

Yeah, after sleeping on it, it seems best to have a policy that all
variables declared in that place are volatile.  Even if there's no bug
now, not having volatile creates a risk of surprising behavior after
future changes.  Pushed that way.

regards, tom lane




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote:
> On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada  wrote:
> >
> > After further investigation, the performance degradation comes from
> > calling posix_fallocate() (called via FileFallocate()) and pwritev()
> > (called via FileZero) alternatively depending on how many blocks we
> > extend by. And it happens only on the xfs filesystem.
>
> FYI, the attached simple C program proves the fact that calling
> alternatively posix_fallocate() and pwrite() causes slow performance
> on posix_fallocate():
>
> $ gcc -o test test.c
> $ time ./test test.1 1
> total   20
> fallocate   20
> filewrite   0
>
> real0m1.305s
> user0m0.050s
> sys 0m1.255s
>
> $ time ./test test.2 2
> total   20
> fallocate   10
> filewrite   10
>
> real1m29.222s
> user0m0.139s
> sys 0m3.139s

On an xfs filesystem, with a very recent kernel:

time /tmp/msw_test /srv/dev/fio/msw 0
total   20
fallocate   0
filewrite   20

real0m0.456s
user0m0.017s
sys 0m0.439s


time /tmp/msw_test /srv/dev/fio/msw 1
total   20
fallocate   20
filewrite   0

real0m0.141s
user0m0.010s
sys 0m0.131s


time /tmp/msw_test /srv/dev/fio/msw 2
total   20
fallocate   10
filewrite   10

real0m0.297s
user0m0.017s
sys 0m0.280s


So I don't think I can reproduce your problem on that system...

I also tried adding a fdatasync() into the loop, but that just made things
uniformly slow.


I guess I'll try to dig up whether this is a problem in older upstream
kernels, or whether it's been introduced in RHEL.

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote:
> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.
> Here are the results on my environment (EC2 instance, RHEL 8.6, 128
> vCPUs, 512GB RAM):

Gah, RHEL with its frankenkernels, the bane of my existance.

FWIW, I had extensively tested this with XFS, just with a newer kernel. Have
you tested this on RHEL9 as well by any chance?

Greetings,

Andres Freund




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Andres Freund
Hi,

On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote:
> On 2023-Jul-03, Masahiko Sawada wrote:
> 
> > While testing PG16, I observed that in PG16 there is a big performance
> > degradation in concurrent COPY into a single relation with 2 - 16
> > clients in my environment. I've attached a test script that measures
> > the execution time of COPYing 5GB data in total to the single relation
> > while changing the number of concurrent insertions, in PG16 and PG15.
> 
> This item came up in the RMT meeting.  Andres, I think this item belongs
> to you, because of commit 00d1e02be2.

I'll take a look - I wasn't even aware of this thread until now.

Greetings,

Andres Freund




Re: Exclusion constraints on partitioned tables

2023-07-10 Thread Paul A Jungwirth
On Mon, Jul 10, 2023 at 7:05 AM Peter Eisentraut  wrote:
> I'm not sure what value we would get from testing this with btree_gist,
> but if we wanted to do that, then adding a new test file to the
> btree_gist sql/ directory would seem reasonable to me.
>
> (I would make the test a little bit bigger than you had shown, like
> insert a few values.)
>
> If you want to do that, please send another patch.  Otherwise, I'm ok to
> commit this one.

I can get you a patch tonight or tomorrow. I think it's worth it since
btree_gist uses different strategy numbers than ordinary gist.

Thanks!
Paul




Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-10 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Jul-10, Tom Lane wrote:
>> The user has altered properties of an extension member
>> object locally within the database, but has not changed the extension's
>> installation script to match.

> If I were developing an extension and decided, down the line, to have
> some objects in another schema, I would certainly increment the
> extension's version number and have a new script to move the object.  I
> would never expect the user to do an ALTER directly (and it makes no
> sense for me as an extension developer to do it manually, either.)

It's certainly poor practice, but I could see doing it early in an
extension's development (while you're still working towards 1.0).

ISTR that we discussed forbidding such changes way back when the
extension mechanism was invented, and decided against it on the
grounds that (a) it'd be nanny-ism, (b) we'd have to add checks in an
awful lot of places and it'd be easy to miss some, and (c) forbidding
superusers from doing anything they want is generally not our style.
We could reconsider that now, but I think we'd probably land on the
same place.

regards, tom lane




Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-10 Thread Alvaro Herrera
On 2023-Jul-10, Tom Lane wrote:

> Michael Paquier  writes:
> > On Thu, Jul 06, 2023 at 10:09:20AM -0700, Andres Freund wrote:
> >> I also don't think pg_dump will dump the changed schema, which means a
> >> dump/restore leads to a different schema - IMO something to avoid.
> 
> > Yes, you're right here.  The function dumped is restored in the same
> > schema as the extension.
> 
> Actually, I think the given example demonstrates pilot error rather
> than a bug.

Well, if this is pilot error, why don't we throw an error ourselves?

> The user has altered properties of an extension member
> object locally within the database, but has not changed the extension's
> installation script to match.

If I were developing an extension and decided, down the line, to have
some objects in another schema, I would certainly increment the
extension's version number and have a new script to move the object.  I
would never expect the user to do an ALTER directly (and it makes no
sense for me as an extension developer to do it manually, either.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: remaining sql/json patches

2023-07-10 Thread Alvaro Herrera
I forgot to add:

* 0001 looks an obvious improvement.  You could just push it now, to
avoid carrying it forward anymore.  I would just put the constructName
ahead of value expr in the argument list, though.

* 0002: I have no idea what this is (though I probably should).  I would
also push it right away -- if anything, so that we figure out sooner
that it was actually needed in the first place.  Or maybe you just need
the right test cases?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: Add more sanity checks around callers of changeDependencyFor()

2023-07-10 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Jul 06, 2023 at 10:09:20AM -0700, Andres Freund wrote:
>> I also don't think pg_dump will dump the changed schema, which means a
>> dump/restore leads to a different schema - IMO something to avoid.

> Yes, you're right here.  The function dumped is restored in the same
> schema as the extension.

Actually, I think the given example demonstrates pilot error rather
than a bug.  The user has altered properties of an extension member
object locally within the database, but has not changed the extension's
installation script to match.  The fact that after restore, the object
does again match the script is intended behavior.  We've made some
exceptions to that rule for permissions, but not anything else.
I don't see a reason to consider the objects' schema assignments
differently from other properties for this purpose.

regards, tom lane




Re: remaining sql/json patches

2023-07-10 Thread Alvaro Herrera
On 2023-Jul-10, Amit Langote wrote:

> > I see that you add json_returning_clause_opt, but we already have
> > json_output_clause_opt.  Shouldn't these two be one and the same?
> > I think the new name is more sensible than the old one, since the
> > governing keyword is RETURNING; I suppose naming it "output" comes from
> > the fact that the standard calls this .
> 
> One difference between the two is that json_output_clause_opt allows
> specifying the FORMAT clause in addition to the RETURNING type name,
> while json_returning_clause_op only allows specifying the type name.
> 
> I'm inclined to keep only json_returning_clause_opt as you suggest and
> make parse_expr.c output an error if the FORMAT clause is specified in
> JSON() and JSON_SCALAR(), so turning the current syntax error on
> specifying RETURNING ... FORMAT for these functions into a parsing
> error.   Done that way in the attached updated patch and also updated
> the latter patch that adds JSON_EXISTS() and JSON_VALUE() to have
> similar behavior.

Yeah, that's reasonable.

> > I'm not in love with the fact that JSON and JSONB have pretty much
> > parallel type categorizing functionality. It seems entirely artificial.
> > Maybe this didn't matter when these were contained inside each .c file
> > and nobody else had to deal with that, but I think it's not good to make
> > this an exported concept.  Is it possible to do away with that?  I mean,
> > reduce both to a single categorization enum, and a single categorization
> > API.  Here you have to cast the enum value to int in order to make
> > ExecInitExprRec work, and that seems a bit lame; moreso when the
> > "is_jsonb" is determined separately (cf. ExecEvalJsonConstructor)
> 
> OK, I agree that a unified categorizing API might be better.  I'll
> look at making this better.  Btw, does src/include/common/jsonapi.h
> look like an appropriate place for that?

Hmm, that header is frontend-available, and the type-category appears to
be backend-only, so maybe no.  Perhaps jsonfuncs.h is more apropos?
execExpr.c is already dealing with array internals, so having to deal
with json internals doesn't seem completely out of place.


> > In the 2023 standard, JSON_SCALAR is just
> >
> >  ::= JSON_SCALAR   
> >
> > but we seem to have added a  clause to it.  Should
> > we really?
> 
> Hmm, I am not seeing  in the rule for JSON_SCALAR,

Agh, yeah, I confused myself, sorry.

> Per what I wrote above, the grammar for JSON() and JSON_SCALAR() does
> not allow specifying the FORMAT clause.  Though considering what you
> wrote, the RETURNING clause does appear to be an extension to the
> standard's spec.

Hmm, I see that  (which is RETURNING plus optional
FORMAT) appears included in JSON_OBJECT, JSON_ARRAY, JSON_QUERY,
JSON_SERIALIZE, JSON_OBJECTAGG, JSON_ARRAYAGG.  It's not necessarily a
bad thing to have it in other places, but we should consider it
carefully.  Do we really want/need it in JSON() and JSON_SCALAR()?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: Reducing connection overhead in pg_upgrade compat check phase

2023-07-10 Thread Daniel Gustafsson
> On 8 Jul 2023, at 23:43, Nathan Bossart  wrote:

Thanks for reviewing!

> Since "script" will be NULL and "db_used" will be false in the first
> iteration of the loop, couldn't we move this stuff to before the loop?

We could.  It's done this way to match how all the other checks are performing
the inner loop for consistency.  I think being consistent is better than micro
optimizing in non-hot codepaths even though it adds some redundancy.

> nitpick: І think the current code has two spaces at the beginning of this
> format string.  Did you mean to remove one of them?

Nice catch, I did not. Fixed.

> Won't "script" always be initialized here?  If I'm following this code
> correctly, I think everything except the fclose() can be removed.

You are right that this check is superfluous.  This is again an artifact of
modelling the code around how the other checks work for consistency.  At least
I think that's a good characteristic of the code.

> I think this is unnecessary since we assign "cur_check" at the beginning of
> every loop iteration.  I see two of these.

Right, this is a pointless leftover from a previous version which used a
while() loop, and I had missed removing them.  Fixed.

> +static int   n_data_types_usage_checks = 7;
> 
> Can we determine this programmatically so that folks don't need to remember
> to update it?

Fair point, I've added a counter loop to the beginning of the check function to
calculate it.

> IMHO it's a little strange that this is initialized to all "true", only
> because I think most other Postgres code does the opposite.

Agreed, but it made for a less contrived codepath in knowing when an error has
been seen already, to avoid duplicate error output, so I think it's worth it.

> Do you think we should rename these functions to something like
> "should_check_for_*"?  They don't actually do the check, they just tell you
> whether you should based on the version.

I've been pondering that too, and did a rename now along with moving them all
to a single place as well as changing the comments to make it clearer.

> In fact, I wonder if we could just add the versions directly to
> data_types_usage_checks so that we don't need the separate hook functions.

We could, but it would be sort of contrived I think since some check <= and
some == while some check the catversion as well (and new ones may have other
variants.  I think this is the least paint-ourselves-in-a-corner version, if we
feel it's needlessly complicated and no other variants are added we can revisit
this.

--
Daniel Gustafsson



v6-0001-pg_upgrade-run-all-data-type-checks-per-connectio.patch
Description: Binary data


Re: Changing types of block and chunk sizes in memory contexts

2023-07-10 Thread Melih Mutlu
Hi,

Thanks for your comments.

Tom Lane , 28 Haz 2023 Çar, 13:59 tarihinde şunu yazdı:
>
> David Rowley  writes:
> > Perhaps it's ok to leave the context creation functions with Size
> > typed parameters and then just Assert the passed-in sizes are not
> > larger than 1GB within the context creation function.
>
> Yes, I'm strongly opposed to not using Size/size_t in the mmgr APIs.
> If we go that road, we're going to have a problem when someone
> inevitably wants to pass a larger-than-GB value for some context
> type.

I reverted changes in the context creation functions and only changed
the types in structs.
I believe there are already lines to assert whether the sizes are less
than 1GB, so we should be safe there.

Andres Freund , 29 Haz 2023 Per, 02:34 tarihinde şunu yazdı:
> There are a few other fields that we can get rid of.
>
> - Afaics AllocSet->keeper is unnecessary these days, as it is always allocated
>   together with the context itself. Saves 8 bytes.

This seemed like a safe change and removed the keeper field in
AllocSet and Generation contexts. It saves an additional 8 bytes.

Best,
-- 
Melih Mutlu
Microsoft


v2-0001-Change-memory-context-fields-to-uint32.patch
Description: Binary data


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-10 Thread Melih Mutlu
Hi,

Hayato Kuroda (Fujitsu) , 6 Tem 2023 Per,
12:47 tarihinde şunu yazdı:
>
> Dear Melih,
>
> > Thanks for the 0003 patch. But it did not work for me. Can you create
> > a subscription successfully with patch 0003 applied?
> > I get the following error: " ERROR:  table copy could not start
> > transaction on publisher: another command is already in progress".
>
> You got the ERROR when all the patches (0001-0005) were applied, right?
> I have focused on 0001 and 0002 only, so I missed something.
> If it was not correct, please attach the logfile and test script what you did.

Yes, I did get an error with all patches applied. But with only 0001
and 0002, your version seems like working and mine does not.
What do you think about combining 0002 and 0003? Or should those stay separate?

> Hi, I did a performance testing for v16 patch set.
> Results show that patches significantly improves the performance in most 
> cases.
>
> # Method
>
> Following tests were done 10 times per condition, and compared by median.
> do_one_test.sh was used for the testing.
>
> 1.  Create tables on publisher
> 2.  Insert initial data on publisher
> 3.  Create tables on subscriber
> 4.  Create a replication slot (mysub_slot) on publisher
> 5.  Create a publication on publisher
> 6.  Create tables on subscriber
> --- timer on ---
> 7.  Create subscription with pre-existing replication slot (mysub_slot)
> 8.  Wait until all srsubstate in pg_subscription_rel becomes 'r'
> --- timer off ---
>

Thanks for taking the time to do testing and sharing the results. This
is also how I've been doing the testing since, but the process was
half scripted, half manual work.

> According to the measurement, we can say following things:
>
> * In any cases the performance was improved from the HEAD.
> * The improvement became more significantly if number of synced tables were 
> increased.

Yes, I believe it becomes more significant when workers spend less
time with actually copying data but more with other stuff like
launching workers, opening connections etc.

> * 0003 basically improved performance from first two patches

Agree, 0003 is definitely a good addition which was missing earlier.


Thanks,
-- 
Melih Mutlu
Microsoft




Re: check_strxfrm_bug()

2023-07-10 Thread Peter Eisentraut

On 10.07.23 04:51, Thomas Munro wrote:

On Sun, Jul 9, 2023 at 6:35 PM Thomas Munro  wrote:

On Sun, Jul 9, 2023 at 6:20 PM Peter Eisentraut  wrote:

So I don't think this code is correct.  AFAICT, there is nothing right
now that can possibly define HAVE_MBSTOWCS_L on Windows/MSVC.  Was that
the intention?


Yes, that was my intention.  Windows actually doesn't have them.


Thinking about that some more...  Its _XXX implementations don't deal
with UTF-8 the way Unix-based developers would expect, and are
therefore just portability hazards, aren't they?  What would we gain
by restoring the advertisement that they are available?  Perhaps we
should go the other way completely and remove the relevant #defines
from win32_port.h, and fully confine knowledge of them to pg_locale.c?
  It knows how to deal with that.  Here is a patch trying this idea
out, with as slightly longer explanation.


This looks sensible to me.

If we ever need mbstowcs_l() etc. outside of pg_locale.c, then the 
proper way would be to make a mbstowcs_l.c file in src/port/.


But I like your approach for now because it moves us more firmly into 
the direction of having it contained in pg_locale.c, instead of having 
some knowledge global and some local.






Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-10 Thread Melih Mutlu
Hi,

Amit Kapila , 6 Tem 2023 Per, 06:56 tarihinde
şunu yazdı:
>
> On Wed, Jul 5, 2023 at 1:48 AM Melih Mutlu  wrote:
> >
> > Hayato Kuroda (Fujitsu) , 4 Tem 2023 Sal,
> > 08:42 tarihinde şunu yazdı:
> > > > > But in the later patch the tablesync worker tries to reuse the slot 
> > > > > during the
> > > > > synchronization, so in this case the application_name should be same 
> > > > > as
> > > > slotname.
> > > > >
> > > >
> > > > Fair enough. I am slightly afraid that if we can't show the benefits
> > > > with later patches then we may need to drop them but at this stage I
> > > > feel we need to investigate why those are not helping?
> > >
> > > Agreed. Now I'm planning to do performance testing independently. We can 
> > > discuss
> > > based on that or Melih's one.
> >
> > Here I attached  what I use for performance testing of this patch.
> >
> > I only benchmarked the patch set with reusing connections very roughly
> > so far. But seems like it improves quite significantly. For example,
> > it took 611 ms to sync 100 empty tables, it was 1782 ms without
> > reusing connections.
> > First 3 patches from the set actually bring a good amount of
> > improvement, but not sure about the later patches yet.
> >
>
> I suggest then we should focus first on those 3, get them committed
> and then look at the remaining.
>

That sounds good. I'll do my best to address any review/concern from
reviewers now for the first 3 patches and hopefully those can get
committed first. I'll continue working on the remaining patches later.

-- 
Melih Mutlu
Microsoft




Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-10 Thread Önder Kalacı
Hi,

I also think so. If this is true, how can we think of supporting
> indexes other than hash like GiST, and SP-GiST as mentioned by you in
> your latest email? As per my understanding if we don't have PK or
> replica identity then after the index scan, we do tuples_equal which
> will fail for GIST or SP-GIST. Am, I missing something?
>

I also don't think we can support anything other than btree, hash and brin
as those lack equality operators.

And, for BRIN, Hayato brought up the amgettuple issue, which is fair. So,
overall, as far as I can see, we can
easily add hash indexes but all others are either very hard to add or not
possible.

I think if someone one day works on supporting primary keys using other
index types, we can use it here :p

Thanks,
Onder


Re: Exclusion constraints on partitioned tables

2023-07-10 Thread Peter Eisentraut

On 09.07.23 03:21, Paul A Jungwirth wrote:

It seems to me that many of the test cases added in indexing.sql are
redundant with create_table.sql/alter_table.sql (or vice versa).  Is
there a reason for this?


Yes, there is some overlap. I think that's just because there was
overlap before, and I didn't want to delete the old tests completely.
But since indexing.sql has a fuller list of tests and is a superset of
the others, this new patch removes the redundant tests from
{create,alter}_table.sql.


This looks better.


Btw speaking of tests, I want to make sure this new feature will still
work when you're using btree_gist and and `EXCLUDE WITH (myint =,
mytsrange &&)` (and not just `(myint4range =, mytsrange &&)`). Some of
my early attempts writing this patch worked w/o btree_gist but not w/
(or vice versa). But as far as I know there's no way to test that in
regress. I wound up writing a private shell script that just does
this:



Is there some place in the repo to include a test like that? It seems
a little funny to put it in the btree_gist suite, but maybe that's the
right answer.


I'm not sure what value we would get from testing this with btree_gist, 
but if we wanted to do that, then adding a new test file to the 
btree_gist sql/ directory would seem reasonable to me.


(I would make the test a little bit bigger than you had shown, like 
insert a few values.)


If you want to do that, please send another patch.  Otherwise, I'm ok to 
commit this one.






Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-10 Thread Sergei Kornilov
>> Is this restriction only for the subscriber?
>>
>> If we have not changed the replica identity and there is no primary key, 
>> then we forbid update and delete on the publication side (a fairly common 
>> usage error at the beginning of using publications).
>> If we have replica identity FULL (the table has such a column), then on the 
>> subscription side, update and delete will be performed.
> 
> In the above sentence, do you mean the publisher side?

Yep, sorry.

> But we will not be able to apply them on a subscription. Right?
> 
> If your previous sentence talks about the publisher and this sentence
> about the subscriber then what you are saying is correct. You can see
> the example in the email [1].

Thank you

>> This is an important difference for real use, when the subscriber is not 
>> necessarily postgresql - for example, debezium.
> 
> Can you explain the difference and problem you are seeing? As per my
> understanding, this is the behavior from the time logical replication
> has been introduced.

The difference is that if it's a subscriber-only restriction, then it won't 
automatically apply to anyone with a non-postgresql subscriber.
But if suddenly this would be a limitation of the publisher - then it will 
automatically apply to everyone, regardless of which subscriber is used.
(and it's a completely different problem if the restriction affects the 
update/delete themselves, not only their replication. Like as default replica 
identity on table without primary key, not in this case)

So, I suggest to mention subscriber explicitly:

+ class of Btree, then UPDATE and 
DELETE
-  operations cannot be replicated.
+ operations cannot be applied on subscriber.

Another example of difference:
Debezium users sometimes ask to set identity to FULL to get access to old 
values: https://stackoverflow.com/a/59820210/10983392
However, identity FULL is described in the documentation as: 
https://www.postgresql.org/docs/current/logical-replication-publication.html

> If the table does not have any suitable key, then it can be set to replica 
> identity “full”, which means the entire row becomes the key. This, however, 
> is very inefficient and should only be used as a fallback if no other 
> solution is possible.

But not mentioned, this would only be "very inefficient" for the subscriber, or 
would have an huge impact on the publisher too (besides writing more WAL).

regards, Sergei




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-10 Thread Alvaro Herrera
Hello,

On 2023-Jul-03, Masahiko Sawada wrote:

> While testing PG16, I observed that in PG16 there is a big performance
> degradation in concurrent COPY into a single relation with 2 - 16
> clients in my environment. I've attached a test script that measures
> the execution time of COPYing 5GB data in total to the single relation
> while changing the number of concurrent insertions, in PG16 and PG15.

This item came up in the RMT meeting.  Andres, I think this item belongs
to you, because of commit 00d1e02be2.

The regression seems serious enough at low client counts:

> * PG15 (4b15868b69)
> PG15: nclients = 1, execution time = 14.181
> PG15: nclients = 2, execution time = 9.319
> PG15: nclients = 4, execution time = 5.872
> PG15: nclients = 8, execution time = 3.773
> PG15: nclients = 16, execution time = 3.202

> * PG16 (c24e9ef330)
> PG16: nclients = 1, execution time = 17.112
> PG16: nclients = 2, execution time = 14.084
> PG16: nclients = 4, execution time = 27.997
> PG16: nclients = 8, execution time = 10.554
> PG16: nclients = 16, execution time = 7.074

So the fact that the speed has clearly gone up at larger client counts
is not an excuse for not getting it fixed, XFS-specificity
notwithstanding.

> The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
> extend tables more efficiently". With commit 1cbbee0338 (the previous
> commit of 00d1e02be2), I got a better numbers, it didn't have a better
> scalability, though:
> 
> PG16: nclients = 1, execution time = 17.444
> PG16: nclients = 2, execution time = 10.690
> PG16: nclients = 4, execution time = 7.010
> PG16: nclients = 8, execution time = 4.282
> PG16: nclients = 16, execution time = 3.373

Well, these numbers are better, but they still look worse than PG15.
I suppose there are other commits that share blame.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)




Re: Things I don't like about \du's "Attributes" column

2023-07-10 Thread Pavel Luzanov

On 23.06.2023 03:50, Tom Lane wrote:

* On a purely presentation level, how did we possibly arrive
at the idea that the connection-limit and valid-until properties
deserve their own lines in the Attributes column while the other
properties are comma-separated?  That makes no sense whatsoever,
nor does it look nice in \x display format.


I think this a reason why footer property explicitly disabled in the output.
As part of reworking footer should be enabled, as it worked for other 
meta-commands.


Just to don't forget.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com





Re: [PATCH] Infinite loop while acquiring new TOAST Oid

2023-07-10 Thread Aleksander Alekseev
Hi Nikita,

> I've prepared a patch with a 64-bit TOAST Value ID. It is a kind of prototype
> and needs some further work, but it is already working and ready to play with.

Unfortunately the patch rotted a bit. Could you please submit an
updated/rebased patch so that it could be reviewed in the scope of
July commitfest?

-- 
Best regards,
Aleksander Alekseev




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-10 Thread Amit Kapila
On Mon, Jul 10, 2023 at 4:33 PM Sergei Kornilov  wrote:
>
> Is this restriction only for the subscriber?
>
> If we have not changed the replica identity and there is no primary key, then 
> we forbid update and delete on the publication side (a fairly common usage 
> error at the beginning of using publications).
> If we have replica identity FULL (the table has such a column), then on the 
> subscription side, update and delete will be performed.
>

In the above sentence, do you mean the publisher side?

>
 But we will not be able to apply them on a subscription. Right?
>

If your previous sentence talks about the publisher and this sentence
about the subscriber then what you are saying is correct. You can see
the example in the email [1].

> This is an important difference for real use, when the subscriber is not 
> necessarily postgresql - for example, debezium.
>

Can you explain the difference and problem you are seeing? As per my
understanding, this is the behavior from the time logical replication
has been introduced.

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB5866C7B6086EB74918910F74F527A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Slight improvement of worker_spi.c example

2023-07-10 Thread Daniel Gustafsson
> On 10 Jul 2023, at 14:40, Aleksander Alekseev  
> wrote:

>> Have you had a chance to look at these suggestions, and Juliens reply
>> downthread, in order to produce a new version of the patch?
> 
> Thanks for the reminder. No I haven't. Please feel free marking this
> CF entry as RwF if this will be helpful. We may reopen it if and when
> necessary.

Ok, will do.  Please feel free to resubmit to a future CF when there is a new
version for consideration.

--
Daniel Gustafsson





Re: [PATCH] Slight improvement of worker_spi.c example

2023-07-10 Thread Aleksander Alekseev
Hi,

> Have you had a chance to look at these suggestions, and Juliens reply
> downthread, in order to produce a new version of the patch?

Thanks for the reminder. No I haven't. Please feel free marking this
CF entry as RwF if this will be helpful. We may reopen it if and when
necessary.

-- 
Best regards,
Aleksander Alekseev




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 13:43, Tomas Vondra
 wrote:
> On 7/10/23 12:22, Matthias van de Meent wrote:
>> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  
>> wrote:
>>> However, it's not quite clear to me is what you mean by the weight- and
>>> compare-operators? That is, what are
>>>
>>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>>  - brin_minmax_compare(order, order) -> int
>>>
>>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
>>
>> _minorder/_maxorder is for extracting the minimum/maximum relative
>> order of each range, used for ASC/DESC sorting of operator results
>> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
>> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
>> though I did forget to describe the second operator argument for the
>> distance function. We might also want to use only one such "order
>> extraction function" with DESC/ASC indicated by an argument.
>>
>
> I'm still not sure I understand what "minimum/maximum relative order"
> is. Isn't it the same as returning min/max value that can appear in the
> range? Although, that wouldn't work for points/boxes.

Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
only need to extract the normal min/max values for ASC/DESC sorts,
which are readily available in the summary. But for multi-dimensional
and distance searches (nearest neighbour) we need to calculate the
distance between the indexed value(s) and the origin value to compare
the summary against, and the order would thus be asc/desc on distance
- a distance which may not be precisely represented by float types -
thus 'relative order' with its own order operation.

>>> In principle we just need a procedure that tells us min/max for a given
>>> page range - I guess that's what the minorder/maxorder functions do? But
>>> why would we need the compare one? We're comparing by the known data
>>> type, so we can just delegate the comparison to that, no?
>>
>> Is there a comparison function for any custom orderable type that we
>> can just use? GIST distance ordering uses floats, and I don't quite
>> like that from a user perspective, as it makes ordering operations
>> imprecise. I'd rather allow (but discourage) any type with its own
>> compare function.
>>
>
> I haven't really thought about geometric types, just about minmax and
> minmax-multi. It's not clear to me what the benefit for these types be.
> I mean, we can probably sort points lexicographically, but is anyone
> doing that in queries? It seems useless for order by distance.

Yes, that's why you would sort them by distance, where the distance is
generated by the opclass as min/max distance between the summary and
the distance's origin, and then inserted into the tuplesort.

(previously)
>>> I finally had time to look at this patch again. There's a bit of bitrot,
>>> so here's a rebased version (no other changes).

It seems like you forgot to attach the rebased patch, so unless you're
actively working on updating the patchset right now, could you send
the rebase to make CFBot happy?


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: ResourceOwner refactoring

2023-07-10 Thread Peter Eisentraut

A few suggestions on the API:

> +static ResourceOwnerFuncs tupdesc_resowner_funcs =

These aren't all "functions", so maybe another word like "info" or 
"description" would be more appropriate?



> +   .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
> +   .release_priority = RELEASE_PRIO_TUPDESC_REFS,

I suggest assigning explicit non-zero numbers to the RESOURCE_RELEASE_* 
constants, as well as changing RELEASE_PRIO_FIRST to 1 and adding an 
assertion that the priority is not zero.  Otherwise, someone might 
forget to assign these fields and would implicitly get phase 0 and 
priority 0, which would probably work in most cases, but wouldn't be the 
explicit intention.


Then again, you are using RELEASE_PRIO_FIRST for the pgcrypto patch.  Is 
it the recommendation that if there are no other requirements, external 
users should use that?  If we are going to open this up to external 
users, we should probably have some documented guidance around this.



> +   .PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning

I think this can be refactored further.  We really just need a function 
to describe a resource, like maybe


static char *
ResOwnerDescribeTupleDesc(Datum res)
{
TupleDesc   tupdesc = (TupleDesc) DatumGetPointer(res);

return psprintf("TupleDesc %p (%u,%d)",
tupdesc, tupdesc->tdtypeid, tupdesc->tdtypmod);
}

and then the common code can generate the warnings from that like

elog(WARNING,
 "%s leak: %s still referenced",
 kind->name, kind->describe(value));

That way, we could more easily develop further options, such as turning 
this warning into an error (useful for TAP tests).


Also, maybe, you could supply a default description that just prints 
"%p", which appears to be applicable in about half the places.


Possible bonus project: I'm not sure these descriptions are so useful 
anyway.  It doesn't help me much in debugging if "File 55 was not 
released" or some such.  If would be cool if ResourceOwnerRemember() in 
some debug mode could remember the source code location, and then print 
that together with the leak warning.


> +#define ResourceOwnerRememberTupleDesc(owner, tupdesc) \
> +   ResourceOwnerRemember(owner, PointerGetDatum(tupdesc), 
_resowner_funcs)

> +#define ResourceOwnerForgetTupleDesc(owner, tupdesc) \
> +   ResourceOwnerForget(owner, PointerGetDatum(tupdesc), 
_resowner_funcs)


I would prefer that these kinds of things be made inline functions, so 
that we maintain the type safety of the previous interface.  And it's 
also easier when reading code to see type decorations.


(But I suspect the above bonus project would require these to be macros?)


> +   if (context->resowner)
> +   ResourceOwnerForgetJIT(context->resowner, context);

It seems most ResourceOwnerForget*() calls have this kind of check 
before them.  Could this be moved inside the function?






Re: HOT readme missing documentation on summarizing index handling

2023-07-10 Thread Aleksander Alekseev
Hi,

> > Thanks, pushed after correcting a couple typos.
>
> Thanks!

I noticed that ec99d6e9c87a introduced a slight typo:

s/if there is not room/if there is no room

-- 
Best regards,
Aleksander Alekseev




Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-07-10 Thread Daniel Gustafsson
> On 11 May 2023, at 13:24, Yurii Rashkovskii  wrote:
> 
> On Thu, May 11, 2023 at 10:36 AM Alvaro Herrera  > wrote:
> On 2023-May-11, Yurii Rashkovskii wrote:
> 
> > ```
> > 127.0.0.1=5432 ::1=54321
> > ```
> > 
> > Basically, a space-delimited set of address/port pairs (delimited by `=` to
> > allow IPv6 addresses to use a colon).
> 
> This seems a bit too creative.  I'd rather have the IPv6 address in
> square brackets, which clues the parser immediately as to the address
> family and use colons to separate the port number.  If we do go with a
> separate file, which to me sounds easier than cramming it into the PID
> file, then one per line is likely better, if only because line-oriented
> Unix text tooling has an easier time that way.
> 
> Just a general caution here that using square brackets to denote IPv6 
> addresses will make it (unnecessarily?) harder to process this with a shell 
> script.

This patch is Waiting on Author in the current commitfest with no new patch
presented following the discussion here.  Is there an update ready or should we
close it in this CF in favour of a future one?

--
Daniel Gustafsson





Re: POC, WIP: OR-clause support for indexes

2023-07-10 Thread Ranier Vilela
Em seg., 10 de jul. de 2023 às 09:03, Ranier Vilela 
escreveu:

> Hi Alena,
>
> Em seg., 10 de jul. de 2023 às 05:38, Alena Rybakina <
> lena.riback...@yandex.ru> escreveu:
>
>> I agreed with the changes. Thank you for your work.
>>
>> I updated patch and added you to the authors.
>>
>> I specified Ranier Vilela as a reviewer.
>>
> Is a good habit when post a new version of the patch, name it v1, v2,
> v3,etc.
> Makes it easy to follow development and references on the thread.
>
> Regarding the last patch.
> 1. I think that variable const_is_left is not necessary.
> You can stick with:
> + if (IsA(get_leftop(orqual), Const))
> + nconst_expr =get_rightop(orqual);
> + const_expr = get_leftop(orqual) ;
> + else if (IsA(get_rightop(orqual), Const))
> + nconst_expr =get_leftop(orqual);
> + const_expr = get_rightop(orqual) ;
> + else
> + {
> + or_list = lappend(or_list, orqual);
> + continue;
> + }
>
> 2. Test scalar_type != RECORDOID is more cheaper,
> mainly if OidIsValid were a function, we knows that is a macro.
> + if (scalar_type != RECORDOID && OidIsValid(scalar_type))
>
> 3. Sorry about wrong tip about array_type, but if really necessary,
> better use it.
> + newa->element_typeid = scalar_type;
> + newa->array_typeid = array_type;
>
> 4. Is a good habit, call free last, to avoid somebody accidentally using
> it.
> + or_list = lappend(or_list, gentry->expr);
> + list_free(gentry->consts);
> + continue;
>
> 5. list_make1(makeString((char *) "=")
> Is an invariant?
>
Please nevermind 5. Is not invariant.

regards,
Ranier Vilela


Re: POC, WIP: OR-clause support for indexes

2023-07-10 Thread Ranier Vilela
Hi Alena,

Em seg., 10 de jul. de 2023 às 05:38, Alena Rybakina <
lena.riback...@yandex.ru> escreveu:

> I agreed with the changes. Thank you for your work.
>
> I updated patch and added you to the authors.
>
> I specified Ranier Vilela as a reviewer.
>
Is a good habit when post a new version of the patch, name it v1, v2,
v3,etc.
Makes it easy to follow development and references on the thread.

Regarding the last patch.
1. I think that variable const_is_left is not necessary.
You can stick with:
+ if (IsA(get_leftop(orqual), Const))
+ nconst_expr =get_rightop(orqual);
+ const_expr = get_leftop(orqual) ;
+ else if (IsA(get_rightop(orqual), Const))
+ nconst_expr =get_leftop(orqual);
+ const_expr = get_rightop(orqual) ;
+ else
+ {
+ or_list = lappend(or_list, orqual);
+ continue;
+ }

2. Test scalar_type != RECORDOID is more cheaper,
mainly if OidIsValid were a function, we knows that is a macro.
+ if (scalar_type != RECORDOID && OidIsValid(scalar_type))

3. Sorry about wrong tip about array_type, but if really necessary,
better use it.
+ newa->element_typeid = scalar_type;
+ newa->array_typeid = array_type;

4. Is a good habit, call free last, to avoid somebody accidentally using it.
+ or_list = lappend(or_list, gentry->expr);
+ list_free(gentry->consts);
+ continue;

5. list_make1(makeString((char *) "=")
Is an invariant?
We can store in a variable and keep out the loop?

Keep up the good work.

best regards,
Ranier Vilela


Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Tomas Vondra



On 7/10/23 12:22, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  
> wrote:
>>
>> Hi,
>>
>> I finally had time to look at this patch again. There's a bit of bitrot,
>> so here's a rebased version (no other changes).
> 
> Thanks!
> 
>> On 2/27/23 16:40, Matthias van de Meent wrote:
>>> Some initial comments on 0004:
>>>
 +/*
 + * brin_minmax_ranges
 + *Load the BRIN ranges and sort them.
 + */
 +Datum
 +brin_minmax_ranges(PG_FUNCTION_ARGS)
 +{
>>>
>>> Like in 0001, this seems to focus on only single columns. Can't we put
>>> the scan and sort infrastructure in brin, and put the weight- and
>>> compare-operators in the opclasses? I.e.
>>> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
>>> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
>>> brin_minmax_compare(order, order) -> int? I'm thinking of something
>>> similar to GIST's distance operators, which would make implementing
>>> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
>>> brin infrastructure.
>>>
>>
>> However, it's not quite clear to me is what you mean by the weight- and
>> compare-operators? That is, what are
>>
>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>  - brin_minmax_compare(order, order) -> int
>>
>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
> 
> _minorder/_maxorder is for extracting the minimum/maximum relative
> order of each range, used for ASC/DESC sorting of operator results
> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
> though I did forget to describe the second operator argument for the
> distance function. We might also want to use only one such "order
> extraction function" with DESC/ASC indicated by an argument.
> 

I'm still not sure I understand what "minimum/maximum relative order"
is. Isn't it the same as returning min/max value that can appear in the
range? Although, that wouldn't work for points/boxes.

>> In principle we just need a procedure that tells us min/max for a given
>> page range - I guess that's what the minorder/maxorder functions do? But
>> why would we need the compare one? We're comparing by the known data
>> type, so we can just delegate the comparison to that, no?
> 
> Is there a comparison function for any custom orderable type that we
> can just use? GIST distance ordering uses floats, and I don't quite
> like that from a user perspective, as it makes ordering operations
> imprecise. I'd rather allow (but discourage) any type with its own
> compare function.
> 

I haven't really thought about geometric types, just about minmax and
minmax-multi. It's not clear to me what the benefit for these types be.
I mean, we can probably sort points lexicographically, but is anyone
doing that in queries? It seems useless for order by distance.

>> Also, the existence of these opclass procedures should be enough to
>> identify the opclasses which can support this.
> 
> Agreed.
> 
 +/*
 + * XXX Does it make sense (is it possible) to have a sort by more than one
 + * column, using a BRIN index?
 + */
>>>
>>> Yes, even if only one prefix column is included in the BRIN index
>>> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
>>> tuples can add additional sorting without first reading the whole
>>> table, potentially (likely) reducing the total resource usage of the
>>> query. That utilizes the same idea as incremental sorts, but with the
>>> caveat that the input sort order is approximately likely but not at
>>> all guaranteed. So, even if the range sort is on a single index
>>> column, we can still do the table's tuplesort on all ORDER BY
>>> attributes, as long as a prefix of ORDER BY columns are included in
>>> the BRIN index.
>>>
>>
>> That's now quite what I meant, though. When I mentioned sorting by more
>> than one column, I meant using a multi-column BRIN index on those
>> columns. Something like this:
>>
>>CREATE TABLE t (a int, b int);
>>INSERT INTO t ...
>>CREATE INDEX ON t USING brin (a,b);
>>
>>SELECT * FROM t ORDER BY a, b;
>>
>> Now, what I think you described is using BRIN to sort by "a", and then
>> do incremental sort for "b". What I had in mind is whether it's possible
>> to use BRIN to sort by "b" too.
>>
>> I was suspecting it might be made to work, but now that I think about it
>> again it probably can't - BRIN pretty much sorts the columns separately,
>> it's not like having an ordering by (a,b) - first by "a", then "b".
> 
> I imagine it would indeed be limited to an extremely small subset of
> cases, and probably not worth the effort to implement in an initial
> version.
> 

OK, let's stick to order by a single column.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL 

Re: Standardize type of variable when extending Buffers

2023-07-10 Thread Ranier Vilela
Em seg., 10 de jul. de 2023 às 03:27, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Fri, 7 Jul 2023 11:29:16 -0700, Gurjeet Singh  wrote
> in
> > On Fri, Jul 7, 2023 at 6:12 AM Ranier Vilela 
> wrote:
> > >
> > > Hi,
> > >
> > > This has already been discussed in [1].
> > > But I thought it best to start a new thread.
> > >
> > > The commit 31966b1 introduced the infrastructure to extend
> > > buffers.
> > > But the patch mixed types with int and uint32.
> > > The correct type of the variable counter is uint32.
> > >
> > > Fix by standardizing the int type to uint32.
> > >
> > > patch attached.
> >
> > LGTM.
>
> LGTM, too.
>
Thanks Gurjeet and Kyotaro, for taking a look.


> I don't think it will actually come to play, since I believe we won't
> be expanding a relation by 16TB all at once. Nevertheless, I believe
> keeping things tidy is a good habit to stick to.
>
Yeah, mainly because of copy-and-paste.
Also, compiler has to promote int to uint32, anyway.

regards,
Ranier Vilela


RE: Initial Schema Sync for Logical Replication

2023-07-10 Thread Kumar, Sachin


> From: Amit Kapila 
> On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada
>  wrote:
> >
> > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith 
> wrote:
> > >
> > > Hi,
> > >
> > > Below are my review comments for the PoC patch 0001.
> > >
> > > In addition,  the patch needed rebasing, and, after I rebased it
> > > locally in my private environment there were still test failures:
> > > a) The 'make check' tests fail but only in a minor way due to
> > > changes colname
> > > b) the subscription TAP test did not work at all for me -- many errors.
> >
> > Thank you for reviewing the patch.
> >
> > While updating the patch, I realized that the current approach won't
> > work well or at least has the problem with partition tables. If a
> > publication has a partitioned table with publish_via_root = false, the
> > subscriber launches tablesync workers for its partitions so that each
> > tablesync worker copies data of each partition. Similarly, if it has a
> > partition table with publish_via_root = true, the subscriber launches
> > a tablesync worker for the parent table. With the current design,
> > since the tablesync worker is responsible for both schema and data
> > synchronization for the target table, it won't be possible to
> > synchronize both the parent table's schema and partitions' schema.
> >
> 
> I think one possibility to make this design work is that when publish_via_root
> is false, then we assume that subscriber already has parent table and then
> the individual tablesync workers can sync the schema of partitions and their
> data.

Since publish_via_partition_root is false by default users have to create 
parent table by themselves
which I think is not a good user experience.

> And when publish_via_root is true, then the table sync worker is
> responsible to sync parent and child tables along with data. Do you think
> such a mechanism can address the partition table related cases?
> 


Re:doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-10 Thread Sergei Kornilov
Hello

Is this restriction only for the subscriber?

If we have not changed the replica identity and there is no primary key, then 
we forbid update and delete on the publication side (a fairly common usage 
error at the beginning of using publications).
If we have replica identity FULL (the table has such a column), then on the 
subscription side, update and delete will be performed. But we will not be able 
to apply them on a subscription. Right?

This is an important difference for real use, when the subscriber is not 
necessarily postgresql - for example, debezium.

regards, Sergei




Re: optimize several list functions with SIMD intrinsics

2023-07-10 Thread Heikki Linnakangas

On 21/04/2023 23:33, Nathan Bossart wrote:

On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote:

On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart 
wrote:

was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.


It seems like that should be demonstrated before seriously considering
this, like a profile where the relevant list functions show up.


Agreed.


Grepping for "tlist_member" and "list_delete_ptr", I don't see any 
callers in hot codepaths where this could make a noticeable difference. 
So I've marked this as Returned with Feedback in the commitfest.



I noticed that several of the List functions do simple linear searches that
can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
37a6e5d).  The following table shows the time spent iterating over a list
of n elements (via list_member_int) one billion times on my x86 laptop.


  n   | head (ms) | patched (ms) 
--+---+--

2 |  3884 | 3001
4 |  5506 | 4092
8 |  6209 | 3026
   16 |  8797 | 4458
   32 | 25051 | 7032
   64 | 37611 |12763
  128 | 61886 |22770
  256 |70 |59885
  512 |209612 |   103378
 1024 |407462 |   189484


I'm surprised to see an improvement with n=2 and n=2. AFAICS, the 
vectorization only kicks in when n >= 8.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  wrote:
>
> Hi,
>
> I finally had time to look at this patch again. There's a bit of bitrot,
> so here's a rebased version (no other changes).

Thanks!

> On 2/27/23 16:40, Matthias van de Meent wrote:
> > Some initial comments on 0004:
> >
> >> +/*
> >> + * brin_minmax_ranges
> >> + *Load the BRIN ranges and sort them.
> >> + */
> >> +Datum
> >> +brin_minmax_ranges(PG_FUNCTION_ARGS)
> >> +{
> >
> > Like in 0001, this seems to focus on only single columns. Can't we put
> > the scan and sort infrastructure in brin, and put the weight- and
> > compare-operators in the opclasses? I.e.
> > brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> > brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> > brin_minmax_compare(order, order) -> int? I'm thinking of something
> > similar to GIST's distance operators, which would make implementing
> > ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> > brin infrastructure.
> >
>
> However, it's not quite clear to me is what you mean by the weight- and
> compare-operators? That is, what are
>
>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>  - brin_minmax_compare(order, order) -> int
>
> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

_minorder/_maxorder is for extracting the minimum/maximum relative
order of each range, used for ASC/DESC sorting of operator results
(e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
though I did forget to describe the second operator argument for the
distance function. We might also want to use only one such "order
extraction function" with DESC/ASC indicated by an argument.

> In principle we just need a procedure that tells us min/max for a given
> page range - I guess that's what the minorder/maxorder functions do? But
> why would we need the compare one? We're comparing by the known data
> type, so we can just delegate the comparison to that, no?

Is there a comparison function for any custom orderable type that we
can just use? GIST distance ordering uses floats, and I don't quite
like that from a user perspective, as it makes ordering operations
imprecise. I'd rather allow (but discourage) any type with its own
compare function.

> Also, the existence of these opclass procedures should be enough to
> identify the opclasses which can support this.

Agreed.

> >> +/*
> >> + * XXX Does it make sense (is it possible) to have a sort by more than one
> >> + * column, using a BRIN index?
> >> + */
> >
> > Yes, even if only one prefix column is included in the BRIN index
> > (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> > tuples can add additional sorting without first reading the whole
> > table, potentially (likely) reducing the total resource usage of the
> > query. That utilizes the same idea as incremental sorts, but with the
> > caveat that the input sort order is approximately likely but not at
> > all guaranteed. So, even if the range sort is on a single index
> > column, we can still do the table's tuplesort on all ORDER BY
> > attributes, as long as a prefix of ORDER BY columns are included in
> > the BRIN index.
> >
>
> That's now quite what I meant, though. When I mentioned sorting by more
> than one column, I meant using a multi-column BRIN index on those
> columns. Something like this:
>
>CREATE TABLE t (a int, b int);
>INSERT INTO t ...
>CREATE INDEX ON t USING brin (a,b);
>
>SELECT * FROM t ORDER BY a, b;
>
> Now, what I think you described is using BRIN to sort by "a", and then
> do incremental sort for "b". What I had in mind is whether it's possible
> to use BRIN to sort by "b" too.
>
> I was suspecting it might be made to work, but now that I think about it
> again it probably can't - BRIN pretty much sorts the columns separately,
> it's not like having an ordering by (a,b) - first by "a", then "b".

I imagine it would indeed be limited to an extremely small subset of
cases, and probably not worth the effort to implement in an initial
version.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: doc: clarify the limitation for logical replication when REPILICA IDENTITY is FULL

2023-07-10 Thread Amit Kapila
On Mon, Jul 10, 2023 at 2:33 PM Hayato Kuroda (Fujitsu)
 wrote:
>

 If the published table specifies
+ REPLICA
IDENTITY FULL
+ but the table includes an attribute whose datatype is not an operator
+ class of Btree,

Isn't the same true for the hash operator class family as well? Can we
slightly change the line as: "... the table includes an attribute
whose datatype doesn't have an equality operator defined for it..".
Also, I find the proposed wording a bit odd, can we swap the sentence
to say something like: "The UPDATE and DELETE operations cannot be
replicated for the published tables that specifies REPLICA IDENTITY
FULL but the table includes an attribute whose datatype doesn't have
an equality operator defined for it on the subscriber."?

-- 
With Regards,
Amit Kapila.




Re: Request for comment on setting binary format output per session

2023-07-10 Thread Daniel Gustafsson
> On 25 Apr 2023, at 16:47, Dave Cramer  wrote:

> Patch attached with comments removed

This patch no longer applies, please submit a rebased version on top of HEAD.

--
Daniel Gustafsson





Re: [Patch] Use *other* indexes on the subscriber when REPLICA IDENTITY is FULL

2023-07-10 Thread Amit Kapila
On Mon, Jun 26, 2023 at 7:14 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for giving comments! The author's comment is quite helpful for us.
>
> >
> When I last dealt with the same issue, I was examining it from a slightly 
> broader perspective. I think
> my conclusion was that RelationFindReplTupleByIndex() is designed for the 
> constraints of UNIQUE INDEX
> and Primary Key.
> >
>
> I see. IIUC that's why you have added tuples_equal() in the 
> RelationFindReplTupleByIndex().
>
> >
> I think we should also be mindful about tuples_equal() function. When an 
> index returns more than
> one tuple, we rely on tuples_equal() function to make sure non-relevant 
> tuples are skipped.
>
> For btree indexes, it was safe to rely on that function as the columns that 
> are indexed using btree
> always have equality operator. I think we can safely assume the same for hash 
> indexes.
>
> However, say we indexed "point" type using "gist" index. Then, if we let this 
> logic to kick in,
> I think tuples_equal() would fail saying that there is no equality operator 
> exists.
> >
>
> Good point. Actually I have tested for "point" datatype but it have not 
> worked well.
> Now I understand the reason.
> It seemed that when TYPECACHE_EQ_OPR_FINFO is reuqesed to lookup_type_cache(),
> it could return valid value only if the datatype has operator class for Btree 
> or Hash.
> So tuples_equal() might not be able to   use if tuples have point box circle 
> types.
>

I also think so. If this is true, how can we think of supporting
indexes other than hash like GiST, and SP-GiST as mentioned by you in
your latest email? As per my understanding if we don't have PK or
replica identity then after the index scan, we do tuples_equal which
will fail for GIST or SP-GIST. Am, I missing something?

-- 
With Regards,
Amit Kapila.




  1   2   >