Re: Making empty Bitmapsets always be NULL

2023-06-23 Thread David Rowley
On Thu, 22 Jun 2023 at 20:59, Yuya Watari  wrote:
> Table 1: Planning time and its speedup of Join Order Benchmark
> (n: the number of partitions of each table)
> (Speedup: higher is better)

>   64 |   115.7%
>  128 |   142.9%
>  256 |   187.7%

Thanks for benchmarking. It certainly looks like a win for larger
sets.  Would you be able to profile the 256 partition case to see
where exactly master is so slow? (I'm surprised this patch improves
performance that much.)

I think it's also important to check we don't slow anything down for
more normal-sized sets.  The vast majority of sets will contain just a
single word, so we should probably focus on making sure we're not
slowing anything down for those.

To get the ball rolling on that I used the attached plan_times.patch
so that the planner writes the number of elapsed nanosecond from
calling standard_planner(). Patching with this then running make
installcheck kicks out about 35k log lines with times on it.

I ran this on a Linux AMD 3990x machine and also an Apple M2 pro
machine. Taking the sum of the nanoseconds and converting into
seconds, I see:

AMD 3990x
master: 1.384267931 seconds
patched 1.339178764  seconds (3.37% faster)

M2 pro:
master: 0.58293 seconds
patched: 0.581483 seconds (0.25% faster)

So it certainly does not look any slower. Perhaps a little faster with
the zen2 machine.

(The m2 only seems to have microsecond resolution on the timer code
whereas the zen2 has nanosecond. I don't think this matters much as
the planner takes enough microseconds to plan even for simple queries)

I've also attached the v4 patch again as I'll add this patch to the
commitfest and if I don't do that then the CFbot will pick up Ranier's
patch instead of mine.

David
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 1e4dd27dba..3c713782f1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -17,6 +17,7 @@
 
 #include 
 #include 
+#include 
 
 #include "access/genam.h"
 #include "access/htup_details.h"
@@ -274,11 +275,22 @@ planner(Query *parse, const char *query_string, int 
cursorOptions,
ParamListInfo boundParams)
 {
PlannedStmt *result;
+   struct timespec start, end;
+
+   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, );
 
if (planner_hook)
result = (*planner_hook) (parse, query_string, cursorOptions, 
boundParams);
else
result = standard_planner(parse, query_string, cursorOptions, 
boundParams);
+
+   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, );
+
+   elog(LOG,
+"planning time in %f nanoseconds",
+((double) (end.tv_sec * 10 + end.tv_nsec) -
+(double) (start.tv_sec * 10 + 
start.tv_nsec)));
+
return result;
 }
 
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..9cda3b1cc1 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -5,8 +5,16 @@
  *
  * A bitmap set can represent any set of nonnegative integers, although
  * it is mainly intended for sets where the maximum value is not large,
- * say at most a few hundred.  By convention, we always represent the
- * empty set by a NULL pointer.
+ * say at most a few hundred.  By convention, we always represent a set with
+ * the minimum possible number of words, i.e, there are never any trailing
+ * zero words.  Enforcing this requires that an empty set is represented as
+ * NULL.  Because an empty Bitmapset is represented as NULL, a non-NULL
+ * Bitmapset always has at least 1 Bitmapword.  We can exploit this fact to
+ * speedup various loops over the Bitmapset's words array by using "do while"
+ * loops instead of "for" loops.  This means the code does not waste time
+ * checking the loop condition before the first iteration.  For Bitmapsets
+ * containing only a single word (likely the majority of them) this reduces
+ * the loop condition tests by half.
  *
  *
  * Copyright (c) 2003-2023, PostgreSQL Global Development Group
@@ -64,8 +72,6 @@
 #error "invalid BITS_PER_BITMAPWORD"
 #endif
 
-static bool bms_is_empty_internal(const Bitmapset *a);
-
 
 /*
  * bms_copy - make a palloc'd copy of a bitmapset
@@ -85,18 +91,11 @@ bms_copy(const Bitmapset *a)
 }
 
 /*
- * bms_equal - are two bitmapsets equal?
- *
- * This is logical not physical equality; in particular, a NULL pointer will
- * be reported as equal to a palloc'd value containing no members.
+ * bms_equal - are two bitmapsets equal? or both NULL?
  */
 bool
 bms_equal(const Bitmapset *a, const Bitmapset *b)
 {
-   const Bitmapset *shorter;
-   const Bitmapset *longer;
-   int shortlen;
-   int longlen;
int i;
 
/* Handle cases where either input is NULL */
@@ -108,30 +107,19 @@ bms_equal(const Bitmapset *a, const Bitmapset *b)

Re: vac_truncate_clog()'s bogus check leads to bogusness

2023-06-23 Thread Andres Freund
Hi,

On 2023-06-23 18:41:58 -0700, Andres Freund wrote:
> I guess this might be caused by 78db307bb23 adding the check, but using
> GetOldestXmin(NULL, true) to determine lastSaneFrozenXid. That was changed
> soon after, in 87f830e0ce03.

FWIW, the discussion leading up to 87f830e0ce03 is
https://postgr.es/m/4182.1405961004%40sss.pgh.pa.us

Greetings,

Andres Freund




Re: vac_truncate_clog()'s bogus check leads to bogusness

2023-06-23 Thread Andres Freund
Hi,

On 2023-06-21 21:50:39 -0700, Noah Misch wrote:
> On Wed, Jun 21, 2023 at 05:46:37PM -0700, Andres Freund wrote:
> > A related issue is that as far as I can tell the determination of what is
> > bogus is bogus.
> > 
> > The relevant cutoffs are determined vac_update_datfrozenxid() using:
> > 
> > /*
> >  * Identify the latest relfrozenxid and relminmxid values that we could
> >  * validly see during the scan.  These are conservative values, but it's
> >  * not really worth trying to be more exact.
> >  */
> > lastSaneFrozenXid = ReadNextTransactionId();
> > lastSaneMinMulti = ReadNextMultiXactId();
> > 
> > but doing checks based on thos is bogus, because:
> > 
> > a) a concurrent create table / truncate / vacuum can update
> >pg_class.relfrozenxid of some relation in the current database to a newer
> >value, after lastSaneFrozenXid already has been determined. If that
> >happens, we skip updating pg_database.datfrozenxid.
> > 
> > b) A concurrent vacuum in another database, ending up in 
> > vac_truncate_clog(),
> >can compute a newer datfrozenxid. In that case the vac_truncate_clog() 
> > with
> >the outdated lastSaneFrozenXid will not truncate the clog (and also 
> > forget
> >to release WrapLimitsVacuumLock currently, as reported upthread) and not
> >call SetTransactionIdLimit(). The latter is particularly bad, because 
> > that
> >means we might not come out of "database is not accepting commands" land.
> 
> > I guess we could just recompute the boundaries before actually believing the
> > catalog values are bogus?
> 
> That's how I'd do it.

I was looking at doing that and got confused by the current code. Am I missing
something, or does vac_truncate_clog() have two pretty much identical attempts
at a safety measures?

void
vac_update_datfrozenxid(void)
...
lastSaneFrozenXid = ReadNextTransactionId();
...
vac_truncate_clog(newFrozenXid, newMinMulti,
  lastSaneFrozenXid, 
lastSaneMinMulti);
}
...
static void
vac_truncate_clog(TransactionId frozenXID,
  MultiXactId minMulti,
  TransactionId lastSaneFrozenXid,
  MultiXactId lastSaneMinMulti)
{
TransactionId nextXID = ReadNextTransactionId();
...
/*
 * If things are working properly, no database should have a
 * datfrozenxid or datminmxid that is "in the future".  
However, such
 * cases have been known to arise due to bugs in pg_upgrade.  
If we
 * see any entries that are "in the future", chicken out and 
don't do
 * anything.  This ensures we won't truncate clog before those
 * databases have been scanned and cleaned up.  (We will issue 
the
 * "already wrapped" warning if appropriate, though.)
 */
if (TransactionIdPrecedes(lastSaneFrozenXid, datfrozenxid) ||
MultiXactIdPrecedes(lastSaneMinMulti, datminmxid))
bogus = true;

if (TransactionIdPrecedes(nextXID, datfrozenxid))
frozenAlreadyWrapped = true;

lastSaneFrozenXid is a slightly older version of ReadNextTransactionId(),
that's the only difference afaict.


I guess this might be caused by 78db307bb23 adding the check, but using
GetOldestXmin(NULL, true) to determine lastSaneFrozenXid. That was changed
soon after, in 87f830e0ce03.


Am I missing something?


Greetings,

Andres Freund




Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread David G. Johnston
On Fri, Jun 23, 2023 at 5:12 PM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > On Thu, Jun 22, 2023 at 5:08 PM Tom Lane  wrote:
> >> * Personally I could do without the "empty" business, but that seems
> >> unnecessary in the tabular format; an empty column will serve fine.
>
> > I disagree, but not strongly.
>
> > I kinda expected you to be on the side of "why are we discussing a
> > situation that should just be prohibited" though.
>
> I haven't formed an opinion yet on whether it should be prohibited.
> But even if we do that going forward, won't psql need to deal with
> such cases when examining old servers?
>
>
I haven't given enough thought to that.  My first reaction is that using
blank for old servers would be desirable and then, if allowed in v16+
server, "empty" for those.

That said, the entire grantor premise that motivated this doesn't exist on
those servers so maybe \drg just shouldn't work against pre-v16 servers -
and we keep the existing \du query as-is for those as well while removing
the "member of" column when \du is executed against a v16+ server.

David J.


Re: Making empty Bitmapsets always be NULL

2023-06-23 Thread David Rowley
On Sat, 24 Jun 2023 at 07:43, Ranier Vilela  wrote:
> I worked a bit more on the v4 version and made a new v6 version, with some 
> changes.

> I can see some improvement, would you mind testing v6 and reporting back?

Please don't bother. I've already mentioned that I'm not going to
consider any changes here which are unrelated to changing the rule
that Bitmapsets no longer can have trailing zero words. I've already
said in [1] that if you have unrelated changes that you wish to pursue
in regards to Bitmapset, then please do so on another thread.

Also, FWIW, from glancing over it, your v6 patch introduces a bunch of
out-of-bounds memory access bugs and a few things are less efficient
than I'd made them. The number of bytes you're zeroing using memset in
bms_add_member() and bms_add_range() is wrong.  bms_del_member() now
needlessly rechecks if a->words[wordnum] is 0. We already know it is 0
from the above check. You may have misunderstood the point of swapping
for loops for do/while loops? They're meant to save the needless loop
bounds check on the initial loop due to the knowledge that the
Bitmapset contains at least 1 word.

Additionally, it looks like you've made various places that loop over
the set and check for the "lastnonzero" less efficiently by adding an
additional surplus check.  Depending on the CPU architecture, looping
backwards over arrays can be less efficient due to lack of hardware
prefetching when accessing memory in reverse order. It's not clear to
me why you think looping backwards is faster.  I've no desire to
introduce code that needlessly performs more slowly depending on the
ability of the hardware prefetcher on the CPU architecture PostgreSQL
is running on.

Also, if you going to post benchmark results, they're not very
meaningful unless you can demonstrate what you actually tested. You've
mentioned nothing here to say what query-b.sql contains.

David

[1] 
https://postgr.es/m/caaphdvo65dxfzcgjz7pvxs75vut+1-wsap_kvefwgsns2y2...@mail.gmail.com




Re: Stampede of the JIT compilers

2023-06-23 Thread David Rowley
On Sat, 24 Jun 2023 at 02:28, James Coleman  wrote:
> There are a couple of issues here. I'm sure it's been discussed
> before, and it's not the point of my thread, but I can't help but note
> that the default value of jit_above_cost of 10 seems absurdly low.
> On good hardware like we have even well-planned queries with costs
> well above that won't be taking as long as JIT compilation does.

It would be good to know your evidence for thinking it's too low.

The main problem I see with it is that the costing does not account
for how many expressions will be compiled. It's quite different to
compile JIT expressions for a query to a single table with a simple
WHERE clause vs some query with many joins which scans a partitioned
table with 1000 partitions, for example.

> But on the topic of the thread: I'd like to know if anyone has ever
> considered implemented a GUC/feature like
> "max_concurrent_jit_compilations" to cap the number of backends that
> may be compiling a query at any given point so that we avoid an
> optimization from running amok and consuming all of a servers
> resources?

Why do the number of backends matter?  JIT compilation consumes the
same CPU resources that the JIT compilation is meant to save.  If the
JIT compilation in your query happened to be a net win rather than a
net loss in terms of CPU usage, then why would
max_concurrent_jit_compilations be useful? It would just restrict us
on what we could save. This idea just covers up the fact that the JIT
costing is disconnected from reality.  It's a bit like trying to tune
your radio with the volume control.

I think the JIT costs would be better taking into account how useful
each expression will be to JIT compile.  There were some ideas thrown
around in [1].

David

[1] 
https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com




Re: [PATCH] pg_regress.c: Fix "make check" on Mac OS X: Pass DYLD_LIBRARY_PATH

2023-06-23 Thread Andres Freund
Hi,

On 2023-06-05 22:33:16 -0400, Tom Lane wrote:
> Julien Rouhaud  writes:
> > On Mon, Jun 05, 2023 at 09:47:30AM -0400, Evan Jones wrote:
> >> This makes "make check" work on Mac OS X. Without this patch, on Mac OS X a
> >> default "./configure; make; make check" fails with errors like:
> >> ...
> >> The reason is that at some point, Mac OS X started removing the
> >> DYLD_LIBRARY_PATH environment variable for "untrusted" executables [1]:
> 
> > Note that this is a known issue
> 
> Yeah.  We have attempted to work around this before, but failed to find
> a solution without more downsides than upsides.  I will be interested
> to look at this patch, but lack time for it right now.  Anybody else?

FWIW, I have a patch, which I posted originally as part of the meson thread,
that makes the meson build work correctly even with SIP enabled. The trick is
basically to change the absolute references to libraries to relative ones.

Except for a small amount of complexity during install, I don't think this has
a whole lot of downsides. Making the install relocatable imo is pretty nice.

I guess I should repost that for 17...

Greetings,

Andres Freund




Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread Tom Lane
"David G. Johnston"  writes:
> On Thu, Jun 22, 2023 at 5:08 PM Tom Lane  wrote:
>> * Personally I could do without the "empty" business, but that seems
>> unnecessary in the tabular format; an empty column will serve fine.

> I disagree, but not strongly.

> I kinda expected you to be on the side of "why are we discussing a
> situation that should just be prohibited" though.

I haven't formed an opinion yet on whether it should be prohibited.
But even if we do that going forward, won't psql need to deal with
such cases when examining old servers?

regards, tom lane




Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-06-23 Thread Tom Lane
Tomas Vondra  writes:
> The problem is that the selectivity for "IS NULL" is estimated using the
> table-level statistics. But the LEFT JOIN entirely breaks the idea that
> the null_frac has anything to do with NULLs in the join result.

Right.

> I wonder how to improve this, say by adjusting the IS NULL selectivity
> when we know to operate on the outer side of the join. We're able to
> do this for antijoins, so maybe we could do that here, somehow?

This mess is part of the long-term plan around the work I've been doing
on outer-join-aware Vars.  We now have infrastructure that can let
the estimator routines see "oh, this Var isn't directly from a scan
of its table, it's been passed through a potentially-nulling outer
join --- and I can see which one".  I don't have more than vague ideas
about what happens next, but that is clearly an essential step on the
road to doing better.

regards, tom lane




Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-23 Thread Cary Huang
 > Yes, please add it to the July commitfest and feel free to set me as 
 > Reviewer,
 > I intend to take a look at it.

Thank you Daniel, I have added this patch to July commitfest under security 
category and added you as reviewer. 

best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca





Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread Jonathan S. Katz

On 6/23/23 12:16 PM, Tom Lane wrote:

"David G. Johnston"  writes:

On Thu, Jun 22, 2023 at 5:08 PM Tom Lane  wrote:

* I agree that the "tabular" format looks nicer and has fewer i18n
issues than the other proposals.



As you are on board with a separate command please clarify whether you mean
the tabular format but still with newlines, one row per grantee, or the
table with one row per grantor-grantee pair.


I'd lean towards a straight table with a row per grantee/grantor.
I tend to think that faking table layout with some newlines is
a poor idea.  I'm not dead set on that approach though.


[Personal hat]

Generally, I find the tabular view w/o newlines is easier to read, and 
makes it simpler to join to other data (though that may not be 
applicable here).


Again, I'm not the target user of this feature (until I need to use it), 
so my opinion comes with a few grains of salt.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread Jonathan S. Katz

On 6/23/23 11:52 AM, David G. Johnston wrote:
On Thu, Jun 22, 2023 at 5:08 PM Tom Lane > wrote:


"Jonathan S. Katz" mailto:jk...@postgresql.org>> writes:
 > On 6/15/23 2:47 PM, David G. Johnston wrote:
 >> Robert - can you please comment on what you are willing to
commit in
 >> order to close out your open item here.  My take is that the
design for
 >> this, the tabular form a couple of emails ago (copied here), is
 >> ready-to-commit, just needing the actual (trivial) code changes
to be
 >> made to accomplish it.

 > Can we resolve this before Beta 2?[1] The RMT originally advised
to try
 > to resolve before Beta 1[2], and this seems to be lingering.

At this point I kinda doubt that we can get this done before beta2
either, but I'll put in my two cents anyway:


[RMT Hat]

Well, the probability of completing this before the beta 2 freeze is 
effectively zero now. This is a bit disappointing as there was ample 
time since the first RMT nudge on the issue. But let's move forward and 
resolve it before Beta 3.



* I agree that the "tabular" format looks nicer and has fewer i18n
issues than the other proposals.

As you are on board with a separate command please clarify whether you 
mean the tabular format but still with newlines, one row per grantee, or 
the table with one row per grantor-grantee pair.


I still like using newlines here even in the separate meta-command.


(I'll save for the downthread comment).



* Personally I could do without the "empty" business, but that seems
unnecessary in the tabular format; an empty column will serve fine.


I disagree, but not strongly.

I kinda expected you to be on the side of "why are we discussing a 
situation that should just be prohibited" though.


[Personal hat]

I'm still not a fan of "empty" but perhaps the formatting around the 
"separate command" will help drive a conclusion on this.




* I also agree with Pavel's comment that we'd be better off taking
this out of \du altogether and inventing a separate \d command.
Maybe "\drg" for "display role grants"?

Just to be clear, the open item fix proposal is to remove the presently 
broken (due to it showing duplicates without any context) "member of" 
array in \du and make a simple table report output in \drg instead.


I'm good with \drg as a new meta-command.


[Personal hat]

+1 for a new command. The proposal above seems reasonable.

Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] pg_regress.c: Fix "make check" on Mac OS X: Pass DYLD_LIBRARY_PATH

2023-06-23 Thread Peter Eisentraut

On 22.06.23 21:08, David Zhang wrote:
Currently, there is a description suggesting a workaround by running a 
'make install' command first, but I find it to be somewhat inaccurate. 
It would be better to update the existing description to provide more 
precise instructions on how to overcome this issue. Here are the changes 
I would suggest.


from:
"You can work around that by doing make install before make check. Most 
PostgreSQL developers just turn off SIP, though."


to:
"You can execute sudo make install if you do not specify a prefix during 
the configure step, or make install without sudo if you do specify a 
prefix (assuming proper permissions) before make check. Most PostgreSQL 
developers just turn off SIP, though."


Otherwise, following the current description, if you run `./configure && 
make install` you will get error: "mkdir: /usr/local/pgsql: Permission 
denied"


I think you should interpret "doing make install" as "running make 
install or a similar command as described earlier in this chapter". 
Note also that the installation instructions don't use "sudo" anywhere 
right now, so throwing it in at this point would be weird.



echo "# +++ tap check in src/test/modules/brin +++"
... ...
# +++ tap check in src/test/modules/brin +++
t/01_workitems.pl  Bailout called.  Further testing stopped: 
command "initdb -D 
/Users/david/hg/sandbox/postgres/src/test/modules/brin/tmp_check/t_01_workitems_tango_data/pgdata -A trust -N" died with signal 6

t/01_workitems.pl  Dubious, test returned 255 (wstat 65280, 0xff00)
No subtests run


As I mentioned earlier, you would need to find all uses of system() in 
the PostgreSQL source code and make your adjustments there.  IIRC, the 
TAP tests require pg_ctl, so maybe look there.





Re: Add GUC to tune glibc's malloc implementation.

2023-06-23 Thread Peter Eisentraut

On 22.06.23 15:35, Ronan Dunklau wrote:

The thing is, by default, those parameters are adjusted dynamically by the
glibc itself. It starts with quite small thresholds, and raises them when the
program frees some memory, up to a certain limit. This patch proposes a new
GUC allowing the user to adjust those settings according to their workload.

This can cause problems. Let's take for example a table with 10k rows, and 32
columns (as defined by a bench script David Rowley shared last year when
discussing the GenerationContext for tuplesort), and execute the following
query, with 32MB of work_mem:


I don't follow what you are trying to achieve with this.  The examples 
you show appear to work sensibly in my mind.  Using this setting, you 
can save some of the adjustments that glibc does after the first query. 
But that seems only useful if your session only does one query.  Is that 
what you are doing?






Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~

2023-06-23 Thread Peter Eisentraut

On 23.06.23 00:22, Michael Paquier wrote:

Also, note that the documentation claims that the minimum version of
OpenSSL supported is 0.9.8, which is something that commit 9b7cd59 has
done, impacting Postgres 10~.  So your argument looks incorrect to me?


Considering that, yes.




Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-23 Thread Daniel Gustafsson
> On 23 Jun 2023, at 22:10, Cary Huang  wrote:

> would this feature be suitable to be added to commitfest? What do you think?

Yes, please add it to the July commitfest and feel free to set me as Reviewer,
I intend to take a look at it.

--
Daniel Gustafsson





Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-23 Thread Cary Huang

 > Off the cuff that doesn't seem like a bad idea, but I wonder if we should add
 > them to pg_stat_ssl (or both) instead if we deem them valuable?

I think the same information should be available to pg_stat_ssl as well.  
pg_stat_ssl can show the client certificate information for all connecting 
clients, having it to show not_before and not_after timestamps of every client 
are important in my opinion. The attached patch 
"v2-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch" adds this 
support
 
 > Re the patch, it would be nice to move the logic in ssl_client_get_notafter 
 > and
 > the _notbefore counterpart to a static function since they are copies of
 > eachother.

Yes agreed. I have made the adjustment attached as 
"v2-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch"

would this feature be suitable to be added to commitfest? What do you think?

thank you

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca



v2-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v2-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


Re: Making empty Bitmapsets always be NULL

2023-06-23 Thread Ranier Vilela
Em qui., 22 de jun. de 2023 às 05:50, Yuya Watari 
escreveu:

> Hello,
>
> On Thu, Jun 22, 2023 at 1:43 PM David Rowley  wrote:
> > > 3. Avoid enlargement when nwords is equal wordnum.
> > > Can save cycles when in corner cases?
> >
> > No, you're just introducing a bug here.  Arrays in C are zero-based,
> > so "wordnum >=  a->nwords" is exactly the correct way to check if
> > wordnum falls outside the bounds of the existing allocated memory. By
> > changing that to "wordnum > a->nwords" we'll fail to enlarge the words
> > array when it needs to be enlarged by 1 element.
>
> I agree with David. Unfortunately, some of the regression tests failed
> with the v5 patch. These failures are due to the bug introduced by the
> #3 change.
>
Yeah, this is my fault.

Anyway thanks for the brilliant ideas about optimize bitmapset.
I worked a bit more on the v4 version and made a new v6 version, with some
changes.

I made some benchmarks with v4 and v6:
Windows 64 bits
msvc 2019 64 bits

== Query A ==
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\create-tables-a.sql
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\query-a.sql
=

head:
Time: 3489,097 ms (00:03,489)
Time: 3501,780 ms (00:03,502)

patched v4:
Time: 2434,873 ms (00:02,435)
Time: 2310,832 ms (00:02,311)
Time: 2305,445 ms (00:02,305)
Time: 2185,972 ms (00:02,186)
Time: 2177,434 ms (00:02,177)
Time: 2169,883 ms (00:02,170)

patched v6:
Time: 2162,633 ms (00:02,163)
Time: 2159,805 ms (00:02,160)
Time: 2002,771 ms (00:02,003)
Time: 1944,436 ms (00:01,944)
Time: 1906,364 ms (00:01,906)
Time: 1903,897 ms (00:01,904)

== Query B ==
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\create-tables-b.sql
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\query-b.sql

patched v4:
Time: 2684,360 ms (00:02,684)
Time: 2482,571 ms (00:02,483)
Time: 2452,699 ms (00:02,453)
Time: 2465,223 ms (00:02,465)

patched v6:
Time: 1837,775 ms (00:01,838)
Time: 1801,274 ms (00:01,801)
Time: 1800,802 ms (00:01,801)
Time: 1798,786 ms (00:01,799)

I can see some improvement, would you mind testing v6 and reporting back?

regards,
Ranier Vilela


remove_zero_trailing_words_from_bitmapsets_v6.patch
Description: Binary data


Problems with estimating OR conditions, IS NULL on LEFT JOINs

2023-06-23 Thread Tomas Vondra
Hi,

I ran into a pretty terrible case of LEFT JOIN estimate, resulting in
pretty arbitrary underestimate. The query is pretty trivial, nothing
overly complex, and the more I think about it the more I think this is
a fairly fundamental flaw in how we estimate this type of joins.

Imagine you have two trivial tables:

  CREATE TABLE large (id INT, a INT);
  INSERT INTO large SELECT i, i FROM generate_series(1,100) s(i);

  CREATE TABLE small (id INT, b INT);
  INSERT INTO small SELECT i, i FROM generate_series(1,100) s(i);

The business meaning may be that "large" stores orders and "small" is
for events related to tiny fraction of the large table (e.g. returns).
And let's do a couple simple LEFT JOIN queries, adding conditions to it.

Let's start with no condition at all:

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)

 QUERY PLAN
  --
   Hash Left Join  (cost=3.25..18179.25 rows=100 width=16)
   (actual time=0.069..550.290 rows=100 loops=1)
 Hash Cond: (large.id = small.id)
 ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 width=8)
  (actual time=0.010..174.056 rows=100 loops=1)
 ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
   Buckets: 1024  Batches: 1  Memory Usage: 12kB
   ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.291 ms
   Execution Time: 663.551 ms
  (8 rows)

So great, this estimate is perfect. Now, let's add IS NULL condition on
the small table, to find rows without a match (e.g. orders that were not
returned):

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (small.id IS NULL);

 QUERY PLAN
  --
   Hash Anti Join  (cost=3.25..27052.36 rows=00 width=16)
   (actual time=0.071..544.568 rows=00 loops=1)
 Hash Cond: (large.id = small.id)
 ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 width=8)
 (actual time=0.015..166.019 rows=100 loops=1)
 ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.051...
   Buckets: 1024  Batches: 1  Memory Usage: 12kB
   ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.260 ms
   Execution Time: 658.379 ms
  (8 rows)

Also very accurate, great! Now let's do a condition on the large table
instead, filtering some the rows:

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (large.a IN (1000, 2000, 3000, 4000, 5000));

 QUERY PLAN
  --
   Nested Loop Left Join  (cost=0.00..20684.75 rows=5 width=16)
(actual time=0.957..127.376 rows=5 loops=1)
 Join Filter: (large.id = small.id)
 Rows Removed by Join Filter: 500
 ->  Seq Scan on large  (cost=0.00..20675.00 rows=5 width=8)
(actual time=0.878..127.171 rows=5 loops=1)
   Filter: (a = ANY ('{1000,2000,3000,4000,5000}'::integer[]))
   Rows Removed by Filter: 95
 ->  Materialize  (cost=0.00..2.50 rows=100 width=8) ...
   ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.223 ms
   Execution Time: 127.407 ms
  (10 rows)

Also great estimate! Surely, if we do both conditions with OR, we'll get
a decent estimate too?

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (small.id IS NULL)
 OR (large.a IN (1000, 2000, 3000, 4000, 5000));

 QUERY PLAN
  --
   Hash Left Join  (cost=3.25..18179.88 rows=5 width=16)
   (actual time=0.073..580.827 rows=00 loops=1)
 Hash Cond: (large.id = small.id)
 Filter: ((small.id IS NULL) OR
  (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
 Rows Removed by Filter: 100
 ->  Seq Scan on large  (cost=0.00..14425.00 rows=100 width=8)
 (actual time=0.015..166.809 rows=100 loops=1)
 ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
   Buckets: 1024  Batches: 1  Memory Usage: 12kB
   ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.309 ms
   Execution Time: 694.427 ms
  (10 rows)

Well, bummer! This is pretty surprising, because if we know that clause
A produces estimate 1M and clause B estimates as 5, then it's expected
that (A OR B) should be estimated as something >= max(1M, 5). For users
running this, this has to be really surprising.

It's also quite serious, because with underestimates like this the
planner is likely 

Re: Preventing non-superusers from altering session authorization

2023-06-23 Thread Nathan Bossart
On Thu, Jun 22, 2023 at 06:39:45PM -0400, Joseph Koshakow wrote:
> On Wed, Jun 21, 2023 at 11:48 PM Nathan Bossart 
> wrote:
>> I see that RESET SESSION AUTHORIZATION
>> with a concurrently dropped role will FATAL with your patch but succeed
>> without it, which could be part of the reason.
> 
> That might be a good change? If the original authenticated role ID no
> longer exists then we may want to return an error when trying to set
> your session authorization to that role.

I was curious why we don't block DROP ROLE if there are active sessions for
the role or terminate any such sessions as part of the command, and I found
this discussion from 2016:

https://postgr.es/m/flat/56E87CD8.60007%40ohmu.fi

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Assert while autovacuum was executing

2023-06-23 Thread Andres Freund
Hi,

On 2023-06-23 14:04:15 +0530, Amit Kapila wrote:
> OTOH, if the above theory is wrong or people are not convinced, I am
> okay with removing all the changes in commits 72e78d831a and
> 3ba59ccc89.

I am not convinced. And even if I were, coming up with new justifications in a
released version, when the existing testing clearly wasn't enough to find the
current bug, doesn't strike me as wise.

Greetings,

Andres Freund




Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread Tom Lane
"David G. Johnston"  writes:
> On Thu, Jun 22, 2023 at 5:08 PM Tom Lane  wrote:
>> * I agree that the "tabular" format looks nicer and has fewer i18n
>> issues than the other proposals.

> As you are on board with a separate command please clarify whether you mean
> the tabular format but still with newlines, one row per grantee, or the
> table with one row per grantor-grantee pair.

I'd lean towards a straight table with a row per grantee/grantor.
I tend to think that faking table layout with some newlines is
a poor idea.  I'm not dead set on that approach though.

regards, tom lane




Re: psql: Add role's membership options to the \du+ command

2023-06-23 Thread David G. Johnston
On Thu, Jun 22, 2023 at 5:08 PM Tom Lane  wrote:

> "Jonathan S. Katz"  writes:
> > On 6/15/23 2:47 PM, David G. Johnston wrote:
> >> Robert - can you please comment on what you are willing to commit in
> >> order to close out your open item here.  My take is that the design for
> >> this, the tabular form a couple of emails ago (copied here), is
> >> ready-to-commit, just needing the actual (trivial) code changes to be
> >> made to accomplish it.
>
> > Can we resolve this before Beta 2?[1] The RMT originally advised to try
> > to resolve before Beta 1[2], and this seems to be lingering.
>
> At this point I kinda doubt that we can get this done before beta2
> either, but I'll put in my two cents anyway:
>
> * I agree that the "tabular" format looks nicer and has fewer i18n
> issues than the other proposals.
>

As you are on board with a separate command please clarify whether you mean
the tabular format but still with newlines, one row per grantee, or the
table with one row per grantor-grantee pair.

I still like using newlines here even in the separate meta-command.

>
> * Personally I could do without the "empty" business, but that seems
> unnecessary in the tabular format; an empty column will serve fine.
>

I disagree, but not strongly.

I kinda expected you to be on the side of "why are we discussing a
situation that should just be prohibited" though.


> * I also agree with Pavel's comment that we'd be better off taking
> this out of \du altogether and inventing a separate \d command.
> Maybe "\drg" for "display role grants"?
>

Just to be clear, the open item fix proposal is to remove the presently
broken (due to it showing duplicates without any context) "member of" array
in \du and make a simple table report output in \drg instead.

I'm good with \drg as a new meta-command.


> * Parenthetically, the "Attributes" column of \du is a complete
> disaster
>
>
I hadn't thought about this in detail but did get the same impression.

David J.


Re: Bytea PL/Perl transform

2023-06-23 Thread Dagfinn Ilmari Mannsåker
Andrew Dunstan  writes:

> On 2023-06-22 Th 16:56, Greg Sabino Mullane wrote:
>>
>> * Do all of these transforms need to be their own contrib modules? So
>> much duplicated code across contrib/*_plperl already (and *plpython 
>> too for that matter) ...
>>
>>
>
> Yeah, that's a bit of a mess. Not sure what we can do about it now.

Would it be possible to move the functions and other objects to a new
combined extension, and make the existing ones depend on that?

I see ALTER EXTENSION has both ADD and DROP subcommands which don't
affect the object itself, only the extension membership.  The challenge
would be getting the ordering right between the upgrade/install scripts
dropping the objects from the existing extension and adding them to the
new extension.

- ilmari




Re: Improving btree performance through specializing by key shape, take 2

2023-06-23 Thread Matthias van de Meent
On Fri, 23 Jun 2023 at 11:26, Dilip Kumar  wrote:
>
> On Fri, Jun 23, 2023 at 2:21 AM Matthias van de Meent
>  wrote:
> >
>
> > == Dynamic prefix truncation (0001)
> > The code now tracks how many prefix attributes of the scan key are
> > already considered equal based on earlier binsrch results, and ignores
> > those prefix colums in further binsrch operations (sorted list; if
> > both the high and low value of your range have the same prefix, the
> > middle value will have that prefix, too). This reduces the number of
> > calls into opclass-supplied (dynamic) compare functions, and thus
> > increase performance for multi-key-attribute indexes where shared
> > prefixes are common (e.g. index on (customer, order_id)).
>
> I think the idea looks good to me.
>
> I was looking into the 0001 patches,

Thanks for reviewing.

> and I have one confusion in the
> below hunk in the _bt_moveright function, basically, if the parent
> page's right key is exactly matching the HIGH key of the child key
> then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
> use the optimization right, i.e. column number from where we need to
> start the comparison should be used what is passed by the caller.  But
> in the below hunk, you are always passing that as 'cmpcol' which is 1.
> I think this should be '*comparecol' because '*comparecol' will either
> hold the value passed by the parent if high key data exactly match
> with the parent's right tuple or it will hold 1 in case it doesn't
> match.  Am I missing something?

We can't carry _bt_compare prefix results across pages, because the
key range of a page may shift while we're not holding a lock on that
page. That's also why the code resets the prefix to 1 every time it
accesses a new page ^1: it cannot guarantee correct results otherwise.
See also [0] and [1] for why that is important.

^1: When following downlinks, the code above your quoted code tries to
reuse the _bt_compare result of the parent page in the common case of
a child page's high key that is bytewise equal to the right separator
tuple of the parent page's downlink to this page.  However, if it
detects that this child high key has changed (i.e. not 100% bytewise
equal), we can't reuse that result, and we'll have to re-establish all
prefix info on that page from scratch.
In any case, this only establishes the prefix for the right half of
the page's keyspace, the prefix of the left half of the data still
needs to be established separetely.

I hope this explains the reasons for why we can't reuse comparecol as
_bt_compare argument.

Kind regards,

Matthias van de Meent
Neon, Inc.

[0] 
https://www.postgresql.org/message-id/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com




Stampede of the JIT compilers

2023-06-23 Thread James Coleman
Hello,

We recently brought online a new database cluster, and in the course
of ramping up traffic to it encountered a situation where a misplanned
query (analyzing helped with this, but I think the issue is still
relevant) resulted in that query being compiled with JIT, and soon a
large number of backends were running that same shape of query, all of
them JIT compiling it. Since each JIT compilation took ~2s, this
starved the server of resources.

There are a couple of issues here. I'm sure it's been discussed
before, and it's not the point of my thread, but I can't help but note
that the default value of jit_above_cost of 10 seems absurdly low.
On good hardware like we have even well-planned queries with costs
well above that won't be taking as long as JIT compilation does.

But on the topic of the thread: I'd like to know if anyone has ever
considered implemented a GUC/feature like
"max_concurrent_jit_compilations" to cap the number of backends that
may be compiling a query at any given point so that we avoid an
optimization from running amok and consuming all of a servers
resources?

Regards,
James Coleman




[PGdocs] fix description for handling pf non-ASCII characters

2023-06-23 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

While discussing based on the article[1] with Japanese developers, 
I found inconsistencies between codes and documents.

45b1a67a[2] changed the behavior when non-ASCII characters was set as 
application_name,
cluster_name and postgres_fdw.application_name, but it seemed not to be 
documented.
Previously non-ASCII chars were replaed with question makrs '?', but now they 
are replaced
with a hex escape instead.

How do you think? Is my understanding correct?

Acknowledgement:
Sawada-san and Shinoda-san led the developer's discussion.
Fujii-san was confirmed my points. Thank you for all of their works!

[1]: 
https://h50146.www5.hpe.com/products/software/oe/linux/mainstream/support/lcc/pdf/PostgreSQL16Beta1_New_Features_en_20230528_1.pdf
[2]: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45b1a67a0fcb3f1588df596431871de4c93cb76f;hp=da5d4ea5aaac4fc02f2e2aec272efe438dd4e171
 
Best Regards,
Hayato Kuroda
FUJITSU LIMITED



doc_fix.patch
Description: doc_fix.patch


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-06-23 Thread Melih Mutlu
Hi Peter,

Thanks for your reviews. I tried to apply most of them. I just have
some comments below for some of them.

Peter Smith , 14 Haz 2023 Çar, 08:45 tarihinde
şunu yazdı:
>
> 9. process_syncing_tables_for_sync
>
> @@ -378,7 +387,13 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>   */
>   replorigin_drop_by_name(originname, true, false);
>
> - finish_sync_worker();
> + /*
> + * Sync worker is cleaned at this point. It's ready to sync next table,
> + * if needed.
> + */
> + SpinLockAcquire(>relmutex);
> + MyLogicalRepWorker->ready_to_reuse = true;
> + SpinLockRelease(>relmutex);
>
> 9a.
> I did not quite follow the logic. It says "Sync worker is cleaned at
> this point", but who is doing that? -- more details are needed. But,
> why not just call clean_sync_worker() right here like it use to call
> finish_sync_worker?

I agree that these explanations at places where the worker decides to
not continue with the current table were confusing. Even the name of
ready_to_reuse was misleading. I renamed it and tried to improve
comments in such places.
Can you please check if those make more sense now?


> ==
> src/backend/replication/logical/worker.c
>
> 10. General -- run_tablesync_worker, TablesyncWorkerMain
>
> IMO these functions would be more appropriately reside in the
> tablesync.c instead of the (common) worker.c. Was there some reason
> why they cannot be put there?

I'm not really against moving those functions to tablesync.c. But
what's not clear to me is worker.c. Is it the places to put common
functions for all log. rep. workers? Then, what about apply worker?
Should we consider a separate file for apply worker too?

Thanks,
-- 
Melih Mutlu
Microsoft




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-06-23 Thread Melih Mutlu
Hi,

Thanks for your reviews.

Hayato Kuroda (Fujitsu) , 13 Haz 2023 Sal,
13:06 tarihinde şunu yazdı:
> 01. general
>
> Why do tablesync workers have to disconnect from publisher for every 
> iterations?
> I think connection initiation overhead cannot be negligible in the postgres's 
> basic
> architecture. I have not checked yet, but could we add a new replication 
> message
> like STOP_STREAMING or CLEANUP? Or, engineerings for it is quite larger than 
> the benefit?

This actually makes sense. I quickly try to do that without adding any
new replication message. As you would expect, it did not work.
I don't really know what's needed to make a connection to last for
more than one iteration. Need to look into this. Happy to hear any
suggestions and thoughts.

> The sync worker sends a signal to its leader per the iteration, but it may be 
> too
> often. Maybe it is added for changing the rstate to READY, however, it is OK 
> to
> change it when the next change have come because 
> should_apply_changes_for_rel()
> returns true even if rel->state == SUBREL_STATE_SYNCDONE. I think the 
> notification
> should be done only at the end of sync workers. How do you think?

I tried to move the logicalrep_worker_wakeup call from
clean_sync_worker (end of an iteration) to finish_sync_worker (end of
sync worker). I made table sync much slower for some reason, then I
reverted that change. Maybe I should look a bit more into the reason
why that happened some time.

Thanks,
-- 
Melih Mutlu
Microsoft




Re: logical decoding and replication of sequences, take 2

2023-06-23 Thread Ashutosh Bapat
Regarding the patchsets, I think we will need to rearrange the
commits. Right now 0004 has some parts that should have been in 0001.
Also the logic to assign XID to a subtrasaction be better a separate
commit. That piece is independent of logical decoding of sequences.

On Fri, Jun 23, 2023 at 6:48 PM Ashutosh Bapat
 wrote:
>
> On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra
>  wrote:
> >
> > On 5/18/23 16:23, Ashutosh Bapat wrote:
> > > Hi,
> > > Sorry for jumping late in this thread.
> > >
> > > I started experimenting with the functionality. Maybe something that
> > > was already discussed earlier. Given that the thread is being
> > > discussed for so long and has gone several changes, revalidating the
> > > functionality is useful.
> > >
> > > I considered following aspects:
> > > Changes to the sequence on subscriber
> > > -
> > > 1. Since this is logical decoding, logical replica is writable. So the
> > > logically replicated sequence can be manipulated on the subscriber as
> > > well. This implementation consolidates the changes on subscriber and
> > > publisher rather than replicating the publisher state as is. That's
> > > good. See example command sequence below
> > > a. publisher calls nextval() - this sets the sequence state on
> > > publisher as (1, 32, t) which is replicated to the subscriber.
> > > b. subscriber calls nextval() once - this sets the sequence state on
> > > subscriber as (34, 32, t)
> > > c. subscriber calls nextval() 32 times - on-disk state of sequence
> > > doesn't change on subscriber
> > > d. subscriber calls nextval() 33 times - this sets the sequence state
> > > on subscriber as (99, 0, t)
> > > e. publisher calls nextval() 32 times - this sets the sequence state
> > > on publisher as (33, 0, t)
> > >
> > > The on-disk state on publisher at the end of e. is replicated to the
> > > subscriber but subscriber doesn't apply it. The state there is still
> > > (99, 0, t). I think this is closer to how logical replication of
> > > sequence should look like. This is aso good enough as long as we
> > > expect the replication of sequences to be used for failover and
> > > switchover.
> > >
> >
> > I'm really confused - are you describing what the patch is doing, or
> > what you think it should be doing? Because right now there's nothing
> > that'd "consolidate" the changes (in the sense of reconciling write
> > conflicts), and there's absolutely no way to do that.
> >
> > So if the subscriber advances the sequence (which it technically can),
> > the subscriber state will be eventually be discarded and overwritten
> > when the next increment gets decoded from WAL on the publisher.
>
> I described what I observed in my experiments. My observation doesn't
> agree with your description. I will revisit this when I review the
> output plugin changes and the WAL receiver changes.
>
> >
> > Yes, I agree with this. It's probably better to replicate just the next
> > value, without the log_cnt / is_called fields (which are implementation
> > specific).
>
> Ok. I will review the logic once you revise the patches.
>
> >
> > >
> > > 3. Primary key sequences
> > > ---
> > > I am not experimented with this. But I think we will need to add the
> > > sequences associated with the primary keys to the publications
> > > publishing the owner tables. Otherwise, we will have problems with the
> > > failover. And it needs to be done automatically since a. the names of
> > > these sequences are generated automatically b. publications with FOR
> > > ALL TABLES will add tables automatically and start replicating the
> > > changes. Users may not be able to intercept the replication activity
> > > to add the associated sequences are also addedto the publication.
> > >
> >
> > Right, this idea was mentioned before, and I agree maybe we should
> > consider adding some of those "automatic" sequences automatically.
> >
>
> Are you planning to add this in the same patch set or separately?
>
> I reviewed 0001 and related parts of 0004 and 0008 in detail.
>
> I have only one major change request, about
> typedef struct xl_seq_rec
> {
> RelFileLocator locator;
> + bool created; /* creates a new relfilenode (CREATE/ALTER) */
>
> I am not sure what are the repercussions of adding a member to an existing WAL
> record. I didn't see any code which handles the old WAL format which doesn't
> contain the "created" flag. IIUC, the logical decoding may come across
> a WAL record written in the old format after upgrade and restart. Is
> that not possible?
>
> But I don't think it's necessary. We can add a
> decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
> in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
> as is. Of course we will add non-sequence relfilelocators as well but that
> should be fine. Creating a new relfilelocator shouldn't be a frequent
> operation. If at all we are 

Re: logical decoding and replication of sequences, take 2

2023-06-23 Thread Ashutosh Bapat
On Tue, Jun 13, 2023 at 11:01 PM Tomas Vondra
 wrote:
>
> On 5/18/23 16:23, Ashutosh Bapat wrote:
> > Hi,
> > Sorry for jumping late in this thread.
> >
> > I started experimenting with the functionality. Maybe something that
> > was already discussed earlier. Given that the thread is being
> > discussed for so long and has gone several changes, revalidating the
> > functionality is useful.
> >
> > I considered following aspects:
> > Changes to the sequence on subscriber
> > -
> > 1. Since this is logical decoding, logical replica is writable. So the
> > logically replicated sequence can be manipulated on the subscriber as
> > well. This implementation consolidates the changes on subscriber and
> > publisher rather than replicating the publisher state as is. That's
> > good. See example command sequence below
> > a. publisher calls nextval() - this sets the sequence state on
> > publisher as (1, 32, t) which is replicated to the subscriber.
> > b. subscriber calls nextval() once - this sets the sequence state on
> > subscriber as (34, 32, t)
> > c. subscriber calls nextval() 32 times - on-disk state of sequence
> > doesn't change on subscriber
> > d. subscriber calls nextval() 33 times - this sets the sequence state
> > on subscriber as (99, 0, t)
> > e. publisher calls nextval() 32 times - this sets the sequence state
> > on publisher as (33, 0, t)
> >
> > The on-disk state on publisher at the end of e. is replicated to the
> > subscriber but subscriber doesn't apply it. The state there is still
> > (99, 0, t). I think this is closer to how logical replication of
> > sequence should look like. This is aso good enough as long as we
> > expect the replication of sequences to be used for failover and
> > switchover.
> >
>
> I'm really confused - are you describing what the patch is doing, or
> what you think it should be doing? Because right now there's nothing
> that'd "consolidate" the changes (in the sense of reconciling write
> conflicts), and there's absolutely no way to do that.
>
> So if the subscriber advances the sequence (which it technically can),
> the subscriber state will be eventually be discarded and overwritten
> when the next increment gets decoded from WAL on the publisher.

I described what I observed in my experiments. My observation doesn't
agree with your description. I will revisit this when I review the
output plugin changes and the WAL receiver changes.

>
> Yes, I agree with this. It's probably better to replicate just the next
> value, without the log_cnt / is_called fields (which are implementation
> specific).

Ok. I will review the logic once you revise the patches.

>
> >
> > 3. Primary key sequences
> > ---
> > I am not experimented with this. But I think we will need to add the
> > sequences associated with the primary keys to the publications
> > publishing the owner tables. Otherwise, we will have problems with the
> > failover. And it needs to be done automatically since a. the names of
> > these sequences are generated automatically b. publications with FOR
> > ALL TABLES will add tables automatically and start replicating the
> > changes. Users may not be able to intercept the replication activity
> > to add the associated sequences are also addedto the publication.
> >
>
> Right, this idea was mentioned before, and I agree maybe we should
> consider adding some of those "automatic" sequences automatically.
>

Are you planning to add this in the same patch set or separately?

I reviewed 0001 and related parts of 0004 and 0008 in detail.

I have only one major change request, about
typedef struct xl_seq_rec
{
RelFileLocator locator;
+ bool created; /* creates a new relfilenode (CREATE/ALTER) */

I am not sure what are the repercussions of adding a member to an existing WAL
record. I didn't see any code which handles the old WAL format which doesn't
contain the "created" flag. IIUC, the logical decoding may come across
a WAL record written in the old format after upgrade and restart. Is
that not possible?

But I don't think it's necessary. We can add a
decoding routine for RM_SMGR_ID. The decoding routine will add relfilelocator
in XLOG_SMGR_CREATE record to txn->sequences hash. Rest of the logic will work
as is. Of course we will add non-sequence relfilelocators as well but that
should be fine. Creating a new relfilelocator shouldn't be a frequent
operation. If at all we are worried about that, we can add only the
relfilenodes associated with sequences to the hash table.

If this idea has been discussed earlier, please point me to the relevant
discussion.

Some other minor comments and nitpicks.

stream_stop_cb, stream_abort_cb,
stream_commit_cb, and stream_change_cb
- are required, while stream_message_cb and
+ are required, while stream_message_cb,
+ stream_sequence_cb and

Like the non-streaming counterpart, should we also mention what happens if those
callbacks are not defined? That 

Re: Do we want a hashset type?

2023-06-23 Thread Tomas Vondra



On 6/23/23 13:47, Andrew Dunstan wrote:
> 
> On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
>> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
>>> I played around array_func.c
>>> many of the code can be used for multiset data type.
>>> now I imagine multiset as something like one dimension array. (nested 
>>> is somehow beyond the imagination...).
>> Are you suggesting it might be a better idea to start over completely
>> and work on a new code base that is based on arrayfuncs.c,
>> and aim for MULTISET/SET or anyhashset from start, that would not
>> only support int4/int8/uuid but any type?
>>
> 
> Before we run too far down this rabbit hole, let's discuss the storage
> implications of using multisets. ISTM that for small base datums like
> integers it will be a substantial increase in size, since you'll need an
> addition int for the item count, unless some very clever tricks are played.
> 

I honestly don't quite understand what exactly is meant by the proposal
to "reuse array_func.c for multisets". We're implementing sets, not
multisets (those were mentioned only to illustrate behavior). And the
whole point is that sets are not arrays - no duplicates, ordering does
not matter (so no index).

I mentioned that maybe we can model sets based on arrays (say, gram.y
would do similar stuff for SET[] and ARRAY[], polymorphism), not that we
should store sets as arrays. Would it be possible - maybe, if we extend
arrays to also maintain some hash hash table. But I'd bet that'll just
make arrays more complex, and will make sets slower.

Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.

As for the storage size, I don't think an extra "count" field would make
any measurable difference. If we're storing a hash table, we're bound to
have a couple percent of wasted space due to load factor (likely between
0.75 and 0.9).

> As for this older discussion referred to upthread, if the SQL Standards
> Committee hasn't acted on it by now it seem reasonable to think they are
> unlikely to.
> 

AFAIK multisets are included in SQL 2023, pretty much matching the draft
we discussed earlier. Yeah, it's unlikely to change in the future.

> Just for reference, Here's some description of Oracle's suport for
> Multisets from
> :
> 

good to know


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Bytea PL/Perl transform

2023-06-23 Thread Andrew Dunstan


On 2023-06-22 Th 16:56, Greg Sabino Mullane wrote:


* Do all of these transforms need to be their own contrib modules? So 
much duplicated code across contrib/*_plperl already (and *plpython 
too for that matter) ...





Yeah, that's a bit of a mess. Not sure what we can do about it now.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-23 Thread Andrew Dunstan


On 2023-06-23 Fr 04:23, Joel Jacobson wrote:

On Fri, Jun 23, 2023, at 08:40, jian he wrote:

I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?



Before we run too far down this rabbit hole, let's discuss the storage 
implications of using multisets. ISTM that for small base datums like 
integers it will be a substantial increase in size, since you'll need an 
addition int for the item count, unless some very clever tricks are played.


As for this older discussion referred to upthread, if the SQL Standards 
Committee hasn't acted on it by now it seem reasonable to think they are 
unlikely to.


Just for reference, Here's some description of Oracle's suport for 
Multisets from 
:


Multisets in the standard are supported as nested table types in 
Oracle. The Oracle nested table data type based on a scalar type ST is 
equivalent, in standard terminology, to a multiset of rows having a 
single field of type ST and named column_value. The Oracle nested 
table type based on an object type is equivalent to a multiset of 
structured type in the standard.


Oracle supports the following elements of this feature on nested 
tables using the same syntax as the standard has for multisets:


    The CARDINALITY function

    The SET function

    The MEMBER predicate

    The IS A SET predicate

    The COLLECT aggregate

All other aspects of this feature are supported with non-standard 
syntax, as follows:


    To create an empty multiset, denoted MULTISET[] in the standard, 
use an empty constructor of the nested table type.


    To obtain the sole element of a multiset with one element, denoted 
ELEMENT () in the standard, use a scalar 
subquery to select the single element from the nested table.


    To construct a multiset by enumeration, use the constructor of the 
nested table type.


    To construct a multiset by query, use CAST with a multiset 
argument, casting to the nested table type.


    To unnest a multiset, use the TABLE operator in the FROM clause.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Migration database from mysql to postgress

2023-06-23 Thread Andrew Dunstan


On 2023-06-23 Fr 05:30, Alfredo Alcala wrote:

Hi,

I need to move some databases from a MySQL server to Postgresql.

Can someone tell me the migration procedure, tools, and recommendations?





Please ask questions on the correct forum. For this question I suggest 
the pgsql-general mailing list. pgsql-hackers is for questions about 
postgresql development, not usage.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: eqjoinsel_semi still sucks ...

2023-06-23 Thread Andrey Lepikhov

On 2/5/2012 20:34, Tom Lane wrote:

On reflection I think that the idea of clamping ndistinct beforehand is
just wrong, and what we ought to do instead is apply a multiplier to the
selectivity estimate afterwards.  In the case of a base rel we could
just multiply by the selectivity of its baserestrictinfo list.  For join
rels it's a bit harder to guess how much a given input relation might
have been decimated, but if the join's estimated size is smaller than
the output size of the base rel the correlation var came from, we could
multiply by that ratio (on top of whatever correction came from the base
rel's restriction clauses).
I got stuck in some cases where (due to a tree of filters) the planner 
underestimates the JOIN just because the ndistinct conveys a huge number 
to the selectivity estimation formula. However, the estimation of both 
input relations is made correctly and is limited.
I've tried to understand the logic through commits 0d3b231eebf, 
97930cf578e and 7f3eba30c9d. But it is still not clear.
So, why the idea of clamping ndistinct is terrible in general? Could you 
explain your reasons a bit more?


--
regards,
Andrey Lepikhov
Postgres Professional





Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-06-23 Thread Tom Lane
Tommy Pavlicek  writes:
> I've added a single patch here: https://commitfest.postgresql.org/43/4389/

> It wasn't obvious whether I should create a second commitfest entry
> because I've included 2 patches so I've just done 1 to begin with. On
> that note, is it preferred here to split patches of this size into
> separate patches, and if so, additionally, separate threads?

No, our commitfest infrastructure is unable to deal with patches that have
interdependencies unless they're presented in a single email.  So just use
one thread, and be sure to attach all the patches each time.

(BTW, while you seem to have gotten away with it so far, it's usually
advisable to name the patch files like 0001-foo.patch, 0002-bar.patch,
etc, to make sure the cfbot understands what order to apply them in.)

regards, tom lane




Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-06-23 Thread Tommy Pavlicek
Tom Lane  writes:

> Please add this to the upcoming commitfest [1], to ensure we don't
> lose track of it.

I've added a single patch here: https://commitfest.postgresql.org/43/4389/

It wasn't obvious whether I should create a second commitfest entry
because I've included 2 patches so I've just done 1 to begin with. On
that note, is it preferred here to split patches of this size into
separate patches, and if so, additionally, separate threads?

Tom Lane  writes:

> > Additionally, I wasn't sure whether it was preferred to fail or succeed on
> > ALTERs that have no effect, such as adding hashes on an operator that
> > already allows them or disabling hashes on one that does not. I chose to
> > raise an error when this happens, on the thinking it was more explicit and
> > made the code simpler, even though the end result would be what the user
> > wanted.
>
> You could argue that both ways I guess.  We definitely need to raise error
> if the command tries to change an existing nondefault setting, since that
> might break things as per previous discussion.  But perhaps rejecting
> an attempt to set the existing setting is overly nanny-ish.  Personally
> I think I'd lean to "don't throw an error if we don't have to", but I'm
> not strongly set on that position.
>
> (Don't we have existing precedents that apply here?  I can't offhand
> think of any existing ALTER commands that would reject no-op requests,
> but maybe that's not a direct precedent.)

My initial thinking behind the error for a no-op was largely driven by
the existence of 'DROP.. IF EXISTS'. However, I did some ad hoc
testing on ALTER commands and it does seem that they mostly allow
no-ops. I did find that renaming an object to the same name will fail
due to the object already existing, but that seems to be more of a
coincidence than a design decision to me. Given this, I also lean
towards allowing the no-ops and will change it unless there are
objections.




Re: Assert while autovacuum was executing

2023-06-23 Thread Dilip Kumar
On Fri, Jun 23, 2023 at 2:04 PM Amit Kapila  wrote:
>
> On Thu, Jun 22, 2023 at 9:16 AM Amit Kapila  wrote:
> >
> > On Wed, Jun 21, 2023 at 10:57 AM Andres Freund  wrote:
> > >
> > > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that 
> > > likely
> > > also means 3ba59ccc89 is not right.
> > >
> >
> > Indeed. I was thinking of a fix but couldn't find one yet. One idea I
> > am considering is to allow catalog table locks after page lock but I
> > think apart from hacky that also won't work because we still need to
> > remove the check added for page locks in the deadlock code path in
> > commit 3ba59ccc89 and may need to do something for group locking.
> >
>
> I have further thought about this part and I think even if we remove
> the changes in commit 72e78d831a (remove the assertion for page locks
> in LockAcquireExtended()) and remove the check added for page locks in
> FindLockCycleRecurseMember() via commit 3ba59ccc89, it is still okay
> to keep the change related to "allow page lock to conflict among
> parallel group members" in LockCheckConflicts(). This is because locks
> on catalog tables don't conflict among group members. So, we shouldn't
> see a deadlock among parallel group members. Let me try to explain
> this thought via an example:
>

IMHO, whatsoever the case this check[1], is not wrong at all.  I agree
that we do not have parallel write present in the code so having this
check is not necessary as of now.  But in theory, this check is
correct because this is saying that parallel leader and worker should
conflict on the 'relation extension lock' and the 'page lock' and
that's the fact.  It holds true irrespective of whether it is being
used currently or not.


[1]
/*
* The relation extension or page lock conflict even between the group
* members.
*/
if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
{
PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
   proclock);
return true;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: patch: improve "user mapping not found" error message

2023-06-23 Thread Laurenz Albe
On Fri, 2023-06-23 at 16:45 +0900, Ian Lawrence Barwick wrote:
> Mild corner-case annoyance while doing Random Experimental Things:
> 
>     postgres=# SELECT * FROM parttest;
>     ERROR:  user mapping not found for "postgres"
> 
> Oky, but which server?
> 
>     postgres=# \det
>    List of foreign tables
>  Schema | Table |   Server
>     +---+---
>  public | parttest_10_1 | fdw_node2
>  public | parttest_10_3 | fdw_node3
>  public | parttest_10_5 | fdw_node4
>  public | parttest_10_7 | fdw_node5
>  public | parttest_10_9 | fdw_node6
>     (5 rows)
> 
> (Muffled sound of small patch hatching) aha:
> 
>     postgres=# SELECT * FROM parttest;
>     ERROR:  user mapping not found for user "postgres", server "fdw_node5"

+1

Yours,
Laurenz Albe




Re: Migration database from mysql to postgress

2023-06-23 Thread Thomas Kellerer
Alfredo Alcala schrieb am 23.06.2023 um 11:30:
> I need to move some databases from a MySQL server to Postgresql.
>
> Can someone tell me the migration procedure, tools, and recommendations? 


Despite its name, "ora2pg" can also migrate MySQL to Postgres

https://ora2pg.darold.net/





Migration database from mysql to postgress

2023-06-23 Thread Alfredo Alcala
Hi,

I need to move some databases from a MySQL server to Postgresql.

Can someone tell me the migration procedure, tools, and recommendations?

Thanks


Re: Improving btree performance through specializing by key shape, take 2

2023-06-23 Thread Dilip Kumar
On Fri, Jun 23, 2023 at 2:21 AM Matthias van de Meent
 wrote:
>

> == Dynamic prefix truncation (0001)
> The code now tracks how many prefix attributes of the scan key are
> already considered equal based on earlier binsrch results, and ignores
> those prefix colums in further binsrch operations (sorted list; if
> both the high and low value of your range have the same prefix, the
> middle value will have that prefix, too). This reduces the number of
> calls into opclass-supplied (dynamic) compare functions, and thus
> increase performance for multi-key-attribute indexes where shared
> prefixes are common (e.g. index on (customer, order_id)).

I think the idea looks good to me.

I was looking into the 0001 patches, and I have one confusion in the
below hunk in the _bt_moveright function, basically, if the parent
page's right key is exactly matching the HIGH key of the child key
then I suppose while doing the "_bt_compare" with the HIGH_KEY we can
use the optimization right, i.e. column number from where we need to
start the comparison should be used what is passed by the caller.  But
in the below hunk, you are always passing that as 'cmpcol' which is 1.
I think this should be '*comparecol' because '*comparecol' will either
hold the value passed by the parent if high key data exactly match
with the parent's right tuple or it will hold 1 in case it doesn't
match.  Am I missing something?


@@ -247,13 +256,16 @@ _bt_moveright(Relation rel,
{

+ if (P_IGNORE(opaque) ||
+ _bt_compare(rel, key, page, P_HIKEY, ) >= cmpval)
+ {
+ *comparecol = 1;
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Support to define custom wait events for extensions

2023-06-23 Thread Masahiro Ikeda

Hi,

I updated the patches to handle the warning mentioned
by PostgreSQL Patch Tester, and removed unnecessary spaces.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 1bb78fa2cbe6b030cea7a570bec88bd4d68f314a Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Fri, 23 Jun 2023 17:38:38 +0900
Subject: [PATCH 2/2] Add test codes for custom wait events

---
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_custom_wait_events/.gitignore|   2 +
 .../modules/test_custom_wait_events/Makefile  |  23 +++
 .../test_custom_wait_events/meson.build   |  33 
 .../test_custom_wait_events/t/001_basic.pl|  34 
 .../test_custom_wait_events--1.0.sql  |  14 ++
 .../test_custom_wait_events.c | 182 ++
 .../test_custom_wait_events.control   |   3 +
 9 files changed, 293 insertions(+)
 create mode 100644 src/test/modules/test_custom_wait_events/.gitignore
 create mode 100644 src/test/modules/test_custom_wait_events/Makefile
 create mode 100644 src/test/modules/test_custom_wait_events/meson.build
 create mode 100644 src/test/modules/test_custom_wait_events/t/001_basic.pl
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events--1.0.sql
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.c
 create mode 100644 src/test/modules/test_custom_wait_events/test_custom_wait_events.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6331c976dc..84312b7e57 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -17,6 +17,7 @@ SUBDIRS = \
 		  test_bloomfilter \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
+		  test_custom_wait_events \
 		  test_ddl_deparse \
 		  test_extensions \
 		  test_ginpostinglist \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 17d369e378..83a1d8fd19 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -14,6 +14,7 @@ subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
+subdir('test_custom_wait_events')
 subdir('test_ddl_deparse')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
diff --git a/src/test/modules/test_custom_wait_events/.gitignore b/src/test/modules/test_custom_wait_events/.gitignore
new file mode 100644
index 00..716e17f5a2
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/.gitignore
@@ -0,0 +1,2 @@
+# Generated subdirectories
+/tmp_check/
diff --git a/src/test/modules/test_custom_wait_events/Makefile b/src/test/modules/test_custom_wait_events/Makefile
new file mode 100644
index 00..dda365d523
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_custom_wait_events/Makefile
+
+MODULE_big = test_custom_wait_events
+OBJS = \
+	$(WIN32RES) \
+	test_custom_wait_events.o
+PGFILEDESC = "test_custom_wait_events - test custom wait events"
+
+EXTENSION = test_custom_wait_events
+DATA = test_custom_wait_events--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_custom_wait_events
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_custom_wait_events/meson.build b/src/test/modules/test_custom_wait_events/meson.build
new file mode 100644
index 00..8ad073e577
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) PostgreSQL Global Development Group
+
+test_custom_wait_events = files(
+  'test_custom_wait_events.c',
+)
+
+if host_system == 'windows'
+  test_custom_wait_events += rc_lib_gen.process(win32ver_rc, extra_args: [
+'--NAME', 'test_custom_wait_events',
+'--FILEDESC', 'test_custom_wait_events - test custom wait events',])
+endif
+
+test_custom_wait_events = shared_module('test_custom_wait_events',
+  test_custom_wait_events,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_custom_wait_events
+
+test_install_data += files(
+  'test_custom_wait_events.control',
+  'test_custom_wait_events--1.0.sql',
+)
+
+tests += {
+  'name': 'test_custom_wait_events',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_basic.pl',
+],
+  },
+}
diff --git a/src/test/modules/test_custom_wait_events/t/001_basic.pl b/src/test/modules/test_custom_wait_events/t/001_basic.pl
new file mode 100644
index 00..b43c82a713
--- /dev/null
+++ b/src/test/modules/test_custom_wait_events/t/001_basic.pl
@@ -0,0 +1,34 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = 

Re: Do we want a hashset type?

2023-06-23 Thread jian he
On Fri, Jun 23, 2023 at 4:23 PM Joel Jacobson  wrote:

> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> > I played around array_func.c
> > many of the code can be used for multiset data type.
> > now I imagine multiset as something like one dimension array. (nested
> > is somehow beyond the imagination...).
>
> Are you suggesting it might be a better idea to start over completely
> and work on a new code base that is based on arrayfuncs.c,
> and aim for MULTISET/SET or anyhashset from start, that would not
> only support int4/int8/uuid but any type?
>
> /Joel
>

select prosrc from pg_proc where proname ~*
'(hash.*extended)|(extended.*hash)';
return around 30 rows.
so it's a bit generic?

I tend to think set/multiset as a one dimension array, so the textual input
should be like a one dimension array.
use array_func.c functions to parse and validate the input.
So different types, one input validation function.

Does this make sense?


Re: Allow pg_archivecleanup to remove backup history files

2023-06-23 Thread torikoshia

On 2023-06-22 16:47, Kyotaro Horiguchi wrote:
Thanks for your review!



0002:

+* Check file name.
+*
+* We skip files which are not WAL file or partial WAL file.

There's no need to spend this many lines describing this, and it's
suggested to avoid restating what the code does. I think a comment
might not be necessary. But if we decide to have one, it could look
like this:


   /* we're only removing specific types of files */


As you mentioned, this comment is restatement of the codes.
Removed the comment.



Other than that, it looks good to me.


0003:
+   
+ Remove backup history files.

I might be overthinking it, but I'd phrase it as, "Remove backup
history files *as well*.".  (The --help message describes the same
thing using "including".)


Agreed.


+ For details about backup history file, please refer to the
.

We usually write this as simple as "See  for details (of the
backup history files)" or "Refer to  for more information
(about the backup history files)." or such like... (I think)


Agreed. I used the former one.


+bool   cleanBackupHistory = false; /* remove files including
+   
 * backup history files */

The indentation appears to be inconsistent.


Modified.




-* Truncation is essentially harmless, because we skip names of
-* length other than XLOG_FNAME_LEN.  (In principle, one could 
use
-* a 1000-character additional_ext and get trouble.)
+* Truncation is essentially harmless,  because we check the 
file
+* format including the length immediately after this.
+* (In principle, one could use a 1000-character additional_ext
+* and get trouble.)
 */
strlcpy(walfile, xlde->d_name, MAXPGPATH);
TrimExtension(walfile, additional_ext);

The revised comment seems to drift from the original point. Except for
a potential exception by a too-long addition_ext, the original comment
asserted that the name truncation was safe since it wouldn't affect
the files we're removing. In other words, it asserted that the
filenames to be removed won't be truncated and any actual truncation
wouldn't lead to accidental deletions.

Hence, I think we should adjust the comment to maintain its original
point, and extend it to include backup history files. A possible
revision could be (very simple):

		 * Truncation is essentially harmless, because we skip names of 
length

 * longer than the length of backup history file. (In 
principle, one
 * could use a 1000-character additional_ext and get trouble.)


This is true, but we do stricter check for preventing accidental
deletion at the below code than just skipping "names of length longer
than the length of backup history file".

|if (!IsXLogFileName(walfile) && !IsPartialXLogFileName(walfile) 
&&

|!(cleanBackupHistory && IsBackupHistoryFileName(walfile)))
|continue;

How about something like this?

| Truncation is essentially harmless, because we skip files whose format 
is different from WAL files and backup history files.



Regarding the TAP test, it checks that the --clean-backup-history does
indeed remove backup history files. However, it doesn't check that
this doesn't occur if the option isn't specified. Shouldn't we test
for the latter scenario as well?


Agreed.
Added a backup history file to @walfiles_with_gz.


+sub get_walfiles
+{

+
+create_files(get_walfiles(@walfiles_with_gz));

The new function get_walfiels() just increases the line count without
cutting any lines. The following changes are sufficient and easier to
read (at least for me).


open my $file, '>', "$tempdir/$fn->{name}";



foreach my $fn (map {$_->{name}} @walfiles_with_gz)


Agreed.
Remove get_walfiles() and added some changes as above.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom f9487f831436ff097595f891e5f4e796fb9943a4 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Wed, 21 Jun 2023 21:42:08 +0900
Subject: [PATCH v12 1/3] Introduce pg_archivecleanup into getopt_long

This patch is a preliminary step to add an easy-to-understand option
to delete backup history files, but it also adds long options to
the existing options.

Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Fujii Masao
---
 doc/src/sgml/ref/pgarchivecleanup.sgml|  5 -
 src/bin/pg_archivecleanup/pg_archivecleanup.c | 22 +--
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/pgarchivecleanup.sgml b/doc/src/sgml/ref/pgarchivecleanup.sgml
index 635e7c7685..09991c2fcd 100644
--- a/doc/src/sgml/ref/pgarchivecleanup.sgml
+++ b/doc/src/sgml/ref/pgarchivecleanup.sgml
@@ -95,6 +95,7 @@ pg_archivecleanup:  removing file 

Re: Assert while autovacuum was executing

2023-06-23 Thread Amit Kapila
On Thu, Jun 22, 2023 at 9:16 AM Amit Kapila  wrote:
>
> On Wed, Jun 21, 2023 at 10:57 AM Andres Freund  wrote:
> >
> > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that 
> > likely
> > also means 3ba59ccc89 is not right.
> >
>
> Indeed. I was thinking of a fix but couldn't find one yet. One idea I
> am considering is to allow catalog table locks after page lock but I
> think apart from hacky that also won't work because we still need to
> remove the check added for page locks in the deadlock code path in
> commit 3ba59ccc89 and may need to do something for group locking.
>

I have further thought about this part and I think even if we remove
the changes in commit 72e78d831a (remove the assertion for page locks
in LockAcquireExtended()) and remove the check added for page locks in
FindLockCycleRecurseMember() via commit 3ba59ccc89, it is still okay
to keep the change related to "allow page lock to conflict among
parallel group members" in LockCheckConflicts(). This is because locks
on catalog tables don't conflict among group members. So, we shouldn't
see a deadlock among parallel group members. Let me try to explain
this thought via an example:

Begin;
Lock pg_enum in Access Exclusive mode;
gin_clean_pending_list() -- assume this function is executed by both
leader and parallel worker; also this requires a lock on pg_enum as
shown by Andres in email [1]

Say the parallel worker acquires page lock first and it will also get
lock on pg_enum because of group locking, so, the leader backend will
wait for page lock for the parallel worker. Eventually, the parallel
worker will release the page lock and the leader backend can get the
lock. So, we should be still okay with parallelism.

OTOH, if the above theory is wrong or people are not convinced, I am
okay with removing all the changes in commits 72e78d831a and
3ba59ccc89.

[1] - 
https://www.postgresql.org/message-id/20230621052713.wc5377dyslxpckfj%40awork3.anarazel.de

-- 
With Regards,
Amit Kapila.




Re: Do we want a hashset type?

2023-06-23 Thread Joel Jacobson
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> I played around array_func.c
> many of the code can be used for multiset data type.
> now I imagine multiset as something like one dimension array. (nested 
> is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?

/Joel




Re: Deleting prepared statements from libpq.

2023-06-23 Thread Michael Paquier
On Fri, Jun 23, 2023 at 09:39:00AM +0200, Jelte Fennema wrote:
> To be clear, it didn't actually change the behaviour. I only changed
> the error message, since it said the exact opposite of what it was
> expecting. I split this minor fix into its own commit now to clarify
> that. I think it would even make sense to commit this small patch to
> the PG16 branch, since it's a bugfix in the tests (and possibly even
> back-patch to others if that's not a lot of work).

Yes, agreed that this had better be fixed and backpatched.  This
avoids some backpatching fuzz, and that's incorrect as it stands.
--
Michael


signature.asc
Description: PGP signature


patch: improve "user mapping not found" error message

2023-06-23 Thread Ian Lawrence Barwick
Hi

Mild corner-case annoyance while doing Random Experimental Things:

postgres=# SELECT * FROM parttest;
ERROR:  user mapping not found for "postgres"

Oky, but which server?

postgres=# \det
   List of foreign tables
 Schema | Table |   Server
+---+---
 public | parttest_10_1 | fdw_node2
 public | parttest_10_3 | fdw_node3
 public | parttest_10_5 | fdw_node4
 public | parttest_10_7 | fdw_node5
 public | parttest_10_9 | fdw_node6
(5 rows)

(Muffled sound of small patch hatching) aha:

postgres=# SELECT * FROM parttest;
ERROR:  user mapping not found for user "postgres", server "fdw_node5"

Regards

Ian Barwick
commit 1e08150c0bba813050b00e4e35cdba1572fd5564
Author: Ian Barwick 
Date:   Fri Jun 23 16:36:41 2023 +0900

Improve "user mapping not found" error message

Display the name of the foreign server for which the user mapping
was not found.

diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c
index ca3ad55b62..b649064141 100644
--- a/src/backend/foreign/foreign.c
+++ b/src/backend/foreign/foreign.c
@@ -217,10 +217,15 @@ GetUserMapping(Oid userid, Oid serverid)
 	}
 
 	if (!HeapTupleIsValid(tp))
+	{
+		ForeignServer *server = GetForeignServer(serverid);
+
 		ereport(ERROR,
 (errcode(ERRCODE_UNDEFINED_OBJECT),
- errmsg("user mapping not found for \"%s\"",
-		MappingUserName(userid;
+ errmsg("user mapping not found for user \"%s\", server \"%s\"",
+		MappingUserName(userid),
+		server->servername)));
+	}
 
 	um = (UserMapping *) palloc(sizeof(UserMapping));
 	um->umid = ((Form_pg_user_mapping) GETSTRUCT(tp))->oid;


Re: Deleting prepared statements from libpq.

2023-06-23 Thread Jelte Fennema
On Fri, 23 Jun 2023 at 05:59, Michael Paquier  wrote:
> [...]
> res = PQgetResult(conn);
> if (res == NULL)
> -   pg_fatal("expected NULL result");
> +   pg_fatal("expected non-NULL result");
>
> This should check for the NULL-ness of the result returned for
> PQclosePrepared() rather than changing the behavior of the follow-up
> calls?

To be clear, it didn't actually change the behaviour. I only changed
the error message, since it said the exact opposite of what it was
expecting. I split this minor fix into its own commit now to clarify
that. I think it would even make sense to commit this small patch to
the PG16 branch, since it's a bugfix in the tests (and possibly even
back-patch to others if that's not a lot of work). I changed the error
message to be in line with one from earlier in the test.

I addressed all of your other comments.


v6-0002-Support-sending-Close-messages-from-libpq.patch
Description: Binary data


v6-0001-Correct-error-message-in-test_prepared.patch
Description: Binary data


Index range search optimization

2023-06-23 Thread Konstantin Knizhnik

Hi hackers.

_bt_readpage performs key check for each item on the page trying to 
locate upper boundary.
While comparison of simple integer keys are very fast, comparison of 
long strings can be quite expensive.
We can first make check for the largest key on the page and if it is not 
larger than upper boundary, then skip checks for all elements.


At this quite artificial example such optimization gives 3x time speed-up:

create table t(t text primary key);
insert into t values ('primary key-'||generate_series(1,1000)::text);
select count(*) from t where t between 'primary key-100' and 'primary 
key-200';

At my notebook with large enough shared buffers and disabled concurrency 
the difference is 83 vs. 247 msec

For integer keys the difference is much smaller:  69 vs. 82 msec

Certainly I realized that this example is quite exotic: most of DBAs 
prefer integer keys and such large ranges are quite rare.

But still such large range queries are used.
And I have checked that the proposed patch doesn't cause slowdown of 
exact search.diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce9..0f6a767409 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -888,7 +888,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
 	 * keys that must be matched to continue the scan.
 	 */
-	_bt_preprocess_keys(scan);
+	_bt_preprocess_keys(scan, dir);
 
 	/*
 	 * Quit now if _bt_preprocess_keys() discovered that the scan keys can
@@ -1531,6 +1531,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	IndexTuple	itup;
+	bool		all_fit = false;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1584,6 +1586,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	Assert(BTScanPosIsPinned(so->currPos));
 
+	/* Do no perform this optimization for first index page to avoid slowdown of dot queries.
+	 * so->has_matches is set at the end if _bt_readpage, so we do not try to perform this optimization
+	 * at first invocation of _bt_readpage. Also it enforces that we found items mathcing low boundary o only upper boundary has to be checked.
+	 */
+	if (so->is_range_search	&& so->has_matches) {
+		bool		temp;
+		/* Try first to compare minnimal/maximal key on the page to avoid checks of all other items on the page */
+		itup = (IndexTuple) PageGetItem(page,  PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff));
+		all_fit = _bt_checkkeys(scan, itup, indnatts, dir, );
+	}
+
 	if (ScanDirectionIsForward(dir))
 	{
 		/* load items[] in ascending order */
@@ -1594,7 +1607,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		while (offnum <= maxoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	itup;
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1608,7 +1620,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, ))
+			if (all_fit || _bt_checkkeys(scan, itup, indnatts, dir, ))
 			{
 /* tuple passes all scan key conditions */
 if (!BTreeTupleIsPosting(itup))
@@ -1661,9 +1673,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (continuescan && !P_RIGHTMOST(opaque))
 		{
 			ItemId		iid = PageGetItemId(page, P_HIKEY);
-			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
 			int			truncatt;
 
+			itup = (IndexTuple) PageGetItem(page, iid);
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
 			_bt_checkkeys(scan, itup, truncatt, dir, );
 		}
@@ -1686,7 +1698,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		while (offnum >= minoff)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
 
@@ -1716,8 +1727,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-		 );
+			passes_quals = all_fit || _bt_checkkeys(scan, itup, indnatts, dir,
+	);
 			if (passes_quals && tuple_alive)
 			{
 /* tuple passes all scan key conditions */
@@ -1771,8 +1782,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
 		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
-
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return so->has_matches = so->currPos.firstItem <= so->currPos.lastItem;
 }
 
 /* Save an index item into so->currPos.items[itemIndex] */
diff --git a/src/backend/access/nbtree/nbtutils.c 

Re: Do we want a hashset type?

2023-06-23 Thread jian he
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested is
somehow beyond the imagination...).

* A standard varlena array has the following internal structure:
 *- standard varlena header word
 *- number of dimensions of the array
 *- offset to stored data, or 0 if no nulls bitmap
 *- element type OID
 *- length of each array axis (C array of int)
 *- lower boundary of each dimension (C array of int)
 *- bitmap showing locations of nulls (OPTIONAL)
 *- whatever is the stored data

in set/multiset, we don't need {ndim,lower bnds}, since we are only one
dimension, also we don't need subscript.
So for set we can have following
*  int32 vl_len_; /* varlena header (do not touch directly!) */
*  int32   capacity; /* # of capacity */
*  int32 dataoffset; /* offset to data, or 0 if no bitmap */
*  int32 nelements; /* number of items added to the hashset */
*  Oid elemtype; /* element type OID */
 *   - bitmap showing locations of nulls (OPTIONAL)
 *   - bitmap showing this slot is empty or not  ( I am not sure
this part)
 *   - whatever is the stored data

many of the code in array_func.c can be reused.
array_isspace ==> set_isspace
ArrayMetaState  ==> SetMetastate
ArrayCount ==> SetCount (similar to ArrayCount return the dimension
of set, should be zero (empty set) or one)
ArrayParseState ==> SetParseState
ReadArrayStr ==>  ReadSetStr

attached is a demo shows that use array_func.c to parse cstring. have
similar effect of  array_in.
for multiset_in set type input function. if no duplicate required then
multiset_in would just like array, so more code can be copied from
array_func.c
but if unique required then we need first palloc0(capacity * datums (size
per type)) then put valid value into to a specific slot?




On Fri, Jun 23, 2023 at 6:27 AM Tomas Vondra 
wrote:

> On 6/22/23 19:52, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> This is also what the SQL standard does for multisets - there's SQL:20nn
> >> draft at http://www.wiscorp.com/SQLStandards.html, and the  >> predicate> section (p. 475) explains how this should work with NULL.
> >
> > I've looked again at the paper you mentioned and found something
> intriguing
> > in section 2.6 (b). I'm a bit puzzled about this: why would we want to
> return
> > null when we're certain it's not null but just doesn't have any elements?
> >
> > In the same vein, it says, "If it has more than one element, an
> exception is
> > raised." Makes sense to me, but what about when there are no elements at
> all?
> > Why not raise an exception in that case too?
> >
> > The ELEMENT function is designed to do one simple thing: return the
> element of
> > a multiset if the multiset has only 1 element. This seems very similar
> to how
> > our INTO STRICT operates, right?
> >
>
> I agree this looks a bit weird, but that's what I mentioned - this is an
> initial a proposal, outlining the idea. Inevitably some of the stuff
> will get reworked or just left out of the final version. It's useful
> mostly to explain the motivation / goal.
>
> I believe that's the case here - I don't think the ELEMENT got into the
> standard at all, and the NULL rules for the MEMBER OF clause seem not to
> have these strange bits.
>
> > The SQL:20nn seems to still be in draft form, and I can't help but
> wonder if we
> > should propose a bit of an improvement here:
> >
> > "If it doesn't have exactly one element, an exception is raised."
> >
> > Meaning, it would raise an exception both if there are more elements,
> > or zero elements (no elements).
> >
> > I think this would make the semantics more intuitive and less surprising.
> >
>
> Well, the simple truth is the draft is freely available, but you'd need
> to buy the final version. It doesn't mean it's still being worked on or
> that no SQL standard was released since then. In fact, SQL 2023 was
> released a couple weeks ago [1].
>
> It'd be interesting to know the version that actually got into the SQL
> standard (if at all), but I don't have access to the standard yet.
>
> regards
>
>
> [1] https://www.iso.org/standard/76584.html
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
 I recommend David Deutsch's <>

  Jian

/*
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/Desktop/regress_pgsql/set.c
gcc -shared  -o /home/jian/Desktop/regress_pgsql/set.so /home/jian/Desktop/regress_pgsql/set.o

CREATE OR REPLACE FUNCTION set_in_test(cstring,int, int) RETURNS BOOL SET search_path from current
AS '/home/jian/Desktop/regress_pgsql/set', 'set_in_test'
LANGUAGE C IMMUTABLE;
select  set_in_test('{1,2,3,NULL,NULL, NULL}',23,-1);

*/
#include "postgres.h"
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"
// #include "utils/array.h"

Re: memory leak in trigger handling (since PG12)

2023-06-23 Thread Alexander Pyhalov

Tomas Vondra писал 2023-06-22 17:16:

On 6/22/23 13:46, Tomas Vondra wrote:

...

I haven't tried the reproducer, but I think I see the issue - we store
the bitmap as part of the event to be executed later, but the bitmap 
is
in per-tuple context and gets reset. So I guess we need to copy it 
into

the proper long-lived context (e.g. AfterTriggerEvents).

I'll get that fixed.



Alexander, can you try if this fixes the issue for you?


regard


Hi.
The patch fixes the problem and looks good to me.
--
Best regards,
Alexander Pyhalov,
Postgres Professional