date:20221018

Re: Getting rid of SQLValueFunction

2022-10-18 Thread Michael Paquier

(Adding Tom in CC, in case.)

On Tue, Oct 18, 2022 at 04:35:33PM -0400, Corey Huinker wrote:
> I agree that we shouldn't spend too much effort on a good error message
> here, but perhaps we should have the message mention that it is
> date/time-related? A person git-grepping for this error message will get 4
> hits in .c files (date.c, timestamp.c, varbit.c, varchar.c) so even a
> slight variation in the message could save them some time.

The message is the same between HEAD and the patch and these have been
around for a long time, except that we would see it at parsing time on
HEAD, and at executor time with the patch.  I would not mind changing
if there are better ideas than what's used now, of course ;)

> This is an extreme nitpick, but the patchset seems like it should have been
> 1 file or 3 (remove name functions, remove time functions, remove
> SQLValueFunction infrastructure), but that will only matter in the unlikely
> case that we find a need for SQLValueFunction but we want to leave the
> timestamp function as COERCE_SQL_SYNTAX.

Once the timestamp functions are removed, SQLValueFunction is just
dead code so including its removal in 0002 does not change much in my
opinion.

An other thing I had on my list for this patch was to check its
performance impact.  So I have spent some time having a look at the
perf profiles produced on HEAD and with the patch using queries like
SELECT current_role FROM generate_series(1,N) where N > 10M and I have
not noticed any major differences in runtime or in the profiles, at
the difference that we don't have anymore SQLValueFunction() and its
internal functions called, hence they are missing from the stacks, but
that's the whole point of the patch.

With this in mind, would somebody complain if I commit that?  That's a
nice reduction in code, while completing the work done in 40c24bf:
 25 files changed, 338 insertions(+), 477 deletions(-)

Thanks,
--
Michael

signature.asc
Description: PGP signature

Re: Bug: pg_regress makefile does not always copy refint.so

2022-10-18 Thread Alvaro Herrera

Hi,

On 2022-Oct-18, Donghang Lin wrote:

> > I have a somewhat-related-but-not-really complaint.  I recently had need to
> > have refint.so, autoinc.so and regress.so in the install directory; but it
> > turns out that there's no provision at all to get them installed.

> The current pgsql install directory of regress only contains pg_regress 
> binary,
> Do you suggest we add these files (excluding the scratched files) to the 
> regress install directory?
> autoinc.so  data  expected  Makefile  parallel_schedule  pg_regress  
> pg_regress.c  pg_regress.h  pg_regress_main.c  README  refint.so  regress.c  
> regressplans.sh  regress.so  resultmap  sql

No, I think the .c/.h files are likely included only because the RPM
rule is written somewhat carelessly.  If we add support in our
makefiles, it would have to be something better-considered.

> I think this build bug is orthogonal to the inconvenient 
> installation/packaging.
> It produces inconsistent build result, e.g you have to run `make` twice to 
> ensure newly built refint.so is copied to the build dir.

Yes, I agree that it is orthogonal.  I'm not sure that what you propose
(changing these order-only dependencies into regular dependencies) is
the best possible fix, but I agree we need *some* fix.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: fix archive module shutdown callback

2022-10-18 Thread Michael Paquier

On Tue, Oct 18, 2022 at 02:38:07PM +0530, Bharath Rupireddy wrote:
> 2) Clarifies when the archive module shutdown callback gets called in
> documentation.

I have looked at that, and it was actually confusing as the callback
would also be called on reload if archive_library changes, but the
update somewhat outlines that this would happen only on postmaster
shutdown.

> 3) Defines a shutdown callback that just emits a log message in
> shell_archive.c and tests it.

The test had a few issues:
- No need to wait for postmaster.pid in the test, as pg_ctl does this
job.
- The reload can be time-sensitive on slow machines, so I have added a
query run to make sure that the reload happens before stopping the
server.
- slurp_file() was feeding on the full log file of standby2, but we
should load it from the log location before stopping the server, even
if log_min_messages was updated only at the end of the test.

And done, after tweaking a few more things.
--
Michael


signature.asc
Description: PGP signature

Re: thinko in basic_archive.c

2022-10-18 Thread Bharath Rupireddy

On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi
 wrote:
>
> Unbounded number of sequential crash-restarts itself is a more serious
> problem..

Agree. The basic_archive module currently leaves temp files around
even for normal restarts of the cluster, which is bad IMO.

> An archive module could clean up them at startup or at the first call
> but that might be dangerous (since archive directory is I think
> thought as external resource).

The archive module must be responsible for cleaning up the temp file
that it creates. One way to do it is in the archive module's shutdown
callback, this covers most of the cases, but not all.

> Honestly, I'm not sure about a reasonable scenario where simultaneous
> archivings of a same file is acceptable, though. I feel that we should
> not allow simultaneous archiving of the same segment by some kind of
> interlocking. In other words, we might should serialize duplicate
> archiving of asame file.

In a typical production environment, there's some kind of locking (for
instance lease files) that allow/disallow file
creation/deletion/writes/reads which guarantees that the same file
isn't put into a directory (can be archive location) many times. And
as you rightly said archive_directory is something external to
postgres and we really can't deal with concurrent writers
writing/creating the same files. Even if we somehow try to do it, it
makes things complicated. This is true for any PGDATA directories.
However, the archive module implementers can choose to define such a
locking strategy.

> In another direction, the current code allows duplicate simultaneous
> copying to temporary files with different names then the latest
> renaming wins.  We reach the almost same result (on Linuxen (or
> POSIX?))  by unlinking the existing tempfile first then create a new
> one with the same name then continue. Even if the tempfile were left
> alone after a crash, that file would be unlinked at the next trial of
> archiving. But I'm not sure how this can be done on Windows..  In the
> first place I'm not sure that the latest-unconditionally-wins policy
> is appropriate or not, though.

We really can't just unlink the temp file because it has pid and
timestamp in the filename and it's hard to determine the temp file
that we created earlier.

As far as the basic_archive module is concerned, we ought to keep it
simple. I still think the simplest we can do is to use the
basic_archive's shutdown_cb to delete (in most of the cases, but not
all) the left-over temp file that the module is dealing with
as-of-the-moment and add a note about the users dealing with
concurrent writers to the basic_archive.archive_directory like the
attached v2 patch.

> > A simple server restart while the basic_archive module is copying
> > to/from temp file would leave the file behind, see[1]. I think we can
> > fix this by defining shutdown_cb for the basic_archive module, like
> > the attached patch. While this helps in most of the crashes, but not
> > all. However, this is better than what we have right now.
>
> ShutdownWalRecovery() does the similar thing, but as you say this one
> covers rather narrow cases than that since RestoreArchiveFile()
> finally overwrites the left-alone files at the next call for that
> file.

We're using unique temp file names in the basic_archive module so we
can't overwrite the same upon restart.

> # The patch seems forgetting to clear the tmepfilepath *after* a
> # successful renaming.

It does so at the beginning of basic_archive_file() which is sufficient.

> And I don't see how the callback is called.

call_archive_module_shutdown_callback()->basic_archive_shutdown().

Please see the attached v2 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v2-0001-Remove-leftover-temporary-files-via-basic_archive.patch
Description: Binary data

Re: create subscription - improved warning message

2022-10-18 Thread Peter Smith

On Wed, Oct 19, 2022 at 2:44 PM shiy.f...@fujitsu.com
 wrote:
>
...

>
> +test_sub=# CREATE SUBSCRIPTION sub1
> +test_sub-# CONNECTION 'host=localhost dbname=test_pub'
> +test_sub-# PUBLICATION pub1
> +test_sub-# WITH (slot_name=NONE, enabled=false, create_slot=false);
> +WARNING:  subscription was created, but is not connected
> +HINT:  To initiate replication, you must manually create the replication 
> slot, enable the subscription, and refresh the subscription.
> +CREATE SUBSCRIPTION
>
> In example 3, there is actually no such warning message when creating
> subscription because "connect=false" is not specified.
>

Oh, thanks for finding and reporting that. Sorry for my cut/paste
errors. Fixed in v7. PSA,

> I have tested the examples in the patch and didn't see any problem other than
> the one above.
>

Thanks for your testing.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v7-0001-create-subscription-pgdocs-for-deferred-slot-crea.patch
Description: Binary data

Re: make_ctags: use -I option to ignore pg_node_attr macro

2022-10-18 Thread Yugo NAGATA

On Wed, 19 Oct 2022 13:25:17 +0900 (JST)
Tatsuo Ishii  wrote:

> > Hi,
> > 
> > On Sat, 15 Oct 2022 10:40:29 +0900 (JST)
> > Tatsuo Ishii  wrote:
> > 
> >> > On Thu, 13 Oct 2022 15:35:09 +0900 (JST)
> >> > Tatsuo Ishii  wrote:
> >> > 
> >> >> > OK, that sounds good then.  I would make a feature request to have a
> >> >> > switch that supresses creation of these links, then.
> >> >> 
> >> >> Ok, I have added "-n" option to make_ctags so that it skips to create
> >> >> the links.
> >> >> 
> >> >> Also I have changed make_etags so that it exec make_ctags, which seems
> >> >> to be the consensus.
> >> > 
> >> > Thank you for following up my patch.
> >> > I fixed the patch to allow use both -e and -n options together.
> >> 
> >> Thanks. I have made mostly cosmetic changes so that it is more
> >> consistent with existing scripts.
> >> 
> >> I would like to push v6 patch if there's no objection.
> > 
> > I am fine with this patch.
> 
> Thanks. the v6 patch pushed to master branch.

Thanks!

> > By the way, in passing, how about adding "tags" and "TAGS" to
> > .gitignore file?
> 
> Sounds like a good idea.

Ok, the patch is attached.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/.gitignore b/.gitignore
index 794e35b73c..ca1413427b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -31,6 +31,8 @@ win32ver.rc
 *.exe
 lib*dll.def
 lib*.pc
+tags
+TAGS
 
 # Local excludes in root directory
 /GNUmakefile

Re: make_ctags: use -I option to ignore pg_node_attr macro

2022-10-18 Thread Tatsuo Ishii

> Hi,
> 
> On Sat, 15 Oct 2022 10:40:29 +0900 (JST)
> Tatsuo Ishii  wrote:
> 
>> > On Thu, 13 Oct 2022 15:35:09 +0900 (JST)
>> > Tatsuo Ishii  wrote:
>> > 
>> >> > OK, that sounds good then.  I would make a feature request to have a
>> >> > switch that supresses creation of these links, then.
>> >> 
>> >> Ok, I have added "-n" option to make_ctags so that it skips to create
>> >> the links.
>> >> 
>> >> Also I have changed make_etags so that it exec make_ctags, which seems
>> >> to be the consensus.
>> > 
>> > Thank you for following up my patch.
>> > I fixed the patch to allow use both -e and -n options together.
>> 
>> Thanks. I have made mostly cosmetic changes so that it is more
>> consistent with existing scripts.
>> 
>> I would like to push v6 patch if there's no objection.
> 
> I am fine with this patch.

Thanks. the v6 patch pushed to master branch.

> By the way, in passing, how about adding "tags" and "TAGS" to
> .gitignore file?

Sounds like a good idea.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

I've attached patches for Change-3 with a test case. Please review them as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 763cb604fae97131b5be2da36deb10054b6dbf3a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency but don't for 10 since its final release
will come soon.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out  | 45 +++
 .../specs/catalog_change_snapshot.spec|  8 
 src/backend/replication/logical/snapbuild.c   |  7 +--
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expe

RE: create subscription - improved warning message

2022-10-18 Thread shiy.f...@fujitsu.com

On Tue, Oct 18, 2022 5:44 PM Peter Smith  wrote:
> 
> On Mon, Oct 17, 2022 at 7:11 PM shiy.f...@fujitsu.com
>  wrote:
> >
> ...
> >
> > Thanks for your patch. Here are some comments.
> >
> > In Example 2, the returned slot_name should be "myslot".
> >
> > +test_pub=# SELECT * FROM pg_create_logical_replication_slot('myslot',
> 'pgoutput');
> > + slot_name |lsn
> > +---+---
> > + sub1  | 0/19059A0
> > +(1 row)
> >
> 
> Oops. Sorry for my cut/paste error. Fixed in patch v6.
> 
> 
> > Besides, I am thinking is it possible to slightly simplify the example. For
> > example, merge example 1 and 2, keep the steps of example 2 and in the
> step of
> > creating slot, mention what should we do if slot_name is not specified when
> > creating subscription.
> >
> 
> Sure, it might be a bit shorter to combine the examples, but I thought
> it’s just simpler not to do it that way because the combined example
> will then need additional bullets/notes to say – “if there is no
> slot_name do this…” and “if there is a slot_name do that…”. It’s
> really only the activation part which is identical for both.
> 

Thanks for updating the patch.

+test_sub=# CREATE SUBSCRIPTION sub1
+test_sub-# CONNECTION 'host=localhost dbname=test_pub'
+test_sub-# PUBLICATION pub1
+test_sub-# WITH (slot_name=NONE, enabled=false, create_slot=false);
+WARNING:  subscription was created, but is not connected
+HINT:  To initiate replication, you must manually create the replication slot, 
enable the subscription, and refresh the subscription.
+CREATE SUBSCRIPTION

In example 3, there is actually no such warning message when creating
subscription because "connect=false" is not specified.

I have tested the examples in the patch and didn't see any problem other than
the one above.

Regards,
Shi yu

Re: thinko in basic_archive.c

2022-10-18 Thread Kyotaro Horiguchi

 +0530, Bharath Rupireddy  wrote in 
> On Mon, Oct 17, 2022 at 6:45 PM Robert Haas  wrote:
> >
> > On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy
> >  wrote:
> > > What happens to the left-over temp files after a server crash? Will
> > > they be lying around in the archive directory? I understand that we
> > > can't remove such files because we can't distinguish left-over files
> > > from a crash and the temp files that another server is in the process
> > > of copying.
> >
> > Yeah, leaving a potentially unbounded number of files around after
> > system crashes seems pretty unfriendly. I'm not sure how to fix that,
> > exactly.

Unbounded number of sequential crash-restarts itself is a more serious
problem..

An archive module could clean up them at startup or at the first call
but that might be dangerous (since archive directory is I think
thought as external resource).

Honestly, I'm not sure about a reasonable scenario where simultaneous
archivings of a same file is acceptable, though. I feel that we should
not allow simultaneous archiving of the same segment by some kind of
interlocking. In other words, we might should serialize duplicate
archiving of asame file.

In another direction, the current code allows duplicate simultaneous
copying to temporary files with different names then the latest
renaming wins.  We reach the almost same result (on Linuxen (or
POSIX?))  by unlinking the existing tempfile first then create a new
one with the same name then continue. Even if the tempfile were left
alone after a crash, that file would be unlinked at the next trial of
archiving. But I'm not sure how this can be done on Windows..  In the
first place I'm not sure that the latest-unconditionally-wins policy
is appropriate or not, though.

> A simple server restart while the basic_archive module is copying
> to/from temp file would leave the file behind, see[1]. I think we can
> fix this by defining shutdown_cb for the basic_archive module, like
> the attached patch. While this helps in most of the crashes, but not
> all. However, this is better than what we have right now.

ShutdownWalRecovery() does the similar thing, but as you say this one
covers rather narrow cases than that since RestoreArchiveFile()
finally overwrites the left-alone files at the next call for that
file.

# The patch seems forgetting to clear the tmepfilepath *after* a
# successful renaming. And I don't see how the callback is called.

> [1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls
> 00010001
> archtemp.00010002.2493876.1666091933457
> archtemp.00010002.2495316.1666091958680

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Mingw task for Cirrus CI

2022-10-18 Thread Andres Freund

Hi,

On 2022-10-19 00:23:46 +0300, Melih Mutlu wrote:
>  Right, setting CHERE_INVOKING and removing all cd's work and look better.
> Thanks for the suggestion.

Agreed, good idea.

> +
> +  env:
> +CCACHE_DIR: C:/msys64/ccache

It's a bit odd to separate the CCACHE_* variables from each other
(e.g. BUILD_DIR is inbetween them)...

> +BUILD_DIR: "%CIRRUS_WORKING_DIR%/build"
> +PYTHONHOME: C:/msys64/ucrt64

Perhaps add a comment explaining that otherwise plpython tests fail?

> +MSYS: winjitdebug

With this src/tools/ci/cores_backtrace.sh shouldn't need to be modified
anymore afaict?

> +CCACHE_SLOPPINESS: pch_defines,time_macros
> +CCACHE_DEPEND: 1

I experimented a bit and it looks like ccache doesn't yet quite work in CI,
but only because the ccache needs to be larger. Looks like we need about
~400MB.

A fully cached build is ~2min

> +  configure_script: |
> +%BASH_EXE% -lc "meson setup --buildtype debug -Dcassert=true 
> -Db_pch=true -DTAR=%TAR% build"

With these buildflags the tests take about 23min37s. Using -Og I saw 18min26s,
with -O1 18m57s and with -O2 18m38s. There's obviously a fair bit of variance,
but it looks like we should use -Og. I think we considered that making compile
times too bad before, but it seems kinda ok now, with 11min. -O2 is 13min,
without providing further benefits.

I'd replace --buildtype debug with -Ddebug=true -Doptimization=g.

> +  build_script: |
> +%BASH_EXE% -lc "cd %CIRRUS_WORKING_DIR% && ninja -C build"
>
Why do we still need this cd?

> +  upload_caches: ccache
> +
> +  test_world_script: |
> +%BASH_EXE% -lc "meson test --print-errorlogs --num-processes %TEST_JOBS% 
> -C build"

Seems like this could use %MTEST_ARGS%?

Greetings,

Andres Freund

Re: Make finding openssl program a configure or meson option

2022-10-18 Thread Michael Paquier

On Tue, Oct 18, 2022 at 06:46:53PM +0200, Peter Eisentraut wrote:
> On 12.10.22 03:08, Michael Paquier wrote:
>> openssl-env allows the use of the environment variable of the same
>> name.  This reminds me a bit of the recent interferences with GZIP,
>> for example.
> 
> Okay, I see what you meant here now.  openssl-env is the man page describing
> environment variables used by OpenSSL.

Yeah, sorry.  That's what I was referring to.

> I don't see any conflicts with what is being proposed here.

Its meaning is the same in the context of the OpenSSL code.  LibreSSL
has nothing of the kind.

> Added.  New patch attached.

Looks fine as a whole, except for one nit.

src/test/ssl/t/001_ssltests.pl: warn 'couldn\'t run `openssl x509` to get 
client cert serialno';
Perhaps this warning should mentioned $ENV{OPENSSL} instead?
--
Michael


signature.asc
Description: PGP signature

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > wrote:
> > >
> > > >
> > > > I think because the test case proposed needs all three changes, we can
> > > > push the change-1 without a test case and then as a second patch have
> > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > Do you have any other ideas to proceed here?
> > >
> > > I found another test case that causes the assertion failure at
> > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > attached the patch for the test case. In this test case, I modified a
> > > user-catalog table instead of system-catalog table. That way, we don't
> > > generate invalidation messages while generating NEW_CID records. As a
> > > result, we mark only the subtransactions as containing catalog change
> > > and don't make association between top-level and sub transactions. The
> > > assertion failure happens on all supported branches. If we need to fix
> > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > branches.
> > >
> > > There are three changes as Amit mentioned, and regarding the test
> > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > between assertion failures and test cases are very complex. I could
> > > not find any test case to cause only one assertion failure on all
> > > branches. One idea to proceed is:
> > >
> > > Patch-1 includes Change-1 and is applied to all branches.
> > >
> > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > applied to all branches.
> > >
> > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > test case), and is applied to v14 and v15 (also till v11 if we
> > > prefer).
> > >
> > > The patch-1 doesn't include any test case but the user_catalog test
> > > case can test both Change-1 and Change-2 on all branches.
> > >
> >
> > I was wondering if it makes sense to commit both Change-1 and Change-2
> > together as one patch? Both assertions are caused by a single test
> > case and are related to the general problem that the association of
> > top and sub transaction is only guaranteed to be formed before we
> > decode transaction changes. Also, it would be good to fix the problem
> > with a test case that can cause it. What do you think?
>
> Yeah, it makes sense to me.
>

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Regards,

[1] 
https://www.postgresql.org/message-id/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v15_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


HEAD_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

Re: pub/sub - specifying optional parameters without values.

2022-10-18 Thread Peter Smith

On Tue, Oct 18, 2022 at 7:09 AM Justin Pryzby  wrote:
>
> On Fri, Oct 14, 2022 at 07:54:37PM +1100, Peter Smith wrote:
> > Hi hackers.
> >
> > This post is about parameter default values. Specifically. it's about
> > the CREATE PUBLICATION and CREATE SUBSCRIPTION syntax, although the
> > same issue might apply to other commands I am unaware of...
>
> The same thing seems to be true in various other pages:
> git grep 'WITH.*value' doc
>
> In addition to WITH, it's also true of SET:
>
> git grep -F '[= value' 
> doc/src/sgml/ref/alter_index.sgml doc/src/sgml/ref/alter_table.sgml 
> doc/src/sgml/ref/create_materialized_view.sgml 
> doc/src/sgml/ref/create_publication.sgml 
> doc/src/sgml/ref/create_subscription.sgml
>
> Note that some utility statements (analyze,cluster,vacuum,reindex) which
> have parenthesized syntax with booleans say this:
> | The boolean value can also be omitted, in which case TRUE is assumed.

Thank you for the feedback and for reporting about other places
similar to this. For now, I only intended to fix docs related to
logical replication. Scope creep to other areas maybe can be addressed
by subsequent patches if this one gets accepted.

>
> BTW, in your patch:
> + 
> +  A boolean parameter can omit the value. This is equivalent
> +  to assigning the parameter to true.
> + 
> + 
>
> should say: "The value can be omitted, which is equivalent to specifying
> TRUE".
>

I've changed the text as you suggested, except in a couple of places
where I qualified by saying "For boolean parameters..."; that's
because the value part is not *always* optional. I've also made
similar updates to the ALTER PUBLICATION/SUBSCRIPTION pages, which
were accidentally missed before.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-clarify-behavior-of-specifying-a-parameter-with-n.patch
Description: Binary data

RE: Improve errhint for ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2022-10-18 Thread houzj.f...@fujitsu.com

On Tuesday, October 18, 2022 5:50 PM Alvaro Herrera  
wrote:
> 
> On 2022-Oct-18, Japin Li wrote:
> 
> >
> > On Tue, 18 Oct 2022 at 12:00, houzj.f...@fujitsu.com
>  wrote:
> >
> > > Agreed. Here is new version patch which changed the error code and
> > > moved the whole command out of the message according to Álvaro's
> comment.
> >
> > My bad!  The patch looks good to me.
> 
> Thank you, I pushed it to both branches, because I realized we were saying 
> "SET
> PUBLICATION" when we meant "ADD/DROP"; that hint could be quite
> damaging if anybody decides to actually follow it ISTM.

Thanks for pushing!

Best regards,
Hou zj

Re: Fix typo in code comment

2022-10-18 Thread Peter Smith

On Wed, Oct 19, 2022 at 12:29 PM Michael Paquier  wrote:
>
> On Wed, Oct 19, 2022 at 10:09:12AM +1100, Peter Smith wrote:
> > PSA trivial patch to fix a code comment typo seen during a recent review.
>
> Passing by..  And fixed.  Thanks!
> --

Thanks for passing by.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Fix typo in code comment

2022-10-18 Thread Michael Paquier

On Wed, Oct 19, 2022 at 10:09:12AM +1100, Peter Smith wrote:
> PSA trivial patch to fix a code comment typo seen during a recent review.

Passing by..  And fixed.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-10-18 Thread Michael Paquier

On Tue, Oct 18, 2022 at 09:14:21AM +0200, Drouvot, Bertrand wrote:
> BTW, what about adding a new TAP test (dedicated patch) to test the behavior
> in case of errors during the regexes compilation in pg_ident.conf and
> pg_hba.conf (not done yet)? (we could add it once this  patch series is
> done).

Perhaps, that may become tricky when it comes to -DEXEC_BACKEND (for
cases where no fork() implementation is available, aka Windows).  But
a postmaster restart failure would generate logs that could be picked
for a pattern check?

>> While putting my hands on that, I was also wondering whether we should
>> have the error string generated after compilation within the internal
>> regcomp() routine, but that would require more arguments to
>> pg_regcomp() (as of file name, line number, **err_string), and that
>> looks more invasive than necessary.  Perhaps the follow-up steps will
>> prove me wrong, though :)
> 
> I've had the same thought (and that was what the previous global patch was
> doing). I'm tempted to think that the follow-steps will prove you right ;-)
> (specially if at the end those will be the same error messages for databases
> and roles).

Avoiding three times the same error message seems like a good thing in
the long run, but let's think about this part later as needed.  All
these routines are static to hba.c so even if we finish by not
finishing the whole job for this development cycle we can still be
very flexible.
--
Michael


signature.asc
Description: PGP signature

Re: New "single-call SRF" APIs are very confusingly named

2022-10-18 Thread Andres Freund

Hi,

Thanks for "fixing" this so quickly.

Greetings,

Andres Freund

Fix typo in code comment

2022-10-18 Thread Peter Smith

PSA trivial patch to fix a code comment typo seen during a recent review.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-typo-instead-just.patch
Description: Binary data

Re: Checking for missing heap/index files

2022-10-18 Thread Tom Lane

Robert Haas  writes:
> On Tue, Oct 18, 2022 at 3:59 PM Tom Lane  wrote:
>> Isn't it already the case (or could be made so) that relation file
>> removal happens only in the checkpointer?

> I believe that individual backends directly remove all relation forks
> other than the main fork and all segments other than the first one.

Yeah, obviously some changes would need to be made to that, but ISTM
we could just treat all the forks as we now treat the first one.

> The discussion on various other threads has been in the direction of
> trying to standardize on moving that last case out of the checkpointer
> - i.e. getting rid of what Thomas dubbed "tombstone" files - which is
> pretty much the exact opposite of this proposal.

Yeah, we'd have to give up on that.  If that goes anywhere then
it kills this idea.

> But even apart from
> that, I don't think this would be that easy to implement. If you
> removed a large relation, you'd have to tell the checkpointer to
> remove many files instead of just 1.

The backends just implement this by deleting files until they don't
find the next one in sequence.  I fail to see how it'd be any
harder for the checkpointer to do that.

> And I don't think we really need to do any of that. We could invent a
> new kind of lock tag for  combination. Take a share lock
> to create or remove files. Take an exclusive lock to scan the
> directory. I think that accomplishes the same thing as your proposal,
> but more directly, and with less overhead. It's still substantially
> more than NO overhead, though.

My concern about that is that it implies touching a whole lot of
places, and if you miss even one then you've lost whatever guarantee
you thought you were getting.  More, there's no easy way to find
all the relevant places (some will be in extensions, no doubt).
So I have approximately zero faith that it could be made reliable.
Funneling things through the checkpointer would make that a lot
more centralized.  I concede that cowboy unlink() calls could still
be a problem ... but I doubt there's any solution that's totally
free of that hazard.

regards, tom lane

Re: Checking for missing heap/index files

2022-10-18 Thread Robert Haas

On Tue, Oct 18, 2022 at 3:59 PM Tom Lane  wrote:
> Isn't it already the case (or could be made so) that relation file
> removal happens only in the checkpointer?  I wonder if we could
> get to a situation where we can interlock file removal just by
> commanding the checkpointer to not do it for awhile.  Then combining
> that with caching readdir results (to narrow the window in which we
> have to stop the checkpointer) might yield a solution that has some
> credibility.  This scheme doesn't attempt to prevent file creation
> concurrently with a readdir, but you'd have to make some really
> adverse assumptions to believe that file creation would cause a
> pre-existing entry to get missed (as opposed to getting scanned
> twice).  So it might be an acceptable answer.

I believe that individual backends directly remove all relation forks
other than the main fork and all segments other than the first one.
The discussion on various other threads has been in the direction of
trying to standardize on moving that last case out of the checkpointer
- i.e. getting rid of what Thomas dubbed "tombstone" files - which is
pretty much the exact opposite of this proposal. But even apart from
that, I don't think this would be that easy to implement. If you
removed a large relation, you'd have to tell the checkpointer to
remove many files instead of just 1. That sounds kinda painful: it
would be more IPC, and it would delay file removal just so that we can
tell the checkpointer to delay it some more.

And I don't think we really need to do any of that. We could invent a
new kind of lock tag for  combination. Take a share lock
to create or remove files. Take an exclusive lock to scan the
directory. I think that accomplishes the same thing as your proposal,
but more directly, and with less overhead. It's still substantially
more than NO overhead, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Mingw task for Cirrus CI

2022-10-18 Thread Melih Mutlu

Hi Bilal,

Nazir Bilal Yavuz , 18 Eki 2022 Sal, 15:37 tarihinde
şunu yazdı:

> env:
> ...
> CHERE_INVOKING: 1
> BASH_EXE: C:\msys64\usr\bin\bash.exe
>
> 'CHERE_INVOKING: 1' will cause bash.exe to start from current working
> directory(%CIRRUS_WORKING_DIR%).
>
> In this way, there is no need for
> 'cd %CIRRUS_WORKING_DIR%' when using bash.exe,
> also no need for 'BUILD_DIR' environment variable.
>

 Right, setting CHERE_INVOKING and removing all cd's work and look better.
Thanks for the suggestion.

Here's the updated patch.

Best,
Melih


v8-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: Atomic rename feature for Windows.

2022-10-18 Thread Thomas Munro

On Wed, Apr 6, 2022 at 10:40 AM Victor Spirin  wrote:
> Updated patch: we use the posix semantic features in Windows build 17763
> and up.
> We found an issue with this feature on Windows Server 2016 without
> updates (Windows 1607 Build 14393)

Hi Victor,

I rebased and simplified this, and added a lot of tests to be able to
understand what it does.  I think all systems that didn't have this
are now EOL and we don't need to support them in PG16, but perhaps our
_WIN32_WINNT is not quite high enough (this requires Win10 RS1, which
itself was EOL'd in 2019); the main question I have now is what
happens when you run this on non-NTFS filesystems, and whether we want
to *require* this to work because the non-POSIX support will probably
finish up untested.  I posted all that over on a new thread where I am
tidying up lots of related stuff, and I didn't want to repost the
proposed testing framework in multiple threads...

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com

Re: Getting rid of SQLValueFunction

2022-10-18 Thread Corey Huinker

On Fri, Sep 30, 2022 at 2:04 AM Michael Paquier  wrote:

> Hi all,
>
> I have bumped a few days ago on the fact that COERCE_SQL_SYNTAX
> (introduced by 40c24bf) and SQLValueFunction are around to do the
> exact same thing, as known as enforcing single-function calls with
> dedicated SQL keywords.  For example, keywords like SESSION_USER,
> CURRENT_DATE, etc. go through SQLValueFunction and rely on the parser
> to set a state that gets then used in execExprInterp.c.  And it is
> rather easy to implement incorrect SQLValueFunctions, as these rely on
> more hardcoded assumptions in the parser and executor than the
> equivalent FuncCalls (like collation to assign when using a text-like
> SQLValueFunctions).
>
> There are two categories of single-value functions:
> - The ones returning names, where we enforce a C collation in two
> places of the code (current_role, role, current_catalog,
> current_schema, current_database, current_user), even if
> get_typcollation() should do that for name types.
> - The ones working on time, date and timestamps (localtime[stamp],
> current_date, current_time[stamp]), for 9 patterns as these accept an
> optional typmod.
>
> I have dug into the possibility to unify all that with a single
> interface, and finish with the attached patch set which is a reduction
> of code, where all the SQLValueFunctions are replaced by a set of
> FuncCalls:
>  25 files changed, 338 insertions(+), 477 deletions(-)
>
> 0001 is the move done for the name-related functions, cleaning up two
> places in the executor when a C collation is assigned to those
> function expressions.  0002 is the remaining cleanup for the
> time-related ones, moving a set of parser-side checks to the execution
> path within each function, so as all this knowledge is now local to
> each file holding the date and timestamp types.  Most of the gain is
> in 0002, obviously.
>
> The pg_proc entries introduced for the sake of the move use the same
> name as the SQL keywords.  These should perhaps be prefixed with a
> "pg_" at least.  There would be an exception with pg_localtime[stamp],
> though, where we could use a pg_localtime[stamp]_sql for the function
> name for prosrc.  I am open to suggestions for these names.
>
> Thoughts?
> --
> Michael
>

I like this a lot. Deleted code is debugged code.

Patch applies and passes make check-world.

No trace of SQLValueFunction is left in the codebase, at least according to
`git grep -l`.

I have only one non-nitpick question about the code:

+ /*
+ * we're not too tense about good error message here because grammar
+ * shouldn't allow wrong number of modifiers for TIME
+ */
+ if (n != 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid type modifier")));


I agree that we shouldn't spend too much effort on a good error message
here, but perhaps we should have the message mention that it is
date/time-related? A person git-grepping for this error message will get 4
hits in .c files (date.c, timestamp.c, varbit.c, varchar.c) so even a
slight variation in the message could save them some time.

This is an extreme nitpick, but the patchset seems like it should have been
1 file or 3 (remove name functions, remove time functions, remove
SQLValueFunction infrastructure), but that will only matter in the unlikely
case that we find a need for SQLValueFunction but we want to leave the
timestamp function as COERCE_SQL_SYNTAX.

Re: Checking for missing heap/index files

2022-10-18 Thread Tom Lane

Robert Haas  writes:
> On Tue, Oct 18, 2022 at 2:37 PM Stephen Frost  wrote:
>> I don't see it as likely to be acceptable, but arranging to not add or
>> remove files while the scan is happening would presumably eliminate the
>> risk entirely.  We've not seen this issue recur in the expire command
>> since the change to first completely scan the directory and then go and
>> remove the files from it.  Perhaps just not removing files during the
>> scan would be sufficient which might be more reasonable to do.

> Just deciding to cache to the results of readdir() in memory is much
> cheaper insurance. I think I'd probably be willing to foist that
> overhead onto everyone, all the time. As I mentioned before, it could
> still hose someone who is right on the brink of a memory disaster, but
> that's a much narrower blast radius than putting locking around all
> operations that create or remove a file in the same directory as a
> relation file. But it's also not a complete fix, which sucks.

Yeah, that.  I'm not sure if we need to do anything about this, but
if we do, I don't think that's it.  Agreed that the memory-consumption
objection is pretty weak; the one that seems compelling is that by
itself, this does nothing to fix the problem beyond narrowing the
window some.

Isn't it already the case (or could be made so) that relation file
removal happens only in the checkpointer?  I wonder if we could
get to a situation where we can interlock file removal just by
commanding the checkpointer to not do it for awhile.  Then combining
that with caching readdir results (to narrow the window in which we
have to stop the checkpointer) might yield a solution that has some
credibility.  This scheme doesn't attempt to prevent file creation
concurrently with a readdir, but you'd have to make some really
adverse assumptions to believe that file creation would cause a
pre-existing entry to get missed (as opposed to getting scanned
twice).  So it might be an acceptable answer.

regards, tom lane

Re: Bug: pg_regress makefile does not always copy refint.so

2022-10-18 Thread Donghang Lin

Hi Alvaro,

> I have a somewhat-related-but-not-really complaint.  I recently had need to
> have refint.so, autoinc.so and regress.so in the install directory; but it
> turns out that there's no provision at all to get them installed.

True, we also noticed this build bug described above by copying *.so and 
pg_regress binary around.

> I think it would be better to provide a Make rule to allow these files to be
> installed.
I looked at what postgresql-test 
package
 provides today :

$ ls /usr/pgsql-15/lib/test/regress
autoinc.so  data  expected  Makefile  parallel_schedule  pg_regress  
pg_regress.c  pg_regress.h  pg_regress_main.c  README  refint.so  regress.c  
regressplans.sh  regress.so  resultmap  sql
(I’m not sure what this package is supposed to do,  it contains both source 
files and the executables.

The current pgsql install directory of regress only contains pg_regress binary,
Do you suggest we add these files (excluding the scratched files) to the 
regress install directory?
autoinc.so  data  expected  Makefile  parallel_schedule  pg_regress  
pg_regress.c  pg_regress.h  pg_regress_main.c  README  refint.so  regress.c  
regressplans.sh  regress.so  resultmap  sql

>> 1. configure and build postgres
>> 2. edit file src/include/utils/rel.h so that contrib/spi will rebuild
>> 3. cd ${builddir}/src/test/regress
>> 4. make
>> We’ll find refint.so is rebuilt in contrib/spi, but not copied over to 
>> regress folder.
>> While autoinc.so is rebuilt and copied over.

I think this build bug is orthogonal to the inconvenient installation/packaging.
It produces inconsistent build result, e.g you have to run `make` twice to 
ensure newly built refint.so is copied to the build dir.

Regards,
Donghang Lin
(ServiceNow)

From: Alvaro Herrera 
Date: Friday, October 14, 2022 at 8:15 PM
To: Donghang Lin 
Cc: pgsql-hackers@lists.postgresql.org 
Subject: Re: Bug: pg_regress makefile does not always copy refint.so
[External Email]

On 2022-Oct-14, Donghang Lin wrote:

> Hi hackers,
>
> When I was building pg_regress, it didn’t always copy a rebuilt version of 
> refint.so to the folder.
>
> Steps to reproduce:
> OS: centos7
> PSQL version: 14.5
>
> 1. configure and build postgres
> 2. edit file src/include/utils/rel.h so that contrib/spi will rebuild
> 3. cd ${builddir}/src/test/regress
> 4. make
> We’ll find refint.so is rebuilt in contrib/spi, but not copied over to 
> regress folder.
> While autoinc.so is rebuilt and copied over.

I have a somewhat-related-but-not-really complaint.  I recently had need to
have refint.so, autoinc.so and regress.so in the install directory; but it
turns out that there's no provision at all to get them installed.

Packagers have long have had a need for this; for example the postgresql-test
RPM file is built using this icky recipe:

%if %test
# tests. There are many files included here that are unnecessary,
# but include them anyway for completeness.  We replace the original
# Makefiles, however.
%{__mkdir} -p %{buildroot}%{pgbaseinstdir}/lib/test
%{__cp} -a src/test/regress %{buildroot}%{pgbaseinstdir}/lib/test
%{__install} -m 0755 contrib/spi/refint.so 
%{buildroot}%{pgbaseinstdir}/lib/test/regress
%{__install} -m 0755 contrib/spi/autoinc.so 
%{buildroot}%{pgbaseinstdir}/lib/test/regress

I assume that the DEB does something similar, but I didn't look.

I think it would be better to provide a Make rule to allow these files to be
installed.  I'll see about a proposed patch.

--
Álvaro HerreraBreisgau, Deutschland  —  
https://urldefense.com/v3/__https://www.EnterpriseDB.com/__;!!N4vogdjhuJM!Gz471V39hPgZI8Uabm3fUUHoZIuoMKlnz6_W38wwQvH20ZKfYaIWioPYtBJU2U75apWuCP6NmZutZfZ92EDwq5vIcvs$
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: Exponentiation confusion

2022-10-18 Thread Robert Haas

On Tue, Oct 18, 2022 at 6:18 AM Dean Rasheed  wrote:
> Overall, I'm quite happy with these results. The question is, should
> this be back-patched?
>
> In the past, I think I've only back-patched numeric bug-fixes where
> the digits output by the old code were incorrect or an error was
> thrown, not changes that resulted in a different number of digits
> being output, changing the precision of already-correct results.
> However, having 10.0^(-18) produce zero seems pretty bad, so my
> inclination is to back-patch, unless anyone objects.

I don't think that back-patching is a very good idea. The bar for
changing query results should be super-high. Applications can depend
on the existing behavior even if it's wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Checking for missing heap/index files

2022-10-18 Thread Robert Haas

On Tue, Oct 18, 2022 at 2:37 PM Stephen Frost  wrote:
> While I don't think it's really something that should be happening, it's
> definitely something that's been seen with some networked filesystems,
> as reported.

Do you have clear and convincing evidence of this happening on
anything other than CIFS?

> I don't see it as likely to be acceptable, but arranging to not add or
> remove files while the scan is happening would presumably eliminate the
> risk entirely.  We've not seen this issue recur in the expire command
> since the change to first completely scan the directory and then go and
> remove the files from it.  Perhaps just not removing files during the
> scan would be sufficient which might be more reasonable to do.

I don't think that's a complete non-starter, but I do think it would
be somewhat expensive in some workloads. I hate to make everyone pay
that much for insurance against a shouldn't-happen case. We could make
it optional, but then we're asking users to decide whether or not they
need insurance. Since we don't even know which filesystems are
potentially affected, how is anyone else supposed to know? Worse
still, if you have a corruption event, you're still not going to know
for sure whether this would have fixed it, so you still don't know
whether you should turn on the feature for next time. And if you do
turn it on and don't get corruption again, you don't know whether you
would have had a problem if you hadn't used the feature. It all just
seems like a lot of guesswork that will end up being frustrating to
both users and developers.

Just deciding to cache to the results of readdir() in memory is much
cheaper insurance. I think I'd probably be willing to foist that
overhead onto everyone, all the time. As I mentioned before, it could
still hose someone who is right on the brink of a memory disaster, but
that's a much narrower blast radius than putting locking around all
operations that create or remove a file in the same directory as a
relation file. But it's also not a complete fix, which sucks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Checking for missing heap/index files

2022-10-18 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Oct 18, 2022 at 12:59 PM Tom Lane  wrote:
> > There is no text suggesting that it's okay to miss, or to double-return,
> > an entry that is present throughout the scan.  So I'd interpret the case
> > you're worried about as "forbidden by POSIX".  Of course, it's known that
> > NFS fails to provide POSIX semantics in all cases --- but I don't know
> > if this is one of them.
> 
> Yeah, me neither. One problem I see is that, even if the behavior is
> forbidden by POSIX, if it happens in practice on systems people
> actually use, then it's an issue. We even have documentation saying
> that it's OK to use NFS, and a lot of people do -- which IMHO is
> unfortunate, but it's also not clear what the realistic alternatives
> are. It's pretty hard to tell people in 2022 that they are only
> allowed to use PostgreSQL with local storage.
> 
> But to put my cards on the table, it's not so much that I am worried
> about this problem myself as that I want to know whether we're going
> to do anything about it as a project, and if so, what, because it
> intersects a patch that I'm working on. So if we want to readdir() in
> one fell swoop and cache the results, I'm going to go write a patch
> for that. If we don't, then I'd like to know whether (a) we think that
> would be theoretically acceptable but not justified by the evidence
> presently available or (b) would be unacceptable due to (b1) the
> potential for increased memory usage or (b2) some other reason.

While I don't think it's really something that should be happening, it's
definitely something that's been seen with some networked filesystems,
as reported.  I also strongly suspect that on local filesystems there's
something that prevents this from happening but as mentioned that
doesn't cover all PG use cases.

In pgbackrest, we moved to doing a scan and cache'ing all of the results
in memory to reduce the risk when reading from the PG data dir.  We also
reworked our expire code (which removes an older backup from the backup
repository) to also do a complete scan before removing files.

I don't see it as likely to be acceptable, but arranging to not add or
remove files while the scan is happening would presumably eliminate the
risk entirely.  We've not seen this issue recur in the expire command
since the change to first completely scan the directory and then go and
remove the files from it.  Perhaps just not removing files during the
scan would be sufficient which might be more reasonable to do.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: effective_multixact_freeze_max_age issue

2022-10-18 Thread Peter Geoghegan

On Tue, Oct 18, 2022 at 3:43 AM Anton A. Melnikov  wrote:
> It's interesting that prior to this commit, checks were made for freeze 
> limits, but the error messages were talking about oldestXmin and oldestMxact.

The problem really is that oldestXmin and oldestMxact are held back,
though. While that can ultimately result in older FreezeLimit and
MultiXactCutoff cutoffs in vacuumlazy.c, that's just one problem.
Usually not the worst problem.

The term "removable cutoff" is how VACUUM VERBOSE reports OldestXmin.
I think that it's good to use the same terminology here.

> Could you clarify this moment please? Would be very grateful.

While this WARNING is triggered when a threshold controlled by
autovacuum_freeze_max_age is crossed, it's not just a problem with
freezing. It's convenient to use autovacuum_freeze_max_age to
represent "a wildly excessive number of XIDs for OldestXmin to be held
back by, no matter what". In practice it is usually already a big
problem when OldestXmin is held back by far fewer XIDs than that, but
it's hard to reason about when exactly we need to consider that a
problem. However, we can at least be 100% sure of real problems when
OldestXmin age reaches autovacuum_freeze_max_age. There is no longer
any doubt that we need to warn the user, since antiwraparound
autovacuum cannot work as designed at that point. But the WARNING is
nevertheless not primarily (or not exclusively) about not being able
to freeze. It's also about not being able to remove bloat.

Freezing can be thought of as roughly the opposite process to removing
dead tuples deleted by now committed transactions. OldestXmin is the
cutoff both for removing dead tuples (which we want to get rid of
immediately), and freezing live all-visible tuples (which we want to
keep long term). FreezeLimit is usually 50 million XIDs before
OldestXmin (the freeze_min_age default), but that's just how we
implement lazy freezing, which is just an optimization.

> As variant may be split these checks for the freeze cuttoffs and the oldest 
> xmins for clarity?
> The patch attached tries to do this.

I don't think that this is an improvement. For one thing the
FreezeLimit cutoff will have been held back (relative to nextXID-wise
table age) by more than the freeze_min_age setting for a long time
before this WARNING is hit -- so we're not going to show the WARNING
in most cases where freeze_min_age has been completely ignored (it
must be ignored in extreme cases because FreezeLimit must always be <=
OldestXmin). Also, the proposed new WARNING is only seen when we're
bound to also see the existing OldestXmin WARNING already. Why have 2
WARNINGs about exactly the same problem?

-- 
Peter Geoghegan

Re: Checking for missing heap/index files

2022-10-18 Thread Robert Haas

On Tue, Oct 18, 2022 at 12:59 PM Tom Lane  wrote:
> There is no text suggesting that it's okay to miss, or to double-return,
> an entry that is present throughout the scan.  So I'd interpret the case
> you're worried about as "forbidden by POSIX".  Of course, it's known that
> NFS fails to provide POSIX semantics in all cases --- but I don't know
> if this is one of them.

Yeah, me neither. One problem I see is that, even if the behavior is
forbidden by POSIX, if it happens in practice on systems people
actually use, then it's an issue. We even have documentation saying
that it's OK to use NFS, and a lot of people do -- which IMHO is
unfortunate, but it's also not clear what the realistic alternatives
are. It's pretty hard to tell people in 2022 that they are only
allowed to use PostgreSQL with local storage.

But to put my cards on the table, it's not so much that I am worried
about this problem myself as that I want to know whether we're going
to do anything about it as a project, and if so, what, because it
intersects a patch that I'm working on. So if we want to readdir() in
one fell swoop and cache the results, I'm going to go write a patch
for that. If we don't, then I'd like to know whether (a) we think that
would be theoretically acceptable but not justified by the evidence
presently available or (b) would be unacceptable due to (b1) the
potential for increased memory usage or (b2) some other reason.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [RFC] building postgres with meson - v13

2022-10-18 Thread Justin Pryzby

> From 680ff3f7b4da1dbf21d0c7cd87af9bb5ee8b230c Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Wed, 21 Sep 2022 20:36:36 -0700
> Subject: [PATCH v17 01/23] meson: ci: wip: move compilerwarnings task to meson
> 
> ---
>  .cirrus.yml| 92 +-
>  src/tools/ci/linux-mingw-w64-64bit.txt | 13 
>  2 files changed, 59 insertions(+), 46 deletions(-)
>  create mode 100644 src/tools/ci/linux-mingw-w64-64bit.txt
> 
> diff --git a/.cirrus.yml b/.cirrus.yml
> index 7b5cb021027..eb33fdc4855 100644
> --- a/.cirrus.yml
> +++ b/.cirrus.yml
> @@ -465,6 +465,10 @@ task:
>ccache_cache:
>  folder: $CCACHE_DIR
>  
> +  ccache_stats_start_script:
> +ccache -s
> +ccache -z

I realized that ccache -z clears out not only the global stats, but the
per-file cache stats (from which the global stats are derived) - which
obviously makes the cache work poorly.

Newer ccache has CCACHE_STATSLOG, and --show-log-stats, which I think
can do what's wanted.  I'll update my ci branch with that.

Re: Checking for missing heap/index files

2022-10-18 Thread Tom Lane

Robert Haas  writes:
> I'd be really interested in knowing whether this happens on a
> mainstream, non-networked filesystem. It's not an irrelevant concern
> even if it happens only on networked filesystems, but a lot more
> people will be at risk if it also happens on ext4 or xfs. It does seem
> a little bit surprising if no filesystem has a way of preventing this.
> I mean, does open() also randomly but with low probability fail to
> find a file that exists, due to a concurrent directory modification on
> some directory in the pathname? I assume that would be unacceptable,
> and the file system has a way of preventing that from happening, then
> it has some way of ensuring a stable read of a directory, at least
> over a short period.

The POSIX spec for readdir(3) has a little bit of info:

The type DIR, which is defined in the  header, represents a
directory stream, which is an ordered sequence of all the directory
entries in a particular directory. Directory entries represent files;
files may be removed from a directory or added to a directory
asynchronously to the operation of readdir().

If a file is removed from or added to the directory after the most
recent call to opendir() or rewinddir(), whether a subsequent call to
readdir() returns an entry for that file is unspecified.

There is no text suggesting that it's okay to miss, or to double-return,
an entry that is present throughout the scan.  So I'd interpret the case
you're worried about as "forbidden by POSIX".  Of course, it's known that
NFS fails to provide POSIX semantics in all cases --- but I don't know
if this is one of them.

regards, tom lane

Re: Make finding openssl program a configure or meson option

2022-10-18 Thread Peter Eisentraut


On 12.10.22 03:08, Michael Paquier wrote:

On Tue, Oct 11, 2022 at 05:06:22PM +0200, Peter Eisentraut wrote:

Various test suites use the "openssl" program as part of their setup. There
isn't a way to override which openssl program is to be used, other than by
fiddling with the path, perhaps.  This has gotten increasingly problematic
with some of the work I have been doing, because different versions of
openssl have different capabilities and do different things by default.
This patch checks for an openssl binary in configure and meson setup, with
appropriate ways to override it.  This is similar to how "lz4" and "zstd"
are handled, for example.  The meson build system actually already did this,
but the result was only used in some places. This is now applied more
uniformly.


openssl-env allows the use of the environment variable of the same
name.  This reminds me a bit of the recent interferences with GZIP,
for example.


Okay, I see what you meant here now.  openssl-env is the man page 
describing environment variables used by OpenSSL.  I don't see any 
conflicts with what is being proposed here.



This patch is missing one addition of set_single_env() in
vcregress.pl, and one update of install-windows.sgml where all the
supported environment variables for commands are listed.


Added.  New patch attached.
From dd542e0be55ec181c606d632b40509e655ce1bad Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 18 Oct 2022 14:39:47 +0200
Subject: [PATCH v2] Make finding openssl program a configure or meson option

Discussion: 
https://www.postgresql.org/message-id/flat/dc638b75-a16a-007d-9e1c-d16ed6cf0ad2%40enterprisedb.com
---
 configure | 55 +++
 configure.ac  |  1 +
 doc/src/sgml/install-windows.sgml |  9 +++
 meson.build   |  1 +
 meson_options.txt |  3 +
 src/Makefile.global.in|  1 +
 src/test/ldap/Makefile|  1 +
 src/test/ldap/meson.build |  5 +-
 src/test/ldap/t/001_auth.pl   |  8 ++-
 .../modules/ssl_passphrase_callback/Makefile  |  4 +-
 .../ssl_passphrase_callback/meson.build   |  2 -
 src/test/ssl/Makefile |  2 +-
 src/test/ssl/meson.build  |  5 +-
 src/test/ssl/sslfiles.mk  | 34 ++--
 src/test/ssl/t/001_ssltests.pl|  2 +-
 src/tools/msvc/vcregress.pl   |  1 +
 16 files changed, 106 insertions(+), 28 deletions(-)

diff --git a/configure b/configure
index e04ee9fb4166..dd0802844a4a 100755
--- a/configure
+++ b/configure
@@ -648,6 +648,7 @@ PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 LIBOBJS
+OPENSSL
 ZSTD
 LZ4
 UUID_LIBS
@@ -14023,6 +14024,60 @@ done
 
 fi
 
+if test -z "$OPENSSL"; then
+  for ac_prog in openssl
+do
+  # Extract the first word of "$ac_prog", so it can be a program name with 
args.
+set dummy $ac_prog; ac_word=$2
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
+$as_echo_n "checking for $ac_word... " >&6; }
+if ${ac_cv_path_OPENSSL+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  case $OPENSSL in
+  [\\/]* | ?:[\\/]*)
+  ac_cv_path_OPENSSL="$OPENSSL" # Let the user override the test with a path.
+  ;;
+  *)
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+for ac_exec_ext in '' $ac_executable_extensions; do
+  if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
+ac_cv_path_OPENSSL="$as_dir/$ac_word$ac_exec_ext"
+$as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" 
>&5
+break 2
+  fi
+done
+  done
+IFS=$as_save_IFS
+
+  ;;
+esac
+fi
+OPENSSL=$ac_cv_path_OPENSSL
+if test -n "$OPENSSL"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $OPENSSL" >&5
+$as_echo "$OPENSSL" >&6; }
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
+
+
+  test -n "$OPENSSL" && break
+done
+
+else
+  # Report the value of OPENSSL in configure's output in all cases.
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for OPENSSL" >&5
+$as_echo_n "checking for OPENSSL... " >&6; }
+  { $as_echo "$as_me:${as_lineno-$LINENO}: result: $OPENSSL" >&5
+$as_echo "$OPENSSL" >&6; }
+fi
+
 if test "$with_ssl" = openssl ; then
   ac_fn_c_check_header_mongrel "$LINENO" "openssl/ssl.h" 
"ac_cv_header_openssl_ssl_h" "$ac_includes_default"
 if test "x$ac_cv_header_openssl_ssl_h" = xyes; then :
diff --git a/configure.ac b/configure.ac
index f146c8301ae1..2b11d5016684 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1539,6 +1539,7 @@ if test "$with_gssapi" = yes ; then
[AC_CHECK_HEADERS(gssapi.h, [], [AC_MSG_ERROR([gssapi.h header file is 
required for GSSAPI])])])
 fi
 
+PGAC_PATH_PROGS(OPENSSL, openssl)
 if test "$with_ssl" = openssl ; then
   AC_CHECK_HEADER(openssl/ssl.h, [

Re: Checking for missing heap/index files

2022-10-18 Thread Robert Haas

On Fri, Jun 17, 2022 at 6:31 PM Stephen Frost  wrote:
>> Hmm, this sounds pretty bad, and I agree that a workaround should be put
>> in place.  But where is pg_basebackup looping around readdir()?  I
>> couldn't find it.  There's a call to readdir() in FindStreamingStart(),
>> but that doesn't seem to match what you describe.
>
> It’s the server side that does it in basebackup.c when it’s building the 
> tarball for the data dir and each table space and sending it to the client. 
> It’s not done by src/bin/pg_basebackup. Sorry for not being clear. 
> Technically this would be beyond just pg_basebackup but would impact, 
> potentially, anything using BASE_BACKUP from the replication protocol (in 
> addition to other backup tools which operate against the data directory with 
> readdir, of course).

Specifically, sendDir() can either recurse back into sendDir(), or it
can call sendFile(). So in theory if your directory contains
some/stupid/long/path/to/a/file, you could have 6 directory scans open
all at the same time, and then a file being read concurrently with
that. That provides a lot of time for things to change concurrently,
especially at the outer levels. Before the scan of the outermost
directory moves to the next file, it will have to completely finish
reading and sending every file in the directory tree that was rooted
at the directory entry we last read.

I think this could be changed pretty easily. We could change sendDir()
to read all of the directory entries into an in-memory buffer and then
close the directory and iterate over the buffer. I see two potential
disadvantages of that approach. Number one, we could encounter a
directory with a vast number of relations. There are probably
subdirectories of PostgreSQL data directories that contain tens of
millions of files, possibly hundreds of millions of files. So,
remembering all that data in memory would potentially take gigabytes
of memory. I'm not sure that's a huge problem, because if you have
hundreds of millions of tables in a single database, you should
probably have enough memory in the system that a few GB of RAM to take
a backup is no big deal. However, people don't always have the memory
that they should have, and many users do in fact run systems at a
level of load that pushes the boundaries of their hardware.
Nonetheless, we could choose to take the position that caching the
list of filenames is worth it to avoid this risk.

The other consideration here is that this is not a complete remedy. It
makes the race condition narrower, I suppose, but it does not remove
it. Ideally, we would like to do better than "our new code will
corrupt your backup less often." However, I don't quite see how to get
there. We either need the OS to deliver us a reliable list of what
files exist - and I don't see how to make it do that if readdir
doesn't - or we need a way to know what files are supposed to exist
without reference to the OS - which would require some way of reading
the list of relfilenodes from a database to which we're not connected.
So maybe corrupting your backup less often is the best we can do. I do
wonder how often this actually happens though, and on which
filesystems. The provided links seem to suggest that this is mostly a
problem with network filesystems, especially CIFS, and maybe also NFS.

I'd be really interested in knowing whether this happens on a
mainstream, non-networked filesystem. It's not an irrelevant concern
even if it happens only on networked filesystems, but a lot more
people will be at risk if it also happens on ext4 or xfs. It does seem
a little bit surprising if no filesystem has a way of preventing this.
I mean, does open() also randomly but with low probability fail to
find a file that exists, due to a concurrent directory modification on
some directory in the pathname? I assume that would be unacceptable,
and the file system has a way of preventing that from happening, then
it has some way of ensuring a stable read of a directory, at least
over a short period.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade test failure

2022-10-18 Thread Andres Freund

Hi,

On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote:
> On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:
> > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> > > * Server 2019, as used on CI, still uses the traditional NT semantics
> > > (unlink is asynchronous, when all handles closes)
> > > * the fix I proposed has the right effect (I will follow up with tests
> > > to demonstrate)
> > 
> > Wow, nice investigation.  And cirrus does not offer a newer option
> > either..
> 
> Currently Andres builds images based on cirrus's 2019 image, but I think
> we could use any windows docker image.

You unfortunately can't run newer containers than the host OS :(, just user
older ones. And if you use mismatching containers the startup gets slower
because it switches to use full virtualization rather than containers.

I think we need to switch to use full VMs rather than containers. The
performance of the windows containers is just atrocious (build times on a
local VM with the same number of cores is 1/2 of what we see in CI, test times
1/3), they're slow to start due to pulling all files and decompressing them,
and they're fragile. I've asked Bilal (CCed) to work on generating both
containers and images.

Greetings,

Andres Freund

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-10-18 Thread Önder Kalacı

Hi Wang, all


> And I have another confusion about function
> GetCheapestReplicaIdentityFullPath:
> If rel->pathlist is NIL, could we return NULL directly from this function,
> and
> then set idxoid to InvalidOid in function
> FindUsableIndexForReplicaIdentityFull
> in that case?
>
>
We could, but then we need to move some other checks to some other places.
I find the current flow easier to follow, where all happens
via cheapest_total_path, which is a natural field for this purpose.

Do you have a strong opinion on this?


> ===
>
> Here are some comments for test file  032_subscribe_use_index.pl on v18
> patch:
>
> 1.
> ```
> +# Basic test where the subscriber uses index
> +# and only updates 1 row for and deletes
> +# 1 other row
> ```
> There seems to be an extra "for" here.
>

 Fixed


> 2. Typos for subscription name in the error messages.
> tap_sub_rep_full_0 -> tap_sub_rep_full
>
>
Fixed


> 3. Typo in comments
> ```
> +# use the newly created index (provided that it fullfils the
> requirements).
> ```
> fullfils -> fulfils
>
>
Fixed


> 4. Some extra single quotes at the end of the error message ('").
> For example:
> ```
> # wait until the index is used on the subscriber
> $node_subscriber->poll_query_until(
> 'postgres', q{select (idx_scan = 200) from pg_stat_all_indexes
> where indexrelname = 'test_replica_id_full_idx';}
> ) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates 200 rows via index'";
> ```
>

All fixed, thanks



>
> 5. The column names in the error message appear to be a typo.
> ```
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates two rows via index scan with index on high cardinality column-1";
> ...
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates two rows via index scan with index on high cardinality column-3";
> ...
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates two rows via index scan with index on high cardinality column-4";
> ```
> It seems that we need to do the following change: 'column-3' -> 'column-1'
> and
> 'column-4' -> 'column-2'.
> Or we could use the column names directly like this: 'column-1' -> 'column
> a',
> 'column_3' -> 'column a' and 'column_4' -> 'column b'.
>

I think the latter is easier to follow, thanks.


>
> 6. DELETE action is missing from the error message.
> ```
> +# 2 rows from first command, another 2 from the second command
> +# overall index_on_child_1_a is used 4 times
> +$node_subscriber->poll_query_until(
> +   'postgres', q{select idx_scan=4 from pg_stat_all_indexes where
> indexrelname = 'index_on_child_1_a';}
> +) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates child_1 table'";
> ```
> I think we execute both UPDATE and DELETE for child_1 here. Could we add
> DELETE
> action to this error message?
>
>
makes sense, added


> 7. Table name in the error message.
> ```
> # check if the index is used even when the index has NULL values
> $node_subscriber->poll_query_until(
> 'postgres', q{select idx_scan=2 from pg_stat_all_indexes where
> indexrelname = 'test_replica_id_full_idx';}
> ) or die "Timed out while waiting for check subscriber tap_sub_rep_full
> updates parent table'";
> ```
> It seems to be "test_replica_id_full" here instead of "parent'".
>
fixed as well.


Attached v19.

Thanks,
Onder KALACI


v19_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data

Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-10-18 Thread Robert Haas

On Mon, Oct 17, 2022 at 4:30 PM Andres Freund  wrote:
> On 2022-10-17 13:34:02 -0400, Robert Haas wrote:
> > I don't feel quite as confident that not attempting a cleanup lock on
> > the new bucket's primary page is OK. I think it should be fine. The
> > existing comment even says it should be fine. But, that comment could
> > be wrong, and I'm not sure that I have my head around what all of the
> > possible interactions around that cleanup lock are. So changing it
> > makes me a little nervous.
>
> If it's not OK, then the acquire-cleanuplock-after-reinit would be an
> active bug though, right?

Yes, probably so.

Another approach here would be to have something like _hash_getnewbuf
that does not use RBM_ZERO_AND_LOCK or call _hash_pageinit, and then
call _hash_pageinit here, perhaps just before nopaque =
HashPageGetOpaque(npage), so that it's within the critical section.
But that doesn't feel very consistent with the rest of the code.

Maybe just nuking the IsBufferCleanupOK call is best, I don't know. I
honestly doubt that it matters very much what we pick here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: havingQual vs hasHavingQual buglets

2022-10-18 Thread Tom Lane

Etsuro Fujita  writes:
> On Tue, Oct 18, 2022 at 9:47 AM Richard Guo  wrote:
>> On Tue, Oct 18, 2022 at 5:37 AM Tom Lane  wrote:
>>> I came across a couple of places in the planner that are checking
>>> for nonempty havingQual; but since these bits run after
>>> const-simplification of the HAVING clause, that produces the wrong
>>> answer for a constant-true HAVING clause (which'll be folded to
>>> empty).  Correct code is to check root->hasHavingQual instead.

> The postgres_fdw bits would be my oversight.  :-(

No worries --- I think the one in set_subquery_pathlist is probably
my fault :-(

> +1  HEAD only seems reasonable.

Pushed that way; thanks for looking.

regards, tom lane

Re: Logical replication timeout problem

2022-10-18 Thread Fabrice Chapuis

Hello Amit,

In version 14.4 the timeout problem for logical replication happens again
despite the patch provided for this issue in this version. When bulky
materialized views are reloaded it broke logical replication. It is
possible to solve this problem by using your new "streaming" option.
Have you ever had this issue reported to you?

Regards

Fabrice

2022-10-10 17:19:02 CEST [538424]: [17-1]
user=postgres,db=dbxxxa00,client=[local] CONTEXT:  SQL statement "REFRESH
MATERIALIZED VIEW sxxxa00.table_base"
PL/pgSQL function refresh_materialized_view(text) line 5 at EXECUTE
2022-10-10 17:19:02 CEST [538424]: [18-1]
user=postgres,db=dbxxxa00,client=[local] STATEMENT:  select
refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:02 CEST [538424]: [19-1]
user=postgres,db=dbxxxa00,client=[local] LOG:  duration: 264815.652 ms
statement: select refresh_materialized_view('sxxxa00.table_base');
2022-10-10 17:19:27 CEST [559156]: [1-1] user=,db=,client= LOG:  automatic
vacuum of table "dbxxxa00.sxxxa00.table_base": index scans: 0
pages: 0 removed, 296589 remain, 0 skipped due to pins, 0 skipped
frozen
tuples: 0 removed, 48472622 remain, 0 are dead but not yet
removable, oldest xmin: 1501528
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
I/O timings: read: 1.494 ms, write: 0.000 ms
avg read rate: 0.028 MB/s, avg write rate: 107.952 MB/s
buffer usage: 593301 hits, 77 misses, 294605 dirtied
WAL usage: 296644 records, 46119 full page images, 173652718 bytes
system usage: CPU: user: 17.26 s, system: 0.29 s, elapsed: 21.32 s
2022-10-10 17:19:28 CEST [559156]: [2-1] user=,db=,client= LOG:  automatic
analyze of table "dbxxxa00.sxxxa00.table_base"
I/O timings: read: 0.043 ms, write: 0.000 ms
avg read rate: 0.026 MB/s, avg write rate: 0.026 MB/s
buffer usage: 30308 hits, 2 misses, 2 dirtied
system usage: CPU: user: 0.54 s, system: 0.00 s, elapsed: 0.59 s
2022-10-10 17:19:34 CEST [3898111]: [6840-1] user=,db=,client= LOG:
checkpoint complete: wrote 1194 buffers (0.0%); 0 WAL file(s) added, 0
removed, 0 recycled; write=269.551 s, sync=0.002 s, total=269.560 s; sync
files=251, longest=0.00
1 s, average=0.001 s; distance=583790 kB, estimate=583790 kB
2022-10-10 17:20:02 CEST [716163]: [2-1] user=,db=,client= ERROR:
terminating logical replication worker due to timeout
2022-10-10 17:20:02 CEST [3897921]: [13-1] user=,db=,client= LOG:
background worker "logical replication worker" (PID 716163) exited with
exit code 1
2022-10-10 17:20:02 CEST [561346]: [1-1] user=,db=,client= LOG:  logical
replication apply worker for subscription "subxxx_sxxxa00" has started

On Fri, Apr 1, 2022 at 6:09 AM Amit Kapila  wrote:

> On Fri, Apr 1, 2022 at 8:28 AM Euler Taveira  wrote:
> >
> > On Thu, Mar 31, 2022, at 11:27 PM, Amit Kapila wrote:
> >
> > This is exactly our initial analysis and we have tried a patch on
> > these lines and it has a noticeable overhead. See [1]. Calling this
> > for each change or each skipped change can bring noticeable overhead
> > that is why we decided to call it after a certain threshold (100) of
> > skipped changes. Now, surely as mentioned in my previous reply we can
> > make it generic such that instead of calling this (update_progress
> > function as in the patch) for skipped cases, we call it always. Will
> > that make it better?
> >
> > That's what I have in mind but using a different approach.
> >
> > > The functions CreateInitDecodingContext and CreateDecodingContext
> receives the
> > > update_progress function as a parameter. These functions are called in
> 2
> > > places: (a) streaming replication protocol (CREATE_REPLICATION_SLOT)
> and (b)
> > > SQL logical decoding functions (pg_logical_*_changes). Case (a) uses
> > > WalSndUpdateProgress as a progress function. Case (b) does not have
> one because
> > > it is not required -- local decoding/communication. There is no custom
> update
> > > progress routine for each output plugin which leads me to the question:
> > > couldn't we encapsulate the update progress call into the callback
> functions?
> > >
> >
> > Sorry, I don't get your point. What exactly do you mean by this?
> > AFAIS, currently we call this output plugin API in pgoutput functions
> > only, do you intend to get it invoked from a different place?
> >
> > It seems I didn't make myself clear. The callbacks I'm referring to the
> > *_cb_wrapper functions. After every ctx->callbacks.foo_cb() call into a
> > *_cb_wrapper() function, we have something like:
> >
> > if (ctx->progress & PGOUTPUT_PROGRESS_FOO)
> > NewUpdateProgress(ctx, false);
> >
> > The NewUpdateProgress function would contain a logic similar to the
> > update_progress() from the proposed patch. (A different function name
> here just
> > to avoid confusion.)
> >
> > The output plugin is responsible to set ctx->progress with the callback
> > variables (for example,

Re: thinko in basic_archive.c

2022-10-18 Thread Bharath Rupireddy

On Mon, Oct 17, 2022 at 6:45 PM Robert Haas  wrote:
>
> On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy
>  wrote:
> > What happens to the left-over temp files after a server crash? Will
> > they be lying around in the archive directory? I understand that we
> > can't remove such files because we can't distinguish left-over files
> > from a crash and the temp files that another server is in the process
> > of copying.
>
> Yeah, leaving a potentially unbounded number of files around after
> system crashes seems pretty unfriendly. I'm not sure how to fix that,
> exactly.

A simple server restart while the basic_archive module is copying
to/from temp file would leave the file behind, see[1]. I think we can
fix this by defining shutdown_cb for the basic_archive module, like
the attached patch. While this helps in most of the crashes, but not
all. However, this is better than what we have right now.

[1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls
00010001
archtemp.00010002.2493876.1666091933457
archtemp.00010002.2495316.1666091958680

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From cc5450714c8f1624546bf575589cae9562c5db8e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 18 Oct 2022 12:54:59 +
Subject: [PATCH v1] Remove leftover temporary files via basic_archive shutdown
 callback

---
 contrib/basic_archive/basic_archive.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 9f221816bb..91b6af18bf 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -42,10 +42,12 @@ PG_MODULE_MAGIC;
 
 static char *archive_directory = NULL;
 static MemoryContext basic_archive_context;
+static char	tempfilepath[MAXPGPATH + 256];
 
 static bool basic_archive_configured(void);
 static bool basic_archive_file(const char *file, const char *path);
 static void basic_archive_file_internal(const char *file, const char *path);
+static void basic_archive_shutdown(void);
 static bool check_archive_directory(char **newval, void **extra, GucSource source);
 static bool compare_files(const char *file1, const char *file2);
 
@@ -85,6 +87,7 @@ _PG_archive_module_init(ArchiveModuleCallbacks *cb)
 
 	cb->check_configured_cb = basic_archive_configured;
 	cb->archive_file_cb = basic_archive_file;
+	cb->shutdown_cb = basic_archive_shutdown;
 }
 
 /*
@@ -151,6 +154,8 @@ basic_archive_file(const char *file, const char *path)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext oldcontext;
 
+	MemSet(tempfilepath, '\0', sizeof(tempfilepath));
+
 	/*
 	 * We run basic_archive_file_internal() in our own memory context so that
 	 * we can easily reset it during error recovery (thus avoiding memory
@@ -215,7 +220,6 @@ static void
 basic_archive_file_internal(const char *file, const char *path)
 {
 	char		destination[MAXPGPATH];
-	char		temp[MAXPGPATH + 256];
 	struct stat st;
 	struct timeval tv;
 	uint64		epoch;			/* milliseconds */
@@ -268,21 +272,21 @@ basic_archive_file_internal(const char *file, const char *path)
 		pg_add_u64_overflow(epoch, (uint64) (tv.tv_usec / 1000), &epoch))
 		elog(ERROR, "could not generate temporary file name for archiving");
 
-	snprintf(temp, sizeof(temp), "%s/%s.%s.%d." UINT64_FORMAT,
+	snprintf(tempfilepath, sizeof(tempfilepath), "%s/%s.%s.%d." UINT64_FORMAT,
 			 archive_directory, "archtemp", file, MyProcPid, epoch);
 
 	/*
 	 * Copy the file to its temporary destination.  Note that this will fail
 	 * if temp already exists.
 	 */
-	copy_file(unconstify(char *, path), temp);
+	copy_file(unconstify(char *, path), tempfilepath);
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
 	 * Note that this will overwrite any existing file, but this is only
 	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename(temp, destination, ERROR);
+	(void) durable_rename(tempfilepath, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
@@ -368,3 +372,16 @@ compare_files(const char *file1, const char *file2)
 
 	return ret;
 }
+
+static void
+basic_archive_shutdown(void)
+{
+	if (tempfilepath[0] == '\0')
+		return;
+
+	/* Remove the leftover temporary file. */
+	if (unlink(tempfilepath) < 0)
+		ereport(WARNING,
+(errcode_for_file_access(),
+ errmsg("could not unlink file \"%s\": %m", tempfilepath)));
+}
-- 
2.34.1

Re: CF Bot failure in wait_for_subscription_sync()

2022-10-18 Thread Amit Kapila

On Tue, Oct 18, 2022 at 2:57 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, October 18, 2022 2:16 PM Bharath Rupireddy 
>  wrote:
> >
> > Hi,
> >
> > I have seen 2 patches registered in CF failing on Linux - Debian Bullseye in
> > wait_for_subscription_sync(). It seems like the tables aren't being synced. 
> > I
> > have not done any further analysis. I'm not sure if this issue is being 
> > discussed
> > elsewhere.
> >
> > # Postmaster PID for node "twoways" is 50208 Waiting for all subscriptions 
> > in
> > "twoways" to synchronize data
> > [14:12:43.092](198.391s) # poll_query_until timed out executing this query:
> > # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r',
> > 's'); # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > timed out waiting for subscriber to synchronize data at t/100_bugs.pl line 
> > 147.
> >
> > https://api.cirrus-ci.com/v1/artifact/task/6618623857917952/log/src/test/sub
> > scription/tmp_check/log/regress_log_100_bugs
> > https://cirrus-ci.com/task/6618623857917952
> > https://cirrus-ci.com/task/5764058174455808
>
> Thanks for reporting this. I am not sure about the root cause but just share
> some initial analysis here.
>
> This testcase waits for table sync to finish for both table "t" and table 
> "t2".
> But from the log, I can only see the log[1] related to the table sync of table
> "t". So it seems that the table sync worker for table "t2" was never started
> due to some reason.
>

Yeah, the reason is not clear to me either. Let me state my
understanding of the issue. IIUC, the following test in 100_bugs.pl
has failed:
$node_twoways->safe_psql('d1', 'ALTER PUBLICATION testpub ADD TABLE t2');
$node_twoways->safe_psql('d2',
'ALTER SUBSCRIPTION testsub REFRESH PUBLICATION');
...
...
$node_twoways->wait_for_subscription_sync($node_twoways, 'testsub', 'd2');

After the REFRESH operation, the new table should have been added to
pg_subscription_rel. This is also visible from LOGS because one of
tables has synced and for table another worker is not started. Now,
ideally, after the new entry is created in pg_subscription_rel, the
apply worker should have got an invalidation message and refreshed the
'table_states_not_ready' list which should have let it start the new
table sync worker for table 't2' but that is not happening here.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-18 Thread Amit Kapila

On Tue, Oct 18, 2022 at 5:22 PM Dilip Kumar  wrote:
>
> On Thu, Oct 6, 2022 at 1:37 PM Masahiko Sawada  wrote:
> >
>
> > While looking at v35 patch, I realized that there are some cases where
> > the logical replication gets stuck depending on partitioned table
> > structure. For instance, there are following tables, publication, and
> > subscription:
> >
> > * On publisher
> > create table p (c int) partition by list (c);
> > create table c1 partition of p for values in (1);
> > create table c2 (c int);
> > create publication test_pub for table p, c1, c2 with
> > (publish_via_partition_root = 'true');
> >
> > * On subscriber
> > create table p (c int) partition by list (c);
> > create table c1 partition of p for values In (2);
> > create table c2 partition of p for values In (1);
> > create subscription test_sub connection 'port=5551 dbname=postgres'
> > publication test_pub with (streaming = 'parallel', copy_data =
> > 'false');
> >
> > Note that while both the publisher and the subscriber have the same
> > name tables the partition structure is different and rows go to a
> > different table on the subscriber (eg, row c=1 will go to c2 table on
> > the subscriber). If two current transactions are executed as follows,
> > the apply worker (ig, the leader apply worker) waits for a lock on c2
> > held by its parallel apply worker:
> >
> > * TX-1
> > BEGIN;
> > INSERT INTO p SELECT 1 FROM generate_series(1, 1); --- changes are 
> > streamed
> >
> > * TX-2
> > BEGIN;
> > TRUNCATE c2; --- wait for a lock on c2
> >
> > * TX-1
> > INSERT INTO p SELECT 1 FROM generate_series(1, 1);
> > COMMIT;
> >
> > This might not be a common case in practice but it could mean that
> > there is a restriction on how partitioned tables should be structured
> > on the publisher and the subscriber when using streaming = 'parallel'.
> > When this happens, since the logical replication cannot move forward
> > the users need to disable parallel-apply mode or increase
> > logical_decoding_work_mem. We could describe this limitation in the
> > doc but it would be hard for users to detect problematic table
> > structure.
>
> Interesting case.  So I think the root of the problem is the same as
> what we have for a column is marked unique to the subscriber but not
> to the publisher.  In short, two transactions which are independent of
> each other on the publisher are dependent on each other on the
> subscriber side because table definition is different on the
> subscriber.  So can't we handle this case in the same way by marking
> this table unsafe for parallel-apply?
>

Yes, we can do that. I think Hou-San has already dealt that way in his
latest patch [1]. See his response in the email [1]: "Disallow
replicating from or to a partitioned table in parallel streaming
mode".

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB57160760B34E1655718F4D1994249%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 7:56 PM Amit Kapila  wrote:
>
> On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
> > >
> > > --- a/src/backend/replication/logical/decode.c
> > > +++ b/src/backend/replication/logical/decode.c
> > > @@ -113,6 +113,15 @@
> > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > > XLogReaderState *recor
> > >   buf.origptr);
> > >   }
> > >
> > > +#ifdef USE_ASSERT_CHECKING
> > > + /*
> > > + * Check the order of transaction LSNs when we reached the start decoding
> > > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > > + */
> > > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > > + AssertTXNLsnOrder(ctx->reorder);
> > > +#endif
> > > +
> > >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > > >
> > >
> > > I am not able to think how/when this check will be useful. Because we
> > > skipped assert checking only for records that are prior to
> > > start_decoding_at point, I think for those records ordering should
> > > have been checked before the restart. start_decoding_at point will be
> > > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > > any record prior to that must have been processed before restart.
> >
> > Good point. I was considering the case where the client sets far ahead
> > LSN but it's not worth considering this case in this context. I've
> > updated the patch accoringly.
> >
>
> One minor comment:
> Can we slightly change the comment: ". The ordering of the records
> prior to the LSN, we should have been checked before the restart." to
> ". The ordering of the records prior to the start_decoding_at LSN
> should have been checked before the restart."?

Agreed. I'll update the patch accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  wrote:
> >
> > >
> > > I think because the test case proposed needs all three changes, we can
> > > push the change-1 without a test case and then as a second patch have
> > > change-2 for HEAD and change-3 for back branches with the test case.
> > > Do you have any other ideas to proceed here?
> >
> > I found another test case that causes the assertion failure at
> > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > attached the patch for the test case. In this test case, I modified a
> > user-catalog table instead of system-catalog table. That way, we don't
> > generate invalidation messages while generating NEW_CID records. As a
> > result, we mark only the subtransactions as containing catalog change
> > and don't make association between top-level and sub transactions. The
> > assertion failure happens on all supported branches. If we need to fix
> > this (I believe so), Change-2 needs to be backpatched to all supported
> > branches.
> >
> > There are three changes as Amit mentioned, and regarding the test
> > case, we have three test cases I've attached: truncate_testcase.patch,
> > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > between assertion failures and test cases are very complex. I could
> > not find any test case to cause only one assertion failure on all
> > branches. One idea to proceed is:
> >
> > Patch-1 includes Change-1 and is applied to all branches.
> >
> > Patch-2 includes Change-2 and the user_catalog test case, and is
> > applied to all branches.
> >
> > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > test case), and is applied to v14 and v15 (also till v11 if we
> > prefer).
> >
> > The patch-1 doesn't include any test case but the user_catalog test
> > case can test both Change-1 and Change-2 on all branches.
> >
>
> I was wondering if it makes sense to commit both Change-1 and Change-2
> together as one patch? Both assertions are caused by a single test
> case and are related to the general problem that the association of
> top and sub transaction is only guaranteed to be formed before we
> decode transaction changes. Also, it would be good to fix the problem
> with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Mingw task for Cirrus CI

2022-10-18 Thread Nazir Bilal Yavuz


Hi,

Thanks for the patch!

On 10/18/22 12:49, Melih Mutlu wrote:



You can find the updated patch attached.



Does it makes sense to set these at 'Windows - Server 2019, MinGW64 - 
Meson' task:


env:
    ...
    CHERE_INVOKING: 1
    BASH_EXE: C:\msys64\usr\bin\bash.exe

'CHERE_INVOKING: 1' will cause bash.exe to start from current working 
directory(%CIRRUS_WORKING_DIR%).


In this way, there is no need for
'cd %CIRRUS_WORKING_DIR%' when using bash.exe,
also no need for 'BUILD_DIR' environment variable.

i.e:

configure_script: |
    %BASH_EXE% -lc "meson setup --buildtype debug -Dcassert=true 
-Db_pch=true -DTAR=%TAR% build"


will work.

Regards,
Nazir Bilal Yavuz

Re: PATCH: Using BRIN indexes for sorted output

2022-10-18 Thread Tomas Vondra

On 10/17/22 16:00, Matthias van de Meent wrote:
> On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
>  wrote:
>>
>> On 10/16/22 22:17, Matthias van de Meent wrote:
>>> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>>>  wrote:
 Try to formulate the whole algorithm. Maybe I'm missing something.

 The current algorithm is something like this:

 1. request info about ranges from the BRIN opclass
 2. sort them by maxval and minval
>>>
>>> Why sort on maxval and minval? That seems wasteful for effectively all
>>> sorts, where range sort on minval should suffice: If you find a range
>>> that starts at 100 in a list of ranges sorted at minval, you've
>>> processed all values <100. You can't make a similar comparison when
>>> that range is sorted on maxvals.
>>
>> Because that allows to identify overlapping ranges quickly.
>>
>> Imagine you have the ranges sorted by maxval, which allows you to add
>> tuples in small increments. But how do you know there's not a range
>> (possibly with arbitrarily high maxval), that however overlaps with the
>> range we're currently processing?
> 
> Why do we need to identify overlapping ranges specifically? If you
> sort by minval, it becomes obvious that any subsequent range cannot
> contain values < the minval of the next range in the list, allowing
> you to emit any values less than the next, unprocessed, minmax range's
> minval.
> 

D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.

I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.

 3. NULLS FIRST: read all ranges that might have NULLs => output
 4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
 5. load all tuples from "splill" tuplestore, compare to maxval
>>>
>>> Instead of this, shouldn't an update to tuplesort that allows for
>>> restarting the sort be better than this? Moving tuples that we've
>>> accepted into BRINsort state but not yet returned around seems like a
>>> waste of cycles, and I can't think of a reason why it can't work.
>>>
>>
>> I don't understand what you mean by "update to tuplesort". Can you
>> elaborate?
> 
> Tuplesort currently only allows the following workflow: you to load
> tuples, then call finalize, then extract tuples. There is currently no
> way to add tuples once you've started extracting them.
> 
> For my design to work efficiently or without hacking into the
> internals of tuplesort, we'd need a way to restart or 'un-finalize'
> the tuplesort so that it returns to the 'load tuples' phase. Because
> all data of the previous iteration is already sorted, adding more data
> shouldn't be too expensive.
> 

Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).

FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.

I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.

Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).

>>  The point of spilling them into a tuplestore is to make the sort cheaper
>> by not sorting tuples that can't possibly be produced, because the value
>> exceeds the current maxval. Consider ranges sorted by maxval
>> [...]
>>
>> Or maybe I just don't understand what you mean.
> 
> If we sort the ranges by minval like this:
> 
> 1. [0,1000]
> 2. [0,999]
> 3. [50,998]
> 4. [100,997]
> 5. [100,996]
> 6. [150,995]
> 
> Then we can load and sort the values for range 1 and 2, and emit all
> values up to (not including) 50 - the minval of the next,
> not-yet-loaded range in the ordered list of ranges. Then add the
> values from range 3 to the set of tuples we have yet to output; sort;
> and then emit valus up to 100 (range 4's minval), etc. This reduces
> the amount of tuples in the tuplesort to the minimum amount needed to
> output any specific value.
> 
> If the ranges are sorted and loaded by maxval, like your algorithm expects:

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-18 Thread Dilip Kumar

On Thu, Oct 6, 2022 at 1:37 PM Masahiko Sawada  wrote:
>

> While looking at v35 patch, I realized that there are some cases where
> the logical replication gets stuck depending on partitioned table
> structure. For instance, there are following tables, publication, and
> subscription:
>
> * On publisher
> create table p (c int) partition by list (c);
> create table c1 partition of p for values in (1);
> create table c2 (c int);
> create publication test_pub for table p, c1, c2 with
> (publish_via_partition_root = 'true');
>
> * On subscriber
> create table p (c int) partition by list (c);
> create table c1 partition of p for values In (2);
> create table c2 partition of p for values In (1);
> create subscription test_sub connection 'port=5551 dbname=postgres'
> publication test_pub with (streaming = 'parallel', copy_data =
> 'false');
>
> Note that while both the publisher and the subscriber have the same
> name tables the partition structure is different and rows go to a
> different table on the subscriber (eg, row c=1 will go to c2 table on
> the subscriber). If two current transactions are executed as follows,
> the apply worker (ig, the leader apply worker) waits for a lock on c2
> held by its parallel apply worker:
>
> * TX-1
> BEGIN;
> INSERT INTO p SELECT 1 FROM generate_series(1, 1); --- changes are 
> streamed
>
> * TX-2
> BEGIN;
> TRUNCATE c2; --- wait for a lock on c2
>
> * TX-1
> INSERT INTO p SELECT 1 FROM generate_series(1, 1);
> COMMIT;
>
> This might not be a common case in practice but it could mean that
> there is a restriction on how partitioned tables should be structured
> on the publisher and the subscriber when using streaming = 'parallel'.
> When this happens, since the logical replication cannot move forward
> the users need to disable parallel-apply mode or increase
> logical_decoding_work_mem. We could describe this limitation in the
> doc but it would be hard for users to detect problematic table
> structure.

Interesting case.  So I think the root of the problem is the same as
what we have for a column is marked unique to the subscriber but not
to the publisher.  In short, two transactions which are independent of
each other on the publisher are dependent on each other on the
subscriber side because table definition is different on the
subscriber.  So can't we handle this case in the same way by marking
this table unsafe for parallel-apply?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

PATCH: AM-specific statistics, with an example implementation for BRIN (WIP)

2022-10-18 Thread Tomas Vondra

Hi,

A couple days ago I posted a WIP patch [1] implementing "BRIN Sort",
i.e. a node producing sorted output for BRIN minmax indexes. One of the
challenges I mentioned in that thread is costing - that's actually not
specific to that patch, it's an issue affecting BRIN in general, not
just the proposed node, due to block-range indexes being very different
from regular indexes with explicit tuple pointers.

I mentioned I have some ideas how to improve this, and that I'll start a
separate thread to discuss this. So here we go ...


The traditional estimation problem is roughly this:

Given a condition, how many rows will match it?

That is, given a table with X rows, we need to estimate how many rows
will match a WHERE condition, for example. And once we have the row
estimate, we can estimate the amount of I/O, cost for sorting, etc.

We have built fairly solid capability to calculate these estimates,
using per-column statistics, extended statistics, ... The calculated
estimates are not always perfect, but in general it works well.

This affects all path types etc. mostly equally - yes, some paths are
more sensitive to poor estimates (e.g. runtime may grow exponentially
with increasing rowcount).


BRIN indexes however add another layers to this - once we have estimated
the number of rows, we need to estimate the number of pages ranges this
maps to. You may estimate the WHERE condition to match 1000 rows, but
then you need to decide if that's 1 page range, 1000 page ranges or
possibly even all page ranges for the table.

It all depends on how "correlated" the data is with physical position in
the table. If you have perfectly correlated data, it may be enough to
scan a single page. If it's random, you may need to scan everything.

The existing costing uses the column correlation statistics, but sadly
that's rather insensitive to outlier values. If you have a sequential
table, and then set 1% of data to min/max (making the ranges very wide),
the correlation will remain very close to 1.0, but you'll have to scan
all the ranges (and the costing won't reflect that).

The "BRIN sort" patch needs to estimate a different thing - given a page
range, how many other page ranges overlap with it? This is roughly the
amount of stuff we'll need to scan and sort in order to produce the
first row.

These are all things we can't currently estimate - we have some rough
heuristics, but it's pretty easy to confuse those.

Therefore, I propose to calculate a couple new statistics for BRIN
indexes (assume minmax indexes, unless mentioned otherwise):


1) average number of overlapping ranges
---

Given a range, with how many ranges it overlaps? In a perfectly
sequential table this will be 0, so if you have a value you know it'll
match just one range. In random table, it'll be pretty close to the
number of page ranges.

This can be calculated by simply walking the ranges, sorted by minval
(see brin_minmax_count_overlaps).


2) average number of matching ranges for a value


Given a value, how many ranges it matches? This can be calculated by
matching sampled rows to ranges (brin_minmax_match_tuples_to_ranges).

For minmax indexes this is somewhat complementary to the average number
of overlaps, the relationship is roughly this:

  avg(# of matching ranges) = 1 + avg(number of overlapping ranges)/2

The intuition is that if you assume a range randomly overlapped by other
ranges, you're likely to hit about 1/2 of them.

The reason why we want to calculate both (1) and (2) is that for other
opclasses the relationship is not that simple. For bloom opclasses we
probably can't calculate overlaps at all (or at least not that easily),
so the average number of matches is all we have. For minmax-multi, the
overlaps will probably use only the min/max values, ignoring the "gaps",
but the matches should use the gaps.


3) a bunch of other simple statistics
-

These are number of summarized / not-summarized ranges, all_nulls and
has_nulls ranges, which is useful to estimate IS NULL conditions etc.


The attached patch implements a PoC of this. There's a new GUC
(enable_indexam_stats) that can be used to enable/disable this (both the
ANALYZE and costing part). By default it's "off" so make sure to do

  SET enable_indexam_stats = true;

The statistics is stored in pg_statistics catalog, in a new staindexam
column (with bytea). The opclasses can implement a new support
procedure, similarly to what we do of opclass options. There's a couple
of wrinkles (should be explained in XXX comments), but in principle this
works.

The brin_minmax_stats procedure implements this for minmax opclasses,
calculating the stuff mentioned above. I've been experimenting with
different ways to calculate some of the stuff, and ANALYZE prints info
about the calculated values and timings (this can be disabled by
removing the STATS_CROSS_CHECK define).

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Amit Kapila

On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada  wrote:
>
> On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
> >
> > --- a/src/backend/replication/logical/decode.c
> > +++ b/src/backend/replication/logical/decode.c
> > @@ -113,6 +113,15 @@
> > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > XLogReaderState *recor
> >   buf.origptr);
> >   }
> >
> > +#ifdef USE_ASSERT_CHECKING
> > + /*
> > + * Check the order of transaction LSNs when we reached the start decoding
> > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > + */
> > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > + AssertTXNLsnOrder(ctx->reorder);
> > +#endif
> > +
> >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > >
> >
> > I am not able to think how/when this check will be useful. Because we
> > skipped assert checking only for records that are prior to
> > start_decoding_at point, I think for those records ordering should
> > have been checked before the restart. start_decoding_at point will be
> > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > any record prior to that must have been processed before restart.
>
> Good point. I was considering the case where the client sets far ahead
> LSN but it's not worth considering this case in this context. I've
> updated the patch accoringly.
>

One minor comment:
Can we slightly change the comment: ". The ordering of the records
prior to the LSN, we should have been checked before the restart." to
". The ordering of the records prior to the start_decoding_at LSN
should have been checked before the restart."?

-- 
With Regards,
Amit Kapila.

Re: archive modules

2022-10-18 Thread Bharath Rupireddy

On Mon, Oct 17, 2022 at 11:20 AM Michael Paquier  wrote:
>
> On Mon, Oct 17, 2022 at 01:46:39PM +0900, Kyotaro Horiguchi wrote:
> > As the code written, when archive library is being added while archive
> > command is already set, archiver first emits seemingly positive
> > message "restarting archive process because of..", then errors out
> > after the resatart and keep restarting with complaining for the wrong
> > setting. I think we don't need the first message.
> >
> > The ERROR always turns into FATAL, so FATAL would less confusing here,
> > maybe.
>
> You mean the second message in HandlePgArchInterrupts() when
> archiveLibChanged is false?  An ERROR or a FATAL would not change much
> as there is a proc_exit() anyway down the road.

Yes, ERROR or FATAL it really doesn't matter, the process exits see,
pg_re_throw(), for archiver PG_exception_stack is null.
2022-10-18 09:57:41.869 UTC [2479104] FATAL:  both archive_command and
archive_library specified
2022-10-18 09:57:41.869 UTC [2479104] DETAIL:  Only one of
archive_command, archive_library may be set.

I think Kyotaro-san's concern is to place errmsg("both archive_command
and archive_library specified"), before errmsg("restarting archiver
process because value of \"archive_library\" was changed", something
like the attached v4 patch.

> +   if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("both archive_command and archive_library specified"),
> +errdetail("Only one of archive_command, archive_library may 
> be set.")));
>
> So, backpedalling from upthread where Peter mentions that we should
> complain if both archive_command and archive_library are set (creating
> a parallel with recovery parameters), I'd like to think that pgarch.c
> should have zero knowledge of what an archive_command is and should
> just handle the library part.  This makes the whole reasoning around
> what pgarch.c should be much simpler, aka it just needs to know about
> archive *libraries*, not *commands*.

Are you saying that we make/treat/build shell_archive.c as a separate
shared library/module (instead of just an object file) and load it in
pgarc.c? If yes, this can make pgarch.c simple.

> That's the kind of business that
> check_configured_cb() is designed for, actually, as far as I
> understand, or this callback could just be removed entirely for the
> same effect, as there would be no point in having pgarch.c do its
> thing without archive_library or archive_command where a WARNING is
> issued in the default case (shell_archive with no archive_command).

If it's done as said above, the corresponding check_configured_cb()
can deal with allowing/disallowing/misconfiguring various parameters.

> And, by the way, this patch would prevent the existence of archive
> modules that need to be loaded but *want* an archive_command with
> what they want to achieve.  That does not strike me as a good idea if
> we want to have a maximum of flexibility with this facility.  I think
> that for all that, we should put the responsability of what should be
> set or not set directly to the modules, aka basic_archive could
> complain if archive_command is set, but that does not strike me as a
> mandatory requirement, either.  It is true that archive_library has
> been introduced as a way to avoid using archive_command, but the point
> of creating a stronger dependency between both would be IMO annoying
> in the long-term.

Great thought! If the responsibility of
allowing/disallowing/misconfiguring various parameters is given to
check_configured_cb(), the modules can decide whether to error out or
deal with it or use it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Amit Kapila

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  wrote:
>
> >
> > I think because the test case proposed needs all three changes, we can
> > push the change-1 without a test case and then as a second patch have
> > change-2 for HEAD and change-3 for back branches with the test case.
> > Do you have any other ideas to proceed here?
>
> I found another test case that causes the assertion failure at
> "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> attached the patch for the test case. In this test case, I modified a
> user-catalog table instead of system-catalog table. That way, we don't
> generate invalidation messages while generating NEW_CID records. As a
> result, we mark only the subtransactions as containing catalog change
> and don't make association between top-level and sub transactions. The
> assertion failure happens on all supported branches. If we need to fix
> this (I believe so), Change-2 needs to be backpatched to all supported
> branches.
>
> There are three changes as Amit mentioned, and regarding the test
> case, we have three test cases I've attached: truncate_testcase.patch,
> analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> between assertion failures and test cases are very complex. I could
> not find any test case to cause only one assertion failure on all
> branches. One idea to proceed is:
>
> Patch-1 includes Change-1 and is applied to all branches.
>
> Patch-2 includes Change-2 and the user_catalog test case, and is
> applied to all branches.
>
> Patch-3 includes Change-3 and the truncate test case (or the analyze
> test case), and is applied to v14 and v15 (also till v11 if we
> prefer).
>
> The patch-1 doesn't include any test case but the user_catalog test
> case can test both Change-1 and Change-2 on all branches.
>

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

-- 
With Regards,
Amit Kapila.

Re: effective_multixact_freeze_max_age issue

2022-10-18 Thread Anton A. Melnikov


Hello!

On 31.08.2022 21:38, Peter Geoghegan wrote:

On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart  wrote:

LGTM


Pushed, thanks.



In this commit 
https://github.com/postgres/postgres/commit/c3ffa731a5f99c4361203015ce2219d209fea94c
there are checks if oldestXmin and oldestMxact havn't become too far in the 
past.
But the corresponding error messages say also some different things about 
'cutoff for freezing tuples',
ie about checks for another variables: freezeLimit and multiXactCutoff.
See: 
https://github.com/postgres/postgres/commit/c3ffa731a5f99c4361203015ce2219d209fea94c?diff=split#diff-795a3938e3bed9884d426bd010670fe505580732df7d7012fad9edeb9df54badR1075
and
https://github.com/postgres/postgres/commit/c3ffa731a5f99c4361203015ce2219d209fea94c?diff=split#diff-795a3938e3bed9884d426bd010670fe505580732df7d7012fad9edeb9df54badR1080

It's interesting that prior to this commit, checks were made for freeze limits, 
but the error messages were talking about oldestXmin and oldestMxact.

Could you clarify this moment please? Would be very grateful.

As variant may be split these checks for the freeze cuttoffs and the oldest 
xmins for clarity?
The patch attached tries to do this.


--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companycommit f27ea4e61a7680452b56a7c11b1dcab1c0b81c6b
Author: Anton A. Melnikov 
Date:   Fri Oct 14 11:57:22 2022 +0300

Split 'too far in the past checks' in vacuum_set_xid_limits().

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de9..0bbeafba49 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1072,14 +1072,23 @@ vacuum_set_xid_limits(Relation rel,
 		safeOldestMxact = FirstMultiXactId;
 	if (TransactionIdPrecedes(*oldestXmin, safeOldestXmin))
 		ereport(WARNING,
-(errmsg("cutoff for removing and freezing tuples is far in the past"),
+(errmsg("oldest xmin is far in the past"),
  errhint("Close open transactions soon to avoid wraparound problems.\n"
 		 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 	if (MultiXactIdPrecedes(*oldestMxact, safeOldestMxact))
 		ereport(WARNING,
-(errmsg("cutoff for freezing multixacts is far in the past"),
+(errmsg("oldest multixact is far in the past"),
+ errhint("Close open transactions with multixacts soon to avoid wraparound problems.\n")));
+
+	if (TransactionIdPrecedes(*freezeLimit, safeOldestXmin))
+		ereport(WARNING,
+(errmsg("cutoff for freezing tuples is far in the past"),
  errhint("Close open transactions soon to avoid wraparound problems.\n"
 		 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+	if (MultiXactIdPrecedes(*multiXactCutoff, safeOldestMxact))
+		ereport(WARNING,
+(errmsg("cutoff for freezing multixacts is far in the past"),
+ errhint("Close open transactions with multixacts soon to avoid wraparound problems.\n")));
 
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.

Re: Exponentiation confusion

2022-10-18 Thread Dean Rasheed

[Moving this to -hackers]

> On 13/10/2022 18:20 CEST Adrian Klaver  wrote:
> > select power(10, -18::numeric);
> > power
> > 
> > 0.
> >

On Thu, 13 Oct 2022 at 18:17, Tom Lane  wrote:
>
> An inexact result isn't surprising, but it shouldn't be *that* inexact.
>
> I'm inclined to think that we should push the responsibility for choosing
> its rscale into power_var_int(), because internally that already does
> estimate the result weight, so with a little code re-ordering we won't
> need duplicative estimates.  Don't have time to work on that right now
> though ... Dean, are you interested in fixing this?
>

Here's a patch along those lines, bringing power_var_int() more in
line with neighbouring functions by having it choose its own result
scale.

It was necessary to also move the overflow/underflow tests up, in
order to avoid a potential integer overflow when deciding the rscale.

Looking more closely at the upper limit of the overflow test, it turns
out it was far too large. I'm not sure where the "3 * SHRT_MAX" came
from, but I suspect it was just a thinko on my part, back in
7d9a4737c2. I've replaced that with SHRT_MAX + 1, which kicks in much
sooner without changing the actual maximum result allowed, which is <
10^131072 (the absolute upper limit of the numeric type).

The first half the the underflow test condition "f + 1 < -rscale" goes
away, since this is now being done before rscale is computed, and the
choice of rscale makes that condition false. In fact, the new choice
of rscale now ensures that when sig_digits is computed further down,
it is guaranteed to be strictly greater than 0, rather than merely
being >= 0 as before, which is good.

As expected, various regression test results change, since the number
of significant digits computed is now different, but I think the new
results look a lot better, and more consistent. I regenerated the
numeric_big test results by re-running the bc script and rounding to
the new output precisions, and the results from power_var_int()
exactly match in every case. This already included a number of cases
that used to round to zero, and now produce much more reasonable
results.

The test cases where the result actually does round to zero now output
1000 zeros after the decimal point. That looks a little messy, but I
think it's the right thing to do in fixed-point arithmetic -- it's
consistent with the fractional power case, and with exp(numeric),
reflecting the fact that the result is zero to 1000 decimal places,
whilst not being exactly zero.

Overall, I'm quite happy with these results. The question is, should
this be back-patched?

In the past, I think I've only back-patched numeric bug-fixes where
the digits output by the old code were incorrect or an error was
thrown, not changes that resulted in a different number of digits
being output, changing the precision of already-correct results.
However, having 10.0^(-18) produce zero seems pretty bad, so my
inclination is to back-patch, unless anyone objects.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index cafe1ac..af0d94b
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -571,8 +571,7 @@ static void log_var(const NumericVar *ba
 	NumericVar *result);
 static void power_var(const NumericVar *base, const NumericVar *exp,
 	  NumericVar *result);
-static void power_var_int(const NumericVar *base, int exp, NumericVar *result,
-		  int rscale);
+static void power_var_int(const NumericVar *base, int exp, NumericVar *result);
 static void power_ten_int(int exp, NumericVar *result);

 static int	cmp_abs(const NumericVar *var1, const NumericVar *var2);
@@ -10335,13 +10334,8 @@ power_var(const NumericVar *base, const
 		{
 			if (expval64 >= PG_INT32_MIN && expval64 <= PG_INT32_MAX)
 			{
-/* Okay, select rscale */
-rscale = NUMERIC_MIN_SIG_DIGITS;
-rscale = Max(rscale, base->dscale);
-rscale = Max(rscale, NUMERIC_MIN_DISPLAY_SCALE);
-rscale = Min(rscale, NUMERIC_MAX_DISPLAY_SCALE);
-
-power_var_int(base, (int) expval64, result, rscale);
+/* Okay, use power_var_int */
+power_var_int(base, (int) expval64, result);
 return;
 			}
 		}
@@ -10475,19 +10469,74 @@ power_var(const NumericVar *base, const
  * power_var_int() -
  *
  *	Raise base to the power of exp, where exp is an integer.
+ *
+ *	Note: this routine chooses dscale of the result.
  */
 static void
-power_var_int(const NumericVar *base, int exp, NumericVar *result, int rscale)
+power_var_int(const NumericVar *base, int exp, NumericVar *result)
 {
 	double		f;
 	int			p;
 	int			i;
+	int			rscale;
 	int			sig_digits;
 	unsigned int mask;
 	bool		neg;
 	NumericVar	base_prod;
 	int			local_rscale;

+	/*
+	 * Choose the result scale.  For this we need an estimate of the decimal
+	 * weight of the result, which we obtain by approximating using double
+	 * precision arithmetic.
+	 *
+

Re: interrupted tap tests leave postgres instances around

2022-10-18 Thread Alvaro Herrera

On 2022-Oct-01, Andres Freund wrote:

> Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
> 0 and _pid is undefined?

Yeah, that sounds reasonable.

> That does seem to reduce the incidence of "leftover" postgres
> instances. 001_start_stop.pl leaves some behind, but that makes sense, because
> it's bypassing the whole node management. But I still occasionally see some
> remaining processes if I crank up test concurrency.
> 
> Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
> course it can fail as part of the shutdown.

I made teardown_node pass down fail_ok=>1 to avoid this problem, so we
no longer BAIL_OUT in that case.


> But there's still some that survive, where your perl.trace doesn't contain the
> node getting shut down...

Yeah, something's still unexplained.  I'll get this pushed soon, which
already reduces the number of leftover instances a good amount.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From b43b3071abfd7bb83f7ec5942095aa031c3ed038 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 18 Oct 2022 12:11:46 +0200
Subject: [PATCH v2] Better handle interrupting TAP tests

Set up a signal handler for INT/TERM so that we run our END block if we
get them.  In END, if the exit status indicates a problem, call
_update_pid(-1) to improve chances of the stop working in case start()
hasn't returned yet.

Also, change END's teardown_node() so that it passes fail_ok=>1, so that
if a node fails to stop, we still stop the other nodes in the same test.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d7c13318b0..17856d69c8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1545,7 +1545,13 @@ END
 
 	foreach my $node (@all_nodes)
 	{
-		$node->teardown_node;
+		# During unclean termination (which could be a signal or some
+		# other failure), we're not sure that the status of our nodes
+		# has been correctly set up already, so try and detect their
+		# status so that we can shut them down properly.
+		$node->_update_pid(-1) if $exit_code != 0;
+
+		$node->teardown_node(fail_ok => 1);
 
 		# skip clean if we are requested to retain the basedir
 		next if defined $ENV{'PG_TEST_NOCLEAN'};
@@ -1564,13 +1570,15 @@ END
 
 Do an immediate stop of the node
 
+Any optional extra parameter is passed to ->stop.
+
 =cut
 
 sub teardown_node
 {
-	my $self = shift;
+	my ($self, %params) = @_;
 
-	$self->stop('immediate');
+	$self->stop('immediate', %params);
 	return;
 }
 
@@ -2922,6 +2930,13 @@ sub corrupt_page_checksum
 	return;
 }
 
+#
+# Signal handlers
+#
+$SIG{TERM} = $SIG{INT} = sub {
+	die "death by signal";
+};
+
 =pod
 
 =back
-- 
2.30.2

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-18 Thread Amit Kapila

On Tue, Oct 18, 2022 at 8:06 AM Peter Smith  wrote:
>
> Hi, here are my review comments for patch v38-0001.
>
> 3.
>
> + /* Ensure we are reading the data into our memory context. */
> + oldctx = MemoryContextSwitchTo(ApplyMessageContext);
>
> Doesn't something need to switch back to this 'oldctx' prior to
> breaking out of the for(;;) loop?
>
> ~~~
>
> 4.
>
> + apply_dispatch(&s);
> +
> + MemoryContextReset(ApplyMessageContext);
>
> Isn't this broken now? Since you've removed the
> MemoryContextSwitchTo(oldctx), so next iteration will switch to
> ApplyMessageContext again which will overwrite and lose knowledge of
> the original 'oldctx' (??)
>
> ~~
>
> 5.
>
> Maybe this is a silly idea, I'm not sure. Because this is an infinite
> loop, then instead of the multiple calls to
> MemoryContextReset(ApplyMessageContext) maybe there can be just a
> single call to it immediately before you switch to that context in the
> first place. The effect will be the same, won't it?
>

I think so but I think it will look a bit odd, especially for the
first time. If the purpose is to just do it once, won't it be better
to do it at the end of for loop?

>
> 9. apply_handle_stream_start
>
> + *
> + * XXX We can avoid sending pairs of the START/STOP messages to the parallel
> + * worker because unlike apply worker it will process only one transaction 
> at a
> + * time. However, it is not clear whether that is worth the effort because it
> + * is sent after logical_decoding_work_mem changes.
>   */
>  static void
>  apply_handle_stream_start(StringInfo s)
>
> As previously mentioned ([1] #13b) it's not obvious to me what that
> last sentence means. e.g. "because it is sent"  - what is "it"?
>

Here, it refers to START/STOP messages, so I think we should say "...
because these messages are sent .." instead of "... because it is sent
...". Does that makes sense to you?

-- 
With Regards,
Amit Kapila.

Re: remove no longer necessary Perl compatibility hack

2022-10-18 Thread Alvaro Herrera

On 2022-Oct-17, Richard Guo wrote:

> On Mon, Oct 17, 2022 at 4:24 PM Alvaro Herrera 
> wrote:
> 
> > While messing about with Cluster.pm I noticed that we don't need the
> > hack to work around lack of parent.pm in very old Perl versions, because
> > we no longer support those versions (per commit 4c1532763a00).  Trivial
> > patch attached.
> 
> +1. Since we've got rid of perl of versions before 5.10.1, I see no
> reason we do not do this.

Thanks for looking!  Pushed now.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".

Re: Improve errhint for ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2022-10-18 Thread Alvaro Herrera

On 2022-Oct-18, Japin Li wrote:

> 
> On Tue, 18 Oct 2022 at 12:00, houzj.f...@fujitsu.com  
> wrote:
> 
> > Agreed. Here is new version patch which changed the error code and
> > moved the whole command out of the message according to Álvaro's comment.
> 
> My bad!  The patch looks good to me.

Thank you, I pushed it to both branches, because I realized we were
saying "SET PUBLICATION" when we meant "ADD/DROP"; that hint could be
quite damaging if anybody decides to actually follow it ISTM.

I noted that no test needed to be changed because of this, which is
perhaps somewhat concerning.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Mingw task for Cirrus CI

2022-10-18 Thread Melih Mutlu

Hi Andres,

Andres Freund , 11 Eki 2022 Sal, 21:23 tarihinde şunu
yazdı:

> I think it might be easier to just set MSYS=winjitdebug
>
> https://www.msys2.org/wiki/JIT-Debugging/#native-windows-processes-started-from-msys2
> and then rely on the existing windows crash reporting stuff.
>

Done.


> You removed a bunch of newlines here and nearby - I assume that wasn't
> intentional?
>

Yes, that was my mistake. Thanks for pointing it out. Fixed.


> There's no need to use dash anymore, the amount of shell script run is
> minimal.
>

Switched back to bash.


> I think the "cd"s here are superfluous given the ninja -C %BUILD_DIR% etc.
>

You're right. Those were needed when it was building with autoconf. Not
anymore. Removed.

Only remembered that just after sending my email: When using b_pch=true
> (which
> saves a lot of time on mingw) ccache won't cache much (IIRC just the pch
> files
> themselves) unless you do something like
> export CCACHE_SLOPPINESS=pch_defines,time_macros
>

Added  CCACHE_SLOPPINESS and CCACHE_SLOPPINESS variables.

You can find the updated patch attached.

Best,
Melih


v7-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: create subscription - improved warning message

2022-10-18 Thread Peter Smith

On Mon, Oct 17, 2022 at 7:11 PM shiy.f...@fujitsu.com
 wrote:
>
...
>
> Thanks for your patch. Here are some comments.
>
> In Example 2, the returned slot_name should be "myslot".
>
> +test_pub=# SELECT * FROM pg_create_logical_replication_slot('myslot', 
> 'pgoutput');
> + slot_name |lsn
> +---+---
> + sub1  | 0/19059A0
> +(1 row)
>

Oops. Sorry for my cut/paste error. Fixed in patch v6.

> Besides, I am thinking is it possible to slightly simplify the example. For
> example, merge example 1 and 2, keep the steps of example 2 and in the step of
> creating slot, mention what should we do if slot_name is not specified when
> creating subscription.
>

Sure, it might be a bit shorter to combine the examples, but I thought
it’s just simpler not to do it that way because the combined example
will then need additional bullets/notes to say – “if there is no
slot_name do this…” and “if there is a slot_name do that…”. It’s
really only the activation part which is identical for both.

-
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: create subscription - improved warning message

2022-10-18 Thread Peter Smith

Thanks for the feedback.

On Mon, Oct 17, 2022 at 10:14 PM Amit Kapila  wrote:
>
> On Mon, Oct 17, 2022 at 7:17 AM Peter Smith  wrote:
> >
> >
> > Updated as sugggested.
> >
>
> +   
> +Sometimes, either by choice (e.g. create_slot = 
> false),
> +or by necessity (e.g. connect = false), the remote
> +replication slot is not created automatically during
> +CREATE SUBSCRIPTION. In these cases the user will have
> +to create the slot manually before the subscription can be activated.
> +   
>
> This part looks a bit odd when in the previous section we have
> explained the same thing in different words. I think it may be better
> if we start with something like: "As mentioned in the previous
> section, there are cases where we need to create the slot manually
> before the subscription can be activated.". I think you can even
> combine the next para in the patch with this one.

Modified the text and combined the paragraphs as suggested.

>
> Also, it looks odd that the patch uses examples to demonstrate how to
> manually create a slot, and then we have a separate section whose
> title is Examples. I am not sure what is the best way to arrange docs
> here but maybe we can consider renaming the Examples section to
> something more specific.
>

Renamed the examples sections to make their purpose clearer.

~~~

PSA patch v6 with the above changes + one correction from Shi-san [1]

--
[1] 
https://www.postgresql.org/message-id/OSZPR01MB631051BA9AAA728CAA8CBD88FD299%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v6-0001-create-subscriptipon-pgdocs-for-deferred-slot-cre.patch
Description: Binary data

Re: proposal: possibility to read dumped table's name from file

2022-10-18 Thread Julien Rouhaud

On Thu, Oct 13, 2022 at 11:46:34AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-10-07 07:26:08 +0200, Pavel Stehule wrote:
> > I am sending version with handy written parser and meson support
> 
> Given this is a new approach it seems inaccurate to have the CF entry marked
> ready-for-committer. I've updated it to needs-review.

I just had a quick look at the rest of the patch.

For the parser, it seems that filter_get_pattern is reimplementing an
identifier parsing function but isn't entirely correct.  It can correctly parse
quoted non-qualified identifiers and non-quoted qualified identifiers, but not
quoted and qualified ones.  For instance:

$ echo 'include table nsp.tbl' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "TBL"' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "NSP"."TBL"' | pg_dump --filter - >/dev/null
pg_dump: error: invalid format of filter on line 1: unexpected extra data after 
pattern

This should also be covered in the regression tests.

I'm wondering if psql's parse_identifier() could be exported and reused here
rather than creating yet another version.  

Nitpicking: the comments needs some improvements:

+ /*
+  * Simple routines - just don't repeat same code
+  *
+  * Returns true, when filter's file is opened
+  */
+ bool
+ filter_init(FilterStateData *fstate, const char *filename)

also, is there any reason why this function doesn't call exit_nicely in case of
error rather than letting each caller do it without any other cleanup?

+ /*
+  * Release allocated sources for filter
+  */
+ void
+ filter_free_sources(FilterStateData *fstate)

I'm assuming "ressources" not "sources"?

+ /*
+  * log_format_error - Emit error message
+  *
+  * This is mostly a convenience routine to avoid duplicating file closing code
+  * in multiple callsites.
+  */
+ void
+ log_invalid_filter_format(FilterStateData *fstate, char *message)

mismatch between comment and function name (same for filter_read_item)

+ static const char *
+ filter_object_type_name(FilterObjectType fot)

No description.

/*
 * Helper routine to reduce duplicated code
 */
void
log_unsupported_filter_object_type(FilterStateData *fstate,
const 
char *appname,

FilterObjectType fot)

Need more helpful comment.

RE: CF Bot failure in wait_for_subscription_sync()

2022-10-18 Thread houzj.f...@fujitsu.com

On Tuesday, October 18, 2022 2:16 PM Bharath Rupireddy 
 wrote:
> 
> Hi,
> 
> I have seen 2 patches registered in CF failing on Linux - Debian Bullseye in
> wait_for_subscription_sync(). It seems like the tables aren't being synced. I
> have not done any further analysis. I'm not sure if this issue is being 
> discussed
> elsewhere.
> 
> # Postmaster PID for node "twoways" is 50208 Waiting for all subscriptions in
> "twoways" to synchronize data
> [14:12:43.092](198.391s) # poll_query_until timed out executing this query:
> # SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r',
> 's'); # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> timed out waiting for subscriber to synchronize data at t/100_bugs.pl line 
> 147.
> 
> https://api.cirrus-ci.com/v1/artifact/task/6618623857917952/log/src/test/sub
> scription/tmp_check/log/regress_log_100_bugs
> https://cirrus-ci.com/task/6618623857917952
> https://cirrus-ci.com/task/5764058174455808

Thanks for reporting this. I am not sure about the root cause but just share
some initial analysis here.

This testcase waits for table sync to finish for both table "t" and table "t2".
But from the log, I can only see the log[1] related to the table sync of table
"t". So it seems that the table sync worker for table "t2" was never started
due to some reason. I tried it locally but have not reproduced this yet.

[1]---
2022-10-17 10:16:37.216 UTC [48051][logical replication worker] LOG:  logical 
replication table synchronization worker for subscription "testsub", table "t" 
has finished
---

Best regards,
Hou zj

Re: havingQual vs hasHavingQual buglets

2022-10-18 Thread Etsuro Fujita

On Tue, Oct 18, 2022 at 9:47 AM Richard Guo  wrote:
> On Tue, Oct 18, 2022 at 5:37 AM Tom Lane  wrote:
>> I came across a couple of places in the planner that are checking
>> for nonempty havingQual; but since these bits run after
>> const-simplification of the HAVING clause, that produces the wrong
>> answer for a constant-true HAVING clause (which'll be folded to
>> empty).  Correct code is to check root->hasHavingQual instead.

The postgres_fdw bits would be my oversight.  :-(

> +1. root->hasHavingQual is set before we do any expression
> preprocessing. It should be the right one to check with.

+1  HEAD only seems reasonable.

Best regards,
Etsuro Fujita

Re: Unnecessary lateral dependencies implied by PHVs

2022-10-18 Thread Richard Guo

On Tue, Oct 18, 2022 at 9:15 AM Andy Fan  wrote:

> On Mon, Oct 10, 2022 at 10:35 AM Richard Guo 
> wrote:
>
>> ... I'm asking because
>> PHVs may imply lateral dependencies which may make us have to use
>> nestloop join.
>>
>
> I thought lateral join imply nestloop join,  am I missing something?  Here
> is my simple
> testing.
>

I think it's true most of the time.  And that's why we should try to
avoid unnecessary lateral dependencies, as discussed in this thread.

ISTM in your example nestloop is chosen because there is no available
mergejoinable/hashjoinable clause. In your query the lateral subquery
would be pulled up into the parent query and there would be no lateral
dependencies afterwards.

Thanks
Richard

Re: fix archive module shutdown callback

2022-10-18 Thread Bharath Rupireddy

On Mon, Oct 17, 2022 at 11:17 AM Michael Paquier  wrote:
>
> On Mon, Oct 17, 2022 at 02:30:52PM +0900, Kyotaro Horiguchi wrote:
>
> Removing PG_ENSURE_ERROR_CLEANUP() and relying on before_shmem_exit()
> is fine by me, that's what I imply upthread.

Hm. Here's a v2 patch that addresses review comments. In addition to
making it a before_shmem_exit() callback, this patch also does the
following things:
1) Renames call_archive_module_shutdown_callback() to be more
meaningful and generic as before_shmem_exit() callback.
2) Clarifies when the archive module shutdown callback gets called in
documentation.
3) Defines a shutdown callback that just emits a log message in
shell_archive.c and tests it.

Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From a0d5a3b8cfe1d52900027b97b8c1bf8fd9d04362 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 18 Oct 2022 08:35:23 +
Subject: [PATCH v2] Fixes related to archive module shutdown callback

Presently, the archive module shutdown callback can get called
twice as it is first registered using PG_ENSURE_ERROR_CLEANUP and
then called directly in HandlePgArchInterrupts() before
proc_exit().

This patch fixes it by just registering it as before_shmem_exit().

In addition to the above, this patch also does the following
things:
1) Renames call_archive_module_shutdown_callback() to be more
meaningful and generic as before_shmem_exit() callback.
2) Clarifies when the archive module shutdown callback gets called
in documentation.
3) Defines a shutdown callback that just emits a log message in
shell_archive.c and tests it.

Author: Nathan Bossart
Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/20221015221328.GB1821022%40nathanxps13
Backpatch-through: 15
---
 doc/src/sgml/archive-modules.sgml |  7 ---
 doc/src/sgml/config.sgml  |  5 +++--
 src/backend/postmaster/pgarch.c   | 24 +++
 src/backend/postmaster/shell_archive.c|  9 +
 src/test/recovery/t/020_archive_status.pl | 19 ++
 5 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/archive-modules.sgml b/doc/src/sgml/archive-modules.sgml
index ef02051f7f..677a29182a 100644
--- a/doc/src/sgml/archive-modules.sgml
+++ b/doc/src/sgml/archive-modules.sgml
@@ -121,9 +121,10 @@ typedef bool (*ArchiveFileCB) (const char *file, const char *path);
   
Shutdown Callback

-The shutdown_cb callback is called when the archiver
-process exits (e.g., after an error) or the value of
- changes.  If no
+The shutdown_cb callback is called before the server
+begins to shut down low-level subsystems such as shared memory upon the
+archiver process exit (e.g., after an error or the value of
+ changes).  If no
 shutdown_cb is defined, no special action is taken in
 these situations.
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 66312b53b8..22bbf05fef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3625,8 +3625,9 @@ include_dir 'conf.d'
 The library to use for archiving completed WAL file segments.  If set to
 an empty string (the default), archiving via shell is enabled, and
  is used.  Otherwise, the specified
-shared library is used for archiving.  For more information, see
- and
+shared library is used for archiving. The WAL archiver process gets
+restarted by the postmaster whenever this parameter changes. For more
+information, see  and
 .


diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3868cd7bd3..f1d1f808a8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -144,7 +144,7 @@ static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
 static int	ready_file_comparator(Datum a, Datum b, void *arg);
 static void LoadArchiveLibrary(void);
-static void call_archive_module_shutdown_callback(int code, Datum arg);
+static void pgarch_before_shmem_exit(int code, Datum arg);
 
 /* Report shared memory space needed by PgArchShmemInit */
 Size
@@ -232,6 +232,8 @@ PgArchiverMain(void)
 	/* We shouldn't be launched unnecessarily. */
 	Assert(XLogArchivingActive());
 
+	before_shmem_exit(pgarch_before_shmem_exit, 0);
+
 	/* Arrange to clean up at archiver exit */
 	on_shmem_exit(pgarch_die, 0);
 
@@ -252,13 +254,7 @@ PgArchiverMain(void)
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
 
-	PG_ENSURE_ERROR_CLEANUP(call_archive_module_shutdown_callback, 0);
-	{
-		pgarch_MainLoop();
-	}
-	PG_END_ENSURE_ERROR_CLEANUP(call_archive_module_shutdown_callback, 0);
-
-	call_archive_module_shutdown_callback(0, 0);
+	pgarch_MainLoop();
 
 	proc_exit(0);
 }
@@ -803,12 +799,6 @@ HandlePgArchInterr

Re: [PATCH] Fix build with LLVM 15 or above

2022-10-18 Thread Thomas Munro

On Tue, Oct 18, 2022 at 10:01 PM Devrim Gündüz  wrote:
> On Tue, 2022-10-11 at 11:07 +1300, Thomas Munro wrote:
> > OK, I'll wait for the dust to settle on our 15 release and then
> > back-patch this.  Then I'll keep working on the opaque pointer
> > support for master, which LLVM 16 will need (I expect we'll
> > eventually want to back-patch that eventually, but first things
> > first...).
>
> Fedora 37 is out very very soon, and ships with CLANG/LLVM 15. What is
> the timeline for backpatching llvm15 support?

Hi Devrim,

Will do first thing tomorrow.

Re: [PATCH] Fix build with LLVM 15 or above

2022-10-18 Thread Devrim Gündüz

Hi Thomas,

On Tue, 2022-10-11 at 11:07 +1300, Thomas Munro wrote:
> OK, I'll wait for the dust to settle on our 15 release and then
> back-patch this.  Then I'll keep working on the opaque pointer
> support for master, which LLVM 16 will need (I expect we'll
> eventually want to back-patch that eventually, but first things
> first...).

Fedora 37 is out very very soon, and ships with CLANG/LLVM 15. What is
the timeline for backpatching llvm15 support?

Thanks!

Cheers,
-- 
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part

Understanding, testing and improving our Windows filesystem code

2022-10-18 Thread Thomas Munro

Hi,

As a card-carrying Unix hacker, I think it'd be great to remove the
differences between platforms if possible using newer Windows
facilities, so everything just works the way we expect.  Two things
that stopped progress on that front in the past were (1) uncertainty
about OS versioning, fixed in v16 with an EOL purge, and (2)
uncertainty about what the new interfaces really do, due to lack of
good ways to test, which I'd like to address here.

My goals in this thread:

 * introduce a pattern/idiom for TAP-testing of low level C code
without a database server
 * demonstrate the behaviour of our filesystem porting code with full coverage
 * fix reported bugs in my recent symlink changes with coverage
 * understand the new "POSIX semantics" changes in Windows 10
 * figure out what our policy should be on "POSIX semantics"

For context, we have a bunch of stuff under src/port to provide
POSIX-like implementations of:

  open()*
  fstat(), stat()*, lstat()*
  link(), unlink()*, rename()*
  symlink(), readlink()
  opendir(), readdir(), closedir()
  pg_pwrite(), pg_pread()

These call equivalent Windows system interfaces so we can mostly just
write code that assumes the whole world is a POSIX.  Some of them also
deal with three special aspects of Windows file handles, which
occasionally cause trouble:

1.  errno == EACCES && GetLastError() == ERROR_SHARING_VIOLATION:
This happens if you try to access a file that has been opened by
without FILE_SHARE_ flags to allow concurrent access.  While our own
open() wrapper uses those flags, other programs might not.  The
wrapper functions marked with an asterix above deal with this
condition by sleeping or retrying for 10 or 30 seconds, in the hope
that the external program goes away.  AFAIK this problem will always
be with us.

2.  errno == EACCES && GetLastNtStatus() == STATUS_DELETE_PENDING:
This happens if you try to access a directory entry that is scheduled
for asynchronous unlink, but is still present until all handles to the
file are closed.  The wrapper functions above deal with this in
various different ways:

 open() without O_CREAT: -> ENOENT, so we can pretend that unlink()
calls are synchronous
 open() with O_CREAT: -> EEXIST, the zombie dirent wins
 stat(), lstat(): -> ENOENT
 unlink(), rename(): retry, same as we do for ERROR_SHARING_VIOLATION,
until timeout or asynchonous unlink completes (this may have been
unintentional due to same errno?)

3.  errno == EACCESS && :  You can't MoveFileEx() on top of
a file that someone has open.

In Windows 10, a new "POSIX semantics" mode was added.  Yippee!
Victor Spirin proposed[1] that we use it several commitfests ago.
Interestingly, on some systems it is already partially activated
without any change on our part.  That is, on some systems, unlink()
works synchronously (when the call returns, the dirent is gone, even
if someone else has the file open, just like Unix).  Sounds great, but
in testing different Windows systems I have access to using the
attached test suite I found three different sets of behaviour:

 A) Using Windows unlink() and MoveFileEx() on Server 2019 (CI) I get
traditional STATUS_DELETE_PENDING problems
 B) Using Windows unlink()/MoveFileEx() on Windows 10 Home (local VM)
I get mostly POSIX behaviour, except problem (3) above, which you can
see in my test suite
 C) Using Windows new SetFileInformationByHandle() calls with explicit
request for POSIX semantics (this syscall is something like fcntl(), a
kitchen sink kernel interface, and is what unlink() and MoveFileEx()
and others things are built on, but if you do it yourself you can
reach more flags) I get full POSIX behaviour according to my test
suite, i.e. agreement with FreeBSD and Linux for the dirent-related
cases I've though about so far, on both of those Windows systems

It sounds like we want option C, as Victor proposed, but I'm not sure
what happens if you try to use it on a non-NTFS filesystem (does it
quietly fall back to non-POSIX semantics, or fail, or do all file
systems now support this?).  I'm also not sure if we really support
running on a non-NTFS filesystem, not being a Windows user myself.

So the questions I have are:

 * any thoughts on this C TAP testing system?
 * has anyone got a non-EOL'd OS version where this test suite fails?
 * has anyone got a relevant filesystem where this fails?  which way
do ReFS and SMB go?  do the new calls in 0010 just fail, and if so
with which code (ie could we add our own fallback path)?
 * which filesystems do we even claim to support?
 * if we switched to explicitly using POSIX-semantics like in the 0010
patch, I assume there would be nothing left in the build farm or CI
that tests the non-POSIX code paths (either in these tests or in the
real server code), and the non-POSIX support would decay into
non-working form pretty quickly
 * if there are any filesystems that don't support POSIX-semantics,
would we want to either (1) get such a thing into the build farm so
it's tested or (2)

Re: make_ctags: use -I option to ignore pg_node_attr macro

2022-10-18 Thread Yugo NAGATA

Hi,

On Sat, 15 Oct 2022 10:40:29 +0900 (JST)
Tatsuo Ishii  wrote:

> > On Thu, 13 Oct 2022 15:35:09 +0900 (JST)
> > Tatsuo Ishii  wrote:
> > 
> >> > OK, that sounds good then.  I would make a feature request to have a
> >> > switch that supresses creation of these links, then.
> >> 
> >> Ok, I have added "-n" option to make_ctags so that it skips to create
> >> the links.
> >> 
> >> Also I have changed make_etags so that it exec make_ctags, which seems
> >> to be the consensus.
> > 
> > Thank you for following up my patch.
> > I fixed the patch to allow use both -e and -n options together.
> 
> Thanks. I have made mostly cosmetic changes so that it is more
> consistent with existing scripts.
> 
> I would like to push v6 patch if there's no objection.

I am fine with this patch.

By the way, in passing, how about adding "tags" and "TAGS" to
.gitignore file?

> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA

Re: SI-read predicate locks on materialized views

2022-10-18 Thread Yugo NAGATA

Hello Micheal-san,

On Thu, 13 Oct 2022 17:02:06 +0900
Michael Paquier  wrote:

> On Fri, Sep 30, 2022 at 10:12:13AM +0900, Yugo NAGATA wrote:
> > Thank you for comment. Do you think it can be marked as Ready for Commiter?
> 
> Matviews have been discarded from needing predicate locks since
> 3bf3ab8 and their introduction, where there was no concurrent flavor
> of refresh yet.  Shouldn't this patch have at least an isolation test
> to show the difference in terms of read-write conflicts with some
> serializable transactions and REFRESH CONCURRENTLY?

Thank you for your review. I agree that an isolation test is required.
The attached patch contains the test using the scenario as explained in
the previous post.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From b60a53a945283de4b068e1bc7aaafec26dd8f4da Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Tue, 18 Oct 2022 17:15:33 +0900
Subject: [PATCH] SI-read predicate locking on materialized views

Matviews have been discarded from needing predicate locks since
3bf3ab8 and their introduction, where there was no concurrent flavor
of refresh yet.
---
 src/backend/storage/lmgr/predicate.c  |  5 +-
 .../isolation/expected/matview-write-skew.out | 77 +++
 src/test/isolation/isolation_schedule |  1 +
 .../isolation/specs/matview-write-skew.spec   | 43 +++
 4 files changed, 123 insertions(+), 3 deletions(-)
 create mode 100644 src/test/isolation/expected/matview-write-skew.out
 create mode 100644 src/test/isolation/specs/matview-write-skew.spec

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8120174d6..df1c0d72e9 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -490,14 +490,13 @@ static void ReleasePredicateLocksLocal(void);
 
 /*
  * Does this relation participate in predicate locking? Temporary and system
- * relations are exempt, as are materialized views.
+ * relations are exempt.
  */
 static inline bool
 PredicateLockingNeededForRelation(Relation relation)
 {
 	return !(relation->rd_id < FirstUnpinnedObjectId ||
-			 RelationUsesLocalBuffers(relation) ||
-			 relation->rd_rel->relkind == RELKIND_MATVIEW);
+			 RelationUsesLocalBuffers(relation));
 }
 
 /*
diff --git a/src/test/isolation/expected/matview-write-skew.out b/src/test/isolation/expected/matview-write-skew.out
new file mode 100644
index 00..c1f8b3f5ea
--- /dev/null
+++ b/src/test/isolation/expected/matview-write-skew.out
@@ -0,0 +1,77 @@
+Parsed test spec with 2 sessions
+
+starting permutation: rxwy1 c1 rywx2 c2
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step c1: COMMIT;
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step c2: COMMIT;
+
+starting permutation: rxwy1 rywx2 c1 c2
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step c1: COMMIT;
+step c2: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+
+starting permutation: rxwy1 rywx2 c2 c1
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step c2: COMMIT;
+step c1: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+
+starting permutation: rywx2 rxwy1 c1 c2
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step c1: COMMIT;
+step c2: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+
+starting permutation: rywx2 rxwy1 c2 c1
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step c2: COMMIT;
+step c1: COMMIT;
+ERROR:  could not serialize access due to read/write dependencies among transactions
+
+starting permutation: rywx2 c2 rxwy1 c1
+step rywx2: 
+  DO $$DECLARE today date;
+  BEGIN
+SELECT INTO today max(date) + 1 FROM order_summary;
+INSERT INTO orders VALUES (today, 'banana', 10);
+  END$$;
+
+step c2: COMMIT;
+step rxwy1: REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;
+step c1: COMMIT;
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 5413a59a80..c11dc9a420 100644
--- a/src/test/isolation/isolation_schedule
+++ b

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada

On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada  wrote:
> >
> > On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila  wrote:
> > >
> > >
> > > IIUC, here you are speaking of three different changes. Change-1: Add
> > > a check in AssertTXNLsnOrder() to skip assert checking till we reach
> > > start_decoding_at. Change-2: Set needs_timetravel to true in one of
> > > the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> > > call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> > > in PG-14/15 as that won't be required after Change-1.
> >
> > Yes.
> >
> > >
> > > AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> > > required in HEAD/v15/v14 to fix the problem.
> >
> > IIUC Change-2 is required in v16 and HEAD
> >
>
> Why are you referring v16 and HEAD separately?

Sorry, my wrong, I was confused.

>
> > but not mandatory in v15 and
> > v14. The reason why we need Change-2 is that there is a case where we
> > mark only subtransactions as containing catalog change while not doing
> > that for its top-level transaction. In v15 and v14, since we mark both
> > subtransactions and top-level transaction in
> > SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> > don't get the assertion failure at "Assert(!needs_snapshot ||
> > needs_timetravel)".
> >
> > Regarding Change-3, it's required in v15 and v14 but not in HEAD and
> > v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
> > HEAD, Change-3 cannot be applied to the two branches.
> >
> > > Now, the second and third
> > > changes are not required in branches prior to v14 because we don't
> > > record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> > > we want, we can even back-patch Change-2 and Change-3 to keep the code
> > > consistent or maybe just Change-3.
> >
> > Right. I don't think it's a good idea to back-patch Change-2 in
> > branches prior to v14 as it's not a relevant issue.
> >
>
> Fair enough but then why to even backpatch it to v15 and v14?

Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.

>
> > Regarding
> > back-patching Change-3 to branches prior 14, I think it may be okay
> > til v11, but I'd be hesitant for v10 as the final release comes in a
> > month.
> >
>
> So to fix the issue in all branches, what we need to do is to
> backpatch change-1: in all branches till v10, change-2: in HEAD, and
> change-3: in V15 and V14. Additionally, we think, it is okay to
> backpatch change-3 till v11 as it is mainly done to avoid the problem
> fixed by change-1 and it makes code consistent in back branches.

Right.

>
> I think because the test case proposed needs all three changes, we can
> push the change-1 without a test case and then as a second patch have
> change-2 for HEAD and change-3 for back branches with the test case.
> Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

user_catalog_testcase.patch
Description: Binary data

analyze_testcase.patch
Description: Binary data

truncate_testcase.patch
Description: Binary data

Re: Simplify standby state machine a bit in WaitForWALToBecomeAvailable()

2022-10-18 Thread Amul Sul

On Tue, Oct 18, 2022 at 12:01 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> In standby mode, the state machine in WaitForWALToBecomeAvailable()
> reads WAL from pg_wal after failing to read from the archive. This is
> currently implemented in XLogFileReadAnyTLI() by calling
> XLogFileRead() with source XLOG_FROM_PG_WAL after it fails with source
> XLOG_FROM_PG_ARCHIVE and the current source isn't changed at all.
> Also, passing the source to XLogFileReadAnyTLI() in
> WaitForWALToBecomeAvailable() isn't straight i.e. it's not necessary
> to pass in XLOG_FROM_ANY at all. These things make the state machine a
> bit complicated and hard to understand.
>
> The attached patch attempts to simplify the code a bit by changing the
> current source to XLOG_FROM_PG_WAL after failing in
> XLOG_FROM_PG_ARCHIVE so that the state machine can move smoothly to
> read from pg_wal. And we can just pass the current source to
> XLogFileReadAnyTLI(). It also enables us to reduce a bit of extra
> XLogFileRead() code in XLogFileReadAnyTLI().
>
> Thoughts?

+1

Regards,
Amul

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-10-18 Thread Drouvot, Bertrand


Hi,

On 10/18/22 7:51 AM, Michael Paquier wrote:

On Mon, Oct 17, 2022 at 07:56:02PM +0200, Drouvot, Bertrand wrote:

On 10/14/22 7:30 AM, Michael Paquier wrote:

This approach would not stick with
pg_ident.conf though, as we validate the fields in each line when we
put our hands on ident_user and after the base validation of a line
(number of fields, etc.).


I'm not sure to get the issue here with the proposed approach and
pg_ident.conf.


My point is about parse_ident_line(), where we need to be careful in
the order of the operations.  The macros IDENT_MULTI_VALUE() and
IDENT_FIELD_ABSENT() need to be applied on all the fields first, and
the regex computation needs to be last.  Your patch had a subtile
issue here, as users may get errors on the computed regex before the
ordering of the fields as the computation was used *before* the "Get
the PG rolename token" part of the logic.


Gotcha, thanks! I was wondering if we shouldn't add a comment about that 
and I see that you've added one in v2, thanks!


BTW, what about adding a new TAP test (dedicated patch) to test the 
behavior in case of errors during the regexes compilation in 
pg_ident.conf and pg_hba.conf (not done yet)? (we could add it once this 
 patch series is done).



While putting my hands on that, I was also wondering whether we should
have the error string generated after compilation within the internal
regcomp() routine, but that would require more arguments to
pg_regcomp() (as of file name, line number, **err_string), and that
looks more invasive than necessary.  Perhaps the follow-up steps will
prove me wrong, though :)


I've had the same thought (and that was what the previous global patch 
was doing). I'm tempted to think that the follow-steps will prove you 
right ;-) (specially if at the end those will be the same error messages 
for databases and roles).




A last thing is the introduction of a free() routine for AuthTokens,
to minimize the number of places where we haev pg_regfree().  The gain
is minimal, but that looks more consistent with the execution and
compilation paths.


Agree, that looks better.

I had a look at your v2, did a few tests and it looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

75 matches

Mail list logo