date:20240306

On Thu, Mar 7, 2024 at 4:21 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > > > wrote:
> > >
> > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > > >
> > > > > > I'll look in the other templates to see if what they do.
> > > > >
> > > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > > out the typedef for the "define" section, but I first want to
> > > > > reproduce the build failure.
> > > >
> > > > Right. I've reproduced this build failure on my machine by specifying
> > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > > below change seems to fix the problem:
> > >
> > > Confirmed, will push shortly.
> >
> > mamba complained different build errors[1]:
> >
> >  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
> >   |  ~~^ ~~~
> >   ||  |
> >   |long int   int64 {aka long long 
> > int}
> >   |  %lld
> > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> > expects argument of type 'long int', but argument 4 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> > tree->ctl->num_nodes[i]);
> >   |~~^
> > ~~~
> >   |  |
> >  |
> >   |  long int
> >  int64 {aka long long int}
> >   |%lld
> > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> > expects argument of type 'long int', but argument 3 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
> >   |  ~~^   ~
> >   |||
> >   |long int int64 {aka long long 
> > int}
> >   |  %lld
> >
> > Regards,
> >
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2024-03-07%2006%3A05%3A18
>
> Yeah, the attached fixes it for me.

Thanks, LGTM.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Reducing the log spam

2024-03-06 Thread Laurenz Albe

On Wed, 2024-03-06 at 17:33 -0500, Isaac Morland wrote:
> I have two questions about this:
> 
> First, can it be done per role? If I have a particular application which is
> constantly throwing some particular error, I might want to suppress it, but
> not suppress the same error occasionally coming from another application.
> I see ALTER DATABASE name SET configuration_parameter … as being useful here,
> but often multiple applications share a database.
>
> Second, where can this setting be adjusted? Can any session turn off logging
> of arbitrary sets of sqlstates resulting from its queries? It feels to me
> like that might allow security problems to be hidden. Specifically, the first
> thing an SQL injection might do would be to turn off logging of important
> error states, then proceed to try various nefarious things.

I was envisioning the parameter to be like other logging parameters, for
example "log_statement":  only superusers can set the parameter or GRANT
that privilege to others.  Also, a superuser could use ALTER ROLE to set
the parameter for all sessions by that role.

> It seems to me the above questions interact; an answer to the first might be
> "ALTER ROLE role_specification SET configuration_parameter", but I think that
> would allow roles to change their own settings, contrary to the concern
> raised by the second question.

If a superuser sets "log_statement" on a role, that role cannot undo or change
the setting.  That's just how I plan to implement the new parameter.

Yours,
Laurenz Albe

Re: Combine headerscheck and cpluspluscheck scripts

On Thu, Mar 07, 2024 at 01:37:36PM +1300, Thomas Munro wrote:
> +1

Looking at the patch, nice cleanup.
--
Michael


signature.asc
Description: PGP signature

Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 07, 2024 at 12:50:33PM +0530, Bharath Rupireddy wrote:
> Nice catch. So, are you suggesting to log one NOTICE message per row
> even if multiple columns in the single row fail to parse or are
> malformed?

One NOTICE per malformed value, I guess, which would be easier to
parse particularly if the values are long (like, JSON-long).
--
Michael


signature.asc
Description: PGP signature

Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 07, 2024 at 12:48:12PM +0530, Bharath Rupireddy wrote:
> I'm okay with it. But, help me understand it better. We want the
> 'log_verbosity' clause to have options 'default' and 'verbose', right?
> And, later it can also be extended to contain all the LOG levels like
> 'notice', 'error', 'info' , 'debugX' etc. depending on the need,
> right?

You could, or names that have some status like row_details, etc.

> One more thing, how does it sound using both verbosity and verbose in
> log_verbosity verbose something like below? Is this okay?

There's some history with this pattern in psql at least with \set
VERBOSITY verbose.  For the patch, I would tend to choose these two,
but that's as far as my opinion goes and I am OK other ideas gather
more votes.
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  
> > > wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > > wrote:
> >
> > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > >
> > > > > I'll look in the other templates to see if what they do.
> > > >
> > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > out the typedef for the "define" section, but I first want to
> > > > reproduce the build failure.
> > >
> > > Right. I've reproduced this build failure on my machine by specifying
> > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > below change seems to fix the problem:
> >
> > Confirmed, will push shortly.
>
> mamba complained different build errors[1]:
>
>  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
>   |  ~~^ ~~~
>   ||  |
>   |long int   int64 {aka long long 
> int}
>   |  %lld
> ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> expects argument of type 'long int', but argument 4 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> tree->ctl->num_nodes[i]);
>   |~~^
> ~~~
>   |  |
>  |
>   |  long int
>  int64 {aka long long int}
>   |%lld
> ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> expects argument of type 'long int', but argument 3 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
>   |  ~~^   ~
>   |||
>   |long int int64 {aka long long int}
>   |  %lld
>
> Regards,
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 93e6a7d809..b8ad51c14d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2737,7 +2737,7 @@ RT_SCOPE void
 RT_STATS(RT_RADIX_TREE * tree)
 {
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
-	fprintf(stderr, "num_keys = %ld\n", tree->ctl->num_keys);
+	fprintf(stderr, "num_keys = %lld\n", (long long) tree->ctl->num_keys);
 
 #ifdef RT_SHMEM
 	fprintf(stderr, "handle = " DSA_POINTER_FORMAT "\n", tree->ctl->handle);
@@ -2749,10 +2749,10 @@ RT_STATS(RT_RADIX_TREE * tree)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
 
-		fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]);
+		fprintf(stderr, ", n%d = %lld", size_class.fanout, (long long) tree->ctl->num_nodes[i]);
 	}
 
-	fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
+	fprintf(stderr, ", leaves = %lld", (long long) tree->ctl->num_leaves);
 
 	fprintf(stderr, "\n");
 }

Re: Add new error_action COPY ON_ERROR "log"

2024-03-06 Thread Bharath Rupireddy

On Thu, Mar 7, 2024 at 12:37 PM Michael Paquier  wrote:
>
> On Thu, Mar 07, 2024 at 03:52:41PM +0900, Masahiko Sawada wrote:
> > One question I have is; do we want to write multiple NOTICE messages
> > for one row if the row has malformed data on some columns?
>
> Good idea.  We can do that as the field strings are already parsed.

Nice catch. So, are you suggesting to log one NOTICE message per row
even if multiple columns in the single row fail to parse or are
malformed?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-06 Thread Bharath Rupireddy

On Thu, Mar 7, 2024 at 9:30 AM Michael Paquier  wrote:
>
> Is a boolean the best interface for the end-user, though?  Maybe
> something like a "mode" value would speak more than a yes/no from the
> start, say a "default" mode to emit only the last LOG and a "verbose"
> for the whole set in the case of ON_ERROR?  That could use an enum
> from the start internally, but that's an implementation detail.

I'm okay with it. But, help me understand it better. We want the
'log_verbosity' clause to have options 'default' and 'verbose', right?
And, later it can also be extended to contain all the LOG levels like
'notice', 'error', 'info' , 'debugX' etc. depending on the need,
right?

One more thing, how does it sound using both verbosity and verbose in
log_verbosity verbose something like below? Is this okay?

COPY check_ign_err FROM STDIN WITH (on_error ignore, log_verbosity verbose);

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > wrote:
>
> > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > feature [-Werror,-Wtypedef-redefinition]"
> > > >
> > > > I'll look in the other templates to see if what they do.
> > >
> > > Their "declare" sections have full typedefs. I found it works to leave
> > > out the typedef for the "define" section, but I first want to
> > > reproduce the build failure.
> >
> > Right. I've reproduced this build failure on my machine by specifying
> > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > below change seems to fix the problem:
>
> Confirmed, will push shortly.

mamba complained different build errors[1]:

 2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
  |  ~~^ ~~~
  ||  |
  |long int   int64 {aka long long int}
  |  %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
  |~~^
~~~
  |  |
 |
  |  long int
 int64 {aka long long int}
  |%lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
  |  ~~^   ~
  |||
  |long int int64 {aka long long int}
  |  %lld

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2024-03-07%2006%3A05%3A18

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Get rid of the excess semicolon in planner.c

2024-03-06 Thread Richard Guo

On Wed, Mar 6, 2024 at 6:00 AM David Rowley  wrote:

> On Wed, 6 Mar 2024 at 00:43, Richard Guo  wrote:
> >
> > I think this is a typo introduced in 0452b461b.
> >
> >  +root->processed_groupClause = list_copy(parse->groupClause);;
>
> "git grep -E ";;$" -- *.c *.h" tell me it's the only one.
>
> Pushed.


Thanks for checking and pushing.

Thanks
Richard

Re: Regarding the order of the header file includes

2024-03-06 Thread Richard Guo

On Wed, Mar 6, 2024 at 6:25 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Mar 6, 2024 at 3:02 PM Richard Guo  wrote:
> >
> > I think we generally follow the rule that we include 'postgres.h' or
> > 'postgres_fe.h' first, followed by system header files, and then
> > postgres header files ordered in ASCII value.  I noticed that in some C
> > files we fail to follow this rule strictly.  Attached is a patch to fix
> > this.
> >
> > Back in 2019, we performed the same operation in commits 7e735035f2,
> > dddf4cdc33, and 14aec03502.  It appears that the code has deviated from
> > that point onwards.
> >
> > Please note that this patch only addresses the order of header file
> > includes in backend modules (and might not be thorough).  It is possible
> > that other modules may have a similar issue, but I have not evaluated
> > them yet.
>
> +1. I'm just curious to know if you've leveraged any tool from
> src/tools/pginclude or any script or such.


Thanks for looking.

While rebasing one of my patches I noticed that the header file includes
in relnode.c are not sorted in order.  So I wrote a naive script to see
if any other C files have the same issue.  The script is:

#!/bin/bash

find . -name "*.c" | while read -r file; do
  headers=$(grep -o '#include "[^>]*"' "$file" |
grep -v "postgres.h" | grep -v "postgres_fe.h" |
sed 's/\.h"//g')

  sorted_headers=$(echo "$headers" | sort)

  results=$(diff <(echo "$headers") <(echo "$sorted_headers"))

  if [[ $? != 0 ]]; then
echo "Headers in '$file' are out of order"
echo $results
echo
  fi
done

Thanks
Richard

Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 07, 2024 at 03:52:41PM +0900, Masahiko Sawada wrote:
> One question I have is; do we want to write multiple NOTICE messages
> for one row if the row has malformed data on some columns?

Good idea.  We can do that as the field strings are already parsed.
--
Michael


signature.asc
Description: PGP signature

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:

> > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > feature [-Werror,-Wtypedef-redefinition]"
> > >
> > > I'll look in the other templates to see if what they do.
> >
> > Their "declare" sections have full typedefs. I found it works to leave
> > out the typedef for the "define" section, but I first want to
> > reproduce the build failure.
>
> Right. I've reproduced this build failure on my machine by specifying
> flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> below change seems to fix the problem:

Confirmed, will push shortly.

Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 7, 2024 at 1:00 PM Michael Paquier  wrote:
>
> On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:
> > Please see the attached v4 patch. If it looks good, I can pull
> > LOG_VERBOSITY changes out into 0001 and with 0002 containing the
> > detailed messages for discarded rows.
>
> The approach looks sensible seen from here.

Looks like a good approach to me.

>
> +LOG_VERBOSITY [ boolean ]
> [...]
> +LOG_VERBOSITY
> +
> + 
> +  Sets the verbosity of logged messages by COPY
> +  command. As an example, see its usage for
> +  COPY FROM command's ON_ERROR
> +  clause with ignore option.
> + 
> +
>
> Is a boolean the best interface for the end-user, though?  Maybe
> something like a "mode" value would speak more than a yes/no from the
> start, say a "default" mode to emit only the last LOG and a "verbose"
> for the whole set in the case of ON_ERROR?  That could use an enum
> from the start internally, but that's an implementation detail.

+1 for making it an enum, so that we will be able to have multiple
levels for example to get actual soft error contents.

One question I have is; do we want to write multiple NOTICE messages
for one row if the row has malformed data on some columns?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> >
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:

odr-violation seems to refer to One Definition Rule (ODR). According
to Wikipedia[1]:

The One Definition Rule (ODR) is an important rule of the C++
programming language that prescribes that classes/structs and
non-inline functions cannot have more than one definition in the
entire program and template and types cannot have more than one
definition by translation unit. It is defined in the ISO C++ Standard
(ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
have similar but differently defined rules towards the same objective.

I don't fully understand this concept yet but are these two different
build failures related?

Regards,

[1] https://en.wikipedia.org/wiki/One_Definition_Rule

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Make COPY format extendable: Extract COPY TO format implementations

On Wed, Mar 06, 2024 at 03:34:04PM +0900, Michael Paquier wrote:
> I am not sure that my schedule is on track to allow that for this
> release, unfortunately, especially with all the other items to review
> and discuss to make this thread feature-complete.  There should be
> a bit more than four weeks until the feature freeze (date not set in
> stone, should be around the 8th of April AoE), but I have less than
> the half due to personal issues.  Perhaps if somebody jumps on this
> thread, that will be possible..

While on it, here are some profiles based on HEAD and v17 with the
previous tests (COPY TO /dev/null, COPY FROM data sent to the void).

COPY FROM, text format with 30 attributes and HEAD:
-   66.53%16.33%  postgres  postgres[.] NextCopyFrom
- 50.20% NextCopyFrom
   - 30.83% NextCopyFromRawFields
  + 16.09% CopyReadLine
13.72% CopyReadAttributesText
   + 19.11% InputFunctionCallSafe
+ 16.33% _start
COPY FROM, text format with 30 attributes and v17:
-   66.60%16.10%  postgres  postgres[.] NextCopyFrom
- 50.50% NextCopyFrom
   - 30.44% NextCopyFromRawFields
  + 15.71% CopyReadLine
13.73% CopyReadAttributesText
   + 19.81% InputFunctionCallSafe
+ 16.10% _start

COPY TO, text format with 30 attributes and HEAD:
-   79.55%15.54%  postgres  postgres[.] CopyOneRowTo
- 64.01% CopyOneRowTo
   + 30.01% OutputFunctionCall
   + 11.71% appendBinaryStringInfo
 9.36% CopyAttributeOutText
   + 3.03% CopySendEndOfRow
 1.65% int4out
 1.01% 0x83e46be4
 0.93% 0x83e46be8
 0.93% memcpy@plt
 0.87% pgstat_progress_update_param
 0.78% enlargeStringInfo
 0.67% 0x83e46bb4
 0.66% 0x83e46bcc
 0.57% MemoryContextReset
+ 15.54% _start
COPY TO, text format with 30 attributes and v17:
-   79.35%16.08%  postgres  postgres[.] CopyOneRowTo
- 62.27% CopyOneRowTo
   + 28.92% OutputFunctionCall
   + 10.88% appendBinaryStringInfo
 9.54% CopyAttributeOutText
   + 3.03% CopySendEndOfRow
 1.60% int4out
 0.97% pgstat_progress_update_param
 0.95% 0x8c46cbe8
 0.89% memcpy@plt
 0.87% 0x8c46cbe4
 0.79% enlargeStringInfo
 0.64% 0x8c46cbcc
 0.61% 0x8c46cbb4
 0.58% MemoryContextReset
+ 16.08% _start

So, in short, and that's not really a surprise, there is no effect
once we use the dispatching with the routines only when a format would
want to plug-in with the APIs, but a custom format would still have a
penalty of a few percents for both if bottlenecked on CPU.
--
Michael


signature.asc
Description: PGP signature

RE: Synchronizing slots from primary to standby

2024-03-06 Thread Zhijie Hou (Fujitsu)

On Thursday, March 7, 2024 12:46 PM Amit Kapila  wrote:
> 
> On Thu, Mar 7, 2024 at 7:35 AM Peter Smith 
> wrote:
> >
> > Here are some review comments for v107-0001
> >
> > ==
> > src/backend/replication/slot.c
> >
> > 1.
> > +/*
> > + * Struct for the configuration of standby_slot_names.
> > + *
> > + * Note: this must be a flat representation that can be held in a single 
> > chunk
> > + * of guc_malloc'd memory, so that it can be stored as the "extra" data for
> the
> > + * standby_slot_names GUC.
> > + */
> > +typedef struct
> > +{
> > + int slot_num;
> > +
> > + /* slot_names contains nmembers consecutive nul-terminated C strings */
> > + char slot_names[FLEXIBLE_ARRAY_MEMBER];
> > +} StandbySlotConfigData;
> > +
> >
> > 1a.
> > To avoid any ambiguity this 1st field is somehow a slot ID number, I
> > felt a better name would be 'nslotnames' or even just 'n' or 'count',
> >
> 
> We can probably just add a comment above slot_num and that should be
> sufficient but I am fine with 'nslotnames' as well, in anycase let's
> add a comment for the same.

Added.

> 
> >
> > 6b.
> > IMO this function would be tidier written such that the
> > MyReplicationSlot->data.name is passed as a parameter. Then you can
> > name the function more naturally like:
> >
> > IsSlotInStandbySlotNames(const char *slot_name)
> >
> 
> +1. How about naming it as SlotExistsinStandbySlotNames(char
> *slot_name) and pass the slot_name from MyReplicationSlot? Otherwise,
> we need an Assert for MyReplicationSlot in this function.

Changed as suggested.

> 
> Also, can we add a comment like below before the loop:
> + /*
> + * XXX: We are not expecting this list to be long so a linear search
> + * shouldn't hurt but if that turns out not to be true then we can cache
> + * this information for each WalSender as well.
> + */

Added.

Attach the V108 patch set which addressed above and Peter's comments.
I also removed the check for "*" in guc check hook.

Best Regards,
Hou zj



v108-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v108-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch

RE: Synchronizing slots from primary to standby

2024-03-06 Thread Zhijie Hou (Fujitsu)

On Thursday, March 7, 2024 10:05 AM Peter Smith  wrote:
> 
> Here are some review comments for v107-0001

Thanks for the comments.

> 
> ==
> src/backend/replication/slot.c
> 
> 1.
> +/*
> + * Struct for the configuration of standby_slot_names.
> + *
> + * Note: this must be a flat representation that can be held in a
> +single chunk
> + * of guc_malloc'd memory, so that it can be stored as the "extra" data
> +for the
> + * standby_slot_names GUC.
> + */
> +typedef struct
> +{
> + int slot_num;
> +
> + /* slot_names contains nmembers consecutive nul-terminated C strings
> +*/  char slot_names[FLEXIBLE_ARRAY_MEMBER];
> +} StandbySlotConfigData;
> +
> 
> 1a.
> To avoid any ambiguity this 1st field is somehow a slot ID number, I felt a 
> better
> name would be 'nslotnames' or even just 'n' or 'count',

Changed to 'nslotnames'.

> 
> ~
> 
> 1b.
> (fix typo)
> 
> SUGGESTION for the 2nd field comment
> slot_names is a chunk of 'n' X consecutive null-terminated C strings

Changed.

> 
> ~
> 
> 1c.
> A more explanatory name for this typedef maybe is
> 'StandbySlotNamesConfigData' ?

Changed.

> 
> ~~~
> 
> 
> 2.
> +/* This is parsed and cached configuration for standby_slot_names */
> +static StandbySlotConfigData *standby_slot_config;
> 
> 
> 2a.
> /This is parsed and cached configuration for .../This is the parsed and cached
> configuration for .../

Changed.

> 
> ~
> 
> 2b.
> Similar to above -- since this only has name information maybe it is more
> correct to call it 'standby_slot_names_config'?
> 


Changed.

> ~~~
> 
> 3.
> +/*
> + * A helper function to validate slots specified in GUC standby_slot_names.
> + *
> + * The rawname will be parsed, and the parsed result will be saved into
> + * *elemlist.
> + */
> +static bool
> +validate_standby_slots(char *rawname, List **elemlist)
> 
> /and the parsed result/and the result/
> 


Changed.

> ~~~
> 
> 4. check_standby_slot_names
> 
> + /* Need a modifiable copy of string */ rawname = pstrdup(*newval);
> 
> /copy of string/copy of the GUC string/
> 


Changed.

> ~~~
> 
> 5.
> +assign_standby_slot_names(const char *newval, void *extra) {
> + /*
> + * The standby slots may have changed, so we must recompute the oldest
> + * LSN.
> + */
> + ss_oldest_flush_lsn = InvalidXLogRecPtr;
> +
> + standby_slot_config = (StandbySlotConfigData *) extra; }
> 
> To avoid leaking don't we need to somewhere take care to free any memory
> used by a previous value (if any) of this 'standby_slot_config'?
> 

The memory of extra is maintained by the GUC mechanism. It will be
automatically freed when the associated GUC setting is no longer of interest.
See src/backend/utils/misc/README for details.

 
> ~~~
> 
> 6. AcquiredStandbySlot
> 
> +/*
> + * Return true if the currently acquired slot is specified in
> + * standby_slot_names GUC; otherwise, return false.
> + */
> +bool
> +AcquiredStandbySlot(void)
> +{
> + const char *name;
> +
> + /* Return false if there is no value in standby_slot_names */ if
> + (standby_slot_config == NULL) return false;
> +
> + name = standby_slot_config->slot_names; for (int i = 0; i <
> + standby_slot_config->slot_num; i++) { if (strcmp(name,
> + NameStr(MyReplicationSlot->data.name)) == 0) return true;
> +
> + name += strlen(name) + 1;
> + }
> +
> + return false;
> +}
> 
> 6a.
> Just checking "(standby_slot_config == NULL)" doesn't seem enough to me,
> because IIUIC it is possible when 'standby_slot_names' has no value then
> maybe standby_slot_config is not NULL but standby_slot_config->slot_num is
> 0.

The standby_slot_config will always be NULL if there is no value in it.

While checking, I did find a rare case that if there are only some white space
in the standby_slot_names, then slot_num will be 0, and have fixed it so that
standby_slot_config will always be NULL if there is no meaning value in guc.

> 
> ~
> 
> 6b.
> IMO this function would be tidier written such that the
> MyReplicationSlot->data.name is passed as a parameter. Then you can
> name the function more naturally like:
> 
> IsSlotInStandbySlotNames(const char *slot_name)

Changed it to SlotExistsInStandbySlotNames.

> 
> ~
> 
> 6c.
> IMO the body of the function will be tidier if written so there are only 2 
> returns
> instead of 3 like
> 
> SUGGESTION:
> if (...)
> {
>   for (...)
>   {
>  ...
> return true;
>   }
> }
> return false;

I personally prefer the current style.

> 
> ~~~
> 
> 7.
> + /*
> + * Don't need to wait for the standbys to catch up if there is no value
> + in
> + * standby_slot_names.
> + */
> + if (standby_slot_config == NULL)
> + return true;
> 
> (similar to a previous review comment)
> 
> This check doesn't seem enough because IIUIC it is possible when
> 'standby_slot_names' has no value then maybe standby_slot_config is not NULL
> but standby_slot_config->slot_num is 0.

Same as above.

> 
> ~~~
> 
> 8. WaitForStandbyConfirmation
> 
> + /*
> + * Don't need to wait for the standby to catch up if the current
> + acquired
> + * slot is not a

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
> > >
> > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > > + /* Find the control object in shared memory */
> > > > > + control = handle;
> > > >
> > > > I think it's mostly because of readability; it makes clear that the
> > > > handle should be castable to dsa_pointer and it's a control object. I
> > > > borrowed it from dshash_attach().
> > >
> > > I find that a bit strange, but I went ahead and kept it.
> > >
> > >
> > >
> > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  
> > > wrote:
> > >
> > > > The 0001 patch looks good to me. I have some minor comments:
> > >
> > > > +PGFILEDESC = "test_radixtree - test code for 
> > > > src/backend/lib/radixtree.c"
> > > > +
> > > >
> > > > "src/backend/lib/radixtree.c" should be updated to
> > > > "src/include/lib/radixtree.h".
> > >
> > > Done.
> > >
> > > > --- /dev/null
> > > > +++ b/src/test/modules/test_radixtree/README
> > > > @@ -0,0 +1,7 @@
> > > > +test_integerset contains unit tests for testing the integer set 
> > > > implementation
> > > > +in src/backend/lib/integerset.c.
> > > > +
> > > > +The tests verify the correctness of the implementation, but they can 
> > > > also be
> > > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > > +test_integerset.c, the tests will print extra information about 
> > > > execution time
> > > > +and memory usage.
> > > >
> > > > This file is not updated for test_radixtree. I think we can remove it
> > > > as the test cases in test_radixtree are clear.
> > >
> > > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > > has been a very long time coming, but we're finally in the home
> > > stretch!
> > >
> > > Already, I see sifaka doesn't like this, and I'm looking now...
> >
> > It's complaining that these forward declarations...
> >
> > /* generate forward declarations necessary to use the radix tree */
> > #ifdef RT_DECLARE
> >
> > typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> > typedef struct RT_ITER RT_ITER;
> >
> > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > feature [-Werror,-Wtypedef-redefinition]"
> >
> > I'll look in the other templates to see if what they do.
>
> Their "declare" sections have full typedefs. I found it works to leave
> out the typedef for the "define" section, but I first want to
> reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 }  RT_RADIX_TREE_CONTROL;

 /* Entry point for allocating and accessing the tree */
-typedef struct RT_RADIX_TREE
+struct RT_RADIX_TREE
 {
MemoryContext context;

@@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE
/* leaf_context is used only for single-value leaves */
MemoryContextData *leaf_context;
 #endif
-}  RT_RADIX_TREE;
+};

 /*
  * Iteration support.
@@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER
 }  RT_NODE_ITER;

 /* state for iterating over the whole radix tree */
-typedef struct RT_ITER
+struct RT_ITER
 {
RT_RADIX_TREE *tree;

@@ -728,7 +728,7 @@ typedef struct RT_ITER

/* The key constructed during iteration */
uint64  key;
-}  RT_ITER;
+};


 /* verification (available only in assert-enabled builds) */

>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:
>
> ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
>   [1] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
>   [2] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
> These globals were registered at these points:
>   [1]:
> #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #1 0x563564b98d1d in __asan_register_elf_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
> #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1
>
>   [2]:
> #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #1 0x563564b98d1d in

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
> >
> > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  
> > wrote:
> > >
> > > > + /* Find the control object in shared memory */
> > > > + control = handle;
> > >
> > > I think it's mostly because of readability; it makes clear that the
> > > handle should be castable to dsa_pointer and it's a control object. I
> > > borrowed it from dshash_attach().
> >
> > I find that a bit strange, but I went ahead and kept it.
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  
> > wrote:
> >
> > > The 0001 patch looks good to me. I have some minor comments:
> >
> > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > > +
> > >
> > > "src/backend/lib/radixtree.c" should be updated to
> > > "src/include/lib/radixtree.h".
> >
> > Done.
> >
> > > --- /dev/null
> > > +++ b/src/test/modules/test_radixtree/README
> > > @@ -0,0 +1,7 @@
> > > +test_integerset contains unit tests for testing the integer set 
> > > implementation
> > > +in src/backend/lib/integerset.c.
> > > +
> > > +The tests verify the correctness of the implementation, but they can 
> > > also be
> > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > +test_integerset.c, the tests will print extra information about 
> > > execution time
> > > +and memory usage.
> > >
> > > This file is not updated for test_radixtree. I think we can remove it
> > > as the test cases in test_radixtree are clear.
> >
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!
> >
> > Already, I see sifaka doesn't like this, and I'm looking now...
>
> It's complaining that these forward declarations...
>
> /* generate forward declarations necessary to use the radix tree */
> #ifdef RT_DECLARE
>
> typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> typedef struct RT_ITER RT_ITER;
>
> ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> feature [-Werror,-Wtypedef-redefinition]"
>
> I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
  [1] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
  [2] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
  [1]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
#3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

  [2]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
#3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5

Invalid query generated by postgres_fdw with UNION ALL and ORDER BY

2024-03-06 Thread Michał Kłeczek

The following query:

SELECT * FROM (
  SELECT 2023 AS year, * FROM remote_table_1
  UNION ALL
  SELECT 2022 AS year, * FROM remote_table_2
)
ORDER BY year DESC;

yields the following remote query:

SELECT [columns] FROM remote_table_1 ORDER BY 2023 DESC

and subsequently fails remote execution.


Not really sure where the problem is - the planner or postgres_fdw.
I guess it is postgres_fdw not filtering out ordering keys.

This filtering would also be pretty useful in the following scenario (which is 
also why I went through UNION ALL route and discovered this issue):

I have a table partitioned by year, partitions are remote tables.
On remote servers I have a GIST index that does not support ordering ([1]) so I 
would like to avoid sending ORDER BY year to remote servers.
Ideally redundant ordering should be filtered out.

[1] 
https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org


—
Kind regards,
Michal

Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-06 Thread shveta malik

On Thu, Mar 7, 2024 at 11:12 AM Michael Paquier  wrote:
>

> Yeah, it is possible that what you retrieve from
> pgstat_fetch_replslot() does not refer exactly to the slot's content
> under concurrent activity, but you cannot protect a single scan of
> pg_stat_replication_slots as of an effect of its design:
> pg_stat_get_replication_slot() has to be called multiple times.  The
> patch at least makes sure that the copy of the slot's stats retrieved
> by pgstat_fetch_entry() is slightly more consistent

Right, I understand that.

, but you cannot do
> better than that except if the data retrieved from
> pg_replication_slots and its stats are fetched in the same context
> function call, holding the replslot LWLock for the whole scan
> duration.

Yes, agreed.

>
> > Do you feel that the lock in pgstat_fetch_replslot() should be moved
> > to its caller when we are done copying the results of that slot? This
> > will improve the protection slightly.
>
> I don't see what extra protection this would offer, as
> pg_stat_get_replication_slot() is called once for each slot.  Feel
> free to correct me if I'm missing something.

It slightly improves the chances.  pgstat_fetch_replslot is only
called from pg_stat_get_replication_slot(), I thought it might be
better to acquire lock before we call pgstat_fetch_replslot and
release once we are done copying the results for that particular slot.
But  I also understand that it will not prevent someone from dropping
that slot at a later stage when the rest of the calls of
pg_stat_get_replication_slot() are still pending. So I am okay with
the current one.

thanks
Shveta

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
>
> On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
> >
> > > + /* Find the control object in shared memory */
> > > + control = handle;
> >
> > I think it's mostly because of readability; it makes clear that the
> > handle should be castable to dsa_pointer and it's a control object. I
> > borrowed it from dshash_attach().
>
> I find that a bit strange, but I went ahead and kept it.
>
>
>
> On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:
>
> > The 0001 patch looks good to me. I have some minor comments:
>
> > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > +
> >
> > "src/backend/lib/radixtree.c" should be updated to
> > "src/include/lib/radixtree.h".
>
> Done.
>
> > --- /dev/null
> > +++ b/src/test/modules/test_radixtree/README
> > @@ -0,0 +1,7 @@
> > +test_integerset contains unit tests for testing the integer set 
> > implementation
> > +in src/backend/lib/integerset.c.
> > +
> > +The tests verify the correctness of the implementation, but they can also 
> > be
> > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > +test_integerset.c, the tests will print extra information about execution 
> > time
> > +and memory usage.
> >
> > This file is not updated for test_radixtree. I think we can remove it
> > as the test cases in test_radixtree are clear.
>
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!
>
> Already, I see sifaka doesn't like this, and I'm looking now...

It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
>
> > + /* Find the control object in shared memory */
> > + control = handle;
>
> I think it's mostly because of readability; it makes clear that the
> handle should be castable to dsa_pointer and it's a control object. I
> borrowed it from dshash_attach().

I find that a bit strange, but I went ahead and kept it.



On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:

> The 0001 patch looks good to me. I have some minor comments:

> +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> +
>
> "src/backend/lib/radixtree.c" should be updated to
> "src/include/lib/radixtree.h".

Done.

> --- /dev/null
> +++ b/src/test/modules/test_radixtree/README
> @@ -0,0 +1,7 @@
> +test_integerset contains unit tests for testing the integer set 
> implementation
> +in src/backend/lib/integerset.c.
> +
> +The tests verify the correctness of the implementation, but they can also be
> +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> +test_integerset.c, the tests will print extra information about execution 
> time
> +and memory usage.
>
> This file is not updated for test_radixtree. I think we can remove it
> as the test cases in test_radixtree are clear.

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...

Re: Synchronizing slots from primary to standby

2024-03-06 Thread Amit Kapila

On Thu, Mar 7, 2024 at 8:37 AM shveta malik  wrote:
>

I thought about whether we can make standby_slot_names as USERSET
instead of SIGHUP and it doesn't sound like a good idea as that can
lead to inconsistent standby replicas even after configuring the
correct value of standby_slot_names. One can set a different or ''
(empty) value for a particular session and consume all changes from
the slot without waiting for the standby to acknowledge the change.
Also, it would be difficult for users to ensure whether the standby is
always ahead of subscribers. Does anyone think differently?

-- 
With Regards,
Amit Kapila.

Re: Missing LWLock protection in pgstat_reset_replslot()

On Thu, Mar 07, 2024 at 10:57:28AM +0530, shveta malik wrote:
> --It can happen in a small window in pg_stat_get_replication_slot()
> when we are consuming the return values of pgstat_fetch_replslot
> (using slotent).

Yeah, it is possible that what you retrieve from
pgstat_fetch_replslot() does not refer exactly to the slot's content
under concurrent activity, but you cannot protect a single scan of
pg_stat_replication_slots as of an effect of its design:
pg_stat_get_replication_slot() has to be called multiple times.  The
patch at least makes sure that the copy of the slot's stats retrieved
by pgstat_fetch_entry() is slightly more consistent, but you cannot do
better than that except if the data retrieved from
pg_replication_slots and its stats are fetched in the same context
function call, holding the replslot LWLock for the whole scan
duration.

> Do you feel that the lock in pgstat_fetch_replslot() should be moved
> to its caller when we are done copying the results of that slot? This
> will improve the protection slightly.

I don't see what extra protection this would offer, as
pg_stat_get_replication_slot() is called once for each slot.  Feel
free to correct me if I'm missing something.
--
Michael

signature.asc
Description: PGP signature

Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-06 Thread shveta malik

On Wed, Mar 6, 2024 at 2:36 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Mar 06, 2024 at 10:24:46AM +0530, shveta malik wrote:
> > On Tue, Mar 5, 2024 at 6:52 PM Bertrand Drouvot
> >  wrote:
> > Thanks.  Can we try to get rid of multiple LwLockRelease in
> > pgstat_reset_replslot(). Is this any better?
> >
> > /*
> > -* Nothing to do for physical slots as we collect stats only for 
> > logical
> > -* slots.
> > +* Reset stats if it is a logical slot. Nothing to do for physical 
> > slots
> > +* as we collect stats only for logical slots.
> >  */
> > -   if (SlotIsPhysical(slot))
> > -   {
> > -   LWLockRelease(ReplicationSlotControlLock);
> > -   return;
> > -   }
> > -
> > -   /* reset this one entry */
> > -   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
> > -ReplicationSlotIndex(slot));
> > +   if (SlotIsLogical(slot))
> > +   pgstat_reset(PGSTAT_KIND_REPLSLOT, InvalidOid,
> > +ReplicationSlotIndex(slot));
> >
> > LWLockRelease(ReplicationSlotControlLock);
> >
>
> Yeah, it's easier to read and probably reduce the pgstat_replslot.o object 
> file
> size a bit for non optimized build.
>
> > Something similar in pgstat_fetch_replslot() perhaps?
>
> Yeah, all of the above done in v3 attached.
>

Thanks for the patch.

For the fix in pgstat_fetch_replslot(), even with the lock in fetch
call, there are chances that the concerned slot can be dropped and
recreated.

--It can happen in a small window in pg_stat_get_replication_slot()
when we are consuming the return values of pgstat_fetch_replslot
(using slotent).

--Also it can happen at a later stage when we have switched to
fetching the next slot (obtained from 'pg_replication_slots' through
view' pg_stat_replication_slots'), the previous one can be dropped.

Ultimately the results of system view 'pg_replication_slots' can still
give us non-existing or re-created slots. But yes, I do not deny that
it gives us better consistency protection.

Do you feel that the lock in pgstat_fetch_replslot() should be moved
to its caller when we are done copying the results of that slot? This
will improve the protection slightly.

thanks
Shveta

Re: Add new error_action COPY ON_ERROR "log"

2024-03-06 Thread torikoshia

On 2024-03-07 13:00, Michael Paquier wrote:

On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:

Please see the attached v4 patch. If it looks good, I can pull
LOG_VERBOSITY changes out into 0001 and with 0002 containing the
detailed messages for discarded rows.

The approach looks sensible seen from here.

+LOG_VERBOSITY [ class="parameter">boolean ]

[...]
+LOG_VERBOSITY
+
+
+ Sets the verbosity of logged messages by COPY
+ command. As an example, see its usage for
+ COPY FROM command's
ON_ERROR

+ clause with ignore option.
+
+

Is a boolean the best interface for the end-user, though? Maybe
something like a "mode" value would speak more than a yes/no from the
start, say a "default" mode to emit only the last LOG and a "verbose"
for the whole set in the case of ON_ERROR? That could use an enum
from the start internally, but that's an implementation detail.

Describing what gets logged in the paragraph of ON_ERROR sounds fine,
especially if in the future more logs are added depending on other
options. That's an assumption at this stage, of course.

I am adding Alexander Korotkov in CC, as the original committer of
9e2d8701194f, as I assume that he may want to chime in this
discussion.

Torikoshi-san or others, if you have any comments about the interface,
feel free.

Thanks.

Maybe I'm overly concerned, but I'm a little concerned about whether the
contents of this output can really be called verbose, since it does not
output the actual soft error message that occurred, but only the row and
column where the error occurred.

Since the soft error mechanism can at least output the contents of soft
errors in the server log [1], it might be a good idea to use something
like a 'mode' value instead of boolean as Michael-san suggested, so that
the log output contents can be adjusted at multiple levels.

[1]
https://www.postgresql.org/message-id/20230322175000.qbdctk7bnmifh5an%40awork3.anarazel.de

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: DOCS: Avoid using abbreviation "aka"

On Thu, Mar 07, 2024 at 10:02:36AM +0530, Amit Kapila wrote:
> Pushed.

Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: remaining sql/json patches

2024-03-06 Thread Himanshu Upadhyaya

On Wed, Mar 6, 2024 at 9:04 PM Tomas Vondra 
wrote:

>
>
> I'm pretty sure this is the correct & expected behavior. The second
> query treats the value as string (because that's what should happen for
> values in double quotes).
>
>  ok, Then why does the below query provide the correct conversion, even if
we enclose that in double quotes?
‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
 "id" : "1234567890",
 "FULL_NAME" : "JOHN DOE"}',
'$'
COLUMNS(
 name varchar(20) PATH 'lax $.FULL_NAME',
 id int PATH 'lax $.id'
  )
   )
;
   name   | id
--+
 JOHN DOE | 1234567890
(1 row)

and for bigger input(string) it will leave as empty as below.
‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
 "id" : "12345678901",
 "FULL_NAME" : "JOHN DOE"}',
'$'
COLUMNS(
 name varchar(20) PATH 'lax $.FULL_NAME',
 id int PATH 'lax $.id'
  )
   )
;
   name   | id
--+
 JOHN DOE |
(1 row)

seems it is not something to do with data enclosed in double quotes but
somehow related with internal casting it to integer and I think in case of
bigger input it is not able to cast it to integer(as defined under COLUMNS
as id int PATH 'lax $.id')

‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
 "id" : "12345678901",
 "FULL_NAME" : "JOHN DOE"}',
'$'
COLUMNS(
 name varchar(20) PATH 'lax $.FULL_NAME',
 id int PATH 'lax $.id'
  )
   )
;
   name   | id
--+
 JOHN DOE |
(1 row)
)

if it is not able to represent it to integer because of bigger input, it
should error out with a similar error message instead of leaving it empty.

Thoughts?

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: Missing LWLock protection in pgstat_reset_replslot()

On Wed, Mar 06, 2024 at 09:05:59AM +, Bertrand Drouvot wrote:
> Yeah, all of the above done in v3 attached.

Interesting, so this relies on the slot index to ensure the unicity of
the stat entries.  And if the entry pointing to this ID is updated
we may refer to just incorrect data.

The inconsistency you could get for the fetch and reset cases are
narrow, but at least what you are proposing here would protect the 
index lookup until the entry is copied from shmem, so this offers a
better consistency protection when querying
pg_stat_get_replication_slot() on a fetch, while avoiding a reset of
incorrect data under concurrent activity.

In passing..  pgstat_create_replslot() and pgstat_drop_replslot() rely
on the assumption that the LWLock ReplicationSlotAllocationLock is
taken while calling these routines.  Perhaps that would be worth some
extra Assert(LWLockHeldByMeInMode()) in pgstat_replslot.c for these
two?  Not directly related to this problem.
--
Michael


signature.asc
Description: PGP signature

Re: remaining sql/json patches

2024-03-06 Thread jian he

two cosmetic minor issues.

+/*
+ * JsonCoercion
+ * Information about coercing a SQL/JSON value to the specified
+ * type at runtime
+ *
+ * A node of this type is created if the parser cannot find a cast expression
+ * using coerce_type() or OMIT QUOTES is specified for JSON_QUERY.  If the
+ * latter, 'expr' may contain the cast expression; if not, the quote-stripped
+ * scalar string will be coerced by calling the target type's input function.
+ * See ExecEvalJsonCoercion.
+ */
+typedef struct JsonCoercion
+{
+ NodeTag type;
+
+ Oid targettype;
+ int32 targettypmod;
+ bool omit_quotes; /* OMIT QUOTES specified for JSON_QUERY? */
+ Node   *cast_expr; /* coercion cast expression or NULL */
+ Oid collation;
+} JsonCoercion;

comment:  'expr' may contain the cast expression;
here "exr" should be "cast_expr"?
"a SQL/JSON" should be " an SQL/JSON"?

Re: Synchronizing slots from primary to standby

2024-03-06 Thread Amit Kapila

On Thu, Mar 7, 2024 at 7:35 AM Peter Smith  wrote:
>
> Here are some review comments for v107-0001
>
> ==
> src/backend/replication/slot.c
>
> 1.
> +/*
> + * Struct for the configuration of standby_slot_names.
> + *
> + * Note: this must be a flat representation that can be held in a single 
> chunk
> + * of guc_malloc'd memory, so that it can be stored as the "extra" data for 
> the
> + * standby_slot_names GUC.
> + */
> +typedef struct
> +{
> + int slot_num;
> +
> + /* slot_names contains nmembers consecutive nul-terminated C strings */
> + char slot_names[FLEXIBLE_ARRAY_MEMBER];
> +} StandbySlotConfigData;
> +
>
> 1a.
> To avoid any ambiguity this 1st field is somehow a slot ID number, I
> felt a better name would be 'nslotnames' or even just 'n' or 'count',
>

We can probably just add a comment above slot_num and that should be
sufficient but I am fine with 'nslotnames' as well, in anycase let's
add a comment for the same.

>
> 6b.
> IMO this function would be tidier written such that the
> MyReplicationSlot->data.name is passed as a parameter. Then you can
> name the function more naturally like:
>
> IsSlotInStandbySlotNames(const char *slot_name)
>

+1. How about naming it as SlotExistsinStandbySlotNames(char
*slot_name) and pass the slot_name from MyReplicationSlot? Otherwise,
we need an Assert for MyReplicationSlot in this function.

Also, can we add a comment like below before the loop:
+ /*
+ * XXX: We are not expecting this list to be long so a linear search
+ * shouldn't hurt but if that turns out not to be true then we can cache
+ * this information for each WalSender as well.
+ */

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-06 Thread Euler Taveira

On Wed, Mar 6, 2024, at 8:24 AM, vignesh C wrote:
> Few comments:

Thanks for your review. Some changes are included in v26.

> 1)   Can we use strdup here instead of atoi, as we do similarly in
> case of pg_dump too, else we will do double conversion, convert using
> atoi and again to string while forming the connection string:
> +   case 'p':
> +   if ((opt.sub_port = atoi(optarg)) <= 0)
> +   pg_fatal("invalid subscriber
> port number");
> +   break;

I don't have a strong preference but decided to provide a patch for it. See
v26-0003.

> 2) We can have some valid range for this, else we will end up in some
> unexpected values when a higher number is specified:
> +   case 't':
> +   opt.recovery_timeout = atoi(optarg);
> +   break;

I wouldn't like to add an arbitrary value. Suggestions?

> 3) Now that we have addressed most of the items, can we handle this TODO:
> +   /*
> +* TODO use primary_conninfo (if available) from subscriber 
> and
> +* extract publisher connection string. Assume that there are
> +* identical entries for physical and logical
> replication. If there is
> +* not, we would fail anyway.
> +*/
> +   pg_log_error("no publisher connection string specified");
> +   pg_log_error_hint("Try \"%s --help\" for more
> information.", progname);
> +   exit(1);

It is not in my top priority at the moment.

> 4)  By default the log level as info here, I was not sure how to set
> it to debug level to get these error messages:
> +   pg_log_debug("publisher(%d): connection string: %s",
> i, dbinfo[i].pubconninfo);
> +   pg_log_debug("subscriber(%d): connection string: %s",
> i, dbinfo[i].subconninfo);

  -v
  --verbose
  
   
Enables verbose mode. This will cause
pg_createsubscriber to output progress 
messages
and detailed information about each step to standard error.
Repeating the option causes additional debug-level messages to appear on
standard error.
   

> 5) Currently in non verbose mode there are no messages printed on
> console, we could have a few of them printed irrespective of verbose
> or not like the following:
> a) creating publication
> b) creating replication slot
> c) waiting for the target server to reach the consistent state
> d) If pg_createsubscriber fails after this point, you must recreate
> the physical replica before continuing.
> e) creating subscription

That's the idea. Quiet mode by default.

> 6) The message should be "waiting for the target server to reach the
> consistent state":
> +#define NUM_CONN_ATTEMPTS  5
> +
> +   pg_log_info("waiting the target server to reach the consistent 
> state");
> +
> +   conn = connect_database(conninfo, true);

Fixed.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: speed up a logical replica setup

2024-03-06 Thread Euler Taveira

On Wed, Mar 6, 2024, at 7:02 AM, Hayato Kuroda (Fujitsu) wrote:
> Thanks for updating the patch!

Thanks for the feedback. I'm attaching v26 that addresses most of your comments
and some issues pointed by Vignesh [1].

> >* I also removed the check for standby is running. If the standby was 
> >stopped a
> >  long time ago, it will take some time to reach the start point.
> >* Dry run mode has to start / stop the service to work correctly. Is it an
> >  issue?
> 
> One concern (see below comment) is that -l option would not be passed even if
> the standby has been logging before running pg_createsubscriber. Also, some 
> settings
> passed by pg_ctl start -o  would not be restored.

That's a good point. We should state in the documentation that GUCs specified in
the command-line options are ignored during the execution.

> >However, I decided to include --retain option, I'm thinking about to remove 
> >it.
> >If the logging is enabled, the information during the pg_createsubscriber 
> >will
> >be available. The client log can be redirected to a file for future 
> >inspection.
> 
> Just to confirm - you meant to say like below, right? 
> * the client output would be redirected, and
> * -r option would be removed.

Yes. The logging_collector is usually enabled or the syslog is collecting the
log entries. Under reflection, another log directory to store entries for a
short period of time doesn't seem a good idea. It divides the information and
it also costs development time. The questions that make me think about it were:
Should I remove the pg_createsubscriber_output.d directory if it runs
successfully? What if there is an old file there? Is it another directory to
exclude while taking a backup? I also don't like the long directory name.

> Here are my initial comments for v25-0001. I read new doc and looks very good.
> I may do reviewing more about v25-0001, but feel free to revise.
> 
> 01. cleanup_objects_atexit
> ```
> PGconn*conn;
> int i;
> ```
> The declaration *conn can be in the for-loop. Also, the declaration of the 
> indicator can be in the bracket.

Changed.

> 02. cleanup_objects_atexit
> ```
> 
> /*
> * If a connection could not be established, inform the user
> * that some objects were left on primary and should be
> * removed before trying again.
> */
> if (dbinfo[i].made_publication)
> {
> pg_log_warning("There might be a publication \"%s\" in database \"%s\" on 
> primary",
>dbinfo[i].pubname, dbinfo[i].dbname);
> pg_log_warning_hint("Consider dropping this publication before trying 
> again.");
> }
> if (dbinfo[i].made_replslot)
> {
> pg_log_warning("There might be a replication slot \"%s\" in database \"%s\" 
> on primary",
>dbinfo[i].subname, dbinfo[i].dbname);
> pg_log_warning_hint("Drop this replication slot soon to avoid retention of 
> WAL files.");
> }
> ```
> 
> Not sure which is better, but we may able to the list to the concrete file 
> like pg_upgrade.
> (I thought it had been already discussed, but could not find from the 
> archive. Sorry if it was a duplicated comment)

Do you mean the replication slot file? I think the replication slot and the
server (primary) is sufficient for checking and fixing if required.

> 03. main
> ```
> while ((c = getopt_long(argc, argv, "d:D:nP:rS:t:v",
> long_options, _index)) != -1)
> ```
> 
> Missing update for __shortopts.

Fixed.

> 04. main
> ```
> case 'D':
> opt.subscriber_dir = pg_strdup(optarg);
> canonicalize_path(opt.subscriber_dir);
> break;
> ...
> case 'P':
> opt.pub_conninfo_str = pg_strdup(optarg);
> break;
> ...
> case 's':
> opt.socket_dir = pg_strdup(optarg);
> break;
> ...
> case 'U':
> opt.sub_username = pg_strdup(optarg);
> break;
> ```
> 
> Should we consider the case these options would be specified twice?
> I.e., should we call pg_free() before the substitution? 

It isn't a concern for the other client tools. I think the reason is that it
doesn't continue to leak memory during the execution. I wouldn't bother with it.

> 05. main
> 
> Missing canonicalize_path() to the socket_dir.

Fixed.

> 06. main
> ```
> /*
> * If socket directory is not provided, use the current directory.
> */
> ```
> 
> One-line comment can be used. Period can be also removed at that time.

Fixed.

> 07. main
> ```
> /*
> *
> * If subscriber username is not provided, check if the environment
> * variable sets it. If not, obtain the operating system name of the user
> * running it.
> */
> ```
> Unnecessary blank.

Fixed.

> 08. main
> ```
> char*errstr = NULL;
> ```
>  
> This declaration can be at else-part.

Fixed.

> 09. main.
> 
> Also, as the first place, do we have to get username if not specified?
> I felt libpq can handle the case if we skip passing the info.

Are you suggesting that the username should be optional?

> 10. main
> ```
> appendPQExpBuffer(sub_conninfo_str, "host=%s port=%u user=%s 
> fallback_application_name=%s",
>   opt.socket_dir, opt.sub_port, opt.sub_username, progname);
> sub_base_conninfo =

Re: DOCS: Avoid using abbreviation "aka"

2024-03-06 Thread Amit Kapila

On Fri, Mar 1, 2024 at 11:29 AM Michael Paquier  wrote:
>
> On Fri, Mar 01, 2024 at 11:08:21AM +0530, Amit Kapila wrote:
> > I wanted to wait for two or three days to see if any other fixes in
> > docs, typos, or cosmetic stuff are reported in this functionality then
> > I can combine and push them. However, there is no harm in pushing them
> > separately, so if you want to go ahead please feel free to do so.
>
> Nah, feel free to :)
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Add system identifier to backup manifest

2024-03-06 Thread Amul Sul

On Thu, Mar 7, 2024 at 9:37 AM Michael Paquier  wrote:

> On Wed, Mar 06, 2024 at 11:05:36AM -0500, Robert Haas wrote:
> > So with that in mind, here's my proposal. This is an adjustment of
> > Amit's previous refactoring patch. He renamed the existing
> > get_controlfile() to get_dir_controlfile() and made a new
> > get_controlfile() that accepted the path to the control file itself. I
> > chose to instead leave the existing get_controlfile() alone and add a
> > new get_controlfile_by_exact_path(). I think this is better, because
> > most of the existing callers find it more convenient to pass the path
> > to the data directory rather than the path to the controlfile, so the
> > patch is smaller this way, and less prone to cause headaches for
> > people back-patching or maintaining out-of-core code. But it still
> > gives us a way to avoid repeatedly constructing the same pathname.
>
> Yes, that was my primary concern with the previous versions of the
> patch.
>
> > If nobody objects, I plan to commit this version.
>
> You are not changing silently the internals of get_controlfile(), so
> no objections here.  The name of the new routine could be shorter, but
> being short of ideas what you are proposing looks fine by me.
>

Could be get_controlfile_by_path() ?

Regards,
Amul

Re: remaining sql/json patches

2024-03-06 Thread jian he

On Tue, Mar 5, 2024 at 12:38 PM Andy Fan  wrote:
>
>
> In the commit message of 0001, we have:
>
> """
> Both JSON_VALUE() and JSON_QUERY() functions have options for
> handling EMPTY and ERROR conditions, which can be used to specify
> the behavior when no values are matched and when an error occurs
> during evaluation, respectively.
>
> All of these functions only operate on jsonb values. The workaround
> for now is to cast the argument to jsonb.
> """
>
> which is not clear for me why we introduce JSON_VALUE() function, is it
> for handling EMPTY or ERROR conditions? I think the existing cast
> workaround have a similar capacity?
>

I guess because it's in the standard.
but I don't see individual sql standard Identifier, JSON_VALUE in
sql_features.txt
I do see JSON_QUERY.
mysql also have JSON_VALUE, [1]

EMPTY, ERROR: there is a standard Identifier: T825: SQL/JSON: ON EMPTY
and ON ERROR clauses

[1] 
https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html#function_json-value

Re: Add system identifier to backup manifest

On Wed, Mar 06, 2024 at 11:05:36AM -0500, Robert Haas wrote:
> So with that in mind, here's my proposal. This is an adjustment of
> Amit's previous refactoring patch. He renamed the existing
> get_controlfile() to get_dir_controlfile() and made a new
> get_controlfile() that accepted the path to the control file itself. I
> chose to instead leave the existing get_controlfile() alone and add a
> new get_controlfile_by_exact_path(). I think this is better, because
> most of the existing callers find it more convenient to pass the path
> to the data directory rather than the path to the controlfile, so the
> patch is smaller this way, and less prone to cause headaches for
> people back-patching or maintaining out-of-core code. But it still
> gives us a way to avoid repeatedly constructing the same pathname.

Yes, that was my primary concern with the previous versions of the
patch.

> If nobody objects, I plan to commit this version.

You are not changing silently the internals of get_controlfile(), so
no objections here.  The name of the new routine could be shorter, but
being short of ideas what you are proposing looks fine by me.
--
Michael


signature.asc
Description: PGP signature

Re: Add new error_action COPY ON_ERROR "log"

On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:
> Please see the attached v4 patch. If it looks good, I can pull
> LOG_VERBOSITY changes out into 0001 and with 0002 containing the
> detailed messages for discarded rows.

The approach looks sensible seen from here.

+LOG_VERBOSITY [ boolean ]
[...]
+LOG_VERBOSITY
+
+ 
+  Sets the verbosity of logged messages by COPY
+  command. As an example, see its usage for
+  COPY FROM command's ON_ERROR
+  clause with ignore option.
+ 
+

Is a boolean the best interface for the end-user, though?  Maybe
something like a "mode" value would speak more than a yes/no from the
start, say a "default" mode to emit only the last LOG and a "verbose"
for the whole set in the case of ON_ERROR?  That could use an enum
from the start internally, but that's an implementation detail.

Describing what gets logged in the paragraph of ON_ERROR sounds fine,
especially if in the future more logs are added depending on other
options.  That's an assumption at this stage, of course.

I am adding Alexander Korotkov in CC, as the original committer of
9e2d8701194f, as I assume that he may want to chime in this
discussion.

Torikoshi-san or others, if you have any comments about the interface,
feel free.
--
Michael

signature.asc
Description: PGP signature

Re: Synchronizing slots from primary to standby

On Wed, Mar 6, 2024 at 5:53 PM Amit Kapila  wrote:
>
> On Wed, Mar 6, 2024 at 12:07 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 1, 2024 at 3:22 PM Peter Smith  wrote:
> > >
> > > On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > ...
> > > > +/*
> > > > + * "*" is not accepted as in that case primary will not be 
> > > > able to know
> > > > + * for which all standbys to wait for. Even if we have 
> > > > physical slots
> > > > + * info, there is no way to confirm whether there is any 
> > > > standby
> > > > + * configured for the known physical slots.
> > > > + */
> > > > +if (strcmp(*newval, "*") == 0)
> > > > +{
> > > > +GUC_check_errdetail("\"*\" is not accepted for
> > > > standby_slot_names");
> > > > +return false;
> > > > +}
> > > >
> > > > Why only '*' is checked aside from validate_standby_slots()? I think
> > > > that the doc doesn't mention anything about '*' and '*' cannot be used
> > > > as a replication slot name. So even if we don't have this check, it
> > > > might be no problem.
> > > >
> > >
> > > Hi, a while ago I asked this same question. See [1 #28] for the response..
> >
> > Thanks. Quoting the response from the email:
> >
> > SplitIdentifierString() does not give error for '*' and '*' can be 
> > considered
> > as valid value which if accepted can mislead user that all the standbys's 
> > slots
> > are now considered, which is not the case here. So we want to explicitly 
> > call
> > out this case i.e. '*' is not accepted as valid value for 
> > standby_slot_names.
> >
> > IIUC we're concerned with a case like where the user confused
> > standby_slot_names values with synchronous_standby_names values. Which
> > means we would need to keep thath check consistent with available
> > values of synchronous_standby_names.
> >
>
> Both have different formats to specify. For example, for
> synchronous_standby_names we have the following kind of syntax to
> specify:
> [FIRST] num_sync ( standby_name [, ...] )
> ANY num_sync ( standby_name [, ...] )
> standby_name [, ...]
>
> I don't think we can have a common check for both of them as the
> specifications are different. In fact, I don't think we need a special
> check for '*'.

I think so too.

> The user will anyway get a WARNING at a later point
> that the replication slot with that name doesn't exist.

Right.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
>
> Hi, here are some review comments for v7-0002
>
> ==
> Commit Message
>
> 1.
> This commit adds a hash table to binaryheap in order to track of
> positions of each nodes in the binaryheap. That way, by using newly
> added functions such as binaryheap_update_up() etc., both updating a
> key and removing a node can be done in O(1) on an average and O(log n)
> in worst case. This is known as the indexed binary heap. The caller
> can specify to use the indexed binaryheap by passing indexed = true.
>
> ~
>
> /to track of positions of each nodes/to track the position of each node/
>
> ~~~
>
> 2.
> There is no user of it but it will be used by a upcoming patch.
>
> ~
>
> The current code does not use the new indexing logic, but it will be
> used by an upcoming patch.

Fixed.

>
> ==
> src/common/binaryheap.c
>
> 3.
> +/*
> + * Define parameters for hash table code generation. The interface is *also*"
> + * declared in binaryheaph.h (to generate the types, which are externally
> + * visible).
> + */
>
> Typo: *also*"

Fixed.

>
> ~~~
>
> 4.
> +#define SH_PREFIX bh_nodeidx
> +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> +#define SH_KEY_TYPE bh_node_type
> +#define SH_KEY key
> +#define SH_HASH_KEY(tb, key) \
> + hash_bytes((const unsigned char *) , sizeof(bh_node_type))
> +#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
> +#define SH_SCOPE extern
> +#ifdef FRONTEND
> +#define SH_RAW_ALLOCATOR pg_malloc0
> +#endif
> +#define SH_DEFINE
> +#include "lib/simplehash.h"
>
> 4a.
> The comment in simplehash.h says
>  *   The following parameters are only relevant when SH_DEFINE is defined:
>  *   - SH_KEY - ...
>  *   - SH_EQUAL(table, a, b) - ...
>  *   - SH_HASH_KEY(table, key) - ...
>  *   - SH_STORE_HASH - ...
>  *   - SH_GET_HASH(tb, a) - ...
>
> So maybe it is nicer to reorder the #defines in that same order?
>
> SUGGESTION:
> +#define SH_PREFIX bh_nodeidx
> +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> +#define SH_KEY_TYPE bh_node_type
> +#define SH_SCOPE extern
> +#ifdef FRONTEND
> +#define SH_RAW_ALLOCATOR pg_malloc0
> +#endif
>
> +#define SH_DEFINE
> +#define SH_KEY key
> +#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
> +#define SH_HASH_KEY(tb, key) \
> + hash_bytes((const unsigned char *) , sizeof(bh_node_type))
> +#include "lib/simplehash.h"

I'm really not sure it helps increase readability. For instance, for
me it's readable if SH_DEFINE and SH_DECLARE come to the last before
#include since it's more obvious whether we want to declare, define or
both. Other simplehash.h users also do so.

>
> ~~
>
> 4b.
> The comment in simplehash.h says that "it's preferable, if possible,
> to store the element's hash in the element's data type", so should
> SH_STORE_HASH and SH_GET_HASH also be defined here?

Good catch. I've used these macros.

>
> ~~~
>
> 5.
> + *
> + * If 'indexed' is true, we create a hash table to track of each node's
> + * index in the heap, enabling to perform some operations such as removing
> + * the node from the heap.
>   */
>  binaryheap *
> -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
> +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> + bool indexed, void *arg)
>
> BEFORE
> ... enabling to perform some operations such as removing the node from the 
> heap.
>
> SUGGESTION
> ... to help make operations such as removing nodes more efficient.
>

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

> ~~~
>
> 6.
> + heap->bh_indexed = indexed;
> + if (heap->bh_indexed)
> + {
> +#ifdef FRONTEND
> + heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
> +#else
> + heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
> + NULL);
> +#endif
> + }
> +
>
> The heap allocation just uses palloc instead of palloc0 so it might be
> better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
> never get a situation where bh_indexed is false but bh_nodeidx has
> some (garbage) value.

Fixed.

>
> ~~~
>
> 7.
> +/*
> + * Set the given node at the 'index', and updates its position accordingly.
> + *
> + * Return true if the node's index is already tracked.
> + */
> +static bool
> +bh_set_node(binaryheap *heap, bh_node_type node, int index)
>
> 7a.
> I felt the 1st sentence should be more like:
>
> SUGGESTION
> Set the given node at the 'index' and track it if required.

Fixed.

>
> ~
>
> 7b.
> IMO the parameters would be better the other way around (e.g. 'index'
> before the 'node') because that's what the assignments look like:
>
>
> heap->bh_nodes[heap->bh_size] = d;
>
> becomes:
> bh_set_node(heap, heap->bh_size, d);
>

I think it assumes heap->bh_nodes is an array. But if we change it in
the future, it will no

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 12:20 PM vignesh C  wrote:
>
> On Wed, 28 Feb 2024 at 11:40, Amit Kapila  wrote:
> >
> > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > A few comments on 0003:
> > ===
> > 1.
> > +/*
> > + * Threshold of the total number of top-level and sub transactions
> > that controls
> > + * whether we switch the memory track state. While the MAINTAIN_HEAP state 
> > is
> > + * effective when there are many transactions being decoded, in many 
> > systems
> > + * there is generally no need to use it as long as all transactions
> > being decoded
> > + * are top-level transactions. Therefore, we use MaxConnections as
> > the threshold
> > + * so we can prevent switch to the state unless we use subtransactions.
> > + */
> > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
> >
> > The comment seems to imply that MAINTAIN_HEAP is useful for large
> > number of transactions but ReorderBufferLargestTXN() switches to this
> > state even when there is one transaction. So, basically we use the
> > binary_heap technique to get the largest even when we have one
> > transaction but we don't maintain that heap unless we have
> > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
> > in-progress. This means there is some additional work when (build and
> > reset heap each time when we pick largest xact) we have fewer
> > transactions in the system but that may not be impacting us because of
> > other costs involved like serializing all the changes. I think once we
> > can try to stress test this by setting
> > debug_logical_replication_streaming to 'immediate' to see if the new
> > mechanism has any overhead.
>
> I ran the test with a transaction having many inserts:
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 26.31 | 48.84 | 93.65  | 480.05  |  4808.29   | 47020.16
> Patch |  26.35  | 50.8   | 97.99  | 484.8|  4856.95   | 48108.89
>
> The same test with debug_logical_replication_streaming= 'immediate'
>
>  | 5000 | 1   |  2  | 10|  100   | 1000
> --- 
> |---|||--||
> Head | 59.29   |  115.84  |  227.21 | 1156.08   |  11367.42   |  113986.14
> Patch | 62.45  |  120.48  |  240.56 | 1185.12   |  11855.37   |  119921.81
>
> The execution time is in milliseconds. The column header indicates the
> number of inserts in the transaction.
> In this case I noticed that the test execution with patch was taking
> slightly more time.
>

Thank you for testing! With 10M records, I can see 2% regression in
the 'buffered' case and 5% regression in the 'immediate' case.

I think that in general it makes sense to postpone using a max-heap
until the number of transactions is higher than the threshold. I've
implemented this idea and here are the results on my environment (with
10M records and debug_logical_replication_streaming = 'immediate'):

HEAD:
68937.887 ms
69450.174 ms
68808.248 ms

v7 patch:
71280.783 ms
71673.101 ms
71330.898 ms

v8 patch:
68918.259 ms
68822.330 ms
68972.452 ms

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v8-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v8-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data


v8-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-03-06 Thread shveta malik

On Wed, Mar 6, 2024 at 6:54 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, March 6, 2024 9:13 PM Zhijie Hou (Fujitsu) 
>  wrote:
> >
> > On Wednesday, March 6, 2024 11:04 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada
> > >  wrote:
> > >
> > > Hi,
> > >
> > > > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu)
> > > > 
> > > > wrote:
> > > > >
> > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada
> > > >  wrote:
> > > > > >
> > > > > >
> > > > > > ---
> > > > > > +void
> > > > > > +assign_standby_slot_names(const char *newval, void *extra) {
> > > > > > +List  *standby_slots;
> > > > > > +MemoryContext oldcxt;
> > > > > > +char  *standby_slot_names_cpy = extra;
> > > > > > +
> > > > > >
> > > > > > Given that the newval and extra have the same data
> > > > > > (standby_slot_names value), why do we not use newval instead? I
> > > > > > think that if we use newval, we don't need to guc_strdup() in
> > > > > > check_standby_slot_names(), we might need to do list_copy_deep()
> > > > > > instead, though. It's not clear to me as there is no comment.
> > > > >
> > > > > I think SplitIdentifierString will modify the passed in string, so
> > > > > we'd better not pass the newval to it, otherwise the stored guc
> > > > > string(standby_slot_names) will be changed. I can see we are doing
> > > > > similar thing in other GUC check/assign function as well.
> > > > > (check_wal_consistency_checking/ assign_wal_consistency_checking,
> > > > > check_createrole_self_grant/ assign_createrole_self_grant ...).
> > > >
> > > > Why does it have to be a List in the first place?
> > >
> > > I thought the List type is convenient to use here, as we have existing
> > > list build function(SplitIdentifierString), and have convenient list
> > > macro to loop the
> > > list(foreach_ptr) which can save some codes.
> > >
> > > > In earlier version patches, we
> > > > used to copy the list and delete the element until it became empty,
> > > > while waiting for physical wal senders. But we now just refer to
> > > > each slot name in the list. The current code assumes that
> > > > stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it
> > > > changes, it will silently get broken. I think we can check and
> > > > assign standby_slot_names in a similar way to
> > > > check/assign_temp_tablespaces and
> > check/assign_synchronous_standby_names.
> > >
> > > Yes, we could do follow it by allocating an array and copy each slot
> > > name into it, but it also requires some codes to build and scan the
> > > array. So, is it possible to expose the GucMemorycontext or have an API 
> > > like
> > guc_copy_list instead ?
> > > If we don't want to touch the guc api, I am ok with using an array as 
> > > well.
> >
> > I rethink about this and realize that it's not good to do the memory 
> > allocation in
> > assign hook function. As the "src/backend/utils/misc/README" said, we'd
> > better do that in check hook function and pass it via extra to assign hook
> > function. And thus array is a good choice in this case rather than a List 
> > which
> > cannot be passed to *extra.
> >
> > Here is the V107 patch set which parse and cache the standby slot names in 
> > an
> > array instead of a List.
>
> The patch needs to be rebased due to recent commit.
>
> Attach the V107_2 path set. There are no code changes in this version.

 The patch needed to be rebased due to a recent commit. Attached
v107_3, there are no code changes in this version.

thanks
Shveta


v107_3-0002-Document-the-steps-to-check-if-the-standby-is.patch
Description: Binary data


v107_3-0001-Allow-logical-walsenders-to-wait-for-the-phys.patch
Description: Binary data

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-06 Thread Melanie Plageman

On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> On 27/02/2024 21:47, Melanie Plageman wrote:
> > The attached v5 has some simplifications when compared to v4 but takes
> > largely the same approach.
> > 
> > 0001-0004 are refactoring
> 
> I'm looking at just these 0001-0004 patches for now. I like those changes a
> lot for the sake of readablity even without any of the later patches.

Thanks! And thanks so much for the review!

I've done a small performance experiment comparing a branch with all of
the patches applied (your v6 0001-0009) with master. I made an 11 GB
table that has 1,394,328 blocks. For setup, I vacuumed it to update the
VM and made sure it was entirely in shared buffers. All of this was to
make sure all of the blocks would be skipped and we spend the majority
of the time spinning through the lazy_scan_heap() code. Then I ran
vacuum again (the actual test). I saw vacuum go from 13 ms to 10 ms
with the patches applied.

I think I need to do some profiling to see if the difference is actually
due to our code changes, but I thought I would share preliminary
results.

> I made some further changes. I kept them as separate commits for easier
> review, see the commit messages for details. Any thoughts on those changes?

I've given some inline feedback on most of the extra patches you added.
Short answer is they all seem fine to me except I have a reservations
about 0008 because of the number of blkno variables flying around. I
didn't have a chance to rebase these into my existing changes today, so
either I will do it tomorrow or, if you are feeling like you're on a
roll and want to do it, that also works!

> I feel heap_vac_scan_get_next_block() function could use some love. Maybe
> just some rewording of the comments, or maybe some other refactoring; not
> sure. But I'm pretty happy with the function signature and how it's called.

I was wondering if we should remove the "get" and just go with
heap_vac_scan_next_block(). I didn't do that originally because I didn't
want to imply that the next block was literally the sequentially next
block, but I think maybe I was overthinking it.

Another idea is to call it heap_scan_vac_next_block() and then the order
of the words is more like the table AM functions that get the next block
(e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to
be too similar to those since this isn't a table AM callback.

As for other refactoring and other rewording of comments and such, I
will take a pass at this tomorrow.

> BTW, do we have tests that would fail if we botched up
> heap_vac_scan_get_next_block() so that it would skip pages incorrectly, for
> example? Not asking you to write them for this patch, but I'm just
> wondering.

So, while developing this, when I messed up and skipped blocks I
shouldn't, vacuum would error out with the "found xmin from before
relfrozenxid" error -- which would cause random tests to fail. I know
that's not a correctly failing test of this code. I think there might be
some tests in the verify_heapam tests that could/do test this kind of
thing but I don't remember them failing for me during development -- so
I didn't spend much time looking at them.

I would also sometimes get freespace or VM tests that would fail because
those blocks that are incorrectly skipped were meant to be reflected in
the FSM or VM in those tests.

All of that is to say, perhaps we should write a more targeted test?

When I was writing the code, I added logging of skipped blocks and then
came up with different scenarios and ran them on master and with the
patch and diffed the logs.

> From b4047b941182af0643838fde056c298d5cc3ae32 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 6 Mar 2024 20:13:42 +0200
> Subject: [PATCH v6 5/9] Remove unused 'skipping_current_range' field
> 
> ---
>  src/backend/access/heap/vacuumlazy.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c 
> b/src/backend/access/heap/vacuumlazy.c
> index 65d257aab83..51391870bf3 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -217,8 +217,6 @@ typedef struct LVRelState
>   Buffer  vmbuffer;
>   /* Next unskippable block's visibility status */
>   boolnext_unskippable_allvis;
> - /* Whether or not skippable blocks should be skipped */
> - boolskipping_current_range;
>   }   skip;
>  } LVRelState;
>  
> -- 
> 2.39.2
> 

Oops! I thought I removed this. I must have forgotten

> From 27e431e8dc69bbf09d831cb1cf2903d16f177d74 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 6 Mar 2024 20:58:57 +0200
> Subject: [PATCH v6 6/9] Move vmbuffer back to a local varible in
>  lazy_scan_heap()
> 
> It felt confusing that we passed around the current block, 'blkno', as
> an argument to lazy_scan_new_or_empty() and

Re: remaining sql/json patches

2024-03-06 Thread jian he

On Wed, Mar 6, 2024 at 9:22 PM jian he  wrote:
>
> Another case, I did test yet: more keys in a single json, but the
> value is small.

Another case attached. see the attached SQL file's comments.
a single simple jsonb, with 33 keys, each key's value with fixed length: 256.
total table size: SELECT pg_size_pretty(pg_table_size('json33keys'));  --5369 MB
number of rows: 61.

using the previously mentioned method: pg_log_backend_memory_contexts.
all these tests under:
-Dcassert=true \
-Db_coverage=true \
-Dbuildtype=debug \

I hope someone will tell me if the test method is correct or not.

jsonb_33keys.sql
Description: application/sql

Re: "type with xxxx does not exist" when doing ExecMemoize()

2024-03-06 Thread Tender Wang

Andrei Lepikhov  于2024年3月6日周三 11:37写道：

> I think, it is a bug. Should it be fixed (and back-patched) earlier?
>

Agreed.   Need David to review it as he knows this area best.


-- 
Tender Wang
OpenPie:  https://en.openpie.com/

Re: Synchronizing slots from primary to standby

2024-03-06 Thread Peter Smith

Here are some review comments for v107-0001

==
src/backend/replication/slot.c

1.
+/*
+ * Struct for the configuration of standby_slot_names.
+ *
+ * Note: this must be a flat representation that can be held in a single chunk
+ * of guc_malloc'd memory, so that it can be stored as the "extra" data for the
+ * standby_slot_names GUC.
+ */
+typedef struct
+{
+ int slot_num;
+
+ /* slot_names contains nmembers consecutive nul-terminated C strings */
+ char slot_names[FLEXIBLE_ARRAY_MEMBER];
+} StandbySlotConfigData;
+

1a.
To avoid any ambiguity this 1st field is somehow a slot ID number, I
felt a better name would be 'nslotnames' or even just 'n' or 'count',

~

1b.
(fix typo)

SUGGESTION for the 2nd field comment
slot_names is a chunk of 'n' X consecutive null-terminated C strings

~

1c.
A more explanatory name for this typedef maybe is 'StandbySlotNamesConfigData' ?

~~~


2.
+/* This is parsed and cached configuration for standby_slot_names */
+static StandbySlotConfigData *standby_slot_config;


2a.
/This is parsed and cached configuration for .../This is the parsed
and cached configuration for .../

~

2b.
Similar to above -- since this only has name information maybe it is
more correct to call it 'standby_slot_names_config'?

~~~

3.
+/*
+ * A helper function to validate slots specified in GUC standby_slot_names.
+ *
+ * The rawname will be parsed, and the parsed result will be saved into
+ * *elemlist.
+ */
+static bool
+validate_standby_slots(char *rawname, List **elemlist)

/and the parsed result/and the result/

~~~

4. check_standby_slot_names

+ /* Need a modifiable copy of string */
+ rawname = pstrdup(*newval);

/copy of string/copy of the GUC string/

~~~

5.
+assign_standby_slot_names(const char *newval, void *extra)
+{
+ /*
+ * The standby slots may have changed, so we must recompute the oldest
+ * LSN.
+ */
+ ss_oldest_flush_lsn = InvalidXLogRecPtr;
+
+ standby_slot_config = (StandbySlotConfigData *) extra;
+}

To avoid leaking don't we need to somewhere take care to free any
memory used by a previous value (if any) of this
'standby_slot_config'?

~~~

6. AcquiredStandbySlot

+/*
+ * Return true if the currently acquired slot is specified in
+ * standby_slot_names GUC; otherwise, return false.
+ */
+bool
+AcquiredStandbySlot(void)
+{
+ const char *name;
+
+ /* Return false if there is no value in standby_slot_names */
+ if (standby_slot_config == NULL)
+ return false;
+
+ name = standby_slot_config->slot_names;
+ for (int i = 0; i < standby_slot_config->slot_num; i++)
+ {
+ if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0)
+ return true;
+
+ name += strlen(name) + 1;
+ }
+
+ return false;
+}

6a.
Just checking "(standby_slot_config == NULL)" doesn't seem enough to
me, because IIUIC it is possible when 'standby_slot_names' has no
value then maybe standby_slot_config is not NULL but
standby_slot_config->slot_num is 0.

~

6b.
IMO this function would be tidier written such that the
MyReplicationSlot->data.name is passed as a parameter. Then you can
name the function more naturally like:

IsSlotInStandbySlotNames(const char *slot_name)

~

6c.
IMO the body of the function will be tidier if written so there are
only 2 returns instead of 3 like

SUGGESTION:
if (...)
{
  for (...)
  {
 ...
return true;
  }
}
return false;

~~~

7.
+ /*
+ * Don't need to wait for the standbys to catch up if there is no value in
+ * standby_slot_names.
+ */
+ if (standby_slot_config == NULL)
+ return true;

(similar to a previous review comment)

This check doesn't seem enough because IIUIC it is possible when
'standby_slot_names' has no value then maybe standby_slot_config is
not NULL but standby_slot_config->slot_num is 0.

~~~

8. WaitForStandbyConfirmation

+ /*
+ * Don't need to wait for the standby to catch up if the current acquired
+ * slot is not a logical failover slot, or there is no value in
+ * standby_slot_names.
+ */
+ if (!MyReplicationSlot->data.failover || !standby_slot_config)
+ return;

(similar to a previous review comment)

IIUIC it is possible that when 'standby_slot_names' has no value, then
standby_slot_config is not NULL but standby_slot_config->slot_num is
0. So shouldn't that be checked too?

Perhaps it is convenient to encapsulate this check using some macro:
#define StandbySlotNamesHasNoValue() (standby_slot_config = NULL ||
standby_slot_config->slot_num == 0)

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

On Wed, Mar 06, 2024 at 05:45:56PM +, Bertrand Drouvot wrote:
> Thank you both for the report! I did a few test manually and can see the issue
> from times to times. When the issue occurs, the logical decoding was able to
> go through the place where LogicalConfirmReceivedLocation() updates the
> slot's catalog_xmin before being killed. In that case I can see that the
> catalog_xmin is updated to the xid horizon.

Two buildfarm machines have complained here, and one of them twice in
a row.  That's quite amazing, because a couple of dozen runs done in a
row on the same host as these animals all pass.  The CI did not
complain either (did 2~3 runs there yesterday).

> Means in a failed test we have something like:
> 
> slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."
>
> While when the test is ok you'll see something like:
> 
> slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

Perhaps we should also make the test report the catalog_xmin of the
slot.  That may make debugging a bit easier.

> In the failing test the call to SELECT pg_logical_slot_get_changes() does
> not advance the slot's catalog xmin anymore.

Is that something that we could enforce in the test in a stronger way,
cross-checking the xmin horizon before and after the call?

> To fix this, I think we need a new transacion to decode from the primary 
> before
> executing pg_logical_slot_get_changes(). But this transaction has to be 
> replayed
> on the standby first by the startup process. Which means we need to wakeup
> "terminate-process-holding-slot" and that we probably need another injection
> point somewehere in this test.
>
> I'll look at it unless you've another idea?

I am wondering if there is something else lurking here, actually, so
for now I am going to revert the change as it is annoying to get
sporadic failures in the CF bot at this time of the year and there are
a lot of patches under discussion.  Let's give it more time and more
thoughts, without pressure.
--
Michael


signature.asc
Description: PGP signature

Re: Combine headerscheck and cpluspluscheck scripts

2024-03-06 Thread Thomas Munro

+1

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 11:25 AM Peter Smith  wrote:
>
> Hi, Here are some review comments for v7-0001

Thank you for reviewing the patch.

>
> 1.
> /*
>  * binaryheap_free
>  *
>  * Releases memory used by the given binaryheap.
>  */
> void
> binaryheap_free(binaryheap *heap)
> {
> pfree(heap);
> }
>
>
> Shouldn't the above function (not modified by the patch) also firstly
> free the memory allocated for the heap->bh_nodes?
>
> ~~~
>
> 2.
> +/*
> + * Make sure there is enough space for nodes.
> + */
> +static void
> +bh_enlarge_node_array(binaryheap *heap)
> +{
> + heap->bh_space *= 2;
> + heap->bh_nodes = repalloc(heap->bh_nodes,
> +   sizeof(bh_node_type) * heap->bh_space);
> +}
>
> Strictly speaking, this function doesn't really "Make sure" of
> anything because the caller does the check whether we need more space.
> All that happens here is allocating more space. Maybe this function
> comment should say something like "Double the space allocated for
> nodes."

Agreed with the above two points. I'll fix them in the next version patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

improve ssl error code, 2147483650

2024-03-06 Thread David Zhang


Hi Hackers,

When configuring SSL on the Postgres server side with the following 
information:


ssl = on
ssl_ca_file = 'root_ca.crt'
ssl_cert_file = 'server-cn-only.crt'
ssl_key_file = 'server-cn-only.key'

If a user makes a mistake, for example, accidentally using 'root_ca.crl' 
instead of 'root_ca.crt', Postgres will report an error like the one below:
FATAL:  could not load root certificate file "root_ca.crl": SSL error 
code 2147483650


Here, the error code 2147483650 is not very helpful for the user. The 
reason is that Postgres is missing the initial SSL file check before 
passing it to the OpenSSL API. For example:


    if (ssl_ca_file[0])
    {
    STACK_OF(X509_NAME) * root_cert_list;

    if (SSL_CTX_load_verify_locations(context, ssl_ca_file, NULL) 
!= 1 ||
        (root_cert_list = SSL_load_client_CA_file(ssl_ca_file)) == 
NULL)

    {
        ereport(isServerStart ? FATAL : LOG,
                (errcode(ERRCODE_CONFIG_FILE_ERROR),
                 errmsg("could not load root certificate file 
\"%s\": %s",

                        ssl_ca_file, SSLerrmessage(ERR_get_error();
        goto error;
    }

The SSL_CTX_load_verify_locations function in OpenSSL will return NULL 
if there is a system error, such as "No such file or directory" in this 
case:


const char *ERR_reason_error_string(unsigned long e)
{
    ERR_STRING_DATA d, *p = NULL;
    unsigned long l, r;

    if (!RUN_ONCE(_string_init, do_err_strings_init)) {
    return NULL;
    }

    /*
 * ERR_reason_error_string() can't safely return system error strings,
 * since openssl_strerror_r() needs a buffer for thread safety, and we
 * haven't got one that would serve any sensible purpose.
 */
    if (ERR_SYSTEM_ERROR(e))
    return NULL;

It would be better to perform a simple SSL file check before passing the 
SSL file to OpenSSL APIs so that the system error can be captured and a 
meaningful message provided to the end user.


Attached is a simple patch to help address this issue for ssl_ca_file, 
ssl_cert_file, and ssl_crl_file. With this patch, a similar test will 
return something like the message below:
FATAL:  could not access certificate file "root_ca.crl": No such file or 
directory


I believe this can help end users quickly realize the mistake.


Thank you,
David
From bb6264791aab6a3e8150704fc8f1c8774c27018d Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Wed, 6 Mar 2024 15:51:11 -0800
Subject: [PATCH] improve ssl files error code

---
 src/backend/libpq/be-secure-common.c  | 19 +++
 src/backend/libpq/be-secure-openssl.c | 10 ++
 src/include/libpq/libpq.h |  1 +
 3 files changed, 30 insertions(+)

diff --git a/src/backend/libpq/be-secure-common.c 
b/src/backend/libpq/be-secure-common.c
index 0582606192..01d567cbfc 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -102,7 +102,26 @@ error:
return len;
 }
 
+/*
+ * Check SSL certificate files.
+ */
+bool
+check_ssl_file(const char *ssl_file, bool isServerStart)
+{
+   int loglevel = isServerStart ? FATAL : LOG;
+   struct stat buf;
+
+   if (stat(ssl_file, ) != 0)
+   {
+   ereport(loglevel,
+   (errcode_for_file_access(),
+errmsg("could not access certificate file 
\"%s\": %m",
+   ssl_file)));
+   return false;
+   }
 
+   return true;
+}
 /*
  * Check permissions for SSL key files.
  */
diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index e12b1cc9e3..c5d58545d9 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -144,6 +144,10 @@ be_tls_init(bool isServerStart)
/*
 * Load and verify server's certificate and private key
 */
+
+   if (!check_ssl_file(ssl_cert_file, isServerStart))
+   goto error;
+
if (SSL_CTX_use_certificate_chain_file(context, ssl_cert_file) != 1)
{
ereport(isServerStart ? FATAL : LOG,
@@ -297,6 +301,9 @@ be_tls_init(bool isServerStart)
{
STACK_OF(X509_NAME) * root_cert_list;
 
+   if (!check_ssl_file(ssl_ca_file, isServerStart))
+   goto error;
+
if (SSL_CTX_load_verify_locations(context, ssl_ca_file, NULL) 
!= 1 ||
(root_cert_list = SSL_load_client_CA_file(ssl_ca_file)) 
== NULL)
{
@@ -336,6 +343,9 @@ be_tls_init(bool isServerStart)
{
X509_STORE *cvstore = SSL_CTX_get_cert_store(context);
 
+   if (ssl_crl_file[0] && !check_ssl_file(ssl_crl_file, 
isServerStart))
+   goto error;
+
if (cvstore)
{
/* Set the flags to check against the complete CRL

Re: Stack overflow issue

2024-03-06 Thread Tom Lane

Alexander Korotkov  writes:
> Sorry for tediousness, but isn't pre-order a variation of depth-first order
> [1]?

To me, depth-first implies visiting children before parents.
Do I have the terminology wrong?

regards, tom lane

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-06 Thread Melanie Plageman

On Tue, Feb 27, 2024 at 02:47:03PM -0500, Melanie Plageman wrote:
> On Mon, Jan 29, 2024 at 8:18 PM Melanie Plageman
>  wrote:
> >
> > On Fri, Jan 26, 2024 at 8:28 AM vignesh C  wrote:
> > >
> > > CFBot shows that the patch does not apply anymore as in [1]:
> > > === applying patch
> > > ./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch
> > > patching file src/backend/access/heap/vacuumlazy.c
> > > ...
> > > Hunk #10 FAILED at 1042.
> > > Hunk #11 FAILED at 1121.
> > > Hunk #12 FAILED at 1132.
> > > Hunk #13 FAILED at 1161.
> > > Hunk #14 FAILED at 1172.
> > > Hunk #15 FAILED at 1194.
> > > ...
> > > 6 out of 21 hunks FAILED -- saving rejects to file
> > > src/backend/access/heap/vacuumlazy.c.rej
> > >
> > > Please post an updated version for the same.
> > >
> > > [1] - http://cfbot.cputube.org/patch_46_4755.log
> >
> > Fixed in attached rebased v4
> 
> In light of Thomas' update to the streaming read API [1], I have
> rebased and updated this patch set.
> 
> The attached v5 has some simplifications when compared to v4 but takes
> largely the same approach.

Attached is a patch set (v5a) which updates the streaming read user for
vacuum to fix an issue Andrey Borodin pointed out to me off-list.

Note that I started writing this email before seeing Heikki's upthread
review [1], so I will respond to that in a bit. There are no changes in
v5a to any of the prelim refactoring patches which Heikki reviewed in
that email. I only changed the vacuum streaming read users (last two
patches in the set).

Back to this patch set:
Andrey pointed out that it was failing to compile on windows and the
reason is that I had accidentally left an undefined variable "index" in
these places

Assert(index > 0);
...
ereport(DEBUG2,
(errmsg("table \"%s\": removed %lld dead item 
identifiers in %u pages",
vacrel->relname, (long long) index, 
vacuumed_pages)));

See https://cirrus-ci.com/task/6312305361682432

I don't understand how this didn't warn me (or fail to compile) for an
assert build on my own workstation. It seems to think "index" is a
function?

Anyway, thinking about what the correct assertion would be here:

Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
   (rbstate->end_idx == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));

I think I can just replace "index" with "rbstate->end_index". At the end
of reaping, this should have the same value that index would have had.
The issue with this is if pg_streaming_read_buffer_get_next() somehow
never returned a valid buffer (there were no dead items), then rbstate
would potentially be uninitialized. The old assertion (index > 0) would
only have been true if there were some dead items, but there isn't an
explicit assertion in this function that there were some dead items.
Perhaps it is worth adding this? Even if we add this, perhaps it is
unacceptable from a programming standpoint to use rbstate in that scope?

In addition to fixing this slip-up, I have done some performance testing
for streaming read vacuum. Note that these tests are for both vacuum
passes (1 and 2) using streaming read.

Performance results:

The TL;DR of my performance results is that streaming read vacuum is
faster. However there is an issue with the interaction of the streaming
read code and the vacuum buffer access strategy which must be addressed.

Note that "master" in the results below is actually just a commit on my
branch [2] before the one adding the vacuum streaming read users. So it
includes all of my refactoring of the vacuum code from the preliminary
patches.

I tested two vacuum "data states". Both are relatively small tables
because the impact of streaming read can easily be seen even at small
table sizes. DDL for both data states is at the end of the email.

The first data state is a 28 MB table which has never been vacuumed and
has one or two dead tuples on every block. All of the blocks have dead
tuples, so all of the blocks must be vacuumed. We'll call this the
"sequential" data state.

The second data state is a 67 MB table which has been vacuumed and then
a small percentage of the blocks (non-consecutive blocks at irregular
intervals) are updated afterward. Because the visibility map has been
updated and only a few blocks have dead tuples, large ranges of blocks
do not need to be vacuumed. There is at least one run of blocks with
dead tuples larger than 1 block but most of the blocks with dead tuples
are a single block followed by many blocks with no dead tuples. We'll
call this the "few" data state.

I tested these data states with "master" and with streaming read vacuum
with three caching options:

- table data fully in shared buffers (via pg_prewarm)
- table data in the kernel buffer cache but not in shared buffers
- table data completely uncached

I tested the OS cached and uncached caching options with

Re: Stack overflow issue

2024-03-06 Thread Alexander Korotkov

On Thu, Mar 7, 2024 at 12:52 AM Tom Lane  wrote:

> Alexander Korotkov  writes:
> > The revised set of remaining patches is attached.
> > ...
> > 0003 Avoid recursion in MemoryContext functions
> > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> > which I think is a bit more intuitive.  Also I fixed
> > MemoryContextMemConsumed(), which was still trying to use the removed
> > argument "print" of MemoryContextStatsInternal() function.
>
> This patch still doesn't compile for me --- MemoryContextMemConsumed
> got modified some more by commit 743112a2e, and needs minor fixes.
>
> I initially didn't like the definition of MemoryContextTraverseNext
> because it requires two copies of the "process node" logic.  However,
> that seems fine for most of the callers, and even where we are
> duplicating logic it's just a line or so, so I guess it's ok.
> However, MemoryContextTraverseNext seems undercommented to me, plus
> the claim that it traverses in depth-first order is just wrong.
>
> I found some bugs in MemoryContextStatsInternal too: the old
> logic assumed that ichild exceeding max_children was the only
> way to get into the summarization logic, but now ichild minus
> max_children could very well be negative.  Fortunately we can
> just reset ichild to zero and not worry about having any
> connection between the first loop and the second.
>
> Here's a v5 of 0003 with those issues and some more-cosmetic ones
> cleaned up.  I didn't look at 0001 or 0002.
>

Tom, thank you for your revision of this patch!

Sorry for tediousness, but isn't pre-order a variation of depth-first order
[1]?

Links.
1. https://en.wikipedia.org/wiki/Tree_traversal#Depth-first_search

--
Regards,
Alexander Korotkov

Re: [PATCH] Exponential backoff for auth_delay

2024-03-06 Thread Jacob Champion

On Wed, Mar 6, 2024 at 2:45 PM Michael Banck  wrote:
> In order to at least make case 2 not worse for exponential backoff, we
> could maybe disable it (and just wait for auth_delay.milliseconds) once
> MAX_CONN_RECORDS is full. In addition, maybe MAX_CONN_RECORDS should be
> some fraction of max_connections, like 25%?

(Our mails crossed; hopefully I've addressed the other points.)

I think solutions for case 1 and case 2 are necessarily at odds under
the current design, if auth_delay relies on slot exhaustion to do its
work effectively. Weakening that on purpose doesn't make much sense to
me; if a DBA is uncomfortable with the DoS implications then I'd argue
they need a different solution. (Which we could theoretically
implement, but it's not my intention to sign you up for that. :D )

--Jacob

Re: Injection points: some tools to wait and wake

On Wed, Mar 06, 2024 at 10:19:41AM +0100, Jelte Fennema-Nio wrote:
> What I mainly meant is that anything in src/test/modules is not even
> close to somewhat useful for other people to use. They are really just
> very specific tests that need to be written in C. Afaict all those
> modules are not even used by tests outside of their own module. But
> these functions are helper functions, to be used by other tests. And
> limiting the users of those helper functions to just be in-core
> Postgres code seems a bit annoying. I feel like these functions are
> more akin to the pgregress/isolationtester binaries in their usage,
> than akin to other modules in src/test/modules.

Perhaps.  I think that we're still in the discovery phase for this
stuff, and more people should get used to it first (this will take
some time and everybody is busy with their own stuff for the last
commit fest).  At least it does not seem good to rush any decisions at
this stage.

>> FWIW, it would be really annoying to have documentation
>> requirements, actually, because that increases maintenance and I'm not
>> sure it's a good idea to add a module maintenance on top of what could
>> require more facility in the module to implement a test for a bug fix.
> 
> Quite a few contrib modules have very limited documentation. I think
> it would be fine for this as well.

I'd argue that we should try to improve the existing documentation
rather that use that as an argument to add more modules with limited
documentation ;)

> Ugh... Sorry... I didn't realize that it needed a dedicated configure
> flag. When providing that flag it indeed installs the expected files.
> I guess that rules out testing against PGDG packages, because those
> packages almost certainly wouldn't specify this flag.

The CI enables the switch, and I've updated all my buildfarm members
to use it.  In terms of coverage, that's already quite good.
--
Michael


signature.asc
Description: PGP signature

Re: Stack overflow issue

2024-03-06 Thread Tom Lane

Alexander Korotkov  writes:
> The revised set of remaining patches is attached.
> ...
> 0003 Avoid recursion in MemoryContext functions
> I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> which I think is a bit more intuitive.  Also I fixed
> MemoryContextMemConsumed(), which was still trying to use the removed
> argument "print" of MemoryContextStatsInternal() function.

This patch still doesn't compile for me --- MemoryContextMemConsumed
got modified some more by commit 743112a2e, and needs minor fixes.

I initially didn't like the definition of MemoryContextTraverseNext
because it requires two copies of the "process node" logic.  However,
that seems fine for most of the callers, and even where we are
duplicating logic it's just a line or so, so I guess it's ok.
However, MemoryContextTraverseNext seems undercommented to me, plus
the claim that it traverses in depth-first order is just wrong.

I found some bugs in MemoryContextStatsInternal too: the old
logic assumed that ichild exceeding max_children was the only
way to get into the summarization logic, but now ichild minus
max_children could very well be negative.  Fortunately we can
just reset ichild to zero and not worry about having any
connection between the first loop and the second.

Here's a v5 of 0003 with those issues and some more-cosmetic ones
cleaned up.  I didn't look at 0001 or 0002.

regards, tom lane

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 1a615becae..5d426795d9 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -145,9 +145,10 @@ MemoryContext CurTransactionContext = NULL;
 /* This is a transient link to the active portal's memory context: */
 MemoryContext PortalContext = NULL;
 
+static void MemoryContextDeleteOnly(MemoryContext context);
 static void MemoryContextCallResetCallbacks(MemoryContext context);
 static void MemoryContextStatsInternal(MemoryContext context, int level,
-	   bool print, int max_children,
+	   int max_level, int max_children,
 	   MemoryContextCounters *totals,
 	   bool print_to_stderr);
 static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
@@ -219,6 +220,50 @@ GetMemoryChunkHeader(const void *pointer)
 	return header;
 }
 
+/*
+ * MemoryContextTraverseNext
+ *		Helper function to traverse all descendants of a memory context
+ *		without recursion.
+ *
+ * Recursion could lead to out-of-stack errors with deep context hierarchies,
+ * which would be unpleasant in error cleanup code paths.
+ *
+ * To process 'context' and all its descendants, use a loop like this:
+ *
+ * 
+ * for (MemoryContext curr = context->firstchild;
+ *  curr != NULL;
+ *  curr = MemoryContextTraverseNext(curr, context))
+ * {
+ * 
+ * }
+ *
+ * This visits all the contexts in pre-order, that is a node is visited
+ * before its children.
+ */
+static MemoryContext
+MemoryContextTraverseNext(MemoryContext curr, MemoryContext top)
+{
+	/* After processing a node, traverse to its first child if any */
+	if (curr->firstchild != NULL)
+		return curr->firstchild;
+
+	/*
+	 * After processing a childless node, traverse to its next sibling if
+	 * there is one.  If there isn't, traverse back up to the parent (which
+	 * has already been visited, and now so have all its descendants).  We're
+	 * done if that is "top", otherwise traverse to its next sibling if any,
+	 * otherwise repeat moving up.
+	 */
+	while (curr->nextchild == NULL)
+	{
+		curr = curr->parent;
+		if (curr == top)
+			return NULL;
+	}
+	return curr->nextchild;
+}
+
 /*
  * Support routines to trap use of invalid memory context method IDs
  * (from calling pfree or the like on a bogus pointer).  As a possible
@@ -375,14 +420,13 @@ MemoryContextResetOnly(MemoryContext context)
 void
 MemoryContextResetChildren(MemoryContext context)
 {
-	MemoryContext child;
-
 	Assert(MemoryContextIsValid(context));
 
-	for (child = context->firstchild; child != NULL; child = child->nextchild)
+	for (MemoryContext curr = context->firstchild;
+		 curr != NULL;
+		 curr = MemoryContextTraverseNext(curr, context))
 	{
-		MemoryContextResetChildren(child);
-		MemoryContextResetOnly(child);
+		MemoryContextResetOnly(curr);
 	}
 }
 
@@ -392,21 +436,60 @@ MemoryContextResetChildren(MemoryContext context)
  *		allocated therein.
  *
  * The type-specific delete routine removes all storage for the context,
- * but we have to recurse to handle the children.
- * We must also delink the context from its parent, if it has one.
+ * but we have to deal with descendant nodes here.
  */
 void
 MemoryContextDelete(MemoryContext context)
+{
+	MemoryContext curr;
+
+	Assert(MemoryContextIsValid(context));
+
+	/*
+	 * Delete subcontexts from the bottom up.
+	 *
+	 * Note: Do not use recursion here.  A "stack depth limit exceeded" error
+	 * would be unpleasant if we're already in the process of cleaning up

Re: [PATCH] Exponential backoff for auth_delay

2024-03-06 Thread Jacob Champion

On Wed, Mar 6, 2024 at 12:34 PM Tomas Vondra
 wrote:
> Doesn't this mean this approach (as implemented) doesn't actually work
> as a protection against this sort of DoS?
>
> If I'm an attacker, and I can just keep opening new connections for each
> auth request, am I even subject to any auth delay?

s/DoS/brute-force/, but yeah, that's basically the question at the
heart of my comment. _If_ the point of auth_delay is to tie up
connection slots in order to degrade service during an attack, then I
think this addition works, in the sense that it makes the self-imposed
DoS more draconian the more failures occur.

But I don't know if that's actually the intended goal. For one, I'm
not convinced that the host tracking part buys you anything more in
practice, under that model. And if people are trying to *avoid* the
DoS somehow, then I just really don't understand the feature.

> ISTM the problem lies in the fact that we apply the delay only *after*
> the failed auth attempt. Which makes sense, because until now we didn't
> have any state with information for new connections. But with the new
> acr_array, we could check that first, and do the delay before trying to
> athenticate, no?

Yeah, I think one key point is to apply the delay to both successful
and failed connections. That probably opens up a bunch more questions,
though? It seems like a big change from the previous behavior. An
attacker can still front-load a bunch of connections in parallel. And
the end state of the working feature is probably still slot exhaustion
during an attack, so...

I looked around a bit at other designs. [1] is HTTP-focused, but it
talks about some design tradeoffs. I wonder if flipping the sense of
the host tracking [2], to keep track of well-behaved clients and let
them through the service degradation that we're applying to the
masses, might be more robust. But I don't know how to let those
clients punch through slot exhaustion without a lot more work.

--Jacob

[1] https://owasp.org/www-community/controls/Blocking_Brute_Force_Attacks
[2] 
https://owasp.org/www-community/Slow_Down_Online_Guessing_Attacks_with_Device_Cookies

Re: Hooking into ExplainOneQuery() complicated by missing standard_ExplainOneQuery

On Wed, Mar 06, 2024 at 10:32:01AM +0100, Jelte Fennema-Nio wrote:
> This patch would definitely be useful for Citus. We indeed currently
> copy all of that code into our own explain hook. And it seems we
> actually have some bug. Because the es->memory branches were not
> copied (probably because this code didn't exist when we copied it).
> 
> https://github.com/citusdata/citus/blob/d59c93bc504ad32621d66583de6b65f936b0ed13/src/backend/distributed/planner/multi_explain.c#L1248-L1289

That's nice.  You would be able to shave quite a bit of code.  If
there are no objections, I propose to apply the change of this thread
to have this standard explain wrapper at the beginning of next week.
If others have any comments, feel free.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Exponential backoff for auth_delay

2024-03-06 Thread Michael Banck

Hi,

On Wed, Mar 06, 2024 at 09:34:37PM +0100, Tomas Vondra wrote:
> On 3/6/24 19:24, Jacob Champion wrote:
> > On Wed, Mar 6, 2024 at 8:10 AM Nathan Bossart  
> > wrote:
> >> Assuming you are referring to the case where we run out of free slots in
> >> acr_array, I'm not sure I see that as desirable behavior.  Once you run out
> >> of slots, all failed authentication attempts are now subject to the maximum
> >> delay, which is arguably a denial-of-service scenario, albeit not a
> >> particularly worrisome one.
> > 
> > Maybe I've misunderstood the attack vector, but I thought the point of
> > the feature was to deny service when the server is under attack. If we
> > don't deny service, what does the feature do?

I think there are two attack vectors under discussion:

1. Somebody tries to brute-force a password. The original auth_delay
delays auth for a bit everytime authentication fails. If you configure
the delay to be very small, maybe it does not bother the attacker too
much. If you configure it to be long enough, legitimate users might be
annoyed when typoing their password. The suggested feature tries to help
here by initially delaying authentication just a bit and then gradually
increasing the delay.

2. Somebody tries to denial-of-service a server by exhausting all
(remaining) available connections with failed authentication requests
that are being delayed. For this attack, the suggested feature is
hurting more than doing good as it potentially keeps a failed
authentication attempt's connection hanging around longer than before
and makes it easier to denial-of-service a server in this way.

In order to at least make case 2 not worse for exponential backoff, we
could maybe disable it (and just wait for auth_delay.milliseconds) once
MAX_CONN_RECORDS is full. In addition, maybe MAX_CONN_RECORDS should be
some fraction of max_connections, like 25%?

> > And I may have introduced a red herring in talking about the number of
> > hosts, because an attacker operating from a single host is under no
> > obligation to actually wait for the authentication delay. Once we hit
> > some short timeout, we can safely assume the password is wrong,
> > abandon the request, and open up a new connection.

That is a valid point.

Maybe this could be averted if we either outright deny even a successful
authentication request if the host it originates from has a
max_milliseconds delay on file (i.e. has been trying to brute-force the
password for a while) or at least delay a successful authentication
request for some delay, if the host it originates on has a
max_milliseconds delay on file (assuming it will close the connection
beforehand as it thinks the password guess was wrong)?

> > It seems like the thing limiting our attack is the number of
> > connection slots, not MAX_CONN_RECORDS. Am I missing something
> > crucial?
> 
> Doesn't this mean this approach (as implemented) doesn't actually work
> as a protection against this sort of DoS?
> 
> If I'm an attacker, and I can just keep opening new connections for each
> auth request, am I even subject to any auth delay?

Yeah, but see above.

> ISTM the problem lies in the fact that we apply the delay only *after*
> the failed auth attempt. Which makes sense, because until now we didn't
> have any state with information for new connections. But with the new
> acr_array, we could check that first, and do the delay before trying to
> athenticate, no?

I don't think we have a hook for that yet, do we?

Michael

Re: Reducing the log spam

2024-03-06 Thread Isaac Morland

On Tue, 5 Mar 2024 at 07:55, Laurenz Albe  wrote:

> Inspired by feedback to [1], I thought about how to reduce log spam.
>
> My experience from the field is that a lot of log spam looks like
>
>   database/table/... "xy" does not exist
>   duplicate key value violates unique constraint "xy"
>
> So what about a parameter "log_suppress_sqlstates" that suppresses
> logging ERROR and FATAL messages with the enumerated SQL states?
>
> My idea for a default setting would be something like
>
>   log_suppress_sqlstates = '23505,3D000,3F000,42601,42704,42883,42P01'
>
> but that's of course bikeshedding territory.
>

I like the basic idea and the way of specifying states seems likely to
cover a lot of typical use cases. Of course in principle the application
should be fixed, but in practice we can't always control that.

I have two questions about this:

First, can it be done per role? If I have a particular application which is
constantly throwing some particular error, I might want to suppress it, but
not suppress the same error occasionally coming from another application. I
see ALTER DATABASE name SET configuration_parameter … as being useful here,
but often multiple applications share a database.

Second, where can this setting be adjusted? Can any session turn off
logging of arbitrary sets of sqlstates resulting from its queries? It feels
to me like that might allow security problems to be hidden. Specifically,
the first thing an SQL injection might do would be to turn off logging of
important error states, then proceed to try various nefarious things.

It seems to me the above questions interact; an answer to the first might
be "ALTER ROLE role_specification SET configuration_parameter", but I think
that would allow roles to change their own settings, contrary to the
concern raised by the second question.

Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

On Tue, Mar 05, 2024 at 11:20:13PM +, Dean Rasheed wrote:
> I'm not sure how useful these changes are, but I don't really object.
> You need to update the synopsis section of the docs though.

Thanks for taking a look.  I updated the synopsis sections in v3.

I also spent some more time on the reindexdb patch (0003).  I previously
had decided to restrict combinations of tables, schemas, and indexes
because I felt it was "ambiguous and inconsistent with vacuumdb," but
looking closer, I think that's the wrong move.  reindexdb already supports
such combinations, which it interprets to mean it should reindex each
listed object.  So, I removed that change in v3.

Even though reindexdb allows combinations of tables, schema, and indexes,
it doesn't allow combinations of "system catalogs" and other objects, and
it's not clear why.  In v3, I've removed this restriction, which ended up
simplifying the 0003 patch a bit.  Like combinations of tables, schemas,
and indexes, reindexdb will now interpret combinations that include
--system to mean it should reindex each listed object as well as the system
catalogs.

Ideally, we'd allow similar combinations in vacuumdb, but I believe that
would require a much more invasive patch, and I've already spent far more
time on this change than I wanted to.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 84b3f5a8275d53707b15208d761567372b7b20a5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 28 Jun 2023 15:12:18 -0700
Subject: [PATCH v3 1/3] vacuumdb: allow specifying tables or schemas to
 process in all databases

---
 doc/src/sgml/ref/vacuumdb.sgml| 60 ++-
 src/bin/scripts/t/100_vacuumdb.pl | 24 ++---
 src/bin/scripts/vacuumdb.c| 19 +++---
 3 files changed, 52 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 09356ea4fa..66fccb30a2 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -36,7 +36,13 @@ PostgreSQL documentation
 

 
-   dbname
+   
+
+ dbname
+ -a
+ --all
+
+   
   
 
   
@@ -47,40 +53,44 @@ PostgreSQL documentation

 
  
-   
-
- 
-  -n
-  --schema
- 
- schema
-
-   
-
-   
-
- 
-  -N
-  --exclude-schema
- 
- schema
-
-   
+  -n
+  --schema
  
+ schema
 

 
-   dbname
+   
+
+ dbname
+ -a
+ --all
+
+   
   
 
   
vacuumdb
connection-option
option
-   
--a
---all
-   
+
+   
+
+ 
+  -N
+  --exclude-schema
+ 
+ schema
+
+   
+
+   
+
+ dbname
+ -a
+ --all
+
+   
   
  
 
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index 0601fde205..1d8558c780 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -184,18 +184,18 @@ $node->command_fails_like(
 	[ 'vacuumdb', '-n', 'pg_catalog', '-N', '"Foo"', 'postgres' ],
 	qr/cannot vacuum all tables in schema\(s\) and exclude schema\(s\) at the same time/,
 	'cannot use options -n and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-N', '"Foo"' ],
-	qr/cannot exclude specific schema\(s\) in all databases/,
-	'cannot use options -a and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-n', '"Foo"' ],
-	qr/cannot vacuum specific schema\(s\) in all databases/,
-	'cannot use options -a and -n at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-t', '"Foo".bar' ],
-	qr/cannot vacuum specific table\(s\) in all databases/,
-	'cannot use options -a and -t at the same time');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-N', 'pg_catalog' ],
+	qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class).)*/,
+	'vacuumdb -a -N');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-n', 'pg_catalog' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -n');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-t', 'pg_class' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -t');
 $node->command_fails_like(
 	[ 'vacuumdb', '-a', '-d', 'postgres' ],
 	qr/cannot vacuum all databases and a specific one at the same time/,
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 291766793e..7138c6e97e 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -72,6 +72,7 @@ static void vacuum_one_database(ConnParams *cparams,
 static void vacuum_all_databases(ConnParams *cparams,
  vacuumingOptions *vacopts,
  bool analyze_in_stages,
+ SimpleStringList *objects,
  int concurrentCons,
  const char *progname, bool echo, bool quiet);
 
@@ -378,6 +379,7 @@ main(int argc, char *argv[])
 
 		vacuum_all_databases(, ,
 			 analyze_in_stages,

Re: Large files for relations

2024-03-06 Thread Thomas Munro

Rebased.  I had intended to try to get this into v17, but a couple of
unresolved problems came up while rebasing over the new incremental
backup stuff.  You snooze, you lose.  Hopefully we can sort these out
in time for the next commitfest:

* should pg_combinebasebackup read the control file to fetch the segment size?
* hunt for other segment-size related problems that may be lurking in
new incremental backup stuff
* basebackup_incremental.c wants to use memory in proportion to
segment size, which looks like a problem, and I wrote about that in a
new thread[1]

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com
From 85678257fef94aa3ca3efb39ce55fb66df7c889e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 26 May 2023 01:41:11 +1200
Subject: [PATCH v3] Allow relation segment size to be set by initdb.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, relation segment size was a rarely modified compile time
option.  Make it an initdb option, so that users with very large tables
can avoid using so many files and file descriptors.

The initdb option --rel-segsize is modeled on the existing --wal-segsize
option.

The data type used to store the size is int64, not BlockNumber, because
it seems reasonable to want to be able to say --rel-segsize=32TB (=
don't use segments at all), but that would overflow uint32.

It should be fairly straightforward to teach pg_upgrade (or some new
dedicated tool) to convert an existing cluster to a new segment size,
but that is not done yet, so for now this is only useful for entirely
new clusters.

The default behavior is unchanged: 1GB segments.  On Windows, we can't
go above 2GB for now due (we'd have to make a lot of changes due to
Windows' small off_t).

XXX work remains to be done for incremental backups

Reviewed-by: David Steele 
Reviewed-by: Peter Eisentraut 
Reviewed-by: Stephen Frost 
Reviewed-by: Jim Mlodgenski 
Reivewed-by: Dagfinn Ilmari Mannsåker 
Reviewed-by: Pavel Stehule 
Discussion: https://postgr.es/m/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com
---
 configure   |  91 --
 configure.ac|  55 -
 doc/src/sgml/config.sgml|   7 +-
 doc/src/sgml/ref/initdb.sgml|  24 
 meson.build |  14 ---
 src/backend/access/transam/xlog.c   |  11 +-
 src/backend/backup/basebackup.c |   7 +-
 src/backend/backup/basebackup_incremental.c |  31 +++--
 src/backend/bootstrap/bootstrap.c   |   5 +-
 src/backend/storage/file/buffile.c  |   6 +-
 src/backend/storage/smgr/md.c   | 128 
 src/backend/storage/smgr/smgr.c |  14 +++
 src/backend/utils/misc/guc.c|  16 +++
 src/backend/utils/misc/guc_tables.c |  12 +-
 src/bin/initdb/initdb.c |  47 ++-
 src/bin/pg_checksums/pg_checksums.c |   2 +-
 src/bin/pg_combinebackup/reconstruct.c  |  18 ++-
 src/bin/pg_controldata/pg_controldata.c |   2 +-
 src/bin/pg_resetwal/pg_resetwal.c   |   4 +-
 src/bin/pg_rewind/filemap.c |   4 +-
 src/bin/pg_rewind/pg_rewind.c   |   3 +
 src/bin/pg_rewind/pg_rewind.h   |   1 +
 src/bin/pg_upgrade/relfilenumber.c  |   2 +-
 src/include/catalog/pg_control.h|   2 +-
 src/include/pg_config.h.in  |  13 --
 src/include/storage/smgr.h  |   3 +
 src/include/utils/guc_tables.h  |   1 +
 27 files changed, 249 insertions(+), 274 deletions(-)

diff --git a/configure b/configure
index 36feeafbb23..49a7f0f2c4a 100755
--- a/configure
+++ b/configure
@@ -842,8 +842,6 @@ enable_dtrace
 enable_tap_tests
 enable_injection_points
 with_blocksize
-with_segsize
-with_segsize_blocks
 with_wal_blocksize
 with_llvm
 enable_depend
@@ -1551,9 +1549,6 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
   set table block size in kB [8]
-  --with-segsize=SEGSIZE  set table segment size in GB [1]
-  --with-segsize-blocks=SEGSIZE_BLOCKS
-  set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
   set WAL block size in kB [8]
   --with-llvm build with LLVM based JIT support
@@ -3759,85 +3754,6 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
-#
-# Relation segment size
-#
-
-
-
-# Check whether --with-segsize was given.
-if test "${with_segsize+set}" = set; then :
-  withval=$with_segsize;
-  case $withval in
-yes)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-  ;;
-no)
-  as_fn_error $? "argument required for --with-segsize option" "$LINENO" 5
-  ;;
-*)
-  segsize=$withval
-

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent

On Wed, 6 Mar 2024 at 22:46, Matthias van de Meent
 wrote:
>
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

To clarify, what I mean here is that merging the changes of both the
SAOPs changes and the removal of arrayKeyData seems to increase the
patch size and increases the maximum complexity of each component
patch's review.
Separate patches may make this more reviewable, or not, but no comment
was given on why it is better to merge the changes into a single
patch.

- Matthias

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-06 Thread Matthias van de Meent

On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
>
> On Mon, Mar 4, 2024 at 3:51 PM Matthias van de Meent
>  wrote:
> > > +that searches for multiple values together.  Queries that use certain
> > > +SQL constructs to search for rows matching any 
> > > value
> > > +out of a list (or an array) of multiple scalar values might perform
> > > +multiple primitive index scans (up to one primitive 
> > > scan
> > > +per scalar value) at runtime.  See  > > linkend="functions-comparisons"/>
> > > +for details.
> >
> > I don't think the "see  for details" is
> > correctly worded: The surrounding text implies that it would contain
> > details about in which precise situations multiple primitive index
> > scans would be consumed, while it only contains documentation about
> > IN/NOT IN/ANY/ALL/SOME.
> >
> > Something like the following would fit better IMO:
> >
> > +that searches for multiple values together.  Queries that use certain
> > +SQL constructs to search for rows matching any value
> > +out of a list or array of multiple scalar values (such as those
> > described in
> > + might perform multiple primitive
> > +index scans (up to one primitive scan per scalar value) at runtime.
>
> I think that there is supposed to be a closing parenthesis here? So
> "... (such as those described in ") might
> perform...".

Correct.

> FWM, if that's what you meant.

WFM, yes?

> > Then there is a second issue in the paragraph: Inverted indexes such
> > as GIN might well decide to start counting more than one "primitive
> > scan" per scalar value,
[...]
> I've described the issues in this area (in the docs) in a way that is
> most consistent with historical conventions. That seems to have the
> fewest problems, despite everything I've said about it.

Clear enough, thank you for explaining your thoughts on this.

> > > > All that really remains now is to research how we might integrate this
> > > > work with the recently added continuescanPrechecked/haveFirstMatch
> > > > stuff from Alexander Korotkov, if at all.
> > >
> > > The main change in v12 is that I've integrated both the
> > > continuescanPrechecked and the haveFirstMatch optimizations. Both of
> > > these fields are now page-level state, shared between the _bt_readpage
> > > caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
> > > appear next to the new home for _bt_checkkeys' continuescan field, in
> > > the new page state struct).
> >
> > Cool. I'm planning to review the rest of this patch this
> > week/tomorrow, could you take some time to review some of my btree
> > patches, too?
>
> Okay, I'll take a look again.

Thanks, greatly appreciated.

> At one point Heikki suggested that I just get rid of
> BTScanOpaqueData.arrayKeyData (which has been there for as long as
> nbtree had native support for SAOPs), and use
> BTScanOpaqueData.keyData exclusively instead. I've finally got around
> to doing that now.

I'm not sure if it was worth the reduced churn when the changes for
the primary optimization are already 150+kB in size; every "small"
addition increases the context needed to review the patch, and it's
already quite complex.

> These simplifications were enabled by my new approach within
> _bt_preprocess_keys, described when I posted v12. v13 goes even
> further than v12 did, by demoting _bt_preprocess_array_keys to a
> helper function for _bt_preprocess_keys. That means that we do all of
> our scan key preprocessing at the same time, at the start of _bt_first
> (though only during the first _bt_first, or to be more precise during
> the first per btrescan). If we need fancier
> preprocessing/normalization for arrays, then it ought to be a lot
> easier with this structure.

Agreed.

> Note that we no longer need to have an independent representation of
> so->qual_okay for array keys (the convention of setting
> so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> required). There is no longer any need for a separate pass to carry
> over the contents of BTScanOpaqueData.arrayKeyData to
> BTScanOpaqueData.keyData, which was confusing.

I wasn't very confused by it, but sure.

> Are you still interested in working directly on the preprocessing
> stuff?

If you mean my proposed change to merging two equality AOPs, then yes.
I'll try to fit it in tomorrow with the rest of the review.

>  I have a feeling that I was slightly too optimistic about how
> likely we were to be able to get away with not having certain kinds of
> array preprocessing, back when I posted v12. It's true that the
> propensity of the patch to not recognize "partial
> redundancies/contradictions" hardly matters with redundant equalities,
> but inequalities are another matter. I'm slightly worried about cases
> like this one:
>
> select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
>
> Maybe we need to do better with that. What do you think?

Let me come back to that when I'm done reviewing the final

Re: Make attstattarget nullable

Hi Peter,

I took a look at this patch today. I think it makes sense to do this,
and I haven't found any major issues with any of the three patches. A
couple minor comments:

0001


1) I think this bit in ALTER STATISTICS docs is wrong:

-  new_target
+  SET STATISTICS { integer | DEFAULT }

because it means we now have list entries for name, ..., new_name,
new_schema, and then suddenly "SET STATISTICS { integer | DEFAULT }".
That's a bit weird.


2) The newtarget handling in AlterStatistics seems rather confusing. Why
does it get set to -1 just to ignore the value later? For a while I was
99% sure ALTER STATISTICS ... SET STATISTICS DEFAULT will set the field
to -1. Maybe ditching the first if block and directly checking
stmt->stxstattarget before setting repl_val/repl_null would be better?


3) I find it a bit tedious that making the stxstattarget field nullable
means we now have to translate NULL to -1 in so many places. I know why
it's that way, but it's ... not very convenient :-(


0002


1) I think InsertPgAttributeTuples comment probably needs to document
what the new tupdesc_extra parameter does.


0003

no comment, seems fine



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

On 3/5/24 10:03, Nikita Malakhov wrote:
> Hi,
> 
> Tomas, sorry for confusion, in my previous message I meant exactly
> the same approach you've posted above, and came up with almost
> the same implementation.
> 
> Thank you very much for your attention to this thread!
> 
> I've asked Andy about this approach because of the same reasons you
> mentioned - it keeps cache code small, localized and easy to maintain.
> 
> The question that worries me is the memory limit, and I think that cache
> lookup and entry invalidation should be done in toast_tuple_externalize
> code, for the case the value has been detoasted previously is updated
> in the same query, like
> 

I haven't thought very much about when we need to invalidate entries and
where exactly should that happen. My patch is a very rough PoC that I
wrote in ~1h, and it's quite possible it will have to move or should be
refactored in some way.

But I don't quite understand the problem with this query:

> UPDATE test SET value = value || '...';

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

I'm not sure why should anything happen in toast_tuple_externalize, that
seems like a very low-level piece of code. But I realized there's also
toast_delete_external, and maybe that's a good place to invalidate the
deleted TOAST values (which would also cover the UPDATE).

> I've added cache entry invalidation and data removal on delete and update
> of the toasted values and am currently experimenting with large values.
> 

I haven't done anything about large values in the PoC patch, but I think
it might be a good idea to either ignore values that are too large (so
that it does not push everything else from the cache) or store them in
compressed form (assuming it's small enough, to at least save the I/O).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shared detoast Datum proposal

On 3/6/24 07:09, Nikita Malakhov wrote:
> Hi,
> 
> In addition to the previous message - for the toast_fetch_datum_slice
> the first (seems obvious) way is to detoast the whole value, save it
> to cache and get slices from it on demand. I have another one on my
> mind, but have to play with it first.
> 

What if you only need the first slice? In that case decoding everything
will be a clear regression.

IMHO this needs to work like this:

1) check if the cache already has sufficiently large slice detoasted
(and use the cached value if yes)

2) if there's no slice detoasted, or if it's too small, detoast a
sufficiently large slice and store it in the cache (possibly evicting
the already detoasted slice)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Exponential backoff for auth_delay

On 3/6/24 19:24, Jacob Champion wrote:
> On Wed, Mar 6, 2024 at 8:10 AM Nathan Bossart  
> wrote:
>> Assuming you are referring to the case where we run out of free slots in
>> acr_array, I'm not sure I see that as desirable behavior.  Once you run out
>> of slots, all failed authentication attempts are now subject to the maximum
>> delay, which is arguably a denial-of-service scenario, albeit not a
>> particularly worrisome one.
> 
> Maybe I've misunderstood the attack vector, but I thought the point of
> the feature was to deny service when the server is under attack. If we
> don't deny service, what does the feature do?
> 
> And I may have introduced a red herring in talking about the number of
> hosts, because an attacker operating from a single host is under no
> obligation to actually wait for the authentication delay. Once we hit
> some short timeout, we can safely assume the password is wrong,
> abandon the request, and open up a new connection. It seems like the
> thing limiting our attack is the number of connection slots, not
> MAX_CONN_RECORDS. Am I missing something crucial?
> 

Doesn't this mean this approach (as implemented) doesn't actually work
as a protection against this sort of DoS?

If I'm an attacker, and I can just keep opening new connections for each
auth request, am I even subject to any auth delay?

ISTM the problem lies in the fact that we apply the delay only *after*
the failed auth attempt. Which makes sense, because until now we didn't
have any state with information for new connections. But with the new
acr_array, we could check that first, and do the delay before trying to
athenticate, no?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reducing the log spam

2024-03-06 Thread Laurenz Albe

On Wed, 2024-03-06 at 10:50 -0500, Greg Sabino Mullane wrote:
> On Tue, Mar 5, 2024 at 7:55 AM Laurenz Albe  wrote:
> > My experience from the field is that a lot of log spam looks like
> > 
> >   database/table/... "xy" does not exist
> >   duplicate key value violates unique constraint "xy"
> 
> Forcibly hiding those at the Postgres level seems a heavy hammer for what is 
> ultimately an application problem.

Yes... or no.  Lots of applications violate constraints routinely.
As long as the error is caught and handled, that's not a problem.

Whoever cares about the log messages can enable them.  My impression
is that most people don't care about them.

But thanks for your opinion.

Yours,
Laurenz Albe

Re: MERGE ... RETURNING

2024-03-06 Thread Merlin Moncure

On Thu, Feb 29, 2024 at 1:49 PM Jeff Davis  wrote:

>
> Can we get some input on whether the current MERGE ... RETURNING patch
> is the right approach from a language standpoint?
>

MERGE_CLAUSE_NUMBER() seems really out of place to me, it feels out of
place to identify output set by number rather than some kind of name.  Did
not see a lot of support for that position though.

merlin

Re: Transaction timeout

2024-03-06 Thread Alexander Korotkov

On Wed, Mar 6, 2024 at 10:22 AM Andrey M. Borodin  wrote:
> > On 25 Feb 2024, at 21:50, Alexander Korotkov  wrote:
> >
> > Thank you for the patches.  I've pushed the 0001 patch to avoid
> > further failures on buildfarm.  Let 0004 wait till injections points
> > by Mechael are committed.
>
> Thanks!
>
> All prerequisites are committed. I propose something in a line with this 
> patch.

Thank you.  I took a look at the patch.  Should we also check the
relevant message after the timeout is fired?  We could check it in
psql stderr or log for that.

--
Regards,
Alexander Korotkov

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-06 Thread Heikki Linnakangas


On 27/02/2024 21:47, Melanie Plageman wrote:

The attached v5 has some simplifications when compared to v4 but takes
largely the same approach.

0001-0004 are refactoring


I'm looking at just these 0001-0004 patches for now. I like those 
changes a lot for the sake of readablity even without any of the later 
patches.


I made some further changes. I kept them as separate commits for easier 
review, see the commit messages for details. Any thoughts on those changes?


I feel heap_vac_scan_get_next_block() function could use some love. 
Maybe just some rewording of the comments, or maybe some other 
refactoring; not sure. But I'm pretty happy with the function signature 
and how it's called.


BTW, do we have tests that would fail if we botched up 
heap_vac_scan_get_next_block() so that it would skip pages incorrectly, 
for example? Not asking you to write them for this patch, but I'm just 
wondering.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 6e0258c3e31e85526475f46b2e14cbbcbb861909 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 30 Dec 2023 16:30:59 -0500
Subject: [PATCH v6 1/9] lazy_scan_skip remove unneeded local var
 nskippable_blocks

nskippable_blocks can be easily derived from next_unskippable_block's
progress when compared to the passed in next_block.
---
 src/backend/access/heap/vacuumlazy.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8b320c3f89a..1dc6cc8e4db 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1103,8 +1103,7 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 			   bool *next_unskippable_allvis, bool *skipping_current_range)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-next_unskippable_block = next_block,
-nskippable_blocks = 0;
+next_unskippable_block = next_block;
 	bool		skipsallvis = false;
 
 	*next_unskippable_allvis = true;
@@ -1161,7 +1160,6 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 
 		vacuum_delay_point();
 		next_unskippable_block++;
-		nskippable_blocks++;
 	}
 
 	/*
@@ -1174,7 +1172,7 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
 	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+	if (next_unskippable_block - next_block < SKIP_PAGES_THRESHOLD)
 		*skipping_current_range = false;
 	else
 	{
-- 
2.39.2

From 258bce44e3275bc628bf984892797eecaebf0404 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 30 Dec 2023 16:22:12 -0500
Subject: [PATCH v6 2/9] Add lazy_scan_skip unskippable state to LVRelState

Future commits will remove all skipping logic from lazy_scan_heap() and
confine it to lazy_scan_skip(). To make those commits more clear, first
introduce add a struct to LVRelState containing variables needed to skip
ranges less than SKIP_PAGES_THRESHOLD.

lazy_scan_prune() and lazy_scan_new_or_empty() can now access the
buffer containing the relevant block of the visibility map through the
LVRelState.skip, so it no longer needs to be a separate function
parameter.

While we are at it, add additional information to the lazy_scan_skip()
comment, including descriptions of the role and expectations for its
function parameters.
---
 src/backend/access/heap/vacuumlazy.c | 154 ---
 1 file changed, 90 insertions(+), 64 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1dc6cc8e4db..0ddb986bc03 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -204,6 +204,22 @@ typedef struct LVRelState
 	int64		live_tuples;	/* # live tuples remaining */
 	int64		recently_dead_tuples;	/* # dead, but not yet removable */
 	int64		missed_dead_tuples; /* # removable, but not removed */
+
+	/*
+	 * Parameters maintained by lazy_scan_skip() to manage skipping ranges of
+	 * pages greater than SKIP_PAGES_THRESHOLD.
+	 */
+	struct
+	{
+		/* Next unskippable block */
+		BlockNumber next_unskippable_block;
+		/* Buffer containing next unskippable block's visibility info */
+		Buffer		vmbuffer;
+		/* Next unskippable block's visibility status */
+		bool		next_unskippable_allvis;
+		/* Whether or not skippable blocks should be skipped */
+		bool		skipping_current_range;
+	}			skip;
 } LVRelState;
 
 /* Struct for saving and restoring vacuum error information. */
@@ -214,19 +230,15 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
-
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-  BlockNumber next_block,
-  bool *next_unskippable_allvis,
-  bool *skipping_current_range);

Re: Provide a pg_truncate_freespacemap function

2024-03-06 Thread Stephen Frost

Greetings,

* Ronan Dunklau (ronan.dunk...@aiven.io) wrote:
> As we are currently experiencing a FSM corruption issue [1], we need to 
> rebuild FSM when we detect it. 

Ideally, we'd figure out a way to pick up on this and address it without
the user needing to intervene, however ...

> I noticed we have something to truncate a visibility map, but nothing for the 
> freespace map, so I propose the attached (liberally copied from the VM 
> counterpart) to allow to truncate a FSM without incurring downtime, as 
> currently our only options are to either VACUUM FULL the table or stop the 
> cluster and remove the FSM manually.

I agree that this would generally be a useful thing to have.

> Does that seem correct ?

Definitely needs to have a 'REVOKE ALL ON FUNCTION' at the end of the
upgrade script, similar to what you'll find at the bottom of
pg_visibility--1.1.sql in the tree today, otherwise anyone could run it.

Beyond that, I'd suggest a function-level comment above the definition
of the function itself (which is where we tend to put those- not at the
point where we declare the function).

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Potential issue in ecpg-informix decimal converting functions

2024-03-06 Thread a . imamov


Daniel Gustafsson писал(а) 2024-03-06 18:03:

On 27 Feb 2024, at 06:08, Michael Paquier  wrote:

On Mon, Feb 26, 2024 at 12:28:51AM +0100, Daniel Gustafsson wrote:
Yeah, I think this is for HEAD only, especially given the lack of 
complaints

against backbranches.


Daniel, are you planning to look at that?  I haven't done any detailed
lookup, but would be happy to do so it that helps.


I had a look at this today and opted for trimming back the patch a bit.
Reading the informix docs the functions we are mimicking for 
compatibility here
does not have an underflow returnvalue, so adding one doesn't seem 
right (or
helpful).  The attached fixes the return of overflow and leaves it at 
that,
which makes it possible to backpatch since it's fixing the code to 
match the

documented behavior.

--
Daniel Gustafsson


I agree with the proposed changes in favor of backward compatibility.
Also, is it a big deal that the PGTYPESnumeric_to_long() function 
doesn't

exactly match the documentation, compared to PGTYPESnumeric_to_int()? It
handles underflow case separately and sets errno to 
PGTYPES_NUM_UNDERFLOW

additionally.

Using the %m printf format more

2024-03-06 Thread Dagfinn Ilmari Mannsåker

Hi hackers,

I just noticed that commit d93627bc added a bunch of pg_fatal() calls
using %s and strerror(errno), which could be written more concisely as
%m.  I'm assuming this was done because the surrounding code also uses
this pattern, and hadn't been changed to use %m when support for that
was added to snprintf.c to avoid backporting hazards.  However, that
support was in v12, which is now the oldest still-supported back branch,
so we can safely make that change.

The attached patch does so everywhere appropriate. One place where it's
not appropriate is the TAP-emitting functions in pg_regress, since those
call fprintf() and other potentially errno-modifying functions before
evaluating the format string.

- ilmari

>From 17ad27665821decf9116723fe8873bd256e9d75d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 6 Mar 2024 16:58:33 +
Subject: [PATCH] Use %m printf format instead of strerror(errno) where
 appropriate

Now that all live branches (12-) support %m in printf formats, let's
use it everywhere appropriate.

Particularly, we're _not_ converting the TAP output calls in
pg_regress, since those functions call fprintf() and friends before
evaluating the format string, which can clobber errno.
---
 src/backend/postmaster/postmaster.c   | 21 ++--
 src/backend/postmaster/syslogger.c|  2 +-
 src/backend/utils/misc/guc.c  |  9 +-
 src/bin/initdb/findtimezone.c |  4 +-
 src/bin/pg_ctl/pg_ctl.c   | 74 +++---
 src/bin/pg_dump/compress_gzip.c   |  2 +-
 src/bin/pg_dump/compress_none.c   |  5 +-
 src/bin/pg_upgrade/check.c| 41 +++-
 src/bin/pg_upgrade/controldata.c  |  6 +-
 src/bin/pg_upgrade/exec.c | 14 ++-
 src/bin/pg_upgrade/file.c | 98 +--
 src/bin/pg_upgrade/function.c |  3 +-
 src/bin/pg_upgrade/option.c   | 10 +-
 src/bin/pg_upgrade/parallel.c | 12 +--
 src/bin/pg_upgrade/pg_upgrade.c   |  4 +-
 src/bin/pg_upgrade/relfilenumber.c|  5 +-
 src/bin/pg_upgrade/tablespace.c   |  4 +-
 src/bin/pg_upgrade/version.c  |  9 +-
 src/common/psprintf.c |  4 +-
 src/interfaces/ecpg/preproc/ecpg.c| 12 +--
 src/test/isolation/isolationtester.c  |  2 +-
 .../modules/libpq_pipeline/libpq_pipeline.c   |  4 +-
 src/tools/ifaddrs/test_ifaddrs.c  |  2 +-
 23 files changed, 157 insertions(+), 190 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f3c09e8dc0..af8a1efe66 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1375,12 +1375,12 @@ PostmasterMain(int argc, char *argv[])
 
 			/* Make PID file world readable */
 			if (chmod(external_pid_file, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH) != 0)
-write_stderr("%s: could not change permissions of external PID file \"%s\": %s\n",
-			 progname, external_pid_file, strerror(errno));
+write_stderr("%s: could not change permissions of external PID file \"%s\": %m\n",
+			 progname, external_pid_file);
 		}
 		else
-			write_stderr("%s: could not write external PID file \"%s\": %s\n",
-		 progname, external_pid_file, strerror(errno));
+			write_stderr("%s: could not write external PID file \"%s\": %m\n",
+		 progname, external_pid_file);
 
 		on_proc_exit(unlink_external_pid_file, 0);
 	}
@@ -1589,8 +1589,8 @@ checkControlFile(void)
 	{
 		write_stderr("%s: could not find the database system\n"
 	 "Expected to find it in the directory \"%s\",\n"
-	 "but could not open file \"%s\": %s\n",
-	 progname, DataDir, path, strerror(errno));
+	 "but could not open file \"%s\": %m\n",
+	 progname, DataDir, path);
 		ExitPostmaster(2);
 	}
 	FreeFile(fp);
@@ -6277,15 +6277,13 @@ read_backend_variables(char *id, Port **port, BackgroundWorker **worker)
 	fp = AllocateFile(id, PG_BINARY_R);
 	if (!fp)
 	{
-		write_stderr("could not open backend variables file \"%s\": %s\n",
-	 id, strerror(errno));
+		write_stderr("could not open backend variables file \"%s\": %m\n", id);
 		exit(1);
 	}
 
 	if (fread(, sizeof(param), 1, fp) != 1)
 	{
-		write_stderr("could not read from backend variables file \"%s\": %s\n",
-	 id, strerror(errno));
+		write_stderr("could not read from backend variables file \"%s\": %m\n", id);
 		exit(1);
 	}
 
@@ -6293,8 +6291,7 @@ read_backend_variables(char *id, Port **port, BackgroundWorker **worker)
 	FreeFile(fp);
 	if (unlink(id) != 0)
 	{
-		write_stderr("could not remove file \"%s\": %s\n",
-	 id, strerror(errno));
+		write_stderr("could not remove file \"%s\": %m\n", id);
 		exit(1);
 	}
 #else
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index c2a6a226e7..d9d042f562 100644
---

Re: Proposal to add page headers to SLRU pages

2024-03-06 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> I suppose this is important to do if we ever want to move SLRUs into
> shared buffers.  However, I wonder about the extra time this adds to
> pg_upgrade.  Is this something we should be concerned about?  Is there
> any measurement/estimates to tell us how long this would be?  Right now,
> if you use a cloning strategy for the data files, the upgrade should be
> pretty quick ... but the amount of data in pg_xact and pg_multixact
> could be massive, and the rewrite is likely to take considerable time.

While I definitely agree that there should be some consideration of
this concern, it feels on-par with the visibility-map rewrite which was
done previously.  Larger systems will likely have more to deal with than
smaller systems, but it's still a relatively small portion of the data
overall.

The benefit of this change, beyond just the possibility of moving them
into shared buffers some day in the future, is that this would mean that
SLRUs will have checksums (if the cluster has them enabled).  That
benefit strikes me as well worth the cost of the rewrite taking some
time and the minor loss of space due to the page header.

Would it be useful to consider parallelizing this work?  There's already
parts of pg_upgrade which can be parallelized and so this isn't,
hopefully, a big lift to add, but I'm not sure if there's enough work
being done here CPU-wise, compared to the amount of IO being done, to
have it make sense to run it in parallel.  Might be worth looking into
though, at least, as disks have gotten to be quite fast.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-06 Thread Jelte Fennema-Nio

On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera  wrote:
> Docs: one bogus "that that".

will fix

> Did we consider having PQcancelConn() instead be called
> PQcancelCreate()?

Fine by me

> Also, the comment still says
> "Asynchronously cancel a query on the given connection. This requires
> polling the returned PGcancelConn to actually complete the cancellation
> of the query." but this is no longer a good description of what this
> function does.

will fix

> Why do we return a non-NULL pointer from PQcancelConn in the first three
> cases where we return errors?  (original conn was NULL, original conn is
> PGINVALID_SOCKET, pqCopyPGconn returns failure)  Wouldn't it make more
> sense to free the allocated object and return NULL?  Actually, I wonder
> if there's any reason at all to return a valid pointer in any failure
> cases; I mean, do we really expect that application authors are going to
> read/report the error message from a PGcancelConn that failed to be fully
> created?

I think having a useful error message when possible is quite nice. And
I do think people will read/report this error message. Especially
since many people will simply pass it to PQcancelBlocking, whether
it's NULL or not. And then check the status, and then report the error
if the status was CONNECTION_BAD.

> but in any case we
> should set ->cancelRequest in all cases, not only after the first tests
> for errors.

makes sense

> I think the extra PGconn inside pg_cancel_conn is useless; it would be
> simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the
> indirection through the extra struct.

That sounds nice indeed. I'll try it out.

> We could move the definition of struct pg_cancel to fe-cancel.c.  Nobody
> outside that needs to know that definition anyway.

will do

Re: Statistics Import and Export

2024-03-06 Thread Stephen Frost

Greetings,

* Matthias van de Meent (boekewurm+postg...@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost  wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent 
> >  wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can 
> > be parallelized, we won’t be trying to lock a whole bunch of objects at 
> > once, and any failures would only impact that one relation’s stats load.
> 
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed 
> > transaction” is a likely scenario that we have to spend a lot of effort on.
> 
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them? 
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it 
> > would make sense to re-try on a subset of columns, or just generally, if we 
> > know that the pg_dump version matches the target server version?
> 
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Exponential backoff for auth_delay

2024-03-06 Thread Jacob Champion

On Wed, Mar 6, 2024 at 8:10 AM Nathan Bossart  wrote:
> Assuming you are referring to the case where we run out of free slots in
> acr_array, I'm not sure I see that as desirable behavior.  Once you run out
> of slots, all failed authentication attempts are now subject to the maximum
> delay, which is arguably a denial-of-service scenario, albeit not a
> particularly worrisome one.

Maybe I've misunderstood the attack vector, but I thought the point of
the feature was to deny service when the server is under attack. If we
don't deny service, what does the feature do?

And I may have introduced a red herring in talking about the number of
hosts, because an attacker operating from a single host is under no
obligation to actually wait for the authentication delay. Once we hit
some short timeout, we can safely assume the password is wrong,
abandon the request, and open up a new connection. It seems like the
thing limiting our attack is the number of connection slots, not
MAX_CONN_RECORDS. Am I missing something crucial?

--Jacob

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-06 Thread Alvaro Herrera

Docs: one bogus "that that".

Did we consider having PQcancelConn() instead be called
PQcancelCreate()?  I think this better conveys that what we're doing is
create an object that can be used to do something, and that nothing else
is done with it by default.  Also, the comment still says
"Asynchronously cancel a query on the given connection. This requires
polling the returned PGcancelConn to actually complete the cancellation
of the query." but this is no longer a good description of what this
function does.

Why do we return a non-NULL pointer from PQcancelConn in the first three
cases where we return errors?  (original conn was NULL, original conn is
PGINVALID_SOCKET, pqCopyPGconn returns failure)  Wouldn't it make more
sense to free the allocated object and return NULL?  Actually, I wonder
if there's any reason at all to return a valid pointer in any failure
cases; I mean, do we really expect that application authors are going to
read/report the error message from a PGcancelConn that failed to be fully
created?  Anyway, maybe there are reasons for this; but in any case we
should set ->cancelRequest in all cases, not only after the first tests
for errors.

I think the extra PGconn inside pg_cancel_conn is useless; it would be
simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the
indirection through the extra struct.  You're actually dereferencing the
object in two ways in the new code, both by casting the outer object
straight to PGconn (taking advantage that the struct member is first in
the struct), and by using PGcancelConn->conn.  This seems pointless.  I
mean, if we're going to cast to "PGconn *" in some places anyway, then
we may as well access all members directly.  Perhaps, if you want, you
could add asserts that ->cancelRequest is set true in all the
fe-cancel.c functions.  Anyway, we'd still have compiler support to tell
you that you're passing the wrong struct to the function.  (I didn't
actually try to change the code this way, so I might be wrong.)

We could move the definition of struct pg_cancel to fe-cancel.c.  Nobody
outside that needs to know that definition anyway.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-06 Thread Jelte Fennema-Nio

On Wed, 6 Mar 2024 at 15:03, Denis Laxalde  wrote:
>
> In patch 0004, I noticed a couple of typos in the documentation; please
> find attached a fixup patch correcting these.

Thanks, applied.

> while not actually listing the "statuses". Should we list them?

I listed the relevant statuses over now and updated the PQcancelStatus
docs to look more like the PQstatus one. I didn't list any statuses
that a cancel connection could never have (but a normal connection
can).

While going over the list of statuses possible for a cancel connection
I realized that the docs for PQconnectStart were not listing all
relevant statuses, so I fixed that in patch 0001.
From 40d3d9b0f4058bcf3041e63f71ce4c56e43e73f2 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Wed, 6 Mar 2024 18:33:49 +0100
Subject: [PATCH v32 1/3] Add missing connection statuses to docs

The list of connection statuses that PQstatus might return during an
asynchronous connection attempt was incorrect:

1. CONNECTION_SETENV is never returned anymore and is only part of the
   enum for backwards compatibility. So it's removed from the list.
2. CONNECTION_CHECK_STANDBY and CONNECTION_GSS_STARTUP were not listed.

This addresses those problems. CONNECTION_NEEDED and
CONNECTION_CHECK_TARGET are not listed in the docs on purpose, since
these states are internal states that can never be observed by a caller
of PQstatus.
---
 doc/src/sgml/libpq.sgml | 15 ---
 src/interfaces/libpq/libpq-fe.h |  3 ++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 1d8998efb2a..a2bbf33d029 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -428,11 +428,11 @@ PostgresPollingStatusType PQconnectPoll(PGconn *conn);
  
 
 
-
- CONNECTION_SETENV
+
+ CONNECTION_GSS_STARTUP
  
   
-   Negotiating environment-driven parameter settings.
+   Negotiating GSS encryption.
   
  
 
@@ -446,6 +446,15 @@ PostgresPollingStatusType PQconnectPoll(PGconn *conn);
  
 
 
+
+ CONNECTION_CHECK_STANDBY
+ 
+  
+   Checking if connection is to a server in standby mode.
+  
+ 
+
+
 
  CONNECTION_CONSUME
  
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index defc415fa3f..1e5e7481a7c 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -77,7 +77,8 @@ typedef enum
 	CONNECTION_CHECK_WRITABLE,	/* Checking if session is read-write. */
 	CONNECTION_CONSUME,			/* Consuming any extra messages. */
 	CONNECTION_GSS_STARTUP,		/* Negotiating GSSAPI. */
-	CONNECTION_CHECK_TARGET,	/* Checking target server properties. */
+	CONNECTION_CHECK_TARGET,	/* Internal state: Checking target server
+ * properties. */
 	CONNECTION_CHECK_STANDBY	/* Checking if server is in standby mode. */
 } ConnStatusType;
 

base-commit: de7c6fe8347ab726c80ebbfcdb57f4b714d5243d
-- 
2.34.1

From 15d91bb8ca87764eee02d785126495df1f6ffd5f Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Thu, 14 Dec 2023 13:39:09 +0100
Subject: [PATCH v32 3/3] Start using new libpq cancel APIs

A previous commit introduced new APIs to libpq for cancelling queries.
This replaces the usage of the old APIs in most of the codebase with
these newer ones. This specifically leaves out changes to psql and
pgbench as those would need a much larger refactor to be able to call
them, due to the new functions not being signal-safe.
---
 contrib/dblink/dblink.c   |  30 +++--
 contrib/postgres_fdw/connection.c | 105 +++---
 .../postgres_fdw/expected/postgres_fdw.out|  15 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql |   7 ++
 src/fe_utils/connect_utils.c  |  11 +-
 src/test/isolation/isolationtester.c  |  29 ++---
 6 files changed, 145 insertions(+), 52 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 19a362526d2..8b4013c480a 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -1346,22 +1346,32 @@ PG_FUNCTION_INFO_V1(dblink_cancel_query);
 Datum
 dblink_cancel_query(PG_FUNCTION_ARGS)
 {
-	int			res;
 	PGconn	   *conn;
-	PGcancel   *cancel;
-	char		errbuf[256];
+	PGcancelConn *cancelConn;
+	char	   *msg;
 
 	dblink_init();
 	conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
-	cancel = PQgetCancel(conn);
+	cancelConn = PQcancelConn(conn);
 
-	res = PQcancel(cancel, errbuf, 256);
-	PQfreeCancel(cancel);
+	PG_TRY();
+	{
+		if (!PQcancelBlocking(cancelConn))
+		{
+			msg = pchomp(PQcancelErrorMessage(cancelConn));
+		}
+		else
+		{
+			msg = "OK";
+		}
+	}
+	PG_FINALLY();
+	{
+		PQcancelFinish(cancelConn);
+	}
+	PG_END_TRY();
 
-	if (res == 1)
-		PG_RETURN_TEXT_P(cstring_to_text("OK"));
-	else
-		PG_RETURN_TEXT_P(cstring_to_text(errbuf));
+

Re: Remove unnecessary code from psql's watch command

2024-03-06 Thread Tom Lane

Michael Paquier  writes:
> On Tue, Mar 05, 2024 at 10:05:52PM +0900, Yugo NAGATA wrote:
>> In the current code of do_watch(), sigsetjmp is called if WIN32
>> is defined, but siglongjmp is not called in the signal handler
>> in this condition. On Windows, currently, cancellation is checked
>> only by cancel_pressed, and  calling sigsetjmp in do_watch() is
>> unnecessary. Therefore, we can remove code around sigsetjmp in
>> do_watch(). I've attached the patch for this fix.

> Re-reading the top comment of sigint_interrupt_enabled, it looks like
> you're right here.  As long as we check for cancel_pressed there
> should be no need for any special cancellation handling here.

I don't have Windows here to test on, but does the WIN32 code
path work at all?  It looks to me like cancel_pressed becoming
true doesn't get us to exit the outer loop, only the inner delay
one, meaning that trying to control-C out of a \watch will just
cause it to repeat the command even faster.  That path isn't
setting the "done" variable, and it isn't checking it either,
because all of that only happens in the other #ifdef arm.

regards, tom lane

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-03-06 Thread Bertrand Drouvot

Hi,

On Wed, Mar 06, 2024 at 05:45:56PM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier  wrote:
> >
> > On Wed, Mar 06, 2024 at 09:17:58AM +, Bertrand Drouvot wrote:
> > > Right, somehow out of context here.
> >
> > We're not yet in the green yet, one of my animals has complained:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi=2024-03-06%2010%3A10%3A03
> >
> > This is telling us that the xmin horizon is unchanged, and the test
> > cannot move on with the injection point wake up that would trigger the
> > following logs:
> > 2024-03-06 20:12:59.039 JST [21143] LOG:  invalidating obsolete replication 
> > slot "injection_activeslot"
> > 2024-03-06 20:12:59.039 JST [21143] DETAIL:  The slot conflicted with xid 
> > horizon 770.
> >
> > Not sure what to think about that yet.
> 
> Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't
> happy about that as well [1]. It looks like the slot's catalog_xmin on
> the standby isn't moving forward.
> 

Thank you both for the report! I did a few test manually and can see the issue
from times to times. When the issue occurs, the logical decoding was able to
go through the place where LogicalConfirmReceivedLocation() updates the
slot's catalog_xmin before being killed. In that case I can see that the
catalog_xmin is updated to the xid horizon.

Means in a failed test we have something like:

slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839." 

While when the test is ok you'll see something like:

slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

In the failing test the call to SELECT pg_logical_slot_get_changes() does
not advance the slot's catalog xmin anymore.

To fix this, I think we need a new transacion to decode from the primary before
executing pg_logical_slot_get_changes(). But this transaction has to be replayed
on the standby first by the startup process. Which means we need to wakeup
"terminate-process-holding-slot" and that we probably need another injection
point somewehere in this test.

I'll look at it unless you've another idea?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: logical decoding and replication of sequences, take 2

Hi,

Let me share a bit of an update regarding this patch and PG17. I have
discussed this patch and how to move it forward with a couple hackers
(both within EDB and outside), and my takeaway is that the patch is not
quite baked yet, not enough to make it into PG17 :-(

There are two main reasons / concerns leading to this conclusion:

* correctness of the decoding part

There are (were) doubts about decoding during startup, before the
snapshot gets consistent, when we can get "temporarily incorrect"
decisions whether a change is transactional. While the behavior is
ultimately correct (we treat all changes as non-transactional and
discard it), it seems "dirty" and it’s unclear to me if it might cause
more serious issues down the line (not necessarily bugs, but perhaps
making it harder to implement future changes).

* handling of sequences in built-in replication

Per the patch, sequences need to be added to the publication explicitly.
But there were suggestions we might (should) add certain sequences
automatically - e.g. sequences backing SERIAL/BIGSERIAL columns, etc.
I’m not sure we really want to do that, and so far I assumed we would
start with the manual approach and move to automatic addition in the
future. But the agreement seems to be it would be a pretty significant
"breaking change", and something we probably don’t want to do.


If someone feels has an opinion on either of the two issues (in either
way), I'd like to hear it.


Obviously, I'm not particularly happy about this outcome. And I'm also
somewhat cautious because this patch was already committed+reverted in
PG16 cycle, and doing the same thing in PG17 is not on my wish list.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovery modules

Here is another rebase.  Given the size of this patch set and the lack of
review, I am going to punt this one to v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From dd69e5987b477818f60b202af5fb9b2603dc8acb Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 14:28:53 -0800
Subject: [PATCH v18 1/5] introduce routine for checking mutually exclusive
 string GUCs

---
 src/backend/postmaster/pgarch.c |  8 +++-
 src/backend/utils/misc/guc.c| 22 ++
 src/include/utils/guc.h |  3 +++
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f97035ca03..dcb5222bbe 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -815,11 +815,9 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
-	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+	(void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library",
+			XLogArchiveCommand, "archive_command",
+			ERROR);
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dd5a46469a..b5c8e3702a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2660,6 +2660,28 @@ ReportGUCOption(struct config_generic *record)
 	pfree(val);
 }
 
+/*
+ * If both parameters are set, emits a log message at 'elevel' and returns
+ * false.  Otherwise, returns true.
+ */
+bool
+CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+ const char *p2val, const char *p2name,
+ int elevel)
+{
+	if (p1val[0] != '\0' && p2val[0] != '\0')
+	{
+		ereport(elevel,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot set both %s and %s", p1name, p2name),
+ errdetail("Only one of %s or %s may be set.",
+		   p1name, p2name)));
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a value from one of the human-friendly units ("kB", "min" etc.)
  * to the given base unit.  'value' and 'unit' are the input value and unit
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 3712aba09b..3088ada610 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -376,6 +376,9 @@ extern void RestrictSearchPath(void);
 extern void AtEOXact_GUC(bool isCommit, int nestLevel);
 extern void BeginReportingGUCOptions(void);
 extern void ReportChangedGUCOptions(void);
+extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+			 const char *p2val, const char *p2name,
+			 int elevel);
 extern void ParseLongOption(const char *string, char **name, char **value);
 extern const char *get_config_unit_name(int flags);
 extern bool parse_int(const char *value, int *result, int flags,
-- 
2.25.1

>From 83a71392a83b7d04d50568e7c5466630528503a5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 10:36:00 -0800
Subject: [PATCH v18 2/5] refactor code for restoring via shell

---
 src/backend/Makefile  |   2 +-
 src/backend/access/transam/timeline.c |  12 +-
 src/backend/access/transam/xlog.c |  50 -
 src/backend/access/transam/xlogarchive.c  | 167 ---
 src/backend/access/transam/xlogrecovery.c |   3 +-
 src/backend/meson.build   |   1 +
 src/backend/postmaster/startup.c  |  16 +-
 src/backend/restore/Makefile  |  18 ++
 src/backend/restore/meson.build   |   5 +
 src/backend/restore/shell_restore.c   | 245 ++
 src/include/Makefile  |   2 +-
 src/include/access/xlogarchive.h  |   9 +-
 src/include/meson.build   |   1 +
 src/include/postmaster/startup.h  |   1 +
 src/include/restore/shell_restore.h   |  26 +++
 src/tools/pgindent/typedefs.list  |   1 +
 16 files changed, 407 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/restore/Makefile
 create mode 100644 src/backend/restore/meson.build
 create mode 100644 src/backend/restore/shell_restore.c
 create mode 100644 src/include/restore/shell_restore.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index d66e2a4b9f..be00835995 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access archive backup bootstrap catalog parser commands executor \
 	foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
-	regex replication rewrite \
+	regex replication restore rewrite \
 	statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
 	jit
 
diff --git

Re: MERGE ... RETURNING

2024-03-06 Thread Dean Rasheed

On Wed, 6 Mar 2024 at 08:51, Peter Eisentraut  wrote:
>
> For comparison with standard SQL (see ):
>
> For an INSERT you could write
>
> SELECT whatever FROM NEW TABLE (INSERT statement here)
>
> or for an DELETE
>
> SELECT whatever FROM OLD TABLE (DELETE statement here)
>
> And for an UPDATE could can pick either OLD or NEW.
>

Thanks, that's very interesting. I hadn't seen that syntax before.

Over on [1], I have a patch in the works that extends RETURNING,
allowing it to return OLD.colname, NEW.colname, OLD.*, and NEW.*. It
looks like this new SQL standard syntax could be built on top of that
(perhaps by having the rewriter turn queries of the above form into
CTEs).

However, the RETURNING syntax is more powerful, because it allows OLD
and NEW to be used together in arbitrary expressions, for example:

RETURNING ..., NEW.val - OLD.val AS delta, ...

> > The current implementation uses a special function MERGING (a
> > grammatical construct without an OID that parses into a new MergingFunc
> > expr), which takes keywords ACTION or CLAUSE_NUMBER in the argument
> > positions. That's not totally unprecedented in SQL -- the XML and JSON
> > functions are kind of similar. But it's different in the sense that
> > MERGING is also context-sensitive: grammatically, it fits pretty much
> > anywhere a function fits, but then gets rejected at parse analysis time
> > (or perhaps even execution time?) if it's not called from the right
> > place.
>
> An analogy here might be that MATCH_RECOGNIZE (row-pattern recognition)
> has a magic function MATCH_NUMBER() that can be used inside that clause.
>   So a similar zero-argument magic function might make sense.  I don't
> like the MERGING(ACTION) syntax, but something like MERGE_ACTION() might
> make sense.  (This is just in terms of what kind of syntax might be
> palatable.  Depending on where the syntax of the overall clause ends up,
> we might not need it (see above).)
>

It could be that having the ability to return OLD and NEW values, as
in [1], is sufficient for use in MERGE, to identify the action
performed. However, I still think that dedicated functions would be
useful, if we can agree on names/syntax.

I think that I prefer the names MERGE_ACTION() and
MERGE_CLAUSE_NUMBER() from an aesthetic point of view, but it requires
2 new COL_NAME_KEYWORD keywords. Maybe that's OK, I don't know.

Alternatively, we could avoid adding new keywords by going back to
making these regular functions, as they were in an earlier version of
this patch, and then use some special-case code during parse analysis
to turn them into MergeFunc nodes (not quite a complete revert back to
an earlier version of the patch, but not far off).

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/CAEZATCWx0J0-v=Qjc6gXzR=KtsdvAE7Ow=D=mu50agoe+pv...@mail.gmail.com

Re: Switching XLog source from archive to streaming when primary available

On Wed, Mar 06, 2024 at 10:02:43AM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart  
> wrote:
>> I was thinking of something more like
>>
>> typedef enum
>> {
>> NO_FORCE_SWITCH_TO_STREAMING,   /* no switch 
>> necessary */
>> FORCE_SWITCH_TO_STREAMING_PENDING,  /* exhausting pg_wal 
>> */
>> FORCE_SWITCH_TO_STREAMING,  /* switch to 
>> streaming now */
>> } WALSourceSwitchState;
>>
>> At least, that illustrates my mental model of the process here.  IMHO
>> that's easier to follow than two similarly-named bool variables.
> 
> I played with that idea and it came out very nice. Please see the
> attached v22 patch. Note that personally I didn't like "FORCE" being
> there in the names, so I've simplified them a bit.

Thanks.  I'd like to spend some time testing this, but from a glance, the
code appears to be in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Exponential backoff for auth_delay

On Tue, Mar 05, 2024 at 05:14:46PM -0800, Jacob Champion wrote:
> On Tue, Mar 5, 2024 at 1:51 PM Nathan Bossart  
> wrote:
>> I don't have a strong opinion about making this configurable, but I do
>> think we should consider making this a hash table so that we can set
>> MAX_CONN_RECORDS much higher.
> 
> I'm curious why? It seems like the higher you make MAX_CONN_RECORDS,
> the easier it is to put off the brute-force protection. (My assumption
> is that anyone mounting a serious attack is not going to be doing this
> from their own computer; they'll be doing it from many devices they
> don't own -- a botnet, or a series of proxies, or something.)

Assuming you are referring to the case where we run out of free slots in
acr_array, I'm not sure I see that as desirable behavior.  Once you run out
of slots, all failed authentication attempts are now subject to the maximum
delay, which is arguably a denial-of-service scenario, albeit not a
particularly worrisome one.

I also think the linear search approach artifically constrains the value of
MAX_CONN_RECORDS, so even if a user wanted to bump it up substantially for
their own build, they'd potentially begin noticing the effects of the O(n)
behavior.  AFAICT this is really the only reason this value is set so low
at the moment, as I don't think the memory usage or code complexity of a
hash table with thousands of slots would be too bad.  In fact, it might
even be simpler to use hash_search() instead of hard-coding linear searches
in multiple places.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add system identifier to backup manifest

2024-03-06 Thread Robert Haas

On Mon, Mar 4, 2024 at 2:47 PM Robert Haas  wrote:
> I don't have an enormously strong opinion on what the right thing to
> do is here either, but I am not convinced that the change proposed by
> Michael is an improvement. After all, that leaves us with the
> situation where we know the path to the control file in three
> different places. First, verify_backup_file() does a strcmp() against
> the string "global/pg_control" to decide whether to call
> verify_backup_file(). Then, verify_system_identifier() uses that
> string to construct a pathname to the file that it will be read. Then,
> get_controlfile() reconstructs the same pathname using it's own logic.
> That's all pretty disagreeable. Hard-coded constants are hard to avoid
> completely, but here it looks an awful lot like we're trying to
> hardcode the same constant into as many different places as we can.
> The now-dropped patch seems like an effort to avoid this, and while
> it's possible that it wasn't the best way to avoid this, I still think
> avoiding it somehow is probably the right idea.

So with that in mind, here's my proposal. This is an adjustment of
Amit's previous refactoring patch. He renamed the existing
get_controlfile() to get_dir_controlfile() and made a new
get_controlfile() that accepted the path to the control file itself. I
chose to instead leave the existing get_controlfile() alone and add a
new get_controlfile_by_exact_path(). I think this is better, because
most of the existing callers find it more convenient to pass the path
to the data directory rather than the path to the controlfile, so the
patch is smaller this way, and less prone to cause headaches for
people back-patching or maintaining out-of-core code. But it still
gives us a way to avoid repeatedly constructing the same pathname.

If nobody objects, I plan to commit this version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v10-0001-Expose-new-function-get_controlfile_by_exact_pat.patch
Description: Binary data

Re: Reducing the log spam

2024-03-06 Thread Laurenz Albe

On Wed, 2024-03-06 at 17:09 +0300, Aleksander Alekseev wrote:
> I like the idea of suppressing certain log messages in general, but
> the particular user interface doesn't strike me as an especially
> convenient one.
> 
> Firstly I don't think many people remember sqlstates and what 3F000
> stands for. IMO most users don't know such a thing exists. Secondly,
> whether we should list sqlstates to suppress or the opposite - list
> the states that shouldn't be suppressed, is a debatable question. Last
> but not least, it's not quite clear whether PostgreSQL core is the
> right place for implementing this functionality. For instance, one
> could argue that the log message should just contain sqlstate and be
> directed to |grep instead.
> 
> I suspect this could be one of "there is no one size fits all"
> situations. The typical solution in such cases is to form a structure
> containing the log message and its attributes and submit this
> structure to a registered hook of a pluggable logging subsystem. This
> would be the most flexible approach. It will allow not only filtering
> the messages but also using binary logging, JSON logging, logging to
> external systems like Loki instead of a given text file, etc.
> 
> I don't think we currently have this in the core, but maybe I just missed it.

The target would not primarily be installations where people configure
nifty logging software to filter logs (those people know how to deal
with log spam), but installations where people don't even know enough
to configure "shared_buffers".  So I'd like something that is part of
core and reduces spam without the user needing to configure anything.

I am somewhat worried that people will come up with all kinds of
justified but complicated wishes for such a feature:

- an option to choose whether to include or to exclude certain errors
- be able to configure that certain errors be logged on FATAL, but
  not on ERROR
- allow exception names in addition to SQL states
- have wildcards for exception names
- ...

I would like to write a simple patch that covers the basic functionality
I described, provided enough people find it useful.  That does not
exclude the option for future extensions for this feature.

Yours,
Laurenz Albe

Logging which interface was connected to in log_line_prefix

2024-03-06 Thread Greg Sabino Mullane

Someone on -general was asking about this, as they are listening on
multiple IPs and would like to know which exact one clients were hitting. I
took a quick look and we already have that information, so I grabbed some
stuff from inet_server_addr and added it as part of a "%L" (for 'local
interface'). Quick patch / POC attached.

Cheers,
Greg


add_local_interface_to_log_line_prefix.v1.patch
Description: Binary data

Re: Reducing the log spam

2024-03-06 Thread Greg Sabino Mullane

On Tue, Mar 5, 2024 at 7:55 AM Laurenz Albe 
wrote:

> My experience from the field is that a lot of log spam looks like
>
>   database/table/... "xy" does not exist
>   duplicate key value violates unique constraint "xy"

Forcibly hiding those at the Postgres level seems a heavy hammer for what
is ultimately an application problem.

Tell me about a system that logs different classes of errors to different
log files, and I'm interested again.

Cheers,
Greg

Re: remaining sql/json patches