Re: proposal: psql: psql variable BACKEND_PID

2023-02-05 Thread Pavel Stehule
po 6. 2. 2023 v 0:35 odesílatel Corey Huinker 
napsal:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> A small but helpful feature.
>
> The new status of this patch is: Ready for Committer
>

Thank you very much

Pavel


Re: proposal: psql: psql variable BACKEND_PID

2023-02-05 Thread Pavel Stehule
po 6. 2. 2023 v 0:25 odesílatel Corey Huinker 
napsal:

>
>>>
>>> Clearly, it is hard to write a regression test for an externally
>>> volatile value. `SELECT sign(:BACKEND_PID)` would technically do the job,
>>> if we're striving for completeness.
>>>
>>
>> I did simple test - :BACKEND_PID should be equal pg_backend_pid()
>>
>>
>
> Even better.
>
>
>>
>>>
>>> Do we want to change %p to pull from this variable and save the
>>> snprintf()? Not a measurable savings, more or a don't-repeat-yourself thing.
>>>
>>
>> I checked the code, and I don't think so. Current state is safer (I
>> think). The psql's variables are not protected, and I think, so is safer,
>> better to
>> read the value for prompt directly by usage of the libpq API instead read
>> the possibly "customized" variable. I see possible inconsistency,
>> but again, the same inconsistency can be for variables USER and DBNAME
>> too, and I am not able to say what is significantly better. Just prompt
>> shows
>> real value, and the related variable is +/- copy in connection time.
>>
>> I am not 100% sure in this area what is better, but the change requires
>> wider and more general discussion, and I don't think the benefits of
>> possible
>> change are enough. There are just two possible solutions - we can protect
>> some psql's variables (and it can do some compatibility issues), or we
>> need to accept, so the value in prompt can be fake. It is better to not
>> touch it :-).
>>
>
> I agree it is out of scope of this patch, but I like the idea of protected
> psql variables, and I doubt users are intentionally re-using these vars to
> any positive effect. The more likely case is that newer psql vars just
> happen to use the names chosen by somebody's script in the past.
>

bash variables are not protected too. I know this is in a different
context, and different architecture. It can be a very simple patch, but it
needs wider discussion. Probably it should be immutable, it is safer, and
now I  do not have any useful use case for mutability of these variables.

Regards

Pavel



>
>
>>
>> done
>>
>>
>>
> Everything passes. Docs look good. Test looks good.
>


Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-05 Thread Michael Paquier
On Sun, Feb 05, 2023 at 12:56:58AM +0530, Nitin Jadhav wrote:
> Ok. Understood the other problems. I have attached the v2 patch which
> uses the idea present in Michael's patch. In addition, I have removed
> fetching NO_SHOW_ALL parameters while creating tab_settings_flags
> table in guc.sql and adjusted the test which checks for (NO_RESET_ALL
> AND NOT NO_SHOW_ALL) as this was misleading the developer who thinks
> that tab_settings_flags table has NO_SHOW_ALL parameters which is
> incorrect.

Okay, the part to add an initialization check for GUC_NO_SHOW_ALL and
GUC_NOT_IN_SAMPLE looked fine by me, so applied after more comment
polishing.

+-- NO_RESET_ALL can be specified without NO_SHOW_ALL, like transaction_*.
+-- tab_settings_flags does not contain NO_SHOW_ALL flags. Just checking for
+-- NO_RESET_ALL implies NO_RESET_ALL AND NOT NO_SHOW_ALL.
 SELECT name FROM tab_settings_flags
-  WHERE NOT no_show_all AND no_reset_all
+  WHERE no_reset_all
   ORDER BY 1;

Removing entirely no_show_all is fine by me, but keeping this SQL has
little sense, then, because it would include any GUCs loaded by an
external source when they define NO_RESET_ALL.  I think that 0001
should be like the attached, instead, backpatched down to 15 (release
week, so it cannot be touched until the next version is stamped),
where we just remove all the checks based on no_show_all.

On top of that, I have noticed an extra combination that would not
make sense and that could be checked with the SQL queries:
GUC_DISALLOW_IN_FILE implies GUC_NOT_IN_SAMPLE.  The opposite may not
be true, though, as some developer GUCs are marked as
GUC_NOT_IN_SAMPLE but they are allowed in a file.  The only exception
to that currently is config_file.  It is just a special case whose
value is enforced at startup and it can be passed down as an option
switch via the postgres binary, still it seems like it would be better
to also mark it as GUC_NOT_IN_SAMPLE?  This is done in 0002, only for
HEAD, as that would be a new check.

Thoughts?
--
Michael
>From 727361c5fe4c9af16ef66c1090a9e376dfc891d2 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 6 Feb 2023 16:05:11 +0900
Subject: [PATCH v3 1/2] Clean up SQL tests for NO_SHOW_ALL

---
 src/test/regress/expected/guc.out | 28 
 src/test/regress/sql/guc.sql  | 13 -
 2 files changed, 41 deletions(-)

diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out
index 2914b950b4..127c953297 100644
--- a/src/test/regress/expected/guc.out
+++ b/src/test/regress/expected/guc.out
@@ -841,7 +841,6 @@ CREATE TABLE tab_settings_flags AS SELECT name, category,
 'EXPLAIN'  = ANY(flags) AS explain,
 'NO_RESET' = ANY(flags) AS no_reset,
 'NO_RESET_ALL' = ANY(flags) AS no_reset_all,
-'NO_SHOW_ALL'  = ANY(flags) AS no_show_all,
 'NOT_IN_SAMPLE'= ANY(flags) AS not_in_sample,
 'RUNTIME_COMPUTED' = ANY(flags) AS runtime_computed
   FROM pg_show_all_settings() AS psas,
@@ -880,33 +879,6 @@ SELECT name FROM tab_settings_flags
 --
 (0 rows)
 
--- NO_SHOW_ALL implies NO_RESET_ALL, and vice-versa.
-SELECT name FROM tab_settings_flags
-  WHERE no_show_all AND NOT no_reset_all
-  ORDER BY 1;
- name 
---
-(0 rows)
-
--- Exceptions are transaction_*.
-SELECT name FROM tab_settings_flags
-  WHERE NOT no_show_all AND no_reset_all
-  ORDER BY 1;
-  name  
-
- transaction_deferrable
- transaction_isolation
- transaction_read_only
-(3 rows)
-
--- NO_SHOW_ALL implies NOT_IN_SAMPLE.
-SELECT name FROM tab_settings_flags
-  WHERE no_show_all AND NOT not_in_sample
-  ORDER BY 1;
- name 
---
-(0 rows)
-
 -- NO_RESET implies NO_RESET_ALL.
 SELECT name FROM tab_settings_flags
   WHERE no_reset AND NOT no_reset_all
diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql
index d40d0c178f..dc79761955 100644
--- a/src/test/regress/sql/guc.sql
+++ b/src/test/regress/sql/guc.sql
@@ -326,7 +326,6 @@ CREATE TABLE tab_settings_flags AS SELECT name, category,
 'EXPLAIN'  = ANY(flags) AS explain,
 'NO_RESET' = ANY(flags) AS no_reset,
 'NO_RESET_ALL' = ANY(flags) AS no_reset_all,
-'NO_SHOW_ALL'  = ANY(flags) AS no_show_all,
 'NOT_IN_SAMPLE'= ANY(flags) AS not_in_sample,
 'RUNTIME_COMPUTED' = ANY(flags) AS runtime_computed
   FROM pg_show_all_settings() AS psas,
@@ -349,18 +348,6 @@ SELECT name FROM tab_settings_flags
 SELECT name FROM tab_settings_flags
   WHERE category = 'Preset Options' AND NOT not_in_sample
   ORDER BY 1;
--- NO_SHOW_ALL implies NO_RESET_ALL, and vice-versa.
-SELECT name FROM tab_settings_flags
-  WHERE no_show_all AND NOT no_reset_all
-  ORDER BY 1;
--- Exceptions are transaction_*.
-SELECT name FROM tab_settings_flags
-  WHERE NOT no_show_all AND no_reset_all
-  ORDER BY 1;
--- NO_SHOW_ALL implies NOT_IN_SAMPLE.
-SELECT name FROM tab_settings_flags
-  WHERE no_show_all AND NOT not_in_sample
-  ORDER BY 

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-05 Thread Takamichi Osumi (Fujitsu)
On Monday, February 6, 2023 12:03 PM Peter Smith  wrote:
> On Sat, Feb 4, 2023 at 5:04 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> ...
> >
> > Kindly have a look at the attached v27.
> >
> 
> Here are some review comments for patch v27-0001.
Thanks for checking !

> ==
> src/test/subscription/t/032_apply_delay.pl
> 
> 1.
> +# Confirm the time-delayed replication has been effective from the
> +server log # message where the apply worker emits for applying delay.
> +Moreover, verify # that the current worker's remaining wait time is
> +sufficiently bigger than the # expected value, in order to check any update 
> of
> the min_apply_delay.
> +sub check_apply_delay_log
> 
> ~
> 
> "has been effective from the server log" --> "worked, by inspecting the server
> log"
Sounds good to me. Also,
this is an unique part for time-delayed logical replication.
So, we can update those as we want. Fixed. 


> ~~~
> 
> 2.
> +my $delay   = 3;
> 
> Might be better to name this variable as 'min_apply_delay'.
I named this variable by following the test of recovery_min_apply_delay
(src/test/recovery/005_replay_delay.pl). So, this is aligned
with the test and I'd like to keep it as it is.


> ~~~
> 
> 3.
> +# Now wait for replay to complete on publisher. We're done waiting when
> +the # subscriber has applyed up to the publisher LSN.
> +$node_publisher->wait_for_catchup($appname);
> 
> 3a.
> Something seemed wrong with the comment.
> 
> Was it meant to say more like? "The publisher waits for the replication to
> complete".
> 
> Typo: "applyed"
Your wording looks better than mine. Fixed.


> ~
> 
> 3b.
> Instead of doing this wait_for_catchup stuff why don't you just use a
> synchronous pub/sub and then the publication will just block internally like
> you require but without you having to block using test code?
This is the style of 005_reply_delay.pl. Then, this is also aligned with it.
So, I'd like to keep the current way of times comparison as it is.

Even if we could omit wait_for_catchup(), there will be new codes
for synchronous replication and that would make the min_apply_delay tests
more different from the corresponding one. Note that if we use
the synchronous mode, we need to turn it off for the last
ALTER SUBSCRIPTION DISABLE test case whose min_apply_delay to 1 day 5 min
and execute one record insert after that. This will make the tests confusing.
> ~~~
> 
> 4.
> +# Run a query to make sure that the reload has taken effect.
> +$node_publisher->safe_psql('postgres', q{SELECT 1});
> 
> SUGGESTION (for the comment)
> # Running a dummy query causes the config to be reloaded.
Fixed.


> ~~~
> 
> 5.
> +# Confirm the record is not applied expectedly my $result =
> +$node_subscriber->safe_psql('postgres',
> + "SELECT count(a) FROM tab_int WHERE a = 0;"); is($result, qq(0),
> +"check the delayed transaction was not applied");
> 
> "expectedly" ??
> 
> SUGGESTION (for comment)
> # Confirm the record was not applied
Fixed.



Best Regards,
Takamichi Osumi



v28-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v28-0001-Time-delayed-logical-replication-subscriber.patch


Re: Exit walsender before confirming remote flush in logical replication

2023-02-05 Thread Amit Kapila
On Mon, Feb 6, 2023 at 10:33 AM Andres Freund  wrote:
>
> On February 5, 2023 8:29:19 PM PST, Amit Kapila  
> wrote:
> >>
> >> But that seems a too narrow view to me. Imagine you want to decomission
> >> the current primary, and instead start to use the logical standby as the
> >> primary. For that you'd obviously want to replicate the last few
> >> changes. But with the proposed change, that'd be hard to ever achieve.
> >>
> >
> >I think that can still be achieved with the idea being discussed which
> >is to keep allowing sending the WAL for smart shutdown mode but not
> >for other modes(fast, immediate). I don't know whether it is a good
> >idea or not but Kuroda-San has produced a POC patch for it. We can
> >instead choose to improve our docs related to shutdown to explain a
> >bit more about the shutdown's interaction with (logical and physical)
> >replication. As of now, it says: (“Smart” mode disallows new
> >connections, then waits for all existing clients to disconnect. If the
> >server is in hot standby, recovery and streaming replication will be
> >terminated once all clients have disconnected.)[2]. Here, it is not
> >clear that shutdown will wait for sending and flushing all the WALs.
> >The information for fast and immediate modes is even lesser which
> >makes it more difficult to understand what kind of behavior is
> >expected in those modes.
> >
> >[1] - https://commitfest.postgresql.org/42/3581/
> >[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html
> >
>
> Smart shutdown is practically unusable. I don't think it makes sense to tie 
> behavior of walsender to it in any way.
>

So, we have the following options: (a) do nothing for this; (b)
clarify the current behavior in docs. Any suggestions?

-- 
With Regards,
Amit Kapila.




Re: Add progress reporting to pg_verifybackup

2023-02-05 Thread Masahiko Sawada
On Mon, Feb 6, 2023 at 2:45 PM Michael Paquier  wrote:
>
> On Mon, Feb 06, 2023 at 12:27:51PM +0900, Masahiko Sawada wrote:
> > I thought that too, but I thought it's better to ignore it, instead of
> > erroring out. For example, we can specify both --disable and
> > --progress options to pg_checksum commands, but we don't write any
> > progress information in this case.
>
> Well, if you don't feel strongly about that, that's fine by me as
> well, so I have applied your v3 with the tweaks I posted previously,
> without the restriction on --skip-checksums.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-05 Thread Tom Lane
Andres Freund  writes:
> On February 5, 2023 9:12:17 PM PST, Tom Lane  wrote:
>> Damir Belyalov  writes:
>>> InputFunctionCallSafe() is good for detecting errors from input-functions
>>> but there are such errors from NextCopyFrom () that can not be detected
>>> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.

>> If you want to deal with those, then there's more work to be done to make
>> those bits non-error-throwing.  But there's a very finite amount of code
>> involved and no obvious reason why it couldn't be done.

> I'm not even sure it makes sense to avoid that kind of error. And
> invalid column count or such is something quite different than failing
> some data type input routine, or falling a constraint.

I think it could be reasonable to put COPY's overall-line-format
requirements on the same level as datatype input format violations.
I agree that trying to trap every kind of error is a bad idea,
for largely the same reason that the soft-input-errors patches
only trap certain kinds of errors: it's too hard to tell whether
an error is an "internal" error that it's scary to continue past.

regards, tom lane




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-05 Thread Andres Freund
Hi, 

On February 5, 2023 9:12:17 PM PST, Tom Lane  wrote:
>Damir Belyalov  writes:
>>> I don't think this is the right approach. Creating a subtransaction for
>>> each row will cause substantial performance issues.
>
>> Subtransactions aren't created for each row. The block of rows in one
>> subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed.
>
>I think that at this point, any patch that involves adding subtransactions
>to COPY is dead on arrival; whether it's batched or not is irrelevant.
>(It's not like batching has no downsides.)

Indeed.

>> InputFunctionCallSafe() is good for detecting errors from input-functions
>> but there are such errors from NextCopyFrom () that can not be detected
>> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.
>
>If you want to deal with those, then there's more work to be done to make
>those bits non-error-throwing.  But there's a very finite amount of code
>involved and no obvious reason why it couldn't be done.  The major problem
>here has always been the indefinite amount of code implicated by calling
>datatype input functions, and we have now created a plausible answer to
>that problem.

I'm not even sure it makes sense to avoid that kind of error. And invalid 
column count or such is something quite different than failing some data type 
input routine, or falling a constraint. 



-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Add progress reporting to pg_verifybackup

2023-02-05 Thread Michael Paquier
On Mon, Feb 06, 2023 at 12:27:51PM +0900, Masahiko Sawada wrote:
> I thought that too, but I thought it's better to ignore it, instead of
> erroring out. For example, we can specify both --disable and
> --progress options to pg_checksum commands, but we don't write any
> progress information in this case.

Well, if you don't feel strongly about that, that's fine by me as
well, so I have applied your v3 with the tweaks I posted previously,
without the restriction on --skip-checksums.
--
Michael


signature.asc
Description: PGP signature


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-05 Thread Tom Lane
Damir Belyalov  writes:
>> I don't think this is the right approach. Creating a subtransaction for
>> each row will cause substantial performance issues.

> Subtransactions aren't created for each row. The block of rows in one
> subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed.

I think that at this point, any patch that involves adding subtransactions
to COPY is dead on arrival; whether it's batched or not is irrelevant.
(It's not like batching has no downsides.)

> InputFunctionCallSafe() is good for detecting errors from input-functions
> but there are such errors from NextCopyFrom () that can not be detected
> with InputFunctionCallSafe(), e.g. "wrong number of columns in row''.

If you want to deal with those, then there's more work to be done to make
those bits non-error-throwing.  But there's a very finite amount of code
involved and no obvious reason why it couldn't be done.  The major problem
here has always been the indefinite amount of code implicated by calling
datatype input functions, and we have now created a plausible answer to
that problem.

regards, tom lane




Re: Exit walsender before confirming remote flush in logical replication

2023-02-05 Thread Andres Freund
Hi, 

On February 5, 2023 8:29:19 PM PST, Amit Kapila  wrote:
>On Sat, Feb 4, 2023 at 6:31 PM Andres Freund  wrote:
>>
>> On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
>> > The main problem we want to solve here is to avoid shutdown failing in
>> > case walreceiver/applyworker is busy waiting for some lock or for some
>> > other reason as shown in the email [1].
>>
>> Isn't handling this part of the job of wal_sender_timeout?
>>
>
>In some cases, it is not clear whether we can handle it by
>wal_sender_timeout. Consider a case of a time-delayed replica where
>the applyworker will keep sending some response/alive message so that
>walsender doesn't timeout in that (during delay) period. In that case,
>because walsender won't timeout, the shutdown will fail (with the
>failed message) even though it will be complete after the walsender is
>able to send all the WAL and shutdown. The time-delayed replica patch
>is still under discussion [1]. Also, for large values of
>wal_sender_timeout, it will wait till the walsender times out and can
>return with a failed message.
>
>>
>> I don't at all agree that it's ok to just stop replicating changes
>> because we're blocked on network IO. The patch justifies this with:
>>
>> > Currently, at shutdown, walsender processes wait to send all pending data 
>> > and
>> > ensure the all data is flushed in remote node. This mechanism was added by
>> > 985bd7 for supporting clean switch over, but such use-case cannot be 
>> > supported
>> > for logical replication. This commit remove the blocking in the case.
>>
>> and at the start of the thread with:
>>
>> > In case of logical replication, however, we cannot support the use-case 
>> > that
>> > switches the role publisher <-> subscriber. Suppose same case as above, 
>> > additional
>> > transactions are committed while doing step2. To catch up such changes 
>> > subscriber
>> > must receive WALs related with trans, but it cannot be done because 
>> > subscriber
>> > cannot request WALs from the specific position. In the case, we must 
>> > truncate all
>> > data in new subscriber once, and then create new subscription with 
>> > copy_data
>> > = true.
>>
>> But that seems a too narrow view to me. Imagine you want to decomission
>> the current primary, and instead start to use the logical standby as the
>> primary. For that you'd obviously want to replicate the last few
>> changes. But with the proposed change, that'd be hard to ever achieve.
>>
>
>I think that can still be achieved with the idea being discussed which
>is to keep allowing sending the WAL for smart shutdown mode but not
>for other modes(fast, immediate). I don't know whether it is a good
>idea or not but Kuroda-San has produced a POC patch for it. We can
>instead choose to improve our docs related to shutdown to explain a
>bit more about the shutdown's interaction with (logical and physical)
>replication. As of now, it says: (“Smart” mode disallows new
>connections, then waits for all existing clients to disconnect. If the
>server is in hot standby, recovery and streaming replication will be
>terminated once all clients have disconnected.)[2]. Here, it is not
>clear that shutdown will wait for sending and flushing all the WALs.
>The information for fast and immediate modes is even lesser which
>makes it more difficult to understand what kind of behavior is
>expected in those modes.
>
>[1] - https://commitfest.postgresql.org/42/3581/
>[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html
>

Smart shutdown is practically unusable. I don't think it makes sense to tie 
behavior of walsender to it in any way. 


Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-05 Thread Damir Belyalov
Hi, Andres!

Thank you for reviewing.


> I don't think this is the right approach. Creating a subtransaction for
> each row will cause substantial performance issues.
>

Subtransactions aren't created for each row. The block of rows in one
subtransaction is 1000 (SAFE_BUFFER_SIZE) and can be changed. There is also
a constraint for the number of bytes MAX_SAFE_BUFFER_BYTES in safe_buffer:
 while (sfcstate->saved_tuples < SAFE_BUFFER_SIZE &&
 sfcstate->safeBufferBytes < MAX_SAFE_BUFFER_BYTES)



We now can call data type input functions without throwing errors, see
> InputFunctionCallSafe(). Use that to avoid throwing an error instead of
> catching it.
>
InputFunctionCallSafe() is good for detecting errors from input-functions
but there are such errors from NextCopyFrom () that can not be detected
with InputFunctionCallSafe(), e.g. "wrong number of columns in row''. Do
you offer to process input-function errors separately from other errors?
Now all errors are processed in one "switch" loop in PG_CATCH, so this
change can complicate code.



Regards,
Damir Belyalov
Postgres Professional


Re: Exit walsender before confirming remote flush in logical replication

2023-02-05 Thread Amit Kapila
On Sat, Feb 4, 2023 at 6:31 PM Andres Freund  wrote:
>
> On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
> > The main problem we want to solve here is to avoid shutdown failing in
> > case walreceiver/applyworker is busy waiting for some lock or for some
> > other reason as shown in the email [1].
>
> Isn't handling this part of the job of wal_sender_timeout?
>

In some cases, it is not clear whether we can handle it by
wal_sender_timeout. Consider a case of a time-delayed replica where
the applyworker will keep sending some response/alive message so that
walsender doesn't timeout in that (during delay) period. In that case,
because walsender won't timeout, the shutdown will fail (with the
failed message) even though it will be complete after the walsender is
able to send all the WAL and shutdown. The time-delayed replica patch
is still under discussion [1]. Also, for large values of
wal_sender_timeout, it will wait till the walsender times out and can
return with a failed message.

>
> I don't at all agree that it's ok to just stop replicating changes
> because we're blocked on network IO. The patch justifies this with:
>
> > Currently, at shutdown, walsender processes wait to send all pending data 
> > and
> > ensure the all data is flushed in remote node. This mechanism was added by
> > 985bd7 for supporting clean switch over, but such use-case cannot be 
> > supported
> > for logical replication. This commit remove the blocking in the case.
>
> and at the start of the thread with:
>
> > In case of logical replication, however, we cannot support the use-case that
> > switches the role publisher <-> subscriber. Suppose same case as above, 
> > additional
> > transactions are committed while doing step2. To catch up such changes 
> > subscriber
> > must receive WALs related with trans, but it cannot be done because 
> > subscriber
> > cannot request WALs from the specific position. In the case, we must 
> > truncate all
> > data in new subscriber once, and then create new subscription with copy_data
> > = true.
>
> But that seems a too narrow view to me. Imagine you want to decomission
> the current primary, and instead start to use the logical standby as the
> primary. For that you'd obviously want to replicate the last few
> changes. But with the proposed change, that'd be hard to ever achieve.
>

I think that can still be achieved with the idea being discussed which
is to keep allowing sending the WAL for smart shutdown mode but not
for other modes(fast, immediate). I don't know whether it is a good
idea or not but Kuroda-San has produced a POC patch for it. We can
instead choose to improve our docs related to shutdown to explain a
bit more about the shutdown's interaction with (logical and physical)
replication. As of now, it says: (“Smart” mode disallows new
connections, then waits for all existing clients to disconnect. If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected.)[2]. Here, it is not
clear that shutdown will wait for sending and flushing all the WALs.
The information for fast and immediate modes is even lesser which
makes it more difficult to understand what kind of behavior is
expected in those modes.

[1] - https://commitfest.postgresql.org/42/3581/
[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

-- 
With Regards,
Amit Kapila.




Re: cutting down the TODO list thread

2023-02-05 Thread John Naylor
On Mon, Jan 30, 2023 at 10:07 PM Bruce Momjian  wrote:
>
> On Mon, Jan 30, 2023 at 01:13:45PM +0700, John Naylor wrote:

> > "It's worth checking if the feature of interest is found in the TODO
list on
> > our wiki: http://wiki.postgresql.org/wiki/TODO. The entries there often
have
> > additional information about the feature and may point to reasons why
it hasn't
> > been implemented yet."
>
> Good.

> > Do I understand right that we could just remove this entire section
"What areas
> > need work?"?
>
> Yes, I think so.

> > > I can now see that just removing the [E] label totally is the right
> > > answer.  Yes, there might be an easy item on there, but the fact we
have
> > > three labeled and they are not easy makes me thing [E] is causing more
> > > problems than it solves.
> >
> > Okay, having heard no objections I'll remove it.

These are all done now.

I'll try to get back to culling the list items at the end of April.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Add progress reporting to pg_verifybackup

2023-02-05 Thread Masahiko Sawada
On Mon, Feb 6, 2023 at 9:35 AM Michael Paquier  wrote:
>
> On Sat, Feb 04, 2023 at 12:32:15PM +0900, Michael Paquier wrote:
> > That seems rather OK seen from here.  I'll see about getting that
> > applied except if there is an objection of any kind.
>
> Okay, I have looked at that again this morning and I've spotted one
> tiny issue: specifying --progress with --skip-checksums does not
> really make sense.

I thought that too, but I thought it's better to ignore it, instead of
erroring out. For example, we can specify both --disable and
--progress options to pg_checksum commands, but we don't write any
progress information in this case.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: First draft of back-branch release notes is done

2023-02-05 Thread Jonathan S. Katz

On 2/5/23 9:39 PM, Tom Lane wrote:



  
   Prevent clobbering of cached parsetrees for utility statements in
   SQL functions (Tom Lane, Daniel Gustafsson)
  

  
   If a SQL-language function executes the same utility command more
   than once within a single calling query, it could crash or report
   strange errors such as unrecognized node type.
  

regards, tom lane


+1. Thanks!

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-05 Thread Peter Smith
On Sat, Feb 4, 2023 at 5:04 PM Takamichi Osumi (Fujitsu)
 wrote:
>
...
>
> Kindly have a look at the attached v27.
>

Here are some review comments for patch v27-0001.

==
src/test/subscription/t/032_apply_delay.pl

1.
+# Confirm the time-delayed replication has been effective from the server log
+# message where the apply worker emits for applying delay. Moreover, verify
+# that the current worker's remaining wait time is sufficiently bigger than the
+# expected value, in order to check any update of the min_apply_delay.
+sub check_apply_delay_log

~

"has been effective from the server log" --> "worked, by inspecting
the server log"

~~~

2.
+my $delay   = 3;

Might be better to name this variable as 'min_apply_delay'.

~~~

3.
+# Now wait for replay to complete on publisher. We're done waiting when the
+# subscriber has applyed up to the publisher LSN.
+$node_publisher->wait_for_catchup($appname);

3a.
Something seemed wrong with the comment.

Was it meant to say more like? "The publisher waits for the
replication to complete".

Typo: "applyed"

~

3b.
Instead of doing this wait_for_catchup stuff why don't you just use a
synchronous pub/sub and then the publication will just block
internally like you require but without you having to block using test
code?

~~~

4.
+# Run a query to make sure that the reload has taken effect.
+$node_publisher->safe_psql('postgres', q{SELECT 1});

SUGGESTION (for the comment)
# Running a dummy query causes the config to be reloaded.

~~~

5.
+# Confirm the record is not applied expectedly
+my $result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(a) FROM tab_int WHERE a = 0;");
+is($result, qq(0), "check the delayed transaction was not applied");

"expectedly" ??

SUGGESTION (for comment)
# Confirm the record was not applied

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: First draft of back-branch release notes is done

2023-02-05 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On 2/5/23 3:01 PM, Tom Lane wrote:
>> Fair.  I was trying to avoid committing to specific consequences.
>> The assertion failure seen in the original report (#17702) wouldn't
>> occur for typical users, but they might see crashes or "unexpected node
>> type" failures.  Maybe we can say that instead.

> I did a quick readthrough of #17702. Your proposal sounds reasonable.

> Based on that explanation and reading #17702, I'm still not sure if this 
> will make the cut in the release announcement itself, but +1 for 
> modifying it in the release notes.

The notes now say

 
  Prevent clobbering of cached parsetrees for utility statements in
  SQL functions (Tom Lane, Daniel Gustafsson)
 

 
  If a SQL-language function executes the same utility command more
  than once within a single calling query, it could crash or report
  strange errors such as unrecognized node type.
 

regards, tom lane




Re: File descriptors in exec'd subprocesses

2023-02-05 Thread Thomas Munro
On Mon, Feb 6, 2023 at 3:29 AM Andres Freund  wrote:
> On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro 
>  wrote:
> >Are there any more descriptors we need to think about?
>
> Postmaster's listen sockets? Saves a bunch of syscalls, at least.

Assuming you mean accepted sockets, yeah, I see how two save two
syscalls there, and since you nerd-sniped me into looking into the
SOCK_CLOEXEC landscape, I like it even more now that I've understood
that accept4() is rubber-stamped for the next revision of POSIX[1] and
is already accepted almost everywhere.  It's not just window dressing,
you need it to write multi-threaded programs that fork/exec without
worrying about the window between fd creation and fcntl(FD_CLOEXEC) in
another thread; hopefully one day we will care about that sort of
thing in some places too!  Here's a separate patch for that.

I *guess* we need HAVE_DECL_ACCEPT4 for the guarded availability
system (cf pwritev) when Apple gets the memo, but see below.  Hard to
say if AIX is still receiving memos (cf recent speculation in the
Register).  All other target OSes seem to have had this stuff for a
while.

Since client connections already do fcntl(FD_CLOEXEC), postgres_fdw
connections didn't have this problem.  It seems reasonable to want to
skip a couple of system calls there too; also, client programs might
also be interested in future-POSIX's atomic race-free close-on-exec
socket fabrication.  So here also is a patch to use SOCK_CLOEXEC on
that end too, if available.

But ... hmph, all we can do here is test for the existence of
SOCK_NONBLOCK and SOCK_CLOEXEC, since there is no new function to test
for.  Maybe we should just assume accept4() also exists if these exist
(it's hard to imagine that Apple or IBM would address atomicity on one
end but not the other of a socket), but predictions are so difficult,
especially about the future!  Anyone want to guess if it's better to
leave the meson/configure probe in for the accept4 end or just roll
with the macros?

> Logging collector pipe write end, in backends?

The write end of the logging pipe is already closed, after dup2'ing it
to STDOUT_FILENO to STDERR_FILENO, so archive commands and suchlike do
receive the handle, but they want them.  It's the intended and
documented behaviour that anything written to that will finish up in
the log.

As for pipe2(O_CLOEXEC), I see the point of it in a multi-threaded
application.  It's not terribly useful for us though, because we
usually want to close only one end, except in the case of the
self-pipe.  But the self-pipe is no longer used on the systems that
have pipe2()-from-the-future.

I haven't tested this under EXEC_BACKEND yet.

[1] https://www.austingroupbugs.net/view.php?id=411
From c9a61ba8eaddb197c31095a163f7efdf2e3ea797 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 5 Feb 2023 12:18:14 +1300
Subject: [PATCH v2 1/3] Don't leak descriptors into subprograms.

Open data and WAL files with O_CLOEXEC (POSIX 2018, SUSv4).  All of our
target systems have it, except Windows.  Windows already has that
behavior, because we wrote our own open() implementation.

Do the same for sockets with FD_CLOEXEC.  (A separate commit will
improve this with SOCK_CLOEXEC on some systems, but we'll need the
FD_CLOEXEC path as a fallback.)

This means that programs executed by COPY and archiving commands etc,
will not be able to mess with those descriptors (of course nothing stops
them from opening files, so this is more about tidyness and preventing
accidental problems than security).

Reviewed-by: Tom Lane 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 9 ++---
 src/backend/libpq/pqcomm.c| 9 +
 src/backend/storage/file/fd.c | 2 --
 src/backend/storage/smgr/md.c | 9 +
 src/include/port/win32_port.h | 7 +++
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..3b44a0f237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2937,7 +2937,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+			 get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3098,7 +3099,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+			 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 (errcode_for_file_access(),
@@ 

Re: pglz compression performance, take two

2023-02-05 Thread Andrey Borodin
On Sun, Feb 5, 2023 at 5:51 PM Tomas Vondra
 wrote:
>
> On 2/5/23 19:36, Andrey Borodin wrote:
> > On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin  
> > wrote:
> >>
> >> Hello! Please find attached v8.
> >
> > I got some interesting feedback from some patch users.
> > There was an oversight that frequently yielded results that are 1,2 or
> > 3 bytes longer than expected.
> > Looking closer I found that the correctness of the last 3-byte tail is
> > checked in two places. PFA fix for this. Previously compressed data
> > was correct, however in some cases few bytes longer than the result of
> > current pglz implementation.
> >
>
> Thanks. What were the consequences of the issue? Lower compression
> ratio, or did we then fail to decompress the data (or would current pglz
> implementation fail to decompress it)?
>
The data was decompressed fine. But extension tests (Citus's columnar
engine) hard-coded a lot of compression ratio stuff.
And there is still 1 more test where optimized version produces 1 byte
longer output. I'm trying to find it, but with no success yet.

There are known and documented cases when optimized pglz version would
do so. good_match without 10-division and memcmp by 4 bytes. But even
disabling this, still observing 1-byte longer compression results
persists... The problem is the length is changed after deleting some
data, so compression of that particular sequence seems to be somewhere
far away.
It was funny at the beginning - to hunt for 1 byte. But the weekend is
ending, and it seems that byte slipped from me again...

Best regards, Andrey Borodin.




Re: pglz compression performance, take two

2023-02-05 Thread Tomas Vondra
On 2/5/23 19:36, Andrey Borodin wrote:
> On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin  wrote:
>>
>> Hello! Please find attached v8.
> 
> I got some interesting feedback from some patch users.
> There was an oversight that frequently yielded results that are 1,2 or
> 3 bytes longer than expected.
> Looking closer I found that the correctness of the last 3-byte tail is
> checked in two places. PFA fix for this. Previously compressed data
> was correct, however in some cases few bytes longer than the result of
> current pglz implementation.
> 

Thanks. What were the consequences of the issue? Lower compression
ratio, or did we then fail to decompress the data (or would current pglz
implementation fail to decompress it)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: First draft of back-branch release notes is done

2023-02-05 Thread Jonathan S. Katz

On 2/5/23 3:01 PM, Tom Lane wrote:

"Jonathan S. Katz"  writes:

On Feb 4, 2023, at 10:24 AM, Tom Lane  wrote:

“Prevent clobbering of cached parsetrees…Bad things could happen if…”



While I chuckled over the phrasing, I’m left to wonder what the “bad things” 
are, in case I
need to check an older version to see if I’m susceptible to this bug.


Fair.  I was trying to avoid committing to specific consequences.
The assertion failure seen in the original report (#17702) wouldn't
occur for typical users, but they might see crashes or "unexpected node
type" failures.  Maybe we can say that instead.


I did a quick readthrough of #17702. Your proposal sounds reasonable.

Based on that explanation and reading #17702, I'm still not sure if this 
will make the cut in the release announcement itself, but +1 for 
modifying it in the release notes.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-05 Thread Tom Lane
Andres Freund  writes:
> I did survey available meson versions, and chose what features to
> use. But not really ninja, since I didn't know about this specific issue
> and other than this the ninja version differences were handled by meson.

> As all the issues are related to more precise dependencies, I somehwat
> wonder if it'd be good enough to use less accurate dependencies with
> 1.8.2. But I don't like it.

Nah, I don't like that either.  I did a crude survey of ninja's version
history by seeing which version is in each recent Fedora release:

f20/ninja-build.spec:Version:1.4.0
f21/ninja-build.spec:Version:1.5.1
f22/ninja-build.spec:Version:1.5.3
f23/ninja-build.spec:Version:1.7.1
f24/ninja-build.spec:Version:1.7.2
f25/ninja-build.spec:Version:1.8.2
f26/ninja-build.spec:Version:1.8.2
f27/ninja-build.spec:Version:1.8.2
f28/ninja-build.spec:Version:1.8.2
f29/ninja-build.spec:Version:1.8.2
f30/ninja-build.spec:Version:1.9.0
f31/ninja-build.spec:Version:1.10.1
f32/ninja-build.spec:Version:1.10.1
f33/ninja-build.spec:Version:1.10.2
f34/ninja-build.spec:Version:1.10.2
f35/ninja-build.spec:Version:1.10.2
f36/ninja-build.spec:Version:1.10.2
f37/ninja-build.spec:Version:1.10.2
rawhide/ninja-build.spec:Version:1.11.1

Remembering that Fedora has a six-month release cycle, this shows that
1.8.2 was around for awhile but 1.9.x was a real flash-in-the-pan.
We can probably get away with saying that you need 1.10 or newer.
That's already three-plus years old.

regards, tom lane




Re: Support logical replication of DDLs

2023-02-05 Thread Peter Smith
Here are some comments for patch v63-0002.

This is a WIP because I have not yet looked at the large file - ddl_deparse.c.

==
Commit Message

1.
This patch provides JSON blobs representing DDL commands, which can
later be re-processed into plain strings by well-defined sprintf-like
expansion. These JSON objects are intended to allow for machine-editing of
the commands, by replacing certain nodes within the objects.

~

"This patch provides JSON blobs" --> "This patch constructs JSON blobs"

==
src/backend/commands/ddl_json.

2. Copyright

+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group

"2022" --> "2023"

~~~

3.
+/*
+ * Conversion specifier which determines how we expand the JSON element into
+ * string.
+ */
+typedef enum
+{
+ SpecTypeName,
+ SpecOperatorName,
+ SpecDottedName,
+ SpecString,
+ SpecNumber,
+ SpecStringLiteral,
+ SpecIdentifier,
+ SpecRole
+} convSpecifier;

~

3a.
SUGGESTION (comment)
Conversion specifier which determines how to expand the JSON element
into a string.

~

3b.
Are these enums in this strange order deliberately? If not, then maybe
alphabetical is better.

~~~

4. Forward declaration

+char *deparse_ddl_json_to_string(char *jsonb);

Why is this forward declared here? Isn't this already declared extern
in ddl_deparse.h?

~~~

5. expand_fmt_recursive

+/*
+ * Recursive helper for deparse_ddl_json_to_string.
+ *
+ * Find the "fmt" element in the given container, and expand it into the
+ * provided StringInfo.
+ */
+static void
+expand_fmt_recursive(JsonbContainer *container, StringInfo buf)

I noticed all the other expand_ functions are passing the
StringInfo buf as the first parameter. For consistency, shouldn’t this
be the same?

~

6.
+ if (*cp != '%')
+ {
+ appendStringInfoCharMacro(buf, *cp);
+ continue;
+ }
+
+
+ ADVANCE_PARSE_POINTER(cp, end_ptr);
+
+ /* Easy case: %% outputs a single % */
+ if (*cp == '%')
+ {
+ appendStringInfoCharMacro(buf, *cp);
+ continue;
+ }

Double blank lines?

~

7.
+ ADVANCE_PARSE_POINTER(cp, end_ptr);
+ for (; cp < end_ptr;)
+ {


Maybe a while loop is more appropriate?

~

8.
+ value = findJsonbValueFromContainer(container, JB_FOBJECT, );

Should the code be checking or asserting value is not NULL?

(IIRC I asked this a long time ago - sorry if it was already answered)

~~~

9. expand_jsonval_dottedname

It might be simpler code to use a variable like:
JsonbContainer *data = jsonval->val.binary.data;

Instead of repeating jsonval->val.binary.data many times.

~~~

10. expand_jsonval_typename

It might be simpler code to use a variable like:
JsonbContainer *data = jsonval->val.binary.data;

Instead of repeating jsonval->val.binary.data many times.

~~~

11.
+/*
+ * Expand a JSON value as an operator name.
+ */
+static void
+expand_jsonval_operator(StringInfo buf, JsonbValue *jsonval)

Should this function comment be more like the comment for
expand_jsonval_dottedname by saying there can be an optional
"schemaname"?

~~~

12.
+static bool
+expand_jsonval_string(StringInfo buf, JsonbValue *jsonval)
+{
+ if (jsonval->type == jbvString)
+ {
+ appendBinaryStringInfo(buf, jsonval->val.string.val,
+jsonval->val.string.len);
+ }
+ else if (jsonval->type == jbvBinary)
+ {
+ json_trivalue present;
+
+ present = find_bool_in_jsonbcontainer(jsonval->val.binary.data,
+   "present");
+
+ /*
+ * If "present" is set to false, this element expands to empty;
+ * otherwise (either true or absent), fall through to expand "fmt".
+ */
+ if (present == tv_false)
+ return false;
+
+ expand_fmt_recursive(jsonval->val.binary.data, buf);
+ }
+ else
+ return false;
+
+ return true;
+}

I felt this could be simpler if there is a new 'expanded' variable
because then you can have just a single return point instead of 3
returns;

If you choose to do this there is a minor tweak to the "fall through" comment.

SUGGESTION
expand_jsonval_string(StringInfo buf, JsonbValue *jsonval)
{
bool expanded = true;

if (jsonval->type == jbvString)
{
appendBinaryStringInfo(buf, jsonval->val.string.val,
   jsonval->val.string.len);
}
else if (jsonval->type == jbvBinary)
{
json_trivalue present;

present = find_bool_in_jsonbcontainer(jsonval->val.binary.data,
  "present");

/*
 * If "present" is set to false, this element expands to empty;
 * otherwise (either true or absent), expand "fmt".
 */
if (present == tv_false)
expanded = false;
else
expand_fmt_recursive(jsonval->val.binary.data, buf);
}

return expanded;
}

~~~

13.
+/*
+ * Expand a JSON value as an integer quantity.
+ */
+static void
+expand_jsonval_number(StringInfo buf, JsonbValue *jsonval)
+{

Should this also be checking/asserting that the type is jbvNumeric?

~~~

14.
+/*
+ * Expand a JSON value as a role name.  If the is_public element is set to
+ * true, PUBLIC is expanded (no quotes); 

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-01 14:20:19 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 09:49:00 -0800, Andres Freund wrote:
> >> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
> >>> And the minimum version appears to be newer than RHEL8's 1.8.2, which
> >>> I find pretty unfortunate.
> 
> > Unfortunately the test script accidentally pulled in ninja from epel,
> > hence not noticing the issue.
> 
> Ah.  For myself, pulling the newer version from epel would not be a big
> problem.  I think what we need to do is figure out what is the minimum
> ninja version we want to support, and then see if we need to make any
> of these changes.  I don't have hard data on which distros have which
> versions of ninja, but surely somebody checked that at some point?

I did survey available meson versions, and chose what features to
use. But not really ninja, since I didn't know about this specific issue
and other than this the ninja version differences were handled by meson.

As all the issues are related to more precise dependencies, I somehwat
wonder if it'd be good enough to use less accurate dependencies with
1.8.2. But I don't like it.

Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Nathan Bossart
On Sun, Feb 05, 2023 at 04:07:50PM -0800, Andres Freund wrote:
> On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote:
>> I agree that the shell overhead isn't the main performance issue,
>> but it's unclear to me how much of this should be baked into
>> PostgreSQL.
> 
> I don't know fully either. But just reimplementing all of it in
> different modules doesn't seem like a sane approach either. A lot of it
> is policy that we need to solve once, centrally.
> 
>> I mean, we could introduce a GUC that tells us how far ahead to
>> restore and have a background worker (or multiple background workers)
>> asynchronously pull files into a staging directory via the callbacks.
>> Is that the sort of scope you are envisioning?
> 
> Closer, at least.

Got it.  I suspect we'll want to do something similar for archive modules
eventually, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add progress reporting to pg_verifybackup

2023-02-05 Thread Michael Paquier
On Sat, Feb 04, 2023 at 12:32:15PM +0900, Michael Paquier wrote:
> That seems rather OK seen from here.  I'll see about getting that
> applied except if there is an objection of any kind.

Okay, I have looked at that again this morning and I've spotted one
tiny issue: specifying --progress with --skip-checksums does not
really make sense.

Ignoring entries with a bad size would lead to incorrect progress
report (for example, say an entry in the manifest has a largely
oversized size number), so your approach on this side is correct.  The
application of the ignore list via -i is also correct, as a patch
matching with should_ignore_relpath() does not compute an extra size
for total_size.

I was also wondering for a few minutes while on it whether it would
have been cleaner to move the check for should_ignore_relpath()
directly in verify_file_checksum() and verify_backup_file(), but
nobody has complained about that as being a problem, either.

What do you think about the updated version attached?
--
Michael
From 08a33cc650fb3cd58e742a79b232dbd5d9843008 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 6 Feb 2023 09:34:19 +0900
Subject: [PATCH v4] Add progress reporting to pg_verifybackup

This adds a new option to pg_verifybackup called -P/--progress,
showing every second some information about the state of checksum
verification.

Similarly to what is done for pg_rewind and pg_basebackup, the
information printed in the progress report consists of the current
amount of data computed and the total amount of data to compute.  This
could be extended later on.
---
 src/bin/pg_verifybackup/pg_verifybackup.c | 86 ++-
 src/bin/pg_verifybackup/t/004_options.pl  | 21 --
 doc/src/sgml/ref/pg_verifybackup.sgml | 15 
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 7634dfc285..8d5cb1c72b 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -16,12 +16,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "common/hashfn.h"
 #include "common/logging.h"
 #include "fe_utils/simple_list.h"
 #include "getopt_long.h"
 #include "parse_manifest.h"
+#include "pgtime.h"
 
 /*
  * For efficiency, we'd like our hash table containing information about the
@@ -57,6 +59,8 @@ typedef struct manifest_file
 	bool		matched;
 	bool		bad;
 } manifest_file;
+#define should_check_checksums(m) \
+	(((m)->matched) && !((m)->bad) && (((m)->checksum_type) != CHECKSUM_TYPE_NONE))
 
 /*
  * Define a hash table which we can use to store information about the files
@@ -147,10 +151,18 @@ static void report_fatal_error(const char *pg_restrict fmt,...)
 			pg_attribute_printf(1, 2) pg_attribute_noreturn();
 static bool should_ignore_relpath(verifier_context *context, char *relpath);
 
+static void progress_report(bool finished);
 static void usage(void);
 
 static const char *progname;
 
+/* options */
+static bool show_progress = false;
+
+/* Progress indicators */
+static uint64 total_size = 0;
+static uint64 done_size = 0;
+
 /*
  * Main entry point.
  */
@@ -162,6 +174,7 @@ main(int argc, char **argv)
 		{"ignore", required_argument, NULL, 'i'},
 		{"manifest-path", required_argument, NULL, 'm'},
 		{"no-parse-wal", no_argument, NULL, 'n'},
+		{"progress", no_argument, NULL, 'P'},
 		{"quiet", no_argument, NULL, 'q'},
 		{"skip-checksums", no_argument, NULL, 's'},
 		{"wal-directory", required_argument, NULL, 'w'},
@@ -219,7 +232,7 @@ main(int argc, char **argv)
 	simple_string_list_append(_list, "recovery.signal");
 	simple_string_list_append(_list, "standby.signal");
 
-	while ((c = getopt_long(argc, argv, "ei:m:nqsw:", long_options, NULL)) != -1)
+	while ((c = getopt_long(argc, argv, "ei:m:nPqsw:", long_options, NULL)) != -1)
 	{
 		switch (c)
 		{
@@ -241,6 +254,9 @@ main(int argc, char **argv)
 			case 'n':
 no_parse_wal = true;
 break;
+			case 'P':
+show_progress = true;
+break;
 			case 'q':
 quiet = true;
 break;
@@ -277,6 +293,14 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
+	/* Complain if the specified arguments conflict */
+	if (show_progress && quiet)
+		pg_fatal("cannot specify both %s and %s",
+ "-P/--progress", "-q/--quiet");
+	if (show_progress && skip_checksums)
+		pg_fatal("cannot specify both %s and %s",
+ "-P/--progress", "-s/--skip-checksums");
+
 	/* Unless --no-parse-wal was specified, we will need pg_waldump. */
 	if (!no_parse_wal)
 	{
@@ -638,6 +662,10 @@ verify_backup_file(verifier_context *context, char *relpath, char *fullpath)
 		m->bad = true;
 	}
 
+	/* Update statistics for progress report, if necessary */
+	if (show_progress && should_check_checksums(m))
+		total_size += m->size;
+
 	/*
 	 * We don't verify checksums at this stage. We first finish verifying that
 	 * we have the expected set of files with the expected sizes, and only
@@ -675,10 +703,12 @@ 

Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 15:57:47 -0800, Nathan Bossart wrote:
> > For the segment files, we'd likely need a parameter to indicate whether
> > the restore is random or not.
> 
> Wouldn't this approach still require each module to handle restoring ahead
> of time?

Yes, to some degree at least. I was just describing a few pretty obvious
improvements.

The core code can make that a lot easier though. The problem of where to
store such files can be provided by core code (presumably a separate
directory). A GUC for aggressiveness can be provided. Etc.


> I agree that the shell overhead isn't the main performance issue,
> but it's unclear to me how much of this should be baked into
> PostgreSQL.

I don't know fully either. But just reimplementing all of it in
different modules doesn't seem like a sane approach either. A lot of it
is policy that we need to solve once, centrally.


> I mean, we could introduce a GUC that tells us how far ahead to
> restore and have a background worker (or multiple background workers)
> asynchronously pull files into a staging directory via the callbacks.
> Is that the sort of scope you are envisioning?

Closer, at least.

Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Nathan Bossart
On Sun, Feb 05, 2023 at 03:01:57PM -0800, Andres Freund wrote:
> I think at the very least you'd want to have a separate callback for
> restoring segments than for restoring other files. But more likely a
> separate callback for each type of file to be restored.
> 
> For the timeline history case an parameter indicating that we don't want
> to restore the file, just to see if there's a conflict, would make
> sense.

That seems reasonable.

> For the segment files, we'd likely need a parameter to indicate whether
> the restore is random or not.

Wouldn't this approach still require each module to handle restoring ahead
of time?  I agree that the shell overhead isn't the main performance issue,
but it's unclear to me how much of this should be baked into PostgreSQL.  I
mean, we could introduce a GUC that tells us how far ahead to restore and
have a background worker (or multiple background workers) asynchronously
pull files into a staging directory via the callbacks.  Is that the sort of
scope you are envisioning?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-02-05 Thread Andres Freund
H,

On 2023-02-03 13:27:24 +0300, Damir Belyalov wrote:
> @@ -625,6 +628,173 @@ CopyMultiInsertInfoStore(CopyMultiInsertInfo *miinfo, 
> ResultRelInfo *rri,
>   miinfo->bufferedBytes += tuplen;
>  }
>  
> +/*
> + * Safely reads source data, converts to a tuple and fills tuple buffer.
> + * Skips some data in the case of failed conversion if data source for
> + * a next tuple can be surely read without a danger.
> + */
> +static bool
> +SafeCopying(CopyFromState cstate, ExprContext *econtext, TupleTableSlot 
> *myslot)


> + BeginInternalSubTransaction(NULL);
> + CurrentResourceOwner = sfcstate->oldowner;

I don't think this is the right approach. Creating a subtransaction for
each row will cause substantial performance issues.

We now can call data type input functions without throwing errors, see
InputFunctionCallSafe(). Use that to avoid throwing an error instead of
catching it.

Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Michael Paquier
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> Yes, at this stage a revert of the refactoring with shell_restore.c is
> the best path forward.

Done that now, as of 2f6e15a.
--
Michael


signature.asc
Description: PGP signature


Re: proposal: psql: psql variable BACKEND_PID

2023-02-05 Thread Corey Huinker
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

A small but helpful feature.

The new status of this patch is: Ready for Committer


Re: Temporary tables versus wraparound... again

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-04 17:12:36 +0100, Greg Stark wrote:
> I think that was spurious. It looked good when we looked at it yesterday.
> The rest that failed seemed unrelated and was also taking on my SSL patch
> too.

I don't think the SSL failures are related to the failure of this
patch. That was in one of the new tests executed as part of the main
regression tests:

https://api.cirrus-ci.com/v1/artifact/task/6418299974582272/testrun/build/testrun/regress/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/temp.out 
/tmp/cirrus-ci-build/build/testrun/regress/regress/results/temp.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/temp.out 2023-02-04 
05:43:14.225905000 +
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/temp.out 
2023-02-04 05:46:57.46825 +
@@ -108,7 +108,7 @@
:old_relfrozenxid <> :new_relfrozenxid  AS frozenxid_advanced;
  pages_analyzed | pages_reset | tuples_analyzed | tuples_reset | 
frozenxid_advanced 
 
+-+-+--+
- t  | t   | t   | t| t
+ t  | t   | t   | t| f
 (1 row)
 
 -- The toast table can't be analyzed so relpages and reltuples can't


Whereas the SSL test once failed in subscription/031_column_list (a test
with some known stability issues) and twice in postgres_fdw.

Unfortunately the postgres_fdw failures are failing to upload:

[17:41:25.601] Failed to upload artifacts: Put 
"https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6061134453669888/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230128%2Fauto%2Fstorage%2Fgoog4_request=20230128T174012Z=600=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task=6f5606e3966d68060a14077deb93ed5bf680c4636e6409e5eba6ca8f1ff9b11302c1b5605089e2cd759fd90d1542a4e2c794fd4c1210f04b056d7e09db54d3e983c34539fb4c24787b659189c27e1b6d0ebc1d1807b38a066c10e62fa57a374c3a7fbc610edddf1dfe900b3c788c8d7d7ded3366449b4520992c5ed7a3136c7103b7a668b591542bba58a32f5a84cb21bbeeafea09dc525d1631a5f413a0f98df43cc90ebf6c4206e6df61606bc634c3a8116c53d7c6dd4bc5b26547cd7d1a1633839ace694b73426267a9f434317350905b905b9c88132be14a7762c2f204b8072a3bd7e4e1d30217d9e60102d525b08e28bcfaabae80fba734a1015d8eb0a7":
 http2: request body larger than specified content length

Hm, I suspect the problem is that we didn't shut down the server due to
the error, so the log file was changing while cirrus was trying to
upload.

Greetings,

Andres Freund




Re: proposal: psql: psql variable BACKEND_PID

2023-02-05 Thread Corey Huinker
>
>
>>
>> Clearly, it is hard to write a regression test for an externally volatile
>> value. `SELECT sign(:BACKEND_PID)` would technically do the job, if we're
>> striving for completeness.
>>
>
> I did simple test - :BACKEND_PID should be equal pg_backend_pid()
>
>

Even better.


>
>>
>> Do we want to change %p to pull from this variable and save the
>> snprintf()? Not a measurable savings, more or a don't-repeat-yourself thing.
>>
>
> I checked the code, and I don't think so. Current state is safer (I
> think). The psql's variables are not protected, and I think, so is safer,
> better to
> read the value for prompt directly by usage of the libpq API instead read
> the possibly "customized" variable. I see possible inconsistency,
> but again, the same inconsistency can be for variables USER and DBNAME
> too, and I am not able to say what is significantly better. Just prompt
> shows
> real value, and the related variable is +/- copy in connection time.
>
> I am not 100% sure in this area what is better, but the change requires
> wider and more general discussion, and I don't think the benefits of
> possible
> change are enough. There are just two possible solutions - we can protect
> some psql's variables (and it can do some compatibility issues), or we
> need to accept, so the value in prompt can be fake. It is better to not
> touch it :-).
>

I agree it is out of scope of this patch, but I like the idea of protected
psql variables, and I doubt users are intentionally re-using these vars to
any positive effect. The more likely case is that newer psql vars just
happen to use the names chosen by somebody's script in the past.


>
> done
>
>
>
Everything passes. Docs look good. Test looks good.


Re: Pluggable toaster

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-06 00:10:50 +0300, Nikita Malakhov wrote:
> The problem, actually, is that the default TOAST is often not good for
> modern loads and amounts of data.Pluggable TOAST is based not only
> on pure enthusiasm, but on demands and tickets from production
> databases.

> The main demand is effective and transparent storage subsystem
> for large values for some problematic types of data, which we already have,
> with proven efficiency.

> So we're really quite surprised that it has got so little feedback. We've
> got
> some opinions on approach but there is no any general one on the approach
> itself except doubts about the TOAST mechanism needs revision at all.

The problem for me is that what you've been posting doesn't actually fix
any problem, but instead introduces lots of new code and complexity.


> Currently we're busy revising the whole Pluggable TOAST API to make it
> available as an extension and based on hooks to minimize changes in
> the core. It will be available soon.

I don't think we should accept that either. It still doesn't improve
anything about toast, it just allows you to do such improvements out of
core.

Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 14:19:38 -0800, Nathan Bossart wrote:
> On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> > - Should we include archive_cleanup_command into the recovery modules
> > at all?  We've discussed offloading that from the checkpointer, and it
> > makes the failure handling trickier when it comes to unexpected GUC
> > configurations, for one.  The same may actually apply to
> > restore_end_command.  Though it is done in the startup process now,
> > there may be an argument to offload that somewhere else based on the
> > timing of the end-of-recovery checkpoint.  My opinion on this stuff is
> > that only including restore_command in the modules would make most
> > users I know of happy enough as it removes the overhead of the command
> > invocation from the startup process, if able to replay things fast
> > enough so as the restore command is the bottleneck.
> > restore_end_command would be simple enough, but if there is a wish to
> > redesign the startup process to offload it somewhere else, then the
> > recovery module makes backward-compatibility concerns harder to think
> > about in the long-term.
> 
> I agree.  I think we ought to first focus on getting the recovery modules
> interface and restore_command functionality in place before we take on more
> difficult things like archive_cleanup_command.  But I still think the
> archive_cleanup_command/recovery_end_command functionality should
> eventually be added to recovery modules.

I tend not to agree. If you make the API that small, you're IME likely
to end up with something that looks somewhat incoherent once extended.


The more I think about it, the less I am convinced that
one-callback-per-segment, invoked just before needing the file, is the
right approach to address the performance issues of restore_commmand.

The main performance issue isn't the shell invocation overhead, it's
synchronously needing to restore the archive, before replay can
continue. It's also gonna be slow if a restore module copies the segment
from a remote system - the latency is the problem.

The only way the restore module approach can do better, is to
asynchronously restore ahead of the current segment. But for that the
API really isn't suited well. The signature of the relevant callback is:

> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path,
> +const char 
> *lastRestartPointFileName);


That's not very suited to restoring "ahead of time". You need to parse
file to figure out whether a segment or something else is restored, turn
"file" back into an LSN, figure out where to to store further segments,
somehow hand off to some background worker, etc.

That doesn't strike me as something we want to happen inside multiple
restore libraries.

I think at the very least you'd want to have a separate callback for
restoring segments than for restoring other files. But more likely a
separate callback for each type of file to be restored.

For the timeline history case an parameter indicating that we don't want
to restore the file, just to see if there's a conflict, would make
sense.

For the segment files, we'd likely need a parameter to indicate whether
the restore is random or not.


Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Nathan Bossart
On Sun, Feb 05, 2023 at 09:49:57AM +0900, Michael Paquier wrote:
> - Should we include archive_cleanup_command into the recovery modules
> at all?  We've discussed offloading that from the checkpointer, and it
> makes the failure handling trickier when it comes to unexpected GUC
> configurations, for one.  The same may actually apply to
> restore_end_command.  Though it is done in the startup process now,
> there may be an argument to offload that somewhere else based on the
> timing of the end-of-recovery checkpoint.  My opinion on this stuff is
> that only including restore_command in the modules would make most
> users I know of happy enough as it removes the overhead of the command
> invocation from the startup process, if able to replay things fast
> enough so as the restore command is the bottleneck.
> restore_end_command would be simple enough, but if there is a wish to
> redesign the startup process to offload it somewhere else, then the
> recovery module makes backward-compatibility concerns harder to think
> about in the long-term.

I agree.  I think we ought to first focus on getting the recovery modules
interface and restore_command functionality in place before we take on more
difficult things like archive_cleanup_command.  But I still think the
archive_cleanup_command/recovery_end_command functionality should
eventually be added to recovery modules.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Pluggable toaster

2023-02-05 Thread Nikita Malakhov
Hi hackers!

In response to opinion in thread on Compresson dictionaries for JSONb [1]
>The approaches are completely different,

>but they seem to be trying to fix the same problem: the fact that the
>default TOAST stuff isn't good enough for JSONB.





The problem, actually, is that the default TOAST is often not good for




modern loads and amounts of data.Pluggable TOAST is based not only




on pure enthusiasm, but on demands and tickets from production




databases.The main demand is effective and transparent storage subsystem




for large values for some problematic types of data, which we already have,




with proven efficiency.







So we're really quite surprised that it has got so little feedback. We've
got




some opinions on approach but there is no any general one on the approach




itself except doubts about the TOAST mechanism needs revision at all.







Currently we're busy revising the whole Pluggable TOAST API to make it




available as an extension and based on hooks to minimize changes in




the core. It will be available soon.



> [1]  https://postgr.es/m/20230203095540.zutul5vmsbmantbm@alvherre.pgsql
>
--
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: Support logical replication of DDLs

2023-02-05 Thread Peter Smith
On Fri, Feb 3, 2023 at 9:21 PM Alvaro Herrera  wrote:
>
> On 2023-Feb-03, Peter Smith wrote:
>
...
> > 3. ExecuteGrantStmt
> >
> > + /* Copy the grantor id needed for DDL deparsing of Grant */
> > + istmt.grantor_uid = grantor;
> > +
> >
> > SUGGESTION (comment)
> > Copy the grantor id to the parsetree, needed for DDL deparsing of Grant
>
> Is istmt really "the parse tree" actually?  As I recall, it's a derived
> struct that's created during execution of the grant/revoke command, so
> modifying the comment like this would be a mistake.
>

I thought this comment was analogous to another one from this same
patch 0001 (see seclabel.c), so the suggested change above was simply
to make the wording consistent.

@@ -134,6 +134,9 @@ ExecSecLabelStmt(SecLabelStmt *stmt)
  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("must specify provider when multiple security label providers
have been loaded")));
  provider = (LabelProvider *) linitial(label_provider_list);
+
+ /* Copy the provider name to the parsetree, needed for DDL deparsing
of SecLabelStmt */
+ stmt->provider = pstrdup(provider->provider_name);

So if the suggestion for the ExecuteGrantStmt comment was a mistake
then perhaps the ExecSecLabelStmt comment is wrong also?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: First draft of back-branch release notes is done

2023-02-05 Thread Tom Lane
Thomas Munro  writes:
> On Mon, Feb 6, 2023 at 8:57 AM Tom Lane  wrote:
>> Also, aren't shared tuplestores used in more places than just
>> parallel hash join?  I mentioned that as an example, not an
>> exhaustive list of trouble spots.

> Shared file sets (= a directory of temp files with automatic cleanup)
> are used by more things, but shared tuplestores (= a shared file set
> with a tuple-oriented interface on top, to support partial scan) are
> currently only used by PHJ.

Oh, okay.  I'll change it to say "corruption within parallel hash
join", then, and not use the word "tuplestore" at all.

regards, tom lane




Re: First draft of back-branch release notes is done

2023-02-05 Thread Thomas Munro
On Mon, Feb 6, 2023 at 8:57 AM Tom Lane  wrote:
> Alvaro Herrera  writes:
> > On 2023-Feb-03, Tom Lane wrote:
> >>   Fix edge-case data corruption in shared tuplestores (Dmitry Astapov)
>
> > I think this sounds really scary, because people are going to think that
> > their stored data can get corrupted -- they don't necessarily know what
> > a "shared tuplestore" is.  Maybe "Avoid query failures in parallel hash
> > joins" as headline?
>
> Hmmm ... are we sure it *can't* lead to corruption of stored data,
> if this happens during an INSERT or UPDATE plan?  I'll grant that
> such a case seems pretty unlikely though, as the bogus data
> retrieved from the tuplestore would have to not cause a failure
> within the query before it can get written out.

Agreed.  I think you have to be quite unlucky to hit this in the first
place (very large tuples with very particular alignment), and then
you'd be highly likely to fail in some way due to corrupted tuple
size, making permanent corruption extremely unlikely.

> Also, aren't shared tuplestores used in more places than just
> parallel hash join?  I mentioned that as an example, not an
> exhaustive list of trouble spots.

Shared file sets (= a directory of temp files with automatic cleanup)
are used by more things, but shared tuplestores (= a shared file set
with a tuple-oriented interface on top, to support partial scan) are
currently only used by PHJ.




Re: First draft of back-branch release notes is done

2023-02-05 Thread Tom Lane
I wrote:
> Alvaro Herrera  writes:
>> I think this sounds really scary, because people are going to think that
>> their stored data can get corrupted -- they don't necessarily know what
>> a "shared tuplestore" is.  Maybe "Avoid query failures in parallel hash
>> joins" as headline?

Maybe less scary if we make it clear we're talking about a temporary file?

 
  Fix edge-case corruption of temporary data within shared tuplestores
  (Dmitry Astapov)
 

 
  If the final chunk of a large tuple being written out to a temporary
  file was exactly 32760 bytes, it would be corrupted due to a
  fencepost bug.  This is a hazard for parallelized plans that require
  a tuplestore, such as parallel hash join.  The query would typically
  fail later with corrupted-data symptoms.
 

regards, tom lane




Re: First draft of back-branch release notes is done

2023-02-05 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On Feb 4, 2023, at 10:24 AM, Tom Lane  wrote:
>> “Prevent clobbering of cached parsetrees…Bad things could happen if…”

> While I chuckled over the phrasing, I’m left to wonder what the “bad things” 
> are, in case I
> need to check an older version to see if I’m susceptible to this bug.

Fair.  I was trying to avoid committing to specific consequences.
The assertion failure seen in the original report (#17702) wouldn't
occur for typical users, but they might see crashes or "unexpected node
type" failures.  Maybe we can say that instead.

regards, tom lane




Re: First draft of back-branch release notes is done

2023-02-05 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Feb-03, Tom Lane wrote:
>>   Fix edge-case data corruption in shared tuplestores (Dmitry Astapov)

> I think this sounds really scary, because people are going to think that
> their stored data can get corrupted -- they don't necessarily know what
> a "shared tuplestore" is.  Maybe "Avoid query failures in parallel hash
> joins" as headline?

Hmmm ... are we sure it *can't* lead to corruption of stored data,
if this happens during an INSERT or UPDATE plan?  I'll grant that
such a case seems pretty unlikely though, as the bogus data
retrieved from the tuplestore would have to not cause a failure
within the query before it can get written out.

Also, aren't shared tuplestores used in more places than just
parallel hash join?  I mentioned that as an example, not an
exhaustive list of trouble spots.

regards, tom lane




Re: pg_stat_statements and "IN" conditions

2023-02-05 Thread Dmitry Dolgov
> On Sun, Feb 05, 2023 at 11:02:32AM -0500, Tom Lane wrote:
> Dmitry Dolgov <9erthali...@gmail.com> writes:
> > I'm thinking about this in the following way: the core jumbling logic is
> > responsible for deriving locations based on the input expressions; in
> > the case of merging we produce less locations; pgss have to represent
> > the result only using locations and has to be able to differentiate
> > simple locations and locations after merging.
>
> Uh ... why?  ISTM you're just going to elide all inside the IN,
> so why do you need more than a start and stop position?

Exactly, start and stop positions. But if there would be no information
that merging was applied, the following queries will look the same after
jumbling, right?

-- input query
SELECT * FROM test_merge WHERE id IN (1, 2);
-- jumbling result, two LocationLen, for values 1 and 2
SELECT * FROM test_merge WHERE id IN ($1, $2);

-- input query
SELECT * FROM test_merge WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
-- jumbling result, two LocationLen after merging, for values 1 and 10
SELECT * FROM test_merge WHERE id IN (...);
-- without remembering about merging the result would be
SELECT * FROM test_merge WHERE id IN ($1, $2);




Re: [PATCH] Compression dictionaries for JSONB

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote:
> > I don't think we'd want much of the infrastructure introduced in the
> > patch for type agnostic cross-row compression. A dedicated "dictionary"
> > type as a wrapper around other types IMO is the wrong direction. This
> > should be a relation-level optimization option, possibly automatic, not
> > something visible to every user of the table.
>
> So to clarify, are we talking about tuple-level compression? Or
> perhaps page-level compression?

Tuple level.

What I think we should do is basically this:

When we compress datums, we know the table being targeted. If there's a
pg_attribute parameter indicating we should, we can pass a prebuilt
dictionary to the LZ4/zstd [de]compression functions.

It's possible we'd need to use a somewhat extended header for such
compressed datums, to reference the dictionary "id" to be used when
decompressing, if the compression algorithms don't already have that in
one of their headers, but that's entirely doable.


A quick demo of the effect size:

# create data to train dictionary with, use subset to increase realism
mkdir /tmp/pg_proc_as_json/;
CREATE EXTENSION adminpack;
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, 
true)
FROM pg_proc pp
LIMIT 2000;


# build dictionary
zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

# create more data
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, 
true) FROM pg_proc pp;


# compress without dictionary
lz4 -k -m /tmp/pg_proc_as_json/*.raw
zstd -k /tmp/pg_proc_as_json/*.raw

# measure size
cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat 
/tmp/pg_proc_as_json/*.zst|wc -c


# compress with dictionary
rm -f /tmp/pg_proc_as_json/*.{lz4,zst};
lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw
zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

did the same with zstd.

Here's the results:

   lz4   zstd   uncompressed
no dict1328794 9824973898498
dict375070 267194

I'd say the effect of the dictionary is pretty impressive. And remember,
this is with the dictionary having been trained on a subset of the data.


As a comparison, here's all the data compressed compressed at once:

   lz4   zstd
no dict 180231 104913
dict179814 106444

Unsurprisingly the dictionary doesn't help much, because the compression
algorithm can "natively" see the duplication.


- Andres




Re: pglz compression performance, take two

2023-02-05 Thread Andrey Borodin
On Fri, Jan 6, 2023 at 10:02 PM Andrey Borodin  wrote:
>
> Hello! Please find attached v8.

I got some interesting feedback from some patch users.
There was an oversight that frequently yielded results that are 1,2 or
3 bytes longer than expected.
Looking closer I found that the correctness of the last 3-byte tail is
checked in two places. PFA fix for this. Previously compressed data
was correct, however in some cases few bytes longer than the result of
current pglz implementation.

Thanks!

Best regards, Andrey Borodin.


v9-0001-Reorganize-pglz-compression-code.patch
Description: Binary data


v9-0002-Fix-oversight-for-compression-of-last-4-bytes.patch
Description: Binary data


add a "status" column to the pg_rules system view

2023-02-05 Thread Albin Hermange
 hi,

I noticed that the pg_rules system view (all PG versions) does not include a
"status" field (like in pg_trigger with tgenabled column)

the official view (from 15.1 sources) is :

CREATE VIEW pg_rules AS
SELECT
N.nspname AS schemaname,
C.relname AS tablename,
R.rulename AS rulename,
pg_get_ruledef(R.oid) AS definition
FROM (pg_rewrite R JOIN pg_class C ON (C.oid = R.ev_class))
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE R.rulename != '_RETURN';

i propose to add a new field "rule_enabled" to get (easilly and officially)
the rule status to all PG version

CREATE VIEW pg_rules AS
SELECT
N.nspname AS schemaname,
C.relname AS tablename,
R.rulename AS rulename,
R.ev_enabled as rule_enabled,
pg_get_ruledef(R.oid) AS definition
FROM (pg_rewrite R JOIN pg_class C ON (C.oid = R.ev_class))
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE R.rulename != '_RETURN';


What do u think about that ?

Thx

Albin


Re: Make EXPLAIN generate a generic plan for a parameterized query

2023-02-05 Thread Laurenz Albe
On Fri, 2023-02-03 at 15:11 -0500, Tom Lane wrote:
> I can think of a couple of possible ways forward:
> 
> * Fix things so that the generic parameters appear to have NULL
> values when inspected during executor startup.  I'm not sure
> how useful that'd be though.  In partition-pruning cases that'd
> lead to EXPLAIN (GENERIC_PLAN) showing the plan with all
> partitions pruned, other than the one for NULL values if there
> is one.  Doesn't sound too helpful.
> 
> * Invent another executor flag that's a "stronger" version of
> EXEC_FLAG_EXPLAIN_ONLY, and pass that when any generic parameters
> exist, and check it in CreatePartitionPruneState to decide whether
> to do startup pruning.  This avoids changing behavior for existing
> cases, but I'm not sure how much it has to recommend it otherwise.
> Skipping the startup prune step seems like it'd produce misleading
> results in another way, ie showing too many partitions not too few.
> 
> This whole business of partition pruning dependent on the
> generic parameters is making me uncomfortable.  It seems like
> we *can't* show a plan that is either representative of real
> execution or comparable to what you get from more-traditional
> EXPLAIN usage.  Maybe we need to step back and think more.

I slept over it, and the second idea now looks like the the right
approach to me.  My idea of seeing a generic plan is that plan-time
partition pruning happens, but not execution-time pruning, so that
I get no "subplans removed".
Are there any weird side effects of skipping the startup prune step?

Anyway, attached is v7 that tries to do it that way.  It feels fairly
good to me.  I invented a new executor flag EXEC_FLAG_EXPLAIN_GENERIC.
To avoid having to change all the places that check EXEC_FLAG_EXPLAIN_ONLY
to also check for the new flag, I decided that the new flag can only be
used as "add-on" to EXEC_FLAG_EXPLAIN_ONLY.

Yours,
Laurenz Albe
From cd0b5a1a4f301bb7fad9088d5763989f5dde4636 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Sun, 5 Feb 2023 18:11:57 +0100
Subject: [PATCH] Add EXPLAIN option GENERIC_PLAN

This allows EXPLAIN to generate generic plans for parameterized statements
(that have parameter placeholders like $1 in the statement text).
Invent a new executor flag EXEC_FLAG_EXPLAIN_GENERIC that disables runtime
partition pruning for such plans: that would fail if the non-existing
parameters are involved, and we don't want to remove subplans anyway.

Author: Laurenz Albe
Discussion: https://postgr.es/m/0a29b954b10b57f0d135fe12aa0909bd41883eb0.camel%40cybertec.at
---
 doc/src/sgml/ref/explain.sgml | 15 +
 src/backend/commands/explain.c| 11 +++
 src/backend/executor/execMain.c   |  3 ++
 src/backend/executor/execPartition.c  |  9 --
 src/backend/parser/analyze.c  | 29 +
 src/include/commands/explain.h|  1 +
 src/include/executor/executor.h   | 18 +++
 src/test/regress/expected/explain.out | 46 +++
 src/test/regress/sql/explain.sql  | 29 +
 9 files changed, 152 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index d4895b9d7d..a1fdd31bc7 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -40,6 +40,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] statementboolean ]
 COSTS [ boolean ]
 SETTINGS [ boolean ]
+GENERIC_PLAN [ boolean ]
 BUFFERS [ boolean ]
 WAL [ boolean ]
 TIMING [ boolean ]
@@ -167,6 +168,20 @@ ROLLBACK;
 

 
+   
+GENERIC_PLAN
+
+ 
+  Generate a generic plan for the statement (see 
+  for details about generic plans).  The statement can contain parameter
+  placeholders like $1 (but then it has to be a statement
+  that supports parameters).  This option cannot be used together with
+  ANALYZE, since a statement with unknown parameters
+  cannot be executed.
+ 
+
+   
+

 BUFFERS
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 35c23bd27d..37ea4e5035 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -190,6 +190,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
 			es->settings = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "generic_plan") == 0)
+			es->generic = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
 			timing_set = true;
@@ -227,6 +229,13 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	 parser_errposition(pstate, opt->location)));
 	}
 
+	/* check that GENERIC_PLAN is not used with EXPLAIN ANALYZE */
+	if (es->generic && es->analyze)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN ANALYZE cannot be used with GENERIC_PLAN")));
+
+	/* check that WAL is used with EXPLAIN ANALYZE */
 	if (es->wal && 

Re: [PATCH] Compression dictionaries for JSONB

2023-02-05 Thread Aleksander Alekseev
Hi,

> I assume that manually specifying dictionary entries is a consequence of
> the prototype state?  I don't think this is something humans are very
> good at, just analyzing the data to see what's useful to dictionarize
> seems more promising.

No, humans are not good at it. The idea was to automate the process
and build the dictionaries automatically e.g. during the VACUUM.

> I don't think we'd want much of the infrastructure introduced in the
> patch for type agnostic cross-row compression. A dedicated "dictionary"
> type as a wrapper around other types IMO is the wrong direction. This
> should be a relation-level optimization option, possibly automatic, not
> something visible to every user of the table.

So to clarify, are we talking about tuple-level compression? Or
perhaps page-level compression?

Implementing page-level compression should be *relatively*
straightforward. As an example this was previously done for InnoDB.
Basically InnoDB compresses the entire page, then rounds the result to
1K, 2K, 4K, 8K, etc and stores the result in a corresponding fork
("fork" in PG terminology), similarly to how a SLAB allocator works.
Additionally a page_id -> fork_id map should be maintained, probably
in yet another fork, similarly to visibility map. A compressed page
can change the fork after being modified since this may change the
size of a compressed page. The buffer manager is unaffected and deals
only with uncompressed pages. (I'm not an expert in InnoDB and this is
my very rough understanding of how its compression works.)

I believe this can be implemented as a TAM. Whether this would be a
"dictionary" compression is debatable but it gives the users similar
benefits, give or take. The advantage is that users shouldn't define
any dictionaries manually, nor should DBMS during VACUUM or somehow
else.

> I also suspect that we'd have to spend a lot of effort to make
> compression/decompression fast if we want to handle dictionaries
> ourselves, rather than using the dictionary support in libraries like
> lz4/zstd.

That's a reasonable concern, can't argue with that.

> I don't think a prototype-y patch not needing a rebase two months is a
> good measure of complexity :)

It's worth noting that I also invested quite some time into reviewing
type-aware TOASTers :) I just choose to keep my personal opinion about
the complexity of that patch to myself this time since obviously I'm a
bit biased. However if you are curious it's all in the corresponding
thread.

-- 
Best regards,
Aleksander Alekseev




Re: File descriptors in exec'd subprocesses

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 11:06:13 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro 
> >  wrote:
> >> Are there any more descriptors we need to think about?
> 
> > Postmaster's listen sockets?
> 
> I wonder whether O_CLOEXEC on that would be inherited by the
> client-communication sockets, though.

I'd be very suprised if it were.



Nope, at least not on linux. Verified by looking at /proc/*/fdinfo/n
after adding SOCK_CLOEXEC to just the socket() call. 'flags' changes
from 02 -> 0202 for the listen socket, but stays at 04002 for the
client socket. If I add SOCK_CLOEXEC to accept() (well, accept4()), it
does change from 04002 to 02004002.

Greetings,

Andres Freund




Re: [PATCH] Compression dictionaries for JSONB

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 13:41:17 +0300, Aleksander Alekseev wrote:
> > I don't think the approaches in either of these threads is
> > promising. They add a lot of complexity, require implementation effort
> > for each type, manual work by the administrator for column, etc.
> 
> I would like to point out that compression dictionaries don't require
> per-type work.
> 
> Current implementation is artificially limited to JSONB because it's a
> PoC. I was hoping to get more feedback from the community before
> proceeding further. Internally it uses type-agnostic compression and
> doesn't care whether it compresses JSON(B), XML, TEXT, BYTEA or
> arrays. This choice was explicitly done in order to support types
> other than JSONB.

I don't think we'd want much of the infrastructure introduced in the
patch for type agnostic cross-row compression. A dedicated "dictionary"
type as a wrapper around other types IMO is the wrong direction. This
should be a relation-level optimization option, possibly automatic, not
something visible to every user of the table.

I assume that manually specifying dictionary entries is a consequence of
the prototype state?  I don't think this is something humans are very
good at, just analyzing the data to see what's useful to dictionarize
seems more promising.

I also suspect that we'd have to spend a lot of effort to make
compression/decompression fast if we want to handle dictionaries
ourselves, rather than using the dictionary support in libraries like
lz4/zstd.


> > One of the major justifications for work in this area is the cross-row
> > redundancy for types like jsonb. I think there's ways to improve that
> > across types, instead of requiring per-type work.
> 
> To be fair, there are advantages in using type-aware compression. The
> compression algorithm can be more efficient than a general one and in
> theory one can implement lazy decompression, e.g. the one that
> decompresses only the accessed fields of a JSONB document.

> I agree though that particularly for PostgreSQL this is not
> necessarily the right path, especially considering the accompanying
> complexity.

I agree with both those paragraphs.


> above. However having a built-in type-agnostic dictionary compression
> IMO is a too attractive idea to completely ignore it.  Especially
> considering the fact that the implementation was proven to be fairly
> simple and there was even no need to rebase the patch since November
> :)

I don't think a prototype-y patch not needing a rebase two months is a
good measure of complexity :)

Greetings,

Andres Freund




Re: Weird failure with latches in curculio on v15

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-04 10:03:54 -0800, Nathan Bossart wrote:
> On Sat, Feb 04, 2023 at 03:30:29AM -0800, Andres Freund wrote:
> > That's kind of my problem with these changes. They try to introduce new
> > abstraction layers, but don't provide real abstraction, because they're
> > very tightly bound to the way the functions were called before the
> > refactoring.  And none of these restrictions are actually documented.
> 
> Okay.  Michael, why don't we revert the shell_restore stuff for now?  Once
> the archive modules interface changes and the fix for this
> SIGTERM-during-system() problem are in, I will work through this feedback
> and give recovery modules another try.  I'm still hoping to have recovery
> modules ready in time for the v16 feature freeze.
> 
> My intent was to improve this code by refactoring and reducing code
> duplication, but I seem to have missed the mark.  I am sorry.

FWIW, I think the patches were going roughly the right direction, they
just needs a bit more work.

I don't think we should expose the proc_exit() hack, and its supporting
infrastructure, to the pluggable *_command logic. It's bad enough as-is,
but having to do this stuff within an extension module seems likely to
end badly. There's just way too much action-at-a-distance.

I think Thomas has been hacking on a interruptible system()
replacement. With that, a lot of this ugliness would be resolved.

Greetings,

Andres Freund




Re: File descriptors in exec'd subprocesses

2023-02-05 Thread Tom Lane
Andres Freund  writes:
> On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro 
>  wrote:
>> Are there any more descriptors we need to think about?

> Postmaster's listen sockets?

I wonder whether O_CLOEXEC on that would be inherited by the
client-communication sockets, though.  That's fine ... unless you
are doing EXEC_BACKEND.

regards, tom lane




Re: pg_stat_statements and "IN" conditions

2023-02-05 Thread Tom Lane
Dmitry Dolgov <9erthali...@gmail.com> writes:
> I'm thinking about this in the following way: the core jumbling logic is
> responsible for deriving locations based on the input expressions; in
> the case of merging we produce less locations; pgss have to represent
> the result only using locations and has to be able to differentiate
> simple locations and locations after merging.

Uh ... why?  ISTM you're just going to elide all inside the IN,
so why do you need more than a start and stop position?

regards, tom lane




Re: File descriptors in exec'd subprocesses

2023-02-05 Thread Andres Freund
Hi, 

Unsurprisingly I'm in favor of this. 

On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro  
wrote:
>Are there any more descriptors we need to think about?

Postmaster's listen sockets? Saves a bunch of syscalls, at least. Logging 
collector pipe write end, in backends?

Greetings,

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: run pgindent on a regular basis / scripted manner

2023-02-05 Thread Andrew Dunstan


On 2023-02-04 Sa 09:20, Andrew Dunstan wrote:



On 2023-02-04 Sa 06:34, Andres Freund wrote:


ISTM that we're closer to being able to enforce pgindent than
perltidy. At the same time, I think the issue of C code in HEAD not
being indented is more pressing - IME it's much more common to have to
touch a lot of C code than to have to touch a lot fo perl files.  So
perhaps we should just start with being more stringent with C code, and
once we made perltidy less noisy, add that?



Sure, we don't have to tie them together.

I'm currently experimenting with settings on the buildfarm code, 
trying to get it both stable and looking nice. Then I'll try on the 
Postgres core code and post some results.




So here's a diff made from running perltidy v20221112 with the 
additional setting --valign-exclusion-list=", = => || && if unless"


Essentially this abandons those bits of vertical alignment that tend to 
catch us out when additions are made to the code.


I think this will make the code much more maintainable and result in 
much less perltidy churn. It would also mean that it's far more likely 
that what we would naturally code can be undisturbed by perltidy.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


pgperltidy-no-valign.patch.gz
Description: application/gzip


Re: undersized unions

2023-02-05 Thread Andres Freund
Hi,

On 2023-02-05 10:18:14 +0900, Michael Paquier wrote:
> On Sat, Feb 04, 2023 at 05:07:08AM -0800, Andres Freund wrote:
> > : In function 'assign':
> > :9:6: warning: array subscript 'foo[0]' is partly outside array 
> > bounds of 'unsigned char[4]' [-Warray-bounds=]
> > 9 | p->i = i;
> >   |  ^~
> > :8:22: note: object of size 4 allocated by '__builtin_malloc'
> > 8 | foo *p = (foo *) __builtin_malloc(sizeof(int));
> >   |  ^
> > Compiler returned: 0
> >
> > I can't really tell if gcc is right or wrong wrong to warn about
> > this. On the one hand it's a union, and we only access the element that
> > is actually backed by memory, on the other hand, the standard does say
> > that the size of a union is the largest element, so we are pointing to
> > something undersized.
>
> Something I have noticed, related to that..  meson reports a set of
> warnings here, not ./configure, still I apply the same set of CFLAGS
> to both.  What's the difference in the meson setup that creates that,
> if I may ask?  There is a link to the way -Warray-bound is handled?

It's possibly related to the optimization level used. Need a bit more
information to provide a more educated guess. What warnings, what CFLAGS
etc.

Greetings,

Andres Freund




Re: First draft of back-branch release notes is done

2023-02-05 Thread Alvaro Herrera
On 2023-Feb-03, Tom Lane wrote:

> ... at
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f282b026787da69d88a35404cf62f1cc21cfbb7c
> 
> As usual, please send corrections/comments by Sunday.

  Fix edge-case data corruption in shared tuplestores (Dmitry Astapov)

  If the final chunk of a large tuple being written out to disk was
  exactly 32760 bytes, it would be corrupted due to a fencepost bug.
  This is a hazard for parallelized plans that require a tuplestore,
  such as parallel hash join.  The query would typically fail later
  with corrupted-data symptoms.

I think this sounds really scary, because people are going to think that
their stored data can get corrupted -- they don't necessarily know what
a "shared tuplestore" is.  Maybe "Avoid query failures in parallel hash
joins" as headline?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)




Re: pg_stat_statements and "IN" conditions

2023-02-05 Thread Dmitry Dolgov
> On Sun, Feb 05, 2023 at 10:30:25AM +0900, Michael Paquier wrote:
> On Sat, Feb 04, 2023 at 06:08:41PM +0100, Dmitry Dolgov wrote:
> > Here is the rebased version. To adapt to the latest changes, I've marked
> > ArrayExpr with custom_query_jumble to implement this functionality, but
> > tried to make the actual merge logic relatively independent. Otherwise,
> > everything is the same.
>
> I was quickly looking at this patch, so these are rough impressions.
>
> +   boolmerged; /* whether or not the location was marked as
> +  not contributing to jumble */
>
> This part of the patch is a bit disturbing..  We have node attributes
> to track if portions of a node should be ignored or have a location
> marked, still this "merged" flag is used as an extension to track if a
> location should be done or not.  Is that a concept that had better be
> controlled via a new node attribute?

Good question. I need to think a bit more if it's possible to leverage
node attributes mechanism, but at the moment I'm still inclined to
extend LocationLen. The reason is that it doesn't exactly serve the
tracking purpose, i.e. whether to capture a location (I have to update
the code commentary), it helps differentiate cases when locations A and
D are obtained from merging A B C D instead of just being A and D.

I'm thinking about this in the following way: the core jumbling logic is
responsible for deriving locations based on the input expressions; in
the case of merging we produce less locations; pgss have to represent
the result only using locations and has to be able to differentiate
simple locations and locations after merging.

> +--
> +-- Consts merging
> +--
> +CREATE TABLE test_merge (id int, data int);
> +-- IN queries
> +-- No merging
>
> Would it be better to split this set of tests into a new file?  FWIW,
> I have a patch in baking process that refactors a bit the whole,
> before being able to extend it so as we have more coverage for
> normalized utility queries, as of now the query strings stored by
> pg_stat_statements don't reflect that even if the jumbling computation
> marks the location of the Const nodes included in utility statements
> (partition bounds, queries of COPY, etc.).  I should be able to send
> that tomorrow, and my guess that you could take advantage of that
> even for this thread.

Sure, I'll take a look how I can benefit from your work, thanks.




Re: undersized unions

2023-02-05 Thread Andres Freund
Hi, 

On February 5, 2023 6:16:55 AM GMT+01:00, Tom Lane  wrote:
>Michael Paquier  writes:
>> On Sat, Feb 04, 2023 at 05:07:08AM -0800, Andres Freund wrote:
>>> We actually have a fair amount of code like that, but currently are
>>> escaping most of the warnings, because gcc doesn't know that palloc() is
>>> an allocator. With more optimizations (particularly with LTO), we end up
>>> with more of such warnings.  I'd like to annotate palloc so gcc
>>> understands the size, as that does help to catch bugs when confusing the
>>> type. It also helps static analyzers.
>
>> Ah, that seems like a good idea in the long run.
>
>I'm kind of skeptical about whether we'll be able to get rid of all
>the resulting warnings without extremely invasive (and ugly) changes.

It's not that many sources of warnings, fwiw.

But the concrete reason for posting here was that I'm wondering whether the 
"undersized" allocations could cause problems as-is. 

On the one hand there's compiler optimizations that could end up being a 
problem - imagine two branches of an if allocating something containing a union 
and one assigning to 32 the other to a 64bit integer union member. It'd imo be 
reasonable for the compiler to move that register->memory move outside of the 
if.

On the other hand, it also just seems risky from a code writing perspective. 
It's not immediate obvious that it'd be unsafe to create an on-stack Numeric by 
assigning *ptr. But it is.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [PATCH] Compression dictionaries for JSONB

2023-02-05 Thread Aleksander Alekseev
Hi,

> I don't think the approaches in either of these threads is
> promising. They add a lot of complexity, require implementation effort
> for each type, manual work by the administrator for column, etc.

I would like to point out that compression dictionaries don't require
per-type work.

Current implementation is artificially limited to JSONB because it's a
PoC. I was hoping to get more feedback from the community before
proceeding further. Internally it uses type-agnostic compression and
doesn't care whether it compresses JSON(B), XML, TEXT, BYTEA or
arrays. This choice was explicitly done in order to support types
other than JSONB.

> One of the major justifications for work in this area is the cross-row
> redundancy for types like jsonb. I think there's ways to improve that
> across types, instead of requiring per-type work.

To be fair, there are advantages in using type-aware compression. The
compression algorithm can be more efficient than a general one and in
theory one can implement lazy decompression, e.g. the one that
decompresses only the accessed fields of a JSONB document.

I agree though that particularly for PostgreSQL this is not
necessarily the right path, especially considering the accompanying
complexity.

If the user cares about the disk space consumption why storing JSONB
in a relational DBMS in the first place? We already have a great
solution for compacting the data, it was invented in the 70s and is
called normalization.

Since PostgreSQL is not a specified document-oriented DBMS I think we
better focus our (far from being infinite) resources on something more
people would benefit from: AIO/DIO [1] or perhaps getting rid of
freezing [2], to name a few examples.

> [...]
> step back and decide which one we dislike the less, and how to fix that
> one into a shape that we no longer dislike.

For the sake of completeness, doing neither type-aware TOASTing nor
compression dictionaries and leaving this area to the extension
authors (e.g. ZSON) is also a possible choice, for the same reasons
named above. However having a built-in type-agnostic dictionary
compression IMO is a too attractive idea to completely ignore it.
Especially considering the fact that the implementation was proven to
be fairly simple and there was even no need to rebase the patch since
November :)

I know that there were concerns [3] regarding the typmod hack. I don't
like it either and 100% open to suggestions here. This is merely a
current implementation detail used in a PoC, not a fundamental design
decision.

[1]: https://postgr.es/m/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de
[2]: 
https://postgr.es/m/CAJ7c6TOk1mx4KfF0AHkvXi%2BpkdjFqwTwvRE-JmdczZMAYnRQ0w%40mail.gmail.com
[3]: 
https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2023_Developer_Meeting#v16_Patch_Triage

-- 
Best regards,
Aleksander Alekseev




Re: Generating code for query jumbling through gen_node_support.pl

2023-02-05 Thread Michael Paquier
On Tue, Jan 31, 2023 at 03:40:56PM +0900, Michael Paquier wrote:
> With all that in mind, I have spent my day polishing that and doing a
> close lookup, and the patch has been applied.  Thanks a lot!

While working on a different patch, I have noticed a small issue in
the way the jumbling happens for A_Const, where ValUnion was not
getting jumbled correctly.  This caused a few statements that rely on
this node to compile the same query IDs when using different values.
The full contents of pg_stat_statements for a regression database
point to: 
- SET.
- COPY with queries.
- CREATE TABLE with partition bounds and default expressions.

This was causing some confusion in pg_stat_statements where some
utility queries would be incorrectly reported, and at this point the
intention is to keep this area compatible with the string-based method
when it comes to the values.  Like read, write and copy nodes, we need
to compile the query ID based on the type of the value, which cannot
be automated.  Attached is a patch to fix this issue with some
regression tests, that I'd like to get fixed before moving on with
more business in pg_stat_statements (aka properly show Const nodes for
utilities with normalized queries).

Comments or objections are welcome, of course.

(FWIW, I'd like to think that there is an argument to normalize the
A_Const nodes for a portion of the DDL queries, by ignoring their
values in the query jumbling and mark a location, which would be
really useful for some workloads, but that's a separate discussion I
am keeping for later.)
--
Michael
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 3d67787e7a..855da99ec0 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -355,7 +355,7 @@ union ValUnion
 
 typedef struct A_Const
 {
-	pg_node_attr(custom_copy_equal, custom_read_write)
+	pg_node_attr(custom_copy_equal, custom_read_write, custom_query_jumble)
 
 	NodeTag		type;
 	union ValUnion val;
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 223d1bc826..72ce15e43d 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -49,6 +49,7 @@ static void AppendJumble(JumbleState *jstate,
 		 const unsigned char *item, Size size);
 static void RecordConstLocation(JumbleState *jstate, int location);
 static void _jumbleNode(JumbleState *jstate, Node *node);
+static void _jumbleA_Const(JumbleState *jstate, Node *node);
 static void _jumbleList(JumbleState *jstate, Node *node);
 static void _jumbleRangeTblEntry(JumbleState *jstate, Node *node);
 
@@ -313,6 +314,40 @@ _jumbleList(JumbleState *jstate, Node *node)
 	}
 }
 
+static void
+_jumbleA_Const(JumbleState *jstate, Node *node)
+{
+	A_Const *expr = (A_Const *) node;
+
+	JUMBLE_FIELD(isnull);
+	if (!expr->isnull)
+	{
+		JUMBLE_FIELD(val.node.type);
+		switch (nodeTag(>val))
+		{
+			case T_Integer:
+JUMBLE_FIELD(val.ival.ival);
+break;
+			case T_Float:
+JUMBLE_STRING(val.fval.fval);
+break;
+			case T_Boolean:
+JUMBLE_FIELD(val.boolval.boolval);
+break;
+			case T_String:
+JUMBLE_STRING(val.sval.sval);
+break;
+			case T_BitString:
+JUMBLE_STRING(val.bsval.bsval);
+break;
+			default:
+elog(ERROR, "unrecognized node type: %d",
+	 (int) nodeTag(>val));
+break;
+		}
+	}
+}
+
 static void
 _jumbleRangeTblEntry(JumbleState *jstate, Node *node)
 {
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index fb9ccd920f..753062a9d7 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -579,6 +579,14 @@ NOTICE:  table "test" does not exist, skipping
 NOTICE:  table "test" does not exist, skipping
 NOTICE:  function plus_one(pg_catalog.int4) does not exist, skipping
 DROP FUNCTION PLUS_TWO(INTEGER);
+-- This SET query uses two different strings, still thet count as one entry.
+SET work_mem = '1MB';
+Set work_mem = '1MB';
+SET work_mem = '2MB';
+RESET work_mem;
+SET enable_seqscan = off;
+SET enable_seqscan = on;
+RESET enable_seqscan;
 SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 query | calls | rows 
 --+---+--
@@ -588,10 +596,16 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  DROP FUNCTION PLUS_TWO(INTEGER)  | 1 |0
  DROP TABLE IF EXISTS test| 3 |0
  DROP TABLE test  | 1 |0
+ RESET enable_seqscan | 1 |0
+ RESET work_mem 

Re: what's the meaning of key?

2023-02-05 Thread Heikki Linnakangas

On 05/02/2023 06:09, jack...@gmail.com wrote:

I'm doing research on heap_am, and for heap_beginscan func, I find
out that there is a arg called nkeys, I use some sqls as examples like
'select * from t;' and 'select * from t where a  = 1', but it is always 
zero,

can you give me some descriptions for this? what's it used for?


The executor evaluates table scan quals in the SeqScan node itself, in 
ExecScan function. It doesn't use the heap_beginscan scankeys.


There has been some discussion on changing that, as some table access 
methods might be able to filter rows more efficiently using the scan 
keys than the executor node. But that's how it currently works.


I think the heap scankeys are used by catalog accesses, though, so it's 
not completely dead code.


- Heikki





Re: HOT chain validation in verify_heapam()

2023-02-05 Thread Himanshu Upadhyaya
On Tue, Jan 31, 2023 at 7:20 PM Robert Haas  wrote:

> On Mon, Jan 30, 2023 at 8:24 AM Himanshu Upadhyaya
>  wrote:
> > Before this we stop the node by "$node->stop;" and then only we progress
> to
> > manual corruption. This will abort all running/in-progress transactions.
> > So, if we create an in-progress transaction and comment "$node->stop;"
> > then somehow all the code that we have for manual corruption does not
> work.
> >
> > I think it is required to stop the server and then only proceed for
> manual corruption?
> > If this is the case then please suggest if there is a way to get an
> in-progress transaction
> > that we can use for manual corruption.
>
> How about using a prepared transaction?
>
> Thanks, yes it's working fine with Prepared Transaction.
Please find attached the v9 patch incorporating all the review comments.


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From d241937841c5990ff4df00d1abc6bfbb3491ac3e Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Sun, 5 Feb 2023 12:50:21 +0530
Subject: [PATCH v9] Implement HOT chain validation in verify_heapam()

Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev, Andres Freund
---
 contrib/amcheck/verify_heapam.c   | 297 ++
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 241 +-
 2 files changed, 527 insertions(+), 11 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 4fcfd6df72..01c639b5e1 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,9 +399,14 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber];
+		OffsetNumber successor[MaxOffsetNumber];
+		bool		lp_valid[MaxOffsetNumber];
 
 		CHECK_FOR_INTERRUPTS();
 
+		memset(predecessor, 0, sizeof(OffsetNumber) * MaxOffsetNumber);
+
 		/* Optionally skip over all-frozen or all-visible blocks */
 		if (skip_option != SKIP_PAGES_NONE)
 		{
@@ -433,6 +438,10 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
+			successor[ctx.offnum] = InvalidOffsetNumber;
+			lp_valid[ctx.offnum] = false;
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +478,13 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * Make entry in successor array, redirected lp will be
+ * validated at the time when we loop over successor array.
+ */
+successor[ctx.offnum] = rdoffnum;
+lp_valid[ctx.offnum] = true;
 continue;
 			}
 
@@ -504,9 +520,283 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple();
+
+			/*
+			 * Add the data to the successor array if next updated tuple is in
+			 * the same page. It will be used later to generate the
+			 * predecessor array.
+			 *
+			 * We need to access the tuple's header to populate the
+			 * predecessor array. However the tuple is not necessarily sanity
+			 * checked yet so delaying construction of predecessor array until
+			 * all tuples are sanity checked.
+			 */
+			nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
+			if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
+nextoffnum != ctx.offnum)
+			{
+successor[ctx.offnum] = nextoffnum;
+			}
+		}
+
+		/*
+		 * Loop over offset and populate predecessor array from all entries
+		 * that are present in successor array.
+		 */
+		ctx.attnum = -1;
+		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+			 ctx.offnum = OffsetNumberNext(ctx.offnum))
+		{
+			ItemId		curr_lp;
+			ItemId		next_lp;
+			HeapTupleHeader curr_htup;
+			HeapTupleHeader next_htup;
+			TransactionId curr_xmax;
+			TransactionId next_xmin;
+
+			OffsetNumber nextoffnum = successor[ctx.offnum];
+
+			if (nextoffnum == InvalidOffsetNumber)
+			{
+/*
+ * This is either the last updated tuple in the chain or a
+ * corrupted Tuple/lp or unused/dead line pointer.
+ */
+continue;
+			}
+
+			/*
+			 * Add data to the predecessor array even if the current or
+			 * successor's LP is not valid. We will not process/validate these
+			 * offset entries while looping over the predecessor array but
+			 * having all entries in the predecessor array will help in
+			 * identifying(and validating) the Root of a chain.
+			 */
+			if (!lp_valid[ctx.offnum] || !lp_valid[nextoffnum])
+			{
+