date:20241111

On Sun, Nov 10, 2024 at 5:41 PM Peter Geoghegan  wrote:
> > It seems to me knowing which pages may be pinned is very AM-specific
> > knowledge, and my intention was to let the AM to manage that.
>
> This is useful information, because it helps me to understand how
> you're viewing this.
>
> I totally disagree with this characterization. This is an important
> difference in perspective. IMV index AMs hardly care at all about
> holding onto buffer pins, very much unlike heapam.
>
> I think that holding onto pins and whatnot has almost nothing to do
> with the index AM as such -- it's about protecting against unsafe
> concurrent TID recycling, which is a table AM/heap issue. You can make
> a rather weak argument that the index AM needs it for _bt_killitems,
> but that seems very secondary to me (if you go back long enough there
> are no _bt_killitems, but the pin thing itself still existed).

Much of this discussion is going over my head, but I have a comment on
this part. I suppose that when any code in the system takes a pin on a
buffer page, the initial concern is almost always to keep the page
from disappearing out from under it. There might be a few exceptions,
but hopefully not many. So I suppose what is happening here is that
index AM pins an index page so that it can read that page -- and then
it defers releasing the pin because of some interlocking concern. So
at any given moment, there's some set of pins (possibly empty) that
the index AM is holding for its own purposes, and some other set of
pins (also possibly empty) that the index AM no longer requires for
its own purposes but which are still required for heap/index
interlocking. The second set of pins could possibly be managed in some
AM-agnostic way. The AM could communicate that after the heap is done
with X set of TIDs, it can unpin Y set of pages. But the first set of
pins are of direct and immediate concern to the AM.

Or at least, so it seems to me. Am I confused?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Linkify mentions of the primary/subscriber's max_replication_slots

2024-11-11 Thread Tristan Partin

On Mon Nov 11, 2024 at 5:53 AM CST, Amit Kapila wrote:

On Thu, Nov 7, 2024 at 4:40 PM Amit Kapila  wrote:

On Thu, Nov 7, 2024 at 9:36 AM Tristan Partin  wrote:
>
>
> Here is a patch which does so.
>

  Note that this parameter also applies on the subscriber side, but with
- a different meaning.
+ a different meaning. See 
+ in  for more
+ details.

@@ -5215,7 +5217,9 @@ ANY num_sync ( 
+in  for more
+details.

 and  are incorrect. They
should be instead  and .

I have made the required changes and pushed the patch.

Thanks for your help in getting this over the line Amit 😄.

--
Tristan Partin
Neon (https://neon.tech)

Re: index prefetching

2024-11-11 Thread Peter Geoghegan

On Mon, Nov 11, 2024 at 2:00 PM Peter Geoghegan  wrote:
> The real sign that what I said is generally true of index AMs is that
> you'll see so few calls to
> LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls
> ConditionalLockBufferForCleanup at all (which I find a bit weird).
> Both GiST and SP-GiST call neither functions -- even during VACUUM. So
> GiST and SP-GiST make clear that index AMs (that support only MVCC
> snapshot scans) can easily get by without any use of cleanup locks
> (and with no externally significant use of buffer pins).

Actually, I'm pretty sure that it's wrong for GiST VACUUM to not
acquire a full cleanup lock (which used to be called a super-exclusive
lock in index AM contexts), as I went into some years ago:

https://www.postgresql.org/message-id/flat/CAH2-Wz%3DPqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt%3D5qxWQ%40mail.gmail.com

I plan on playing around with injection points soon. I might try my
hand at proving that GiST VACUUM needs to do more here to avoid
breaking concurrent GiST index-only scans.

Issues such as this are why I place so much emphasis on formalizing
all the rules around TID recycling and dropping pins with index scans.
I think that we're still a bit sloppy about things in this area.

-- 
Peter Geoghegan

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Sat, Nov 9, 2024 at 3:45 AM Tomas Vondra  wrote:
>
>
>
> On 11/8/24 19:25, Masahiko Sawada wrote:
> > Hi,
> >
> > Thank you for investigating this issue.
> >
> > On Thu, Nov 7, 2024 at 10:40 AM Tomas Vondra  wrote:
> >>
> >> Hi,
> >>
> >> I kept investigating this, but I haven't made much progress. I still
> >> don't understand why would it be OK to move any of the LSN fields
> >> backwards - certainly for fields like confirm_flush or restart_lsn.
> >>
> >> I did a simple experiment - added asserts to the couple places in
> >> logical.c updating the the LSN fields, checking the value is increased.
> >> But then I simply ran make check-world, instead of the stress test.
> >>
> >> And that actually fails too, 040_standby_failover_slots_sync.pl triggers
> >> this
> >>
> >> {
> >> SpinLockAcquire(&MyReplicationSlot->mutex);
> >> Assert(MyReplicationSlot->data.confirmed_flush <= lsn);
> >> MyReplicationSlot->data.confirmed_flush = lsn;
> >> SpinLockRelease(&MyReplicationSlot->mutex);
> >> }
> >>
> >> So this moves confirm_flush back, albeit only by a tiny amount (I've
> >> seen ~56 byte difference). I don't have an example of this causing an
> >> issue in practice, but I note that CheckPointReplicationSlots does this:
> >>
> >> if (is_shutdown && SlotIsLogical(s))
> >> {
> >> SpinLockAcquire(&s->mutex);
> >>
> >> if (s->data.invalidated == RS_INVAL_NONE &&
> >> s->data.confirmed_flush > s->last_saved_confirmed_flush)
> >> {
> >> s->just_dirtied = true;
> >> s->dirty = true;
> >> }
> >> SpinLockRelease(&s->mutex);
> >> }
> >>
> >> to determine if a slot needs to be flushed to disk during checkpoint. So
> >> I guess it's possible we save a slot to disk at some LSN, then the
> >> confirm_flush moves backward, and we fail to sync the slot to disk.
> >>
> >> But I don't have a reproducer for this ...
> >>
> >>
> >> I also noticed a strange difference between LogicalIncreaseXminForSlot
> >> and LogicalIncreaseRestartDecodingForSlot.
> >>
> >> The structure of LogicalIncreaseXminForSlot looks like this:
> >>
> >> if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
> >> {
> >> }
> >> else if (current_lsn <= slot->data.confirmed_flush)
> >> {
> >> ... update candidate fields ...
> >> }
> >> else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
> >> {
> >> ... update candidate fields ...
> >> }
> >>
> >> while LogicalIncreaseRestartDecodingForSlot looks like this:
> >>
> >> if (restart_lsn <= slot->data.restart_lsn)
> >> {
> >> }
> >> else if (current_lsn <= slot->data.confirmed_flush)
> >> {
> >> ... update candidate fields ...
> >> }
> >>
> >> if (slot->candidate_restart_valid == InvalidXLogRecPtr)
> >> {
> >> ... update candidate fields ...
> >> }
> >>
> >> Notice that LogicalIncreaseXminForSlot has the third block guarded by
> >> "else if", while LogicalIncreaseRestartDecodingForSlot has "if". Isn't
> >> that a bit suspicious, considering the functions do the same thing, just
> >> for different fields? I don't know if this is dangerous, the comments
> >> suggest it may just waste extra effort after reconnect.
> >>
> >
> > I also suspected this point. I still need to investigate if this
> > suspicion is related to the issue but I find this code in
> > LogicalIncreaseRestartDecodingForSlot() is dangerous.
> >
> > We update slot's restart_lsn based on candidate_lsn and
> > candidate_valid upon receiving a feedback message from a subscriber,
> > then clear both fields. Therefore, this code in
> > LogicalIncreaseRestartDecodingForSlot() means that it sets an
> > arbitrary LSN to candidate_restart_lsn after updating slot's
> > restart_lsn.
> >
> > I think an LSN older than slot's restart_lsn can be passed to
> > LogicalIncreaseRestartDecodingForSlot() as restart_lsn for example
> > after logical decoding restarts; My scenario I shared on another
> > thread was that after updating slot's restart_lsn (upon feedback from
> > a subscriber) based on both candidate_restart_lsn and
> > candidate_restart_valid that are remained in the slot, we might call
> > LogicalIncreaseRestartDecodingForSlot() when decoding a RUNNING_XACTS
> > record whose LSN is older than the slot's new restart_lsn. In this
> > case, we end up passing an LSN older than the new restart_lsn to
> > LogicalIncreaseRestartDecodingForSlot(), and that LSN is set to
> > candidate_restart_lsn.
>
> Right, I believe that matches my observations. I only see the issues
> after (unexpected) restarts, say due to network issues, but chances are
> regular reconnects have the same problem.
>
> > My hypothesis is that we wanted to prevent such
> > case by the first if block:
> >
> > /* don't overwrite if have a newer restart lsn */
> > if (restart_lsn <= slot->data.restart_lsn)
> > {
> > }
> >
>
> Yeah, that condit

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-11-11 Thread Guillaume Lelarge

Hi,

Le mer. 6 nov. 2024 à 17:57, Michael Christofides  a
écrit :

> [...]
> I agree. In the past 6 versions, 5 new parameters have been added.
> SETTINGS in v12, WAL in v13, GENERIC_PLAN in v16, SERIALIZE in
> v17, and MEMORY in v17. It feels like we should have some easier way
> to get everything. Currently, we need to specify: EXPLAIN (ANALYZE,
> VERBOSE, BUFFERS, SETTINGS, WAL, SERIALIZE, MEMORY).
>
>
Agreed. Having an "EXPLAIN (ALL)" would be a great addition. I could tell a
customer to do an "EXPLAIN (ALL)", rather than first asking the PostgreSQL
release installed on the server and after that, giving the correct options
for EXPLAIN.


-- 
Guillaume.

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-11 Thread Michel Pelletier

>
> The second expansion in matrix_out happens outside the function, so inside
> there is only the one expansion for both matrix_nvals and the assignment.
> Thank you!  All my tests continue to pass and the change seems to work well.


Hmm, well I spoke to soon looking at the wrong function without an
assignment, I've been down in gdb a bit too much and I got mixed up, here's
another function that wraps it and does an assignment, and it looks like
there is another expansion happening after the nvals but before the
assignment:

create or replace function test_expand_expand(graph matrix) returns matrix
language plpgsql as
$$
declare
nvals bigint = nvals(graph);
begin
graph = test_expand(graph);
return test_expand(graph);
end;
$$;

postgres=# select test_expand_expand(a) from test_fixture ;
DEBUG:  matrix_nvals
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free
DEBUG:  matrix_nvals
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free
DEBUG:  matrix_nvals
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free
DEBUG:  matrix_out
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free

So going back on the assumption I'm somehow not returning the right
pointer, but digging into the array code I'm pretty sure I'm following the
same pattern and tracing my code path is calling EOHPGetRWDatum when I
return the object, I can't get it to stop on DeleteExpandedObject, so I
don't think plpgsql is deleting it, I'm going to keep investigating, sorry
for the noise.

-Michel

>

Re: UUID v7

On Sat, Nov 9, 2024 at 9:07 AM Sergey Prokhorenko
 wrote:
>
> On Saturday 9 November 2024 at 01:00:15 am GMT+3, Masahiko Sawada 
>  wrote:
>
> > the microsecond part is working also as a counter in a sense. IT seems fine 
> > to me but I'm slightly concerned that there is no guidance of such 
> > implementation in RFC 9562.
>
> In fact, there is guidance of similar implementation in RFC 9562:
> https://datatracker.ietf.org/doc/html/rfc9562#name-monotonicity-and-counters
> "Counter Rollover Handling:"
> "Alternatively, implementations MAY increment the timestamp ahead of the 
> actual time and reinitialize the counter."
>

Indeed, thank you.

> But in the near future, this may not be enough for the highest-performance 
> systems.

Yeah, I'm concerned about this. That time might gradually come. That
being said, as long as rand_a part works also as a counter, it's fine.
Also, 12 bits does not differ much as Andrey Borodin mentioned. I
think in the first version it's better to start with a simple
implementation rather than over-engineering it.

Regarding the implementation, the v30 patch uses only microseconds
precision time even on platforms where nanoseconds precision is
available such as Linux. I think it's better to store the value of
(sub-milliseconds * 4096) into 12-bits of rand_a space instead of
directly storing microseconds into 10 bits space. That way, we can use
nanoseconds precision timestamps where available. On some platforms
such as macOS, the sub-milliseconds precision timestamp is restricted
to microseconds, we can consider it as a kind of special case. If
12-bits of rand_a space is not enough to guarantee monotonically in
the future, it is also possible to improve it by putting a (random)
counter into rand_b.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-11-11 Thread Kirill Reshke

On Mon, 11 Nov 2024 at 16:11, torikoshia  wrote:
>
> On 2024-11-09 21:55, Kirill Reshke wrote:
>
> Thanks for working on this!

Thanks for reviewing the v7 patch series!

> > On Thu, 7 Nov 2024 at 23:00, Fujii Masao 
> > wrote:
> >>
> >>
> >>
> >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > I do find this behavior to be consistent.
> >>
> >> +1
> >>
> >>
> >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > replacements, as if REJECT LIMIT is inf.
> >>
> >> +1
> >
> > After thinking for a while, I'm now more opposed to this approach. I
> > think we should count rows with erroneous data as errors only if
> > null substitution for these rows failed, not the total number of rows
> > which were modified.
> > Then, to respect the REJECT LIMIT option, we compare this number with
> > the limit. This is actually simpler approach IMHO. What do You think?
>
> IMHO I prefer the previous interpretation.
> I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> is used to specify how many malformed rows are acceptable in the
> "original" data source.

I do like the first version of interpretation, but I have a struggle
with it. According to this interpretation, we will fail COPY command
if the number
of malformed data rows exceeds the limit, not the number of rejected
rows (some percentage of malformed rows are accepted with null
substitution)
So, a proper name for the limit will be MALFORMED_LIMIT, or something.
However, we are unable to change the name since the REJECT_LIMIT
option has already been committed.
I guess I'll just have to put up with this contradiction. I will send
an updated patch shortly...


> >> > But while I was trying to implement that, I realized that I don't
> >> > understand v4 of this patch. My misunderstanding is about
> >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
> >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> >> > do we do that? My thought is we should try to execute
> >> > InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> >> > column after we failed to insert the input value. And, if this second
> >> > call is successful, we do replacement, otherwise we count the row as
> >> > erroneous.
> >>
> >> Your concern is valid. Allowing NULL to be stored in a column with a
> >> NOT NULL
> >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you
> >> suggested,
> >> NULL values set by SET_TO_NULL should probably be re-evaluated.
> >
> > Thank you. I updated the patch with a NULL re-evaluation.
> >
> >
> > PFA v7. I did not yet update the doc for this patch version, waiting
> > for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> >
>
>
> There were warnings when I applied the patch:
>
> $ git apply
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170:
> trailing whitespace.
> /*
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189:
> trailing whitespace.
> /* If datatype if okay with NULL,
> replace
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212:
> trailing whitespace.
>  /*
>
> >  @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState
> > *pstate, bool is_from)
> >   parser_errposition(pstate, def->location)));
> ...
> >
> >  -   if (opts_out->reject_limit && !opts_out->on_error)
> >  +   if (opts_out->reject_limit && !(opts_out->on_error ==
> > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))
>
> Hmm, is this change necessary?
> Personally, I feel the previous code is easier to read.
>
>
> "REJECT LIMIT" should be "REJECT_LIMIT"?
> > 1037 errhint("Consider specifying the
> > REJECT LIMIT option to skip erroneous rows.")));
>
>
> SET_TO_NULL is one of the value for ON_ERROR, but the patch treats
> SET_TO_NULL as option for COPY:
>
> 221 --- a/src/bin/psql/tab-complete.in.c
> 222 +++ b/src/bin/psql/tab-complete.in.c
> 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
> 224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
> 225   "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
> 226   "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING",
> "DEFAULT",
> 227 - "ON_ERROR", "LOG_VERBOSITY");
> 228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");
>
>
> > Best regards,
> > Kiri

Re: Separate memory contexts for relcache and catcache

2024-11-11 Thread Jeff Davis

On Mon, 2024-11-11 at 17:05 +0530, Ashutosh Bapat wrote:
> It will be good
> to move ahead with the ones we all agree for now. Looking at all the
> emails, those will be CatCacheContext,
> RelCacheContext, PlanCacheContext, TypCacheContext.

I'm not sure we have consensus on all of those yet. Andres's concern,
IIUC, is that the additional memory contexts will cause additional
fragmentation.

I believe we have a rough consensus that CatCacheContext and
RelCacheContext are wanted, but we're trying to find ways to mitigate
the fragmentation.

Regards,
Jeff Davis

Re: Using Expanded Objects other than Arrays from plpgsql

2024-11-11 Thread Michel Pelletier

On Fri, Nov 1, 2024 at 5:53 PM Michel Pelletier 
wrote:

> On Fri, Nov 1, 2024 at 3:27 PM Tom Lane  wrote:
>
>> Michel Pelletier  writes:
>>
>> Here is a v1 patch series that does the first part of what we've been
>> talking about, which is to implement the new optimization rule for
>> the case of a single RHS reference to the target variable.  I'm
>> curious to know if it helps you at all by itself.  You'll definitely
>> also need commit 534d0ea6c, so probably applying these on our git
>> master branch would be the place to start.
>>
>
> I'll apply these tonight and get back to you asap.  There are many
> functions in my API that take only one expanded RHS argument, so I'll look
> for some cases where your changes reduce expansions when I run my tests.
>

I tested these patches with my test setup and can confirm there is now one
less expansion in this function:

create or replace function test_expand(graph matrix) returns matrix
language plpgsql as
$$
declare
nvals bigint = nvals(graph);
begin
return graph;
end;
$$;

postgres=# select test_expand(a) from test_fixture ;
DEBUG:  matrix_nvals
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free
DEBUG:  matrix_out
DEBUG:  DatumGetMatrix
DEBUG:  expand_matrix
DEBUG:  new_matrix
DEBUG:  context_callback_matrix_free

The second expansion in matrix_out happens outside the function, so inside
there is only the one expansion for both matrix_nvals and the assignment.
Thank you!  All my tests continue to pass and the change seems to work well.

Looking forward to helping test the support function idea, let me know if
there's anything else I can do to validate the idea.

-Michel

>

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 21:56, Masahiko Sawada wrote:
> ...
>>
>>> My hypothesis is that we wanted to prevent such
>>> case by the first if block:
>>>
>>> /* don't overwrite if have a newer restart lsn */
>>> if (restart_lsn <= slot->data.restart_lsn)
>>> {
>>> }
>>>
>>
>> Yeah, that condition / comment seems to say exactly that.
>>
>> Do you plan / expect to work on fixing this? It seems you proposed the
>> right fix in that old thread, but it's been inactive since 2023/02 :-(
> 
> I'm happy to work on this fix. At that time, I was unsure if my fix
> was really correct and there was no further discussion.
> 

Thanks. I'm not sure about the correctness either, but I think it's
clear the issue is real, and it's not difficult to reproduce it.

regards

-- 
Tomas Vondra

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4



On 11/11/24 15:17, Tomas Vondra wrote:
> On 11/11/24 14:51, Ashutosh Bapat wrote:
>> ...
>>
>> I think the problem is about processing older running transactions
>> record and setting data.restart_lsn based on the candidates those
>> records produce. But what is not clear to me is how come a newer
>> candidate_restart_lsn is available immediately upon WAL sender
>> restart. I.e. in the sequence of events you mentioned in your first
>> email
>> 1.  344.139 LOG:  starting logical decoding for slot "s1"
>>
>> 2. 344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
>>reading WAL from 1/E96FB4D0.
>>
>>  3. 344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0
>>
>>  4. 344.140 DETAIL:  Logical decoding will begin using saved snapshot.
>>
>>   5. 344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398
>>
>> 6.  344.140 LOG:  LogicalConfirmReceivedLocation updating
>>data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
>>candidate_restart_valid 0/0 (from 1/E9865398)
>>candidate_restart_lsn 0/0 (from 1/E979D4C8)
>>
>> how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
>> = 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
>> must have read and processed running transaction record at 1/E9865398.
>> If that's true, how come it went back to a running transactions WAL
>> record at 1/E979D4C8? It should be reading WAL records sequentially,
>> hence read 1/E979D4C8 first then 1/E9865398.
>>
>>  344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
>>candidate_restart_valid_lsn 1/E979D4C8 (0/0)
>>candidate_restart_lsn 1/E96FB4D0 (0/0)
>>
> 
> Those are good questions, but IIUC that's explained by this comment from
> Masahiko-san's analysis [1]:
> 
>   Thinking about the root cause more, it seems to me that the root cause
>   is not the fact that candidate_xxx values are not cleared when being
>   released.
> 
>   In the scenario I reproduced, after restarting the logical decoding,
>   the walsender sets the restart_lsn to a candidate_restart_lsn left in
>   the slot upon receiving the ack from the subscriber. ...
> 
> If this is correct, then what happens is:
> 
>   1) replication is running, at some point we set candidate LSN to B
> 
>   2) something breaks, causing reconnect with restart LSN A (< B)
> 
>   3) we still have the candidate LSN B in memory, and after receiving
>  some confirmation we set it as restart_lsn
> 
>   4) we get to decode the RUNNING_XACTS, which moves restart_lsn back
> 
> 
> If this analysis is correct, I think it's rather suspicious we don't
> reset the candidate fields on restart. Can those "old" values ever be
> valid? But I haven't tried resetting them.
> 

To clarify this a bit, I mean something like in the attached 0003 patch.

The reasoning is that after ReplicationSlotAcquire() we should get the
slot in the same state as if we just read it from disk. Because why not?
Why should the result be different from what we'd get if the primary
restated right before the reconnect?

Parts 0001 and 0002 add a couple asserts to prevent backwards move for
both the restart_lsn and the various candidate LSN fields.

Both the 0003 and 0004 patches (applied separately) seems to fix crashes
in my stress test, and none of the asserts from 0001+0002 seem to fail.
I'm not sure if we need both fixes or just one of them.

But neither of those fixes prevents backwards move for confirmed_flush
LSN, as enforced by asserts in the 0005 patch. I don't know if this
assert is incorrect or now. It seems natural that once we get a
confirmation for some LSN, we can't move before that position, but I'm
not sure about that. Maybe it's too strict.


regards

-- 
Tomas Vondra
From be7ba029036613df80562cc735a3040d6f760a90 Mon Sep 17 00:00:00 2001
From: tomas 
Date: Sat, 9 Nov 2024 12:09:04 +0100
Subject: [PATCH 1/5] asserts on restart_lsn backwards move

---
 src/backend/replication/logical/logical.c | 3 +++
 src/backend/replication/slot.c| 2 ++
 2 files changed, 5 insertions(+)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 2f7b2c85d9b..d76889664a8 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -376,6 +376,7 @@ CreateInitDecodingContext(const char *plugin,
 	else
 	{
 		SpinLockAcquire(&slot->mutex);
+		Assert(slot->data.restart_lsn <= restart_lsn);
 		slot->data.restart_lsn = restart_lsn;
 		SpinLockRelease(&slot->mutex);
 	}
@@ -1746,6 +1747,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		{
 			Assert(MyReplicationSlot->candidate_restart_lsn != InvalidXLogRecPtr);
 
+			Assert(MyReplicationSlot->data.restart_lsn <= MyReplicationSlot->candidate_restart_lsn);
+
 			MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn;
 			MyReplicationSlot->candidate_restart_lsn = InvalidXLogR

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra  wrote:
>
> If this analysis is correct, I think it's rather suspicious we don't
> reset the candidate fields on restart.  Can those "old" values ever be
> valid? But I haven't tried resetting them.

I had the same question. IIRC resetting them also fixes the
problem[1]. However, I got a comment from Alvaro[2]:

  Hmm, interesting -- I was studying some other bug recently involving the
  xmin of a slot that had been invalidated and I remember wondering if
  these "candidate" fields were being properly ignored when the slot is
  marked not in use; but I didn't check. Are you sure that resetting them
  when the slot is released is the appropriate thing to do? I mean,
  shouldn't they be kept set while the slot is in use, and only reset if
  we destroy it?

Which made me re-investigate the issue and thought that it doesn't
necessarily need to clear these candidate values in memory on
releasing a slot as long as we're carefully updating restart_lsn.
Which seems a bit efficient for example when restarting from a very
old point. Of course, even if we reset them on releasing a slot, it
would perfectly work since it's the same as restarting logical
decoding with a server restart. I think
LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
not to be working expectedly, but I could not have proof that we
should either keep or reset them on releasing a slot.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoBG2OSDOFTtpPtQ7fx5Vt8p3dS5hPAv28CBSC6z2kHx-g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20230206152209.yglmntznhcmaueyn%40alvherre.pgsql

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

On Fri, Nov 8, 2024 at 4:23 PM Jacob Champion
 wrote:
> While I work on breaking pgstat_bestart() apart, here is a v6 which
> pushes down the "coarse" wait events. No changes to 0001 yet.

v7 rewrites 0001 by splitting pgstat_bestart() into three phases.
(0002-3 are unchanged.)

1. pgstat_bestart_initial() reports STATE_STARTING, fills in the early
fields and clears out the rest.
2. pgstat_bestart_security() reports the SSL/GSS status of the
connection. This is only for backends with a valid MyProcPort; they
call it twice.
3. pgstat_bestart_final() fills in the user and database IDs, takes
the entry out of STATE_STARTING, and reports the application_name.

I was wondering if maybe I should fill in application_name before
taking the entry out of STATE_STARTING, in order to establish the rule
that "starting" pgstat entries are always partially filled, and that
DBAs can rely on their full contents once they've proceeded past it.
Thoughts?

I've added machinery to 001_ssltests.pl to make sure we see early
transport security stats prior to user authentication. This overlaps
quite a bit with the new 007_pre_auth.pl, so if we'd rather not have
the latter (as briefly discussed upthread) I can move the rest of it
over.

Thanks,
--Jacob

v7-0001-pgstat-report-in-earlier-with-STATE_STARTING.patch
Description: Binary data

v7-0002-Report-external-auth-calls-as-wait-events.patch
Description: Binary data

v7-0003-squash-Report-external-auth-calls-as-wait-events.patch
Description: Binary data

Re: POC: make mxidoff 64 bits


On 08/11/2024 20:10, Maxim Orlov wrote:
Sorry for a late reply. There was a problem in upgrade with offset 
wraparound. Here is a fixed version. Test also added. I decide to use my 
old patch to set a non-standard multixacts for the old cluster, fill it 
with data and do pg_upgrade.


The wraparound logic is still not correct. To test, I created a cluster 
where multixids have wrapped around, so that:


$ ls -l data-old/pg_multixact/offsets/
total 720
-rw--- 1 heikki heikki 212992 Nov 12 01:11 
-rw-r--r-- 1 heikki heikki 262144 Nov 12 00:55 FFFE
-rw--- 1 heikki heikki 262144 Nov 12 00:56 

After running pg_upgrade:

$ ls -l data-new/pg_multixact/offsets/
total 1184
-rw--- 1 heikki heikki 155648 Nov 12 01:12 0001
-rw--- 1 heikki heikki 262144 Nov 12 01:11 1FFFD
-rw--- 1 heikki heikki 262144 Nov 12 01:11 1FFFE
-rw--- 1 heikki heikki 262144 Nov 12 01:11 1
-rw--- 1 heikki heikki 262144 Nov 12 01:11 2
-rw--- 1 heikki heikki 155648 Nov 12 01:11 20001

That's not right. The segments 2 and 20001 were created by the new 
pg_upgrade conversion code from old segment ''. But multixids are 
still 32-bit values, so after segment 1, you should still wrap 
around to . The new segments should be '' and '0001'. The 
segment '0001' is created when postgres is started after upgrade, but 
it's created from scratch and doesn't contain the upgraded values.


When I try to select from a table after upgrade that contains 
post-wraparound multixids:


TRAP: failed Assert("offset != 0"), File: 
"../src/backend/access/transam/multixact.c", Line: 1353, PID: 63386



On a different note, I'm surprised you're rewriting member segments from 
scratch, parsing all the individual member groups and writing them out 
again. There's no change to the members file format, except for the 
numbering of the files, so you could just copy the files under the new 
names without paying attention to the contents. It's not wrong to parse 
them in detail, but I'd assume that it would be simpler not to.


Here is how to test. All the patches are for 14e87ffa5c543b5f3 master 
branch.
1) Get the 14e87ffa5c543b5f3 master branch apply patches 0001-Add- 
initdb-option-to-initialize-cluster-with-non-sta.patch and 0002-TEST- 
lower-SLRU_PAGES_PER_SEGMENT.patch
2) Get the 14e87ffa5c543b5f3 master branch in a separate directory and 
apply v6 patch set.

3) Build two branches.
4) Use ENV oldinstall to run the test: PROVE_TESTS=t/005_mxidoff.pl 
 oldinstall=/home/orlov/proj/pgsql-new 
PG_TEST_NOCLEAN=1 make check -C src/bin/pg_upgrade/


Maybe, I'll make a shell script to automate this steps if required.


Yeah, I think we need something to automate this. I did the testing 
manually. I used the attached python script to consume multixids faster, 
but it's still tedious.


I used pg_resetwal to quickly create a cluster that's close to multixid 
wrapround:


initdb -D data
pg_resetwal -D data -m 429491,429490
dd if=/dev/zero of=data/pg_multixact/offsets/FFFE bs=8192 count=32

--
Heikki Linnakangas
Neon (https://neon.tech)
import sys;
import threading;
import psycopg2;

def test_multixact(tblname: str):
with psycopg2.connect() as conn:
cur = conn.cursor()
cur.execute(
f"""
DROP TABLE IF EXISTS {tblname};
CREATE TABLE {tblname}(i int primary key, n_updated int) WITH (autovacuum_enabled=false);
INSERT INTO {tblname} select g, 0 from generate_series(1, 50) g;
"""
)

# Lock entries using parallel connections in a round-robin fashion.
nclients = 50
update_every = 97
connections = []
for _ in range(nclients):
# Do not turn on autocommit. We want to hold the key-share locks.
conn = psycopg2.connect()
connections.append(conn)

# On each iteration, we commit the previous transaction on a connection,
# and issue another select. Each SELECT generates a new multixact that
# includes the new XID, and the XIDs of all the other parallel transactions.
# This generates enough traffic on both multixact offsets and members SLRUs
# to cross page boundaries.
for i in range(2):
conn = connections[i % nclients]
conn.commit()

# Perform some non-key UPDATEs too, to exercise different multixact
# member statuses.
if i % update_every == 0:
conn.cursor().execute(f"update {tblname} set n_updated = n_updated + 1 where i = {i % 50}")
else:
conn.cursor().execute(f"select * from {tblname} for key share")

#nthreads=10
#
#threads = []
#for threadno in range(nthreads):
#tblname = f"tbl{threadno}"
#t = threading.Thread(target=test_multixact, args=(tblname,))
#t.start()
#threads.append(t)
#
#for threadno in range(nthreads):
#threads[threadno].join()

test_multixact(sys.argv[1])

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 23:41, Masahiko Sawada wrote:
> On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra  wrote:
>>
>> If this analysis is correct, I think it's rather suspicious we don't
>> reset the candidate fields on restart.  Can those "old" values ever be
>> valid? But I haven't tried resetting them.
> 
> I had the same question. IIRC resetting them also fixes the
> problem[1]. However, I got a comment from Alvaro[2]:
> 
>   Hmm, interesting -- I was studying some other bug recently involving the
>   xmin of a slot that had been invalidated and I remember wondering if
>   these "candidate" fields were being properly ignored when the slot is
>   marked not in use; but I didn't check. Are you sure that resetting them
>   when the slot is released is the appropriate thing to do? I mean,
>   shouldn't they be kept set while the slot is in use, and only reset if
>   we destroy it?
> 
> Which made me re-investigate the issue and thought that it doesn't
> necessarily need to clear these candidate values in memory on
> releasing a slot as long as we're carefully updating restart_lsn.

Not sure, but maybe it'd be useful to ask the opposite question. Why
shouldn't it be correct to reset the fields, which essentially puts the
slot into the same state as if it was just read from disk? That also
discards all these values, and we can't rely on accidentally keeping
something important info in memory (because if the instance restarts
we'd lose that).

But this reminds me that the patch I shared earlier today resets the
slot in the ReplicationSlotAcquire() function, but I guess that's not
quite correct. It probably should be in the "release" path.

> Which seems a bit efficient for example when restarting from a very
> old point. Of course, even if we reset them on releasing a slot, it
> would perfectly work since it's the same as restarting logical
> decoding with a server restart.

I find the "efficiency" argument a bit odd. It'd be fine if we had a
correct behavior to start with, but we don't have that ... Also, I'm not
quite sure why exactly would it be more efficient?

And how likely is this in practice? It seems to me that
performance-sensitive cases can't do reconnects very often anyway,
that's inherently inefficient. No?

> I think
> LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
> not to be working expectedly, but I could not have proof that we
> should either keep or reset them on releasing a slot.
> 

Not sure. Chances are we need both fixes, if only to make
LogicalIncreaseRestartDecodingForSlot more like the other function.

regards

-- 
Tomas Vondra

Re: Eager aggregation, take 3

2024-11-11 Thread Richard Guo

On Tue, Nov 12, 2024 at 1:30 AM Robert Haas  wrote:
> On Sun, Nov 10, 2024 at 7:52 PM Richard Guo  wrote:
> > Hmm, currently we only consider grouped aggregation for eager
> > aggregation.  For grouped aggregation, the window function's
> > arguments, as well as the PARTITION BY expressions, must appear in the
> > GROUP BY clause.  That is to say, the depname column in the first
> > query, or the n column in the second query, will not be aggregated
> > into the partial groups.  Instead, they will remain as they are as
> > input for the WindowAgg nodes.  It seems to me that this ensures
> > that we're good with window functions.  But maybe I'm wrong.
>
> Unfortunately, I don't know what you mean by grouped aggregation. I
> think of grouping and aggregation as synonyms, pretty much.

Ah, sorry for the confusion.  By "grouped aggregation", I mean
aggregation with a GROUP BY clause, where we produce a result row for
each group.  This contrasts with plain aggregation, where there is a
single result row for the whole query.

Thanks
Richard

Re: explain plans for foreign servers

2024-11-11 Thread Andy Fan

dinesh salve  writes:

Hi,

>
> I am working on a feature in postgres_fdw extension to show plans used
> by remote postgresql servers in the output of the EXPLAIN command.
> 
> I think this will help end users understand query execution plans used
> by remote servers. Sample output for table people where people_1 is
> local partition and people_2 is remote partition would look like

This looks nice! Especially for the people who want a FDW based sharding
cluster.  

> I would like community inputs on below high level thoughts:
>
> 1. To enable this feature, either we can introduce a new option in
> EXPLAIN command e.g. (fetch_remote_plans true) or control this
> behaviour using a guc defined in postgres_fdw extension. I am more
> inclined towards guc as this feature is for extension
> postgres_fdw. Adding the EXPLAIN command option might force other FDW
> extensions to handle this. 

+1.

> 2. For ANALYZE = false, the idea is that postgres_fdw would create a
> connection to a remote server, prepare SQL to send over connection and
> store received plans in ExplainState. 

> 3. For ANALYZE = true, idea is that postgres_fdw would set a new guc
> over connection to remote server, remote server postgres_fdw would
> read this guc and send back used query plan as a NOTICE (similar to
> auto_explain extension does) with custom header which postgres_fdw
> extension understands. We also have an opportunity to introduce a new
> message type in the protocol to send back explain plans but it might
> look like too much work for this feature. Open to ideas here. 

This generally looks good to me.  Looking forward a patch for the
details. 

-- 
Best Regards
Andy Fan

Re: intarray sort returns wrong result

2024-11-11 Thread Tom Lane

Junwang Zhao  writes:
> While working on general array sort[1], I played with intarray
> extension, found a bug (or at least inconsistency) when sorting
> multidimensional int array:

> create extension intarray;
> select sort('{{1,2,3}, {2,3,4}}');

> this returns {{1,2,2},{3,3,4}} instead of {{1,2,3},{2,3,4}}

This is documented, isn't it?

Many of these operations are only sensible for one-dimensional
arrays. Although they will accept input arrays of more dimensions,
the data is treated as though it were a linear array in storage
order.

I don't think anyone will thank us for changing intarray's behavior
many years after the fact.

regards, tom lane

intarray sort returns wrong result

2024-11-11 Thread Junwang Zhao

Hi Hackers,

While working on general array sort[1], I played with intarray
extension, found a bug (or at least inconsistency) when sorting
multidimensional int array:

create extension intarray;
select sort('{{1,2,3}, {2,3,4}}');

this returns {{1,2,2},{3,3,4}} instead of {{1,2,3},{2,3,4}}

I think this is misleading, if int array is only for one dimension
array, we should
error out when sorting multidimensional int array. Or we can do something like
attached POC patch to make it work with multidimensional int array.

Thoughts?

[1]: 
https://www.postgresql.org/message-id/CAEG8a3KD7ZmQpxNhfPxyc0BjTTTUXiqb56VuMgB7Muu0%2ByV%3DqQ%40mail.gmail.com

-- 
Regards
Junwang Zhao


v1-0001-int-array-support-multi-dim.patch
Description: Binary data

Re: Add reject_limit option to file_fdw

2024-11-11 Thread torikoshia


On 2024-11-12 01:49, Fujii Masao wrote:

On 2024/11/11 21:45, torikoshia wrote:
Thanks for adding the comment. It clearly states that REJECT_LIMIT 
can be
a single-quoted string. However, it might also be helpful to mention 
that

it can be provided as an int64 in the COPY command option. How about
updating it like this?


REJECT_LIMIT can be specified in two ways: as an int64 for the COPY 
command
option or as a single-quoted string for the foreign table option 
using

file_fdw. Therefore this function needs to handle both formats.



Thanks! it seems better.


Attached v3 patch.


Thanks for updating the patch! It looks like you forgot to attach it, 
though.


Oops, thanks for pointing it out.
Here it is.


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.From 6380a1a52e125671cfecb3a9056019ff85a32dd7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 11 Nov 2024 21:36:57 +0900
Subject: [PATCH v3] file_fdw: Add reject_limit option to file_fdw

Commit 4ac2a9bece introduced REJECT_LIMIT option for COPY. This patch
extends support for this option to file_fdw.

As well as REJECT_LIMIT option for COPY, this option limits the
maximum number of erroneous rows that can be skipped.
If more rows encounter data type conversion errors than allowed by
reject_limit, the access to the file_fdw foreign table will fail
with an error, even when on_error = 'ignore'.

Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, the
value for the foreign table's option must be single-quoted.
To accommodate this, the patch modifies defGetCopyRejectLimitOption
to accept not only int64 but also strings.
---
 contrib/file_fdw/data/agg.bad  |  1 +
 contrib/file_fdw/expected/file_fdw.out | 16 ++--
 contrib/file_fdw/file_fdw.c|  8 
 contrib/file_fdw/sql/file_fdw.sql  |  6 +-
 doc/src/sgml/file-fdw.sgml | 12 
 src/backend/commands/copy.c| 16 +++-
 6 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/contrib/file_fdw/data/agg.bad b/contrib/file_fdw/data/agg.bad
index 3415b15007..04279ce55b 100644
--- a/contrib/file_fdw/data/agg.bad
+++ b/contrib/file_fdw/data/agg.bad
@@ -2,3 +2,4 @@
 100;@99.097@
 0;@aaa@
 42;@324.78@
+1;@bbb@
diff --git a/contrib/file_fdw/expected/file_fdw.out b/contrib/file_fdw/expected/file_fdw.out
index 593fdc782e..f52e82bfcf 100644
--- a/contrib/file_fdw/expected/file_fdw.out
+++ b/contrib/file_fdw/expected/file_fdw.out
@@ -206,10 +206,10 @@ SELECT * FROM agg_csv c JOIN agg_text t ON (t.a = c.a) ORDER BY c.a;
 SELECT * FROM agg_bad;   -- ERROR
 ERROR:  invalid input syntax for type real: "aaa"
 CONTEXT:  COPY agg_bad, line 3, column b: "aaa"
--- on_error and log_verbosity tests
+-- on_error, log_verbosity and reject_limit tests
 ALTER FOREIGN TABLE agg_bad OPTIONS (ADD on_error 'ignore');
 SELECT * FROM agg_bad;
-NOTICE:  1 row was skipped due to data type incompatibility
+NOTICE:  2 rows were skipped due to data type incompatibility
   a  |   b
 -+
  100 | 99.097
@@ -224,6 +224,18 @@ SELECT * FROM agg_bad;
   42 | 324.78
 (2 rows)
 
+ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1'); -- ERROR
+SELECT * FROM agg_bad;
+ERROR:  skipped more than REJECT_LIMIT (1) rows due to data type incompatibility
+CONTEXT:  COPY agg_bad, line 5, column b: "bbb"
+ALTER FOREIGN TABLE agg_bad OPTIONS (SET reject_limit '2');
+SELECT * FROM agg_bad;
+  a  |   b
+-+
+ 100 | 99.097
+  42 | 324.78
+(2 rows)
+
 ANALYZE agg_bad;
 -- misc query tests
 \t on
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 043204c3e7..9e2896f32a 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -77,6 +77,7 @@ static const struct FileFdwOption valid_options[] = {
 	{"encoding", ForeignTableRelationId},
 	{"on_error", ForeignTableRelationId},
 	{"log_verbosity", ForeignTableRelationId},
+	{"reject_limit", ForeignTableRelationId},
 	{"force_not_null", AttributeRelationId},
 	{"force_null", AttributeRelationId},
 
@@ -788,6 +789,13 @@ retry:
 			 */
 			ResetPerTupleExprContext(estate);
 
+			if (cstate->opts.reject_limit > 0 &&
+cstate->num_errors > cstate->opts.reject_limit)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("skipped more than REJECT_LIMIT (%lld) rows due to data type incompatibility",
+(long long) cstate->opts.reject_limit)));
+
 			/* Repeat NextCopyFrom() until no soft error occurs */
 			goto retry;
 		}
diff --git a/contrib/file_fdw/sql/file_fdw.sql b/contrib/file_fdw/sql/file_fdw.sql
index edd77c5cd2..0d6920e1f6 100644
--- a/contrib/file_fdw/sql/file_fdw.sql
+++ b/contrib/file_fdw/sql/file_fdw.sql
@@ -150,11 +150,15 @@ SELECT * FROM agg_csv c JOIN agg_text t ON (t.a = c.a) ORDER BY c.a;
 -- error context report tests
 SELECT * FROM agg_bad;

Optimizing FastPathTransferRelationLocks()

2024-11-11 Thread Fujii Masao


Hi,

I've identified some opportunities to optimize FastPathTransferRelationLocks(),
which transfers locks with a specific lock tag from per-backend fast-path arrays
to the shared hash table. The attached patch includes these enhancements.

Currently, FastPathTransferRelationLocks() recalculates the fast-path group on
each loop iteration, even though it stays the same. This patch updates
the function to calculate the group once and reuse it, improving efficiency.

The patch also extends the function's logic to skip not only backends from
a different database but also backends with pid=0 (which don’t hold fast-path
locks) and groups with no registered fast-path locks.

Since MyProc->pid is reset to 0 when a backend exits but MyProc->databaseId
remains set, checking only databaseId isn’t enough. Backends with pid=0 also
should be skipped.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From f0986e753d5e25fe9d9a941be621e0e0e43d47ae Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Tue, 12 Nov 2024 09:29:52 +0900
Subject: [PATCH v1] Optimize lock transfer from per-backend fast-path arrays
 to shared hash table.

This commit improves FastPathTransferRelationLocks() for transferring locks
with a specific lock tag from per-backend fast-path arrays to the shared
hash table.

Previously, the fast-path group was recalculated on each loop iteration,
though it remains the same. This update now calculates the group once
and reuses it, improving efficiency. Additionally, the function now skips
backends with pid=0 (as they hold no fast-path locks) and skips empty
fast-path groups, further enhancing performance.
---
 src/backend/storage/lmgr/lock.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index edc5020c6a..69c4d156fe 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2773,6 +2773,10 @@ FastPathTransferRelationLocks(LockMethod 
lockMethodTable, const LOCKTAG *locktag
LWLock *partitionLock = LockHashPartitionLock(hashcode);
Oid relid = locktag->locktag_field2;
uint32  i;
+   uint32  group;
+
+   /* fast-path group the lock belongs to */
+   group = FAST_PATH_REL_GROUP(relid);
 
/*
 * Every PGPROC that can potentially hold a fast-path lock is present in
@@ -2783,8 +2787,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, 
const LOCKTAG *locktag
for (i = 0; i < ProcGlobal->allProcCount; i++)
{
PGPROC *proc = &ProcGlobal->allProcs[i];
-   uint32  j,
-   group;
+   uint32  j;
 
LWLockAcquire(&proc->fpInfoLock, LW_EXCLUSIVE);
 
@@ -2802,16 +2805,17 @@ FastPathTransferRelationLocks(LockMethod 
lockMethodTable, const LOCKTAG *locktag
 * less clear that our backend is certain to have performed a 
memory
 * fencing operation since the other backend set 
proc->databaseId.  So
 * for now, we test it after acquiring the LWLock just to be 
safe.
+*
+* Also skip backends with pid=0 as they don’t hold fast-path 
locks,
+* and skip groups without any registered fast-path locks.
 */
-   if (proc->databaseId != locktag->locktag_field1)
+   if (proc->databaseId != locktag->locktag_field1 || proc->pid == 
0 ||
+   proc->fpLockBits[group] == 0)
{
LWLockRelease(&proc->fpInfoLock);
continue;
}
 
-   /* fast-path group the lock belongs to */
-   group = FAST_PATH_REL_GROUP(relid);
-
for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
{
uint32  lockmode;
-- 
2.47.0

Re: intarray sort returns wrong result

2024-11-11 Thread Junwang Zhao

On Tue, Nov 12, 2024 at 9:13 AM Tom Lane  wrote:
>
> Junwang Zhao  writes:
> > While working on general array sort[1], I played with intarray
> > extension, found a bug (or at least inconsistency) when sorting
> > multidimensional int array:
>
> > create extension intarray;
> > select sort('{{1,2,3}, {2,3,4}}');
>
> > this returns {{1,2,2},{3,3,4}} instead of {{1,2,3},{2,3,4}}
>
> This is documented, isn't it?
>
> Many of these operations are only sensible for one-dimensional
> arrays. Although they will accept input arrays of more dimensions,
> the data is treated as though it were a linear array in storage
> order.
>

I did not notice this statement, my bad 😞

> I don't think anyone will thank us for changing intarray's behavior
> many years after the fact.
>

Agreed. Sorry for the noise.

> regards, tom lane



-- 
Regards
Junwang Zhao

Re: Skip collecting decoded changes of already-aborted transactions

2024-11-11 Thread Peter Smith

On Tue, Nov 12, 2024 at 5:00 AM Masahiko Sawada  wrote:
>
> I've attached the updated patch.
>

Hi, here are some review comments for the latest v6-0001.

==
contrib/test_decoding/sql/stats.sql

1.
+INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM
generate_series(1, 5000) g(i);

I didn't understand the meaning of "serialize-topbig--1". My guess is
it is a typo that was supposed to say "toobig".

Perhaps there should also be some comment to explain that this
"toobig" stuff was done deliberately like this to exceed
'logical_decoding_work_mem' because that would normally (if it was not
aborted) cause a spill to disk.

~~~

2.
+-- Check stats. We should not spill anything as the transaction is already
+-- aborted.
+SELECT pg_stat_force_next_flush();
+SELECT slot_name, spill_txns AS spill_txn, spill_count AS spill_count
FROM pg_stat_replication_slots WHERE slot_name =
'regression_slot_stats4_twophase';
+

Those aliases seem unnecessary: "spill_txns AS spill_txn" and
"spill_count AS spill_count"

==
.../replication/logical/reorderbuffer.c

ReorderBufferCheckTXNAbort:

3.
Other static functions are also declared at the top of this module.
For consistency, shouldn't this be the same?

~~~

4.
+ * We don't mark the transaction as streamed since this function can be
+ * called for non-streamed transactions too.
+ */
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
+ ReorderBufferToastReset(rb, txn);

Given the comment says "since this function can be called for
non-streamed transactions too", would it be easier to pass
rbtxn_is_streamed(txn) here instead of 'false', and then just remove
the comment?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Incremental Sort Cost Estimation Instability

2024-11-11 Thread Andrei Lepikhov


Hi,

Revising the patch I found out that it may make sense to teach 
estimate_num_groups to process a PathKey node inside the presortedExprs 
list.
In this case it can pass through the EquivalenceClass members and choose 
correct number of distinct values.

Here is a sketch of the solution (see in attachment).

--
regards, Andrei LepikhovFrom 3d97c3aa36170460a82e66a06af13f95ed4a37c0 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Tue, 29 Oct 2024 08:49:33 +0700
Subject: [PATCH] Improve group number estimation.

Estimating GROUP BY x optimisation employs a distinct statistic for column x.
But if we have an expression like 'x=y' somewhere down the query tree, the
number of different values can't be more than the smaller distinct value on
columns 'x' and 'y'. That means it is possible to correct the estimation with
knowledge provided by the equivalence class.

In this commit, the estimate_num_groups routine is changed to include PathKey
nodes in the presortedExprs list. With the PathKey node, we can pass through
its equivalence class members and correct the distinct estimation.

To avoid multiple calls on statistic tuples, the em_ndistinct cache field
is introduced.
---
 src/backend/optimizer/path/costsize.c | 40 ++---
 src/backend/optimizer/path/equivclass.c   |  1 +
 src/backend/utils/adt/selfuncs.c  | 87 ++-
 src/include/nodes/pathnodes.h |  2 +
 .../regress/expected/incremental_sort.out | 51 +++
 src/test/regress/sql/incremental_sort.sql | 31 +++
 6 files changed, 180 insertions(+), 32 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 2bb6db1df7..fad63f694d 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2008,13 +2008,12 @@ cost_incremental_sort(Path *path,
 run_cost,
 input_run_cost = input_total_cost - input_startup_cost;
 	double		group_tuples,
-input_groups;
+input_groups,
+estimated_groups;
 	Cost		group_startup_cost,
 group_run_cost,
 group_input_run_cost;
 	List	   *presortedExprs = NIL;
-	ListCell   *l;
-	bool		unknown_varno = false;
 
 	Assert(presorted_keys > 0 && presorted_keys < list_length(pathkeys));
 
@@ -2025,9 +2024,6 @@ cost_incremental_sort(Path *path,
 	if (input_tuples < 2.0)
 		input_tuples = 2.0;
 
-	/* Default estimate of number of groups, capped to one group per row. */
-	input_groups = Min(input_tuples, DEFAULT_NUM_DISTINCT);
-
 	/*
 	 * Extract presorted keys as list of expressions.
 	 *
@@ -2050,33 +2046,15 @@ cost_incremental_sort(Path *path,
 	 * anyway - from that standpoint the DEFAULT_NUM_DISTINCT is defensive
 	 * while maintaining lower startup cost.
 	 */
-	foreach(l, pathkeys)
-	{
-		PathKey*key = (PathKey *) lfirst(l);
-		EquivalenceMember *member = (EquivalenceMember *)
-			linitial(key->pk_eclass->ec_members);
-
-		/*
-		 * Check if the expression contains Var with "varno 0" so that we
-		 * don't call estimate_num_groups in that case.
-		 */
-		if (bms_is_member(0, pull_varnos(root, (Node *) member->em_expr)))
-		{
-			unknown_varno = true;
-			break;
-		}
+	presortedExprs = list_copy_head(pathkeys, presorted_keys);
 
-		/* expression not containing any Vars with "varno 0" */
-		presortedExprs = lappend(presortedExprs, member->em_expr);
-
-		if (foreach_current_index(l) + 1 >= presorted_keys)
-			break;
-	}
-
-	/* Estimate the number of groups with equal presorted keys. */
-	if (!unknown_varno)
-		input_groups = estimate_num_groups(root, presortedExprs, input_tuples,
+	estimated_groups = estimate_num_groups(root, presortedExprs, input_tuples,
 		   NULL, NULL);
+	if (estimated_groups > 0.0)
+		input_groups = estimated_groups;
+	else
+		/* Default estimate of number of groups, capped to one group per row. */
+		input_groups = Min(input_tuples, DEFAULT_NUM_DISTINCT);
 
 	group_tuples = input_tuples / input_groups;
 	group_input_run_cost = input_run_cost / input_groups;
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index fae137dd82..3552614502 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -525,6 +525,7 @@ add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids,
 	em->em_datatype = datatype;
 	em->em_jdomain = jdomain;
 	em->em_parent = parent;
+	em->em_ndistinct = -1.0;
 
 	if (bms_is_empty(relids))
 	{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 08fa6774d9..e49101f50e 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -213,6 +213,8 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 		 MemoryContext outercontext,
 		 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static EquivalenceMember *identify_sort_ecmember(PlannerInfo *root,
+ EquivalenceClass *e

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On 11/11/24 14:51, Ashutosh Bapat wrote:
> ...
>
> I think the problem is about processing older running transactions
> record and setting data.restart_lsn based on the candidates those
> records produce. But what is not clear to me is how come a newer
> candidate_restart_lsn is available immediately upon WAL sender
> restart. I.e. in the sequence of events you mentioned in your first
> email
> 1.  344.139 LOG:  starting logical decoding for slot "s1"
> 
> 2. 344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
>reading WAL from 1/E96FB4D0.
> 
>  3. 344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0
> 
>  4. 344.140 DETAIL:  Logical decoding will begin using saved snapshot.
> 
>   5. 344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398
> 
> 6.  344.140 LOG:  LogicalConfirmReceivedLocation updating
>data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
>candidate_restart_valid 0/0 (from 1/E9865398)
>candidate_restart_lsn 0/0 (from 1/E979D4C8)
> 
> how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
> = 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
> must have read and processed running transaction record at 1/E9865398.
> If that's true, how come it went back to a running transactions WAL
> record at 1/E979D4C8? It should be reading WAL records sequentially,
> hence read 1/E979D4C8 first then 1/E9865398.
> 
>  344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
>candidate_restart_valid_lsn 1/E979D4C8 (0/0)
>candidate_restart_lsn 1/E96FB4D0 (0/0)
> 

Those are good questions, but IIUC that's explained by this comment from
Masahiko-san's analysis [1]:

  Thinking about the root cause more, it seems to me that the root cause
  is not the fact that candidate_xxx values are not cleared when being
  released.

  In the scenario I reproduced, after restarting the logical decoding,
  the walsender sets the restart_lsn to a candidate_restart_lsn left in
  the slot upon receiving the ack from the subscriber. ...

If this is correct, then what happens is:

  1) replication is running, at some point we set candidate LSN to B

  2) something breaks, causing reconnect with restart LSN A (< B)

  3) we still have the candidate LSN B in memory, and after receiving
 some confirmation we set it as restart_lsn

  4) we get to decode the RUNNING_XACTS, which moves restart_lsn back

If this analysis is correct, I think it's rather suspicious we don't
reset the candidate fields on restart. Can those "old" values ever be
valid? But I haven't tried resetting them.

Also, this is why I'm not entirely sure just tweaking the conditions in
LogicalIncreaseRestartDecodingForSlot is quite correct. Maybe it fixes
this particular issue, but maybe the right fix would be to reset the
candidate fields on reconnect? And this change would be just hiding the
actual problem. I haven't tried this.

[1]
https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com

-- 
Tomas Vondra

Re: define pg_structiszero(addr, s, r)

2024-11-11 Thread Bertrand Drouvot

Hi,

On Mon, Nov 11, 2024 at 06:19:50AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Sat, Nov 09, 2024 at 04:15:04AM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > > I tried with [1] and the
> > > latest gcc does not seem to be smart enough to figure this out.  I
> > > tried adding some additional len checks that the compiler can use as a
> > > cue and won't need to emit code for the checks providing the function
> > > does get inlined. That was enough to get the compiler to not emit the
> > > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > > in the 2nd compiler window. I just don't know if putting something
> > > like that into the code is a good idea as if the function wasn't
> > > inlined for some reason, the extra len checks would have to be
> > > compiled into the function.
> > > 
> > > David
> > > 
> > > [1] https://godbolt.org/z/xa81ro8GK
> > 
> > Looking at it, that looks like an issue.
> > 
> > I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 
> > bytes
> > beyond the struct's memory (which is 16 bytes):
> > 
> > This is fine:
> > "
> > movdqu  xmm0, XMMWORD PTR [rdi]
> > "
> > 
> > But I don't think it is:
> > 
> > "
> > movdqu  xmm2, XMMWORD PTR [rdi+16]
> > movdqu  xmm1, XMMWORD PTR [rdi+32]
> > movdqu  xmm3, XMMWORD PTR [rdi+48]
> > "
> > 
> > given that the struct size is only 16 bytes.
> > 
> > Thoughts?
> 
> What about "simply" starting pg_memory_is_all_zeros() with?
> 
> "
> if (len < sizeof(size_t)*8) {
> while (p < end) {
> if (*p++ != 0)
> return false;
> }
> return true;
> }
> "
> 
> That way:
> 
> - we prevents reading beyond the memory area in the SIMD section (if < 64 
> bytes)
> - we make sure that aligned_end can not be after the real end (could be if the
> len is < 8 bytes)
> - there is no need for additional size checks later in the code
> - len < 64 bytes will be read byte per byte but that's likely "enough" (if not
> faster) for those "small" sizes
> 

To handle the "what about the len check if the function is not inlined?", I
can't think about a good approach.

I thought about using a macro like:

"
#define pg_memory_is_all_zeros(ptr, len) \
((len) < sizeof(size_t) * 8 ? \
pg_memory_is_all_zeros_small((ptr), (len)) : \
pg_memory_is_all_zeros_large((ptr), (len)))
"

but that would not help (as the len check would need to be done too, so same
run time cost).

I also thought about using __builtin_constant_p(len) but that would not help
because still we need the len check for safety.

So please find attached v11 that uses "only" one len check into the function to
ensure that we won't read beyond the memory area (should its size be < 64 or < 
8).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 200e7752335d0f537f82a27ce8425fad8309a9fb Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 8 Nov 2024 14:19:59 +0900
Subject: [PATCH v11 1/2] Optimize pg_memory_is_all_zeros()

pg_memory_is_all_zeros() is currently doing byte per byte comparison and so
could lead to performance regression or penalties when multi bytes comparison
could be done instead.

Let's provide an optimized version that divides the checks into four phases for
efficiency:

- Initial alignment (byte per byte comparison)
- Compare 8 size_t chunks at once using bitwise OR (candidate for SIMD optimization)
- Compare remaining size_t aligned chunks
- Compare remaining bytes (byte per byte comparison)

If the memory area size is < 64 bytes then we are using byte per byte comparison
only to ensure that no data beyond the memory area could be read (that's likely
to be good enough for such sizes).

Code mainly suggested by David Rowley.
---
 src/include/utils/memutils.h | 85 ++--
 1 file changed, 82 insertions(+), 3 deletions(-)
 100.0% src/include/utils/

diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 3590c8bad9..f10b0ea05e 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -190,19 +190,98 @@ extern MemoryContext BumpContextCreate(MemoryContext parent,
 #define SLAB_LARGE_BLOCK_SIZE		(8 * 1024 * 1024)
 
 /*
+ * pg_memory_is_all_zeros
+ *
  * Test if a memory region starting at "ptr" and of size "len" is full of
  * zeroes.
+ *
+ * The test is divided into multiple phases, to be efficient for various
+ * length values:
+ * - Byte by byte comparison if len < 64 to ensure that we won't read beyond the
+ *   memory area.
+ * - Byte by byte comparison, until the pointer is aligned.
+ * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers
+ *   to use SIMD instructions if available, up to the last aligned location
+ *   possible.
+ * - size_t comparisons, with aligned pointers, up to the last location
+ *   possible.
+ * - Byte by b

Re: Add html-serve target to autotools and meson

On Mon, Nov 11, 2024 at 7:50 AM Andres Freund  wrote:
> If you just want that we could print out a file:// url or even open it? Rather
> than an http server?

+1. The coverage-html recipe is a nice example of this; just
control-click the file: link and away you go.

--Jacob

explain plans for foreign servers

2024-11-11 Thread dinesh salve

Hi Hackers,

I am working on a feature in postgres_fdw extension to show plans used by
remote postgresql servers in the output of the EXPLAIN command.
I think this will help end users understand query execution plans used by
remote servers. Sample output for table people where people_1 is local
partition and people_2 is remote partition would look like -

postgres:5432> explain select * from "test"."people";
QUERY PLAN
Append (cost=0.00..399.75 rows=2270 width=46)
→ Seq Scan on "people.people_1" people_1 (cost=0.00..21.00 rows=1100
width=46)
→ Foreign Scan on "people.people_2" people_2 (cost=100.00..367.40
rows=1170 width=46)
Remote Plan
Seq Scan on "people.people_2" (cost=0.00..21.00 rows=1100
width=46)
(5 rows)

I would like community inputs on below high level thoughts:

1. To enable this feature, either we can introduce a new option in EXPLAIN
command e.g. (fetch_remote_plans true) or control this behaviour using a
guc defined in postgres_fdw extension.  I am more inclined towards guc
as this feature is for extension postgres_fdw. Adding the EXPLAIN command
option might force other FDW extensions to handle this.

2. For ANALYZE = false, the idea is that postgres_fdw would create a
connection to a remote server, prepare SQL to send over connection and
store received plans in ExplainState.

3. For ANALYZE = true, idea is that postgres_fdw would set a new guc over
connection to remote server, remote server postgres_fdw would read this guc
and send back used query plan as a NOTICE (similar to auto_explain
extension does) with custom header which postgres_fdw extension
understands. . We also have an opportunity to introduce a new message type
in the protocol to send back explain plans but it might look like too much
work for this feature. Open to ideas here.

Dinesh Salve
SDE@AWS

Re: Add html-serve target to autotools and meson

2024-11-11 Thread Andres Freund

Hi,

On 2024-11-11 19:28:24 +0530, Ashutosh Bapat wrote:
> On Mon, Nov 11, 2024 at 9:55 AM Michael Paquier  wrote:
> >
> > On Sat, Nov 09, 2024 at 09:26:20AM +0100, Peter Eisentraut wrote:
> > > What is the advantage of using an http server over just opening the file
> > > directly in a web browser (file:// scheme)?
> >
> > Not sure to see any on sight.  I just use file:// to check the state
> > of the docs produced by any patch sent to the lists, even keeping some
> > bookmarks for the bookindex to ease lookups.
> 
> I sometimes create separate directories and clones for my work. In
> that case, there is no fix doc URL that I can bookmark. So it would be
> handy if meson opens docs for me with build specific URLs. I am not
> sure if it's worth the code and effort though. I cook up the URL using
> the build directory and copy paste in the browser. One could start a
> browser with DOCs opened as well.

If you just want that we could print out a file:// url or even open it? Rather
than an http server?

Greetings,

Andres Freund

Re: Eager aggregation, take 3

On Sun, Nov 10, 2024 at 7:52 PM Richard Guo  wrote:
> > I have similar but weaker feelings about ordered aggregates. Consider:
> >
> > explain select t1.id, array_agg(t2.v order by t3.o) from t1, t2, t3
> > where t1.id = t2.id and t2.id = t3.id group by 1;
> >
>
> It seems to me that a partially aggregated row might need to be
> combined with other partially aggregated rows after the join, if they
> belong to the same t1.id group.  IIUC, this implies that we cannot
> perform partial aggregation on ordered input before the join,
> otherwise we may get incorrect results during the final aggregation
> phase.

Hmm, I think you're right. I think that if the t1.id=t2.id join is one
to one, then it would work out fine, but that need not be the case.

> Hmm, currently we only consider grouped aggregation for eager
> aggregation.  For grouped aggregation, the window function's
> arguments, as well as the PARTITION BY expressions, must appear in the
> GROUP BY clause.  That is to say, the depname column in the first
> query, or the n column in the second query, will not be aggregated
> into the partial groups.  Instead, they will remain as they are as
> input for the WindowAgg nodes.  It seems to me that this ensures
> that we're good with window functions.  But maybe I'm wrong.

Unfortunately, I don't know what you mean by grouped aggregation. I
think of grouping and aggregation as synonyms, pretty much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Offsets of `struct Port` are no longer constant

Hi all,

A comment at the end of the Port struct says

/*
 * OpenSSL structures. (Keep these last so that the locations of other
 * fields are the same whether or not you build with SSL enabled.)
 */

but as part of the direct-SSL changes in d39a49c1e45 (cc'd Heikki),
some additional fields snuck in after it.

I assume it's too late to change this for 17, but should this be
addressed in HEAD? I've attached a draft patch (untested, beware)
which should no longer rely on field order.

Thanks,
--Jacob


port-offset.diff
Description: Binary data

Re: [PoC] XMLCast (SQL/XML X025)

On Sun, Nov 10, 2024 at 1:14 PM Jim Jones wrote:
> rebase.

Hmm, this patch has gotten no responses for 4 months. That's kind of
unfortunate. Sadly, there's not a whole lot that I can do to better
the situation, because I know very little either about XML-related
standards or about how people make use of XML in practice. It's not
that much code, so if it does a useful thing that we actually want, we
can probably figure out how to verify that the code is correct, or fix
it. But I don't know whether it's a useful thing that we actually
want. Syntactically, XMLCAST() looks a lot like CAST(), so one might
ask whether the things that it does can already be accomplished using
CAST(); or whether, perhaps, we have some other existing method for
performing such conversions.

The only thing I found during a quick perusal of the documentation was
XMLTABLE(), which seems a bit baroque if you just want to convert one
value. Is this intended to plug that gap? Is there any other current
way of doing it?

Do we need to ensure some kind of consistency between XMLTABLE() and
XMLCAST() in terms of how they behave? The documentation at
https://www.postgresql.org/docs/current/xml-limits-conformance.html#FUNCTIONS-XML-LIMITS-CASTS
says that "When PostgreSQL maps SQL data values to XML (as in
xmlelement), or XML to SQL (as in the output columns of xmltable),
except for a few cases treated specially, PostgreSQL simply assumes
that the XML data type's XPath 1.0 string form will be valid as the
text-input form of the SQL datatype, and conversely." Unfortunately,
it does not specify what those cases treated specially are, and the
commit that added that documentation text is not the one that added
the underlying code, so I don't actually know where that code is, but
one would expect this function to conform to that general rule.

I emphasize again that if there are people other than the submitter
who are interested in this patch, they should really chime in. This
can't progress in a vacuum.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Parallel heap vacuum

On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawda-san,
>
> >
> > I've attached new version patches that fixes failures reported by
> > cfbot. I hope these changes make cfbot happy.
>
> Thanks for updating the patch and sorry for delaying the reply. I confirmed 
> cfbot
> for Linux/Windows said ok.
> I'm still learning the feature so I can post only one comment :-(.
>
> I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, 
> pre-existing API,
> TidStoreBeginIterate(), has already accepted the shared TidStore. The only 
> difference
> is whether elog(ERROR) exists, but I wonder if it benefits others. Is there 
> another
> reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

TidStoreBeginIterateShared() is designed for multiple parallel workers
to iterate a shared TidStore. During an iteration, parallel workers
share the iteration state and iterate the underlying radix tree while
taking appropriate locks. Therefore, it's available only for a shared
TidStore. This is required to implement the parallel heap vacuum,
where multiple parallel workers do the iteration on the shared
TidStore.

On the other hand, TidStoreBeginIterate() is designed for a single
process to iterate a TidStore. It accepts even a shared TidStore as
you mentioned, but during an iteration there is no inter-process
coordination such as locking. When it comes to parallel vacuum,
supporting TidStoreBeginIterate() on a shared TidStore is necessary to
cover the case where we use only parallel index vacuum but not
parallel heap scan/vacuum. In this case, we need to store dead tuple
TIDs on the shared TidStore during heap scan so parallel workers can
use it during index vacuum. But it's not necessary to use
TidStoreBeginIterateShared() because only one (leader) process does
heap vacuum.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: RFC: Additional Directory for Extensions

2024-11-11 Thread David E. Wheeler

On Nov 11, 2024, at 02:16, Peter Eisentraut  wrote:

> I implemented a patch along the lines Craig had suggested.

Oh, nice, thank you.

> It's a new GUC variable that is a path for extension control files.  It's 
> called extension_control_path, and it works exactly the same way as 
> dynamic_library_path.  Except that the magic token is called $system instead 
> of $libdir.

I assume we can bikeshed these names later. :-)

> In fact, most of the patch is refactoring the routines in dfmgr.c to not 
> hardcode dynamic_library_path but allow searching for any file in any path.  
> Once a control file is found, the other extension support files (script files 
> and auxiliary control files) are looked for in the same directory.

What about shared libraries files?

> This works pretty much fine for the use cases that have been presented here, 
> including installing extensions outside of the core installation tree (for 
> CNPG and Postgres.app) and for testing uninstalled extensions (for Debian).

If I understand correctly, shared modules still lie in dynamic_library_path, 
yes? That makes things a bit more complicated, as the CNPG use case has to set 
up multiple persistent volumes to persist files put into various directories, 
notably sharedir and pkglibdir.

> - You can install extensions into alternative directories using PGXS like
> 
>make install datadir=/else/where/share pkglibdir=/else/where/lib
> 
> This works.  I was hoping it would work to use
> 
>make install prefix=/else/where
> 
> but that doesn't because of some details in Makefile.global.  I think we can 
> tweak that a little bit to make that work too.

In the broader extension organization RFC I’ve been working up[1], I propose a 
new Makefile prefix for the destination directory for an extension. It hinges 
on the idea that an extension has all of its files organized in a directory 
with the extension name, rather than separate params for data, pkglib, bin, etc.

> - With the current patch, if you install into datadir=/else/where/share, then 
> you need to set extension_control_path=/else/where/share/extension.  This is 
> a bit confusing.  Maybe we want to make the "extension" part implicit.

+1

> - The biggest problem is that many extensions set in their control file
> 
>module_pathname = '$libdir/foo'
> 
> This disables the use of dynamic_library_path, so this whole idea of 
> installing an extension elsewhere won't work that way.  The obvious solution 
> is that extensions change this to just 'foo'.  But this will require a lot 
> updating work for many extensions, or a lot of patching by packagers.

Since that’s set at build/install time, couldn’t the definition of `$libdir` 
here be changed to mean “the directory into which it’s being installed right 
now?”. Doesn’t seem necessary to search a path if the specific location is set 
at install time.

> Maybe we could devise some sort of rule that if the extension control file is 
> found via the path outside of $system, then the leading $libdir is ignored.
> 

This is what I propose,[1] yes. If we redefine the organization of extension 
files to live in a single directory named for an extension, then once you’ve 
found the control files all the other files are there, too. But `$libdir` is 
presumably still meaningful, since the first time you call an extension 
function in a new connection, it just tries to load that location, it doesn’t 
go through the control file search process, AFAIK.

Perhaps I misunderstand, but I would like to talk through the implications of a 
more radical rethinking of extension file location along the lines of the other 
thread[2] and the RFC I’m working up based on them both[1], especially since 
there are a few other use cases that inform it.

A notable one is the idea Gabriele shared with me in Athens to be able to add 
an extension to a running CNPG pod by simply mounting a read-only volume with 
all the files it requires.

Best,

David

[1]: https://github.com/theory/justatheory/pull/7/files

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

2024-11-11 Thread Joel Jacobson

On Sun, Nov 10, 2024, at 23:14, Tom Lane wrote:
> "Joel Jacobson"  writes:
>> On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:
>>> That seems like a hack, as it also changes the behavior w.r.t.
>>> prompts and EOF-mark detection, neither for the better.
>
>> Hmm, in what way does it change the behavior?
>> I ran almost all the tests, including the TAP ones, and none of them failed 
>> with
>> this fix. Is there some behavior that is currently not covered by the test 
>> suite?
>
> Of course.  (Our regression tests are very far from covering 100% of
> the behavioral space.)
>
> In the case at hand, this would break the
> expected line-by-line prompting for "\copy ... from '/dev/tty'".

Yes, of course, just wondered what kind of behavior in this area
that wasn't tested, thanks for explaining it.

> Playing with this just now, I notice that the prompt you get
> still claims that \. works to end input, although right now
> it does not:
>
> regression=# \copy int8_tbl from '/dev/tty'
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
>>> 13
>>> 56
>>> \.
>>> ^D
> COPY 2
>
> I'm inclined to think that the prompt still describes what should
> happen, at least in non-binary mode, although in binary mode we
> probably ought to just say (and do) "End with an EOF signal".
>
> So perhaps the if-test to choose the code path could be
>
>   if ((isbinary || copystream != pset.cur_cmd_source) && !showprompt)

Thanks for guidance, that seems to fix, for 6/8 cases I've figured out how to 
test.

> which would allow dropping the vestigial prompt logic in the first
> path,

I assume you mean that due to "&& !showprompt" the "if (showprompt)"
becomes unreachable and can therefore be dropped?

> and we would also need to change the test in the second path
> that decides if we should check for \.  (Likely this should be
> refactored a bit to make it more understandable.  An intermediate
> flag saying whether we intend to check for \. might help.)

Maybe check_dot_command?

const bool  check_dot_command = (copystream == pset.cur_cmd_source);

I haven't tried yet to refactor the code,
except than replacing the two "copystream == pset.cur_cmd_source"
occurrences with the new check_dot_command flag.

First want to understand if the two remaining cases are valid,
and if they can be tested:

> Anyway, my point is that we need to think through the desirable
> behavior for each possible combination of showprompt, isbinary, and
> copystream != pset.cur_cmd_source, because all 8 cases are reachable.

I guess these are the 8 cases?

++-+--+--+
| CASE   | showprompt  | isbinary | check_dot_command |
++-+--+--+
|   1|false|   false  |  false  |
|   2|false|   false  |  true   |
|   3|false|   true   |  false  |
|   4|false|   true   |  true   |
|   5|true |   false  |  false  |
|   6|true |   false  |  true   |
|   7*   |true |   true   |  false  |
|   8*   |true |   true   |  true   |
++-+--+--+
* Cases 7 and 8 not tested yet

With the changed if-test, case 1-6 works,
and for case 1, then binary mode branch is taken
instead of the text mode branch,
whereas cases 2-6 take the same branch as before.

joel@Joels-MBP psql_tester % git diff --no-index -U100 /tmp/psql.log.HEAD 
/tmp/psql.log
diff --git a/tmp/psql.log.HEAD b/tmp/psql.log
index 5e44e30..1f48ac9 100644
--- a/tmp/psql.log.HEAD
+++ b/tmp/psql.log
@@ -1,6 +1,6 @@
-COPY case: 1 TEXT MODE
+COPY case: 1 BINARY MODE
 COPY case: 2 TEXT MODE
 COPY case: 3 BINARY MODE
 COPY case: 4 BINARY MODE
 COPY case: 5 TEXT MODE
 COPY case: 6 TEXT MODE

Here is how I tested each case:

# CASE 1:
# showprompt:false
# isbinary:  false
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.data'"

# CASE 2:
# showprompt:false
# isbinary:  false
# check_dot_command: true
psql -f /tmp/copy_stdin_text.sql

# CASE 3:
# showprompt:false
# isbinary:  true
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.bin' (format binary)"

# CASE 4:
# showprompt:false
# isbinary:  true
# check_dot_command: true
printf '\\copy int8_tbl from stdin (format binary)\n' 
>/tmp/copy_stdin_binary.sql
cat /tmp/int8_tbl.bin >>/tmp/copy_stdin_binary.sql
psql -f copy_stdin_binary.sql

# CASE 5:
# showprompt:true
# isbinary:  false
# check_dot_command: false

psql
\copy int8_tbl from '/dev/tty'
17  18
19  20
\.
# Send EOF (Ctrl+D)

# CASE 6:
# showprompt:true
# isbinary:  false
# check_dot_command: true

psql
\copy int8_tbl from stdin
21  22
23  24
\.

# CASE 7:
# showprompt:true
# isbinary:  true
# chec

Re: Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS

2024-11-11 Thread Karina Litskevich

On Sun, Nov 10, 2024 at 3:43 PM Kirill Reshke  wrote:
> > By the way, I suggest adding a completion of "ALTER TYPE ... ALTER
> > ATTRIBUTE ... TYPE" with the list of types while we are here. It should
> > probably go together with v4-0001.
>
> Surprisingly this is already supported in an interesting manner.  psql
> will tab-complete everything that end with token "TYPE" with list of
> datatypes:
>
> ```
> reshke=# FOO BAR BAZ
>  *nothing*
> reshke=# FOO BAR BAZ TYPE
> 
> Display all 119 possibilities? (y or n)
> ```

Agreed.

I wrote:
> I hope to look more thoroughly into tab-complete.in.c tomorrow or on
> Monday to see if there are any other problems I can't see at first
> glance. I'll send another mail when I get to do this.

As promised, I looked into the new set of patches more closely. Can't
see any other significant problems. However, you still need to do a
couple of cosmetic changes.

On Sun, Nov 10, 2024 at 3:43 PM Kirill Reshke  wrote:
> > 1) Try to keep comments identical, at least in one piece of code. Right
> > now you have "CREATE MATERIALIZED VIEW " and "CREATE MATERIALIZED
> > VIEW " within three consecutive lines. I can see there was the
> > same problem before your changes, so it's not exactly your fault. Let's
> > correct it, though.
>
> Ok, sure. I did the correction.

You now have the same problem with "USING " and
"USING " in v5-0004.

Also, make sure you ran pgindent before creating patches. In v5-0004
there are comments that are too long for one line, and there is a line
with a trailing space:

+ else if (Matches("CREATE", "MATERIALIZED", "VIEW", MatchAny, "AS") ||

Other than that, everything looks fine to me.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/

Re: Linkify mentions of the primary/subscriber's max_replication_slots

On Thu, Nov 7, 2024 at 4:40 PM Amit Kapila  wrote:
>
> On Thu, Nov 7, 2024 at 9:36 AM Tristan Partin  wrote:
> >
> >
> > Here is a patch which does so.
> >
>
> 
>   Note that this parameter also applies on the subscriber side, but 
> with
> - a different meaning.
> + a different meaning. See  linkend="guc-max-replication-slots-subscriber"/>
> + in  for more
> + details.
>  
> 
>
> @@ -5215,7 +5217,9 @@ ANY  class="parameter">num_sync ( 
> +in  for more
> +details.
>
>  and  linkend="runtime-config-replication-primary"/> are incorrect. They
> should be instead  linkend="runtime-config-replication-subscriber"/> and  linkend="runtime-config-replication-sender"/>.
>

I have made the required changes and pushed the patch.

-- 
With Regards,
Amit Kapila.

RE: Parallel heap vacuum

2024-11-11 Thread Hayato Kuroda (Fujitsu)

Dear Sawda-san,

> 
> I've attached new version patches that fixes failures reported by
> cfbot. I hope these changes make cfbot happy.

Thanks for updating the patch and sorry for delaying the reply. I confirmed 
cfbot
for Linux/Windows said ok.
I'm still learning the feature so I can post only one comment :-(.

I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, 
pre-existing API,
TidStoreBeginIterate(), has already accepted the shared TidStore. The only 
difference
is whether elog(ERROR) exists, but I wonder if it benefits others. Is there 
another
reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?

Another approach is to restrict TidStoreBeginIterate() to support only the 
local one.
How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Support LIKE with nondeterministic collations


On 04/11/2024 10:26, Peter Eisentraut wrote:

On 29.10.24 18:15, Jacob Champion wrote:

libfuzzer is unhappy about the following code in MatchText:


+    while (p1len > 0)
+    {
+    if (*p1 == '\\')
+    {
+    found_escape = true;
+    NextByte(p1, p1len);
+    }
+    else if (*p1 == '_' || *p1 == '%')
+    break;
+    NextByte(p1, p1len);
+    }


If the pattern ends with a backslash, we'll call NextByte() twice,
p1len will wrap around to INT_MAX, and we'll walk off the end of the
buffer. (I fixed it locally by duplicating the ERROR case that's
directly above this.)


Thanks.  Here is an updated patch with that fixed.


Sadly the algorithm is O(n^2) with non-deterministic collations.Is there 
any way this could be optimized? We make no claims on how expensive any 
functions or operators are, so I suppose a slow implementation is 
nevertheless better than throwing an error.


Let's at least add some CHECK_FOR_INTERRUPTS(). For example, this takes 
a very long time and is uninterruptible:


 SELECT repeat('x', 10) LIKE '%xxxy%' COLLATE ignore_accents;

--
Heikki Linnakangas
Neon (https://neon.tech)

RE: Commit Timestamp and LSN Inversion issue

2024-11-11 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

> I understand your concern and appreciate the feedback. I've made some
> adjustments to the patch by directly placing the code to adjust the commit
> timestamp within the spinlock, aiming to keep it as efficient as possible. The
> changes have resulted in just a few extra lines. Would this approach be
> acceptable to you ?
> 
> Additionally, we're also doing performance tests for it and will share the
> results once they're available.

I've done performance tests compared with master vs. v2 patch.
It showed that for small transactions cases, the performance difference was 
0-2%,
which was almost the same of the run-by-run variation.

We may completely change the approach based on the recent discussion,
but let me share it once.

## Executed workload

Very small transactions with many clients were executed and results between 
master
and patched were compared. Two workloads were executed and compared:

- Insert single tuple to the table which does not have indices:
```
BEGIN;
INSERT INTO foo VALUES (1);
COMMIT;
```

- Emit a transactional logical replication message:
```
BEGIN;
SELECT pg_logical_emit_message(true, 'pgbench', 'benchmarking', false);
COMMIT;
```

## Results

Here is a result.
Each run had 60s periods and median of 5 runs were chosen.

### Single-tuple insertion

# of clients  HEAD  V2  diff
1  5602.828793  5593.991167  0.158%
2  10547.04503  10615.55583  -0.650%
4  15967.80926  15651.12383  1.983%
8  31213.14796  30584.75382  2.013%
16  60321.34215  59998.0144  0.536%
32  08.2883  110615.9216  0.443%
64  171860.0186  171359.8172  0.291%

### Transactional message

# of clients  HEAD  V2  diff
1  5714.611827  5648.9299  1.149%
2  10651.26476  10677.2745  -0.244%
4  16137.30248  15984.11671  0.949%
8  31220.16833  30772.53673  1.434%
16  60364.22808  59866.92579  0.824%
32  111887.1129  111406.4923  0.430%
64  172136.76  171347.5658  0.458%

Actually the standard deviation of each runs was almost the same (0-2%), so we 
could
conclude that there were no significant difference.

## Appendix - measurement environments

- Used machine has 755GB memory and 120 cores (Intel(R) Xeon(R) CPU E7-4890).
- RHEL 7.9 is running on the machine
- HEAD pointed the commit 41b98ddb7 when tested.
- Only `--prefix` was specified for when configured.
- Changed parameters from the default were:
```
autovacuum = false
checkpoint_timeout = 1h
shared_buffers = '30GB'
max_wal_size = 20GB
min_wal_size = 10GB
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED

clean up create|alter domain stmt incompatiable constraint error case and add regression test

2024-11-11 Thread jian he

hi.

https://coverage.postgresql.org/src/backend/commands/typecmds.c.gcov.html
show DefineDomain has poor coverage tests.
so I added a more extensive regression test.

also
create domain d_fail as int4 constraint cc GENERATED ALWAYS AS (2) STORED;
ERROR:  unrecognized constraint subtype: 4

This kind of error message is not helpful. so I changed it to
ERROR:  generated columns are not supported on domains

create domain d_int as int4;
ALTER DOMAIN d_int ADD constraint cc check(value > 1) no inherit not valid;
should fail.
so I made it fail.

typedef struct AlterDomainStmt
{
NodeTagtype;
charsubtype;
List   *typeName;/* domain to work on */
char   *name;/* column or constraint name to act on */
Node   *def;/* definition of default or constraint */
DropBehavior behavior;/* RESTRICT or CASCADE for DROP cases */
boolmissing_ok;/* skip error if missing? */
} AlterDomainStmt;
we only have one `Node   *def; ` in AlterDomainStmt
unlike CreateDomainStmt, have `List   *constraints;`

so I guess we have to allow the following ALTER DOMAIN queries.
even though the corresponding CREATE DOMAIN stmt would fail.

create domain d_int as int4;
ALTER DOMAIN d_int ADD constraint cc1 check(value > 1) INITIALLY IMMEDIATE; --ok
ALTER DOMAIN d_int ADD constraint cc2 check(value > 1) NOT DEFERRABLE; --ok
ALTER DOMAIN d_int ADD constraint cc3 check(value > 1) NOT DEFERRABLE
not valid; --ok
ALTER DOMAIN d_int ADD constraint cc4 check(value > 1) INITIALLY
IMMEDIATE not valid; --ok

looking at pg_constraint, for the above queries, column {condeferrable
| condeferred | convalidated} all set properly.


i didn't found a way to trigger
errmsg("exclusion constraints not possible for domains")));
From 35d9ce7318b259281d1c8cf347dc0594e9ac62ba Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 11 Nov 2024 16:49:34 +0800
Subject: [PATCH v1 1/1] add more CREATE|ALTER DOMAIN tests

also AlterDomainAddConstraint, DefineDomain refactoring
---
 src/backend/commands/typecmds.c  | 20 ++-
 src/test/regress/expected/domain.out | 84 
 src/test/regress/sql/domain.sql  | 37 
 3 files changed, 140 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 859e2191f0..8c602ca63e 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -1011,6 +1011,16 @@ DefineDomain(CreateDomainStmt *stmt)
 		 errmsg("specifying constraint deferrability not supported for domains")));
 break;
 
+			case CONSTR_IDENTITY:
+ereport(ERROR,
+		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		 errmsg("identity columns are not supported on domains")));
+break;
+			case CONSTR_GENERATED:
+ereport(ERROR,
+		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		 errmsg("generated columns are not supported on domains")));
+break;
 			default:
 elog(ERROR, "unrecognized constraint subtype: %d",
 	 (int) constr->contype);
@@ -2934,8 +2944,16 @@ AlterDomainAddConstraint(List *names, Node *newConstraint,
 	switch (constr->contype)
 	{
 		case CONSTR_CHECK:
+			if (constr->is_no_inherit)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+		 errmsg("check constraints for domains cannot be marked NO INHERIT")));
+			break;
 		case CONSTR_NOTNULL:
-			/* processed below */
+			if (constr->is_no_inherit)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+		 errmsg("NOT NULL constraints for domains cannot be marked NO INHERIT")));
 			break;
 
 		case CONSTR_UNIQUE:
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index f65b66345a..f4131c2696 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -87,6 +87,90 @@ drop domain domainvarchar restrict;
 drop domain domainnumeric restrict;
 drop domain domainint4 restrict;
 drop domain domaintext;
+--error case.
+create table t1(a int primary key);
+create domain d_fail as int4 constraint cc references t1;
+ERROR:  foreign key constraints not possible for domains
+create domain d_fail as int4 not null no inherit;
+ERROR:  not-null constraints for domains cannot be marked NO INHERIT
+create domain d_fail as int4 not null null;
+ERROR:  conflicting NULL/NOT NULL constraints
+create domain d_fail as int4 not null default 3 default 3;
+ERROR:  multiple default expressions
+create domain d_fail as int4 unique;
+ERROR:  unique constraints not possible for domains
+create domain d_fail as int4 PRIMARY key;
+ERROR:  primary key constraints not possible for domains
+create domain d_fail as int4 constraint cc generated by default as identity;
+ERROR:  identity columns are not supported on domains
+create domain d_fail as int4 constraint cc GENERATED ALWAYS AS (2) STORED;
+ERROR:  generated columns are not supported on domains
+create domain d_f

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-11-11 Thread torikoshia

On 2024-11-09 21:55, Kirill Reshke wrote:

Thanks for working on this!

On Thu, 7 Nov 2024 at 23:00, Fujii Masao  
wrote:

On 2024/10/26 6:03, Kirill Reshke wrote:
> when the REJECT LIMIT is set to some non-zero number and the number of
> row NULL replacements exceeds the limit, is it OK to fail. Because
> there WAS errors, and we should not tolerate more than $limit errors .
> I do find this behavior to be consistent.

+1

> But what if we don't set a REJECT LIMIT, it is sane to do all
> replacements, as if REJECT LIMIT is inf.

+1

After thinking for a while, I'm now more opposed to this approach. I
think we should count rows with erroneous data as errors only if
null substitution for these rows failed, not the total number of rows
which were modified.
Then, to respect the REJECT LIMIT option, we compare this number with
the limit. This is actually simpler approach IMHO. What do You think?

IMHO I prefer the previous interpretation.
I'm not sure this is what people expect, but I assume that REJECT_LIMIT 
is used to specify how many malformed rows are acceptable in the 
"original" data source.

> But while I was trying to implement that, I realized that I don't
> understand v4 of this patch. My misunderstanding is about
> `t_on_error_null` tests. We are allowed to insert a NULL value for the
> first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> do we do that? My thought is we should try to execute
> InputFunctionCallSafe with NULL value  (i mean, here [1]) for the
> column after we failed to insert the input value. And, if this second
> call is successful, we do replacement, otherwise we count the row as
> erroneous.

Your concern is valid. Allowing NULL to be stored in a column with a 
NOT NULL
constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you 
suggested,

NULL values set by SET_TO_NULL should probably be re-evaluated.

Thank you. I updated the patch with a NULL re-evaluation.

PFA v7. I did not yet update the doc for this patch version, waiting
for feedback about REJECT LIMIT + SET_TO_NULL behaviour.

There were warnings when I applied the patch:

$ git apply 
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170: 
trailing whitespace.

   /*
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189: 
trailing whitespace.
   /* If datatype if okay with NULL, 
replace
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196: 
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212: 
trailing whitespace.

/*

 @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState 
*pstate, bool is_from)

  parser_errposition(pstate, def->location)));

...

 -   if (opts_out->reject_limit && !opts_out->on_error)
 +   if (opts_out->reject_limit && !(opts_out->on_error == 
COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))

Hmm, is this change necessary?
Personally, I feel the previous code is easier to read.

"REJECT LIMIT" should be "REJECT_LIMIT"?
1037 errhint("Consider specifying the 
REJECT LIMIT option to skip erroneous rows.")));

SET_TO_NULL is one of the value for ON_ERROR, but the patch treats 
SET_TO_NULL as option for COPY:

221 --- a/src/bin/psql/tab-complete.in.c
222 +++ b/src/bin/psql/tab-complete.in.c
223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
225   "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
226   "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", 
"DEFAULT",

227 - "ON_ERROR", "LOG_VERBOSITY");
228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");

Best regards,
Kirill Reshke

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Re: doc: pgevent.dll location

On Thu, Nov 7, 2024 at 1:46 PM Dave Page  wrote:
>
> Hi
>
> On Wed, 6 Nov 2024 at 16:11, Peter Eisentraut  wrote:
>>
>> On 06.11.24 13:57, Ryohei Takahashi (Fujitsu) wrote:
>> > The dll install paths are changed as follows on Windows.
>> >
>> > (1) pgevent.dll
>> > PG16: lib/
>> > PG17: bin/
>> >
>> > (2) dll for user (like libpq.dll, libecpg.dll)
>> > PG16: Both in lib/ and bin/
>> > PG17: bin/
>> >
>> > (3) contrib dll (like amcheck.dll)
>> > PG16: lib/
>> > PG17: lib/
>> >
>> >
>> > I understand that Dave says (1) and (2) should exist on bin/ directory
>> > and the paths in PG17 are correct.
>> >
>> > So, I think it is correct to update documentation.
>>
>> I don't have Windows handy to test it out, but looking at the respective
>> build system source files, in master, pgevent is built and installed
>> like a normal shared library in both meson.build and Makefile, so it
>> should end up somewhere in lib.
>>
>> In src/tools/msvc in REL_16_STABLE, I see some code that appears to
>> suggest that it installs in bin.
>>
>> Again, this is just reading the code, but it seems to be backwards from
>> what is claimed earlier.
>
>
> I downloaded the builds of v16 and v17 from
> https://github.com/dpage/winpgbuild/actions/runs/11639786998, a project 
> worked on by Andres, Dave Cramer, and myself.
>
> I've just double-checked I didn't get mixed up, and can confirm that in v17, 
> pgevent.dll is installed into bin/ and in v16.4, it's in lib/
>

I can re-confirm that on my Windows setup, pgevent.dll is present in
bin/ for PG17 and HEAD and in lib/ for PG16.

-- 
With Regards,
Amit Kapila.

Re: Separate memory contexts for relcache and catcache

On Sat, Nov 2, 2024 at 4:18 AM Andres Freund  wrote:
>
>
> > > I've previously proposed creating a type of memory context that's
> > > intended for
> > > places where we never expect to allocate much which allocates from
> > > either a
> > > superior memory context or just from the system allocator and tracks
> > > memory
> > > via linked lists.
> >
> > Why not just use ALLOCSET_SMALL_SIZES?
>
> That helps some, but not *that* much. You still end up with a bunch of 
> partially
> filled blocks. Here's e.g. an excerpt with your patch applied:
>
> │ name │ ident
>   │   type   │ level │ path  │ total_bytes │ total_nblocks │ 
> free_bytes │ free_chunks │ used_bytes │
> ├──┼┼──┼───┼───┼─┼───┼┼─┼┤
> │ CacheMemoryContext   │ (null)   
>   │ AllocSet │ 2 │ {1,19}│8192 │ 1 │   
> 7952 │   0 │240 │
> │ TypCacheContext  │ (null)   
>   │ AllocSet │ 3 │ {1,19,28} │8192 │ 1 │   
> 4816 │   0 │   3376 │
> │ search_path processing cache │ (null)   
>   │ AllocSet │ 3 │ {1,19,29} │8192 │ 1 │   
> 5280 │   7 │   2912 │
> │ CatCacheContext  │ (null)   
>   │ AllocSet │ 3 │ {1,19,30} │  262144 │ 6 │  
> 14808 │   0 │ 247336 │
> │ RelCacheContext  │ (null)   
>   │ AllocSet │ 3 │ {1,19,31} │  262144 │ 6 │   
> 8392 │   2 │ 253752 │
> │ relation rules   │ pg_backend_memory_contexts   
>   │ AllocSet │ 4 │ {1,19,31,34}  │8192 │ 4 │   
> 3280 │   1 │   4912 │
> │ index info   │ manyrows_pkey
>   │ AllocSet │ 4 │ {1,19,31,35}  │2048 │ 2 │
> 864 │   1 │   1184 │
> │ index info   │ pg_statistic_ext_relid_index 
>   │ AllocSet │ 4 │ {1,19,31,36}  │2048 │ 2 │
> 928 │   1 │   1120 │
> │ index info   │ pg_class_tblspc_relfilenode_index
>   │ AllocSet │ 4 │ {1,19,31,37}  │2048 │ 2 │
> 440 │   1 │   1608 │
>
> (this is a tiny bit misleading as "search_path processing cache" was just 
> moved")
>
> You can quickly see that the various contexts have a decent amount of free
> space, some of their space.
>
> We've already been more aggressive about using separate contets for indexes -
> and in aggregate that memory usage shows up:
>
> postgres[1088243][1]=# SELECT count(*), sum(total_bytes) as total_bytes, 
> sum(total_nblocks) as total_nblocks, sum(free_bytes) free_bytes, 
> sum(free_chunks) as free_chunks, sum(used_bytes) used_bytes FROM 
> pg_backend_memory_contexts WHERE path @> (SELECT path FROM 
> pg_backend_memory_contexts WHERE name = 'CacheMemoryContext') and name = 
> 'index info'
> ┌───┬─┬───┬┬─┬┐
> │ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes 
> │
> ├───┼─┼───┼┼─┼┤
> │87 │  162816 │   144 │  48736 │ 120 │ 114080 
> │
> └───┴─┴───┴┴─┴┘
>
>
>
> And it's not just the partially filled blocks that are an "issue", it's also
> the freelists that are much less likely to be used soon if they're split very
> granularly. Often we'll end up with memory in freelists that are created while
> building some information that then will not be used again.
>
>
> Without your patch:
> ┌┬┬──┬───┬┬─┬───┬┬─┬┐
> │name│ ident  │   
> type   │ level │path│ total_bytes │ total_nblocks │ free_bytes │ 
> free_chunks │ used_bytes │
> ├┼┼──┼───┼┼─┼───┼┼─┼┤
> │ CacheMemoryContext │ (null) │ 
> AllocSet │ 2 │ {1,17} │  524288 │ 7 │  75448 │
>0 │ 448840 │
> │ relation rules │ pg_backend_memory_contexts │ 
> AllocSet │ 3 │ {1,17,27}  │8192 │ 4 │   3472

Re: Virtual generated columns

2024-11-11 Thread jian he

On Wed, Nov 6, 2024 at 12:17 AM Peter Eisentraut  wrote:
>
> New patch version.  I've gone through the whole thread again and looked
> at all the feedback and various bug reports and test cases and made sure
> they are all addressed in the latest patch version.  (I'll send some
> separate messages to respond to some individual messages, but I'm
> keeping the latest patch here.)

just quickly note the not good error message before you rebase.

src7=# create domain d_fail as int4 constraint cc GENERATED ALWAYS AS (2) ;
ERROR:  unrecognized constraint subtype: 4
src7=# create domain d_fail as int4 constraint cc GENERATED ALWAYS AS
(2) stored;
ERROR:  unrecognized constraint subtype: 4
src7=# create domain d_fail as int4 constraint cc GENERATED ALWAYS AS
(2) virtual;
ERROR:  unrecognized constraint subtype: 4

reading gram.y, typedef struct Constraint seems cannot distinguish, we
are creating a domain or create table.
I cannot found a way to error out in gram.y.

so we have to error out at DefineDomain.

Re: Parallel workers stats in pg_stat_database

2024-11-11 Thread Benoit Lobréau


On 11/11/24 02:51, Michael Paquier wrote:

Okidoki, applied.  If tweaks are necessary depending on the feedback,
like column names, let's tackle things as required.  We still have a
good chunk of time for this release cycle.
--
Michael


Thanks !

--
Benoit Lobréau
Consultant
http://dalibo.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-11-11 Thread Nisha Moond

On Wed, Sep 18, 2024 at 3:31 PM shveta malik  wrote:
>
> On Wed, Sep 18, 2024 at 2:49 PM shveta malik  wrote:
> >
> > > > Please find the attached v46 patch having changes for the above review
> > > > comments and your test review comments and Shveta's review comments.
> > > >
>
> When the synced slot is marked as 'inactive_timeout' invalidated on
> hot standby due to invalidation of publisher 's failover slot, the
> former starts showing NULL' inactive_since'. Is this intentional
> behaviour? I feel inactive_since should be non-NULL here too?
> Thoughts?
>
> physical standby:
> postgres=# select slot_name, inactive_since, invalidation_reason,
> failover, synced from pg_replication_slots;
> slot_name  |  inactive_since  |
> invalidation_reason | failover | synced
> -+--+-+--+
> sub2 | 2024-09-18 15:20:04.364998+05:30 |   | t| t
> sub3 | 2024-09-18 15:20:04.364953+05:30 |   | t| t
>
> After sync of invalidation_reason:
>
> slot_name  |  inactive_since  | invalidation_reason |
> failover | synced
> -+--+-+--+
>  sub2 |   | inactive_timeout| t| t
>  sub3 |   | inactive_timeout| t| t
>
>

For synced slots on the standby, inactive_since indicates the last
synchronization time rather than the time the slot became inactive
(see doc - 
https://www.postgresql.org/docs/devel/view-pg-replication-slots.html).

In the reported case above, once a synced slot is invalidated we don't
even keep the last synchronization time for it. This is because when a
synced slot on the standby is marked invalid, inactive_since is reset
to NULL each time the slot-sync worker acquires a lock on it. This
lock acquisition before checking invalidation is done to avoid certain
race conditions and will activate the slot temporarily, resetting
inactive_since. Later, the slot-sync worker updates inactive_since for
all synced slots to the current synchronization time. However, for
invalid slots, this update is skipped, as per the patch’s design.

If we want to preserve the inactive_since value for the invalid synced
slots on standby, we need to clarify the time it should display. Here
are three possible approaches:

1) Copy the primary's inactive_since upon invalidation: When a slot
becomes invalid on the primary, the slot-sync worker could copy the
primary slot’s inactive_since to the standby slot and retain it, by
preventing future updates on the standby.

2) Use the current time of standby when the synced slot is marked
invalid for the first time and do not update it in subsequent sync
cycles if the slot is invalid.

Approach (2) seems more reasonable to me, however, Both 1) & 2)
approaches contradicts the purpose of inactive_since, as it no longer
represents either the true "last sync time" or the "time slot became
inactive" because the slot-sync worker acquires locks periodically for
syncing, and keeps activating the slot.

3) Continuously update inactive_since for invalid synced slots as
well: Treat invalid synced slots like valid ones by updating
inactive_since with each sync cycle. This way, we can keep the "last
sync time" in the inactive_since. However, this could confuse users
when "invalidation_reason=inactive_timeout" is set for a synced slot
on standby but inactive_since would reflect sync time rather than the
time slot became inactive. IIUC, on the primary, when
invalidation_reason=inactive_timeout for a slot, the inactive_since
represents the actual time the slot became inactive before getting
invalidated, unless the primary is restarted.

Thoughts?

--
Thanks,
Nisha

Re: Add reject_limit option to file_fdw

2024-11-11 Thread torikoshia


On 2024-11-08 01:44, Fujii Masao wrote:
Thanks for your review!


On 2024/11/05 22:30, torikoshia wrote:

Thanks for the patch! Could you add it to the next CommitFest?


Added an entry for this patch:
https://commitfest.postgresql.org/50/5331/


Thanks!



+ALTER FOREIGN TABLE agg_bad OPTIONS (reject_limit '1');
+SELECT * FROM agg_bad;
+  a  |   b
+-+
+ 100 | 99.097
+  42 | 324.78
+(2 rows)

Wouldn't it be better to include a test where a SELECT query fails, 
even with
on_error set to "ignore," because the number of errors exceeds 
reject_limit?


Agreed.


As for the agg.bad2 file that your latest patch added:

Instead of adding a new bad input file, what do you think about simply 
appending

"1;@bbb@" to the existing agg.bad file and adjusting sql/file_fdw.sql
as follows?
This approach might keep things simpler.


That seems better. Agreed.


---
-\set filename :abs_srcdir '/data/agg.bad2'
-ALTER FOREIGN TABLE agg_bad OPTIONS (SET filename :'filename', ADD
reject_limit '1');
+ALTER FOREIGN TABLE agg_bad OPTIONS (ADD reject_limit '1');
 SELECT * FROM agg_bad;
 ALTER FOREIGN TABLE agg_bad OPTIONS (SET reject_limit '2');
---


+  
+   reject_limit

This entry should be placed right after the on_error option,
following the same order as in the COPY command documentation.


Based on the synopsis of the CREATE/ALTER FOREIGN TABLE commands, 
the value for the foreign table's option must be single-quoted. I’m 
not entirely sure if this is the correct approach, but in order to 
accommodate this, the patch modifies the value of reject_limit 
option to accept not only numeric values but also strings.




I don't have a better approach for this, so I'm okay with your 
solution.

Just one note: it would be helpful to explain and comment
why defGetCopyRejectLimitOption() accepts and parses both int64 and
string values.


Added a comment for this.


Thanks for adding the comment. It clearly states that REJECT_LIMIT can 
be
a single-quoted string. However, it might also be helpful to mention 
that

it can be provided as an int64 in the COPY command option. How about
updating it like this?


REJECT_LIMIT can be specified in two ways: as an int64 for the COPY 
command

option or as a single-quoted string for the foreign table option using
file_fdw. Therefore this function needs to handle both formats.



Thanks! it seems better.


Attached v3 patch.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Re: magical eref alias names

2024-11-11 Thread Andrei Lepikhov


On 11/8/24 20:33, Robert Haas wrote:

On Thu, Nov 7, 2024 at 4:38 PM Tom Lane  wrote:

The trick there is to keep them predictable, because as I mentioned in
my previous response, there may be people depending on knowing what
name will be assigned.  We're working with a ton of history here,
and I'm not really convinced that change will be change for the
better.


Yeah, I don't really want to be the one to break somebody's query that
explicitly references "*VALUES*" or whatever. At least not without a
better reason than I currently have. If this were just a display
artifact I think finding some way to clean it up would be pretty
worthwhile, but I would need a better reason to break working SQL.
Thanks for this topic! Having run into this years ago, I was confused by 
eref and alias fields.
I frequently use eref during debugging. Also, knowing the naming 
convention makes it much easier to resolve issues with only an 
explanation when the user can't provide any other information. I wonder 
if other people who work with EXPLAIN a lot already have some sort of 
habit here.
I agree that the naming convention can float, but please let it be 
stable and predictable.
Also, I'm not sure how other extension developers operate, but in a 
handful of mine, I use the fact that eref always contains a name - the 
relational model requires a name for each (even intermediate) table and 
column, doesn't it?
Also, do not forget that these names can be used in pg_hint_plan hints - 
one more reason to make it stable.


--
regards, Andrei Lepikhov

Re: Doc: typo in config.sgml

On Tue, 5 Nov 2024 10:08:17 +0100
Peter Eisentraut  wrote:


> >> So you convert LATIN1 characters to HTML entities so that it's easier
> >> to detect non-LATIN1 characters is in the SGML docs? If my
> >> understanding is correct, it can be also achieved by using some tools
> >> like:
> >>
> >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
> >>
> >> If there are some non-LATIN1 characters in release-17.sgml,
> >> it will complain like:
> >>
> >> iconv: illegal input sequence at position 175
> >>
> >> An advantage of this is, we don't need to covert each LATIN1
> >> characters to HTML entities and make the sgml file authors life a
> >> little bit easier.

> I think the iconv approach is an idea worth checking out.
> 
> It's also not necessarily true that the set of characters provided by 
> the built-in PDF fonts is exactly the set of characters in Latin 1.  It 
> appears to be close enough, but I'm not sure, and I haven't found any 
> authoritative information on that.  

I found a description in FAQ on Apache FOP [1] that explains some glyphs for
Latin1 character set are not contained in the standard text fonts.

 The standard text fonts supplied with Acrobat Reader have mostly glyphs for
 characters from the ISO Latin 1 character set. For a variety of reasons, even
 those are not completely guaranteed to work, for example you can't use the fi
 ligature from the standard serif font.

[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters

However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do 
this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.

> Another approach for a fix would be 
> to get FOP produce the required warnings or errors more reliably.  I 
> know it has a bunch of logging settings (ultimately via log4j), so there 
> might be some possibilities.

When a character that cannot be displayed in PDF is found, a warning
"Glyph ... not available in font " is output in fop's log. We can
prevent such characters from being contained in PDF by checking
the message as the attached patch 0001. However, this is checked after
the pdf is generated since I could not have an idea how to terminate the
generation immediately when such character is detected.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
>From b6bed0089fa510480dc410969ecff42a55ea7442 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH 2/2] Check non-latin1 characters in make check

---
 doc/src/sgml/Makefile | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index edc3725e5a..39822082c8 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -157,10 +157,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 
 %.pdf: %.fo $(ALL_IMAGES)
 	$(FOP) -fo $< -pdf $@ 2>&1 | \
-	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
 	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
-
 ##
 ## EPUB
 ##
@@ -197,7 +196,7 @@ MAKEINFO = makeinfo
 ##
 
 # Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
 	$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
 
 
@@ -270,6 +269,11 @@ check-nbsp:
 	  $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
 	(echo "Non-breaking spaces appear in SGML/XML files" 1>&2;  exit 1)
 
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+	@ (iconv -t ISO-8859-1 -f UTF-8 $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) >/dev/null 2>&1) || \
+	(echo "Non-Latin1 characters appear in SGML/XML files" 1>&2;  exit 1)
+
 ##
 ## Clean
 ##
-- 
2.34.1

>From 7e6a612c15bf65169e31906371218cdf13fcacdb Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH 1/2] Disallow characters that cannot be displayed in PDF

---
 doc/src/sgml/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..edc3725e5a 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 	$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
 
 %.pdf: %.fo $(ALL_IMAGES)
-	$(FOP) -fo $< -pdf $@
+	$(FOP) -fo $< -pdf $@ 2>&1 | \
+	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
 
 ##
-- 
2.34.1

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Sat, Nov 9, 2024 at 5:05 PM Tomas Vondra  wrote:
>
> On 11/8/24 15:57, Ashutosh Bapat wrote:
> > ...
> >
> > After examining the code before reading [2], I came to the same
> > conclusion as Masahiko-san in [2]. We install candidate_restart_lsn
> > based on the running transaction record whose LSN is between
> > restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
> > such candidates would be lesser than any confirmed_flush_lsn received
> > from downstream. I am surprised that the fix suggested by Masahiko-san
> > didn't work though. The fix also fix the asymmetry, between
> > LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
> > that you have pointed out in your next email. What behaviour do you
> > see with that fix applied?
> >
> >
> > [1] https://www.postgresql.org/message-id/Yz2hivgyjS1RfMKs%40depesz.com
> > [2] 
> > https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com
> >
> >
>
> I read that message (and the earlier discussion multiple times) while
> investigating the issue, and TBH it's not very clear to me what the
> conclusion is :-(
>
> There's some discussion about whether the candidate fields should be
> reset on release or not. There are even claims that it might be
> legitimate to not reset the fields and update the restart_lsn. Using
> such "stale" LSN values seems rather suspicious to me, but I don't have
> a proof that it's incorrect. FWIW this unclarity is what I mentioned the
> policy/contract for candidate fields is not explained anywhere.
>
> That being said, I gave that fix a try - see the attached 0001 patch. It
> tweaks LogicalIncreaseRestartDecodingForSlot (it needs a bit more care
> because of the spinlock), and it adds a couple asserts to make sure the
> data.restart_lsn never moves back.
>
> And indeed, with this my stress test script does not crash anymore.

:)

I think the problem is about processing older running transactions
record and setting data.restart_lsn based on the candidates those
records produce. But what is not clear to me is how come a newer
candidate_restart_lsn is available immediately upon WAL sender
restart. I.e. in the sequence of events you mentioned in your first
email
1.  344.139 LOG:  starting logical decoding for slot "s1"

2. 344.139 DETAIL:  Streaming transactions committing after 1/E97EAC30,
   reading WAL from 1/E96FB4D0.

 3. 344.140 LOG:  logical decoding found consistent point at 1/E96FB4D0

 4. 344.140 DETAIL:  Logical decoding will begin using saved snapshot.

  5. 344.140 LOG:  LogicalConfirmReceivedLocation 1/E9865398

6.  344.140 LOG:  LogicalConfirmReceivedLocation updating
   data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
   candidate_restart_valid 0/0 (from 1/E9865398)
   candidate_restart_lsn 0/0 (from 1/E979D4C8)

how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
= 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
must have read and processed running transaction record at 1/E9865398.
If that's true, how come it went back to a running transactions WAL
record at 1/E979D4C8? It should be reading WAL records sequentially,
hence read 1/E979D4C8 first then 1/E9865398.

 344.145 LOG:  LogicalIncreaseRestartDecodingForSlot s1
   candidate_restart_valid_lsn 1/E979D4C8 (0/0)
   candidate_restart_lsn 1/E96FB4D0 (0/0)

-- 
Best Wishes,
Ashutosh Bapat

Re: Add html-serve target to autotools and meson

On Mon, Nov 11, 2024 at 9:55 AM Michael Paquier  wrote:
>
> On Sat, Nov 09, 2024 at 09:26:20AM +0100, Peter Eisentraut wrote:
> > What is the advantage of using an http server over just opening the file
> > directly in a web browser (file:// scheme)?
>
> Not sure to see any on sight.  I just use file:// to check the state
> of the docs produced by any patch sent to the lists, even keeping some
> bookmarks for the bookindex to ease lookups.

I sometimes create separate directories and clones for my work. In
that case, there is no fix doc URL that I can bookmark. So it would be
handy if meson opens docs for me with build specific URLs. I am not
sure if it's worth the code and effort though. I cook up the URL using
the build directory and copy paste in the browser. One could start a
browser with DOCs opened as well.

-- 
Best Wishes,
Ashutosh Bapat

Re: pgcrypto: better memory management?...

2024-11-11 Thread Peter Eisentraut


On 08.11.24 03:15, Bear Giles wrote:
The first is that openssl has supported custom memory management for a 
very long time - I remember using it in the late 90s. The function is in 
crypto.h


  typedef void *(*CRYPTO_malloc_fn)(size_t num, const char *file, int line);
  typedef void *(*CRYPTO_realloc_fn)(void *addr, size_t num, const char 
*file,

                                     int line);
  typedef void (*CRYPTO_free_fn)(void *addr, const char *file, int line);
int CRYPTO_set_mem_functions(CRYPTO_malloc_fn malloc_fn,
                               CRYPTO_realloc_fn realloc_fn,
                               CRYPTO_free_fn free_fn);

The main benefit of this function is that it openssl-internal functions 
will use the same memory pool as your application. I know the current 
extension uses palloc/pfree but I didn't see a call to this function.


The problem with this kind of interface is that in PostgreSQL, the idea 
is behind using palloc() is that you usually don't need to take care to 
free the memory, because it is automatically freed at some appropriate 
time (end of row, end of query, end of transaction, etc.).  But an 
external library like openssl doesn't know that.  So it could happen 
that PostgreSQL automatically frees some data structure that openssl 
thinks should still exist.


This kind of problem definitely existed with libxml, which has a very 
similar interface.  So you'd have to do some very careful analysis 
whether this would be safe for openssl.  I suspect it's not worth the 
bother.



The second item is that openssl also has these wonderful functions:

int CRYPTO_secure_malloc_init(size_t sz, size_t minsize);
int CRYPTO_secure_malloc_done(void);
void *CRYPTO_secure_malloc(size_t num, const char *file, int line);
void *CRYPTO_secure_zalloc(size_t num, const char *file, int line);
void CRYPTO_secure_free(void *ptr, const char *file, int line);
void CRYPTO_secure_clear_free(void *ptr, size_t num,
                               const char *file, int line);
int CRYPTO_secure_allocated(const void *ptr);
int CRYPTO_secure_malloc_initialized(void);
size_t CRYPTO_secure_actual_size(void *ptr);
size_t CRYPTO_secure_used(void);

void OPENSSL_cleanse(void *ptr, size_t len);

These functions address (at least) two potential vulnerabilities. First 
it tweaks the memory block so it will never be written to the swap file, 
and second it (iirc) also tweaks the memory block to reduce the access 
permissions further than usual.


PostgreSQL does do this kind of thing already, see explicit_bzero(). 
Maybe we could do more.  This would need a more specific analysis and 
proposal.

Re: meson and check-tests

2024-11-11 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback!

On Mon, 11 Nov 2024 at 06:34, jian he  wrote:
>
> hi.
> Actually, there is a difference compared to make.
> make works fine with many whitespaces. but meson you can only have one
> whitespace.
>
> meson test -C $BUILD7 --num-processes 20 --suite setup --verbose
> export TESTS="test_setup boolean char"
>
> meson test -C $BUILD7 --suite regress --verbose
> --it will fail.
>
> TESTS="copy  jsonb" meson test -C $BUILD7 --suite regress --verbose
> --it will fail too.

Yes, you are right. It is fixed now.

> seems to need rebase, otherwise v4-0001 cannot use `git apply`.

Done.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 0e66e814aa70118b1b6c55f441e9506d2d700900 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 11 Nov 2024 10:35:02 +0300
Subject: [PATCH v5 1/2] Add 'make check-tests' behavior to the meson based
 builds

There was no way to run specific regression tests in the regress/regress
tests in the meson based builds. Add this behavior.

Author: Nazir Bilal Yavuz 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Jian He 
Discussion: postgr.es/m/CAExHW5tK-QqayUN0%2BN3MF5bjV6vLKDCkRuGwoDJwc7vGjwCygQ%40mail.gmail.com
---
 meson.build| 14 --
 src/tools/testwrap | 10 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 5b0510cef78..073ec71b30c 100644
--- a/meson.build
+++ b/meson.build
@@ -3394,11 +3394,11 @@ foreach test_dir : tests
 '--dbname', dbname,
   ] + t.get('regress_args', [])
 
-  test_selection = []
-  if t.has_key('schedule')
-test_selection += ['--schedule', t['schedule'],]
-  endif
+  # To avoid '--schedule' option to be treated as a separate argument in
+  # the testwrap script if not quoted correctly.
+  test_schedule = t.get('schedule', [])
 
+  test_selection = []
   if kind == 'isolation'
 test_selection += t.get('specs', [])
   else
@@ -3422,12 +3422,13 @@ foreach test_dir : tests
   testwrap_base,
   '--testgroup', test_group,
   '--testname', kind,
+  '--schedule', test_schedule,
+  '--tests', test_selection,
   '--',
   test_command_base,
   '--outputdir', test_output,
   '--temp-instance', test_output / 'tmp_check',
   '--port', testport.to_string(),
-  test_selection,
 ],
 suite: test_group,
 kwargs: test_kwargs,
@@ -3442,10 +3443,11 @@ foreach test_dir : tests
 testwrap_base,
 '--testgroup', test_group_running,
 '--testname', kind,
+'--schedule', test_schedule,
+'--tests', test_selection,
 '--',
 test_command_base,
 '--outputdir', test_output_running,
-test_selection,
   ],
   is_parallel: t.get('runningcheck-parallel', true),
   suite: test_group_running,
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 8ae8fb79ba7..998006c7361 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -12,6 +12,8 @@ parser.add_argument('--srcdir', help='source directory of test', type=str)
 parser.add_argument('--basedir', help='base directory of test', type=str)
 parser.add_argument('--testgroup', help='test group', type=str)
 parser.add_argument('--testname', help='test name', type=str)
+parser.add_argument('--schedule', help='schedule tests', nargs='*')
+parser.add_argument('--tests', help='tests', nargs='*')
 parser.add_argument('--skip', help='skip test (with reason)', type=str)
 parser.add_argument('--pg-test-extra', help='extra tests', type=str)
 parser.add_argument('test_command', nargs='*')
@@ -51,6 +53,14 @@ env_dict = {**os.environ,
 if "PG_TEST_EXTRA" not in env_dict and args.pg_test_extra:
 env_dict["PG_TEST_EXTRA"] = args.pg_test_extra
 
+if "TESTS" in env_dict and args.testgroup == 'regress' and args.testname == 'regress':
+args.test_command += env_dict["TESTS"].split()
+else:
+if args.schedule:
+args.test_command += ['--schedule', ' '.join(args.schedule)]
+if args.tests:
+args.test_command.extend(args.tests)
+
 sp = subprocess.Popen(args.test_command, env=env_dict, stdout=subprocess.PIPE)
 # Meson categorizes a passing TODO test point as bad
 # (https://github.com/mesonbuild/meson/issues/13183).  Remove the TODO
-- 
2.45.2

From eede2e0862e19a34fde1576d842b848b96c95c1f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 26 Sep 2024 10:24:52 +0300
Subject: [PATCH v5 2/2] Expand test selection behavior to all test types in
 meson based builds

Previously, the ability to select specific tests to run was limited to
regress/regress tests. This commit extends that functionality to all test
types in the meson based builds.

Author: Nazir Bilal Yavuz 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Jian He 
Discussion: postgr.es/m/CAExHW5tK-QqayUN0%2BN3MF5bjV6vLKDCkRuGwoDJwc7vGjwCygQ%40mail.gmail.com
---
 src/tools/test

Re: Commit Timestamp and LSN Inversion issue

On Fri, Nov 8, 2024 at 8:23 PM Andres Freund  wrote:
>
> On 2024-11-08 09:08:55 +, Zhijie Hou (Fujitsu) wrote:
> > >
> > > I think it's *completely* unacceptable to call a hook inside
> > > ReserveXLogInsertLocation, with the spinlock held, no less. That's the 
> > > most
> > > contended section of code in postgres.
> >
> > I understand your concern and appreciate the feedback. I've made some
> > adjustments to the patch by directly placing the code to adjust the commit
> > timestamp within the spinlock, aiming to keep it as efficient as possible. 
> > The
> > changes have resulted in just a few extra lines. Would this approach be
> > acceptable to you ?
>
> No, not at all. I think it's basically not acceptable to add any nontrivial
> instruction to the locked portion of ReserveXLogInsertLocation().
>
> Before long we're going to have to replace that spinlocked instruction with
> atomics, which will make it impossible to just block while holding the lock
> anyway.
>
> IMO nothing that touches core xlog insert infrastructure is acceptable for
> this.
>

I can't think of a solution other than the current proposal where we
do both the operations (reserve WAL space for commit and adjust
commit_timestamp, if required) together to resolve LSN<->Timestamp
invertion issue. Please let us know if you have any other ideas for
solving this. Even, if we want to change ReserveXLogInsertLocation to
make the locked portion an atomic operation, we can still do it for
records other than commit. Now, if we don't have any other solution
for LSN<->Timestamp inversion issue, changing the entire locked
portion to atomic will close the door to solving this problem forever
unless we have some other way to solve it which can make it difficult
to rely on commit_timestamps for certain scenarios.

-- 
With Regards,
Amit Kapila.

Re: Proposal to add page headers to SLRU pages

2024-11-11 Thread Li, Yong



> On Nov 10, 2024, at 01:32, Rishu Bagga  wrote:
>
> External Email
>
>> Thanks for pointing this out. Here is what I have tried:
>> 1. Manually build and install PostgreSQL from the latest source code.
>> 2. Following the instructions from src/bin/pg_upgrade to manually dump the 
>> regression
>> database.
>> 3. Apply the patch to the latest code, and build from the source.
>> 4. Run “make check” by following the instructions from src/bin/pg_upgrade 
>> and setting up
>> the olddump and oldinstall to point to the “old” installation used in step 2.
>
>> All tests pass.
>
> Hi all,
>
> Following up on this. What remaining work do we need to do to get this in?
>
> Thanks,
>
> Rishu Bagga
>

Thanks for taking interest in this proposal. There is no remaining work for 
this proposal.
It’s now “waiting for review”. It would be great if you can provide a review 
report, so we
can change the status to “ready for commit”.

I’ve updated the patch against the latest HEAD.


Yong





v7-0001-SLRU-header.patch
Description: v7-0001-SLRU-header.patch

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-11-11 Thread Guillaume Lelarge

Le lun. 11 nov. 2024 à 03:05, Michael Paquier  a
écrit :

> On Tue, Oct 08, 2024 at 06:24:54PM +0300, Alena Rybakina wrote:
> > yes, I agree with you. Even when I experimented with vacuum settings for
> > database and used my vacuum statistics patch [0] for analyzes , I first
> > looked at this change in the number of blocks or deleted rows at the
> > database level,
> > and only then did an analysis of each table and index.
> >
> > [0] https://commitfest.postgresql.org/50/5012/
>
> As hinted on other related threads like around [1], I am so-so about
> the proposal of these numbers at table and index level now that we
> have e7a9496de906 and 5d4298e75f25.
>
> In such cases, I apply the concept that I call the "Mention Bien" (or
> when you get a baccalaureat diploma with honors and with a 14~16/20 in
> France).  What we have is not perfect, still it's good enough to get
> a 14/20 IMO, making hopefully 70~80% of users happy with these new
> metrics.  Perhaps I'm wrong, but I'd be curious to know if this
> thread's proposal is required at all at the end.
>
>
I agree with you. We'll see if we need more, but it's already good to have
the metrics already commited.


> I have not looked at the logging proposal yet.
>
>
I hope you'll have time to look at it. It seems to me very important to get
that kind of info in the logs.

Thanks again.


> [1]: https://www.postgresql.org/message-id/zywxw7vqplbfv...@paquier.xyz
> --
> Michael
>


-- 
Guillaume.

Re: warn if GUC set to an invalid shared library

I had a quick glance at this, and I agree with Robert's comment earlier 
that this requires a lot of context to understand.


Does this change the behavior of what is accepted or not? What are those 
cases? Can you give some examples, please?


When do you get the new warnings? Examples please.

Does this require documentation changes?

On 22/07/2024 19:28, Justin Pryzby wrote:

And added a preparatory patch to distinguish ALTER USER/DATABASE SET
from SET in a function, to avoid warning in that case.


The comment above enum GucSource needs updating.

PGC_S_TEST_FUNCTION is missing from GucSource_Names (there's a reminder 
of this in the comment above enum GucSource). Are there other places 
where PGC_S_TEST_FUNCTION is missing?


Do extensions need any changes for this? Consider renaming PGC_S_TEST 
too, to intentionally break any code that might now need to also check 
for PGC_S_TEST_FUNCTION.


Or perhaps there's a better way to distinguish between ALTER 
DATABASE/ROLE and CREATE FUNCTION settings. We already have different 
GucSource codes for per-database and per-user settings; I wonder if it 
would be better to use those codes and a separate "is_testing" flag. 
That would also naturally allow the combination of "source == PGC_S_FILE 
&& is_testing==true", if some settings would need different checks in 
ALTER SYSTEM for example.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Separate memory contexts for relcache and catcache

On Sat, Nov 2, 2024 at 3:17 AM Jeff Davis  wrote:
>
> On Fri, 2024-11-01 at 15:19 -0400, Andres Freund wrote:
> > I'm a bit worried about the increase in "wasted" memory we might end
> > up when
> > creating one aset for *everything*. Just splitting out Relcache and
> > CatCache
> > isn't a big deal from that angle, they're always used reasonably
> > much. But
> > creating a bunch of barely used contexts does have the potential for
> > lots of
> > memory being wasted at the end of a page and on freelists.  It might
> > be ok as
> > far was what you proposed in the above email, I haven't analyzed that
> > in depth
> > yet.
>
> Melih raised similar concerns. The new contexts that my patch created
> were CatCacheContext, RelCacheContext, SPICacheContext,
> PgOutputContext, PlanCacheContext, TextSearchCacheContext, and
> TypCacheContext.
>
> Those are all created lazily, so you need to at least be using the
> relevant feature before it has any cost (with the exception of the
> first two).
>
> > > I agree with others that we should look at changing the initial
> > > size or
> > > type of the contexts, but that should be a separate commit.
> >
> > It needs to be done close together though, otherwise we'll increase
> > the
> > new-connection-memory-usage of postgres measurably.
>
> I don't have a strong opinion here; that was a passing comment. But I'm
> curious: why it would increase the per-connection memory usage much to
> just have a couple new memory contexts?


Without patch
First backend
SELECT count(*), pg_size_pretty(sum(total_bytes)) as total_bytes,
sum(total_nblocks) as total_nblocks, pg_size_pretty(sum(free_bytes))
free_bytes, sum(free_chunks) as free_chunks,
pg_size_pretty(sum(used_bytes)) used_bytes from
pg_get_backend_memory_contexts();
 count | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes
---+-+---++-+
   121 | 1917 kB |   208 | 716 kB | 128 | 1201 kB
(1 row)

Second backend
SELECT count(*), pg_size_pretty(sum(total_bytes)) as total_bytes,
sum(total_nblocks) as total_nblocks, pg_size_pretty(sum(free_bytes))
free_bytes, sum(free_chunks) as free_chunks,
pg_size_pretty(sum(used_bytes)) used_bytes from
pg_get_backend_memory_contexts();
 count | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes
---+-+---++-+
   121 | 1408 kB |   210 | 384 kB | 186 | 1024 kB
(1 row)

With both patches from Melih applied
First backend
SELECT count(*), pg_size_pretty(sum(total_bytes)) as total_bytes,
sum(total_nblocks) as total_nblocks, pg_size_pretty(sum(free_bytes))
free_bytes, sum(free_chunks) as free_chunks,
pg_size_pretty(sum(used_bytes)) used_bytes from
pg_get_backend_memory_contexts();
 count | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes
---+-+---++-+
   124 | 1670 kB |   207 | 467 kB | 128 | 1203 kB
(1 row)

Second backend
SELECT count(*), pg_size_pretty(sum(total_bytes)) as total_bytes,
sum(total_nblocks) as total_nblocks, pg_size_pretty(sum(free_bytes))
free_bytes, sum(free_chunks) as free_chunks,
pg_size_pretty(sum(used_bytes)) used_bytes from
pg_get_backend_memory_contexts();
 count | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes
---+-+---++-+
   124 | 1417 kB |   209 | 391 kB | 187 | 1026 kB
(1 row)

So it looks like the patches do reduce memory allocated at the start
of a backend. That is better as far as the conditions just after the
backend start are concerned.

The chunks of memory allocated in a given context will more likely
have similar sizes since they will be allocated for the same types of
objects as compared to one big context where chunks are allocated for
many different kinds of objects. I believe this will lead to a better
utilization of freelist.

-- 
Best Wishes,
Ashutosh Bapat

Re: index prefetching

On Mon, Nov 11, 2024 at 1:03 PM Peter Geoghegan  wrote:
> I almost think of "pin held" and "buffer lock held" as synonymous when
> working on the nbtree code, even though you have this one obscure page
> deletion case where that isn't quite true (plus the TID recycle safety
> business imposed by heapam). As far as protecting the structure of the
> index itself is concerned, holding on to buffer pins alone does not
> matter at all.

That makes sense from the point of view of working with the btree code
itself, but from a system-wide perspective, it's weird to pretend like
the pins don't exist or don't matter just because a buffer lock is
also held. I had actually forgotten that the btree code tends to
pin+lock together; now that you mention it, I remember that I knew it
at one point, but it fell out of my head a long time ago...

> I think that this is exactly what I propose to do, said in a different
> way. (Again, I wouldn't have expressed it in this way because it seems
> obvious to me that buffer pins don't have nearly the same significance
> to an index AM as they do to heapam -- they have no value in
> protecting the index structure, or helping an index scan to reason
> about concurrency that isn't due to a heapam issue.)
>
> Does that make sense?

Yeah, it just really throws me for a loop that you're using "pin" to
mean "pin at a time when we don't also hold a lock." The fundamental
purpose of a pin is to prevent a buffer from being evicted while
someone is in the middle of looking at it, and nothing that uses
buffers can possibly work correctly without that guarantee. Everything
you've written in parentheses there is, AFAICT, 100% wrong if you mean
"any pin" and 100% correct if you mean "a pin held without a
corresponding lock."

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Fix for pageinspect bug in PG 17

Hi,

Here's a fix for pageinspect bug in PG17, reported in [1]. The bug turns
out to be introduced by my commit

commit dae761a87edae444d11a411f711f1d679bed5941
Author: Tomas Vondra 
Date:   Fri Dec 8 17:07:30 2023 +0100

Add empty BRIN ranges during CREATE INDEX

...

This adds an out argument to brin_page_items, but I failed to consider
the user may still run with an older version of the extension - either
after pg_upgrade (as in the report), or when the CREATE EXTENSION
command specifies VERSION.

The new "empty" field is added in the middle of the output tuple, which
shifts the values, causing segfault when accessing a text field at the
end of the array.

Of course, we add arguments to existing functions pretty often, and we
know how to do that in backwards-compatible way - pg_stat_statements is
a good example of how to do that nicely. Unfortunately, it's too late to
do that for brin_page_items :-( There may already be upgraded systems or
with installed pageinspect, etc.

The only fix I can think of is explicitly looking at TupleDesc->natts,
as in the attached fix.


regards

-- 
Tomas Vondra
diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index 22621d584fa..6086b02445b 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -240,25 +240,29 @@ brin_page_items(PG_FUNCTION_ARGS)
 		else
 		{
 			int			att = attno - 1;
+			int			idx = 0;
 
-			values[0] = UInt16GetDatum(offset);
+			values[idx++] = UInt16GetDatum(offset);
 			switch (TupleDescAttr(rsinfo->setDesc, 1)->atttypid)
 			{
 case INT8OID:
-	values[1] = Int64GetDatum((int64) dtup->bt_blkno);
+	values[idx++] = Int64GetDatum((int64) dtup->bt_blkno);
 	break;
 case INT4OID:
 	/* support for old extension version */
-	values[1] = UInt32GetDatum(dtup->bt_blkno);
+	values[idx++] = UInt32GetDatum(dtup->bt_blkno);
 	break;
 default:
 	elog(ERROR, "incorrect output types");
 			}
-			values[2] = UInt16GetDatum(attno);
-			values[3] = BoolGetDatum(dtup->bt_columns[att].bv_allnulls);
-			values[4] = BoolGetDatum(dtup->bt_columns[att].bv_hasnulls);
-			values[5] = BoolGetDatum(dtup->bt_placeholder);
-			values[6] = BoolGetDatum(dtup->bt_empty_range);
+			values[idx++] = UInt16GetDatum(attno);
+			values[idx++] = BoolGetDatum(dtup->bt_columns[att].bv_allnulls);
+			values[idx++] = BoolGetDatum(dtup->bt_columns[att].bv_hasnulls);
+			values[idx++] = BoolGetDatum(dtup->bt_placeholder);
+
+			if (rsinfo->setDesc->natts >= 8)
+values[idx++] = BoolGetDatum(dtup->bt_empty_range);
+
 			if (!dtup->bt_columns[att].bv_allnulls)
 			{
 BrinValues *bvalues = &dtup->bt_columns[att];
@@ -284,13 +288,15 @@ brin_page_items(PG_FUNCTION_ARGS)
 }
 appendStringInfoChar(&s, '}');
 
-values[7] = CStringGetTextDatum(s.data);
+values[idx++] = CStringGetTextDatum(s.data);
 pfree(s.data);
 			}
 			else
 			{
-nulls[7] = true;
+nulls[idx++] = true;
 			}
+
+			Assert(idx == rsinfo->setDesc->natts);
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);

Re: index prefetching

2024-11-11 Thread Peter Geoghegan

On Mon, Nov 11, 2024 at 1:33 PM Robert Haas  wrote:
> That makes sense from the point of view of working with the btree code
> itself, but from a system-wide perspective, it's weird to pretend like
> the pins don't exist or don't matter just because a buffer lock is
> also held.

I can see how that could cause confusion. If you're working on nbtree
all day long, it becomes natural, though. Both points are true, and
relevant to the discussion.

I prefer to over-communicate when discussing these points -- it's too
easy to talk past each other here. I think that the precise reasons
why the index AM does things with buffer pins will need to be put on a
more rigorous and formalized footing with Tomas' patch. The different
requirements/safety considerations will have to be carefully teased
apart.

> I had actually forgotten that the btree code tends to
> pin+lock together; now that you mention it, I remember that I knew it
> at one point, but it fell out of my head a long time ago...

The same thing appears to mostly be true of hash, which mostly uses
_hash_getbuf + _hash_relbuf (hash's idiosyncratic use of cleanup locks
notwithstanding).

To be fair it does look like GiST's gistdoinsert function holds onto
multiple buffer pins at a time, for its own reasons -- index AM
reasons. But this looks to be more or less an optimization to deal
with navigating the tree with a loose index order, where multiple
descents and ascents are absolutely expected. (This makes it a bit
like the nbtree "drop lock but not pin" case that I mentioned in my
last email.)

It's not as if these gistdoinsert buffer pins persist across calls to
amgettuple, though, so for the purposes of this discussion about the
new batch API to replace amgettuple they are not relevant -- they
don't actually undermine my point. (Though to be fair their existence
does help to explain why you found my characterization of buffer pins
as irrelevant to index AMs confusing.)

The real sign that what I said is generally true of index AMs is that
you'll see so few calls to
LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls
ConditionalLockBufferForCleanup at all (which I find a bit weird).
Both GiST and SP-GiST call neither functions -- even during VACUUM. So
GiST and SP-GiST make clear that index AMs (that support only MVCC
snapshot scans) can easily get by without any use of cleanup locks
(and with no externally significant use of buffer pins).

> > I think that this is exactly what I propose to do, said in a different
> > way. (Again, I wouldn't have expressed it in this way because it seems
> > obvious to me that buffer pins don't have nearly the same significance
> > to an index AM as they do to heapam -- they have no value in
> > protecting the index structure, or helping an index scan to reason
> > about concurrency that isn't due to a heapam issue.)
> >
> > Does that make sense?
>
> Yeah, it just really throws me for a loop that you're using "pin" to
> mean "pin at a time when we don't also hold a lock."

I'll try to be more careful about that in the future, then.

> The fundamental
> purpose of a pin is to prevent a buffer from being evicted while
> someone is in the middle of looking at it, and nothing that uses
> buffers can possibly work correctly without that guarantee. Everything
> you've written in parentheses there is, AFAICT, 100% wrong if you mean
> "any pin" and 100% correct if you mean "a pin held without a
> corresponding lock."

I agree.

-- 
Peter Geoghegan

Re: Offsets of `struct Port` are no longer constant


On 11/11/2024 18:34, Jacob Champion wrote:

Hi all,

A comment at the end of the Port struct says

 /*
  * OpenSSL structures. (Keep these last so that the locations of other
  * fields are the same whether or not you build with SSL enabled.)
  */

but as part of the direct-SSL changes in d39a49c1e45 (cc'd Heikki),
some additional fields snuck in after it.

I assume it's too late to change this for 17, but should this be
addressed in HEAD? I've attached a draft patch (untested, beware)
which should no longer rely on field order.


Oops. Fortunately this is not likely to cause trouble in practice, I 
can't imagine an extension peeking into 'raw_buf' and friends. But yeah, 
we should do something at least on HEAD. +1 on your patch; I'll commit 
that unless someone has better ideas.


On REL_17_STABLE, we should probably adjust the comment to warn that 
'raw_buf' and friends can move depending on USE_OPENSSL.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Offsets of `struct Port` are no longer constant

On Mon, Nov 11, 2024 at 11:13 AM Heikki Linnakangas  wrote:
> On REL_17_STABLE, we should probably adjust the comment to warn that
> 'raw_buf' and friends can move depending on USE_OPENSSL.

Yeah, makes sense.

--Jacob

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

On Sun, Nov 10, 2024 at 9:05 PM Michael Paquier  wrote:
> As hinted on other related threads like around [1], I am so-so about
> the proposal of these numbers at table and index level now that we
> have e7a9496de906 and 5d4298e75f25.

I think the question to which we don't have a clear answer is: for
what purpose would you want to be able to distinguish parallel and
non-parallel scans on a per-table basis?

I think it's fairly clear why the existing counters exist at a table
level. If an index isn't getting very many index scans, perhaps it's
useless -- or worse than useless if it interferes with HOT -- and
should be dropped. On the other hand if a table is getting a lot of
sequential scans even though it happens to be quite large, perhaps new
indexes are needed. Or if the indexes that we expect to get used are
not the same as those actually getting used, perhaps we want to add or
drop indexes or adjust the queries.

But it is unclear to me what sort of tuning we would do based on
knowing how many of the scans on a certain table or a certain index
were parallel vs non-parallel. I have not fully reviewed the threads
linked in the original post; but I did look at them briefly and did
not immediately see discussion of the specific counters proposed here.
I also don't see anything in this thread that clearly explains why we
should want this exact thing. I don't want to make it sound like I
know that this is useless; I'm sure that Guillaume probably has lots
of hands-on tuning experience with this stuff that I lack. But the
reasons aren't clearly spelled out as far as I can see, and I'm having
some trouble imagining what they are.

Compare the parallel worker draught stuff. It's really clear how that
is intended to be used. If we're routinely failing to launch workers,
then either max_parallel_workers_per_gather is too high or
max_parallel_workers is too low. Now, I will admit that I have a few
doubts about whether that feature will get much real-world use but it
seems hard to doubt that it HAS a use. In this case, that seems less
clear.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PG17 failing tests (DST related?)

2024-11-11 Thread Dave Page

On Mon, 11 Nov 2024 at 15:35, Tom Lane  wrote:

> Dave Page  writes:
> > For about a week, the Windows builds I've been running in Github have
> been
> > failing for PG17:
>
> Should be fixed as of a35801915 ?
>

Ah, yes - it does appear to be fine with head. I guess I'll have to put up
with the error messages until 17.1!

Thanks.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org

Re: Skip collecting decoded changes of already-aborted transactions

On Sun, Nov 10, 2024 at 11:24 PM Peter Smith  wrote:
>
> Hi Sawada-San, here are some review comments for the patch v5-0001.
>

Thank you for reviewing the patch!

> ==
> Commit message.
>
> 1.
> This commit introduces an additional check to determine if a
> transaction is already aborted by a CLOG lookup, so the logical
> decoding skips further change also when it doesn't touch system
> catalogs.
>
> ~
>
> Is that wording backwards? Is it meant to say:
>
> This commit introduces an additional CLOG lookup check to determine if
> a transaction is already aborted, so the ...

Fixed.

>
> ==
> contrib/test_decoding/sql/stats.sql
>
> 2
> +SELECT slot_name, spill_txns = 0 AS spill_txn, spill_count = 0 AS
> spill_count FROM pg_stat_replication_slots WHERE slot_name =
> 'regression_slot_stats4_twophase';
>
> Why do the SELECT "= 0" like this, instead of just having zeros in the
> "expected" results?

Indeed. I used "=0" like other queries in the same file do, but it
makes sense to me just to have zeros in the expected file. That way,
it would make it a bit easier to investigate in case of failures.

>
> ==
> .../replication/logical/reorderbuffer.c
>
> 3.
>  static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN 
> *txn,
> - bool txn_prepared);
> + bool txn_prepared, bool mark_streamed);
>
> That last parameter name ('mark_streamed') does not match the same
> parameter name in this function's definition.

Fixed.

>
> ~~~
>
> ReorderBufferTruncateTXN:
>
> 4.
> if (txn_streaming && (!txn_prepared) &&
>   (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
>   txn->txn_flags |= RBTXN_IS_STREAMED;
>
> if (txn_prepared)
> {
> ~
>
> Since the following condition was already "if (txn_prepared)" would it
> be better remove the "(!txn_prepared)" here and instead just refactor
> the code like:
>
> if (txn_prepared)
> {
>   ...
> }
> else if (txn_streaming && (rbtxn_is_toptxn(txn) || (txn->nentries_mem != 0)))
> {
>   ...
> }

Good idea.

>
> ~~~
>
> ReorderBufferProcessTXN:
>
> 5.
> +
> + /* Remember the transaction is aborted */
> + Assert((curtxn->txn_flags & RBTXN_IS_COMMITTED) == 0);
> + curtxn->txn_flags |= RBTXN_IS_ABORTED;
>
> Missing period on comment.

Fixed.

>
> ~~~
>
> ReorderBufferCheckTXNAbort:
>
> 6.
> + * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
> + * check the transaction status, so the caller always processes this
> + * transaction. This is to disable this check for regression tests.
> + */
> +static bool
> +ReorderBufferCheckTXNAbort(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
> + /*
> + * If GUC 'debug_logical_replication_streaming' is "immediate", we don't
> + * check the transaction status, so the caller always processes this
> + * transaction.
> + */
> + if (unlikely(debug_logical_replication_streaming ==
> DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE))
> + return false;
> +
>
> The wording of the sentence "This is to disable..." seemed a bit
> confusing. Maybe this area can be simplified by doing the following.
>
> 6a.
> Change the function comment to say more like below:
>
> When the GUC 'debug_logical_replication_streaming' is set to
> "immediate", we don't check the transaction status, meaning the caller
> will always process this transaction. This mode is used by regression
> tests to avoid unnecessary transaction status checking.
>
> ~
>
> 6b.
> It is not necessary for this 2nd comment to repeat everything that was
> already said in the function comment. A simpler comment here might be
> all you need:
>
> SUGGESTION:
> Quick return for regression tests.

Agreed with the above two comments. Fixed.

>
> ~~~
>
> 7.
> Is it worth mentioning about this skipping of the transaction status
> check in the docs for this GUC? [1]

If we want to mention this optimization in the docs, we have to
explain how the optimization works too. I think it's too detailed.

I've attached the updated patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v6-0001-Skip-logical-decoding-of-already-aborted-transact.patch
Description: Binary data

Re: index prefetching

2024-11-11 Thread Peter Geoghegan

On Mon, Nov 11, 2024 at 12:23 PM Robert Haas  wrote:
> > I think that holding onto pins and whatnot has almost nothing to do
> > with the index AM as such -- it's about protecting against unsafe
> > concurrent TID recycling, which is a table AM/heap issue. You can make
> > a rather weak argument that the index AM needs it for _bt_killitems,
> > but that seems very secondary to me (if you go back long enough there
> > are no _bt_killitems, but the pin thing itself still existed).
>
> Much of this discussion is going over my head, but I have a comment on
> this part. I suppose that when any code in the system takes a pin on a
> buffer page, the initial concern is almost always to keep the page
> from disappearing out from under it.

That almost never comes up in index AM code, though -- cases where you
simply want to avoid having an index page evicted do exist, but are
naturally very rare. I think that nbtree only does this during page
deletion by VACUUM, since it works out to be slightly more convenient
to hold onto just the pin at one point where we quickly drop and
reacquire the lock. Index AMs find very little use for pins that don't
naturally coexist with buffer locks. And even the supposed exception
that happens for page deletion could easily be replaced by just
dropping the pin and the lock (there'd just be no point in it).

I almost think of "pin held" and "buffer lock held" as synonymous when
working on the nbtree code, even though you have this one obscure page
deletion case where that isn't quite true (plus the TID recycle safety
business imposed by heapam). As far as protecting the structure of the
index itself is concerned, holding on to buffer pins alone does not
matter at all.

I have a vague recollection of hash doing something novel with cleanup
locks, but I also seem to recall that that had problems -- I think
that we got rid of it not too long back. In any case my mental model
is that cleanup locks are for the benefit of heapam, never for the
benefit of index AMs themselves. This is why we require cleanup locks
for nbtree VACUUM but not nbtree page deletion, even though both
operations perform precisely the same kinds of page-level
modifications to the index leaf page.

> There might be a few exceptions,
> but hopefully not many. So I suppose what is happening here is that
> index AM pins an index page so that it can read that page -- and then
> it defers releasing the pin because of some interlocking concern. So
> at any given moment, there's some set of pins (possibly empty) that
> the index AM is holding for its own purposes, and some other set of
> pins (also possibly empty) that the index AM no longer requires for
> its own purposes but which are still required for heap/index
> interlocking.

That summary is correct, but FWIW I find the emphasis on index pins
slightly odd from an index AM point of view.

The nbtree code virtually always calls _bt_getbuf and _bt_relbuf, as
opposed to independently acquiring pins and locks -- that's why "lock"
and "pin" seem almost synonymous to me in nbtree contexts. Clearly no
index AM should hold onto a buffer lock for more than an instant, so
my natural instinct is to wonder why you're even talking about buffer
pins or buffer locks that the index AM cares about directly.

As I said to Tomas, yeah, the index AM kinda sometimes needs to hold
onto a leaf page pin to be able to correctly perform _bt_killitems.
But this is only because it needs to reason about concurrent TID
recycling. So this is also not really any kind of exception.
(_bt_killitems is even prepared to reason about cases where no pin was
held at all, and has been since commit 2ed5b87f96.)

> The second set of pins could possibly be managed in some
> AM-agnostic way. The AM could communicate that after the heap is done
> with X set of TIDs, it can unpin Y set of pages. But the first set of
> pins are of direct and immediate concern to the AM.
>
> Or at least, so it seems to me. Am I confused?

I think that this is exactly what I propose to do, said in a different
way. (Again, I wouldn't have expressed it in this way because it seems
obvious to me that buffer pins don't have nearly the same significance
to an index AM as they do to heapam -- they have no value in
protecting the index structure, or helping an index scan to reason
about concurrency that isn't due to a heapam issue.)

Does that make sense?

-- 
Peter Geoghegan

Re: [PATCH] Add sortsupport for range types and btree_gist

2024-11-11 Thread Andrey M. Borodin

> On 11 Nov 2024, at 21:41, Bernd Helmle  wrote:
> 
> Updated and rebased patches attached.

Hi Bernd!

Some nitpicking:
0. postgres % git apply ~/Downloads/v7.3-Add-GIST-sortsupport-*
/Users/x4mmm/Downloads/v7.3-Add-GIST-sortsupport-btree-gist.patch:19: space 
before tab in indent.
oid oid_buffering timestamp timestamp_buffering timestamptz \
/Users/x4mmm/Downloads/v7.3-Add-GIST-sortsupport-btree-gist.patch:20: space 
before tab in indent.
timestamptz_buffering time time_buffering timetz timetz_buffering date \
/Users/x4mmm/Downloads/v7.3-Add-GIST-sortsupport-btree-gist.patch:21: space 
before tab in indent.
date_buffering interval interval_buffering macaddr macaddr_buffering \
/Users/x4mmm/Downloads/v7.3-Add-GIST-sortsupport-btree-gist.patch:22: space 
before tab in indent.
macaddr8 macaddr8_buffering inet inet_buffering cidr cidr_buffering text \
/Users/x4mmm/Downloads/v7.3-Add-GIST-sortsupport-btree-gist.patch:23: space 
before tab in indent.
text_buffering varchar varchar_buffering char char_buffering \
warning: 5 lines add whitespace errors.

1. I'm mostly seening patches formatted with `git patch-format` rather than 
diffs as patches...

2. Changes in contrib/btree_gist/btree_gist.c seem unnecessary.

3. Why do you move "typedef struct int32key" to btree_gist.h, but do not need 
to move all other keys?

4. These ifdedfs are not needed, just do INJECTION_POINT()
#ifdef USE_INJECTION_POINTS
INJECTION_POINT("btree-gist-sorted-build");
#endif

Also, there are 61 occurance of this code. Why not just 1 in 
gist_indexsortbuild() ?

Thanks!

Best regards, Andrey Borodin.

PG17 failing tests (DST related?)

2024-11-11 Thread Dave Page

For about a week, the Windows builds I've been running in Github have been
failing for PG17:
https://github.com/dpage/winpgbuild/actions/workflows/postgresql.yml. The
consistently failing tests are:

postgresql:recovery / recovery/027_stream_regress
postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
postgresql:regress / regress/regress

which all appear to have the same failure in jsonb_jsonpath:

=
diff --strip-trailing-cr -U3
D:/a/winpgbuild/winpgbuild/postgresql-17.0/src/test/regress/expected/jsonb_jsonpath.out
D:/a/winpgbuild/winpgbuild/postgresql-17.0/build/testrun/regress/regress/results/jsonb_jsonpath.out
---
D:/a/winpgbuild/winpgbuild/postgresql-17.0/src/test/regress/expected/jsonb_jsonpath.out
2024-09-23 20:02:53.0 +
+++
D:/a/winpgbuild/winpgbuild/postgresql-17.0/build/testrun/regress/regress/results/jsonb_jsonpath.out
2024-11-11 04:16:42.243042300 +
@@ -2637,7 +2637,7 @@
 select jsonb_path_query_tz('"12:34:56"', '$.time_tz().string()');
  jsonb_path_query_tz
 -
- "12:34:56-07:00"
+ "12:34:56-08:00"
 (1 row)

 select jsonb_path_query('"12:34:56"', '$.time().string()');
=

It seems likely this is related to the DST change in the US, as the issue
started occurring on 4th November.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org

Re: Commit Timestamp and LSN Inversion issue

On 11/11/24 09:19, Amit Kapila wrote:
> On Fri, Nov 8, 2024 at 8:23 PM Andres Freund  wrote:
>>
>> On 2024-11-08 09:08:55 +, Zhijie Hou (Fujitsu) wrote:

 I think it's *completely* unacceptable to call a hook inside
 ReserveXLogInsertLocation, with the spinlock held, no less. That's the most
 contended section of code in postgres.
>>>
>>> I understand your concern and appreciate the feedback. I've made some
>>> adjustments to the patch by directly placing the code to adjust the commit
>>> timestamp within the spinlock, aiming to keep it as efficient as possible. 
>>> The
>>> changes have resulted in just a few extra lines. Would this approach be
>>> acceptable to you ?
>>
>> No, not at all. I think it's basically not acceptable to add any nontrivial
>> instruction to the locked portion of ReserveXLogInsertLocation().
>>
>> Before long we're going to have to replace that spinlocked instruction with
>> atomics, which will make it impossible to just block while holding the lock
>> anyway.
>>
>> IMO nothing that touches core xlog insert infrastructure is acceptable for
>> this.
>>
> 
> I can't think of a solution other than the current proposal where we
> do both the operations (reserve WAL space for commit and adjust
> commit_timestamp, if required) together to resolve LSN<->Timestamp
> invertion issue. Please let us know if you have any other ideas for
> solving this. Even, if we want to change ReserveXLogInsertLocation to
> make the locked portion an atomic operation, we can still do it for
> records other than commit. Now, if we don't have any other solution
> for LSN<->Timestamp inversion issue, changing the entire locked
> portion to atomic will close the door to solving this problem forever
> unless we have some other way to solve it which can make it difficult
> to rely on commit_timestamps for certain scenarios.
> 

I don't know what the solution is, isn't the problem that

(a) we record both values (LSN and timestamp) during commit

(b) reading timestamp from system clock can be quite expensive

It seems to me that if either of those two issues disappeared, we
wouldn't have such an issue.

For example, imagine getting a timestamp is super cheap - just reading
and updating a simple counter from shmem, just like we do for the LSN.
Wouldn't that solve the problem?

For example, let's say XLogCtlInsert has two fields

int64 CommitTimestamp;

and that ReserveXLogInsertLocation() also does this for each commit:

commit_timestamp = Insert->commit_timestamp++;

while holding the insertpos_lock. Now we have the LSN and timestamp
perfectly correlated. Of course, this timestamp would be completely
detached from "normal" timestamps, it'd just be a sequence. But we could
also once in a while "set" the timestamp to GetCurrentTimestamp(), or
something like that. For example, there could be a "ticker" process,
doing that every 1000ms, or 100ms, or whatever we think is enough.

Or maybe it could be driven by the commit itself, say when some timeout
expires, we assign too many items since the last commit, ...

Alternatively, we could simply stop relying on the timestamps recorded
in the commit, and instead derive "monotonic" commit timestamps after
the fact. For example, we could track timestamps for some subset of
commits, and then approximate the rest using LSN.

AFAIK the inversion can happen only for concurrent commits, and there's
can't be that many of those. So if we record the (LSN, timestamp) for
every 1MB of WAL, we approximate timestamps for commits in that 1MB by
linear approximation.

Of course, there's a lot of questions and details to solve - e.g. how
often would it need to happen, when exactly would it happen, etc. And
also how would that integrate with the logical decoding - it's easy to
just get the timestamp from the WAL record, this would require more work
to actually calculate it. It's only a very rough idea.

regards

-- 
Tomas Vondra

Re: PG17 failing tests (DST related?)

2024-11-11 Thread Tom Lane

Dave Page  writes:
> For about a week, the Windows builds I've been running in Github have been
> failing for PG17:

Should be fixed as of a35801915 ?

regards, tom lane

Re: Add reject_limit option to file_fdw

2024-11-11 Thread Fujii Masao





On 2024/11/11 21:45, torikoshia wrote:

Thanks for adding the comment. It clearly states that REJECT_LIMIT can be
a single-quoted string. However, it might also be helpful to mention that
it can be provided as an int64 in the COPY command option. How about
updating it like this?


REJECT_LIMIT can be specified in two ways: as an int64 for the COPY command
option or as a single-quoted string for the foreign table option using
file_fdw. Therefore this function needs to handle both formats.



Thanks! it seems better.


Attached v3 patch.


Thanks for updating the patch! It looks like you forgot to attach it, though.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Offsets of `struct Port` are no longer constant

2024-11-11 Thread Daniel Gustafsson

> On 11 Nov 2024, at 20:17, Jacob Champion  
> wrote:
> 
> On Mon, Nov 11, 2024 at 11:13 AM Heikki Linnakangas  wrote:
>> On REL_17_STABLE, we should probably adjust the comment to warn that
>> 'raw_buf' and friends can move depending on USE_OPENSSL.
> 
> Yeah, makes sense.

+1

Re: [PoC] XMLCast (SQL/XML X025)

2024-11-11 Thread Jim Jones

Hi Robert
Thanks for taking a look at it.

On 11.11.24 19:15, Robert Haas wrote:
> Hmm, this patch has gotten no responses for 4 months. That's kind of
> unfortunate. Sadly, there's not a whole lot that I can do to better
> the situation, because I know very little either about XML-related
> standards or about how people make use of XML in practice. It's not
> that much code, so if it does a useful thing that we actually want, we
> can probably figure out how to verify that the code is correct, or fix
> it. But I don't know whether it's a useful thing that we actually
> want. Syntactically, XMLCAST() looks a lot like CAST(), so one might
> ask whether the things that it does can already be accomplished using
> CAST(); or whether, perhaps, we have some other existing method for
> performing such conversions.
It indeed has a huge overlap with CAST(), except for a few handy SQL <->
XML mappings, such as

SELECT xmlcast('foo & <"bar">'::xml AS text);

    xmlcast    
---
 foo & <"bar">
(1 row)

--

SELECT
  xmlcast('2024-05-29 12:04:10.703585+02'::timestamp without time zone
AS xml),
  xmlcast('2024-05-29T12:04:10.703585'::xml AS timestamp without time zone);
 
  xmlcast   |  xmlcast   
+
 2024-05-29T12:04:10.703585 | 2024-05-29 12:04:10.703585
(1 row)

--

SELECT
  xmlcast('P1Y2M3DT4H5M6S'::xml AS interval),
  xmlcast('1 year 2 mons 3 days 4 hours 5 minutes 6 seconds'::interval
AS xml);
 
    xmlcast    |    xmlcast     
---+
 1 year 2 mons 3 days 04:05:06 | P1Y2M3DT4H5M6S
(1 row)

--

SELECT CAST('42'::xml AS int);

ERROR:  cannot cast type xml to integer
LINE 1: SELECT CAST('42'::xml AS int);
   ^
--

SELECT XMLCAST('42'::xml AS int);
 xmlcast
-
  42
(1 row)


> The only thing I found during a quick perusal of the documentation was
> XMLTABLE(), which seems a bit baroque if you just want to convert one
> value. Is this intended to plug that gap? Is there any other current
> way of doing it?
>
> Do we need to ensure some kind of consistency between XMLTABLE() and
> XMLCAST() in terms of how they behave? 

I haven't considered any compatibility to XMLTABLE(), as it has a
different spec (X300-X305), but I can take a look at it! To implement
this function I just followed the SQL/XML spec "ISO/IEC IWD 9075-14" -
and from time to time I also took a look on how other databases
implemented it.[1]

> The documentation at
> https://www.postgresql.org/docs/current/xml-limits-conformance.html#FUNCTIONS-XML-LIMITS-CASTS
> says that "When PostgreSQL maps SQL data values to XML (as in
> xmlelement), or XML to SQL (as in the output columns of xmltable),
> except for a few cases treated specially, PostgreSQL simply assumes
> that the XML data type's XPath 1.0 string form will be valid as the
> text-input form of the SQL datatype, and conversely." Unfortunately,
> it does not specify what those cases treated specially are, and the
> commit that added that documentation text is not the one that added
> the underlying code, so I don't actually know where that code is, but
> one would expect this function to conform to that general rule.

I agree. It would be nice to know which cases those are.
However, invalid inputs should normally return an error, e.g.

SELECT xmlcast('foo&bar'::xml AS text);

ERROR:  invalid XML content
LINE 1: SELECT xmlcast('foo&bar'::xml AS text);
   ^
DETAIL:  line 1: EntityRef: expecting ';'
foo&bar
   ^
--

SELECT xmlcast('foo'::xml AS date);
ERROR:  invalid input syntax for type date: "foo"

--

.. but perhaps the text means something else?

Thanks!

Best, Jim

1 - https://dbfiddle.uk/ZSpsyIal

Re: Update Unicode data to Unicode 16.0.0

2024-11-11 Thread Joe Conway


On 11/11/24 01:27, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]  AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]:
https://www.postgresql.org/message-id/flat/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel%40j-davis.com


I ran a check and found that this patch causes changes in upper casing 
of some characters. Repro:


setup
8<-
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
 LOCALE_PROVIDER builtin
 BUILTIN_LOCALE 'C.UTF-8'
 TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-


8<-
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)

SELECT md5(string_agg(t.s,NULL)) FROM t;
   md5
--
 7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)

SELECT md5(string_agg(t.s,NULL)) FROM t;
   md5
--
 97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)

SELECT md5(string_agg(t.s,NULL)) FROM t;
   md5
--
 8cf65a43affc221f3a20645ef402085e
(1 row)
8<-


8<-
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table 
ORDER BY 1)

SELECT md5(string_agg(t.s,NULL)) FROM t;
   md5
--
 7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table 
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;

   md5
--
 3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM 
unsorted_table ORDER BY 1)

SELECT md5(string_agg(t.s,NULL)) FROM t;
   md5
--
 9985acddf7902ea603897cdaccd02114
(1 row)
8<-

So both UPPER and INITCAP produce different results unless I am missing 
something.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

On Tue, Nov 12, 2024 at 12:02 PM Masahiko Sawada  wrote:
>
> On Mon, Nov 11, 2024 at 2:08 PM Tomas Vondra  wrote:
> >
> >
> > But neither of those fixes prevents backwards move for confirmed_flush
> > LSN, as enforced by asserts in the 0005 patch. I don't know if this
> > assert is incorrect or now. It seems natural that once we get a
> > confirmation for some LSN, we can't move before that position, but I'm
> > not sure about that. Maybe it's too strict.
>
> Hmm, I'm concerned that it might be another problem. I think there are
> some cases where a subscriber sends a flush position older than slot's
> confirmed_flush as a feedback message. But it seems to be dangerous if
> we always accept it as a new confirmed_flush value. It could happen
> that confirm_flush could be set to a LSN older than restart_lsn.
>

If confirmed_flush LSN moves backwards, it means the transactions
which were thought to be replicated earlier are no longer considered
to be replicated. This means that the restart_lsn of the slot needs to
be at least far back as the oldest of starting points of those
transactions. Thus restart_lsn of slot has to be pushed further back.
That WAL may not be available anymore. Similar issue with
catalog_xmin, the older catalog rows may have been removed. Other
problem is we may send some transactions twice, which might cause
trouble downstream. So I agree that confirmed_flush LSN should not
move backwards. IIRC, if the downstream sends an older confirmed_flush
in START_REPLICATION message, WAL sender does not consider it and
instead uses the one in replication slot.

-- 
Best Wishes,
Ashutosh Bapat

RE: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

2024-11-11 Thread Zhijie Hou (Fujitsu)

On Friday, November 8, 2024 7:06 PM Shlok Kyal  wrote:
> 
> Hi Amit,
> 
> On Thu, 7 Nov 2024 at 11:37, Amit Kapila  wrote:
> >
> > On Tue, Nov 5, 2024 at 12:53 PM Shlok Kyal 
> wrote:
> > >
> > > To avoid the issue, we can disallow UPDATE/DELETE on table with
> > > unpublished generated column as REPLICA IDENTITY. I have attached a
> > > patch for the same.
> > >
> >
> > +CREATE PUBLICATION pub_gencol FOR TABLE testpub_gencol; UPDATE
> > +testpub_gencol SET a = 100 WHERE a = 1;
> > +ERROR:  cannot update table "testpub_gencol"
> > +DETAIL:  Column list used by the publication does not cover the
> > replica identity.
> >
> > This is not a correct ERROR message as the publication doesn't have
> > any column list associated with it. You have added the code to detect
> > this in the column list code path which I think is not required. BTW,
> > you also need to consider the latest commit 7054186c4e for this. I
> > guess you need to keep another flag in PublicationDesc to detect this
> > and then give an appropriate ERROR.
> 
> I have addressed the comments and provided an updated patch. Also, I am
> currently working to fix this issue in back branches.

Thanks for the patch. I am reviewing it and have some initial comments:


1.
+   char attgenerated = get_attgenerated(relid, attnum);
+

I think it's unnecessary to initialize attgenerated here because the value will
be overwritten if pubviaroot is true anyway. Also, the get_attgenerated()
is not cheap.

2.

I think the patch missed to check the case when table is marked REPLICA
IDENTITY FULL, and generated column is not published:

CREATE TABLE testpub_gencol (a INT, b INT GENERATED ALWAYS AS (a + 1) STORED 
NOT NULL);
ALTER TABLE testpub_gencol REPLICA IDENTITY FULL;
CREATE PUBLICATION pub_gencol FOR TABLE testpub_gencol;
UPDATE testpub_gencol SET a = 2;

I expected the UPDATE to fail in above case, but it can still pass after 
applying the patch.

3.

+* If the publication is FOR ALL TABLES we can skip the 
validation.
+*/

This comment seems not clear to me, could you elaborate a bit more on this ?

4.

Also, I think the patch does not handle the FOR ALL TABLE case correctly:

CREATE TABLE testpub_gencol (a INT, b INT GENERATED ALWAYS AS (a + 1) STORED 
NOT NULL);
CREATE UNIQUE INDEX testpub_gencol_idx ON testpub_gencol (b);
ALTER TABLE testpub_gencol REPLICA IDENTITY USING index testpub_gencol_idx;
CREATE PUBLICATION pub_gencol FOR ALL TABLEs;
UPDATE testpub_gencol SET a = 2;

I expected the UPDATE to fail in above case as well.

5.

+   else if (cmd == CMD_UPDATE && !pubdesc.replident_has_valid_gen_cols)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
+errmsg("cannot update table \"%s\"",
+   RelationGetRelationName(rel)),
+errdetail("REPLICA IDENTITY consists of an 
unpublished generated column.")));

I think it would be better to use lower case "replica identity" to consistent
with other existing messages.

Best Regards,
Hou zj

Re: Fix for pageinspect bug in PG 17

On Mon, Nov 11, 2024 at 07:32:10PM +0100, Tomas Vondra wrote:
> This adds an out argument to brin_page_items, but I failed to consider
> the user may still run with an older version of the extension - either
> after pg_upgrade (as in the report), or when the CREATE EXTENSION
> command specifies VERSION.

Perhaps it would be worth adding a test where the function is called
in the context of a past extension version?  (I know, I'm noisy when
it comes to that.)

> The only fix I can think of is explicitly looking at TupleDesc->natts,
> as in the attached fix.

How about some consistency instead as this is the same error as
691e8b2e1889?  btreefuncs.c is deciding to issue an error if we are
trying to call one of its functions in the context of a past version
of the extension.  pageinspect is a superuser module, it does not seem
that bad to me to tell users that they should force an upgrade when
using it.
--
Michael


signature.asc
Description: PGP signature

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

On Mon, Nov 11, 2024 at 11:06:43AM -0500, Robert Haas wrote:
> But it is unclear to me what sort of tuning we would do based on
> knowing how many of the scans on a certain table or a certain index
> were parallel vs non-parallel. I have not fully reviewed the threads
> linked in the original post; but I did look at them briefly and did
> not immediately see discussion of the specific counters proposed here.
> I also don't see anything in this thread that clearly explains why we
> should want this exact thing. I don't want to make it sound like I
> know that this is useless; I'm sure that Guillaume probably has lots
> of hands-on tuning experience with this stuff that I lack. But the
> reasons aren't clearly spelled out as far as I can see, and I'm having
> some trouble imagining what they are.

Thanks for the summary.  My main worry is that these are kind of hard
to act on for tuning when aggregated at relation level (Guillaume,
feel free to counter-argue!).  The main point that comes into mind is
that single table scans would be mostly involved with OLTP workloads
or simple joins, where parallel workers are of little use.  That could
be much more interesting for analytical-ish workloads with more
complex plan pattern where one or more Gather or GatherMerge nodes are
involved.  Still, even in this case I suspect that most users will
finish by looking at plan patterns, and that these counters added for
index or tables would have a limited impact at the end.
--
Michael


signature.asc
Description: PGP signature

Re: 2024-11-14 release announcement draft

2024-11-11 Thread Thomas Munro

On Tue, Nov 12, 2024 at 5:28 PM Jonathan S. Katz  wrote:
> Please review the draft of the release announcement for the 2024-11-14
> release. Please provide any feedback by 2024-11-14 12:00 UTC.

I think "* Fixes random crashes in JIT compilation on aarch64 systems"
might be worth highlighting.  We've had a handful of separate reports
about that and people have been turning off JIT to work around it.

Re: Commit Timestamp and LSN Inversion issue

On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra  wrote:
>
> On 11/11/24 09:19, Amit Kapila wrote:
> >
> > I can't think of a solution other than the current proposal where we
> > do both the operations (reserve WAL space for commit and adjust
> > commit_timestamp, if required) together to resolve LSN<->Timestamp
> > invertion issue. Please let us know if you have any other ideas for
> > solving this. Even, if we want to change ReserveXLogInsertLocation to
> > make the locked portion an atomic operation, we can still do it for
> > records other than commit. Now, if we don't have any other solution
> > for LSN<->Timestamp inversion issue, changing the entire locked
> > portion to atomic will close the door to solving this problem forever
> > unless we have some other way to solve it which can make it difficult
> > to rely on commit_timestamps for certain scenarios.
> >
>
> I don't know what the solution is, isn't the problem that
>
> (a) we record both values (LSN and timestamp) during commit
>
> (b) reading timestamp from system clock can be quite expensive
>
> It seems to me that if either of those two issues disappeared, we
> wouldn't have such an issue.
>
> For example, imagine getting a timestamp is super cheap - just reading
> and updating a simple counter from shmem, just like we do for the LSN.
> Wouldn't that solve the problem?
>

Yeah, this is exactly what I thought.

> For example, let's say XLogCtlInsert has two fields
>
> int64 CommitTimestamp;
>
> and that ReserveXLogInsertLocation() also does this for each commit:
>
> commit_timestamp = Insert->commit_timestamp++;
>
> while holding the insertpos_lock. Now we have the LSN and timestamp
> perfectly correlated.
>

Right, and the patch sent by Hou-San [1] (based on the original patch
by Jan) is somewhat on these lines. The idea you have shared or
implemented by the patch is a logical clock stored in shared memory.
So, what the patch does is that if the time recorded by the current
commit record is lesser than or equal to the logical clock (which
means after we record time in the commit code path and before we
reserve the LSN, there happens a concurrent transaction), we shall
increment the value of logical clock by one and use that as commit
time.

So, in most cases, we need to perform one additional "if check" and
"an assignment" under spinlock, and that too only for the commit
record. In some cases, when inversion happens, there will be "one
increment" and "two assignments."

> Of course, this timestamp would be completely
> detached from "normal" timestamps, it'd just be a sequence. But we could
> also once in a while "set" the timestamp to GetCurrentTimestamp(), or
> something like that. For example, there could be a "ticker" process,
> doing that every 1000ms, or 100ms, or whatever we think is enough.
>
> Or maybe it could be driven by the commit itself, say when some timeout
> expires, we assign too many items since the last commit, ...
>

If we follow the patch, we don't need this additional stuff. Isn't
what is proposed [1] quite similar to your idea? If so, we can save
this extra maintenance of commit timestamps.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB57162A227EC357482FEB4470945D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

With Regards,
Amit Kapila.

Re: define pg_structiszero(addr, s, r)

On Mon, Nov 11, 2024 at 05:07:51PM +, Bertrand Drouvot wrote:
> To handle the "what about the len check if the function is not inlined?", I
> can't think about a good approach.

FWIW, my choice would be to not over-engineer things more than what's
in v10 posted at [1], hence do something without the exception case
where the size is less than 64b.  We've proved that this would be
better for the 8k block case that does the size_t based comparison
anyway, without impacting the existing other cases with the pgstats
entries.

[1]: 
https://www.postgresql.org/message-id/Zy5LgNyHzOhnYTTy%40ip-10-97-1-34.eu-west-3.compute.internal
--
Michael

signature.asc
Description: PGP signature

Re: Remove a unnecessary backslash in CopyFrom

2024-11-11 Thread Fujii Masao





On 2024/11/12 11:58, Tender Wang wrote:



Yugo Nagata mailto:nag...@sraoss.co.jp>> 于2024年11月12日周二 
10:46写道：

Hi,

I found a unnecessary backslash in CopyFrom().

             if (cstate->opts.reject_limit > 0 && \
                 cstate->num_errors > cstate->opts.reject_limit)

It can be removed because this is not in a macro.
I've attached a patch.


Yeah, agree. The patch LGTM.


Thanks for the report and patch! This is a mistake in commit 4ac2a9bece.
I'll commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

2024-11-14 release announcement draft

2024-11-11 Thread Jonathan S. Katz

Hi,

Please review the draft of the release announcement for the 2024-11-14
release. Please provide any feedback by 2024-11-14 12:00 UTC.

Thanks!

Jonathan
The PostgreSQL Global Development Group has released an update to all supported
versions of PostgreSQL, including 17.1, 16.5, 15.9, 14.14, 13.17, and 12.21.
This release fixes over 35 bugs reported over the last several months.

For the full list of changes, please review the
[release notes](https://www.postgresql.org/docs/release/).

PostgreSQL 12 EOL Notice

**This is the final release of PostgreSQL 12**. PostgreSQL 12 is now end-of-life
and will no longer receive security and bug fixes. If you are
running PostgreSQL 12 in a production environment, we suggest that you make
plans to upgrade to a newer, supported version of PostgreSQL. Please see our
[versioning policy](https://www.postgresql.org/support/versioning/) for more
information.

Bug Fixes and Improvements
--

This update fixes over 35 bugs that were reported in the last several months.
The issues listed below affect PostgreSQL 17. Some of these issues may also
affect other supported versions of PostgreSQL.

* Fix when attaching or detaching table partitions with foreign key constraints.
After upgrade, users impacted by this issue will need to perform manual steps to
finish fixing it. Please see the "Upgrading" section and the release notes for
more information.
* Fix when using libc as the default collation provider when `LC_CTYPE` is `C`
while `LC_COLLATE` is a different locale. This could lead to incorrect query
results. If you have these settings in your database, please reindex any
affected indexes after updating to this release. This issue impacted 17.0 only.
* Several query planner fixes.
* Fix possible wrong answers or `wrong varnullingrels` planner errors for
[`MERGE ... WHEN NOT MATCHED BY
SOURCE`](https://www.postgresql.org/docs/current/sql-merge.html)
actions.
* Fix validation of the
[`COPY`](https://www.postgresql.org/docs/current/sql-copy.html)
`FORCE_NOT_NULL` and `FORCE_NULL`.
* Fix server crash when a
[`json_objectagg()`](https://www.postgresql.org/docs/current/functions-aggregate.html)
call contains a volatile function.
* Ensure there's a registered dependency between a partitioned table and a
non-built-in access method specified in `CREATE TABLE ... USING`. This fix only
prevents problems for partitioned tables created after this update.
* Fix race condition in committing a serializable transaction.
* Fix race condition in [`COMMIT
PREPARED`](https://www.postgresql.org/docs/current/sql-commit-prepared.html)
that could require manual file removal after a crash-and-recovery.
* Fix for
[`pg_cursors`](https://www.postgresql.org/docs/current/view-pg-cursors.html)
view to prevent errors by excluding cursors that aren't completely set up.
* Reduce logical decoding memory consumption.
* Fix to prevent stable functions from receiving stale row values when they're
called from a [`CALL`](https://www.postgresql.org/docs/current/sql-call.html)
statement's argument list and the `CALL` is within a
[PL/pgSQL
`EXCEPTION`](https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING)
block.
* The `psql` `\watch` now treats values that are less than 1ms to be an interval
of 0 (no wait between executions).
* Fix failure to use credentials for a replication user in the
[password file](https://www.postgresql.org/docs/current/libpq-pgpass.html)
([`pgpass`](https://www.postgresql.org/docs/current/libpq-pgpass.html))
*
[`pg_combinebackup`](https://www.postgresql.org/docs/current/app-pgcombinebackup.html)
now throws an error if an incremental backup file is present in a directory
that should contain a full backup.
* Fix to avoid reindexing temporary tables and indexes in
[`vacuumdb`](https://www.postgresql.org/docs/current/app-vacuumdb.html) and
parallel
[`reindexdb`](https://www.postgresql.org/docs/current/app-reindexdb.html).

This release also updates time zone data files to tzdata release 2024b. This
tzdata release changes the old System-V-compatibility zone names to duplicate
the corresponding geographic zones; for example `PST8PDT` is now an alias for
`America/Los_Angeles`. The main visible consequence is that for timestamps
before the introduction of standardized time zones, the zone is considered to
represent local mean solar time for the named location. For example, in
`PST8PDT`, timestamptz input such as 1801-01-01 00:00 would previously have been
rendered as `1801-01-01 00:00:00-08`, but now it is rendered as
`1801-01-01 00:00:00-07:52:58`.

Also, historical corrections for Mexico, Mongolia, and Portugal. Notably,
Asia/Choibalsan is now an alias for Asia/Ulaanbaatar rather than being a
separate zone, mainly because the differences between those zones were found to
be based on untrustworthy data.

Updating

All PostgreSQL update releases are cumulative. As with other minor releases,
use

Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY

On Sat, Nov 9, 2024 at 8:46 AM Amit Kapila  wrote:
>
> On Fri, Nov 8, 2024 at 5:17 PM Alvaro Herrera  wrote:
> >
> > On 2024-Nov-07, Amit Kapila wrote:
> >
> > > BTW, I was thinking as to how to fix it on back branches and it seems
> > > we should restrict to define REPLICA IDENTITY on stored generated
> > > columns in the first place in back branches as those can't be
> > > replicated. So, the following should fail:
> > >
> > > CREATE TABLE testpub_gencol (a INT, b INT GENERATED ALWAYS AS (a + 1)
> > > STORED NOT NULL);
> > > CREATE UNIQUE INDEX testpub_gencol_idx ON testpub_gencol (b);
> > > ALTER TABLE testpub_gencol REPLICA IDENTITY USING index 
> > > testpub_gencol_idx;
> > >
> > > Peter, do you have an opinion on this?
> >
> > I think a blanket restriction of this sort is not a good idea (at least
> > in back branches), because there might be people using replica
> > identities with stacks other than pgoutput.
> >
>
> Do you mean to say that people using plugins other than pgoutput may
> already be sending generated columns, so defining replica identity
> should be okay for them?
>

If we don't want to add a restriction to not create replica identity
on generated columns then I think the solution similar to HEAD should
be okay which is to restrict UPDATE/DELETE in such cases.

Also, another point against restricting defining REPLICA IDENTITY on
generated columns is that we do allow generated columns to be PRIMARY
KEY which is a DEFAULT for REPLICA IDENTITY, so that also needs to be
restricted. That won't be a good idea.

-- 
With Regards,
Amit Kapila.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

On Tue, 12 Nov 2024 01:27:53 +0500
Kirill Reshke  wrote:

> On Mon, 11 Nov 2024 at 16:11, torikoshia  wrote:
> >
> > On 2024-11-09 21:55, Kirill Reshke wrote:
> >
> > Thanks for working on this!
> 
> Thanks for reviewing the v7 patch series!
> 
> > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao 
> > > wrote:
> > >>
> > >>
> > >>
> > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > >> > there WAS errors, and we should not tolerate more than $limit errors .
> > >> > I do find this behavior to be consistent.
> > >>
> > >> +1
> > >>
> > >>
> > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > >> > replacements, as if REJECT LIMIT is inf.
> > >>
> > >> +1
> > >
> > > After thinking for a while, I'm now more opposed to this approach. I
> > > think we should count rows with erroneous data as errors only if
> > > null substitution for these rows failed, not the total number of rows
> > > which were modified.
> > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > the limit. This is actually simpler approach IMHO. What do You think?
> >
> > IMHO I prefer the previous interpretation.
> > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > is used to specify how many malformed rows are acceptable in the
> > "original" data source.

I also prefer the previous version.
 
> I do like the first version of interpretation, but I have a struggle
> with it. According to this interpretation, we will fail COPY command
> if the number
> of malformed data rows exceeds the limit, not the number of rejected
> rows (some percentage of malformed rows are accepted with null
> substitution)
> So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> However, we are unable to change the name since the REJECT_LIMIT
> option has already been committed.
> I guess I'll just have to put up with this contradiction. I will send
> an updated patch shortly...

I think we can rename the REJECT_LIMIT option because it is not yet released.

The documentation says that REJECT_LIMIT "Specifies the maximum number of 
errors",
and there are no wording "reject" in the description, so I wonder it is unclear
what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
since it is supposed to be used with ON_ERROR. 

Alternatively, if we emphasize that errors are handled other than terminating
the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may be
good, for example.

Regards,
Yugo Nagata

-- 
Yugo Nagata

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

On Tue, 12 Nov 2024 14:03:50 +0900
Yugo Nagata  wrote:

> On Tue, 12 Nov 2024 01:27:53 +0500
> Kirill Reshke  wrote:
> 
> > On Mon, 11 Nov 2024 at 16:11, torikoshia  wrote:
> > >
> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> > >
> > > Thanks for working on this!
> > 
> > Thanks for reviewing the v7 patch series!
> > 
> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao 
> > > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> > > >> > when the REJECT LIMIT is set to some non-zero number and the number 
> > > >> > of
> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> > > >> > there WAS errors, and we should not tolerate more than $limit errors 
> > > >> > .
> > > >> > I do find this behavior to be consistent.
> > > >>
> > > >> +1
> > > >>
> > > >>
> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> > > >> > replacements, as if REJECT LIMIT is inf.
> > > >>
> > > >> +1
> > > >
> > > > After thinking for a while, I'm now more opposed to this approach. I
> > > > think we should count rows with erroneous data as errors only if
> > > > null substitution for these rows failed, not the total number of rows
> > > > which were modified.
> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> > > > the limit. This is actually simpler approach IMHO. What do You think?
> > >
> > > IMHO I prefer the previous interpretation.
> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> > > is used to specify how many malformed rows are acceptable in the
> > > "original" data source.
> 
> I also prefer the previous version.
>  
> > I do like the first version of interpretation, but I have a struggle
> > with it. According to this interpretation, we will fail COPY command
> > if the number
> > of malformed data rows exceeds the limit, not the number of rejected
> > rows (some percentage of malformed rows are accepted with null
> > substitution)
> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
> > However, we are unable to change the name since the REJECT_LIMIT
> > option has already been committed.
> > I guess I'll just have to put up with this contradiction. I will send
> > an updated patch shortly...
> 
> I think we can rename the REJECT_LIMIT option because it is not yet released.
> 
> The documentation says that REJECT_LIMIT "Specifies the maximum number of 
> errors",
> and there are no wording "reject" in the description, so I wonder it is 
> unclear
> what means in "REJECT" in REJECT_LIMIT. It may be proper to use ERROR_LIMIT
> since it is supposed to be used with ON_ERROR. 
> 
> Alternatively, if we emphasize that errors are handled other than terminating
> the command,perhaps MALFORMED_LIMIT as proposed above or TOLERANCE_LIMIT may 
> be
> good, for example.

I might misunderstand the meaning of the name. If REJECT_LIMIT means "a limit on
the number of rows with any malformed value allowed before the COPY command is
rejected", we would not have to rename it.

-- 
Yugo Nagata

Re: Add reject_limit option to file_fdw

On Tue, 12 Nov 2024 10:16:50 +0900
torikoshia  wrote:

> On 2024-11-12 01:49, Fujii Masao wrote:
> > On 2024/11/11 21:45, torikoshia wrote:
> >>> Thanks for adding the comment. It clearly states that REJECT_LIMIT 
> >>> can be
> >>> a single-quoted string. However, it might also be helpful to mention 
> >>> that
> >>> it can be provided as an int64 in the COPY command option. How about
> >>> updating it like this?
> >>> 
> >>> 
> >>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY 
> >>> command
> >>> option or as a single-quoted string for the foreign table option 
> >>> using
> >>> file_fdw. Therefore this function needs to handle both formats.
> >>> 
> >> 
> >> Thanks! it seems better.
> >> 
> >> 
> >> Attached v3 patch.
> > 
> > Thanks for updating the patch! It looks like you forgot to attach it, 
> > though.
> 
> Oops, thanks for pointing it out.
> Here it is.

I have one minor comment.

+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+errmsg("skipped more than 
REJECT_LIMIT (%lld) rows due to data type incompatibility",
+   (long long) 
cstate->opts.reject_limit)));

Do we have to keep this message consistent with ones of COPY command?
With foreign data wrappers, it seems common that the option name is passed in
lowercase, so is it better to use "skipped more than reject_limit ..." rather
than using uppercase?

Regards,
Yugo Nagata

-- 
Yugo Nagata

Re: define pg_structiszero(addr, s, r)

2024-11-11 Thread Bertrand Drouvot

Hi,

On Tue, Nov 12, 2024 at 12:28:53PM +0900, Michael Paquier wrote:
> On Mon, Nov 11, 2024 at 05:07:51PM +, Bertrand Drouvot wrote:
> > To handle the "what about the len check if the function is not inlined?", I
> > can't think about a good approach.
> 
> FWIW, my choice would be to not over-engineer things more than what's
> in v10 posted at [1], hence do something without the exception case
> where the size is less than 64b.

I think that the 64b len check done in v11 is mandatory for safety reasons.

1. First reason:

"
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
"

The loop above reads 64 bytes at once, so would read beyond the memory area 
bounds
if len < 64: That could cause crash or read invalid data.

It's observed in [1] (using the godbolt shared in [2]), where we can see:

"
movdqu  xmm2, XMMWORD PTR [rdi+16]
movdqu  xmm1, XMMWORD PTR [rdi+32]
movdqu  xmm3, XMMWORD PTR [rdi+48]
"

while the struct size is 16 bytes (so we are reading 48 bytes beyond it).

2. Second reason

"
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
"

aligned_end could be beyond the end for len < 8, so that we could read 
invalid data or crash here:

"
for (; p < aligned_end; p += sizeof(size_t)) {
"

The len < 8 check is covered into the len < 64 check, so only the 64b check is
needed.

[1]: 
https://www.postgresql.org/message-id/Zy7hyG8JUMC5P2T3%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: 
https://www.postgresql.org/message-id/CAApHDvp2jx_%3DpFbgj-O1_ZmzP9WOZKfwLzZrS_%3DZmbsqMQQ59g%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Missing word in comment

2024-11-11 Thread Amit Langote

Hi,

Will push the attached shortly, which does:

  * An SMgrRelation may be "pinned", to prevent it from being destroyed while
- * it's in use.  We use this to prevent pointers relcache to smgr from being
+ * it's in use.  We use this to prevent pointers in relcache to smgr from being
  * invalidated.  SMgrRelations that are not pinned are deleted at end of

-- 
Thanks, Amit Langote


v1-0001-Add-missing-word-in-comment.patch
Description: Binary data

Re: define pg_structiszero(addr, s, r)

On Tue, Nov 12, 2024 at 06:09:04AM +, Bertrand Drouvot wrote:
> I think that the 64b len check done in v11 is mandatory for safety reasons.
> 
> The loop above reads 64 bytes at once, so would read beyond the memory area 
> bounds
> if len < 64: That could cause crash or read invalid data.

Sorry, I was not following your argument.  You're right that we need
something else here.  However..

+   /*
+* For len < 64, compare byte per byte to ensure we'll not read beyond 
the
+* memory area.
+*/
+   if (len < sizeof(size_t) * 8)
+   {
+   while (p < end)
+   {
+   if (*p++ != 0)
+   return false;
+   }
+   return true;
+   }
+
+   /* Compare bytes until the pointer "p" is aligned */
+   while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
+   {
+   if (p == end)
+   return true;
+
+   if (*p++ != 0)
+   return false;
+   }
+

Still, this is not optimal, based on what's been discussed upthread.
The byte-per-byte check is more expensive than the size_t check, so
shouldn't you make sure that you stack some size_t checks if dealing
with something smaller than 64 bytes?  That would be a bit more
complex, sure, but if you leave that within the block doing "len <
sizeof(size_t) * 8", perhaps that's OK..  Or just do what you
mentioned upthread with a second macro for sizes <= 64.  You'd need
three steps in this first block rather than one:
- byte-per-byte, up to aligned location.
- size_t loop.
- Again byte-per-byte, until the end,
--
Michael


signature.asc
Description: PGP signature

Remove a unnecessary backslash in CopyFrom

Hi,

I found a unnecessary backslash in CopyFrom().

if (cstate->opts.reject_limit > 0 && \
cstate->num_errors > cstate->opts.reject_limit)

It can be removed because this is not in a macro.
I've attached a patch.

Regards,
Yugo Nagata

-- 
Yugo Nagata 
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 07cbd5d22b..754cb49616 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1018,7 +1018,7 @@ CopyFrom(CopyFromState cstate)
 			pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
 		 cstate->num_errors);
 
-			if (cstate->opts.reject_limit > 0 && \
+			if (cstate->opts.reject_limit > 0 &&
 cstate->num_errors > cstate->opts.reject_limit)
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),

Re: Remove a unnecessary backslash in CopyFrom

2024-11-11 Thread Tender Wang

Yugo Nagata  于2024年11月12日周二 10:46写道：

> Hi,
>
> I found a unnecessary backslash in CopyFrom().
>
> if (cstate->opts.reject_limit > 0 && \
> cstate->num_errors > cstate->opts.reject_limit)
>
> It can be removed because this is not in a macro.
> I've attached a patch.
>

Yeah, agree. The patch LGTM.

-- 
Thanks,
Tender Wang

Re: Commit Timestamp and LSN Inversion issue

On Tue, Nov 12, 2024 at 8:35 AM Amit Kapila  wrote:
>
> On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra  wrote:
> >
> > On 11/11/24 09:19, Amit Kapila wrote:
> > >
> > > I can't think of a solution other than the current proposal where we
> > > do both the operations (reserve WAL space for commit and adjust
> > > commit_timestamp, if required) together to resolve LSN<->Timestamp
> > > invertion issue. Please let us know if you have any other ideas for
> > > solving this. Even, if we want to change ReserveXLogInsertLocation to
> > > make the locked portion an atomic operation, we can still do it for
> > > records other than commit. Now, if we don't have any other solution
> > > for LSN<->Timestamp inversion issue, changing the entire locked
> > > portion to atomic will close the door to solving this problem forever
> > > unless we have some other way to solve it which can make it difficult
> > > to rely on commit_timestamps for certain scenarios.
> > >
> >
> > I don't know what the solution is, isn't the problem that
> >
> > (a) we record both values (LSN and timestamp) during commit
> >
> > (b) reading timestamp from system clock can be quite expensive
> >
> > It seems to me that if either of those two issues disappeared, we
> > wouldn't have such an issue.
> >
> > For example, imagine getting a timestamp is super cheap - just reading
> > and updating a simple counter from shmem, just like we do for the LSN.
> > Wouldn't that solve the problem?
> >
>
> Yeah, this is exactly what I thought.
>
> > For example, let's say XLogCtlInsert has two fields
> >
> > int64 CommitTimestamp;
> >
> > and that ReserveXLogInsertLocation() also does this for each commit:
> >
> > commit_timestamp = Insert->commit_timestamp++;
> >
> > while holding the insertpos_lock. Now we have the LSN and timestamp
> > perfectly correlated.
> >
>
> Right, and the patch sent by Hou-San [1] (based on the original patch
> by Jan) is somewhat on these lines. The idea you have shared or
> implemented by the patch is a logical clock stored in shared memory.
> So, what the patch does is that if the time recorded by the current
> commit record is lesser than or equal to the logical clock (which
> means after we record time in the commit code path and before we
> reserve the LSN, there happens a concurrent transaction), we shall
> increment the value of logical clock by one and use that as commit
> time.
>
> So, in most cases, we need to perform one additional "if check" and
> "an assignment" under spinlock, and that too only for the commit
> record. In some cases, when inversion happens, there will be "one
> increment" and "two assignments."
>

As the inversion issue can mainly hamper logical replication-based
solutions we can do any of this additional work under spinlock only
when the current record is a commit record (which the currently
proposed patch is already doing) and "wal_level = logical" and also
can have another option at the subscription level to enable this new
code path. I am not sure what is best but just sharing the ideas here.

-- 
With Regards,
Amit Kapila.

Re: Commit Timestamp and LSN Inversion issue

On Mon, Nov 11, 2024 at 9:05 PM Tomas Vondra  wrote:
>
> Alternatively, we could simply stop relying on the timestamps recorded
> in the commit, and instead derive "monotonic" commit timestamps after
> the fact. For example, we could track timestamps for some subset of
> commits, and then approximate the rest using LSN.
>
> AFAIK the inversion can happen only for concurrent commits, and there's
> can't be that many of those. So if we record the (LSN, timestamp) for
> every 1MB of WAL, we approximate timestamps for commits in that 1MB by
> linear approximation.
>
> Of course, there's a lot of questions and details to solve - e.g. how
> often would it need to happen, when exactly would it happen, etc. And
> also how would that integrate with the logical decoding - it's easy to
> just get the timestamp from the WAL record, this would require more work
> to actually calculate it. It's only a very rough idea.
>

I think for logical decoding it would be probably easy because it
reads all the WAL. So, it can remember the commit time of the previous
commit, and if any future commit has a commit timestamp lower than
that it can fix it by incrementing it. But outside logical decoding,
it would be tricky because we may need a separate process to fix up
commit timestamps by using linear approximation. IIUC, the bigger
challenge is that such a solution would require us to read the WAL on
a continuous basis and keep fixing commit timestamps or we need to
read the extra WAL before using or relying on commit timestamp. This
sounds to be a somewhat complex and costlier solution though the cost
is outside the hot-code path but still, it matters as it leads to
extra read I/O.

-- 
With Regards,
Amit Kapila.

Re: Add reject_limit option to file_fdw

2024-11-11 Thread Kirill Reshke

On Tue, 12 Nov 2024 at 06:17, torikoshia  wrote:
>
> On 2024-11-12 01:49, Fujii Masao wrote:
> > On 2024/11/11 21:45, torikoshia wrote:
> >>> Thanks for adding the comment. It clearly states that REJECT_LIMIT
> >>> can be
> >>> a single-quoted string. However, it might also be helpful to mention
> >>> that
> >>> it can be provided as an int64 in the COPY command option. How about
> >>> updating it like this?
> >>>
> >>> 
> >>> REJECT_LIMIT can be specified in two ways: as an int64 for the COPY
> >>> command
> >>> option or as a single-quoted string for the foreign table option
> >>> using
> >>> file_fdw. Therefore this function needs to handle both formats.
> >>> 
> >>
> >> Thanks! it seems better.
> >>
> >>
> >> Attached v3 patch.
> >
> > Thanks for updating the patch! It looks like you forgot to attach it,
> > though.
>
> Oops, thanks for pointing it out.
> Here it is.
>
>
> --
> Regards,
>
> --
> Atsushi Torikoshi
> Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Hi!

A little question from me.

This is your doc for reject_limit:

+  
+   reject_limit
+
+   
+
+ Specifies the maximum number of errors tolerated while
converting a column's
+ input value to its data type, the same as COPY's
+REJECT_LIMIT option.
+
+   
+  
+

This is how it looks on the current HEAD for copy.

REJECT_LIMIT

  Specifies the maximum number of errors tolerated while converting a
  column's input value to its data type, when ON_ERROR is
  set to ignore.
  If the input causes more errors than the specified value, the
COPY
  command fails, even with ON_ERROR set to
ignore.
  This clause must be used with
ON_ERROR=ignore
  and maxerror must
be positive bigint.
  If not specified, ON_ERROR=ignore
  allows an unlimited number of errors, meaning COPY will
  skip all erroneous data.

There is a difference. Should we add REJECT_LIMIT vs ON_ERROR
clarification for file_fdw too? or maybe we put a reference for COPY
doc.

-- 
Best regards,
Kirill Reshke

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4