Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-08 Thread Amit Kapila
On Fri, Dec 8, 2023 at 7:16 PM Shlok Kyal  wrote:
>
> > Then let's go with the original patch only. BTW, it took almost the
> > same time (105 wallclock secs) in my environment (CentOs VM) to run
> > tests in src/test/subscription both with and without the patch. I took
> > a median of five runs. I have slightly adjusted the comments and
> > commit message in the attached. If you are fine with this, we can
> > commit and backpatch this.
>
> I have tested the patch for all the branches from PG 17 to PG 12.
> The same patch applies cleanly on all branches. Also, the same patch
> resolves the issue on all the branches.
> I ran all the tests and all the tests passed on each branch.
>
> I also reviewed the patch and it looks good to me.
>

Thanks, I could also reproduce the issue on back branches (tried till
12), and the fix works. I'll push this on Monday.

-- 
With Regards,
Amit Kapila.




Re: remaining sql/json patches

2023-12-08 Thread jian he
Hi.
function JsonPathExecResult comment needs to be refactored? since it
changed a lot.




Re: Eager page freeze criteria clarification

2023-12-08 Thread Melanie Plageman
On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman
 wrote:
> The next step is to devise different heuristics and measure their
> efficacy. IMO, the goal of the algorithm it is to freeze pages in a
> relation such that we drive early unfreezes/freezes -> 0 and pages
> frozen/number of pages of a certain age -> 1.

Attached is a patchset with an adaptive freezing algorithm that works
well. It keeps track of the pages' unmodified duration after having been
set all-visible and uses that to predict how likely a page is to be
modified given its age.

Each time an all-visible page is modified by an update, delete, or tuple
lock, if the page was all-visible for less than target_freeze_duration
(a new guc that specifies the minimum amount of time you would like data
to stay frozen), it is counted as an "early unset" and the duration that
it was unmodified is added into an accumulator. Before each relation is
vacuumed, we calculate the mean and standard deviation of these
durations using the accumulator. This is used to calculate a page LSN
threshold demarcating pages with a < 5% likelihood of being modified
before target_freeze_duration. We don't freeze pages younger than this
threshold.

I've done away with the ring buffer of vacuums and the idea of
attributing an "unset" to the vacuum that set it all visible. Instead,
using an "accumulator", I keep a running sum of the page ages, along
with the cardinality of the accumulated set and the sum squared of page
ages. This data structure allows us to extract the mean and standard
deviation, at any time, from an arbitrary number of values in constant
space; and is used to model the pattern of these unsets as a normal
distribution that we can use to try and predict whether or not a page
will be modified.

Values can be "removed" from the accumulator by simply decrementing its
cardinality and decreasing the sum and sum squared by a value that will
not change the mean and standard deviation of the overall distribution.
To adapt to a table's changing access patterns, we'll need to remove
values from this accumulator over time, but this patch doesn't yet
decide when to do this. A simple solution may be to cap the cardinality
of the accumulator to the greater of 1% of the table size, or some fixed
number of values (perhaps 200?). Even without such removal of values,
the distribution recorded in the accumulator will eventually skew toward
more recent data, albeit at a slower rate.

This algorithm is able to predict when pages will be modified before
target_freeze_threshold as long as their modification pattern fits a
normal distribution -- this accommodates access patterns where, after
some period of time, pages become less likely to be modified the older
they are. As an example, however, access patterns where modification
times are bimodal aren't handled well by this model (imagine that half
your modifications are to very new data and the other half are to data
that is much older but still younger than target_freeze_duration). If
this is a common access pattern, the normal distribution could easily be
swapped out for another distribution. The current accumulator is capable
of tracking a distribution's first two moments of central tendency (the
mean and standard deviation). We could track more if we wanted to use a
fancier distribution.

We also must consider what to do when we have few unsets, e.g. with an
insert-only workload. When there are very few unsets (I chose 30 which
the internet says is the approximate minimum number required for the
central limit theorem to hold), we can choose to always freeze freezable
pages; above this limit, the calculated page threshold is used. However,
we may not want to make predictions based on 31 values either. To avoid
this "cliff", we could modify the algorithm a bit to skew the mean and
standard deviation of the distribution using a confidence interval based
on the number of values we've collected.

The goal is to keep pages frozen for at least target_freeze_duration.
target_freeze_duration is in seconds and pages only have a last
modification LSN, so target_freeze_duration must be converted to LSNs.
To accomplish this, I've added an LSNTimeline data structure, containing
XLogRecPtr, TimestampTz pairs stored with decreasing precision as they
age. When we need to translate the guc value to LSNs, we linearly
interpolate it on this timeline. For the time being, the global
LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There
is no reason it can't be updated with some other cadence and/or by some
other process (nothing about it is inherently tied to vacuum). The
cached translated value of target_freeze_duration is stored in each
table's stats. This is arbitrary as it is not a table-level stat.
However, it needs to be located somewhere that is accessible on
update/delete. We may want to recalculate it more often than once per
table vacuum, especially in case of long-running vacuums.

To benchmark this new heuristic (I'm 

How abnormal server shutdown could be detected by tests?

2023-12-08 Thread Alexander Lakhin

Hello hackers,

While studying bug #18158, I've come to the conclusion that the existing
testing infrastructure is unable to detect abnormal situations. of some
kind.

Just a simple example:
With Assert(0) injected in walsender (say:
sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
), I observe the following:
$ make -s check -C src/test/recovery PROVE_TESTS="t/012*"

t/012_subtransactions.pl .. ok
All tests successful.

(The same with meson:
1/1 postgresql:recovery / recovery/012_subtransactions OK    3.24s  
 12 subtests passed)

But:
$ grep TRAP src/test/recovery/tmp_check/log/*
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373729
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373750
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373821
src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373786


src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
...
2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was 
terminated by signal 6: Aborted
2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: 
START_REPLICATION 0/300 TIMELINE 3
2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active server 
processes
2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
...

So the shutdown definitely was considered abnormal, but we missed that.

With log_min_messages = DEBUG3, I can see in the log:
2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 before_shmem_exit 
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit 
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic 
shared memory with ID 2898643884
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared memory 
control segment with ID 3446966170
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to make

The postmaster process exits with exit code 1, but pg_ctl can't get the
code and just reports that stop was completed successfully.

Should this be improved? And if yes, how it can be done?
Maybe postmaster shouldn't remove it's postmaster.pid when it exits
abnormally to let pg_ctl know of it?

Best regards,
Alexander




RE: Make COPY format extendable: Extract COPY TO format implementations

2023-12-08 Thread Hayato Kuroda (Fujitsu)
Dear Junagn, Sutou-san,

Basically I agree your point - improving a extendibility is good.
(I remember that this theme was talked at Japan PostgreSQL conference)
Below are my comments for your patch.

01. General

Just to confirm - is it OK to partially implement APIs? E.g., only COPY TO is
available. Currently it seems not to consider a case which is not implemented.

02. General

It might be trivial, but could you please clarify how users can extend? Is it OK
to do below steps?

1. Create a handler function, via CREATE FUNCTION,
2. Register a handler, via new SQL (CREATE COPY HANDLER),
3. Specify the added handler as COPY ... FORMAT clause.

03. General

Could you please add document-related tasks to your TODO? I imagined like
fdwhandler.sgml.

04. General - copyright

For newly added files, the below copyright seems sufficient. See 
applyparallelworker.c.

```
 * Copyright (c) 2023, PostgreSQL Global Development Group
```

05. src/include/catalog/* files

IIUC, 8000 or higher OIDs should be used while developing a patch. 
src/include/catalog/unused_oids
would suggest a candidate which you can use.

06. copy.c

I felt that we can create files per copying methods, like 
copy_{text|csv|binary}.c,
like indexes.
How do other think?

07. fmt_to_name()

I'm not sure the function is really needed. Can we follow like 
get_foreign_data_wrapper_oid()
and remove the funciton?

08. GetCopyRoutineByName()

Should we use syscache for searching a catalog?

09. CopyToFormatTextSendEndOfRow(), CopyToFormatBinaryStart()

Comments still refer CopyHandlerOps, whereas it was renamed.

10. copy.h

Per foreign.h and fdwapi.h, should we add a new header file and move some APIs?

11. copy.h

```
-/* These are private in commands/copy[from|to].c */
-typedef struct CopyFromStateData *CopyFromState;
-typedef struct CopyToStateData *CopyToState;
```

Are above changes really needed?

12. CopyFormatOptions

Can we remove `bool binary` in future?

13. external functions

```
+extern void CopyToFormatTextStart(CopyToState cstate, TupleDesc tupDesc);
+extern void CopyToFormatTextOneRow(CopyToState cstate, TupleTableSlot *slot);
+extern void CopyToFormatTextEnd(CopyToState cstate);
+extern void CopyFromFormatTextStart(CopyFromState cstate, TupleDesc tupDesc);
+extern bool CopyFromFormatTextNext(CopyFromState cstate, ExprContext *econtext,
+
Datum *values, bool *nulls);
+extern void CopyFromFormatTextErrorCallback(CopyFromState cstate);
+
+extern void CopyToFormatBinaryStart(CopyToState cstate, TupleDesc tupDesc);
+extern void CopyToFormatBinaryOneRow(CopyToState cstate, TupleTableSlot *slot);
+extern void CopyToFormatBinaryEnd(CopyToState cstate);
+extern void CopyFromFormatBinaryStart(CopyFromState cstate, TupleDesc tupDesc);
+extern bool CopyFromFormatBinaryNext(CopyFromState cstate,
ExprContext *econtext,
+
  Datum *values, bool *nulls);
+extern void CopyFromFormatBinaryErrorCallback(CopyFromState cstate);
```

FYI - If you add files for {text|csv|binary}, these declarations can be removed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-08 Thread Peter Geoghegan
On Fri, Dec 8, 2023 at 10:46 AM Alexander Korotkov  wrote:
> > In your example "foo = 90" is satisfied by_bt_first(), but "foo >
> > 99::int8" is not.  I think this could be resolved by introducing a
> > separate flag exactly distinguishing scan keys used for _bt_first().
> > I'm going to post the patch doing this.
>
> The draft patch is attached.  It requires polishing and proper
> commenting.  But I hope the basic idea is clear.  Do you think this is
> the way forward?

Does this really need to work at the scan key level, rather than at
the whole-scan level? Wouldn't it make more sense to just totally
disable it for the whole scan, since we'll barely ever need to do that
anyway?

My ScalarArrayOpExpr patch will need to disable this optimization,
since with that patch in place we don't necessarily go through
_bt_first each time the search-type scan keys must change. We might
need to check a few tuples from before the _bt_first-wise position of
the next set of array values, which is a problem with
opposite-direction-only inequalities (it's a little bit like the
situation from my test case, actually). That's partly why I'd prefer
this to work at the whole-scan level (though I also just don't think
that inventing SK_BT_BT_FIRST makes much sense).

I think that you should make it clearer that this whole optimization
only applies to required *inequalities*, which can be required in the
opposite direction *only*. It should be more obvious from looking at
the code that the optimization doesn't apply to required equality
strategy scan keys (those are always required in *both* scan
directions or in neither direction, so unlike the closely related
prefix optimization added by the same commit, they just can't use the
optimization that we need to fix here).

BTW, do we really need to keep around the BTScanOpaqueData.firstPage
field? Why can't the call to _bt_readpage from _bt_first (and from
_bt_endpoint) just pass "firstPage=true" as a simple argument? Note
that the first call to _bt_readpage must take place from _bt_first (or
from _bt_endpoint). The first _bt_first call is already kind of
special, in a way that is directly related to this issue. I added some
comments about that to today's commit c9c0589fda, in fact -- I think
it's an important issue in general.

-- 
Peter Geoghegan




Re: Streaming I/O, vectored I/O (WIP)

2023-12-08 Thread Thomas Munro
On Sat, Dec 9, 2023 at 7:25 AM Andres Freund  wrote:
> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> > > Maybe we should bite the bullet and always retry short writes in
> > > FileWriteV(). Is that what you meant by "handling them"?
> > > If the total size is expensive to calculate, how about passing it as an
> > > extra argument? Presumably it is cheap for the callers to calculate at
> > > the same time that they build the iovec array?

> > There is another problem with pushing it down to fd.c, though.
> > Suppose you try to write 8192 bytes, and the kernel says "you wrote
> > 4096 bytes" so your loop goes around again with the second half the
> > data and now the kernel says "-1, ENOSPC".  What are you going to do?
> > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> > so you'd either have to return -1, ENOSPC (converting short writes
> > into actual errors, a lie because you did write some data), or return
> > 4096 (and possibly also set errno = ENOSPC as we have always done).
> > So you can't really handle this problem at this level, can you?
> > Unless you decide that fd.c should get into the business of raising
> > errors for I/O failures, which would be a bit of a departure.
> >
> > That's why I did the retry higher up in md.c.
>
> I think that's the right call. I think for AIO we can't do retry handling
> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
> buy us that much in md.c anyway, we still need to handle the cross segment
> case and such, from what I can tell?

Heikki, what do you think about this:  we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno.  Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful?  I wouldn't want them for
md.c though because I already know the size.




Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Andres Freund  writes:
> On 2023-12-08 17:29:45 -0500, Tom Lane wrote:
>> Agreed.  I think we want to do that after the initial handshake,
>> too, so maybe as attached.

> I was wondering about that too. But if we do so, why not also do it for
> writes?

Writes don't act that way, do they?  EOF on a pipe gives you an error,
not silently reporting that zero bytes were written and leaving you
to retry indefinitely.

What I was wondering about was if we needed similar changes on the
libpq side, but it's still about reads not writes.

regards, tom lane




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 17:35:26 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I thought it'd be nice to have a test for this, particularly because it's 
> > not
> > clear that the behaviour is consistent across openssl versions.
> 
> Perhaps, but ...
> 
> > To deal with that, I changed the test to instead check if "not accept SSL
> > connection: Success" is not logged.
> 
> ... testing only that much seems entirely not worth the cycles, given the
> shape of the patches we both just made.  If we can't rely on "errno != 0"
> to ensure we won't get "Success", there is one heck of a lot of other
> code that will be broken worse than this.

I was certainly more optimistic about the usefullness of the test before
disocvering the above difficulties...

I considered accepting both ECONNRESET and the errno = 0 phrasing, but after
discovering that the phrasing differs between platforms that seemed less
attractive.

I guess the test might still provide some value, by ensuring those paths are
reached.

Greetings,

Andres Freund




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
On 2023-12-08 17:29:45 -0500, Tom Lane wrote:
> Andres Freund  writes:
> >> I think I figured it it out. Looks like we need to translate a closed 
> >> socket
> >> (recvfrom() returning 0) to ECONNRESET or such.
> 
> > Seems like we should just treat errno == 0 as a reason to emit the "EOF
> > detected" message?
> 
> Agreed.  I think we want to do that after the initial handshake,
> too, so maybe as attached.

I was wondering about that too. But if we do so, why not also do it for
writes?




Re: micro-optimizing json.c

2023-12-08 Thread Tom Lane
Nathan Bossart  writes:
> Here are a couple more easy micro-optimizations in nearby code.  I've split
> them into individual patches for review, but I'll probably just combine
> them into one patch before committing.

LGTM

regards, tom lane




Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Andres Freund  writes:
> I thought it'd be nice to have a test for this, particularly because it's not
> clear that the behaviour is consistent across openssl versions.

Perhaps, but ...

> To deal with that, I changed the test to instead check if "not accept SSL
> connection: Success" is not logged.

... testing only that much seems entirely not worth the cycles, given the
shape of the patches we both just made.  If we can't rely on "errno != 0"
to ensure we won't get "Success", there is one heck of a lot of other
code that will be broken worse than this.

regards, tom lane




Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Andres Freund  writes:
>> I think I figured it it out. Looks like we need to translate a closed socket
>> (recvfrom() returning 0) to ECONNRESET or such.

> Seems like we should just treat errno == 0 as a reason to emit the "EOF
> detected" message?

Agreed.  I think we want to do that after the initial handshake,
too, so maybe as attached.

regards, tom lane

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6b8125695a..f0b35f08c6 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -460,6 +460,7 @@ aloop:
 	 * per-thread error queue following another call to an OpenSSL I/O
 	 * routine.
 	 */
+	errno = 0;
 	ERR_clear_error();
 	r = SSL_accept(port->ssl);
 	if (r <= 0)
@@ -496,7 +497,7 @@ aloop:
 		 WAIT_EVENT_SSL_OPEN_SERVER);
 goto aloop;
 			case SSL_ERROR_SYSCALL:
-if (r < 0)
+if (r < 0 && errno != 0)
 	ereport(COMMERROR,
 			(errcode_for_socket_access(),
 			 errmsg("could not accept SSL connection: %m")));
@@ -732,7 +733,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
 			break;
 		case SSL_ERROR_SYSCALL:
 			/* leave it to caller to ereport the value of errno */
-			if (n != -1)
+			if (n != -1 || errno == 0)
 			{
 errno = ECONNRESET;
 n = -1;


Re: Row pattern recognition

2023-12-08 Thread Tatsuo Ishii
> On 04.12.23 12:40, Tatsuo Ishii wrote:
>> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
>> index d631ac89a9..5a77fca17f 100644
>> --- a/src/backend/parser/gram.y
>> +++ b/src/backend/parser/gram.y
>> @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char
>> *relname, List *aliases, Node *query);
>>  DefElem*defelt;
>>  SortBy *sortby;
>>  WindowDef  *windef;
>> +RPCommonSyntax  *rpcom;
>> +RPSubsetItem*rpsubset;
>>  JoinExpr   *jexpr;
>>  IndexElem  *ielem;
>>  StatsElem  *selem;
>> @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char
>> *relname, List *aliases, Node *query);
>>  MergeWhenClause *mergewhen;
>>  struct KeyActions *keyactions;
>>  struct KeyAction *keyaction;
>> +RPSkipToskipto;
>>   }
>> %type  stmt toplevel_stmt schema_stmt routine_body_stmt
> 
> It is usually not the style to add an entry for every node type to the
> %union.  Otherwise, we'd have hundreds of entries in there.

Ok, I have removed the node types and used existing node types.  Also
I have moved RPR related %types to same place to make it easier to know
what are added by RPR.

>> @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char
>> *relname, List *aliases, Node *query);
>>   %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */
>>   %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE
>>   %ROLLUP
>>  SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT
>> +%nonassoc   MEASURES AFTER INITIAL SEEK PATTERN_P
>>   %left Op OPERATOR /* multi-character ops and user-defined operators */
>>   %left  '+' '-'
>>   %left  '*' '/' '%'
> 
> It was recently discussed that these %nonassoc should ideally all have
> the same precedence.  Did you consider that here?

No, I didn't realize it. Thanks for pointing it out. I have removed
%nonassoc so that MEASURES etc. have the same precedence as IDENT etc.

Attached is the new diff of gram.y against master branch.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..6c41aa2e9f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -659,6 +659,21 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
 
+%type  row_pattern_measure_item
+   row_pattern_definition
+%typeopt_row_pattern_common_syntax
+   opt_row_pattern_skip_to
+   row_pattern_subset_item
+   row_pattern_term
+%typeopt_row_pattern_measures
+   row_pattern_measure_list
+   row_pattern_definition_list
+   opt_row_pattern_subset_clause
+   row_pattern_subset_list
+   row_pattern_subset_rhs
+   row_pattern
+%type opt_row_pattern_initial_or_seek
+   first_or_last
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -702,7 +717,7 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
 
DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
-   DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DEPENDS DEPTH 
DESC
+   DEFERRABLE DEFERRED DEFINE DEFINER DELETE_P DELIMITER DELIMITERS 
DEPENDS DEPTH DESC
DETACH DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P
DOUBLE_P DROP
 
@@ -718,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
HANDLER HAVING HEADER_P HOLD HOUR_P
 
IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P 
INCLUDE
-   INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIALLY 
INLINE_P
+   INCLUDING INCREMENT INDENT INDEX INDEXES INHERIT INHERITS INITIAL 
INITIALLY INLINE_P
INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
 
@@ -731,7 +746,7 @@ static Node *makeRecursiveViewSelect(char *relname, List 
*aliases, Node *query);
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
 
-   MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
+   MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MEASURES MERGE METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
 
NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NFC NFD NFKC NFKD NO NONE
@@ 

Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 11:33:16 -0800, Andres Freund wrote:
> On 2023-12-08 10:51:01 -0800, Andres Freund wrote:
> > On 2023-12-08 13:46:07 -0500, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
> > > >> Hmm, don't suppose you have a way to reproduce that?
> > >
> > > > After a bit of trying, yes.  I put an abort() into pgtls_open_client(), 
> > > > after
> > > > initialize_SSL(). Connecting does result in:
> > > > LOG:  could not accept SSL connection: Success
> > >
> > > OK.  I can dig into that, unless you're already on it?
>
> [...]
>
> Seems like we should just treat errno == 0 as a reason to emit the "EOF
> detected" message?

I thought it'd be nice to have a test for this, particularly because it's not
clear that the behaviour is consistent across openssl versions.

I couldn't think of a way to do that with psql. But it's just a few lines of
perl to gin up an "SSL" startup packet and then close the socket.  I couldn't
quite figure out when IO::Socket::INET was added, but I think it's likely been
long enough, I see references from 1999.

This worked easily on linux and freebsd, but not on windows and macos, where
it seems to cause ECONNRESET. I thought that explicitly shutting down the
socket might help, but that just additionally caused freebsd to fail.

Windows uses an older openssl, so it could also be caused by the behaviour
differing back then.

To deal with that, I changed the test to instead check if "not accept SSL
connection: Success" is not logged.  I'm not sure that actually would be
logged on windows, it does seem to have different strings for errors than
other platforms.

Greetings,

Andres Freund
>From d5133a5783b743c0235e016dcb438d2dda82a07b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 8 Dec 2023 13:09:38 -0800
Subject: [PATCH v1] WIP: Fix error message for client disconnects during SSL
 startup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231208193316.5ylgs4zb6zngw...@awork3.anarazel.de
Backpatch:
---
 src/backend/libpq/be-secure-openssl.c |  7 -
 src/test/ssl/t/001_ssltests.pl| 42 +++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6b8125695a3..3c182c2d3a4 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -496,7 +496,12 @@ aloop:
 		 WAIT_EVENT_SSL_OPEN_SERVER);
 goto aloop;
 			case SSL_ERROR_SYSCALL:
-if (r < 0)
+/*
+ * If errno is 0, the client closed the socket without
+ * shutting down the SSL connection, e.g. because the client
+ * terminated.
+ */
+if (r < 0 && errno != 0)
 	ereport(COMMERROR,
 			(errcode_for_socket_access(),
 			 errmsg("could not accept SSL connection: %m")));
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index d921f1dde9f..e23300b318e 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -213,6 +213,48 @@ $node->connect_fails(
 	? qr/server accepted connection without a valid SSL certificate/
 	: qr/sslcertmode value "require" is not supported/);
 
+
+# Check log message makes sense when initiating SSL connection establishment
+# but then closing the socket immediately. In the past that sometimes was
+# logged as "could not accept SSL connection: Success", which is misleading.
+# Such disconnects happens regularly in real workloads, but can't easily be
+# tested with psql. Therefore just establish the connection the hard way -
+# it's easy enough here, because we just need to send a startup packet asking
+# for SSL to be initiated.
+{
+	use IO::Socket::INET;
+
+	my $sock = IO::Socket::INET->new(
+		PeerAddr => $SERVERHOSTADDR,
+		PeerPort => $node->port,
+		Proto => 'tcp');
+
+	my $log_location = -s $node->logfile;
+
+	# ssl request is message length as 32 bit big endian, 4 byte magic "ssl"
+	# protocol version, terminated by a 0  byte.
+	my $message = pack "NNx", 4 + 4 + 1, (1234 << 16) | 5679;
+
+	$sock->send($message);
+	$sock->close();
+
+	# On at least macos and windows closing the socket leads to a ECONNRESET
+	# on the server side. Therefore we don't check if the right message is
+	# logged, but that the "Success" message isn't.
+	#
+	# First wait for the connection attempt to be logged, so we can test for
+	# the absence of the "Success" message without a race.
+	ok( $node->wait_for_log(
+			qr/could not accept SSL connection:/,
+			$log_location),
+		'aborted connection attempt is logged');
+
+	$node->log_check("aborted connection attempt is logged as failure",
+		$log_location,
+		(log_unlike => [qr/could not accept SSL connection: Success/]));
+}
+
+
 # CRL tests
 
 # Invalid CRL filename is the same as no CRL, succeeds
-- 
2.38.0



Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Joe Conway

On 12/8/23 14:45, Daniel Verite wrote:

Joe Conway wrote:


copyto_json.007.diff


When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


The patch as-is just does the equivalent of row_to_json():
8<
select row_to_json(j) from j;
 row_to_json
--
 {"f":{"a":1,+
 "b":2   +
 }}
(1 row)
8<

So yeah, that is how it works today. I will take a look at what it would 
take to fix it.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: micro-optimizing json.c

2023-12-08 Thread Nathan Bossart
Here are a couple more easy micro-optimizations in nearby code.  I've split
them into individual patches for review, but I'll probably just combine
them into one patch before committing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 79727aab47c7d7cd733ce21c8241051de6c9ae1e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Dec 2023 14:03:00 -0600
Subject: [PATCH v1 1/3] micro-optimize bool code path in
 datum_to_json_internal()

shows about an 18% speed up on my machine
---
 src/backend/utils/adt/json.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index e585981d08..4ac5837a23 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -208,11 +208,14 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result,
 			composite_to_json(val, result, false);
 			break;
 		case JSONTYPE_BOOL:
-			outputstr = DatumGetBool(val) ? "true" : "false";
 			if (key_scalar)
-escape_json(result, outputstr);
+appendStringInfoChar(result, '"');
+			if (DatumGetBool(val))
+appendBinaryStringInfo(result, "true", strlen("true"));
 			else
-appendStringInfoString(result, outputstr);
+appendBinaryStringInfo(result, "false", strlen("false"));
+			if (key_scalar)
+appendStringInfoChar(result, '"');
 			break;
 		case JSONTYPE_NUMERIC:
 			outputstr = OidOutputFunctionCall(outfuncoid, val);
-- 
2.25.1

>From bcba82c9ba451c011ac66a70d6cad5be3ccf2d67 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Dec 2023 14:10:00 -0600
Subject: [PATCH v1 2/3] micro-optimize null code path in
 datum_to_json_internal()

shows about a 29% speed up on my machine
---
 src/backend/utils/adt/json.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 4ac5837a23..24e12244cc 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -186,7 +186,7 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result,
 
 	if (is_null)
 	{
-		appendStringInfoString(result, "null");
+		appendBinaryStringInfo(result, "null", strlen("null"));
 		return;
 	}
 
-- 
2.25.1

>From 35324988e9c42c275806763bd0bf6b570616a8e7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Dec 2023 14:15:56 -0600
Subject: [PATCH v1 3/3] micro-optimize cast code path in
 datum_to_json_internal()

---
 src/backend/utils/adt/json.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 24e12244cc..aa20cf60ea 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -278,9 +278,8 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result,
 		case JSONTYPE_CAST:
 			/* outfuncoid refers to a cast function, not an output function */
 			jsontext = DatumGetTextPP(OidFunctionCall1(outfuncoid, val));
-			outputstr = text_to_cstring(jsontext);
-			appendStringInfoString(result, outputstr);
-			pfree(outputstr);
+			appendBinaryStringInfo(result, VARDATA_ANY(jsontext),
+   VARSIZE_ANY_EXHDR(jsontext));
 			pfree(jsontext);
 			break;
 		default:
-- 
2.25.1



Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite
Dave Cramer wrote:

> > This argument for leaving 3 as the column count makes sense to me.  I
> > agree this content is not meant to facilitate interpreting the contents at
> > a protocol level.
> >
> 
> I'd disagree. From my POV if the data comes back as a JSON Array this is
> one object and this should be reflected in the column count.

The doc says this:
"Int16
The number of columns in the data to be copied (denoted N below)."

and this formulation is repeated in PQnfields() for libpq:

"PQnfields
Returns the number of columns (fields) to be copied."

How to interpret that sentence? 
"to be copied" from what, into what, and by what way?

A plausible interpretation is "to be copied from the source data
into the COPY stream, by the backend".  So the number of columns
to be copied likely refers to the columns of the dataset, not the
"in-transit form" that is text or csv or json.

The interpetation you're proposing also makes sense, that it's just
one json column per row, or even a single-row single-column for the
entire dataset in the force_array case, but then the question is why
isn't that number of columns always 1 for the original "text" format,
since each row is represented in the stream as a single long piece of
text?


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Change GUC hashtable to use simplehash?

2023-12-08 Thread Jeff Davis


I committed 867dd2dc87, which means my use case for a fast GUC hash
table (quickly setting proconfigs) is now solved.

Andres mentioned that it could still be useful to reduce overhead in a
few other places:

https://postgr.es/m/20231117220830.t6sb7di6h6am4...@awork3.anarazel.de

How should we evaluate GUC hash table performance optimizations? Just
microbenchmarks, or are there end-to-end tests where the costs are
showing up?

(As I said in another email, I think the hash function APIs justify
themselves regardless of improvements to the GUC hash table.)

On Wed, 2023-12-06 at 07:39 +0700, John Naylor wrote:
> > There's already a patch to use simplehash, and the API is a bit
> > cleaner, and there's a minor performance improvement. It seems
> > fairly
> > non-controversial -- should I just proceed with that patch?
> 
> I won't object if you want to commit that piece now, but I hesitate
> to
> call it a performance improvement on its own.
> 
> - The runtime measurements I saw reported were well within the noise
> level.
> - The memory usage starts out better, but with more entries is worse.

I suppose I'll wait until there's a reason to commit it, then.

Regards,
Jeff Davis





Re: Change GUC hashtable to use simplehash?

2023-12-08 Thread Jeff Davis
On Wed, 2023-11-29 at 20:31 +0700, John Naylor wrote:
> Attached is a rough start with Andres's earlier ideas, to get
> something concrete out there.

The implementation of string hash in 0004 forgot to increment 'buf'.

I tested using the new hash function APIs for my search path cache, and
there's a significant speedup for cases not benefiting from a86c61c9ee.
It's enough that we almost don't need a86c61c9ee. So a definite +1 to
the new APIs.

Regards,
Jeff Davis





Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite
Joe Conway wrote:

> copyto_json.007.diff

When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: micro-optimizing json.c

2023-12-08 Thread Nathan Bossart
On Fri, Dec 08, 2023 at 11:51:15AM +0700, John Naylor wrote:
> This is less verbose and still compiles with constants:
> 
> use_line_feeds ? strlen(",\n ") : strlen(",");

This one worked on my machine.  I've committed the patch with that change.
Thanks everyone for the reviews!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 10:51:01 -0800, Andres Freund wrote:
> On 2023-12-08 13:46:07 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
> > >> Hmm, don't suppose you have a way to reproduce that?
> >
> > > After a bit of trying, yes.  I put an abort() into pgtls_open_client(), 
> > > after
> > > initialize_SSL(). Connecting does result in:
> > > LOG:  could not accept SSL connection: Success
> >
> > OK.  I can dig into that, unless you're already on it?
>
> I think I figured it it out. Looks like we need to translate a closed socket
> (recvfrom() returning 0) to ECONNRESET or such.

I think we might just need to expand the existing branch for EOF:

if (r < 0)
ereport(COMMERROR,

(errcode_for_socket_access(),
 errmsg("could not 
accept SSL connection: %m")));
else
ereport(COMMERROR,

(errcode(ERRCODE_PROTOCOL_VIOLATION),
 errmsg("could not 
accept SSL connection: EOF detected")));

The openssl docs say:

 The following return values can occur:

0

The TLS/SSL handshake was not successful but was shut down controlled and 
by the specifications of the TLS/SSL protocol. Call SSL_get_error() with the 
return value ret to find out the reason.
1

The TLS/SSL handshake was successfully completed, a TLS/SSL connection has 
been established.
<0

The TLS/SSL handshake was not successful because a fatal error occurred 
either at the protocol level or a connection failure occurred. The shutdown was 
not clean. It can also occur if action is needed to continue the operation for 
nonblocking BIOs. Call SSL_get_error() with the return value ret to find out 
the reason.


Which fits with my reproducer - due to the abort the connection was *not* shut
down via SSL in a controlled manner, therefore r < 0.


Hm, oddly enough, there's this tidbit in the SSL_get_error() manpage:

 On an unexpected EOF, versions before OpenSSL 3.0 returned SSL_ERROR_SYSCALL,
 nothing was added to the error stack, and errno was 0. Since OpenSSL 3.0 the
 returned error is SSL_ERROR_SSL with a meaningful error on the error stack.

But I reproduced this with 3.1.


Seems like we should just treat errno == 0 as a reason to emit the "EOF
detected" message?



I wonder if we should treat send/recv returning 0 different from an error
message perspective during an established connection. Right now we produce
  could not receive data from client: Connection reset by peer

because be_tls_read() sets errno to ECONNRESET - despite that not having been
returned by the OS.  But I guess that's a topic for another day.

Greetings,

Andres Freund




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 13:46:07 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
> >> Hmm, don't suppose you have a way to reproduce that?
> 
> > After a bit of trying, yes.  I put an abort() into pgtls_open_client(), 
> > after
> > initialize_SSL(). Connecting does result in:
> > LOG:  could not accept SSL connection: Success
> 
> OK.  I can dig into that, unless you're already on it?

I think I figured it it out. Looks like we need to translate a closed socket
(recvfrom() returning 0) to ECONNRESET or such.

Greetings,

Andres Freund




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-08 Thread Alexander Korotkov
On Fri, Dec 8, 2023 at 8:30 PM Alexander Korotkov  wrote:
> On Wed, Dec 6, 2023 at 6:05 AM Alexander Korotkov  
> wrote:
> > On Wed, Dec 6, 2023 at 3:46 AM Peter Geoghegan  wrote:
> > > On Tue, Dec 5, 2023 at 4:41 PM Peter Geoghegan  wrote:
> > > > "In general, when inequality keys are present, the initial-positioning
> > > > code only promises to position before the first possible match, not
> > > > exactly at the first match, for a forward scan; or after the last
> > > > match for a backward scan."
> > > >
> > > > My test case mostly just demonstrates how to reproduce the scenario
> > > > described by this sentence.
> > >
> > > I just realized that my test case wasn't quite minimized correctly. It
> > > depended on a custom function that was no longer created.
> > >
> > > Attached is a revised version that uses btint84cmp instead.
> >
> > Thank you for raising this issue.  Preprocessing of btree scan keys is
> > normally removing the redundant scan keys.  However, redundant scan
> > keys aren't removed when they have arguments of different types.
> > Please give me a bit of time to figure out how to workaround this.
>
> I dig into the problem.  I think this assumption is wrong in my commit.
>
> "When the key is required for opposite direction scan, it must be
> already satisfied by_bt_first() ..."
>
> In your example "foo = 90" is satisfied by_bt_first(), but "foo >
> 99::int8" is not.  I think this could be resolved by introducing a
> separate flag exactly distinguishing scan keys used for _bt_first().
> I'm going to post the patch doing this.

The draft patch is attached.  It requires polishing and proper
commenting.  But I hope the basic idea is clear.  Do you think this is
the way forward?

--
Regards,
Alexander Korotkov


fix_btree_skip_scan_keys_optimization-v1.patch
Description: Binary data


Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Andres Freund  writes:
> On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
>> Hmm, don't suppose you have a way to reproduce that?

> After a bit of trying, yes.  I put an abort() into pgtls_open_client(), after
> initialize_SSL(). Connecting does result in:
> LOG:  could not accept SSL connection: Success

OK.  I can dig into that, unless you're already on it?

regards, tom lane




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 13:23:50 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-12-08 10:05:09 -0500, Tom Lane wrote:
> >> ... there was already opinion upthread that this should be on by
> >> default, which I agree with.  You shouldn't be hitting cases like
> >> this commonly (if so, they're bugs to fix or the errcode should be
> >> rethought), and the failure might be pretty hard to reproduce.
>
> > FWIW, I did some analysis on aggregated logs on a larger number of machines,
> > and it does look like that'd be a measurable increase in log volume. There 
> > are
> > a few voluminous internal errors in core, but the bigger issue is
> > extensions. They are typically much less disciplined about assigning error
> > codes than core PG is.
>
> Well, I don't see much wrong with making a push to assign error codes
> to more calls.

Oh, very much agreed. But I suspect we won't quickly do the same for
out-of-core extensions...


> Certainly these SSL failures are not "internal" errors.
>
> > could not accept SSL connection: %m - with zero errno
> > ...
> > I'm a bit confused about the huge number of "could not accept SSL 
> > connection:
> > %m" with a zero errno. I guess we must be clearing errno somehow, but I 
> > don't
> > immediately see where. Or perhaps we need to actually look at what
> > SSL_get_error() returns?
>
> Hmm, don't suppose you have a way to reproduce that?

After a bit of trying, yes.  I put an abort() into pgtls_open_client(), after
initialize_SSL(). Connecting does result in:

LOG:  could not accept SSL connection: Success

Greetings,

Andres Freund




Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-08 Thread Alexander Korotkov
On Wed, Dec 6, 2023 at 6:05 AM Alexander Korotkov  wrote:
> On Wed, Dec 6, 2023 at 3:46 AM Peter Geoghegan  wrote:
> > On Tue, Dec 5, 2023 at 4:41 PM Peter Geoghegan  wrote:
> > > "In general, when inequality keys are present, the initial-positioning
> > > code only promises to position before the first possible match, not
> > > exactly at the first match, for a forward scan; or after the last
> > > match for a backward scan."
> > >
> > > My test case mostly just demonstrates how to reproduce the scenario
> > > described by this sentence.
> >
> > I just realized that my test case wasn't quite minimized correctly. It
> > depended on a custom function that was no longer created.
> >
> > Attached is a revised version that uses btint84cmp instead.
>
> Thank you for raising this issue.  Preprocessing of btree scan keys is
> normally removing the redundant scan keys.  However, redundant scan
> keys aren't removed when they have arguments of different types.
> Please give me a bit of time to figure out how to workaround this.

I dig into the problem.  I think this assumption is wrong in my commit.

"When the key is required for opposite direction scan, it must be
already satisfied by_bt_first() ..."

In your example "foo = 90" is satisfied by_bt_first(), but "foo >
99::int8" is not.  I think this could be resolved by introducing a
separate flag exactly distinguishing scan keys used for _bt_first().
I'm going to post the patch doing this.

--
Regards,
Alexander Korotkov




Re: Streaming I/O, vectored I/O (WIP)

2023-12-08 Thread Andres Freund
Hi,

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas  wrote:
> > On 29/11/2023 21:39, Thomas Munro wrote:
> > > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > > callers believe that short writes set errno: they error out with a
> > > message including %m.  We have historically set errno = ENOSPC inside
> > > FileWrite() if the write size was unexpectedly small AND the kernel
> > > didn't set errno to a non-zero value (having set it to zero ourselves
> > > earlier).  In FileWriteV(), I didn't want to do that because it is
> > > expensive to compute the total write size from the vector array and we
> > > managed to measure an effect due to that in some workloads.
> > >
> > > Note that the smgr patch actually handles short writes by continuing,
> > > instead of raising an error.  Short writes do already occur in the
> > > wild on various systems for various rare technical reasons other than
> > > ENOSPC I have heard (imagine transient failure to acquire some
> > > temporary memory that the kernel chooses not to wait for, stuff like
> > > that, though certainly many people and programs believe they should
> > > not happen[1]), and it seems like a good idea to actually handle them
> > > as our write sizes increase and the probability of short writes might
> > > presumably increase.
> >
> > Maybe we should bite the bullet and always retry short writes in
> > FileWriteV(). Is that what you meant by "handling them"?
> > If the total size is expensive to calculate, how about passing it as an
> > extra argument? Presumably it is cheap for the callers to calculate at
> > the same time that they build the iovec array?
> 
> It's cheap for md.c, because it already has nblocks_this_segment.
> That's one reason I chose to put the retry there.  If we push it down
> to fd.c in order to be able to help other callers, you're right that
> we could pass in the total size (and I guess assert that it's
> correct), but that is sort of annoyingly redundant and further from
> the interface we're wrapping.

> There is another problem with pushing it down to fd.c, though.
> Suppose you try to write 8192 bytes, and the kernel says "you wrote
> 4096 bytes" so your loop goes around again with the second half the
> data and now the kernel says "-1, ENOSPC".  What are you going to do?
> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> so you'd either have to return -1, ENOSPC (converting short writes
> into actual errors, a lie because you did write some data), or return
> 4096 (and possibly also set errno = ENOSPC as we have always done).
> So you can't really handle this problem at this level, can you?
> Unless you decide that fd.c should get into the business of raising
> errors for I/O failures, which would be a bit of a departure.
> 
> That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Greetings,

Andres Freund




Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Andres Freund  writes:
> On 2023-12-08 10:05:09 -0500, Tom Lane wrote:
>> ... there was already opinion upthread that this should be on by
>> default, which I agree with.  You shouldn't be hitting cases like
>> this commonly (if so, they're bugs to fix or the errcode should be
>> rethought), and the failure might be pretty hard to reproduce.

> FWIW, I did some analysis on aggregated logs on a larger number of machines,
> and it does look like that'd be a measurable increase in log volume. There are
> a few voluminous internal errors in core, but the bigger issue is
> extensions. They are typically much less disciplined about assigning error
> codes than core PG is.

Well, I don't see much wrong with making a push to assign error codes
to more calls.  We've had other discussions about doing that.
Certainly these SSL failures are not "internal" errors.

> could not accept SSL connection: %m - with zero errno
> ...
> I'm a bit confused about the huge number of "could not accept SSL connection:
> %m" with a zero errno. I guess we must be clearing errno somehow, but I don't
> immediately see where. Or perhaps we need to actually look at what
> SSL_get_error() returns?

Hmm, don't suppose you have a way to reproduce that?

regards, tom lane




Re: backtrace_on_internal_error

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-08 10:05:09 -0500, Tom Lane wrote:
> Peter Eisentraut  writes:
> > One possible question for discussion is whether the default for this
> > should be off, on, or possibly something like on-in-assert-builds.
> > (Personally, I'm happy to turn it on myself at run time, but everyone
> > has different workflows.)
>
> ... there was already opinion upthread that this should be on by
> default, which I agree with.  You shouldn't be hitting cases like
> this commonly (if so, they're bugs to fix or the errcode should be
> rethought), and the failure might be pretty hard to reproduce.

FWIW, I did some analysis on aggregated logs on a larger number of machines,
and it does look like that'd be a measurable increase in log volume. There are
a few voluminous internal errors in core, but the bigger issue is
extensions. They are typically much less disciplined about assigning error
codes than core PG is.

I've been wondering about doing some macro hackery to inform elog.c about
whether a log message is from core or an extension. It might even be possible
to identify the concrete extension, e.g. by updating the contents of
PG_MODULE_MAGIC during module loading, and referencing that.


Based on the aforementioned data, the most common, in-core, log messages
without assigned error codes are:

could not accept SSL connection: %m - with zero errno
archive command was terminated by signal %d: %s
could not send data to client: %m - with zero errno
cache lookup failed for type %u
archive command failed with exit code %d
tuple concurrently updated
could not restore file "%s" from archive: %s
archive command was terminated by signal %d: %s
%s at file "%s" line %u
invalid memory alloc request size %zu
could not send data to client: %m
could not open directory "%s": %m - errno indicating ENOMEM
could not write init file
out of relcache_callback_list slots
online backup was canceled, recovery cannot continue
requested timeline %u does not contain minimum recovery point %X/%X on timeline 
%u


There were a lot more in older PG versions, I tried to filter those out.


I'm a bit confused about the huge number of "could not accept SSL connection:
%m" with a zero errno. I guess we must be clearing errno somehow, but I don't
immediately see where. Or perhaps we need to actually look at what
SSL_get_error() returns?

Greetings,

Andres Freund




Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-12-08 Thread Alexander Korotkov
Hi, Ashutosh!

On Fri, Dec 8, 2023 at 3:28 PM Ashutosh Bapat
 wrote:
> I did some analysis of memory consumption by bitmapsets in such cases.
> [1] contains slides with the result of this analysis. The slides are
> crude and quite WIP. But they will give some idea.
>
> [1] 
> https://docs.google.com/presentation/d/1S9BiAADhX-Fv9tDbx5R5Izq4blAofhZMhHcO1c-wzfI/edit?usp=sharing

Thank you for sharing your analysis.  I understand that usage of a
plain bitmap becomes a problem with a large number of partitions.  But
I wonder what does "post proposed fixes" mean?  Is it the fixes posted
in [1].  If so it's very surprising for me they are reducing the
memory footprint size.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdtLgCryACcrmLv=Koq9rAB3=tr5y9d84dggvuhscvj...@mail.gmail.com

--
Regards,
Alexander Korotkov




Re: remaining sql/json patches

2023-12-08 Thread Andres Freund
Hi,

On 2023-12-07 21:07:59 +0900, Amit Langote wrote:
> --- a/src/include/executor/execExpr.h
> +++ b/src/include/executor/execExpr.h
> @@ -16,6 +16,7 @@
>  
>  #include "executor/nodeAgg.h"
>  #include "nodes/execnodes.h"
> +#include "nodes/miscnodes.h"
>  
>  /* forward references to avoid circularity */
>  struct ExprEvalStep;
> @@ -168,6 +169,7 @@ typedef enum ExprEvalOp
>  
>   /* evaluate assorted special-purpose expression types */
>   EEOP_IOCOERCE,
> + EEOP_IOCOERCE_SAFE,
>   EEOP_DISTINCT,
>   EEOP_NOT_DISTINCT,
>   EEOP_NULLIF,
> @@ -547,6 +549,7 @@ typedef struct ExprEvalStep
>   bool   *checknull;
>   /* OID of domain type */
>   Oid resulttype;
> + ErrorSaveContext *escontext;
>   }   domaincheck;
>  
>   /* for EEOP_CONVERT_ROWTYPE */
> @@ -776,6 +779,7 @@ extern void ExecEvalParamExec(ExprState *state, 
> ExprEvalStep *op,
> ExprContext 
> *econtext);
>  extern void ExecEvalParamExtern(ExprState *state, ExprEvalStep *op,
>   ExprContext 
> *econtext);
> +extern void ExecEvalCoerceViaIOSafe(ExprState *state, ExprEvalStep *op);
>  extern void ExecEvalSQLValueFunction(ExprState *state, ExprEvalStep *op);
>  extern void ExecEvalCurrentOfExpr(ExprState *state, ExprEvalStep *op);
>  extern void ExecEvalNextValueExpr(ExprState *state, ExprEvalStep *op);
> diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
> index 5d7f17dee0..6a7118d300 100644
> --- a/src/include/nodes/execnodes.h
> +++ b/src/include/nodes/execnodes.h
> @@ -34,6 +34,7 @@
>  #include "fmgr.h"
>  #include "lib/ilist.h"
>  #include "lib/pairingheap.h"
> +#include "nodes/miscnodes.h"
>  #include "nodes/params.h"
>  #include "nodes/plannodes.h"
>  #include "nodes/tidbitmap.h"
> @@ -129,6 +130,12 @@ typedef struct ExprState
>  
>   Datum  *innermost_domainval;
>   bool   *innermost_domainnull;
> +
> + /*
> +  * For expression nodes that support soft errors.  Should be set to NULL
> +  * before calling ExecInitExprRec() if the caller wants errors thrown.
> +  */
> + ErrorSaveContext *escontext;
>  } ExprState;

Why do we need this both in ExprState *and* in ExprEvalStep?



> From 38b53297b2d435d5cebf78c1f81e4748fed6c8b6 Mon Sep 17 00:00:00 2001
> From: Amit Langote 
> Date: Wed, 22 Nov 2023 13:18:49 +0900
> Subject: [PATCH v30 2/5] Add soft error handling to populate_record_field()
> 
> An uncoming patch would like the ability to call it from the
> executor for some SQL/JSON expression nodes and ask to suppress any
> errors that may occur.
> 
> This commit does two things mainly:
> 
> * It modifies the various interfaces internal to jsonfuncs.c to pass
>   the ErrorSaveContext around.
> 
> * Make necessary modifications to handle the cases where the
>   processing is aborted partway through various functions that take
>   an ErrorSaveContext when a soft error occurs.
> 
> Note that the above changes are only intended to suppress errors in
> the functions in jsonfuncs.c, but not those in any external functions
> that the functions in jsonfuncs.c in turn call, such as those from
> arrayfuncs.c.  It is assumed that the various populate_* functions
> validate the data before passing those to external functions.
> 
> Discussion: 
> https://postgr.es/m/CA+HiwqE4XTdfb1nW=ojoy_tqsrhyt-q_kb6i5d4xckyrlc1...@mail.gmail.com

The code here is getting substantially more verbose / less readable.  I wonder
if there's something more general that could be improved to make this less
painful?

I'd not at all be surprised if this caused a measurable slowdown.


> ---
>  src/backend/utils/adt/jsonfuncs.c | 310 +++---
>  1 file changed, 236 insertions(+), 74 deletions(-)

>  /* functions supporting jsonb_delete, jsonb_set and jsonb_concat */
>  static JsonbValue *IteratorConcat(JsonbIterator **it1, JsonbIterator **it2,
> @@ -2484,12 +2491,12 @@ 
> populate_array_report_expected_array(PopulateArrayContext *ctx, int ndim)
>   if (ndim <= 0)
>   {
>   if (ctx->colname)
> - ereport(ERROR,
> + errsave(ctx->escontext,
>   
> (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>errmsg("expected JSON array"),
>errhint("See the value of key 
> \"%s\".", ctx->colname)));
>   else
> - ereport(ERROR,
> + errsave(ctx->escontext,
>   
> (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>errmsg("expected JSON array")));
>   }
> @@ -2506,13 +2513,13 @@ 
> populate_array_report_expected_array(PopulateArrayContext 

Re: Parallel CREATE INDEX for BRIN indexes

2023-12-08 Thread Tomas Vondra
Hi,

I've pushed the first two parts (backfill of empty ranges for serial
builds, allowing parallelism) after a bit more cleanup, adding a simple
pageinspect test to 0001, improving comments and some minor adjustments.

I ended up removing the protections against BlockNumber overflows, and
moved them into a separate WIP patch. I still think we should probably
reconsider the position that we don't need to worry about issues so
close to the 32TB boundary, but it seemed rather weird to fix only the
new bits and leave the existing issues in place.

I'm attaching that as a WIP patch, but I don't know if/when I'll get
back to this.


Thanks for the reviews/reworks/ideas!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom cf53c109c73cfa9264df71763e9ec5712f1c1f7f Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 7 Dec 2023 15:07:04 +0100
Subject: [PATCH] WIP: Prevent overflows for BRIN page ranges

Introduces remove brin_next_range that checks if blkno overflows, and
instead sets it to InvalidBlockNumber.
---
 src/backend/access/brin/brin.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 23f081389b2..1952142d050 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -227,6 +227,7 @@ static bool add_values_to_range(Relation idxRel, BrinDesc *bdesc,
 static bool check_null_keys(BrinValues *bval, ScanKey *nullkeys, int nnullkeys);
 static void brin_fill_empty_ranges(BrinBuildState *state,
    BlockNumber prevRange, BlockNumber maxRange);
+static BlockNumber brin_next_range(BrinBuildState *state, BlockNumber blkno);
 
 /* parallel index builds */
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
@@ -2935,7 +2936,7 @@ brin_fill_empty_ranges(BrinBuildState *state,
 	 * If we already summarized some ranges, we need to start with the next
 	 * one. Otherwise start from the first range of the table.
 	 */
-	blkno = (prevRange == InvalidBlockNumber) ? 0 : (prevRange + state->bs_pagesPerRange);
+	blkno = (prevRange == InvalidBlockNumber) ? 0 : brin_next_range(state, prevRange);
 
 	/* Generate empty ranges until we hit the next non-empty range. */
 	while (blkno < nextRange)
@@ -2948,6 +2949,18 @@ brin_fill_empty_ranges(BrinBuildState *state,
 	  blkno, state->bs_emptyTuple, state->bs_emptyTupleLen);
 
 		/* try next page range */
-		blkno += state->bs_pagesPerRange;
+		blkno = brin_next_range(state, blkno);
 	}
 }
+
+static BlockNumber
+brin_next_range(BrinBuildState *state, BlockNumber blkno)
+{
+	BlockNumber ret = (blkno + state->bs_pagesPerRange);
+
+	/* overflow */
+	if (ret < blkno)
+		ret = InvalidBlockNumber;
+
+	return ret;
+}
-- 
2.41.0



Re: remaining sql/json patches

2023-12-08 Thread Andrew Dunstan



On 2023-12-08 Fr 11:37, Robert Haas wrote:

On Fri, Dec 8, 2023 at 1:59 AM Amit Langote  wrote:

Would it be messy to replace the lookahead approach by whatever's
suiable *in the future* when it becomes necessary to do so?

It might be. Changing grammar rules to tends to change corner-case
behavior if nothing else. We're best off picking the approach that we
think is correct long term.



All this makes me wonder if Alvaro's first suggested solution (adding 
NESTED to the UNBOUNDED precedence level) wouldn't be better after all.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: remaining sql/json patches

2023-12-08 Thread Robert Haas
On Fri, Dec 8, 2023 at 1:59 AM Amit Langote  wrote:
> Would it be messy to replace the lookahead approach by whatever's
> suiable *in the future* when it becomes necessary to do so?

It might be. Changing grammar rules to tends to change corner-case
behavior if nothing else. We're best off picking the approach that we
think is correct long term.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: UBSan pointer overflow in xlogreader.c

2023-12-08 Thread Robert Haas
On Thu, Dec 7, 2023 at 10:18 PM Thomas Munro  wrote:
> On Fri, Dec 8, 2023 at 3:57 AM Robert Haas  wrote:
> > On Tue, Dec 5, 2023 at 4:01 PM Nathan Bossart  
> > wrote:
> > > +1
> >
> > So, Thomas ... any chance you could commit this? So that my patch
> > stops making cfbot sad?
>
> Done.  Thanks both for the reviews.

Thank you!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: backtrace_on_internal_error

2023-12-08 Thread Tom Lane
Peter Eisentraut  writes:
> Here is a patch to play with.

Didn't read the patch yet, but ...

> One possible question for discussion is whether the default for this 
> should be off, on, or possibly something like on-in-assert-builds. 
> (Personally, I'm happy to turn it on myself at run time, but everyone 
> has different workflows.)

... there was already opinion upthread that this should be on by
default, which I agree with.  You shouldn't be hitting cases like
this commonly (if so, they're bugs to fix or the errcode should be
rethought), and the failure might be pretty hard to reproduce.

I'm not really sold that we even need YA GUC, for that matter.
How about committing the behavior without a GUC, and then
back-filling one if we get pushback?

regards, tom lane




Re: GUC names in messages

2023-12-08 Thread Peter Eisentraut

On 08.12.23 05:10, Peter Smith wrote:

Patch 0001 -- "datestyle" becomes DateStyle in messages
Rebased this again, which was part of an earlier patch set
- I think any GUC names documented as MixedCase should keep that same
case in messages; this also obeys the guidelines recently pushed [1].
- Some others agreed, expecting the exact GUC name (in the message)
can be found in pg_settings [2].
- OTOH, Michael didn't like the diff churn [3] caused by this patch.


I'm fine with adjusting the mixed-case stuff, but intuitively, I don't 
think removing the quotes in this is an improvement:


- GUC_check_errdetail("Conflicting \"datestyle\" specifications.");
+ GUC_check_errdetail("Conflicting DateStyle specifications.");





Re: Memory consumed by paths during partitionwise join planning

2023-12-08 Thread Ashutosh Bapat
On Fri, Dec 8, 2023 at 1:02 PM David Rowley  wrote:
>
> On Fri, 8 Dec 2023 at 18:02, Ashutosh Bapat
>  wrote:
> > given path. E.g. we have three path chains as follows
> > 1. joinpath_1->joinpath_2->seqpath_1,
> > 2. joinpath_3->joinpath_4->seqpath_1,
> > 3. joinpath_5->joinpath_2->seqpath_1
> >
> > Please note that this is not full path tree/forest. It is difficult to
> > describe the whole path forest in text format. But this should be
> > sufficient to get the gist of how linking and unlinking works.
> >
> > Let's say all these paths are listed in pathlists of respective
> > RelOptInfos. Assume that joinpath_1, joinpath_3 and joinpath_5 are all
> > part of the topmost RelOptInfo. Then the refcounts will be as follows
> > joinpath_1, joinpath_3 and joinpath_5 -> each 1 (RelOptInfo)
> > joinpath_2 - 3 (joinpath_2, joinpath_5 and RelOptInfo)
> > joinpath_4 - 2 (joinpath_3, RelOptInfo)
> > seqpath_1 - 3 (joinpath_2, joinpath_4, RelOptInfo)
>
> I think this likely works fine providing you always unlink paths from
> the root of the path tree.  When you start unlinking from non-root
> paths I think you could start to see problems unless you increment
> refcounts recursively.
>
> The problem I had when working on improving the union planner was when
> building the final paths for a union child.
>
> Let's say I have the query:  SELECT a,b FROM t1 UNION SELECT x,y FROM t2;
>
> When planning the first union child, I'd like to provide the union
> planner with a sorted path and also the cheapest scan path on t1 to
> allow it to Hash Aggregate on the cheapest path.
>
> Let's say the planner produces the following paths on t1.
>
> SeqScan_t1
>
> When I create the final union child paths I want to try and produce a
> few sorted paths to allow MergeAppend to work. I'll loop over each
> path in t1's pathlist.
>
> SeqScan_t1 isn't sorted, so I'll make Sort1->SeqScan_t1 add add_path
> that to final_rel.
>
> Now, I also want to allow cheap Hash Aggregates to implement the
> UNION, so I'll include SeqScan_t1 without sorting it.
>
> If I now do add_path(final_rel, SeqScan_t1), add_path() could see that
> the SeqScan cost is fuzzily the same as the Sort->SeqScan_t1 and since
> the sort version has better PathKeys, add_path will unlink SeqScan_t1.
>
> Now, I don't think your scheme for refcounting paths is broken by this
> because you'll increment the refcount of the SeqScan_t1 when the Sort
> uses it as its subpath, but if instead of a Sort->SeqScan that was
> IncrementalSort->Sort->SeqScan, then suddenly you're not incrementing
> the refcount for the SeqScan when you create the Incremental Sort due
> to the Sort being between the two.
>
> So, if I now do add_path(final_rel, Sort2->Incremental_Sort->SeqScan_t1)
>
> Sort1 is rejected due to cost and we unlink it. Sort1 refcount gets to
> 0 and we free the path and the SeqScan_t1's refcount goes down by 1
> but does not reach 0.

What is Sort1? Sort1->Seqscan_t1 OR incremental_sort->Sort->seqscan_t1?

I am getting lost here. But I think you have misunderstood the code so
this question is irrelevant. See next.

>
> We then add_path(final_rel, SeqScan_t1) but this path is fuzzily the
> same cost as Sort2 so we reject SeqScan_t1 due to Sort2 having better
> pathkeys and we unlink SeqScan_t1.  Refcount on SeqScan_t1 gets to 0
> and we free the path. We need SeqScan_t1 for the incremental sort.

the path being presented to add_path is never passed to unlink_path().
If it's suboptimal to any existing path, it is passed to free_path()
which in its current form will throw an error. But that can fixed.
free_path can just ignore a path with refcount > 0, if we want to
present the same path to add_path multiple times.

>
> Perhaps in reality here, since SeqScan_t1 exists in the base rel's
> pathlist we'd still have a refcount from that, but I assume at some
> point you want to start freeing up paths from RelOptInfos that we're
> done with, in which case the SeqScan_t1's refcount would reach 0 at
> that point and we'd crash during create plan of the
> Sort2->Incremental_Sort->SeqScan_t1 plan.
>
> Have I misunderstood something here?  As far as I see it with your
> non-recursive refcounts, it's only safe to free from the root of a
> pathtree and isn't safe when unlinking paths that are the subpath of
> another path unless the unlinking is done from the root.
>

As long as we call link_path every time we reference a path, we could
start freeing paths from anywhere. The reference count takes care of
not freeing referenced paths. Of course, I will need to fix
free_path() as above.

-- 
Best Wishes,
Ashutosh Bapat




Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Dave Cramer
On Thu, 7 Dec 2023 at 08:47, David G. Johnston 
wrote:

> On Thursday, December 7, 2023, Daniel Verite 
> wrote:
>
>> Joe Conway wrote:
>>
>> > The attached should fix the CopyOut response to say one column. I.e. it
>> > ought to look something like:
>>
>> Spending more time with the doc I came to the opinion that in this bit
>> of the protocol, in CopyOutResponse (B)
>> ...
>> Int16
>> The number of columns in the data to be copied (denoted N below).
>> ...
>>
>> this number must be the number of columns in the source.
>> That is for COPY table(a,b,c)   the number is 3, independently
>> on whether the result is formatted in text, cvs, json or binary.
>>
>> I think that changing it for json can reasonably be interpreted
>> as a protocol break and we should not do it.
>>
>> The fact that this value does not help parsing the CopyData
>> messages that come next is not a new issue. A reader that
>> doesn't know the field separator and whether it's text or csv
>> cannot parse these messages into fields anyway.
>> But just knowing how much columns there are in the original
>> data might be useful by itself and we don't want to break that.
>
>
> This argument for leaving 3 as the column count makes sense to me.  I
> agree this content is not meant to facilitate interpreting the contents at
> a protocol level.
>

I'd disagree. From my POV if the data comes back as a JSON Array this is
one object and this should be reflected in the column count.

>
>
>>
>>
>> The other question for me is, in the CopyData message, this
>> bit:
>> " Messages sent from the backend will always correspond to single data
>> rows"
>>
>> ISTM that considering that the "[" starting the json array is a
>> "data row" is a stretch.
>> That might be interpreted as a protocol break, depending
>> on how strict the interpretation is.
>>
>
Well technically it is a single row if you send an array.

Regardless, I expect Euler's comment above that JSON lines format is going
to be the preferred format as the client doesn't have to wait for the
entire object before starting to parse.

Dave

>


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-08 Thread Shlok Kyal
Hi,

> Then let's go with the original patch only. BTW, it took almost the
> same time (105 wallclock secs) in my environment (CentOs VM) to run
> tests in src/test/subscription both with and without the patch. I took
> a median of five runs. I have slightly adjusted the comments and
> commit message in the attached. If you are fine with this, we can
> commit and backpatch this.

I have tested the patch for all the branches from PG 17 to PG 12.
The same patch applies cleanly on all branches. Also, the same patch
resolves the issue on all the branches.
I ran all the tests and all the tests passed on each branch.

I also reviewed the patch and it looks good to me.

Thanks and Regards,
Shlok Kyal




Re: Forbid the use of invalidated physical slots in streaming replication.

2023-12-08 Thread Ashutosh Bapat
> > > pg_replication_slot could be set back to null.
> >
> > In this case, since the basebackup was taken after the slot was 
> > invalidated, it
> > does not require the WAL that was removed. But it seems that once the
> > streaming starts, the slot sprints to life again and gets validated again. 
> > Here's
> > pg_replication_slot output after the standby starts.
>
> Actually, It doesn't bring the invalidated slot back to life completely.
> The slot's view data looks valid while the 'invalidated' flag of this slot is 
> still
> RS_INVAL_WAL_REMOVED (user are not aware of it.)
>

I was mislead by the code in pg_get_replication_slots(). I did not
read it till the following

--- code 
case WALAVAIL_REMOVED:

/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.
*
* If we do change it, save the state for safe_wal_size below.
*/
--- code ---

I see now how an invalid slot's wal status can be reported as
unreserved. So I think it's a bug.

>
> >
> > >
> > > So, I think it's a bug and one idea to fix is to check the validity of
> > > the physical slot in
> > > StartReplication() after acquiring the slot like what the attachment
> > > does, what do you think ?
> >
> > I am not sure whether that's necessarily a bug. Of course, we don't expect
> > invalid slots to be used but in this case I don't see any harm.
> > The invalid slot has been revived and has all the properties set just like 
> > a valid
> > slot. So it must be considered in
> > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it myself
> > though. In case the WAL is really lost and is requested by the standby it 
> > will
> > throw an error "requested WAL segment [0-9A-F]+ has already been
> > removed". So no harm there as well.
>
> Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), 
> even
> if the walsender advances the restart_lsn, the slot will not be considered in 
> the
> ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe
> and that's why I think it's a bug.
>
> After looking closer, it seems this behavior started from 15f8203 which 
> introduced the
> ReplicationSlotInvalidationCause 'invalidated', after that we check the 
> invalidated enum
> in Xmin/Lsn computation function.
>
> If we want to go back to previous behavior, we need to revert/adjust the check
> for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the 
> logical
> decoding(walsender) disallow using invalidated slot, so I feel it's consistent
> to do similar check for physical one. Besides, pg_replication_slot_advance()
> also disallow passing invalidated slot to it as well. What do you think ?

What happens if you run your script on a build prior to 15f8203?
Purely from reading the code, it looks like the physical slot would
sprint back to life since its restart_lsn would be updated. But I am
not able to see what happens to invalidated_at. It probably remains a
valid LSN and the slot would still not be considred in xmin
calculation. It will be good to be compatible to pre-15f8203
behaviour.

I think making logical and physical slot behaviour consistent would be
better but if the inconsistent behaviour is out there for some
releases, changing that now will break backward compatibility.

-- 
Best Wishes,
Ashutosh Bapat




Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'

2023-12-08 Thread Ashutosh Bapat
On Fri, Dec 8, 2023 at 12:43 PM Alexander Pyhalov
 wrote:
>
> Andrei Lepikhov писал(а) 2023-12-08 07:37:
> > On 28/11/2023 01:37, Alexander Korotkov wrote:
> >> On Mon, Nov 27, 2023 at 8:07 PM Andres Freund 
> >> wrote:
> > Sorry for the late answer, I missed this thread because of vacation.
> >>> On 2023-11-27 11:29:48 +0530, Ashutosh Bapat wrote:
>  How do we ensure that we are not making unnecessary copies of
>  Bitmapsets?
> >>>
> >>> We don't - but that's not specific to this patch. Bitmapsets
> >>> typically aren't
> >>> very large, I doubt that it's a significant proportion of the memory
> >>> usage. Adding refcounts or such would likely add more overhead than
> >>> it'd save,
> >>> both in time and memory.
> >
> > I'd already clashed with Tom on copying the required_relids field and
> > voluntarily made unnecessary copies in the project [1].
> > And ... stuck into huge memory consumption. The reason was in
> > Bitmapsets:
> > When we have 1E3-1E4 partitions and try to reparameterize a join, one
> > bitmapset field can have a size of about 1kB. Having bitmapset
> > referencing Relation with a large index value, we had a lot of (for
> > example, 1E4 * 1kB) copies on each reparametrization of such a field.
> > Alexander Pyhalov should remember that case.
>
> Yes. If it matters, this happened during reparametrization when 2
> partitioned tables with 1000 partitions each were joined. Then
> asymmetric  pw join managed to eat lots of memory for bitmapsets (by
> lots of memory I mean all available on the test VM).

I did some analysis of memory consumption by bitmapsets in such cases.
[1] contains slides with the result of this analysis. The slides are
crude and quite WIP. But they will give some idea.

[1] 
https://docs.google.com/presentation/d/1S9BiAADhX-Fv9tDbx5R5Izq4blAofhZMhHcO1c-wzfI/edit?usp=sharing


-- 
Best Wishes,
Ashutosh Bapat




Re: backtrace_on_internal_error

2023-12-08 Thread Peter Eisentraut

Here is a patch to play with.

I also found a related typo.

One possible question for discussion is whether the default for this 
should be off, on, or possibly something like on-in-assert-builds. 
(Personally, I'm happy to turn it on myself at run time, but everyone 
has different workflows.)


From 2e01ee20358df323170fbfef6dc01f684371d873 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 8 Dec 2023 13:30:55 +0100
Subject: [PATCH v1 1/2] Fix variable name and comment

---
 src/backend/utils/error/elog.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 6aeb855e49..ecacba4ee1 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -115,8 +115,8 @@ char   *Log_destination_string = NULL;
 bool   syslog_sequence_numbers = true;
 bool   syslog_split_messages = true;
 
-/* Processed form of backtrace_symbols GUC */
-static char *backtrace_symbol_list;
+/* Processed form of backtrace_functions GUC */
+static char *backtrace_function_list;
 
 #ifdef HAVE_SYSLOG
 
@@ -831,13 +831,13 @@ matches_backtrace_functions(const char *funcname)
 {
const char *p;
 
-   if (!backtrace_symbol_list || funcname == NULL || funcname[0] == '\0')
+   if (!backtrace_function_list || funcname == NULL || funcname[0] == '\0')
return false;
 
-   p = backtrace_symbol_list;
+   p = backtrace_function_list;
for (;;)
{
-   if (*p == '\0') /* end of backtrace_symbol_list 
*/
+   if (*p == '\0') /* end of 
backtrace_function_list */
break;
 
if (strcmp(funcname, p) == 0)
@@ -2180,7 +2180,7 @@ check_backtrace_functions(char **newval, void **extra, 
GucSource source)
 void
 assign_backtrace_functions(const char *newval, void *extra)
 {
-   backtrace_symbol_list = (char *) extra;
+   backtrace_function_list = (char *) extra;
 }
 
 /*

base-commit: 7db01fbcefbd95a7c97afa128fab59f4a09b3ff1
-- 
2.43.0

From 06beed94067efaf5b93da5041048347fa22a4290 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 8 Dec 2023 14:04:53 +0100
Subject: [PATCH v1 2/2] Add GUC backtrace_on_internal_error

---
 doc/src/sgml/config.sgml| 27 +++
 src/backend/utils/error/elog.c  |  8 +---
 src/backend/utils/misc/guc_tables.c | 16 
 src/include/utils/guc.h |  1 +
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 94d1eb2b81..b830539b9a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11027,6 +11027,33 @@ Developer Options
   
  
 
+ 
+  backtrace_on_internal_error 
(boolean)
+  
+backtrace_on_internal_error configuration 
parameter
+  
+  
+  
+   
+If this parameter is on and an error with error code XX000 (internal
+error; see also ) is raised, then a
+backtrace is written to the server log together with the error
+message.  This can be used to debug such internal errors (which should
+normally not happen in production).  The default is off.
+   
+
+   
+Backtrace support is not available on all platforms, and the quality
+of the backtraces depends on compilation options.
+   
+
+   
+Only superusers and users with the appropriate SET
+privilege can change this setting.
+   
+  
+ 
+
  
   debug_discard_caches (integer)
   
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index ecacba4ee1..6a7a487323 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -498,9 +498,11 @@ errfinish(const char *filename, int lineno, const char 
*funcname)
 
/* Collect backtrace, if enabled and we didn't already */
if (!edata->backtrace &&
-   edata->funcname &&
-   backtrace_functions &&
-   matches_backtrace_functions(edata->funcname))
+   ((edata->funcname &&
+ backtrace_functions &&
+ matches_backtrace_functions(edata->funcname)) ||
+(edata->sqlerrcode == ERRCODE_INTERNAL_ERROR &&
+ backtrace_on_internal_error)))
set_backtrace(edata, 2);
 
/*
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 6474e35ec0..73252d4864 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -527,6 +527,7 @@ double  log_statement_sample_rate = 1.0;
 double log_xact_sample_rate = 0;
 inttrace_recovery_messages = LOG;
 char  *backtrace_functions;
+bool   backtrace_on_internal_error = false;
 
 int

Re: [PATCH] New [relation] option engine

2023-12-08 Thread Nikolay Shaplov
В письме от пятница, 8 декабря 2023 г. 15:59:09 MSK пользователь Alvaro 
Herrera написал:

> > Theoretically I can create patch with full options.c as it is in the patch
> > now, and use that code only in index AM, and keep reloption.c mostly
> > unchanged.
> > 
> > This will be total mess with two different options mechanisms working in
> > the same time, but this might be much more easy to review.  When we are
> > done with the first step, we can change the rest.
> > If this will help to finally include patch into postgres, I can do it.
> > Will
> > that help you to review?
> 
> I don't think that's better, because we could create slight
> inconsistencies between the code used for index AMs and the users of
> reloptions.
I've written quite good regression tests for it, there should be no 
inconsistency.
 
> I'm not seeing any reasonable way to split this patch in smaller ones.
Actually me neither. Not a good one anyway.

But if somebody really need it to be split, it can be done that way.

-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.


Re: initdb caching during tests

2023-12-08 Thread Daniel Gustafsson
> On 7 Dec 2023, at 15:27, Matthias van de Meent  
> wrote:

> Then that'd be the attached patch, which also includes --auth instead
> of -A, for the same reason as -N vs --no-sync

Applied to master, thanks!

--
Daniel Gustafsson





Re: [PATCH] New [relation] option engine

2023-12-08 Thread Alvaro Herrera
On 2023-Dec-08, Nikolay Shaplov wrote:

> Theoretically I can create patch with full options.c as it is in the patch 
> now, and use that code only in index AM, and keep reloption.c mostly 
> unchanged.
> 
> This will be total mess with two different options mechanisms working in the 
> same time, but this might be much more easy to review.  When we are done with 
> the first step, we can change the rest.
> If this will help to finally include patch into postgres, I can do it. Will 
> that help you to review?

I don't think that's better, because we could create slight
inconsistencies between the code used for index AMs and the users of
reloptions.

I'm not seeing any reasonable way to split this patch in smaller ones.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)




Re: GUC names in messages

2023-12-08 Thread Alvaro Herrera
On 2023-Dec-08, Peter Smith wrote:

> Patch 0001 -- "datestyle" becomes DateStyle in messages
> Rebased this again, which was part of an earlier patch set
> - I think any GUC names documented as MixedCase should keep that same
> case in messages; this also obeys the guidelines recently pushed [1].

I agree.

> Patch 0002 -- use mixed case for intervalstyle error message
> I found that the GUC name substituted to the error message was coming
> from the statement, not from the original name in the guc_tables, so
> there was a case mismatch:

I agree.  Let's also add a test that shows this difference (my 0002
here).

I'm annoyed that this saga has transiently created a few untranslated
strings by removing unnecessary quotes but failing to move the variable
names outside the translatable part of the string.  I change a few of
those in 0003 -- mostly the ones in strings already touched by commit
8d9978a7176a, but also a few others.  Didn't go out of my way to grep
for other possible messages to fix, though.  (I feel like this is
missing some "translator:" comments.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We’ve narrowed the problem down to the customer’s pants being in a situation
 of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
>From 0ccf4f8342d49d42116a1b1b872b15aeeddaf568 Mon Sep 17 00:00:00 2001
From: Peter Smith 
Date: Fri, 8 Dec 2023 12:59:20 +1100
Subject: [PATCH v5 1/3] GUC names - use mixed case for datestyle in messages

---
 src/backend/commands/variable.c|  2 +-
 src/backend/utils/adt/datetime.c   |  2 +-
 src/test/regress/expected/date.out | 58 +-
 src/test/regress/expected/horology.out |  2 +-
 4 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index c361bb2079..2703d2edc4 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -182,7 +182,7 @@ check_datestyle(char **newval, void **extra, GucSource source)
 
 	if (!ok)
 	{
-		GUC_check_errdetail("Conflicting \"datestyle\" specifications.");
+		GUC_check_errdetail("Conflicting DateStyle specifications.");
 		return false;
 	}
 
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index fca9a2a6e9..8ef5bf0076 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -4022,7 +4022,7 @@ DateTimeParseError(int dterr, DateTimeErrorExtra *extra,
 	(errcode(ERRCODE_DATETIME_FIELD_OVERFLOW),
 	 errmsg("date/time field value out of range: \"%s\"",
 			str),
-	 errhint("Perhaps you need a different \"datestyle\" setting.")));
+	 errhint("Perhaps you need a different DateStyle setting.")));
 			break;
 		case DTERR_INTERVAL_OVERFLOW:
 			errsave(escontext,
diff --git a/src/test/regress/expected/date.out b/src/test/regress/expected/date.out
index f5949f3d17..99650bf231 100644
--- a/src/test/regress/expected/date.out
+++ b/src/test/regress/expected/date.out
@@ -94,17 +94,17 @@ SELECT date '1/8/1999';
 ERROR:  date/time field value out of range: "1/8/1999"
 LINE 1: SELECT date '1/8/1999';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date '1/18/1999';
 ERROR:  date/time field value out of range: "1/18/1999"
 LINE 1: SELECT date '1/18/1999';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date '18/1/1999';
 ERROR:  date/time field value out of range: "18/1/1999"
 LINE 1: SELECT date '18/1/1999';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date '01/02/03';
 date
 
@@ -139,7 +139,7 @@ SELECT date 'January 8, 99 BC';
 ERROR:  date/time field value out of range: "January 8, 99 BC"
 LINE 1: SELECT date 'January 8, 99 BC';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date '99-Jan-08';
 date
 
@@ -156,7 +156,7 @@ SELECT date '08-Jan-99';
 ERROR:  date/time field value out of range: "08-Jan-99"
 LINE 1: SELECT date '08-Jan-99';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date '08-Jan-1999';
 date
 
@@ -167,7 +167,7 @@ SELECT date 'Jan-08-99';
 ERROR:  date/time field value out of range: "Jan-08-99"
 LINE 1: SELECT date 'Jan-08-99';
 ^
-HINT:  Perhaps you need a different "datestyle" setting.
+HINT:  Perhaps you need a different DateStyle setting.
 SELECT date 'Jan-08-1999';
 date
 
@@ -198,7 +198,7 @@ SELECT date '08 Jan 99';
 ERROR:  date/time field value out of range: "08 Jan 99"
 LINE 1: 

Re: [PATCH] New [relation] option engine

2023-12-08 Thread Nikolay Shaplov
В письме от пятница, 8 декабря 2023 г. 08:59:41 MSK пользователь Michael 
Paquier написал:

> > I've rebased patch, so it could be add to commitfest again.
> 
> This is a 270kB patch with quite a few changes, and a lot of code
> 
> moved around:
> >  47 files changed, 2592 insertions(+), 2326 deletions(-)
> 
> Could it be possible to split that into more successive steps to ease
> its review?

Theoretically I can create patch with full options.c as it is in the patch 
now, and use that code only in index AM, and keep reloption.c mostly 
unchanged.

This will be total mess with two different options mechanisms working in the 
same time, but this might be much more easy to review.  When we are done with 
the first step, we can change the rest.
If this will help to finally include patch into postgres, I can do it. Will 
that help you to review?


-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 7:46 PM John Naylor  wrote:
>
> On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada  wrote:
> >
> > BTW Given that the actual value size can be calculated only by the
> > caller, how does the tree know if the value is embedded or not? It's
> > probably related to how to store combined pointer/value slots.
>
> Right, this is future work. At first, variable-length types will have
> to be single-value leaves. In fact, the idea for storing up to 3
> offsets in the bitmap header could be done this way -- it would just
> be a (small) single-value leaf.

Agreed.

>
> (Reminder: Currently, fixed-length values are compile-time embeddable
> if the platform pointer size is big enough.)
>
> > If leaf
> > nodes have a bitmap array that indicates the corresponding slot is an
> > embedded value or a pointer to a value, it would be easy.
>
> That's the most general way to do it. We could do it much more easily
> with a pointer tag, although for the above idea it may require some
> endian-aware coding. Both were mentioned in the paper, I recall.

True. Probably we can use the combined pointer/value slots approach
only if the tree is able to use the pointer tagging. That is, if the
caller allows the tree to use one bit of the value.

I'm going to update the patch based on the recent discussion (RT_SET()
and variable-length values) etc., and post the patch set early next
week.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Remove some unnecessary includes of "access/xlog_internal.h"

2023-12-08 Thread Heikki Linnakangas

On 08/12/2023 13:44, Peter Eisentraut wrote:

I found a few places where access/xlog_internal.h was apparently
included unnecessarily.  In some of those places, a more specific header
file (that somehow came in via access/xlog_internal.h) can be used
instead.  The *.h file change passes headerscheck.


+1

--
Heikki Linnakangas
Neon (https://neon.tech)





Remove some unnecessary includes of "access/xlog_internal.h"

2023-12-08 Thread Peter Eisentraut
I found a few places where access/xlog_internal.h was apparently 
included unnecessarily.  In some of those places, a more specific header 
file (that somehow came in via access/xlog_internal.h) can be used 
instead.  The *.h file change passes headerscheck.From c03641b59ec9575e74fa8eb4e519b59d548730af Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 8 Dec 2023 12:25:51 +0100
Subject: [PATCH] Remove some unnecessary includes of "access/xlog_internal.h"

---
 src/bin/pg_checksums/pg_checksums.c | 2 +-
 src/bin/pg_rewind/timeline.c| 1 -
 src/include/access/generic_xlog.h   | 2 +-
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_checksums/pg_checksums.c 
b/src/bin/pg_checksums/pg_checksums.c
index 6543d9ce08..19c083be17 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -20,11 +20,11 @@
 #include 
 #include 
 
-#include "access/xlog_internal.h"
 #include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/logging.h"
+#include "common/relpath.h"
 #include "fe_utils/option_utils.h"
 #include "getopt_long.h"
 #include "pg_getopt.h"
diff --git a/src/bin/pg_rewind/timeline.c b/src/bin/pg_rewind/timeline.c
index 2d445dac32..fd5f748448 100644
--- a/src/bin/pg_rewind/timeline.c
+++ b/src/bin/pg_rewind/timeline.c
@@ -10,7 +10,6 @@
 #include "postgres_fe.h"
 
 #include "access/timeline.h"
-#include "access/xlog_internal.h"
 #include "pg_rewind.h"
 
 /*
diff --git a/src/include/access/generic_xlog.h 
b/src/include/access/generic_xlog.h
index 66941f99a8..f099ec7321 100644
--- a/src/include/access/generic_xlog.h
+++ b/src/include/access/generic_xlog.h
@@ -15,7 +15,7 @@
 #define GENERIC_XLOG_H
 
 #include "access/xlog.h"
-#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
 #include "access/xloginsert.h"
 #include "storage/bufpage.h"
 #include "utils/rel.h"
-- 
2.43.0



Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-08 Thread Amit Kapila
On Thu, Dec 7, 2023 at 11:21 AM Shlok Kyal  wrote:
>
> > I mean to commit the open transaction at the below place in
> > wait_for_relation_state_change()
> >
> > wait_for_relation_state_change()
> > {
> > ...
> > -- commit the xact
> > WaitLatch();
> > ...
> > }
> >
> > Then start after the wait is over. This is just to test whether it
> > improves the difference in regression test timing.
>
> I tried the above approach and observed that the performance of this
> approach is nearly same as the previous approach.
>

Then let's go with the original patch only. BTW, it took almost the
same time (105 wallclock secs) in my environment (CentOs VM) to run
tests in src/test/subscription both with and without the patch. I took
a median of five runs. I have slightly adjusted the comments and
commit message in the attached. If you are fine with this, we can
commit and backpatch this.

-- 
With Regards,
Amit Kapila.


v3-0001-Fix-an-undetected-deadlock-due-to-apply-worker.patch
Description: Binary data


Add code indentation check to cirrus-ci (was Re: Add BF member koel-like indentation checks to SanityCheck CI)

2023-12-08 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 12:50 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> How about adding code indent checks (like what BF member koel has) to
> the SanityCheck CI task? This helps catch indentation problems way
> before things are committed so that developers can find them out in
> their respective CI runs and lets developers learn the postgres code
> indentation stuff. It also saves committers time - git log | grep
> 'koel' | wc -l gives me 11 commits and git log | grep 'indentation' |
> wc -l gives me 97. Most, if not all of these commits went into fixing
> code indentation problems that could have been reported a bit early
> and fixed by developers/patch authors themselves.
>
> Thoughts?

Currently some of the code indentation issues that pgindent reports
are caught after the code gets committed either via the buildfarm
member koel or when someone runs pgindent before the code release.
These indentation issues are then fixed mostly by the committers in
separate commits. This is taking away the development time and causing
back and forth emails in mailing lists.

As of this writing, git log | grep 'koel' | wc -l gives 13 commits and
git log | grep 'indentation' | wc -l gives 100 commits (all may not be
related to code indentation per se). Almost all of these commits went
into fixing code indentation problems that could have been reported a
bit early and fixed by developers/patch authors themselves.

The attached patch adds a new cirrus-ci task with minimal resources
(with 1 CPU and ccache 150MB) that fails when pgindent is not happy
with any of the changes. This helps catch code indentation issues in
development phase way before things get committed. This step can kick
in developers cirrus-ci runs in their own accounts if cirrus-ci is
enabled in their development repos. Otherwise, it can also be enabled
to kick in cfbot runs (I've not explored this yet).

If we don't want this new task to drain the free credits/run time that
cirrus-ci offers, one possible way is to cook the code indentation
check into either SanityCheck or CompilerWarnings task to save on the
resources. If at all an indentation check like this is needed, we can
think of adding pgperltidy check as well.

Here's a development branch to see the new task in action
https://github.com/BRupireddy2/postgres/tree/add_code_indentation_check_to_cirrus_ci
- an intentional pgindent failure is detected by the new task where
the diff is uploaded as an artifact -
https://api.cirrus-ci.com/v1/artifact/task/6127561344811008/indentation/pgindent.diffs.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Add-code-indentation-check-to-cirrus-ci.patch
Description: Binary data


Re: [PATCH] minor reloption regression tests improvement

2023-12-08 Thread Alvaro Herrera
On 2022-Mar-07, Greg Stark wrote:

> I don't think this is worth spending time adding tests for. I get what
> you're saying that this is at least semi-intentional behaviour and
> desirable behaviour so it should have tests ensuring that it continues
> to work. But it's not documented behaviour and the test is basically
> testing that the implementation is this specific implementation.
> 
> I don't think the test is really a bad idea but neither is it really
> very useful and I think it's not worth having people spend time
> reviewing and discussing. I'm leaning to saying this patch is
> rejected.

I've pushed this patch; I agree with Greg that this is semi-intentional
and desirable, even if not documented.  (I doubt we would document it
anyway, as having previously SET an invalid option is not something that
happens commonly; what reason would we have for documenting that it
works to RESET such an option?)  As for the implementation being this
specific one, that's something already embedded in all of
reloptions.sql, so I don't really mind that, if we do change it, we'll
have to rewrite pretty much the whole file and not "the whole file
except these three lines".

Lastly, there is a long-running patch proposal to refactor reloptions
quite significantly, so it'll be good to ensure that the new
implementation doesn't break the features we already have; and this one
corner is one thing Nikolay already said his patch had broken and failed
to notice right away.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La verdad no siempre es bonita, pero el hambre de ella sí"




Re: Adding a pg_get_owned_sequence function?

2023-12-08 Thread Shubham Khanna
On Fri, Dec 8, 2023 at 3:43 PM Dagfinn Ilmari Mannsåker
 wrote:
>
> Hi hackers,
>
> I've always been annoyed by the fact that pg_get_serial_sequence takes
> the table and returns the sequence as strings rather than regclass. And
> since identity columns were added, the name is misleading as well (which
> is even acknowledged in the docs, together with a suggestion for a
> better name).
>
> So, instead of making excuses in the documentation, I thought why not
> add a new function which addresses all of these issues, and document the
> old one as a backward-compatibilty wrapper?
>
> Please see the attached patch for my stab at this.
>

I reviewed the Patch and the compilation looks fine. I tested various
scenarios and did not find any issues.  Also I did RUN 'make check'
and 'make check-world' and all the test cases passed successfully. I
figured out a small typo please have a look at it:-

This is the name the docs say `pg_get_serial_sequence` sholud have
had, and gives us the opportunity to change the return and table
argument types to `regclass` and the column argument to `name`,
instead of using `text` everywhere.  This matches what's in catalogs,
and requires less explaining than the rules for
`pg_get_serial_sequence`.

Here 'sholud' have been 'should'.

Thanks and Regards,
Shubham Khanna.




Re: Test 002_pg_upgrade fails with olddump on Windows

2023-12-08 Thread Alvaro Herrera
On 2023-Dec-08, Michael Paquier wrote:

> On Thu, Dec 07, 2023 at 05:44:53PM +0900, Michael Paquier wrote:
> > Hmm.  Yes, it looks like you're right here.  That should allow all the
> > scenarios we expect to work to update the paths for the functions.
> 
> And done this one as well down to v15, where not only meson, but also
> vpath could have been confused with an update to an incorrect path.

Argh, yeah, this has caused me pain a couple of times.  Thanks for fixing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread John Naylor
On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada  wrote:
>
> BTW Given that the actual value size can be calculated only by the
> caller, how does the tree know if the value is embedded or not? It's
> probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

> If leaf
> nodes have a bitmap array that indicates the corresponding slot is an
> embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

> But since
> the bitmap array is needed only in the leaf nodes, internal nodes and
> leaf nodes will no longer be identical structure, which is not a bad
> thing to me, though.

Absolutely no way we are going back to double everything: double
types, double functions, double memory contexts. Plus, that bitmap in
inner nodes could indicate a pointer to a leaf that got there by "lazy
expansion".




Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-12-08 Thread Alexander Korotkov
On Fri, Dec 8, 2023 at 11:46 AM Kartyshov Ivan
 wrote:
>
> Should rise disscusion on separate utility statement or find
> case where procedure version is failed.
>
> 1) Classic (wait_classic_v3.patch)
> https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
> ==
> advantages: multiple wait events, separate WAIT FOR statement
> disadvantages: new words in grammar
>
>
>
> WAIT FOR  [ANY | ALL] event [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ WAIT FOR [ANY | ALL] event [, ...]]
> event:
> LSN value
> TIMEOUT number_of_milliseconds
> timestamp

Nice, but as you stated requires new keywords.

> 2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
> https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
> ==
> advantages: no new words in grammar
> disadvantages: a little harder to understand
>
>
>
> AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
> START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ AFTER lsn_event [ WITHIN delay_milliseconds ]]

+1 from me

> 3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
> https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
> https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
> ==
> advantages: no new words in grammar,like it made in
> pg_last_wal_replay_lsn
> disadvantages: use snapshot xmin trick
> SELECT pg_waitlsn(‘LSN’, timeout);
> SELECT pg_waitlsn_infinite(‘LSN’);
> SELECT pg_waitlsn_no_wait(‘LSN’);

Nice, because simplicity.  But only safe if called within the simple
query containing nothing else.  Validating this from the function
kills the simplicity.

--
Regards,
Alexander Korotkov




Re: Transaction timeout

2023-12-08 Thread Japin Li


On Fri, 08 Dec 2023 at 18:08, Andrey M. Borodin  wrote:
>> On 8 Dec 2023, at 12:59, Japin Li  wrote:
>>
>>
>> On Thu, 07 Dec 2023 at 20:40, Andrey M. Borodin  wrote:
 On 7 Dec 2023, at 06:25, Japin Li  wrote:

 If idle_in_transaction_timeout is bigger than transaction_timeout,
 the idle-in-transaction timeout don't needed, right?
>>> Yes, I think so.
>>>
>>
>> Should we disable the idle_in_transaction_timeout in this case? Of cursor, 
>> I'm
>> not strongly insist on this.
> Good idea!
>
>> I think you forget disable transaction_timeout in AutoVacWorkerMain().
>> If not, can you elaborate on why you don't disable it?
>
> Seems like code in autovacuum.c was copied, but patch was not updated. I’ve 
> fixed this oversight.
>

Thanks for updating the patch. LGTM.

--
Regrads,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-12-08 Thread Alexander Korotkov
On Fri, Dec 8, 2023 at 11:20 AM Kartyshov Ivan
 wrote:
>
> On 2023-11-27 03:08, Alexander Korotkov wrote:
> > I've retried my case with v6 and it doesn't fail anymore.  But I
> > wonder how safe it is to reset xmin within the user-visible function?
> > We have no guarantee that the function is not called inside the
> > complex query.  Then how will the rest of the query work with xmin
> > reset?  Separate utility statement still looks like more safe option
> > for me.
>
> As you mentioned, we can`t guarantee that the function is not called
> inside the complex query, but we can return the xmin after waiting.

Returning xmin back isn't safe.  Especially after potentially long
waiting.  The snapshot could be no longer valid, because the
corresponding tuples could be VACUUM'ed.

--
Regards,
Alexander Korotkov




Re: Transaction timeout

2023-12-08 Thread Andrey M. Borodin


> On 8 Dec 2023, at 12:59, Japin Li  wrote:
> 
> 
> On Thu, 07 Dec 2023 at 20:40, Andrey M. Borodin  wrote:
>>> On 7 Dec 2023, at 06:25, Japin Li  wrote:
>>> 
>>> If idle_in_transaction_timeout is bigger than transaction_timeout,
>>> the idle-in-transaction timeout don't needed, right?
>> Yes, I think so.
>> 
> 
> Should we disable the idle_in_transaction_timeout in this case? Of cursor, I'm
> not strongly insist on this.
Good idea!

> I think you forget disable transaction_timeout in AutoVacWorkerMain().
> If not, can you elaborate on why you don't disable it?

Seems like code in autovacuum.c was copied, but patch was not updated. I’ve 
fixed this oversight.

Thanks!


Best regards, Andrey Borodin.


v8-0001-Introduce-transaction_timeout.patch
Description: Binary data


Re: trying again to get incremental backup

2023-12-08 Thread Jakub Wartak
On Thu, Dec 7, 2023 at 4:15 PM Robert Haas  wrote:

Hi Robert,

> On Thu, Dec 7, 2023 at 9:42 AM Jakub Wartak
>  wrote:
> > Comment: I was wondering if it wouldn't make some sense to teach
> > pg_resetwal to actually delete all WAL summaries after any any
> > WAL/controlfile alteration?
>
> I thought that this was a good idea so I decided to go implement it,
> only to discover that it was already part of the patch set ... did you
> find some case where it doesn't work as expected? The code looks like
> this:

Ah, my bad, with a fresh mind and coffee the error message makes it
clear and of course it did reset the summaries properly.

While we are at it, maybe around the below in PrepareForIncrementalBackup()

if (tlep[i] == NULL)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("timeline %u found in
manifest, but not in this server's history",
range->tli)));

we could add

errhint("You might need to start a new full backup instead of
incremental one")

?

> > test_pending_2pc.sh - getting GOOD on most recent runs, but several
> > times during early testing (probably due to my own mishaps), I've been
> > hit by Abort/TRAP. I'm still investigating and trying to reproduce
> > those ones. TRAP: failed Assert("summary_end_lsn >=
> > WalSummarizerCtl->pending_lsn"), File: "walsummarizer.c", Line: 940
>
> I have a fix for this locally, but I'm going to hold off on publishing
> a new version until either there's a few more things I can address all
> at once, or until Thomas commits the ubsan fix.
>

Great, I cannot get it to fail again today, it had to be some dirty
state of the testing env. BTW: Thomas has pushed that ubsan fix.

-J.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-12-08 Thread Kartyshov Ivan

Should rise disscusion on separate utility statement or find
case where procedure version is failed.

1) Classic (wait_classic_v3.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
==
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
==
advantages: no new words in grammar
disadvantages: a little harder to understand



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]



3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
https://www.postgresql.org/message-id/27171.1586439221%40sss.pgh.pa.us
https://www.postgresql.org/message-id/20210121.173009.235021120161403875.horikyota.ntt%40gmail.com
==
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn
disadvantages: use snapshot xmin trick
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);


Regards
--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@



+   
 
  
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156a..08399452ce 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,15 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+ WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c 

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-12-08 Thread Alena Rybakina

On 06.12.2023 10:30, Richard Guo wrote:

I've self-reviewed this patch again and I think it's now in a
committable state.  I'm wondering if we can mark it as 'Ready for
Committer' now, or we need more review comments/feedbacks.

To recap, this patch postpones reparameterization of paths until
createplan.c, which would help avoid building the reparameterized
expressions we might not use.  More importantly, it makes it possible to
modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos
(e.g. baserestrictinfo) for reparameterization.  Failing to do that can
lead to executor crashes, wrong results, or planning errors, as we have
already seen.

I marked it as 'Ready for Committer'. I think it is ready.

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Proposal to add page headers to SLRU pages

2023-12-08 Thread Li, Yong
Given so many different approaches were discussed, I have started a wiki to 
record and collaborate all efforts towards SLRU improvements.  The wiki 
provides a concise overview of all the ideas discussed and can serve as a 
portal for all historical discussions.  Currently, the wiki summarizes four 
recent threads ranging from identifier format change to page header change, to 
moving SLRU into the main buffer pool, to reduce lock contention on SLRU 
latches.  We can keep the patch related discussions in this thread and use the 
wiki as a live document for larger scale collaborations.

The wiki page is here:  https://wiki.postgresql.org/wiki/SLRU_improvements

Regarding the benefits of this patch, here is a detailed explanation:

  1.  Checksum is added to each page, allowing us to verify if a page has been 
corrupted when read from the disk.
  2.  The ad-hoc LSN group structure is removed from the SLRU cache control 
data and is replaced by the page LSN in the page header. This allows us to use 
the same WAL protocol as used by pages in the main buffer pool: flush all redo 
logs up to the page LSN before flushing the page itself. If we move SLRU caches 
into the main buffer pool, this change fits naturally.
  3.  It leaves further optimizations open. We can continue to pursue the goal 
of moving SLRU into the main buffer pool, or we can follow the lock partition 
idea. This change by itself does not conflict with either proposal.

Also, the patch is now complete and is ready for review.  All check-world tests 
including tap tests passed with this patch.


Regards,
Yong

From: Robert Haas 
Date: Friday, December 8, 2023 at 03:51
To: Debnath, Shawn 
Cc: Andrey Borodin , PostgreSQL Hackers 
, Aleksander Alekseev 
, Li, Yong , Shyrabokau, Anton 
, Bagga, Rishu 
Subject: Re: Proposal to add page headers to SLRU pages
External Email

On Thu, Dec 7, 2023 at 1:28 PM Debnath, Shawn  wrote:
> What is being proposed now is the simple and core functionality of introducing
> page headers to SLRU pages while continuing to be in the SLRU cache. This
> allows the whole project to be iterative and reviewers to better reason about
> the smaller set of changes being introduced into the codebase.
>
> Once the set of on-disk changes are in, we can follow up on optimizations.
> It may be moving to buffer cache or reviewing Dilip's approach in [1], we
> will have the option to be flexible in our approach.

I basically agree with this. I don't think we should let the perfect
be the enemy of the good. Shooting down this patch because it doesn't
do everything that we want is a recipe for getting nothing done at
all.

That said, I don't think that the original post on this thread
provides a sufficiently clear and detailed motivation for making this
change. For this to eventually be committed, it's going to need (among
other things) a commit message that articulates a convincing rationale
for whatever changes it makes. Here's what the original email said:

> It adds a checksum to each SLRU page, tracks page LSN as if it is a standard 
> page and eases future page enhancements.

Of those three things, in my opinion, the first is good and the other
two are too vague. I assume that most people who would be likely to
read a commit message would understand the value of pages having
checksums. But I can't immediately think of what the value of tracking
the page LSN as if it were a standard page might be, so that probably
needs more explanation. Similarly, at least one or two of the future
page enhancements that might be eased should be spelled out, and/or
the ways in which they would be made easier should be articulated.

--
Robert Haas
EDB: 
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com%2F=05%7C01%7Cyoli%40ebay.com%7C2cad2fe1de8a40f3167608dbf75de73c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638375754901646398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=hkccuGfpt1%2BKxuhk%2BJt%2F3HyYuJqQHYfizib76%2F9HtUU%3D=0


slru_page_header_v1.patch
Description: slru_page_header_v1.patch


Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-12-08 Thread Kartyshov Ivan

On 2023-11-27 03:08, Alexander Korotkov wrote:

I've retried my case with v6 and it doesn't fail anymore.  But I
wonder how safe it is to reset xmin within the user-visible function?
We have no guarantee that the function is not called inside the
complex query.  Then how will the rest of the query work with xmin
reset?  Separate utility statement still looks like more safe option
for me.


As you mentioned, we can`t guarantee that the function is not called
inside the complex query, but we can return the xmin after waiting.
But you are right and separate utility statement still looks more safe.
So I want to bring up the discussion on separate utility statement 
again.


--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com




Re: Synchronizing slots from primary to standby

2023-12-08 Thread Amit Kapila
On Wed, Dec 6, 2023 at 4:53 PM shveta malik  wrote:
>
> PFA v43, changes are:
>

I wanted to discuss 0003 patch about cascading standby's. It is not
clear to me whether we want to allow physical standbys to further wait
for cascading standby to sync their slots. If we allow such a feature
one may expect even primary to wait for all the cascading standby's
because otherwise still logical subscriber can be ahead of one of the
cascading standby. I feel even if we want to allow such a behaviour we
can do it later once the main feature is committed. I think it would
be good to just allow logical walsenders on primary to wait for
physical standbys represented by GUC 'standby_slot_names'. If we agree
on that then it would be good to prohibit setting this GUC on standby
or at least it should be a no-op even if this GUC should be set on
physical standby.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada  wrote:
>
> On Fri, Dec 8, 2023 at 1:37 PM John Naylor  wrote:
> >
> > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada  
> > wrote:
> >
> > > It's still unclear to me why the value doesn't need to contain the size.
> > >
> > > If I understand you correctly, in RT_GET(), the tree allocs a new
> > > memory and updates the slot where the value is embedded with the
> > > pointer to the allocated memory, and returns the pointer to the
> > > caller. Since the returned value, newly allocated memory, is still
> > > empty, the callner needs to copy the contents of the old value to the
> > > new value and do whatever else it needs to.
> > >
> > > If the value is already a single-leave value and RT_GET() is called
> > > with a larger size, the slot is always replaced with the newly
> > > allocated area and the caller needs to copy the contents? If the tree
> > > does realloc the value with a new size, how does the tree know the new
> > > value is larger than the existing value? It seems like the caller
> > > needs to provide a function to calculate the size of the value based
> > > on the length.
> >
> > Right. My brief description mentioned one thing without details: The
> > caller would need to control whether to re-alloc. RT_GET would pass
> > the size. If nothing is found, the tree would allocate. If there is a
> > value already, just return it. That means both the address of the
> > slot, and the local pointer to the value (with embedded, would be the
> > same address).

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots. If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy. But since
the bitmap array is needed only in the leaf nodes, internal nodes and
leaf nodes will no longer be identical structure, which is not a bad
thing to me, though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Transaction timeout

2023-12-08 Thread Japin Li


On Thu, 07 Dec 2023 at 20:40, Andrey M. Borodin  wrote:
>> On 7 Dec 2023, at 06:25, Japin Li  wrote:
>>
>>  If idle_in_transaction_timeout is bigger than transaction_timeout,
>> the idle-in-transaction timeout don't needed, right?
> Yes, I think so.
>

Should we disable the idle_in_transaction_timeout in this case? Of cursor, I'm
not strongly insist on this.

I think you forget disable transaction_timeout in AutoVacWorkerMain().
If not, can you elaborate on why you don't disable it?

--
Regrads,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.