Re: GUC tables - use designated initializers

2022-10-04 Thread Peter Smith
On Tue, Oct 4, 2022 at 5:48 PM Michael Paquier  wrote:
>
> On Tue, Oct 04, 2022 at 04:20:36PM +1100, Peter Smith wrote:
> > The v2 patches are updated as follows:
> >
> > 0001 - Now this patch only fixes a comment that had a wrong enum name.
>
> This was wrong, so fixed.

Thanks for pushing!

>
> > 0002 - Removes unnecessary whitespace (same as v1-0002)
>
> This one does not seem worth doing, though..

Yeah, fair enough. I didn't really expect much support for that one,
but I thought I'd post it anyway when I saw it removed 250 lines from
the already long source file.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




get rid of Abs()

2022-10-04 Thread Peter Eisentraut
I was wondering why we have a definition of Abs() in c.h when there are 
more standard functions such as abs() and fabs() in widespread use.  I 
think this one is left over from pre-ANSI-C days.  The attached patches 
replace all uses of Abs() with more standard functions.


The first patch installs uses of abs() and fabs().  These are already in 
use in the tree and should be straightforward.


The next two patches install uses of llabs() and fabsf(), which are not 
in use yet.  But they are in C99.


The last patch removes the definition of Abs().


Fun fact: The current definition

#define Abs(x) ((x) >= 0 ? (x) : -(x))

is slightly wrong for floating-point values.  Abs(-0.0) returns -0.0, 
but fabs(-0.0) returns +0.0.From 595664765c53b0ffe0958bbc277bfccf10ea06e0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 4 Oct 2022 08:15:13 +0200
Subject: [PATCH 1/4] Remove unnecessary uses of Abs()

Use abs() or fabs() instead.
---
 contrib/btree_gist/btree_date.c   |  4 ++--
 contrib/btree_gist/btree_float8.c |  4 ++--
 contrib/btree_gist/btree_int2.c   |  2 +-
 contrib/btree_gist/btree_int4.c   |  2 +-
 contrib/btree_gist/btree_interval.c   |  2 +-
 contrib/btree_gist/btree_time.c   |  2 +-
 contrib/btree_gist/btree_ts.c |  2 +-
 contrib/btree_gist/btree_utils_num.h  |  2 +-
 contrib/btree_gist/btree_utils_var.c  |  2 +-
 contrib/cube/cube.c   |  2 +-
 contrib/intarray/_int_gist.c  |  3 ++-
 contrib/intarray/_intbig_gist.c   |  4 +++-
 contrib/ltree/_ltree_gist.c   |  4 +++-
 contrib/seg/seg.c |  4 ++--
 src/backend/access/gist/gistproc.c|  4 ++--
 src/backend/optimizer/geqo/geqo_erx.c |  8 
 src/backend/partitioning/partbounds.c |  4 ++--
 src/backend/utils/adt/datetime.c  |  8 
 src/backend/utils/adt/numeric.c   | 20 ++--
 src/backend/utils/adt/rangetypes_gist.c   |  4 ++--
 src/backend/utils/adt/selfuncs.c  |  2 +-
 src/backend/utils/adt/timestamp.c |  4 ++--
 src/backend/utils/adt/tsgistidx.c |  2 +-
 src/backend/utils/adt/tsrank.c|  2 +-
 src/backend/utils/misc/guc.c  |  2 +-
 src/interfaces/ecpg/pgtypeslib/interval.c |  4 ++--
 26 files changed, 54 insertions(+), 49 deletions(-)

diff --git a/contrib/btree_gist/btree_date.c b/contrib/btree_gist/btree_date.c
index 455a265a4975..68a4107dbf08 100644
--- a/contrib/btree_gist/btree_date.c
+++ b/contrib/btree_gist/btree_date.c
@@ -95,7 +95,7 @@ gdb_date_dist(const void *a, const void *b, FmgrInfo *flinfo)

   DateADTGetDatum(*((const DateADT *) a)),

   DateADTGetDatum(*((const DateADT *) b)));
 
-   return (float8) Abs(DatumGetInt32(diff));
+   return (float8) abs(DatumGetInt32(diff));
 }
 
 
@@ -123,7 +123,7 @@ date_dist(PG_FUNCTION_ARGS)

   PG_GETARG_DATUM(0),

   PG_GETARG_DATUM(1));
 
-   PG_RETURN_INT32(Abs(DatumGetInt32(diff)));
+   PG_RETURN_INT32(abs(DatumGetInt32(diff)));
 }
 
 
diff --git a/contrib/btree_gist/btree_float8.c 
b/contrib/btree_gist/btree_float8.c
index b95a08e228bb..081a719b0061 100644
--- a/contrib/btree_gist/btree_float8.c
+++ b/contrib/btree_gist/btree_float8.c
@@ -79,7 +79,7 @@ gbt_float8_dist(const void *a, const void *b, FmgrInfo 
*flinfo)
r = arg1 - arg2;
if (unlikely(isinf(r)) && !isinf(arg1) && !isinf(arg2))
float_overflow_error();
-   return Abs(r);
+   return fabs(r);
 }
 
 
@@ -110,7 +110,7 @@ float8_dist(PG_FUNCTION_ARGS)
if (unlikely(isinf(r)) && !isinf(a) && !isinf(b))
float_overflow_error();
 
-   PG_RETURN_FLOAT8(Abs(r));
+   PG_RETURN_FLOAT8(fabs(r));
 }
 
 /**
diff --git a/contrib/btree_gist/btree_int2.c b/contrib/btree_gist/btree_int2.c
index a91b95ff3985..fdbf156586c9 100644
--- a/contrib/btree_gist/btree_int2.c
+++ b/contrib/btree_gist/btree_int2.c
@@ -105,7 +105,7 @@ int2_dist(PG_FUNCTION_ARGS)
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
 errmsg("smallint out of range")));
 
-   ra = Abs(r);
+   ra = abs(r);
 
PG_RETURN_INT16(ra);
 }
diff --git a/contrib/btree_gist/btree_int4.c b/contrib/btree_gist/btree_int4.c
index 7ea98c478c73..8915fb5d0877 100644
--- a/contrib/btree_gist/btree_int4.c
+++ b/contrib/btree_gist/btree_int4.c
@@ -106,7 +106,7 @@ int4_dist(PG_FUNCTION_ARGS)
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
 errmsg("integer out of range")));

Re: log_heap_visible(): remove unused parameter and update comment

2022-10-04 Thread Bharath Rupireddy
On Fri, Sep 30, 2022 at 7:48 PM Japin Li  wrote:
>
> On Fri, 30 Sep 2022 at 22:09, Bharath Rupireddy 
>  wrote:
> > On Fri, Sep 30, 2022 at 7:30 PM Japin Li  wrote:
> >>
> >> When I try to use -Wunused-parameter, I find there are many warnings :-( .
> >
> > Great!
> >
> > I think we can't just remove every unused parameter, for instance, it
> > makes sense to retain PlannerInfo *root parameter even though it's not
> > used now, in future it may be. But if the parameter is of type
> > unrelated to the context of the function, like the one committed
> > 65b158ae4e892c2da7a5e31e2d2645e5e79a0bfd and like the proposed patch
> > to some extent, it could be removed.
> >
> > Others may have different thoughts here.
>
> Maybe we can define a macro like UNUSED(x) for those parameters, and
> call this macro on the parameter that we might be useful, then
> we can use -Wunused-parameter when compiling.  Any thoughts?

We have the pg_attribute_unused() macro already. I'm not sure if
adding -Wunused-parameter for compilation plus using
pg_attribute_unused() for unused-yet-contextually-required variables
is a great idea. But it has some merits as it avoids unused variables
lying around in the code. However, we can even discuss this in a
separate thread IMO to hear more from other hackers.

While on this, I noticed that pg_attribute_unused() is being used for
npages in AdvanceXLInsertBuffer(), but the npages variable is actually
being used there. I think we can remove it.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




future of serial and identity columns

2022-10-04 Thread Peter Eisentraut
In PostgreSQL 10, we added identity columns, as an alternative to serial 
columns (since 6.something).  They mostly work the same.  Identity 
columns are SQL-conforming, have some more features (e.g., overriding 
clause), and are a bit more robust in schema management.  Some of that 
was described in [0].  AFAICT, there have been no complaints since that 
identity columns lack features or are somehow a regression over serial 
columns.


But clearly, the syntax "serial" is more handy, and most casual examples 
use that syntax.  So it seems like we are stuck with maintaining these 
two variants in parallel forever.  I was thinking we could nudge this a 
little by remapping "serial" internally to create an identity column 
instead.  At least then over time, the use of the older serial 
mechanisms would go away.


Note that pg_dump dumps a serial column in pieces (CREATE SEQUENCE + 
ALTER SEQUENCE ... OWNED BY + ALTER TABLE ... SET DEFAULT).  So if we 
did this, any existing databases would keep their old semantics, and 
those who really need it can manually create the old semantics as well.


Attached is a demo patch how the implementation of this change would 
look like.  This creates a bunch of regression test failures, but 
AFAICT, those are mainly display differences and some very peculiar test 
setups that are intentionally examining some edge cases.  These would 
need to be investigated in more detail, of course.



[0]: 
https://www.enterprisedb.com/blog/postgresql-10-identity-columns-explainedFrom e94d059db4f19ee2c58ddb5d6522c8bef17576d2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 4 Oct 2022 09:38:12 +0200
Subject: [PATCH] WIP: Change serial types to map to identity columns

This changes serial types to convert internally to identity columns,
instead of the custom construction involving nextval defaults that
they previously did.
---
 src/backend/parser/parse_utilcmd.c | 44 ++
 1 file changed, 2 insertions(+), 42 deletions(-)

diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index bd068bba05e4..491c60834fda 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -588,51 +588,11 @@ transformColumnDefinition(CreateStmtContext *cxt, 
ColumnDef *column)
/* Special actions for SERIAL pseudo-types */
if (is_serial)
{
-   char   *snamespace;
-   char   *sname;
-   char   *qstring;
-   A_Const*snamenode;
-   TypeCast   *castnode;
-   FuncCall   *funccallnode;
Constraint *constraint;
 
-   generateSerialExtraStmts(cxt, column,
-
column->typeName->typeOid, NIL,
-false, false,
-&snamespace, 
&sname);
-
-   /*
-* Create appropriate constraints for SERIAL.  We do this in 
full,
-* rather than shortcutting, so that we will detect any 
conflicting
-* constraints the user wrote (like a different DEFAULT).
-*
-* Create an expression tree representing the function call
-* nextval('sequencename').  We cannot reduce the raw tree to 
cooked
-* form until after the sequence is created, but there's no 
need to do
-* so.
-*/
-   qstring = quote_qualified_identifier(snamespace, sname);
-   snamenode = makeNode(A_Const);
-   snamenode->val.node.type = T_String;
-   snamenode->val.sval.sval = qstring;
-   snamenode->location = -1;
-   castnode = makeNode(TypeCast);
-   castnode->typeName = SystemTypeName("regclass");
-   castnode->arg = (Node *) snamenode;
-   castnode->location = -1;
-   funccallnode = makeFuncCall(SystemFuncName("nextval"),
-   
list_make1(castnode),
-   
COERCE_EXPLICIT_CALL,
-   -1);
-   constraint = makeNode(Constraint);
-   constraint->contype = CONSTR_DEFAULT;
-   constraint->location = -1;
-   constraint->raw_expr = (Node *) funccallnode;
-   constraint->cooked_expr = NULL;
-   column->constraints = lappend(column->constraints, constraint);
-
constraint = makeNode(Constraint);
-   constraint->contype = CONSTR_NOTNULL;
+   constraint->contype = CONSTR_IDENTITY;
+   constraint->generated_when = ATTRIBUTE_IDENTITY_BY_DEFAULT;
constraint->location = -1;
  

Assorted fixes related to WAL files (was: Use XLogFromFileName() in pg_resetwal to parse position from WAL file)

2022-10-04 Thread Bharath Rupireddy
On Tue, Oct 4, 2022 at 12:11 PM Kyotaro Horiguchi
 wrote:
>
> >
> > > static uint32 minXlogTli = 0;
>
> I have found other three instances of this in xlog.c and
> pg_receivewal.c.  Do they worth fixing?
>
> (pg_upgarade.c has "uint32 tli/logid/segno but I'm not sure they need
>  to be "fixed". At least the segno is not a XLogSegNo.)

There are quite a number of places where data types need to be fixed,
see XLogFileNameById() callers. They are all being parsed as uint32
and then used. I'm not sure if we want to fix all of them.

I think I found that we can fix/refactor few WAL file related things:

1. 0001 replaces explicit WAL file parsing code with
XLogFromFileName() and uses XLByteToSeg() in pg_resetwal.c. This was
not done then (in PG 10) because the XLogFromFileName() wasn't
accepting file size as an input parameter and pg_resetwal needed to
use WAL file size from the controlfile. Thanks to the commit
fc49e24fa69a15efacd5b8958115ed9c43c48f9a which added the
wal_segsz_bytes parameter to XLogFromFileName(). This removes using
extra variables in pg_resetwal.c and a bit of duplicate code too. It
also replaces the explicit code with the XLByteToSeg() macro.

2. 0002 replaces MAXPGPATH with MAXFNAMELEN for WAL file names.
MAXFNAMELEN (64 bytes) is typically meant to be used for all WAL file
names across the code base. Because the WAL file names in postgres
can't be bigger than 64 bytes, in fact, not more than XLOG_FNAME_LEN
(24 bytes) but there are suffixes, timeline history files etc. To
accommodate all of that MAXFNAMELEN is introduced. There are some
places in the code base that still use MAXPGPATH (1024 bytes) for WAL
file names which is an unnecessary wastage of stack memory. This makes
code consistent across and saves a bit of space.

3. 0003 replaces WAL file name calculation with XLogFileNameById() in
pg_upgrade/controldata.c to be consistent across the code base. Note
that this requires us to change the nextxlogfile size from hard-coded
25 bytes to MAXFNAMELEN (64 bytes).

I'm attaching the v2 patch set.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Use-WAL-file-related-inline-functions-macros-in-p.patch
Description: Binary data


v2-0002-Use-MAXFNAMELEN-for-WAL-file-names-instead-of-MAX.patch
Description: Binary data


v2-0003-Use-XLogFileNameById-in-pg_upgrade-controldata.c.patch
Description: Binary data


Re: installcheck-world concurrency issues

2022-10-04 Thread Michael Paquier
On Mon, Oct 03, 2022 at 04:41:11PM -0700, Andres Freund wrote:
> There's a few further roles that seem to pose some danger goign forward:

I have never seen that myself, but 0001 is a nice cleanup.
generated.sql includes a user named "regress_user11".  Perhaps that's
worth renaming while on it?

> BTW, shouldn't src/test/modules/unsafe_tests use the PG_TEST_EXTRA mechanism
> somehow?  Seems not great to run it as part of installcheck-world, if we don't
> want to run it as part of installcheck.c

Indeed.

> A second issue I noticed is that advisory_lock.sql often fails, because the
> pg_locks queries don't restrict to the current database. Patch attached.

As in prepared_xacts.sql or just advisory locks taken in an installed
cluster?  Or both?

> I attached the meson patch as well, but just because I used it to to get to
> these patches.

I am still studying a lot of this area, but it seems like all the
spots requiring a custom configuration (aka NO_INSTALLCHECK) are
covered.  --setup running is working here with 0003.
--
Michael


signature.asc
Description: PGP signature


pid_t on mingw

2022-10-04 Thread Peter Eisentraut
I wanted to propose the attached patch to get rid of the custom pgpid_t 
typedef in pg_ctl.  Since we liberally use pid_t elsewhere, this seemed 
plausible.


However, this patch fails the CompilerWarnings job on Cirrus, because 
apparently under mingw, pid_t is "volatile long long int", so all the 
printf placeholders mismatch.  However, we print pid_t as %d in a lot of 
other places, so I'm confused why this fails here.


Also, googling around a bit about this, it seems that mingw might have 
changed the pid_t from long long int to int some time ago.  Maybe that's 
how the pgpid_t came about to begin with.  The Cirrus job uses a 
cross-compilation environment.  I wonder how up to date that is compared 
to say the native mingw installations used on the build farm.


Any clues?From 11aed64ddb6e5615c15e18b05825545ce66ed951 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 4 Oct 2022 08:12:31 +0200
Subject: [PATCH] Remove pgpid_t type, use pid_t instead

It's unclear why a separate type would be needed here.  We use plain
pid_t everywhere else.

Reverts 66fa6eba5a61be740a6c07de92c42221fae79e9c.
---
 src/bin/pg_ctl/pg_ctl.c  | 111 +++
 src/tools/pgindent/typedefs.list |   1 -
 2 files changed, 54 insertions(+), 58 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 9291c61d000f..a7fb4a9b330b 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -33,9 +33,6 @@
 #include "pqexpbuffer.h"
 #endif
 
-/* PID can be negative for standalone backend */
-typedef long pgpid_t;
-
 
 typedef enum
 {
@@ -103,7 +100,7 @@ static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
-static volatile pgpid_t postmasterPID = -1;
+static volatile pid_t postmasterPID = -1;
 
 #ifdef WIN32
 static DWORD pgctl_start_type = SERVICE_AUTO_START;
@@ -129,7 +126,7 @@ static void do_reload(void);
 static void do_status(void);
 static void do_promote(void);
 static void do_logrotate(void);
-static void do_kill(pgpid_t pid);
+static void do_kill(pid_t pid);
 static void print_msg(const char *msg);
 static void adjust_data_dir(void);
 
@@ -147,13 +144,13 @@ static intCreateRestrictedProcess(char *cmd, 
PROCESS_INFORMATION *processInfo,
 static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken);
 #endif
 
-static pgpid_t get_pgpid(bool is_status_request);
+static pid_t get_pgpid(bool is_status_request);
 static char **readfile(const char *path, int *numlines);
 static void free_readfile(char **optlines);
-static pgpid_t start_postmaster(void);
+static pid_t start_postmaster(void);
 static void read_post_opts(void);
 
-static WaitPMResult wait_for_postmaster_start(pgpid_t pm_pid, bool 
do_checkpoint);
+static WaitPMResult wait_for_postmaster_start(pid_t pm_pid, bool 
do_checkpoint);
 static bool wait_for_postmaster_stop(void);
 static bool wait_for_postmaster_promote(void);
 static bool postmaster_is_alive(pid_t pid);
@@ -245,11 +242,11 @@ print_msg(const char *msg)
}
 }
 
-static pgpid_t
+static pid_t
 get_pgpid(bool is_status_request)
 {
FILE   *pidf;
-   longpid;
+   int pid;
struct stat statbuf;
 
if (stat(pg_data, &statbuf) != 0)
@@ -289,7 +286,7 @@ get_pgpid(bool is_status_request)
exit(1);
}
}
-   if (fscanf(pidf, "%ld", &pid) != 1)
+   if (fscanf(pidf, "%d", &pid) != 1)
{
/* Is the file empty? */
if (ftell(pidf) == 0 && feof(pidf))
@@ -301,7 +298,7 @@ get_pgpid(bool is_status_request)
exit(1);
}
fclose(pidf);
-   return (pgpid_t) pid;
+   return (pid_t) pid;
 }
 
 
@@ -439,13 +436,13 @@ free_readfile(char **optlines)
  * On Windows, we also save aside a handle to the shell process in
  * "postmasterProcess", which the caller should close when done with it.
  */
-static pgpid_t
+static pid_t
 start_postmaster(void)
 {
char   *cmd;
 
 #ifndef WIN32
-   pgpid_t pm_pid;
+   pid_t   pm_pid;
 
/* Flush stdio channels just before fork, to avoid double-output 
problems */
fflush(NULL);
@@ -593,7 +590,7 @@ start_postmaster(void)
  * manager checkpoint, it's got nothing to do with database checkpoints!!
  */
 static WaitPMResult
-wait_for_postmaster_start(pgpid_t pm_pid, bool do_checkpoint)
+wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 {
int i;
 
@@ -610,7 +607,7 @@ wait_for_postmaster_start(pgpid_t pm_pid, bool 
do_checkpoint)
numlines >= LOCK_FILE_LINE_PM_STATUS)
{
/* File is complete enough for us, parse it */
-   pgpid_t pmpid;
+   pid_t   pmpid;
time_t  pmstart;
 
/*
@@ -665,7 +662,7 @@ wait_for_postmaster_s

Re: Fix some newly modified tab-complete changes

2022-10-04 Thread Peter Smith
On Thu, Sep 29, 2022 at 12:50 PM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Sep 28, 2022 1:49 PM Kyotaro Horiguchi  
> wrote:
> >
> > At Wed, 28 Sep 2022 14:14:01 +1000, Peter Smith
> >  wrote in
...
> > >
> > > 2. tab complete for GRANT
> > >
> > > test_pub=# grant 
> > > ALLEXECUTE
> > > pg_execute_server_program  pg_read_server_files   postgres
> > >   TRIGGER
> > > ALTER SYSTEM   GRANT  pg_monitor
> > >   pg_signal_backend  REFERENCES
> > > TRUNCATE
> > > CONNECTINSERT pg_read_all_data
> > >   pg_stat_scan_tablesSELECT UPDATE
> > > CREATE pg_checkpoint
> > > pg_read_all_settings   pg_write_all_data  SET
> > >   USAGE
> > > DELETE pg_database_owner
> > > pg_read_all_stats  pg_write_server_files  TEMPORARY
> > >
> > > 2a.
> > > grant "GRANT" ??
> >
> > Yeah, for the mement I thought that might a kind of admin option but
> > there's no such a privilege. REVOKE gets the same suggestion.
> >
>
> Maybe that's for "REVOKE GRANT OPTION FOR". But it is used by both GRANT and
> REVOKE. I think it's a separate problem, I have tried to fix it in the 
> attached
> 0002 patch.
>

I checked your v2-0002 patch and AFAICT it does fix properly the
previously reported GRANT/REVOKE problem.

~

But, while testing I noticed another different quirk

It seems that neither the GRANT nor the REVOKE auto-complete
recognises the optional PRIVILEGE keyword

e.g. GRANT ALL  --> ON  (but not PRIVILEGE)
e.g. GRANT ALL PRIV --> ???

e.g. REVOKE ALL  --> ON (but not PRIVILEGE)..
e.g. REVOKE ALL PRIV --> ???

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: log_heap_visible(): remove unused parameter and update comment

2022-10-04 Thread Peter Eisentraut

On 04.10.22 09:19, Bharath Rupireddy wrote:

We have the pg_attribute_unused() macro already. I'm not sure if
adding -Wunused-parameter for compilation plus using
pg_attribute_unused() for unused-yet-contextually-required variables
is a great idea. But it has some merits as it avoids unused variables
lying around in the code. However, we can even discuss this in a
separate thread IMO to hear more from other hackers.


I tried this once.  The patch I have from a few years ago is

 420 files changed, 1482 insertions(+), 1482 deletions(-)

and it was a lot of work to maintain.

I can send it in if there is interest.  But I'm not sure if it's worth it.





Re: interrupted tap tests leave postgres instances around

2022-10-04 Thread Peter Eisentraut

On 30.09.22 06:07, Andres Freund wrote:

When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.

In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.


I ran into the problem recently that pg_upgrade starts the servers with 
pg_ctl, and thus without terminal, and so you can't get any password 
prompts for SSL keys, for example.  Taking out the setsid() call in 
pg_ctl.c fixed that.  I suspect this is ultimately the same problem.


We could make TAP tests and pg_upgrade not use pg_ctl and start 
postmaster directly.  I'm not sure how much work that would be, but 
seeing that pg_regress does it, it doesn't seem unreasonable.


Alternatively, perhaps we could make a mode for pg_ctl that it doesn't 
call setsid().  This could be activated by an environment variable. 
That might address all these problems, too.






Re: Assorted fixes related to WAL files (was: Use XLogFromFileName() in pg_resetwal to parse position from WAL file)

2022-10-04 Thread Kyotaro Horiguchi
At Tue, 4 Oct 2022 13:20:54 +0530, Bharath Rupireddy 
 wrote in 
> On Tue, Oct 4, 2022 at 12:11 PM Kyotaro Horiguchi
>  wrote:
> >
> > >
> > > > static uint32 minXlogTli = 0;
> >
> > I have found other three instances of this in xlog.c and
> > pg_receivewal.c.  Do they worth fixing?
> >
> > (pg_upgarade.c has "uint32 tli/logid/segno but I'm not sure they need
> >  to be "fixed". At least the segno is not a XLogSegNo.)
> 
> There are quite a number of places where data types need to be fixed,
> see XLogFileNameById() callers. They are all being parsed as uint32
> and then used. I'm not sure if we want to fix all of them.
> 
> I think I found that we can fix/refactor few WAL file related things:
> 
> 1. 0001 replaces explicit WAL file parsing code with
> XLogFromFileName() and uses XLByteToSeg() in pg_resetwal.c. This was
> not done then (in PG 10) because the XLogFromFileName() wasn't
> accepting file size as an input parameter and pg_resetwal needed to
> use WAL file size from the controlfile. Thanks to the commit
> fc49e24fa69a15efacd5b8958115ed9c43c48f9a which added the
> wal_segsz_bytes parameter to XLogFromFileName(). This removes using
> extra variables in pg_resetwal.c and a bit of duplicate code too. It
> also replaces the explicit code with the XLByteToSeg() macro.

Looks good to me.

> 2. 0002 replaces MAXPGPATH with MAXFNAMELEN for WAL file names.
> MAXFNAMELEN (64 bytes) is typically meant to be used for all WAL file
> names across the code base. Because the WAL file names in postgres
> can't be bigger than 64 bytes, in fact, not more than XLOG_FNAME_LEN
> (24 bytes) but there are suffixes, timeline history files etc. To
> accommodate all of that MAXFNAMELEN is introduced. There are some
> places in the code base that still use MAXPGPATH (1024 bytes) for WAL
> file names which is an unnecessary wastage of stack memory. This makes
> code consistent across and saves a bit of space.

Looks reasonable, too.  I don't find other instances of the same mistake.

> 3. 0003 replaces WAL file name calculation with XLogFileNameById() in
> pg_upgrade/controldata.c to be consistent across the code base. Note
> that this requires us to change the nextxlogfile size from hard-coded
> 25 bytes to MAXFNAMELEN (64 bytes).

I'm not sure I like this. In other places where XLogFileNameById() is
used, the buffer is known to store longer strings so MAXFNAMELEN is
reasonable.  But we don't need to add useless 39 bytes here.
Therefore, even if I wanted to change it, I would replace it with
"XLOG_FNAME_LEN + 1".

> I'm attaching the v2 patch set.
> 
> Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Reducing the chunk header sizes on all memory context types

2022-10-04 Thread David Rowley
On Tue, 4 Oct 2022 at 13:35, Ranier Vilela  wrote:
> Revisiting my work on reducing memory consumption, I found this patch left 
> out.
> I'm not sure I can help.
> But basically I was able to write and read the block size, in the chunk.
> Could it be the case of writing and reading the context pointer in the same 
> way?
> Sure this leads to some performance loss, but would it make it possible to 
> get the context pointer from the chunk?
> In other words, would it be possible to save the context pointer and compare 
> it later in MemoryContextContains?

I'm not really sure I understand the intention of the patch. The
header size for all our memory contexts was already reduced in
c6e0fe1f2.  The patch you sent seems to pre-date that commit.

David




JUMBLE_SIZE macro in two files

2022-10-04 Thread bt22nakamorit

Hi,

queryjumble.c and queryjumble.h both define a macro JUMBLE_SIZE = 1024.
Since queryjumble.c includes queryjumble.h, the JUMBLE_SIZE definition 
in queryjumble.c should be deleted.

Thoughts?

Tatsudiff --git a/src/backend/utils/misc/queryjumble.c b/src/backend/utils/misc/queryjumble.c
index a8508463e7..dd1ffbd546 100644
--- a/src/backend/utils/misc/queryjumble.c
+++ b/src/backend/utils/misc/queryjumble.c
@@ -37,8 +37,6 @@
 #include "parser/scansup.h"
 #include "utils/queryjumble.h"
 
-#define JUMBLE_SIZE1024/* query serialization buffer size */
-
 /* GUC parameters */
 intcompute_query_id = COMPUTE_QUERY_ID_AUTO;


Re: get rid of Abs()

2022-10-04 Thread Zhang Mingli
Hi,

On Oct 4, 2022, 15:07 +0800, Peter Eisentraut 
, wrote:
> I was wondering why we have a definition of Abs() in c.h when there are
> more standard functions such as abs() and fabs() in widespread use. I
> think this one is left over from pre-ANSI-C days. The attached patches
> replace all uses of Abs() with more standard functions.
>
> The first patch installs uses of abs() and fabs(). These are already in
> use in the tree and should be straightforward.
>
> The next two patches install uses of llabs() and fabsf(), which are not
> in use yet. But they are in C99.
>
> The last patch removes the definition of Abs().
>
>
> Fun fact: The current definition
>
> #define Abs(x) ((x) >= 0 ? (x) : -(x))
>
> is slightly wrong for floating-point values. Abs(-0.0) returns -0.0,
> but fabs(-0.0) returns +0.0.
+1,

Like patch3, also found some places where could use fabsf instead of fabs if 
possible, add a patch to replace them.

Regards,
Zhang Mingli


v2-0005-replace-fabs-with-fabsf-if-possible.patch
Description: Binary data


Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-10-04 Thread Bharath Rupireddy
On Thu, Sep 29, 2022 at 7:43 PM Bharath Rupireddy
 wrote:
>
> Please see the attached v1 patch.

FWIW, I'm attaching Nathan's patch that introduced the new function
pg_walfile_offset_lsn as 0002 in the v1 patch set. Here's the CF entry
- https://commitfest.postgresql.org/40/3909/.

Please consider this for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b19aa25a0d1f2ce85abe0c2081c0e7ede256e329 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 29 Sep 2022 13:29:44 +
Subject: [PATCH v1] Add LSN along with offset to error messages reported for
 WAL file read/write/validate header failures

---
 src/backend/access/transam/xlog.c |  8 ++--
 src/backend/access/transam/xlogreader.c   | 16 +++-
 src/backend/access/transam/xlogrecovery.c | 13 -
 src/backend/access/transam/xlogutils.c| 10 ++
 src/include/access/xlogreader.h   |  1 +
 5 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e15256db8..a495bbac85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2218,6 +2218,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 {
 	char		xlogfname[MAXFNAMELEN];
 	int			save_errno;
+	XLogRecPtr	lsn;
 
 	if (errno == EINTR)
 		continue;
@@ -2226,11 +2227,14 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	XLogFileName(xlogfname, tli, openLogSegNo,
  wal_segment_size);
 	errno = save_errno;
+	XLogSegNoOffsetToRecPtr(openLogSegNo, startoffset,
+			wal_segment_size, lsn);
 	ereport(PANIC,
 			(errcode_for_file_access(),
 			 errmsg("could not write to log file %s "
-	"at offset %u, length %zu: %m",
-	xlogfname, startoffset, nleft)));
+	"at offset %u, LSN %X/%X, length %zu: %m",
+	xlogfname, startoffset,
+	LSN_FORMAT_ARGS(lsn), nleft)));
 }
 nleft -= written;
 from += written;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5a8fe81f82..c3befc44ba 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1229,9 +1229,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid magic number %04X in WAL segment %s, offset %u",
+			  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_magic,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1243,9 +1244,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1284,9 +1286,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1303,9 +1306,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "unexpected pageaddr %X/%X in WAL segment %s, offset %u",
+			  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 			  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1328,10 +1332,11 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
-  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",
+  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
   hdr->xlp_tli,
   state->latestPageTLI,
   fname,
+  LSN_FORMAT_ARGS(recptr),
   offset);
 			return false;
 		}
@@ -1556,6 +1561,7 @@ WALRead(XLogReaderState *state,
 			errinfo->wre_req = segbytes;
 			errinfo->wre_read = readbytes;
 			errinfo->wre_off = startoff;
+			errinfo->wre_lsn = recptr;
 			errinfo->wre_seg = state->seg;
 			retur

Re: Miscellaneous tab completion issue fixes

2022-10-04 Thread vignesh C
On Tue, 4 Oct 2022 at 09:13, Michael Paquier  wrote:
>
> On Mon, Oct 03, 2022 at 06:29:32PM +0100, Dagfinn Ilmari Mannsåker wrote:
> > vignesh C  writes:
> >> +else if (TailMatchesCS("\\dRp*"))
> >> +COMPLETE_WITH_QUERY(Query_for_list_of_publications[0].query);
> >> +else if (TailMatchesCS("\\dRs*"))
> >> +COMPLETE_WITH_QUERY(Query_for_list_of_subscriptions[0].query);
> >
> > These are version-specific queries, so should be passed in their
> > entirety to COMPLETE_WITH_VERSIONED_QUERY() so that psql can pick the
> > right version, and avoid sending the query at all if the server is too
> > old.
>
> +1.
>

Modified

 >> +/* add these to Query_for_list_of_roles in OWNER TO contexts */
> >> +#define Keywords_for_list_of_owner_to_roles \
> >> +"CURRENT_ROLE", "CURRENT_USER", "SESSION_USER"
> >
> > I think this would read better without the TO, both in the comment and
> > the constant name, similar to the below only having GRANT without TO:
>
> Keywords_for_list_of_grant_roles is used in six code paths, so it
> seems to me that there is little gain in having a separate #define
> here.  Let's just specify the list of allowed roles (RoleSpec) where
> OWNER TO is parsed.

I have removed the macro and specified the allowed roles.

Thanks for the comments, the attached v2 patch has the changes for the same.

Regards,
Vignesh
From 525eb4d3959f316338160bfbcc769f64c16310a0 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 2 Oct 2022 10:59:59 +0530
Subject: [PATCH v2 2/2] Include CURRENT_ROLE, CURRENT_USER and SESSION_USER in
 tab completion while changing owner.

Include CURRENT_ROLE, CURRENT_USER and SESSION_USER in tab completion
while changing owner.
---
 src/bin/psql/tab-complete.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2818fb26a0..584d9d5ae6 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -4160,7 +4160,10 @@ psql_completion(const char *text, int start, int end)
 
 /* OWNER TO  - complete with available roles */
 	else if (TailMatches("OWNER", "TO"))
-		COMPLETE_WITH_QUERY(Query_for_list_of_roles);
+		COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_roles,
+ "CURRENT_ROLE",
+ "CURRENT_USER",
+ "SESSION_USER");
 
 /* ORDER BY */
 	else if (TailMatches("FROM", MatchAny, "ORDER"))
-- 
2.32.0

From 890ac3d01ecd4238572047e738a80acd1a5c7a67 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 2 Oct 2022 10:45:04 +0530
Subject: [PATCH v2 1/2] Display publications and subscriptions for tab
 completion of \dRp and \dRs.

Display publications and subscriptions for tab completion of \dRp and
\dRs.
---
 src/bin/psql/tab-complete.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 71cfe8aec1..2818fb26a0 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -4614,6 +4614,10 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_partitioned_tables);
 	else if (TailMatchesCS("\\dP*"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_partitioned_relations);
+	else if (TailMatchesCS("\\dRp*"))
+		COMPLETE_WITH_VERSIONED_QUERY(Query_for_list_of_publications);
+	else if (TailMatchesCS("\\dRs*"))
+		COMPLETE_WITH_VERSIONED_QUERY(Query_for_list_of_subscriptions);
 	else if (TailMatchesCS("\\ds*"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_sequences);
 	else if (TailMatchesCS("\\dt*"))
-- 
2.32.0



Re: future of serial and identity columns

2022-10-04 Thread Laurenz Albe
On Tue, 2022-10-04 at 09:41 +0200, Peter Eisentraut wrote:
> In PostgreSQL 10, we added identity columns, as an alternative to serial 
> columns (since 6.something).  They mostly work the same.  Identity 
> columns are SQL-conforming, have some more features (e.g., overriding 
> clause), and are a bit more robust in schema management.  Some of that 
> was described in [0].  AFAICT, there have been no complaints since that 
> identity columns lack features or are somehow a regression over serial 
> columns.
> 
> But clearly, the syntax "serial" is more handy, and most casual examples 
> use that syntax.  So it seems like we are stuck with maintaining these 
> two variants in parallel forever.  I was thinking we could nudge this a 
> little by remapping "serial" internally to create an identity column 
> instead.  At least then over time, the use of the older serial 
> mechanisms would go away.

I think that would be great.
That might generate some confusion among users who follow old tutorials
and are surprised that the eventual table definition differs, but I'd say
that is a good thing.

Yours,
Laurenz Albe




Re: Miscellaneous tab completion issue fixes

2022-10-04 Thread Dagfinn Ilmari Mannsåker
vignesh C  writes:

> On Tue, 4 Oct 2022 at 09:13, Michael Paquier  wrote:
>>
>> On Mon, Oct 03, 2022 at 06:29:32PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> > vignesh C  writes:
>> >> +else if (TailMatchesCS("\\dRp*"))
>> >> +COMPLETE_WITH_QUERY(Query_for_list_of_publications[0].query);
>> >> +else if (TailMatchesCS("\\dRs*"))
>> >> +
>> >> COMPLETE_WITH_QUERY(Query_for_list_of_subscriptions[0].query);
>> >
>> > These are version-specific queries, so should be passed in their
>> > entirety to COMPLETE_WITH_VERSIONED_QUERY() so that psql can pick the
>> > right version, and avoid sending the query at all if the server is too
>> > old.
>>
>> +1.
>
> Modified
>
>  >> +/* add these to Query_for_list_of_roles in OWNER TO contexts */
>> >> +#define Keywords_for_list_of_owner_to_roles \
>> >> +"CURRENT_ROLE", "CURRENT_USER", "SESSION_USER"
>> >
>> > I think this would read better without the TO, both in the comment and
>> > the constant name, similar to the below only having GRANT without TO:
>>
>> Keywords_for_list_of_grant_roles is used in six code paths, so it
>> seems to me that there is little gain in having a separate #define
>> here.  Let's just specify the list of allowed roles (RoleSpec) where
>> OWNER TO is parsed.
>
> I have removed the macro and specified the allowed roles.
>
> Thanks for the comments, the attached v2 patch has the changes for the same.

LGTM, +1 to commit.

- ilmari




Re: Pluggable toaster

2022-10-04 Thread Nikita Malakhov
Hi hackers!
Now cfbot is happy, but there were warnings due to recent changes in
PointerGetDatum function, so here's corrected patchset.

Patchset consists of:
v20-0001-toaster-interface.patch - Pluggable TOAST API interface along with
reference TOAST mechanics - new API is introduced but
reference TOAST is still unchanged;
v20-0002-toaster-default.patch - Default TOAST re-implemented
using Toaster API - reference TOAST is re-implemented via new API;
v20-0003-toaster-docs.patch - Pluggable TOAST API documentation package

Actual GitHub branch resides at
https://github.com/postgrespro/postgres/tree/toasterapi_clean

On Tue, Oct 4, 2022 at 1:02 AM Nikita Malakhov  wrote:

> Hi hackers!
> Cfbot failed in meson build with previous patchsets, so I've rebased them
> onto the latest master and added necessary meson build info.
>
> Patchset consists of:
> v19-0001-toaster-interface.patch - Pluggable TOAST API interface along
> with reference TOAST mechanics - new API is introduced but
> reference TOAST is still unchanged;
> v19-0002-toaster-default.patch - Default TOAST re-implemented using
> Toaster API - reference TOAST is re-implemented via new API;
> v19-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>
> Actual GitHub branch resides at
> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>
> On Tue, Sep 27, 2022 at 12:26 AM Nikita Malakhov 
> wrote:
>
>> Hi,
>> Meson build for the patchset failed, meson build files attached and
>> README/Doc package
>> reworked with more detailed explanation of virtual function table along
>> with other corrections.
>>
>> On Sun, Sep 25, 2022 at 1:41 AM Nikita Malakhov 
>> wrote:
>>
>>> Hi hackers!
>>> Last patchset has an invalid patch file - v16-0003-toaster-docs.patch.
>>> Here's corrected patchset,
>>> sorry for the noise.
>>>
>>> On Sat, Sep 24, 2022 at 3:50 PM Nikita Malakhov 
>>> wrote:
>>>
 Hi hackers!

 Cfbot is still not happy with the patchset, so I'm attaching a rebased
 one, rebased onto the current
 master (from today). The third patch contains documentation package,
 and the second one contains large
 README.toastapi file providing additional in-depth docs for developers.

 Comments would be greatly appreciated.

 Again, after checking patch sources I have a strong opinion that it
 needs some refactoring -
 move all files related to TOAST implementation into new folder
 /backend/access/toast where
 Generic (default) Toaster resides.

 Patchset consists of:
 v16-0001-toaster-interface.patch - Pluggable TOAST API interface along
 with reference TOAST mechanics;
 v16-0002-toaster-default.patch - Default TOAST re-implemented using
 Toaster API;
 v16-0003-toaster-docs.patch - Pluggable TOAST API documentation package

 Actual GitHub branch resides at
 https://github.com/postgrespro/postgres/tree/toasterapi_clean

 On Fri, Sep 23, 2022 at 10:54 PM Nikita Malakhov 
 wrote:

> Hi hackers!
>
> Cfbot is not happy with previous patchset, so I'm attaching new one,
> rebased onto current master
> (15b4). Also providing patch with documentation package, and the
> second one contains large
> README.toastapi file providing additional in-depth docs for developers.
>
> Comments would be greatly appreciated.
>
> Also, after checking patch sources I have a strong opinion that it
> needs some refactoring -
> move all files related to TOAST implementation into new folder
> /backend/access/toast where
> Generic (default) Toaster resides.
>
> Patchset consists of:
> v15-0001-toaster-interface.patch - Pluggable TOAST API interface along
> with reference TOAST mechanics;
> v15-0002-toaster-default.patch - Default TOAST re-implemented using
> Toaster API;
> v15-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>
> On Tue, Sep 13, 2022 at 7:50 PM Jacob Champion <
> jchamp...@timescale.com> wrote:
>
>> On Mon, Sep 12, 2022 at 11:45 PM Nikita Malakhov 
>> wrote:
>> > It would be more clear for complex data types like JSONB, where
>> developers could
>> > need some additional functionality to work with internal
>> representation of data type,
>> > and its full potential is revealed in our JSONB toaster extension.
>> The JSONB toaster
>> > is still in development but we plan to make it available soon.
>>
>> Okay. It'll be good to have that, because as it is now it's hard to
>> see the whole picture.
>>
>> > On installing dummy_toaster contrib: I've just checked it by making
>> a patch from commit
>> > and applying onto my clone of master and 2 patches provided in
>> previous email without
>> > any errors and sll checks passed - applying with git am, configure
>> with debug, cassert,
>> > depend and enable-tap-tests flags and run checks.
>> > Plea

Re: JUMBLE_SIZE macro in two files

2022-10-04 Thread Julien Rouhaud
Hi,

On Tue, Oct 04, 2022 at 05:41:12PM +0900, bt22nakamorit wrote:
>
> queryjumble.c and queryjumble.h both define a macro JUMBLE_SIZE = 1024.
> Since queryjumble.c includes queryjumble.h, the JUMBLE_SIZE definition in
> queryjumble.c should be deleted.

+1




Re: Pluggable toaster

2022-10-04 Thread Nikita Malakhov
Hi hackers!
Now cfbot is happy, but there were warnings due to recent changes in
PointerGetDatum function, so here's corrected patchset.
Sorry, forgot to attach patch files. My fault.

Patchset consists of:
v20-0001-toaster-interface.patch - Pluggable TOAST API interface along with
reference TOAST mechanics - new API is introduced but
reference TOAST is still unchanged;
v20-0002-toaster-default.patch - Default TOAST re-implemented
using Toaster API - reference TOAST is re-implemented via new API;
v20-0003-toaster-docs.patch - Pluggable TOAST API documentation package

Actual GitHub branch resides at
https://github.com/postgrespro/postgres/tree/toasterapi_clean

On Tue, Oct 4, 2022 at 1:45 PM Nikita Malakhov  wrote:

> Hi hackers!
> Now cfbot is happy, but there were warnings due to recent changes in
> PointerGetDatum function, so here's corrected patchset.
>
> Patchset consists of:
> v20-0001-toaster-interface.patch - Pluggable TOAST API interface along
> with reference TOAST mechanics - new API is introduced but
> reference TOAST is still unchanged;
> v20-0002-toaster-default.patch - Default TOAST re-implemented
> using Toaster API - reference TOAST is re-implemented via new API;
> v20-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>
> Actual GitHub branch resides at
> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>
> On Tue, Oct 4, 2022 at 1:02 AM Nikita Malakhov  wrote:
>
>> Hi hackers!
>> Cfbot failed in meson build with previous patchsets, so I've rebased them
>> onto the latest master and added necessary meson build info.
>>
>> Patchset consists of:
>> v19-0001-toaster-interface.patch - Pluggable TOAST API interface along
>> with reference TOAST mechanics - new API is introduced but
>> reference TOAST is still unchanged;
>> v19-0002-toaster-default.patch - Default TOAST re-implemented using
>> Toaster API - reference TOAST is re-implemented via new API;
>> v19-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>>
>> Actual GitHub branch resides at
>> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>>
>> On Tue, Sep 27, 2022 at 12:26 AM Nikita Malakhov 
>> wrote:
>>
>>> Hi,
>>> Meson build for the patchset failed, meson build files attached and
>>> README/Doc package
>>> reworked with more detailed explanation of virtual function table along
>>> with other corrections.
>>>
>>> On Sun, Sep 25, 2022 at 1:41 AM Nikita Malakhov 
>>> wrote:
>>>
 Hi hackers!
 Last patchset has an invalid patch file - v16-0003-toaster-docs.patch.
 Here's corrected patchset,
 sorry for the noise.

 On Sat, Sep 24, 2022 at 3:50 PM Nikita Malakhov 
 wrote:

> Hi hackers!
>
> Cfbot is still not happy with the patchset, so I'm attaching a rebased
> one, rebased onto the current
> master (from today). The third patch contains documentation package,
> and the second one contains large
> README.toastapi file providing additional in-depth docs for developers.
>
> Comments would be greatly appreciated.
>
> Again, after checking patch sources I have a strong opinion that it
> needs some refactoring -
> move all files related to TOAST implementation into new folder
> /backend/access/toast where
> Generic (default) Toaster resides.
>
> Patchset consists of:
> v16-0001-toaster-interface.patch - Pluggable TOAST API interface along
> with reference TOAST mechanics;
> v16-0002-toaster-default.patch - Default TOAST re-implemented using
> Toaster API;
> v16-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>
> Actual GitHub branch resides at
> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>
> On Fri, Sep 23, 2022 at 10:54 PM Nikita Malakhov 
> wrote:
>
>> Hi hackers!
>>
>> Cfbot is not happy with previous patchset, so I'm attaching new one,
>> rebased onto current master
>> (15b4). Also providing patch with documentation package, and the
>> second one contains large
>> README.toastapi file providing additional in-depth docs for
>> developers.
>>
>> Comments would be greatly appreciated.
>>
>> Also, after checking patch sources I have a strong opinion that it
>> needs some refactoring -
>> move all files related to TOAST implementation into new folder
>> /backend/access/toast where
>> Generic (default) Toaster resides.
>>
>> Patchset consists of:
>> v15-0001-toaster-interface.patch - Pluggable TOAST API interface
>> along with reference TOAST mechanics;
>> v15-0002-toaster-default.patch - Default TOAST re-implemented using
>> Toaster API;
>> v15-0003-toaster-docs.patch - Pluggable TOAST API documentation
>> package
>>
>> On Tue, Sep 13, 2022 at 7:50 PM Jacob Champion <
>> jchamp...@timescale.com> wrote:
>>
>>> On Mon, Sep 12, 2022 at 11:45 PM Nikita Malakhov 
>>> wrote:
>>> > I

possibility of partial data dumps with pg_dump

2022-10-04 Thread Никита Старовойтов
Hello,
with a view to meeting with postgres code and to get some practice with it,
I am making a small patch that adds the possibility of partial tables dump.
A rule of filtering is specified with standard SQL where clause (without
"where" keyword)
There are three ways to send data filters over command line:

1) using table pattern in "where" parameter with divider '@'
... --where "table_pattern@where_condition" ...

"Where" condition will be used for all tables that match the search pattern.

2) using table parameter before any table inclusion
... --where "where condition" ...

All tables in databases will be filtered with input condition.

3) using "where" parameter after table pattern
... -t table_pattern --where where_condition ...

Only tables matching to last pattern before --where will be filtered. Third
way is necessary to shorten the command
line and to avoid duplicating tables pattern when specific tables are
dumped.

Also filters may be input from files.
A file consists of lines, and every line is a table pattern or a where
condition for data.
For example, file
"""
where column_name_1 == 1
table_pattern
table_pattern where column_name_2 == 1
"""
corresponds to parameters

 --where "column_name_1 == 1" -t table_pattern --where "column_name_2 == 1"

The file format is not very good, because it doesn't provide sending
patterns of other components such as schemas for example.
And I am ready to change it, if functionality is actually needed.

All use cases are provided with tests.

I will be grateful if patch will get a discussion.


where_condition_v1.patch
Description: Binary data


Re: Small miscellaneous fixes

2022-10-04 Thread Ranier Vilela
Em ter., 4 de out. de 2022 às 01:18, Michael Paquier 
escreveu:

> On Mon, Oct 03, 2022 at 08:05:57AM -0300, Ranier Vilela wrote:
> > Em seg., 3 de out. de 2022 às 05:01, Masahiko Sawada <
> sawada.m...@gmail.com>
> > escreveu:
> >> On Fri, Sep 30, 2022 at 9:08 AM Ranier Vilela 
> wrote:
> >>> 1. Avoid useless reassigning var _logsegno
> >> (src/backend/access/transam/xlog.c)
> >>> Commit 7d70809 left a little oversight.
> >>> XLByteToPrevSeg and XLByteToSeg are macros, and both assign _logsegno.
> >>> So, the first assignment is lost and is useless.
>
> Right, I have missed this one.  We do that now in
> build_backup_content() when building the contents of the backup
> history file.
>
> >>> 4. Fix declaration volatile signal var (src/bin/pgbench/pgbench.c)
> >>> Like how to commit 5ac9e86, this is a similar case.
> >>
> >> The same is true also for alarm_triggered in pg_test_fsync.c?
> >>
> > I don't think so.
> > If I understand the problem correctly, the failure can occur with true
> > signals, provided by the OS
> > In the case at hand, it seems to me more like an internal form of signal,
> > that is, simulated.
> > So bool works fine.
>
> I am not following your reasoning here.  Why does it matter to change
> one but not the other?  Both are used with SIGALRM, it seems.
>
Both are correct, I missed the pqsignal calls.

Attached patch to change this.


> The other three seem fine, so fixed.
>
Thanks Michael for the commit.

regards,
Ranier Vilela


fix_declaration_volatile_signal_pg_test_fsync.patch
Description: Binary data


Re: Reducing the chunk header sizes on all memory context types

2022-10-04 Thread Ranier Vilela
Em ter., 4 de out. de 2022 às 05:36, David Rowley 
escreveu:

> On Tue, 4 Oct 2022 at 13:35, Ranier Vilela  wrote:
> > Revisiting my work on reducing memory consumption, I found this patch
> left out.
> > I'm not sure I can help.
> > But basically I was able to write and read the block size, in the chunk.
> > Could it be the case of writing and reading the context pointer in the
> same way?
> > Sure this leads to some performance loss, but would it make it possible
> to get the context pointer from the chunk?
> > In other words, would it be possible to save the context pointer and
> compare it later in MemoryContextContains?
>
> I'm not really sure I understand the intention of the patch. The
> header size for all our memory contexts was already reduced in
> c6e0fe1f2.  The patch you sent seems to pre-date that commit.
>
There is zero intention to commit this. It's just an experiment I did.

As it is in the patch, it is possible to save the context pointer outside
the header chunk.
Making it possible to retrieve it in MemoryContextContains.

It's just an idea.

regards,
Ranier Vilela


Re: possibility of partial data dumps with pg_dump

2022-10-04 Thread Pavel Stehule
Hi

út 4. 10. 2022 v 12:48 odesílatel Никита Старовойтов 
napsal:

> Hello,
> with a view to meeting with postgres code and to get some practice with
> it, I am making a small patch that adds the possibility of partial tables
> dump.
> A rule of filtering is specified with standard SQL where clause (without
> "where" keyword)
> There are three ways to send data filters over command line:
>
> 1) using table pattern in "where" parameter with divider '@'
> ... --where "table_pattern@where_condition" ...
>
> "Where" condition will be used for all tables that match the search
> pattern.
>
> 2) using table parameter before any table inclusion
> ... --where "where condition" ...
>
> All tables in databases will be filtered with input condition.
>
> 3) using "where" parameter after table pattern
> ... -t table_pattern --where where_condition ...
>
> Only tables matching to last pattern before --where will be filtered.
> Third way is necessary to shorten the command
> line and to avoid duplicating tables pattern when specific tables are
> dumped.
>
> Also filters may be input from files.
> A file consists of lines, and every line is a table pattern or a where
> condition for data.
> For example, file
> """
> where column_name_1 == 1
> table_pattern
> table_pattern where column_name_2 == 1
> """
> corresponds to parameters
>
>  --where "column_name_1 == 1" -t table_pattern --where "column_name_2 == 1"
>
> The file format is not very good, because it doesn't provide sending
> patterns of other components such as schemas for example.
> And I am ready to change it, if functionality is actually needed.
>
> All use cases are provided with tests.
>
> I will be grateful if patch will get a discussion.
>

What is benefit and use case? For this case I don't see any benefit against
simple

\copy (select * from xx where ...) to file CSV

or how hard is it to write trivial application that does export of what you
want in the format that you want?

Regards

Pavel


Re: possibility of partial data dumps with pg_dump

2022-10-04 Thread Julien Rouhaud
Hi,

On Tue, Oct 04, 2022 at 02:15:16PM +0200, Pavel Stehule wrote:
>
> út 4. 10. 2022 v 12:48 odesílatel Никита Старовойтов 
> napsal:
>
> > Hello,
> > with a view to meeting with postgres code and to get some practice with
> > it, I am making a small patch that adds the possibility of partial tables
> > dump.
> > A rule of filtering is specified with standard SQL where clause (without
> > "where" keyword)
>
> What is benefit and use case? For this case I don't see any benefit against
> simple
>
> \copy (select * from xx where ...) to file CSV
>
> or how hard is it to write trivial application that does export of what you
> want in the format that you want?

Also, such approach probably requires a lot of effort to get a valid backup
(with regards to foreign keys and such).

There's already a project dedicated to generate such partial (and consistent)
backups: https://github.com/mla/pg_sample.  Maybe that would address your
needs?




Re: Reducing the chunk header sizes on all memory context types

2022-10-04 Thread Ranier Vilela
Em ter., 4 de out. de 2022 às 08:29, Ranier Vilela 
escreveu:

> Em ter., 4 de out. de 2022 às 05:36, David Rowley 
> escreveu:
>
>> On Tue, 4 Oct 2022 at 13:35, Ranier Vilela  wrote:
>> > Revisiting my work on reducing memory consumption, I found this patch
>> left out.
>> > I'm not sure I can help.
>> > But basically I was able to write and read the block size, in the chunk.
>> > Could it be the case of writing and reading the context pointer in the
>> same way?
>> > Sure this leads to some performance loss, but would it make it possible
>> to get the context pointer from the chunk?
>> > In other words, would it be possible to save the context pointer and
>> compare it later in MemoryContextContains?
>>
>> I'm not really sure I understand the intention of the patch. The
>> header size for all our memory contexts was already reduced in
>> c6e0fe1f2.  The patch you sent seems to pre-date that commit.
>>
> There is zero intention to commit this. It's just an experiment I did.
>
> As it is in the patch, it is possible to save the context pointer outside
> the header chunk.
> Making it possible to retrieve it in MemoryContextContains.
>
Never mind, it's a waste of time.

regards,
Ranier Vilela


Re: Assorted fixes related to WAL files (was: Use XLogFromFileName() in pg_resetwal to parse position from WAL file)

2022-10-04 Thread Bharath Rupireddy
On Tue, Oct 4, 2022 at 2:01 PM Kyotaro Horiguchi
 wrote:
>
> > 1. 0001 replaces explicit WAL file parsing code with
>
> Looks good to me.
>
> > 2. 0002 replaces MAXPGPATH with MAXFNAMELEN for WAL file names.
>
> Looks reasonable, too.  I don't find other instances of the same mistake.

Thanks for reviewing.

> > 3. 0003 replaces WAL file name calculation with XLogFileNameById() in
> > pg_upgrade/controldata.c to be consistent across the code base. Note
> > that this requires us to change the nextxlogfile size from hard-coded
> > 25 bytes to MAXFNAMELEN (64 bytes).
>
> I'm not sure I like this. In other places where XLogFileNameById() is
> used, the buffer is known to store longer strings so MAXFNAMELEN is
> reasonable.  But we don't need to add useless 39 bytes here.
> Therefore, even if I wanted to change it, I would replace it with
> "XLOG_FNAME_LEN + 1".

I'm fine with doing either of these things. Let's hear from others.

I've added a CF entry - https://commitfest.postgresql.org/40/3927/

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [patch] \g with multiple result sets and \watch with copy queries

2022-10-04 Thread Daniel Verite
Tom Lane wrote:

> Pushed after making some corrections.

Thanks!

> Given the time pressure, I did not worry about installing regression
> test coverage for this stuff, but I wonder if we shouldn't add some.

Currently, test/regress/sql/psql.sql doesn't AFAICS write anything
outside of stdout, but \g, \o, \copy need  to write to external
files to be tested properly.

Looking at nearby tests, I see that commit d1029bb5a26 brings
interesting additions in test/regress/sql/misc.sql that could be used
as a model to handle output files. psql.sql could write
into PG_ABS_BUILDDIR, then read the files back with \copy I guess,
then output that again to stdout for comparison. I'll see if I can get
that to work.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Matthias van de Meent
On Mon, 3 Oct 2022 at 23:26, Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
> > > Random idea: xl_prev is large. Store a full xl_prev in the page header, 
> > > but
> > > only store a 2 byte offset from the page header xl_prev within each 
> > > record.
> >
> > With that small xl_prev we may not detect partial page writes in
> > recycled segments; or other issues in the underlying file system. With
> > small record sizes, the chance of returning incorrect data would be
> > significant for small records (it would be approximately the chance of
> > getting a record boundary on the underlying page boundary * chance of
> > getting the same MAXALIGN-adjusted size record before the persistence
> > boundary). That issue is part of the reason why my proposed change
> > upthread still contains the full xl_prev.
>
> What exactly is the theory for this significant increase? I don't think
> xl_prev provides a meaningful protection against torn pages in the first
> place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.




Re: [patch] \g with multiple result sets and \watch with copy queries

2022-10-04 Thread Tom Lane
"Daniel Verite"  writes:
>   Tom Lane wrote:
>> Given the time pressure, I did not worry about installing regression
>> test coverage for this stuff, but I wonder if we shouldn't add some.

> Currently, test/regress/sql/psql.sql doesn't AFAICS write anything
> outside of stdout, but \g, \o, \copy need  to write to external
> files to be tested properly.

Yeah, I don't think we can usefully test these in psql.sql, because
file-system side effects are bad in that context.  But maybe a TAP
test could cope?

regards, tom lane




Re: JUMBLE_SIZE macro in two files

2022-10-04 Thread Tom Lane
bt22nakamorit  writes:
> queryjumble.c and queryjumble.h both define a macro JUMBLE_SIZE = 1024.
> Since queryjumble.c includes queryjumble.h, the JUMBLE_SIZE definition 
> in queryjumble.c should be deleted.

I would go more for taking it out of queryjumble.h.  I see no
reason why that constant needs to be world-visible.

regards, tom lane




Re: get rid of Abs()

2022-10-04 Thread Tom Lane
Peter Eisentraut  writes:
> I was wondering why we have a definition of Abs() in c.h when there are 
> more standard functions such as abs() and fabs() in widespread use.  I 
> think this one is left over from pre-ANSI-C days.  The attached patches 
> replace all uses of Abs() with more standard functions.

I'm not in favor of the llabs() changes.  I think what we really want
in those places, or at least most of them, is "abs() for int64".
That could be had by #define'ing "iabs64" (or some name along that
line) as labs or llabs depending on which type we are using for int64.

Seems OK beyond that nitpick.

regards, tom lane




Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund
Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:
> On Mon, 3 Oct 2022 at 23:26, Andres Freund  wrote:
> > On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:
> > > On Mon, 3 Oct 2022, 19:01 Andres Freund,  wrote:
> > > > Random idea: xl_prev is large. Store a full xl_prev in the page header, 
> > > > but
> > > > only store a 2 byte offset from the page header xl_prev within each 
> > > > record.
> > >
> > > With that small xl_prev we may not detect partial page writes in
> > > recycled segments; or other issues in the underlying file system. With
> > > small record sizes, the chance of returning incorrect data would be
> > > significant for small records (it would be approximately the chance of
> > > getting a record boundary on the underlying page boundary * chance of
> > > getting the same MAXALIGN-adjusted size record before the persistence
> > > boundary). That issue is part of the reason why my proposed change
> > > upthread still contains the full xl_prev.
> >
> > What exactly is the theory for this significant increase? I don't think
> > xl_prev provides a meaningful protection against torn pages in the first
> > place?
> 
> XLog pages don't have checksums, so they do not provide torn page
> protection capabilities on their own.
> A singular xlog record is protected against torn page writes through
> the checksum that covers the whole record - if only part of the record
> was written, we can detect that through the mismatching checksum.
> However, if records end at the tear boundary, we must know for certain
> that any record that starts after the tear is the record that was
> written after the one before the tear. Page-local references/offsets
> would not work, because the record decoding doesn't know which xlog
> page the record should be located on; it could be both the version of
> the page before it was recycled, or the one after.
> Currently, we can detect this because the value of xl_prev will point
> to a record far in the past (i.e. not the expected value), but with a
> page-local version of xl_prev we would be less likely to detect torn
> pages (and thus be unable to handle this without risk of corruption)
> due to the significant chance of the truncated xl_prev value being the
> same in both the old and new record.

Think this is addressable, see below.


> Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> recycled and receives a new record C at the place of A with the same
> length.
> 
> With your proposal, record B would still be a valid record when it
> follows C; as the page-local serial number/offset reference to the
> previous record would still match after the torn write.
> With the current situation and a full LSN in xl_prev, the mismatching
> value in the xl_prev pointer allows us to detect this torn page write
> and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.


The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.


> PS. there are ideas floating around (I heard about this one from
> Heikki) where we could concatenate WAL records into one combined
> record that has only one shared xl_prev+crc; which would save these 12
> bytes per record. However, that needs a lot of careful consideration
> to make sure that the persistence guarantee of operations doesn't get
> lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund




Re: Reducing the chunk header sizes on all memory context types

2022-10-04 Thread Tom Lane
I wrote:
> I think what we should look at is extending the aggregate/window
> function APIs so that such functions can report where they put their
> output, and then we can nuke MemoryContextContains(), with the
> code code set up to assume that it has to copy if the called function
> didn't report anything.  The existing FunctionCallInfo.resultinfo
> mechanism (cf. ReturnSetInfo for SRFs) is probably the right thing
> to pass the flag through.

After studying the existing usages of MemoryContextContains, I think
there is a better answer, which is to just nuke them.

As far as I can tell, the datumCopy steps associated with aggregate
finalfns are basically useless.  They only serve to prevent
returning a pointer directly to the aggregate's transition value
(or, perhaps, to a portion thereof).  But what's wrong with that?
It'll last as long as we need it to.  Maybe there was a reason
back before we required finalfns to not scribble on the transition
values, but those days are gone.

The same goes for aggregate serialfns --- although there, I can't
avoid the feeling that the datumCopy step was just cargo-culted in.
I don't think there can exist a serialfn that doesn't return a
freshly-palloced bytea.

The one place where we actually need the conditional datumCopy is
with window functions, and even there I don't think we need it
in simple cases with only one window function.  The case that is
hazardous is where multiple window functions are sharing a
WindowObject.  So I'm content to optimize the single-window-function
case and just always copy if there's more than one.  (Sadly, there
is no existing regression test that catches this, so I added one.)

See attached.

regards, tom lane

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fe74e49814..45fcd37253 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1028,6 +1028,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
  *
  * The finalfn will be run, and the result delivered, in the
  * output-tuple context; caller's CurrentMemoryContext does not matter.
+ * (But note that in some cases, such as when there is no finalfn, the
+ * result might be a pointer to or into the agg's transition value.)
  *
  * The finalfn uses the state as set in the transno. This also might be
  * being used by another aggregate function, so it's important that we do
@@ -1112,21 +1114,13 @@ finalize_aggregate(AggState *aggstate,
 	}
 	else
 	{
-		/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
-		*resultVal = pergroupstate->transValue;
+		*resultVal =
+			MakeExpandedObjectReadOnly(pergroupstate->transValue,
+	   pergroupstate->transValueIsNull,
+	   pertrans->transtypeLen);
 		*resultIsNull = pergroupstate->transValueIsNull;
 	}
 
-	/*
-	 * If result is pass-by-ref, make sure it is in the right context.
-	 */
-	if (!peragg->resulttypeByVal && !*resultIsNull &&
-		!MemoryContextContains(CurrentMemoryContext,
-			   DatumGetPointer(*resultVal)))
-		*resultVal = datumCopy(*resultVal,
-			   peragg->resulttypeByVal,
-			   peragg->resulttypeLen);
-
 	MemoryContextSwitchTo(oldContext);
 }
 
@@ -1176,19 +1170,13 @@ finalize_partialaggregate(AggState *aggstate,
 	}
 	else
 	{
-		/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
-		*resultVal = pergroupstate->transValue;
+		*resultVal =
+			MakeExpandedObjectReadOnly(pergroupstate->transValue,
+	   pergroupstate->transValueIsNull,
+	   pertrans->transtypeLen);
 		*resultIsNull = pergroupstate->transValueIsNull;
 	}
 
-	/* If result is pass-by-ref, make sure it is in the right context. */
-	if (!peragg->resulttypeByVal && !*resultIsNull &&
-		!MemoryContextContains(CurrentMemoryContext,
-			   DatumGetPointer(*resultVal)))
-		*resultVal = datumCopy(*resultVal,
-			   peragg->resulttypeByVal,
-			   peragg->resulttypeLen);
-
 	MemoryContextSwitchTo(oldContext);
 }
 
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 8b0858e9f5..ce860bceeb 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -630,20 +630,13 @@ finalize_windowaggregate(WindowAggState *winstate,
 	}
 	else
 	{
-		/* Don't need MakeExpandedObjectReadOnly; datumCopy will copy it */
-		*result = peraggstate->transValue;
+		*result =
+			MakeExpandedObjectReadOnly(peraggstate->transValue,
+	   peraggstate->transValueIsNull,
+	   peraggstate->transtypeLen);
 		*isnull = peraggstate->transValueIsNull;
 	}
 
-	/*
-	 * If result is pass-by-ref, make sure it is in the right context.
-	 */
-	if (!peraggstate->resulttypeByVal && !*isnull &&
-		!MemoryContextContains(CurrentMemoryContext,
-			   DatumGetPointer(*result)))
-		*result = datumCopy(*result,
-			peraggstate->resulttypeByVal,
-			peraggstate->resulttypeLen);
 	MemoryContextSwitchTo(oldContext);
 }
 
@@ -1057,13 +1050,14 @@ eval_w

[PATCH] Expand character set for ltree labels

2022-10-04 Thread Garen Torikian
Dear hackers,

I am submitting a patch to expand the label requirements for ltree.

The current format is restricted to alphanumeric characters, plus _.
Unfortunately, for non-English labels, this set is insufficient. Rather
than figure out how to expand this set to include characters beyond the
ASCII limit, I have instead opted to provide users with some mechanism for
storing encoded UTF-8 characters which is widely used: punycode (
https://en.wikipedia.org/wiki/Punycode).

The punycode range of characters is the exact same set as the existing
ltree range, with the addition of a hyphen (-). Within this system, any
human language can be encoded using just A-Za-z0-9-.

On top of this, I added support for two more characters: # and ;, which are
used for HTML entities. Note that & and % have special significance in the
existing ltree logic; users would have to encode items as #20; (rather than
%20). This seems a fair compromise.

Since the encoding could make a regular slug even longer, I have also
doubled the character limit, from 256 to 512.

Please let me know if I can provide any more information or changes.

Very sincerely,
Garen


0001-Expand-character-set-for-ltree-labels.patch
Description: Binary data


Re: interrupted tap tests leave postgres instances around

2022-10-04 Thread Andres Freund
Hi,

On 2022-10-04 10:24:19 +0200, Peter Eisentraut wrote:
> On 30.09.22 06:07, Andres Freund wrote:
> > When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
> > postgres instances etc. That doesn't strike me as a good thing.
> > 
> > In contrast, the postgres instances started by pg_regress do terminate. I
> > assume this is because pg_regress starts postgres directly, whereas tap 
> > tests
> > largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
> > without a controlling terminal. Thus a ctrl-c won't be delivered to it.
> 
> I ran into the problem recently that pg_upgrade starts the servers with
> pg_ctl, and thus without terminal, and so you can't get any password prompts
> for SSL keys, for example.

For this specific case I wonder if pg_upgrade should disable ssl... That would
require fixing pg_upgrade to use a unix socket on windows, but that'd be a
good idea anyway.


> Taking out the setsid() call in pg_ctl.c fixed that.  I suspect this is
> ultimately the same problem.

> We could make TAP tests and pg_upgrade not use pg_ctl and start postmaster
> directly.  I'm not sure how much work that would be, but seeing that
> pg_regress does it, it doesn't seem unreasonable.

It's not trivial, particularly from perl. Check all the stuff pg_regress and
pg_ctl do around windows accounts and tokens.


> Alternatively, perhaps we could make a mode for pg_ctl that it doesn't call
> setsid().  This could be activated by an environment variable. That might
> address all these problems, too.

It looks like that won't help. Because pg_ctl exits after forking postgres,
postgres parent isn't the shell anymore...

Greetings,

Andres Freund




Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Robert Haas
On Tue, Oct 4, 2022 at 11:34 AM Andres Freund  wrote:
> > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > recycled and receives a new record C at the place of A with the same
> > length.
> >
> > With your proposal, record B would still be a valid record when it
> > follows C; as the page-local serial number/offset reference to the
> > previous record would still match after the torn write.
> > With the current situation and a full LSN in xl_prev, the mismatching
> > value in the xl_prev pointer allows us to detect this torn page write
> > and halt replay, before redoing an old (incorrect) record.
>
> In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
> As you specified it, C has the same length as A, so B's xl_prev will be the
> same whether it's a page local offset or the full 8 bytes.
>
> The relevant protection against issues like this isn't xl_prev, it's the
> CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
> than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000100040025 and now it's
000100040049. So if a given chunk of the page is leftover
from when the page was 000100040025, it will have xl_prev
values like 4/25xx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2022-10-04 Thread Jeff Davis
On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:
> Once a table becomes larger than vacuum_freeze_strategy_threshold,
> VACUUM stops marking pages all-visible in the first place,
> consistently marking them all-frozen instead.

What are the trade-offs here? Why does it depend on table size?

Regards,
Jeff Davis





Re: Suppressing useless wakeups in walreceiver

2022-10-04 Thread Nathan Bossart
Here is an updated patch set with the following changes:

* The creation of the struct for non-shared WAL receiver state is moved to
a prerequisite 0001 patch.  This should help ease review of 0002 a bit.

* I updated the nap time calculation to round up to the next millisecond,
as discussed upthread.

* I attempted to minimize the calls to GetCurrentTimestamp().  The WAL
receiver code already calls this function pretty liberally, so I don't know
if this is important, but perhaps it could make a difference for systems
that don't have something like the vDSO to avoid real system calls.

* I removed the 'tli' argument from functions that now have an argument for
the non-shared state struct.  The 'tli' is stored within the struct, so the
extra argument is unnecessary.

One thing that still bugs me a little bit about 0002 is that the calls to
GetCurrentTimestamp() feel a bit scattered and more difficult to reason
about.  AFAICT 0002 keeps 'now' relatively up-to-date, but it seems
possible that a future change could easily disrupt that.  I don't have any
other ideas at the moment, though.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 32dbc891cc829f9dbbfa69690743d97699862453 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 3 Oct 2022 19:39:44 -0700
Subject: [PATCH v3 1/2] Move WAL receivers' non-shared state to a new struct.

This is preparatory work for a follow-up change that will revamp
the wakeup mechanism for periodic tasks that WAL receivers must
perform.
---
 src/backend/replication/walreceiver.c | 46 +++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3767466ef3..83e333e89c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -116,6 +116,14 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }			LogstreamResult;
 
+/*
+ * A struct to keep track of non-shared state.
+ */
+typedef struct WalRcvInfo
+{
+	TimeLineID	startpointTLI;
+} WalRcvInfo;
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -175,7 +183,6 @@ WalReceiverMain(void)
 	char		slotname[NAMEDATALEN];
 	bool		is_temp_slot;
 	XLogRecPtr	startpoint;
-	TimeLineID	startpointTLI;
 	TimeLineID	primaryTLI;
 	bool		first_stream;
 	WalRcvData *walrcv = WalRcv;
@@ -185,6 +192,7 @@ WalReceiverMain(void)
 	char	   *err;
 	char	   *sender_host = NULL;
 	int			sender_port = 0;
+	WalRcvInfo	state = {0};
 
 	/*
 	 * WalRcv should be set up already (if we are a backend, we inherit this
@@ -238,7 +246,7 @@ WalReceiverMain(void)
 	strlcpy(slotname, (char *) walrcv->slotname, NAMEDATALEN);
 	is_temp_slot = walrcv->is_temp_slot;
 	startpoint = walrcv->receiveStart;
-	startpointTLI = walrcv->receiveStartTLI;
+	state.startpointTLI = walrcv->receiveStartTLI;
 
 	/*
 	 * At most one of is_temp_slot and slotname can be set; otherwise,
@@ -258,7 +266,7 @@ WalReceiverMain(void)
 	pg_atomic_write_u64(&WalRcv->writtenUpto, 0);
 
 	/* Arrange to clean up at walreceiver exit */
-	on_shmem_exit(WalRcvDie, PointerGetDatum(&startpointTLI));
+	on_shmem_exit(WalRcvDie, PointerGetDatum(&state));
 
 	/* Properly accept or ignore signals the postmaster might send us */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload); /* set flag to read config
@@ -345,11 +353,11 @@ WalReceiverMain(void)
 		 * Confirm that the current timeline of the primary is the same or
 		 * ahead of ours.
 		 */
-		if (primaryTLI < startpointTLI)
+		if (primaryTLI < state.startpointTLI)
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("highest timeline %u of the primary is behind recovery timeline %u",
-			primaryTLI, startpointTLI)));
+			primaryTLI, state.startpointTLI)));
 
 		/*
 		 * Get any missing history files. We do this always, even when we're
@@ -361,7 +369,7 @@ WalReceiverMain(void)
 		 * but let's avoid the confusion of timeline id collisions where we
 		 * can.
 		 */
-		WalRcvFetchTimeLineHistoryFiles(startpointTLI, primaryTLI);
+		WalRcvFetchTimeLineHistoryFiles(state.startpointTLI, primaryTLI);
 
 		/*
 		 * Create temporary replication slot if requested, and update slot
@@ -396,17 +404,17 @@ WalReceiverMain(void)
 		options.logical = false;
 		options.startpoint = startpoint;
 		options.slotname = slotname[0] != '\0' ? slotname : NULL;
-		options.proto.physical.startpointTLI = startpointTLI;
+		options.proto.physical.startpointTLI = state.startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
 		{
 			if (first_stream)
 ereport(LOG,
 		(errmsg("started streaming WAL from primary at %X/%X on timeline %u",
-LSN_FORMAT_ARGS(startpoint), startpointTLI)));
+LSN_FORMAT_ARGS(startpoint), state.startpointTLI)));
 			else
 ereport(LOG,
 		(errmsg("restarted WAL streaming at %X/%X on timeline %u",
-LSN_FORMAT_ARGS(startpoint), startpointTLI)));

Re: Can we avoid chdir'ing in resolve_symlinks() ?

2022-10-04 Thread Tom Lane
Thomas Munro  writes:
> I tried lots of crazy stuff[1] to try to get an error out of this
> thing, but came up empty handed.  Unlike realpath(), _fullpath()
> doesn't resolve symlinks (or junctions), so I guess there's less to go
> wrong.  It still needs the present working directory, which is a
> per-drive concept on this OS, but even bogus drives don't seem to
> produce an error (despite what the manual says).

Interesting.

> I'd still lean towards assuming errno is set, given that the manual
> references errno and not GetLastError().

Agreed.  In the attached, I drop the _dosmaperr() step and instead
just do "errno = 0" before the call.  That way, if we ever do manage
to hit a _fullpath() failure, we can at least tell whether the errno
that's reported is real or not.

In this version I've attempted to resolve Peter's complaint by only
applying realpath() when the executable path we've obtained is relative
or has a symlink as the last component.  Things will definitely not
work right if either of those is true and we make no effort to get
a more trustworthy path.  I concede that things will usually work okay
without resolving a symlink that's two or more levels up the path,
but I wonder how much we want to trust that.  Suppose somebody changes
such a symlink while the server is running --- nothing very good is
likely to happen if it suddenly starts consulting some other libdir
or sharedir.  Maybe we need to add a flag telling whether we want
this behavior?  TBH I think that pg_config is the only place I'd
be comfortable with doing it like this.  Peter, would your concerns
be satisfied if we just made pg_config do it?

regards, tom lane

diff --git a/src/common/exec.c b/src/common/exec.c
index 22f04aafbe..06a8601b85 100644
--- a/src/common/exec.c
+++ b/src/common/exec.c
@@ -14,6 +14,15 @@
  *-
  */
 
+/*
+ * On macOS, "man realpath" avers:
+ *Defining _DARWIN_C_SOURCE or _DARWIN_BETTER_REALPATH before including
+ *stdlib.h will cause the provided implementation of realpath() to use
+ *F_GETPATH from fcntl(2) to discover the path.
+ * This should be harmless everywhere else.
+ */
+#define _DARWIN_BETTER_REALPATH
+
 #ifndef FRONTEND
 #include "postgres.h"
 #else
@@ -58,11 +67,8 @@ extern int	_CRT_glob = 0;		/* 0 turns off globbing; 1 turns it on */
 	(fprintf(stderr, __VA_ARGS__), fputc('\n', stderr))
 #endif
 
-#ifdef _MSC_VER
-#define getcwd(cwd,len)  GetCurrentDirectory(len, cwd)
-#endif
-
-static int	resolve_symlinks(char *path);
+static int	normalize_exec_path(char *path);
+static char *pg_realpath(const char *fname);
 
 #ifdef WIN32
 static BOOL GetTokenUser(HANDLE hToken, PTOKEN_USER *ppTokenUser);
@@ -87,7 +93,7 @@ validate_exec(const char *path)
 	char		path_exe[MAXPGPATH + sizeof(".exe") - 1];
 
 	/* Win32 requires a .exe suffix for stat() */
-	if (strlen(path) >= strlen(".exe") &&
+	if (strlen(path) < strlen(".exe") ||
 		pg_strcasecmp(path + strlen(path) - strlen(".exe"), ".exe") != 0)
 	{
 		strlcpy(path_exe, path, sizeof(path_exe) - 4);
@@ -151,30 +157,16 @@ validate_exec(const char *path)
 int
 find_my_exec(const char *argv0, char *retpath)
 {
-	char		cwd[MAXPGPATH],
-test_path[MAXPGPATH];
 	char	   *path;
 
-	if (!getcwd(cwd, MAXPGPATH))
-	{
-		log_error(errcode_for_file_access(),
-  _("could not identify current directory: %m"));
-		return -1;
-	}
-
 	/*
 	 * If argv0 contains a separator, then PATH wasn't used.
 	 */
-	if (first_dir_separator(argv0) != NULL)
+	strlcpy(retpath, argv0, MAXPGPATH);
+	if (first_dir_separator(retpath) != NULL)
 	{
-		if (is_absolute_path(argv0))
-			strlcpy(retpath, argv0, MAXPGPATH);
-		else
-			join_path_components(retpath, cwd, argv0);
-		canonicalize_path(retpath);
-
 		if (validate_exec(retpath) == 0)
-			return resolve_symlinks(retpath);
+			return normalize_exec_path(retpath);
 
 		log_error(errcode(ERRCODE_WRONG_OBJECT_TYPE),
   _("invalid binary \"%s\": %m"), retpath);
@@ -183,9 +175,8 @@ find_my_exec(const char *argv0, char *retpath)
 
 #ifdef WIN32
 	/* Win32 checks the current directory first for names without slashes */
-	join_path_components(retpath, cwd, argv0);
 	if (validate_exec(retpath) == 0)
-		return resolve_symlinks(retpath);
+		return normalize_exec_path(retpath);
 #endif
 
 	/*
@@ -208,21 +199,15 @@ find_my_exec(const char *argv0, char *retpath)
 			if (!endp)
 endp = startp + strlen(startp); /* point to end */
 
-			strlcpy(test_path, startp, Min(endp - startp + 1, MAXPGPATH));
+			strlcpy(retpath, startp, Min(endp - startp + 1, MAXPGPATH));
 
-			if (is_absolute_path(test_path))
-join_path_components(retpath, test_path, argv0);
-			else
-			{
-join_path_components(retpath, cwd, test_path);
-join_path_components(retpath, retpath, argv0);
-			}
+			join_path_components(retpath, retpath, argv0);
 			canonicalize_path(retpath);
 
 			switch (validate_exec(retpath))
 			{
 case 0:			/* found ok */
-			

Re: New strategies for freezing, advancing relfrozenxid early

2022-10-04 Thread Peter Geoghegan
On Tue, Oct 4, 2022 at 10:39 AM Jeff Davis  wrote:
> On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:
> > Once a table becomes larger than vacuum_freeze_strategy_threshold,
> > VACUUM stops marking pages all-visible in the first place,
> > consistently marking them all-frozen instead.
>
> What are the trade-offs here? Why does it depend on table size?

That's a great question. The table-level threshold
vacuum_freeze_strategy_threshold more or less buckets every table into
one of two categories: small tables and big tables. Perhaps this seems
simplistic to you. That would be an understandable reaction, given the
central importance of this threshold. The current default of 4GB could
have easily been 8GB or perhaps even 16GB instead.

It's not so much size as the rate of growth over time that matters. We
really want to do eager freezing on "growth tables", particularly
append-only tables. On the other hand we don't want to do useless
freezing on small, frequently updated tables, like pgbench_tellers or
pgbench_branches -- those tables may well require zero freezing, and
yet each VACUUM will advance relfrozenxid to a very recent value
consistently (even on Postgres 15). But "growth" is hard to capture,
because in general we have to infer things about the future from the
past, which is difficult and messy.

Since it's hard to capture "growth table vs fixed size table"
directly, we use table size as a proxy. It's far from perfect, but I
think that it will work quite well in practice because most individual
tables simply never get very large. It's very common for a relatively
small number of tables to consistently grow, without bound (perhaps
not strictly append-only tables, but tables where nothing is ever
deleted and inserts keep happening). So a simplistic threshold
(combined with dynamic per-page decisions about freezing) should be
enough to avoid most of the downside of eager freezing. In particular,
we will still freeze lazily in tables where it's obviously very
unlikely to be worth it.

In general I think that being correct on average is overrated. It's
more important to always avoid being dramatically wrong -- especially
if there is no way to course correct in the next VACUUM. Although I
think that we have a decent chance of coming out ahead by every
available metric, that isn't really the goal. Why should performance
stability not have some cost, at least in some cases? I want to keep
the cost as low as possible (often "negative cost" relative to
Postgres 15), but overall I am consciously making a trade-off. There
are downsides.

-- 
Peter Geoghegan




Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund
Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:
> On Tue, Oct 4, 2022 at 11:34 AM Andres Freund  wrote:
> > > Example: Page { [ record A ] | tear boundary | [ record B ] } gets
> > > recycled and receives a new record C at the place of A with the same
> > > length.
> > >
> > > With your proposal, record B would still be a valid record when it
> > > follows C; as the page-local serial number/offset reference to the
> > > previous record would still match after the torn write.
> > > With the current situation and a full LSN in xl_prev, the mismatching
> > > value in the xl_prev pointer allows us to detect this torn page write
> > > and halt replay, before redoing an old (incorrect) record.
> >
> > In this concrete scenario the 8 byte xl_prev doesn't provide *any* 
> > protection?
> > As you specified it, C has the same length as A, so B's xl_prev will be the
> > same whether it's a page local offset or the full 8 bytes.
> >
> > The relevant protection against issues like this isn't xl_prev, it's the
> > CRC. We could improve the CRC by using the "full width" LSN for xl_prev 
> > rather
> > than the offset.
>
> I'm really confused. xl_prev *is* a full-width LSN currently, as I
> understand it. So in the scenario that Matthias poses, let's say the
> segment was previously 000100040025 and now it's
> 000100040049. So if a given chunk of the page is leftover
> from when the page was 000100040025, it will have xl_prev
> values like 4/25xx. If it's been rewritten since the segment was
> recycled, it will have xl_prev values like 4/49xx. So, we can tell
> whether record B has been overwritten with a new record since the
> segment was recycled. But if we stored only 2 bytes in each xl_prev
> field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid.  Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.


Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev 
B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
   insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev 
B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
   location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?


> A way to up the chances of detecting this case would be to store only
> 2 or 4 bytes of xl_prev on disk, but arrange to include the full
> xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.


> Then your chances of a collision are about 2^-32, or maybe more if you posit
> that CRC is a weak and crappy algorithm, but even then it's strictly better
> than just hoping that there isn't a tear point at a record boundary where
> the same length record precedes the tear in both the old and new WAL
> segments. However, on the flip side, even if you assume that CRC is a
> fantastic algorithm with beautiful and state-of-the-art bit mixing, the
> chances of it failing to notice the problem are still >0, whereas the
> current algorithm that compares the full xl_prev value is a sure
> thing. Because xl_prev values are never repeated, it's certain that when a
> segment is recycled, any values that were legal for the old one aren't legal
> in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset f

Re: installcheck-world concurrency issues

2022-10-04 Thread Andres Freund
Hi,

On 2022-10-04 17:05:40 +0900, Michael Paquier wrote:
> On Mon, Oct 03, 2022 at 04:41:11PM -0700, Andres Freund wrote:
> > There's a few further roles that seem to pose some danger goign forward:
> 
> I have never seen that myself, but 0001 is a nice cleanup.
> generated.sql includes a user named "regress_user11".  Perhaps that's
> worth renaming while on it?

I think regress_* without a "namespace" is what's src/test/regress uses, so
there's not really a need?


> > A second issue I noticed is that advisory_lock.sql often fails, because the
> > pg_locks queries don't restrict to the current database. Patch attached.
> 
> As in prepared_xacts.sql or just advisory locks taken in an installed
> cluster?  Or both?

There's various isolation tests, including several in src/test/isolation, that
use advisory locks.

prepared_xacts.sql shouldn't be an issue, because it's scheduled in a separate
group.


> > I attached the meson patch as well, but just because I used it to to get to
> > these patches.
> 
> I am still studying a lot of this area, but it seems like all the
> spots requiring a custom configuration (aka NO_INSTALLCHECK) are
> covered.  --setup running is working here with 0003.

Thanks for checking.

Greetings,

Andres Freund




Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Robert Haas
On Tue, Oct 4, 2022 at 2:30 PM Andres Freund  wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev 
> B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
>insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, 
> prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
>location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
>   page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pluggable toaster

2022-10-04 Thread Nikita Malakhov
Hi hackers!
cfbot is unhappy again, with documentation package.Here's
corrected patchset.

Patchset consists of:
v21-0001-toaster-interface.patch - Pluggable TOAST API interface along with
reference TOAST mechanics - new API is introduced but
reference TOAST is still unchanged;
v21-0002-toaster-default.patch - Default TOAST re-implemented
using Toaster API - reference TOAST is re-implemented via new API;
v21-0003-toaster-docs.patch - Pluggable TOAST API documentation package

Actual GitHub branch resides at
https://github.com/postgrespro/postgres/tree/toasterapi_clean

On Tue, Oct 4, 2022 at 1:46 PM Nikita Malakhov  wrote:

> Hi hackers!
> Now cfbot is happy, but there were warnings due to recent changes in
> PointerGetDatum function, so here's corrected patchset.
> Sorry, forgot to attach patch files. My fault.
>
> Patchset consists of:
> v20-0001-toaster-interface.patch - Pluggable TOAST API interface along
> with reference TOAST mechanics - new API is introduced but
> reference TOAST is still unchanged;
> v20-0002-toaster-default.patch - Default TOAST re-implemented
> using Toaster API - reference TOAST is re-implemented via new API;
> v20-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>
> Actual GitHub branch resides at
> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>
> On Tue, Oct 4, 2022 at 1:45 PM Nikita Malakhov  wrote:
>
>> Hi hackers!
>> Now cfbot is happy, but there were warnings due to recent changes in
>> PointerGetDatum function, so here's corrected patchset.
>>
>> Patchset consists of:
>> v20-0001-toaster-interface.patch - Pluggable TOAST API interface along
>> with reference TOAST mechanics - new API is introduced but
>> reference TOAST is still unchanged;
>> v20-0002-toaster-default.patch - Default TOAST re-implemented
>> using Toaster API - reference TOAST is re-implemented via new API;
>> v20-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>>
>> Actual GitHub branch resides at
>> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>>
>> On Tue, Oct 4, 2022 at 1:02 AM Nikita Malakhov  wrote:
>>
>>> Hi hackers!
>>> Cfbot failed in meson build with previous patchsets, so I've rebased
>>> them onto the latest master and added necessary meson build info.
>>>
>>> Patchset consists of:
>>> v19-0001-toaster-interface.patch - Pluggable TOAST API interface along
>>> with reference TOAST mechanics - new API is introduced but
>>> reference TOAST is still unchanged;
>>> v19-0002-toaster-default.patch - Default TOAST re-implemented using
>>> Toaster API - reference TOAST is re-implemented via new API;
>>> v19-0003-toaster-docs.patch - Pluggable TOAST API documentation package
>>>
>>> Actual GitHub branch resides at
>>> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>>>
>>> On Tue, Sep 27, 2022 at 12:26 AM Nikita Malakhov 
>>> wrote:
>>>
 Hi,
 Meson build for the patchset failed, meson build files attached and
 README/Doc package
 reworked with more detailed explanation of virtual function table along
 with other corrections.

 On Sun, Sep 25, 2022 at 1:41 AM Nikita Malakhov 
 wrote:

> Hi hackers!
> Last patchset has an invalid patch file - v16-0003-toaster-docs.patch.
> Here's corrected patchset,
> sorry for the noise.
>
> On Sat, Sep 24, 2022 at 3:50 PM Nikita Malakhov 
> wrote:
>
>> Hi hackers!
>>
>> Cfbot is still not happy with the patchset, so I'm attaching a
>> rebased one, rebased onto the current
>> master (from today). The third patch contains documentation package,
>> and the second one contains large
>> README.toastapi file providing additional in-depth docs for
>> developers.
>>
>> Comments would be greatly appreciated.
>>
>> Again, after checking patch sources I have a strong opinion that it
>> needs some refactoring -
>> move all files related to TOAST implementation into new folder
>> /backend/access/toast where
>> Generic (default) Toaster resides.
>>
>> Patchset consists of:
>> v16-0001-toaster-interface.patch - Pluggable TOAST API interface
>> along with reference TOAST mechanics;
>> v16-0002-toaster-default.patch - Default TOAST re-implemented using
>> Toaster API;
>> v16-0003-toaster-docs.patch - Pluggable TOAST API documentation
>> package
>>
>> Actual GitHub branch resides at
>> https://github.com/postgrespro/postgres/tree/toasterapi_clean
>>
>> On Fri, Sep 23, 2022 at 10:54 PM Nikita Malakhov 
>> wrote:
>>
>>> Hi hackers!
>>>
>>> Cfbot is not happy with previous patchset, so I'm attaching new one,
>>> rebased onto current master
>>> (15b4). Also providing patch with documentation package, and the
>>> second one contains large
>>> README.toastapi file providing additional in-depth docs for
>>> developers.
>>>
>>> Comments would be greatly appreciated.
>>>
>>> Al

Re: [POC] Allow flattening of subquery with a link to upper query

2022-10-04 Thread Zhihong Yu
Hi,
For contain_placeholders():

+   if (IsA(node, Query))
+   return query_tree_walker((Query *) node, contain_placeholders,
context, 0);
+   else if (IsA(node, PlaceHolderVar))

The `else` is not needed.

For correlated_t struct, it would be better if the fields have comments.

+* (for grouping, as an example). So, revert its status
to
+* a full valued entry.

full valued -> fully valued

Cheers


Re: Warning about using pg_stat_reset() and pg_stat_reset_shared()

2022-10-04 Thread David Rowley
On Thu, 29 Sept 2022 at 04:45, Bruce Momjian  wrote:
>
> We have discussed the problems caused by the use of pg_stat_reset() and
> pg_stat_reset_shared(), specifically the removal of information needed
> by autovacuum.  I don't see these risks documented anywhere.  Should we
> do that?  Are there other risks?

There was some discussion in [1] a few years back.  A few people were
for the warning. Nobody seemed to object to it.  There's a patch in
[2].

David

[1] 
https://www.postgresql.org/message-id/flat/CAKJS1f8DTbCHf9gedU0He6ARsd58E6qOhEHM1caomqj_r9MOiQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAKJS1f80o98hcfSk8j%3DfdN09S7Sjz%2BvuzhEwbyQqvHJb_sZw0g%40mail.gmail.com




Re: [PATCH] Expand character set for ltree labels

2022-10-04 Thread Nathan Bossart
On Tue, Oct 04, 2022 at 12:54:46PM -0400, Garen Torikian wrote:
> The punycode range of characters is the exact same set as the existing
> ltree range, with the addition of a hyphen (-). Within this system, any
> human language can be encoded using just A-Za-z0-9-.

IIUC ASCII characters like '!' and '<' are valid Punycode characters, but
even with your proposal, those wouldn't be allowed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

2022-10-04 Thread Tom Lane
[ redirecting to -hackers because patch attached ]

David Rowley  writes:
> So that confirms there were 950k relations in the xl_standby_locks.
> The contents of that message seem to be produced by standby_desc().
> That should be the same WAL record that's processed by standby_redo()
> which adds the 950k locks to the RecoveryLockListsEntry.

> I'm not seeing why 950k becomes 134m.

I figured out what the problem is.  The standby's startup process
retains knowledge of all these locks in standby.c's RecoveryLockLists
data structure, which *has no de-duplication capability*.  It'll add
another entry to the per-XID list any time it's told about a given
exclusive lock.  And checkpoints cause us to regurgitate the entire
set of currently-held exclusive locks into the WAL.  So if you have
a process holding a lot of exclusive locks, and sitting on them
across multiple checkpoints, standby startup processes will bloat.
It's not a true leak, in that we know where the memory is and
we'll release it whenever we see that XID commit/abort.  And I doubt
that this is a common usage pattern, which probably explains the
lack of previous complaints.  Still, bloat bad.

PFA a quick-hack fix that solves this issue by making per-transaction
subsidiary hash tables.  That's overkill perhaps; I'm a little worried
about whether this slows down normal cases more than it's worth.
But we ought to do something about this, because aside from the
duplication aspect the current storage of these lists seems mighty
space-inefficient.

regards, tom lane

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9dab931990..14b86cd76a 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,7 +42,26 @@ int			max_standby_archive_delay = 30 * 1000;
 int			max_standby_streaming_delay = 30 * 1000;
 bool		log_recovery_conflict_waits = false;
 
-static HTAB *RecoveryLockLists;
+/*
+ * Keep track of all the exclusive locks owned by original transactions.
+ * For each original transaction that is known to have any such locks,
+ * there is a RecoveryLockHashEntry in the RecoveryLockHash hash table,
+ * pointing to a subsidiary hash table containing RecoveryLockEntry items.
+ */
+typedef struct RecoveryLockHashEntry
+{
+	TransactionId xid;			/* hash key -- must be first */
+	HTAB	   *hashtab;		/* table of locks belonging to xid */
+} RecoveryLockHashEntry;
+
+typedef struct RecoveryLockEntry
+{
+	/* Both Oids are included in the hash key; there is no payload per se */
+	Oid			dbOid;			/* DB containing table */
+	Oid			relOid;			/* OID of table */
+} RecoveryLockEntry;
+
+static HTAB *RecoveryLockHash = NULL;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
@@ -58,15 +77,6 @@ static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
 static const char *get_recovery_conflict_desc(ProcSignalReason reason);
 
-/*
- * Keep track of all the locks owned by a given transaction.
- */
-typedef struct RecoveryLockListsEntry
-{
-	TransactionId xid;
-	List	   *locks;
-} RecoveryLockListsEntry;
-
 /*
  * InitRecoveryTransactionEnvironment
  *		Initialize tracking of our primary's in-progress transactions.
@@ -85,16 +95,18 @@ InitRecoveryTransactionEnvironment(void)
 	VirtualTransactionId vxid;
 	HASHCTL		hash_ctl;
 
+	Assert(RecoveryLockHash == NULL);	/* don't run this twice */
+
 	/*
 	 * Initialize the hash table for tracking the list of locks held by each
 	 * transaction.
 	 */
 	hash_ctl.keysize = sizeof(TransactionId);
-	hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
-	RecoveryLockLists = hash_create("RecoveryLockLists",
-	64,
-	&hash_ctl,
-	HASH_ELEM | HASH_BLOBS);
+	hash_ctl.entrysize = sizeof(RecoveryLockHashEntry);
+	RecoveryLockHash = hash_create("RecoveryLockHash",
+   64,
+   &hash_ctl,
+   HASH_ELEM | HASH_BLOBS);
 
 	/*
 	 * Initialize shared invalidation management for Startup process, being
@@ -140,12 +152,12 @@ void
 ShutdownRecoveryTransactionEnvironment(void)
 {
 	/*
-	 * Do nothing if RecoveryLockLists is NULL because which means that
-	 * transaction tracking has not been yet initialized or has been already
-	 * shutdowned. This prevents transaction tracking from being shutdowned
-	 * unexpectedly more than once.
+	 * Do nothing if RecoveryLockHash is NULL because that means that
+	 * transaction tracking has not yet been initialized or has already been
+	 * shut down.  This makes it safe to have possibly-redundant calls of this
+	 * function during process exit.
 	 */
-	if (RecoveryLockLists == NULL)
+	if (RecoveryLockHash == NULL)
 		return;
 
 	/* Mark all tracked in-progress transactions as finished. */
@@ -155,8 +167,8 @@ ShutdownRecoveryTransactionEnvironment(void)
 	StandbyReleaseAllLocks();
 
 	/* Destroy the hash table of locks. */
-	hash_

Re: Move backup-related code to xlogbackup.c/.h

2022-10-04 Thread Nathan Bossart
On Wed, Sep 28, 2022 at 08:16:08PM +0530, Bharath Rupireddy wrote:
> In doing so, I had to add a few Get/Set functions
> for XLogCtl variables so that xlogbackup.c can use them.

I would suggest moving this to a separate prerequisite patch that can be
reviewed independently from the patches that simply move code to a
different file.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Expand character set for ltree labels

2022-10-04 Thread Garen Torikian
No, not quite.

Valid Punycode characters are `[A-Za-z0-9-]`. This proposal includes `-`,
as well as `#` and `;` for HTML entities.

I double-checked the RFC to see the valid Punycode characters and the set
above is indeed correct:
https://datatracker.ietf.org/doc/html/draft-ietf-idn-punycode-02#section-5

While it would be nice for ltree labels to support *any* printable
character, it can't because symbols like `!` and `%` already have special
meaning in the querying. This proposal leaves those as is and does not
depend on any existing special character.

On Tue, Oct 4, 2022 at 6:32 PM Nathan Bossart 
wrote:

> On Tue, Oct 04, 2022 at 12:54:46PM -0400, Garen Torikian wrote:
> > The punycode range of characters is the exact same set as the existing
> > ltree range, with the addition of a hyphen (-). Within this system, any
> > human language can be encoded using just A-Za-z0-9-.
>
> IIUC ASCII characters like '!' and '<' are valid Punycode characters, but
> even with your proposal, those wouldn't be allowed.
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


Re: problems with making relfilenodes 56-bits

2022-10-04 Thread Andres Freund
Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund  wrote:
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.


I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

Greetings,

Andres Freund
>From 68248e7aa220586cbf075d3cb6eb56a6461ae81f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 4 Oct 2022 16:04:16 -0700
Subject: [PATCH v1 1/2] Add a varint implementation

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20191210015054.5otdfuftxrqb5...@alap3.anarazel.de
Backpatch:
---
 src/include/common/varint.h |  92 +++
 src/common/Makefile |   1 +
 src/common/meson.build  |   1 +
 src/common/varint.c | 228 
 4 files changed, 322 insertions(+)
 create mode 100644 src/include/common/varint.h
 create mode 100644 src/common/varint.c

diff --git a/src/include/common/varint.h b/src/include/common/varint.h
new file mode 100644
index 000..43d9ade7808
--- /dev/null
+++ b/src/include/common/varint.h
@@ -0,0 +1,92 @@
+#ifndef PG_VARINT_H
+#define PG_VARINT_H
+
+/*
+ * Variable length integer encoding.
+ *
+ * Represent the length of the varint, in bytes - 1, as the number of
+ * "leading" 0 bits in the first byte. I.e. the length is stored in
+ * unary.  If less than 9 bytes of data are needed, the leading 0 bits
+ * are followed by a 1 bit, followed by the data. When encoding 64bit
+ * integers, for 8 bytes of input data, there is however no high 1 bit
+ * in the second output byte, as that would be unnecessary (and
+ * increase the size to 10 bytes of output bytes). If, hypothetically,
+ * larger numbers were supported, that'd not be the case, obviously.
+ *
+ * The reason to encode the number of bytes as zero bits, rathter than
+ * ones, is that more architectures have instructions for efficiently
+ * finding the first bit set, than finding the first bit unset.
+ *
+ * In contrast to encoding the length as e.g. the high-bit in each following
+ * bytes, this allows for faster decoding (and encoding, although the benefit
+ * is smaller), because no length related branches are needed after processing
+ * the first byte and no 7 bit/byte -> 8 bit/byte conversion is needed. The
+ * mask / shift overhead is similar.
+ *
+ *
+ * I.e. the encoding of a unsiged 64bit integer is as follows
+ * [0 {#bytes - 1}][1 as separator][leading 0 bits][data]
+ *
+ * E.g.
+ * - 7 (input 0b'0111) is encoded as 0b1000'0111
+ * - 145 (input 0b1001'0001) is 0b0100'''1001'0001 (as the separator does
+ *   not fit into the one byte of the fixed width encoding)
+ * - 4141 (input 0b0001'''0010'1101) is 0b0101'''0010'1101 (the
+ *   leading 0, followed by a 1 indicating that two bytes are needed)
+ *
+ *
+ * Naively encoding negative two's complement integers this way would
+ * lead to such numbers always being stored as the maximum varint
+ * length, due to the leading sign bits.  Instead the sign bit is
+ * encoded in the least significant bit. To avoid different limits
+ * than native 64bit integers - in particular around the lowest
+ * possible value - the data portion is bit flipped instead of
+ * negated.
+ *
+ * The reason for the sign bit to be trailing is that that makes it
+ * trivial to reuse code of the unsigned case, whereas a leading bit
+ * would make that harder.
+ *
+ * XXX: One problem with storing negative numbers this way is that it
+ * makes varints not compare lexicographically. Is it worth the code
+ * duplication to allow 

Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

2022-10-04 Thread Tom Lane
I wrote:
> PFA a quick-hack fix that solves this issue by making per-transaction
> subsidiary hash tables.  That's overkill perhaps; I'm a little worried
> about whether this slows down normal cases more than it's worth.
> But we ought to do something about this, because aside from the
> duplication aspect the current storage of these lists seems mighty
> space-inefficient.

After further thought, maybe it'd be better to do it as attached,
with one long-lived hash table for all the locks.  This is a shade
less space-efficient than the current code once you account for
dynahash overhead, but the per-transaction overhead should be lower
than the previous patch since we only need to create/destroy a hash
table entry not a whole hash table.

regards, tom lane

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 9dab931990..7db86f7885 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,7 +42,29 @@ int			max_standby_archive_delay = 30 * 1000;
 int			max_standby_streaming_delay = 30 * 1000;
 bool		log_recovery_conflict_waits = false;
 
-static HTAB *RecoveryLockLists;
+/*
+ * Keep track of all the exclusive locks owned by original transactions.
+ * For each known exclusive lock, there is a RecoveryLockEntry in the
+ * RecoveryLockHash hash table.  All RecoveryLockEntrys belonging to a
+ * given XID are chained together so that we can find them easily.
+ * For each original transaction that is known to have any such locks,
+ * there is a RecoveryLockXidEntry in the RecoveryLockXidHash hash table,
+ * which stores the head of the chain of its locks.
+ */
+typedef struct RecoveryLockEntry
+{
+	xl_standby_lock key;		/* hash key: xid, dbOid, relOid */
+	struct RecoveryLockEntry *next; /* chain link */
+} RecoveryLockEntry;
+
+typedef struct RecoveryLockXidEntry
+{
+	TransactionId xid;			/* hash key -- must be first */
+	struct RecoveryLockEntry *head; /* chain head */
+} RecoveryLockXidEntry;
+
+static HTAB *RecoveryLockHash = NULL;
+static HTAB *RecoveryLockXidHash = NULL;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
@@ -58,15 +80,6 @@ static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
 static const char *get_recovery_conflict_desc(ProcSignalReason reason);
 
-/*
- * Keep track of all the locks owned by a given transaction.
- */
-typedef struct RecoveryLockListsEntry
-{
-	TransactionId xid;
-	List	   *locks;
-} RecoveryLockListsEntry;
-
 /*
  * InitRecoveryTransactionEnvironment
  *		Initialize tracking of our primary's in-progress transactions.
@@ -85,16 +98,24 @@ InitRecoveryTransactionEnvironment(void)
 	VirtualTransactionId vxid;
 	HASHCTL		hash_ctl;
 
+	Assert(RecoveryLockHash == NULL);	/* don't run this twice */
+
 	/*
-	 * Initialize the hash table for tracking the list of locks held by each
+	 * Initialize the hash tables for tracking the locks held by each
 	 * transaction.
 	 */
+	hash_ctl.keysize = sizeof(xl_standby_lock);
+	hash_ctl.entrysize = sizeof(RecoveryLockEntry);
+	RecoveryLockHash = hash_create("RecoveryLockHash",
+   64,
+   &hash_ctl,
+   HASH_ELEM | HASH_BLOBS);
 	hash_ctl.keysize = sizeof(TransactionId);
-	hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
-	RecoveryLockLists = hash_create("RecoveryLockLists",
-	64,
-	&hash_ctl,
-	HASH_ELEM | HASH_BLOBS);
+	hash_ctl.entrysize = sizeof(RecoveryLockXidEntry);
+	RecoveryLockXidHash = hash_create("RecoveryLockXidHash",
+	  64,
+	  &hash_ctl,
+	  HASH_ELEM | HASH_BLOBS);
 
 	/*
 	 * Initialize shared invalidation management for Startup process, being
@@ -140,12 +161,12 @@ void
 ShutdownRecoveryTransactionEnvironment(void)
 {
 	/*
-	 * Do nothing if RecoveryLockLists is NULL because which means that
-	 * transaction tracking has not been yet initialized or has been already
-	 * shutdowned. This prevents transaction tracking from being shutdowned
-	 * unexpectedly more than once.
+	 * Do nothing if RecoveryLockHash is NULL because that means that
+	 * transaction tracking has not yet been initialized or has already been
+	 * shut down.  This makes it safe to have possibly-redundant calls of this
+	 * function during process exit.
 	 */
-	if (RecoveryLockLists == NULL)
+	if (RecoveryLockHash == NULL)
 		return;
 
 	/* Mark all tracked in-progress transactions as finished. */
@@ -154,9 +175,11 @@ ShutdownRecoveryTransactionEnvironment(void)
 	/* Release all locks the tracked transactions were holding */
 	StandbyReleaseAllLocks();
 
-	/* Destroy the hash table of locks. */
-	hash_destroy(RecoveryLockLists);
-	RecoveryLockLists = NULL;
+	/* Destroy the lock hash tables. */
+	hash_destroy(RecoveryLockHash);
+	hash_destroy(RecoveryLockXidHash);
+	RecoveryLockHash = NULL;
+	RecoveryLockXidHash = NULL;
 
 	/* Cleanup our

Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

2022-10-04 Thread Nathan Bossart
On Tue, Oct 04, 2022 at 07:53:11PM -0400, Tom Lane wrote:
> I wrote:
>> PFA a quick-hack fix that solves this issue by making per-transaction
>> subsidiary hash tables.  That's overkill perhaps; I'm a little worried
>> about whether this slows down normal cases more than it's worth.
>> But we ought to do something about this, because aside from the
>> duplication aspect the current storage of these lists seems mighty
>> space-inefficient.
> 
> After further thought, maybe it'd be better to do it as attached,
> with one long-lived hash table for all the locks.  This is a shade
> less space-efficient than the current code once you account for
> dynahash overhead, but the per-transaction overhead should be lower
> than the previous patch since we only need to create/destroy a hash
> table entry not a whole hash table.

This feels like a natural way to solve this problem.  I saw several cases
of the issue that was fixed with 6301c3a, so I'm inclined to believe this
usage pattern is actually somewhat common.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

2022-10-04 Thread Kyotaro Horiguchi
At Tue, 4 Oct 2022 17:15:31 -0700, Nathan Bossart  
wrote in 
> On Tue, Oct 04, 2022 at 07:53:11PM -0400, Tom Lane wrote:
> > I wrote:
> >> PFA a quick-hack fix that solves this issue by making per-transaction
> >> subsidiary hash tables.  That's overkill perhaps; I'm a little worried
> >> about whether this slows down normal cases more than it's worth.
> >> But we ought to do something about this, because aside from the
> >> duplication aspect the current storage of these lists seems mighty
> >> space-inefficient.
> > 
> > After further thought, maybe it'd be better to do it as attached,
> > with one long-lived hash table for all the locks.  This is a shade
> > less space-efficient than the current code once you account for
> > dynahash overhead, but the per-transaction overhead should be lower
> > than the previous patch since we only need to create/destroy a hash
> > table entry not a whole hash table.

First one is straight forward outcome from the current implement but I
like the new one.  I agree that it is natural and that the expected
overhead per (typical) transaction is lower than both the first one
and doing the same operation on a list. I don't think that space
inefficiency in that extent doesn't matter since it is the startup
process.

> This feels like a natural way to solve this problem.  I saw several cases
> of the issue that was fixed with 6301c3a, so I'm inclined to believe this
> usage pattern is actually somewhat common.

So releasing locks becomes somewhat slower? But it seems to still be
far faster than massively repetitive head-removal in a list.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Miscellaneous tab completion issue fixes

2022-10-04 Thread Michael Paquier
On Tue, Oct 04, 2022 at 10:35:25AM +0100, Dagfinn Ilmari Mannsåker wrote:
> LGTM, +1 to commit.

Fine by me, so done.
--
Michael


signature.asc
Description: PGP signature


Re: New strategies for freezing, advancing relfrozenxid early

2022-10-04 Thread Jeff Davis
On Tue, 2022-10-04 at 11:09 -0700, Peter Geoghegan wrote:
> So a simplistic threshold
> (combined with dynamic per-page decisions about freezing) should be
> enough to avoid most of the downside of eager freezing.

...

> I want to keep
> the cost as low as possible (often "negative cost" relative to
> Postgres 15), but overall I am consciously making a trade-off. There
> are downsides.

I am fine with that, but I'd like us all to understand what the
downsides are.

If I understand correctly:

1. Eager freezing (meaning to freeze at the same time as setting all-
visible) causes a modest amount of WAL traffic, hopefully before the
next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
visible but don't freeze) defers the work, and it might never need to
be done; but if it does, it can cause spikes at unfortunate times and
is more likely to generate more FPIs.

2. You're trying to mitigate the downsides of eager freezing by:
  a. when freezing a tuple, eagerly freeze other tuples on that page
  b. optimize WAL freeze records

3. You're trying to capture the trade-off in #1 by using the table size
as a proxy. Deferred work is only really a problem for big tables, so
that's where you use eager freezing. But maybe we can just always use
eager freezing?:
  a. You're mitigating the WAL work for freezing.
  b. A lot of people run with checksums on, meaning that setting the
all-visible bit requires WAL work anyway, and often FPIs.
  c. All-visible is conceptually similar to freezing, but less
important, and it feels more and more like the design concept of all-
visible isn't carrying its weight.
  d. (tangent) I had an old patch[1] that actually removed
PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
perhaps its time has come?

Regards,
Jeff Davis


[1]
https://www.postgresql.org/message-id/1353551097.11440.128.camel%40sussancws0025





Re: JUMBLE_SIZE macro in two files

2022-10-04 Thread Michael Paquier
On Tue, Oct 04, 2022 at 09:16:44AM -0400, Tom Lane wrote:
> I would go more for taking it out of queryjumble.h.  I see no
> reason why that constant needs to be world-visible.

I was just looking at the patch before seeing your reply, and thought
the exact same thing.  Perhaps you'd prefer apply that yourself?
--
Michael


signature.asc
Description: PGP signature


Re: JUMBLE_SIZE macro in two files

2022-10-04 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Oct 04, 2022 at 09:16:44AM -0400, Tom Lane wrote:
>> I would go more for taking it out of queryjumble.h.  I see no
>> reason why that constant needs to be world-visible.

> I was just looking at the patch before seeing your reply, and thought
> the exact same thing.  Perhaps you'd prefer apply that yourself?

Nah, feel free.

regards, tom lane




Re: Data is copied twice when specifying both child and parent table in publication

2022-10-04 Thread Peter Smith
Hi Wang-san. Here are my review comments for HEAD_v12-0001 patch.

==

1. Missing documentation.

In [1] you wrote:
> I think the behaviour of multiple publications with parameter 
> publish_via_partition_root could be added to the pg-doc later in a separate 
> patch.

~

That doesn't seem right to me. IMO the related documentation updates
cannot really be separated from this patch. Otherwise, what's the
alternative? Push this change, and then (while waiting for the
documentation patch) users will just have to use trial and error to
guess how it works...?

--

2. src/backend/catalog/pg_publication.c

+ typedef struct
+ {
+ Oid relid; /* OID of published table */
+ Oid pubid; /* OID of publication that publishes this
+ * table. */
+ } published_rel;

2a.
I think that should be added to typedefs.list

~

2b.
Maybe this also needs some comment to clarify that there will be
*multiple* of these structures in scenarios where the same table is
published by different publications in the array passed.

--

3. QUESTION - pg_get_publication_tables / fetch_table_list

When the same table is published by different publications (but there
are other differences like row-filters/column-lists in each
publication) the result tuple of this function does not include the
pubid. Maybe the SQL of pg_publication_tables/fetch_table_list() is OK
as-is but how does it manage to associate each table with the correct
tuple?

I know it apparently all seems to work but I’m not how does that
happen? Can you explain why a puboid is not needed for the result
tuple of this function?

~~

test_pub=# create table t1(a int, b int, c int);
CREATE TABLE
test_pub=# create publication pub1 for table t1(a) where (a > 99);
CREATE PUBLICATION
test_pub=# create publication pub2 for table t1(a,b) where (b < 33);
CREATE PUBLICATION

Following seems OK when I swap orders of publication names...

test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
ARRAY['pub2','pub1']) gpt(relid, attrs, qual);
 relid | attrs | rowfilter
---+---+---
 16385 | 1 2   | (b < 33)
 16385 | 1 | (a > 99)
(2 rows)

test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
ARRAY['pub1','pub2']) gpt(relid, attrs, qual);
 relid | attrs | rowfilter
---+---+---
 16385 | 1 | (a > 99)
 16385 | 1 2   | (b < 33)
(2 rows)

But what about this (this is similar to the SQL fragment from
fetch_table_list); I swapped the pub names but the results are the
same...

test_pub=# SELECT pg_get_publication_tables(VARIADIC
array_agg(p.pubname)) from pg_publication p where pubname
IN('pub2','pub1');

 pg_get_publication_tables

--
--
---
 (16385,1,"{OPEXPR :opno 521 :opfuncid 147 :opresulttype 16 :opretset
false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 1
:vartype 23 :vartypmod -1 :var
collid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 1 :location 47}
{CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
:constbyval true :constisnull false :
location 51 :constvalue 4 [ 99 0 0 0 0 0 0 0 ]}) :location 49}")
 (16385,"1 2","{OPEXPR :opno 97 :opfuncid 66 :opresulttype 16
:opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1
:varattno 2 :vartype 23 :vartypmod -1 :v
arcollid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 2 :location 49}
{CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
:constbyval true :constisnull false
 :location 53 :constvalue 4 [ 33 0 0 0 0 0 0 0 ]}) :location 51}")
(2 rows)

test_pub=# SELECT pg_get_publication_tables(VARIADIC
array_agg(p.pubname)) from pg_publication p where pubname
IN('pub1','pub2');

 pg_get_publication_tables

--
--
---
 (16385,1,"{OPEXPR :opno 521 :opfuncid 147 :opresulttype 16 :opretset
false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 1
:vartype 23 :vartypmod -1 :var
collid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 1 :location 47}
{CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
:constbyval true :constisnull false :
location 51 :constvalue 4 [ 99 0 0 0 0 0 0 0 ]}) :location 49}")
 (16385,"1 2","{OP

Re: New strategies for freezing, advancing relfrozenxid early

2022-10-04 Thread Peter Geoghegan
On Tue, Oct 4, 2022 at 7:59 PM Jeff Davis  wrote:
> I am fine with that, but I'd like us all to understand what the
> downsides are.

Although I'm sure that there must be one case that loses measurably,
it's not particularly obvious where to start looking for one. I mean
it's easy to imagine individual pages that we lose on, but a practical
test case where most of the pages are like that reliably is harder to
imagine.

> If I understand correctly:
>
> 1. Eager freezing (meaning to freeze at the same time as setting all-
> visible) causes a modest amount of WAL traffic, hopefully before the
> next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
> visible but don't freeze) defers the work, and it might never need to
> be done; but if it does, it can cause spikes at unfortunate times and
> is more likely to generate more FPIs.

Lazy freezing means to freeze every eligible tuple (every XID <
OldestXmin) when one or more XIDs are before FreezeLimit. Eager
freezing means freezing every eligible tuple when the page is about to
be set all-visible, or whenever lazy freezing would trigger freezing.

Eager freezing tends to avoid big spikes in larger tables, which is
very important. It can sometimes be cheaper and better in every way
than lazy freezing. Though lazy freezing sometimes retains an
advantage by avoiding freezing that is never going to be needed
altogether, typically only in small tables.

Lazy freezing is fairly similar to what we do on HEAD now -- though
it's not identical. It's still "page level freezing". It has lazy
criteria for triggering page freezing.

> 2. You're trying to mitigate the downsides of eager freezing by:
>   a. when freezing a tuple, eagerly freeze other tuples on that page
>   b. optimize WAL freeze records

Sort of.

Both of these techniques apply to eager freezing too, in fact. It's
just that eager freezing is likely to do the bulk of all freezing that
actually goes ahead. It'll disproportionately be helped by these
techniques because it'll do most actual freezing that goes ahead (even
when most VACUUM operations use the lazy freezing strategy, which is
probably the common case -- just because lazy freezing freezes
lazily).

> 3. You're trying to capture the trade-off in #1 by using the table size
> as a proxy. Deferred work is only really a problem for big tables, so
> that's where you use eager freezing.

Right.

> But maybe we can just always use
> eager freezing?:

That doesn't seem like a bad idea, though it might be tricky to put
into practice. It might be possible to totally unite the concept of
all-visible and all-frozen pages in the scope of this work. But there
are surprisingly many tricky details involved. I'm not surprised that
you're suggesting this -- it basically makes sense to me. It's just
the practicalities that I worry about here.

>   a. You're mitigating the WAL work for freezing.

I don't see why this would be true. Lazy vs Eager are exactly the same
for a given page at the point that freezing is triggered. We'll freeze
all eligible tuples (often though not always every tuple), or none at
all.

Lazy vs Eager describe the policy for deciding to freeze a page, but
do not affect the actual execution steps taken once we decide to
freeze.

>   b. A lot of people run with checksums on, meaning that setting the
> all-visible bit requires WAL work anyway, and often FPIs.

The idea of rolling the WAL records into one does seem appealing, but
we'd still need the original WAL record to set a page all-visible in
VACUUM's second heap pass (only setting a page all-visible in the
first heap pass could be optimized by making the FREEZE_PAGE WAL
record mark the page all-visible too). Or maybe we'd roll that into
the VACUUM WAL record at the same time.

In any case the second heap pass would have to have a totally
different WAL logging strategy to the first heap pass. Not
insurmountable, but not exactly an easy thing to do in passing either.

>   c. All-visible is conceptually similar to freezing, but less
> important, and it feels more and more like the design concept of all-
> visible isn't carrying its weight.

Well, not quite -- at least not on the VM side itself.

There are cases where heap_lock_tuple() will update a tuple's xmax,
replacing it with a new Multi. This will necessitate clearly the
page's all-frozen bit in the VM -- but the all-visible bit will stay
set. This is why it's possible for small numbers of all-visible pages
to appear even in large tables that have been eagerly frozen.

>   d. (tangent) I had an old patch[1] that actually removed
> PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
> perhaps its time has come?

I remember that pgCon developer meeting well.  :-)

If anything your original argument for getting rid of PD_ALL_VISIBLE
is weakened by the proposal to merge together the WAL records for
freezing and for setting a heap page all visible. You'd know for sure
that the page will be dirtied when such a WAL reco

Re: Assorted fixes related to WAL files (was: Use XLogFromFileName() in pg_resetwal to parse position from WAL file)

2022-10-04 Thread Michael Paquier
On Tue, Oct 04, 2022 at 06:24:18PM +0530, Bharath Rupireddy wrote:
> I'm fine with doing either of these things. Let's hear from others.
> 
> I've added a CF entry - https://commitfest.postgresql.org/40/3927/

About 0002, I am not sure that it is worth bothering.  Sure, this
wastes a few bytes, but I recall that there are quite a few places in
the code where we imply a WAL segment but append a full path to it,
and this creates a few bumps with back-patches.

--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 
+#include "access/xlog_internal.h"
 #include "common/relpath.h"
 #include "libpq-fe.h"
Well, xlog_internal.h includes a few backend-only definitions.  I'd
rather have us untangle more the backend/frontend dependencies before
adding more includes of this type, even if I agree that nextxlogfile
would be better with the implied name size limit.

Saying that, 0001 is a nice catch, so applied it.  I have switched the
two TLI variables to use TimeLineID, while touching the area.
--
Michael


signature.asc
Description: PGP signature


Re: JUMBLE_SIZE macro in two files

2022-10-04 Thread Michael Paquier
On Tue, Oct 04, 2022 at 11:17:09PM -0400, Tom Lane wrote:
> Nah, feel free.

Okay, thanks.  Applied, then.
--
Michael


signature.asc
Description: PGP signature


Re: installcheck-world concurrency issues

2022-10-04 Thread Peter Eisentraut

On 04.10.22 01:41, Andres Freund wrote:

BTW, shouldn't src/test/modules/unsafe_tests use the PG_TEST_EXTRA mechanism
somehow?  Seems not great to run it as part of installcheck-world, if we don't
want to run it as part of installcheck.


I think there are different levels and kinds of unsafeness.  The ssl and 
kerberos tests start open server processes on your machine.  The 
modules/unsafe_tests just make a mess of your postgres instance.  The 
latter isn't a problem when run against a temp instance.






Re: Can we avoid chdir'ing in resolve_symlinks() ?

2022-10-04 Thread Peter Eisentraut

On 15.09.22 16:43, Tom Lane wrote:

That seems ... a tad far-fetched, and even more to the point,
it'd be the other package's fault not ours.  We have never promised
that those directories point to anyplace that's not PG-specific.
I certainly do not buy that that's a good argument for breaking
Postgres installation setups that work today.

Also, there is nothing in that scenario that is in any way dependent
on the use of symlinks, or even absolute paths, so I don't quite
see the relevance to the current discussion.


Here is another variant of the same problem:

I have

$ which meson
/usr/local/bin/meson

Meson records its own path (somewhere under meson-info/ AFAICT), so it 
can re-run itself when any of the meson.build files change.  But since 
the above is a symlink, it records its own location as 
"/usr/local/Cellar/meson/0.63.1/bin/meson".  So now, whenever the meson 
package updates (even if it's just 0.63.0 -> 0.63.1), my build tree is 
broken.


To clarify, this instance is not at all the fault of any code in 
PostgreSQL.  But it's another instance where resolving symlinks just 
because we can causing problems.






Re: [RFC] building postgres with meson - v13

2022-10-04 Thread Peter Eisentraut

On 03.10.22 09:39, samay sharma wrote:

9f5be26c1215 meson: Add docs for building with meson

I do like the overall layout of this.

The "Supported Platforms" section should be moved back to near the end
of the chapter.  I don't see a reason to move it forward, at least
none that is related to the meson issue.


Agreed that it's unrelated to meson. However, I think it's better to 
move it in the front as it's generally useful to know if your platform 
is supported before you start performing the installation steps and get 
stuck somewhere.


The way it is currently organized is that 17.2 says

"In general, a modern Unix-compatible platform should be able to run 
PostgreSQL. The platforms that had received specific testing at the time 
of release are described in Section 17.6 below."


So basically, it says, don't worry about it, your platform is probably 
supported, but check below if you are interested in the details.


I don't see a reason to turn this around.



Do you think I should submit that as a separate commit in the same 
patch-set or just move it out to a completely different patch submission?



The changes to the "Getting the Source" section are also not
appropriate for this patch.


Given that many developers are now using Git for downloading the source 
code, I think it makes sense to be in the Getting the source section. 
Also, meson today doesn't cleanly build via the tarballs. Hence, I added 
it to the section (and patchset).


Section 17.3 already contains a link to section I.1 about using Git.


Do you think I should move this to a different patch?


If you wanted to pursue these changes, then yes, but I think they are 
not clear improvements, as mentioned above.


I suggest focusing on getting the actual meson documentation finished 
and then considering polishing the overall flow if desired.






Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-04 Thread Masahiko Sawada
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada  wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
>  wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor  
> > wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > > that the whole node fits comfortably inside a power-of-two allocation. 
> > > That would allow us to use aset without wasting space for the smaller 
> > > nodes, which would be faster and possibly would solve the fragmentation 
> > > problem Andres referred to in
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed 
> > the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> > only has a small number of values, so it doesn't really make sense to use 
> > an enum here in the struct (in fact, Andres' prototype used a uint8 for 
> > node_kind). We could use a bitfield for the count and kind:
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, 
> > the bitfield can just go back to being count only.
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4:   8 +   4 +(4)+   4*8 =   48 bytes
> > node16:  8 +  16 +  16*8 =  152
> > node32:  8 +  32 +  32*8 =  296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8   + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset 
> > allocation. Parentheses are padding bytes. Even if my math has mistakes, 
> > the numbers shouldn't be too far off:
> >
> > node3:   4 +   3 +(1)+   3*8 =   32 bytes
> > node6:   4 +   6 +(6)+   6*8 =   64
> > node13:  4 +  13 +(7)+  13*8 =  128
> > node28:  4 +  28 +  28*8 =  256
> > node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> > node94:  4 + 256 +  96/8 +  94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256: = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated.  We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> sizeof(dsa_area_span), 0,   /* special size classes */
> 8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
> 80, 96, 112, 128,   /* 4 classes separated by 16 bytes */
> 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
> 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
> 640, 768, 896, 1024,/* 4 classes separated by 128 bytes */
> 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
> 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
> 5456, 6552, 7280, 8192  /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com