date:20220810

On Tue, Aug 9, 2022 at 10:57 AM Thomas Munro  wrote:
>
> OK, so there aren't many places in 0001 that error out where we
> previously did not.

Well, that's not true I believe. The 0001 patch introduces
get_dirent_type() with elevel ERROR which means that lstat failures
are now reported as ERRORs with ereport(ERROR, as opposed to
continuing on the HEAD.

-if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG)
 continue;

-if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+if (get_dirent_type(path, snap_de, false, ERROR) != PGFILETYPE_REG)
 {

so on.

The main idea of using get_dirent_type() instead of lstat or stat is
to benefit from avoiding system calls on platforms where the directory
entry type is stored in dirent structure, but not to cause errors for
lstat or stat system calls failures where we previously didn't. I
think 0001 must be just replacing lstat or stat with get_dirent_type()
with elevel ERROR if lstat or stat failure is treated as ERROR
previously, otherwise with elevel LOG. I modified it as attached. Once
this patch gets committed, then we can continue the discussion as to
whether we make elog-LOG to elog-ERROR or vice-versa.

I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
requires us to pass struct dirent * from RemoveOldXlogFiles() (as
attached 0002 for now), If done, this avoids one lstat() system call
per WAL file. IMO, that's okay to pass an extra function parameter to
RemoveXlogFile() as it is a static function and there can be many
(thousands of) WAL files at times, so saving system calls (lstat() *
number of WAL files) is definitely an advantage.

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

v14-0001-Make-more-use-of-get_dirent_type.patch
Description: Binary data

v14-0002-Replace-lstat-with-get_dirent_type-in-xlog.c.patch
Description: Binary data

Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-10 Thread Dilip Kumar

On Tue, Aug 9, 2022 at 9:46 PM Simon Riggs  wrote:

> Those calls are unaffected, i.e. they both still work.
>
> Right now, we register all subxids in subtrans. But not all xids are
> subxids, so in fact, subtrans has many "holes" in it, where if you
> look up the parent for an xid it will just return
> c. There is a protection against that causing a
> problem because if you call TransactionIdDidCommit/Abort you can get a
> WARNING, or if you call SubTransGetTopmostTransaction() you can get an
> ERROR, but it is possible if you do a lookup for an inappropriate xid.
> i.e. if you call TransactionIdDidCommit() without first calling
> TransactionIdIsInProgress() as you are supposed to do.

IIUC, if SubTransGetParent SubTransGetParent then
SubTransGetTopmostTransaction() loop will break and return the
previousxid.  So if we pass any topxid to
SubTransGetTopmostTransaction() it will return back the same xid and
that's fine as next we are going to search in the snapshot->xip array.

But if we are calling this function with the subxid which might be
there in the snapshot->subxip array but if we are first calling
SubTransGetTopmostTransaction() then it will just return the same xid
if the parent is not set for it.  And now if we search this in the
snapshot->xip array then we will get the wrong answer?

So I still think some adjustment is required in XidInMVCCSnapdhot()
such that we first search the snapshot->subxip array.

Am I still missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Using each rel as both outer and inner for JOIN_ANTI

2022-08-10 Thread Richard Guo

On Tue, Aug 9, 2022 at 6:54 PM Alvaro Herrera 
wrote:

> I suppose this looks good as far as the plan goes, but the cost estimation
> might be a little bit too optimistic: it is reporting that the new plan
> costs 50% of the original, yet the execution time is only 5% lower.


Thanks for trying this patch. Yeah, the estimated cost doesn't match the
execution time here. I tried the query locally and here is what I got
(best of three):

Unpatched:
# explain analyze select * from foo left join bar on foo.a = bar.c where
bar.c is null;
QUERY PLAN
---
 Hash Anti Join  (cost=154156.00..173691.19 rows=1 width=16) (actual
time=1548.622..1548.624 rows=0 loops=1)
   Hash Cond: (foo.a = bar.c)
   ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual
time=0.024..0.026 rows=10 loops=1)
   ->  Hash  (cost=72124.00..72124.00 rows=500 width=8) (actual
time=1443.157..1443.158 rows=500 loops=1)
 Buckets: 262144  Batches: 64  Memory Usage: 5107kB
 ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=8)
(actual time=0.045..481.059 rows=500 loops=1)
 Planning Time: 0.262 ms
 Execution Time: 1549.138 ms
(8 rows)

Patched:
# explain analyze select * from foo left join bar on foo.a = bar.c where
bar.c is null;
 QUERY PLAN
-
 Hash Right Anti Join  (cost=1.23..90875.33 rows=1 width=16) (actual
time=985.773..985.775 rows=0 loops=1)
   Hash Cond: (bar.c = foo.a)
   ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=8) (actual
time=0.095..438.333 rows=500 loops=1)
   ->  Hash  (cost=1.10..1.10 rows=10 width=8) (actual time=0.076..0.077
rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual
time=0.060..0.064 rows=10 loops=1)
 Planning Time: 0.290 ms
 Execution Time: 985.830 ms
(8 rows)

Seems the cost matches the execution time better in my local box.

The right-anti join plan has the same cost estimation with right join
plan in this case. So would you please help to test what the right join
plan looks like in your env for the query below?

 select * from foo left join bar on foo.a = bar.c;

Thanks
Richard

Re: [RFC] building postgres with meson - v10

2022-08-10 Thread Nazir Bilal Yavuz


Hi,

On 8/8/22 18:53, Andres Freund wrote:

Bilal's version checked different directories for expected files, but I don't
think that's necessary. Bilal, do you remember why you added that?
This was for not breaking autoconf build. Autoconf wasn't using 
expecteddir, so I checked different directories.


Greetings,

Nazir Bilal Yavuz

Re: createuser doesn't tell default settings for some options

2022-08-10 Thread Daniel Gustafsson

> On 10 Aug 2022, at 08:12, Kyotaro Horiguchi  wrote:
> 
> (I suppose this is a pg15 issue)
> 
> createuser --help shows the following help text.
> 
>> --bypassrls   role can bypass row-level security (RLS) policy
>> --no-bypassrlsrole cannot bypass row-level security (RLS) policy
>> --replication role can initiate replication
>> --no-replication  role cannot initiate replication
> 
> For other options the text tells which one is the default, which I
> think the two options also should have the same.

Agreed.  For --no-replication the docs in createuser.sgml should fixed to
include a "This is the default" sentence like the others have as well.

> The interacitive mode doesn't cover all options, but I'm not sure what
> we should do to the mode since I don't have a clear idea of how the
> mode is used.  In the attached only --bypassrls is arbirarily added.
> The remaining options omitted in the interactive mode are: password,
> valid-until, role, member and replication. (attached second)

I'm not convinced that we should add more to the interactive mode, it's IMO
mostly a backwards compat option for ancient (pre-9.2) createuser where this
was automatically done.  Back then we had this in the documentation which has
since been removed:

"You will be prompted for a name and other missing information if it is not
specified on the command line."

> The ternary options are checked against decimal 0, but it should use
> TRI_DEFAULT instead. (attached third)

Agreed, nice catch.

--
Daniel Gustafsson   https://vmware.com/

Re: Fast COPY FROM based on batch insert

2022-08-10 Thread Etsuro Fujita

Hi,

On Wed, Aug 10, 2022 at 1:06 AM Zhihong Yu  wrote:
> On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita  wrote:
>> * When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
>> patch uses the slots passed to ExecForeignBatchInsert(), not the ones
>> returned by the callback function, but I don't think that that is
>> always correct, as the documentation about the callback function says:
>>
>>  The return value is an array of slots containing the data that was
>>  actually inserted (this might differ from the data supplied, for
>>  example as a result of trigger actions.)
>>  The passed-in slots can be re-used for this purpose.
>>
>> postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
>> I modified the patch to reference the returned slots when running the
>> AFTER ROW triggers.

I noticed that my explanation was not correct.  Let me explain.
Before commit 82593b9a3, when batching into a view referencing a
postgres_fdw foreign table that has WCO constraints, postgres_fdw used
the passed-in slots to store the first tuple that was actually
inserted to the remote table.  But that commit disabled batching in
that case, so postgres_fdw wouldn’t use the passed-in slots (until we
support batching when there are WCO constraints from the parent views
and/or AFTER ROW triggers on the foreign table).

> +   /* If any rows were inserted, run AFTER ROW INSERT triggers. */
> ...
> +   for (i = 0; i < inserted; i++)
> +   {
> +   TupleTableSlot *slot = rslots[i];
> ...
> +   slot->tts_tableOid =
> +   RelationGetRelid(resultRelInfo->ri_RelationDesc);
>
> It seems the return value of 
> `RelationGetRelid(resultRelInfo->ri_RelationDesc)` can be stored in a 
> variable outside the for loop.
> Inside the for loop, assign this variable to slot->tts_tableOid.

Actually, I did this to match the code in ExecBatchInsert(), but that
seems like a good idea, so I’ll update the patch as such in the next
version.

Thanks for reviewing!

Best regards,
Etsuro Fujita

Re: Fix a typo in pgstatfuncs.c

On Wed, Aug 10, 2022 at 1:22 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> It seems like there's the following typo in pgstatfuncs.c:
>
> -   /* Values only available to role member or
> pg_read_all_stats */
> +   /* Values only available to role member of
> pg_read_all_stats */
>
> Attaching a tiny patch to fix it.

I don't think it's a typo, the comment says that the values are only
available to the user who has privileges of backend's role or
pg_read_all_stats, the macro HAS_PGSTAT_PERMISSIONS says it all.

IMO, any of the following works better, if the existing comment is confusing:

/* Values only available to the member{or role or user} with
privileges of backend's role or pg_read_all_stats */

/* Values only available to the member{or role or user} that
has membership in backend's role or has privileges of
pg_read_all_stats */

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

2022-08-10 Thread Kyotaro Horiguchi

Hello.

> Yes, that is correct.

Mmm. I believed that the log came from a single server run, since the
PID (I believe the [359], [357] are PID) did not change through the
log lines.

> 2022-08-05 18:50:13 UTC::@:[359]:LOG:  creating missing WAL directory 
> "pg_wal/archive_status"

This means that someone removes the content of pg_wal directory.
Removing some WAL files in pg_wal may lead to this symptom.

> 2022-08-05 18:50:13 UTC::@:[359]:LOG:  entering standby mode
> recovering 0002.history
> 2022-08-05 18:50:13 UTC::@:[359]:LOG:  ### [S] @6/B8000198: abort=(0/0)0/0, 
> miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
> 2022-08-05 18:50:13 UTC::@:[359]:LOG:  database system was not properly shut 
> down; automatic recovery in progress
> 2022-08-05 18:50:13 UTC::@:[359]:LOG:  redo starts at 6/B8E8
> 
> And a few hours later, is when we see a panic

So, it seems that the *standby* received the inconsistent WAL stream
(aborted-contrecord not followed by a overwriting-missing-contrecord)
from the primary.  Thus the inconsistency happened on the primary, not
on the standby.

So... I'm still failing to draw the big picutre of what is happening
here.

Could you show us the server configuration (dbservers involved and
their roles (primary/standby..)), and the exact steps when you restart
the server after carsh?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Using each rel as both outer and inner for JOIN_ANTI

On 2022-Aug-10, Richard Guo wrote:

> The right-anti join plan has the same cost estimation with right join
> plan in this case. So would you please help to test what the right join
> plan looks like in your env for the query below?
> 
>  select * from foo left join bar on foo.a = bar.c;

You're right, it does.

55432 16devel 475322=# explain (analyze, buffers)  select * from foo left join 
bar on foo.a = bar.c;
  QUERY PLAN
  
──
 Hash Right Join  (cost=1.23..90875.24 rows=10 width=20) (actual 
time=456.410..456.415 rows=10 loops=1)
   Hash Cond: (bar.c = foo.a)
   Buffers: shared hit=15852 read=6273
   ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=12) (actual 
time=0.036..210.468 rows=500 loops=1)
 Buffers: shared hit=15852 read=6272
   ->  Hash  (cost=1.10..1.10 rows=10 width=8) (actual time=0.037..0.038 
rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 Buffers: shared read=1
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual 
time=0.022..0.026 rows=10 loops=1)
   Buffers: shared read=1
 Planning:
   Buffers: shared hit=92 read=13
 Planning Time: 1.077 ms
 Execution Time: 456.458 ms
(14 filas)


55432 16devel 475322=# explain (analyze, buffers) select * from foo left join 
bar on foo.a = bar.c where bar.c is null;
  QUERY PLAN
  
──
 Hash Right Anti Join  (cost=1.23..90875.24 rows=10 width=20) (actual 
time=451.747..451.751 rows=10 loops=1)
   Hash Cond: (bar.c = foo.a)
   Buffers: shared hit=15646 read=6479
   ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=12) (actual 
time=0.048..204.940 rows=500 loops=1)
 Buffers: shared hit=15645 read=6479
   ->  Hash  (cost=1.10..1.10 rows=10 width=8) (actual time=0.030..0.031 
rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 Buffers: shared hit=1
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual 
time=0.017..0.020 rows=10 loops=1)
   Buffers: shared hit=1
 Planning Time: 0.227 ms
 Execution Time: 451.793 ms

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-10 Thread Thomas Munro

On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy
 wrote:
> The main idea of using get_dirent_type() instead of lstat or stat is
> to benefit from avoiding system calls on platforms where the directory
> entry type is stored in dirent structure, but not to cause errors for
> lstat or stat system calls failures where we previously didn't.

Yeah, I get it... I'm OK with separating mechanical changes like
lstat() -> get_dirent_type() from policy changes on error handling.  I
initially thought the ERROR was surely better than what we're doing
now (making extra system calls to avoid unlinking unexpected things,
for which our only strategy on failure is to try to unlink the thing
anyway), but it's better to discuss that separately.

> I
> think 0001 must be just replacing lstat or stat with get_dirent_type()
> with elevel ERROR if lstat or stat failure is treated as ERROR
> previously, otherwise with elevel LOG. I modified it as attached. Once
> this patch gets committed, then we can continue the discussion as to
> whether we make elog-LOG to elog-ERROR or vice-versa.

Agreed, will look at this version tomorrow.

> I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
> requires us to pass struct dirent * from RemoveOldXlogFiles() (as
> attached 0002 for now), If done, this avoids one lstat() system call
> per WAL file. IMO, that's okay to pass an extra function parameter to
> RemoveXlogFile() as it is a static function and there can be many
> (thousands of) WAL files at times, so saving system calls (lstat() *
> number of WAL files) is definitely an advantage.

-lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
+get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG &&

I think you wanted ==, but maybe it'd be nicer to pass in a
"recyclable" boolean to RemoveXlogFile()?

If you're looking for more, rmtree() looks like another.

Small typo in OAT README

2022-08-10 Thread Daniel Gustafsson

Reading over the new object access hook test I spotted a small typo in the
documentation.  Will apply a fix shortly.

-A real-world OAT hook should certainly provide more fine-grained conrol than
+A real-world OAT hook should certainly provide more fine-grained control than

--
Daniel Gustafsson   https://vmware.com/




OAT_hooks_README_typo.diff
Description: Binary data

Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

On 2022-Aug-09, Lukas Fittl wrote:

> But I wonder, why do we have an explicit pretty printing flag on these
> functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior.
> If we don't want pretty printing to affect schema qualification, why
> does that flag exist?

Because of CVE-2018-1058.  See commit 815172ba8068.

I imagine that that commit only touched the minimum necessary to solve
the immediate security problem, but that further work is needed to make
PRETTYFLAG_SCHEMA become a fully functional gadget; but that would
require that the whole of ruleutils.c (and everything downstream from
it) behaves sanely.  In other words, I think your patch is too small.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-10 Thread Amit Kapila

On Wed, Aug 10, 2022 at 10:58 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
> > > On Aug 9, 2022, at 7:26 PM, Andres Freund  wrote:
> > >
> > > The relevant code triggering it:
> > >
> > > newbuf = XLogInitBufferForRedo(record, 1);
> > > _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> > >   xlrec->new_bucket_flag, true);
> > > if (!IsBufferCleanupOK(newbuf))
> > > elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
> > > cleanup lock");
> > >
> > > Why do we just crash if we don't already have a cleanup lock? That can't 
> > > be
> > > right. Or is there supposed to be a guarantee this can't happen?
> >
> > Perhaps the code assumes that when xl_hash_split_allocate_page record was
> > written, the new_bucket field referred to an unused page, and so during
> > replay it should also refer to an unused page, and being unused, that nobody
> > will have it pinned.  But at least in heap we sometimes pin unused pages
> > just long enough to examine them and to see that they are unused.  Maybe
> > something like that is happening here?
>
> I don't think it's a safe assumption that nobody would hold a pin on such a
> page during recovery. While not the case here, somebody else could have used
> pg_prewarm to read it in.
>
> But also, the checkpointer or bgwriter could have it temporarily pinned, to
> write it out, or another backend could try to write it out as a victim buffer
> and have it temporarily pinned.
>
>
> static int
> SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext 
> *wb_context)
> {
> ...
> /*
>  * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if 
> the
>  * buffer is clean by the time we've locked it.)
>  */
> PinBuffer_Locked(bufHdr);
> LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
>
>
> As you can see we acquire a pin without holding a lock on the page (and that
> can't be changed!).
>

I think this could be the probable reason for failure though I didn't
try to debug/reproduce this yet. AFAIU, this is possible during
recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via
XLogReadBufferForRedoExtended, we can mark the buffer dirty while
restoring from full page image. OTOH, because during normal operation
we didn't mark the page dirty SyncOneBuffer would have skipped it due
to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).

>
> I assume this is trying to defend against some sort of deadlock by not
> actually getting a cleanup lock (by passing get_cleanup_lock = true to
> XLogReadBufferForRedoExtended()).
>

IIRC, this is just following what we do during normal operation and
based on the theory that the meta-page is not updated yet so no
backend will access it. I think we can do what you wrote unless there
is some other reason behind this failure.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-10 Thread Peter Smith

Here are some review comments for the patch v20-0001:

==

1. doc/src/sgml/catalogs.sgml

+   p = apply changes directly using a background
+   worker, if available, otherwise, it behaves the same as 't'

The different char values 'f','t','p' are separated by comma (,) in
the list, which is normal for the pgdocs AFAIK. However, because of
this I don't think it is a good idea to use those other commas within
the description for  'p', I suggest you remove those ones to avoid
ambiguity with the separators.

==

2. doc/src/sgml/protocol.sgml

@@ -3096,7 +3096,7 @@ psql "dbname=postgres replication=database" -c
"IDENTIFY_SYSTEM;"
  
   
Protocol version. Currently versions 1,
2,
-   and 3 are supported.
+   3 and 4 are supported.
   

Put a comma after the penultimate value like it had before.

==

3. src/backend/replication/logical/applybgworker.c - 

There are multiple function comments and other code comments in this
file that are missing a terminating period (.)

==

4. src/backend/replication/logical/applybgworker.c - apply_bgworker_start

+/*
+ * Try to get a free apply background worker.
+ *
+ * If there is at least one worker in the free list, then take one. Otherwise,
+ * try to start a new apply background worker. If successful, cache it in
+ * ApplyBgworkersHash keyed by the specified xid.
+ */
+ApplyBgworkerState *
+apply_bgworker_start(TransactionId xid)

SUGGESTION (for function comment)
Return the apply background worker that will be used for the specified xid.

If an apply background worker is found in the free list then re-use
it, otherwise start a fresh one. Cache the worker ApplyBgworkersHash
keyed by the specified xid.

~~~

5.

+ /* Try to get a free apply background worker */
+ if (list_length(ApplyBgworkersFreeList) > 0)

if (list_length(ApplyBgworkersFreeList) > 0)

AFAIK a non-empty list is guaranteed to be not NIL, and an empty list
is guaranteed to be NIL. So if you want to the above can simply be
written as:

if (ApplyBgworkersFreeList)

~~~

6. src/backend/replication/logical/applybgworker.c - apply_bgworker_find

+/*
+ * Try to look up worker assigned before (see function apply_bgworker_get_free)
+ * inside ApplyBgworkersHash for requested xid.
+ */
+ApplyBgworkerState *
+apply_bgworker_find(TransactionId xid)

SUGGESTION (for function comment)
Find the worker previously assigned/cached for this xid. (see function
apply_bgworker_start)

~~~

7.

+ Assert(status == APPLY_BGWORKER_BUSY);
+
+ return entry->wstate;
+ }
+ else
+ return NULL;

IMO here it is better to just remove that 'else' and unconditionally
return NULL at the end of this function.

~~~

8. src/backend/replication/logical/applybgworker.c -
apply_bgworker_subxact_info_add

+ * Inside apply background worker we can figure out that new subtransaction was
+ * started if new change arrived with different xid. In that case we can define
+ * named savepoint, so that we were able to commit/rollback it separately
+ * later.
+ * Special case is if the first change comes from subtransaction, then
+ * we check that current_xid differs from stream_xid.
+ */
+void
+apply_bgworker_subxact_info_add(TransactionId current_xid)

It is not quite English. Can you improve it a bit?

SUGGESTION (maybe like this?)
The apply background worker can figure out if a new subtransaction was
started by checking if the new change arrived with different xid. In
that case define a named savepoint, so that we are able to
commit/rollback it separately later. A special case is when the first
change comes from subtransaction – this is determined by checking if
the current_xid differs from stream_xid.

==

9. src/backend/replication/logical/launcher.c - WaitForReplicationWorkerAttach

+ *
+ * Return false if the attach fails. Otherwise return true.
  */
-static void
+static bool
 WaitForReplicationWorkerAttach(LogicalRepWorker *worker,

Why not just say "Return whether the attach was successful."

~~~

10. src/backend/replication/logical/launcher.c - logicalrep_worker_stop

+ /* Found the main worker, then try to stop it. */
+ if (worker)
+ logicalrep_worker_stop_internal(worker);

IMO the comment is kind of pointless because it only says what the
code is clearly doing. If you really wanted to reinforce this worker
is a main apply worker then you can do that with code like:

if (worker)
{
Assert(!worker->subworker);
logicalrep_worker_stop_internal(worker);
}

~~~

11. src/backend/replication/logical/launcher.c - logicalrep_worker_detach

@@ -599,6 +632,29 @@ logicalrep_worker_attach(int slot)
 static void
 logicalrep_worker_detach(void)
 {
+ /*
+ * This is the main apply worker, stop all the apply background workers we
+ * started before.
+ */
+ if (!MyLogicalRepWorker->subworker)

SUGGESTION (for comment)
This is the main apply worker. Stop all apply background workers
previously started from here.

~~~

12 src/backend/replication/logical/launcher.c - logicalrep_apply_bgworker_count

+/*
+ * Count the

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

On Wed, Aug 10, 2022 at 2:11 PM Thomas Munro  wrote:
>
> On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy
>  wrote:
> > The main idea of using get_dirent_type() instead of lstat or stat is
> > to benefit from avoiding system calls on platforms where the directory
> > entry type is stored in dirent structure, but not to cause errors for
> > lstat or stat system calls failures where we previously didn't.
>
> Yeah, I get it... I'm OK with separating mechanical changes like
> lstat() -> get_dirent_type() from policy changes on error handling.  I
> initially thought the ERROR was surely better than what we're doing
> now (making extra system calls to avoid unlinking unexpected things,
> for which our only strategy on failure is to try to unlink the thing
> anyway), but it's better to discuss that separately.
>
> > I
> > think 0001 must be just replacing lstat or stat with get_dirent_type()
> > with elevel ERROR if lstat or stat failure is treated as ERROR
> > previously, otherwise with elevel LOG. I modified it as attached. Once
> > this patch gets committed, then we can continue the discussion as to
> > whether we make elog-LOG to elog-ERROR or vice-versa.
>
> Agreed, will look at this version tomorrow.

Thanks.

> > I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
> > requires us to pass struct dirent * from RemoveOldXlogFiles() (as
> > attached 0002 for now), If done, this avoids one lstat() system call
> > per WAL file. IMO, that's okay to pass an extra function parameter to
> > RemoveXlogFile() as it is a static function and there can be many
> > (thousands of) WAL files at times, so saving system calls (lstat() *
> > number of WAL files) is definitely an advantage.
>
> -lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
> +get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG &&
>
> I think you wanted ==, but maybe it'd be nicer to pass in a
> "recyclable" boolean to RemoveXlogFile()?

Ah, my bad, I corrected it now.

If you mean to do "bool recyclable = (get_dirent_type(path, xlde,
false, LOG) == PGFILETYPE_REG)"" and pass it as parameter to
RemoveXlogFile(), then we have to do get_dirent_type calls for every
WAL file even when any of (wal_recycle && *endlogSegNo <= recycleSegNo
&& XLogCtl->InstallXLogFileSegmentActive) is false which is
unnecessary and it's better to pass dirent structure as a parameter.

> If you're looking for more, rmtree() looks like another.

Are you suggesting to not use pgfnames() in rmtree() and do
opendir()-readdir()-get_dirent_type() directly? If yes, I don't think
it's a great idea because rmtree() has recursive calls and holding
opendir()-readdir() for all rmtree() recursive calls without
closedir() may eat up dynamic memory if there are deeper level
directories. I would like to leave it that way. Do you have any other
ideas in mind?

Please find the v15 patch, I merged the RemoveXlogFile() changes into 0001.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v15-0001-Make-more-use-of-get_dirent_type-in-place-of-sta.patch
Description: Binary data

Re: NAMEDATALEN increase because of non-latin languages

I wrote:

> The syscache use of GETSTRUCT still uses a simple cast of the tuple (for 
> pg_cast those calls live in parse_coerce.c, which is unchanged from master in 
> v3). Next step I think is to see about the syscache piece -- teaching a 
> syscache miss to deform the entire tuple into a struct and store it in the 
> syscache.

v4 is a hackish attempt at getting the syscache to deform the tuple
and store the struct. Using only pg_cast again for ease, it seems to
work for that. It's now about as far as I can get without thinking
about byref types.

0001 just adds copies of some syscache / catcache functions for
private use by pg_cast, as scaffolding.
0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call
heap_deform_tuple(). This then calls a for-now handwritten function
(similar to the one generated in v3) which palloc's space for the
struct and copies the fields over. Only this function knows the
catalog struct type, so a future design could call the per-cache
function via a pointer in the catcache control array. Since we already
have the isnull/values array, it's also trivial to copy the datums for
the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the
t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW
that knows it's looking directly at the struct.

Getting to varlen attributes: For one, I think it was mentioned above
that we will need a way to perform a deep copy of structs that contain
pointers to varlen fields. I imagine keeping track of the attributes
length will come up for some types and might be tricky. And before
that, the catalog machinery will need some preprocessor tricks to
declare pointers in the structs.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 20fba44412e8ef1bb4cd5b051b9d7e82618a6d93 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 10 Aug 2022 17:19:24 +0700
Subject: [PATCH v4 2/2] Teach catcache to store structs for pg_cast

---
 src/backend/parser/parse_coerce.c  |  6 +--
 src/backend/utils/cache/catcache.c | 73 --
 src/include/access/htup_details.h  |  2 +
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 39b7e5707b..07a1b047e3 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -3056,7 +3056,7 @@ IsBinaryCoercible(Oid srctype, Oid targettype)
 			ObjectIdGetDatum(targettype));
 	if (!HeapTupleIsValid(tuple))
 		return false;			/* no cast */
-	castForm = (Form_pg_cast) GETSTRUCT(tuple);
+	castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 	result = (castForm->castmethod == COERCION_METHOD_BINARY &&
 			  castForm->castcontext == COERCION_CODE_IMPLICIT);
@@ -3120,7 +3120,7 @@ find_coercion_pathway(Oid targetTypeId, Oid sourceTypeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 		CoercionContext castcontext;
 
 		/* convert char value for castcontext to CoercionContext enum */
@@ -3287,7 +3287,7 @@ find_typmod_coercion_function(Oid typeId,
 
 	if (HeapTupleIsValid(tuple))
 	{
-		Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple);
+		Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple);
 
 		*funcid = castForm->castfunc;
 		ReleaseSysCache(tuple);
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index b1287bb6a0..8ddc109052 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -21,6 +21,7 @@
 #include "access/table.h"
 #include "access/valid.h"
 #include "access/xact.h"
+#include "catalog/pg_cast.h" // fixme
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
@@ -2158,6 +2159,42 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	return ct;
 }
 
+// WIP: generated functions would look like this and be called through a pointer
+// FIXME: ct->tuple is no longer a real tuple
+// XXX: for now assume the caller has switched to the right memory context
+static CatCTup *
+CatCArrayGetStruct_pg_cast(HeapTuple pg_cast_tuple, CatCTup *ct, Datum *values, bool *isnull)
+{
+	Form_pg_cast pg_cast_struct;
+
+	/* Allocate memory for CatCTup and the cached struct in one go */
+	ct = (CatCTup *) palloc(sizeof(CatCTup) +
+			MAXIMUM_ALIGNOF + sizeof(FormData_pg_cast));
+
+	/* copy the identification info */
+	// WIP: for caches we only need t_self, can we just have that as a
+	// separate field in CatCTup?
+	ct->tuple.t_len = pg_cast_tuple->t_len;
+	ct->tuple.t_self = pg_cast_tuple->t_self;
+	ct->tuple.t_tableOid = pg_cast_tuple->t_tableOid;
+
+	// WIP: treat t_data as a pointer to the struct
+	ct->tuple.t_data = (HeapTupleHeader)
+		MAXALIGN(((char *) ct) + sizeof(CatCTup));
+	pg_cast_struct = (Form_pg_cast) ct->tuple.t_data;
+
+	/* copy tuple contents */
+	// WIP: we can just assign because there are no varlen attribute

Re: Typo in misc_sanity.sql?

2022-08-10 Thread Peter Eisentraut


On 25.07.22 14:04, Japin Li wrote:

Yeah, they do not have unique keys, however, here we check primary keys. So,
IMO, the description exceptions should say they do not have primary keys,
rather than do not have unique keys.


The context of that check is that for each system catalog we pick one of 
the available unique keys and designate it as the one primary key.  If a 
system catalog doesn't have a unique key to choose from, then we can't 
do that, hence the comment.  Changing the comment as suggested would 
essentially be saying, this catalog has no primary key because it has no 
primary key, which wouldn't be helpful.

RE: logical replication restrictions

2022-08-10 Thread osumi.takami...@fujitsu.com

On Tuesday, August 9, 2022 6:47 AM Euler Taveira  wrote:
> I attached a v6.
Hi, thank you for posting the updated patch.


Minor review comments for v6.

(1) commit message

"If the subscriber sets min_apply_delay parameter, ..."

I suggest we use subscription rather than subscriber, because
this parameter refers to and is used for one subscription.
My suggestion is
"If one subscription sets min_apply_delay parameter, ..."
In case if you agree, there are other places to apply this change.


(2) commit message

It might be better to write a note for committer
like "Bump catalog version" at the bottom of the commit message.


(3) unit alignment between recovery_min_apply_delay and min_apply_delay

The former interprets input number as milliseconds in case of no units,
while the latter takes it as seconds without units.
I feel it would be better to make them aligned.


(4) catalogs.sgml

+   Delay the application of changes by a specified amount of time. The
+   unit is in milliseconds.

As a column explanation, it'd be better to use a noun
in the first sentence to make this description aligned with
other places. My suggestion is
"Application delay of changes by ".


(5) pg_subscription.c

There is one missing blank line before writing if statement.
It's written in the AlterSubscription for other cases.

@@ -1100,6 +1130,12 @@ AlterSubscription(ParseState *pstate, 
AlterSubscriptionStmt *stmt,

replaces[Anum_pg_subscription_subdisableonerr - 1]
= true;
}
+   if (IsSet(opts.specified_opts, 
SUBOPT_MIN_APPLY_DELAY))


(6) tab-complete.c

The order of tab-complete parameters listed in the COMPLETE_WITH
should follow alphabetical order. "min_apply_delay" can come before "origin".
We can refer to d547f7c commit.


(7) 032_apply_delay.pl

There are missing whitespaces after comma in the mod functions.

UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
DELETE FROM test_tab WHERE mod(a,3) = 0;




Best Regards,
Takamichi Osumi

Re: Generalize ereport_startup_progress infrastructure

2022-08-10 Thread Nitin Jadhav

> > > Here's an attempt to generalize the ereport_startup_progress
> > > infrastructure. The attached v1 patch places the code in elog.c/.h,
> > > renames associated functions and variables, something like
> > > ereport_startup_progress to ereport_progress,
> > > log_startup_progress_interval to log_progress_report_interval and so
> > > on.
> >
> > I'm not averse to reusing this infrastructure in other places, but I
> > doubt we'd want all of those places to be controlled by a single GUC,
> > especially because that GUC is also the on/off switch for the feature.
>
> Thanks Robert! How about we tweak the function a bit -
> begin_progress_report_phase(int timeout), so that each process can use
> their own timeout interval? In  this case, do we want to retain
> log_startup_progress_interval as-is specific to the startup process?
> If yes, other processes might come up with their own GUCs (if they
> don't want to use hard-coded timeouts) similar to
> log_startup_progress_interval, which isn't the right way IMO.
>
> I think the notion of ereport_progress feature being disabled when the
> timeout is 0, makes sense to me at least.
>
> On the flip side, what if we just have a single GUC
> log_progress_report_interval (as proposed in the v1 patch)? Do we ever
> want different processes to emit progress report messages at different
> frequencies? Well, I can think of the startup process during standby
> recovery needing to emit recovery progress report messages at a much
> lower frequency than the startup process during the crash recovery.
> Again, controlling the frequencies with different GUCs isn't the way
> forward. But we can do something like: process 1 emits messages with a
> frequency of 2*log_progress_report_interval, process 2 with  a
> frequency 4*log_progress_report_interval and so on without needing
> additional GUCs.
>
> Thoughts?

+1 for the idea to generalize the infrastructure.

Given two options, option-1 is to use a single GUC across all kind of
log running operations and option-2 is to use multiple GUCs (one for
each kind of long running operations), I go with option-1 because if a
user is interested to see a log message after every 10s for startup
operations (or any other long running operations) then it is likely
that he is interested to see other long running operations after every
10s only. It does not make sense to use different intervals for each
kind of long running operation here. It also increases the number of
GUCs which makes things complex. So it is a good idea to use a single
GUC here. But I am worried about the on/off switch as Robert
mentioned. How about using a new GUC to indicate features on/off. Say
"log_long_running_operations" which contains a comma separated string
which indicates the features to be enabled. For example,
"log_long_running_operations = startup, postmaster" will enable
logging for startup and postmaster operations and disables logging of
other long running operations. With this the number of GUCs will be
limited to 2 and it is simple and easy for the user.


Thanks & Regards,
Nitin Jadhav

On Thu, Aug 4, 2022 at 9:58 AM Bharath Rupireddy
 wrote:
>
> On Wed, Aug 3, 2022 at 12:11 AM Robert Haas  wrote:
> >
> > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy
> >  wrote:
> > > ereport_startup_progress infrastructure added by commit 9ce346e [1]
> > > will be super-useful for reporting progress of any long-running server
> > > operations, not just the startup process operations. For instance,
> > > postmaster can use it for reporting progress of temp file and temp
> > > relation file removals [2], checkpointer can use it for reporting
> > > progress of snapshot or mapping file processing or even WAL file
> > > processing and so on. And I'm sure there can be many places in the
> > > code where we have while or for loops which can, at times, take a long
> > > time to finish and having a log message there would definitely help.
> > >
> > > Here's an attempt to generalize the ereport_startup_progress
> > > infrastructure. The attached v1 patch places the code in elog.c/.h,
> > > renames associated functions and variables, something like
> > > ereport_startup_progress to ereport_progress,
> > > log_startup_progress_interval to log_progress_report_interval and so
> > > on.
> >
> > I'm not averse to reusing this infrastructure in other places, but I
> > doubt we'd want all of those places to be controlled by a single GUC,
> > especially because that GUC is also the on/off switch for the feature.
>
> Thanks Robert! How about we tweak the function a bit -
> begin_progress_report_phase(int timeout), so that each process can use
> their own timeout interval? In  this case, do we want to retain
> log_startup_progress_interval as-is specific to the startup process?
> If yes, other processes might come up with their own GUCs (if they
> don't want to use hard-coded timeouts) similar to
> log_startup_progress_interval, which isn't the right way IMO.
>
> I think the notion of erepor

Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-10 Thread Simon Riggs

On Wed, 10 Aug 2022 at 08:34, Dilip Kumar  wrote:
>
> Am I still missing something?

No, you have found a dependency between the patches that I was unaware
of. So there is no bug if you apply both patches.

Thanks for looking.

> So I still think some adjustment is required in XidInMVCCSnapdhot()

That is one way to resolve the issue, but not the only one. I can also
change AssignTransactionId() to recursively register parent xids for
all of a subxid's parents.

I will add in a test case and resolve the dependency in my next patch.

Thanks again.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

s390x builds on buildfarm

2022-08-10 Thread Vivian Kong

Hi,

Are builds being paused on s390x as it looks like the s390x builds were last 
run 15 days ago.  If so, wondering what is the reason for the pause and what is 
required to resume the builds?
The OS the builds were running on seems to have reached end of life.  Please 
let me know if we can help with getting them updated and resume the builds.

Regards,

Vivian Kong
Linux on IBM Z Open Source Ecosystem
IBM Canada Toronto Lab

Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

On Wed, Aug 10, 2022 at 12:39 AM Thomas Munro  wrote:
> Here's an email about that:
>
> https://www.postgresql.org/message-id/CAE9k0P=OXww6RQCGrmDNa8=L3EeB01SGbYuP23y-qZJ=4td...@mail.gmail.com

Hmm. If I'm reading that email correctly, it indicates that I noticed
this problem before commit and asked for it to be changed, but then
for some reason it wasn't changed and I still committed it.

I can't immediately think of a reason why it wouldn't be safe to
insist on acquiring a cleanup lock there.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-10 Thread Amit Langote

On Wed, Aug 10, 2022 at 3:57 AM Andres Freund  wrote:
> One way this code could be drastically simplified is to force all
> type-coercions to go through the "io coercion" path, which could be
> implemented as a single execution step (which thus could trivially
> start/finish a subtransaction) and would remove a lot of the complicated code
> around coercions.

Could you please clarify how you think we might do the io coercion
wrapped with a subtransaction all as a single execution step?  I
would've thought that we couldn't do the sub-transaction without
leaving ExecInterpExpr() anyway, so maybe you meant the io coercion
itself was done using some code outside ExecInterpExpr()?

The current JsonExpr code does it by recursively calling
ExecInterpExpr() using the nested ExprState expressly for the
coercion.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-08-10 Thread osumi.takami...@fujitsu.com

On Tuesday, August 9, 2022 12:59 AM Önder Kalacı  wrote:
> Attaching v5 of the patch which reflects the review on this email, also few
> minor test improvements.
Hi,


Thank you for the updated patch.
FYI, I noticed that v5 causes cfbot failure in [1].
Could you please fix it in the next version ?


[19:44:38.420] execReplication.c: In function ‘RelationFindReplTupleByIndex’:
[19:44:38.420] execReplication.c:186:24: error: ‘eq’ may be used uninitialized 
in this function [-Werror=maybe-uninitialized]
[19:44:38.420]   186 |   if (!indisunique && !tuples_equal(outslot, searchslot, 
eq))
[19:44:38.420]   |
^
[19:44:38.420] cc1: all warnings being treated as errors



[1] - https://cirrus-ci.com/task/6544573026533376



Best Regards,
Takamichi Osumi

something has gone wrong, but what is it?

Hi,

Today while hacking I encountered this delight:

2022-08-10 09:30:29.025 EDT [27126] FATAL:  something has gone wrong

I actually already knew that something had gone wrong, because the
code I was writing was incomplete. And if I hadn't known that, the
word FATAL would have been a real good clue. What I was hoping was
that the error message might tell me WHAT had gone wrong, but it
didn't.

This seems to be the fault of Andres's commit
5aa4a9d2077fa902b4041245805082fec6be0648. In his defense, the addition
of any kind of elog() at that point in the code appears to be an
improvement over the previous state of affairs. Nonetheless I feel we
could do better still, as in the attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Be-more-specific-about-exactly-what-has-gone-wron.patch
Description: Binary data

Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

Alvaro Herrera  writes:
> On 2022-Aug-09, Lukas Fittl wrote:
>> But I wonder, why do we have an explicit pretty printing flag on these
>> functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior.
>> If we don't want pretty printing to affect schema qualification, why
>> does that flag exist?

> Because of CVE-2018-1058.  See commit 815172ba8068.

> I imagine that that commit only touched the minimum necessary to solve
> the immediate security problem, but that further work is needed to make
> PRETTYFLAG_SCHEMA become a fully functional gadget; but that would
> require that the whole of ruleutils.c (and everything downstream from
> it) behaves sanely.  In other words, I think your patch is too small.

What I'm inclined to do, rather than repeat the same finicky &
undocumented coding pattern in one more place, is write a convenience
function for it that can be named and documented to reflect the coding
rule about which call sites should use it (rather than calling plain
generate_relation_name).  However, the first requirement for that
is to have a clearly defined rule.  I think the intent of 815172ba8068
was to convert all uses that would determine the object-creation schema
in commands issued by pg_dump.  Do we want to widen that, and if so
by how much?  I'd be on board I think with adjusting other ruleutils.c
functions that could plausibly be used for building creation commands,
but happen not to be called by pg_dump.  I'm not on board with
converting every single generate_relation_name call --- mainly because
it'd be pointless unless you also qualify every single function name,
operator name, etc; and that would be unreadable.

regards, tom lane

Re: something has gone wrong, but what is it?

2022-08-10 Thread Daniel Gustafsson

> On 10 Aug 2022, at 15:41, Robert Haas  wrote:

> I feel we could do better still, as in the attached.

+1, LGTM.

--
Daniel Gustafsson   https://vmware.com/

Re: something has gone wrong, but what is it?

Robert Haas  writes:

-   elog(ERROR, "something has gone wrong");
+   elog(ERROR, "unrecognized AuxProcType: %d", (int) 
auxtype);

+1 ... the existing message is clearly not up to project standard.

regards, tom lane

Re: s390x builds on buildfarm

2022-08-10 Thread Andrew Dunstan



On 2022-08-10 We 09:04, Vivian Kong wrote:
>
> Hi,
>
>  
>
> Are builds being paused on s390x as it looks like the s390x builds
> were last run 15 days ago.  If so, wondering what is the reason for
> the pause and what is required to resume the builds?
> The OS the builds were running on seems to have reached end of life. 
> Please let me know if we can help with getting them updated and resume
> the builds.
>
>  
>
>

Mark, I think you run most or all of these.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

designated initializers

(Coming from https://postgr.es/m/20220809193616.5uucf33piwdxn452@alvherre.pgsql 
)

On 2022-Aug-09, Alvaro Herrera wrote:

> On 2022-Aug-09, Andres Freund wrote:
> 
> > Mildly wondering whether we ought to use designated initializers instead,
> > given we're whacking it around already. Too easy to get the order wrong when
> > adding new members, and we might want to have optional callbacks too.
> 
> Strong +1.  It makes code much easier to navigate (see XmlTableRoutine
> and compare with heapam_methods, for example).

For example, I propose the attached.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")
>From 66f5981751cdaeb27a0614ca2b0643632f7cc128 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 10 Aug 2022 15:57:09 +0200
Subject: [PATCH] use designated initializors

---
 .../libpqwalreceiver/libpqwalreceiver.c   | 30 +--
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index da9c359af1..2865024524 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -82,21 +82,21 @@ static WalRcvExecResult *libpqrcv_exec(WalReceiverConn *conn,
 static void libpqrcv_disconnect(WalReceiverConn *conn);
 
 static WalReceiverFunctionsType PQWalReceiverFunctions = {
-	libpqrcv_connect,
-	libpqrcv_check_conninfo,
-	libpqrcv_get_conninfo,
-	libpqrcv_get_senderinfo,
-	libpqrcv_identify_system,
-	libpqrcv_server_version,
-	libpqrcv_readtimelinehistoryfile,
-	libpqrcv_startstreaming,
-	libpqrcv_endstreaming,
-	libpqrcv_receive,
-	libpqrcv_send,
-	libpqrcv_create_slot,
-	libpqrcv_get_backend_pid,
-	libpqrcv_exec,
-	libpqrcv_disconnect
+	.walrcv_connect = libpqrcv_connect,
+	.walrcv_check_conninfo = libpqrcv_check_conninfo,
+	.walrcv_get_conninfo = libpqrcv_get_conninfo,
+	.walrcv_get_senderinfo = libpqrcv_get_senderinfo,
+	.walrcv_identify_system = libpqrcv_identify_system,
+	.walrcv_server_version = libpqrcv_server_version,
+	.walrcv_readtimelinehistoryfile = libpqrcv_readtimelinehistoryfile,
+	.walrcv_startstreaming = libpqrcv_startstreaming,
+	.walrcv_endstreaming = libpqrcv_endstreaming,
+	.walrcv_receive = libpqrcv_receive,
+	.walrcv_send = libpqrcv_send,
+	.walrcv_create_slot = libpqrcv_create_slot,
+	.walrcv_get_backend_pid = libpqrcv_get_backend_pid,
+	.walrcv_exec = libpqrcv_exec,
+	.walrcv_disconnect = libpqrcv_disconnect
 };
 
 /* Prototypes for private functions */
-- 
2.30.2

Re: moving basebackup code to its own directory

On Tue, Aug 9, 2022 at 3:28 PM Robert Haas  wrote:
> On Tue, Aug 9, 2022 at 2:40 PM Justin Pryzby  wrote:
> > It looks like this updates the header comments in the .h files but not the 
> > .c
> > files.
> >
> > Personally, I find these to be silly boilerplate ..
>
> Here is a version with some updates to the silly boilerplate.

If there are no further comments on this I will go ahead and commit it.

David Steele voted for back-patching this on the grounds that it would
make future back-patching easier, which is an argument that seems to
me to have some merit, although on the other hand, we are already into
August so it's quite late in the day. Anyone else want to vote?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: moving basebackup code to its own directory

2022-08-10 Thread Justin Pryzby

On Wed, Aug 10, 2022 at 10:08:02AM -0400, Robert Haas wrote:
> On Tue, Aug 9, 2022 at 3:28 PM Robert Haas  wrote:
> > On Tue, Aug 9, 2022 at 2:40 PM Justin Pryzby  wrote:
> > > It looks like this updates the header comments in the .h files but not 
> > > the .c
> > > files.
> > >
> > > Personally, I find these to be silly boilerplate ..
> >
> > Here is a version with some updates to the silly boilerplate.
> 
> If there are no further comments on this I will go ahead and commit it.
> 
> David Steele voted for back-patching this on the grounds that it would
> make future back-patching easier, which is an argument that seems to
> me to have some merit, although on the other hand, we are already into
> August so it's quite late in the day. Anyone else want to vote?

No objection to backpatching to v15, but if you don't, git ought to handle
renamed files just fine.

These look like similar precedent for "late" renaming+backpatching: 41dae3553,
47ca48364

-- 
Justin

Re: something has gone wrong, but what is it?

On Wed, Aug 10, 2022 at 9:53 AM Tom Lane  wrote:
> Robert Haas  writes:
>
> -   elog(ERROR, "something has gone wrong");
> +   elog(ERROR, "unrecognized AuxProcType: %d", (int) 
> auxtype);
>
> +1 ... the existing message is clearly not up to project standard.

After a bit of further looking around I noticed that there's another
check for an invalid auxtype in this function which uses a slightly
different message text and also PANIC rather than ERROR.

I think we should adopt that here too, for consistency, as in the attached.

The distinction between PANIC and ERROR doesn't really seem to matter
here. Either way, the server goes into an infinite crash-and-restart
loop. May as well be consistent.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v2-0001-Be-more-specific-about-exactly-what-has-gone-wron.patch
Description: Binary data

SPI versus read/write expanded datums

In the report at [1] we learned that the SQL-language function
handler is too cavalier about read/write expanded datums that
it receives as input.  A function that receives such a datum
is entitled to scribble on its value, or even delete it.
If the function turns around and passes the datum on to some
other function, the same applies there.  So in general, it can
only be safe to pass such a datum to *one* subsidiary function.
If you want to use the value more than once, you'd better convert
the pointer to read-only.  fmgr_sql wasn't doing that, leading
to the reported bug.

After fixing that, I wondered if we had the same problem anywhere
else, and it didn't take long to think of such a place: SPI.
If you pass a read/write datum to SPI_execute_plan or one of its
siblings, and the executed query references that datum more than
once, you're potentially in trouble.  Even if it does only
reference it once, you might be surprised that your copy of the
datum got modified.

However, we can't install a 100% fix in SPI itself, because
plpgsql intentionally exploits exactly this behavior to optimize
things like "arr := array_append(arr, val)".  I considered the
idea of adding a 90% fix by making _SPI_convert_params() convert
R/W pointers to R/O.  That would protect places using the old-style
"char *Nulls" APIs, and then we'd deem it the responsibility
of callers using ParamListInfo APIs to protect themselves.
I can't get terribly excited about that though, because it'd
be adding complexity and cycles for a problem that seems entirely
theoretical at this point.  I can't find any SPI callers that
would *actually* be passing a R/W datum to a query that'd be
likely to modify it.  The non-plpgsql PLs are at the most risk
of calling a hazardous query, but they all pass "flat" datums
that are the immediate result of a typinput function or the like.

So my inclination is to do nothing about this now, and maybe
nothing ever.  But I thought it'd be a good idea to memorialize
this issue for the archives.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/WScDU5qfoZ7PB2gXwNqwGGgDPmWzz08VdydcPFLhOwUKZcdWbblbo-0Lku-qhuEiZoXJ82jpiQU4hOjOcrevYEDeoAvz6nR0IU4IHhXnaCA%3D%40mackler.email

Re: something has gone wrong, but what is it?

Hi,

On 2022-08-10 10:49:59 -0400, Robert Haas wrote:
> On Wed, Aug 10, 2022 at 9:53 AM Tom Lane  wrote:
> > Robert Haas  writes:
> >
> > -   elog(ERROR, "something has gone wrong");
> > +   elog(ERROR, "unrecognized AuxProcType: %d", (int) 
> > auxtype);
> >
> > +1 ... the existing message is clearly not up to project standard.
> 
> After a bit of further looking around I noticed that there's another
> check for an invalid auxtype in this function which uses a slightly
> different message text and also PANIC rather than ERROR.
> 
> I think we should adopt that here too, for consistency, as in the attached.
> 
> The distinction between PANIC and ERROR doesn't really seem to matter
> here. Either way, the server goes into an infinite crash-and-restart
> loop. May as well be consistent.

Makes sense.

Greetings,

Andres Freund

Re: Generalize ereport_startup_progress infrastructure

On Tue, Aug 9, 2022 at 11:54 AM Bharath Rupireddy
 wrote:
> I'm attaching 0002 for reporting removal of temp files and temp
> relation files by postmaster.
>
> If this looks okay, I can code 0003 for reporting processing of
> snapshot, mapping and old WAL files by checkpointer.

I think that someone is going to complain about the changes to
timeout.c. Some trouble has been taken to allow things like
SetLatch(MyLatch) to be unconditional. Aside from that, I am unsure
how generally safe it is to use the timeout infrastructure in the
postmaster.

>From a user-interface point of view, log_postmaster_progress_interval
seems a bit awkward. It's really quite narrow, basically just checking
for one thing. I'm not sure I like adding a GUC for something that
specific, although I also don't have another idea at the moment
either. Hmm.

Maybe the checkpointer is a better candidate, but somehow I feel that
we can't consider this sort of thing separate from the existing
progress reporting that checkpointer already does. Perhaps we need to
think of changing or improving that in some way rather than adding
something wholly new alongside the existing system.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Allow logical replication to copy tables in binary format

2022-08-10 Thread Melih Mutlu

Hey hackers,

I see that logical replication subscriptions have an option to enable
binary [1].
When it's enabled, subscription requests publisher to send data in binary
format.
But this is only the case for apply phase. In tablesync, tables are still
copied as text.

To copy tables, COPY command is used and that command supports copying in
binary. So it seemed to me possible to copy in binary for tablesync too.
I'm not sure if there is a reason to always copy tables in text format. But
I couldn't see why not to do it in binary if it's enabled.

You can find the small patch that only enables binary copy attached.

What do you think about this change? Does it make sense? Am I missing
something?

[1] https://www.postgresql.org/docs/15/sql-createsubscription.html

Best,
Melih


0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-08-10 Thread Jacob Champion

On Tue, Aug 9, 2022 at 3:39 AM Drouvot, Bertrand  wrote:
> Agree that it makes sense to work on those patches in this particular
> order then.

Sounds good. The ClientConnectionInfo patch (previously 0002) is
attached, with the SQL function removed.

Thanks,
--Jacob
From a22ff3ba36f5eb93c582a957c7c2caca07ed21c5 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 23 Mar 2022 15:07:05 -0700
Subject: [PATCH] Allow parallel workers to read authn_id

Move authn_id into a new global, MyClientConnectionInfo, which is
intended to hold all the client information that needs to be shared
between the backend and any parallel workers. MyClientConnectionInfo is
serialized and restored using a new parallel key.
---
 src/backend/access/transam/parallel.c | 19 ++-
 src/backend/libpq/auth.c  | 16 +++---
 src/backend/utils/init/miscinit.c | 72 +++
 src/include/libpq/libpq-be.h  | 39 ++-
 src/include/miscadmin.h   |  4 ++
 5 files changed, 129 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index df0cd77558..bc93101ff7 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_REINDEX_STATE			UINT64CONST(0x000C)
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0x000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0x000E)
+#define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0x000F)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -212,6 +213,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 	Size		reindexlen = 0;
 	Size		relmapperlen = 0;
 	Size		uncommittedenumslen = 0;
+	Size		clientconninfolen = 0;
 	Size		segsize = 0;
 	int			i;
 	FixedParallelState *fps;
@@ -272,8 +274,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_estimate_chunk(&pcxt->estimator, relmapperlen);
 		uncommittedenumslen = EstimateUncommittedEnumsSpace();
 		shm_toc_estimate_chunk(&pcxt->estimator, uncommittedenumslen);
+		clientconninfolen = EstimateClientConnectionInfoSpace();
+		shm_toc_estimate_chunk(&pcxt->estimator, clientconninfolen);
 		/* If you add more chunks here, you probably need to add keys. */
-		shm_toc_estimate_keys(&pcxt->estimator, 11);
+		shm_toc_estimate_keys(&pcxt->estimator, 12);
 
 		/* Estimate space need for error queues. */
 		StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ==
@@ -352,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *session_dsm_handle_space;
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
+		char	   *clientconninfospace;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -422,6 +427,12 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_UNCOMMITTEDENUMS,
 	   uncommittedenumsspace);
 
+		/* Serialize our ClientConnectionInfo. */
+		clientconninfospace = shm_toc_allocate(pcxt->toc, clientconninfolen);
+		SerializeClientConnectionInfo(clientconninfolen, clientconninfospace);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_CLIENTCONNINFO,
+	   clientconninfospace);
+
 		/* Allocate space for worker information. */
 		pcxt->worker = palloc0(sizeof(ParallelWorkerInfo) * pcxt->nworkers);
 
@@ -1270,6 +1281,7 @@ ParallelWorkerMain(Datum main_arg)
 	char	   *reindexspace;
 	char	   *relmapperspace;
 	char	   *uncommittedenumsspace;
+	char	   *clientconninfospace;
 	StringInfoData msgbuf;
 	char	   *session_dsm_handle_space;
 	Snapshot	tsnapshot;
@@ -1479,6 +1491,11 @@ ParallelWorkerMain(Datum main_arg)
 		   false);
 	RestoreUncommittedEnums(uncommittedenumsspace);
 
+	/* Restore the ClientConnectionInfo. */
+	clientconninfospace = shm_toc_lookup(toc, PARALLEL_KEY_CLIENTCONNINFO,
+		 false);
+	RestoreClientConnectionInfo(clientconninfospace);
+
 	/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
 	AttachSerializableXact(fps->serializable_xact_handle);
 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 2d9ab7edce..313a6ea701 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -342,15 +342,15 @@ auth_failed(Port *port, int status, const char *logdetail)
  * authorization will fail later.
  *
  * The provided string will be copied into TopMemoryContext, to match the
- * lifetime of the Port, so it is safe to pass a string that is managed by an
- * external library.
+ * lifetime of MyClientConnectionInfo, so it is safe to pass a string that is
+ * managed by an external library.
  */
 static void
 set_authn_id(Port *port, const char *id)
 {
 	Assert(id);
 
-	if (port->authn_id)
+	if (MyClientConnectionInfo.authn_id)
 	{
 		/*
 		 * An existing authn_id should never be overwritten; that means two
@@ -361,17 +361,18 @@ set_authn_id(Port *port, const char *id)
 		ereport(FATAL,
 (errmsg("authentication identifier set

fix stale help message

2022-08-10 Thread Junwang Zhao

when parsing command-line options, the -f option support disabling
8 scan and join methods, o, b and t disable index-only scans,
bitmap index scans, and TID scans respectively, add them to the
help message.

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 5a964a0db6..f5da4260a1 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -351,7 +351,7 @@ help(const char *progname)
printf(_("  -?, --help show this help, then exit\n"));

printf(_("\nDeveloper options:\n"));
-   printf(_("  -f s|i|n|m|h   forbid use of some plan
types\n"));
+   printf(_("  -f s|i|o|b|t|n|m|h forbid use of some plan
types\n"));
printf(_("  -n do not reinitialize shared
memory after abnormal exit\n"));
printf(_("  -O allow system table structure
changes\n"));
printf(_("  -P disable system indexes\n"));

-- 
Regards
Junwang Zhao


0001-fix-stale-help-message.patch
Description: Binary data

Re: SQL/JSON features for v15

2022-08-10 Thread Andrew Dunstan



On 2022-08-09 Tu 16:58, Jonathan S. Katz wrote:
> Hi,
>
> (Personal hat, not RMT hat unless otherwise noted).
>
> This thread[1] raised some concerns around the implementation of the
> SQL/JSON features that are slated for v15, which includes an
> outstanding open item[2]. Given the current state of the discussion,
> when the RMT met on Aug 8, they several options, readable here[3].
> Given we are now into the later part of the release cycle, we need to
> make some decisions on how to proceed with this feature given the
> concerns raised.
>
> Per additional discussion on the thread, the group wanted to provide
> more visibility into the discussion to get opinions on how to proceed
> for the v15 release.
>
> Without rehashing the thread, the options presented were:
>
> 1. Fixing the concerns addressed in the thread around the v15 SQL/JSON
> features implementation, noting that this would likely entail at least
> one more beta release and would push the GA date past our normal
> timeframe.
>
> 2. Try to commit a subset of the features that caused less debate.
> This was ruled out.
>
> 3. Revert the v15 SQL/JSON features work.
>
> 
> Based on the current release timing and the open issues presented on
> the thread, and the RMT had recommended reverting, but preferred to
> drive consensus on next steps.
> 
>
> From a release advocacy standpoint, I need about 6 weeks lead time to
> put together the GA launch. We're at the point where I typically
> deliver a draft release announcement. From this, given this involves a
> high visibility feature, I would want some clarity on what option we
> would like to pursue. Once the announcement translation process has
> begun (and this is when we have consensus on the release
> announcement), it becomes more challenging to change things out.
>
> From a personal standpoint (restating from[3]), I would like to see
> what we could do to include support for this batch of the SQL/JSON
> features in v15. What is included looks like it closes most of the gap
> on what we've been missing syntactically since the standard was
> adopted, and the JSON_TABLE work is very convenient for converting
> JSON data into a relational format. I believe having this feature set
> is important for maintaining standards compliance, interoperability,
> tooling support, and general usability. Plus, JSON still seems to be
> pretty popular.
>
> We're looking for additional input on what makes sense as a best
> course of action, given what is presented in[3].
>
> Thanks,
>
> Jonathan
>
> [1]
> https://www.postgresql.org/message-id/flat/20220616233130.rparivafipt6doj3%40alap3.anarazel.de
> [2] https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items
> [3]
> https://www.postgresql.org/message-id/787cef45-15de-8f1d-ed58-a1c1435bfc0e%40postgresql.org


To preserve options I will start preparing reversion patches. Given
there are I think more than 20 commits all told that could be fun, and
will probably take me a little while. The sad part is that to the best
of my knowledge this code is producing correct results, and not
disturbing the stability or performance of anything else. There was a
performance issue but it's been dealt with AIUI.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

use argument instead of global variable

2022-08-10 Thread Junwang Zhao

The caller of `get_stats_option_name` pass optarg as the argument,
it's saner to use the argument instead of the global variable set
by getopt, which is more safe since the argument has a *const*
specifier.

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 11e802eba9..68552b8779 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3598,9 +3598,9 @@ get_stats_option_name(const char *arg)
switch (arg[0])
{
case 'p':
-   if (optarg[1] == 'a')   /* "parser" */
+   if (arg[1] == 'a')  /* "parser" */
return "log_parser_stats";
-   else if (optarg[1] == 'l')  /* "planner"
*/
+   else if (arg[1] == 'l') /* "planner" */
return "log_planner_stats";
break;

-- 
Regards
Junwang Zhao

Re: moving basebackup code to its own directory

Robert Haas  writes:
> David Steele voted for back-patching this on the grounds that it would
> make future back-patching easier, which is an argument that seems to
> me to have some merit, although on the other hand, we are already into
> August so it's quite late in the day. Anyone else want to vote?

Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD.

regards, tom lane

Re: moving basebackup code to its own directory

2022-08-10 Thread Magnus Hagander

On Wed, Aug 10, 2022 at 6:20 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > David Steele voted for back-patching this on the grounds that it would
> > make future back-patching easier, which is an argument that seems to
> > me to have some merit, although on the other hand, we are already into
> > August so it's quite late in the day. Anyone else want to vote?
>
> Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD.

+1, but I suggest also getting a hat-tip from the RMT on it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: moving basebackup code to its own directory

2022-08-10 Thread Jonathan S. Katz


On 8/10/22 12:32 PM, Magnus Hagander wrote:

On Wed, Aug 10, 2022 at 6:20 PM Tom Lane  wrote:


Robert Haas  writes:

David Steele voted for back-patching this on the grounds that it would
make future back-patching easier, which is an argument that seems to
me to have some merit, although on the other hand, we are already into
August so it's quite late in the day. Anyone else want to vote?


Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD.


+1, but I suggest also getting a hat-tip from the RMT on it.


With RMT hat on, given a few folks who maintain backup utilities seem to 
be in favor of backpatching to v15 and they are the ones to be most 
affected by this, it seems to me that this is an acceptable, 
noncontroversial course of action.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: designated initializers

Hi,

On 2022-08-10 16:03:00 +0200, Alvaro Herrera wrote:
> (Coming from 
> https://postgr.es/m/20220809193616.5uucf33piwdxn452@alvherre.pgsql )
> 
> On 2022-Aug-09, Alvaro Herrera wrote:
> 
> > On 2022-Aug-09, Andres Freund wrote:
> > 
> > > Mildly wondering whether we ought to use designated initializers instead,
> > > given we're whacking it around already. Too easy to get the order wrong 
> > > when
> > > adding new members, and we might want to have optional callbacks too.
> > 
> > Strong +1.  It makes code much easier to navigate (see XmlTableRoutine
> > and compare with heapam_methods, for example).
> 
> For example, I propose the attached.

+1 I've fought with this one when fixing a conflict when rebasing a patch...

Greetings,

Andres Freund

Re: moving basebackup code to its own directory

On 2022-Aug-10, Robert Haas wrote:

> David Steele voted for back-patching this on the grounds that it would
> make future back-patching easier, which is an argument that seems to
> me to have some merit, although on the other hand, we are already into
> August so it's quite late in the day. Anyone else want to vote?

Given that 10 of these 11 files are new in 15, I definitely agree with
backpatching the move.

Moving the include/ files is going to cause some pain for any
third-party code #including those files.  I don't think this is a
problem.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: [PATCH] CF app: add "Returned: Needs more interest"

2022-08-10 Thread Bruce Momjian

On Wed, Aug  3, 2022 at 02:53:23PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > My impression is that a lot of the patches floating from CF to CF have 
> > gotten
> > sceptical feedback and at best a minor amount of work to address that has 
> > been
> > done.
> 
> That I think is a distinct issue: nobody wants to take on the
> unpleasant job of saying "no, we don't want this" in a final way.
> We may raise some objections but actually rejecting a patch is hard.
> So it tends to slide forward until the author gives up.

Agreed.  There is a sense when I look at patches in that status that
they seem like a good idea to someone and could be useful to someone,
but the overhead or complexity it would add to the software doesn't seem
warranted.  It is complex to explain that to someone, and since it is a
judgement call, not worth the argument.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: [RFC] building postgres with meson

Hi,

On 2022-06-02 10:26:09 -0700, Andres Freund wrote:
> > Could we have the meson build check that, say, if gram.c exists it
> > is newer than gram.y?  Or get it to ignore an in-tree gram.c?
>
> I suspect the problem with ignoring is gram.h, that's probably a bit harder to
> ignore.

I tried to ignore various generated files in the source tree, but I don't
think it's doable for all of them. Consider
e.g. src/backend/utils/misc/guc-file.c which is gets built via #include
"guc-file.c" from gram.c

Because it's a "" include, the search path starts in the current directory and
only then -I is searched. To my knowledge there's no way of changing
that. Quoting the gcc manpage:

   -I dir
   -iquote dir
   -isystem dir
   -idirafter dir
   Add the directory dir to the list of directories to be searched for 
header files during preprocessing.  If dir begins with = or $SYSROOT, then
   the = or $SYSROOT is replaced by the sysroot prefix; see --sysroot 
and -isysroot.

   Directories specified with -iquote apply only to the quote form of 
the directive, "#include "file"".  Directories specified with -I, -isystem, or
   -idirafter apply to lookup for both the "#include "file"" and 
"#include " directives.

   You can specify any number or combination of these options on the 
command line to search for header files in several directories.  The lookup
   order is as follows:

   1.  For the quote form of the include directive, the directory of 
the current file is searched first.

   2.  For the quote form of the include directive, the directories 
specified by -iquote options are searched in left-to-right order, as they appear
   on the command line.

   3.  Directories specified with -I options are scanned in 
left-to-right order.
   [...]

Except for copying guc.c from source to build tree before building, I don't
see a way of ignoring the in-build-tree guc-file.c.

Not sure what a good way of dealing with this is. For now I'll make it just
error out if there's any known such file in the source tree, but that's not a
good solution forever.  If it were just "normal" build leftovers I'd propose
to (optionally) just remove them, but that's not good for tarballs.

Greetings,

Andres Freund

Re: Get the statistics based on the application name and IP address

2022-08-10 Thread Ibrar Ahmed

On Mon, Aug 8, 2022 at 10:11 PM Julien Rouhaud  wrote:

> Hi,
>
> On Mon, Aug 08, 2022 at 08:21:06PM +0500, Ibrar Ahmed wrote:
> > While working on pg_stat_stements, I got some questions from customers to
> > have statistics by application and IP address.
> > [...]
> > name. I did some POC and had a patch. But before sharing the patch.
> >
> > I need to know if there has been any previous discussion about this
> topic;
> > by the way,
>
> Thanks for the input.

> I don't think there was any discussion on this exactly, but there have been
> some related discussions.
>
> This would likely bring 2 problems.

> First, for now each entry contains its own
> query text in the query file.  There can already be some duplication, which
> isn't great, but adding the application_name and/or IP address will make
> things
> way worse, so you would probably need to fix that first.

I doubt that makes it worst because these (IP and Application) will be part
of
the key, not the query text. But yes, I agree that it will increase the
footprint of rows,
excluding query text.

I am not 100% sure about the query text duplication but will look at that
in detail,
if you have more insight, then it will help to solve that.

> There has been some
> discussion about it recently (1) but more work and benchmarking are needed.
>
> The other problem is the multiplication of entries.  It's a well known
> limitation that pg_stat_statements eviction are so costly that it makes it
> unusable.  The last numbers I saw about it was ~55% overhead (2).  Adding
> application_name or ip address to the key would probably make
> pg_stat_statements unusable for anyone who would actually need those
> metrics.
>

I am sure adding a new item in the key does not affect the performance of
evictions of the row,
as it will not be part of that area.  I am doing some benchmarking and
hacking to reduce that and will
send results with the patch.

> [1]:
> https://www.postgresql.org/message-id/flat/604E3199-2DD2-47DD-AC47-774A6F97DCA9%40amazon.com
> [2]: https://twitter.com/AndresFreundTec/status/1105585237772263424
>

-- 

Ibrar Ahmed.
Senior Software Engineer, PostgreSQL Consultant.

Re: moving basebackup code to its own directory

On Wed, Aug 10, 2022 at 12:57 PM Alvaro Herrera  wrote:
> Given that 10 of these 11 files are new in 15, I definitely agree with
> backpatching the move.

OK, done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: something has gone wrong, but what is it?

2022-08-10 Thread Andrey Borodin




> On 10 Aug 2022, at 19:49, Robert Haas  wrote:
> 
> After a bit of further looking around I noticed that there's another
> check for an invalid auxtype in this function which uses a slightly
> different message text and also PANIC rather than ERROR.

Is there a reason to do
MyBackendType = B_INVALID;
after PANIC or ERROR?

Best regards, Andrey Borodin.

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Tue, 9 Aug 2022 at 12:48, Andres Freund  wrote:

> > The reason I want a C function is I'm trying to get as far as I can
> > without a connection to a database, without a transaction, without
> > accessing the catalog, and as much as possible without taking locks.
>
> I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

> I'm *dead* set against including catalog names in shared memory stats. That'll
> add a good amount of memory usage and complexity, without any sort of
> comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.

> > I also think it would be nice to have a change counter for every stat
> > object, or perhaps a change time. Prometheus wouldn't be able to make
> > use of it but other monitoring software might be able to receive only
> > metrics that have changed since the last update which would really
> > help on databases with large numbers of mostly static objects.
>
> I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

-- 
greg

Re: something has gone wrong, but what is it?

On Wed, Aug 10, 2022 at 2:06 PM Andrey Borodin  wrote:
> > On 10 Aug 2022, at 19:49, Robert Haas  wrote:
> > After a bit of further looking around I noticed that there's another
> > check for an invalid auxtype in this function which uses a slightly
> > different message text and also PANIC rather than ERROR.
>
> Is there a reason to do
> MyBackendType = B_INVALID;
> after PANIC or ERROR?

That could probably be taken out, but it doesn't seem important to take it out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

One thing that's bugging me is that the names we use for these stats
are *all* over the place.

The names go through three different stages

pgstat structs  ->  pgstatfunc tupledescs  ->  pg_stat_* views

(Followed by a fourth stage where pg_exporter or whatever names for
the monitoring software)

And for some reason both transitions (plus the exporter) felt the need
to fiddle with the names or values. And not in any sort of even
vaguely consistent way. So there are three (or four) different sets of
names for the same metrics :(

e.g.

* Some of the struct elements have abbreviated words which are
expanded in the tupledesc names or the view columns -- some have long
names which get abbreviated.

* Some struct members have n_ prefixes (presumably to avoid C keywords
or other namespace issues?) and then lose them at one of the other
stages. But then the relation stats do not have n_ prefixes and then
the pg_stat view *adds* n_ prefixes in the SQL view!

* Some columns are added together in the SQL view which seems like
gratuitously hiding information from the user. The pg_stat_*_tables
view actually looks up information from the indexes stats and combines
them to get idx_scan and idx_tup_fetch.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?


Some of the simpler renaming could be transparently fixed by making
the internal stats match the public facing names. But for many of them
I think the internal names are better. And the cases where the views
aggregate data in a way that loses information are not something I
want to reproduce.

I had intended to use the internal names directly, reasoning that
transparency and consistency are the direction to be headed. But in
some cases I think the current public names are the better choice -- I
certainly don't want to remove n_* prefixes from some names but then
add them to different names! And some of the cases where the data is
combined or modified do seem like they would be missed.

-- 
greg

Re: pg_auth_members.grantor is bunk

On Mon, Aug 1, 2022 at 3:51 PM Robert Haas  wrote:
> On Mon, Aug 1, 2022 at 1:38 PM Tom Lane  wrote:
> >> I think the latter --- the cfbot thinks the July CF is no longer relevant,
> > but Jacob hasn't yet moved your patches forward.  You could wait for
> > him to do that, or do it yourself.
>
> Done. New patches attached.

Well, CI isn't happy with this, and for good reason:

 ALTER GROUP regress_priv_group2 ADD USER regress_priv_user2; -- duplicate
-NOTICE:  role "regress_priv_user2" has already been granted
membership in role "regress_priv_group2" by role "rhaas"
+NOTICE:  role "regress_priv_user2" has already been granted
membership in role "regress_priv_group2" by role "postgres"

The problem here is that I revised the error message to include the
name of the grantor, since that's now a part of the identity of the
grant. It would be misleading to say, as we did previously...

NOTICE:  role "regress_priv_user2" is already a member of role
"regress_priv_group2"

...because them being in the group isn't relevant so much as them
being in the group by means of the same grantor. However, I suspect
that I can't persuade all of you that we should hard-code the name of
the bootstrap superuser as "rhaas", so this test case needs some
alteration. I found, however, that the original intent of the test
case couldn't be preserved with the patch as written, because when you
grant membership in one role to another role as the superuser or a
CREATEROLE user, the grant is attributed to the bootstrap superuser,
whose name is variable, as this test failure shows. Therefore, to fix
the test, I needed to use ALTER GROUP as a non-CREATEROLE user, some
user created as part of the test, for the results to be stable. But
that was impossible, because even though "GRANT user_name TO
group_name" requires *either* CREATEROLE *or* ADMIN OPTION on the
group, the equivalent command "ALTER GROUP group_name ADD USER
user_name" requires specifically CREATEROLE.

I debated whether to fix that inconsistency or just remove this test
case and eventually came down on the side of fixing the inconsistency,
so the attached version does it that way.

--
Robert Haas
EDB: http://www.enterprisedb.com

v5-0001-Ensure-that-pg_auth_members.grantor-is-always-val.patch
Description: Binary data

v5-0002-Make-role-grant-system-more-consistent-with-other.patch
Description: Binary data

Re: logical replication restrictions

2022-08-10 Thread Euler Taveira

On Wed, Aug 10, 2022, at 9:39 AM, osumi.takami...@fujitsu.com wrote:
> Minor review comments for v6.
Thanks for your review. I'm attaching v7.

> "If the subscriber sets min_apply_delay parameter, ..."
> 
> I suggest we use subscription rather than subscriber, because
> this parameter refers to and is used for one subscription.
> My suggestion is
> "If one subscription sets min_apply_delay parameter, ..."
> In case if you agree, there are other places to apply this change.
I changed the terminology to subscription. I also checked other "subscriber"
occurrences but I don't think it should be changed. Some of them are used as
publisher/subscriber pair. If you think there is another sentence to consider,
point it out.

> It might be better to write a note for committer
> like "Bump catalog version" at the bottom of the commit message.
It is a committer task to bump the catalog number. IMO it is easy to notice
(using a git hook?) that it must bump it when we are modifying the catalog.
AFAICS there is no recommendation to add such a warning.

> The former interprets input number as milliseconds in case of no units,
> while the latter takes it as seconds without units.
> I feel it would be better to make them aligned.
In a previous version I decided not to add a code to attach a unit when there
isn't one. Instead, I changed the documentation to reflect what interval_in
uses (seconds as unit). Under reflection, let's use ms as default unit if the
user doesn't specify one.

I fixed all the other suggestions too.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From a28987c8adb70d6932558f5e39f9dd4c55223a30 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sat, 6 Nov 2021 11:31:10 -0300
Subject: [PATCH v7] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (particularly to fix
errors that might cause data loss).

If the subscription sets min_apply_delay parameter, the logical
replication worker will delay the transaction commit for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs only on WAL records for transaction begins. The main
reason is to avoid keeping a transaction open for a long time. Regular
and prepared transactions are covered. Streamed transactions are also
covered.

Author: Euler Taveira
Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com
---
 doc/src/sgml/catalogs.sgml |  10 ++
 doc/src/sgml/ref/alter_subscription.sgml   |   5 +-
 doc/src/sgml/ref/create_subscription.sgml  |  43 +-
 src/backend/catalog/pg_subscription.c  |   1 +
 src/backend/catalog/system_views.sql   |   7 +-
 src/backend/commands/subscriptioncmds.c|  59 +++-
 src/backend/replication/logical/worker.c   | 100 +
 src/backend/utils/adt/timestamp.c  |  32 
 src/bin/pg_dump/pg_dump.c  |  17 ++-
 src/bin/pg_dump/pg_dump.h  |   1 +
 src/bin/psql/describe.c|   9 +-
 src/bin/psql/tab-complete.c|   4 +-
 src/include/catalog/pg_subscription.h  |   3 +
 src/include/datatype/timestamp.h   |   2 +
 src/include/utils/timestamp.h  |   2 +
 src/test/regress/expected/subscription.out | 165 -
 src/test/regress/sql/subscription.sql  |  20 +++
 src/test/subscription/t/032_apply_delay.pl | 129 
 18 files changed, 526 insertions(+), 83 deletions(-)
 create mode 100644 src/test/subscription/t/032_apply_delay.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index cd2cc37aeb..291ebdafad 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7833,6 +7833,16 @@ SCRAM-SHA-256$:&l
   
  
 
+ 
+  
+   subapplydelay int8
+  
+  
+   Application delay of changes by a specified amount of time. The
+   unit is in milliseconds.
+  
+ 
+
  
   
subname name
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 64efc21f53..8901e1361c 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -208,8 +208,9 @@ ALTER SUBSCRIPTION name RENAME TO <
   are slot_name,
   synchronous_commit,
   binary, streaming,
-  disable_on_error, and
-  origin.
+  disable_on_error,
+  origin, and
+  min_apply_delay.
  
 

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 7390c715bc..a794c07042 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -317,7 +317,36 @@ CREATE SUBSCRIPTION subscription_name
 

-  
+
+   
+min_apply_delay (integer)
+
+

Re: shared-memory based stats collector - v70

2022-08-10 Thread Greg Stark

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 8/9/22 6:00 PM, Greg Stark wrote:
> > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
> >>
> >> What do you think about adding a function in core PG to provide such
> >> functionality? (means being able to retrieve all the stats (+ eventually
> >> add some filtering) without the need to connect to each database).
> > I'm working on it myself too. I'll post a patch for discussion in a bit.
>
> Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

I'm definitely not wedded to the idea of using callbacks, it was just
the most convenient way to get started, especially when I was putting
the main loop in pgstat.c.  Ideally I do want to keep open the
possibility of streaming the results out without buffering the whole
set in memory.

> Out of curiosity, would you be also interested by such a feature for
> previous versions (that will not get the patch in) ?

I always had trouble understanding the existing stats code so I was
hoping the new code would make it easier. It seems to have worked but
it's possible I'm wrong and it was always possible and the problem was
always just me :)


-- 
greg
/*-
 *
 * telemetry.c
 *
 * Most of this code was copied from pg_prewarm.c as a template. 
 *
 *
 *-
 */

#include "postgres.h"

#include 
#include 
#include 

#include "access/relation.h"
#include "access/xact.h"
#include "catalog/pg_class.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgworker.h"
#include "postmaster/interrupt.h"
#include "storage/buf_internals.h"
#include "storage/dsm.h"
#include "storage/ipc.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/proc.h"
#include "storage/procsignal.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/datetime.h"
#include "utils/guc.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/resowner.h"


#include "telemetry.h"
#include "telemetry_pgstat.h"

PG_MODULE_MAGIC;


/* We should already have included what we need to get uint64_t size_t
   fd_set socklen_t and struct sockaddr so don't let microhttpd
   include extra stuff for them */
#define MHD_PLATFORM_H
#include 

/* MHD_USE_EPOLL */
/* MHD_USE_AUTO */
/* MHD_OPTION_CONNECTION_LIMIT */
/* MHD_OPTION_CONNECTION_TIMEOUT */
/* MHD_OPTION_EXTERNAL_LOGGER */



/* Background worker harness */
void		_PG_init(void);

/* Actual internal background worker main entry point */
static void telemetry_start_worker(unsigned short port);

/* GUC variables. */
static int telemetry_port = 9187; /* TCP port to listen on for metrics */
static char *telemetry_listen_addresses; /* TCP listen addresses */
static bool telemetry = true; /* start worker automatically on default port */


static enum MHD_Result
telemetry_handler(void *cls,
		  struct MHD_Connection *connection,
		  const char *url,
		  const char *method,
		  const char *version,
		  const char *upload_data,
		  size_t *upload_data_size,
		  void **con_cls);

/*
 * Module load callback.
 */
void
_PG_init(void)
{
DefineCustomIntVariable("telemetry.port",
			"TCP Port to serve metrics on by default",
			NULL,
			&telemetry_port,
			9187, 1, 65535,
			PGC_SIGHUP,
			0, /* flags */
			NULL, NULL, NULL /* hooks */
			);


DefineCustomStringVariable("telemetry.listen_addresses",
			   "TCP Listen Addresses to serve metrics on by default",
			   NULL,
			   &telemetry_listen_addresses,
			   "*",
			   PGC_SIGHUP,
			   GUC_LIST_INPUT, /* flags */
			   NULL, NULL, NULL /* hooks */
			   );

if (!process_shared_preload_libraries_in_progress)
	return;

/* can't define PGC_POSTMASTER variable after startup */
DefineCustomBoolVariable("telemetry.start_server",
			 "Starts the telemetry worker on startup.",
			 NULL,
			 &telemetry,
			 true,
			 PGC_POSTMASTER,
			 0,
			 NULL,
			 NULL,
			 NULL);

EmitWarningsOnPlaceholders("telemetry");

/* Register telemetry worker, if enabled. */
if (telemetry)
	telemetry_start_worker(telemetry_port);
}


static void telemetry_logger(void *arg, const char *fmt, va_list ap);

struct MHD_OptionItem mhd_ops[] = {
{ MHD_OPTION_EXTERNAL_LOGGER, (intptr_t)telemetry_logger

Re: optimize lookups in snapshot [sub]xip arrays

On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote:
> LGTM, let's see what the buildfarm thinks of 0001.

Thanks!  I haven't noticed any related buildfarm failures yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-10 14:18:25 -0400, Greg Stark wrote:
> > I don't think that's a large enough issue to worry about unless you're
> > polling at a very high rate, which'd be a bad idea in itself. If a backend
> > can't get the lock for some stats change it'll defer flushing the stats a 
> > bit,
> > so it'll not cause a lot of other problems.
> 
> Hm. I wonder if we're on the same page about what constitutes a "high rate".
> 
> I've seen people try push prometheus or other similar systems to 5s
> poll intervals. That would be challenging for Postgres due to the
> volume of statistics. The default is 30s and people often struggle to
> even have that function for large fleets. But if you had a small
> fleet, perhaps an iot style system with a "one large table" type of
> schema you might well want stats every 5s or even every 1s.

That's probably fine. Although I think you might run into trouble not from the
stats subystem side, but from the "amount of data" side. On a system with a
lot of objects that can be a fair amount.  If you really want to do very low
latency stats reporting, I suspect you'd have to build an incremental system.


> > I'm *dead* set against including catalog names in shared memory stats. 
> > That'll
> > add a good amount of memory usage and complexity, without any sort of
> > comensurate gain.
> 
> Well it's pushing the complexity there from elsewhere. If the labels
> aren't in the stats structures then the exporter needs to connect to
> each database, gather all the names into some local cache and then it
> needs to worry about keeping it up to date. And if there are any
> database problems such as disk errors or catalog objects being locked
> then your monitoring breaks though perhaps it can be limited to just
> missing some object names or having out of date names.

Shrug. If the stats system state desynchronizes from an alter table rename
you'll also have a problem in monitoring.

And even if you can benefit from having all that information, it'd still be an
overhead born by everybody for a very small share of users.


> > > I also think it would be nice to have a change counter for every stat
> > > object, or perhaps a change time. Prometheus wouldn't be able to make
> > > use of it but other monitoring software might be able to receive only
> > > metrics that have changed since the last update which would really
> > > help on databases with large numbers of mostly static objects.
> >
> > I think you're proposing adding overhead that doesn't even have a real user.
> 
> I guess I'm just brainstorming here. I don't need to currently no. It
> doesn't seem like significant overhead though compared to the locking
> and copying though?

Yes, timestamps aren't cheap to determine (nor free too store, but that's a
lesser issue).

Greetings,

Andres Freund

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-10 15:48:15 -0400, Greg Stark wrote:
> One thing that's bugging me is that the names we use for these stats
> are *all* over the place.

Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san
did as well.  I had to limit the amount of cleanup done to make it feasible to
get anything committed. I think it's a bit less bad than before, but by no
means good.

> * The pg_stat_bgwriter view returns data from two different fixed
> entries, the checkpointer and the bgwriter, is there a reason those
> are kept separately but then reported as if they're one thing?

Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't
anymore.

Greetings,

Andres Freund

Re: use SSE2 for is_valid_ascii

On Wed, Aug 10, 2022 at 01:50:14PM +0700, John Naylor wrote:
> Here is an updated patch using the new USE_SSE2 symbol. The style is
> different from the last one in that each stanza has platform-specific
> code. I wanted to try it this way because is_valid_ascii() is already
> written in SIMD-ish style using general purpose registers and bit
> twiddling, so it seemed natural to see the two side-by-side. Sometimes
> they can share the same comment. If we think this is bad for
> readability, I can go back to one block each, but that way leads to
> duplication of code and it's difficult to see what's different for
> each platform, IMO.

This is a neat patch.  I don't know that we need an entirely separate code
block for the USE_SSE2 path, but I do think that a little bit of extra
commentary would improve the readability.  IMO the existing comment for the
zero accumulator has the right amount of detail.

+   /*
+* Set all bits in each lane of the error accumulator where 
input
+* bytes are zero.
+*/
+   error_cum = _mm_or_si128(error_cum,
+
_mm_cmpeq_epi8(chunk, _mm_setzero_si128()));

I wonder if reusing a zero vector (instead of creating a new one every
time) has any noticeable effect on performance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Checking pgwin32_is_junction() errors

2022-08-10 Thread Thomas Munro

On Tue, Aug 9, 2022 at 10:59 PM  wrote:
> On 2022-08-09 05:44, Thomas Munro wrote:
> > On Tue, Aug 9, 2022 at 8:30 AM Thomas Munro 
> > wrote:
> >> On Mon, Aug 8, 2022 at 8:23 PM  wrote:
> >> > "C:/HOME" is the junction point to the second volume on my hard drive -
> >> > "\??\Volume{GUID}\" which name pgreadlink() erroneously strips here:
> >> > https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357.
> >
> >> ... Would it
> >> be better to say: if it doesn't begin with "\??\X:", where X could be
> >> any letter, then don't modify it?
> >
> > Concretely, I wonder if this is a good fix at least in the short term.
> > Does this work for you, and do the logic and explanation make sense?
>
> Yes, this patch works well with my junction points.

Thanks for testing!  I did a bit more reading on this stuff, so that I
could update the comments with the correct terminology from Windows
APIs.  I also realised that the pattern we could accept to symlink()
and expect to work is not just "C:..." (could be
RtlPathTypeDriveRelative, which wouldn't actually work in a junction
point) but "C:\..." (RtlPathTypeDriveAbsolute).  I tweaked it a bit to
test for that.

> I checked a few variants:
>
> 21.07.2022  15:11 HOME [\??\Volume{GUID}\]
> 09.08.2022  15:06 Test1 [\\?\Volume{GUID}\]
> 09.08.2022  15:06 Test2 [\\.\Volume{GUID}\]
> 09.08.2022  15:17 Test3 [\??\Volume{GUID}\]
> 09.08.2022  15:27 Test4 [C:\temp\1]
> 09.08.2022  15:28 Test5 [C:\HOME\Temp\1]

One more thing I wondered about, now that we're following junctions
outside PGDATA: can a junction point to another junction?  If so, I
didn't allow for that: stat() gives up after one hop, because I
figured that was enough for the stuff we expect inside PGDATA and I
couldn't find any evidence in the man pages that referred to chains.
But if you *are* allowed to create a junction "c:\huey" that points to
junction "c:\dewey" that points to "c:\louie", and then you do initdb
-D c:\huey\pgdata, then I guess it would fail.  Would you mind
checking if that is a real possibility, and if so, testing this
chain-following patch to see if it fixes it?

> After hours of reading the documentation and debugging, it seems to me
> we can use REPARSE_GUID_DATA_BUFFER structure instead of our
> REPARSE_JUNCTION_DATA_BUFFER [1]. DataBuffer doesn't contain any
> prefixes,
> so we don't need to strip them. But we still need to construct a correct
> volume name if a junction point is a volume mount point. Is it worth to
> check this idea?

I don't know.  I think I prefer our current approach, because it can
handle anything (raw/full NT paths) and doesn't try to be very clever,
and I don't want to change to a different scheme for no real
benefit...
From 6ed3edc77275aa51d1d0b1012c0a36db65735d39 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 11 Aug 2022 10:42:13 +1200
Subject: [PATCH v2 1/2] Fix readlink() for general Windows junction points.

Our symlink() and readlink() replacements perform a naive transformation
from DOS paths to NT paths and back, as required by the junction point
APIs.  This was enough for the "drive absolute" paths we expect users to
provide for tablespaces, for example "d:\my\fast\storage".

Since commit c5cb8f3b taught stat() to follow symlinks, and since initdb
uses pg_mkdir_p(), and that examines parent directories, our humble
readlink() implementation can now be exposed to junction points not of
PostgreSQL origin.  Those might be corrupted by our naive path mangling,
which doesn't really understand NT paths in general.

Simply decline to transform paths that don't look like a drive absolute
path.  That means that readlink() returns the NT path directly when
checking a parent directory of PGDATA that happen to point to a drive
using "rooted" format.  That  works for the purposes of our stat()
emulation.

Reported-by: Roman Zharkov 
Discussion: https://postgr.es/m/4590c37927d7b8ee84f9855d83229018%40postgrespro.ru
---
 src/port/dirmod.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 2818bfd2e9..53310baafb 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -199,7 +199,11 @@ pgsymlink(const char *oldpath, const char *newpath)
 	if (dirhandle == INVALID_HANDLE_VALUE)
 		return -1;

-	/* make sure we have an unparsed native win32 path */
+	/*
+	 * We expect either an NT path or a "drive absolute" path like "C:\..."
+	 * that we convert to an NT path by bolting on a prefix and converting any
+	 * Unix-style path delimiters to NT-style.
+	 */
 	if (memcmp("\\??\\", oldpath, 4) != 0)
 		snprintf(nativeTarget, sizeof(nativeTarget), "\\??\\%s", oldpath);
 	else
@@ -351,10 +355,21 @@ pgreadlink(const char *path, char *buf, size_t size)
 	}

 	/*
-	 * If the path starts with "\??\", which it will do in most (all?) cases,
-	 * strip those out.
+	 * If the path starts with "\??\" follo

Re: Get the statistics based on the application name and IP address

2022-08-10 Thread Julien Rouhaud

Hi,

On Wed, Aug 10, 2022 at 10:42:31PM +0500, Ibrar Ahmed wrote:
> On Mon, Aug 8, 2022 at 10:11 PM Julien Rouhaud  wrote:
>
> > First, for now each entry contains its own
> > query text in the query file.  There can already be some duplication, which
> > isn't great, but adding the application_name and/or IP address will make
> > things
> > way worse, so you would probably need to fix that first.
>
> I doubt that makes it worst because these (IP and Application) will be part
> of
> the key, not the query text.

It's because you want to add new elements to the key that it would make it
worse, as the exact same query text will be stored much more often.

> But yes, I agree that it will increase the
> footprint of rows,
> excluding query text.

I don't think that this part should be a concern.

> I am not 100% sure about the query text duplication but will look at that
> in detail,
> if you have more insight, then it will help to solve that.

You can refer to the mentioned thread for the (only?) discussion about that.

> > There has been some
> > discussion about it recently (1) but more work and benchmarking are needed.
> >
> > The other problem is the multiplication of entries.  It's a well known
> > limitation that pg_stat_statements eviction are so costly that it makes it
> > unusable.  The last numbers I saw about it was ~55% overhead (2).  Adding
> > application_name or ip address to the key would probably make
> > pg_stat_statements unusable for anyone who would actually need those
> > metrics.
> >
>
> I am sure adding a new item in the key does not affect the performance of
> evictions of the row,
> as it will not be part of that area.  I am doing some benchmarking and
> hacking to reduce that and will
> send results with the patch.

Sorry if that was unclear.  I didn't meant that adding new items to the key
would make evictions way costlier (although it would have some impact), but
much more frequent.  Adding application_name and the IP to the key can very
easily amplify the number of entries so much that you will either need an
unreasonable value for pg_stat_statements.max (which would likely bring its own
set of problems) if possible at all, or evict entries frequently.

Re: Allow logical replication to copy tables in binary format

2022-08-10 Thread Euler Taveira

On Wed, Aug 10, 2022, at 12:03 PM, Melih Mutlu wrote:
> I see that logical replication subscriptions have an option to enable binary 
> [1]. 
> When it's enabled, subscription requests publisher to send data in binary 
> format. 
> But this is only the case for apply phase. In tablesync, tables are still 
> copied as text.
This option could have been included in the commit 9de77b54531; it wasn't.
Maybe it wasn't considered because the initial table synchronization can be a
separate step in your logical replication setup idk. I agree that the binary
option should be available for the initial table synchronization.

> To copy tables, COPY command is used and that command supports copying in 
> binary. So it seemed to me possible to copy in binary for tablesync too.
> I'm not sure if there is a reason to always copy tables in text format. But I 
> couldn't see why not to do it in binary if it's enabled.
The reason to use text format is that it is error prone. There are restrictions
while using the binary format. For example, if your schema has different data
types for a certain column, the copy will fail. Even with such restrictions, I
think it is worth adding it.

> You can find the small patch that only enables binary copy attached.  
I have a few points about your implementation.

* Are we considering to support prior Postgres versions too? These releases
  support binary mode but it could be an unexpected behavior (initial sync in
  binary mode) for a publisher using 14 or 15 and a subscriber using 16. IMO
  you should only allow it for publisher on 16 or later.
* Docs should say that the binary option also applies to initial table
  synchronization and possibly emphasize some of the restrictions.
* Tests. Are the current tests enough? 014_binary.pl.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: optimize lookups in snapshot [sub]xip arrays

On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart  wrote:
>
> On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote:
> > LGTM, let's see what the buildfarm thinks of 0001.
>
> Thanks!  I haven't noticed any related buildfarm failures yet.

I was waiting for all the Windows animals to report in, and it looks
like they have, so I've pushed 0002. Thanks for picking this topic up
again!

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Using each rel as both outer and inner for JOIN_ANTI

2022-08-10 Thread Richard Guo

On Wed, Aug 10, 2022 at 4:40 PM Alvaro Herrera 
wrote:

> On 2022-Aug-10, Richard Guo wrote:
>
> > The right-anti join plan has the same cost estimation with right join
> > plan in this case. So would you please help to test what the right join
> > plan looks like in your env for the query below?
> >
> >  select * from foo left join bar on foo.a = bar.c;
>
> You're right, it does.
>
> 55432 16devel 475322=# explain (analyze, buffers)  select * from foo left
> join bar on foo.a = bar.c;
>   QUERY PLAN
>
>
> ──
>  Hash Right Join  (cost=1.23..90875.24 rows=10 width=20) (actual
> time=456.410..456.415 rows=10 loops=1)
>Hash Cond: (bar.c = foo.a)
>Buffers: shared hit=15852 read=6273
>->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=12)
> (actual time=0.036..210.468 rows=500 loops=1)
>  Buffers: shared hit=15852 read=6272
>->  Hash  (cost=1.10..1.10 rows=10 width=8) (actual time=0.037..0.038
> rows=10 loops=1)
>  Buckets: 1024  Batches: 1  Memory Usage: 9kB
>  Buffers: shared read=1
>  ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual
> time=0.022..0.026 rows=10 loops=1)
>Buffers: shared read=1
>  Planning:
>Buffers: shared hit=92 read=13
>  Planning Time: 1.077 ms
>  Execution Time: 456.458 ms
> (14 filas)


Thanks for help testing. Comparing the anti join plan and the right join
plan, the estimated cost and the execution time mismatch a lot. Seems
the cost estimate of hashjoin path is not that precise for this case
even in the unpatched codes. Maybe this is something we need to improve.

Thanks
Richard

Re: [RFC] building postgres with meson

On Thu, Aug 11, 2022 at 12:19 AM Andres Freund  wrote:
> I tried to ignore various generated files in the source tree, but I don't
> think it's doable for all of them. Consider
> e.g. src/backend/utils/misc/guc-file.c which is gets built via #include
> "guc-file.c" from gram.c

With a bit of work, we could probably get rid of those includes. See
27199058d98ef7f for one example.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: [RFC] building postgres with meson

John Naylor  writes:
> With a bit of work, we could probably get rid of those includes. See
> 27199058d98ef7f for one example.

Yeah --- it would mean creating gram.h files for all the bison grammars
not just a few of them, but it's certainly do-able if there's motivation
to make the changes.  Most of the files that are done that way date
from before we knew about flex's %top.

regards, tom lane

Re: [RFC] building postgres with meson

I wrote:
> John Naylor  writes:
>> With a bit of work, we could probably get rid of those includes. See
>> 27199058d98ef7f for one example.

> Yeah --- it would mean creating gram.h files for all the bison grammars
> not just a few of them, but it's certainly do-able if there's motivation
> to make the changes.  Most of the files that are done that way date
> from before we knew about flex's %top.

BTW, 72b1e3a21 is another useful precedent in this area.

regards, tom lane

Re: [RFC] building postgres with meson

On Thu, Aug 11, 2022 at 10:37 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > With a bit of work, we could probably get rid of those includes. See
> > 27199058d98ef7f for one example.
>
> Yeah --- it would mean creating gram.h files for all the bison grammars
> not just a few of them, but it's certainly do-able if there's motivation
> to make the changes.  Most of the files that are done that way date
> from before we knew about flex's %top.

I'll volunteer to work on this unless an easier solution happens to
come along in the next couple days. (aside: guc-file.l doesn't have a
grammar, so not yet sure if that makes the issue easier or harder...)

-- 
John Naylor
EDB: http://www.enterprisedb.com

tests and meson - test names and file locations

Hi,

One of the things motivating me to work on the meson conversion is the ability
to run tests in an easily understandable way. Meson has a testrunner that
both allows to run all tests at once, and run subsets of tests.

= Test and testsuite naming =

Each test has a unique name, and 0-n labels (called 'suites'). One can run
tests by their name, or select tests by suites.

The way I've organized it so far is that tests are named like this:

main/pg_regress
main/isolation
recovery/t/018_wal_optimize.pl
pg_prewarm/t/001_basic.pl
test_shm_mq/pg_regress
psql/t/001_basic.pl

we could also name tests by their full path, but that makes the display very
unwieldy.

At the moment there's three suites differentiating by the type of test:
'pg_regress', 'isolation' and 'tap'. There's also a separate "axis" of suites,
describing what's being tested, e.g. 'main', 'test_decoding', 'recovery' etc.

That currently works out to each test having two suites, although I've
wondered about adding a 'contrib' suite as well.

Perhaps the pg_regress suite should just be named 'regress'? And perhaps the
t/ in the tap tests is superfluous - the display is already fairly wide?

= Log and Data locations =

To make things like the selection of log files for a specific test easier,
I've so far set it up so that test data and logs are stored in a separate
directory from the sources.

testrun///

The runner now creates a test.start at the start of a test and either
test.success or test.failure at the end. That should make it pretty easy for
e.g. the buildfarm and CI to make the logs for a failed test easily
accessible. I've spent far too much time going through the ~hundred logs in
src/test/recovery/ that the buildfarm displays as one thing.

I really like having all the test data separately from the sources, but I get
that that's not what we've done so far. It's doable to just mirror the current
choice, but I don't think we should. But I won't push too hard against keeping
things the same.

I do wonder if we should put test data and log files in a separate directory
tree, but that'd be a bit more work probably.

Any comments on the above?

= Example outputs =

Here's an example output that you mostly should be able to make sense of now:

$ m test --print-errorlogs
ninja: Entering directory `/tmp/meson'
ninja: no work to do.
1/242 postgresql:setup / tmp_install
OK0.33s
2/242 postgresql:tap+pg_upgrade / pg_upgrade/t/001_basic.pl
OK0.35s 8 subtests passed
3/242 postgresql:tap+recovery / recovery/t/011_crash_recovery.pl
OK2.67s 3 subtests passed
4/242 postgresql:tap+recovery / recovery/t/013_crash_restart.pl
OK2.91s 18 subtests passed
5/242 postgresql:tap+recovery / recovery/t/014_unlogged_reinit.pl
OK3.01s 23 subtests passed
6/242 postgresql:tap+recovery / recovery/t/022_crash_temp_files.pl
OK3.16s 11 subtests passed
7/242 postgresql:tap+recovery / recovery/t/016_min_consistency.pl
OK3.43s 1 subtests passed
8/242 postgresql:tap+recovery / recovery/t/021_row_visibility.pl
OK3.46s 10 subtests passed
9/242 postgresql:isolation+tcn / tcn/isolation
OK3.42s
10/242 postgresql:tap+recovery / recovery/t/023_pitr_prepared_xact.pl
OK3.63s 1 subtests passed
...
241/242 postgresql:isolation+main / main/isolation
OK 46.69s
242/242 postgresql:tap+pg_upgrade / pg_upgrade/t/002_pg_upgrade.pl
OK 57.00s 13 subtests passed

Ok: 242
Expected Fail: 0
Fail: 0
Unexpected Pass:0
Skipped:0
Timeout:0

Full log written to /tmp/meson/meson-logs/testlog.txt

The 'postgresql' is because meson supports subprojects (both to provide
dependencies if needed, and "real" subprojects), and their tests can be run at
once.

If a test fails it'll show the error output at the time of test:

39/242 postgresql:pg_regress+cube / cube/pg_regress
FAIL 3.74s exit status 1
>>> REGRESS_SHLIB=/tmp/meson/src/test/regress/regress.so MALLOC_PERTURB_=44
>>> PG_TEST_EXTRA='kerberos ldap ssl'
>>> PG_REGRESS=/tmp/meson/src/test/regress/pg_regress
>>> PATH=/tmp/meson/tmp_install/tmp/meson-install/bin:/tmp/meson/contrib/cube:/home/andres/bin/perl5/bin:/home/andres/bin/pg:/home/andres/bin/bin:/usr/sbin:/sbin:/home/andres/bin/pg:/home/andres/bin/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/

Re: [RFC] building postgres with meson

Hi,

On 2022-08-11 10:57:33 +0700, John Naylor wrote:
> I'll volunteer to work on this unless an easier solution happens to
> come along in the next couple days.

Cool!

> (aside: guc-file.l doesn't have a grammar, so not yet sure if that makes the
> issue easier or harder...)

I think we should consider compiling it separately from guc.c. guc.c already
compiles quite slowly (iirc beat only by ecpg and main grammar), and it's a
relatively commonly changed source file.

It might even be a good idea to split guc.c so it only contains the settings
arrays + direct dependencies...

Greetings,

Andres Freund

Re: use SSE2 for is_valid_ascii

On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart  wrote:
>
> This is a neat patch.  I don't know that we need an entirely separate code
> block for the USE_SSE2 path, but I do think that a little bit of extra
> commentary would improve the readability.  IMO the existing comment for the
> zero accumulator has the right amount of detail.
>
> +   /*
> +* Set all bits in each lane of the error accumulator where 
> input
> +* bytes are zero.
> +*/
> +   error_cum = _mm_or_si128(error_cum,
> +
> _mm_cmpeq_epi8(chunk, _mm_setzero_si128()));

Okay, I will think about the comments, thanks for looking.

> I wonder if reusing a zero vector (instead of creating a new one every
> time) has any noticeable effect on performance.

Creating a zeroed register is just FOO PXOR FOO, which should get
hoisted out of the (unrolled in this case) loop, and which a recent
CPU will just map to a hard-coded zero in the register file, in which
case the execution latency is 0 cycles. :-)

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

On Mon, Aug 8, 2022 at 7:20 PM Bharath Rupireddy
 wrote:
>
> > Please review.
>
> I added this to current CF - https://commitfest.postgresql.org/39/3808/

Here's the v2 patch, no change from v1, just rebased because of commit
a8c012869763c711abc9085f54b2a100b60a85fa (Move basebackup code to new
directory src/backend/backup).

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

v2-0001-Refactor-backup-related-code.patch
Description: Binary data

Re: [RFC] building postgres with meson

John Naylor  writes:
> I'll volunteer to work on this unless an easier solution happens to
> come along in the next couple days. (aside: guc-file.l doesn't have a
> grammar, so not yet sure if that makes the issue easier or harder...)

That one's probably mostly about the issue mentioned in the other
commit you identified.  Without %top, it's impossible to make a
standalone flex module honor the rule about thou-shalt-have-no-
other-includes-before-postgres.h.  So embedding it in some other
file was originally a necessity for that.  Now that we know how
to fix that, it's just a matter of making sure that any other stuff
the scanner needs is available from a .h file.

regards, tom lane

Re: optimize lookups in snapshot [sub]xip arrays

On Thu, Aug 11, 2022 at 09:50:54AM +0700, John Naylor wrote:
> I was waiting for all the Windows animals to report in, and it looks
> like they have, so I've pushed 0002. Thanks for picking this topic up
> again!

Thanks for reviewing and committing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: making relfilenodes 56 bits

2022-08-10 Thread Dilip Kumar

On Tue, Aug 9, 2022 at 8:51 PM Robert Haas  wrote:
>
> On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> > I think even if we start the range from the 4 billion we can not avoid
> > keeping two separate ranges for system and user tables otherwise the
> > next upgrade where old and new clusters both have 56 bits
> > relfilenumber will get conflicting files.  And, for the same reason we
> > still have to call SetNextRelFileNumber() during upgrade.
>
> Well, my proposal to move everything from the new cluster up to higher
> numbers would address this without requiring two ranges.
>
> > So the idea is, we will be having 2 ranges for relfilenumbers, system
> > range will start from 4 billion and user range maybe something around
> > 4.1 (I think we can keep it very small though, just reserve 50k
> > relfilenumber for system for future expansion and start user range
> > from there).
>
> A disadvantage of this is that it basically means all the file names
> in new clusters are going to be 10 characters long. That's not a big
> disadvantage, but it's not wonderful. File names that are only 5-7
> characters long are common today, and easier to remember.

That's correct.

> > So now system tables have no issues and also the user tables from the
> > old cluster have no issues.  But pg_largeobject might get conflict
> > when both old and new cluster are using 56 bits relfilenumber, because
> > it is possible that in the new cluster some other system table gets
> > that relfilenumber which is used by pg_largeobject in the old cluster.
> >
> > This could be resolved if we allocate pg_largeobject's relfilenumber
> > from the user range, that means this relfilenumber will always be the
> > first value from the user range.  So now if the old and new cluster
> > both are using 56bits relfilenumber then pg_largeobject in both
> > cluster would have got the same relfilenumber and if the old cluster
> > is using the current 32 bits relfilenode system then the whole range
> > of the new cluster is completely different than that of the old
> > cluster.
>
> I think this can work, but it does rely to some extent on the fact
> that there are no other tables which need to be treated like
> pg_largeobject. If there were others, they'd need fixed starting
> RelFileNumber assignments, or some other trick, like renumbering them
> twice in the cluster, first two a known-unused value and then back to
> the proper value. You'd have trouble if in the other cluster
> pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
> cluster the reverse, without some hackery.

Agree, if it has more catalog like pg_largeobject then it would
require some hacking.

> I do feel like your idea here has some advantages - my proposal
> requires rewriting all the catalogs in the new cluster before we do
> anything else, and that's going to take some time even though they
> should be small. But I also feel like it has some disadvantages: it
> seems to rely on complicated reasoning and special cases more than I'd
> like.

One other advantage with your approach is that since we are starting
the "nextrelfilenumber" after the old cluster's relfilenumber range.
So only at the beginning we need to set the "nextrelfilenumber" but
after that while upgrading each object we don't need to set the
nextrelfilenumber every time because that is already higher than the
complete old cluster range.  In other 2 approaches we will have to try
to set the nextrelfilenumber everytime we preserve the relfilenumber
during upgrade.

Other than these two approaches we have another approach (what the
patch set is already doing) where we keep the system relfilenumber
range same as Oid.  I know you have already pointed out that this
might have some hidden bug but one advantage of this approach is it is
simple compared two above two approaches in the sense that it doesn't
need to maintain two ranges and it also doesn't need to rewrite all
system tables in the new cluster.  So I think it would be good if we
can get others' opinions on all these 3 approaches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-10 Thread Dilip Kumar

On Wed, Aug 10, 2022 at 6:31 PM Simon Riggs
 wrote:
>
> On Wed, 10 Aug 2022 at 08:34, Dilip Kumar  wrote:
> >
> > Am I still missing something?
>
> No, you have found a dependency between the patches that I was unaware
> of. So there is no bug if you apply both patches.

Right

>
> > So I still think some adjustment is required in XidInMVCCSnapdhot()
>
> That is one way to resolve the issue, but not the only one. I can also
> change AssignTransactionId() to recursively register parent xids for
> all of a subxid's parents.
>
> I will add in a test case and resolve the dependency in my next patch.

Okay, thanks, I will look into the updated patch after you submit that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use SSE2 for is_valid_ascii