date:20240710

On Wed, Jul 10, 2024 at 01:38:06PM +, Bertrand Drouvot wrote:
> On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote:
>> and I am troubled by the approach taken (mentioned down by you), but
>> that's invasive compared to how pgstats wants to be transparent with
>> its stats kinds.
>> 
>> +   Oid objoid; /* object ID, either table or function
>> or tablespace. */
>> +   RelFileNumber relfile;  /* relfilenumber for RelFileLocator. */
>>  } PgStat_HashKey;
>> 
>> This adds a relfilenode component to the central hash key used for the
>> dshash of pgstats, which is something most stats types don't care
>> about.
> 
> That's right but that's an existing behavior without the patch as:
> 
> PGSTAT_KIND_DATABASE does not care care about the objoid
> PGSTAT_KIND_REPLSLOT does not care care about the dboid
> PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid
> 
> That's 3 kinds out of the 5 non fixed stats kind.

I'd like to think that this is just going to increase across time.

>> That looks like the incorrect thing to do to me, particularly
>> seeing a couple of lines down that a stats kind is assigned so the
>> HashKey uniqueness is ensured by the KindInfo:
>> +   [PGSTAT_KIND_RELFILENODE] = {
>> +   .name = "relfilenode",
> 
> You mean, just rely on kind, dboid and relfile to ensure uniqueness?

Or table OID for the objid, with a hardcoded number of past
relfilenodes stats stored, to limit bloating the dshash with too much
past stats.  See below.

> So, I think it makes sense to link the hashkey to all the RelFileLocator
> fields, means:
> 
> dboid (linked to RelFileLocator's dbOid)
> objoid (linked to RelFileLocator's spcOid)
> relfile (linked to RelFileLocator's relNumber)

Hmm.  How about using the table OID as objoid, but store in the stats
of the new KindInfo an array of entries with the relfilenodes (current
and past, perhaps with more data than the relfilenode to ensure the
uniqueness tracking) and each of its stats?  The number of past
relfilenodes would be fixed, meaning that there would be a strict
control with the retention of the past stats.  When a table is
dropped, removing its relfilenode stats would be as cheap as when its
PGSTAT_KIND_RELATION is dropped.

> Yeah, I also thought about keeping a list of "previous" relfilenodes stats 
> for a
> relation but that would lead to:
> 
> 1. Keep previous relfilnode stats 
> 2. A more complicated way to look at relation stats (as you said)
> 3. Extra memory usage
> 
> I think the only reason "previous" relfilenode stats are needed is to provide
> accurate stats for the relation. Outside of this need, I don't think we would
> want to retrieve "individual" previous relfilenode stats in the past.
> 
> That's why the POC patch "simply" copies the stats to the relation during a
> rewrite (before getting rid of the "previous" relfilenode stats).

Hmm.  Okay.
--
Michael


signature.asc
Description: PGP signature

Re: Speed up Hash Join by teaching ExprState about hashing

On Mon, 13 May 2024 at 21:23, David Rowley  wrote:
> In master, if you look at ExecHashGetHashValue() in nodeHash.c, you
> can see that it calls ExecEvalExpr() and then manually calls the hash
> function on the returned value. This process is repeated once for each
> hash key. This is inefficient for a few reasons:
>
> 1) ExecEvalExpr() will only deform tuples up the max varattno that's
> mentioned in the hash key.  That means we might have to deform
> attributes in multiple steps, once for each hash key.
> 2) ExecHashGetHashValue() is very branchy and checks if hashStrict[]
> and keep_nulls on every loop.  There's also a branch to check which
> hash functions to use.
> 3) foreach isn't exactly the pinnacle of efficiency either.
>
> All of the above points can be improved by making ExprState handle
> hashing. This means we'll deform all attributes that are needed for
> hashing once, rather than incrementally once per key. This also allows
> JIT compilation of hashing ExprStates, which will make things even
> faster.
>
> The attached patch implements this. Here are some performance numbers.

I've been doing a bit more work on this to start to add support for
faster hashing for hashing needs other than Hash Join.  In the
attached, I've added support to give the hash value an initial value.
Support for that is required to allow Hash Aggregate to work. If you
look at what's being done now inside BuildTupleHashTableExt(), you'll
see that "hash_iv" exists there to allow an initial hash value. This
seems to be getting used to allow some variation in hash values
calculated inside parallel workers, per hashtable->hash_iv =
murmurhash32(ParallelWorkerNumber).  One of my aims for this patch is
to always produce the same hash value before and after the patch, so
I've gone and implemented the equivalent functionality which can be
enabled or disabled as required depending on the use case.

I've not added support for Hash Aggregate quite yet. I did look at
doing that, but it seems to need quite a bit of refactoring to do it
nicely.  The problem is that BuildTupleHashTableExt() receives
keyColIdx with the attribute numbers to hash.  The new
ExecBuildHash32Expr() function requires a List of Exprs.  It looks
like the keyColIdx array comes directly from the planner which is many
layers up and would need lots of code churn of function signatures to
change.  While I could form Vars using the keyColIdx array to populate
the required List of Exprs, I so far can't decide where exactly that
should happen. I think probably the planner should form the Expr List.
It seems a bit strange to be doing makeVar() in the executor.

I currently think that it's fine to speed up Hash Join as phase one
for this patch.  I can work more on improving hash value generation in
other locations later.

I'd be happy if someone else were to give this patch a review and
test. One part I struggled a bit with was finding a way to cast the
Size variable down to uint32 in LLVM. I tried to add a new supported
type for uint32 but just couldn't get it to work.  Instead, I did:

v_tmp1 = LLVMBuildAnd(b, v_tmp1,
  l_sizet_const(0x), "");

which works and I imagine compiled to the same code as a cast.  It
just looks a bit strange.

David

v2-0001-Speed-up-Hash-Join-by-making-ExprStates-hash.patch
Description: Binary data

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-07-10 Thread vignesh C

On Mon, 8 Jul 2024 at 17:46, vignesh C  wrote:
>
> On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas  wrote:
> >
> > On 05/07/2024 14:07, vignesh C wrote:
> > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas  wrote:
> > >>
> > >> I'm don't quite understand the problem we're trying to fix:
> > >>
> > >>> Currently the launcher's latch is used for the following: a) worker
> > >>> process attach b) worker process exit and c) subscription creation.
> > >>> Since this same latch is used for multiple cases, the launcher process
> > >>> is not able to handle concurrent scenarios like: a) Launcher started a
> > >>> new apply worker and waiting for apply worker to attach and b) create
> > >>> subscription sub2 sending launcher wake up signal. In this scenario,
> > >>> both of them will set latch of the launcher process, the launcher
> > >>> process is not able to identify that both operations have occurred 1)
> > >>> worker is attached 2) subscription is created and apply worker should
> > >>> be started. As a result the apply worker does not get started for the
> > >>> new subscription created immediately and gets started after the
> > >>> timeout of 180 seconds.
> > >>
> > >> I don't see how we could miss a notification. Yes, both events will set
> > >> the same latch. Why is that a problem?
> > >
> > > The launcher process waits for the apply worker to attach at
> > > WaitForReplicationWorkerAttach function. The launcher's latch is
> > > getting set concurrently for another create subscription and apply
> > > worker attached. The launcher now detects the latch is set while
> > > waiting at WaitForReplicationWorkerAttach, it resets the latch and
> > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> > > the latch has already been reset). Further details are provided below.
> > >
> > > The loop will see that the new
> > >> worker has attached, and that the new subscription has been created, and
> > >> process both events. Right?
> > >
> > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> > > processing the create subscription event.
> > >
> > > Slightly detailing further:
> > > In the scenario when we execute two concurrent create subscription
> > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> > > CREATE SUBSCRIPTION sub2.
> > > In a few random scenarios, the following events may unfold:
> > > After the first create subscription command(sub1), the backend will
> > > set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> > > Subsequently, the launcher process will reset the latch and identify
> > > the addition of a new subscription in the pg_subscription list. The
> > > launcher process will proceed to request the postmaster to start the
> > > apply worker background process (sub1) and request the postmaster to
> > > notify the launcher once the apply worker(sub1) has been started.
> > > Launcher will then wait for the apply worker(sub1) to attach in the
> > > WaitForReplicationWorkerAttach function.
> > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> > > executed concurrently, also set the launcher's latch(because of
> > > ApplyLauncherWakeupAtCommit).
> > > Simultaneously when the launcher remains in the
> > > WaitForReplicationWorkerAttach function waiting for apply worker of
> > > sub1 to start, the postmaster will start the apply worker for
> > > subscription sub1 and send a SIGUSR1 signal to the launcher process
> > > via ReportBackgroundWorkerPID. Upon receiving this signal, the
> > > launcher process will then set its latch.
> > >
> > > Now, the launcher's latch has been set concurrently after the apply
> > > worker for sub1 is started and the execution of the CREATE
> > > SUBSCRIPTION sub2 command.
> > >
> > > At this juncture, the launcher, which had been awaiting the attachment
> > > of the apply worker, detects that the latch is set and proceeds to
> > > reset it. The reset operation of the latch occurs within the following
> > > section of code in WaitForReplicationWorkerAttach:
> > > ...
> > > rc = WaitLatch(MyLatch,
> > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > > 10L, WAIT_EVENT_BGWORKER_STARTUP);
> > >
> > > if (rc & WL_LATCH_SET)
> > > {
> > > ResetLatch(MyLatch);
> > > CHECK_FOR_INTERRUPTS();
> > > }
> > > ...
> > >
> > > After resetting the latch here, the activation signal intended to
> > > start the apply worker for subscription sub2 is no longer present. The
> > > launcher will return to the ApplyLauncherMain function, where it will
> > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> > > processing the create subscription request (i.e., creating a new apply
> > > worker for sub2).
> > > The issue arises from the latch being reset in
> > > WaitForReplicationWorkerAttach, which can occasionally delay the
> > > synchronization of table data for the subscription.
> >
> > Ok, I see it now. Thanks for the explanation. So the problem isn't using
> > the

Re: improve performance of pg_dump with many sequences

2024-07-10 Thread Euler Taveira

On Wed, Jul 10, 2024, at 7:05 PM, Nathan Bossart wrote:
> I'm not following why that would be a better approach.  strncpy() will add
> a NUL to the end of the string unless it doesn't fit in the buffer, in
> which case we'll add our own via "seqtype[sizeof(seqtype) - 1] = '\0'".
> Furthermore, the compiler can determine the position where the NUL should
> be placed, whereas placing it at the end of the copied string requires a
> runtime strlen().

Nevermind, you are copying the whole buffer (n = sizeof(seqtype)).

> Unfortunately, I think we have to keep this workaround since older minor
> releases of PostgreSQL don't have the fix.

Hmm. Right.

> What pg_dump command did you test here?  Did you dump the sequence data, or
> was this --schema-only?

time pg_dump -f - -s -d postgres


--
Euler Taveira
EDB   https://www.enterprisedb.com/

RE: Conflict detection and logging in logical replication

2024-07-10 Thread Zhijie Hou (Fujitsu)

On Wednesday, July 10, 2024 5:39 PM shveta malik  wrote:
> 
> On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu)
>  wrote:
> > >
> >
> > Hi,
> >
> > As suggested by Sawada-san in another thread[1].
> >
> > I am attaching the V4 patch set which tracks the delete_differ
> > conflict in logical replication.
> >
> > delete_differ means that the replicated DELETE is deleting a row that
> > was modified by a different origin.
> >
> 
> Thanks for the patch. I am still in process of review but please find few
> comments:

Thanks for the comments!

> 1) When I try to *insert* primary/unique key on pub, which already exists on
> sub, conflict gets detected. But when I try to *update* primary/unique key to 
> a
> value on pub which already exists on sub, conflict is not detected. I get the
> error:
> 
> 2024-07-10 14:21:09.976 IST [647678] ERROR:  duplicate key value violates
> unique constraint "t1_pkey"
> 2024-07-10 14:21:09.976 IST [647678] DETAIL:  Key (pk)=(4) already exists.

Yes, I think the detection of this conflict is not added with the
intention to control the size of the patch in the first version.

> 
> This is because such conflict detection needs detection of constraint 
> violation
> using the *new value* rather than *existing* value during UPDATE. INSERT
> conflict detection takes care of this case i.e. the columns of incoming row 
> are
> considered as new values and it tries to see if all unique indexes are okay to
> digest such new values (all incoming columns) but update's logic is different.
> It searches based on oldTuple *only* and thus above detection is missing.

I think the logic is the same if we want to detect the unique violation
for UDPATE, we need to check if the new value of the UPDATE violates any
unique constraints same as the detection of insert_exists (e.g. check
the conflict around ExecInsertIndexTuples())

> 
> Shall we support such detection? If not, is it worth docuementing?

I am personally OK to support this detection. And
I think it's already documented that we only detect unique violation for
insert which mean update conflict is not detected.

> 2)
> Another case which might confuse user:
> 
> CREATE TABLE t1 (pk integer primary key, val1 integer, val2 integer);
> 
> On PUB: insert into t1 values(1,10,10); insert into t1 values(2,20,20);
> 
> On SUB: update t1 set pk=3 where pk=2;
> 
> Data on PUB: {1,10,10}, {2,20,20}
> Data on SUB: {1,10,10}, {3,20,20}
> 
> Now on PUB: update t1 set val1=200 where val1=20;
> 
> On Sub, I get this:
> 2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing detected
> on relation "public.t1"
> 2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row to be
> updated.
> 2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote data for
> replication origin "pg_16389" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 760, finished at 0/156D658
> 
> To user, it could be quite confusing, as val1=20 exists on sub but still he 
> gets
> update_missing conflict and the 'DETAIL' is not sufficient to give the 
> clarity. I
> think on HEAD as well (have not tested), we will get same behavior i.e. update
> will be ignored as we make search based on RI (pk in this case). So we are not
> worsening the situation, but now since we are detecting conflict, is it 
> possible
> to give better details in 'DETAIL' section indicating what is actually 
> missing?

I think It's doable to report the row value that cannot be found in the local
relation, but the concern is the potential risk of exposing some
sensitive data in the log. This may be OK, as we are already reporting the
key value for constraints violation, so if others also agree, we can add
the row value in the DETAIL as well.

Best Regards,
Hou zj

Re: Parent/child context relation in pg_get_backend_memory_contexts()

On Thu, 11 Jul 2024 at 09:19, Robert Haas  wrote:
> FWIW, I would have done what Melih did. A path normally is listed in
> root-to-leaf order, not leaf-to-root.

Melih and I talked about this in a meeting yesterday evening. I think
I'm about on the fence about having the IDs in leaf-to-root or
root-to-leaf.  My main concern about which order is chosen is around
how easy it is to write hierarchical queries. I think I'd feel better
about having it in root-to-leaf order if "level" was 1-based rather
than 0-based. That would allow querying CacheMemoryContext and all of
its descendants with:

WITH c AS (SELECT * FROM pg_backend_memory_contexts)
SELECT c1.*
FROM c c1, c c2
WHERE c2.name = 'CacheMemoryContext'
AND c1.path[c2.level] = c2.path[c2.level];

(With the v6 patch, you have to do level + 1.)

Ideally, no CTE would be needed here, but unfortunately, there's no
way to know the CacheMemoryContext's ID beforehand.  We could make the
ID more stable if we did a breadth-first traversal of the context.
i.e., assign IDs in level order.  This would stop TopMemoryContext's
2nd child getting a different ID if its first child became a parent
itself.

This allows easier ad-hoc queries, for example:

select * from pg_backend_memory_contexts;
-- Observe that CacheMemoryContext has ID=22 and level=2. Get the
total of that and all of its descendants.
select sum(total_bytes) from pg_backend_memory_contexts where path[2] = 22;
-- or just it and direct children
select sum(total_bytes) from pg_backend_memory_contexts where path[2]
= 22 and level <= 3;

Without the breadth-first assignment of context IDs, the sum() would
cause another context to be created for aggregation and the 2nd query
wouldn't work. Of course, it doesn't make it 100% guaranteed to be
stable, but it's way less annoying to write ad-hoc queries. It's more
stable the closer to the root you're interested in, which seems (to
me) the most likely area of interest for most people.

> On Fri, Jul 5, 2024 at 4:06 AM David Rowley  wrote:
> > I also imagined "path" would be called "context_ids". I thought that
> > might better indicate what the column is without consulting the
> > documentation.
>
> The only problem I see with this is that it doesn't make it clear that
> we're being shown parentage or ancestry, rather than values for the
> current node. I suspect path is fairly understandable, but if you
> don't like that, what about parent_ids?

I did a bit more work in the attached.  I changed "level" to be
1-based and because it's the column before "path" I find it much more
intuitive (assuming no prior knowledge) that the "path" column relates
to "level" somehow as it's easy to see that "level" is the same number
as the number of elements in "path". With 0-based levels, that's not
the case.

Please see the attached patch.  I didn't update any documentation.

David

v8_add_path_to_pg_backend_memory_contexts.patch
Description: Binary data

Re: Allow non-superuser to cancel superuser tasks.

On Wed, Jul 10, 2024 at 10:57:45PM +0500, Kirill Reshke wrote:
> Hi, that's for digging into this. Turns out I completely missed one of
> your emails today morning.

Don't worry.  Using this domain tends to put my emails in one's spam
folder.
--
Michael


signature.asc
Description: PGP signature

Re: POC, WIP: OR-clause support for indexes

2024-07-10 Thread Alena Rybakina


On 27.06.2024 23:06, Alena Rybakina wrote:
Tobe honest,I've alreadystartedwritingcodetodothis,butI'm facedwitha 
misunderstandingof howto correctlycreatea 
conditionfor"OR"expressionsthatare notsubjectto transformation.


For example,the expressions b=1in the query below:

alena@postgres=# explain select * from x where ( (a =5 or a=4) and a 
= ANY(ARRAY[5,4])) or (b=1); QUERY PLAN 
-- 
Seq Scan on x (cost=0.00..123.00 rows=1 width=8) Filter: a = 5) 
OR (a = 4)) AND (a = ANY ('{5,4}'::integer[]))) OR (b = 1)) (2 rows)


I see that two expressions have remained unchanged and it only works 
for "AND" binary operations.


But I think it might be worth applying this together, where does the 
optimizer generate indexes (build_paths_for_OR function)?


Iimplementedsuchcode,butatthe 
analysisstageinplanner,anditwasn'tfullyreadyyet,butIwas ableto 
drawsomeimportantconclusions.Firstof all,Ifacedtheproblemof the 
inequalityof the numberof columnsinthe expressionwiththe 
requiredone,atleastsomeextracolumnappeared,judgingby the 
crust.Ihaven'tfullyrealizedityet andhaven'tfixedit.


#0  __pthread_kill_implementation (no_tid=0, signo=6, 
threadid=134300960061248)

    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=134300960061248) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=134300960061248, signo=signo@entry=6) 
at ./nptl/pthread_kill.c:89
#3  0x7a2560042476 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26

#4  0x7a25600287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x5573f9df62a8 in ExceptionalCondition (
    conditionName=0x5573f9fec4c8 
"AttrNumberIsForUserDefinedAttr(list_attnums[i]) || 
!bms_is_member(attnum, clauses_attnums)", fileName=0x5573f9fec11c 
"dependencies.c", lineNumber=1525) at assert.c:66
#6  0x5573f9b8b85f in dependencies_clauselist_selectivity 
(root=0x5573fad534e8,
    clauses=0x5573fad0b2d8, varRelid=0, jointype=JOIN_INNER, 
sjinfo=0x0, rel=0x5573fad54b38,

    estimatedclauses=0x7ffe2e43f178) at dependencies.c:1525
#7  0x5573f9b8fed9 in statext_clauselist_selectivity 
(root=0x5573fad534e8, clauses=0x5573fad0b2d8,
    varRelid=0, jointype=JOIN_INNER, sjinfo=0x0, rel=0x5573fad54b38, 
estimatedclauses=0x7ffe2e43f178,

    is_or=false) at extended_stats.c:2035
--Type  for more, q to quit, c to continue without paging--
#8  0x5573f9a57f88 in clauselist_selectivity_ext 
(root=0x5573fad534e8, clauses=0x5573fad0b2d8,
    varRelid=0, jointype=JOIN_INNER, sjinfo=0x0, 
use_extended_stats=true) at clausesel.c:153
#9  0x5573f9a57e30 in clauselist_selectivity (root=0x5573fad534e8, 
clauses=0x5573fad0b2d8,

    varRelid=0, jointype=JOIN_INNER, sjinfo=0x0) at clausesel.c:106
#10 0x5573f9a62e03 in set_baserel_size_estimates 
(root=0x5573fad534e8, rel=0x5573fad54b38)

    at costsize.c:5247
#11 0x5573f9a51aa5 in set_plain_rel_size (root=0x5573fad534e8, 
rel=0x5573fad54b38,

    rte=0x5573fad0ec58) at allpaths.c:581
#12 0x5573f9a516ce in set_rel_size (root=0x5573fad534e8, 
rel=0x5573fad54b38, rti=1,

    rte=0x5573fad0ec58) at allpaths.c:411
#13 0x5573f9a514c7 in set_base_rel_sizes (root=0x5573fad534e8) at 
allpaths.c:322
#14 0x5573f9a5119d in make_one_rel (root=0x5573fad534e8, 
joinlist=0x5573fad0adf8) at allpaths.c:183

#15 0x5573f9a94d45 in query_planner (root=0x5573fad534e8,
    qp_callback=0x5573f9a9b59e , 
qp_extra=0x7ffe2e43f540) at planmain.c:280
#16 0x5573f9a977a8 in grouping_planner (root=0x5573fad534e8, 
tuple_fraction=0, setops=0x0)

    at planner.c:1520
#17 0x5573f9a96e47 in subquery_planner (glob=0x5573fad533d8, 
parse=0x5573fad0ea48, parent_root=0x0,

    hasRecursion=false, tuple_fraction=0, setops=0x0) at planner.c:1089
#18 0x5573f9a954aa in standard_planner (parse=0x5573fad0ea48,
    query_string=0x5573fad8b3b0 "explain analyze SELECT * FROM 
functional_dependencies WHERE ((a * 2) = 2 OR (a * 2) = 102) AND 
upper(b) = '1'", cursorOptions=2048, boundParams=0x0) at planner.c:415

#19 0x5573f9a951d4 in planner (parse=0x5573fad0ea48,
--Type  for more, q to quit, c to continue without paging--
    query_string=0x5573fad8b3b0 "explain analyze SELECT * FROM 
functional_dependencies WHERE ((a * 2) = 2 OR (a * 2) = 102) AND 
upper(b) = '1'", cursorOptions=2048, boundParams=0x0) at planner.c:282

#20 0x5573f9bf4e2e in pg_plan_query (querytree=0x5573fad0ea48,
    query_string=0x5573fad8b3b0 "explain analyze SELECT * FROM 
functional_dependencies WHERE ((a * 2) = 2 OR (a * 2) = 102) AND 
upper(b) = '1'", cursorOptions=2048, boundParams=0x0) at postgres.c:904
#21 0x5573f98613e7 in standard_ExplainOneQuery 
(query=0x5573fad0ea48, cursorOptions=2048, into=0x0,

    es=0x5573fad57da0,
    queryString=0x5573fad8b3b0 "explain analyze SELECT * FROM 
functional_dependencies WHERE ((a * 2) = 2 OR (a * 2) = 102) AND 
upper(b) = '1'", params=0x0, queryEnv=0x0) at explain.c:489
#22

Re: improve performance of pg_dump with many sequences

On Wed, Jul 10, 2024 at 05:08:56PM -0300, Euler Taveira wrote:
> Nice improvement. The numbers for a realistic scenario (10k sequences) are

Thanks for taking a look!

> You are changing internal representation from char to int64. Is the main goal 
> to
> validate catalog data? What if there is a new sequence data type whose
> representation is not an integer?

IIRC 0001 was primarily intended to reduce the amount of memory needed for
the sorted table.  Regarding a new sequence data type, I'm assuming we'll
have much bigger fish to fry if we do that (e.g., pg_sequence uses int8 for
the values), and I'd hope that adjusting this code wouldn't be too
difficult, anyway.

> This code path is adding zero byte to the last position of the fixed string. I
> suggest that the zero byte is added to the position after the string length.

I'm not following why that would be a better approach.  strncpy() will add
a NUL to the end of the string unless it doesn't fit in the buffer, in
which case we'll add our own via "seqtype[sizeof(seqtype) - 1] = '\0'".
Furthermore, the compiler can determine the position where the NUL should
be placed, whereas placing it at the end of the copied string requires a
runtime strlen().

> l = strlen(PQgetvalue(res, 0, 0));
> Assert(l < sizeof(seqtype));
> strncpy(seqtype, PQgetvalue(res, 0, 0), l);
> seqtype[l] = '\0';

I think the strncpy() should really be limited to the size of the seqtype
buffer.  IMHO an Assert is not sufficient.

> If you are not planning to apply 0003, make sure you fix collectSequences() to
> avoid versions less than 10. Move this part to 0002.

Yeah, no need to create the table if we aren't going to use it.

> Since you apply a fix for pg_sequence_last_value function, you can simplify 
> the
> query in 0003. CASE is not required.

Unfortunately, I think we have to keep this workaround since older minor
releases of PostgreSQL don't have the fix.

> patched (0001 and 0002):
> real 0m0,290s
> user 0m0,038s
> sys 0m0,104s
> 
> I'm not sure if 0003 is worth. Maybe if you have another table like you
> suggested.

What pg_dump command did you test here?  Did you dump the sequence data, or
was this --schema-only?

-- 
nathan

Re: improve predefined roles documentation

Committed.  Thank you for reviewing!

-- 
nathan

Re: Parent/child context relation in pg_get_backend_memory_contexts()

On Fri, Jul 5, 2024 at 4:06 AM David Rowley  wrote:
> I've been wondering about the order of the "path" column.  When we
> talked, I had in mind that the TopMemoryContext should always be at
> the end of the array rather than the start, but I see you've got it
> the other way around.

FWIW, I would have done what Melih did. A path normally is listed in
root-to-leaf order, not leaf-to-root.

> I also imagined "path" would be called "context_ids". I thought that
> might better indicate what the column is without consulting the
> documentation.

The only problem I see with this is that it doesn't make it clear that
we're being shown parentage or ancestry, rather than values for the
current node. I suspect path is fairly understandable, but if you
don't like that, what about parent_ids?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Parent/child context relation in pg_get_backend_memory_contexts()

On Wed, Apr 3, 2024 at 7:34 PM Michael Paquier  wrote:
> I've been re-reading the patch again to remember what this is about,
> and I'm OK with having this "path" column in the catalog.  However,
> I'm somewhat confused by the choice of having a temporary number that
> shows up in the catalog representation, because this may not be
> constant across multiple calls so this still requires a follow-up
> temporary ID <-> name mapping in any SQL querying this catalog.  A
> second thing is that array does not show the hierarchy of the path;
> the patch relies on the order of the elements in the output array
> instead.

This complaint doesn't seem reasonable to me. The point of the path,
as I understand it, is to allow the caller to make sense of the
results of a single call, which is otherwise impossible. Stability
across multiple calls would be much more difficult, particularly
because we have no unique, long-lived identifier for memory contexts,
except perhaps the address of the context. Exposing the pointer
address of the memory contexts to clients would be an extremely bad
idea from a security point of view -- and it also seems unnecessary,
because the point of this function is to get a clear snapshot of
memory usage at a particular moment, not to track changes in usage by
the same contexts over time. You could still build the latter on top
of this if you wanted to do that, but I don't think most people would,
and I don't think the transient path IDs make it any more difficult.

I feel like Melih has chosen a simple and natural representation and I
would have done pretty much the same thing. And AFAICS there's no
reasonable alternative design.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Wed, Jul 3, 2024 at 1:07 PM Robert Haas  wrote:
> I think the problem here is that the WAL summarizer believes that when
> a new timeline appears, it should pick up from where the old timeline
> ended. And here, that doesn't happen: the new timeline branches off
> before the end of the old timeline, because of the recovery target.
>
> I'm not yet sure what should be done about this. The obvious answer is
> "remove the assertion," and maybe that is all we need to do. However,
> I'm not quite sure what the actual behavior will be if we just do
> that, so I think more investigation is needed. I'll keep looking at
> this, although given the US holiday I may not have results until next
> week.

Here is a draft patch that attempts to fix this problem. I'm not
certain that it's completely correct, but it does seem to fix the
reported issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Fix-WAL-summarization-across-abrupt-timeline-swit.patch
Description: Binary data

Re: improve performance of pg_dump with many sequences

2024-07-10 Thread Euler Taveira

On Tue, Jul 9, 2024, at 4:11 PM, Nathan Bossart wrote:
> rebased

Nice improvement. The numbers for a realistic scenario (10k sequences) are

for i in `seq 1 1`; do echo "CREATE SEQUENCE s$i;"; done > /tmp/s.sql

master:
real 0m1,141s
user 0m0,056s
sys 0m0,147s

patched:
real 0m0,410s
user 0m0,045s
sys 0m0,103s

You are changing internal representation from char to int64. Is the main goal to
validate catalog data? What if there is a new sequence data type whose
representation is not an integer?

This code path is adding zero byte to the last position of the fixed string. I
suggest that the zero byte is added to the position after the string length.

Assert(strlen(PQgetvalue(res, 0, 0)) < sizeof(seqtype));
strncpy(seqtype, PQgetvalue(res, 0, 0), sizeof(seqtype));
seqtype[sizeof(seqtype) - 1] = '\0';

Something like

l = strlen(PQgetvalue(res, 0, 0));
Assert(l < sizeof(seqtype));
strncpy(seqtype, PQgetvalue(res, 0, 0), l);
seqtype[l] = '\0';

Another suggestion is to use a constant for seqtype

char seqtype[MAX_SEQNAME_LEN];

and simplify the expression:

size_t  seqtype_sz = sizeof(((SequenceItem *) 0)->seqtype);

If you are not planning to apply 0003, make sure you fix collectSequences() to
avoid versions less than 10. Move this part to 0002.

@@ -17233,11 +17235,24 @@ collectSequences(Archive *fout)   
PGresult   *res;
const char *query;  

+   if (fout->remoteVersion < 10)   
+   return; 
+ 

Since you apply a fix for pg_sequence_last_value function, you can simplify the
query in 0003. CASE is not required.

I repeated the same test but not applying 0003.

patched (0001 and 0002):
real 0m0,290s
user 0m0,038s
sys 0m0,104s

I'm not sure if 0003 is worth. Maybe if you have another table like you
suggested.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: pg_maintain and USAGE privilege on schema

On Wed, Jul 10, 2024 at 12:29:00PM -0700, Jeff Davis wrote:
>> > It might be reasonable to give implicit USAGE privileges on all
>> > schemas
>> > during maintenance commands to pg_maintain roles.
> 
> That's an even more specific exception: having USAGE only in the
> context of a maintenance command. I think that's a new concept, right?

AFAIK yes.

-- 
nathan

Re: Simplifying width_bucket_numeric()

2024-07-10 Thread Dean Rasheed

On Sun, 7 Jul 2024 at 13:43, Joel Jacobson  wrote:
>
> > SELECT hash_array(array_agg(width_bucket(op, b1, b2, c))) FROM t;
> > -- Result not changed by patch
>
> Same hash_array on all my three machines:
>
> > SELECT sum(width_bucket(op, b1, b2, c)) FROM t;
> > Time: 3658.962 ms (00:03.659)  -- HEAD
> > Time: 3089.946 ms (00:03.090)  -- with patch
>
> Significant improvement on all my three machines:
>

Thanks for testing. I have committed this now.

(I also realised that the "reversed_bounds" code was unnecessary,
since the only effect was to negate both inputs to div_var(), so the
signs cancel out.)

Regards,
Dean

Re: pg_maintain and USAGE privilege on schema

2024-07-10 Thread Jeff Davis

On Wed, 2024-07-10 at 17:13 +0900, Fujii Masao wrote:
> ISTM that both
> pg_read_all_data and pg_write_all_data roles are defined similarly,
> with USAGE rights on all schemas.

I'm not so sure that was a great idea to begin with. If you create a
private schema with a SECURITY DEFINER function in it, it's a bit odd
to allow someone with pg_read_all_data to execute it. Granted, that's
documented behavior, but I'm not sure the privileges should be bundled
in that fashion.

> > It might be reasonable to give implicit USAGE privileges on all
> > schemas
> > during maintenance commands to pg_maintain roles.

That's an even more specific exception: having USAGE only in the
context of a maintenance command. I think that's a new concept, right?

Regards,
Jeff Davis

Re: Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?

On Wed, Jul 10, 2024 at 05:00:13PM +0200, Antonin Houska wrote:
> I don't quite understand why TransactionIdIsCurrentTransactionId() implements
> binary search in ParallelCurrentXids "from scratch" instead of using
> bsearch().

I believe there are a number of open-coded binary searches in the tree.  My
concern with switching them to bsearch() would be the performance impact of
using a function pointer for the comparisons.  Perhaps we could add a
couple of inlined binary search implementations for common types to replace
many of the open-coded ones.

-- 
nathan

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 11:11:13AM -0400, Tom Lane wrote:
> We went through a ton of permutations of that kind of
> idea years ago, when it first became totally clear that cross-checks
> between GUCs do not work nicely if implemented in check_hooks.
> (You can find all the things we tried in the commit log, although
> I don't recall exactly when.)

Do you remember the general timeframe or any of the GUCs involved?  I spent
some time searching through the commit log and mailing lists, but I've thus
far only found allusions to past bad experiences with such hooks.

-- 
nathan

Re: Allow non-superuser to cancel superuser tasks.

2024-07-10 Thread Kirill Reshke

Hi, that's for digging into this. Turns out I completely missed one of
your emails today morning.

On Wed, 10 Jul 2024 at 10:15, Michael Paquier  wrote:
> And then the timestamp of the tests:
> [12:48:23.277](0.058s) not ok 2 - autovacuum worker signaled with
> pg_signal_autovacuum_worker granted
>
> We check for the contents of the logs 1ms before they are generated,
> hence failing the lookup check because the test is faster than the
> backend.
>
> Like what we are doing in 010_pg_basebackup.pl, we could do a
> poll_query_until() until the PID of the autovacuum worker is gone from
> pg_stat_activity before fetching the logs as ProcessInterrupts() stuff
> would be logged before the process exits, say:
> +# Wait for the autovacuum worker to exit before scanning the logs.
> +$node->poll_query_until('postgres',
> +   "SELECT count(*) = 0 FROM pg_stat_activity "
> +   . "WHERE pid = $av_pid AND backend_type = 'autovacuum worker';");
>
> That gives something like the attached.  Does that look correct to
> you?
> --
> Michael

+1.

Re: long-standing data loss bug in initial sync of logical replication

2024-07-10 Thread Nitin Motiani

On Wed, Jul 10, 2024 at 10:39 PM vignesh C  wrote:
>
> On Wed, 10 Jul 2024 at 12:28, Amit Kapila  wrote:
> > The patch missed to use the ShareRowExclusiveLock for partitions, see
> > attached. I haven't tested it but they should also face the same
> > problem. Apart from that, I have changed the comments in a few places
> > in the patch.
>
> I could not hit the updated ShareRowExclusiveLock changes through the
> partition table, instead I could verify it using the inheritance
> table. Added a test for the same and also attaching the backbranch
> patch.
>

Hi,

I tested alternative-experimental-fix-lock.patch provided by Tomas
(replaces SUE with SRE in OpenTableList). I believe there are a couple
of scenarios the patch does not cover.

1. It doesn't handle the case of "ALTER PUBLICATION  ADD TABLES
IN SCHEMA  ".

I took crash-test.sh provided by Tomas and modified it to add all
tables in the schema to publication using the following command :

   ALTER PUBLICATION p ADD TABLES IN SCHEMA  public

The modified script is attached (crash-test-with-schema.sh). With this
script, I can reproduce the issue even with the patch applied. This is
because the code path to add a schema to the publication doesn't go
through OpenTableList.

I have also attached a script run-test-with-schema.sh to run
crash-test-with-schema.sh in a loop with randomly generated parameters
(modified from run.sh provided by Tomas).

2.  The second issue is a deadlock which happens when the alter
publication command is run for a comma separated list of tables.

I created another script create-test-tables-order-reverse.sh. This
script runs a command like the following :

ALTER PUBLICATION p ADD TABLE test_2,test_1

Running the above script, I was able to get a deadlock error (the
output is attached in deadlock.txt). In the alter publication command,
I added the tables in the reverse order to increase the probability of
the deadlock. But it should happen with any order of tables.

I am not sure if the deadlock is a major issue because detecting the
deadlock is better than data loss. The schema issue is probably more
important. I didn't test it out with the latest patches sent by
Vignesh but since the code changes in that patch are also in
OpenTableList, I think the schema scenario won't be covered by those.

Thanks & Regards,
Nitin Motiani
Google
nitinmotiani@nitinmotiani:~/vanila_postgres/postgresql$ 
./crash-test-tables-order-reverse.sh test-data-dir-reverse-proper 2 2 5
/usr/local/pgsql/bin:/usr/local/pgsql/bin:/usr/local/pgsql/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[Wed Jul 10 07:50:10 AM UTC 2024] [1720597810] NUMTABLES=2  SLEEP=5  REFRESH=2
The files belonging to this database system will be owned by user 
"nitinmotiani".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory test-data-dir-reverse-proper/data-pub ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option 
-A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

/usr/local/pgsql/bin/pg_ctl -D test-data-dir-reverse-proper/data-pub -l 
logfile start

The files belonging to this database system will be owned by user 
"nitinmotiani".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory test-data-dir-reverse-proper/data-sub ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option 
-A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

Re: Windows: openssl & gssapi dislike each other

2024-07-10 Thread Imran Zaheer

On Tue, Jul 9, 2024 at 2:32 AM Andres Freund  wrote:
>
> Hi,
>
>
> On 2024-06-13 00:12:51 +0900, Imran Zaheer wrote:
> > I removed the macro from the sslinfo.c as suggested by Andrew. Then I
> > was thinking maybe we can undo some other similar code.
>
> What precisely do you mean by that? Just getting rid of the "ordered include"
> of openssl headers in {fe,be}-secure-openssl.h?
>
Hi

I reordered the includes in {fe,be}-secure-openssl.h as they were also placed
there to resolve similar errors and also were contradicting the
project conventions [1].
But looks like it's better to not touch those as they were for future proofing.

>
> > I rearranged the headers to their previous position in
> > be-secure-openssl.c and in fe-secure-openssl.c. I was able to compile
> > with gssapi and openssl enabled. You can look into the original commits. [1,
> > 2]
> > Is it ok if we undo the changes from these commits?
> >
> > I am attaching two new patches.
> > One with macro guards removed from ssinfo.c.
> > Second patch will additionally rearrange headers for
> > be-secure-openssl.c and in fe-secure-openssl.c to their previous
> > position.
>
> One thing that concerns me with this is that there are other includes of
> gssapi/gssapi.h (e.g. in , which haven't been changed here. ISTM we ought to 
> do apply
> the same change to all of those, otherwise we're just waiting for the problem
> to re-appear.
>

Yes this should be better.

> I wonder if we should add a src/include/libpq/pg-gssapi.h or such, which could
> wrap the entire ifdeferry for gss support. Something like
>
>
> #ifdef ENABLE_GSS
>
> #if defined(HAVE_GSSAPI_H)
> #include 
> #include 
> #else
> #include 
> #include 
> #endif
>
> /*
>  * On Windows,  includes a #define for X509_NAME, which breaks our
>  * ability to use OpenSSL's version of that symbol if  is pulled
>  * in after  ... and, at least on some builds, it is.  We
>  * can't reliably fix that by re-ordering #includes, because libpq/libpq-be.h
>  * #includes .  Instead, just zap the #define again here.
>  */
> #ifdef X509_NAME
> #undef X509_NAME
> #endif
>
> #endif /* ENABLE_GSS */
>
> Which'd allow the various places using gss (libpq-be.h, be-gssapi-common.h,
> libpq-int.h) to just include pg-gssapi.h and get all of the above without
> redundancy?
>
>
> Another thing that concerns me about this approach is that it seems to assume
> that the only source of such conflicting includes is gssapi. What if some
> other header pulls in wincrypt.h?  But I can't really see a way out of that...
>
> Greetings,
>
> Andres Freund

Creating src/include/libpq/pg-gssapi.h can be another great way of
handling these includes. I compiled successfully but couldn't do
proper testing as there is something wrong with my windows env.

And you are right, the approach we are going with right now only
assumes that it's due to the
gssapi as the bug also appeared when building with gssapi (openssl &
gssapi build). What if openssl clashes with
some other lib too which indirectly includes wincrypt.h

For now maybe we can do the future proofing for gssapi & openssl includes
and do testing if openssl clashes with some other lib too.

Thanks
Imran Zaheer

[1]: https://wiki.postgresql.org/wiki/Committing_checklist#Policies
(3rd last point)

Re: long-standing data loss bug in initial sync of logical replication

2024-07-10 Thread vignesh C

On Wed, 10 Jul 2024 at 12:28, Amit Kapila  wrote:
>
> On Tue, Jul 9, 2024 at 8:14 PM vignesh C  wrote:
> >
> > On Tue, 9 Jul 2024 at 17:05, Amit Kapila  wrote:
> > >
> > > On Mon, Jul 1, 2024 at 10:51 AM vignesh C  wrote:
> > > >
> > > >
> > > > This issue is present in all supported versions. I was able to
> > > > reproduce it using the steps recommended by Andres and Tomas's
> > > > scripts. I also conducted a small test through TAP tests to verify the
> > > > problem. Attached is the alternate_lock_HEAD.patch, which includes the
> > > > lock modification(Tomas's change) and the TAP test.
> > > >
> > >
> > > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
> > >   /* Allow query cancel in case this takes a long time */
> > >   CHECK_FOR_INTERRUPTS();
> > >
> > > - rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
> > > + rel = table_openrv(t->relation, ShareRowExclusiveLock);
> > >
> > > The comment just above this code ("Open, share-lock, and check all the
> > > explicitly-specified relations") needs modification. It would be
> > > better to explain the reason of why we would need SRE lock here.
> >
> > Updated comments for the same.
> >
>
> The patch missed to use the ShareRowExclusiveLock for partitions, see
> attached. I haven't tested it but they should also face the same
> problem. Apart from that, I have changed the comments in a few places
> in the patch.

I could not hit the updated ShareRowExclusiveLock changes through the
partition table, instead I could verify it using the inheritance
table. Added a test for the same and also attaching the backbranch
patch.

Regards,
Vignesh
From d300868b61c65a6b575078c29c0d20994acae1fa Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 9 Jul 2024 19:23:10 +0530
Subject: [PATCH v4] Fix data loss during initial sync in logical replication.

Previously, when adding tables to a publication in PostgreSQL, they were
locked using ShareUpdateExclusiveLock mode. This mode allowed the lock to
succeed even if there were ongoing DML transactions on that table. As a
consequence, the ALTER PUBLICATION command could be completed before these
transactions, leading to a scenario where the catalog snapshot used for
replication did not include changes from transactions initiated before the
alteration.

To fix this issue, tables are now locked using ShareRowExclusiveLock mode
during the addition to a publication. This change ensures that the
ALTER PUBLICATION command waits for any ongoing transactions on the tables
(to be added to the publication) to be completed before proceeding. As a
result, transactions initiated before the publication alteration are
correctly included in the replication process.

Reported-by: Tomas Vondra
Diagnosed-by: Andres Freund
Author: Vignesh C, Tomas Vondra
Reviewed-by: Amit Kapila
Backpatch-through: 12
Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com
---
 src/backend/commands/publicationcmds.c |  16 ++-
 src/test/subscription/t/100_bugs.pl| 142 +
 2 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c
index 6ea709988e..341ea0318d 100644
--- a/src/backend/commands/publicationcmds.c
+++ b/src/backend/commands/publicationcmds.c
@@ -1542,8 +1542,14 @@ RemovePublicationSchemaById(Oid psoid)
 
 /*
  * Open relations specified by a PublicationTable list.
- * The returned tables are locked in ShareUpdateExclusiveLock mode in order to
- * add them to a publication.
+ *
+ * The returned tables are locked in ShareRowExclusiveLock mode to add them
+ * to a publication. The table needs to be locked in ShareRowExclusiveLock
+ * mode to ensure that any ongoing transactions involving that table are
+ * completed before adding it to the publication. Otherwise, the transaction
+ * initiated before the alteration of the publication will continue to use a
+ * catalog snapshot predating the publication change, leading to
+ * non-replication of these transaction changes.
  */
 static List *
 OpenTableList(List *tables)
@@ -1568,7 +1574,7 @@ OpenTableList(List *tables)
 		/* Allow query cancel in case this takes a long time */
 		CHECK_FOR_INTERRUPTS();
 
-		rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
+		rel = table_openrv(t->relation, ShareRowExclusiveLock);
 		myrelid = RelationGetRelid(rel);
 
 		/*
@@ -1594,7 +1600,7 @@ OpenTableList(List *tables)
 		 errmsg("conflicting or redundant column lists for table \"%s\"",
 RelationGetRelationName(rel;
 
-			table_close(rel, ShareUpdateExclusiveLock);
+			table_close(rel, ShareRowExclusiveLock);
 			continue;
 		}
 
@@ -1622,7 +1628,7 @@ OpenTableList(List *tables)
 			List	   *children;
 			ListCell   *child;
 
-			children = find_all_inheritors(myrelid, ShareUpdateExclusiveLock,
+			children = find_all_inheritors(myrelid, ShareRowExclusiveLock,
 		   NULL);
 
 			foreach(child, children)
diff --git

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 10:10 AM Robert Haas  wrote:
> On Wed, Jul 10, 2024 at 1:56 AM Fujii Masao  
> wrote:
> > I believe this issue occurs when the server is shut down cleanly.
> > The shutdown-checkpoint record retains the wal_level value used
> > before the shutdown. If wal_level is changed after this,
> > the wal_level that indicated by the shutdown-checkpoint record
> > and that the WAL data generated afterwards depends on may not match.
>
> Oh, that's a problem. I'll have to think about that.

Here is an attempt at fixing this problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v3-0001-Do-not-summarize-WAL-if-generated-with-wal_level-.patch
Description: Binary data

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Thu, Jul 11, 2024 at 01:02:25AM +0900, Fujii Masao wrote:
> On 2024/07/10 23:18, Nathan Bossart wrote:
>> Yeah.  I initially thought this patch might be okay, at least as a stopgap,
>> but Jelte pointed out a case where it doesn't work, namely when you have
>> something like the following in the config file:
>> 
>>  wal_level = 'minimal'
>>  summarize_wal = 'true'
>>  wal_level = 'logical'
> 
> Unless I'm mistaken, the patch works fine in this case. If the check_hook
> triggered every time a parameter appears in the configuration file,
> it would mistakenly detect wal_level=minimal and summarize_wal=on together
> and raise an error. However, this isn't the case. The check_hook is
> designed to trigger after duplicate parameters are deduplicated.
> Am I missing something?

After further research, I think you are right about that.

-- 
nathan

Re: Streaming I/O, vectored I/O (WIP)

2024-07-10 Thread Nazir Bilal Yavuz

Hi,

It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
[1] is partially applied, the top comment is not updated. The attached
patch just updates it.

[1] 
https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 76efb37b03b295070dc5259049825fe56460383f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 10 Jul 2024 19:04:51 +0300
Subject: [PATCH] Fix the top comment at read_stream.c

---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 74b9bae6313..238ff448a91 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -51,7 +51,7 @@
  * I/Os that have been started with StartReadBuffers(), and for which
  * WaitReadBuffers() must be called before returning the buffer.
  *
- * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * For example, if the callback returns block numbers 10, 42, 43, 44, 60 in
  * successive calls, then these data structures might appear as follows:
  *
  *  buffers buf/data   ios
-- 
2.45.2

Re: Allow logical failover slots to wait on synchronous replication

Hi,

On Mon, Jul 08, 2024 at 12:08:58PM -0700, John H wrote:
> I took a deeper look at this with GDB and I think it's necessary to
> cache the value of "mode".
> We first check:
> 
> if (mode == SYNC_REP_NO_WAIT)
> return true;
> 
> However after this check it's possible to receive a SIGHUP changing
> SyncRepWaitMode
> to SYNC_REP_NO_WAIT (e.g. synchronous_commit = 'on' -> 'off'), leading
> to lsn[-1].

What about adding "SyncRepWaitMode" as a third StandbySlotsHaveCaughtup()
parameter then? (so that the function will used whatever value was passed during
the call).

> > 2 
> >
> > +   static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE] = 
> > {InvalidXLogRecPtr};
> >
> > I did some testing and saw that the lsn[] values were not always set to
> > InvalidXLogRecPtr right after. It looks like that, in that case, we should
> > avoid setting the lsn[] values at compile time. Then, what about?
> >
> > 1. remove the "static".
> >
> > or
> >
> > 2. keep the static but set the lsn[] values after its declaration.
> 
> I'd prefer to keep the static because it reduces unnecessary
> contention on SyncRepLock if logical client has fallen behind.
> I'll add a change with your second suggestion.

Got it, you want lsn[] to be initialized only one time and that each call to
StandbySlotsHaveCaughtup() relies on the values that were previously stored in 
lsn[] and then return if lsn[mode] >= wait_for_lsn.

Then I think that:

1 ===

That's worth additional comments in the code.

2 ===

+   if (!initialized)
+   {
+   for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
+   {
+   lsn[i] = InvalidXLogRecPtr;
+   }
+   }

Looks like setting initialized to true is missing once done.

Also,

3 ===

+   /* Cache values to reduce contention on lock */
+   for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
+   {
+   lsn[i] = walsndctl->lsn[i];
+   }

NUM_SYNC_REP_WAIT_MODE is small but as the goal is the keep the lock time as
short as possible I wonder if it wouldn't be better to use memcpy() here instead
of this for loop.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 11:54:38AM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> I haven't tested it, but from skimming around the code, it looks like
>> ProcessConfigFileInternal() would deduplicate any previous entries in the
>> file prior to applying the values and running the check hooks.  Else,
>> reloading a configuration file with multiple startup-only GUC entries could
>> fail, even without bogus GUC check hooks.
> 
> While it's been a little while since I actually traced the logic,
> I believe the reason that case doesn't fail is this bit in
> set_config_with_handle, about line 3477 as of HEAD:
> 
> case PGC_POSTMASTER:
> if (context == PGC_SIGHUP)
> {
> /*
>  * We are re-reading a PGC_POSTMASTER variable from
>  * postgresql.conf.  We can't change the setting, so we should
>  * give a warning if the DBA tries to change it.  However,
>  * because of variant formats, canonicalization by check
>  * hooks, etc, we can't just compare the given string directly
>  * to what's stored.  Set a flag to check below after we have
>  * the final storable value.
>  */
> prohibitValueChange = true;
> }
> else if (context != PGC_POSTMASTER)
> // throw "cannot be changed now" error

That's what I thought at first, but then I saw this in
ProcessConfigFileInternal():

/* If it's already marked, then this is a duplicate 
entry */
if (record->status & GUC_IS_IN_FILE)
{
/*
 * Mark the earlier occurrence(s) as 
dead/ignorable.  We could
 * avoid the O(N^2) behavior here with some 
additional state,
 * but it seems unlikely to be worth the 
trouble.
 */
ConfigVariable *pitem;

for (pitem = head; pitem != item; pitem = 
pitem->next)
{
if (!pitem->ignore &&
strcmp(pitem->name, item->name) 
== 0)
pitem->ignore = true;
}
}

-- 
nathan

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal





On 2024/07/11 0:44, Nathan Bossart wrote:

On Wed, Jul 10, 2024 at 11:11:13AM -0400, Tom Lane wrote:

Please, no.  We went through a ton of permutations of that kind of
idea years ago, when it first became totally clear that cross-checks
between GUCs do not work nicely if implemented in check_hooks.
(You can find all the things we tried in the commit log, although
I don't recall exactly when.)


Understood.


A counter-example for what you just
said is when a configuration file like the above is loaded after
postmaster start.


I haven't tested it, but from skimming around the code, it looks like
ProcessConfigFileInternal() would deduplicate any previous entries in the
file prior to applying the values and running the check hooks.  Else,
reloading a configuration file with multiple startup-only GUC entries could
fail, even without bogus GUC check hooks.


Yeah, I'm thinking the same.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On 2024/07/10 23:18, Nathan Bossart wrote:

On Wed, Jul 10, 2024 at 10:10:30AM -0400, Robert Haas wrote:

On Wed, Jul 10, 2024 at 1:56 AM Fujii Masao wrote:

I'm sure this patch is necessary as a safeguard for WAL summarization.
OTOH, I also think we should apply the patch I proposed earlier
in this thread, which prevents summarize_wal from being enabled
when wal_level is set to minimal. This way, if there's
a misconfiguration, users will see an error message and
can quickly identify and fix the issue. Thought?

I interpreted these emails as meaning that we should not proceed with
that approach:

https://www.postgresql.org/message-id/CAGECzQR2r-rHFLQr5AonFehVP8DiFH+==r2yqdbvunynwxs...@mail.gmail.com
http://postgr.es/m/3253790.1720019...@sss.pgh.pa.us

Yeah. I initially thought this patch might be okay, at least as a stopgap,
but Jelte pointed out a case where it doesn't work, namely when you have
something like the following in the config file:

wal_level = 'minimal'
summarize_wal = 'true'
wal_level = 'logical'

Unless I'm mistaken, the patch works fine in this case. If the check_hook
triggered every time a parameter appears in the configuration file,
it would mistakenly detect wal_level=minimal and summarize_wal=on together
and raise an error. However, this isn't the case. The check_hook is
designed to trigger after duplicate parameters are deduplicated.
Am I missing something?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Send duration output to separate log files

2024-07-10 Thread Greg Sabino Mullane

Please find attached a patch to allow for durations to optionally be sent
to separate log files. In other words, rather than cluttering up our
postgres202007.log file with tons of output from
log_min_duration_statement, duration lines are sent instead to the file
postgres202007.duration.

Over the years, durations have been the number one cause of very large log
files, in which the more "important" items get buried in the noise. Also,
programs that are scanning for durations typically do not care about the
normal, non-duration output. Some people have a policy of logging
everything, which in effect means setting log_min_duration_statement to 0,
which in turn makes your log files nearly worthless for spotting
non-duration items. This feature will also be very useful for those who
need to temporarily turn on log_min_duration_statement, for some quick
auditing of exactly what is being run on their database. When done, you can
move or remove the duration file without messing up your existing log
stream.

This only covers the case when both the duration and statement are set on
the same line. In other words, log_min_duration_statement output, but not
log_duration (which is best avoided anyway). It also requires
logging_collector to be on, obviously.

Details:

The edata structure is expanded to have a new message_type, with a matching
function errmessagetype() created.
[include/utils/elog.h]
[backend/utils/elog.c]

Any errors that have both a duration and a statement are marked via
errmessagetype()
[backend/tcop/postgres.c]

A new GUC named "log_duration_destination" is created, which supports any
combination of stderr, csvlog, and jsonlog. It does not need to match
log_destination, in order to support different use cases. For example, the
user wants durations sent to a CSV file for processing by some other tool,
but still wants everything else going to a normal text log file.

Code: [include/utils/guc_hooks.h] [backend/utils/misc/guc_tables.c]
Docs: [sgml/config.sgml]  [backend/utils/misc/postgresql.conf.sample]

Create a new flag called PIPE_PROTO_DEST_DURATION
[include/postmaster/syslogger.h]

Create new flags:
  LOG_DESTINATION_DURATION,
  LOG_DESTINATION_DURATION_CSV,
  LOG_DESTINATION_DURATION_JSON
[include/utils/elog.h]

Routing and mapping LOG_DESTINATION to PIPE_PROTO
[backend/utils/error/elog.c]

Minor rerouting when using alternate forms
[backend/utils/error/csvlog.c]
[backend/utils/error/jsonlog.c]

Create new filehandles, do log rotation, map PIPE_PROTO to LOG_DESTINATION.
Rotation and entry into the "current_logfiles" file are the same as
existing log files. The new names/suffixes are duration, duration.csv, and
duration.json.
[backend/postmaster/syslogger.c]

Testing to ensure combinations of log_destination and
log_duration_destination work as intended
[bin/pg_ctl/meson.build]
[bin/pg_ctl/t/005_log_duration_destination.pl]

Questions I've asked along the way, and perhaps other might as well:

What about logging other things? Why just duration?

Duration logging is a very unique event in our logs. There is nothing quite
like it - it's always client-driven, yet automatically generated. And it
can be extraordinarily verbose. Removing it from the existing logging
stream has no particular downsides. Almost any other class of log message
would likely meet resistance as far as moving it to a separate log file,
with good reason.

Why not build a more generic log filtering case?

I looked into this, but it would be a large undertaking, given the way our
logging system works. And as per above, I don't think the pain would be
worth it, as duration covers 99% of the use cases for separate logs.
Certainly, nothing else other than a recurring ERROR from the client can
cause massive bloat in the size of the files. (There is a nearby patch to
exclude certain errors from the log file as a way to mitigate the error
spam - I don't like that idea, but should mention it here as another effort
to keep the log files a manageable size)

Why not use an extension for this?

I did start this as an extension, but it only goes so far. We can use
emit_log_hook, but it requires careful coordination of open filehandles,
has to do inefficient regex of every log message, and cannot do things like
log rotation.

Why not bitmap PIPE_PROTO *and* LOG_DESTINATION?

I tried to do both as simple bitmaps (i.e. duration+csv = duration.csv),
and not have to use e.g. LOG_DESTIATION_DURATION_CSV, but size_rotation_for
ruined it for me. Since our PIPE always sends one thing at a time, a single
new flag enables it to stay as a clean bits8 type.

What about Windows?

Untested. I don't have access to a Windows build, but I think in theory it
should work fine.

Cheers,
Greg
From c6d19d07cfa7eac51e5ead79f1c1b230b5b2d364 Mon Sep 17 00:00:00 2001
From: Greg Sabino Mullane 
Date: Tue, 2 Jul 2024 15:13:44 -0400
Subject: [PATCH] Add new parameter log_duration_destination

This allows writing of items from log_min_duration_statement to be sent to

Re: Is it possible to create a cursor with hold using extended query protocol

2024-07-10 Thread David G. Johnston

On Wed, Jul 10, 2024 at 8:29 AM Dave Cramer  wrote:

>
> On Wed, 10 Jul 2024 at 11:04, David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Wednesday, July 10, 2024, Dave Cramer  wrote:
>>
>>> Greetings,
>>>
>>> There are suggestions that you can use extended query to fetch from a
>>> cursor, however I don't see how you can bind values to the cursor ?
>>>
>>> PostgreSQL: Documentation: 16: FETCH
>>> 
>>>
>>> Is this possible?
>>>
>>
>> Not that i can see.  The declare’d query isn’t shown to accept $n
>> bindings rather it must be executable (select or values).  Per the note on
>> declare, the bind phase of the fetch command under the extended protocol is
>> used to determine whether values retrieved are text or binary.  Beyond
>> that, the bind is really just a formality of the protocol, the same as for
>> executing any other non-parameterized query that way.
>>
>
> Seems you can bind to the Declare though.
>
>
Right.  You got me trying to equate cursors and prepared statements and
they differ in exactly this way.  The prepared binds are held until execute
while cursor binds, and query execution for that matter, are immediate,
with fetch just being a way to obtain the rows already computed (at least
conceptually, optimizations might exist).  They both end up creating a
named portal.  You cannot declare an execute command which simplifies
things a bit.

David J.

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-10 Thread Tom Lane

Nathan Bossart  writes:
> I haven't tested it, but from skimming around the code, it looks like
> ProcessConfigFileInternal() would deduplicate any previous entries in the
> file prior to applying the values and running the check hooks.  Else,
> reloading a configuration file with multiple startup-only GUC entries could
> fail, even without bogus GUC check hooks.

While it's been a little while since I actually traced the logic,
I believe the reason that case doesn't fail is this bit in
set_config_with_handle, about line 3477 as of HEAD:

case PGC_POSTMASTER:
if (context == PGC_SIGHUP)
{
/*
 * We are re-reading a PGC_POSTMASTER variable from
 * postgresql.conf.  We can't change the setting, so we should
 * give a warning if the DBA tries to change it.  However,
 * because of variant formats, canonicalization by check
 * hooks, etc, we can't just compare the given string directly
 * to what's stored.  Set a flag to check below after we have
 * the final storable value.
 */
prohibitValueChange = true;
}
else if (context != PGC_POSTMASTER)
// throw "cannot be changed now" error

regards, tom lane

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 11:11:13AM -0400, Tom Lane wrote:
> Please, no.  We went through a ton of permutations of that kind of
> idea years ago, when it first became totally clear that cross-checks
> between GUCs do not work nicely if implemented in check_hooks.
> (You can find all the things we tried in the commit log, although
> I don't recall exactly when.)

Understood.

> A counter-example for what you just
> said is when a configuration file like the above is loaded after
> postmaster start.

I haven't tested it, but from skimming around the code, it looks like
ProcessConfigFileInternal() would deduplicate any previous entries in the
file prior to applying the values and running the check hooks.  Else,
reloading a configuration file with multiple startup-only GUC entries could
fail, even without bogus GUC check hooks.

-- 
nathan

Re: Is it possible to create a cursor with hold using extended query protocol

2024-07-10 Thread Dave Cramer

On Wed, 10 Jul 2024 at 11:04, David G. Johnston 
wrote:

> On Wednesday, July 10, 2024, Dave Cramer  wrote:
>
>> Greetings,
>>
>> There are suggestions that you can use extended query to fetch from a
>> cursor, however I don't see how you can bind values to the cursor ?
>>
>> PostgreSQL: Documentation: 16: FETCH
>> 
>>
>> Is this possible?
>>
>
> Not that i can see.  The declare’d query isn’t shown to accept $n bindings
> rather it must be executable (select or values).  Per the note on declare,
> the bind phase of the fetch command under the extended protocol is used to
> determine whether values retrieved are text or binary.  Beyond that, the
> bind is really just a formality of the protocol, the same as for executing
> any other non-parameterized query that way.
>

Seems you can bind to the Declare though.

execute : BEGIN
2024-07-10 11:18:57.247 EDT [98519] LOG:  duration: 0.239 ms  parse
: DECLARE c1 CURSOR WITH HOLD FOR select * from vactbl where id <
$1
2024-07-10 11:18:57.247 EDT [98519] LOG:  duration: 0.014 ms  bind
: DECLARE c1 CURSOR WITH HOLD FOR select * from vactbl where id <
$1
2024-07-10 11:18:57.247 EDT [98519] DETAIL:  Parameters: $1 = '400'
2024-07-10 11:18:57.248 EDT [98519] LOG:  duration: 1.080 ms  execute
: DECLARE c1 CURSOR WITH HOLD FOR select * from vactbl where id <
$1
2024-07-10 11:18:57.248 EDT [98519] DETAIL:  Parameters: $1 = '400'

Thanks,

Dave

>
>

Re: jsonpath: Inconsistency of timestamp_tz() Output

On Jul 10, 2024, at 10:54, David E. Wheeler  wrote:

> So it should be -7, not -8. Not sure where to tell it to pay proper attention 
> to daylight savings time.

Oh, and the time and date were wrong, too, because I blindly used the same 
conversion for dates as for timestamps. Fixed in v2.

PR: https://github.com/theory/postgres/pull/7
CF: https://commitfest.postgresql.org/49/5119/

Best,

David



v2-0001-Preserve-tz-when-converting-to-jsonb-timestamptz.patch
Description: Binary data

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-10 Thread Tom Lane

Jelte Fennema-Nio  writes:
> On Wed, 10 Jul 2024 at 16:18, Nathan Bossart  wrote:
>> Yeah.  I initially thought this patch might be okay, at least as a stopgap,
>> but Jelte pointed out a case where it doesn't work, namely when you have
>> something like the following in the config file:
>> 
>> wal_level = 'minimal'
>> summarize_wal = 'true'
>> wal_level = 'logical'

> I think that issue can be solved fairly easily by making the guc
> check_hook always pass during postmaster startup (by e.g. checking
> pmState), and relying on the previous startup check instead during
> startup.

Please, no.  We went through a ton of permutations of that kind of
idea years ago, when it first became totally clear that cross-checks
between GUCs do not work nicely if implemented in check_hooks.
(You can find all the things we tried in the commit log, although
I don't recall exactly when.)  A counter-example for what you just
said is when a configuration file like the above is loaded after
postmaster start.

If we want to solve this, let's actually solve it, perhaps by
inventing a "consistency check" mechanism that GUC applies after
it thinks it's reached a final set of GUC values.  I'm not very
clear on how outside checking code would be able to look at the
tentative rather than active values of the variables, but that
should be solvable.

regards, tom lane

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-10 Thread Jelte Fennema-Nio

On Wed, 10 Jul 2024 at 16:46, Nathan Bossart  wrote:
> Do we actually need to look at pmState?  Or could we just skip
> it if the context is <= PGC_S_ARGV?

I'm not 100% sure, but I think PGC_S_FILE would still be used when
postgresql.conf changes and on SIGHUP is sent. And we would want the
check_hook to be used then.

Re: Is it possible to create a cursor with hold using extended query protocol

2024-07-10 Thread David G. Johnston

On Wednesday, July 10, 2024, Dave Cramer  wrote:

> Greetings,
>
> There are suggestions that you can use extended query to fetch from a
> cursor, however I don't see how you can bind values to the cursor ?
>
> PostgreSQL: Documentation: 16: FETCH
> 
>
> Is this possible?
>

Not that i can see.  The declare’d query isn’t shown to accept $n bindings
rather it must be executable (select or values).  Per the note on declare,
the bind phase of the fetch command under the extended protocol is used to
determine whether values retrieved are text or binary.  Beyond that, the
bind is really just a formality of the protocol, the same as for executing
any other non-parameterized query that way.

David J.

Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?

2024-07-10 Thread Antonin Houska

I don't quite understand why TransactionIdIsCurrentTransactionId() implements
binary search in ParallelCurrentXids "from scratch" instead of using
bsearch().

If I read the code correctly, the contents of the ParallelCurrentXids is
composed in SerializeTransactionState(), which uses xidComparator:

qsort(workspace, nxids, sizeof(TransactionId), xidComparator);

so it should be o.k. to use bsearch(..., xidComparator).

For example, ReorderBufferCopySnap() also uses xidComparator to sort the
'subxip' array, and HeapTupleSatisfiesHistoricMVCC() then uses
TransactionIdInArray() (which is effectively bsearch(..., xidComparator)) to
search for particular XID in the array.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909d..8540e70e70 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -965,29 +965,9 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 	 * the XIDs in this array are sorted numerically rather than according to
 	 * transactionIdPrecedes order.
 	 */
-	if (nParallelCurrentXids > 0)
-	{
-		int			low,
-	high;
-
-		low = 0;
-		high = nParallelCurrentXids - 1;
-		while (low <= high)
-		{
-			int			middle;
-			TransactionId probe;
-
-			middle = low + (high - low) / 2;
-			probe = ParallelCurrentXids[middle];
-			if (probe == xid)
-return true;
-			else if (probe < xid)
-low = middle + 1;
-			else
-high = middle - 1;
-		}
-		return false;
-	}
+	if (bsearch(, ParallelCurrentXids, nParallelCurrentXids,
+sizeof(TransactionId), xidComparator))
+		return true;
 
 	/*
 	 * We will return true for the Xid of the current subtransaction, any of

Re: jsonpath: Inconsistency of timestamp_tz() Output

On Jul 10, 2024, at 10:33, David E. Wheeler  wrote:

> Yeah I don’t know either, but now at least it’s consistent. I’ve attached a 
> patch to fix it.

Actually I think there’s a subtlety still missing here:

@@ -2914,7 +2914,7 @@ HINT:  Use *_tz() function for time zone support.
 select jsonb_path_query_tz('"2023-08-15"', '$.timestamp_tz()'); -- should work
  jsonb_path_query_tz   -
- "2023-08-15T07:00:00+00:00"
+ "2023-08-14T23:00:00-08:00"
 (1 row)

This test runs while the time zone is set to “PST8PDT”, but it’s got the PST 
offset when it should be PDT:

david=# set time zone 'PST8PDT';
SET
david=# select '2023-08-15'::timestamptz;
  timestamptz   

 2023-08-15 00:00:00-07

So it should be -7, not -8. Not sure where to tell it to pay proper attention 
to daylight savings time.

Best,

David

Re: Possible incorrect row estimation for Gather paths

2024-07-10 Thread Rafia Sabih

Hello Anthonin,

I spent some time on this problem and your proposed solution.
As I understand it, this is the correction for the row count when the
number of parallel workers < 4.
Once the number of workers is 4 or more, the output from
parallel_divisor is the same as the number of parallel workers.
And then the number of rows for such cases would be the same with or
without the proposed patch.
So that way I think it is good to fix this case for a smaller number of workers.

But I don't quite understood the purpose of this,
+ total_groups = input_rel->rows;
+
+ /*
+ * If the number of rows is unknown, fallback to gather rows
+ * estimation
+ */
+ if (total_groups == 0)
+ total_groups = gather_rows_estimate(input_path);

why not just use total_groups = gather_rows_estimate(input_path), what
is the importance of having total_groups = input_rel->rows?

With respect to the change introduced by the patch in the regression
test, I wonder if we should test it on the tables of a larger scale
and check for performance issues.
Imagine the case when Parallel Seq Scan on extremely_skewed s
(cost=0.00..167.01 rows=1 width=4) returns 1000 rows instead of 1 ...
I wonder which plan would perform better then or will there be a
totally different plan.

However, my hunch is that there should not be serious problems,
because before this patch the number of estimated rows was incorrect
anyway.

I don't see a problem in merging the two patches.


On Fri, 24 May 2024 at 11:35, Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> While experimenting on an explain option to display all plan candidates (very 
> rough prototype here [1]), I've noticed some discrepancies in some generated 
> plans.
>
> EXPLAIN (ALL_CANDIDATES) SELECT * FROM pgbench_accounts order by aid;
>  Plan 1
> ->  Gather Merge  (cost=11108.32..22505.38 rows=10 width=97)
>  Workers Planned: 1
>  ->  Sort  (cost=10108.31..10255.37 rows=58824 width=97)
>Sort Key: aid
>->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..2228.24 
> rows=58824 width=97)
>  Plan 2
> ->  Gather Merge  (cost=11108.32..17873.08 rows=58824 width=97)
>  Workers Planned: 1
>  ->  Sort  (cost=10108.31..10255.37 rows=58824 width=97)
>Sort Key: aid
>->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..2228.24 
> rows=58824 width=97)
>
> The 2 plans are similar except one Gather Merge has a lower 58K estimated 
> rows.
>
> The first plan is created with generate_useful_gather_paths with 
> override_rows=false then create_gather_merge_path and will use rel->rows as 
> the row count (so the 100K rows of pgbench_accounts).
> #0: create_gather_merge_path(...) at pathnode.c:1885:30
> #1: generate_useful_gather_paths(... override_rows=false) at 
> allpaths.c:3286:11
> #2: apply_scanjoin_target_to_paths(...) at planner.c:7744:3
> #3: grouping_planner(...) at planner.c:1638:3
>
> The second plan is created through create_ordered_paths then 
> create_gather_merge_path and the number of rows is estimated to (worker_rows 
> * parallel_workers). Since we only have 1 parallel worker, this yields 58K 
> rows.
> #0: create_gather_merge_path(...) at pathnode.c:1885:30
> #1: create_ordered_paths(...) at planner.c:5287:5
> #2: grouping_planner(...) at planner.c:1717:17
>
> The 58K row estimate looks possibly incorrect. A worker row count is 
> estimated using total_rows/parallel_divisor. The parallel_divisor will 
> include the possible leader participation and will be 1.7 for 1 worker thus 
> the 58K rows (100K/1.7=58K)
> However, the gather merge will only do 58K*1, dropping the leader 
> participation component.
>
> I have a tentative patch split in two changes:
> 1: This is a related refactoring to remove an unnecessary and confusing 
> assignment of rows in create_gather_merge_path. This value is never used and 
> immediately overwritten in cost_gather_merge
> 2: This changes the row estimation of gather path to use 
> worker_rows*parallel_divisor to get a more accurate estimation. Additionally, 
> when creating a gather merge path in create_ordered_paths, we try to use the 
> source's rel rows when available.
>
> The patch triggered a small change in the hash_join regression test. Pre 
> patch, we had the following candidates.
> Plan 4
> ->  Aggregate  (cost=511.01..511.02 rows=1 width=8)
>  ->  Gather  (cost=167.02..511.00 rows=2 width=0)
>Workers Planned: 1
>->  Parallel Hash Join  (cost=167.02..510.80 rows=1 width=0)
>  Hash Cond: (r.id = s.id)
>  ->  Parallel Seq Scan on simple r  (cost=0.00..299.65 
> rows=11765 width=4)
>  ->  Parallel Hash  (cost=167.01..167.01 rows=1 width=4)
>->  Parallel Seq Scan on extremely_skewed s  
> (cost=0.00..167.01 rows=1 width=4)
> Plan 5
> ->  Finalize Aggregate  (cost=510.92..510.93 rows=1 width=8)
>  ->  Gather  (cost=510.80..510.91 rows=1 width=8)
>Workers Planned: 1
>->  Partial

Is it possible to create a cursor with hold using extended query protocol

2024-07-10 Thread Dave Cramer

Greetings,

There are suggestions that you can use extended query to fetch from a
cursor, however I don't see how you can bind values to the cursor ?

PostgreSQL: Documentation: 16: FETCH


Is this possible?

Dave Cramer

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 04:29:14PM +0200, Jelte Fennema-Nio wrote:
> On Wed, 10 Jul 2024 at 16:18, Nathan Bossart  wrote:
>> Yeah.  I initially thought this patch might be okay, at least as a stopgap,
>> but Jelte pointed out a case where it doesn't work, namely when you have
>> something like the following in the config file:
>>
>> wal_level = 'minimal'
>> summarize_wal = 'true'
>> wal_level = 'logical'
> 
> I think that issue can be solved fairly easily by making the guc
> check_hook always pass during postmaster startup (by e.g. checking
> pmState), and relying on the previous startup check instead during
> startup.

I was actually just thinking about doing something similar in a different
thread [0].  Do we actually need to look at pmState?  Or could we just skip
it if the context is <= PGC_S_ARGV?

[0] https://postgr.es/m/Zow-DBaDY2IzAzA2%40nathan

-- 
nathan

Re: jsonpath: Inconsistency of timestamp_tz() Output

On Jul 10, 2024, at 10:33, David E. Wheeler  wrote:

> Yeah I don’t know either, but now at least it’s consistent. I’ve attached a 
> patch to fix it.
> 
> Ideally, I think, we wouldn’t convert the value and determine the offset 
> twice, but teach date_timestamptz and timestamp_timestamptz (or 
> date2timestamptz and timestamp2timestamptz?) how to return the offset, or 
> create alternate functions that do so. Not sure what calling style should be 
> adopted here, but this at least addresses the issue. Happy to resubmit 
> something more efficient upon function design feedback.

Here’s a September CommitFest item, though I think it should be fixed before 
the next beta.

  https://commitfest.postgresql.org/49/5119/

Best,

David

Re: jsonpath: Inconsistency of timestamp_tz() Output

On Jul 10, 2024, at 01:48, Junwang Zhao  wrote:

> I apply your patch with some minor change(to make the server not crash):

Oh, thank you! Kicking myself for not catching the obvious.

> It now gives the local tz:
> 
> [local] postgres@postgres:5432-54960=# set time zone 'America/New_York';
> SET
> Time: 2.894 ms
> [local] postgres@postgres:5432-54960=# select
> jsonb_path_query_tz('"2024-08-15 12:34:56"', '$.timestamp_tz()');
> ┌─┐
> │ jsonb_path_query_tz │
> ├─┤
> │ "2024-08-15T12:34:56-04:00" │
> └─┘
> (1 row)
> 
> Time: 293813.022 ms (04:53.813)

Yes, and I think that’s what we want, since it preserves and displays the 
offset for strings that contain them:

david=# set time zone 'America/New_York';
SET
david=# select jsonb_path_query_tz('"2024-08-15 12:34:56+10"', 
'$.timestamp_tz()');
jsonb_path_query_tz
-
"2024-08-15T12:34:56+10:00"

> I'm not sure whether the SQL/JSON standard mentioned this, I searched a
> little bit, but found no clue :(

Yeah I don’t know either, but now at least it’s consistent. I’ve attached a 
patch to fix it.

Ideally, I think, we wouldn’t convert the value and determine the offset twice, 
but teach date_timestamptz and timestamp_timestamptz (or date2timestamptz and 
timestamp2timestamptz?) how to return the offset, or create alternate functions 
that do so. Not sure what calling style should be adopted here, but this at 
least addresses the issue. Happy to resubmit something more efficient upon 
function design feedback.

Best,

David

v1-0001-Preserve-tz-when-converting-to-jsonb-timestamptz.patch
Description: Binary data

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-10 Thread Jelte Fennema-Nio

On Wed, 10 Jul 2024 at 16:18, Nathan Bossart  wrote:
> Yeah.  I initially thought this patch might be okay, at least as a stopgap,
> but Jelte pointed out a case where it doesn't work, namely when you have
> something like the following in the config file:
>
> wal_level = 'minimal'
> summarize_wal = 'true'
> wal_level = 'logical'

I think that issue can be solved fairly easily by making the guc
check_hook always pass during postmaster startup (by e.g. checking
pmState), and relying on the previous startup check instead during
startup.

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 10:10:30AM -0400, Robert Haas wrote:
> On Wed, Jul 10, 2024 at 1:56 AM Fujii Masao  
> wrote:
>> I'm sure this patch is necessary as a safeguard for WAL summarization.
>> OTOH, I also think we should apply the patch I proposed earlier
>> in this thread, which prevents summarize_wal from being enabled
>> when wal_level is set to minimal. This way, if there's
>> a misconfiguration, users will see an error message and
>> can quickly identify and fix the issue. Thought?
> 
> I interpreted these emails as meaning that we should not proceed with
> that approach:
> 
> https://www.postgresql.org/message-id/CAGECzQR2r-rHFLQr5AonFehVP8DiFH+==r2yqdbvunynwxs...@mail.gmail.com
> http://postgr.es/m/3253790.1720019...@sss.pgh.pa.us

Yeah.  I initially thought this patch might be okay, at least as a stopgap,
but Jelte pointed out a case where it doesn't work, namely when you have
something like the following in the config file:

wal_level = 'minimal'
summarize_wal = 'true'
wal_level = 'logical'

I'm not sure that's a dealbreaker for v17 if we can't come up with anything
else, but IMHO it at least deserves a loud comment and a plan for a better
solution in v18.

-- 
nathan

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

On Wed, Jul 10, 2024 at 1:56 AM Fujii Masao  wrote:
> I believe this issue occurs when the server is shut down cleanly.
> The shutdown-checkpoint record retains the wal_level value used
> before the shutdown. If wal_level is changed after this,
> the wal_level that indicated by the shutdown-checkpoint record
> and that the WAL data generated afterwards depends on may not match.

Oh, that's a problem. I'll have to think about that.

> I'm sure this patch is necessary as a safeguard for WAL summarization.
> OTOH, I also think we should apply the patch I proposed earlier
> in this thread, which prevents summarize_wal from being enabled
> when wal_level is set to minimal. This way, if there's
> a misconfiguration, users will see an error message and
> can quickly identify and fix the issue. Thought?

I interpreted these emails as meaning that we should not proceed with
that approach:

https://www.postgresql.org/message-id/CAGECzQR2r-rHFLQr5AonFehVP8DiFH+==r2yqdbvunynwxs...@mail.gmail.com
http://postgr.es/m/3253790.1720019...@sss.pgh.pa.us

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_maintain and USAGE privilege on schema

On Wed, Jul 10, 2024 at 05:13:58PM +0900, Fujii Masao wrote:
> However, unlike the database owner, pg_maintain by definition should
> have *all* the rights needed for maintenance tasks, including MAINTAIN
> rights on tables and USAGE rights on schemas? ISTM that both
> pg_read_all_data and pg_write_all_data roles are defined similarly,
> with USAGE rights on all schemas. So, granting USAGE rights to
> pg_maintain, but not the database owner, doesn't seem so odd to me.

It doesn't seem so odd to me, either.  But there are other things that
could prevent a role with privileges of pg_maintain from being able to
VACUUM a table.  For example, the role might not have LOGIN, or it might
not have CONNECT on the database.  I think the argument for giving
pg_maintain roles implicit USAGE on all schemas for only maintenance
commands is that we already do that in some cases (e.g., a database-wide
VACUUM).

> I'd like hear more opinions about this.

+1

-- 
nathan

Re: tests fail on windows with default git settings

2024-07-10 Thread Andrew Dunstan




On 2024-07-10 We 9:25 AM, Tom Lane wrote:

Dave Page  writes:

On Wed, 10 Jul 2024 at 12:12, Andrew Dunstan  wrote:

As I was looking at this I wondered if there might be anywhere else that
needed adjustment. One thing that occurred to me was that that maybe we
should replace the use of "-w" in pg_regress.c with this rather less
dangerous flag, so instead of ignoring any white space difference we would
only ignore line end differences. The use of "-w" apparently dates back to
2009.

That seems like a good improvement to me.

+1





OK, done.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: relfilenode statistics

Hi,

On Wed, Jul 10, 2024 at 03:02:34PM +0900, Michael Paquier wrote:
> On Sat, May 25, 2024 at 07:52:02AM +, Bertrand Drouvot wrote:
> > But I think that it is in a state that can be used to discuss the approach 
> > it
> > is implementing (so that we can agree or not on it) before moving
> > forward.
> 
> I have read through the patch to get an idea of how things are done,

Thanks!

> and I am troubled by the approach taken (mentioned down by you), but
> that's invasive compared to how pgstats wants to be transparent with
> its stats kinds.
> 
> +   Oid objoid; /* object ID, either table or function
> or tablespace. */
> +   RelFileNumber relfile;  /* relfilenumber for RelFileLocator. */
>  } PgStat_HashKey;
> 
> This adds a relfilenode component to the central hash key used for the
> dshash of pgstats, which is something most stats types don't care
> about.

That's right but that's an existing behavior without the patch as:

PGSTAT_KIND_DATABASE does not care care about the objoid
PGSTAT_KIND_REPLSLOT does not care care about the dboid
PGSTAT_KIND_SUBSCRIPTION does not care care about the dboid

That's 3 kinds out of the 5 non fixed stats kind.

Not saying it's good, just saying that's an existing behavior.

> That looks like the incorrect thing to do to me, particularly
> seeing a couple of lines down that a stats kind is assigned so the
> HashKey uniqueness is ensured by the KindInfo:
> +   [PGSTAT_KIND_RELFILENODE] = {
> +   .name = "relfilenode",

You mean, just rely on kind, dboid and relfile to ensure uniqueness?

I'm not sure that would work as there is this comment in relfilelocator.h:

"
 * Notice that relNumber is only unique within a database in a particular
 * tablespace.
"

So, I think it makes sense to link the hashkey to all the RelFileLocator
fields, means:

dboid (linked to RelFileLocator's dbOid)
objoid (linked to RelFileLocator's spcOid)
relfile (linked to RelFileLocator's relNumber)

> FWIW, I have on my stack of patches something to switch the objoid to
> 8 bytes, actually, which is something that would be required for
> pg_stat_statements as query IDs are wider than that and affect all
> databases, FWIW.  Relfilenodes are 4 bytes, okay still Robert has
> proposed a couple of years ago a patch set to bump that to 56 bits,
> change reverted in a448e49bcbe4.

Right, but it really looks like this extra field is needed to ensure
uniqueness (see above).

> What you would be looking instead is to use the relfilenode as an
> objoid

Not sure that works, as it looks like uniqueness won't be ensured (see above).

> and keep track of the OID of the original relation in each
> PgStat_StatRelFileNodeEntry so as it is possible to know where a past
> relfilenode was used?  That makes looking back at the past relation's
> elfilenodes stats more complicated as it would be necessary to keep a
> list of the past relfilenodes for a relation, as well.  Perhaps with
> some kind of cache that maintains a mapping between the relation and
> its relfilenode history?

Yeah, I also thought about keeping a list of "previous" relfilenodes stats for a
relation but that would lead to:

1. Keep previous relfilnode stats 
2. A more complicated way to look at relation stats (as you said)
3. Extra memory usage

I think the only reason "previous" relfilenode stats are needed is to provide
accurate stats for the relation. Outside of this need, I don't think we would
want to retrieve "individual" previous relfilenode stats in the past.

That's why the POC patch "simply" copies the stats to the relation during a
rewrite (before getting rid of the "previous" relfilenode stats).

What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-10 Thread Masahiko Sawada

On Wed, Jul 10, 2024 at 5:14 PM Masahiko Sawada  wrote:
>
> On Wed, Jul 10, 2024 at 4:14 PM Fujii Masao  
> wrote:
> >
> >
> >
> > On 2024/07/10 12:13, Masahiko Sawada wrote:
> > > On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao  
> > > wrote:
> > >>
> > >> Hi,
> > >>
> > >> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
> > >> always create new partitions in the default tablespace, regardless of
> > >> the parent's tablespace. However, the indexes of these new partitions 
> > >> inherit
> > >> the tablespaces of their parent indexes. This inconsistency seems odd.
> > >> Is this an oversight or intentional?
> > >>
> > >> Here are the steps I used to test this:
> > >>
> > >> ---
> > >> CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
> > >> CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
> > >> PARTITION BY RANGE (i) TABLESPACE tblspc;
> > >>
> > >> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> > >> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
> > >>
> > >> ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
> > >>
> > >> SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 
> > >> 'tp_0_2') ORDER BY tablename;
> > >>tablename | tablespace
> > >> ---+
> > >>t | tblspc
> > >>tp_0_2| (null)
> > >> (2 rows)
> > >>
> > >> SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 
> > >> 'tp_0_2') ORDER BY indexname;
> > >> indexname  | tablespace
> > >> -+
> > >>t_pkey  | tblspc
> > >>tp_0_2_pkey | tblspc
> > >> ---
> > >>
> > >>
> > >> If it's an oversight, I've attached a patch to ensure these commands 
> > >> create
> > >> new partitions in the parent's tablespace.
> > >
> > > +1
> > >
> > > Since creating a child table through the CREATE TABLE statement sets
> > > its parent table's tablespace as the child table's tablespace, it is
> > > logical to set the parent table's tablespace as the merged table's
> > > tablespace.

One expectation I had for MERGE PARTITION was that if all partition
tables to be merged are in the same tablespace, the merged table is
also created in the same tablespace. But it would be an exceptional
case in a sense, and I agree with the proposed behavior as it's
consistent. It might be a good idea that we can specify the tablespace
for each merged/split table in the future.

BTW the new regression tests don't check the table and index names.
Isn't it better to show table and index names for better
diagnosability?

+-- Check the new partition inherits parent's tablespace
+CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE regress_tblspace)
+  PARTITION BY RANGE (i) TABLESPACE regress_tblspace;
+CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
+CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
+ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
+SELECT tablespace FROM pg_tables WHERE tablename IN ('t', 'tp_0_2')
ORDER BY tablespace;
+tablespace
+--
+ regress_tblspace
+ regress_tblspace
+(2 rows)
+
+SELECT tablespace FROM pg_indexes WHERE tablename IN ('t', 'tp_0_2')
ORDER BY tablespace;
+tablespace
+--
+ regress_tblspace
+ regress_tblspace
+(2 rows)
+
+DROP TABLE t;


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: tests fail on windows with default git settings

2024-07-10 Thread Tom Lane

Dave Page  writes:
> On Wed, 10 Jul 2024 at 12:12, Andrew Dunstan  wrote:
>> As I was looking at this I wondered if there might be anywhere else that
>> needed adjustment. One thing that occurred to me was that that maybe we
>> should replace the use of "-w" in pg_regress.c with this rather less
>> dangerous flag, so instead of ignoring any white space difference we would
>> only ignore line end differences. The use of "-w" apparently dates back to
>> 2009.

> That seems like a good improvement to me.

+1

regards, tom lane

Re: Allow non-superuser to cancel superuser tasks.

2024-07-10 Thread Andrey M. Borodin

> On 10 Jul 2024, at 11:27, Kirill Reshke  wrote:
> 
> That's very strange, because the test works fine on my virtual
> machine. Also, it seems that it works in Cirrus [0], as there is this
> line:

So far I could not reproduce that failure.
I’ve checkouted 6edec53 from CFbot repository, but it works fine in both 
Cirrus[0,1,2] and my machines…
It seems like we have to rely on intuition to know what happened.

Best regards, Andrey Borodin.
[0] https://github.com/x4m/postgres_g/runs/27266322657
[1] https://github.com/x4m/postgres_g/runs/27266278325
[2] https://github.com/x4m/postgres_g/runs/27266052318

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-07-10 Thread Jelte Fennema-Nio

On Mon, 1 Jul 2024 at 00:38, Jelte Fennema-Nio  wrote:
> Ugh yes, I think this was a copy paste error. See attached patch 0003
> to fix this (rest of the patches are untouched from previous
> revision).

Alvaro committed 0003, which caused cfbot to think a rebase is
necessary. Attached should solve that.


v4-0001-Make-postgres_fdw-cancel-test-not-flaky-anymore.patch
Description: Binary data


v4-0002-Do-not-reset-statement_timeout-indicator-outside-.patch
Description: Binary data

Re: Should we work around msvc failing to compile tab-complete.c?

2024-07-10 Thread Andrew Dunstan



On 2024-07-10 We 5:55 AM, Dave Page wrote:




What is more relevant is that as far as I can see, we've never 
actually supported either libedit or readline in MSVC++ builds - which 
kinda makes sense because Windows terminals are very different from 
traditional *nix ones. Windows isn't supported by either libedit or 
readline as far as I can see, except through a couple of forks.


I imagine from mingw/cygwin builds, readline would only actually work 
properly in their respective terminals.




configure.ac explicitly forces --without-readline on win32, so no for 
mingw. configure.ac claims there are issues especially with non-US code 
pages.


One of the reasons I've continued to support building with Cygwin is 
that its readline does seem to work, making its psql the best I know of 
on Windows.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: speed up a logical replica setup

2024-07-10 Thread Euler Taveira

On Tue, Jul 9, 2024, at 8:00 AM, Alexander Lakhin wrote:
> Hello Amit and Kuroda-san,
> 
> 03.07.2024 14:02, Amit Kapila wrote:
> > Pushed 0002 and 0003. Let's wait for a discussion on 0001.
> 
> Please look at another failure of the test [1]:
> [13:28:05.647](2.460s) not ok 26 - failover slot is synced
> [13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
> #   at 
> /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
>  line 307.
> [13:28:05.648](0.000s) #  got: ''
> # expected: 'failover_slot'

I'm wondering if the attached patch is sufficient to move the restart_lsn
forward. I experimented several lightweight ideas but none works. BTW the steps
to create the failover slot here is similar 040_standby_failover_slots_sync.pl.
I don't have a clue why it is failing for this one.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
index 74b90d9a913..232dbbbc55e 100644
--- a/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
+++ b/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
@@ -297,6 +297,17 @@ $node_p->safe_psql($db1,
 	"SELECT pg_create_logical_replication_slot('$fslotname', 'pgoutput', false, false, true)"
 );
 $node_s->start;
+$node_p->safe_psql($db1, qq(
+	SELECT pg_logical_emit_message(true, 'a', '1');
+	CHECKPOINT;
+));
+$node_p->safe_psql($db1, qq(
+	SELECT *
+	FROM pg_logical_slot_get_binary_changes('$fslotname', NULL, NULL,
+		'proto_version', '1',
+		'publication_names', 'dummy',
+		'messages', 'true');
+));
 # Wait for the standby to catch up so that the standby is not lagging behind
 # the failover slot.
 $node_p->wait_for_replay_catchup($node_s);

Re: tests fail on windows with default git settings

On Wed, 10 Jul 2024 at 12:12, Andrew Dunstan  wrote:

>
> On 2024-07-09 Tu 11:34 AM, Andrew Dunstan wrote:
>
>
> On 2024-07-09 Tu 9:52 AM, Dave Page wrote:
>
>
>
>> > What I suggest (see attached) is we run the diff command with
>> > --strip-trailing-cr on Windows. Then we just won't care if the expected
>> file
>> > and/or the output file has CRs.
>>
>> I was wondering about that too, but I wasn't sure we can rely on that flag
>> being supported...
>>
>
> I have 4 different diff.exe's on my ~6 week old build VM (not counting
> shims), all of which seem to support --strip-trailing-cr. Those builds came
> with:
>
> - git
> - VC++
> - diffutils (installed by chocolatey)
> - vcpkg
>
> I think it's reasonable to assume it'll be supported.
>
>
>
> Ok, cool. So I propose to patch the test_json_parser and pg_bsd_indent
> tests to use it on Windows, later today unless there's some objection.
>
>
>
>
> As I was looking at this I wondered if there might be anywhere else that
> needed adjustment. One thing that occurred to me was that that maybe we
> should replace the use of "-w" in pg_regress.c with this rather less
> dangerous flag, so instead of ignoring any white space difference we would
> only ignore line end differences. The use of "-w" apparently dates back to
> 2009.
>
That seems like a good improvement to me.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Re: tests fail on windows with default git settings

2024-07-10 Thread Andrew Dunstan

On 2024-07-09 Tu 11:34 AM, Andrew Dunstan wrote:

On 2024-07-09 Tu 9:52 AM, Dave Page wrote:

> What I suggest (see attached) is we run the diff command with
> --strip-trailing-cr on Windows. Then we just won't care if the
expected file
> and/or the output file has CRs.

I was wondering about that too, but I wasn't sure we can rely on
that flag
being supported...

I have 4 different diff.exe's on my ~6 week old build VM (not 
counting shims), all of which seem to support --strip-trailing-cr. 
Those builds came with:

- git
- VC++
- diffutils (installed by chocolatey)
- vcpkg

I think it's reasonable to assume it'll be supported.

Ok, cool. So I propose to patch the test_json_parser and pg_bsd_indent 
tests to use it on Windows, later today unless there's some objection.

As I was looking at this I wondered if there might be anywhere else that 
needed adjustment. One thing that occurred to me was that that maybe we 
should replace the use of "-w" in pg_regress.c with this rather less 
dangerous flag, so instead of ignoring any white space difference we 
would only ignore line end differences. The use of "-w" apparently dates 
back to 2009.

Thoughts?

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Lock-free compaction. Why not?

2024-07-10 Thread Tomas Vondra

On 7/10/24 11:49, David Rowley wrote:
> On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi
>  wrote:
>> The thing is, after reading the code a million times, I still don't 
>> understand why lock-free (or minimum locking) is such a big problem! Is it 
>> that hard to lazily move tuples from one page to the other after 
>> defragmenting it lazily?
> 
> I think there are a few things to think about. You may have thought of
> some of these already.
> 
> 1. moving rows could cause deadlocking issues. Users might struggle to
> accept that some background process is causing their transaction to
> rollback.
> 2. transaction size: How large to make the batches of tuples to move
> at once? One transaction sounds much more prone to deadlocking.
> 3. xid consumption. Doing lots of small transactions to move tuples
> could consume lots of xids.
> 4. moving tuples around to other pages needs indexes to be updated and
> could cause index bloat.
> 
> For #1, maybe there's something that can be done to ensure it's always
> vacuum that's the deadlock victim.
> 
> You might be interested in [1].  There's also an older discussion in
> [2] that you might find interesting.
> 

IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by
moving rows around. I'm not sure if it did the lock-free thing as
proposed here (probably not), but I guess at least some of the reasons
why it was replaced by CLUSTER would still apply to this new thing.

Maybe it's a good trade off for some use cases (after all, people do
that using pg_repack/pg_squeeze/... so it clearly has value for them),
but it'd be a bit unfortunate to rediscover those old issues later.

The cluster vacuum was introduced by commit 946cf229e89 in 2010, and
then the "inplace" variant was removed by 0a469c87692 shortly after. I
haven't looked for the threads discussing those changes, but I guess it
should not be hard to find in the archives.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

On Sat, 6 Jul 2024 at 19:06, Fujii Masao  wrote:
> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
> always create new partitions in the default tablespace, regardless of
> the parent's tablespace. However, the indexes of these new partitions inherit
> the tablespaces of their parent indexes. This inconsistency seems odd.
> Is this an oversight or intentional?

My expectation of this feature is that the tablespace choice would
work the same as what was done in ca4103025 to make it inherit from
the partition table's tablespace. I imagine we might get complaints if
it does not follow the same logic.

I've not looked at your patch, but if the behaviour is as you describe
and the patch changes that to follow ca4103025, then +1 from me.

David

Re: Should we work around msvc failing to compile tab-complete.c?

On Tue, 9 Jul 2024 at 17:23, Andres Freund  wrote:

> Hi,
>
> On 2024-07-09 09:14:33 +0100, Dave Page wrote:
> > On Mon, 8 Jul 2024 at 21:08, Andres Freund  wrote:
> > > I think we'd need to backpatch more for older branches. At least
> > >
> > > commit 3f28bd7337d
> > > Author: Thomas Munro 
> > > Date:   2022-12-22 17:14:23 +1300
> > >
> > > Add work-around for VA_ARGS_NARGS() on MSVC.
> > >
> > >
> > > Given that - afaict - tab completion never used to work with msvc, I
> think
> > > it'd be ok to just do it in 17 or 16+17 or such. Obviously nobody is
> > > currently
> > > building with readline support for windows - not sure if any packager
> is
> > > going
> > > to go back and add support for it in older branches.
> >
> >
> > Packagers aren't likely to be using readline, as it's GPL and it would
> have
> > to be shipped with packages on Windows.
>
> I'm not sure (I mean that literally, not as a way to state that I think
> it's
> not as you say) it'd actually be *that* big a problem to use readline -
> sure
> it'd make psql GPL, but that's the only thing using readline. Since psql
> doesn't link to extensible / user provided code, it might actually be ok.
>
> With openssl < 3.0 the mix between openssl and readline would be
> problematic,
> IIRC the licenses aren't really compatible. But luckily openssl relicensed
> to
> apache v2.
>

Certainly in the case of the packages produced at EDB, we didn't want any
potential surprises for users/redistributors/embedders, so *any* GPL was a
problem.

> But that is pretty orthogonal to the issue at hand:
>

Right.

What is more relevant is that as far as I can see, we've never actually
supported either libedit or readline in MSVC++ builds - which kinda makes
sense because Windows terminals are very different from traditional *nix
ones. Windows isn't supported by either libedit or readline as far as I can
see, except through a couple of forks.

I imagine from mingw/cygwin builds, readline would only actually work
properly in their respective terminals.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Re: Lock-free compaction. Why not?

On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi
 wrote:
> The thing is, after reading the code a million times, I still don't 
> understand why lock-free (or minimum locking) is such a big problem! Is it 
> that hard to lazily move tuples from one page to the other after 
> defragmenting it lazily?

I think there are a few things to think about. You may have thought of
some of these already.

1. moving rows could cause deadlocking issues. Users might struggle to
accept that some background process is causing their transaction to
rollback.
2. transaction size: How large to make the batches of tuples to move
at once? One transaction sounds much more prone to deadlocking.
3. xid consumption. Doing lots of small transactions to move tuples
could consume lots of xids.
4. moving tuples around to other pages needs indexes to be updated and
could cause index bloat.

For #1, maybe there's something that can be done to ensure it's always
vacuum that's the deadlock victim.

You might be interested in [1].  There's also an older discussion in
[2] that you might find interesting.

David

[1] 
https://www.postgresql.org/message-id/flat/CAFj8pRDNDOrg90hLMmbo_hiWpgBm%2B73gmWMRUHRkTKwrGnvYJQ%40mail.gmail.com#cc4f8d730d2c5203f53c50260053fec5
[2] 
https://www.postgresql.org/message-id/flat/CANTTaev-LdgYj4uZoy67catS5SF5u_X-dTHiLH7OKwU6Gv3MFA%40mail.gmail.com

Re: Conflict detection and logging in logical replication

2024-07-10 Thread shveta malik

On Wed, Jul 3, 2024 at 8:31 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, June 26, 2024 10:58 AM Zhijie Hou (Fujitsu) 
>  wrote:
> >
>
> Hi,
>
> As suggested by Sawada-san in another thread[1].
>
> I am attaching the V4 patch set which tracks the delete_differ
> conflict in logical replication.
>
> delete_differ means that the replicated DELETE is deleting a row
> that was modified by a different origin.
>

Thanks for the patch. I am still in process of review but please find
few comments:

1) When I try to *insert* primary/unique key on pub, which already
exists on sub, conflict gets detected. But when I try to *update*
primary/unique key to a value on pub which already exists on sub,
conflict is not detected. I get the error:

2024-07-10 14:21:09.976 IST [647678] ERROR:  duplicate key value
violates unique constraint "t1_pkey"
2024-07-10 14:21:09.976 IST [647678] DETAIL:  Key (pk)=(4) already exists.

This is because such conflict detection needs detection of constraint
violation using the *new value* rather than *existing* value during
UPDATE. INSERT conflict detection takes care of this case i.e. the
columns of incoming row are considered as new values and it tries to
see if all unique indexes are okay to digest such new values (all
incoming columns) but update's logic is different. It  searches based
on oldTuple *only* and thus above detection is missing.

Shall we support such detection? If not, is it worth docuementing? It
basically falls in 'pkey_exists' conflict category but to user it
might seem like any ordinary update leading to 'unique key constraint
violation'.

2)
Another case which might confuse user:

CREATE TABLE t1 (pk integer primary key, val1 integer, val2 integer);

On PUB: insert into t1 values(1,10,10); insert into t1 values(2,20,20);

On SUB: update t1 set pk=3 where pk=2;

Data on PUB: {1,10,10}, {2,20,20}
Data on SUB: {1,10,10}, {3,20,20}

Now on PUB: update t1 set val1=200 where val1=20;

On Sub, I get this:
2024-07-10 14:44:00.160 IST [648287] LOG:  conflict update_missing
detected on relation "public.t1"
2024-07-10 14:44:00.160 IST [648287] DETAIL:  Did not find the row to
be updated.
2024-07-10 14:44:00.160 IST [648287] CONTEXT:  processing remote data
for replication origin "pg_16389" during message type "UPDATE" for
replication target relation "public.t1" in transaction 760, finished
at 0/156D658

To user, it could be quite confusing, as val1=20 exists on sub but
still he gets update_missing conflict and the 'DETAIL' is not
sufficient to give the clarity. I think on HEAD as well (have not
tested), we will get same behavior i.e. update will be ignored as we
make search based on RI (pk in this case). So we are not worsening the
situation, but now since we are detecting conflict, is it possible to
give better details in 'DETAIL' section indicating what is actually
missing?

 thanks
Shveta

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-07-10 Thread Tatsuo Ishii

>>> Yes, I think so. I'd keep each as a separate patch so they can be
>>> considered independently. Doing all of them should hopefully ensure we
>>> strike the right balance of what code to put in explain.c and what
>>> code to put in tuplestore.c.
>> +1
>> 
>> + if (es->format != EXPLAIN_FORMAT_TEXT)
>> + {
>> + ExplainPropertyText("Storage", storageType, es);
>> + ExplainPropertyInteger("Maximum Storage", "kB", spaceUsedKB, es);
>> + }
>> + else
>> + {
>> + ExplainIndentText(es);
>> + appendStringInfo(es->str,
>> + "Storage: %s  Maximum Storage: " INT64_FORMAT "kB\n",
>> + storageType,
>> + spaceUsedKB);
>> + }
>> 
>> It will be good to move this code to a function which will be called
>> by show_*_info functions().
> 
> I have already implemented that in this direction in my working in
> progress patch:

Attached are the v2 patches. As suggested by David, I split them
into multiple patches so that each patch implements the feature for
each node. You need to apply the patches in the order of patch number
(if you want to apply all of them, "git apply v2-*.patch" should
work).

v2-0001-Refactor-show_material_info.patch:
This refactors show_material_info(). The guts are moved to new
show_storage_info() so that it can be shared by not only Materialized
node.

v2-0002-Add-memory-disk-usage-for-CTE-Scan-nodes-in-EXPLA.patch:
This adds memory/disk usage for CTE Scan nodes in EXPLAIN (ANALYZE) command.

v2-0003-Add-memory-disk-usage-for-Table-Function-Scan-nod.patch:
This adds memory/disk usage for Table Function Scan nodes in EXPLAIN (ANALYZE) 
command.

v2-0004-Add-memory-disk-usage-for-Recursive-Union-nodes-i.patch:
This adds memory/disk usage for Recursive Union nodes in EXPLAIN
(ANALYZE) command. Also show_storage_info() is changed so that it
accepts int64 storage_used, char *storage_type arguments. They are
used if the target node uses multiple tuplestores, in case a simple
call to tuplestore_space_used() does not work. Such executor nodes
need to collect storage_used while running the node. This type of node
includes Recursive Union and Window Aggregate.

v2-0005-Add-memory-disk-usage-for-Window-Aggregate-nodes-.patch: This
adds memory/disk usage for Window Aggregate nodes in EXPLAIN (ANALYZE)
command. Note that if David's proposal
https://www.postgresql.org/message-id/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com
is committed, this will need to be adjusted.

For a demonstration, how storage/memory usage is shown in EXPLAIN
(notice "Storage: Memory Maximum Storage: 120kB" etc.). The script
used is attached (test.sql.txt). The SQLs are shamelessly copied from
David's example and the regression test (some of them were modified by
me).

EXPLAIN (ANALYZE, COSTS OFF)
SELECT count(t1.b) FROM (VALUES(1),(2)) t2(x) LEFT JOIN (SELECT * FROM t1 WHERE 
a <= 100) t1 ON TRUE;
   QUERY PLAN   
 
-
 Aggregate (actual time=0.345..0.346 rows=1 loops=1)
   ->  Nested Loop Left Join (actual time=0.015..0.330 rows=200 loops=1)
 ->  Values Scan on "*VALUES*" (actual time=0.001..0.003 rows=2 loops=1)
 ->  Materialize (actual time=0.006..0.152 rows=100 loops=2)
   Storage: Memory  Maximum Storage: 120kB
   ->  Seq Scan on t1 (actual time=0.007..0.213 rows=100 loops=1)
 Filter: (a <= 100)
 Rows Removed by Filter: 900
 Planning Time: 0.202 ms
 Execution Time: 0.377 ms
(10 rows)

-- CTE Scan node
EXPLAIN (ANALYZE, COSTS OFF)
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) OVER() FROM t;
QUERY PLAN  
   
---
 WindowAgg (actual time=0.151..0.169 rows=100 loops=1)
   Storage: Memory  Maximum Storage: 20kB
   CTE t
 ->  Recursive Union (actual time=0.001..0.105 rows=100 loops=1)
   Storage: Memory  Maximum Storage: 17kB
   ->  Result (actual time=0.001..0.001 rows=1 loops=1)
   ->  WorkTable Scan on t t_1 (actual time=0.000..0.000 rows=1 
loops=100)
 Filter: (n < 100)
 Rows Removed by Filter: 0
   ->  CTE Scan on t (actual time=0.002..0.127 rows=100 loops=1)
 Storage: Memory  Maximum Storage: 20kB
 Planning Time: 0.053 ms
 Execution Time: 0.192 ms
(13 rows)

-- Table Function Scan node
CREATE OR REPLACE VIEW public.jsonb_table_view6 AS
 SELECT js2,
jsb2w,
jsb2q,
ia,
ta,
jba
   FROM JSON_TABLE(
'null'::jsonb, '$[*]' AS json_table_path_0
PASSING
1 + 2 AS a,
'"foo"'::json AS "b c"
COLUMNS (
js2 json PATH '$' WITHOUT WRAPPER KEEP QUOTES,
jsb2w jsonb PATH '$' WITH UNCONDITIONAL WRAPPER KEEP QUOTES,

Re: tests fail on windows with default git settings

Sorry - somehow managed to send whilst pasting in logs...

On Wed, 10 Jul 2024 at 10:30, Dave Page  wrote:

>
>
> On Tue, 9 Jul 2024 at 17:32, Andres Freund  wrote:
>
>> Hi,
>>
>> On 2024-07-09 14:52:39 +0100, Dave Page wrote:
>> > I have 4 different diff.exe's on my ~6 week old build VM (not counting
>> > shims), all of which seem to support --strip-trailing-cr. Those builds
>> came
>> > with:
>> >
>> > - git
>> > - VC++
>> > - diffutils (installed by chocolatey)
>> > - vcpkg
>> >
>> > I think it's reasonable to assume it'll be supported.
>>
>> I think the more likely issue would be an older setup with an older diff,
>> people on windows seem to not want to touch a working setup ever :). But
>> we
>> can deal with that if reports about it come in.
>>
>
> They've got to move to meson/ninja anyway, so... .
>
>
>>
>>
>> > > > Not sure what the issue is with pg_bsd_indent, though.
>> > >
>> >
>> > Yeah - that's odd, as that test always passes for me, with or without
>> > autocrlf.
>>
>> Huh.
>>
>>
>> > The other failures I see are the following, which I'm just starting to
>> dig
>> > into:
>> >
>> >  26/298 postgresql:recovery / recovery/019_replslot_limit
>> > ERROR43.05s   exit status 2
>> >  44/298 postgresql:recovery / recovery/027_stream_regress
>> > ERROR   383.08s   exit status 1
>> >  50/298 postgresql:recovery / recovery/035_standby_logical_decoding
>> > ERROR   138.06s   exit status 25
>> >  68/298 postgresql:recovery / recovery/040_standby_failover_slots_sync
>> >  ERROR   132.87s   exit status 25
>> > 170/298 postgresql:pg_dump / pg_dump/002_pg_dump
>> >  ERROR93.45s   exit status 2
>> > 233/298 postgresql:bloom / bloom/001_wal
>> >  ERROR54.47s   exit status 2
>> > 236/298 postgresql:subscription / subscription/001_rep_changes
>> >  ERROR46.46s   exit status 2
>> > 246/298 postgresql:subscription / subscription/010_truncate
>> > ERROR47.69s   exit status 2
>> > 253/298 postgresql:subscription / subscription/013_partition
>> >  ERROR   125.63s   exit status 25
>> > 255/298 postgresql:subscription / subscription/022_twophase_cascade
>> > ERROR58.13s   exit status 2
>> > 257/298 postgresql:subscription / subscription/015_stream
>> > ERROR   128.32s   exit status 2
>> > 262/298 postgresql:subscription / subscription/028_row_filter
>> > ERROR43.14s   exit status 2
>> > 263/298 postgresql:subscription / subscription/027_nosuperuser
>> >  ERROR   102.02s   exit status 2
>> > 269/298 postgresql:subscription / subscription/031_column_list
>> >  ERROR   123.16s   exit status 2
>> > 271/298 postgresql:subscription / subscription/032_subscribe_use_index
>> >  ERROR   139.33s   exit status 2
>>
>> Hm, it'd be good to see some of errors behind that ([1]).
>>
>> I suspect it might be related to conflicting ports. I had to use
>> PG_TEST_USE_UNIX_SOCKETS to avoid random tests from failing:
>>
>>   # use unix socket to prevent port conflicts
>>   $env:PG_TEST_USE_UNIX_SOCKETS = 1;
>>   # otherwise pg_regress insists on creating the directory and
>> does it
>>   # in a non-existing place, this needs to be fixed :(
>>   mkdir d:/sockets
>>   $env:PG_REGRESS_SOCK_DIR = "d:/sockets/"
>>
>
> No, it all seems to be fallout from GSSAPI being included in the build. If
> I build without that, everything passes. Most of the tests are failing with
> a "too many clients already" error, but a handful do seem to include GSSAPI
> auth related errors as well. For example, this is from
>


... this is from subscription/001_rep_changes:

[14:46:57.723](2.318s) ok 11 - check rows on subscriber after table drop
from publication
connection error: 'psql: error: connection to server at "127.0.0.1", port
58059 failed: could not initiate GSSAPI security context: No credentials
were supplied, or the credentials were unavailable or inaccessible:
Credential cache is empty
connection to server at "127.0.0.1", port 58059 failed: FATAL:  sorry, too
many clients already'
while running 'psql -XAtq -d port=58059 host=127.0.0.1 dbname='postgres' -f
- -v ON_ERROR_STOP=1' at
C:/Users/dpage/git/postgresql/src/test/perl/PostgreSQL/Test/Cluster.pm line
2129.
# Postmaster PID for node "publisher" is 14488
### Stopping node "publisher" using mode immediate
# Running: pg_ctl -D
C:\Users\dpage\git\postgresql\build/testrun/subscription/001_rep_changes\data/t_001_rep_changes_publisher_data/pgdata
-m immediate stop
waiting for server to shut down done
server stopped
# No postmaster PID for node "publisher"
# Postmaster PID for node "subscriber" is 15012
### Stopping node "subscriber" using mode immediate
# Running: pg_ctl -D

Re: tests fail on windows with default git settings

On Tue, 9 Jul 2024 at 17:32, Andres Freund  wrote:

> Hi,
>
> On 2024-07-09 14:52:39 +0100, Dave Page wrote:
> > I have 4 different diff.exe's on my ~6 week old build VM (not counting
> > shims), all of which seem to support --strip-trailing-cr. Those builds
> came
> > with:
> >
> > - git
> > - VC++
> > - diffutils (installed by chocolatey)
> > - vcpkg
> >
> > I think it's reasonable to assume it'll be supported.
>
> I think the more likely issue would be an older setup with an older diff,
> people on windows seem to not want to touch a working setup ever :). But we
> can deal with that if reports about it come in.
>

They've got to move to meson/ninja anyway, so... .


>
>
> > > > Not sure what the issue is with pg_bsd_indent, though.
> > >
> >
> > Yeah - that's odd, as that test always passes for me, with or without
> > autocrlf.
>
> Huh.
>
>
> > The other failures I see are the following, which I'm just starting to
> dig
> > into:
> >
> >  26/298 postgresql:recovery / recovery/019_replslot_limit
> > ERROR43.05s   exit status 2
> >  44/298 postgresql:recovery / recovery/027_stream_regress
> > ERROR   383.08s   exit status 1
> >  50/298 postgresql:recovery / recovery/035_standby_logical_decoding
> > ERROR   138.06s   exit status 25
> >  68/298 postgresql:recovery / recovery/040_standby_failover_slots_sync
> >  ERROR   132.87s   exit status 25
> > 170/298 postgresql:pg_dump / pg_dump/002_pg_dump
> >  ERROR93.45s   exit status 2
> > 233/298 postgresql:bloom / bloom/001_wal
> >  ERROR54.47s   exit status 2
> > 236/298 postgresql:subscription / subscription/001_rep_changes
> >  ERROR46.46s   exit status 2
> > 246/298 postgresql:subscription / subscription/010_truncate
> > ERROR47.69s   exit status 2
> > 253/298 postgresql:subscription / subscription/013_partition
> >  ERROR   125.63s   exit status 25
> > 255/298 postgresql:subscription / subscription/022_twophase_cascade
> > ERROR58.13s   exit status 2
> > 257/298 postgresql:subscription / subscription/015_stream
> > ERROR   128.32s   exit status 2
> > 262/298 postgresql:subscription / subscription/028_row_filter
> > ERROR43.14s   exit status 2
> > 263/298 postgresql:subscription / subscription/027_nosuperuser
> >  ERROR   102.02s   exit status 2
> > 269/298 postgresql:subscription / subscription/031_column_list
> >  ERROR   123.16s   exit status 2
> > 271/298 postgresql:subscription / subscription/032_subscribe_use_index
> >  ERROR   139.33s   exit status 2
>
> Hm, it'd be good to see some of errors behind that ([1]).
>
> I suspect it might be related to conflicting ports. I had to use
> PG_TEST_USE_UNIX_SOCKETS to avoid random tests from failing:
>
>   # use unix socket to prevent port conflicts
>   $env:PG_TEST_USE_UNIX_SOCKETS = 1;
>   # otherwise pg_regress insists on creating the directory and
> does it
>   # in a non-existing place, this needs to be fixed :(
>   mkdir d:/sockets
>   $env:PG_REGRESS_SOCK_DIR = "d:/sockets/"
>

No, it all seems to be fallout from GSSAPI being included in the build. If
I build without that, everything passes. Most of the tests are failing with
a "too many clients already" error, but a handful do seem to include auth
related errors as well. For example, this is from




>
>
> FWIW, building a tree with the patches I sent to the list last night and
> changes to make postgresql-dev.yml use a git checkout, I get:
>
>
> https://github.com/anarazel/winpgbuild/actions/runs/9852370209/job/27200784987#step:12:469
>
> Ok: 281
> Expected Fail:  0
> Fail:   0
> Unexpected Pass:0
> Skipped:17
> Timeout:0
>
> This is without readline and pltcl, as neither is currently built as part
> of
> winpgbuild. Otherwise it has all applicable dependencies enabled (no
> bonjour,
> bsd_auth, dtrace, llvm, pam, selinux, systemd, but that's afaict expected).
>
> Greetings,
>
> Andres Freund
>
>
> [1] I plan to submit a PR that'll collect the necessary information
>


-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

Re: Pluggable cumulative statistics

Hi,

On Wed, Jul 10, 2024 at 08:28:56AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jul 09, 2024 at 03:54:37PM +0900, Michael Paquier wrote:
> > On Mon, Jul 08, 2024 at 02:07:58PM +, Bertrand Drouvot wrote:
> > > It looks pretty straightforward, just one comment:
> > > 
> > > +   ptr = ((char *) ctl) + kind_info->shared_ctl_off;
> > > +   kind_info->init_shmem_cb((void *) ptr);
> > > 
> > > I don't think we need to cast ptr to void when calling init_shmem_cb(). 
> > > Looking
> > > at some examples in the code, it does not look like we cast the argument 
> > > to void
> > > when a function has (void *) as parameter (also there is examples in 0003 
> > > where
> > > it's not done, see next comments for 0003).
> > 
> > Yep.  Fine by me.
> 
> Thanks!
> 
> > 
> > Please find attached a rebased patch set for now, to make the
> > CF bot happy.
> 
> v5-0001 LGTM.
> 
> As far v5-0002:
> 
> +   goto error;
> +   info = pgstat_get_kind_info(kind);
> 
> Nit: add an empty line between the two?
> 
> Except this Nit, v5-0002 LGTM.

Oh, and also due to this change in 0002:

switch (t)
{
+   case PGSTAT_FILE_ENTRY_FIXED:
+   {

Then this comment:

/*
 * We found an existing statistics file. Read it and put all the hash
 * table entries into place.
 */
for (;;)
{
int t = fgetc(fpin);

switch (t)
{
case PGSTAT_FILE_ENTRY_FIXED:
{

is not correct anymore (as we're not reading the stats only into the hash table
anymore).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Pluggable cumulative statistics

Hi,

On Tue, Jul 09, 2024 at 03:54:37PM +0900, Michael Paquier wrote:
> On Mon, Jul 08, 2024 at 02:07:58PM +, Bertrand Drouvot wrote:
> > It looks pretty straightforward, just one comment:
> > 
> > +   ptr = ((char *) ctl) + kind_info->shared_ctl_off;
> > +   kind_info->init_shmem_cb((void *) ptr);
> > 
> > I don't think we need to cast ptr to void when calling init_shmem_cb(). 
> > Looking
> > at some examples in the code, it does not look like we cast the argument to 
> > void
> > when a function has (void *) as parameter (also there is examples in 0003 
> > where
> > it's not done, see next comments for 0003).
> 
> Yep.  Fine by me.

Thanks!

> 
> Please find attached a rebased patch set for now, to make the
> CF bot happy.

v5-0001 LGTM.

As far v5-0002:

+   goto error;
+   info = pgstat_get_kind_info(kind);

Nit: add an empty line between the two?

Except this Nit, v5-0002 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Eager aggregation, take 3

2024-07-10 Thread Richard Guo

On Sun, Jul 7, 2024 at 10:45 AM Paul George  wrote:
> Thanks for reviving this patch and for all of your work on it! Eager 
> aggregation pushdown will be beneficial for my work and I'm hoping to see it 
> land.

Thanks for looking at this patch!

> The output of both the original query and this one match (and the plans with 
> eager aggregation and the subquery are nearly identical if you restore the 
> LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does this 
> mean that there are conditions under which eager aggregation can be pushed 
> down to the nullable side?

I think it's a very risky thing to push a partial aggregation down to
the nullable side of an outer join, because the NULL-extended rows
produced by the outer join would not be available when we perform the
partial aggregation, while with a non-eager-aggregation plan these
rows are available for the top-level aggregation.  This may put the
rows into groups in a different way than expected, or get wrong values
from the aggregate functions.  I've managed to compose an example:

create table t (a int, b int);
insert into t select 1, 1;

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+---
   | 1
(1 row)

This is the expected result, because after the outer join we have got
a NULL-extended row.

But if we somehow push down the partial aggregation to the nullable
side of this outer join, we would get a wrong result.

explain (costs off)
select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
QUERY PLAN
---
 Finalize HashAggregate
   Group Key: t2.a
   ->  Nested Loop Left Join
 Filter: (t2.a IS NULL)
 ->  Seq Scan on t t1
 ->  Materialize
   ->  Partial HashAggregate
 Group Key: t2.a
 ->  Seq Scan on t t2
   Filter: (b > 1)
(10 rows)

select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
t2.a having t2.a is null;
 a | count
---+---
   | 0
(1 row)

I believe there are cases where pushing a partial aggregation down to
the nullable side of an outer join can be safe, but I doubt that there
is an easy way to identify these cases and do the push-down for them.
So for now I think we'd better refrain from doing that.

Thanks
Richard

Re: Logical Replication of sequences

2024-07-10 Thread Peter Smith

Here are a few comments for patch v20240705-0003.

(This is a WIP. I have only looked at the docs so far.)

==
doc/src/sgml/config.sgml

nitpick - max_logical_replication_workers: /and sequence
synchornization worker/and a sequence synchornization worker/

==
doc/src/sgml/logical-replication.sgml

nitpick - max_logical_replication_workers: re-order list of workers to
be consistent with other docs 1-apply,2-parallel,3-tablesync,4-seqsync

==
doc/src/sgml/ref/alter_subscription.sgml

1.
IIUC the existing "REFRESH PUBLICATION" command will fetch and sync
all new sequences, etc., and/or remove old ones no longer in the
publication. But current docs do not say anything at all about
sequences here. It should say something about sequence behaviour.

~~~

2.
For the existing "REFRESH PUBLICATION" there is a sub-option
"copy_data=true/false". Won't this need some explanation about how it
behaves for sequences? Or will there be another option
"copy_sequences=true/false".

~~~

3.
IIUC the main difference between REFRESH PUBLICATION and REFRESH
PUBLICATION SEQUENCES is that the 2nd command will try synchronize
with all the *existing* sequences to bring them to the same point as
on the publisher, but otherwise, they are the same command. If that is
correct understanding I don't think that distinction is made very
clear in the current docs.

~~~

nitpick - the synopsis is misplaced. It should not be between ENABLE
and DISABLE. I moved it. Also, it should say "REFRESH PUBLICATION
SEQUENCES" because that is how the new syntax is defined in gram.y

nitpick - REFRESH SEQUENCES. Renamed to "REFRESH PUBLICATION
SEQUENCES". And, shouldn't "from the publisher" say "with the
publisher"?

nitpick - changed the varlistentry "id".

==
99.
Please also see the attached diffs patch which implements any nitpicks
mentioned above.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 981ca51..645fc48 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5201,7 +5201,7 @@ ANY num_sync 
( 
 max_logical_replication_workers
 must be set to at least the number of subscriptions (for leader apply
-workers), plus some reserve for the table synchronization workers, sequence
-synchronization worker and parallel apply workers.
+workers), plus some reserve for the parallel apply workers, table 
synchronization workers, and a sequence
+synchronization worker.

 

diff --git a/doc/src/sgml/ref/alter_subscription.sgml 
b/doc/src/sgml/ref/alter_subscription.sgml
index fc8a33c..f81faf9 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -26,8 +26,8 @@ ALTER SUBSCRIPTION name SET PUBLICA
 ALTER SUBSCRIPTION name ADD 
PUBLICATION publication_name [, 
...] [ WITH ( publication_option 
[= value] [, ... ] ) ]
 ALTER SUBSCRIPTION name DROP 
PUBLICATION publication_name [, 
...] [ WITH ( publication_option 
[= value] [, ... ] ) ]
 ALTER SUBSCRIPTION name REFRESH 
PUBLICATION [ WITH ( refresh_option [= value] [, ... ] ) ]
+ALTER SUBSCRIPTION name REFRESH 
PUBLICATION SEQUENCES
 ALTER SUBSCRIPTION name ENABLE
-ALTER SUBSCRIPTION name REFRESH 
SEQUENCES
 ALTER SUBSCRIPTION name DISABLE
 ALTER SUBSCRIPTION name SET ( 
subscription_parameter [= 
value] [, ... ] )
 ALTER SUBSCRIPTION name SKIP ( 
skip_option = value )
@@ -195,12 +195,12 @@ ALTER SUBSCRIPTION name RENAME TO <
 

 
-   
-REFRESH SEQUENCES
+   
+REFRESH PUBLICATION SEQUENCES
 
  
   Fetch missing sequences information from publisher and re-synchronize the
-  sequence data from the publisher.
+  sequence data with the publisher.

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-10 Thread Masahiko Sawada

On Wed, Jul 10, 2024 at 4:14 PM Fujii Masao  wrote:
>
>
>
> On 2024/07/10 12:13, Masahiko Sawada wrote:
> > On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao  
> > wrote:
> >>
> >> Hi,
> >>
> >> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
> >> always create new partitions in the default tablespace, regardless of
> >> the parent's tablespace. However, the indexes of these new partitions 
> >> inherit
> >> the tablespaces of their parent indexes. This inconsistency seems odd.
> >> Is this an oversight or intentional?
> >>
> >> Here are the steps I used to test this:
> >>
> >> ---
> >> CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
> >> CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
> >> PARTITION BY RANGE (i) TABLESPACE tblspc;
> >>
> >> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> >> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
> >>
> >> ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
> >>
> >> SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 
> >> 'tp_0_2') ORDER BY tablename;
> >>tablename | tablespace
> >> ---+
> >>t | tblspc
> >>tp_0_2| (null)
> >> (2 rows)
> >>
> >> SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 
> >> 'tp_0_2') ORDER BY indexname;
> >> indexname  | tablespace
> >> -+
> >>t_pkey  | tblspc
> >>tp_0_2_pkey | tblspc
> >> ---
> >>
> >>
> >> If it's an oversight, I've attached a patch to ensure these commands create
> >> new partitions in the parent's tablespace.
> >
> > +1
> >
> > Since creating a child table through the CREATE TABLE statement sets
> > its parent table's tablespace as the child table's tablespace, it is
> > logical to set the parent table's tablespace as the merged table's
> > tablespace.
>
> Thanks for the review!
>
>
> > While the patch does not include test cases for SPLIT PARTITIONS,
> > which is understandable as these commands use the common function that
> > we have fixed, I believe it would be prudent to test SPLIT PARTITIONS
> > as well since we could change it in the future development.
>
> Unless I'm mistaken, the patch already includes tests for the split case.
> Could you please check the tests added to partition_split.sql?
>

Oops, sorry, I missed that part for some reason.So the patch looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_maintain and USAGE privilege on schema





On 2024/07/08 11:13, Nathan Bossart wrote:

On Mon, Jul 08, 2024 at 01:03:42AM +0900, Fujii Masao wrote:

I've noticed an issue with non-superusers who have the pg_maintain role.
When they run VACUUM on a specific table within a specific schema,
like "VACUUM mynsp.mytbl", it fails if they don't have the USAGE privilege
on the schema. For example, the error message logged is
"ERROR: permission denied for schema mynsp". However, running VACUUM
without specifying the table name, such as "VACUUM",
completes successfully and vacuums all tables, including those in schemas
where the user lacks the USAGE privilege.

Is this behavior intentional?


I'd consider it intentional because it matches the database owner behavior.
If the database owner does not have USAGE on a schema, they'll similarly be
unable to VACUUM a specific table in that schema while being able to VACUUM
it via a database-wide command. 


Yes, you're right.



That's admittedly a little weird, but IMHO
any changes in this area should apply to both pg_maintain and the database
owner.


However, unlike the database owner, pg_maintain by definition should
have *all* the rights needed for maintenance tasks, including MAINTAIN
rights on tables and USAGE rights on schemas? ISTM that both
pg_read_all_data and pg_write_all_data roles are defined similarly,
with USAGE rights on all schemas. So, granting USAGE rights to
pg_maintain, but not the database owner, doesn't seem so odd to me.



I assumed that a pg_maintain user could run VACUUM on specific tables
in any schema without needing additional privileges. So, shouldn't
pg_maintain users be able to perform maintenance commands as if they have
USAGE rights on all schemas?


It might be reasonable to give implicit USAGE privileges on all schemas
during maintenance commands to pg_maintain roles.  I would be a little
hesitant to consider this v17 material, though.


That's a valid concern. I'd like hear more opinions about this.



There are some other inconsistencies that predate MAINTAIN that I think we
ought to clear up at some point.  For example, the privilege checks for
REINDEX work a bit differently than VACUUM, ANALYZE, and CLUSTER.  I doubt
that's causing anyone too much trouble in the field, but since we're
grouping these commands together as "maintenance commands" now, it'd be
nice to make them as consistent as possible.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Avoid orphaned objects dependencies, take 3

Hi,

On Tue, Jul 02, 2024 at 05:56:23AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jul 01, 2024 at 09:39:17AM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote:
> > > Hi,
> > > 
> > > On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> > > > Another thought for the RelationRelationId class case: we could check 
> > > > if there
> > > > is a lock first and if there is no lock then acquire one. That way that 
> > > > would
> > > > ensure the relation is always locked (so no "risk" anymore), but OTOH 
> > > > it may
> > > > add "unecessary" locking (see 2. mentioned previously).
> > > 
> > > Please find attached v12 implementing this idea for the 
> > > RelationRelationId class
> > > case. As mentioned, it may add unnecessary locking for 2. but I think 
> > > that's
> > > worth it to ensure that we are always on the safe side of thing. This 
> > > idea is
> > > implemented in LockNotPinnedObjectById().
> > 
> > Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, 
> > make
> > use of CheckRelationOidLockedByMe() added in 0cecc908e97.
> 
> Please find attached v14, mandatory rebase due to 65b71dec2d5.

In [1] I mentioned that the object locking logic has been put outside of the 
dependency code except for:

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

Please find attached v15 that also removes the logic outside of the 3 above 
functions. Well, for recordDependencyOnExpr() and 
recordDependencyOnSingleRelExpr()
that's now done in find_expr_references_walker(): It's somehow still in the
dependency code but at least it is now clear which objects we are locking (and
I'm not sure how we could do better than that for those 2 functions).

There is still one locking call in recordDependencyOnCurrentExtension() but I
think this one is clear enough and does not need to be put outside (for
the same reason mentioned in [1]).

So, to sum up:

A. Locking is now done exclusively with LockNotPinnedObject(Oid classid, Oid 
objid)
so that it's now always clear what object we want to acquire a lock for. It 
means
we are not manipulating directly an object address or a list of objects address
as it was the case when the locking was done "directly" within the dependency 
code.

B. A special case is done for objects that belong to the RelationRelationId 
class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

 1. The relation is already locked (could be an existing relation or a relation
 that we are creating).

 2. The relation is protected indirectly (i.e an index protected by a lock on
 its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that seems worth it. 

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From bc47856d85e9e4f61fc22bc29ec40419e957b7e6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v15] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?





On 2024/07/10 12:13, Masahiko Sawada wrote:

On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao  wrote:


Hi,

I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
always create new partitions in the default tablespace, regardless of
the parent's tablespace. However, the indexes of these new partitions inherit
the tablespaces of their parent indexes. This inconsistency seems odd.
Is this an oversight or intentional?

Here are the steps I used to test this:

---
CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
PARTITION BY RANGE (i) TABLESPACE tblspc;

CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);

ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;

SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 'tp_0_2') 
ORDER BY tablename;
   tablename | tablespace
---+
   t | tblspc
   tp_0_2| (null)
(2 rows)

SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 'tp_0_2') 
ORDER BY indexname;
indexname  | tablespace
-+
   t_pkey  | tblspc
   tp_0_2_pkey | tblspc
---


If it's an oversight, I've attached a patch to ensure these commands create
new partitions in the parent's tablespace.


+1

Since creating a child table through the CREATE TABLE statement sets
its parent table's tablespace as the child table's tablespace, it is
logical to set the parent table's tablespace as the merged table's
tablespace.


Thanks for the review!



While the patch does not include test cases for SPLIT PARTITIONS,
which is understandable as these commands use the common function that
we have fixed, I believe it would be prudent to test SPLIT PARTITIONS
as well since we could change it in the future development.


Unless I'm mistaken, the patch already includes tests for the split case.
Could you please check the tests added to partition_split.sql?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Doc: fix track_io_timing description to mention pg_stat_io





On 2024/07/08 12:01, hajime.matsun...@nttdata.com wrote:



From: Fujii Masao 
Sent: Wednesday, July 3, 2024 7:44 PM


On 2024/07/03 17:51, hajime.matsun...@nttdata.com wrote:

Thanks for the suggestions the other day.
I have created a patch that incorporates your suggestions.


-pg_stat_database, in the output of
+pg_stat_database and
+
+pg_stat_io, in the output of

I'm not a native English speaker, but it seems more natural to use a comma instead of 
"and" before "pg_stat_io."

-   of block read and write times.
+   of block read, block write, extend, and fsync times.

"block" before "fsync" doesn't seem necessary.


I'm also not a native English speaker, so I appreciate your feedback.
I have created a patch that incorporates your feedback.


Thanks for updating the patch! Pushed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: long-standing data loss bug in initial sync of logical replication

2024-07-10 Thread Amit Kapila

On Tue, Jul 9, 2024 at 8:14 PM vignesh C  wrote:
>
> On Tue, 9 Jul 2024 at 17:05, Amit Kapila  wrote:
> >
> > On Mon, Jul 1, 2024 at 10:51 AM vignesh C  wrote:
> > >
> > >
> > > This issue is present in all supported versions. I was able to
> > > reproduce it using the steps recommended by Andres and Tomas's
> > > scripts. I also conducted a small test through TAP tests to verify the
> > > problem. Attached is the alternate_lock_HEAD.patch, which includes the
> > > lock modification(Tomas's change) and the TAP test.
> > >
> >
> > @@ -1568,7 +1568,7 @@ OpenTableList(List *tables)
> >   /* Allow query cancel in case this takes a long time */
> >   CHECK_FOR_INTERRUPTS();
> >
> > - rel = table_openrv(t->relation, ShareUpdateExclusiveLock);
> > + rel = table_openrv(t->relation, ShareRowExclusiveLock);
> >
> > The comment just above this code ("Open, share-lock, and check all the
> > explicitly-specified relations") needs modification. It would be
> > better to explain the reason of why we would need SRE lock here.
>
> Updated comments for the same.
>

The patch missed to use the ShareRowExclusiveLock for partitions, see
attached. I haven't tested it but they should also face the same
problem. Apart from that, I have changed the comments in a few places
in the patch.

-- 
With Regards,
Amit Kapila.


v3-0001-Fix-data-loss-during-initial-sync-in-logical-repl.patch
Description: Binary data

Re: why there is not VACUUM FULL CONCURRENTLY?

2024-07-10 Thread Antonin Houska

Alvaro Herrera  wrote:

> On 2024-Jul-09, Antonin Houska wrote:
> 
> > Alvaro Herrera  wrote:
> > 
> > > > Is your plan to work on it soon or should I try to write a draft patch? 
> > > > (I
> > > > assume this is for PG >= 18.)
> > > 
> > > I don't have plans for it, so if you have resources, please go for it.
> > 
> > The first version is attached. The actual feature is in 0003. 0004 is 
> > probably
> > not necessary now, but I haven't realized until I coded it.
> 
> Thank you, this is great.  I'll be studying this during the next
> commitfest.

Thanks. I'll register it in the CF application.


> BTW I can apply 0003 from this email perfectly fine, but you're right
> that the archives don't show the file name.  I suspect the
> "Content-Disposition: inline" PLUS the Content-Type text/plain are what
> cause the problem -- for instance, [1] doesn't have a problem and they
> do have inline content disposition, but the content-type is not
> text/plain.  In any case, I encourage you not to send patches as
> tarballs :-)
> 
> [1]  https://postgr.es/m/32781.1714378236@antos

You're right, "Content-Disposition" is the problem. I forgot that "attachment"
is better for patches and my email client (emacs+nmh) defaults to
"inline". I'll pay attention next time.


-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Is creating logical replication slots in template databases useful at all?

On Tue, Jun 18, 2024 at 03:19:41PM +0530, Ashutosh Bapat wrote:
> On Mon, Jun 17, 2024 at 5:50 PM Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote:
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29d0a77fa6606f9c01ba17311fc452dabd3f793d
>> ,
>> I noticed that get_old_cluster_logical_slot_infos gets called for even
>> template1 and template0 databases. Which means, pg_upgrade executes
>> queries against the template databases to get replication slot
>> information. I then realized that postgres allows one to connect to
>> template1 database (or any other user-defined template databases for
>> that matter), and create logical replication slots in it. If created,
>> all the subsequent database creations will end up adding inactive
>> logical replication slots in the postgres server. This might not be a
>> problem in production servers as I assume the connections to template
>> databases are typically restricted. Despite the connection
>> restrictions, if at all one gets to connect to template databases in
>> any way, it's pretty much possible to load the postgres server with
>> inactive replication slots.
> 
> The replication slot names are unique across databases [1] Hence
> replication slots created by connecting to template1 database should not
> get copied over when creating a new database. Is that broken? A logical
> replication slot is associated with a database but a physical replication
> slot is not. The danger you mention above applies only to logical
> replication slots I assume.

get_old_cluster_logical_slot_infos() on even template0 is still
correct, IMO, even if this template database is not something that
should be modified at all, or even have allow_connections enabled.  It
seems to me the correct answer here is that users should not create
slots where they are not going to use them.

>> This leads me to think why one would need logical replication slots in
>> template databases at all. Can postgres restrict logical replication
>> slots creation in template databases? If restricted, it may deviate
>> from the fundamental principle of template databases in the sense that
>> everything in the template database must be copied over to the new
>> database created using it. Is it okay to do this? Am I missing
>> something here?
> 
> If applications are using template1, they would want to keep the template1
> on primary and replica in sync. Replication slot associated with template1
> would be useful there.

Templates defined in CREATE DATABASE can be any active database as
long as they are in pg_database, so doing logical replication on
template1 to keep it in sync across nodes is fine.

In short, I am not quite seeing the problem here.
--
Michael


signature.asc
Description: PGP signature

Re: Allow non-superuser to cancel superuser tasks.

2024-07-10 Thread Kirill Reshke

> The script has two tests, and the CI is failing for the second test
> where we expect the signal to be processed:
> [12:48:23.370] #   Failed test 'autovacuum worker signaled with
> pg_signal_autovacuum_worker granted'
> [12:48:23.370] #   at t/006_signal_autovacuum.pl line 90.

> --
> Michael

That's very strange, because the test works fine on my virtual
machine. Also, it seems that it works in Cirrus [0], as there is this
line:

[autovacuum worker] FATAL:  terminating autovacuum process due to
administrator command

after `SET ROLE signal_autovacuum_worker_role;` and  `SELECT
pg_terminate_backend` in the log file.

Somehow the `node->log_contains` check does not catch that. Maybe
there is some issue with `$offset`? Will try to investigate

[0] 
https://api.cirrus-ci.com/v1/artifact/task/5668467599212544/log/src/test/modules/test_misc/tmp_check/log/006_signal_autovacuum_node.log

Re: relfilenode statistics