Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid
On Sat, Nov 9, 2024 at 12:41 PM Dilip Kumar wrote: > > On Fri, Nov 8, 2024 at 8:36 PM Tom Lane wrote: >> >> Dilip Kumar writes: >> > IIRC, In catalog we intentionally left it as Oid because RelFileNumber is >> > an internal typedef bug, it is not an exposed datatype, so probably we can >> > not use it in catalog. >> >> We could declare it as RelFileNumber so that that is what C code sees, >> and teach Catalog.pm to translate that to OID in the emitted catalog >> contents. > > > Yeah that make sense and yes we can actually keep this change > >> >> I think you'd have to do that to avoid a ton of false >> positives when you make RelFileNumber into a struct, and we might as >> well keep the result. > > > Even after this, there were tons of false positives, whenever using any > comparison operator on relfilenumbers[1] and there are tons of those, or > using them in log messages with %u [2]. Anyway, I have gone through those > manually and after ignoring all false positives here is what I got, PFA patch > (odd 25 places where I have fixed usage of Oid instead of RelFileNumbers). > > > [1] > ../../../../src/include/storage/buf_internals.h:157:20: error: invalid > operands to binary expression ('const RelFileNumber' (aka 'const struct > RelFileNumber') and 'const RelFileNumber') > (tag1->relNumber == tag2->relNumber) > > [2] > fsmpage.c:277:47: warning: format specifies type 'unsigned int' but the > argument has type 'RelFileNumber' (aka 'struct RelFileNumber') [-Wformat] > blknum, rlocator.spcOid, rlocator.dbOid, > rlocator.relNumber); > Other than the changes, I have made in patch in my previous email there are a couple of more items we can consider for this change, but I am not sure so listing down here 1. We are using PG_RETURN_OID() for returning the relfilenumber, should we introduce something like PG_RETURN_RELFILENUMBER() ? e.g pg_relation_filenode() 2. We are using PG_GETARG_OID() for getting the relfilenumber as an external function argument, shall we introduce something like PG_GETARG_RELFILENUMBER() ? e.g. binary_upgrade_set_next_heap_relfilenode() 3. info.c:611:23: error: assigning to 'RelFileNumber' (aka 'struct RelFileNumber') from incompatible type 'Oid' (aka 'unsigned int') curr->relfilenumber = atooid(PQgetvalue(res, relnum, i_relfilenumber)); 4. Also using ObjectIdGetDatum() and DatumGetObjectId() for relfilenumber so shall we introduce a new function i.e RelFileNumberGetDatum() and DatumGetRelFileNumber() -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid
On Fri, Nov 8, 2024 at 8:36 PM Tom Lane wrote: > Dilip Kumar writes: > > IIRC, In catalog we intentionally left it as Oid because RelFileNumber is > > an internal typedef bug, it is not an exposed datatype, so probably we > can > > not use it in catalog. > > We could declare it as RelFileNumber so that that is what C code sees, > and teach Catalog.pm to translate that to OID in the emitted catalog > contents. Yeah that make sense and yes we can actually keep this change > I think you'd have to do that to avoid a ton of false > positives when you make RelFileNumber into a struct, and we might as > well keep the result. > Even after this, there were tons of false positives, whenever using any comparison operator on relfilenumbers[1] and there are tons of those, or using them in log messages with %u [2]. Anyway, I have gone through those manually and after ignoring all false positives here is what I got, PFA patch (odd 25 places where I have fixed usage of Oid instead of RelFileNumbers). [1] ../../../../src/include/storage/buf_internals.h:157:20: error: invalid operands to binary expression ('const RelFileNumber' (aka 'const struct RelFileNumber') and 'const RelFileNumber') (tag1->relNumber == tag2->relNumber) [2] fsmpage.c:277:47: warning: format specifies type 'unsigned int' but the argument has type 'RelFileNumber' (aka 'struct RelFileNumber') [-Wformat] blknum, rlocator.spcOid, rlocator.dbOid, rlocator.relNumber); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com remove_using_oid_for_relfilenumber.patch Description: Binary data
Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid
On Fri, Nov 8, 2024 at 4:30 PM Alvaro Herrera wrote: > On 2024-Nov-08, Dilip Kumar wrote: > > > It appears that we are passing InvalidOid instead of InvalidRelFileNumber > > when calling index_create(). While this isn’t technically a bug, since > > InvalidRelFileNumber is defined as InvalidOid, it’s preferable to use the > > correct identifier for clarity and consistency. > > Oh ugh, this is quite a widespread problem actually, and it's just > because RelFileNumber is a typedef to Oid that the compiler doesn't > complain. In a very quick desultory scan, I found a bunch of similar > issues elsewhere, such as below (also the assignment to relNumber in > GetNewRelFileNumber). This isn't exhaustive by any means nor did I try > to compile it ... are you motivated to search for this more > exhaustively? You could try (temporarily) making RelFileNumber a > typedef that tricks the compiler into creating a warning or error for > every mischief, such as a struct, fix all the warnings/errors, then > revert the temporary change. > Okay I will try that > > diff --git a/src/backend/backup/basebackup_incremental.c > b/src/backend/backup/basebackup_incremental.c > index 275615877eb..e2a73d88408 100644 > --- a/src/backend/backup/basebackup_incremental.c > +++ b/src/backend/backup/basebackup_incremental.c > @@ -740,7 +740,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const > char *path, > */ > rlocator.spcOid = spcoid; > rlocator.dbOid = dboid; > - rlocator.relNumber = 0; > + rlocator.relNumber = InvalidRelFileNumber; > if (BlockRefTableGetEntry(ib->brtab, &rlocator, MAIN_FORKNUM, > &limit_block) != > NULL) > { > diff --git a/src/backend/bootstrap/bootparse.y > b/src/backend/bootstrap/bootparse.y > index 73a7592fb71..2f7f06a126f 100644 > --- a/src/backend/bootstrap/bootparse.y > +++ b/src/backend/bootstrap/bootparse.y > @@ -206,7 +206,7 @@ Boot_CreateStmt: > > PG_CATALOG_NAMESPACE, > > shared_relation ? GLOBALTABLESPACE_OID : 0, > > $3, > - > InvalidOid, > + > InvalidRelFileNumber, > > HEAP_TABLE_AM_OID, > > tupdesc, > > RELKIND_RELATION, > diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c > index f9bb721c5fe..f8d4824a3f8 100644 > --- a/src/backend/catalog/index.c > +++ b/src/backend/catalog/index.c > @@ -3740,7 +3740,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId, > if (set_tablespace) > { > /* Update its pg_class row */ > - SetRelationTableSpace(iRel, params->tablespaceOid, > InvalidOid); > + SetRelationTableSpace(iRel, params->tablespaceOid, > InvalidRelFileNumber); > > /* > * Schedule unlinking of the old index storage at > transaction commit. > diff --git a/src/backend/commands/tablecmds.c > b/src/backend/commands/tablecmds.c > index eaa81424270..8505cc8c1b5 100644 > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -15441,7 +15441,7 @@ ATExecSetTableSpaceNoStorage(Relation rel, Oid > newTableSpace) > } > > /* Update can be done, so change reltablespace */ > - SetRelationTableSpace(rel, newTableSpace, InvalidOid); > + SetRelationTableSpace(rel, newTableSpace, InvalidRelFileNumber); > > InvokeObjectPostAlterHook(RelationRelationId, > RelationGetRelid(rel), 0); > > diff --git a/src/include/catalog/pg_class.h > b/src/include/catalog/pg_class.h > index 0fc2c093b0d..bffbb98779e 100644 > --- a/src/include/catalog/pg_class.h > +++ b/src/include/catalog/pg_class.h > @@ -54,7 +54,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP > BKI_ROWTYPE_OID(83,Relat > > /* identifier of physical storage file */ > /* relfilenode == 0 means it is a "mapped" relation, see > relmapper.c */ > - Oid relfilenode BKI_DEFAULT(0); > + RelFileNumber relfilenode BKI_DEFAULT(0); > > /* identifier of table space for relation (0 means default for > database) */ > Oid reltablespace BKI_DEFAULT(0) > BKI_LOOKUP_OPT(pg_tablespace); > IIRC, In catalog we intentionally left it as Oid because RelFileNumber is an internal typedef bug, it is not an exposed datatype, so probably we can not use it in catalog. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Fix small typo, use InvalidRelFileNumber instead of InvalidOid
It appears that we are passing InvalidOid instead of InvalidRelFileNumber when calling index_create(). While this isn’t technically a bug, since InvalidRelFileNumber is defined as InvalidOid, it’s preferable to use the correct identifier for clarity and consistency. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-Fix-typo-use-InvalidRelFileNumber-instead-of-Invalid.patch Description: Binary data
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Sat, Oct 26, 2024 at 5:30 PM Melanie Plageman wrote: > On Sat, Oct 26, 2024 at 12:17 AM Dilip Kumar > wrote: > > > > On Fri, Oct 25, 2024 at 10:07 PM Melanie Plageman < > melanieplage...@gmail.com> wrote: > >> > >> On Fri, Oct 25, 2024 at 10:29 AM Melanie Plageman > >> wrote: > >> > > >> > Tom suggested off-list that if rs_cindex can't be zero, then it should > >> > be unsigned. I checked the other scan types using the > >> > HeapScanDescData, and it seems none of them use values of rs_cindex or > >> > rs_ntuples < 0. As such, here is a patch making both rs_ntuples and > >> > rs_cindex unsigned. > >> > > > > @@ -943,8 +945,8 @@ heapgettup_pagemode(HeapScanDesc scan, > > { > > HeapTuple tuple = &(scan->rs_ctup); > > Page page; > > - int lineindex; > > - int linesleft; > > + uint32 lineindex; > > + uint32 linesleft; > > > > IMHO we can't make "lineindex" as uint32, because just see the first > code block[1] of heapgettup_pagemode(), we use this index as +ve (Forward > scan )as well as -ve (Backward scan). > > Yes, so in the case of backwards scan, if scan->rs_cindex is 0, when > dir is -1, lineindex will wrap around, but we don't use it in that > case because linesleft will be 0 and then we will not meet the > conditions to execute the code in the loop under continue_page. And in > the loop, when adding -1 to lineindex, for backwards scan, it always > starts at linesleft and ticks down to 0. > Yeah you are right, although the lineindex will wrap around when rs_cindex is 0 , it would not be used. So, it won’t actually cause any issues, but I’m not comfortable with the unintentional wraparound. I would have left "scan->rs_cindex" as int itself but I am fine with whatever you prefer. > We could add an if statement above the goto that says something like > if (linesleft > 0) > goto continue_page; > > Would that make it clearer? > Not sure it would make it clearer. In fact, In common cases it would add an extra instruction to check the if (linesleft > 0). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Fri, Oct 25, 2024 at 10:07 PM Melanie Plageman wrote: > On Fri, Oct 25, 2024 at 10:29 AM Melanie Plageman > wrote: > > > > Tom suggested off-list that if rs_cindex can't be zero, then it should > > be unsigned. I checked the other scan types using the > > HeapScanDescData, and it seems none of them use values of rs_cindex or > > rs_ntuples < 0. As such, here is a patch making both rs_ntuples and > > rs_cindex unsigned. > > @@ -943,8 +945,8 @@ heapgettup_pagemode(HeapScanDesc scan, { HeapTuple tuple = &(scan->rs_ctup); Page page; - int lineindex; - int linesleft; + uint32 lineindex; + uint32 linesleft; IMHO we can't make "lineindex" as uint32, because just see the first code block[1] of heapgettup_pagemode(), we use this index as +ve (Forward scan )as well as -ve (Backward scan). [1] if (likely(scan->rs_inited)) { /* continue from previously returned page/tuple */ page = BufferGetPage(scan->rs_cbuf); lineindex = scan->rs_cindex + dir; if (ScanDirectionIsForward(dir)) --Refer definition of ScanDirection typedef enum ScanDirection { BackwardScanDirection = -1, NoMovementScanDirection = 0, ForwardScanDirection = 1 } ScanDirection; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Thu, Oct 24, 2024 at 7:11 PM Melanie Plageman wrote: > Thanks for the reply, Dilip! > > On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar > wrote: > > > > On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman < > melanieplage...@gmail.com> wrote: > >> > >> HeapScanDescData->rs_cindex (the current index into the array of > >> visible tuples stored in the heap scan descriptor) is used for > >> multiple scan types, but for bitmap heap scans, AFAICT, it should > >> never be < 0. > > > > You are right it can never be < 0 in this case at least. > > Cool, thanks for confirming. I will commit this change then. > +1 > > > In fact you don't need to explicitly set it to 0 in initscan[1], because > before calling heapam_scan_bitmap_next_tuple() we must call > heapam_scan_bitmap_next_block() and this function is initializing this to 0 > (hscan->rs_cindex = 0;). Anyway no object even if you prefer to initialize > in initscan(), just the point is that we are already doing it for each > block before fetching tuples from the block. > > I am inclined to still initialize it to 0 in initscan(). I was > refactoring the bitmap heap scan code to use the read stream API and > after moving some things around, this field was no longer getting > initialized in heapam_scan_bitmap_next_block(). While that may not be > a good reason to change it in this patch, most of the other fields in > the heap scan descriptor (like rs_cbuf and rs_cblock) are > re-initialized in initscan(), so it seems okay to do that there even > though it isn't strictly necessary on master. > make sense -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman wrote: > Hi, > > HeapScanDescData->rs_cindex (the current index into the array of > visible tuples stored in the heap scan descriptor) is used for > multiple scan types, but for bitmap heap scans, AFAICT, it should > never be < 0. > > As such, I find this test in heapam_scan_bitmap_next_tuple() pretty > confusing. > > if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples) > > I know it seems innocuous, but I find it pretty distracting. Am I > missing something? Could it somehow be < 0? > > If not, I propose removing that part of the if statement like in the > attached patch. > > You are right it can never be < 0 in this case at least. In fact you don't need to explicitly set it to 0 in initscan[1], because before calling heapam_scan_bitmap_next_tuple() we must call heapam_scan_bitmap_next_block() and this function is initializing this to 0 (hscan->rs_cindex = 0;). Anyway no object even if you prefer to initialize in initscan(), just the point is that we are already doing it for each block before fetching tuples from the block. [1] @@ -378,6 +378,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) ItemPointerSetInvalid(&scan->rs_ctup.t_self); scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + scan->rs_cindex = 0; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make default subscription streaming option as Parallel
On Fri, 18 Oct 2024 at 5:24 PM, Amit Kapila wrote: > On Fri, Oct 18, 2024 at 9:48 AM Dilip Kumar wrote: > > > > On Tue, Oct 8, 2024 at 3:38 PM Amit Kapila > wrote: > >> > >> On Tue, Oct 8, 2024 at 2:25 PM shveta malik > wrote: > >> > > >> > On Mon, Oct 7, 2024 at 4:03 PM vignesh C wrote: > >> > > > >> > > >> > With parallel streaming as default, do you think there is a need to > >> > increase the default for 'max_logical_replication_workers' as IIUC > >> > parallel workers are taken from the same pool. > >> > > >> > >> Good question. But then one may say that we should proportionately > >> increase max_worker_processes as well. I don't know what should be > >> reasonable new defaults. I think we should make parallel streaming as > >> default and then wait for some user feedback before changing other > >> defaults. > >> > > > > I agree, actually streaming of in progress transactions is a useful > feature for performance in case of large transactions, so it makes sense to > make it "on" by default. So +1 from my side. > > > > Your response is confusing. AFAIU, this proposal is to change the > default value of the streaming option to 'parallel' but you are > suggesting to make 'on' as default which is different from the > proposed default but OTOH you are saying +1 as well. So, both can't be > true. Sorry for confusion I meant to say change default as ‘parallel’ — Dilip >
Re: Make default subscription streaming option as Parallel
On Tue, Oct 8, 2024 at 3:38 PM Amit Kapila wrote: > On Tue, Oct 8, 2024 at 2:25 PM shveta malik > wrote: > > > > On Mon, Oct 7, 2024 at 4:03 PM vignesh C wrote: > > > > > > > With parallel streaming as default, do you think there is a need to > > increase the default for 'max_logical_replication_workers' as IIUC > > parallel workers are taken from the same pool. > > > > Good question. But then one may say that we should proportionately > increase max_worker_processes as well. I don't know what should be > reasonable new defaults. I think we should make parallel streaming as > default and then wait for some user feedback before changing other > defaults. > > I agree, actually streaming of in progress transactions is a useful feature for performance in case of large transactions, so it makes sense to make it "on" by default. So +1 from my side. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Invalid Assert while validating REPLICA IDENTITY?
On Fri, Sep 6, 2024 at 4:48 PM vignesh C wrote: > > On Mon, 2 Sept 2024 at 18:22, Dilip Kumar wrote: > > > > On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila wrote: > > > > > > On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar wrote: > > > > > > > > While working on some other code I noticed that in > > > > FindReplTupleInLocalRel() there is an assert [1] that seems to be > > > > passing IndexRelation to GetRelationIdentityOrPK() whereas it should > > > > be passing normal relation. > > > > > > > > > > Agreed. But this should lead to assertion failure. Did you try testing it? > > > > No, I did not test this particular case, it impacted me with my other > > addition of the code where I got Index Relation as input to the > > RelationGetIndexList() function, and my local changes were impacted by > > that. I will write a test for this stand-alone case so that it hits > > the assert. Thanks for looking into this. > > The FindReplTupleInLocalRel function can be triggered by both update > and delete operations, but this only occurs if the relation has been > marked as updatable by the logicalrep_rel_mark_updatable function. If > the relation is marked as non-updatable, an error will be thrown by > check_relation_updatable. Given this, if a relation is updatable, the > IsIndexUsableForReplicaIdentityFull condition might always evaluate to > true due to the previous checks in logicalrep_rel_mark_updatable. > Therefore, it’s possible that we might not encounter the Assert > statement, as IsIndexUsableForReplicaIdentityFull may consistently be > true. > Thoughts? With that it seems that the first Assert condition is useless isn't it? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Deadlock due to locking order violation while inserting into a leaf relation
Basically, when we are inserting into a leaf relation (or lower level relation of the partitioned relation), we acquire the lock on the leaf relation during parsing time itself whereas parent lock is acquire during generate_partition_qual(). Now concurrently if we try to drop the partition root then it will acquire the lock in reverse order, i.e. parent first and then child so this will create a deadlock. Below example reproduce this case. Setup: CREATE TABLE test(a int, b int) partition by range(a); CREATE TABLE test1 partition of test for values from (1) to (10); Test: -- --Session1: INSERT INTO test1 VALUES (1, 4); -- let session is lock the relation test1 and make it wait before it locks test (put breakpoint in ExecInitModifyTable) --Session2: -- try to drop the top table which will try to take AccessExclusive lock on all partitions DROP TABLE test; --session3 -- see PG_LOCKS -- we can see that session1 has locked locked root table test(16384) waiting on test1(16387) as session1 is holding that lock locktype| database | relation | pid |mode | granted ---+--+---+---+-+ relation |5 |16387 | 30368 | RowExclusiveLock| t relation |5 |16387 | 30410 | AccessExclusiveLock | f relation |5 |16384 | 30410 | AccessExclusiveLock | t (11 rows) --Session1, now as soon as you continue in gdb in session 1 it will hit the deadlock ERROR: 40P01: deadlock detected DETAIL: Process 30368 waits for AccessShareLock on relation 16384 of database 5; blocked by process 30410. Process 30410 waits for AccessExclusiveLock on relation 16387 of database 5; blocked by process 30368. HINT: See server log for query details. LOCATION: DeadLockReport, deadlock.c:1135 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Invalid Assert while validating REPLICA IDENTITY?
On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila wrote: > > On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar wrote: > > > > While working on some other code I noticed that in > > FindReplTupleInLocalRel() there is an assert [1] that seems to be > > passing IndexRelation to GetRelationIdentityOrPK() whereas it should > > be passing normal relation. > > > > Agreed. But this should lead to assertion failure. Did you try testing it? No, I did not test this particular case, it impacted me with my other addition of the code where I got Index Relation as input to the RelationGetIndexList() function, and my local changes were impacted by that. I will write a test for this stand-alone case so that it hits the assert. Thanks for looking into this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Invalid Assert while validating REPLICA IDENTITY?
While working on some other code I noticed that in FindReplTupleInLocalRel() there is an assert [1] that seems to be passing IndexRelation to GetRelationIdentityOrPK() whereas it should be passing normal relation. [1] if (OidIsValid(localidxoid)) { #ifdef USE_ASSERT_CHECKING Relation idxrel = index_open(localidxoid, AccessShareLock); /* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */ Assert(GetRelationIdentityOrPK(idxrel) == localidxoid || IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel), edata->targetRel->attrmap)); index_close(idxrel, AccessShareLock); #endif In the above code, we are passing idxrel to GetRelationIdentityOrPK(), whereas we should be passing localrel Fix should be @@ -2929,7 +2929,7 @@ FindReplTupleInLocalRel(ApplyExecutionData *edata, Relation localrel, Relationidxrel = index_open(localidxoid, AccessShareLock); /* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */ - Assert(GetRelationIdentityOrPK(idxrel) == localidxoid || + Assert(GetRelationIdentityOrPK(localrel) == localidxoid || IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel), edata->targetRel->attrmap)); index_close(idxrel, AccessShareLock); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jul 30, 2024 at 4:56 PM shveta malik wrote: > > On Tue, Jul 30, 2024 at 4:04 PM Dilip Kumar wrote: > > > > On Fri, Jul 26, 2024 at 9:50 AM Ajin Cherian wrote: > > > > Comment in 0002, > > > > 1) I do not see any test case that set a proper conflict type and > > conflict resolver, all tests either give incorrect conflict > > type/conflict resolver or the conflict resolver is ignored > > > > 0003 > > 2) I was trying to think about this patch, so suppose we consider this > > case conflict_type-> update_differ resolver->remote_apply, my > > question is to confirm whether my understanding is correct. So if > > this is set and we have 2 nodes and set up a 2-way logical > > replication, and if a conflict occurs node-1 will take the changes of > > node-2 and node-2 will take the changes of node-1? > > Yes, that's right. > > > Maybe so I think > > to avoid such cases user needs to set the resolver more thoughtfully, > > on node-1 it may be set as "skip" and on node-1 as "remote-apply" so > > in such cases if conflict happens both nodes will have the value from > > node-1. But maybe it would be more difficult to get a consistent > > value if we are setting up a mess replication topology right? Maybe > > there I think a more advanced timestamp-based option would work better > > IMHO. > > Yes, that's correct. We can get data divergence with resolvers like > 'remote_apply', 'keep_local' etc. If you meant 'mesh' replication > topology, then yes, it is difficult to get consistent value there with > resolvers other than timestamp based. And thus timestamp based > resolvers are needed and should be the default when implemented. > Thanks for the clarification. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Fri, Jul 26, 2024 at 9:50 AM Ajin Cherian wrote: Comment in 0002, 1) I do not see any test case that set a proper conflict type and conflict resolver, all tests either give incorrect conflict type/conflict resolver or the conflict resolver is ignored 0003 2) I was trying to think about this patch, so suppose we consider this case conflict_type-> update_differ resolver->remote_apply, my question is to confirm whether my understanding is correct. So if this is set and we have 2 nodes and set up a 2-way logical replication, and if a conflict occurs node-1 will take the changes of node-2 and node-2 will take the changes of node-1? Maybe so I think to avoid such cases user needs to set the resolver more thoughtfully, on node-1 it may be set as "skip" and on node-1 as "remote-apply" so in such cases if conflict happens both nodes will have the value from node-1. But maybe it would be more difficult to get a consistent value if we are setting up a mess replication topology right? Maybe there I think a more advanced timestamp-based option would work better IMHO. I am doing code level review as well and will share my comments soon on 0003 and 0004 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict detection and logging in logical replication
On Tue, Jul 30, 2024 at 1:49 PM Zhijie Hou (Fujitsu) wrote: > > > On Monday, July 29, 2024 6:59 PM Dilip Kumar > > wrote: > > > > > > On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > > > I was going through v7-0001, and I have some initial comments. > > > > Thanks for the comments ! > > > > > > > > @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo > > > *resultRelInfo, TupleTableSlot *slot, > > > ExprContext *econtext; > > > Datum values[INDEX_MAX_KEYS]; > > > bool isnull[INDEX_MAX_KEYS]; > > > - ItemPointerData invalidItemPtr; > > > bool checkedIndex = false; > > > > > > ItemPointerSetInvalid(conflictTid); > > > - ItemPointerSetInvalid(&invalidItemPtr); > > > > > > /* > > > * Get information from the result relation info structure. > > > @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo > > > *resultRelInfo, TupleTableSlot *slot, > > > > > > satisfiesConstraint = > > > check_exclusion_or_unique_constraint(heapRelation, indexRelation, > > > - indexInfo, &invalidItemPtr, > > > + indexInfo, &slot->tts_tid, > > > values, isnull, estate, false, > > > CEOUC_WAIT, true, > > > conflictTid); > > > > > > What is the purpose of this change? I mean > > > 'check_exclusion_or_unique_constraint' says that 'tupleid' > > > should be invalidItemPtr if the tuple is not yet inserted and > > > ExecCheckIndexConstraints is called by ExecInsert before inserting the > > tuple. > > > So what is this change? > > > > Because the function ExecCheckIndexConstraints() is now invoked after > > inserting a tuple (in the patch). So, we need to ignore the newly inserted > > tuple > > when checking conflict in check_exclusion_or_unique_constraint(). > > > > > Would this change ExecInsert's behavior as well? > > > > Thanks for pointing it out, I will check and reply. > > After checking, I think it may affect ExecInsert's behavior if the slot passed > to ExecCheckIndexConstraints() comes from other tables (e.g. when executing > INSERT INTO SELECT FROM othertbl), because the slot->tts_tid points to a valid > position from another table in this case, which can cause the > check_exclusion_or_unique_constraint to skip a tuple unexpectedly). > > I thought about two ideas to fix this: One is to reset the slot->tts_tid > before > calling ExecCheckIndexConstraints() in ExecInsert(), but I feel a bit > uncomfortable to this since it is touching existing logic. So, another idea > is to > just add a new parameter 'tupletid' in ExecCheckIndexConstraints(), then pass > tupletid=InvalidOffsetNumber in when invoke the function in ExecInsert() and > pass a valid tupletid in the new code paths in the patch. The new > 'tupletid' will be passed to check_exclusion_or_unique_constraint to > skip the target tuple. I feel the second one maybe better. > > What do you think ? Yes, the second approach seems good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict detection and logging in logical replication
On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu) wrote: > I was going through v7-0001, and I have some initial comments. @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo *resultRelInfo, TupleTableSlot *slot, ExprContext *econtext; Datum values[INDEX_MAX_KEYS]; bool isnull[INDEX_MAX_KEYS]; - ItemPointerData invalidItemPtr; bool checkedIndex = false; ItemPointerSetInvalid(conflictTid); - ItemPointerSetInvalid(&invalidItemPtr); /* * Get information from the result relation info structure. @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo *resultRelInfo, TupleTableSlot *slot, satisfiesConstraint = check_exclusion_or_unique_constraint(heapRelation, indexRelation, - indexInfo, &invalidItemPtr, + indexInfo, &slot->tts_tid, values, isnull, estate, false, CEOUC_WAIT, true, conflictTid); What is the purpose of this change? I mean 'check_exclusion_or_unique_constraint' says that 'tupleid' should be invalidItemPtr if the tuple is not yet inserted and ExecCheckIndexConstraints is called by ExecInsert before inserting the tuple. So what is this change? Would this change ExecInsert's behavior as well? +ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate, +ConflictType type, List *recheckIndexes, +TupleTableSlot *slot) +{ + /* Re-check all the unique indexes for potential conflicts */ + foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes) + { + TupleTableSlot *conflictslot; + + if (list_member_oid(recheckIndexes, uniqueidx) && + FindConflictTuple(resultRelInfo, estate, uniqueidx, slot, &conflictslot)) + { + RepOriginId origin; + TimestampTz committs; + TransactionId xmin; + + GetTupleCommitTs(conflictslot, &xmin, &origin, &committs); + ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc, uniqueidx, + xmin, origin, committs, conflictslot); + } + } +} and + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes; + if (resultRelInfo->ri_NumIndices > 0) recheckIndexes = ExecInsertIndexTuples(resultRelInfo, -slot, estate, false, false, -NULL, NIL, false); +slot, estate, false, +conflictindexes ? true : false, +&conflict, +conflictindexes, false); + + /* + * Rechecks the conflict indexes to fetch the conflicting local tuple + * and reports the conflict. We perform this check here, instead of + * perform an additional index scan before the actual insertion and + * reporting the conflict if any conflicting tuples are found. This is + * to avoid the overhead of executing the extra scan for each INSERT + * operation, even when no conflict arises, which could introduce + * significant overhead to replication, particularly in cases where + * conflicts are rare. + */ + if (conflict) + ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS, +recheckIndexes, slot); This logic is confusing, first, you are calling ExecInsertIndexTuples() with no duplicate error for the indexes present in 'ri_onConflictArbiterIndexes' which means the indexes returned by the function must be a subset of 'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes() you are again processing all the indexes of 'ri_onConflictArbiterIndexes' and checking if any of these is a subset of the indexes that is returned by ExecInsertIndexTuples(). Why are we doing that, I think we can directly use the 'recheckIndexes' which is returned by ExecInsertIndexTuples(), and those indexes are guaranteed to be a subset of ri_onConflictArbiterIndexes. No? --- --- -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Fri, Jul 5, 2024 at 2:23 PM Amit Kapila wrote: > > On Fri, Jul 5, 2024 at 11:58 AM Dilip Kumar wrote: > > > > On Thu, Jul 4, 2024 at 5:37 PM Amit Kapila wrote: > > > > So, the situation will be the same. We can even > > > > > decide to spill the data to files if the decision is that we need to > > > > > wait to avoid network buffer-fill situations. But note that the wait > > > > > in apply worker has consequences that the subscriber won't be able to > > > > > confirm the flush position and publisher won't be able to vacuum the > > > > > dead rows and we won't be remove WAL as well. Last time when we > > > > > discussed the delay_apply feature, we decided not to proceed because > > > > > of such issues. This is the reason I proposed a cap on wait time. > > > > > > > > Yes, spilling to file or cap on the wait time should help, and as I > > > > said above maybe a parallel apply worker can also help. > > > > > > > > > > It is not clear to me how a parallel apply worker can help in this > > > case. Can you elaborate on what you have in mind? > > > > If we decide to wait at commit time, and before starting to apply if > > we already see a remote commit_ts clock is ahead, then if we apply > > such transactions using the parallel worker, wouldn't it solve the > > issue of the network buffer congestion? Now the apply worker can move > > ahead and fetch new transactions from the buffer as our waiting > > transaction will not block it. I understand that if this transaction > > is going to wait at commit then any future transaction that we are > > going to fetch might also going to wait again because if the previous > > transaction committed before is in the future then the subsequent > > transaction committed after this must also be in future so eventually > > that will also go to some another parallel worker and soon we end up > > consuming all the parallel worker if the clock skew is large. So I > > won't say this will resolve the problem and we would still have to > > fall back to the spilling to the disk but that's just in the worst > > case when the clock skew is really huge. In most cases which is due > > to slight clock drift by the time we apply the medium to large size > > transaction, the local clock should be able to catch up the remote > > commit_ts and we might not have to wait in most of the cases. > > > > Yeah, this is possible but even if go with the spilling logic at first > it should work for all cases. If we get some complaints then we can > explore executing such transactions by parallel apply workers. > Personally, I am of the opinion that clock synchronization should be > handled outside the database system via network time protocols like > NTP. Still, we can have some simple solution to inform users about the > clock_skew. Yeah, that makes sense, in the first version we can have a simple solution and we can further improvise it based on the feedback. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Thu, Jul 4, 2024 at 5:37 PM Amit Kapila wrote: > > So, the situation will be the same. We can even > > > decide to spill the data to files if the decision is that we need to > > > wait to avoid network buffer-fill situations. But note that the wait > > > in apply worker has consequences that the subscriber won't be able to > > > confirm the flush position and publisher won't be able to vacuum the > > > dead rows and we won't be remove WAL as well. Last time when we > > > discussed the delay_apply feature, we decided not to proceed because > > > of such issues. This is the reason I proposed a cap on wait time. > > > > Yes, spilling to file or cap on the wait time should help, and as I > > said above maybe a parallel apply worker can also help. > > > > It is not clear to me how a parallel apply worker can help in this > case. Can you elaborate on what you have in mind? If we decide to wait at commit time, and before starting to apply if we already see a remote commit_ts clock is ahead, then if we apply such transactions using the parallel worker, wouldn't it solve the issue of the network buffer congestion? Now the apply worker can move ahead and fetch new transactions from the buffer as our waiting transaction will not block it. I understand that if this transaction is going to wait at commit then any future transaction that we are going to fetch might also going to wait again because if the previous transaction committed before is in the future then the subsequent transaction committed after this must also be in future so eventually that will also go to some another parallel worker and soon we end up consuming all the parallel worker if the clock skew is large. So I won't say this will resolve the problem and we would still have to fall back to the spilling to the disk but that's just in the worst case when the clock skew is really huge. In most cases which is due to slight clock drift by the time we apply the medium to large size transaction, the local clock should be able to catch up the remote commit_ts and we might not have to wait in most of the cases. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 5:08 PM shveta malik wrote: > > On Wed, Jul 3, 2024 at 4:12 PM Dilip Kumar wrote: > > > > On Wed, Jul 3, 2024 at 4:02 PM shveta malik wrote: > > > > > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar wrote: > > > > > > > > On Wed, Jul 3, 2024 at 11:00 AM shveta malik > > > > wrote: > > > > > > > > > > > Yes, I also think it should be independent of CDR. IMHO, it should > > > > > > be > > > > > > based on the user-configured maximum clock skew tolerance and can be > > > > > > independent of CDR. > > > > > > > > > > +1 > > > > > > > > > > > IIUC we would make the remote apply wait just > > > > > > before committing if the remote commit timestamp is ahead of the > > > > > > local > > > > > > clock by more than the maximum clock skew tolerance, is that > > > > > > correct? > > > > > > > > > > +1 on condition to wait. > > > > > > > > > > But I think we should make apply worker wait during begin > > > > > (apply_handle_begin) instead of commit. It makes more sense to delay > > > > > the entire operation to manage clock-skew rather than the commit > > > > > alone. And only then CDR's timestamp based resolution which are much > > > > > prior to commit-stage can benefit from this. Thoughts? > > > > > > > > But do we really need to wait at apply_handle_begin()? I mean if we > > > > already know the commit_ts then we can perform the conflict resolution > > > > no? > > > > > > I would like to highlight one point here that the resultant data may > > > be different depending upon at what stage (begin or commit) we > > > conclude to wait. Example: > > > > > > --max_clock_skew set to 0 i.e. no tolerance for clock skew. > > > --Remote Update with commit_timestamp = 10.20AM. > > > --Local clock (which is say 5 min behind) shows = 10.15AM. > > > > > > Case 1: Wait during Begin: > > > When remote update arrives at local node, apply worker waits till > > > local clock hits 'remote's commit_tts - max_clock_skew' i.e. till > > > 10.20 AM. In the meantime (during the wait period of apply worker) if > > > some local update on the same row has happened at say 10.18am (local > > > clock), that will be applied first. Now when apply worker's wait is > > > over, it will detect 'update_diffe'r conflict and as per > > > 'last_update_win', remote_tuple will win as 10.20 is latest than > > > 10.18. > > > > > > Case 2: Wait during Commit: > > > When remote update arrives at local node, it finds no conflict and > > > goes for commit. But before commit, it waits till the local clock hits > > > 10.20 AM. In the meantime (during wait period of apply worker)) if > > > some local update is trying to update the same row say at 10.18, it > > > has to wait (due to locks taken by remote update on that row) and > > > remote tuple will get committed first with commit timestamp of 10.20. > > > Then local update will proceed and will overwrite remote tuple. > > > > > > So in case1, remote tuple is the final change while in case2, local > > > tuple is the final change. > > > > Got it, but which case is correct, I think both. Because in case-1 > > local commit's commit_ts is 10:18 and the remote commit's commit_ts is > > 10:20 so remote apply wins. And case 2, the remote commit's commit_ts > > is 10:20 whereas the local commit's commit_ts must be 10:20 + delta > > (because it waited for the remote transaction to get committed). > > > > Now say which is better, in case-1 we have to make the remote apply to > > wait at the beginning state without knowing what would be the local > > clock when it actually comes to commit, it may so happen that if we > > choose case-2 by the time the remote transaction finish applying the > > local clock is beyond 10:20 and we do not even need to wait? > > yes, agree that wait time could be lesser to some extent in case 2. > But the wait during commit will make user operations on the same row > wait, without user having any clue on concurrent blocking operations. > I am not sure if it will be acceptable. I don't think there is any problem with the acceptance of user experience because even while applying the remote transaction (irrespective of whether we implement this wait feature) the user transaction might have to wait if updating the common rows right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 4:48 PM Amit Kapila wrote: > > On Wed, Jul 3, 2024 at 4:04 PM Dilip Kumar wrote: > > > > On Wed, Jul 3, 2024 at 3:35 PM Amit Kapila wrote: > > > > > > On Wed, Jul 3, 2024 at 2:16 PM Dilip Kumar wrote: > > > > > > > > On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar > > > > > wrote: > > > > > > > > > But waiting after applying the operations and before applying the > > > > > commit would mean that we need to wait with the locks held. That could > > > > > be a recipe for deadlocks in the system. I see your point related to > > > > > performance but as we are not expecting clock skew in normal cases, we > > > > > shouldn't be too much bothered on the performance due to this. If > > > > > there is clock skew, we expect users to fix it, this is just a > > > > > worst-case aid for users. > > > > > > > > But if we make it wait at the very first operation that means we will > > > > not suck more decoded data from the network and wouldn't that make the > > > > sender wait for the network buffer to get sucked in by the receiver? > > > > > > > > > > That would be true even if we wait just before applying the commit > > > record considering the transaction is small and the wait time is > > > large. > > > > What I am saying is that if we are not applying the whole transaction, > > it means we are not receiving it either unless we plan to spill it to > > a file. If we don't spill it to a file, the network buffer will fill > > up very quickly. This issue wouldn't occur if we waited right before > > the commit because, by that time, we would have already received all > > the data from the network. > > > > We would have received the transaction data but there could be other > transactions that need to wait because the apply worker is waiting > before the commit. Yeah, that's a valid point, can parallel apply worker help here? So, the situation will be the same. We can even > decide to spill the data to files if the decision is that we need to > wait to avoid network buffer-fill situations. But note that the wait > in apply worker has consequences that the subscriber won't be able to > confirm the flush position and publisher won't be able to vacuum the > dead rows and we won't be remove WAL as well. Last time when we > discussed the delay_apply feature, we decided not to proceed because > of such issues. This is the reason I proposed a cap on wait time. Yes, spilling to file or cap on the wait time should help, and as I said above maybe a parallel apply worker can also help. So I agree that the problem with network buffers arises in both cases, whether we wait before committing or before beginning. So keeping that in mind I don't have any strong objections against waiting at the beginning if it simplifies the design compared to waiting at the commit. However, one point to remember in favor of waiting before applying the commit is that if we decide to wait before beginning the transaction, we would end up waiting in many more cases compared to waiting before committing. Because in cases, when transactions are large and the clock skew is small, the local clock would have already passed the remote commit_ts by the time we reach the commit. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 4:02 PM shveta malik wrote: > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar wrote: > > > > On Wed, Jul 3, 2024 at 11:00 AM shveta malik wrote: > > > > > > > Yes, I also think it should be independent of CDR. IMHO, it should be > > > > based on the user-configured maximum clock skew tolerance and can be > > > > independent of CDR. > > > > > > +1 > > > > > > > IIUC we would make the remote apply wait just > > > > before committing if the remote commit timestamp is ahead of the local > > > > clock by more than the maximum clock skew tolerance, is that correct? > > > > > > +1 on condition to wait. > > > > > > But I think we should make apply worker wait during begin > > > (apply_handle_begin) instead of commit. It makes more sense to delay > > > the entire operation to manage clock-skew rather than the commit > > > alone. And only then CDR's timestamp based resolution which are much > > > prior to commit-stage can benefit from this. Thoughts? > > > > But do we really need to wait at apply_handle_begin()? I mean if we > > already know the commit_ts then we can perform the conflict resolution > > no? > > I would like to highlight one point here that the resultant data may > be different depending upon at what stage (begin or commit) we > conclude to wait. Example: > > --max_clock_skew set to 0 i.e. no tolerance for clock skew. > --Remote Update with commit_timestamp = 10.20AM. > --Local clock (which is say 5 min behind) shows = 10.15AM. > > Case 1: Wait during Begin: > When remote update arrives at local node, apply worker waits till > local clock hits 'remote's commit_tts - max_clock_skew' i.e. till > 10.20 AM. In the meantime (during the wait period of apply worker) if > some local update on the same row has happened at say 10.18am (local > clock), that will be applied first. Now when apply worker's wait is > over, it will detect 'update_diffe'r conflict and as per > 'last_update_win', remote_tuple will win as 10.20 is latest than > 10.18. > > Case 2: Wait during Commit: > When remote update arrives at local node, it finds no conflict and > goes for commit. But before commit, it waits till the local clock hits > 10.20 AM. In the meantime (during wait period of apply worker)) if > some local update is trying to update the same row say at 10.18, it > has to wait (due to locks taken by remote update on that row) and > remote tuple will get committed first with commit timestamp of 10.20. > Then local update will proceed and will overwrite remote tuple. > > So in case1, remote tuple is the final change while in case2, local > tuple is the final change. Got it, but which case is correct, I think both. Because in case-1 local commit's commit_ts is 10:18 and the remote commit's commit_ts is 10:20 so remote apply wins. And case 2, the remote commit's commit_ts is 10:20 whereas the local commit's commit_ts must be 10:20 + delta (because it waited for the remote transaction to get committed). Now say which is better, in case-1 we have to make the remote apply to wait at the beginning state without knowing what would be the local clock when it actually comes to commit, it may so happen that if we choose case-2 by the time the remote transaction finish applying the local clock is beyond 10:20 and we do not even need to wait? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 3:35 PM Amit Kapila wrote: > > On Wed, Jul 3, 2024 at 2:16 PM Dilip Kumar wrote: > > > > On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila wrote: > > > > > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar wrote: > > > > > But waiting after applying the operations and before applying the > > > commit would mean that we need to wait with the locks held. That could > > > be a recipe for deadlocks in the system. I see your point related to > > > performance but as we are not expecting clock skew in normal cases, we > > > shouldn't be too much bothered on the performance due to this. If > > > there is clock skew, we expect users to fix it, this is just a > > > worst-case aid for users. > > > > But if we make it wait at the very first operation that means we will > > not suck more decoded data from the network and wouldn't that make the > > sender wait for the network buffer to get sucked in by the receiver? > > > > That would be true even if we wait just before applying the commit > record considering the transaction is small and the wait time is > large. What I am saying is that if we are not applying the whole transaction, it means we are not receiving it either unless we plan to spill it to a file. If we don't spill it to a file, the network buffer will fill up very quickly. This issue wouldn't occur if we waited right before the commit because, by that time, we would have already received all the data from the network. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila wrote: > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar wrote: > But waiting after applying the operations and before applying the > commit would mean that we need to wait with the locks held. That could > be a recipe for deadlocks in the system. I see your point related to > performance but as we are not expecting clock skew in normal cases, we > shouldn't be too much bothered on the performance due to this. If > there is clock skew, we expect users to fix it, this is just a > worst-case aid for users. But if we make it wait at the very first operation that means we will not suck more decoded data from the network and wouldn't that make the sender wait for the network buffer to get sucked in by the receiver? Also, we already have a handling of parallel apply workers so if we do not have an issue of deadlock there or if we can handle those issues there we can do it here as well no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jul 3, 2024 at 11:00 AM shveta malik wrote: > > > Yes, I also think it should be independent of CDR. IMHO, it should be > > based on the user-configured maximum clock skew tolerance and can be > > independent of CDR. > > +1 > > > IIUC we would make the remote apply wait just > > before committing if the remote commit timestamp is ahead of the local > > clock by more than the maximum clock skew tolerance, is that correct? > > +1 on condition to wait. > > But I think we should make apply worker wait during begin > (apply_handle_begin) instead of commit. It makes more sense to delay > the entire operation to manage clock-skew rather than the commit > alone. And only then CDR's timestamp based resolution which are much > prior to commit-stage can benefit from this. Thoughts? But do we really need to wait at apply_handle_begin()? I mean if we already know the commit_ts then we can perform the conflict resolution no? I mean we should wait before committing because we are considering this remote transaction to be in the future and we do not want to confirm the commit of this transaction to the remote node before the local clock reaches the record commit_ts to preserve the causal order. However, we can still perform conflict resolution beforehand since we already know the commit_ts. The conflict resolution function will be something like "out_version = CRF(version1_commit_ts, version2_commit_ts)," so the result should be the same regardless of when we apply it, correct? From a performance standpoint, wouldn't it be beneficial to perform as much work as possible in advance? By the time we apply all the operations, the local clock might already be in sync with the commit_ts of the remote transaction. Am I missing something? However, while thinking about this, I'm wondering about how we will handle the streaming of in-progress transactions. If we start applying with parallel workers, we might not know the commit_ts of those transactions since they may not have been committed yet. One simple option could be to prevent parallel workers from applying in-progress transactions when CDR is set up. Instead, we could let these transactions spill to files and only apply them once we receive the commit record. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jul 2, 2024 at 2:40 PM shveta malik wrote: > > On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar wrote: > > > > On Tue, Jun 18, 2024 at 3:29 PM shveta malik wrote: > > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar > > > wrote: > > > > > > I tried to work out a few scenarios with this, where the apply worker > > > will wait until its local clock hits 'remote_commit_tts - max_skew > > > permitted'. Please have a look. > > > > > > Let's say, we have a GUC to configure max_clock_skew permitted. > > > Resolver is last_update_wins in both cases. > > > > > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew. > > > > > > Remote Update with commit_timestamp = 10.20AM. > > > Local clock (which is say 5 min behind) shows = 10.15AM. > > > > > > When remote update arrives at local node, we see that skew is greater > > > than max_clock_skew and thus apply worker waits till local clock hits > > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the > > > local clock hits 10.20 AM, the worker applies the remote change with > > > commit_tts of 10.20AM. In the meantime (during wait period of apply > > > worker)) if some local update on same row has happened at say 10.18am, > > > that will applied first, which will be later overwritten by above > > > remote change of 10.20AM as remote-change's timestamp appear more > > > latest, even though it has happened earlier than local change. > > > > For the sake of simplicity let's call the change that happened at > > 10:20 AM change-1 and the change that happened at 10:15 as change-2 > > and assume we are talking about the synchronous commit only. > > > > I think now from an application perspective the change-1 wouldn't have > > caused the change-2 because we delayed applying change-2 on the local > > node which would have delayed the confirmation of the change-1 to the > > application that means we have got the change-2 on the local node > > without the confirmation of change-1 hence change-2 has no causal > > dependency on the change-1. So it's fine that we perform change-1 > > before change-2 and the timestamp will also show the same at any other > > node if they receive these 2 changes. > > > > The goal is to ensure that if we define the order where change-2 > > happens before change-1, this same order should be visible on all > > other nodes. This will hold true because the commit timestamp of > > change-2 is earlier than that of change-1. > > > > > 2) Case 2: max_clock_skew is set to 2min. > > > > > > Remote Update with commit_timestamp=10.20AM > > > Local clock (which is say 5 min behind) = 10.15AM. > > > > > > Now apply worker will notice skew greater than 2min and thus will wait > > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e. > > > 10.18 and will apply the change with commit_tts of 10.20 ( as we > > > always save the origin's commit timestamp into local commit_tts, see > > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say > > > another local update is triggered at 10.19am, it will be applied > > > locally but it will be ignored on remote node. On the remote node , > > > the existing change with a timestamp of 10.20 am will win resulting in > > > data divergence. > > > > Let's call the 10:20 AM change as a change-1 and the change that > > happened at 10:19 as change-2 > > > > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that > > commit_ts of that change is 10:20, and the same will be visible to all > > other nodes. So in conflict resolution still the change-1 happened > > after the change-2 because change-2's commit_ts is 10:19 AM. Now > > there could be a problem with the causal order because we applied the > > change-1 at 10:18 AM so the application might have gotten confirmation > > at 10:18 AM and the change-2 of the local node may be triggered as a > > result of confirmation of the change-1 that means now change-2 has a > > causal dependency on the change-1 but commit_ts shows change-2 > > happened before the change-1 on all the nodes. > > > > So, is this acceptable? I think yes because the user has configured a > > maximum clock skew of 2 minutes, which means the detected order might > > not always align with the causal order for transactions occurring > > within that time frame.
Re: Conflict Detection and Resolution
On Wed, Jun 19, 2024 at 2:36 PM shveta malik wrote: > > On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar wrote: > > > > On Tue, Jun 18, 2024 at 3:29 PM shveta malik wrote: > > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar > > > wrote: > > > > > > I tried to work out a few scenarios with this, where the apply worker > > > will wait until its local clock hits 'remote_commit_tts - max_skew > > > permitted'. Please have a look. > > > > > > Let's say, we have a GUC to configure max_clock_skew permitted. > > > Resolver is last_update_wins in both cases. > > > > > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew. > > > > > > Remote Update with commit_timestamp = 10.20AM. > > > Local clock (which is say 5 min behind) shows = 10.15AM. > > > > > > When remote update arrives at local node, we see that skew is greater > > > than max_clock_skew and thus apply worker waits till local clock hits > > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the > > > local clock hits 10.20 AM, the worker applies the remote change with > > > commit_tts of 10.20AM. In the meantime (during wait period of apply > > > worker)) if some local update on same row has happened at say 10.18am, > > > that will applied first, which will be later overwritten by above > > > remote change of 10.20AM as remote-change's timestamp appear more > > > latest, even though it has happened earlier than local change. Oops lot of mistakes in the usage of change-1 and change-2, sorry about that. > > For the sake of simplicity let's call the change that happened at > > 10:20 AM change-1 and the change that happened at 10:15 as change-2 > > and assume we are talking about the synchronous commit only. > > Do you mean "the change that happened at 10:18 as change-2" Right > > > > I think now from an application perspective the change-1 wouldn't have > > caused the change-2 because we delayed applying change-2 on the local > > node > > Do you mean "we delayed applying change-1 on the local node." Right > >which would have delayed the confirmation of the change-1 to the > > application that means we have got the change-2 on the local node > > without the confirmation of change-1 hence change-2 has no causal > > dependency on the change-1. So it's fine that we perform change-1 > > before change-2 > > Do you mean "So it's fine that we perform change-2 before change-1" Right > >and the timestamp will also show the same at any other > > node if they receive these 2 changes. > > > > The goal is to ensure that if we define the order where change-2 > > happens before change-1, this same order should be visible on all > > other nodes. This will hold true because the commit timestamp of > > change-2 is earlier than that of change-1. > > Considering the above corrections as base, I agree with this. +1 > > > 2) Case 2: max_clock_skew is set to 2min. > > > > > > Remote Update with commit_timestamp=10.20AM > > > Local clock (which is say 5 min behind) = 10.15AM. > > > > > > Now apply worker will notice skew greater than 2min and thus will wait > > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e. > > > 10.18 and will apply the change with commit_tts of 10.20 ( as we > > > always save the origin's commit timestamp into local commit_tts, see > > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say > > > another local update is triggered at 10.19am, it will be applied > > > locally but it will be ignored on remote node. On the remote node , > > > the existing change with a timestamp of 10.20 am will win resulting in > > > data divergence. > > > > Let's call the 10:20 AM change as a change-1 and the change that > > happened at 10:19 as change-2 > > > > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that > > commit_ts of that change is 10:20, and the same will be visible to all > > other nodes. So in conflict resolution still the change-1 happened > > after the change-2 because change-2's commit_ts is 10:19 AM. Now > > there could be a problem with the causal order because we applied the > > change-1 at 10:18 AM so the application might have gotten confirmation > > at 10:18 AM and the change-2 of the local node may be triggered as a > > result of confirmation of the change-1 that means now change-2 has a > > causal dependency on the change-1 but commit_ts shows change-2 > > happened before the change-1 on all the nodes. > > > > So, is this acceptable? I think yes because the user has configured a > > maximum clock skew of 2 minutes, which means the detected order might > > not always align with the causal order for transactions occurring > > within that time frame. > > Agree. I had the same thoughts, and wanted to confirm my understanding. Okay -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jun 18, 2024 at 3:29 PM shveta malik wrote: > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar wrote: > > I tried to work out a few scenarios with this, where the apply worker > will wait until its local clock hits 'remote_commit_tts - max_skew > permitted'. Please have a look. > > Let's say, we have a GUC to configure max_clock_skew permitted. > Resolver is last_update_wins in both cases. > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew. > > Remote Update with commit_timestamp = 10.20AM. > Local clock (which is say 5 min behind) shows = 10.15AM. > > When remote update arrives at local node, we see that skew is greater > than max_clock_skew and thus apply worker waits till local clock hits > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the > local clock hits 10.20 AM, the worker applies the remote change with > commit_tts of 10.20AM. In the meantime (during wait period of apply > worker)) if some local update on same row has happened at say 10.18am, > that will applied first, which will be later overwritten by above > remote change of 10.20AM as remote-change's timestamp appear more > latest, even though it has happened earlier than local change. For the sake of simplicity let's call the change that happened at 10:20 AM change-1 and the change that happened at 10:15 as change-2 and assume we are talking about the synchronous commit only. I think now from an application perspective the change-1 wouldn't have caused the change-2 because we delayed applying change-2 on the local node which would have delayed the confirmation of the change-1 to the application that means we have got the change-2 on the local node without the confirmation of change-1 hence change-2 has no causal dependency on the change-1. So it's fine that we perform change-1 before change-2 and the timestamp will also show the same at any other node if they receive these 2 changes. The goal is to ensure that if we define the order where change-2 happens before change-1, this same order should be visible on all other nodes. This will hold true because the commit timestamp of change-2 is earlier than that of change-1. > 2) Case 2: max_clock_skew is set to 2min. > > Remote Update with commit_timestamp=10.20AM > Local clock (which is say 5 min behind) = 10.15AM. > > Now apply worker will notice skew greater than 2min and thus will wait > till local clock hits 'remote's commit_tts - max_clock_skew' i.e. > 10.18 and will apply the change with commit_tts of 10.20 ( as we > always save the origin's commit timestamp into local commit_tts, see > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say > another local update is triggered at 10.19am, it will be applied > locally but it will be ignored on remote node. On the remote node , > the existing change with a timestamp of 10.20 am will win resulting in > data divergence. Let's call the 10:20 AM change as a change-1 and the change that happened at 10:19 as change-2 IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that commit_ts of that change is 10:20, and the same will be visible to all other nodes. So in conflict resolution still the change-1 happened after the change-2 because change-2's commit_ts is 10:19 AM. Now there could be a problem with the causal order because we applied the change-1 at 10:18 AM so the application might have gotten confirmation at 10:18 AM and the change-2 of the local node may be triggered as a result of confirmation of the change-1 that means now change-2 has a causal dependency on the change-1 but commit_ts shows change-2 happened before the change-1 on all the nodes. So, is this acceptable? I think yes because the user has configured a maximum clock skew of 2 minutes, which means the detected order might not always align with the causal order for transactions occurring within that time frame. Generally, the ideal configuration for max_clock_skew should be in multiple of the network round trip time. Assuming this configuration, we wouldn’t encounter this problem because for change-2 to be caused by change-1, the client would need to get confirmation of change-1 and then trigger change-2, which would take at least 2-3 network round trips. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jun 18, 2024 at 12:11 PM Amit Kapila wrote: > > On Tue, Jun 18, 2024 at 11:54 AM Dilip Kumar wrote: > > > > On Mon, Jun 17, 2024 at 8:51 PM Robert Haas wrote: > > > > > > On Mon, Jun 17, 2024 at 1:42 AM Amit Kapila > > > wrote: > > > > The difference w.r.t the existing mechanisms for holding deleted data > > > > is that we don't know whether we need to hold off the vacuum from > > > > cleaning up the rows because we can't say with any certainty whether > > > > other nodes will perform any conflicting operations in the future. > > > > Using the example we discussed, > > > > Node A: > > > > T1: INSERT INTO t (id, value) VALUES (1,1); > > > > T2: DELETE FROM t WHERE id = 1; > > > > > > > > Node B: > > > > T3: UPDATE t SET value = 2 WHERE id = 1; > > > > > > > > Say the order of receiving the commands is T1-T2-T3. We can't predict > > > > whether we will ever get T-3, so on what basis shall we try to prevent > > > > vacuum from removing the deleted row? > > > > > > The problem arises because T2 and T3 might be applied out of order on > > > some nodes. Once either one of them has been applied on every node, no > > > further conflicts are possible. > > > > If we decide to skip the update whether the row is missing or deleted, > > we indeed reach the same end result regardless of the order of T2, T3, > > and Vacuum. Here's how it looks in each case: > > > > Case 1: T1, T2, Vacuum, T3 -> Skip the update for a non-existing row > > -> end result we do not have a row. > > Case 2: T1, T2, T3 -> Skip the update for a deleted row -> end result > > we do not have a row. > > Case 3: T1, T3, T2 -> deleted the row -> end result we do not have a row. > > > > In case 3, how can deletion be successful? The row required to be > deleted has already been updated. Hmm, I was considering this case in the example given by you above[1], so we have updated some fields of the row with id=1, isn't this row still detectable by the delete because delete will find this by id=1 as we haven't updated the id? I was making the point w.r.t. the example used above. [1] > > > > Node A: > > > > T1: INSERT INTO t (id, value) VALUES (1,1); > > > > T2: DELETE FROM t WHERE id = 1; > > > > > > > > Node B: > > > > T3: UPDATE t SET value = 2 WHERE id = 1; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 8:51 PM Robert Haas wrote: > > On Mon, Jun 17, 2024 at 1:42 AM Amit Kapila wrote: > > The difference w.r.t the existing mechanisms for holding deleted data > > is that we don't know whether we need to hold off the vacuum from > > cleaning up the rows because we can't say with any certainty whether > > other nodes will perform any conflicting operations in the future. > > Using the example we discussed, > > Node A: > > T1: INSERT INTO t (id, value) VALUES (1,1); > > T2: DELETE FROM t WHERE id = 1; > > > > Node B: > > T3: UPDATE t SET value = 2 WHERE id = 1; > > > > Say the order of receiving the commands is T1-T2-T3. We can't predict > > whether we will ever get T-3, so on what basis shall we try to prevent > > vacuum from removing the deleted row? > > The problem arises because T2 and T3 might be applied out of order on > some nodes. Once either one of them has been applied on every node, no > further conflicts are possible. If we decide to skip the update whether the row is missing or deleted, we indeed reach the same end result regardless of the order of T2, T3, and Vacuum. Here's how it looks in each case: Case 1: T1, T2, Vacuum, T3 -> Skip the update for a non-existing row -> end result we do not have a row. Case 2: T1, T2, T3 -> Skip the update for a deleted row -> end result we do not have a row. Case 3: T1, T3, T2 -> deleted the row -> end result we do not have a row. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jun 18, 2024 at 10:17 AM Dilip Kumar wrote: > > On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila wrote: > > > > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar wrote: > > > > > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra > > > wrote: > > > > > > > > Yes, that's correct. However, many cases could benefit from the > > > > > update_deleted conflict type if it can be implemented reliably. That's > > > > > why we wanted to give it a try. But if we can't achieve predictable > > > > > results with it, I'm fine to drop this approach and conflict_type. We > > > > > can consider a better design in the future that doesn't depend on > > > > > non-vacuumed entries and provides a more robust method for identifying > > > > > deleted rows. > > > > > > > > > > > > > I agree having a separate update_deleted conflict would be beneficial, > > > > I'm not arguing against that - my point is actually that I think this > > > > conflict type is required, and that it needs to be detected reliably. > > > > > > > > > > When working with a distributed system, we must accept some form of > > > eventual consistency model. However, it's essential to design a > > > predictable and acceptable behavior. For example, if a change is a > > > result of a previous operation (such as an update on node B triggered > > > after observing an operation on node A), we can say that the operation > > > on node A happened before the operation on node B. Conversely, if > > > operations on nodes A and B are independent, we consider them > > > concurrent. > > > > > > In distributed systems, clock skew is a known issue. To establish a > > > consistency model, we need to ensure it guarantees the > > > "happens-before" relationship. Consider a scenario with three nodes: > > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and > > > subsequently NodeB makes changes, and then both NodeA's and NodeB's > > > changes are sent to NodeC, the clock skew might make NodeB's changes > > > appear to have occurred before NodeA's changes. However, we should > > > maintain data that indicates NodeB's changes were triggered after > > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's > > > changes happened after NodeA's changes, despite what the timestamps > > > suggest. > > > > > > A common method to handle such cases is using vector clocks for > > > conflict resolution. > > > > > > > I think the unbounded size of the vector could be a problem to store > > for each event. However, while researching previous discussions, it > > came to our notice that we have discussed this topic in the past as > > well in the context of standbys. For recovery_min_apply_delay, we > > decided the clock skew is not a problem as the settings of this > > parameter are much larger than typical time deviations between servers > > as mentioned in docs. Similarly for casual reads [1], there was a > > proposal to introduce max_clock_skew parameter and suggesting the user > > to make sure to have NTP set up correctly. We have tried to check > > other databases (like Ora and BDR) where CDR is implemented but didn't > > find anything specific to clock skew. So, I propose to go with a GUC > > like max_clock_skew such that if the difference of time between the > > incoming transaction's commit time and the local time is more than > > max_clock_skew then we raise an ERROR. It is not clear to me that > > putting bigger effort into clock skew is worth especially when other > > systems providing CDR feature (like Ora or BDR) for decades have not > > done anything like vector clocks. It is possible that this is less of > > a problem w.r.t CDR and just detecting the anomaly in clock skew is > > good enough. > > I believe that if we've accepted this solution elsewhere, then we can > also consider the same. Basically, we're allowing the application to > set its tolerance for clock skew. And, if the skew exceeds that > tolerance, it's the application's responsibility to synchronize; > otherwise, an error will occur. This approach seems reasonable. This model can be further extended by making the apply worker wait if the remote transaction's commit_ts is greater than the local timestamp. This ensures that no local transactions occurring after the remote transaction appear
Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila wrote: > > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar wrote: > > > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra > > wrote: > > > > > > Yes, that's correct. However, many cases could benefit from the > > > > update_deleted conflict type if it can be implemented reliably. That's > > > > why we wanted to give it a try. But if we can't achieve predictable > > > > results with it, I'm fine to drop this approach and conflict_type. We > > > > can consider a better design in the future that doesn't depend on > > > > non-vacuumed entries and provides a more robust method for identifying > > > > deleted rows. > > > > > > > > > > I agree having a separate update_deleted conflict would be beneficial, > > > I'm not arguing against that - my point is actually that I think this > > > conflict type is required, and that it needs to be detected reliably. > > > > > > > When working with a distributed system, we must accept some form of > > eventual consistency model. However, it's essential to design a > > predictable and acceptable behavior. For example, if a change is a > > result of a previous operation (such as an update on node B triggered > > after observing an operation on node A), we can say that the operation > > on node A happened before the operation on node B. Conversely, if > > operations on nodes A and B are independent, we consider them > > concurrent. > > > > In distributed systems, clock skew is a known issue. To establish a > > consistency model, we need to ensure it guarantees the > > "happens-before" relationship. Consider a scenario with three nodes: > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and > > subsequently NodeB makes changes, and then both NodeA's and NodeB's > > changes are sent to NodeC, the clock skew might make NodeB's changes > > appear to have occurred before NodeA's changes. However, we should > > maintain data that indicates NodeB's changes were triggered after > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's > > changes happened after NodeA's changes, despite what the timestamps > > suggest. > > > > A common method to handle such cases is using vector clocks for > > conflict resolution. > > > > I think the unbounded size of the vector could be a problem to store > for each event. However, while researching previous discussions, it > came to our notice that we have discussed this topic in the past as > well in the context of standbys. For recovery_min_apply_delay, we > decided the clock skew is not a problem as the settings of this > parameter are much larger than typical time deviations between servers > as mentioned in docs. Similarly for casual reads [1], there was a > proposal to introduce max_clock_skew parameter and suggesting the user > to make sure to have NTP set up correctly. We have tried to check > other databases (like Ora and BDR) where CDR is implemented but didn't > find anything specific to clock skew. So, I propose to go with a GUC > like max_clock_skew such that if the difference of time between the > incoming transaction's commit time and the local time is more than > max_clock_skew then we raise an ERROR. It is not clear to me that > putting bigger effort into clock skew is worth especially when other > systems providing CDR feature (like Ora or BDR) for decades have not > done anything like vector clocks. It is possible that this is less of > a problem w.r.t CDR and just detecting the anomaly in clock skew is > good enough. I believe that if we've accepted this solution elsewhere, then we can also consider the same. Basically, we're allowing the application to set its tolerance for clock skew. And, if the skew exceeds that tolerance, it's the application's responsibility to synchronize; otherwise, an error will occur. This approach seems reasonable. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 5:38 AM Tomas Vondra wrote: > > > The issue with using commit timestamps is that, when multiple nodes > > are involved, the commit timestamp won't accurately represent the > > actual order of operations. There's no reliable way to determine the > > perfect order of each operation happening on different nodes roughly > > simultaneously unless we use some globally synchronized counter. > > Generally, that order might not cause real issues unless one operation > > is triggered by a previous operation, and relying solely on physical > > timestamps would not detect that correctly. > > > This whole conflict detection / resolution proposal is based on using > commit timestamps. Aren't you suggesting it can't really work with > commit timestamps? > > FWIW there are ways to builds distributed consistency with timestamps, > as long as it's monotonic - e.g. clock-SI does that. It's not perfect, > but it shows it's possible. Hmm, I see that clock-SI does this by delaying the transaction when it detects the clock skew. > However, I'm not we have to go there - it depends on what the goal is. > For a one-directional replication (multiple nodes replicating to the > same target) it might be sufficient if the conflict resolution is > "deterministic" (e.g. not dependent on the order in which the changes > are applied). I'm not sure, but it's why I asked what's the goal in my > very first message in this thread. I'm not completely certain about this. Even in one directional replication if multiple nodes are sending data how can we guarantee determinism in the presence of clock skew if we are not using some other mechanism like logical counters or something like what clock-SI is doing? I don't want to insist on using any specific solution here. However, I noticed that we haven't addressed how we plan to manage clock skew, which is my primary concern. I believe that if multiple nodes are involved and we're receiving data from them with unsynchronized clocks, ensuring determinism about their order will require us to take some measures to handle that. > > We need some sort of logical counter, such as a vector clock, which > > might be an independent counter on each node but can perfectly track > > the causal order. For example, if NodeA observes an operation from > > NodeB with a counter value of X, NodeA will adjust its counter to X+1. > > This ensures that if NodeA has seen an operation from NodeB, its next > > operation will appear to have occurred after NodeB's operation. > > > > I admit that I haven't fully thought through how we could design such > > version tracking in our logical replication protocol or how it would > > fit into our system. However, my point is that we need to consider > > something beyond commit timestamps to achieve reliable ordering. > > > > I can't really respond to this as there's no suggestion how it would be > implemented in the patch discussed in this thread. > No worries, I'll consider whether finding such a solution is feasible for our situation. Thank you! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication of sequences
On Thu, Jun 13, 2024 at 11:53 AM vignesh C wrote: > > On Thu, 13 Jun 2024 at 10:27, Dilip Kumar wrote: > > Thanks for the explanation, but I am still not getting it completely, > > do you mean to say unless all the sequences are not synced any of the > > sequences would not be marked "ready" in pg_subscription_rel? Is that > > necessary? I mean why we can not sync the sequences one by one and > > mark them ready? Why it is necessary to either have all the sequences > > synced or none of them? > > Since updating the sequence is one operation and setting > pg_subscription_rel is another, I was trying to avoid a situation > where the sequence is updated but its state is not reflected in > pg_subscription_rel. It seems you are suggesting that it's acceptable > for the sequence to be updated even if its state isn't updated in > pg_subscription_rel, and in such cases, the sequence value does not > need to be reverted. Right, the complexity we're adding to achieve a behavior that may not be truly desirable is a concern. For instance, if we mark the status as ready but do not sync the sequences, it could lead to issues. However, if we have synced some sequences but encounter a failure without marking the status as ready, I don't consider it inconsistent in any way. But anyway, now I understand your thinking behind that so it's a good idea to leave this design behavior for a later decision. Gathering more opinions and insights during later stages will provide a clearer perspective on how to proceed with this aspect. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication of sequences
On Thu, Jun 13, 2024 at 10:10 AM vignesh C wrote: > > > So, you're saying that when we synchronize the sequence values on the > > subscriber side, we will create a new relfilenode to allow reverting > > to the old state of the sequence in case of an error or transaction > > rollback? But why would we want to do that? Generally, even if you > > call nextval() on a sequence and then roll back the transaction, the > > sequence value doesn't revert to the old value. So, what specific > > problem on the subscriber side are we trying to avoid by operating on > > a new relfilenode? > > Let's consider a situation where we have two sequences: seq1 with a > value of 100 and seq2 with a value of 200. Now, let's say seq1 is > synced and updated to 100, then we attempt to synchronize seq2, > there's a failure due to the sequence not existing or encountering > some other issue. In this scenario, we don't want to halt operations > where seq1 is synchronized, but the sequence state for sequence isn't > changed to "ready" in pg_subscription_rel. Thanks for the explanation, but I am still not getting it completely, do you mean to say unless all the sequences are not synced any of the sequences would not be marked "ready" in pg_subscription_rel? Is that necessary? I mean why we can not sync the sequences one by one and mark them ready? Why it is necessary to either have all the sequences synced or none of them? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jun 12, 2024 at 5:26 PM Tomas Vondra wrote: > > >> I agree having a separate update_deleted conflict would be beneficial, > >> I'm not arguing against that - my point is actually that I think this > >> conflict type is required, and that it needs to be detected reliably. > >> > > > > When working with a distributed system, we must accept some form of > > eventual consistency model. > > I'm not sure this is necessarily true. There are distributed databases > implementing (or aiming to) regular consistency models, without eventual > consistency. I'm not saying it's easy, but it shows eventual consistency > is not the only option. Right, that statement might not be completely accurate. Based on the CAP theorem, when a network partition is unavoidable and availability is expected, we often choose an eventual consistency model. However, claiming that a fully consistent model is impossible in any distributed system is incorrect, as it can be achieved using mechanisms like Two-Phase Commit. We must also accept that our PostgreSQL replication mechanism does not guarantee a fully consistent model. Even with synchronous commit, it only waits for the WAL to be replayed on the standby but does not change the commit decision based on other nodes. This means, at most, we can only guarantee "Read Your Write" consistency. > > However, it's essential to design a > > predictable and acceptable behavior. For example, if a change is a > > result of a previous operation (such as an update on node B triggered > > after observing an operation on node A), we can say that the operation > > on node A happened before the operation on node B. Conversely, if > > operations on nodes A and B are independent, we consider them > > concurrent. > > > > Right. And this is precisely the focus or my questions - understanding > what behavior we aim for / expect in the end. Or said differently, what > anomalies / weird behavior would be considered expected. > Because that's important both for discussions about feasibility, etc. > And also for evaluation / reviews of the patch. +1 > > In distributed systems, clock skew is a known issue. To establish a > > consistency model, we need to ensure it guarantees the > > "happens-before" relationship. Consider a scenario with three nodes: > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and > > subsequently NodeB makes changes, and then both NodeA's and NodeB's > > changes are sent to NodeC, the clock skew might make NodeB's changes > > appear to have occurred before NodeA's changes. However, we should > > maintain data that indicates NodeB's changes were triggered after > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's > > changes happened after NodeA's changes, despite what the timestamps > > suggest. > > > > A common method to handle such cases is using vector clocks for > > conflict resolution. "Vector clocks" allow us to track the causal > > relationships between changes across nodes, ensuring that we can > > correctly order events and resolve conflicts in a manner that respects > > the "happens-before" relationship. This method helps maintain > > consistency and predictability in the system despite issues like clock > > skew. > > > > I'm familiar with the concept of vector clock (or logical clock in > general), but it's not clear to me how you plan to use this in the > context of the conflict handling. Can you elaborate/explain? > > The way I see it, conflict handling is pretty tightly coupled with > regular commit timestamps and MVCC in general. How would you use vector > clock to change that? The issue with using commit timestamps is that, when multiple nodes are involved, the commit timestamp won't accurately represent the actual order of operations. There's no reliable way to determine the perfect order of each operation happening on different nodes roughly simultaneously unless we use some globally synchronized counter. Generally, that order might not cause real issues unless one operation is triggered by a previous operation, and relying solely on physical timestamps would not detect that correctly. We need some sort of logical counter, such as a vector clock, which might be an independent counter on each node but can perfectly track the causal order. For example, if NodeA observes an operation from NodeB with a counter value of X, NodeA will adjust its counter to X+1. This ensures that if NodeA has seen an operation from NodeB, its next operation will appear to have occurred after NodeB's operation. I admit that I haven't fully thought through how we could design such version tracking in our logical replication protocol or how it would fit into our system. However, my point is that we need to consider something beyond commit timestamps to achieve reliable ordering. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication of sequences
On Wed, Jun 12, 2024 at 4:08 PM vignesh C wrote: > > On Wed, 12 Jun 2024 at 10:51, Dilip Kumar wrote: > > > > On Tue, Jun 11, 2024 at 4:06 PM vignesh C wrote: > > > > > > Amit and I engaged in an offline discussion regarding the design and > > > contemplated that it could be like below: > > > > If I understand correctly, does this require the sequences to already > > exist on the subscribing node before creating the subscription, or > > will it also copy any non-existing sequences? > > Sequences must exist in the subscriber; we'll synchronize only their > values. Any sequences that are not present in the subscriber will > trigger an error. Okay, that makes sense. > > > > 3) Refreshing the sequence can be achieved through the existing > > > command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change > > > here). > > > The subscriber identifies stale sequences, meaning sequences present > > > in pg_subscription_rel but absent from the publication, and removes > > > them from the pg_subscription_rel system table. The subscriber also > > > checks for newly added sequences in the publisher and synchronizes > > > their values from the publisher using the steps outlined in the > > > subscription creation process. It's worth noting that previously > > > synchronized sequences won't be synchronized again; the sequence sync > > > will occur solely for the newly added sequences. > > > > > 4) Introducing a new command for refreshing all sequences: ALTER > > > SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES. > > > The subscriber will remove stale sequences and add newly added > > > sequences from the publisher. Following this, it will re-synchronize > > > the sequence values for all sequences in the updated list from the > > > publisher, following the steps outlined in the subscription creation > > > process. > > > > Okay, this answers my first question: we will remove the sequences > > that are removed from the publisher and add the new sequences. I don't > > see any problem with this, but doesn't it seem like we are effectively > > doing DDL replication only for sequences without having a > > comprehensive plan for overall DDL replication? > > What I intended to convey is that we'll eliminate the sequences from > pg_subscription_rel. We won't facilitate the DDL replication of > sequences; instead, we anticipate users to create the sequences > themselves. hmm okay. > > > 5) Incorporate the pg_sequence_state function to fetch the sequence > > > value from the publisher, along with the page LSN. Incorporate > > > SetSequence function, which will procure a new relfilenode for the > > > sequence and set the new relfilenode with the specified value. This > > > will facilitate rollback in case of any failures. > > > > I do not understand this point, you mean whenever we are fetching the > > sequence value from the publisher we need to create a new relfilenode > > on the subscriber? Why not just update the catalog tuple is > > sufficient? Or this is for handling the ALTER SEQUENCE case? > > Sequences operate distinctively from tables. Alterations to sequences > reflect instantly in another session, even before committing the > transaction. To ensure the synchronization of sequence value and state > updates in pg_subscription_rel, we assign it a new relfilenode. This > strategy ensures that any potential errors allow for the rollback of > both the sequence state in pg_subscription_rel and the sequence values > simultaneously. So, you're saying that when we synchronize the sequence values on the subscriber side, we will create a new relfilenode to allow reverting to the old state of the sequence in case of an error or transaction rollback? But why would we want to do that? Generally, even if you call nextval() on a sequence and then roll back the transaction, the sequence value doesn't revert to the old value. So, what specific problem on the subscriber side are we trying to avoid by operating on a new relfilenode? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication of sequences
On Tue, Jun 11, 2024 at 4:06 PM vignesh C wrote: > > Amit and I engaged in an offline discussion regarding the design and > contemplated that it could be like below: If I understand correctly, does this require the sequences to already exist on the subscribing node before creating the subscription, or will it also copy any non-existing sequences? > 1) CREATE PUBLICATION syntax enhancement: > CREATE PUBLICATION ... FOR ALL SEQUENCES; > The addition of a new column titled "all sequences" in the > pg_publication system table will signify whether the publication is > designated as all sequences publication or not. > > 2) CREATE SUBSCRIPTION -- no syntax change. > Upon creation of a subscription, the following additional steps will > be managed by the subscriber: > i) The subscriber will retrieve the list of sequences associated with > the subscription's publications. > ii) For each sequence: a) Retrieve the sequence value from the > publisher by invoking the pg_sequence_state function. b) Set the > sequence with the value obtained from the publisher. iv) Once the > subscription creation is completed, all sequence values will become > visible at the subscriber's end. > > An alternative design approach could involve retrieving the sequence > list from the publisher during subscription creation and inserting the > sequences with an "init" state into the pg_subscription_rel system > table. These tasks could be executed by a single sequence sync worker, > which would: > i) Retrieve the list of sequences in the "init" state from the > pg_subscription_rel system table. > ii) Initiate a transaction. > iii) For each sequence: a) Obtain the sequence value from the > publisher by utilizing the pg_sequence_state function. b) Update the > sequence with the value obtained from the publisher. > iv) Commit the transaction. > > The benefit with the second approach is that if there are large number > of sequences, the sequence sync can be enhanced to happen in parallel > and also if there are any locks held on the sequences in the > publisher, the sequence worker can wait to acquire the lock instead of > blocking the whole create subscription command which will delay the > initial copy of the tables too. Yeah w.r.t. this point second approach seems better. > 3) Refreshing the sequence can be achieved through the existing > command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change > here). > The subscriber identifies stale sequences, meaning sequences present > in pg_subscription_rel but absent from the publication, and removes > them from the pg_subscription_rel system table. The subscriber also > checks for newly added sequences in the publisher and synchronizes > their values from the publisher using the steps outlined in the > subscription creation process. It's worth noting that previously > synchronized sequences won't be synchronized again; the sequence sync > will occur solely for the newly added sequences. > 4) Introducing a new command for refreshing all sequences: ALTER > SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES. > The subscriber will remove stale sequences and add newly added > sequences from the publisher. Following this, it will re-synchronize > the sequence values for all sequences in the updated list from the > publisher, following the steps outlined in the subscription creation > process. Okay, this answers my first question: we will remove the sequences that are removed from the publisher and add the new sequences. I don't see any problem with this, but doesn't it seem like we are effectively doing DDL replication only for sequences without having a comprehensive plan for overall DDL replication? > 5) Incorporate the pg_sequence_state function to fetch the sequence > value from the publisher, along with the page LSN. Incorporate > SetSequence function, which will procure a new relfilenode for the > sequence and set the new relfilenode with the specified value. This > will facilitate rollback in case of any failures. I do not understand this point, you mean whenever we are fetching the sequence value from the publisher we need to create a new relfilenode on the subscriber? Why not just update the catalog tuple is sufficient? Or this is for handling the ALTER SEQUENCE case? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra wrote: > > Yes, that's correct. However, many cases could benefit from the > > update_deleted conflict type if it can be implemented reliably. That's > > why we wanted to give it a try. But if we can't achieve predictable > > results with it, I'm fine to drop this approach and conflict_type. We > > can consider a better design in the future that doesn't depend on > > non-vacuumed entries and provides a more robust method for identifying > > deleted rows. > > > > I agree having a separate update_deleted conflict would be beneficial, > I'm not arguing against that - my point is actually that I think this > conflict type is required, and that it needs to be detected reliably. > When working with a distributed system, we must accept some form of eventual consistency model. However, it's essential to design a predictable and acceptable behavior. For example, if a change is a result of a previous operation (such as an update on node B triggered after observing an operation on node A), we can say that the operation on node A happened before the operation on node B. Conversely, if operations on nodes A and B are independent, we consider them concurrent. In distributed systems, clock skew is a known issue. To establish a consistency model, we need to ensure it guarantees the "happens-before" relationship. Consider a scenario with three nodes: NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and subsequently NodeB makes changes, and then both NodeA's and NodeB's changes are sent to NodeC, the clock skew might make NodeB's changes appear to have occurred before NodeA's changes. However, we should maintain data that indicates NodeB's changes were triggered after NodeA's changes arrived at NodeB. This implies that logically, NodeB's changes happened after NodeA's changes, despite what the timestamps suggest. A common method to handle such cases is using vector clocks for conflict resolution. "Vector clocks" allow us to track the causal relationships between changes across nodes, ensuring that we can correctly order events and resolve conflicts in a manner that respects the "happens-before" relationship. This method helps maintain consistency and predictability in the system despite issues like clock skew. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent wrote: > > On Fri, 7 Jun 2024 at 10:28, Dilip Kumar wrote: > > > > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent > > wrote: > >> > >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: > >>> > >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > >>> wrote: > >>> > >>> I agree with you that we introduced the WAL_LOG strategy to avoid > >>> these force checkpoints. However, in binary upgrade cases where no > >>> operations are happening in the system, the FILE_COPY strategy should > >>> be faster. > >> > >> While you would be correct if there were no operations happening in > >> the system, during binary upgrade we're still actively modifying > >> catalogs; and this is done with potentially many concurrent jobs. I > >> think it's not unlikely that this would impact performance. > > > > Maybe, but generally, long checkpoints are problematic because they > > involve a lot of I/O, which hampers overall system performance. > > However, in the case of a binary upgrade, the concurrent operations > > are only performing a schema restore, not a real data restore. > > Therefore, it shouldn't have a significant impact, and the checkpoints > > should also not do a lot of I/O during binary upgrade, right? > > My primary concern isn't the IO, but the O(shared_buffers) that we > have to go through during a checkpoint. As I mentioned upthread, it is > reasonably possible the new cluster is already setup with a good > fraction of the old system's shared_buffers configured. Every > checkpoint has to scan all those buffers, which IMV can get (much) > more expensive than the IO overhead caused by the WAL_LOG strategy. It > may be a baseless fear as I haven't done the performance benchmarks > for this, but I wouldn't be surprised if shared_buffers=8GB would > measurably impact the upgrade performance in the current patch (vs the > default 128MB). Okay, that's a valid point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera wrote: > > On 2024-Jun-07, Dilip Kumar wrote: > > > I think the compression option should be supported at the CREATE > > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > > can decide on compression for each subscription individually rather > > than applying it to all subscribers. It makes more sense for the > > subscriber to control this, especially when we are planning to > > compress the data sent downstream. > > True. (I think we have some options that are in GUCs for the general > behavior and can be overridden by per-subscription options for specific > tailoring; would that make sense here? I think it does, considering > that what we mostly want is to save disk space in the publisher when > spilling to disk.) Yeah, that makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera wrote: > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. +1 > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. > I think the compression option should be supported at the CREATE SUBSCRIPTION level instead of being controlled by a GUC. This way, we can decide on compression for each subscription individually rather than applying it to all subscribers. It makes more sense for the subscriber to control this, especially when we are planning to compress the data sent downstream. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent wrote: > > On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: > > > > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > > wrote: > >> > >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela wrote: > >>> > >>> Why not use it too, if not binary_upgrade? > >> > >> Because in the normal case (not during binary_upgrade) you don't want > >> to have to generate 2 checkpoints for every created database, > >> especially not when your shared buffers are large. Checkpoints' costs > >> scale approximately linearly with the size of shared buffers, so being > >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > >> of performance in the systems where this performance impact matters > >> most. > > > > I agree with you that we introduced the WAL_LOG strategy to avoid > > these force checkpoints. However, in binary upgrade cases where no > > operations are happening in the system, the FILE_COPY strategy should > > be faster. > > While you would be correct if there were no operations happening in > the system, during binary upgrade we're still actively modifying > catalogs; and this is done with potentially many concurrent jobs. I > think it's not unlikely that this would impact performance. Maybe, but generally, long checkpoints are problematic because they involve a lot of I/O, which hampers overall system performance. However, in the case of a binary upgrade, the concurrent operations are only performing a schema restore, not a real data restore. Therefore, it shouldn't have a significant impact, and the checkpoints should also not do a lot of I/O during binary upgrade, right? > Now that I think about it, arguably, we shouldn't need to run > checkpoints during binary upgrade for the FILE_COPY strategy after > we've restored the template1 database and created a checkpoint after > that: All other databases use template1 as their template database, > and the checkpoint is there mostly to guarantee the FS knows about all > changes in the template database before we task it with copying the > template database over to our new database, so the protections we get > from more checkpoints are practically useless. > If such a change were implemented (i.e. no checkpoints for FILE_COPY > in binary upgrade, with a single manual checkpoint after restoring > template1 in create_new_objects) I think most of my concerns with this > patch would be alleviated. Yeah, I think that's a valid point. The second checkpoint is to ensure that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for binary upgrades, we don't need that guarantee because a checkpoint will be performed during shutdown at the end of the upgrade anyway. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Vacuum statistics
On Thu, May 30, 2024 at 11:57 PM Alena Rybakina wrote: > > On 30.05.2024 10:33, Alena Rybakina wrote: > > > > I suggest gathering information about vacuum resource consumption for > > processing indexes and tables and storing it in the table and index > > relationships (for example, PgStat_StatTabEntry structure like it has > > realized for usual statistics). It will allow us to determine how well > > the vacuum is configured and evaluate the effect of overhead on the > > system at the strategic level, the vacuum has gathered this > > information already, but this valuable information doesn't store it. > > > My colleagues and I have prepared a patch that can help to solve this > problem. > > We are open to feedback. I was reading through the patch here are some initial comments. -- +typedef struct LVExtStatCounters +{ + TimestampTz time; + PGRUsage ru; + WalUsage walusage; + BufferUsage bufusage; + int64 VacuumPageMiss; + int64 VacuumPageHit; + int64 VacuumPageDirty; + double VacuumDelayTime; + PgStat_Counter blocks_fetched; + PgStat_Counter blocks_hit; +} LVExtStatCounters; I noticed that you are storing both pgBufferUsage and VacuumPage(Hit/Miss/Dirty) stats. Aren't these essentially the same? It seems they both exist in the system because some code, like heap_vacuum_rel(), uses pgBufferUsage, while do_analyze_rel() still relies on the old counters. And there is already a patch to remove those old counters. -- +static Datum +pg_stats_vacuum(FunctionCallInfo fcinfo, ExtVacReportType type, int ncolumns) +{ I don't think you need this last parameter (ncolumns) we can anyway fetch that from tupledesc, so adding an additional parameter just for checking doesn't look good to me. -- + /* Tricky turn here: enforce pgstat to think that our database us dbid */ + + MyDatabaseId = dbid; typo /think that our database us dbid/think that our database has dbid Also, remove the blank line between the comment and the next code block that is related to that comment. -- VacuumPageDirty = 0; + VacuumDelayTime = 0.; There is an extra "." after 0 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: How about using dirty snapshots to locate dependent objects?
On Fri, Jun 7, 2024 at 11:53 AM Ashutosh Sharma wrote: > > On Fri, Jun 7, 2024 at 10:06 AM Dilip Kumar wrote: > > > > On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma > > wrote: > > > > > > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar wrote: > > >> > > >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma > > >> wrote: > > >> > > > >> > Hello everyone, > > >> > > > >> > At present, we use MVCC snapshots to identify dependent objects. This > > >> > implies that if a new dependent object is inserted within a > > >> > transaction that is still ongoing, our search for dependent objects > > >> > won't include this recently added one. Consequently, if someone > > >> > attempts to drop the referenced object, it will be dropped, and when > > >> > the ongoing transaction completes, we will end up having an entry for > > >> > a referenced object that has already been dropped. This situation can > > >> > lead to an inconsistent state. Below is an example illustrating this > > >> > scenario: > > >> > > >> I don't think it's correct to allow the index to be dropped while a > > >> transaction is creating it. Instead, the right solution should be for > > >> the create index operation to protect the object it is using from > > >> being dropped. Specifically, the create index operation should acquire > > >> a shared lock on the Access Method (AM) to ensure it doesn't get > > >> dropped concurrently while the transaction is still in progress. > > > > > > > > > If I'm following you correctly, that's exactly what the patch is trying > > > to do; while the index creation is in progress, if someone tries to drop > > > the object referenced by the index under creation, the referenced object > > > being dropped is able to know about the dependent object (in this case > > > the index being created) using dirty snapshot and hence, it is unable to > > > acquire the lock on the dependent object, and as a result of that, it is > > > unable to drop it. > > > > You are aiming for the same outcome, but not in the conventional way. > > In my opinion, the correct approach is not to find objects being > > created using a dirty snapshot. Instead, when creating an object, you > > should acquire a proper lock on any dependent objects to prevent them > > from being dropped during the creation process. For instance, when > > creating an index that depends on the btree_gist access method, the > > create index operation should protect btree_gist from being dropped by > > acquiring the appropriate lock. It is not the responsibility of the > > drop extension to identify in-progress index creations. > > Thanks for sharing your thoughts, I appreciate your inputs and > completely understand your perspective, but I wonder if that is > feasible? For example, if an object (index in this case) has > dependency on lets say 'n' number of objects, and those 'n' number of > objects belong to say 'n' different catalog tables, so should we > acquire locks on each of them until the create index command succeeds, > or, should we just check for the presence of dependent objects and > record their dependency inside the pg_depend table. Talking about this > particular case, we are trying to create gist index that has > dependency on gist_int4 opclass, it is one of the tuple inside > pg_opclass catalog table, so should acquire lock in this tuple/table > until the create index command succeeds and is that the thing to be > done for all the dependent objects? I am not sure what is the best way to do it, but if you are creating an object which is dependent on the other object then you need to check the existence of those objects, record dependency on those objects, and also lock them so that those object doesn't get dropped while you are creating your object. I haven't looked into the patch but something similar is being achieved in the thread Bertrand has pointed out by locking the database object while recording the dependency on those. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent wrote: > > On Wed, 5 Jun 2024 at 18:47, Ranier Vilela wrote: > > > > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart > > escreveu: > >> > >> I noticed that the "Restoring database schemas in the new cluster" part of > >> pg_upgrade can take a while if you have many databases, so I experimented > >> with a couple different settings to see if there are any easy ways to speed > >> it up. The FILE_COPY strategy for CREATE DATABASE helped quite > >> significantly on my laptop. For ~3k empty databases, this step went from > >> ~100 seconds to ~30 seconds with the attached patch. I see commit ad43a41 > >> made a similar change for initdb, so there might even be an argument for > >> back-patching this to v15 (where STRATEGY was introduced). One thing I > >> still need to verify is that this doesn't harm anything when there are lots > >> of objects in the databases, i.e., more WAL generated during many > >> concurrent CREATE-DATABASE-induced checkpoints. > >> > >> Thoughts? > > > > Why not use it too, if not binary_upgrade? > > Because in the normal case (not during binary_upgrade) you don't want > to have to generate 2 checkpoints for every created database, > especially not when your shared buffers are large. Checkpoints' costs > scale approximately linearly with the size of shared buffers, so being > able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > of performance in the systems where this performance impact matters > most. I agree with you that we introduced the WAL_LOG strategy to avoid these force checkpoints. However, in binary upgrade cases where no operations are happening in the system, the FILE_COPY strategy should be faster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma wrote: > > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar wrote: >> >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma wrote: >> > >> > Hello everyone, >> > >> > At present, we use MVCC snapshots to identify dependent objects. This >> > implies that if a new dependent object is inserted within a transaction >> > that is still ongoing, our search for dependent objects won't include this >> > recently added one. Consequently, if someone attempts to drop the >> > referenced object, it will be dropped, and when the ongoing transaction >> > completes, we will end up having an entry for a referenced object that has >> > already been dropped. This situation can lead to an inconsistent state. >> > Below is an example illustrating this scenario: >> >> I don't think it's correct to allow the index to be dropped while a >> transaction is creating it. Instead, the right solution should be for >> the create index operation to protect the object it is using from >> being dropped. Specifically, the create index operation should acquire >> a shared lock on the Access Method (AM) to ensure it doesn't get >> dropped concurrently while the transaction is still in progress. > > > If I'm following you correctly, that's exactly what the patch is trying to > do; while the index creation is in progress, if someone tries to drop the > object referenced by the index under creation, the referenced object being > dropped is able to know about the dependent object (in this case the index > being created) using dirty snapshot and hence, it is unable to acquire the > lock on the dependent object, and as a result of that, it is unable to drop > it. You are aiming for the same outcome, but not in the conventional way. In my opinion, the correct approach is not to find objects being created using a dirty snapshot. Instead, when creating an object, you should acquire a proper lock on any dependent objects to prevent them from being dropped during the creation process. For instance, when creating an index that depends on the btree_gist access method, the create index operation should protect btree_gist from being dropped by acquiring the appropriate lock. It is not the responsibility of the drop extension to identify in-progress index creations. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: How about using dirty snapshots to locate dependent objects?
On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma wrote: > > Hello everyone, > > At present, we use MVCC snapshots to identify dependent objects. This implies > that if a new dependent object is inserted within a transaction that is still > ongoing, our search for dependent objects won't include this recently added > one. Consequently, if someone attempts to drop the referenced object, it will > be dropped, and when the ongoing transaction completes, we will end up having > an entry for a referenced object that has already been dropped. This > situation can lead to an inconsistent state. Below is an example illustrating > this scenario: I don't think it's correct to allow the index to be dropped while a transaction is creating it. Instead, the right solution should be for the create index operation to protect the object it is using from being dropped. Specifically, the create index operation should acquire a shared lock on the Access Method (AM) to ensure it doesn't get dropped concurrently while the transaction is still in progress. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication of sequences
On Thu, Jun 6, 2024 at 9:34 AM Amit Kapila wrote: > > On Wed, Jun 5, 2024 at 3:17 PM Bharath Rupireddy > wrote: > > > > On Tue, Jun 4, 2024 at 5:40 PM Amit Kapila wrote: > > > > > > Even if we decode it periodically (say each time we decode the > > > checkpoint record) then also we need to send the entire set of > > > sequences at shutdown. This is because the sequences may have changed > > > from the last time we sent them. > > > > Agree. How about decoding and sending only the sequences that are > > changed from the last time when they were sent? I know it requires a > > bit of tracking and more work, but all I'm looking for is to reduce > > the amount of work that walsenders need to do during the shutdown. > > > > I see your point but going towards tracking the changed sequences > sounds like moving towards what we do for incremental backups unless > we can invent some other smart way. Yes, we would need an entirely new infrastructure to track the sequence change since the last sync. We can only determine this from WAL, and relying on it would somehow bring us back to the approach we were trying to achieve with logical decoding of sequences patch. > > Having said that, I like the idea of letting the user sync the > > sequences via ALTER SUBSCRIPTION command and not weave the logic into > > the shutdown checkpoint path. As Peter Eisentraut said here > > https://www.postgresql.org/message-id/42e5cb35-4aeb-4f58-8091-90619c7c3ecc%40eisentraut.org, > > this can be a good starting point to get going. > > > > Agreed. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila wrote: > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. > > > > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? > > > > > 2. Do we want a GUC to switch compression on/off? > > > > It depends on the overhead of decoding. Did you try to measure the > decoding overhead of decompression when reading compressed files? I think it depends on the trade-off between the I/O savings from reducing the data size and the performance cost of compressing and decompressing the data. This balance is highly dependent on the hardware. For example, if you have a very slow disk and a powerful processor, compression could be advantageous. Conversely, if the disk is very fast, the I/O savings might be minimal, and the compression overhead could outweigh the benefits. Additionally, the effectiveness of compression also depends on the compression ratio, which varies with the type of data being compressed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
On Thu, Jun 6, 2024 at 3:23 PM Anthonin Bonnefoy wrote: > > Hi, > > I sent a similar patch for this in > https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com Okay, I see, In that case, we can just discard mine, thanks for notifying me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Thu, Jun 6, 2024 at 3:43 PM Amit Kapila wrote: > > On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar wrote: > > > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > > method? It seems after the initial update on the local node, it will > > > never allow remote update to succeed which sounds a bit odd. Jan has > > > shared this and similar concerns about this resolution method, so I > > > have added him to the email as well. > > > > > I can not think of a use case exactly in this context but it's very > > common to have such a use case while designing a distributed > > application with multiple clients. For example, when we are doing git > > push concurrently from multiple clients it is expected that the > > earliest commit wins. > > > > Okay, I think it mostly boils down to something like what Shveta > mentioned where Inserts for a primary key can use > "earliest_timestamp_wins" resolution method [1]. So, it seems useful > to support this method as well. Correct, but we still need to think about how to make it work correctly in the presence of a clock skew as I mentioned in one of my previous emails. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel
As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was removed from the vacuum. However, do_analyze_rel() was still using these counters, necessitating the tracking of global counters alongside BufferUsage counters. The attached patch addresses the issue by eliminating the need to track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making the global counters obsolete. This simplifies the code and improves consistency. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v1-0001-Remove-duplicate-tracking-of-the-page-stats-durin.patch Description: Binary data
Re: Conflict Detection and Resolution
On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > Can you share the use case of "earliest_timestamp_wins" resolution > method? It seems after the initial update on the local node, it will > never allow remote update to succeed which sounds a bit odd. Jan has > shared this and similar concerns about this resolution method, so I > have added him to the email as well. > I can not think of a use case exactly in this context but it's very common to have such a use case while designing a distributed application with multiple clients. For example, when we are doing git push concurrently from multiple clients it is expected that the earliest commit wins. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On Wed, Jun 5, 2024 at 9:12 AM shveta malik wrote: > > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila wrote: > > > > > > > > > > > > > > > > Conflict Resolution > > > > > > > > > > a) latest_timestamp_wins:The change with later commit timestamp > > > > > wins. > > > > > b) earliest_timestamp_wins: The change with earlier commit > > > > > timestamp wins. > > > > Can you share the use case of "earliest_timestamp_wins" resolution > > method? It seems after the initial update on the local node, it will > > never allow remote update to succeed which sounds a bit odd. Jan has > > shared this and similar concerns about this resolution method, so I > > have added him to the email as well. > > I do not have the exact scenario for this. But I feel, if 2 nodes are > concurrently inserting different data against a primary key, then some > users may have preferences that retain the row which was inserted > earlier. It is no different from latest_timestamp_wins. It totally > depends upon what kind of application and requirement the user may > have, based on which, he may discard the later coming rows (specially > for INSERT case). I haven't read the complete design yet, but have we discussed how we plan to deal with clock drift if we use timestamp-based conflict resolution? For example, a user might insert something conflicting on node1 first and then on node2. However, due to clock drift, the timestamp from node2 might appear earlier. In this case, if we choose "earliest timestamp wins," we would keep the changes from node2. I haven't fully considered if this would cause any problems, but users might detect this issue. For instance, a client machine might send a change to node1 first and then, upon confirmation, send it to node2. If the clocks on node1 and node2 are not synchronized, the changes might appear in a different order. Does this seem like a potential problem? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: New committers: Melanie Plageman, Richard Guo
On Fri, Apr 26, 2024 at 5:24 PM Jonathan S. Katz wrote: > > The Core Team would like to extend our congratulations to Melanie > Plageman and Richard Guo, who have accepted invitations to become our > newest PostgreSQL committers. > > Please join us in wishing them much success and few reverts! Congratulations to both of you. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Why don't we support external input/output functions for the composite types
On Thu, Apr 25, 2024 at 10:14 AM Tom Lane wrote: > > Dilip Kumar writes: > > I'm curious about composite types in PostgreSQL. By default, when we > > create a composite type, it utilizes the "record_in" and "record_out" > > functions for input/output. Do you think it would be beneficial to > > expand the syntax to allow users to specify custom input/output > > functions when creating composite types? > > No. > > > I believe it would be beneficial because users creating a new type > > might prefer to define specific input/output syntax rather than > > conforming to what is accepted by the RECORD type. > Thanks for the quick response, Tom. > The primary outcome would be to require a huge amount of new work > to be done by a lot of software, much of it not under our control. Yeah, I agree with that. > And the impact wouldn't only be to software that would prefer not > to know about this. For example, how likely do you think it is > that these hypothetical user-defined I/O functions would cope > well with ALTER TABLE/ALTER TYPE commands that change those > rowtypes? That's a good point. I was primarily focused on altering the representation of input and output values, rather than considering changes to internal storage. However, offering this feature could indeed allow users to influence how values are stored. And that can potentially affect ALTER TYPE because then we do not have control over how those values are stored internally. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Why don't we support external input/output functions for the composite types
Hi, I'm curious about composite types in PostgreSQL. By default, when we create a composite type, it utilizes the "record_in" and "record_out" functions for input/output. Do you think it would be beneficial to expand the syntax to allow users to specify custom input/output functions when creating composite types? Has anyone attempted this before, and are there any design challenges associated with it? Or is it simply not implemented because it's not seen as a valuable addition? I believe it would be beneficial because users creating a new type might prefer to define specific input/output syntax rather than conforming to what is accepted by the RECORD type. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera wrote: > > Hello, > > On 2024-Apr-03, Alexander Lakhin wrote: > > > I've managed to trigger an assert added by 53c2a97a9. > > Please try the following script against a server compiled with > > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the > > define, it just simplifies reproducing...): > > Ah yes, absolutely, we're missing to trade the correct SLRU bank lock > there. This rewrite of that small piece should fix it. Thanks for > reporting this. > Yeah, we missed acquiring the bank lock w.r.t. intervening pages, thanks for reporting. Your fix looks correct to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas wrote: > > > Yeah, that's a very valid point. So I think now Heikki/Melanie might > > have got an answer to their question, about the thought process behind > > serializing the snapshot for each scan node. And the same thing is > > followed for BitmapHeapNode as well. > > I see. Thanks, understanding the thought process helps. > > So when a parallel table or index scan runs in the executor as part of a > query, we could just use the active snapshot. But there are some other > callers of parallel table scans that don't use the executor, namely > parallel index builds. For those it makes sense to pass the snapshot for > the scan independent of the active snapshot. Right > A parallel bitmap heap scan isn't really a parallel scan as far as the > table AM is concerned, though. It's more like an independent bitmap heap > scan in each worker process, nodeBitmapHeapscan.c does all the > coordination of which blocks to scan. So I think that > table_parallelscan_initialize() was the wrong role model, and we should > still remove the snapshot serialization code from nodeBitmapHeapscan.c. I think that seems right. > Digging deeper into the question of whether es_snapshot == > GetActiveSnapshot() is a valid assumption: > > > > es_snapshot is copied from the QueryDesc in standard_ExecutorStart(). > Looking at the callers of ExecutorStart(), they all get the QueryDesc by > calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any > callers changing the active snapshot between the ExecutorStart() and > ExecutorRun() calls either. In pquery.c, we explicitly > PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So > no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds. > > _SPI_execute_plan() has code to deal with the possibility that the > active snapshot is not set. That seems fishy; do we really support SPI > without any snapshot? I'm inclined to turn that into an error. I ran the > regression tests with an "Assert(ActiveSnapshotSet())" there, and > everything worked. IMHO, we can call SPI_Connect() and SPI_Execute() from any C extension, so I don't think there we can guarantee that the snapshot must be set, do we? > If es_snapshot was different from the active snapshot, things would get > weird, even without parallel query. The scans would use es_snapshot for > the visibility checks, but any functions you execute in quals would use > the active snapshot. > > We could double down on that assumption, and remove es_snapshot > altogether and use GetActiveSnapshot() instead. And perhaps add > "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun(). > > > > In summary, this es_snapshot stuff is a bit confusing and could use some > cleanup. But for now, I'd like to just add some assertions and a > comments about this, and remove the snapshot serialization from bitmap > heap scan node, to make it consistent with other non-parallel scan nodes > (it's not really a parallel scan as far as the table AM is concerned). > See attached patch, which is the same as previous patch with some extra > assertions. Maybe for now we can just handle this specific case to remove the snapshot serializing for the BitmapHeapScan as you are doing in the patch. After looking into the code your theory seems correct that we are just copying the ActiveSnapshot while building the query descriptor and from there we are copying into the Estate so logically there should not be any reason for these two to be different. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Mar 13, 2024 at 9:25 PM Robert Haas wrote: > > On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar wrote: > > > Andres already commented on the snapshot stuff on an earlier patch > > > version, and that's much nicer with this version. However, I don't > > > understand why a parallel bitmap heap scan needs to do anything at all > > > with the snapshot, even before these patches. The parallel worker > > > infrastructure already passes the active snapshot from the leader to the > > > parallel worker. Why does bitmap heap scan code need to do that too? > > > > Yeah thinking on this now it seems you are right that the parallel > > infrastructure is already passing the active snapshot so why do we > > need it again. Then I checked other low scan nodes like indexscan and > > seqscan and it seems we are doing the same things there as well. > > Check for SerializeSnapshot() in table_parallelscan_initialize() and > > index_parallelscan_initialize() which are being called from > > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() > > respectively. > > I remember thinking about this when I was writing very early parallel > query code. It seemed to me that there must be some reason why the > EState has a snapshot, as opposed to just using the active snapshot, > and so I took care to propagate that snapshot, which is used for the > leader's scans, to the worker scans also. Now, if the EState doesn't > need to contain a snapshot, then all of that mechanism is unnecessary, > but I don't see how it can be right for the leader to do > table_beginscan() using estate->es_snapshot and the worker to use the > active snapshot. Yeah, that's a very valid point. So I think now Heikki/Melanie might have got an answer to their question, about the thought process behind serializing the snapshot for each scan node. And the same thing is followed for BitmapHeapNode as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Mar 13, 2024 at 7:04 PM Heikki Linnakangas wrote: > > (Adding Dilip, the original author of the parallel bitmap heap scan > patch all those years ago, in case you remember anything about the > snapshot stuff below.) > > On 27/02/2024 16:22, Melanie Plageman wrote: > Andres already commented on the snapshot stuff on an earlier patch > version, and that's much nicer with this version. However, I don't > understand why a parallel bitmap heap scan needs to do anything at all > with the snapshot, even before these patches. The parallel worker > infrastructure already passes the active snapshot from the leader to the > parallel worker. Why does bitmap heap scan code need to do that too? Yeah thinking on this now it seems you are right that the parallel infrastructure is already passing the active snapshot so why do we need it again. Then I checked other low scan nodes like indexscan and seqscan and it seems we are doing the same things there as well. Check for SerializeSnapshot() in table_parallelscan_initialize() and index_parallelscan_initialize() which are being called from ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() respectively. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Streaming I/O, vectored I/O (WIP)
On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro wrote: > > I think you'd be right if StartReadBuffers() were capable of > processing a sequence consisting of a hit followed by misses, but > currently it always gives up after the first hit. That is, it always > processes some number of misses (0-16) and then at most one hit. So > for now the variable would always turn out to be the same as blockNum. > Okay, then shouldn't this "if (found)" block immediately break the loop so that when we hit the block we just return that block? So it makes sense what you explained but with the current code if there are the first few hits followed by misses then we will issue the smgrprefetch() for the initial hit blocks as well. + if (found) + { + /* + * Terminate the read as soon as we get a hit. It could be a + * single buffer hit, or it could be a hit that follows a readable + * range. We don't want to create more than one readable range, + * so we stop here. + */ + actual_nblocks = operation->nblocks = *nblocks = i + 1;(Dilip: I think we should break after this?) + } + else + { + /* Extend the readable range to cover this block. */ + operation->io_buffers_len++; + } + } > The reason is that I wanted to allows "full sized" read system calls > to form. If you said "hey please read these 16 blocks" (I'm calling > that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits, > then it could only form a read of 14 blocks, but there might be more > blocks that could be read after those. We would have some arbitrary > shorter read system calls, when we wanted to make them all as big as > possible. So in the current patch you say "hey please read these 16 > blocks" and it returns saying "only read 1", you call again with 15 > and it says "only read 1", and you call again and says "read 16!" > (assuming 2 more were readable after the original range we started > with). Then physical reads are maximised. Maybe there is some nice > way to solve that, but I thought this way was the simplest (and if > there is some instruction-cache-locality/tight-loop/perf reason why we > should work harder to find ranges of hits, it could be for later). > Does that make sense? Understood, I think this makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Streaming I/O, vectored I/O (WIP)
On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro wrote: > Hi Thomas, I am planning to review this patch set, so started going through 0001, I have a question related to how we are issuing smgrprefetch in StartReadBuffers() + if (operation->io_buffers_len > 0) + { + if (flags & READ_BUFFERS_ISSUE_ADVICE) { - if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages) - { - ereport(WARNING, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg("invalid page in block %u of relation %s; zeroing out page", - blockNum, - relpath(smgr->smgr_rlocator, forkNum; - MemSet((char *) bufBlock, 0, BLCKSZ); - } - else - ereport(ERROR, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg("invalid page in block %u of relation %s", - blockNum, - relpath(smgr->smgr_rlocator, forkNum; + /* + * In theory we should only do this if PrepareReadBuffers() had to + * allocate new buffers above. That way, if two calls to + * StartReadBuffers() were made for the same blocks before + * WaitReadBuffers(), only the first would issue the advice. + * That'd be a better simulation of true asynchronous I/O, which + * would only start the I/O once, but isn't done here for + * simplicity. Note also that the following call might actually + * issue two advice calls if we cross a segment boundary; in a + * true asynchronous version we might choose to process only one + * real I/O at a time in that case. + */ + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len); } This is always issuing smgrprefetch starting with the input blockNum, shouldn't we pass the first blockNum which we did not find in the Buffer pool? So basically in the loop above this call where we are doing PrepareReadBuffer() we should track the first blockNum for which the found is not true and pass that blockNum into the smgrprefetch() as a first block right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: PostgreSQL Contributors Updates
On Sun, Mar 3, 2024 at 9:28 PM Joe Conway wrote: > > All, > > The PostgreSQL Contributor Page > (https://www.postgresql.org/community/contributors/) includes people who > have made substantial, long-term contributions of time and effort to the > PostgreSQL project. The PostgreSQL Contributors Team recognizes the > following people for their contributions. > > New PostgreSQL Contributors: > > * Bertrand Drouvot > * Gabriele Bartolini > * Richard Guo > > New PostgreSQL Major Contributors: > > * Alexander Lakhin > * Daniel Gustafsson > * Dean Rasheed > * John Naylor > * Melanie Plageman > * Nathan Bossart > > Thank you and congratulations to all! > Congratulations to all! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: pgsql: Improve performance of subsystems on top of SLRU
On Mon, Mar 4, 2024 at 1:56 AM Alvaro Herrera wrote: > > On 2024-Feb-28, Alvaro Herrera wrote: > > > Improve performance of subsystems on top of SLRU > > Coverity had the following complaint about this commit: > > > *** CID NNN: Control flow issues (DEADCODE) > /srv/coverity/git/pgsql-git/postgresql/src/backend/access/transam/multixact.c: > 1375 in GetMultiXactIdMembers() > 1369 * and acquire the lock of the new bank. > 1370 */ > 1371lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno); > 1372if (lock != prevlock) > 1373{ > 1374if (prevlock != NULL) > >>> CID 1592913: Control flow issues (DEADCODE) > >>> Execution cannot reach this statement: "LWLockRelease(prevlock);". > 1375LWLockRelease(prevlock); > 1376LWLockAcquire(lock, LW_EXCLUSIVE); > 1377prevlock = lock; > 1378} > 1379 > 1380slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, > multi); > > And I think it's correct that this is somewhat bogus, or at least > confusing: the only way to have control back here on line 1371 after > having executed once is via the "goto retry" line below; and there we > release "prevlock" and set it to NULL beforehand, so it's impossible for > prevlock to be NULL. Looking closer I think this code is all confused, > so I suggest to rework it as shown in the attached patch. > > I'll have a look at the other places where we use this "prevlock" coding > pattern tomorrow. + /* Acquire the bank lock for the page we need. */ lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno); - if (lock != prevlock) - { - if (prevlock != NULL) - LWLockRelease(prevlock); - LWLockAcquire(lock, LW_EXCLUSIVE); - prevlock = lock; - } + LWLockAcquire(lock, LW_EXCLUSIVE); This part is definitely an improvement. I am not sure about the other changes, I mean that makes the code much simpler but now we are not releasing the 'MultiXactOffsetCtl' related bank lock, and later in the following loop, we are comparing that lock against 'MultiXactMemberCtl' related bank lock. This makes code simpler because now in the loop we are sure that we are always holding the lock but I do not like comparing the bank locks for 2 different SLRUs, although there is no problem as there would not be a common lock address, anyway, I do not have any strong objection to what you have done here. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera wrote: > On 2024-Feb-27, Alvaro Herrera wrote: > > > Here's the complete set, with these two names using the singular. > > BTW one thing I had not noticed is that before this patch we have > minimum shmem size that's lower than the lowest you can go with the new > code. > > This means Postgres may no longer start when extremely tight memory > restrictions (and of course use more memory even when idle or with small > databases). I wonder to what extent should we make an effort to relax > that. For small, largely inactive servers, this is just memory we use > for no good reason. However, anything we do here will impact > performance on the high end, because as Andrey says this will add > calculations and jumps where there are none today. > > I was just comparing the minimum memory required for SLRU when the system is minimally configured, correct me if I am wrong. SLRUunpatched patched commit_timestamp_buffers 4 16 subtransaction_buffers 32 16 transaction_buffers 4 16 multixact_offset_buffers8 16 multixact_member_buffers 16 16 notify_buffers 8 16 serializable_buffers 16 16 - total buffers 88 112 so that is < 200kB of extra memory on a minimally configured system, IMHO this should not matter. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera wrote: > On 2024-Feb-23, Dilip Kumar wrote: > > + > + For each SLRU area that's part of the core server, > + there is a configuration parameter that controls its size, with the > suffix > + _buffers appended. For historical > + reasons, the names are not exact matches, but Xact > + corresponds to transaction_buffers and the rest > should > + be obvious. > + > + > > I think I would like to suggest renaming the GUCs to have the _slru_ bit > in the middle: > > +# - SLRU Buffers (change requires restart) - > + > +#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 > = auto) > +#multixact_offsets_slru_buffers = 16# memory for > pg_multixact/offsets > +#multixact_members_slru_buffers = 32# memory for > pg_multixact/members > +#notify_slru_buffers = 16 # memory for pg_notify > +#serializable_slru_buffers = 32 # memory for pg_serial > +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 = > auto) > +#transaction_slru_buffers = 0 # memory for pg_xact (0 = > auto) > > and the pgstat_internal.h table: > > static const char *const slru_names[] = { > "commit_timestamp", > "multixact_members", > "multixact_offsets", > "notify", > "serializable", > "subtransaction", > "transaction", > "other" /* has to be last > */ > }; > > This way they match perfectly. > Yeah, I think this looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar wrote: > > On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera > wrote: > > > > On 2024-Feb-07, Dilip Kumar wrote: > > > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera > > > wrote: > > > > > > Sure, but is that really what we want? > > > > > > So your question is do we want these buffers to be in multiple of > > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > > don't think it should create any problem logically. I mean we can > > > look again in the patch to see if we have made any such assumptions > > > but that should be fairly easy to fix, then maybe if we are going in > > > this way we should get rid of the check_slru_buffers() function as > > > well. > > > > Not really, I just don't think the macro should be in slru.h. > > Okay > > > Another thing I've been thinking is that perhaps it would be useful to > > make the banks smaller, when the total number of buffers is small. For > > example, if you have 16 or 32 buffers, it's not really clear to me that > > it makes sense to have just 1 bank or 2 banks. It might be more > > sensible to have 4 banks with 4 or 8 buffers instead. That should make > > the algorithm scale down as well as up ... > > It might be helpful to have small-size banks when SLRU buffers are set > to a very low value and we are only accessing a couple of pages at a > time (i.e. no buffer replacement) because in such cases most of the > contention will be on SLRU Bank lock. Although I am not sure how > practical such a use case would be, I mean if someone is using > multi-xact very heavily or creating frequent subtransaction overflow > then wouldn't they should set this buffer limit to some big enough > value? By doing this we would lose some simplicity of the patch I > mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would > need to compute this and store it in SlruShared. Maybe that's not that > bad. > > > > > I haven't done either of those things in the attached v19 version. I > > did go over the comments once again and rewrote the parts I was unhappy > > with, including some existing ones. I think it's OK now from that point > > of view ... at some point I thought about creating a separate README, > > but in the end I thought it not necessary. > > Thanks, I will review those changes. Few other things I noticed while reading through the patch, I haven't read it completely yet but this is what I got for now. 1. + * If no process is already in the list, we're the leader; our first step + * is to "close out the group" by resetting the list pointer from + * ProcGlobal->clogGroupFirst (this lets other processes set up other + * groups later); then we lock the SLRU bank corresponding to our group's + * page, do the SLRU updates, release the SLRU bank lock, and wake up the + * sleeping processes. I think here we are saying that we "close out the group" before acquiring the SLRU lock but that's not true. We keep the group open until we gets the lock so that we can get maximum members in while we are anyway waiting for the lock. 2. static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts, RepOriginId nodeid, int slotno) { - Assert(TransactionIdIsNormal(xid)); + if (!TransactionIdIsNormal(xid)) + return; + + entryno = TransactionIdToCTsEntry(xid); I do not understand why we need this change. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera > > wrote: > > > > Sure, but is that really what we want? > > > > So your question is do we want these buffers to be in multiple of > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > don't think it should create any problem logically. I mean we can > > look again in the patch to see if we have made any such assumptions > > but that should be fairly easy to fix, then maybe if we are going in > > this way we should get rid of the check_slru_buffers() function as > > well. > > Not really, I just don't think the macro should be in slru.h. Okay > Another thing I've been thinking is that perhaps it would be useful to > make the banks smaller, when the total number of buffers is small. For > example, if you have 16 or 32 buffers, it's not really clear to me that > it makes sense to have just 1 bank or 2 banks. It might be more > sensible to have 4 banks with 4 or 8 buffers instead. That should make > the algorithm scale down as well as up ... It might be helpful to have small-size banks when SLRU buffers are set to a very low value and we are only accessing a couple of pages at a time (i.e. no buffer replacement) because in such cases most of the contention will be on SLRU Bank lock. Although I am not sure how practical such a use case would be, I mean if someone is using multi-xact very heavily or creating frequent subtransaction overflow then wouldn't they should set this buffer limit to some big enough value? By doing this we would lose some simplicity of the patch I mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would need to compute this and store it in SlruShared. Maybe that's not that bad. > > I haven't done either of those things in the attached v19 version. I > did go over the comments once again and rewrote the parts I was unhappy > with, including some existing ones. I think it's OK now from that point > of view ... at some point I thought about creating a separate README, > but in the end I thought it not necessary. Thanks, I will review those changes. > I did add a bunch of Assert()s to make sure the locks that are supposed > to be held are actually held. This led me to testing the page status to > be not EMPTY during SimpleLruWriteAll() before calling > SlruInternalWritePage(), because the assert was firing. The previous > code is not really *buggy*, but to me it's weird to call WritePage() on > a slot with no contents. Okay, I mean internally SlruInternalWritePage() will flush only if the status is SLRU_PAGE_VALID, but it is better the way you have done. > Another change was in TransactionGroupUpdateXidStatus: the original code > had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to > know which bank to lock. I changed it to simply be the page used by the > leader process; this doesn't need an atomic read, and should be the same > page anyway. (If it isn't, it's no big deal). But what's more: even if > we do read ->clogGroupFirst at that point, there's no guarantee that > this is going to be exactly for the same process that ends up being the > first in the list, because since we have not set it to INVALID by the > time we grab the bank lock, it is quite possible for more processes to > add themselves to the list. Yeah, this looks better > I realized all this while rewriting the comments in a way that would let > me understand what was going on ... so IMO the effort was worthwhile. +1 I will review and do some more testing early next week and share my feedback. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 2:52 PM Robert Haas wrote: > > On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar wrote: > > So the problem is that we might consider the transaction change as > > non-transaction and mark this flag as true. > > But it's not "might" right? It's absolutely 100% certain that we will > consider that transaction's changes as non-transactional ... because > when we're in fast-forward mode, the table of new relfilenodes is not > built, and so whenever we check whether any transaction made a new > relfilenode for this sequence, the answer will be no. > > > But what would have > > happened if we would have identified it correctly as transactional? > > In such cases, we wouldn't have set this flag here but then we would > > have set this while processing the DecodeAbort/DecodeCommit, so the > > net effect would be the same no? You may question what if the > > Abort/Commit WAL never appears in the WAL, but this flag is > > specifically for the upgrade case, and in that case we have to do a > > clean shutdown so may not be an issue. But in the future, if we try > > to use 'ctx->processing_required' for something else where the clean > > shutdown is not guaranteed then this flag can be set incorrectly. > > > > I am not arguing that this is a perfect design but I am just making a > > point about why it would work. > > Even if this argument is correct (and I don't know if it is), the code > and comments need some updating. We should not be testing a flag that > is guaranteed false with comments that make it sound like the value of > the flag is trustworthy when it isn't. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Wed, Feb 21, 2024 at 1:24 PM Robert Haas wrote: > > On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar wrote: > > > But I am wondering why this flag is always set to true in > > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > > > aborted transactions are not supposed to be replayed? So if my > > > observation is correct that for the aborted transaction, this > > > shouldn't be set to true then we have a problem with sequence where we > > > are identifying the transactional changes as non-transaction changes > > > because now for transactional changes this should depend upon commit > > > status. > > > > I have checked this case with Amit Kapila. So it seems in the cases > > where we have sent the prepared transaction or streamed in-progress > > transaction we would need to send the abort also, and for that reason, > > we are setting 'ctx->processing_required' as true so that if these > > WALs are not streamed we do not allow upgrade of such slots. > > I don't find this explanation clear enough for me to understand. Explanation about why we set 'ctx->processing_required' to true from DecodeCommit as well as DecodeAbort: -- For upgrading logical replication slots, it's essential to ensure these slots are completely synchronized with the subscriber. To identify that we process all the pending WAL in 'fast_forward' mode to find whether there is any decodable WAL or not. So in short any WAL type that we stream to standby in normal mode (no fast_forward mode) is considered decodable and so is the abort WAL. That's the reason why at the end of the transaction commit/abort we need to set this 'ctx->processing_required' to true i.e. there are some decodable WAL exists so we can not upgrade this slot. Why the below check is safe? > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } So the problem is that we might consider the transaction change as non-transaction and mark this flag as true. But what would have happened if we would have identified it correctly as transactional? In such cases, we wouldn't have set this flag here but then we would have set this while processing the DecodeAbort/DecodeCommit, so the net effect would be the same no? You may question what if the Abort/Commit WAL never appears in the WAL, but this flag is specifically for the upgrade case, and in that case we have to do a clean shutdown so may not be an issue. But in the future, if we try to use 'ctx->processing_required' for something else where the clean shutdown is not guaranteed then this flag can be set incorrectly. I am not arguing that this is a perfect design but I am just making a point about why it would work. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar wrote: > > On Tue, Feb 20, 2024 at 3:38 PM Robert Haas wrote: > > > Let's say fast_forward is true. Then smgr_decode() is going to skip > > recording anything about the relfilenode, so we'll identify all > > sequence changes as non-transactional. But look at how this case is > > handled in seq_decode(): > > > > + if (ctx->fast_forward) > > + { > > + /* > > + * We need to set processing_required flag to notify the sequence > > + * change existence to the caller. Usually, the flag is set when > > + * either the COMMIT or ABORT records are decoded, but this must be > > + * turned on here because the non-transactional logical message is > > + * decoded without waiting for these records. > > + */ > > + if (!transactional) > > + ctx->processing_required = true; > > + > > + return; > > + } > > It appears that the 'processing_required' flag was introduced as part > of supporting upgrades for logical replication slots. Its purpose is > to determine whether a slot is fully caught up, meaning that there are > no pending decodable changes left before it can be upgraded. > > So now if some change was transactional but we have identified it as > non-transaction then we will mark this flag 'ctx->processing_required > = true;' so we temporarily set this flag incorrectly, but even if the > flag would have been correctly identified initially, it would have > been set again to true in the DecodeTXNNeedSkip() function regardless > of whether the transaction is committed or aborted. As a result, the > flag would eventually be set to 'true', and the behavior would align > with the intended logic. > > But I am wondering why this flag is always set to true in > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the > aborted transactions are not supposed to be replayed? So if my > observation is correct that for the aborted transaction, this > shouldn't be set to true then we have a problem with sequence where we > are identifying the transactional changes as non-transaction changes > because now for transactional changes this should depend upon commit > status. I have checked this case with Amit Kapila. So it seems in the cases where we have sent the prepared transaction or streamed in-progress transaction we would need to send the abort also, and for that reason, we are setting 'ctx->processing_required' as true so that if these WALs are not streamed we do not allow upgrade of such slots. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 3:38 PM Robert Haas wrote: > Let's say fast_forward is true. Then smgr_decode() is going to skip > recording anything about the relfilenode, so we'll identify all > sequence changes as non-transactional. But look at how this case is > handled in seq_decode(): > > + if (ctx->fast_forward) > + { > + /* > + * We need to set processing_required flag to notify the sequence > + * change existence to the caller. Usually, the flag is set when > + * either the COMMIT or ABORT records are decoded, but this must be > + * turned on here because the non-transactional logical message is > + * decoded without waiting for these records. > + */ > + if (!transactional) > + ctx->processing_required = true; > + > + return; > + } It appears that the 'processing_required' flag was introduced as part of supporting upgrades for logical replication slots. Its purpose is to determine whether a slot is fully caught up, meaning that there are no pending decodable changes left before it can be upgraded. So now if some change was transactional but we have identified it as non-transaction then we will mark this flag 'ctx->processing_required = true;' so we temporarily set this flag incorrectly, but even if the flag would have been correctly identified initially, it would have been set again to true in the DecodeTXNNeedSkip() function regardless of whether the transaction is committed or aborted. As a result, the flag would eventually be set to 'true', and the behavior would align with the intended logic. But I am wondering why this flag is always set to true in DecodeTXNNeedSkip() irrespective of the commit or abort. Because the aborted transactions are not supposed to be replayed? So if my observation is correct that for the aborted transaction, this shouldn't be set to true then we have a problem with sequence where we are identifying the transactional changes as non-transaction changes because now for transactional changes this should depend upon commit status. On another thought, can there be a situation where we have identified this flag wrongly as non-transaction and set this flag, and the commit/abort record never appeared in the WAL so never decoded? That can also lead to an incorrect decision during the upgrade. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On Tue, Feb 20, 2024 at 10:30 AM Robert Haas wrote: > > Is the rule that changes are transactional if and only if the current > transaction has assigned a new relfilenode to the sequence? Yes, thats the rule. > Why does the logic get confused if the state of the snapshot changes? The rule doesn't get changed, but the way this identification is implemented at the decoding gets confused and assumes transactional as non-transactional. The identification of whether the sequence is transactional or not is implemented based on what WAL we have decoded from the particular transaction and whether we decode a particular WAL or not depends upon the snapshot state (it's about what we decode not necessarily what we sent). So if the snapshot state changed the mid-transaction that means we haven't decoded the WAL which created a new relfilenode but we will decode the WAL which is operating on the sequence. So here we will assume the change is non-transaction whereas it was transactional because we did not decode some of the changes of transaction which we rely on for identifying whether it is transactional or not. > My naive reaction is that it kinda sounds like you're relying on two > different mistakes cancelling each other out, and that might be a bad > idea, because maybe there's some situation where they don't. But I > don't understand the issue well enough to have an educated opinion at > this point. I would say the first one is a mistake in identifying the transactional as non-transactional during the decoding and that mistake happens only when we decode the transaction partially. But we never stream the partially decoded transactions downstream which means even though we have made a mistake in decoding it, we are not streaming it so our mistake is not getting converted into a real problem. But again I agree there is a temporary wrong decision and if we try to do something else based on this decision then it could be an issue. You might be interested in more detail [1] where I first reported this problem and also [2] where we concluded why this is not creating a real problem. [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
> > So now if we have such a functionality then it would be even better to > > extend it to selectively sync the slot. For example, if there is some > > issue in syncing all slots, maybe some bug or taking a long time to > > sync because there are a lot of slots but if the user needs to quickly > > failover and he/she is interested in only a couple of slots then such > > a option could be helpful. no? > > > > I see your point but not sure how useful it is in the field. I am fine > if others also think such a parameter will be useful and anyway I > think we can even extend it after v1 is done. > Okay, I am fine with that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera > > wrote: > > > > > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > > > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > > > because we don't have that macro at that point, so I just used constant > > > 16. Obviously need a better solution for this. > > > > If we define SLRU_BANK_SIZE in slur.h then we can use it here right, > > because these files are anyway include slur.h so. > > Sure, but is that really what we want? So your question is do we want these buffers to be in multiple of SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I don't think it should create any problem logically. I mean we can look again in the patch to see if we have made any such assumptions but that should be fairly easy to fix, then maybe if we are going in this way we should get rid of the check_slru_buffers() function as well. > > > I've been wondering whether we should add a "slru" to the name of the > > > GUCs: > > > > > > commit_timestamp_slru_buffers > > > transaction_slru_buffers > > > etc > > > > I am not sure we are exposing anything related to SLRU to the user, > > We do -- we have pg_stat_slru already. > > > I mean transaction_buffers should make sense for the user that it > > stores transaction-related data in some buffers pool but whether that > > buffer pool is called SLRU or not doesn't matter much to the user > > IMHO. > > Yeah, that's exactly what my initial argument was for naming these this > way. But since the term slru already escaped into the wild via the > pg_stat_slru view, perhaps it helps users make the connection between > these things. Alternatively, we can cross-reference each term from the > other's documentation and call it a day. Yeah, that's true I forgot this point about the pg_stat_slru, from this pov if the configuration has the name slru they would be able to make a better connection with the configured value, and the stats in this view based on that they can take call if they need to somehow increase the size of these slru buffers. > Another painful point is that pg_stat_slru uses internal names in the > data it outputs, which obviously do not match the new GUCs. Yeah, that's true, but I think this could be explained somewhere not sure what is the right place for this. FYI, I have also repeated all the performance tests I performed in my first email[1], and I am seeing a similar gain. Some comments on v18 in my first pass of the review. 1. @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) lsnindex = GetLSNIndex(slotno, xid); *lsn = XactCtl->shared->group_lsn[lsnindex]; - LWLockRelease(XactSLRULock); + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); Maybe here we can add an assert before releasing the lock for a safety check Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); 2. + * + * XXX could we make the LSNs to be bank-based? */ XLogRecPtr *group_lsn; IMHO, the flush still happens at the page level so up to which LSN should be flush before flushing the particular clog page should also be maintained at the page level. [1] https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera wrote: > > Here's the rest of it rebased on top of current master. I think it > makes sense to have this as one individual commit. > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > because we don't have that macro at that point, so I just used constant > 16. Obviously need a better solution for this. If we define SLRU_BANK_SIZE in slur.h then we can use it here right, because these files are anyway include slur.h so. > > I also changed the location of bank_mask in SlruCtlData for better > packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO() > to SlotGetBankNumber(). Okay. > Some very critical comments still need to be updated to the new design, > particularly anything that mentions "control lock"; but also the overall > model needs to be explained in some central location, rather than > incongruently some pieces here and other pieces there. I'll see about > this later. But at least this is code you should be able to play with. Okay, I will review and test this > I've been wondering whether we should add a "slru" to the name of the > GUCs: > > commit_timestamp_slru_buffers > transaction_slru_buffers > etc I am not sure we are exposing anything related to SLRU to the user, I mean transaction_buffers should make sense for the user that it stores transaction-related data in some buffers pool but whether that buffer pool is called SLRU or not doesn't matter much to the user IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera wrote: > > > > (We also have SimpleLruTruncate, but I think it's not as critical to > > > have a barrier there anyhow: accessing a slightly outdated page number > > > could only be a problem if a bug elsewhere causes us to try to truncate > > > in the current page. I think we only have this code there because we > > > did have such bugs in the past, but IIUC this shouldn't happen anymore.) > > > > +1, I agree with this theory in general. But the below comment in > > SimpleLruTrucate in your v3 patch doesn't seem correct, because here > > we are checking if the latest_page_number is smaller than the cutoff > > if so we log it as wraparound and skip the whole thing and that is > > fine even if we are reading with atomic variable and slightly outdated > > value should not be a problem but the comment claim that this safe > > because we have the same bank lock as SimpleLruZeroPage(), but that's > > not true here we will be acquiring different bank locks one by one > > based on which slotno we are checking. Am I missing something? > > I think you're correct. I reworded this comment, so now it says this: > > /* > * An important safety check: the current endpoint page must not be > * eligible for removal. This check is just a backstop against wraparound > * bugs elsewhere in SLRU handling, so we don't care if we read a slightly > * outdated value; therefore we don't add a memory barrier. > */ > > Pushed with those changes. Thank you! Yeah, this looks perfect, thanks. > Now I'll go rebase the rest of the patch on top. Okay, I will review and test after that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila wrote: > > On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar wrote: > > > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada > > wrote: > > > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila > > > wrote: > > > > > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada > > > > wrote: > > > > > > > > > > --- > > > > > Since Two processes (e.g. the slotsync worker and > > > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > > > information, there is a race condition where slot's > > > > > confirmed_flush_lsn goes backward. > > > > > > > > > > > > > Right, this is possible, though there shouldn't be a problem because > > > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > > > required WAL won't be removed. Having said that, I can think of two > > > > ways to avoid it: (a) We can have some flag in shared memory using > > > > which we can detect whether any other process is doing slot > > > > syncronization and then either error out at that time or simply wait > > > > or may take nowait kind of parameter from user to decide what to do? > > > > If this is feasible, we can simply error out for the first version and > > > > extend it later if we see any use cases for the same (b) similar to > > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > > > error, this is good for now but in future we may still have another > > > > similar issue, so I would prefer (a) among these but I am fine if you > > > > prefer (b) or have some other ideas like just note down in comments > > > > that this is a harmless case and can happen only very rarely. > > > > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > > > issue still happens for other fields. > > > > I agree that (a) looks better. On a separate note, while looking at > > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there > > be an optional parameter to give one slot or multiple slots or all > > slots as default, that will give better control to the user no? > > > > As of now, we want to give functionality similar to slotsync worker > with a difference that users can use this new function for planned > switchovers. So, syncing all failover slots by default. I think if > there is a use case to selectively sync some of the failover slots > then we can probably extend this function and slotsync worker as well. > Normally, if the primary goes down due to whatever reason users would > want to restart the replication for all the defined publications via > existing failover slots. Why would anyone want to do it partially? If we consider the usability of such a function (I mean as it is implemented now, without any argument) one use case could be that if the slot sync worker is not keeping up or at some point in time the user doesn't want to wait for the worker to do this instead user can do it by himself. So now if we have such a functionality then it would be even better to extend it to selectively sync the slot. For example, if there is some issue in syncing all slots, maybe some bug or taking a long time to sync because there are a lot of slots but if the user needs to quickly failover and he/she is interested in only a couple of slots then such a option could be helpful. no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Question on LWLockMode in dsa.c
On Tue, Jan 30, 2024 at 6:24 AM Masahiko Sawada wrote: > > Hi, > > While working on radix tree patch[1], John Naylor found that dsa.c > doesn't already use shared locks even in dsa_dump(). dsa_dump() seems > a pure read-only function so I thought we could use a shared lock mode > there. Is there any reason to use exclusive mode even in dsa_dump()? > > Ultimately, since we're trying to add a new function > dsa_get_total_size() that just returns > dsa_area_control.total_segment_size and therefore would also be a > read-only function, I'd like to find out the correct lock mode there. > Doesn't seem like there is any reason for this to be an exclusive lock. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada wrote: > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila wrote: > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada > > wrote: > > > > > > --- > > > Since Two processes (e.g. the slotsync worker and > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > information, there is a race condition where slot's > > > confirmed_flush_lsn goes backward. > > > > > > > Right, this is possible, though there shouldn't be a problem because > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > required WAL won't be removed. Having said that, I can think of two > > ways to avoid it: (a) We can have some flag in shared memory using > > which we can detect whether any other process is doing slot > > syncronization and then either error out at that time or simply wait > > or may take nowait kind of parameter from user to decide what to do? > > If this is feasible, we can simply error out for the first version and > > extend it later if we see any use cases for the same (b) similar to > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > error, this is good for now but in future we may still have another > > similar issue, so I would prefer (a) among these but I am fine if you > > prefer (b) or have some other ideas like just note down in comments > > that this is a harmless case and can happen only very rarely. > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > issue still happens for other fields. I agree that (a) looks better. On a separate note, while looking at this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there be an optional parameter to give one slot or multiple slots or all slots as default, that will give better control to the user no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy wrote: > > Hi, > > Replication slots in postgres will prevent removal of required > resources when there is no connection using them (inactive). This > consumes storage because neither required WAL nor required rows from > the user tables/system catalogs can be removed by VACUUM as long as > they are required by a replication slot. In extreme cases this could > cause the transaction ID wraparound. > > Currently postgres has the ability to invalidate inactive replication > slots based on the amount of WAL (set via max_slot_wal_keep_size GUC) > that will be needed for the slots in case they become active. However, > the wraparound issue isn't effectively covered by > max_slot_wal_keep_size - one can't tell postgres to invalidate a > replication slot if it is blocking VACUUM. Also, it is often tricky to > choose a default value for max_slot_wal_keep_size, because the amount > of WAL that gets generated and allocated storage for the database can > vary. > > Therefore, it is often easy for developers to do the following: > a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5 > billion, after which the slots get invalidated. > b) set a timeout of say 1 or 2 or 3 days, after which the inactive > slots get invalidated. > > To implement (a), postgres needs a new GUC called max_slot_xid_age. > The checkpointer then invalidates all the slots whose xmin (the oldest > transaction that this slot needs the database to retain) or > catalog_xmin (the oldest transaction affecting the system catalogs > that this slot needs the database to retain) has reached the age > specified by this setting. > > To implement (b), first postgres needs to track the replication slot > metrics like the time at which the slot became inactive (inactive_at > timestamptz) and the total number of times the slot became inactive in > its lifetime (inactive_count numeric) in ReplicationSlotPersistentData > structure. And, then it needs a new timeout GUC called > inactive_replication_slot_timeout. Whenever a slot becomes inactive, > the current timestamp and inactive count are stored in > ReplicationSlotPersistentData structure and persisted to disk. The > checkpointer then invalidates all the slots that are lying inactive > for about inactive_replication_slot_timeout duration starting from > inactive_at. > > In addition to implementing (b), these two new metrics enable > developers to improve their monitoring tools as the metrics are > exposed via pg_replication_slots system view. For instance, one can > build a monitoring tool that signals when replication slots are lying > inactive for a day or so using inactive_at metric, and/or when a > replication slot is becoming inactive too frequently using inactive_at > metric. > > I’m attaching the v1 patch set as described below: > 0001 - Tracks invalidation_reason in pg_replication_slots. This is > needed because slots now have multiple reasons for slot invalidation. > 0002 - Tracks inactive replication slot information inactive_at and > inactive_timeout. > 0003 - Adds inactive_timeout based replication slot invalidation. > 0004 - Adds XID based replication slot invalidation. > > Thoughts? > +1 for the idea, here are some comments on 0002, I will review other patches soon and respond. 1. + + inactive_at timestamptz + + +The time at which the slot became inactive. +NULL if the slot is currently actively being +used. + + Maybe we can change the field name to 'last_inactive_at'? or maybe the comment can explain timestampt at which slot was last inactivated. I think since we are already maintaining the inactive_count so better to explicitly say this is the last invaliding time. 2. + /* + * XXX: Can inactive_count of type uint64 ever overflow? It takes + * about a half-billion years for inactive_count to overflow even + * if slot becomes inactive for every 1 millisecond. So, using + * pg_add_u64_overflow might be an overkill. + */ Correct we don't need to use pg_add_u64_overflow for this counter. 3. + + /* Convert to numeric. */ + snprintf(buf, sizeof buf, UINT64_FORMAT, slot_contents.data.inactive_count); + values[i++] = DirectFunctionCall3(numeric_in, + CStringGetDatum(buf), + ObjectIdGetDatum(0), + Int32GetDatum(-1)); What is the purpose of doing this? I mean inactive_count is 8 byte integer and you can define function outparameter as 'int8' which is 8 byte integer. Then you don't need to convert int to string and then to numeric? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera wrote: > > On 2024-Feb-02, Dilip Kumar wrote: > > > I have checked the patch and it looks fine to me other than the above > > question related to memory barrier usage one more question about the > > same, basically below to instances 1 and 2 look similar but in 1 you > > are not using the memory write_barrier whereas in 2 you are using the > > write_barrier, why is it so? I mean why the reordering can not happen > > in 1 and it may happen in 2? > > What I was thinking is that there's a lwlock operation just below, which > acts as a barrier. But I realized something more important: there are > only two places that matter, which are SlruSelectLRUPage and > SimpleLruZeroPage. The others are all initialization code that run at a > point where there's no going to be any concurrency in SLRU access, so we > don't need barriers anyway. In SlruSelectLRUPage we definitely don't > want to evict the page that SimpleLruZeroPage has initialized, starting > from the point where it returns that new page to its caller. > But if you consider the code of those two routines, you realize that the > only time an equality between latest_page_number and "this_page_number" > is going to occur, is when both pages are in the same bank ... and both > routines are required to be holding the bank lock while they run, so in > practice this is never a problem. Right, in fact when I first converted this 'latest_page_number' to an atomic the thinking was to protect it from concurrently setting the values in SimpleLruZeroPage() and also concurrently reading in SlruSelectLRUPage() should not read the corrupted value. All other usages were during the initialization phase where we do not need any protection. > > We need the atomic write and atomic read so that multiple processes > processing pages in different banks can update latest_page_number > simultaneously. But the equality condition that we're looking for? > it can never happen concurrently. Yeah, that's right, after you told I also realized that the case is protected by the bank lock. Earlier I didn't think about this case. > In other words, these barriers are fully useless. > > (We also have SimpleLruTruncate, but I think it's not as critical to > have a barrier there anyhow: accessing a slightly outdated page number > could only be a problem if a bug elsewhere causes us to try to truncate > in the current page. I think we only have this code there because we > did have such bugs in the past, but IIUC this shouldn't happen anymore.) +1, I agree with this theory in general. But the below comment in SimpleLruTrucate in your v3 patch doesn't seem correct, because here we are checking if the latest_page_number is smaller than the cutoff if so we log it as wraparound and skip the whole thing and that is fine even if we are reading with atomic variable and slightly outdated value should not be a problem but the comment claim that this safe because we have the same bank lock as SimpleLruZeroPage(), but that's not true here we will be acquiring different bank locks one by one based on which slotno we are checking. Am I missing something? + * An important safety check: the current endpoint page must not be + * eligible for removal. Like SlruSelectLRUPage, we don't need a + * memory barrier here because for the affected page to be relevant, + * we'd have to have the same bank lock as SimpleLruZeroPage. */ - if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage)) + if (ctl->PagePrecedes(pg_atomic_read_u64(&shared->latest_page_number), + cutoffPage)) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar wrote: > > On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar wrote: > > > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera > > wrote: > > > Okay. > > > > > > While I have your attention -- if you could give a look to the 0001 > > > patch I posted, I would appreciate it. > > > > > > > I will look into it. Thanks. > > Some quick observations, > > Do we need below two write barriers at the end of the function? > because the next instruction is separated by the function boundary > > @@ -766,14 +766,11 @@ StartupCLOG(void) > .. > - XactCtl->shared->latest_page_number = pageno; > - > - LWLockRelease(XactSLRULock); > + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno); > + pg_write_barrier(); > } > > /* > * Initialize member's idea of the latest page number. > */ > pageno = MXOffsetToMemberPage(offset); > - MultiXactMemberCtl->shared->latest_page_number = pageno; > + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number, > +pageno); > + > + pg_write_barrier(); > } > I have checked the patch and it looks fine to me other than the above question related to memory barrier usage one more question about the same, basically below to instances 1 and 2 look similar but in 1 you are not using the memory write_barrier whereas in 2 you are using the write_barrier, why is it so? I mean why the reordering can not happen in 1 and it may happen in 2? 1. + pg_atomic_write_u64(&CommitTsCtl->shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); vs 2. - shared->latest_page_number = pageno; + pg_atomic_write_u64(&shared->latest_page_number, pageno); + pg_write_barrier(); /* update the stats counter of zeroed pages */ pgstat_count_slru_page_zeroed(shared->slru_stats_idx); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar wrote: > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera wrote: > Okay. > > > > While I have your attention -- if you could give a look to the 0001 > > patch I posted, I would appreciate it. > > > > I will look into it. Thanks. Some quick observations, Do we need below two write barriers at the end of the function? because the next instruction is separated by the function boundary @@ -766,14 +766,11 @@ StartupCLOG(void) .. - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno); + pg_write_barrier(); } /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number, +pageno); + + pg_write_barrier(); } I am looking more into this from the concurrency point of view and will update you soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera wrote: > > On 2024-Feb-01, Dilip Kumar wrote: > > > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera > > wrote: > > > > > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > > > > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter > > > "transaction_buffers": 17 > > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must > > > be a multiple of 16 > > > > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE > > instead of giving an error? > > Since this is the auto-tuning feature, I think it should use the > previous multiple rather than the next, but yeah, something like that. Okay. > > While I have your attention -- if you could give a look to the 0001 > patch I posted, I would appreciate it. > I will look into it. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera wrote: > > Hah: > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter > "transaction_buffers": 17 > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be > a multiple of 16 Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE instead of giving an error? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera wrote: > > I've continued to review this and decided that I don't like the mess > this patch proposes in order to support pg_commit_ts's deletion of all > files. (Yes, I know that I was the one that proposed this idea. It's > still a bad one). I'd like to change that code by removing the limit > that we can only have 128 bank locks in a SLRU. To recap, the reason we > did this is that commit_ts sometimes wants to delete all files while > running (DeactivateCommitTs), and for this it needs to acquire all bank > locks. Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we > added the SLRU limit making multiple banks share lwlocks. > > I propose two alternative solutions: > > 1. The easiest is to have DeactivateCommitTs continue to hold > CommitTsLock until the end, including while SlruScanDirectory does its > thing. This sounds terrible, but considering that this code only runs > when the module is being deactivated, I don't think it's really all that > bad in practice. I mean, if you deactivate the commit_ts module and > then try to read commit timestamp data, you deserve to wait for a few > seconds just as a punishment for your stupidity. I think this idea looks reasonable. I agree that if we are trying to read commit_ts after deactivating it then it's fine to make it wait. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera wrote: > > On 2024-Jan-25, Alvaro Herrera wrote: > > > Here's a touched-up version of this patch. > > > diff --git a/src/backend/storage/lmgr/lwlock.c > > b/src/backend/storage/lmgr/lwlock.c > > index 98fa6035cc..4a5e05d5e4 100644 > > --- a/src/backend/storage/lmgr/lwlock.c > > +++ b/src/backend/storage/lmgr/lwlock.c > > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = { > > [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash", > > [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA", > > [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash", > > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU", > > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU", > > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU", > > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU", > > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU" > > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU", > > + [LWTRANCHE_XACT_SLRU] = "XactSLRU", > > }; > > Eeek. Last minute changes ... Fixed here. Thank you for working on this. There is one thing that I feel is problematic. We have kept the allowed values for these GUCs to be in multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min values were changed to 16 but in this refactoring patch for some of the buffers you have changed that to 8 so I think that's not good. + { + {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Sets the size of the dedicated buffer pool used for the MultiXact offset cache."), + NULL, + GUC_UNIT_BLOCKS + }, + &multixact_offsets_buffers, + 16, 8, SLRU_MAX_ALLOWED_BUFFERS, + check_multixact_offsets_buffers, NULL, NULL + }, Other than this patch looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: index prefetching
On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra wrote: > On 1/22/24 07:35, Konstantin Knizhnik wrote: > > > > On 22/01/2024 1:47 am, Tomas Vondra wrote: > >> h, right. Well, you're right in this case we perhaps could set just one > >> of those flags, but the "purpose" of the two places is quite different. > >> > >> The "prefetch" flag is fully controlled by the prefetcher, and it's up > >> to it to change it (e.g. I can easily imagine some new logic touching > >> setting it to "false" for some reason). > >> > >> The "data" flag is fully controlled by the custom callbacks, so whatever > >> the callback stores, will be there. > >> > >> I don't think it's worth simplifying this. In particular, I don't think > >> the callback can assume it can rely on the "prefetch" flag. > >> > > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not > > cause any extra space overhead (because of alignment), but allows to > > avoid dynamic memory allocation (not sure if it is critical, but nice to > > avoid if possible). > > > While reading through the first patch I got some questions, I haven't read it complete yet but this is what I got so far. 1. +static bool +IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block) +{ + int idx; ... + if (prefetch->blockItems[idx] != (block - i)) + return false; + + /* Don't prefetch if the block happens to be the same. */ + if (prefetch->blockItems[idx] == block) + return false; + } + + /* not sequential, not recently prefetched */ + return true; +} The above function name is BlockIsSequential but at the end, it returns true if it is not sequential, seem like a problem? Also other 2 checks right above the end of the function are returning false if the block is the same or the pattern is sequential I think those are wrong too. 2. I have noticed that the prefetch history is maintained at the backend level, but what if multiple backends are trying to fetch the same heap blocks maybe scanning the same index, so should that be in some shared structure? I haven't thought much deeper about this from the implementation POV, but should we think about it, or it doesn't matter? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Opportunistically pruning page before update
On Tue, Jan 23, 2024 at 7:18 AM James Coleman wrote: > > On Mon, Jan 22, 2024 at 8:21 PM James Coleman wrote: > > > > See rebased patch attached. > > I just realized I left a change in during the rebase that wasn't necessary. > > v4 attached. I have noticed that you are performing the opportunistic pruning after we decided that the updated tuple can not fit in the current page and then we are performing the pruning on the new target page. Why don't we first perform the pruning on the existing page of the tuple itself? Or this is already being done before this patch? I could not find such existing pruning so got this question because such pruning can convert many non-hot updates to the HOT update right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan wrote: > > Add some fixes and rebase. > While quickly looking into the patch, I understood the idea of what we are trying to achieve here and I feel that a useful feature. But while looking at both the patches I could not quickly differentiate between these two approaches. I believe, internally at the core both are implementing similar wait logic but providing different syntaxes. So if we want to keep both these approaches open for the sake of discussion then better first to create a patch that implements the core approach i.e. the waiting logic and the other common part and then add top-up patches with 2 different approaches that would be easy for review. I also see in v4 that there is no documentation for the syntax part so it makes it even harder to understand. I think this thread is implementing a useful feature so my suggestion is to add some documentation in v4 and also make it more readable w.r.t. What are the clear differences between these two approaches, maybe adding commit message will also help. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila wrote: > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik wrote: > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > on the topic of extending replication commands instead of using the > current model where we fetch the required slot information via SQL > using a database connection. I would like to summarize the discussion > and would like to know the thoughts of others on this topic. > > In the current patch, we launch the slotsync worker on physical > standby which connects to the specified database (currently we let > users specify the required dbname in primary_conninfo) on the primary. > It then fetches the required information for failover marked slots > from the primary and also does some primitive checks on the upstream > node via SQL (the additional checks are like whether the upstream node > has a specified physical slot or whether the upstream node is a > primary node or a standby node). To fetch the required information it > uses a libpqwalreciever API which is mostly apt for this purpose as it > supports SQL execution but for this patch, we don't need a replication > connection, so we extend the libpqwalreciever connect API. What sort of extension we have done to 'libpqwalreciever'? Is it something like by default this supports replication connections so we have done an extension to the API so that we can provide an option whether to create a replication connection or a normal connection? > Now, the concerns related to this could be that users would probably > need to change existing mechanisms/tools to update priamry_conninfo > and one of the alternatives proposed is to have an additional GUC like > slot_sync_dbname. Users won't be able to drop the database this worker > is connected to aka whatever is specified in slot_sync_dbname but as > the user herself sets up the configuration it shouldn't be a big deal. Yeah for this purpose users may use template1 or so which they generally don't plan to drop. So in case the user wants to drop that database user needs to turn off the slot syncing option and then it can be done? > Then we also discussed whether extending libpqwalreceiver's connect > API is a good idea and whether we need to further extend it in the > future. As far as I can see, slotsync worker's primary requirement is > to execute SQL queries which the current API is sufficient, and don't > see something that needs any drastic change in this API. Note that > tablesync worker that executes SQL also uses these APIs, so we may > need something in the future for either of those. Then finally we need > a slotsync worker to also connect to a database to use SQL and fetch > results. While looking into the patch v64-0002 I could not exactly point out what sort of extensions are there in libpqwalreceiver.c, I just saw one extra API for fetching the dbname from connection info? > Now, let us consider if we extend the replication commands like > READ_REPLICATION_SLOT and or introduce a new set of replication > commands to fetch the required information then we don't need a DB > connection with primary or a connection in slotsync worker. As per my > current understanding, it is quite doable but I think we will slowly > go in the direction of making replication commands something like SQL > because today we need to extend it to fetch all slots info that have > failover marked as true, the existence of a particular replication, > etc. Then tomorrow, if we want to extend this work to have multiple > slotsync workers say workers perdb then we have to extend the > replication command to fetch per-database failover marked slots. To > me, it sounds more like we are slowly adding SQL-like features to > replication commands. > > Apart from this when we are reading per-db replication slots without > connecting to a database, we probably need some additional protection > mechanism so that the database won't get dropped. Something like locking the database only while fetching the slots? > Considering all this it seems that for now probably extending > replication commands can simplify a few things like mentioned above > but using SQL's with db-connection is more extendable. Even I have similar thoughts. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)
On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent wrote: > > Hi, > > Currently, nbtree code compares each and every column of an index > tuple during the binary search on the index page. With large indexes > that have many duplicate prefix column values (e.g. an index on (bool, > bool, uuid) ) that means a lot of wasted time getting to the right > column. > > The attached patch improves on that by doing per-page dynamic prefix > truncation: If we know that on both the right and left side there are > index tuples where the first two attributes are equal to the scan key, > we skip comparing those attributes at the current index tuple and > start with comparing attribute 3, saving two attribute compares. We > gain performance whenever comparing prefixing attributes is expensive > and when there are many tuples with a shared prefix - in unique > indexes this doesn't gain much, but we also don't lose much in this > case. > > This patch was originally suggested at [0], but it was mentioned that > they could be pulled out into it's own thread. Earlier, the > performance gains were not clearly there for just this patch, but > after further benchmarking this patch stands on its own for > performance: it sees no obvious degradation of performance, while > gaining 0-5% across various normal indexes on the cc-complete sample > dataset, with the current worst-case index shape getting a 60%+ > improved performance on INSERTs in the tests at [0]. +1 for the idea, I have some initial comments while reading through the patch. 1. Commit message refers to a non-existing reference '(see [0])'. 2. +When we do a binary search on a sorted set (such as a BTree), we know that a +tuple will be smaller than its left neighbour, and larger than its right +neighbour. I think this should be 'larger than left neighbour and smaller than right neighbour' instead of the other way around. 3. +With the above optimizations, dynamic prefix truncation improves the worst +case complexity of indexing from O(tree_height * natts * log(tups_per_page)) +to O(tree_height * (3*natts + log(tups_per_page))) Where do the 3*natts come from? Is it related to setting up the dynamic prefix at each level? 4. + /* + * All tuple attributes are equal to the scan key, only later attributes + * could potentially not equal the scan key. + */ + *compareattr = ntupatts + 1; Can you elaborate on this more? If all tuple attributes are equal to the scan key then what do those 'later attributes' mean? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Tue, Jan 16, 2024 at 9:37 AM Amit Kapila wrote: > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik wrote: > > > Agreed and as said earlier I think it is better to make it a > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > already autovacuum launcher is handled in the same way. One more minor > thing is it will save us for having a new bgworker state > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. Yeah, it's not a nonstandard way. But bgworker provides a lot of inbuilt infrastructure which otherwise we would have to maintain by ourselves if we opt for option 2. I would have preferred option 3 from the simplicity point of view but I would prefer to make this PGC_SIGHUP over simplicity. But anyway, if there are issues in doing so then we can keep it PGC_POSTMASTER but it's worth trying this out. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu) wrote: > > > IIUC on the standby we just want to overwrite what we get from primary no? > > If > > so why we are using those APIs that are meant for the actual decoding slots > > where it needs to take certain logical decisions instead of mere > > overwriting? > > I think we don't have a strong reason to use these APIs, but it was > convenient to > use these APIs as they can take care of updating the slots info and will call > functions like, ReplicationSlotsComputeRequiredXmin, > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly > overwriting > the fields and call these manually ? I might be missing something but do you want to call ReplicationSlotsComputeRequiredXmin() kind of functions in standby? I mean those will ultimately update the catalog xmin and replication xmin in Procarray and that prevents Vacuum from cleaning up some of the required xids. But on standby, those shared memory parameters are not used IIUC. In my opinion on standby, we just need to update the values in the local slots and whatever we get from remote slots without taking all the logical decisions in the hope that they will all fall into a particular path, for example, if you see LogicalIncreaseXminForSlot(), it is doing following steps of operations as shown below[1]. These all make sense when you are doing candidate-based updation where we first mark the candidates and then update the candidate to real value once you get the confirmation for the LSN. Now following all this logic looks completely weird unless this can fall in a different path I feel it will do some duplicate steps as well. For example in local_slot_update(), first you call LogicalConfirmReceivedLocation() which will set the 'data.confirmed_flush' and then you will call LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;' and will again call LogicalConfirmReceivedLocation(). I don't think this is the correct way of reusing the function unless you need to go through those paths and I am missing something. [1] LogicalIncreaseXminForSlot() { if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin)) { } else if (current_lsn <= slot->data.confirmed_flush) { } else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr) { } if (updated_xmin) LogicalConfirmReceivedLocation(slot->data.confirmed_flush); } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu) wrote: > comments on 0002 1. +/* Worker's nap time in case of regular activity on the primary server */ +#define WORKER_DEFAULT_NAPTIME_MS 10L /* 10 ms */ + +/* Worker's nap time in case of no-activity on the primary server */ +#define WORKER_INACTIVITY_NAPTIME_MS1L /* 10 sec */ Instead of directly switching between 10ms to 10s shouldn't we increase the nap time gradually? I mean it can go beyond 10 sec as well but instead of directly switching from 10ms to 10 sec we can increase it every time with some multiplier and keep a max limit up to which it can grow. Although we can reset back to 10ms directly as soon as we observe some activity. 2. SlotSyncWorkerCtxStruct add this to typedefs. list file 3. +/* + * Update local slot metadata as per remote_slot's positions + */ +static void +local_slot_update(RemoteSlot *remote_slot) +{ + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); + + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, +remote_slot->catalog_xmin); + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, + remote_slot->restart_lsn); +} IIUC on the standby we just want to overwrite what we get from primary no? If so why we are using those APIs that are meant for the actual decoding slots where it needs to take certain logical decisions instead of mere overwriting? 4. +/* + * Helper function for drop_obsolete_slots() + * + * Drops synced slot identified by the passed in name. + */ +static void +drop_synced_slots_internal(const char *name, bool nowait) Suggestion to add one line to explain no wait in the header 5. +/* + * Helper function to check if local_slot is present in remote_slots list. + * + * It also checks if logical slot is locally invalidated i.e. invalidated on + * the standby but valid on the primary server. If found so, it sets + * locally_invalidated to true. + */ Instead of saying "but valid on the primary server" better to mention it in the remote_slots list, because here this function is just checking the remote_slots list regardless of whether the list came from. Mentioning primary seems like it might fetch directly from the primary in this function so this is a bit confusing. 6. +/* + * Check that all necessary GUCs for slot synchronization are set + * appropriately. If not, raise an ERROR. + */ +static void +validate_slotsync_parameters(char **dbname) The function name just says 'validate_slotsync_parameters' but it also gets the dbname so I think it better we change the name accordingly also instead of passing dbname as a parameter just return it directly. There is no need to pass this extra parameter and make the function return void. 7. + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, tupslot); + Assert(tuple_ok); /* It must return one tuple */ Comments say 'It must return one tuple' but asserting just for at least one tuple shouldn't we enhance assert so that it checks that we got exactly one tuple? 8. /* No need to check further, return that we are cascading standby */ + *am_cascading_standby = true; we are not returning immediately we are just setting am_cascading_standby to true so adjust comments accordingly 9. + /* No need to check further, return that we are cascading standby */ + *am_cascading_standby = true; + } + else + { + /* We are a normal standby. */ Single-line comments do not follow the uniform pattern for the full stop, either use a full stop for all single-line comments or none, at least follow the same rule in a file or nearby comments. 10. + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("%s must be defined.", "primary_slot_name")); Why we are using the constant string "primary_slot_name" as a variable in this error formatting? 11. + /* + * Hot_standby_feedback must be enabled to cooperate with the physical + * replication slot, which allows informing the primary about the xmin and + * catalog_xmin values on the standby. I do not like capitalizing the first letter of the 'hot_standby_feedback' which is a GUC parameter -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera wrote: > > The more I look at TransactionGroupUpdateXidStatus, the more I think > it's broken, and while we do have some tests, I don't have confidence > that they cover all possible cases. > > Or, at least, if this code is good, then it hasn't been sufficiently > explained. Any thought about a case in which you think it might be broken, I mean any vague thought might also help where you think it might not work as expected so that I can also think in that direction. It might be possible that I might not be thinking of some perspective that you are thinking and comments might be lacking from that point of view. > If we have multiple processes trying to write bits to clog, and they are > using different banks, then the LWLockConditionalAcquire will be able to > acquire the bank lock Do you think there is a problem with multiple processes getting the lock? I mean they are modifying different CLOG pages so that can be done concurrently right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com