from:"\"Dilip Kumar\""

Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid

2024-11-10 Thread Dilip Kumar

On Sat, Nov 9, 2024 at 12:41 PM Dilip Kumar  wrote:
>
> On Fri, Nov 8, 2024 at 8:36 PM Tom Lane  wrote:
>>
>> Dilip Kumar  writes:
>> > IIRC, In catalog we intentionally left it as Oid because RelFileNumber is
>> > an internal typedef bug, it is not an exposed datatype, so probably we can
>> > not use it in catalog.
>>
>> We could declare it as RelFileNumber so that that is what C code sees,
>> and teach Catalog.pm to translate that to OID in the emitted catalog
>> contents.
>
>
> Yeah that make sense and yes we can actually keep this change
>
>>
>>   I think you'd have to do that to avoid a ton of false
>> positives when you make RelFileNumber into a struct, and we might as
>> well keep the result.
>
>
> Even after this, there were tons of false positives, whenever using any 
> comparison operator on relfilenumbers[1] and there are tons of those, or 
> using them in log messages with %u [2].  Anyway, I have gone through those 
> manually and after ignoring all false positives here is what I got, PFA patch 
> (odd 25 places where I have fixed usage of Oid instead of RelFileNumbers).
>
>
> [1]
> ../../../../src/include/storage/buf_internals.h:157:20: error: invalid 
> operands to binary expression ('const RelFileNumber' (aka 'const struct 
> RelFileNumber') and 'const RelFileNumber')
> (tag1->relNumber == tag2->relNumber)
>
> [2]
> fsmpage.c:277:47: warning: format specifies type 'unsigned int' but the 
> argument has type 'RelFileNumber' (aka 'struct RelFileNumber') [-Wformat]
>  blknum, rlocator.spcOid, rlocator.dbOid, 
> rlocator.relNumber);
>

Other than the changes, I have made in patch in my previous email
there are a couple of more items we can consider for this change, but
I am not sure so listing down here

1. We are using PG_RETURN_OID() for returning the relfilenumber,
should we introduce something like PG_RETURN_RELFILENUMBER() ? e.g
pg_relation_filenode()
2. We are using PG_GETARG_OID() for getting the relfilenumber as an
external function argument, shall we introduce something like
PG_GETARG_RELFILENUMBER() ? e.g.
binary_upgrade_set_next_heap_relfilenode()
3. info.c:611:23: error: assigning to 'RelFileNumber' (aka 'struct
RelFileNumber') from incompatible type 'Oid' (aka 'unsigned int')
    curr->relfilenumber = atooid(PQgetvalue(res, relnum,
i_relfilenumber));
4. Also using ObjectIdGetDatum() and DatumGetObjectId() for
relfilenumber so shall we introduce a new function i.e
RelFileNumberGetDatum() and DatumGetRelFileNumber()


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid

2024-11-08 Thread Dilip Kumar

On Fri, Nov 8, 2024 at 8:36 PM Tom Lane  wrote:

> Dilip Kumar  writes:
> > IIRC, In catalog we intentionally left it as Oid because RelFileNumber is
> > an internal typedef bug, it is not an exposed datatype, so probably we
> can
> > not use it in catalog.
>
> We could declare it as RelFileNumber so that that is what C code sees,
> and teach Catalog.pm to translate that to OID in the emitted catalog
> contents.


Yeah that make sense and yes we can actually keep this change


>   I think you'd have to do that to avoid a ton of false
> positives when you make RelFileNumber into a struct, and we might as
> well keep the result.
>

Even after this, there were tons of false positives, whenever using any
comparison operator on relfilenumbers[1] and there are tons of those, or
using them in log messages with %u [2].  Anyway, I have gone through those
manually and after ignoring all false positives here is what I got, PFA
patch (odd 25 places where I have fixed usage of Oid instead of
RelFileNumbers).


[1]
../../../../src/include/storage/buf_internals.h:157:20: error: invalid
operands to binary expression ('const RelFileNumber' (aka 'const struct
RelFileNumber') and 'const RelFileNumber')
(tag1->relNumber == tag2->relNumber)

[2]
fsmpage.c:277:47: warning: format specifies type 'unsigned int' but the
argument has type 'RelFileNumber' (aka 'struct RelFileNumber') [-Wformat]
 blknum, rlocator.spcOid, rlocator.dbOid,
rlocator.relNumber);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


remove_using_oid_for_relfilenumber.patch
Description: Binary data

Re: Fix small typo, use InvalidRelFileNumber instead of InvalidOid

2024-11-08 Thread Dilip Kumar

On Fri, Nov 8, 2024 at 4:30 PM Alvaro Herrera 
wrote:

> On 2024-Nov-08, Dilip Kumar wrote:
>
> > It appears that we are passing InvalidOid instead of InvalidRelFileNumber
> > when calling index_create(). While this isn’t technically a bug, since
> > InvalidRelFileNumber is defined as InvalidOid, it’s preferable to use the
> > correct identifier for clarity and consistency.
>
> Oh ugh, this is quite a widespread problem actually, and it's just
> because RelFileNumber is a typedef to Oid that the compiler doesn't
> complain.  In a very quick desultory scan, I found a bunch of similar
> issues elsewhere, such as below (also the assignment to relNumber in
> GetNewRelFileNumber).  This isn't exhaustive by any means nor did I try
> to compile it ... are you motivated to search for this more
> exhaustively?  You could try (temporarily) making RelFileNumber a
> typedef that tricks the compiler into creating a warning or error for
> every mischief, such as a struct, fix all the warnings/errors, then
> revert the temporary change.
>

Okay I will try that

>
> diff --git a/src/backend/backup/basebackup_incremental.c
> b/src/backend/backup/basebackup_incremental.c
> index 275615877eb..e2a73d88408 100644
> --- a/src/backend/backup/basebackup_incremental.c
> +++ b/src/backend/backup/basebackup_incremental.c
> @@ -740,7 +740,7 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const
> char *path,
>  */
> rlocator.spcOid = spcoid;
> rlocator.dbOid = dboid;
> -   rlocator.relNumber = 0;
> +   rlocator.relNumber = InvalidRelFileNumber;
> if (BlockRefTableGetEntry(ib->brtab, &rlocator, MAIN_FORKNUM,
>   &limit_block) !=
> NULL)
> {
> diff --git a/src/backend/bootstrap/bootparse.y
> b/src/backend/bootstrap/bootparse.y
> index 73a7592fb71..2f7f06a126f 100644
> --- a/src/backend/bootstrap/bootparse.y
> +++ b/src/backend/bootstrap/bootparse.y
> @@ -206,7 +206,7 @@ Boot_CreateStmt:
>
>  PG_CATALOG_NAMESPACE,
>
>  shared_relation ? GLOBALTABLESPACE_OID : 0,
>
>  $3,
> -
> InvalidOid,
> +
> InvalidRelFileNumber,
>
>  HEAP_TABLE_AM_OID,
>
>  tupdesc,
>
>  RELKIND_RELATION,
> diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> index f9bb721c5fe..f8d4824a3f8 100644
> --- a/src/backend/catalog/index.c
> +++ b/src/backend/catalog/index.c
> @@ -3740,7 +3740,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
> if (set_tablespace)
> {
> /* Update its pg_class row */
> -   SetRelationTableSpace(iRel, params->tablespaceOid,
> InvalidOid);
> +   SetRelationTableSpace(iRel, params->tablespaceOid,
> InvalidRelFileNumber);
>
> /*
>  * Schedule unlinking of the old index storage at
> transaction commit.
> diff --git a/src/backend/commands/tablecmds.c
> b/src/backend/commands/tablecmds.c
> index eaa81424270..8505cc8c1b5 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -15441,7 +15441,7 @@ ATExecSetTableSpaceNoStorage(Relation rel, Oid
> newTableSpace)
> }
>
> /* Update can be done, so change reltablespace */
> -   SetRelationTableSpace(rel, newTableSpace, InvalidOid);
> +   SetRelationTableSpace(rel, newTableSpace, InvalidRelFileNumber);
>
> InvokeObjectPostAlterHook(RelationRelationId,
> RelationGetRelid(rel), 0);
>
> diff --git a/src/include/catalog/pg_class.h
> b/src/include/catalog/pg_class.h
> index 0fc2c093b0d..bffbb98779e 100644
> --- a/src/include/catalog/pg_class.h
> +++ b/src/include/catalog/pg_class.h
> @@ -54,7 +54,7 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP
> BKI_ROWTYPE_OID(83,Relat
>
> /* identifier of physical storage file */
> /* relfilenode == 0 means it is a "mapped" relation, see
> relmapper.c */
> -   Oid relfilenode BKI_DEFAULT(0);
> +   RelFileNumber relfilenode BKI_DEFAULT(0);
>
>     /* identifier of table space for relation (0 means default for
> database) */
> Oid reltablespace BKI_DEFAULT(0)
> BKI_LOOKUP_OPT(pg_tablespace);
>

IIRC, In catalog we intentionally left it as Oid because RelFileNumber is
an internal typedef bug, it is not an exposed datatype, so probably we can
not use it in catalog.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Fix small typo, use InvalidRelFileNumber instead of InvalidOid

2024-11-08 Thread Dilip Kumar

It appears that we are passing InvalidOid instead of InvalidRelFileNumber
when calling index_create(). While this isn’t technically a bug, since
InvalidRelFileNumber is defined as InvalidOid, it’s preferable to use the
correct identifier for clarity and consistency.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-typo-use-InvalidRelFileNumber-instead-of-Invalid.patch
Description: Binary data

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-28 Thread Dilip Kumar

On Sat, Oct 26, 2024 at 5:30 PM Melanie Plageman 
wrote:

> On Sat, Oct 26, 2024 at 12:17 AM Dilip Kumar 
> wrote:
> >
> > On Fri, Oct 25, 2024 at 10:07 PM Melanie Plageman <
> melanieplage...@gmail.com> wrote:
> >>
> >> On Fri, Oct 25, 2024 at 10:29 AM Melanie Plageman
> >>  wrote:
> >> >
> >> > Tom suggested off-list that if rs_cindex can't be zero, then it should
> >> > be unsigned. I checked the other scan types using the
> >> > HeapScanDescData, and it seems none of them use values of rs_cindex or
> >> > rs_ntuples < 0. As such, here is a patch making both rs_ntuples  and
> >> > rs_cindex unsigned.
> >>
> >
> > @@ -943,8 +945,8 @@ heapgettup_pagemode(HeapScanDesc scan,
> >  {
> >   HeapTuple tuple = &(scan->rs_ctup);
> >   Page page;
> > - int lineindex;
> > - int linesleft;
> > + uint32 lineindex;
> > + uint32 linesleft;
> >
> > IMHO we can't make  "lineindex" as uint32, because just see the first
> code block[1] of heapgettup_pagemode(), we use this index as +ve (Forward
> scan )as well as -ve (Backward scan).
>
> Yes, so in the case of backwards scan, if scan->rs_cindex is 0, when
> dir is -1, lineindex will wrap around, but we don't use it in that
> case because linesleft will be 0 and then we will not meet the
> conditions to execute the code in the loop under continue_page. And in
> the loop, when adding -1 to lineindex, for backwards scan, it always
> starts at linesleft and ticks down to 0.
>

Yeah you are right, although the lineindex will wrap around when rs_cindex
is 0 , it would not be used.  So, it won’t actually cause any issues, but
I’m not comfortable with the unintentional wraparound. I would have left
"scan->rs_cindex" as int itself but I am fine with whatever you prefer.


> We could add an if statement above the goto that says something like
> if (linesleft > 0)
> goto continue_page;
>
> Would that make it clearer?
>

Not sure it would make it clearer.  In fact, In common cases it would add
an extra instruction to check the if (linesleft > 0).


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-25 Thread Dilip Kumar

On Fri, Oct 25, 2024 at 10:07 PM Melanie Plageman 
wrote:

> On Fri, Oct 25, 2024 at 10:29 AM Melanie Plageman
>  wrote:
> >
> > Tom suggested off-list that if rs_cindex can't be zero, then it should
> > be unsigned. I checked the other scan types using the
> > HeapScanDescData, and it seems none of them use values of rs_cindex or
> > rs_ntuples < 0. As such, here is a patch making both rs_ntuples  and
> > rs_cindex unsigned.
>
>
@@ -943,8 +945,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
  HeapTuple tuple = &(scan->rs_ctup);
  Page page;
- int lineindex;
- int linesleft;
+ uint32 lineindex;
+ uint32 linesleft;

IMHO we can't make  "lineindex" as uint32, because just see the first code
block[1] of heapgettup_pagemode(), we use this index as +ve (Forward scan
)as well as -ve (Backward scan).

[1]
 if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
page = BufferGetPage(scan->rs_cbuf);

lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))

--Refer definition of ScanDirection
typedef enum ScanDirection
{
BackwardScanDirection = -1,
NoMovementScanDirection = 0,
ForwardScanDirection = 1
} ScanDirection;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2024 at 7:11 PM Melanie Plageman 
wrote:

> Thanks for the reply, Dilip!
>
> On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar 
> wrote:
> >
> > On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman <
> melanieplage...@gmail.com> wrote:
> >>
> >> HeapScanDescData->rs_cindex (the current index into the array of
> >> visible tuples stored in the heap scan descriptor) is used for
> >> multiple scan types, but for bitmap heap scans, AFAICT, it should
> >> never be < 0.
> >
> > You are right it can never be < 0 in this case at least.
>
> Cool, thanks for confirming. I will commit this change then.
>

+1


>
> > In fact you don't need to explicitly set it to 0 in initscan[1], because
> before calling heapam_scan_bitmap_next_tuple() we must call
> heapam_scan_bitmap_next_block() and this function is initializing this to 0
> (hscan->rs_cindex = 0;).  Anyway no object even if you prefer to initialize
> in initscan(), just the point is that we are already doing it for each
> block before fetching tuples from the block.
>
> I am inclined to still initialize it to 0 in initscan(). I was
> refactoring the bitmap heap scan code to use the read stream API and
> after moving some things around, this field was no longer getting
> initialized in heapam_scan_bitmap_next_block(). While that may not be
> a good reason to change it in this patch, most of the other fields in
> the heap scan descriptor (like rs_cbuf and rs_cblock) are
> re-initialized in initscan(), so it seems okay to do that there even
> though it isn't strictly necessary on master.
>

 make sense

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman 
wrote:

> Hi,
>
> HeapScanDescData->rs_cindex (the current index into the array of
> visible tuples stored in the heap scan descriptor) is used for
> multiple scan types, but for bitmap heap scans, AFAICT, it should
> never be < 0.
>
> As such, I find this test in heapam_scan_bitmap_next_tuple() pretty
> confusing.
>
>  if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
>
> I know it seems innocuous, but I find it pretty distracting. Am I
> missing something? Could it somehow be < 0?
>
> If not, I propose removing that part of the if statement like in the
> attached patch.
>
>
You are right it can never be < 0 in this case at least.  In fact you don't
need to explicitly set it to 0 in initscan[1], because before
calling heapam_scan_bitmap_next_tuple() we must
call heapam_scan_bitmap_next_block() and this function is initializing this
to 0 (hscan->rs_cindex = 0;).  Anyway no object even if you prefer to
initialize in initscan(), just the point is that we are already doing it
for each block before fetching tuples from the block.

[1]
@@ -378,6 +378,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
  ItemPointerSetInvalid(&scan->rs_ctup.t_self);
  scan->rs_cbuf = InvalidBuffer;
  scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_cindex = 0;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Make default subscription streaming option as Parallel

2024-10-19 Thread Dilip Kumar

On Fri, 18 Oct 2024 at 5:24 PM, Amit Kapila  wrote:

> On Fri, Oct 18, 2024 at 9:48 AM Dilip Kumar  wrote:
> >
> > On Tue, Oct 8, 2024 at 3:38 PM Amit Kapila 
> wrote:
> >>
> >> On Tue, Oct 8, 2024 at 2:25 PM shveta malik 
> wrote:
> >> >
> >> > On Mon, Oct 7, 2024 at 4:03 PM vignesh C  wrote:
> >> > >
> >> >
> >> > With parallel streaming as default, do you think there is a need to
> >> > increase the default for 'max_logical_replication_workers' as IIUC
> >> > parallel workers are taken from the same pool.
> >> >
> >>
> >> Good question. But then one may say that we should proportionately
> >> increase max_worker_processes as well. I don't know what should be
> >> reasonable new defaults. I think we should make parallel streaming as
> >> default and then wait for some user feedback before changing other
> >> defaults.
> >>
> >
> > I agree, actually streaming of in progress transactions is a useful
> feature for performance in case of large transactions, so it makes sense to
> make it "on" by default.  So +1 from my side.
> >
>
> Your response is confusing. AFAIU, this proposal is to change the
> default value of the streaming option to 'parallel' but you are
> suggesting to make 'on' as default which is different from the
> proposed default but OTOH you are saying +1 as well. So, both can't be
> true.


Sorry for confusion I meant to say change default as ‘parallel’

—
Dilip

>

Re: Make default subscription streaming option as Parallel

2024-10-17 Thread Dilip Kumar

On Tue, Oct 8, 2024 at 3:38 PM Amit Kapila  wrote:

> On Tue, Oct 8, 2024 at 2:25 PM shveta malik 
> wrote:
> >
> > On Mon, Oct 7, 2024 at 4:03 PM vignesh C  wrote:
> > >
> >
> > With parallel streaming as default, do you think there is a need to
> > increase the default for 'max_logical_replication_workers' as IIUC
> > parallel workers are taken from the same pool.
> >
>
> Good question. But then one may say that we should proportionately
> increase max_worker_processes as well. I don't know what should be
> reasonable new defaults. I think we should make parallel streaming as
> default and then wait for some user feedback before changing other
> defaults.
>
>
I agree, actually streaming of in progress transactions is a useful feature
for performance in case of large transactions, so it makes sense to make it
"on" by default.  So +1 from my side.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-08 Thread Dilip Kumar

On Fri, Sep 6, 2024 at 4:48 PM vignesh C  wrote:
>
> On Mon, 2 Sept 2024 at 18:22, Dilip Kumar  wrote:
> >
> > On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila  wrote:
> > >
> > > On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  wrote:
> > > >
> > > > While working on some other code I noticed that in
> > > > FindReplTupleInLocalRel() there is an assert [1] that seems to be
> > > > passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> > > > be passing normal relation.
> > > >
> > >
> > > Agreed. But this should lead to assertion failure. Did you try testing it?
> >
> > No, I did not test this particular case, it impacted me with my other
> > addition of the code where I got Index Relation as input to the
> > RelationGetIndexList() function, and my local changes were impacted by
> > that.  I will write a test for this stand-alone case so that it hits
> > the assert.  Thanks for looking into this.
>
> The FindReplTupleInLocalRel function can be triggered by both update
> and delete operations, but this only occurs if the relation has been
> marked as updatable by the logicalrep_rel_mark_updatable function. If
> the relation is marked as non-updatable, an error will be thrown by
> check_relation_updatable. Given this, if a relation is updatable, the
> IsIndexUsableForReplicaIdentityFull condition might always evaluate to
> true due to the previous checks in logicalrep_rel_mark_updatable.
> Therefore, it’s possible that we might not encounter the Assert
> statement, as IsIndexUsableForReplicaIdentityFull may consistently be
> true.
> Thoughts?

With that it seems that the first Assert condition is useless isn't it?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Deadlock due to locking order violation while inserting into a leaf relation

2024-09-08 Thread Dilip Kumar

Basically, when we are inserting into a leaf relation (or lower level
relation of the partitioned relation), we acquire the lock on the leaf
relation during parsing time itself whereas parent lock is acquire
during generate_partition_qual().  Now concurrently if we try to drop
the partition root then it will acquire the lock in reverse order,
i.e. parent first and then child so this will create a deadlock.
Below example reproduce this case.

Setup:


CREATE TABLE test(a int, b int) partition by range(a);
CREATE TABLE test1 partition of test for values from (1) to (10);

Test:
--
--Session1:
INSERT INTO test1 VALUES (1, 4);
-- let session is lock the relation test1 and make it wait before it
locks test (put breakpoint in ExecInitModifyTable)

--Session2:
-- try to drop the top table which will try to take AccessExclusive
lock on all partitions
DROP TABLE test;

--session3
-- see PG_LOCKS
-- we can see that session1 has locked locked root table test(16384)
waiting on test1(16387) as session1 is holding that lock

   locktype| database | relation |  pid  |mode | granted
---+--+---+---+-+
  relation  |5 |16387 | 30368 | RowExclusiveLock| t
 relation  |5 |16387 |  30410 | AccessExclusiveLock | f
 relation  |5 |16384 |  30410 | AccessExclusiveLock | t
(11 rows)


--Session1, now as soon as you continue in gdb in session 1 it will
hit the deadlock
ERROR:  40P01: deadlock detected
DETAIL:  Process 30368 waits for AccessShareLock on relation 16384 of
database 5; blocked by process 30410.
Process 30410 waits for AccessExclusiveLock on relation 16387 of
database 5; blocked by process 30368.
HINT:  See server log for query details.
LOCATION:  DeadLockReport, deadlock.c:1135


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-02 Thread Dilip Kumar

On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila  wrote:
>
> On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  wrote:
> >
> > While working on some other code I noticed that in
> > FindReplTupleInLocalRel() there is an assert [1] that seems to be
> > passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> > be passing normal relation.
> >
>
> Agreed. But this should lead to assertion failure. Did you try testing it?

No, I did not test this particular case, it impacted me with my other
addition of the code where I got Index Relation as input to the
RelationGetIndexList() function, and my local changes were impacted by
that.  I will write a test for this stand-alone case so that it hits
the assert.  Thanks for looking into this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Invalid Assert while validating REPLICA IDENTITY?

2024-09-01 Thread Dilip Kumar

While working on some other code I noticed that in
FindReplTupleInLocalRel() there is an assert [1] that seems to be
passing IndexRelation to GetRelationIdentityOrPK() whereas it should
be passing normal relation.

[1]
if (OidIsValid(localidxoid))
{
#ifdef USE_ASSERT_CHECKING
Relation idxrel = index_open(localidxoid, AccessShareLock);

/* Index must be PK, RI, or usable for REPLICA IDENTITY FULL tables */
Assert(GetRelationIdentityOrPK(idxrel) == localidxoid ||
IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel),
edata->targetRel->attrmap));
index_close(idxrel, AccessShareLock);
#endif

In the above code, we are passing idxrel to GetRelationIdentityOrPK(),
whereas we should be passing localrel

Fix should be


@@ -2929,7 +2929,7 @@ FindReplTupleInLocalRel(ApplyExecutionData
*edata, Relation localrel,

Relationidxrel = index_open(localidxoid,
AccessShareLock);

/* Index must be PK, RI, or usable for REPLICA
IDENTITY FULL tables */
-   Assert(GetRelationIdentityOrPK(idxrel) == localidxoid ||
+   Assert(GetRelationIdentityOrPK(localrel) == localidxoid ||

IsIndexUsableForReplicaIdentityFull(BuildIndexInfo(idxrel),

edata->targetRel->attrmap));

index_close(idxrel, AccessShareLock);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-30 Thread Dilip Kumar

On Tue, Jul 30, 2024 at 4:56 PM shveta malik  wrote:
>
> On Tue, Jul 30, 2024 at 4:04 PM Dilip Kumar  wrote:
> >
> > On Fri, Jul 26, 2024 at 9:50 AM Ajin Cherian  wrote:
> >
> > Comment in 0002,
> >
> > 1) I do not see any test case that set a proper conflict type and
> > conflict resolver, all tests either give incorrect conflict
> > type/conflict resolver or the conflict resolver is ignored
> >
> > 0003
> > 2) I was trying to think about this patch, so suppose we consider this
> > case conflict_type-> update_differ  resolver->remote_apply, my
> > question is to confirm whether my understanding is correct.  So if
> > this is set and we have 2 nodes and set up a 2-way logical
> > replication, and if a conflict occurs node-1 will take the changes of
> > node-2 and node-2 will take the changes of node-1?
>
> Yes, that's right.
>
> > Maybe so I think
> > to avoid such cases user needs to set the resolver more thoughtfully,
> > on node-1 it may be set as "skip" and on node-1 as "remote-apply" so
> > in such cases if conflict happens both nodes will have the value from
> > node-1.  But maybe it would be more difficult to get a consistent
> > value if we are setting up a mess replication topology right? Maybe
> > there I think a more advanced timestamp-based option would work better
> > IMHO.
>
> Yes, that's correct. We can get data divergence with resolvers like
> 'remote_apply', 'keep_local' etc. If you meant 'mesh' replication
> topology, then yes, it is difficult to get consistent value there with
> resolvers other than timestamp based. And thus timestamp based
> resolvers are needed and should be the default when implemented.
>

Thanks for the clarification.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-30 Thread Dilip Kumar

On Fri, Jul 26, 2024 at 9:50 AM Ajin Cherian  wrote:

Comment in 0002,

1) I do not see any test case that set a proper conflict type and
conflict resolver, all tests either give incorrect conflict
type/conflict resolver or the conflict resolver is ignored

0003
2) I was trying to think about this patch, so suppose we consider this
case conflict_type-> update_differ  resolver->remote_apply, my
question is to confirm whether my understanding is correct.  So if
this is set and we have 2 nodes and set up a 2-way logical
replication, and if a conflict occurs node-1 will take the changes of
node-2 and node-2 will take the changes of node-1?  Maybe so I think
to avoid such cases user needs to set the resolver more thoughtfully,
on node-1 it may be set as "skip" and on node-1 as "remote-apply" so
in such cases if conflict happens both nodes will have the value from
node-1.  But maybe it would be more difficult to get a consistent
value if we are setting up a mess replication topology right? Maybe
there I think a more advanced timestamp-based option would work better
IMHO.

I am doing code level review as well and will share my comments soon
on 0003 and 0004

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection and logging in logical replication

2024-07-30 Thread Dilip Kumar

On Tue, Jul 30, 2024 at 1:49 PM Zhijie Hou (Fujitsu)
 wrote:
>
> > On Monday, July 29, 2024 6:59 PM Dilip Kumar 
> > wrote:
> > >
> > > On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > >
> > > I was going through v7-0001, and I have some initial comments.
> >
> > Thanks for the comments !
> >
> > >
> > > @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> > > *resultRelInfo, TupleTableSlot *slot,
> > >   ExprContext *econtext;
> > >   Datum values[INDEX_MAX_KEYS];
> > >   bool isnull[INDEX_MAX_KEYS];
> > > - ItemPointerData invalidItemPtr;
> > >   bool checkedIndex = false;
> > >
> > >   ItemPointerSetInvalid(conflictTid);
> > > - ItemPointerSetInvalid(&invalidItemPtr);
> > >
> > >   /*
> > >   * Get information from the result relation info structure.
> > > @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> > > *resultRelInfo, TupleTableSlot *slot,
> > >
> > >   satisfiesConstraint =
> > >   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> > > - indexInfo, &invalidItemPtr,
> > > + indexInfo, &slot->tts_tid,
> > >   values, isnull, estate, false,
> > >   CEOUC_WAIT, true,
> > >   conflictTid);
> > >
> > > What is the purpose of this change?  I mean
> > > 'check_exclusion_or_unique_constraint' says that 'tupleid'
> > > should be invalidItemPtr if the tuple is not yet inserted and
> > > ExecCheckIndexConstraints is called by ExecInsert before inserting the
> > tuple.
> > > So what is this change?
> >
> > Because the function ExecCheckIndexConstraints() is now invoked after
> > inserting a tuple (in the patch). So, we need to ignore the newly inserted 
> > tuple
> > when checking conflict in check_exclusion_or_unique_constraint().
> >
> > > Would this change ExecInsert's behavior as well?
> >
> > Thanks for pointing it out, I will check and reply.
>
> After checking, I think it may affect ExecInsert's behavior if the slot passed
> to ExecCheckIndexConstraints() comes from other tables (e.g. when executing
> INSERT INTO SELECT FROM othertbl), because the slot->tts_tid points to a valid
> position from another table in this case, which can cause the
> check_exclusion_or_unique_constraint to skip a tuple unexpectedly).
>
> I thought about two ideas to fix this: One is to reset the slot->tts_tid 
> before
> calling ExecCheckIndexConstraints() in ExecInsert(), but I feel a bit
> uncomfortable to this since it is touching existing logic. So, another idea 
> is to
> just add a new parameter 'tupletid' in ExecCheckIndexConstraints(), then pass
> tupletid=InvalidOffsetNumber in when invoke the function in ExecInsert() and
> pass a valid tupletid in the new code paths in the patch.  The new
> 'tupletid' will be passed to check_exclusion_or_unique_constraint to
> skip the target tuple. I feel the second one maybe better.
>
> What do you think ?

Yes, the second approach seems good to me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection and logging in logical replication

2024-07-29 Thread Dilip Kumar

On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>

I was going through v7-0001, and I have some initial comments.

@@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,
  ExprContext *econtext;
  Datum values[INDEX_MAX_KEYS];
  bool isnull[INDEX_MAX_KEYS];
- ItemPointerData invalidItemPtr;
  bool checkedIndex = false;

  ItemPointerSetInvalid(conflictTid);
- ItemPointerSetInvalid(&invalidItemPtr);

  /*
  * Get information from the result relation info structure.
@@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,

  satisfiesConstraint =
  check_exclusion_or_unique_constraint(heapRelation, indexRelation,
- indexInfo, &invalidItemPtr,
+ indexInfo, &slot->tts_tid,
  values, isnull, estate, false,
  CEOUC_WAIT, true,
  conflictTid);

What is the purpose of this change?  I mean
'check_exclusion_or_unique_constraint' says that 'tupleid'
should be invalidItemPtr if the tuple is not yet inserted and
ExecCheckIndexConstraints is called by ExecInsert
before inserting the tuple.  So what is this change? Would this change
ExecInsert's behavior as well?




+ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
+ConflictType type, List *recheckIndexes,
+TupleTableSlot *slot)
+{
+ /* Re-check all the unique indexes for potential conflicts */
+ foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ TupleTableSlot *conflictslot;
+
+ if (list_member_oid(recheckIndexes, uniqueidx) &&
+ FindConflictTuple(resultRelInfo, estate, uniqueidx, slot, &conflictslot))
+ {
+ RepOriginId origin;
+ TimestampTz committs;
+ TransactionId xmin;
+
+ GetTupleCommitTs(conflictslot, &xmin, &origin, &committs);
+ ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc, uniqueidx,
+ xmin, origin, committs, conflictslot);
+ }
+ }
+}
 and

+ conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
+
  if (resultRelInfo->ri_NumIndices > 0)
  recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-slot, estate, false, false,
-NULL, NIL, false);
+slot, estate, false,
+conflictindexes ? true : false,
+&conflict,
+conflictindexes, false);
+
+ /*
+ * Rechecks the conflict indexes to fetch the conflicting local tuple
+ * and reports the conflict. We perform this check here, instead of
+ * perform an additional index scan before the actual insertion and
+ * reporting the conflict if any conflicting tuples are found. This is
+ * to avoid the overhead of executing the extra scan for each INSERT
+ * operation, even when no conflict arises, which could introduce
+ * significant overhead to replication, particularly in cases where
+ * conflicts are rare.
+ */
+ if (conflict)
+ ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
+recheckIndexes, slot);


 This logic is confusing, first, you are calling
ExecInsertIndexTuples() with no duplicate error for the indexes
present in 'ri_onConflictArbiterIndexes' which means
 the indexes returned by the function must be a subset of
'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
you are again processing all the
 indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
is a subset of the indexes that is returned by
ExecInsertIndexTuples().

 Why are we doing that, I think we can directly use the
'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
those indexes are guaranteed to be a subset of
 ri_onConflictArbiterIndexes. No?

 ---
 ---


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-05 Thread Dilip Kumar

On Fri, Jul 5, 2024 at 2:23 PM Amit Kapila  wrote:
>
> On Fri, Jul 5, 2024 at 11:58 AM Dilip Kumar  wrote:
> >
> > On Thu, Jul 4, 2024 at 5:37 PM Amit Kapila  wrote:
> > > >  So, the situation will be the same. We can even
> > > > > decide to spill the data to files if the decision is that we need to
> > > > > wait to avoid network buffer-fill situations. But note that the wait
> > > > > in apply worker has consequences that the subscriber won't be able to
> > > > > confirm the flush position and publisher won't be able to vacuum the
> > > > > dead rows and we won't be remove WAL as well. Last time when we
> > > > > discussed the delay_apply feature, we decided not to proceed because
> > > > > of such issues. This is the reason I proposed a cap on wait time.
> > > >
> > > > Yes, spilling to file or cap on the wait time should help, and as I
> > > > said above maybe a parallel apply worker can also help.
> > > >
> > >
> > > It is not clear to me how a parallel apply worker can help in this
> > > case. Can you elaborate on what you have in mind?
> >
> > If we decide to wait at commit time, and before starting to apply if
> > we already see a remote commit_ts clock is ahead, then if we apply
> > such transactions using the parallel worker, wouldn't it solve the
> > issue of the network buffer congestion? Now the apply worker can move
> > ahead and fetch new transactions from the buffer as our waiting
> > transaction will not block it.  I understand that if this transaction
> > is going to wait at commit then any future transaction that we are
> > going to fetch might also going to wait again because if the previous
> > transaction committed before is in the future then the subsequent
> > transaction committed after this must also be in future so eventually
> > that will also go to some another parallel worker and soon we end up
> > consuming all the parallel worker if the clock skew is large.  So I
> > won't say this will resolve the problem and we would still have to
> > fall back to the spilling to the disk but that's just in the worst
> > case when the clock skew is really huge.  In most cases which is due
> > to slight clock drift by the time we apply the medium to large size
> > transaction, the local clock should be able to catch up the remote
> > commit_ts and we might not have to wait in most of the cases.
> >
>
> Yeah, this is possible but even if go with the spilling logic at first
> it should work for all cases. If we get some complaints then we can
> explore executing such transactions by parallel apply workers.
> Personally, I am of the opinion that clock synchronization should be
> handled outside the database system via network time protocols like
> NTP. Still, we can have some simple solution to inform users about the
> clock_skew.

Yeah, that makes sense, in the first version we can have a simple
solution and we can further improvise it based on the feedback.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-04 Thread Dilip Kumar

On Thu, Jul 4, 2024 at 5:37 PM Amit Kapila  wrote:
> >  So, the situation will be the same. We can even
> > > decide to spill the data to files if the decision is that we need to
> > > wait to avoid network buffer-fill situations. But note that the wait
> > > in apply worker has consequences that the subscriber won't be able to
> > > confirm the flush position and publisher won't be able to vacuum the
> > > dead rows and we won't be remove WAL as well. Last time when we
> > > discussed the delay_apply feature, we decided not to proceed because
> > > of such issues. This is the reason I proposed a cap on wait time.
> >
> > Yes, spilling to file or cap on the wait time should help, and as I
> > said above maybe a parallel apply worker can also help.
> >
>
> It is not clear to me how a parallel apply worker can help in this
> case. Can you elaborate on what you have in mind?

If we decide to wait at commit time, and before starting to apply if
we already see a remote commit_ts clock is ahead, then if we apply
such transactions using the parallel worker, wouldn't it solve the
issue of the network buffer congestion? Now the apply worker can move
ahead and fetch new transactions from the buffer as our waiting
transaction will not block it.  I understand that if this transaction
is going to wait at commit then any future transaction that we are
going to fetch might also going to wait again because if the previous
transaction committed before is in the future then the subsequent
transaction committed after this must also be in future so eventually
that will also go to some another parallel worker and soon we end up
consuming all the parallel worker if the clock skew is large.  So I
won't say this will resolve the problem and we would still have to
fall back to the spilling to the disk but that's just in the worst
case when the clock skew is really huge.  In most cases which is due
to slight clock drift by the time we apply the medium to large size
transaction, the local clock should be able to catch up the remote
commit_ts and we might not have to wait in most of the cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-03 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 5:08 PM shveta malik  wrote:
>
> On Wed, Jul 3, 2024 at 4:12 PM Dilip Kumar  wrote:
> >
> > On Wed, Jul 3, 2024 at 4:02 PM shveta malik  wrote:
> > >
> > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jul 3, 2024 at 11:00 AM shveta malik  
> > > > wrote:
> > > > >
> > > > > > Yes, I also think it should be independent of CDR.  IMHO, it should 
> > > > > > be
> > > > > > based on the user-configured maximum clock skew tolerance and can be
> > > > > > independent of CDR.
> > > > >
> > > > > +1
> > > > >
> > > > > > IIUC we would make the remote apply wait just
> > > > > > before committing if the remote commit timestamp is ahead of the 
> > > > > > local
> > > > > > clock by more than the maximum clock skew tolerance, is that 
> > > > > > correct?
> > > > >
> > > > > +1 on condition to wait.
> > > > >
> > > > > But I think we should make apply worker wait during begin
> > > > > (apply_handle_begin) instead of commit. It makes more sense to delay
> > > > > the entire operation to manage clock-skew rather than the commit
> > > > > alone. And only then CDR's timestamp based resolution which are much
> > > > > prior to commit-stage can benefit from this. Thoughts?
> > > >
> > > > But do we really need to wait at apply_handle_begin()? I mean if we
> > > > already know the commit_ts then we can perform the conflict resolution
> > > > no?
> > >
> > > I would like to highlight one point here that the resultant data may
> > > be different depending upon at what stage (begin or commit) we
> > > conclude to wait. Example:
> > >
> > > --max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > > --Remote Update with commit_timestamp = 10.20AM.
> > > --Local clock (which is say 5 min behind) shows = 10.15AM.
> > >
> > > Case 1: Wait during Begin:
> > > When remote update arrives at local node, apply worker waits till
> > > local clock hits 'remote's commit_tts - max_clock_skew' i.e. till
> > > 10.20 AM. In the meantime (during the wait period of apply worker) if
> > > some local update on the same row has happened at say 10.18am (local
> > > clock), that will be applied first. Now when apply worker's wait is
> > > over, it will detect 'update_diffe'r conflict and as per
> > > 'last_update_win', remote_tuple will win as 10.20 is latest than
> > > 10.18.
> > >
> > > Case 2: Wait during Commit:
> > > When remote update arrives at local node, it finds no conflict and
> > > goes for commit. But before commit, it waits till the local clock hits
> > > 10.20 AM. In the meantime (during wait period of apply worker)) if
> > > some local update is trying to update the same row say at 10.18, it
> > > has to wait (due to locks taken by remote update on that row) and
> > > remote tuple will get committed first with commit timestamp of 10.20.
> > > Then local update will proceed and will overwrite remote tuple.
> > >
> > > So in case1, remote tuple is the final change while in case2, local
> > > tuple is the final change.
> >
> > Got it, but which case is correct, I think both.  Because in case-1
> > local commit's commit_ts is 10:18 and the remote commit's commit_ts is
> > 10:20 so remote apply wins.  And case 2, the remote commit's commit_ts
> > is 10:20 whereas the local commit's commit_ts must be 10:20 + delta
> > (because it waited for the remote transaction to get committed).
> >
> > Now say which is better, in case-1 we have to make the remote apply to
> > wait at the beginning state without knowing what would be the local
> > clock when it actually comes to commit, it may so happen that if we
> > choose case-2 by the time the remote transaction finish applying the
> > local clock is beyond 10:20 and we do not even need to wait?
>
> yes, agree that wait time could be lesser to some extent in case 2.
> But the wait during commit will make user operations on the same row
> wait, without user having any clue on concurrent blocking operations.
> I am not sure if it will be acceptable.

I don't think there is any problem with the acceptance of user
experience because even while applying the remote transaction
(irrespective of whether we implement this wait feature) the user
transaction might have to wait if updating the common rows right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-03 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 4:48 PM Amit Kapila  wrote:
>
> On Wed, Jul 3, 2024 at 4:04 PM Dilip Kumar  wrote:
> >
> > On Wed, Jul 3, 2024 at 3:35 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 3, 2024 at 2:16 PM Dilip Kumar  wrote:
> > > >
> > > > On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  
> > > > > wrote:
> > > >
> > > > > But waiting after applying the operations and before applying the
> > > > > commit would mean that we need to wait with the locks held. That could
> > > > > be a recipe for deadlocks in the system. I see your point related to
> > > > > performance but as we are not expecting clock skew in normal cases, we
> > > > > shouldn't be too much bothered on the performance due to this. If
> > > > > there is clock skew, we expect users to fix it, this is just a
> > > > > worst-case aid for users.
> > > >
> > > > But if we make it wait at the very first operation that means we will
> > > > not suck more decoded data from the network and wouldn't that make the
> > > > sender wait for the network buffer to get sucked in by the receiver?
> > > >
> > >
> > > That would be true even if we wait just before applying the commit
> > > record considering the transaction is small and the wait time is
> > > large.
> >
> > What I am saying is that if we are not applying the whole transaction,
> > it means we are not receiving it either unless we plan to spill it to
> > a file. If we don't spill it to a file, the network buffer will fill
> > up very quickly. This issue wouldn't occur if we waited right before
> > the commit because, by that time, we would have already received all
> > the data from the network.
> >
>
> We would have received the transaction data but there could be other
> transactions that need to wait because the apply worker is waiting
> before the commit.

Yeah, that's a valid point, can parallel apply worker help here?

 So, the situation will be the same. We can even
> decide to spill the data to files if the decision is that we need to
> wait to avoid network buffer-fill situations. But note that the wait
> in apply worker has consequences that the subscriber won't be able to
> confirm the flush position and publisher won't be able to vacuum the
> dead rows and we won't be remove WAL as well. Last time when we
> discussed the delay_apply feature, we decided not to proceed because
> of such issues. This is the reason I proposed a cap on wait time.

Yes, spilling to file or cap on the wait time should help, and as I
said above maybe a parallel apply worker can also help.

So I agree that the problem with network buffers arises in both cases,
whether we wait before committing or before beginning. So keeping that
in mind I don't have any strong objections against waiting at the
beginning if it simplifies the design compared to waiting at the
commit.

However, one point to remember in favor of waiting before applying the
commit is that if we decide to wait before beginning the transaction,
we would end up waiting in many more cases compared to waiting before
committing. Because in cases, when transactions are large and the
clock skew is small, the local clock would have already passed the
remote commit_ts by the time we reach the commit.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-03 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 4:02 PM shveta malik  wrote:
>
> On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
> >
> > On Wed, Jul 3, 2024 at 11:00 AM shveta malik  wrote:
> > >
> > > > Yes, I also think it should be independent of CDR.  IMHO, it should be
> > > > based on the user-configured maximum clock skew tolerance and can be
> > > > independent of CDR.
> > >
> > > +1
> > >
> > > > IIUC we would make the remote apply wait just
> > > > before committing if the remote commit timestamp is ahead of the local
> > > > clock by more than the maximum clock skew tolerance, is that correct?
> > >
> > > +1 on condition to wait.
> > >
> > > But I think we should make apply worker wait during begin
> > > (apply_handle_begin) instead of commit. It makes more sense to delay
> > > the entire operation to manage clock-skew rather than the commit
> > > alone. And only then CDR's timestamp based resolution which are much
> > > prior to commit-stage can benefit from this. Thoughts?
> >
> > But do we really need to wait at apply_handle_begin()? I mean if we
> > already know the commit_ts then we can perform the conflict resolution
> > no?
>
> I would like to highlight one point here that the resultant data may
> be different depending upon at what stage (begin or commit) we
> conclude to wait. Example:
>
> --max_clock_skew set to 0 i.e. no tolerance for clock skew.
> --Remote Update with commit_timestamp = 10.20AM.
> --Local clock (which is say 5 min behind) shows = 10.15AM.
>
> Case 1: Wait during Begin:
> When remote update arrives at local node, apply worker waits till
> local clock hits 'remote's commit_tts - max_clock_skew' i.e. till
> 10.20 AM. In the meantime (during the wait period of apply worker) if
> some local update on the same row has happened at say 10.18am (local
> clock), that will be applied first. Now when apply worker's wait is
> over, it will detect 'update_diffe'r conflict and as per
> 'last_update_win', remote_tuple will win as 10.20 is latest than
> 10.18.
>
> Case 2: Wait during Commit:
> When remote update arrives at local node, it finds no conflict and
> goes for commit. But before commit, it waits till the local clock hits
> 10.20 AM. In the meantime (during wait period of apply worker)) if
> some local update is trying to update the same row say at 10.18, it
> has to wait (due to locks taken by remote update on that row) and
> remote tuple will get committed first with commit timestamp of 10.20.
> Then local update will proceed and will overwrite remote tuple.
>
> So in case1, remote tuple is the final change while in case2, local
> tuple is the final change.

Got it, but which case is correct, I think both.  Because in case-1
local commit's commit_ts is 10:18 and the remote commit's commit_ts is
10:20 so remote apply wins.  And case 2, the remote commit's commit_ts
is 10:20 whereas the local commit's commit_ts must be 10:20 + delta
(because it waited for the remote transaction to get committed).

Now say which is better, in case-1 we have to make the remote apply to
wait at the beginning state without knowing what would be the local
clock when it actually comes to commit, it may so happen that if we
choose case-2 by the time the remote transaction finish applying the
local clock is beyond 10:20 and we do not even need to wait?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-03 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 3:35 PM Amit Kapila  wrote:
>
> On Wed, Jul 3, 2024 at 2:16 PM Dilip Kumar  wrote:
> >
> > On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila  wrote:
> > >
> > > On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:
> >
> > > But waiting after applying the operations and before applying the
> > > commit would mean that we need to wait with the locks held. That could
> > > be a recipe for deadlocks in the system. I see your point related to
> > > performance but as we are not expecting clock skew in normal cases, we
> > > shouldn't be too much bothered on the performance due to this. If
> > > there is clock skew, we expect users to fix it, this is just a
> > > worst-case aid for users.
> >
> > But if we make it wait at the very first operation that means we will
> > not suck more decoded data from the network and wouldn't that make the
> > sender wait for the network buffer to get sucked in by the receiver?
> >
>
> That would be true even if we wait just before applying the commit
> record considering the transaction is small and the wait time is
> large.

What I am saying is that if we are not applying the whole transaction,
it means we are not receiving it either unless we plan to spill it to
a file. If we don't spill it to a file, the network buffer will fill
up very quickly. This issue wouldn't occur if we waited right before
the commit because, by that time, we would have already received all
the data from the network.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-03 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 12:30 PM Amit Kapila  wrote:
>
> On Wed, Jul 3, 2024 at 11:29 AM Dilip Kumar  wrote:

> But waiting after applying the operations and before applying the
> commit would mean that we need to wait with the locks held. That could
> be a recipe for deadlocks in the system. I see your point related to
> performance but as we are not expecting clock skew in normal cases, we
> shouldn't be too much bothered on the performance due to this. If
> there is clock skew, we expect users to fix it, this is just a
> worst-case aid for users.

But if we make it wait at the very first operation that means we will
not suck more decoded data from the network and wouldn't that make the
sender wait for the network buffer to get sucked in by the receiver?
Also, we already have a handling of parallel apply workers so if we do
not have an issue of deadlock there or if we can handle those issues
there we can do it here as well no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-02 Thread Dilip Kumar

On Wed, Jul 3, 2024 at 11:00 AM shveta malik  wrote:
>
> > Yes, I also think it should be independent of CDR.  IMHO, it should be
> > based on the user-configured maximum clock skew tolerance and can be
> > independent of CDR.
>
> +1
>
> > IIUC we would make the remote apply wait just
> > before committing if the remote commit timestamp is ahead of the local
> > clock by more than the maximum clock skew tolerance, is that correct?
>
> +1 on condition to wait.
>
> But I think we should make apply worker wait during begin
> (apply_handle_begin) instead of commit. It makes more sense to delay
> the entire operation to manage clock-skew rather than the commit
> alone. And only then CDR's timestamp based resolution which are much
> prior to commit-stage can benefit from this. Thoughts?

But do we really need to wait at apply_handle_begin()? I mean if we
already know the commit_ts then we can perform the conflict resolution
no?  I mean we should wait before committing because we are
considering this remote transaction to be in the future and we do not
want to confirm the commit of this transaction to the remote node
before the local clock reaches the record commit_ts to preserve the
causal order.  However, we can still perform conflict resolution
beforehand since we already know the commit_ts. The conflict
resolution function will be something like "out_version =
CRF(version1_commit_ts, version2_commit_ts)," so the result should be
the same regardless of when we apply it, correct?  From a performance
standpoint, wouldn't it be beneficial to perform as much work as
possible in advance? By the time we apply all the operations, the
local clock might already be in sync with the commit_ts of the remote
transaction.  Am I missing something?

However, while thinking about this, I'm wondering about how we will
handle the streaming of in-progress transactions. If we start applying
with parallel workers, we might not know the commit_ts of those
transactions since they may not have been committed yet. One simple
option could be to prevent parallel workers from applying in-progress
transactions when CDR is set up. Instead, we could let these
transactions spill to files and only apply them once we receive the
commit record.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-07-02 Thread Dilip Kumar

On Tue, Jul 2, 2024 at 2:40 PM shveta malik  wrote:
>
> On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  
> > > wrote:
> > >
> > > I tried to work out a few scenarios with this, where the apply worker
> > > will wait until its local clock hits 'remote_commit_tts - max_skew
> > > permitted'. Please have a look.
> > >
> > > Let's say, we have a GUC to configure max_clock_skew permitted.
> > > Resolver is last_update_wins in both cases.
> > > 
> > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > >
> > > Remote Update with commit_timestamp = 10.20AM.
> > > Local clock (which is say 5 min behind) shows = 10.15AM.
> > >
> > > When remote update arrives at local node, we see that skew is greater
> > > than max_clock_skew and thus apply worker waits till local clock hits
> > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > > local clock hits 10.20 AM, the worker applies the remote change with
> > > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > > worker)) if some local update on same row has happened at say 10.18am,
> > > that will applied first, which will be later overwritten by above
> > > remote change of 10.20AM as remote-change's timestamp appear more
> > > latest, even though it has happened earlier than local change.
> >
> > For the sake of simplicity let's call the change that happened at
> > 10:20 AM change-1 and the change that happened at 10:15 as change-2
> > and assume we are talking about the synchronous commit only.
> >
> > I think now from an application perspective the change-1 wouldn't have
> > caused the change-2 because we delayed applying change-2 on the local
> > node which would have delayed the confirmation of the change-1 to the
> > application that means we have got the change-2 on the local node
> > without the confirmation of change-1 hence change-2 has no causal
> > dependency on the change-1.  So it's fine that we perform change-1
> > before change-2 and the timestamp will also show the same at any other
> > node if they receive these 2 changes.
> >
> > The goal is to ensure that if we define the order where change-2
> > happens before change-1, this same order should be visible on all
> > other nodes. This will hold true because the commit timestamp of
> > change-2 is earlier than that of change-1.
> >
> > > 2)  Case 2: max_clock_skew is set to 2min.
> > >
> > > Remote Update with commit_timestamp=10.20AM
> > > Local clock (which is say 5 min behind) = 10.15AM.
> > >
> > > Now apply worker will notice skew greater than 2min and thus will wait
> > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > > always save the origin's commit timestamp into local commit_tts, see
> > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > > another local update is triggered at 10.19am, it will be applied
> > > locally but it will be ignored on remote node. On the remote node ,
> > > the existing change with a timestamp of 10.20 am will win resulting in
> > > data divergence.
> >
> > Let's call the 10:20 AM change as a change-1 and the change that
> > happened at 10:19 as change-2
> >
> > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> > commit_ts of that change is 10:20, and the same will be visible to all
> > other nodes.  So in conflict resolution still the change-1 happened
> > after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> > there could be a problem with the causal order because we applied the
> > change-1 at 10:18 AM so the application might have gotten confirmation
> > at 10:18 AM and the change-2 of the local node may be triggered as a
> > result of confirmation of the change-1 that means now change-2 has a
> > causal dependency on the change-1 but commit_ts shows change-2
> > happened before the change-1 on all the nodes.
> >
> > So, is this acceptable? I think yes because the user has configured a
> > maximum clock skew of 2 minutes, which means the detected order might
> > not always align with the causal order for transactions occurring
> > within that time frame.

Re: Conflict Detection and Resolution

2024-06-19 Thread Dilip Kumar

On Wed, Jun 19, 2024 at 2:36 PM shveta malik  wrote:
>
> On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  
> > > wrote:
> > >
> > > I tried to work out a few scenarios with this, where the apply worker
> > > will wait until its local clock hits 'remote_commit_tts - max_skew
> > > permitted'. Please have a look.
> > >
> > > Let's say, we have a GUC to configure max_clock_skew permitted.
> > > Resolver is last_update_wins in both cases.
> > > 
> > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > >
> > > Remote Update with commit_timestamp = 10.20AM.
> > > Local clock (which is say 5 min behind) shows = 10.15AM.
> > >
> > > When remote update arrives at local node, we see that skew is greater
> > > than max_clock_skew and thus apply worker waits till local clock hits
> > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > > local clock hits 10.20 AM, the worker applies the remote change with
> > > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > > worker)) if some local update on same row has happened at say 10.18am,
> > > that will applied first, which will be later overwritten by above
> > > remote change of 10.20AM as remote-change's timestamp appear more
> > > latest, even though it has happened earlier than local change.

Oops lot of mistakes in the usage of change-1 and change-2, sorry about that.

> > For the sake of simplicity let's call the change that happened at
> > 10:20 AM change-1 and the change that happened at 10:15 as change-2
> > and assume we are talking about the synchronous commit only.
>
> Do you mean "the change that happened at 10:18 as change-2"

Right

> >
> > I think now from an application perspective the change-1 wouldn't have
> > caused the change-2 because we delayed applying change-2 on the local
> > node
>
> Do you mean "we delayed applying change-1 on the local node."

Right

> >which would have delayed the confirmation of the change-1 to the
> > application that means we have got the change-2 on the local node
> > without the confirmation of change-1 hence change-2 has no causal
> > dependency on the change-1.  So it's fine that we perform change-1
> > before change-2
>
> Do you mean "So it's fine that we perform change-2 before change-1"

Right

> >and the timestamp will also show the same at any other
> > node if they receive these 2 changes.
> >
> > The goal is to ensure that if we define the order where change-2
> > happens before change-1, this same order should be visible on all
> > other nodes. This will hold true because the commit timestamp of
> > change-2 is earlier than that of change-1.
>
> Considering the above corrections as base, I agree with this.

+1

> > > 2)  Case 2: max_clock_skew is set to 2min.
> > >
> > > Remote Update with commit_timestamp=10.20AM
> > > Local clock (which is say 5 min behind) = 10.15AM.
> > >
> > > Now apply worker will notice skew greater than 2min and thus will wait
> > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > > always save the origin's commit timestamp into local commit_tts, see
> > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > > another local update is triggered at 10.19am, it will be applied
> > > locally but it will be ignored on remote node. On the remote node ,
> > > the existing change with a timestamp of 10.20 am will win resulting in
> > > data divergence.
> >
> > Let's call the 10:20 AM change as a change-1 and the change that
> > happened at 10:19 as change-2
> >
> > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> > commit_ts of that change is 10:20, and the same will be visible to all
> > other nodes.  So in conflict resolution still the change-1 happened
> > after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> > there could be a problem with the causal order because we applied the
> > change-1 at 10:18 AM so the application might have gotten confirmation
> > at 10:18 AM and the change-2 of the local node may be triggered as a
> > result of confirmation of the change-1 that means now change-2 has a
> > causal dependency on the change-1 but commit_ts shows change-2
> > happened before the change-1 on all the nodes.
> >
> > So, is this acceptable? I think yes because the user has configured a
> > maximum clock skew of 2 minutes, which means the detected order might
> > not always align with the causal order for transactions occurring
> > within that time frame.
>
> Agree. I had the same thoughts, and wanted to confirm my understanding.

Okay

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-19 Thread Dilip Kumar

On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
>
> I tried to work out a few scenarios with this, where the apply worker
> will wait until its local clock hits 'remote_commit_tts - max_skew
> permitted'. Please have a look.
>
> Let's say, we have a GUC to configure max_clock_skew permitted.
> Resolver is last_update_wins in both cases.
> 
> 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
>
> Remote Update with commit_timestamp = 10.20AM.
> Local clock (which is say 5 min behind) shows = 10.15AM.
>
> When remote update arrives at local node, we see that skew is greater
> than max_clock_skew and thus apply worker waits till local clock hits
> 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> local clock hits 10.20 AM, the worker applies the remote change with
> commit_tts of 10.20AM. In the meantime (during wait period of apply
> worker)) if some local update on same row has happened at say 10.18am,
> that will applied first, which will be later overwritten by above
> remote change of 10.20AM as remote-change's timestamp appear more
> latest, even though it has happened earlier than local change.

For the sake of simplicity let's call the change that happened at
10:20 AM change-1 and the change that happened at 10:15 as change-2
and assume we are talking about the synchronous commit only.

I think now from an application perspective the change-1 wouldn't have
caused the change-2 because we delayed applying change-2 on the local
node which would have delayed the confirmation of the change-1 to the
application that means we have got the change-2 on the local node
without the confirmation of change-1 hence change-2 has no causal
dependency on the change-1.  So it's fine that we perform change-1
before change-2 and the timestamp will also show the same at any other
node if they receive these 2 changes.

The goal is to ensure that if we define the order where change-2
happens before change-1, this same order should be visible on all
other nodes. This will hold true because the commit timestamp of
change-2 is earlier than that of change-1.

> 2)  Case 2: max_clock_skew is set to 2min.
>
> Remote Update with commit_timestamp=10.20AM
> Local clock (which is say 5 min behind) = 10.15AM.
>
> Now apply worker will notice skew greater than 2min and thus will wait
> till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> 10.18 and will apply the change with commit_tts of 10.20 ( as we
> always save the origin's commit timestamp into local commit_tts, see
> RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> another local update is triggered at 10.19am, it will be applied
> locally but it will be ignored on remote node. On the remote node ,
> the existing change with a timestamp of 10.20 am will win resulting in
> data divergence.

Let's call the 10:20 AM change as a change-1 and the change that
happened at 10:19 as change-2

IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
commit_ts of that change is 10:20, and the same will be visible to all
other nodes.  So in conflict resolution still the change-1 happened
after the change-2 because change-2's commit_ts is 10:19 AM.   Now
there could be a problem with the causal order because we applied the
change-1 at 10:18 AM so the application might have gotten confirmation
at 10:18 AM and the change-2 of the local node may be triggered as a
result of confirmation of the change-1 that means now change-2 has a
causal dependency on the change-1 but commit_ts shows change-2
happened before the change-1 on all the nodes.

So, is this acceptable? I think yes because the user has configured a
maximum clock skew of 2 minutes, which means the detected order might
not always align with the causal order for transactions occurring
within that time frame. Generally, the ideal configuration for
max_clock_skew should be in multiple of the network round trip time.
Assuming this configuration, we wouldn’t encounter this problem
because for change-2 to be caused by change-1, the client would need
to get confirmation of change-1 and then trigger change-2, which would
take at least 2-3 network round trips.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-18 Thread Dilip Kumar

On Tue, Jun 18, 2024 at 12:11 PM Amit Kapila  wrote:
>
> On Tue, Jun 18, 2024 at 11:54 AM Dilip Kumar  wrote:
> >
> > On Mon, Jun 17, 2024 at 8:51 PM Robert Haas  wrote:
> > >
> > > On Mon, Jun 17, 2024 at 1:42 AM Amit Kapila  
> > > wrote:
> > > > The difference w.r.t the existing mechanisms for holding deleted data
> > > > is that we don't know whether we need to hold off the vacuum from
> > > > cleaning up the rows because we can't say with any certainty whether
> > > > other nodes will perform any conflicting operations in the future.
> > > > Using the example we discussed,
> > > > Node A:
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);
> > > >   T2: DELETE FROM t WHERE id = 1;
> > > >
> > > > Node B:
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;
> > > >
> > > > Say the order of receiving the commands is T1-T2-T3. We can't predict
> > > > whether we will ever get T-3, so on what basis shall we try to prevent
> > > > vacuum from removing the deleted row?
> > >
> > > The problem arises because T2 and T3 might be applied out of order on
> > > some nodes. Once either one of them has been applied on every node, no
> > > further conflicts are possible.
> >
> > If we decide to skip the update whether the row is missing or deleted,
> > we indeed reach the same end result regardless of the order of T2, T3,
> > and Vacuum. Here's how it looks in each case:
> >
> > Case 1: T1, T2, Vacuum, T3 -> Skip the update for a non-existing row
> > -> end result we do not have a row.
> > Case 2: T1, T2, T3 -> Skip the update for a deleted row -> end result
> > we do not have a row.
> > Case 3: T1, T3, T2 -> deleted the row -> end result we do not have a row.
> >
>
> In case 3, how can deletion be successful? The row required to be
> deleted has already been updated.

Hmm, I was considering this case in the example given by you above[1],
so we have updated some fields of the row with id=1, isn't this row
still detectable by the delete because delete will find this by id=1
as we haven't updated the id?  I was making the point w.r.t. the
example used above.

[1]
> > > > Node A:
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);
> > > >   T2: DELETE FROM t WHERE id = 1;
> > > >
> > > > Node B:
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;




-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-17 Thread Dilip Kumar

On Mon, Jun 17, 2024 at 8:51 PM Robert Haas  wrote:
>
> On Mon, Jun 17, 2024 at 1:42 AM Amit Kapila  wrote:
> > The difference w.r.t the existing mechanisms for holding deleted data
> > is that we don't know whether we need to hold off the vacuum from
> > cleaning up the rows because we can't say with any certainty whether
> > other nodes will perform any conflicting operations in the future.
> > Using the example we discussed,
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1);
> >   T2: DELETE FROM t WHERE id = 1;
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1;
> >
> > Say the order of receiving the commands is T1-T2-T3. We can't predict
> > whether we will ever get T-3, so on what basis shall we try to prevent
> > vacuum from removing the deleted row?
>
> The problem arises because T2 and T3 might be applied out of order on
> some nodes. Once either one of them has been applied on every node, no
> further conflicts are possible.

If we decide to skip the update whether the row is missing or deleted,
we indeed reach the same end result regardless of the order of T2, T3,
and Vacuum. Here's how it looks in each case:

Case 1: T1, T2, Vacuum, T3 -> Skip the update for a non-existing row
-> end result we do not have a row.
Case 2: T1, T2, T3 -> Skip the update for a deleted row -> end result
we do not have a row.
Case 3: T1, T3, T2 -> deleted the row -> end result we do not have a row.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-17 Thread Dilip Kumar

On Tue, Jun 18, 2024 at 10:17 AM Dilip Kumar  wrote:
>
> On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila  wrote:
> >
> > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar  wrote:
> > >
> > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
> > >  wrote:
> > >
> > > > > Yes, that's correct. However, many cases could benefit from the
> > > > > update_deleted conflict type if it can be implemented reliably. That's
> > > > > why we wanted to give it a try. But if we can't achieve predictable
> > > > > results with it, I'm fine to drop this approach and conflict_type. We
> > > > > can consider a better design in the future that doesn't depend on
> > > > > non-vacuumed entries and provides a more robust method for identifying
> > > > > deleted rows.
> > > > >
> > > >
> > > > I agree having a separate update_deleted conflict would be beneficial,
> > > > I'm not arguing against that - my point is actually that I think this
> > > > conflict type is required, and that it needs to be detected reliably.
> > > >
> > >
> > > When working with a distributed system, we must accept some form of
> > > eventual consistency model. However, it's essential to design a
> > > predictable and acceptable behavior. For example, if a change is a
> > > result of a previous operation (such as an update on node B triggered
> > > after observing an operation on node A), we can say that the operation
> > > on node A happened before the operation on node B. Conversely, if
> > > operations on nodes A and B are independent, we consider them
> > > concurrent.
> > >
> > > In distributed systems, clock skew is a known issue. To establish a
> > > consistency model, we need to ensure it guarantees the
> > > "happens-before" relationship. Consider a scenario with three nodes:
> > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
> > > subsequently NodeB makes changes, and then both NodeA's and NodeB's
> > > changes are sent to NodeC, the clock skew might make NodeB's changes
> > > appear to have occurred before NodeA's changes. However, we should
> > > maintain data that indicates NodeB's changes were triggered after
> > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's
> > > changes happened after NodeA's changes, despite what the timestamps
> > > suggest.
> > >
> > > A common method to handle such cases is using vector clocks for
> > > conflict resolution.
> > >
> >
> > I think the unbounded size of the vector could be a problem to store
> > for each event. However, while researching previous discussions, it
> > came to our notice that we have discussed this topic in the past as
> > well in the context of standbys. For recovery_min_apply_delay, we
> > decided the clock skew is not a problem as the settings of this
> > parameter are much larger than typical time deviations between servers
> > as mentioned in docs. Similarly for casual reads [1], there was a
> > proposal to introduce max_clock_skew parameter and suggesting the user
> > to make sure to have NTP set up correctly. We have tried to check
> > other databases (like Ora and BDR) where CDR is implemented but didn't
> > find anything specific to clock skew. So, I propose to go with a GUC
> > like max_clock_skew such that if the difference of time between the
> > incoming transaction's commit time and the local time is more than
> > max_clock_skew then we raise an ERROR. It is not clear to me that
> > putting bigger effort into clock skew is worth especially when other
> > systems providing CDR feature (like Ora or BDR) for decades have not
> > done anything like vector clocks. It is possible that this is less of
> > a problem w.r.t CDR and just detecting the anomaly in clock skew is
> > good enough.
>
> I believe that if we've accepted this solution elsewhere, then we can
> also consider the same. Basically, we're allowing the application to
> set its tolerance for clock skew. And, if the skew exceeds that
> tolerance, it's the application's responsibility to synchronize;
> otherwise, an error will occur. This approach seems reasonable.

This model can be further extended by making the apply worker wait if
the remote transaction's commit_ts is greater than the local
timestamp. This ensures that no local transactions occurring after the
remote transaction appear

Re: Conflict Detection and Resolution

2024-06-17 Thread Dilip Kumar

On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila  wrote:
>
> On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar  wrote:
> >
> > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
> >  wrote:
> >
> > > > Yes, that's correct. However, many cases could benefit from the
> > > > update_deleted conflict type if it can be implemented reliably. That's
> > > > why we wanted to give it a try. But if we can't achieve predictable
> > > > results with it, I'm fine to drop this approach and conflict_type. We
> > > > can consider a better design in the future that doesn't depend on
> > > > non-vacuumed entries and provides a more robust method for identifying
> > > > deleted rows.
> > > >
> > >
> > > I agree having a separate update_deleted conflict would be beneficial,
> > > I'm not arguing against that - my point is actually that I think this
> > > conflict type is required, and that it needs to be detected reliably.
> > >
> >
> > When working with a distributed system, we must accept some form of
> > eventual consistency model. However, it's essential to design a
> > predictable and acceptable behavior. For example, if a change is a
> > result of a previous operation (such as an update on node B triggered
> > after observing an operation on node A), we can say that the operation
> > on node A happened before the operation on node B. Conversely, if
> > operations on nodes A and B are independent, we consider them
> > concurrent.
> >
> > In distributed systems, clock skew is a known issue. To establish a
> > consistency model, we need to ensure it guarantees the
> > "happens-before" relationship. Consider a scenario with three nodes:
> > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
> > subsequently NodeB makes changes, and then both NodeA's and NodeB's
> > changes are sent to NodeC, the clock skew might make NodeB's changes
> > appear to have occurred before NodeA's changes. However, we should
> > maintain data that indicates NodeB's changes were triggered after
> > NodeA's changes arrived at NodeB. This implies that logically, NodeB's
> > changes happened after NodeA's changes, despite what the timestamps
> > suggest.
> >
> > A common method to handle such cases is using vector clocks for
> > conflict resolution.
> >
>
> I think the unbounded size of the vector could be a problem to store
> for each event. However, while researching previous discussions, it
> came to our notice that we have discussed this topic in the past as
> well in the context of standbys. For recovery_min_apply_delay, we
> decided the clock skew is not a problem as the settings of this
> parameter are much larger than typical time deviations between servers
> as mentioned in docs. Similarly for casual reads [1], there was a
> proposal to introduce max_clock_skew parameter and suggesting the user
> to make sure to have NTP set up correctly. We have tried to check
> other databases (like Ora and BDR) where CDR is implemented but didn't
> find anything specific to clock skew. So, I propose to go with a GUC
> like max_clock_skew such that if the difference of time between the
> incoming transaction's commit time and the local time is more than
> max_clock_skew then we raise an ERROR. It is not clear to me that
> putting bigger effort into clock skew is worth especially when other
> systems providing CDR feature (like Ora or BDR) for decades have not
> done anything like vector clocks. It is possible that this is less of
> a problem w.r.t CDR and just detecting the anomaly in clock skew is
> good enough.

I believe that if we've accepted this solution elsewhere, then we can
also consider the same. Basically, we're allowing the application to
set its tolerance for clock skew. And, if the skew exceeds that
tolerance, it's the application's responsibility to synchronize;
otherwise, an error will occur. This approach seems reasonable.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-17 Thread Dilip Kumar

On Mon, Jun 17, 2024 at 5:38 AM Tomas Vondra
 wrote:
>

> > The issue with using commit timestamps is that, when multiple nodes
> > are involved, the commit timestamp won't accurately represent the
> > actual order of operations. There's no reliable way to determine the
> > perfect order of each operation happening on different nodes roughly
> > simultaneously unless we use some globally synchronized counter.
> > Generally, that order might not cause real issues unless one operation
> > is triggered by a previous operation, and relying solely on physical
> > timestamps would not detect that correctly.
> >
> This whole conflict detection / resolution proposal is based on using
> commit timestamps. Aren't you suggesting it can't really work with
> commit timestamps?
>
> FWIW there are ways to builds distributed consistency with timestamps,
> as long as it's monotonic - e.g. clock-SI does that. It's not perfect,
> but it shows it's possible.

Hmm, I see that clock-SI does this by delaying the transaction when it
detects the clock skew.

> However, I'm not we have to go there - it depends on what the goal is.
> For a one-directional replication (multiple nodes replicating to the
> same target) it might be sufficient if the conflict resolution is
> "deterministic" (e.g. not dependent on the order in which the changes
> are applied). I'm not sure, but it's why I asked what's the goal in my
> very first message in this thread.

I'm not completely certain about this.  Even in one directional
replication if multiple nodes are sending data how can we guarantee
determinism in the presence of clock skew if we are not using some
other mechanism like logical counters or something like what clock-SI
is doing?  I don't want to insist on using any specific solution here.
However, I noticed that we haven't addressed how we plan to manage
clock skew, which is my primary concern. I believe that if multiple
nodes are involved and we're receiving data from them with
unsynchronized clocks, ensuring determinism about their order will
require us to take some measures to handle that.

> > We need some sort of logical counter, such as a vector clock, which
> > might be an independent counter on each node but can perfectly track
> > the causal order. For example, if NodeA observes an operation from
> > NodeB with a counter value of X, NodeA will adjust its counter to X+1.
> > This ensures that if NodeA has seen an operation from NodeB, its next
> > operation will appear to have occurred after NodeB's operation.
> >
> > I admit that I haven't fully thought through how we could design such
> > version tracking in our logical replication protocol or how it would
> > fit into our system. However, my point is that we need to consider
> > something beyond commit timestamps to achieve reliable ordering.
> >
>
> I can't really respond to this as there's no suggestion how it would be
> implemented in the patch discussed in this thread.
>
No worries, I'll consider whether finding such a solution is feasible
for our situation. Thank you!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication of sequences

2024-06-12 Thread Dilip Kumar

On Thu, Jun 13, 2024 at 11:53 AM vignesh C  wrote:
>
> On Thu, 13 Jun 2024 at 10:27, Dilip Kumar  wrote:

> > Thanks for the explanation, but I am still not getting it completely,
> > do you mean to say unless all the sequences are not synced any of the
> > sequences would not be marked "ready" in pg_subscription_rel? Is that
> > necessary? I mean why we can not sync the sequences one by one and
> > mark them ready?  Why it is necessary to either have all the sequences
> > synced or none of them?
>
> Since updating the sequence is one operation and setting
> pg_subscription_rel is another, I was trying to avoid a situation
> where the sequence is updated but its state is not reflected in
> pg_subscription_rel. It seems you are suggesting that it's acceptable
> for the sequence to be updated even if its state isn't updated in
> pg_subscription_rel, and in such cases, the sequence value does not
> need to be reverted.

Right, the complexity we're adding to achieve a behavior that may not
be truly desirable is a concern. For instance, if we mark the status
as ready but do not sync the sequences, it could lead to issues.
However, if we have synced some sequences but encounter a failure
without marking the status as ready, I don't consider it inconsistent
in any way.  But anyway, now I understand your thinking behind that so
it's a good idea to leave this design behavior for a later decision.
Gathering more opinions and insights during later stages will provide
a clearer perspective on how to proceed with this aspect.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication of sequences

2024-06-12 Thread Dilip Kumar

On Thu, Jun 13, 2024 at 10:10 AM vignesh C  wrote:
>
> > So, you're saying that when we synchronize the sequence values on the
> > subscriber side, we will create a new relfilenode to allow reverting
> > to the old state of the sequence in case of an error or transaction
> > rollback? But why would we want to do that? Generally, even if you
> > call nextval() on a sequence and then roll back the transaction, the
> > sequence value doesn't revert to the old value. So, what specific
> > problem on the subscriber side are we trying to avoid by operating on
> > a new relfilenode?
>
> Let's consider a situation where we have two sequences: seq1 with a
> value of 100 and seq2 with a value of 200. Now, let's say seq1 is
> synced and updated to 100, then we attempt to synchronize seq2,
> there's a failure due to the sequence not existing or encountering
> some other issue. In this scenario, we don't want to halt operations
> where seq1 is synchronized, but the sequence state for sequence isn't
> changed to "ready" in pg_subscription_rel.

Thanks for the explanation, but I am still not getting it completely,
do you mean to say unless all the sequences are not synced any of the
sequences would not be marked "ready" in pg_subscription_rel? Is that
necessary? I mean why we can not sync the sequences one by one and
mark them ready?  Why it is necessary to either have all the sequences
synced or none of them?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-12 Thread Dilip Kumar

On Wed, Jun 12, 2024 at 5:26 PM Tomas Vondra
 wrote:
>
> >> I agree having a separate update_deleted conflict would be beneficial,
> >> I'm not arguing against that - my point is actually that I think this
> >> conflict type is required, and that it needs to be detected reliably.
> >>
> >
> > When working with a distributed system, we must accept some form of
> > eventual consistency model.
>
> I'm not sure this is necessarily true. There are distributed databases
> implementing (or aiming to) regular consistency models, without eventual
> consistency. I'm not saying it's easy, but it shows eventual consistency
> is not the only option.

Right, that statement might not be completely accurate. Based on the
CAP theorem, when a network partition is unavoidable and availability
is expected, we often choose an eventual consistency model. However,
claiming that a fully consistent model is impossible in any
distributed system is incorrect, as it can be achieved using
mechanisms like Two-Phase Commit.

We must also accept that our PostgreSQL replication mechanism does not
guarantee a fully consistent model. Even with synchronous commit, it
only waits for the WAL to be replayed on the standby but does not
change the commit decision based on other nodes. This means, at most,
we can only guarantee "Read Your Write" consistency.

> > However, it's essential to design a
> > predictable and acceptable behavior. For example, if a change is a
> > result of a previous operation (such as an update on node B triggered
> > after observing an operation on node A), we can say that the operation
> > on node A happened before the operation on node B. Conversely, if
> > operations on nodes A and B are independent, we consider them
> > concurrent.
> >
>
> Right. And this is precisely the focus or my questions - understanding
> what behavior we aim for / expect in the end. Or said differently, what
> anomalies / weird behavior would be considered expected.

> Because that's important both for discussions about feasibility, etc.
> And also for evaluation / reviews of the patch.

+1

> > In distributed systems, clock skew is a known issue. To establish a
> > consistency model, we need to ensure it guarantees the
> > "happens-before" relationship. Consider a scenario with three nodes:
> > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
> > subsequently NodeB makes changes, and then both NodeA's and NodeB's
> > changes are sent to NodeC, the clock skew might make NodeB's changes
> > appear to have occurred before NodeA's changes. However, we should
> > maintain data that indicates NodeB's changes were triggered after
> > NodeA's changes arrived at NodeB. This implies that logically, NodeB's
> > changes happened after NodeA's changes, despite what the timestamps
> > suggest.
> >
> > A common method to handle such cases is using vector clocks for
> > conflict resolution. "Vector clocks" allow us to track the causal
> > relationships between changes across nodes, ensuring that we can
> > correctly order events and resolve conflicts in a manner that respects
> > the "happens-before" relationship. This method helps maintain
> > consistency and predictability in the system despite issues like clock
> > skew.
> >
>
> I'm familiar with the concept of vector clock (or logical clock in
> general), but it's not clear to me how you plan to use this in the
> context of the conflict handling. Can you elaborate/explain?
>
> The way I see it, conflict handling is pretty tightly coupled with
> regular commit timestamps and MVCC in general. How would you use vector
> clock to change that?

The issue with using commit timestamps is that, when multiple nodes
are involved, the commit timestamp won't accurately represent the
actual order of operations. There's no reliable way to determine the
perfect order of each operation happening on different nodes roughly
simultaneously unless we use some globally synchronized counter.
Generally, that order might not cause real issues unless one operation
is triggered by a previous operation, and relying solely on physical
timestamps would not detect that correctly.

We need some sort of logical counter, such as a vector clock, which
might be an independent counter on each node but can perfectly track
the causal order. For example, if NodeA observes an operation from
NodeB with a counter value of X, NodeA will adjust its counter to X+1.
This ensures that if NodeA has seen an operation from NodeB, its next
operation will appear to have occurred after NodeB's operation.

I admit that I haven't fully thought through how we could design such
version tracking in our logical replication protocol or how it would
fit into our system. However, my point is that we need to consider
something beyond commit timestamps to achieve reliable ordering.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication of sequences

2024-06-12 Thread Dilip Kumar

On Wed, Jun 12, 2024 at 4:08 PM vignesh C  wrote:
>
> On Wed, 12 Jun 2024 at 10:51, Dilip Kumar  wrote:
> >
> > On Tue, Jun 11, 2024 at 4:06 PM vignesh C  wrote:
> > >
> > > Amit and I engaged in an offline discussion regarding the design and
> > > contemplated that it could be like below:
> >
> > If I understand correctly, does this require the sequences to already
> > exist on the subscribing node before creating the subscription, or
> > will it also copy any non-existing sequences?
>
> Sequences must exist in the subscriber; we'll synchronize only their
> values. Any sequences that are not present in the subscriber will
> trigger an error.

Okay, that makes sense.

>
> > > 3) Refreshing the sequence can be achieved through the existing
> > > command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change
> > > here).
> > > The subscriber identifies stale sequences, meaning sequences present
> > > in pg_subscription_rel but absent from the publication, and removes
> > > them from the pg_subscription_rel system table. The subscriber also
> > > checks for newly added sequences in the publisher and synchronizes
> > > their values from the publisher using the steps outlined in the
> > > subscription creation process. It's worth noting that previously
> > > synchronized sequences won't be synchronized again; the sequence sync
> > > will occur solely for the newly added sequences.
> >
> > > 4) Introducing a new command for refreshing all sequences: ALTER
> > > SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES.
> > > The subscriber will remove stale sequences and add newly added
> > > sequences from the publisher. Following this, it will re-synchronize
> > > the sequence values for all sequences in the updated list from the
> > > publisher, following the steps outlined in the subscription creation
> > > process.
> >
> > Okay, this answers my first question: we will remove the sequences
> > that are removed from the publisher and add the new sequences. I don't
> > see any problem with this, but doesn't it seem like we are effectively
> > doing DDL replication only for sequences without having a
> > comprehensive plan for overall DDL replication?
>
> What I intended to convey is that we'll eliminate the sequences from
> pg_subscription_rel. We won't facilitate the DDL replication of
> sequences; instead, we anticipate users to create the sequences
> themselves.

hmm okay.

> > > 5) Incorporate the pg_sequence_state function to fetch the sequence
> > > value from the publisher, along with the page LSN. Incorporate
> > > SetSequence function, which will procure a new relfilenode for the
> > > sequence and set the new relfilenode with the specified value. This
> > > will facilitate rollback in case of any failures.
> >
> > I do not understand this point, you mean whenever we are fetching the
> > sequence value from the publisher we need to create a new relfilenode
> > on the subscriber?  Why not just update the catalog tuple is
> > sufficient?  Or this is for handling the ALTER SEQUENCE case?
>
> Sequences operate distinctively from tables. Alterations to sequences
> reflect instantly in another session, even before committing the
> transaction. To ensure the synchronization of sequence value and state
> updates in pg_subscription_rel, we assign it a new relfilenode. This
> strategy ensures that any potential errors allow for the rollback of
> both the sequence state in pg_subscription_rel and the sequence values
> simultaneously.

So, you're saying that when we synchronize the sequence values on the
subscriber side, we will create a new relfilenode to allow reverting
to the old state of the sequence in case of an error or transaction
rollback? But why would we want to do that? Generally, even if you
call nextval() on a sequence and then roll back the transaction, the
sequence value doesn't revert to the old value. So, what specific
problem on the subscriber side are we trying to avoid by operating on
a new relfilenode?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication of sequences

2024-06-11 Thread Dilip Kumar

On Tue, Jun 11, 2024 at 4:06 PM vignesh C  wrote:
>
> Amit and I engaged in an offline discussion regarding the design and
> contemplated that it could be like below:

If I understand correctly, does this require the sequences to already
exist on the subscribing node before creating the subscription, or
will it also copy any non-existing sequences?

> 1) CREATE PUBLICATION syntax enhancement:
> CREATE PUBLICATION ... FOR ALL SEQUENCES;
> The addition of a new column titled "all sequences" in the
> pg_publication system table will signify whether the publication is
> designated as all sequences publication or not.
>
> 2)  CREATE SUBSCRIPTION -- no syntax change.
> Upon creation of a subscription, the following additional steps will
> be managed by the subscriber:
> i) The subscriber will retrieve the list of sequences associated with
> the subscription's publications.
> ii) For each sequence: a) Retrieve the sequence value from the
> publisher by invoking the pg_sequence_state function. b) Set the
> sequence with the value obtained from the publisher. iv) Once the
> subscription creation is completed, all sequence values will become
> visible at the subscriber's end.
>
> An alternative design approach could involve retrieving the sequence
> list from the publisher during subscription creation and inserting the
> sequences with an "init" state into the pg_subscription_rel system
> table. These tasks could be executed by a single sequence sync worker,
> which would:
> i) Retrieve the list of sequences in the "init" state from the
> pg_subscription_rel system table.
> ii) Initiate a transaction.
> iii) For each sequence: a) Obtain the sequence value from the
> publisher by utilizing the pg_sequence_state function. b) Update the
> sequence with the value obtained from the publisher.
> iv) Commit the transaction.
>
> The benefit with the second approach is that if there are large number
> of sequences, the sequence sync can be enhanced to happen in parallel
> and also if there are any locks held on the sequences in the
> publisher, the sequence worker can wait to acquire the lock instead of
> blocking the whole create subscription command which will delay the
> initial copy of the tables too.

Yeah w.r.t. this point second approach seems better.

> 3) Refreshing the sequence can be achieved through the existing
> command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change
> here).
> The subscriber identifies stale sequences, meaning sequences present
> in pg_subscription_rel but absent from the publication, and removes
> them from the pg_subscription_rel system table. The subscriber also
> checks for newly added sequences in the publisher and synchronizes
> their values from the publisher using the steps outlined in the
> subscription creation process. It's worth noting that previously
> synchronized sequences won't be synchronized again; the sequence sync
> will occur solely for the newly added sequences.

> 4) Introducing a new command for refreshing all sequences: ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES.
> The subscriber will remove stale sequences and add newly added
> sequences from the publisher. Following this, it will re-synchronize
> the sequence values for all sequences in the updated list from the
> publisher, following the steps outlined in the subscription creation
> process.

Okay, this answers my first question: we will remove the sequences
that are removed from the publisher and add the new sequences. I don't
see any problem with this, but doesn't it seem like we are effectively
doing DDL replication only for sequences without having a
comprehensive plan for overall DDL replication?

> 5) Incorporate the pg_sequence_state function to fetch the sequence
> value from the publisher, along with the page LSN. Incorporate
> SetSequence function, which will procure a new relfilenode for the
> sequence and set the new relfilenode with the specified value. This
> will facilitate rollback in case of any failures.

I do not understand this point, you mean whenever we are fetching the
sequence value from the publisher we need to create a new relfilenode
on the subscriber?  Why not just update the catalog tuple is
sufficient?  Or this is for handling the ALTER SEQUENCE case?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-11 Thread Dilip Kumar

On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
 wrote:

> > Yes, that's correct. However, many cases could benefit from the
> > update_deleted conflict type if it can be implemented reliably. That's
> > why we wanted to give it a try. But if we can't achieve predictable
> > results with it, I'm fine to drop this approach and conflict_type. We
> > can consider a better design in the future that doesn't depend on
> > non-vacuumed entries and provides a more robust method for identifying
> > deleted rows.
> >
>
> I agree having a separate update_deleted conflict would be beneficial,
> I'm not arguing against that - my point is actually that I think this
> conflict type is required, and that it needs to be detected reliably.
>

When working with a distributed system, we must accept some form of
eventual consistency model. However, it's essential to design a
predictable and acceptable behavior. For example, if a change is a
result of a previous operation (such as an update on node B triggered
after observing an operation on node A), we can say that the operation
on node A happened before the operation on node B. Conversely, if
operations on nodes A and B are independent, we consider them
concurrent.

In distributed systems, clock skew is a known issue. To establish a
consistency model, we need to ensure it guarantees the
"happens-before" relationship. Consider a scenario with three nodes:
NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
subsequently NodeB makes changes, and then both NodeA's and NodeB's
changes are sent to NodeC, the clock skew might make NodeB's changes
appear to have occurred before NodeA's changes. However, we should
maintain data that indicates NodeB's changes were triggered after
NodeA's changes arrived at NodeB. This implies that logically, NodeB's
changes happened after NodeA's changes, despite what the timestamps
suggest.

A common method to handle such cases is using vector clocks for
conflict resolution. "Vector clocks" allow us to track the causal
relationships between changes across nodes, ensuring that we can
correctly order events and resolve conflicts in a manner that respects
the "happens-before" relationship. This method helps maintain
consistency and predictability in the system despite issues like clock
skew.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-07 Thread Dilip Kumar

On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent
 wrote:
>
> On Fri, 7 Jun 2024 at 10:28, Dilip Kumar  wrote:
> >
> > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
> >  wrote:
> >>
> >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
> >>>
> >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> >>>  wrote:
> >>>
> >>> I agree with you that we introduced the WAL_LOG strategy to avoid
> >>> these force checkpoints. However, in binary upgrade cases where no
> >>> operations are happening in the system, the FILE_COPY strategy should
> >>> be faster.
> >>
> >> While you would be correct if there were no operations happening in
> >> the system, during binary upgrade we're still actively modifying
> >> catalogs; and this is done with potentially many concurrent jobs. I
> >> think it's not unlikely that this would impact performance.
> >
> > Maybe, but generally, long checkpoints are problematic because they
> > involve a lot of I/O, which hampers overall system performance.
> > However, in the case of a binary upgrade, the concurrent operations
> > are only performing a schema restore, not a real data restore.
> > Therefore, it shouldn't have a significant impact, and the checkpoints
> > should also not do a lot of I/O during binary upgrade, right?
>
> My primary concern isn't the IO, but the O(shared_buffers) that we
> have to go through during a checkpoint. As I mentioned upthread, it is
> reasonably possible the new cluster is already setup with a good
> fraction of the old system's shared_buffers configured. Every
> checkpoint has to scan all those buffers, which IMV can get (much)
> more expensive than the IO overhead caused by the WAL_LOG strategy. It
> may be a baseless fear as I haven't done the performance benchmarks
> for this, but I wouldn't be surprised if shared_buffers=8GB would
> measurably impact the upgrade performance in the current patch (vs the
> default 128MB).

Okay, that's a valid point.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Dilip Kumar

On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-07, Dilip Kumar wrote:
>
> > I think the compression option should be supported at the CREATE
> > SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> > can decide on compression for each subscription individually rather
> > than applying it to all subscribers. It makes more sense for the
> > subscriber to control this, especially when we are planning to
> > compress the data sent downstream.
>
> True.  (I think we have some options that are in GUCs for the general
> behavior and can be overridden by per-subscription options for specific
> tailoring; would that make sense here?  I think it does, considering
> that what we mostly want is to save disk space in the publisher when
> spilling to disk.)

Yeah, that makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.

+1

> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
>
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
>
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
>

I think the compression option should be supported at the CREATE
SUBSCRIPTION level instead of being controlled by a GUC. This way, we
can decide on compression for each subscription individually rather
than applying it to all subscribers. It makes more sense for the
subscriber to control this, especially when we are planning to
compress the data sent downstream.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-07 Thread Dilip Kumar

On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
 wrote:
>
> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
> >
> > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> >  wrote:
> >>
> >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela  wrote:
> >>>
> >>> Why not use it too, if not binary_upgrade?
> >>
> >> Because in the normal case (not during binary_upgrade) you don't want
> >> to have to generate 2 checkpoints for every created database,
> >> especially not when your shared buffers are large. Checkpoints' costs
> >> scale approximately linearly with the size of shared buffers, so being
> >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> >> of performance in the systems where this performance impact matters
> >> most.
> >
> > I agree with you that we introduced the WAL_LOG strategy to avoid
> > these force checkpoints. However, in binary upgrade cases where no
> > operations are happening in the system, the FILE_COPY strategy should
> > be faster.
>
> While you would be correct if there were no operations happening in
> the system, during binary upgrade we're still actively modifying
> catalogs; and this is done with potentially many concurrent jobs. I
> think it's not unlikely that this would impact performance.

Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?

> Now that I think about it, arguably, we shouldn't need to run
> checkpoints during binary upgrade for the FILE_COPY strategy after
> we've restored the template1 database and created a checkpoint after
> that: All other databases use template1 as their template database,
> and the checkpoint is there mostly to guarantee the FS knows about all
> changes in the template database before we task it with copying the
> template database over to our new database, so the protections we get
> from more checkpoints are practically useless.
> If such a change were implemented (i.e. no checkpoints for FILE_COPY
> in binary upgrade, with a single manual checkpoint after restoring
> template1 in create_new_objects) I think most of my concerns with this
> patch would be alleviated.

Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Vacuum statistics

2024-06-06 Thread Dilip Kumar

On Thu, May 30, 2024 at 11:57 PM Alena Rybakina
 wrote:
>
> On 30.05.2024 10:33, Alena Rybakina wrote:
> >
> > I suggest gathering information about vacuum resource consumption for
> > processing indexes and tables and storing it in the table and index
> > relationships (for example, PgStat_StatTabEntry structure like it has
> > realized for usual statistics). It will allow us to determine how well
> > the vacuum is configured and evaluate the effect of overhead on the
> > system at the strategic level, the vacuum has gathered this
> > information already, but this valuable information doesn't store it.
> >
> My colleagues and I have prepared a patch that can help to solve this
> problem.
>
> We are open to feedback.

I was reading through the patch here are some initial comments.

--
+typedef struct LVExtStatCounters
+{
+ TimestampTz time;
+ PGRUsage ru;
+ WalUsage walusage;
+ BufferUsage bufusage;
+ int64 VacuumPageMiss;
+ int64 VacuumPageHit;
+ int64 VacuumPageDirty;
+ double VacuumDelayTime;
+ PgStat_Counter blocks_fetched;
+ PgStat_Counter blocks_hit;
+} LVExtStatCounters;

I noticed that you are storing both pgBufferUsage and
VacuumPage(Hit/Miss/Dirty) stats. Aren't these essentially the same?
It seems they both exist in the system because some code, like
heap_vacuum_rel(), uses pgBufferUsage, while do_analyze_rel() still
relies on the old counters. And there is already a patch to remove
those old counters.

--
+static Datum
+pg_stats_vacuum(FunctionCallInfo fcinfo, ExtVacReportType type, int ncolumns)
+{

I don't think you need this last parameter (ncolumns) we can anyway
fetch that from tupledesc, so adding an additional parameter
just for checking doesn't look good to me.

--
+ /* Tricky turn here: enforce pgstat to think that our database us dbid */
+
+ MyDatabaseId = dbid;

typo
/think that our database us dbid/think that our database has dbid

Also, remove the blank line between the comment and the next code
block that is related to that comment.

--
  VacuumPageDirty = 0;
+ VacuumDelayTime = 0.;

There is an extra "." after 0

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Dilip Kumar

On Fri, Jun 7, 2024 at 11:53 AM Ashutosh Sharma  wrote:
>
> On Fri, Jun 7, 2024 at 10:06 AM Dilip Kumar  wrote:
> >
> > On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma  
> > wrote:
> > >
> > > On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar  wrote:
> > >>
> > >> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  
> > >> wrote:
> > >> >
> > >> > Hello everyone,
> > >> >
> > >> > At present, we use MVCC snapshots to identify dependent objects. This 
> > >> > implies that if a new dependent object is inserted within a 
> > >> > transaction that is still ongoing, our search for dependent objects 
> > >> > won't include this recently added one. Consequently, if someone 
> > >> > attempts to drop the referenced object, it will be dropped, and when 
> > >> > the ongoing transaction completes, we will end up having an entry for 
> > >> > a referenced object that has already been dropped. This situation can 
> > >> > lead to an inconsistent state. Below is an example illustrating this 
> > >> > scenario:
> > >>
> > >> I don't think it's correct to allow the index to be dropped while a
> > >> transaction is creating it. Instead, the right solution should be for
> > >> the create index operation to protect the object it is using from
> > >> being dropped. Specifically, the create index operation should acquire
> > >> a shared lock on the Access Method (AM) to ensure it doesn't get
> > >> dropped concurrently while the transaction is still in progress.
> > >
> > >
> > > If I'm following you correctly, that's exactly what the patch is trying 
> > > to do; while the index creation is in progress, if someone tries to drop 
> > > the object referenced by the index under creation, the referenced object 
> > > being dropped is able to know about the dependent object (in this case 
> > > the index being created) using dirty snapshot and hence, it is unable to 
> > > acquire the lock on the dependent object, and as a result of that, it is 
> > > unable to drop it.
> >
> > You are aiming for the same outcome, but not in the conventional way.
> > In my opinion, the correct approach is not to find objects being
> > created using a dirty snapshot. Instead, when creating an object, you
> > should acquire a proper lock on any dependent objects to prevent them
> > from being dropped during the creation process. For instance, when
> > creating an index that depends on the btree_gist access method, the
> > create index operation should protect btree_gist from being dropped by
> > acquiring the appropriate lock. It is not the responsibility of the
> > drop extension to identify in-progress index creations.
>
> Thanks for sharing your thoughts, I appreciate your inputs and
> completely understand your perspective, but I wonder if that is
> feasible? For example, if an object (index in this case) has
> dependency on lets say 'n' number of objects, and those 'n' number of
> objects belong to say 'n' different catalog tables, so should we
> acquire locks on each of them until the create index command succeeds,
> or, should we just check for the presence of dependent objects and
> record their dependency inside the pg_depend table. Talking about this
> particular case, we are trying to create gist index that has
> dependency on gist_int4 opclass, it is one of the tuple inside
> pg_opclass catalog table, so should acquire lock in this tuple/table
> until the create index command succeeds and is that the thing to be
> done for all the dependent objects?

I am not sure what is the best way to do it, but if you are creating
an object which is dependent on the other object then you need to
check the existence of those objects, record dependency on those
objects, and also lock them so that those object doesn't get dropped
while you are creating your object.  I haven't looked into the patch
but something similar is being achieved in the thread Bertrand has
pointed out by locking the database object while recording the
dependency on those.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-06 Thread Dilip Kumar

On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
 wrote:
>
> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela  wrote:
> >
> > Em ter., 4 de jun. de 2024 às 16:39, Nathan Bossart 
> >  escreveu:
> >>
> >> I noticed that the "Restoring database schemas in the new cluster" part of
> >> pg_upgrade can take a while if you have many databases, so I experimented
> >> with a couple different settings to see if there are any easy ways to speed
> >> it up.  The FILE_COPY strategy for CREATE DATABASE helped quite
> >> significantly on my laptop.  For ~3k empty databases, this step went from
> >> ~100 seconds to ~30 seconds with the attached patch.  I see commit ad43a41
> >> made a similar change for initdb, so there might even be an argument for
> >> back-patching this to v15 (where STRATEGY was introduced).  One thing I
> >> still need to verify is that this doesn't harm anything when there are lots
> >> of objects in the databases, i.e., more WAL generated during many
> >> concurrent CREATE-DATABASE-induced checkpoints.
> >>
> >> Thoughts?
> >
> > Why not use it too, if not binary_upgrade?
>
> Because in the normal case (not during binary_upgrade) you don't want
> to have to generate 2 checkpoints for every created database,
> especially not when your shared buffers are large. Checkpoints' costs
> scale approximately linearly with the size of shared buffers, so being
> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> of performance in the systems where this performance impact matters
> most.

I agree with you that we introduced the WAL_LOG strategy to avoid
these force checkpoints. However, in binary upgrade cases where no
operations are happening in the system, the FILE_COPY strategy should
be faster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 7:39 PM Ashutosh Sharma  wrote:
>
> On Thu, Jun 6, 2024 at 6:20 PM Dilip Kumar  wrote:
>>
>> On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  wrote:
>> >
>> > Hello everyone,
>> >
>> > At present, we use MVCC snapshots to identify dependent objects. This 
>> > implies that if a new dependent object is inserted within a transaction 
>> > that is still ongoing, our search for dependent objects won't include this 
>> > recently added one. Consequently, if someone attempts to drop the 
>> > referenced object, it will be dropped, and when the ongoing transaction 
>> > completes, we will end up having an entry for a referenced object that has 
>> > already been dropped. This situation can lead to an inconsistent state. 
>> > Below is an example illustrating this scenario:
>>
>> I don't think it's correct to allow the index to be dropped while a
>> transaction is creating it. Instead, the right solution should be for
>> the create index operation to protect the object it is using from
>> being dropped. Specifically, the create index operation should acquire
>> a shared lock on the Access Method (AM) to ensure it doesn't get
>> dropped concurrently while the transaction is still in progress.
>
>
> If I'm following you correctly, that's exactly what the patch is trying to 
> do; while the index creation is in progress, if someone tries to drop the 
> object referenced by the index under creation, the referenced object being 
> dropped is able to know about the dependent object (in this case the index 
> being created) using dirty snapshot and hence, it is unable to acquire the 
> lock on the dependent object, and as a result of that, it is unable to drop 
> it.

You are aiming for the same outcome, but not in the conventional way.
In my opinion, the correct approach is not to find objects being
created using a dirty snapshot. Instead, when creating an object, you
should acquire a proper lock on any dependent objects to prevent them
from being dropped during the creation process. For instance, when
creating an index that depends on the btree_gist access method, the
create index operation should protect btree_gist from being dropped by
acquiring the appropriate lock. It is not the responsibility of the
drop extension to identify in-progress index creations.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 5:59 PM Ashutosh Sharma  wrote:
>
> Hello everyone,
>
> At present, we use MVCC snapshots to identify dependent objects. This implies 
> that if a new dependent object is inserted within a transaction that is still 
> ongoing, our search for dependent objects won't include this recently added 
> one. Consequently, if someone attempts to drop the referenced object, it will 
> be dropped, and when the ongoing transaction completes, we will end up having 
> an entry for a referenced object that has already been dropped. This 
> situation can lead to an inconsistent state. Below is an example illustrating 
> this scenario:

I don't think it's correct to allow the index to be dropped while a
transaction is creating it. Instead, the right solution should be for
the create index operation to protect the object it is using from
being dropped. Specifically, the create index operation should acquire
a shared lock on the Access Method (AM) to ensure it doesn't get
dropped concurrently while the transaction is still in progress.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication of sequences

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 9:34 AM Amit Kapila  wrote:
>
> On Wed, Jun 5, 2024 at 3:17 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Jun 4, 2024 at 5:40 PM Amit Kapila  wrote:
> > >
> > > Even if we decode it periodically (say each time we decode the
> > > checkpoint record) then also we need to send the entire set of
> > > sequences at shutdown. This is because the sequences may have changed
> > > from the last time we sent them.
> >
> > Agree. How about decoding and sending only the sequences that are
> > changed from the last time when they were sent? I know it requires a
> > bit of tracking and more work, but all I'm looking for is to reduce
> > the amount of work that walsenders need to do during the shutdown.
> >
>
> I see your point but going towards tracking the changed sequences
> sounds like moving towards what we do for incremental backups unless
> we can invent some other smart way.

Yes, we would need an entirely new infrastructure to track the
sequence change since the last sync. We can only determine this from
WAL, and relying on it would somehow bring us back to the approach we
were trying to achieve with logical decoding of sequences patch.

> > Having said that, I like the idea of letting the user sync the
> > sequences via ALTER SUBSCRIPTION command and not weave the logic into
> > the shutdown checkpoint path. As Peter Eisentraut said here
> > https://www.postgresql.org/message-id/42e5cb35-4aeb-4f58-8091-90619c7c3ecc%40eisentraut.org,
> > this can be a good starting point to get going.
> >
>
> Agreed.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila  wrote:
>
> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.
> >
>
> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> >
> > 2. Do we want a GUC to switch compression on/off?
> >
>
> It depends on the overhead of decoding. Did you try to measure the
> decoding overhead of decompression when reading compressed files?

I think it depends on the trade-off between the I/O savings from
reducing the data size and the performance cost of compressing and
decompressing the data. This balance is highly dependent on the
hardware. For example, if you have a very slow disk and a powerful
processor, compression could be advantageous. Conversely, if the disk
is very fast, the I/O savings might be minimal, and the compression
overhead could outweigh the benefits. Additionally, the effectiveness
of compression also depends on the compression ratio, which varies
with the type of data being compressed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 3:23 PM Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> I sent a similar patch for this in 
> https://www.postgresql.org/message-id/flat/cao6_xqr__kttclkftqs0qscm-j7_xbrg3ge2rwhucxqjmjh...@mail.gmail.com

Okay, I see, In that case, we can just discard mine, thanks for notifying me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-06 Thread Dilip Kumar

On Thu, Jun 6, 2024 at 3:43 PM Amit Kapila  wrote:
>
> On Wed, Jun 5, 2024 at 7:29 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
> > >
> > > Can you share the use case of "earliest_timestamp_wins" resolution
> > > method? It seems after the initial update on the local node, it will
> > > never allow remote update to succeed which sounds a bit odd. Jan has
> > > shared this and similar concerns about this resolution method, so I
> > > have added him to the email as well.
> > >
> > I can not think of a use case exactly in this context but it's very
> > common to have such a use case while designing a distributed
> > application with multiple clients.  For example, when we are doing git
> > push concurrently from multiple clients it is expected that the
> > earliest commit wins.
> >
>
> Okay, I think it mostly boils down to something like what Shveta
> mentioned where Inserts for a primary key can use
> "earliest_timestamp_wins" resolution method [1]. So, it seems useful
> to support this method as well.

Correct, but we still need to think about how to make it work
correctly in the presence of a clock skew as I mentioned in one of my
previous emails.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Remove dependency on VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel

2024-06-06 Thread Dilip Kumar

As part of commit 5cd72cc0c5017a9d4de8b5d465a75946da5abd1d, the
dependency on global counters such as VacuumPage(Hit/Miss/Dirty) was
removed from the vacuum. However, do_analyze_rel() was still using
these counters, necessitating the tracking of global counters
alongside BufferUsage counters.

The attached patch addresses the issue by eliminating the need to
track VacuumPage(Hit/Miss/Dirty) counters in do_analyze_rel(), making
the global counters obsolete. This simplifies the code and improves
consistency.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Remove-duplicate-tracking-of-the-page-stats-durin.patch
Description: Binary data

Re: Conflict Detection and Resolution

2024-06-05 Thread Dilip Kumar

On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
>
> Can you share the use case of "earliest_timestamp_wins" resolution
> method? It seems after the initial update on the local node, it will
> never allow remote update to succeed which sounds a bit odd. Jan has
> shared this and similar concerns about this resolution method, so I
> have added him to the email as well.
>
I can not think of a use case exactly in this context but it's very
common to have such a use case while designing a distributed
application with multiple clients.  For example, when we are doing git
push concurrently from multiple clients it is expected that the
earliest commit wins.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

2024-06-05 Thread Dilip Kumar

On Wed, Jun 5, 2024 at 9:12 AM shveta malik  wrote:
>
> On Tue, Jun 4, 2024 at 9:37 AM Amit Kapila  wrote:
> >
> > > >
> > > > >
> > > > > Conflict Resolution
> > > > > 
> > > > > a) latest_timestamp_wins:The change with later commit timestamp 
> > > > > wins.
> > > > > b) earliest_timestamp_wins:   The change with earlier commit 
> > > > > timestamp wins.
> >
> > Can you share the use case of "earliest_timestamp_wins" resolution
> > method? It seems after the initial update on the local node, it will
> > never allow remote update to succeed which sounds a bit odd. Jan has
> > shared this and similar concerns about this resolution method, so I
> > have added him to the email as well.
>
> I do not have the exact scenario for this.  But I feel, if 2 nodes are
> concurrently inserting different data against a primary key, then some
> users may have preferences that retain the row which was inserted
> earlier. It is no different from latest_timestamp_wins. It totally
> depends upon what kind of application and requirement the user may
> have, based on which, he may discard the later coming rows (specially
> for INSERT case).

I haven't read the complete design yet, but have we discussed how we
plan to deal with clock drift if we use timestamp-based conflict
resolution? For example, a user might insert something conflicting on
node1 first and then on node2. However, due to clock drift, the
timestamp from node2 might appear earlier. In this case, if we choose
"earliest timestamp wins," we would keep the changes from node2.

I haven't fully considered if this would cause any problems, but users
might detect this issue. For instance, a client machine might send a
change to node1 first and then, upon confirmation, send it to node2.
If the clocks on node1 and node2 are not synchronized, the changes
might appear in a different order. Does this seem like a potential
problem?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Dilip Kumar

On Fri, Apr 26, 2024 at 5:24 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!

Congratulations to both of you.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Why don't we support external input/output functions for the composite types

2024-04-24 Thread Dilip Kumar

On Thu, Apr 25, 2024 at 10:14 AM Tom Lane  wrote:
>
> Dilip Kumar  writes:
> > I'm curious about composite types in PostgreSQL. By default, when we
> > create a composite type, it utilizes the "record_in" and "record_out"
> > functions for input/output. Do you think it would be beneficial to
> > expand the syntax to allow users to specify custom input/output
> > functions when creating composite types?
>
> No.
>
> > I believe it would be beneficial because users creating a new type
> > might prefer to define specific input/output syntax rather than
> > conforming to what is accepted by the RECORD type.
>

Thanks for the quick response, Tom.

> The primary outcome would be to require a huge amount of new work
> to be done by a lot of software, much of it not under our control.

Yeah, I agree with that.

> And the impact wouldn't only be to software that would prefer not
> to know about this.  For example, how likely do you think it is
> that these hypothetical user-defined I/O functions would cope
> well with ALTER TABLE/ALTER TYPE commands that change those
> rowtypes?

That's a good point. I was primarily focused on altering the
representation of input and output values, rather than considering
changes to internal storage. However, offering this feature could
indeed allow users to influence how values are stored.  And that can
potentially affect ALTER TYPE because then we do not have control over
how those values are stored internally.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Why don't we support external input/output functions for the composite types

2024-04-24 Thread Dilip Kumar

Hi,

I'm curious about composite types in PostgreSQL. By default, when we
create a composite type, it utilizes the "record_in" and "record_out"
functions for input/output. Do you think it would be beneficial to
expand the syntax to allow users to specify custom input/output
functions when creating composite types? Has anyone attempted this
before, and are there any design challenges associated with it? Or is
it simply not implemented because it's not seen as a valuable
addition?

I believe it would be beneficial because users creating a new type
might prefer to define specific input/output syntax rather than
conforming to what is accepted by the RECORD type.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Dilip Kumar

On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera  wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-14 Thread Dilip Kumar

On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas  wrote:
>
> > Yeah, that's a very valid point. So I think now Heikki/Melanie might
> > have got an answer to their question, about the thought process behind
> > serializing the snapshot for each scan node.  And the same thing is
> > followed for BitmapHeapNode as well.
>
> I see. Thanks, understanding the thought process helps.
>
> So when a parallel table or index scan runs in the executor as part of a
> query, we could just use the active snapshot. But there are some other
> callers of parallel table scans that don't use the executor, namely
> parallel index builds. For those it makes sense to pass the snapshot for
> the scan independent of the active snapshot.

Right

> A parallel bitmap heap scan isn't really a parallel scan as far as the
> table AM is concerned, though. It's more like an independent bitmap heap
> scan in each worker process, nodeBitmapHeapscan.c does all the
> coordination of which blocks to scan. So I think that
> table_parallelscan_initialize() was the wrong role model, and we should
> still remove the snapshot serialization code from nodeBitmapHeapscan.c.

I think that seems right.

> Digging deeper into the question of whether es_snapshot ==
> GetActiveSnapshot() is a valid assumption:
>
> 
>
> es_snapshot is copied from the QueryDesc in standard_ExecutorStart().
> Looking at the callers of ExecutorStart(), they all get the QueryDesc by
> calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any
> callers changing the active snapshot between the ExecutorStart() and
> ExecutorRun() calls either. In pquery.c, we explicitly
> PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So
> no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds.
>
> _SPI_execute_plan() has code to deal with the possibility that the
> active snapshot is not set. That seems fishy; do we really support SPI
> without any snapshot? I'm inclined to turn that into an error. I ran the
> regression tests with an "Assert(ActiveSnapshotSet())" there, and
> everything worked.

IMHO, we can call SPI_Connect() and SPI_Execute() from any C
extension, so I don't think there we can guarantee that the snapshot
must be set, do we?

> If es_snapshot was different from the active snapshot, things would get
> weird, even without parallel query. The scans would use es_snapshot for
> the visibility checks, but any functions you execute in quals would use
> the active snapshot.
>
> We could double down on that assumption, and remove es_snapshot
> altogether and use GetActiveSnapshot() instead. And perhaps add
> "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun().
>
> 
>
> In summary, this es_snapshot stuff is a bit confusing and could use some
> cleanup. But for now, I'd like to just add some assertions and a
> comments about this, and remove the snapshot serialization from bitmap
> heap scan node, to make it consistent with other non-parallel scan nodes
> (it's not really a parallel scan as far as the table AM is concerned).
> See attached patch, which is the same as previous patch with some extra
> assertions.

Maybe for now we can just handle this specific case to remove the
snapshot serializing for the BitmapHeapScan as you are doing in the
patch.  After looking into the code your theory seems correct that we
are just copying the ActiveSnapshot while building the query
descriptor and from there we are copying into the Estate so logically
there should not be any reason for these two to be different.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 9:25 PM Robert Haas  wrote:
>
> On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar  wrote:
> > > Andres already commented on the snapshot stuff on an earlier patch
> > > version, and that's much nicer with this version. However, I don't
> > > understand why a parallel bitmap heap scan needs to do anything at all
> > > with the snapshot, even before these patches. The parallel worker
> > > infrastructure already passes the active snapshot from the leader to the
> > > parallel worker. Why does bitmap heap scan code need to do that too?
> >
> > Yeah thinking on this now it seems you are right that the parallel
> > infrastructure is already passing the active snapshot so why do we
> > need it again.  Then I checked other low scan nodes like indexscan and
> > seqscan and it seems we are doing the same things there as well.
> > Check for SerializeSnapshot() in table_parallelscan_initialize() and
> > index_parallelscan_initialize() which are being called from
> > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
> > respectively.
>
> I remember thinking about this when I was writing very early parallel
> query code. It seemed to me that there must be some reason why the
> EState has a snapshot, as opposed to just using the active snapshot,
> and so I took care to propagate that snapshot, which is used for the
> leader's scans, to the worker scans also. Now, if the EState doesn't
> need to contain a snapshot, then all of that mechanism is unnecessary,
> but I don't see how it can be right for the leader to do
> table_beginscan() using estate->es_snapshot and the worker to use the
> active snapshot.

Yeah, that's a very valid point. So I think now Heikki/Melanie might
have got an answer to their question, about the thought process behind
serializing the snapshot for each scan node.  And the same thing is
followed for BitmapHeapNode as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 7:04 PM Heikki Linnakangas  wrote:
>
> (Adding Dilip, the original author of the parallel bitmap heap scan
> patch all those years ago, in case you remember anything about the
> snapshot stuff below.)
>
> On 27/02/2024 16:22, Melanie Plageman wrote:

> Andres already commented on the snapshot stuff on an earlier patch
> version, and that's much nicer with this version. However, I don't
> understand why a parallel bitmap heap scan needs to do anything at all
> with the snapshot, even before these patches. The parallel worker
> infrastructure already passes the active snapshot from the leader to the
> parallel worker. Why does bitmap heap scan code need to do that too?

Yeah thinking on this now it seems you are right that the parallel
infrastructure is already passing the active snapshot so why do we
need it again.  Then I checked other low scan nodes like indexscan and
seqscan and it seems we are doing the same things there as well.
Check for SerializeSnapshot() in table_parallelscan_initialize() and
index_parallelscan_initialize() which are being called from
ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
respectively.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Streaming I/O, vectored I/O (WIP)

2024-03-12 Thread Dilip Kumar

On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro  wrote:
>
> I think you'd be right if StartReadBuffers() were capable of
> processing a sequence consisting of a hit followed by misses, but
> currently it always gives up after the first hit.  That is, it always
> processes some number of misses (0-16) and then at most one hit.  So
> for now the variable would always turn out to be the same as blockNum.
>
Okay, then shouldn't this "if (found)" block immediately break the
loop so that when we hit the block we just return that block?  So it
makes sense what you explained but with the current code if there are
the first few hits followed by misses then we will issue the
smgrprefetch() for the initial hit blocks as well.

+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit.  It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range.  We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;(Dilip: I
think we should break after this?)
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }

> The reason is that I wanted to allows "full sized" read system calls
> to form.  If you said "hey please read these 16 blocks" (I'm calling
> that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
> then it could only form a read of 14 blocks, but there might be more
> blocks that could be read after those.  We would have some arbitrary
> shorter read system calls, when we wanted to make them all as big as
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15
> and it says "only read 1", and you call again and says "read 16!"
> (assuming 2 more were readable after the original range we started
> with).  Then physical reads are maximised.  Maybe there is some nice
> way to solve that, but I thought this way was the simplest (and if
> there is some instruction-cache-locality/tight-loop/perf reason why we
> should work harder to find ranges of hits, it could be for later).
> Does that make sense?

Understood, I think this makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Streaming I/O, vectored I/O (WIP)

2024-03-11 Thread Dilip Kumar

On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro  wrote:
>
Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum;
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

 This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
 Buffer pool?  So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
 the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PostgreSQL Contributors Updates

2024-03-04 Thread Dilip Kumar

On Sun, Mar 3, 2024 at 9:28 PM Joe Conway  wrote:
>
> All,
>
> The PostgreSQL Contributor Page
> (https://www.postgresql.org/community/contributors/) includes people who
> have made substantial, long-term contributions of time and effort to the
> PostgreSQL project. The PostgreSQL Contributors Team recognizes the
> following people for their contributions.
>
> New PostgreSQL Contributors:
>
> * Bertrand Drouvot
> * Gabriele Bartolini
> * Richard Guo
>
> New PostgreSQL Major Contributors:
>
> * Alexander Lakhin
> * Daniel Gustafsson
> * Dean Rasheed
> * John Naylor
> * Melanie Plageman
> * Nathan Bossart
>
> Thank you and congratulations to all!
>

 Congratulations to all!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Improve performance of subsystems on top of SLRU

2024-03-03 Thread Dilip Kumar

On Mon, Mar 4, 2024 at 1:56 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-28, Alvaro Herrera wrote:
>
> > Improve performance of subsystems on top of SLRU
>
> Coverity had the following complaint about this commit:
>
> 
> *** CID NNN:  Control flow issues  (DEADCODE)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/transam/multixact.c:
>  1375 in GetMultiXactIdMembers()
> 1369 * and acquire the lock of the new bank.
> 1370 */
> 1371lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
> 1372if (lock != prevlock)
> 1373{
> 1374if (prevlock != NULL)
> >>> CID 1592913:  Control flow issues  (DEADCODE)
> >>> Execution cannot reach this statement: "LWLockRelease(prevlock);".
> 1375LWLockRelease(prevlock);
> 1376LWLockAcquire(lock, LW_EXCLUSIVE);
> 1377prevlock = lock;
> 1378}
> 1379
> 1380slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, 
> multi);
>
> And I think it's correct that this is somewhat bogus, or at least
> confusing: the only way to have control back here on line 1371 after
> having executed once is via the "goto retry" line below; and there we
> release "prevlock" and set it to NULL beforehand, so it's impossible for
> prevlock to be NULL.  Looking closer I think this code is all confused,
> so I suggest to rework it as shown in the attached patch.
>
> I'll have a look at the other places where we use this "prevlock" coding
> pattern tomorrow.


+ /* Acquire the bank lock for the page we need. */
  lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
- if (lock != prevlock)
- {
- if (prevlock != NULL)
- LWLockRelease(prevlock);
- LWLockAcquire(lock, LW_EXCLUSIVE);
- prevlock = lock;
- }
+ LWLockAcquire(lock, LW_EXCLUSIVE);

This part is definitely an improvement.

I am not sure about the other changes, I mean that makes the code much
simpler but now we are not releasing the 'MultiXactOffsetCtl' related
bank lock, and later in the following loop, we are comparing that lock
against 'MultiXactMemberCtl' related bank lock. This makes code
simpler because now in the loop we are sure that we are always holding
the lock but I do not like comparing the bank locks for 2 different
SLRUs, although there is no problem as there would not be a common
lock address, anyway, I do not have any strong objection to what you
have done here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera 
wrote:

> On 2024-Feb-27, Alvaro Herrera wrote:
>
> > Here's the complete set, with these two names using the singular.
>
> BTW one thing I had not noticed is that before this patch we have
> minimum shmem size that's lower than the lowest you can go with the new
> code.
>
> This means Postgres may no longer start when extremely tight memory
> restrictions (and of course use more memory even when idle or with small
> databases).  I wonder to what extent should we make an effort to relax
> that.  For small, largely inactive servers, this is just memory we use
> for no good reason.  However, anything we do here will impact
> performance on the high end, because as Andrey says this will add
> calculations and jumps where there are none today.
>
>
I was just comparing the minimum memory required for SLRU when the system
is minimally configured, correct me if I am wrong.

SLRUunpatched
patched
commit_timestamp_buffers  4   16
subtransaction_buffers 32 16
transaction_buffers   4   16
multixact_offset_buffers8   16
multixact_member_buffers   16  16
notify_buffers 8
 16
serializable_buffers   16  16
-
total buffers 88
112

so that is < 200kB of extra memory on a minimally configured system, IMHO
this should not matter.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera 
wrote:

> On 2024-Feb-23, Dilip Kumar wrote:
>
> +  
> +   For each SLRU area that's part of the core server,
> +   there is a configuration parameter that controls its size, with the
> suffix
> +   _buffers appended.  For historical
> +   reasons, the names are not exact matches, but Xact
> +   corresponds to transaction_buffers and the rest
> should
> +   be obvious.
> +   
> +  
>
> I think I would like to suggest renaming the GUCs to have the _slru_ bit
> in the middle:
>
> +# - SLRU Buffers (change requires restart) -
> +
> +#commit_timestamp_slru_buffers = 0  # memory for pg_commit_ts (0
> = auto)
> +#multixact_offsets_slru_buffers = 16# memory for
> pg_multixact/offsets
> +#multixact_members_slru_buffers = 32# memory for
> pg_multixact/members
> +#notify_slru_buffers = 16   # memory for pg_notify
> +#serializable_slru_buffers = 32 # memory for pg_serial
> +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 =
> auto)
> +#transaction_slru_buffers = 0   # memory for pg_xact (0 =
> auto)
>
> and the pgstat_internal.h table:
>
> static const char *const slru_names[] = {
> "commit_timestamp",
> "multixact_members",
> "multixact_offsets",
> "notify",
> "serializable",
> "subtransaction",
> "transaction",
> "other" /* has to be last
> */
> };
>
> This way they match perfectly.
>

Yeah, I think this looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar  wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  
> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(&procglobal->clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 2:52 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 2:43 PM Dilip Kumar  wrote:
> > So the problem is that we might consider the transaction change as
> > non-transaction and mark this flag as true.
>
> But it's not "might" right? It's absolutely 100% certain that we will
> consider that transaction's changes as non-transactional ... because
> when we're in fast-forward mode, the table of new relfilenodes is not
> built, and so whenever we check whether any transaction made a new
> relfilenode for this sequence, the answer will be no.
>
> > But what would have
> > happened if we would have identified it correctly as transactional?
> > In such cases, we wouldn't have set this flag here but then we would
> > have set this while processing the DecodeAbort/DecodeCommit, so the
> > net effect would be the same no?  You may question what if the
> > Abort/Commit WAL never appears in the WAL, but this flag is
> > specifically for the upgrade case, and in that case we have to do a
> > clean shutdown so may not be an issue.  But in the future, if we try
> > to use 'ctx->processing_required' for something else where the clean
> > shutdown is not guaranteed then this flag can be set incorrectly.
> >
> > I am not arguing that this is a perfect design but I am just making a
> > point about why it would work.
>
> Even if this argument is correct (and I don't know if it is), the code
> and comments need some updating. We should not be testing a flag that
> is guaranteed false with comments that make it sound like the value of
> the flag is trustworthy when it isn't.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-21 Thread Dilip Kumar

On Wed, Feb 21, 2024 at 1:24 PM Robert Haas  wrote:
>
> On Wed, Feb 21, 2024 at 1:06 PM Dilip Kumar  wrote:
> > > But I am wondering why this flag is always set to true in
> > > DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> > > aborted transactions are not supposed to be replayed?  So if my
> > > observation is correct that for the aborted transaction, this
> > > shouldn't be set to true then we have a problem with sequence where we
> > > are identifying the transactional changes as non-transaction changes
> > > because now for transactional changes this should depend upon commit
> > > status.
> >
> > I have checked this case with Amit Kapila.  So it seems in the cases
> > where we have sent the prepared transaction or streamed in-progress
> > transaction we would need to send the abort also, and for that reason,
> > we are setting 'ctx->processing_required' as true so that if these
> > WALs are not streamed we do not allow upgrade of such slots.
>
> I don't find this explanation clear enough for me to understand.


Explanation about why we set 'ctx->processing_required' to true from
DecodeCommit as well as DecodeAbort:
--
For upgrading logical replication slots, it's essential to ensure
these slots are completely synchronized with the subscriber.  To
identify that we process all the pending WAL in 'fast_forward' mode to
find whether there is any decodable WAL or not.  So in short any WAL
type that we stream to standby in normal mode (no fast_forward mode)
is considered decodable and so is the abort WAL.  That's the reason
why at the end of the transaction commit/abort we need to set this
'ctx->processing_required' to true i.e. there are some decodable WAL
exists so we can not upgrade this slot.

Why the below check is safe?
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

So the problem is that we might consider the transaction change as
non-transaction and mark this flag as true.  But what would have
happened if we would have identified it correctly as transactional?
In such cases, we wouldn't have set this flag here but then we would
have set this while processing the DecodeAbort/DecodeCommit, so the
net effect would be the same no?  You may question what if the
Abort/Commit WAL never appears in the WAL, but this flag is
specifically for the upgrade case, and in that case we have to do a
clean shutdown so may not be an issue.  But in the future, if we try
to use 'ctx->processing_required' for something else where the clean
shutdown is not guaranteed then this flag can be set incorrectly.

I am not arguing that this is a perfect design but I am just making a
point about why it would work.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 4:32 PM Dilip Kumar  wrote:
>
> On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:
>
> > Let's say fast_forward is true. Then smgr_decode() is going to skip
> > recording anything about the relfilenode, so we'll identify all
> > sequence changes as non-transactional. But look at how this case is
> > handled in seq_decode():
> >
> > + if (ctx->fast_forward)
> > + {
> > + /*
> > + * We need to set processing_required flag to notify the sequence
> > + * change existence to the caller. Usually, the flag is set when
> > + * either the COMMIT or ABORT records are decoded, but this must be
> > + * turned on here because the non-transactional logical message is
> > + * decoded without waiting for these records.
> > + */
> > + if (!transactional)
> > + ctx->processing_required = true;
> > +
> > + return;
> > + }
>
> It appears that the 'processing_required' flag was introduced as part
> of supporting upgrades for logical replication slots. Its purpose is
> to determine whether a slot is fully caught up, meaning that there are
> no pending decodable changes left before it can be upgraded.
>
> So now if some change was transactional but we have identified it as
> non-transaction then we will mark this flag  'ctx->processing_required
> = true;' so we temporarily set this flag incorrectly, but even if the
> flag would have been correctly identified initially, it would have
> been set again to true in the DecodeTXNNeedSkip() function regardless
> of whether the transaction is committed or aborted. As a result, the
> flag would eventually be set to 'true', and the behavior would align
> with the intended logic.
>
> But I am wondering why this flag is always set to true in
> DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
> aborted transactions are not supposed to be replayed?  So if my
> observation is correct that for the aborted transaction, this
> shouldn't be set to true then we have a problem with sequence where we
> are identifying the transactional changes as non-transaction changes
> because now for transactional changes this should depend upon commit
> status.

I have checked this case with Amit Kapila.  So it seems in the cases
where we have sent the prepared transaction or streamed in-progress
transaction we would need to send the abort also, and for that reason,
we are setting 'ctx->processing_required' as true so that if these
WALs are not streamed we do not allow upgrade of such slots.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 3:38 PM Robert Haas  wrote:

> Let's say fast_forward is true. Then smgr_decode() is going to skip
> recording anything about the relfilenode, so we'll identify all
> sequence changes as non-transactional. But look at how this case is
> handled in seq_decode():
>
> + if (ctx->fast_forward)
> + {
> + /*
> + * We need to set processing_required flag to notify the sequence
> + * change existence to the caller. Usually, the flag is set when
> + * either the COMMIT or ABORT records are decoded, but this must be
> + * turned on here because the non-transactional logical message is
> + * decoded without waiting for these records.
> + */
> + if (!transactional)
> + ctx->processing_required = true;
> +
> + return;
> + }

It appears that the 'processing_required' flag was introduced as part
of supporting upgrades for logical replication slots. Its purpose is
to determine whether a slot is fully caught up, meaning that there are
no pending decodable changes left before it can be upgraded.

So now if some change was transactional but we have identified it as
non-transaction then we will mark this flag  'ctx->processing_required
= true;' so we temporarily set this flag incorrectly, but even if the
flag would have been correctly identified initially, it would have
been set again to true in the DecodeTXNNeedSkip() function regardless
of whether the transaction is committed or aborted. As a result, the
flag would eventually be set to 'true', and the behavior would align
with the intended logic.

But I am wondering why this flag is always set to true in
DecodeTXNNeedSkip() irrespective of the commit or abort. Because the
aborted transactions are not supposed to be replayed?  So if my
observation is correct that for the aborted transaction, this
shouldn't be set to true then we have a problem with sequence where we
are identifying the transactional changes as non-transaction changes
because now for transactional changes this should depend upon commit
status.

On another thought, can there be a situation where we have identified
this flag wrongly as non-transaction and set this flag, and the
commit/abort record never appeared in the WAL so never decoded? That
can also lead to an incorrect decision during the upgrade.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding and replication of sequences, take 2

2024-02-20 Thread Dilip Kumar

On Tue, Feb 20, 2024 at 10:30 AM Robert Haas  wrote:
>
> Is the rule that changes are transactional if and only if the current
> transaction has assigned a new relfilenode to the sequence?

Yes, thats the rule.

> Why does the logic get confused if the state of the snapshot changes?

The rule doesn't get changed, but the way this identification is
implemented at the decoding gets confused and assumes transactional as
non-transactional.  The identification of whether the sequence is
transactional or not is implemented based on what WAL we have decoded
from the particular transaction and whether we decode a particular WAL
or not depends upon the snapshot state (it's about what we decode not
necessarily what we sent).  So if the snapshot state changed the
mid-transaction that means we haven't decoded the WAL which created a
new relfilenode but we will decode the WAL which is operating on the
sequence.  So here we will assume the change is non-transaction
whereas it was transactional because we did not decode some of the
changes of transaction which we rely on for identifying whether it is
transactional or not.

> My naive reaction is that it kinda sounds like you're relying on two
> different mistakes cancelling each other out, and that might be a bad
> idea, because maybe there's some situation where they don't. But I
> don't understand the issue well enough to have an educated opinion at
> this point.

I would say the first one is a mistake in identifying the
transactional as non-transactional during the decoding and that
mistake happens only when we decode the transaction partially.  But we
never stream the partially decoded transactions downstream which means
even though we have made a mistake in decoding it, we are not
streaming it so our mistake is not getting converted into a real
problem.  But again I agree there is a temporary wrong decision and if
we try to do something else based on this decision then it could be an
issue.

You might be interested in more detail [1] where I first reported this
problem and also [2] where we concluded why this is not creating a
real problem.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-07 Thread Dilip Kumar

> > So now if we have such a functionality then it would be even better to
> > extend it to selectively sync the slot.  For example, if there is some
> > issue in syncing all slots, maybe some bug or taking a long time to
> > sync because there are a lot of slots but if the user needs to quickly
> > failover and he/she is interested in only a couple of slots then such
> > a option could be helpful. no?
> >
>
> I see your point but not sure how useful it is in the field. I am fine
> if others also think such a parameter will be useful and anyway I
> think we can even extend it after v1 is done.
>

Okay, I am fine with that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-07 Thread Dilip Kumar

On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  
> > wrote:
> > >
> > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > > because we don't have that macro at that point, so I just used constant
> > > 16.  Obviously need a better solution for this.
> >
> > If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> > because these files are anyway include slur.h so.
>
> Sure, but is that really what we want?

So your question is do we want these buffers to be in multiple of
SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
don't think it should create any problem logically.  I mean we can
look again in the patch to see if we have made any such assumptions
but that should be fairly easy to fix, then maybe if we are going in
this way we should get rid of the check_slru_buffers() function as
well.

> > > I've been wondering whether we should add a "slru" to the name of the
> > > GUCs:
> > >
> > > commit_timestamp_slru_buffers
> > > transaction_slru_buffers
> > > etc
> >
> > I am not sure we are exposing anything related to SLRU to the user,
>
> We do -- we have pg_stat_slru already.
>
> > I mean transaction_buffers should make sense for the user that it
> > stores transaction-related data in some buffers pool but whether that
> > buffer pool is called SLRU or not doesn't matter much to the user
> > IMHO.
>
> Yeah, that's exactly what my initial argument was for naming these this
> way.  But since the term slru already escaped into the wild via the
> pg_stat_slru view, perhaps it helps users make the connection between
> these things.  Alternatively, we can cross-reference each term from the
> other's documentation and call it a day.

Yeah, that's true I forgot this point about the pg_stat_slru, from
this pov if the configuration has the name slru they would be able to
make a better connection with the configured value, and the stats in
this view based on that they can take call if they need to somehow
increase the size of these slru buffers.

> Another painful point is that pg_stat_slru uses internal names in the
> data it outputs, which obviously do not match the new GUCs.

Yeah, that's true, but I think this could be explained somewhere not
sure what is the right place for this.

FYI, I have also repeated all the performance tests I performed in my
first email[1], and I am seeing a similar gain.

Some comments on v18 in my first pass of the review.

1.
@@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  lsnindex = GetLSNIndex(slotno, xid);
  *lsn = XactCtl->shared->group_lsn[lsnindex];

- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));

Maybe here we can add an assert before releasing the lock for a safety check

Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

2.
+ *
+ * XXX could we make the LSNs to be bank-based?
  */
  XLogRecPtr *group_lsn;

IMHO, the flush still happens at the page level so up to which LSN
should be flush before flushing the particular clog page should also
be maintained at the page level.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  wrote:
>
> Here's the rest of it rebased on top of current master.  I think it
> makes sense to have this as one individual commit.
>
> I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> because we don't have that macro at that point, so I just used constant
> 16.  Obviously need a better solution for this.

If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
because these files are anyway include slur.h so.

>
> I also changed the location of bank_mask in SlruCtlData for better
> packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
> to SlotGetBankNumber().

Okay.

> Some very critical comments still need to be updated to the new design,
> particularly anything that mentions "control lock"; but also the overall
> model needs to be explained in some central location, rather than
> incongruently some pieces here and other pieces there.  I'll see about
> this later.  But at least this is code you should be able to play with.

Okay, I will review and test this

> I've been wondering whether we should add a "slru" to the name of the
> GUCs:
>
> commit_timestamp_slru_buffers
> transaction_slru_buffers
> etc

I am not sure we are exposing anything related to SLRU to the user, I
mean transaction_buffers should make sense for the user that it stores
transaction-related data in some buffers pool but whether that buffer
pool is called SLRU or not doesn't matter much to the user IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera  wrote:
>
> > > (We also have SimpleLruTruncate, but I think it's not as critical to
> > > have a barrier there anyhow: accessing a slightly outdated page number
> > > could only be a problem if a bug elsewhere causes us to try to truncate
> > > in the current page.  I think we only have this code there because we
> > > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> >
> > +1, I agree with this theory in general.  But the below comment in
> > SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> > we are checking if the latest_page_number is smaller than the cutoff
> > if so we log it as wraparound and skip the whole thing and that is
> > fine even if we are reading with atomic variable and slightly outdated
> > value should not be a problem but the comment claim that this safe
> > because we have the same bank lock as SimpleLruZeroPage(), but that's
> > not true here we will be acquiring different bank locks one by one
> > based on which slotno we are checking.  Am I missing something?
>
> I think you're correct.  I reworded this comment, so now it says this:
>
> /*
>  * An important safety check: the current endpoint page must not be
>  * eligible for removal.  This check is just a backstop against wraparound
>  * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
>  * outdated value; therefore we don't add a memory barrier.
>  */
>
> Pushed with those changes.  Thank you!

Yeah, this looks perfect, thanks.

> Now I'll go rebase the rest of the patch on top.

Okay, I will review and test after that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila  wrote:
>
> On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar  wrote:
> >
> > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > ---
> > > > > Since Two processes (e.g. the slotsync worker and
> > > > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > > > information, there is a race condition where slot's
> > > > > confirmed_flush_lsn goes backward.
> > > > >
> > > >
> > > > Right, this is possible, though there shouldn't be a problem because
> > > > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > > > required WAL won't be removed. Having said that, I can think of two
> > > > ways to avoid it: (a) We can have some flag in shared memory using
> > > > which we can detect whether any other process is doing slot
> > > > syncronization and then either error out at that time or simply wait
> > > > or may take nowait kind of parameter from user to decide what to do?
> > > > If this is feasible, we can simply error out for the first version and
> > > > extend it later if we see any use cases for the same (b) similar to
> > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > > > error, this is good for now but in future we may still have another
> > > > similar issue, so I would prefer (a) among these but I am fine if you
> > > > prefer (b) or have some other ideas like just note down in comments
> > > > that this is a harmless case and can happen only very rarely.
> > >
> > > Thank you for sharing the ideas. I would prefer (a). For (b), the same
> > > issue still happens for other fields.
> >
> > I agree that (a) looks better.  On a separate note, while looking at
> > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
> > be an optional parameter to give one slot or multiple slots or all
> > slots as default, that will give better control to the user no?
> >
>
> As of now, we want to give functionality similar to slotsync worker
> with a difference that users can use this new function for planned
> switchovers. So, syncing all failover slots by default. I think if
> there is a use case to selectively sync some of the failover slots
> then we can probably extend this function and slotsync worker as well.
> Normally, if the primary goes down due to whatever reason users would
> want to restart the replication for all the defined publications via
> existing failover slots. Why would anyone want to do it partially?

If we consider the usability of such a function (I mean as it is
implemented now, without any argument) one use case could be that if
the slot sync worker is not keeping up or at some point in time the
user doesn't want to wait for the worker to do this instead user can
do it by himself.

So now if we have such a functionality then it would be even better to
extend it to selectively sync the slot.  For example, if there is some
issue in syncing all slots, maybe some bug or taking a long time to
sync because there are a lot of slots but if the user needs to quickly
failover and he/she is interested in only a couple of slots then such
a option could be helpful. no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Question on LWLockMode in dsa.c

2024-02-06 Thread Dilip Kumar

On Tue, Jan 30, 2024 at 6:24 AM Masahiko Sawada  wrote:
>
> Hi,
>
> While working on radix tree patch[1], John Naylor found that dsa.c
> doesn't already use shared locks even in dsa_dump(). dsa_dump() seems
> a pure read-only function so I thought we could use a shared lock mode
> there. Is there any reason to use exclusive mode even in dsa_dump()?
>
> Ultimately, since we're trying to add a new function
> dsa_get_total_size() that just returns
> dsa_area_control.total_segment_size and therefore would also be a
> read-only function, I'd like to find out the correct lock mode there.
>

Doesn't seem like there is any reason for this to be an exclusive lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila  wrote:
> >
> > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada  
> > wrote:
> > >
> > > ---
> > > Since Two processes (e.g. the slotsync worker and
> > > pg_sync_replication_slots()) concurrently fetch and update the slot
> > > information, there is a race condition where slot's
> > > confirmed_flush_lsn goes backward.
> > >
> >
> > Right, this is possible, though there shouldn't be a problem because
> > anyway, slotsync is an async process. Till we hold restart_lsn, the
> > required WAL won't be removed. Having said that, I can think of two
> > ways to avoid it: (a) We can have some flag in shared memory using
> > which we can detect whether any other process is doing slot
> > syncronization and then either error out at that time or simply wait
> > or may take nowait kind of parameter from user to decide what to do?
> > If this is feasible, we can simply error out for the first version and
> > extend it later if we see any use cases for the same (b) similar to
> > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an
> > error, this is good for now but in future we may still have another
> > similar issue, so I would prefer (a) among these but I am fine if you
> > prefer (b) or have some other ideas like just note down in comments
> > that this is a harmless case and can happen only very rarely.
>
> Thank you for sharing the ideas. I would prefer (a). For (b), the same
> issue still happens for other fields.

I agree that (a) looks better.  On a separate note, while looking at
this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there
be an optional parameter to give one slot or multiple slots or all
slots as default, that will give better control to the user no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-02-06 Thread Dilip Kumar

On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Replication slots in postgres will prevent removal of required
> resources when there is no connection using them (inactive). This
> consumes storage because neither required WAL nor required rows from
> the user tables/system catalogs can be removed by VACUUM as long as
> they are required by a replication slot. In extreme cases this could
> cause the transaction ID wraparound.
>
> Currently postgres has the ability to invalidate inactive replication
> slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
> that will be needed for the slots in case they become active. However,
> the wraparound issue isn't effectively covered by
> max_slot_wal_keep_size - one can't tell postgres to invalidate a
> replication slot if it is blocking VACUUM. Also, it is often tricky to
> choose a default value for max_slot_wal_keep_size, because the amount
> of WAL that gets generated and allocated storage for the database can
> vary.
>
> Therefore, it is often easy for developers to do the following:
> a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
> billion, after which the slots get invalidated.
> b) set a timeout of say 1 or 2 or 3 days, after which the inactive
> slots get invalidated.
>
> To implement (a), postgres needs a new GUC called max_slot_xid_age.
> The checkpointer then invalidates all the slots whose xmin (the oldest
> transaction that this slot needs the database to retain) or
> catalog_xmin (the oldest transaction affecting the system catalogs
> that this slot needs the database to retain) has reached the age
> specified by this setting.
>
> To implement (b), first postgres needs to track the replication slot
> metrics like the time at which the slot became inactive (inactive_at
> timestamptz) and the total number of times the slot became inactive in
> its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
> structure. And, then it needs a new timeout GUC called
> inactive_replication_slot_timeout. Whenever a slot becomes inactive,
> the current timestamp and inactive count are stored in
> ReplicationSlotPersistentData structure and persisted to disk. The
> checkpointer then invalidates all the slots that are lying inactive
> for about inactive_replication_slot_timeout duration starting from
> inactive_at.
>
> In addition to implementing (b), these two new metrics enable
> developers to improve their monitoring tools as the metrics are
> exposed via pg_replication_slots system view. For instance, one can
> build a monitoring tool that signals when replication slots are lying
> inactive for a day or so using inactive_at metric, and/or when a
> replication slot is becoming inactive too frequently using inactive_at
> metric.
>
> I’m attaching the v1 patch set as described below:
> 0001 - Tracks invalidation_reason in pg_replication_slots. This is
> needed because slots now have multiple reasons for slot invalidation.
> 0002 - Tracks inactive replication slot information inactive_at and
> inactive_timeout.
> 0003 - Adds inactive_timeout based replication slot invalidation.
> 0004 - Adds XID based replication slot invalidation.
>
> Thoughts?
>
+1 for the idea,  here are some comments on 0002, I will review other
patches soon and respond.

1.
+  
+   inactive_at timestamptz
+  
+  
+The time at which the slot became inactive.
+NULL if the slot is currently actively being
+used.
+  
+ 

Maybe we can change the field name to 'last_inactive_at'? or maybe the
comment can explain timestampt at which slot was last inactivated.
I think since we are already maintaining the inactive_count so better
to explicitly say this is the last invaliding time.

2.
+ /*
+ * XXX: Can inactive_count of type uint64 ever overflow? It takes
+ * about a half-billion years for inactive_count to overflow even
+ * if slot becomes inactive for every 1 millisecond. So, using
+ * pg_add_u64_overflow might be an overkill.
+ */

Correct we don't need to use pg_add_u64_overflow for this counter.

3.

+
+ /* Convert to numeric. */
+ snprintf(buf, sizeof buf, UINT64_FORMAT, slot_contents.data.inactive_count);
+ values[i++] = DirectFunctionCall3(numeric_in,
+   CStringGetDatum(buf),
+   ObjectIdGetDatum(0),
+   Int32GetDatum(-1));

What is the purpose of doing this? I mean inactive_count is 8 byte
integer and you can define function outparameter as 'int8' which is 8
byte integer.  Then you don't need to convert int to string and then
to numeric?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Dilip Kumar

On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-02, Dilip Kumar wrote:
>
> > I have checked the patch and it looks fine to me other than the above
> > question related to memory barrier usage one more question about the
> > same, basically below to instances 1 and 2 look similar but in 1 you
> > are not using the memory write_barrier whereas in 2 you are using the
> > write_barrier, why is it so?  I mean why the reordering can not happen
> > in 1 and it may happen in 2?
>
> What I was thinking is that there's a lwlock operation just below, which
> acts as a barrier.  But I realized something more important: there are
> only two places that matter, which are SlruSelectLRUPage and
> SimpleLruZeroPage.  The others are all initialization code that run at a
> point where there's no going to be any concurrency in SLRU access, so we
> don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
> want to evict the page that SimpleLruZeroPage has initialized, starting
> from the point where it returns that new page to its caller.
> But if you consider the code of those two routines, you realize that the
> only time an equality between latest_page_number and "this_page_number"
> is going to occur, is when both pages are in the same bank ... and both
> routines are required to be holding the bank lock while they run, so in
> practice this is never a problem.

Right, in fact when I first converted this 'latest_page_number' to an
atomic the thinking was to protect it from concurrently setting the
values in SimpleLruZeroPage() and also concurrently reading in
SlruSelectLRUPage() should not read the corrupted value.  All other
usages were during the initialization phase where we do not need any
protection.

>
> We need the atomic write and atomic read so that multiple processes
> processing pages in different banks can update latest_page_number
> simultaneously.  But the equality condition that we're looking for?
> it can never happen concurrently.

Yeah, that's right, after you told I also realized that the case is
protected by the bank lock.  Earlier I didn't think about this case.

> In other words, these barriers are fully useless.
>
> (We also have SimpleLruTruncate, but I think it's not as critical to
> have a barrier there anyhow: accessing a slightly outdated page number
> could only be a problem if a bug elsewhere causes us to try to truncate
> in the current page.  I think we only have this code there because we
> did have such bugs in the past, but IIUC this shouldn't happen anymore.)

+1, I agree with this theory in general.  But the below comment in
SimpleLruTrucate in your v3 patch doesn't seem correct, because here
we are checking if the latest_page_number is smaller than the cutoff
if so we log it as wraparound and skip the whole thing and that is
fine even if we are reading with atomic variable and slightly outdated
value should not be a problem but the comment claim that this safe
because we have the same bank lock as SimpleLruZeroPage(), but that's
not true here we will be acquiring different bank locks one by one
based on which slotno we are checking.  Am I missing something?


+ * An important safety check: the current endpoint page must not be
+ * eligible for removal.  Like SlruSelectLRUPage, we don't need a
+ * memory barrier here because for the affected page to be relevant,
+ * we'd have to have the same bank lock as SimpleLruZeroPage.
  */
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u64(&shared->latest_page_number),
+   cutoffPage))


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
> >
> > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  
> > wrote:
>
> > Okay.
> > >
> > > While I have your attention -- if you could give a look to the 0001
> > > patch I posted, I would appreciate it.
> > >
> >
> > I will look into it.  Thanks.
>
> Some quick observations,
>
> Do we need below two write barriers at the end of the function?
> because the next instruction is separated by the function boundary
>
> @@ -766,14 +766,11 @@ StartupCLOG(void)
>   ..
> - XactCtl->shared->latest_page_number = pageno;
> -
> - LWLockRelease(XactSLRULock);
> + pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno);
> + pg_write_barrier();
>  }
>
> /*
>   * Initialize member's idea of the latest page number.
>   */
>   pageno = MXOffsetToMemberPage(offset);
> - MultiXactMemberCtl->shared->latest_page_number = pageno;
> + pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number,
> +pageno);
> +
> + pg_write_barrier();
>  }
>

I have checked the patch and it looks fine to me other than the above
question related to memory barrier usage one more question about the
same, basically below to instances 1 and 2 look similar but in 1 you
are not using the memory write_barrier whereas in 2 you are using the
write_barrier, why is it so?  I mean why the reordering can not happen
in 1 and it may happen in 2?

1.
+ pg_atomic_write_u64(&CommitTsCtl->shared->latest_page_number,
+ trunc->pageno);

  SimpleLruTruncate(CommitTsCtl, trunc->pageno);

vs
2.

  - shared->latest_page_number = pageno;
+ pg_atomic_write_u64(&shared->latest_page_number, pageno);
+ pg_write_barrier();

  /* update the stats counter of zeroed pages */
  pgstat_count_slru_page_zeroed(shared->slru_stats_idx);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:

> Okay.
> >
> > While I have your attention -- if you could give a look to the 0001
> > patch I posted, I would appreciate it.
> >
>
> I will look into it.  Thanks.

Some quick observations,

Do we need below two write barriers at the end of the function?
because the next instruction is separated by the function boundary

@@ -766,14 +766,11 @@ StartupCLOG(void)
  ..
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u64(&XactCtl->shared->latest_page_number, pageno);
+ pg_write_barrier();
 }

/*
  * Initialize member's idea of the latest page number.
  */
  pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u64(&MultiXactMemberCtl->shared->latest_page_number,
+pageno);
+
+ pg_write_barrier();
 }

I am looking more into this from the concurrency point of view and
will update you soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-01, Dilip Kumar wrote:
>
> > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  
> > wrote:
> > >
> > > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> > >
> > > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> > > "transaction_buffers": 17
> > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must 
> > > be a multiple of 16
> >
> > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> > instead of giving an error?
>
> Since this is the auto-tuning feature, I think it should use the
> previous multiple rather than the next, but yeah, something like that.

Okay.
>
> While I have your attention -- if you could give a look to the 0001
> patch I posted, I would appreciate it.
>

I will look into it.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  wrote:
>
> Hah:
>
> postgres -c lc_messages=C -c shared_buffers=$((512*17))
>
> 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> "transaction_buffers": 17
> 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be 
> a multiple of 16

Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
instead of giving an error?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera  wrote:
>
> I've continued to review this and decided that I don't like the mess
> this patch proposes in order to support pg_commit_ts's deletion of all
> files.  (Yes, I know that I was the one that proposed this idea. It's
> still a bad one).  I'd like to change that code by removing the limit
> that we can only have 128 bank locks in a SLRU.  To recap, the reason we
> did this is that commit_ts sometimes wants to delete all files while
> running (DeactivateCommitTs), and for this it needs to acquire all bank
> locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
> added the SLRU limit making multiple banks share lwlocks.
>
> I propose two alternative solutions:
>
> 1. The easiest is to have DeactivateCommitTs continue to hold
> CommitTsLock until the end, including while SlruScanDirectory does its
> thing.  This sounds terrible, but considering that this code only runs
> when the module is being deactivated, I don't think it's really all that
> bad in practice.  I mean, if you deactivate the commit_ts module and
> then try to read commit timestamp data, you deserve to wait for a few
> seconds just as a punishment for your stupidity.

I think this idea looks reasonable.  I agree that if we are trying to
read commit_ts after deactivating it then it's fine to make it wait.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-25, Alvaro Herrera wrote:
>
> > Here's a touched-up version of this patch.
>
> > diff --git a/src/backend/storage/lmgr/lwlock.c 
> > b/src/backend/storage/lmgr/lwlock.c
> > index 98fa6035cc..4a5e05d5e4 100644
> > --- a/src/backend/storage/lmgr/lwlock.c
> > +++ b/src/backend/storage/lmgr/lwlock.c
> > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
> >   [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
> >   [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
> >   [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> > + [LWTRANCHE_XACT_SLRU] = "XactSLRU",
> >  };
>
> Eeek.  Last minute changes ...  Fixed here.

Thank you for working on this.  There is one thing that I feel is
problematic.  We have kept the allowed values for these GUCs to be in
multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
values were changed to 16 but in this refactoring patch for some of
the buffers you have changed that to 8 so I think that's not good.

+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for
the MultiXact offset cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &multixact_offsets_buffers,
+ 16, 8, SLRU_MAX_ALLOWED_BUFFERS,
+ check_multixact_offsets_buffers, NULL, NULL
+ },

Other than this patch looks good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: index prefetching

2024-01-25 Thread Dilip Kumar

On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
 wrote:

> On 1/22/24 07:35, Konstantin Knizhnik wrote:
> >
> > On 22/01/2024 1:47 am, Tomas Vondra wrote:
> >> h, right. Well, you're right in this case we perhaps could set just one
> >> of those flags, but the "purpose" of the two places is quite different.
> >>
> >> The "prefetch" flag is fully controlled by the prefetcher, and it's up
> >> to it to change it (e.g. I can easily imagine some new logic touching
> >> setting it to "false" for some reason).
> >>
> >> The "data" flag is fully controlled by the custom callbacks, so whatever
> >> the callback stores, will be there.
> >>
> >> I don't think it's worth simplifying this. In particular, I don't think
> >> the callback can assume it can rely on the "prefetch" flag.
> >>
> > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> > cause any extra space overhead (because of alignment), but allows to
> > avoid dynamic memory allocation (not sure if it is critical, but nice to
> > avoid if possible).
> >
>
While reading through the first patch I got some questions, I haven't
read it complete yet but this is what I got so far.

1.
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+ int idx;
...
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+
+ /* Don't prefetch if the block happens to be the same. */
+ if (prefetch->blockItems[idx] == block)
+ return false;
+ }
+
+ /* not sequential, not recently prefetched */
+ return true;
+}

The above function name is BlockIsSequential but at the end, it
returns true if it is not sequential, seem like a problem?
Also other 2 checks right above the end of the function are returning
false if the block is the same or the pattern is sequential I think
those are wrong too.

 2.
 I have noticed that the prefetch history is maintained at the backend
level, but what if multiple backends are trying to fetch the same heap
blocks maybe scanning the same index, so should that be in some shared
structure?  I haven't thought much deeper about this from the
implementation POV, but should we think about it, or it doesn't
matter?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Opportunistically pruning page before update

2024-01-22 Thread Dilip Kumar

On Tue, Jan 23, 2024 at 7:18 AM James Coleman  wrote:
>
> On Mon, Jan 22, 2024 at 8:21 PM James Coleman  wrote:
> >
> > See rebased patch attached.
>
> I just realized I left a change in during the rebase that wasn't necessary.
>
> v4 attached.

I have noticed that you are performing the opportunistic pruning after
we decided that the updated tuple can not fit in the current page and
then we are performing the pruning on the new target page.  Why don't
we first perform the pruning on the existing page of the tuple itself?
 Or this is already being done before this patch? I could not find
such existing pruning so got this question because such pruning can
convert many non-hot updates to the HOT update right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-01-21 Thread Dilip Kumar

On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan
 wrote:
>
> Add some fixes and rebase.
>
While quickly looking into the patch, I understood the idea of what we
are trying to achieve here and I feel that a useful feature.  But
while looking at both the patches I could not quickly differentiate
between these two approaches.  I believe, internally at the core both
are implementing similar wait logic but providing different syntaxes.
So if we want to keep both these approaches open for the sake of
discussion then better first to create a patch that implements the
core approach i.e. the waiting logic and the other common part and
then add top-up patches with 2 different approaches that would be easy
for review.  I also see in v4 that there is no documentation for the
syntax part so it makes it even harder to understand.

I think this thread is implementing a useful feature so my suggestion
is to add some documentation in v4 and also make it more readable
w.r.t. What are the clear differences between these two approaches,
maybe adding commit message will also help.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-19 Thread Dilip Kumar

On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila  wrote:
>
> On Wed, Jan 17, 2024 at 4:00 PM shveta malik  wrote:
> >
>
> I had some off-list discussions with Sawada-San, Hou-San, and Shveta
> on the topic of extending replication commands instead of using the
> current model where we fetch the required slot information via SQL
> using a database connection. I would like to summarize the discussion
> and would like to know the thoughts of others on this topic.
>
> In the current patch, we launch the slotsync worker on physical
> standby which connects to the specified database (currently we let
> users specify the required dbname in primary_conninfo) on the primary.
> It then fetches the required information for failover marked slots
> from the primary and also does some primitive checks on the upstream
> node via SQL (the additional checks are like whether the upstream node
> has a specified physical slot or whether the upstream node is a
> primary node or a standby node). To fetch the required information it
> uses a libpqwalreciever API which is mostly apt for this purpose as it
> supports SQL execution but for this patch, we don't need a replication
> connection, so we extend the libpqwalreciever connect API.

What sort of extension we have done to 'libpqwalreciever'? Is it
something like by default this supports replication connections so we
have done an extension to the API so that we can provide an option
whether to create a replication connection or a normal connection?

> Now, the concerns related to this could be that users would probably
> need to change existing mechanisms/tools to update priamry_conninfo
> and one of the alternatives proposed is to have an additional GUC like
> slot_sync_dbname. Users won't be able to drop the database this worker
> is connected to aka whatever is specified in slot_sync_dbname but as
> the user herself sets up the configuration it shouldn't be a big deal.

Yeah for this purpose users may use template1 or so which they
generally don't plan to drop.  So in case the user wants to drop that
database user needs to turn off the slot syncing option and then it
can be done?

> Then we also discussed whether extending libpqwalreceiver's connect
> API is a good idea and whether we need to further extend it in the
> future. As far as I can see, slotsync worker's primary requirement is
> to execute SQL queries which the current API is sufficient, and don't
> see something that needs any drastic change in this API. Note that
> tablesync worker that executes SQL also uses these APIs, so we may
> need something in the future for either of those. Then finally we need
> a slotsync worker to also connect to a database to use SQL and fetch
> results.

While looking into the patch v64-0002 I could not exactly point out
what sort of extensions are there in libpqwalreceiver.c, I just saw
one extra API for fetching the dbname from connection info?

> Now, let us consider if we extend the replication commands like
> READ_REPLICATION_SLOT and or introduce a new set of replication
> commands to fetch the required information then we don't need a DB
> connection with primary or a connection in slotsync worker. As per my
> current understanding, it is quite doable but I think we will slowly
> go in the direction of making replication commands something like SQL
> because today we need to extend it to fetch all slots info that have
> failover marked as true, the existence of a particular replication,
> etc. Then tomorrow, if we want to extend this work to have multiple
> slotsync workers say workers perdb then we have to extend the
> replication command to fetch per-database failover marked slots. To
> me, it sounds more like we are slowly adding SQL-like features to
> replication commands.
>
> Apart from this when we are reading per-db replication slots without
> connecting to a database, we probably need some additional protection
> mechanism so that the database won't get dropped.

Something like locking the database only while fetching the slots?

> Considering all this it seems that for now probably extending
> replication commands can simplify a few things like mentioned above
> but using SQL's with db-connection is more extendable.

Even I have similar thoughts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-01-18 Thread Dilip Kumar

On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
 wrote:
>
> Hi,
>
> Currently, nbtree code compares each and every column of an index
> tuple during the binary search on the index page. With large indexes
> that have many duplicate prefix column values (e.g. an index on (bool,
> bool, uuid) ) that means a lot of wasted time getting to the right
> column.
>
> The attached patch improves on that by doing per-page dynamic prefix
> truncation: If we know that on both the right and left side there are
> index tuples where the first two attributes are equal to the scan key,
> we skip comparing those attributes at the current index tuple and
> start with comparing attribute 3, saving two attribute compares. We
> gain performance whenever comparing prefixing attributes is expensive
> and when there are many tuples with a shared prefix - in unique
> indexes this doesn't gain much, but we also don't lose much in this
> case.
>
> This patch was originally suggested at [0], but it was mentioned that
> they could be pulled out into it's own thread. Earlier, the
> performance gains were not clearly there for just this patch, but
> after further benchmarking this patch stands on its own for
> performance: it sees no obvious degradation of performance, while
> gaining 0-5% across various normal indexes on the cc-complete sample
> dataset, with the current worst-case index shape getting a 60%+
> improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

1.
Commit message refers to a non-existing reference '(see [0])'.


2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.

I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))

Where do the 3*natts come from?  Is it related to setting up the
dynamic prefix at each level?

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;

Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-15 Thread Dilip Kumar

On Tue, Jan 16, 2024 at 9:37 AM Amit Kapila  wrote:
>
> On Tue, Jan 16, 2024 at 9:03 AM shveta malik  wrote:
> >
> Agreed and as said earlier I think it is better to make it a
> PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> already autovacuum launcher is handled in the same way. One more minor
> thing is it will save us for having a new bgworker state
> BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.

Yeah, it's not a nonstandard way.  But bgworker provides a lot of
inbuilt infrastructure which otherwise we would have to maintain by
ourselves if we opt for option 2.  I would have preferred option 3
from the simplicity point of view but I would prefer to make this
PGC_SIGHUP over simplicity.  But anyway, if there are issues in doing
so then we can keep it PGC_POSTMASTER but it's worth trying this out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-11 Thread Dilip Kumar

On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu)
 wrote:
>

> > IIUC on the standby we just want to overwrite what we get from primary no? 
> > If
> > so why we are using those APIs that are meant for the actual decoding slots
> > where it needs to take certain logical decisions instead of mere 
> > overwriting?
>
> I think we don't have a strong reason to use these APIs, but it was 
> convenient to
> use these APIs as they can take care of updating the slots info and will call
> functions like, ReplicationSlotsComputeRequiredXmin,
> ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly 
> overwriting
> the fields and call these manually ?

I might be missing something but do you want to call
ReplicationSlotsComputeRequiredXmin() kind of functions in standby?  I
mean those will ultimately update the catalog xmin and replication
xmin in Procarray and that prevents Vacuum from cleaning up some of
the required xids.  But on standby, those shared memory parameters are
not used IIUC.

In my opinion on standby, we just need to update the values in the
local slots and whatever we get from remote slots without taking all
the logical decisions in the hope that they will all fall into a
particular path, for example, if you see LogicalIncreaseXminForSlot(),
it is doing following steps of operations as shown below[1].  These
all make sense when you are doing candidate-based updation where we
first mark the candidates and then update the candidate to real value
once you get the confirmation for the LSN.  Now following all this
logic looks completely weird unless this can fall in a different path
I feel it will do some duplicate steps as well.  For example in
local_slot_update(), first you call LogicalConfirmReceivedLocation()
which will set the 'data.confirmed_flush' and then you will call
LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;'
and will again call LogicalConfirmReceivedLocation().  I don't think
this is the correct way of reusing the function unless you need to go
through those paths and I am missing something.

[1]
LogicalIncreaseXminForSlot()
{
   if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin))
  {
  }
  else if (current_lsn <= slot->data.confirmed_flush)
  {
  }
 else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
 {
 }

if (updated_xmin)
  LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Synchronizing slots from primary to standby

2024-01-09 Thread Dilip Kumar

On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu)
 wrote:
>
comments on 0002

1.
+/* Worker's nap time in case of regular activity on the primary server */
+#define WORKER_DEFAULT_NAPTIME_MS   10L /* 10 ms */
+
+/* Worker's nap time in case of no-activity on the primary server */
+#define WORKER_INACTIVITY_NAPTIME_MS1L /* 10 sec */

Instead of directly switching between 10ms to 10s shouldn't we
increase the nap time gradually?  I mean it can go beyond 10 sec as
well but instead of directly switching
from 10ms to 10 sec we can increase it every time with some multiplier
and keep a max limit up to which it can grow.  Although we can reset
back to 10ms directly as soon as
we observe some activity.

2.
SlotSyncWorkerCtxStruct add this to typedefs. list file

3.
+/*
+ * Update local slot metadata as per remote_slot's positions
+ */
+static void
+local_slot_update(RemoteSlot *remote_slot)
+{
+ Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
+
+ LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
+ LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
+remote_slot->catalog_xmin);
+ LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
+   remote_slot->restart_lsn);
+}

IIUC on the standby we just want to overwrite what we get from primary
no? If so why we are using those APIs that are meant for the actual
decoding slots where it needs to take certain logical decisions
instead of mere overwriting?

4.
+/*
+ * Helper function for drop_obsolete_slots()
+ *
+ * Drops synced slot identified by the passed in name.
+ */
+static void
+drop_synced_slots_internal(const char *name, bool nowait)

Suggestion to add one line to explain no wait in the header

5.
+/*
+ * Helper function to check if local_slot is present in remote_slots list.
+ *
+ * It also checks if logical slot is locally invalidated i.e. invalidated on
+ * the standby but valid on the primary server. If found so, it sets
+ * locally_invalidated to true.
+ */

Instead of saying "but valid on the primary server" better to mention
it in the remote_slots list, because here this function is just
checking the remote_slots list regardless of whether the list came
from.  Mentioning primary
seems like it might fetch directly from the primary in this function
so this is a bit confusing.

6.
+/*
+ * Check that all necessary GUCs for slot synchronization are set
+ * appropriately. If not, raise an ERROR.
+ */
+static void
+validate_slotsync_parameters(char **dbname)


The function name just says 'validate_slotsync_parameters' but it also
gets the dbname so I think it better we change the name accordingly
also instead of passing dbname as a parameter just return it directly.
There
is no need to pass this extra parameter and make the function return void.

7.
+ tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
+ tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, tupslot);
+ Assert(tuple_ok); /* It must return one tuple */

Comments say 'It must return one tuple' but asserting just for at
least one tuple shouldn't we enhance assert so that it checks that we
got exactly one tuple?

8.
/* No need to check further, return that we are cascading standby */
+ *am_cascading_standby = true;

we are not returning immediately we are just setting
am_cascading_standby to true so adjust comments accordingly

9.
+ /* No need to check further, return that we are cascading standby */
+ *am_cascading_standby = true;
+ }
+ else
+ {
+ /* We are a normal standby. */

Single-line comments do not follow the uniform pattern for the full
stop, either use a full stop for all single-line comments or none, at
least follow the same rule in a file or nearby comments.

10.
+ errmsg("exiting from slot synchronization due to bad configuration"),
+ errhint("%s must be defined.", "primary_slot_name"));

Why we are using the constant string "primary_slot_name" as a variable
in this error formatting?

11.
+ /*
+ * Hot_standby_feedback must be enabled to cooperate with the physical
+ * replication slot, which allows informing the primary about the xmin and
+ * catalog_xmin values on the standby.

I do not like capitalizing the first letter of the
'hot_standby_feedback' which is a GUC parameter

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-08 Thread Dilip Kumar

On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera  wrote:
>
> The more I look at TransactionGroupUpdateXidStatus, the more I think
> it's broken, and while we do have some tests, I don't have confidence
> that they cover all possible cases.
>
> Or, at least, if this code is good, then it hasn't been sufficiently
> explained.

Any thought about a case in which you think it might be broken, I mean
any vague thought might also help where you think it might not work as
expected so that I can also think in that direction.  It might be
possible that I might not be thinking of some perspective that you are
thinking and comments might be lacking from that point of view.

> If we have multiple processes trying to write bits to clog, and they are
> using different banks, then the LWLockConditionalAcquire will be able to
> acquire the bank lock

Do you think there is a problem with multiple processes getting the
lock? I mean they are modifying different CLOG pages so that can be
done concurrently right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1060 matches

Mail list logo