Re: sequences vs. synchronous replication

2022-01-25 Thread Tomas Vondra

On 1/25/22 10:18, Peter Eisentraut wrote:


On 15.01.22 23:57, Tomas Vondra wrote:
This approach (and also my previous proposal) seems to assume that 
the value returned from nextval() should not be used until the 
transaction executing that nextval() has been committed successfully. 
But I'm not sure how many applications follow this assumption. Some 
application might use the return value of nextval() instantly before 
issuing commit command. Some might use the return value of nextval() 
executed in rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


The wording in the SQL standard is:

"Changes to the current base value of a sequence generator are not 
controlled by SQL-transactions; therefore, commits and rollbacks of 
SQL-transactions have no effect on the current base value of a sequence 
generator."


This implies the well-known behavior that consuming a sequence value is 
not rolled back.  But it also appears to imply that committing a 
transaction has no impact on the validity of a sequence value produced 
during that transaction.  In other words, this appears to imply that 
making use of a sequence value produced in a rolled-back transaction is 
valid.


A very strict reading of this would seem to imply that every single 
nextval() call needs to be flushed to WAL immediately, which is of 
course impractical.


I'm not an expert in reading standards, but I'd not interpret it that 
way. I think it simply says the sequence must not go back, no matter 
what happened to the transaction.


IMO interpreting this as "must not lose any increments from uncommitted 
transactions" is maybe a bit too strict, and as you point out it's also 
impractical because it'd mean calling nextval() repeatedly flushes WAL 
all the time. Not great for batch loads, for example.


I don't think we need to flush WAL for every nextval() call, if we don't 
write WAL for every increment - I think we still can batch WAL for 32 
increments just like we do now (AFAICS that'd not contradict even this 
quite strict interpretation of the standard).


OTOH the flush would have to happen immediately, we can't delay that 
until the end of the transaction. Which is going to affect even cases 
that generate WAL for other reasons (e.g. doing insert), which was 
entirely unaffected by the previous patches.


And the flush would have to happen even for sessions that didn't write 
WAL (which was what started this thread) - we could use page LSN and 
flush only to that (so we'd flush once and then it'd be noop until the 
sequence increments 32-times and writes another WAL record).


Of course, it's not enough to just flush WAL, we have to wait for the 
sync replica too :-(


I don't have any benchmark results quantifying this yet, but I'll do 
some tests in the next day or two. But my expectation is this is going 
to be pretty expensive, and considering how concerned we were about 
affecting current workloads, making the impact worse seems wrong.



My opinion is we should focus on fixing this given the current (weaker) 
interpretation of the standard, i.e. accepting the loss of increments 
observed only by uncommitted transactions. The page LSN patch seems like 
the best way to do that so far.



We may try reworking this to provide the stronger guarantees (i.e. not 
losing even increments from uncommitted transactions) in the future, of 
course. But considering (a) we're not sure that's really what the SQL 
standard requires, (b) no one complained about that in years, and (c) 
it's going to make sequences way more expensive, I doubt that's really 
desirable.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 27cb6307581..0ab95bd3d8e 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "parser/parse_type.h"
+#include "replication/syncrep.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -812,6 +813,22 @@ nextval_internal(Oid relid, bool check_permissions)
 		PageSetLSN(page, recptr);
 	}
 
+	/*
+	 * Make sure the WAL for the sequence is properly flushed, before returning
+	 * the nextval result. For sessions that did not generate WAL (and so use a
+	 * cached value), this ensures the WAL generated by other sessions reaches
+	 * disk. Also, wait for sync replica.
+	 * 
+	 */
+	if (RelationNeedsWAL(seqrel))
+	{
+		XLogRecPtr	recptr = PageGetLSN(page);
+
+		XLogFlush(recptr);
+
+		SyncRepWaitForLSN(recptr, 

Re: sequences vs. synchronous replication

2022-01-25 Thread Peter Eisentraut



On 15.01.22 23:57, Tomas Vondra wrote:
This approach (and also my previous proposal) seems to assume that the 
value returned from nextval() should not be used until the transaction 
executing that nextval() has been committed successfully. But I'm not 
sure how many applications follow this assumption. Some application 
might use the return value of nextval() instantly before issuing 
commit command. Some might use the return value of nextval() executed 
in rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


The wording in the SQL standard is:

"Changes to the current base value of a sequence generator are not 
controlled by SQL-transactions; therefore, commits and rollbacks of 
SQL-transactions have no effect on the current base value of a sequence 
generator."


This implies the well-known behavior that consuming a sequence value is 
not rolled back.  But it also appears to imply that committing a 
transaction has no impact on the validity of a sequence value produced 
during that transaction.  In other words, this appears to imply that 
making use of a sequence value produced in a rolled-back transaction is 
valid.


A very strict reading of this would seem to imply that every single 
nextval() call needs to be flushed to WAL immediately, which is of 
course impractical.





Re: sequences vs. synchronous replication

2022-01-15 Thread Tomas Vondra

On 1/15/22 06:12, Fujii Masao wrote:



On 2022/01/12 1:07, Tomas Vondra wrote:

I explored the idea of using page LSN a bit


Many thanks!



The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:


Yes, you're right.



So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).


This approach (and also my previous proposal) seems to assume that the 
value returned from nextval() should not be used until the transaction 
executing that nextval() has been committed successfully. But I'm not 
sure how many applications follow this assumption. Some application 
might use the return value of nextval() instantly before issuing commit 
command. Some might use the return value of nextval() executed in 
rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


If we want to avoid duplicate sequence value even in those cases, ISTM 
that the transaction needs to wait for WAL flush and sync rep before 
nextval() returns the value. Of course, this might cause other issues 
like performance decrease, though.




Right, something like that. But that'd hurt well-behaving applications, 
because by doing the wait earlier (in nextval, not at commit) it 
increases the probability of waiting.


FWIW I'm not against improvements in this direction, but it goes way 
beyong fixing the original issue.





On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
    1  insert  829   807   802   97%   97%
   nextval/1 16491   814 16465    5%  100%
   nextval/32    24487 16462 24632   67%  101%
   nextval/64    24516 24918 24671  102%  101%
   nextval/128   32337 33178 32863  103%  102%

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
    4  insert 1577  1590  1546  101%   98%
   nextval/1 45607  1579 21220    3%   47%
   nextval/32    68453 49141 51170   72%   75%
   nextval/64    66928 65534 66408   98%   99%
   nextval/128   83502 81835 82576   98%   99%

The results seem clearly better, I think.


Thanks for benchmarking this! I agree that page-lsn is obviously better 
than log-all.




For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.


Yes, but 53% drop would be critial for some applications that don't want 
to increase the cache size for some reasons. So IMO it's better to 
provide the option to enable/disable that page-lsn approach.




I disagree. This drop applies only to extremely simple transactions - 
once the transaction does any WAL write, it disappears. Even if the 
transaction does only a couple reads, it'll go away. I find it hard to 
believe there's any serious application doing this.


So I think we should get it reliable (to not lose data after commit) 
first and then maybe see if we can improve this.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2022-01-14 Thread Fujii Masao




On 2022/01/12 1:07, Tomas Vondra wrote:

I explored the idea of using page LSN a bit


Many thanks!



The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:


Yes, you're right.



So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).


This approach (and also my previous proposal) seems to assume that the value 
returned from nextval() should not be used until the transaction executing that 
nextval() has been committed successfully. But I'm not sure how many 
applications follow this assumption. Some application might use the return 
value of nextval() instantly before issuing commit command. Some might use the 
return value of nextval() executed in rollbacked transaction.

If we want to avoid duplicate sequence value even in those cases, ISTM that the 
transaction needs to wait for WAL flush and sync rep before nextval() returns 
the value. Of course, this might cause other issues like performance decrease, 
though.



On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
1  insert  829   807   802   97%   97%
   nextval/1 16491   814 164655%  100%
   nextval/3224487 16462 24632   67%  101%
   nextval/6424516 24918 24671  102%  101%
   nextval/128   32337 33178 32863  103%  102%

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
4  insert 1577  1590  1546  101%   98%
   nextval/1 45607  1579 212203%   47%
   nextval/3268453 49141 51170   72%   75%
   nextval/6466928 65534 66408   98%   99%
   nextval/128   83502 81835 82576   98%   99%

The results seem clearly better, I think.


Thanks for benchmarking this! I agree that page-lsn is obviously better than 
log-all.



For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.


Yes, but 53% drop would be critial for some applications that don't want to 
increase the cache size for some reasons. So IMO it's better to provide the 
option to enable/disable that page-lsn approach.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: sequences vs. synchronous replication

2022-01-11 Thread Tomas Vondra
On 12/24/21 11:40, Tomas Vondra wrote:
> 
> ...
> 
> FWIW I plan to explore the idea of looking at sequence page LSN, and
> flushing up to that position.
> 

So, I explored the page LSN idea, and it seems to be working pretty
nicely. There still is some impact on the workload doing just nextval
calls, but it's much better than WAL-logging everything. The patch is
available at [1].

While working on that patch, I realized this actually affects even
systems without any replicas - it's trivial to get a sequence going
backwards. Imagine you do this:

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,50) s(i);
  ROLLBACK;

  SELECT nextval('s');

  -- kill -9 postgres

It's pretty likely a nextval('s') after restarting the server returns a
value from before the last nextval('s'), in case we do not manage to
flush the WAL before the kill.

The patch deals with this by updating XactLastRecEnd and then flushing
up to that point in RecordTransactionCommit(). But for that to happen,
we have to do the flush/wait even without a valid XID (which we may not
have when nextval gets called outside a transaction).

So I was wondering what other places do the same thing (generates WAL
without setting a XID), because that might either have similar issues
with not flushing data, and/or be affected by this change.

RecordTransactionCommit() says about such cases this:

  /*
   * Check if we want to commit asynchronously.  We can allow the
   * XLOG flush to happen asynchronously if synchronous_commit=off,
   * or if the current transaction has not performed any WAL-logged
   * operation or didn't assign an xid.  The transaction can end up
   * not writing any WAL, even if it has an xid, if it only wrote to
   * temporary and/or unlogged tables.  It can end up having written
   * WAL without an xid if it did HOT pruning.  In case of a crash,
   * the loss of such a transaction will be irrelevant; temp tables
   * will be lost anyway, unlogged tables will be truncated and HOT
   * pruning will be done again later. (Given the foregoing, you
   * might think that it would be unnecessary to emit the XLOG record
   * at all in this case, but we don't currently try to do that.  It
   * would certainly cause problems at least in Hot Standby mode,
   * where the KnownAssignedXids machinery requires tracking every
   * XID assignment.  It might be OK to skip it only when wal_level <
   * replica, but for now we don't.)
   *
   * However, if we're doing cleanup of any non-temp rels or
   * committing any command that wanted to force sync commit, then we
   * must flush XLOG immediately.  (We must not allow asynchronous
   * commit if there are any non-temp tables to be deleted, because
   * we might delete the files before the COMMIT record is flushed to
   * disk.  We do allow asynchronous commit if all to-be-deleted
   * tables are temporary though, since they are lost anyway if we
   * crash.)
   */

Note: This relates only to XLogFlush vs. XLogSetAsyncXactLSN, not about
waiting for sync standby. For that we ignore forceSyncCommit, which
seems a bit weird ...

Anyway, I was wondering what happens in practice, so I added very simple
logging to RecordTransactionCommit():

if (wrote_log && !markXidCommitted)
elog(WARNING, "not flushing at %X/%X",
 (uint32) (XactLastRecEnd >> 32),
 (uint32) XactLastRecEnd);

and then ran installcheck, which produces ~700 messages. Looking at the
WAL (last few records before the LSN reported by the log message), most
of this is related to HOT pruning (i.e. PRUNE), but there's plenty of
other WAL records. And I'm not sure if it's OK to just lose (some of)
those messages, as the comment claims for PRUNE.

It's quite possible I miss something basic/important, and everything is
fine and dandy, but here's a couple non-pruning examples - command
triggering the log message, along with the last few WAL records without
XID assigned right before RecordTransactionCommit() was called.

A more complete data set (full WAL dump, regression.diffs etc.) is
available at [2].



VACUUM ANALYZE num_exp_add;
---
VISIBLE cutoff xid 37114 flags 0x01, blkref #0: rel 1663/63341/3697 ...
INPLACE off 39, blkref #0: rel 1663/63341/1259 blk 5


SELECT proname, provolatile FROM pg_proc
   WHERE oid in ('functest_B_1'::regproc,
 'functest_B_2'::regproc,
 'functest_B_3'::regproc,
 'functest_B_4'::regproc) ORDER BY proname;

VACUUM nunused 223, blkref #0: rel 1663/63341/2608 blk 39
VISIBLE cutoff xid 39928 flags 0x01, blkref #0: rel 1663/63341/2608 ...
VACUUM nunused 6, blkref #0: rel 1663/63341/2608 blk 40
META_CLEANUP last_cleanup_num_delpages 5, blkref #0: rel 1663/63341 ...
META_CLEANUP last_cleanup_num_delpages 1, blkref #0: rel 1663/63341 ...
INPLACE off 13, blkref #0: rel 1663/63341/1259 blk 4

Re: sequences vs. synchronous replication

2022-01-11 Thread Tomas Vondra


On 12/22/21 18:50, Fujii Masao wrote:
> 
> 
> On 2021/12/22 21:11, Tomas Vondra wrote:
>> Interesting idea, but I think it has a couple of issues :-(
> 
> Thanks for the review!
> 
>> 1) We'd need to know the LSN of the last WAL record for any given
>> sequence, and we'd need to communicate that between backends somehow.
>> Which seems rather tricky to do without affecting performance.
> 
> How about using the page lsn for the sequence? nextval_internal()
> already uses that to check whether it's less than or equal to checkpoint
> redo location.
> 

I explored the idea of using page LSN a bit, and there's some good and
bad news.

The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:

1) Imagine a high-concurrency environment, with a lot of sessions doing
nextval('s') at the same time. One session WAL-logs the increment, but
before the WAL gets flushed / sent to replica, another session calls
nextval. SyncRepNeedsWait() says true, so it WAL-logs it again, moving
the page LSN forward. And so on. So in a high-concurrency environments,
this simply makes the matters worse - it causes an avalanche of WAL
writes instead of saving anything.

(You don't even need multiple sessions - a single session calling
nextval would have the same issue, WAL-logging every call.)


2) It assumes having a synchronous replica, but that's wrong. It's
partially my fault because I formulated this issue as if it was just
about sync replicas, but that's just one symptom. It applies even to
systems without any replicas.

Imagine you do

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,40);
  ROLLBACK;

  SELECT nextval('s');

and then you murder the server by "kill -9". If you restart it and do a
nextval('s') again, the value will likely go back, generating duplicate
values :-(


So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).

I repeated the benchmark measurements with nextval/insert workloads, to
compare this with the other patch (WAL-logging every increment). I had
to use a different machine, so the the results are not directly
comparable to the numbers presented earlier.

On btrfs, it looks like this. The log-all is the first patch, page-lsn
is the new patch using page LSN. The first columns are raw pgbench tps
values, the last two columns are comparison to master.

On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
   1  insert  829   807   802   97%   97%
  nextval/1 16491   814 164655%  100%
  nextval/3224487 16462 24632   67%  101%
  nextval/6424516 24918 24671  102%  101%
  nextval/128   32337 33178 32863  103%  102%

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
   4  insert 1577  1590  1546  101%   98%
  nextval/1 45607  1579 212203%   47%
  nextval/3268453 49141 51170   72%   75%
  nextval/6466928 65534 66408   98%   99%
  nextval/128   83502 81835 82576   98%   99%

The results seem clearly better, I think.

For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.

Out of curiosity I ran the tests on tmpfs too, which should show overhed
not related to I/O. The results are similar:

  client  test master   log-all  page-lsn   log-all  page-lsn
  ---
1 insert44033 43740 43215   99%   98%
  nextval/1 58640 48384 59243   83%  101%
  nextval/3261089 60901 60830  100%  100%
  nextval/6460412 61315 61550  101%  102%
  nextval/128   

Re: sequences vs. synchronous replication

2021-12-28 Thread Pavel Stehule
út 28. 12. 2021 v 9:53 odesílatel Sascha Kuhl 
napsal:

>
>
> Pavel Stehule  schrieb am Di., 28. Dez. 2021,
> 09:51:
>
>> Hi
>>
>> út 28. 12. 2021 v 9:28 odesílatel Sascha Kuhl 
>> napsal:
>>
>>> Sequence validation by step, in total is great. If the sequence is
>>> Familie or professional, does it make sense to a have a total validation by
>>> an expert. I can only say true by chi square Networks, but would a medical
>>> opinion be an improvement?
>>>
>>
>> Is it generated by boot or by a human?
>>
>
> I validation my family and Société, only when them Show me not their
> Sekret, part of their truth. Works fine by a Boot level, as far as I can
> detektei, without the Boot showing up 
>

don't spam this mailing list, please

Thank you

Pavel


>>
>>
>>> Fujii Masao  schrieb am Di., 28. Dez.
>>> 2021, 07:56:
>>>


 On 2021/12/24 19:40, Tomas Vondra wrote:
 > Maybe, but what would such workload look like? Based on the tests I
 did, such workload probably can't generate any WAL. The amount of WAL added
 by the change is tiny, the regression is caused by having to flush WAL.
 >
 > The only plausible workload I can think of is just calling nextval,
 and the cache pretty much fixes that.

 Some users don't want to increase cache setting, do they? Because

 - They may expect that setval() affects all subsequent nextval(). But
 if cache is set to greater than one, the value set by setval() doesn't
 affect other backends until they consumed all the cached sequence values.
 - They may expect that the value returned from nextval() is basically
 increased monotonically. If cache is set to greater than one, subsequent
 nextval() can easily return smaller value than one returned by previous
 nextval().
 - They may want to avoid "hole" of a sequence as much as possible,
 e.g., as far as the server is running normally. If cache is set to greater
 than one, such "hole" can happen even thought the server doesn't crash yet.


 > FWIW I plan to explore the idea of looking at sequence page LSN, and
 flushing up to that position.

 Sounds great, thanks!

 Regards,

 --
 Fujii Masao
 Advanced Computing Technology Center
 Research and Development Headquarters
 NTT DATA CORPORATION





Re: sequences vs. synchronous replication

2021-12-28 Thread Sascha Kuhl
Pavel Stehule  schrieb am Di., 28. Dez. 2021,
09:51:

> Hi
>
> út 28. 12. 2021 v 9:28 odesílatel Sascha Kuhl 
> napsal:
>
>> Sequence validation by step, in total is great. If the sequence is
>> Familie or professional, does it make sense to a have a total validation by
>> an expert. I can only say true by chi square Networks, but would a medical
>> opinion be an improvement?
>>
>
> Is it generated by boot or by a human?
>

I validation my family and Société, only when them Show me not their
Sekret, part of their truth. Works fine by a Boot level, as far as I can
detektei, without the Boot showing up 

>
>
>
>> Fujii Masao  schrieb am Di., 28. Dez. 2021,
>> 07:56:
>>
>>>
>>>
>>> On 2021/12/24 19:40, Tomas Vondra wrote:
>>> > Maybe, but what would such workload look like? Based on the tests I
>>> did, such workload probably can't generate any WAL. The amount of WAL added
>>> by the change is tiny, the regression is caused by having to flush WAL.
>>> >
>>> > The only plausible workload I can think of is just calling nextval,
>>> and the cache pretty much fixes that.
>>>
>>> Some users don't want to increase cache setting, do they? Because
>>>
>>> - They may expect that setval() affects all subsequent nextval(). But if
>>> cache is set to greater than one, the value set by setval() doesn't affect
>>> other backends until they consumed all the cached sequence values.
>>> - They may expect that the value returned from nextval() is basically
>>> increased monotonically. If cache is set to greater than one, subsequent
>>> nextval() can easily return smaller value than one returned by previous
>>> nextval().
>>> - They may want to avoid "hole" of a sequence as much as possible, e.g.,
>>> as far as the server is running normally. If cache is set to greater than
>>> one, such "hole" can happen even thought the server doesn't crash yet.
>>>
>>>
>>> > FWIW I plan to explore the idea of looking at sequence page LSN, and
>>> flushing up to that position.
>>>
>>> Sounds great, thanks!
>>>
>>> Regards,
>>>
>>> --
>>> Fujii Masao
>>> Advanced Computing Technology Center
>>> Research and Development Headquarters
>>> NTT DATA CORPORATION
>>>
>>>
>>>


Re: sequences vs. synchronous replication

2021-12-28 Thread Pavel Stehule
Hi

út 28. 12. 2021 v 9:28 odesílatel Sascha Kuhl 
napsal:

> Sequence validation by step, in total is great. If the sequence is Familie
> or professional, does it make sense to a have a total validation by an
> expert. I can only say true by chi square Networks, but would a medical
> opinion be an improvement?
>

Is it generated by boot or by a human?



> Fujii Masao  schrieb am Di., 28. Dez. 2021,
> 07:56:
>
>>
>>
>> On 2021/12/24 19:40, Tomas Vondra wrote:
>> > Maybe, but what would such workload look like? Based on the tests I
>> did, such workload probably can't generate any WAL. The amount of WAL added
>> by the change is tiny, the regression is caused by having to flush WAL.
>> >
>> > The only plausible workload I can think of is just calling nextval, and
>> the cache pretty much fixes that.
>>
>> Some users don't want to increase cache setting, do they? Because
>>
>> - They may expect that setval() affects all subsequent nextval(). But if
>> cache is set to greater than one, the value set by setval() doesn't affect
>> other backends until they consumed all the cached sequence values.
>> - They may expect that the value returned from nextval() is basically
>> increased monotonically. If cache is set to greater than one, subsequent
>> nextval() can easily return smaller value than one returned by previous
>> nextval().
>> - They may want to avoid "hole" of a sequence as much as possible, e.g.,
>> as far as the server is running normally. If cache is set to greater than
>> one, such "hole" can happen even thought the server doesn't crash yet.
>>
>>
>> > FWIW I plan to explore the idea of looking at sequence page LSN, and
>> flushing up to that position.
>>
>> Sounds great, thanks!
>>
>> Regards,
>>
>> --
>> Fujii Masao
>> Advanced Computing Technology Center
>> Research and Development Headquarters
>> NTT DATA CORPORATION
>>
>>
>>


Re: sequences vs. synchronous replication

2021-12-28 Thread Sascha Kuhl
Sequence validation by step, in total is great. If the sequence is Familie
or professional, does it make sense to a have a total validation by an
expert. I can only say true by chi square Networks, but would a medical
opinion be an improvement?

Fujii Masao  schrieb am Di., 28. Dez. 2021,
07:56:

>
>
> On 2021/12/24 19:40, Tomas Vondra wrote:
> > Maybe, but what would such workload look like? Based on the tests I did,
> such workload probably can't generate any WAL. The amount of WAL added by
> the change is tiny, the regression is caused by having to flush WAL.
> >
> > The only plausible workload I can think of is just calling nextval, and
> the cache pretty much fixes that.
>
> Some users don't want to increase cache setting, do they? Because
>
> - They may expect that setval() affects all subsequent nextval(). But if
> cache is set to greater than one, the value set by setval() doesn't affect
> other backends until they consumed all the cached sequence values.
> - They may expect that the value returned from nextval() is basically
> increased monotonically. If cache is set to greater than one, subsequent
> nextval() can easily return smaller value than one returned by previous
> nextval().
> - They may want to avoid "hole" of a sequence as much as possible, e.g.,
> as far as the server is running normally. If cache is set to greater than
> one, such "hole" can happen even thought the server doesn't crash yet.
>
>
> > FWIW I plan to explore the idea of looking at sequence page LSN, and
> flushing up to that position.
>
> Sounds great, thanks!
>
> Regards,
>
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION
>
>
>


Re: sequences vs. synchronous replication

2021-12-27 Thread Fujii Masao




On 2021/12/24 19:40, Tomas Vondra wrote:

Maybe, but what would such workload look like? Based on the tests I did, such 
workload probably can't generate any WAL. The amount of WAL added by the change 
is tiny, the regression is caused by having to flush WAL.

The only plausible workload I can think of is just calling nextval, and the 
cache pretty much fixes that.


Some users don't want to increase cache setting, do they? Because

- They may expect that setval() affects all subsequent nextval(). But if cache 
is set to greater than one, the value set by setval() doesn't affect other 
backends until they consumed all the cached sequence values.
- They may expect that the value returned from nextval() is basically increased 
monotonically. If cache is set to greater than one, subsequent nextval() can 
easily return smaller value than one returned by previous nextval().
- They may want to avoid "hole" of a sequence as much as possible, e.g., as far as the 
server is running normally. If cache is set to greater than one, such "hole" can happen 
even thought the server doesn't crash yet.



FWIW I plan to explore the idea of looking at sequence page LSN, and flushing 
up to that position.


Sounds great, thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: sequences vs. synchronous replication

2021-12-27 Thread Tomas Vondra

On 12/27/21 21:24, Peter Eisentraut wrote:

On 24.12.21 09:04, Kyotaro Horiguchi wrote:

Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.


There is also the possibility of unlogged sequences if you want to avoid 
the WAL logging and get higher performance.


But unlogged sequences are not supported:

  test=# create unlogged sequence s;
  ERROR:  unlogged sequences are not supported

And even if we did, what would be the behavior after crash? For tables 
we discard the contents, so for sequences we'd probably discard it too 
and start from scratch? That doesn't seem particularly useful.


We could also write / fsync the sequence buffer, but that has other 
downsides. But that's not implemented either, and it's certainly out of 
scope for this patch.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-27 Thread Peter Eisentraut

On 24.12.21 09:04, Kyotaro Horiguchi wrote:

Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.


There is also the possibility of unlogged sequences if you want to avoid 
the WAL logging and get higher performance.





Re: sequences vs. synchronous replication

2021-12-24 Thread Tomas Vondra




On 12/24/21 09:04, Kyotaro Horiguchi wrote:

...
So, strictly speaking, that is a violation of the constraint I
mentioned regardless whether the transaction is committed or
not. However we have technical limitations as below.



I don't follow. What violates what?

If the transaction commits (and gets a confirmation from sync
replica), the modified WAL logging prevents duplicate values. It does
nothing for uncommitted transactions. Seems like an improvement to me.


Sorry for the noise. I misunderstand that ROLLBACK is being changed to
rollback sequences.



No problem, this part of the code is certainly rather confusing due to 
several layers of caching and these WAL-logging optimizations.



No idea. IMHO from the correctness / behavior point of view, the
modified logging is an improvement. The only issue is the additional
overhead, and I think the cache addresses that quite well.


Now I understand the story here.

I agree that the patch is improvment from the current behavior.
I agree that the overhead is eventually-nothing for WAL-emitting workloads.



OK, thanks.


Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.



Maybe, but what would such workload look like? Based on the tests I did, 
such workload probably can't generate any WAL. The amount of WAL added 
by the change is tiny, the regression is caused by having to flush WAL.


The only plausible workload I can think of is just calling nextval, and 
the cache pretty much fixes that.


FWIW I plan to explore the idea of looking at sequence page LSN, and 
flushing up to that position.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-24 Thread Kyotaro Horiguchi
At Fri, 24 Dec 2021 08:23:13 +0100, Tomas Vondra 
 wrote in 
> 
> 
> On 12/24/21 06:37, Kyotaro Horiguchi wrote:
> > At Thu, 23 Dec 2021 19:50:22 +0100, Tomas Vondra
> >  wrote in
> >> On 12/23/21 15:42, Fujii Masao wrote:
> >>> On 2021/12/23 3:49, Tomas Vondra wrote:
>  Attached is a patch tweaking WAL logging - in wal_level=minimal we do
>  the same thing as now, in higher levels we log every sequence fetch.
> >>> Thanks for the patch!
> >>> With the patch, I found that the regression test for sequences failed.
> >>> +    fetch = log = fetch;
> >>> This should be "log = fetch"?
> >>> On second thought, originally a sequence doesn't guarantee that the
> >>> value already returned by nextval() will never be returned by
> >>> subsequent nextval() after the server crash recovery. That is,
> >>> nextval() may return the same value across crash recovery. Is this
> >>> understanding right? For example, this case can happen if the server
> >>> crashes after nextval() returned the value but before WAL for the
> >>> sequence was flushed to the permanent storage.
> >>
> >> I think the important step is commit. We don't guarantee anything for
> >> changes in uncommitted transactions. If you do nextval in a
> >> transaction and the server crashes before the WAL gets flushed before
> >> COMMIT, then yes, nextval may generate the same nextval again. But
> >> after commit that is not OK - it must not happen.
> > I don't mean to stand on Fujii-san's side particularly, but it seems
> > to me sequences of RDBSs are not rolled back generally.  Some googling
> > told me that at least Oracle (documented), MySQL, DB2 and MS-SQL
> > server doesn't rewind sequences at rollback, that is, sequences are
> > incremented independtly from transaction control.  It seems common to
> > think that two nextval() calls for the same sequence must not return
> > the same value in any context.
> > 
> 
> Yes, sequences are not rolled back on abort generally. That would
> require much stricter locking, and that'd go against using sequences
> in concurrent sessions.

I thinks so.

> But we're not talking about sequence rollback - we're talking about
> data loss, caused by failure to flush WAL for a sequence. But that
> affects the *current* code too, and to much greater extent.

Ah, yes, I don't object to that aspect.

> Consider this:
> 
> BEGIN;
> SELECT nextval('s') FROM generate_series(1,1000) s(i);
> ROLLBACK; -- or crash of a different backend
> 
> BEGIN;
> SELECT nextval('s');
> COMMIT;
> 
> With the current code, this may easily lose the WAL, and we'll
> generate duplicate values from the sequence. We pretty much ignore the
> COMMIT.
>
> With the proposed change to WAL logging, that is not possible. The
> COMMIT flushes enough WAL to prevent this issue.
> 
> So this actually makes this issue less severe.
> 
> Maybe I'm missing some important detail, though. Can you show an
> example where the proposed changes make the issue worse?

No. It seems to me improvoment at least from the current state, for
the reason you mentioned.

> >>> So it's not a bug that sync standby may return the same value as
> >>> already returned in the primary because the corresponding WAL has not
> >>> been replicated yet, isn't it?
> >>>
> >>
> >> No, I don't think so. Once the COMMIT happens (and gets confirmed by
> >> the sync standby), it should be possible to failover to the sync
> >> replica without losing any data in committed transaction. Generating
> >> duplicate values is a clear violation of that.
> > So, strictly speaking, that is a violation of the constraint I
> > mentioned regardless whether the transaction is committed or
> > not. However we have technical limitations as below.
> > 
> 
> I don't follow. What violates what?
> 
> If the transaction commits (and gets a confirmation from sync
> replica), the modified WAL logging prevents duplicate values. It does
> nothing for uncommitted transactions. Seems like an improvement to me.

Sorry for the noise. I misunderstand that ROLLBACK is being changed to
rollback sequences.

> No idea. IMHO from the correctness / behavior point of view, the
> modified logging is an improvement. The only issue is the additional
> overhead, and I think the cache addresses that quite well.

Now I understand the story here.

I agree that the patch is improvment from the current behavior.
I agree that the overhead is eventually-nothing for WAL-emitting workloads.

Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: sequences vs. synchronous replication

2021-12-23 Thread Tomas Vondra




On 12/24/21 06:37, Kyotaro Horiguchi wrote:

At Thu, 23 Dec 2021 19:50:22 +0100, Tomas Vondra 
 wrote in

On 12/23/21 15:42, Fujii Masao wrote:

On 2021/12/23 3:49, Tomas Vondra wrote:

Attached is a patch tweaking WAL logging - in wal_level=minimal we do
the same thing as now, in higher levels we log every sequence fetch.

Thanks for the patch!
With the patch, I found that the regression test for sequences failed.
+    fetch = log = fetch;
This should be "log = fetch"?
On second thought, originally a sequence doesn't guarantee that the
value already returned by nextval() will never be returned by
subsequent nextval() after the server crash recovery. That is,
nextval() may return the same value across crash recovery. Is this
understanding right? For example, this case can happen if the server
crashes after nextval() returned the value but before WAL for the
sequence was flushed to the permanent storage.


I think the important step is commit. We don't guarantee anything for
changes in uncommitted transactions. If you do nextval in a
transaction and the server crashes before the WAL gets flushed before
COMMIT, then yes, nextval may generate the same nextval again. But
after commit that is not OK - it must not happen.


I don't mean to stand on Fujii-san's side particularly, but it seems
to me sequences of RDBSs are not rolled back generally.  Some googling
told me that at least Oracle (documented), MySQL, DB2 and MS-SQL
server doesn't rewind sequences at rollback, that is, sequences are
incremented independtly from transaction control.  It seems common to
think that two nextval() calls for the same sequence must not return
the same value in any context.



Yes, sequences are not rolled back on abort generally. That would 
require much stricter locking, and that'd go against using sequences in 
concurrent sessions.


But we're not talking about sequence rollback - we're talking about data 
loss, caused by failure to flush WAL for a sequence. But that affects 
the *current* code too, and to much greater extent.


Consider this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,1000) s(i);
ROLLBACK; -- or crash of a different backend

BEGIN;
SELECT nextval('s');
COMMIT;

With the current code, this may easily lose the WAL, and we'll generate 
duplicate values from the sequence. We pretty much ignore the COMMIT.


With the proposed change to WAL logging, that is not possible. The 
COMMIT flushes enough WAL to prevent this issue.


So this actually makes this issue less severe.

Maybe I'm missing some important detail, though. Can you show an example 
where the proposed changes make the issue worse?



So it's not a bug that sync standby may return the same value as
already returned in the primary because the corresponding WAL has not
been replicated yet, isn't it?



No, I don't think so. Once the COMMIT happens (and gets confirmed by
the sync standby), it should be possible to failover to the sync
replica without losing any data in committed transaction. Generating
duplicate values is a clear violation of that.


So, strictly speaking, that is a violation of the constraint I
mentioned regardless whether the transaction is committed or
not. However we have technical limitations as below.



I don't follow. What violates what?

If the transaction commits (and gets a confirmation from sync replica), 
the modified WAL logging prevents duplicate values. It does nothing for 
uncommitted transactions. Seems like an improvement to me.



IMHO the fact that we allow a transaction to commit (even just
locally) without flushing all the WAL it depends on is clearly a data
loss bug.


BTW, if the returned value is stored in database, the same value is
guaranteed not to be returned again after the server crash or by sync
standby. Because in that case the WAL of the transaction storing that
value is flushed and replicated.



True, assuming the table is WAL-logged etc. I agree the issue may be
affecting a fairly small fraction of workloads, because most people
use sequences to generate data for inserts etc.


It seems to me, from the fact that sequences are designed explicitly
untransactional and that behavior is widely adopted, the discussion
might be missing some significant use-cases.  But there's a
possibility that the spec of sequence came from some technical
limitation in the past, but I'm not sure..



No idea. IMHO from the correctness / behavior point of view, the 
modified logging is an improvement. The only issue is the additional 
overhead, and I think the cache addresses that quite well.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-23 Thread Kyotaro Horiguchi
At Thu, 23 Dec 2021 19:50:22 +0100, Tomas Vondra 
 wrote in 
> On 12/23/21 15:42, Fujii Masao wrote:
> > On 2021/12/23 3:49, Tomas Vondra wrote:
> >> Attached is a patch tweaking WAL logging - in wal_level=minimal we do
> >> the same thing as now, in higher levels we log every sequence fetch.
> > Thanks for the patch!
> > With the patch, I found that the regression test for sequences failed.
> > +    fetch = log = fetch;
> > This should be "log = fetch"?
> > On second thought, originally a sequence doesn't guarantee that the
> > value already returned by nextval() will never be returned by
> > subsequent nextval() after the server crash recovery. That is,
> > nextval() may return the same value across crash recovery. Is this
> > understanding right? For example, this case can happen if the server
> > crashes after nextval() returned the value but before WAL for the
> > sequence was flushed to the permanent storage.
> 
> I think the important step is commit. We don't guarantee anything for
> changes in uncommitted transactions. If you do nextval in a
> transaction and the server crashes before the WAL gets flushed before
> COMMIT, then yes, nextval may generate the same nextval again. But
> after commit that is not OK - it must not happen.

I don't mean to stand on Fujii-san's side particularly, but it seems
to me sequences of RDBSs are not rolled back generally.  Some googling
told me that at least Oracle (documented), MySQL, DB2 and MS-SQL
server doesn't rewind sequences at rollback, that is, sequences are
incremented independtly from transaction control.  It seems common to
think that two nextval() calls for the same sequence must not return
the same value in any context.

> > So it's not a bug that sync standby may return the same value as
> > already returned in the primary because the corresponding WAL has not
> > been replicated yet, isn't it?
> > 
> 
> No, I don't think so. Once the COMMIT happens (and gets confirmed by
> the sync standby), it should be possible to failover to the sync
> replica without losing any data in committed transaction. Generating
> duplicate values is a clear violation of that.

So, strictly speaking, that is a violation of the constraint I
mentioned regardless whether the transaction is committed or
not. However we have technical limitations as below.

> IMHO the fact that we allow a transaction to commit (even just
> locally) without flushing all the WAL it depends on is clearly a data
> loss bug.
> 
> > BTW, if the returned value is stored in database, the same value is
> > guaranteed not to be returned again after the server crash or by sync
> > standby. Because in that case the WAL of the transaction storing that
> > value is flushed and replicated.
> > 
> 
> True, assuming the table is WAL-logged etc. I agree the issue may be
> affecting a fairly small fraction of workloads, because most people
> use sequences to generate data for inserts etc.

It seems to me, from the fact that sequences are designed explicitly
untransactional and that behavior is widely adopted, the discussion
might be missing some significant use-cases.  But there's a
possibility that the spec of sequence came from some technical
limitation in the past, but I'm not sure..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: sequences vs. synchronous replication

2021-12-23 Thread Tomas Vondra

On 12/23/21 15:42, Fujii Masao wrote:



On 2021/12/23 3:49, Tomas Vondra wrote:
Attached is a patch tweaking WAL logging - in wal_level=minimal we do 
the same thing as now, in higher levels we log every sequence fetch.


Thanks for the patch!

With the patch, I found that the regression test for sequences failed.

+    fetch = log = fetch;

This should be "log = fetch"?

On second thought, originally a sequence doesn't guarantee that the 
value already returned by nextval() will never be returned by subsequent 
nextval() after the server crash recovery. That is, nextval() may return 
the same value across crash recovery. Is this understanding right? For 
example, this case can happen if the server crashes after nextval() 
returned the value but before WAL for the sequence was flushed to the 
permanent storage.


I think the important step is commit. We don't guarantee anything for 
changes in uncommitted transactions. If you do nextval in a transaction 
and the server crashes before the WAL gets flushed before COMMIT, then 
yes, nextval may generate the same nextval again. But after commit that 
is not OK - it must not happen.



So it's not a bug that sync standby may return the same value as
already returned in the primary because the corresponding WAL has not
been replicated yet, isn't it?



No, I don't think so. Once the COMMIT happens (and gets confirmed by the 
sync standby), it should be possible to failover to the sync replica 
without losing any data in committed transaction. Generating duplicate 
values is a clear violation of that.


IMHO the fact that we allow a transaction to commit (even just locally) 
without flushing all the WAL it depends on is clearly a data loss bug.


BTW, if the returned value is stored in database, the same value is 
guaranteed not to be returned again after the server crash or by sync 
standby. Because in that case the WAL of the transaction storing that 
value is flushed and replicated.




True, assuming the table is WAL-logged etc. I agree the issue may be 
affecting a fairly small fraction of workloads, because most people use 
sequences to generate data for inserts etc.


So I think this makes it acceptable / manageable. Of course, this 
means the values are much less monotonous (across backends), but I 
don't think we really promised that. And I doubt anyone is really 
using sequences like this (just nextval) in performance critical use 
cases.


I think that this approach is not acceptable to some users. So, if we 
actually adopt WAL-logging every sequence fetch, also how about exposing 
SEQ_LOG_VALS as reloption for a sequence? If so, those who want to log 
every sequence fetch can set this SEQ_LOG_VALS reloption to 0. OTOH, 
those who prefer the current behavior in spite of the risk we're 
discussing at this thread can set the reloption to 32 like it is for 
now, for example.




I think it'd be worth explaining why you think it's not acceptable?

I've demonstrated the impact on regular workloads (with other changes 
that write stuff to WAL) is not measurable, and enabling sequence 
caching eliminates most of the overhead for the rare corner case 
workloads if needed. It does generate a bit more WAL, but the sequence 
WAL records are pretty tiny.


I'm opposed to adding relooptions that affect correctness - it just 
seems like a bad idea to me. Moreover setting the CACHE for a sequence 
does almost the same thing - if you set CACHE 32, we only generate WAL 
once every 32 increments. The only difference is that this cache is not 
shared between backends, so one backend will generate 1,2,3,... and 
another backend will generate 33,34,35,... etc. I don't think that's a 
problem, because if you want strictly monotonous / gap-less sequences 
you can't use our sequences anyway. Yes, with short-lived backends this 
may consume the sequences faster, but well - short-lived backends are 
expensive anyway and overflowing bigserial is still unlikely.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-23 Thread Fujii Masao




On 2021/12/23 3:49, Tomas Vondra wrote:

Attached is a patch tweaking WAL logging - in wal_level=minimal we do the same 
thing as now, in higher levels we log every sequence fetch.


Thanks for the patch!

With the patch, I found that the regression test for sequences failed.

+   fetch = log = fetch;

This should be "log = fetch"?

On second thought, originally a sequence doesn't guarantee that the value 
already returned by nextval() will never be returned by subsequent nextval() 
after the server crash recovery. That is, nextval() may return the same value 
across crash recovery. Is this understanding right? For example, this case can 
happen if the server crashes after nextval() returned the value but before WAL 
for the sequence was flushed to the permanent storage. So it's not a bug that 
sync standby may return the same value as already returned in the primary 
because the corresponding WAL has not been replicated yet, isn't it?

BTW, if the returned value is stored in database, the same value is guaranteed 
not to be returned again after the server crash or by sync standby. Because in 
that case the WAL of the transaction storing that value is flushed and 
replicated.


So I think this makes it acceptable / manageable. Of course, this means the 
values are much less monotonous (across backends), but I don't think we really 
promised that. And I doubt anyone is really using sequences like this (just 
nextval) in performance critical use cases.


I think that this approach is not acceptable to some users. So, if we actually 
adopt WAL-logging every sequence fetch, also how about exposing SEQ_LOG_VALS as 
reloption for a sequence? If so, those who want to log every sequence fetch can 
set this SEQ_LOG_VALS reloption to 0. OTOH, those who prefer the current 
behavior in spite of the risk we're discussing at this thread can set the 
reloption to 32 like it is for now, for example.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: sequences vs. synchronous replication

2021-12-22 Thread Tomas Vondra

On 12/22/21 18:50, Fujii Masao wrote:



On 2021/12/22 21:11, Tomas Vondra wrote:

Interesting idea, but I think it has a couple of issues :-(


Thanks for the review!

1) We'd need to know the LSN of the last WAL record for any given 
sequence, and we'd need to communicate that between backends somehow. 
Which seems rather tricky to do without affecting performance.


How about using the page lsn for the sequence? nextval_internal() 
already uses that to check whether it's less than or equal to checkpoint 
redo location.




Hmm, maybe.



2) SyncRepWaitForLSN() is used only in commit-like situations, and 
it's a simple wait, not a decision to write more WAL. Environments 
without sync replicas are affected by this too - yes, the data loss 
issue is not there, but the amount of WAL is still increased.


How about reusing only a part of code in SyncRepWaitForLSN()? Attached 
is the PoC patch that implemented what I'm thinking.



IIRC sync_standby_names can change while a transaction is running, 
even just right before commit, at which point we can't just go back in 
time and generate WAL for sequences accessed earlier. But we still 
need to ensure the sequence is properly replicated.


Yes. In the PoC patch, SyncRepNeedsWait() still checks 
sync_standbys_defined and uses SyncRepWaitMode. But they should not be 
checked nor used because their values can be changed on the fly, as you 
pointed out. Probably SyncRepNeedsWait() will need to be changed so that 
it doesn't use them.




Right. I think the data loss with sync standby is merely a symptom, not 
the root cause. We'd need to deduce the LSN for which to wait at commit.




3) I don't think it'd actually reduce the amount of WAL records in 
environments with many sessions (incrementing the same sequence). In 
those cases the WAL (generated by in-progress xact from another 
session) is likely to not be flushed, so we'd generate the extra WAL 
record. (And if the other backends would need flush LSN of this new 
WAL record, which would make it more likely they have to generate WAL 
too.)


With the PoC patch, only when previous transaction that executed 
nextval() and caused WAL record is aborted, subsequent nextval() 
generates additional WAL record. So this approach can reduce WAL volume 
than other approach?

 > In the PoC patch, to reduce WAL volume more, it might be better to make
nextval_internal() update XactLastRecEnd and assign XID rather than 
emitting new WAL record, when SyncRepNeedsWait() returns true.




Yes, but I think there are other cases. For example the WAL might have 
been generated by another backend, in a transaction that might be still 
running. In which case I don't see how updating XactLastRecEnd in 
nextval_internal would fix this, right?


I did some experiments with increasing CACHE for the sequence, and that 
mostly eliminates the overhead. See the message I sent a couple minutes 
ago. IMHO that's a reasonable solution for the tiny number of people 
using nextval() in a way that'd be affected by this (i.e. without 
writing anything else in the xact).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-22 Thread Tomas Vondra



On 12/21/21 03:49, Tomas Vondra wrote:

On 12/21/21 02:01, Tom Lane wrote:

Tomas Vondra  writes:

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.


But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.



D'oh! For some reason I thought pgbench has a sequence on the history 
table, but clearly I was mistaken. There's another thinko, because after 
inspecting pg_waldump output I realized "SEQ_LOG_VALS 1" actually logs 
only every 2nd increment. So it should be "SEQ_LOG_VALS 0".


So I repeated the test fixing SEQ_LOG_VALS, and doing the pgbench with a 
table like this:


   create table test (a serial, b int);

and a script doing

   insert into test (b) values (1);

The results look like this:

1) select nextval('s');

  clients  1 4
     --
  master   39533    124998
  patched   3748  9114
     --
  diff  -91%  -93%


2) insert into test (b) values (1);

  clients  1 4
     --
  master    3718  9188
  patched   3698  9209
     --
  diff    0%    0%

So the nextval() results are a bit worse, due to not caching 1/2 the 
nextval calls. The -90% is roughly expected, due to generating about 32x 
more WAL (and having to wait for commit).


But results for the more realistic insert workload are about the same as 
before (i.e. no measurable difference). Also kinda expected, because 
those transactions have to wait for WAL anyway.




Attached is a patch tweaking WAL logging - in wal_level=minimal we do 
the same thing as now, in higher levels we log every sequence fetch.


After thinking about this a bit more, I think even the nextval workload 
is not such a big issue, because we can set cache for the sequences. 
Until now this had fairly limited impact, but it can significantly 
reduce the performance drop caused by WAL-logging every sequence fetch.


I've repeated the nextval test on a different machine (the one I used 
before is busy with something else), and the results look like this:


1) 1 client

cache  1 32128
--
master 13975  14425  19886
patched  886   7900  18397
--
diff-94%   -45%-7%

4) 4 clients

cache 1 32128
-
master 8338  12849  18248
patched 331   8124  18983
-
diff   -96%   -37% 4%

So I think this makes it acceptable / manageable. Of course, this means 
the values are much less monotonous (across backends), but I don't think 
we really promised that. And I doubt anyone is really using sequences 
like this (just nextval) in performance critical use cases.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom eeaa7cb36c69af048f0321e4883864ebe2542429 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Wed, 22 Dec 2021 03:18:46 +0100
Subject: [PATCH 1/6] WAL-log individual sequence fetches

---
 src/backend/commands/sequence.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a4..0f309d0a4e 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -52,6 +52,9 @@
  * We don't want to log each fetching of a value from a sequence,
  * so we pre-log a few fetches in advance. In the event of
  * crash we can lose (skip over) as many values as we pre-logged.
+ *
+ * We only pre-log fetches in wal_level=minimal. For higher levels we
+ * WAL-log every individual sequence increment, as if this was 0.
  */
 #define SEQ_LOG_VALS	32
 
@@ -666,11 +669,18 @@ nextval_internal(Oid relid, bool check_permissions)
 	 * WAL record to be written anyway, else replay starting from the
 	 * checkpoint would fail to advance the sequence past the logged values.
 	 * In this case we may as well fetch extra values.
+	 *
+	 * We only pre-log fetches in wal_level=minimal. For higher levels we
+	 * WAL-log every individual sequence increment.
 	 */
 	if (log < fetch || !seq->is_called)
 	{
 		/* forced log to satisfy local demand for values */
-		fetch = log = fetch + SEQ_LOG_VALS;
+		if (XLogIsNeeded())
+			fetch = log = fetch;
+		else
+			fetch = log = fetch + SEQ_LOG_VALS;
+
 		logit = true;
 	}
 	else
@@ -680,7 +690,11 @@ nextval_internal(Oid relid, bool check_permissions)
 		if (PageGetLSN(page) <= redoptr)
 		{
 			

Re: sequences vs. synchronous replication

2021-12-22 Thread Fujii Masao



On 2021/12/22 21:11, Tomas Vondra wrote:

Interesting idea, but I think it has a couple of issues :-(


Thanks for the review!


1) We'd need to know the LSN of the last WAL record for any given sequence, and 
we'd need to communicate that between backends somehow. Which seems rather 
tricky to do without affecting performance.


How about using the page lsn for the sequence? nextval_internal() already uses 
that to check whether it's less than or equal to checkpoint redo location.



2) SyncRepWaitForLSN() is used only in commit-like situations, and it's a 
simple wait, not a decision to write more WAL. Environments without sync 
replicas are affected by this too - yes, the data loss issue is not there, but 
the amount of WAL is still increased.


How about reusing only a part of code in SyncRepWaitForLSN()? Attached is the 
PoC patch that implemented what I'm thinking.



IIRC sync_standby_names can change while a transaction is running, even just 
right before commit, at which point we can't just go back in time and generate 
WAL for sequences accessed earlier. But we still need to ensure the sequence is 
properly replicated.


Yes. In the PoC patch, SyncRepNeedsWait() still checks sync_standbys_defined 
and uses SyncRepWaitMode. But they should not be checked nor used because their 
values can be changed on the fly, as you pointed out. Probably 
SyncRepNeedsWait() will need to be changed so that it doesn't use them.



3) I don't think it'd actually reduce the amount of WAL records in environments 
with many sessions (incrementing the same sequence). In those cases the WAL 
(generated by in-progress xact from another session) is likely to not be 
flushed, so we'd generate the extra WAL record. (And if the other backends 
would need flush LSN of this new WAL record, which would make it more likely 
they have to generate WAL too.)


With the PoC patch, only when previous transaction that executed nextval() and 
caused WAL record is aborted, subsequent nextval() generates additional WAL 
record. So this approach can reduce WAL volume than other approach?

In the PoC patch, to reduce WAL volume more, it might be better to make 
nextval_internal() update XactLastRecEnd and assign XID rather than emitting 
new WAL record, when SyncRepNeedsWait() returns true.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONdiff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 72bfdc07a4..38cd55b81a 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "parser/parse_type.h"
+#include "replication/syncrep.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -676,8 +677,9 @@ nextval_internal(Oid relid, bool check_permissions)
else
{
XLogRecPtr  redoptr = GetRedoRecPtr();
+   XLogRecPtr  pagelsn = PageGetLSN(page);
 
-   if (PageGetLSN(page) <= redoptr)
+   if (pagelsn <= redoptr || SyncRepNeedsWait(pagelsn))
{
/* last update of seq was before checkpoint */
fetch = log = fetch + SEQ_LOG_VALS;
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index bdbc9ef844..589364cb58 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -130,6 +130,34 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===
  */
 
+bool
+SyncRepNeedsWait(XLogRecPtr lsn)
+{
+   int mode;
+
+   if (!SyncRepRequested() ||
+   !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+   return false;
+
+   mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
+   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+   Assert(WalSndCtl != NULL);
+
+   LWLockAcquire(SyncRepLock, LW_SHARED);
+   Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+
+   if (!WalSndCtl->sync_standbys_defined ||
+   lsn <= WalSndCtl->lsn[mode])
+   {
+   LWLockRelease(SyncRepLock);
+   return false;
+   }
+
+   LWLockRelease(SyncRepLock);
+   return true;
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
diff --git a/src/include/replication/syncrep.h 
b/src/include/replication/syncrep.h
index 4266afde8b..b08f3f32d1 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -82,6 +82,7 @@ extern char *syncrep_parse_error_msg;
 extern char *SyncRepStandbyNames;
 
 /* called by user backend */
+extern bool SyncRepNeedsWait(XLogRecPtr lsn);
 extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit);
 
 /* called at backend exit */


Re: sequences vs. synchronous replication

2021-12-22 Thread Tomas Vondra




On 12/22/21 05:56, Fujii Masao wrote:



On 2021/12/22 10:57, Tomas Vondra wrote:



On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
 wrote:


while working on logical decoding of sequences, I ran into an issue 
with

nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

    CREATE SEQUENCE s;

    -- shutdown the sync replica

    BEGIN;
    SELECT nextval('s') FROM generate_series(1,50);
    ROLLBACK;

    BEGIN;
    SELECT nextval('s');
    COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.



How about if we always WAL log the first sequence change in a 
transaction?




I've been thinking about doing something like this, but I think it 
would not have any significant advantages compared to using 
"SEQ_LOG_VALS 0". It would still have the same performance hit for 
plain nextval() calls, and there's no measurable impact on simple 
workloads that already write WAL in transactions even with 
SEQ_LOG_VALS 0.


Just idea; if wal_level > minimal, how about making nextval_internal() 
(1) check whether WAL is replicated to sync standbys, up to the page lsn 
of the sequence, and (2) forcibly emit a WAL record if not replicated 
yet? The similar check is performed at the beginning of 
SyncRepWaitForLSN(), so probably we can reuse that code.




Interesting idea, but I think it has a couple of issues :-(

1) We'd need to know the LSN of the last WAL record for any given 
sequence, and we'd need to communicate that between backends somehow. 
Which seems rather tricky to do without affecting performance.


2) SyncRepWaitForLSN() is used only in commit-like situations, and it's 
a simple wait, not a decision to write more WAL. Environments without 
sync replicas are affected by this too - yes, the data loss issue is not 
there, but the amount of WAL is still increased.


IIRC sync_standby_names can change while a transaction is running, even 
just right before commit, at which point we can't just go back in time 
and generate WAL for sequences accessed earlier. But we still need to 
ensure the sequence is properly replicated.


3) I don't think it'd actually reduce the amount of WAL records in 
environments with many sessions (incrementing the same sequence). In 
those cases the WAL (generated by in-progress xact from another session) 
is likely to not be flushed, so we'd generate the extra WAL record. (And 
if the other backends would need flush LSN of this new WAL record, which 
would make it more likely they have to generate WAL too.)



So I don't think this would actually help much.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-21 Thread Fujii Masao




On 2021/12/22 10:57, Tomas Vondra wrote:



On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
 wrote:


while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

    CREATE SEQUENCE s;

    -- shutdown the sync replica

    BEGIN;
    SELECT nextval('s') FROM generate_series(1,50);
    ROLLBACK;

    BEGIN;
    SELECT nextval('s');
    COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.



How about if we always WAL log the first sequence change in a transaction?



I've been thinking about doing something like this, but I think it would not have any 
significant advantages compared to using "SEQ_LOG_VALS 0". It would still have 
the same performance hit for plain nextval() calls, and there's no measurable impact on 
simple workloads that already write WAL in transactions even with SEQ_LOG_VALS 0.


Just idea; if wal_level > minimal, how about making nextval_internal() (1) 
check whether WAL is replicated to sync standbys, up to the page lsn of the 
sequence, and (2) forcibly emit a WAL record if not replicated yet? The similar 
check is performed at the beginning of SyncRepWaitForLSN(), so probably we can 
reuse that code.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: sequences vs. synchronous replication

2021-12-21 Thread Tomas Vondra




On 12/19/21 04:03, Amit Kapila wrote:

On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
 wrote:


while working on logical decoding of sequences, I ran into an issue with
nextval() in a transaction that rolls back, described in [1]. But after
thinking about it a bit more (and chatting with Petr Jelinek), I think
this issue affects physical sync replication too.

Imagine you have a primary <-> sync_replica cluster, and you do this:

CREATE SEQUENCE s;

-- shutdown the sync replica

BEGIN;
SELECT nextval('s') FROM generate_series(1,50);
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the
sync replica (which is not running), right? But it does not.



How about if we always WAL log the first sequence change in a transaction?



I've been thinking about doing something like this, but I think it would 
not have any significant advantages compared to using "SEQ_LOG_VALS 0". 
It would still have the same performance hit for plain nextval() calls, 
and there's no measurable impact on simple workloads that already write 
WAL in transactions even with SEQ_LOG_VALS 0.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-20 Thread Tomas Vondra

On 12/21/21 02:01, Tom Lane wrote:

Tomas Vondra  writes:

OK, I did a quick test with two very simple benchmarks - simple select
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current
master, patched means SEQ_LOG_VALS was set to 1.


But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.



D'oh! For some reason I thought pgbench has a sequence on the history 
table, but clearly I was mistaken. There's another thinko, because after 
inspecting pg_waldump output I realized "SEQ_LOG_VALS 1" actually logs 
only every 2nd increment. So it should be "SEQ_LOG_VALS 0".


So I repeated the test fixing SEQ_LOG_VALS, and doing the pgbench with a 
table like this:


  create table test (a serial, b int);

and a script doing

  insert into test (b) values (1);

The results look like this:

1) select nextval('s');

 clients  1 4
--
 master   39533124998
 patched   3748  9114
--
 diff  -91%  -93%


2) insert into test (b) values (1);

 clients  1 4
--
 master3718  9188
 patched   3698  9209
--
 diff0%0%

So the nextval() results are a bit worse, due to not caching 1/2 the 
nextval calls. The -90% is roughly expected, due to generating about 32x 
more WAL (and having to wait for commit).


But results for the more realistic insert workload are about the same as 
before (i.e. no measurable difference). Also kinda expected, because 
those transactions have to wait for WAL anyway.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-20 Thread Tom Lane
Tomas Vondra  writes:
> OK, I did a quick test with two very simple benchmarks - simple select 
> from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current 
> master, patched means SEQ_LOG_VALS was set to 1.

But ... pgbench -N doesn't use sequences at all, does it?

Probably inserts into a table with a serial column would constitute a
plausible real-world case.

regards, tom lane




Re: sequences vs. synchronous replication

2021-12-20 Thread Tomas Vondra

On 12/20/21 17:40, Tomas Vondra wrote:

On 12/20/21 15:31, Peter Eisentraut wrote:

On 18.12.21 22:48, Tomas Vondra wrote:
What do you mean by "not caching unused sequence numbers"? Reducing 
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?


That'd work, but I wonder how significant the impact will be. It'd 
bet it hurts the patch adding logical decoding of sequences quite a bit.


It might be worth testing.  This behavior is ancient and has never 
really been scrutinized since it was added.




OK, I'll do some testing to measure the overhead, and I'll see how much 
it affects the sequence decoding patch.




OK, I did a quick test with two very simple benchmarks - simple select 
from a sequence, and 'pgbench -N' on scale 1. Benchmark was on current 
master, patched means SEQ_LOG_VALS was set to 1.


Average of 10 runs, each 30 seconds long, look like this:

1) select nextval('s');

 clients  1 4
--
 master   39497123137
 patched   6813 18326
--
 diff  -83%  -86%

2) pgbench -N

 clients  1 4
--
 master2935  9156
 patched   2937  9100
--
 diff0%0%


Clearly the extreme case (1) is hit pretty bad, while the much mure 
likely workload (2) is almost unaffected.



I'm not sure what conclusion to make from this, but assuming almost no 
one does just nextval calls, it should be acceptable.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-20 Thread Tomas Vondra

On 12/20/21 15:31, Peter Eisentraut wrote:

On 18.12.21 22:48, Tomas Vondra wrote:
What do you mean by "not caching unused sequence numbers"? Reducing 
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?


That'd work, but I wonder how significant the impact will be. It'd bet 
it hurts the patch adding logical decoding of sequences quite a bit.


It might be worth testing.  This behavior is ancient and has never 
really been scrutinized since it was added.




OK, I'll do some testing to measure the overhead, and I'll see how much 
it affects the sequence decoding patch.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-20 Thread Peter Eisentraut

On 18.12.21 22:48, Tomas Vondra wrote:
What do you mean by "not caching unused sequence numbers"? Reducing 
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?


That'd work, but I wonder how significant the impact will be. It'd bet 
it hurts the patch adding logical decoding of sequences quite a bit.


It might be worth testing.  This behavior is ancient and has never 
really been scrutinized since it was added.





Re: sequences vs. synchronous replication

2021-12-18 Thread Amit Kapila
On Sat, Dec 18, 2021 at 7:24 AM Tomas Vondra
 wrote:
>
> while working on logical decoding of sequences, I ran into an issue with
> nextval() in a transaction that rolls back, described in [1]. But after
> thinking about it a bit more (and chatting with Petr Jelinek), I think
> this issue affects physical sync replication too.
>
> Imagine you have a primary <-> sync_replica cluster, and you do this:
>
>CREATE SEQUENCE s;
>
>-- shutdown the sync replica
>
>BEGIN;
>SELECT nextval('s') FROM generate_series(1,50);
>ROLLBACK;
>
>BEGIN;
>SELECT nextval('s');
>COMMIT;
>
> The natural expectation would be the COMMIT gets stuck, waiting for the
> sync replica (which is not running), right? But it does not.
>

How about if we always WAL log the first sequence change in a transaction?

-- 
With Regards,
Amit Kapila.




Re: sequences vs. synchronous replication

2021-12-18 Thread Tom Lane
Tomas Vondra  writes:
> What do you mean by "not caching unused sequence numbers"? Reducing 
> SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?

Right.

> That'd work, but I wonder how significant the impact will be.

As I said, we've accepted worse in order to have stable replication
behavior.

regards, tom lane




Re: sequences vs. synchronous replication

2021-12-18 Thread Tomas Vondra




On 12/18/21 22:27, Tom Lane wrote:

Tomas Vondra  writes:

Here's a PoC demonstrating this idea. I'm not convinced it's the right
way to deal with this - it surely seems more like a duct tape fix than a
clean solution. But it does the trick.


I was imagining something a whole lot simpler, like "don't try to
cache unused sequence numbers when wal_level > minimal".  We've
accepted worse performance hits in that operating mode, and it'd
fix a number of user complaints we've seen about weird sequence
behavior on standbys.



What do you mean by "not caching unused sequence numbers"? Reducing 
SEQ_LOG_VALS to 1, i.e. WAL-logging every sequence increment?


That'd work, but I wonder how significant the impact will be. It'd bet 
it hurts the patch adding logical decoding of sequences quite a bit.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-18 Thread Tom Lane
Tomas Vondra  writes:
> Here's a PoC demonstrating this idea. I'm not convinced it's the right 
> way to deal with this - it surely seems more like a duct tape fix than a 
> clean solution. But it does the trick.

I was imagining something a whole lot simpler, like "don't try to
cache unused sequence numbers when wal_level > minimal".  We've
accepted worse performance hits in that operating mode, and it'd
fix a number of user complaints we've seen about weird sequence
behavior on standbys.

regards, tom lane




Re: sequences vs. synchronous replication

2021-12-18 Thread Tomas Vondra



On 12/18/21 07:00, Tomas Vondra wrote:



On 12/18/21 05:52, Tom Lane wrote:

Tomas Vondra  writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).


Ugh.


I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().


(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.



Good point - it doesn't :-( At least not by simply storing LSN in a 
global variable or something like that.


The second backend needs to know the LSN of the last WAL-logged sequence 
increment, but only the first backend knows that. So we'd need to share 
that between backends somehow. I doubt we want to track LSN for every 
individual sequence (because for clusters with many dbs / sequences that 
may be a lot).


Perhaps we could track just a fixed number o LSN values in shared memory 
(say, 1024), and update/read just the element determined by hash(oid). 
That is, the backend WAL-logging sequence with given oid would set the 
current LSN to array[hash(oid) % 1024], and backend doing nextval() 
would simply remember the LSN in that slot. Yes, if there are conflicts 
that'll flush more than needed.




Here's a PoC demonstrating this idea. I'm not convinced it's the right 
way to deal with this - it surely seems more like a duct tape fix than a 
clean solution. But it does the trick.


I wonder if storing this in shmem is good enough - we lose the LSN info 
on restart, but the checkpoint should trigger FPI which makes it OK.


A bigger question is whether sequences are the only thing affected by 
this. If you look at RecordTransactionCommit() then we skip flush/wait 
in two cases:


1) !wrote_xlog - if the xact did not produce WAL

2) !markXidCommitted - if the xact does not have a valid XID

Both apply to sequences, and the PoC patch tweaks them. But maybe there 
are other places where we don't generate WAL and/or assign XID in some 
cases, to save time?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..c6a0c07846 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -35,6 +35,7 @@
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
 #include "commands/async.h"
+#include "commands/sequence.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
 #include "common/pg_prng.h"
@@ -1289,6 +1290,13 @@ RecordTransactionCommit(void)
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
 
+	/*
+	 * Force flush and synchronous replication even if no XID was assigned.
+	 * This is needed when depending on WAL produced by another transaction
+	 * (possibly in a different backend). Sequences need this, for example.
+	 */
+	bool		force_sync = false;
+
 	/*
 	 * Log pending invalidations for logical decoding of in-progress
 	 * transactions.  Normally for DDLs, we log this at each command end,
@@ -1299,6 +1307,24 @@ RecordTransactionCommit(void)
 	if (XLogLogicalInfoActive())
 		LogLogicalInvalidations();
 
+	/*
+	 * Check the LSN of increments for sequences we touched in this transaction.
+	 * If it's higher than XactLastRecEnd, we need to force flush/sync.
+	 */
+	{
+		/* Separate call, so that we can compare to XactLastRecEnd. */
+		XLogRecPtr tmpptr = SequenceGetLastLSN();
+
+		/*
+		 * If higher than XactLastRecEnd, make sure we flush even without
+		 * a XID assigned.
+		 */
+		force_sync = (tmpptr > XactLastRecEnd);
+
+		/* Override the XactLastRecEnd value. */
+		XactLastRecEnd = Max(XactLastRecEnd, tmpptr);
+	}
+
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, );
 	nchildren = xactGetCommittedChildren();
@@ -1446,7 +1472,7 @@ RecordTransactionCommit(void)
 	 * if all to-be-deleted tables are temporary though, since they are lost
 	 * anyway if we crash.)
 	 */
-	if ((wrote_xlog && markXidCommitted &&
+	if ((wrote_xlog && (markXidCommitted || force_sync) &&
 		 synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
 		forceSyncCommit || nrels > 0)
 	{
@@ -1504,7 +1530,7 @@ RecordTransactionCommit(void)
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	if (wrote_xlog && markXidCommitted)
+	if (wrote_xlog && (markXidCommitted || force_sync))
 		SyncRepWaitForLSN(XactLastRecEnd, true);
 
 	

Re: sequences vs. synchronous replication

2021-12-17 Thread Tomas Vondra




On 12/18/21 05:52, Tom Lane wrote:

Tomas Vondra  writes:

The problem is exactly the same as in [1] - the aborted transaction
generated WAL, but RecordTransactionAbort() ignores that and does not
update LogwrtResult.Write, with the reasoning that aborted transactions
do not matter. But sequences violate that, because we only write WAL
once every 32 increments, so the following nextval() gets "committed"
without waiting for the replica (because it did not produce WAL).


Ugh.


I'm not sure this is a clear data corruption bug, but it surely walks
and quacks like one. My proposal is to fix this by tracking the lsn of
the last LSN for a sequence increment, and then check that LSN in
RecordTransactionCommit() before calling XLogFlush().


(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.



Good point - it doesn't :-( At least not by simply storing LSN in a 
global variable or something like that.


The second backend needs to know the LSN of the last WAL-logged sequence 
increment, but only the first backend knows that. So we'd need to share 
that between backends somehow. I doubt we want to track LSN for every 
individual sequence (because for clusters with many dbs / sequences that 
may be a lot).


Perhaps we could track just a fixed number o LSN values in shared memory 
(say, 1024), and update/read just the element determined by hash(oid). 
That is, the backend WAL-logging sequence with given oid would set the 
current LSN to array[hash(oid) % 1024], and backend doing nextval() 
would simply remember the LSN in that slot. Yes, if there are conflicts 
that'll flush more than needed.


Alternatively we could simply use the current insert LSN, but that's 
going to flush more stuff than needed all the time.




(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether.  It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors.  Perhaps if wal_level > minimal, we don't do things
the same way?


Maybe, but I have no idea how should the reworked WAL logging work. Any 
batching seems to have this issue, and loging individual increments is 
likely going to be slower.


Of course, reworking how sequences are WAL-logged may invalidate the 
"sequence decoding" patch I've been working on :-(



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sequences vs. synchronous replication

2021-12-17 Thread Tom Lane
Tomas Vondra  writes:
> The problem is exactly the same as in [1] - the aborted transaction 
> generated WAL, but RecordTransactionAbort() ignores that and does not 
> update LogwrtResult.Write, with the reasoning that aborted transactions 
> do not matter. But sequences violate that, because we only write WAL 
> once every 32 increments, so the following nextval() gets "committed" 
> without waiting for the replica (because it did not produce WAL).

Ugh.

> I'm not sure this is a clear data corruption bug, but it surely walks 
> and quacks like one. My proposal is to fix this by tracking the lsn of 
> the last LSN for a sequence increment, and then check that LSN in 
> RecordTransactionCommit() before calling XLogFlush().

(1) Does that work if the aborted increment was in a different
session?  I think it is okay but I'm tired enough to not be sure.

(2) I'm starting to wonder if we should rethink the sequence logging
mechanism altogether.  It was cool when designed, but it seems
really problematic when you start thinking about replication
behaviors.  Perhaps if wal_level > minimal, we don't do things
the same way?

regards, tom lane




sequences vs. synchronous replication

2021-12-17 Thread Tomas Vondra

Hi,

while working on logical decoding of sequences, I ran into an issue with 
nextval() in a transaction that rolls back, described in [1]. But after 
thinking about it a bit more (and chatting with Petr Jelinek), I think 
this issue affects physical sync replication too.


Imagine you have a primary <-> sync_replica cluster, and you do this:

  CREATE SEQUENCE s;

  -- shutdown the sync replica

  BEGIN;
  SELECT nextval('s') FROM generate_series(1,50);
  ROLLBACK;

  BEGIN;
  SELECT nextval('s');
  COMMIT;

The natural expectation would be the COMMIT gets stuck, waiting for the 
sync replica (which is not running), right? But it does not.


The problem is exactly the same as in [1] - the aborted transaction 
generated WAL, but RecordTransactionAbort() ignores that and does not 
update LogwrtResult.Write, with the reasoning that aborted transactions 
do not matter. But sequences violate that, because we only write WAL 
once every 32 increments, so the following nextval() gets "committed" 
without waiting for the replica (because it did not produce WAL).


I'm not sure this is a clear data corruption bug, but it surely walks 
and quacks like one. My proposal is to fix this by tracking the lsn of 
the last LSN for a sequence increment, and then check that LSN in 
RecordTransactionCommit() before calling XLogFlush().



regards


[1] 
https://www.postgresql.org/message-id/ae3cab67-c31e-b527-dd73-08f196999ad4%40enterprisedb.com


--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company