Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2023-04-26 Thread Dimos Stamatakis
Hi hackers,

I was wondering whether there are any updates on the bug in visibility check 
introduced in version 14.5.

Many thanks,
Dimos
[ServiceNow]


Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-29 Thread Dimos Stamatakis
[External Email]

On 2022-Nov-25, Dimos Stamatakis wrote:

> So does this mean there is no race condition in this case and that
> this error is redundant?

No, it means I believe a bug exists but that I haven't spent enough time
on it to understand what it is.



Great! Please keep me posted and let me know if you need any more evidence to 
debug. 

Thanks,
Dimos


Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-25 Thread Alvaro Herrera
On 2022-Nov-25, Dimos Stamatakis wrote:

> So does this mean there is no race condition in this case and that
> this error is redundant?

No, it means I believe a bug exists but that I haven't spent enough time
on it to understand what it is.

Please do not top-post.
https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)




Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-25 Thread Dimos Stamatakis
So does this mean there is no race condition in this case and that this error 
is redundant?

Thanks,
Dimos
[ServiceNow]


From: Alvaro Herrera 
Date: Thursday, 24. November 2022 at 19:24
To: Dimos Stamatakis 
Cc: Peter Geoghegan , simon.ri...@enterprisedb.com 
, pgsql-hackers@lists.postgresql.org 

Subject: Re: Fix for visibility check on 14.5 fails on tpcc with high 
concurrency
[External Email]

On 2022-Nov-24, Alvaro Herrera wrote:

> On 2022-Nov-24, Dimos Stamatakis wrote:
>
> > rmgr: MultiXact len (rec/tot): 54/ 54, tx: 248477, lsn: 0/66DB82A8, prev 
> > 0/66DB8260, desc: CREATE_ID 133 offset 265 nmembers 2: 248477 (nokeyupd) 
> > 248500 (keysh)
> > rmgr: Heap len (rec/tot): 70/ 70, tx: 248477, lsn: 0/66DB82E0, prev 
> > 0/66DB82A8, desc: HOT_UPDATE off 20 xmax 133 flags 0x20 IS_MULTI EXCL_LOCK 
> > ; new off 59 xmax 132, blkref #0: rel 1663/16384/16385 blk 422
> > rmgr: Transaction len (rec/tot): 34/ 34, tx: 248477, lsn: 0/66DBA710, prev 
> > 0/66DBA6D0, desc: ABORT 2022-11-23 20:33:03.712298 UTC
> > …
> > rmgr: Transaction len (rec/tot): 34/ 34, tx: 248645, lsn: 0/66DBB060, prev 
> > 0/66DBB020, desc: ABORT 2022-11-23 20:33:03.712388 UTC
>
> Ah, it seems clear enough: the transaction that aborted after having
> updated the tuple, is still considered live when doing the second
> update. That sounds wrong.

Hmm, if a transaction is aborted but its Xid is after latestCompletedXid,
it would be reported as still running. AFAICS that is only modified
with ProcArrayLock held in exclusive mode, and only read with it held in
share mode, so this should be safe.

I can see nothing else ATM.

--
Álvaro Herrera Breisgau, Deutschland — 
https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code. Too bad I can't do this at work (Oracle 8/9)." (Tom Allison)
http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php


Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-24 Thread Alvaro Herrera
On 2022-Nov-24, Alvaro Herrera wrote:

> On 2022-Nov-24, Dimos Stamatakis wrote:
> 
> > rmgr: MultiXact   len (rec/tot): 54/54, tx: 248477, lsn: 
> > 0/66DB82A8, prev 0/66DB8260, desc: CREATE_ID 133 offset 265 nmembers 2: 
> > 248477 (nokeyupd) 248500 (keysh)
> > rmgr: Heaplen (rec/tot): 70/70, tx: 248477, lsn: 
> > 0/66DB82E0, prev 0/66DB82A8, desc: HOT_UPDATE off 20 xmax 133 flags 0x20 
> > IS_MULTI EXCL_LOCK ; new off 59 xmax 132, blkref #0: rel 1663/16384/16385 
> > blk 422
> > rmgr: Transaction len (rec/tot): 34/34, tx: 248477, lsn: 
> > 0/66DBA710, prev 0/66DBA6D0, desc: ABORT 2022-11-23 20:33:03.712298 UTC
> > …
> > rmgr: Transaction len (rec/tot): 34/34, tx: 248645, lsn: 
> > 0/66DBB060, prev 0/66DBB020, desc: ABORT 2022-11-23 20:33:03.712388 UTC
> 
> Ah, it seems clear enough: the transaction that aborted after having
> updated the tuple, is still considered live when doing the second
> update.  That sounds wrong.

Hmm, if a transaction is aborted but its Xid is after latestCompletedXid,
it would be reported as still running.  AFAICS that is only modified
with ProcArrayLock held in exclusive mode, and only read with it held in
share mode, so this should be safe.

I can see nothing else ATM.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code.  Too bad I can't do this at work (Oracle 8/9)."   (Tom Allison)
   http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php




Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-24 Thread Alvaro Herrera
On 2022-Nov-24, Dimos Stamatakis wrote:

> Thanks for your feedback!
> I applied the patch to print more information about the error. Here’s what I 
> got:
> 
> 2022-11-23 20:33:03 UTC [638 test_database]: [5458] ERROR:  new multixact has 
> more than one updating member: 0 2[248477 (nokeyupd), 248645 (nokeyupd)]
> 2022-11-23 20:33:03 UTC [638 test_database]: [5459] STATEMENT:  UPDATE 
> warehouse1
>   SET w_ytd = w_ytd + 498
> WHERE w_id = 5
> 
> I then inspected the WAL and I found the log records for these 2 transactions:
> 
> …
> rmgr: MultiXact   len (rec/tot): 54/54, tx: 248477, lsn: 
> 0/66DB82A8, prev 0/66DB8260, desc: CREATE_ID 133 offset 265 nmembers 2: 
> 248477 (nokeyupd) 248500 (keysh)
> rmgr: Heaplen (rec/tot): 70/70, tx: 248477, lsn: 
> 0/66DB82E0, prev 0/66DB82A8, desc: HOT_UPDATE off 20 xmax 133 flags 0x20 
> IS_MULTI EXCL_LOCK ; new off 59 xmax 132, blkref #0: rel 1663/16384/16385 blk 
> 422
> rmgr: Transaction len (rec/tot): 34/34, tx: 248477, lsn: 
> 0/66DBA710, prev 0/66DBA6D0, desc: ABORT 2022-11-23 20:33:03.712298 UTC
> …
> rmgr: Transaction len (rec/tot): 34/34, tx: 248645, lsn: 
> 0/66DBB060, prev 0/66DBB020, desc: ABORT 2022-11-23 20:33:03.712388 UTC

Ah, it seems clear enough: the transaction that aborted after having
updated the tuple, is still considered live when doing the second
update.  That sounds wrong.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"




Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-24 Thread Alvaro Herrera
On 2022-Nov-23, Peter Geoghegan wrote:

> On Wed, Nov 23, 2022 at 2:54 AM Alvaro Herrera  
> wrote:
> > Something like the attached.  It would result in output like this:
> > WARNING:  new multixact has more than one updating member: 0 2[17378 
> > (keysh), 17381 (nokeyupd)]
> >
> > Then it should be possible to trace (in pg_waldump output) the
> > operations of each of the transactions that have any status in the
> > multixact that includes some form of "upd".
> 
> That seems very useful.

Okay, pushed to all branches.

> Separately, I wonder if it would make sense to add additional
> defensive checks to FreezeMultiXactId() for this. There is an
> assertion that should catch the presence of multiple updaters in a
> single Multi when it looks like we have to generate a new Multi to
> carry the XID members forward (typically something we only need to do
> during a VACUUM FREEZE). We could at least make that
> "Assert(!TransactionIdIsValid(update_xid));" line into a defensive
> "can't happen" ereport(). It couldn't hurt, at least -- we already
> have a similar relfrozenxid check nearby, added after the "freeze the
> dead" bug was fixed.

Hmm, agreed.  I'll see about that separately.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
  (Linus Torvalds)




Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-23 Thread Peter Geoghegan
On Wed, Nov 23, 2022 at 2:54 AM Alvaro Herrera  wrote:
> Something like the attached.  It would result in output like this:
> WARNING:  new multixact has more than one updating member: 0 2[17378 (keysh), 
> 17381 (nokeyupd)]
>
> Then it should be possible to trace (in pg_waldump output) the
> operations of each of the transactions that have any status in the
> multixact that includes some form of "upd".

That seems very useful.

Separately, I wonder if it would make sense to add additional
defensive checks to FreezeMultiXactId() for this. There is an
assertion that should catch the presence of multiple updaters in a
single Multi when it looks like we have to generate a new Multi to
carry the XID members forward (typically something we only need to do
during a VACUUM FREEZE). We could at least make that
"Assert(!TransactionIdIsValid(update_xid));" line into a defensive
"can't happen" ereport(). It couldn't hurt, at least -- we already
have a similar relfrozenxid check nearby, added after the "freeze the
dead" bug was fixed.

-- 
Peter Geoghegan




Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-23 Thread Alvaro Herrera
On 2022-Nov-23, Alvaro Herrera wrote:

> I suggest that we could improve that elog() so that it includes the
> members of the multixact in question, which could help us better
> understand what is going on.

Something like the attached.  It would result in output like this:
WARNING:  new multixact has more than one updating member: 0 2[17378 (keysh), 
17381 (nokeyupd)]

Then it should be possible to trace (in pg_waldump output) the
operations of each of the transactions that have any status in the
multixact that includes some form of "upd".

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Just treat us the way you want to be treated + some extra allowance
 for ignorance."(Michael Brusser)
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 204aa95045..e1191a7564 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -799,7 +799,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 			if (ISUPDATE_from_mxstatus(members[i].status))
 			{
 if (has_update)
-	elog(ERROR, "new multixact has more than one updating member");
+	elog(ERROR, "new multixact has more than one updating member: %s",
+		 mxid_to_string(InvalidMultiXactId, nmembers, members));
 has_update = true;
 			}
 		}


Re: Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-23 Thread Alvaro Herrera
Hello Dimos

On 2022-Nov-22, Dimos Stamatakis wrote:

> When running tpcc on sysbench with high concurrency (96 threads, scale
> factor 5) we realized that a fix for visibility check (introduced in
> PG-14.5) causes sysbench to fail in 1 out of 70 runs.
> The error is the following:
> 
> SQL error, errno = 0, state = 'XX000': new multixact has more than one 
> updating member

Ouch.

I did not remember any reports of this.  Searching I found this recent
one:
https://postgr.es/m/17518-04e368df5ad7f...@postgresql.org

However, the reporter there says they're using 12.10, and according to
src/tools/git_changelog the commit appeared only in 12.12:

Author: Heikki Linnakangas 
Branch: master Release: REL_15_BR [adf6d5dfb] 2022-06-27 08:21:08 +0300
Branch: REL_14_STABLE Release: REL_14_5 [e24615a00] 2022-06-27 08:24:30 +0300
Branch: REL_13_STABLE Release: REL_13_8 [7ba325fd7] 2022-06-27 08:24:35 +0300
Branch: REL_12_STABLE Release: REL_12_12 [af530898e] 2022-06-27 08:24:36 +0300
Branch: REL_11_STABLE Release: REL_11_17 [b49889f3c] 2022-06-27 08:24:37 +0300
Branch: REL_10_STABLE Release: REL_10_22 [4822b4627] 2022-06-27 08:24:38 +0300

Fix visibility check when XID is committed in CLOG but not in procarray.
[...]


Thinking further, one problem in tracking this down is that at this
point the multixact in question is *being created*, so we don't have a
WAL trail we could trace through.

I suggest that we could improve that elog() so that it includes the
members of the multixact in question, which could help us better
understand what is going on.

> This commit was supposed to fix a race condition during the visibility
> check. Please let us know whether you are aware of this issue and if
> there is a quick fix.

I don't think so.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Fix for visibility check on 14.5 fails on tpcc with high concurrency

2022-11-22 Thread Dimos Stamatakis
Hi hackers,

When running tpcc on sysbench with high concurrency (96 threads, scale factor 
5) we realized that a fix for visibility check (introduced in PG-14.5) causes 
sysbench to fail in 1 out of 70 runs.
The error is the following:

SQL error, errno = 0, state = 'XX000': new multixact has more than one updating 
member

And it is caused by the following statement:

UPDATE warehouse1
  SET w_ytd = w_ytd + 234
  WHERE w_id = 3;

The commit that fixes the visibility check is the following:
https://github.com/postgres/postgres/commit/e24615a0057a9932904317576cf5c4d42349b363

We reverted this commit and tpcc does not fail anymore, proving that this 
change is problematic.
Steps to reproduce:
1. Install sysbench
  https://github.com/akopytov/sysbench
2. Install percona sysbench TPCC
  https://github.com/Percona-Lab/sysbench-tpcc
3. Run percona sysbench -- prepare
  # sysbench-tpcc/tpcc.lua --pgsql-host=localhost --pgsql-port=5432 
--pgsql-user={USER} --pgsql-password={PASSWORD} --pgsql-db=test_database 
--db-driver=pgsql --tables=1 --threads=96 --scale=5 --time=60 prepare
4. Run percona sysbench -- run
  # sysbench-tpcc/tpcc.lua --pgsql-host=localhost --pgsql-port=5432 
--pgsql-user={USER} --pgsql-password={PASSWORD} --pgsql-db=test_database 
--db-driver=pgsql --tables=1 --report-interval=1 --rand-seed=1 --threads=96 
--scale=5 --time=60 run

We tested on a machine with 2 NUMA nodes, 16 physical cores per node, and 2 
threads per core, resulting in 64 threads total. The total memory is 376GB.
Attached please find the configuration file we used (postgresql.conf).

This commit was supposed to fix a race condition during the visibility check. 
Please let us know whether you are aware of this issue and if there is a quick 
fix.
Any input is highly appreciated.

Thanks,
Dimos
[ServiceNow]


postgresql.conf
Description: postgresql.conf