Re: adding wait_start column to pg_locks

torikoshia Tue, 09 Feb 2021 06:31:29 -0800

On 2021-02-09 22:54, Fujii Masao wrote:

On 2021/02/09 19:11, Fujii Masao wrote:
On 2021/02/09 18:13, Fujii Masao wrote:
On 2021/02/09 17:48, torikoshia wrote:
On 2021-02-05 18:49, Fujii Masao wrote:
On 2021/02/05 0:03, torikoshia wrote:
On 2021-02-03 11:23, Fujii Masao wrote:
64-bit fetches are not atomic on some platforms. So spinlock isnecessary when updating "waitStart" without holding thepartition lock? Also GetLockStatusData() needs spinlock whenreading "waitStart"?
Also it might be worth thinking to use 64-bit atomic operationslike
pg_atomic_read_u64(), for that.
Thanks for your suggestion and advice!
In the attached patch I used pg_atomic_read_u64() andpg_atomic_write_u64().
waitStart is TimestampTz i.e., int64, but it seemspg_atomic_read_xxx and pg_atomic_write_xxx only supports unsignedint, so I cast the type.
I may be using these functions not correctly, so if something iswrong, I would appreciate any comments.
About the documentation, since your suggestion seems better thanv6, I used it as is.
Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+        pg_atomic_write_u64(&MyProc->waitStart,
+ pg_atomic_read_u64((pg_atomic_uint64*) &now));
pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+        deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+        pg_atomic_write_u64(&MyProc->waitStart,
+ pg_atomic_read_u64((pg_atomic_uint64 *)&deadlockStart));
Same as above.

+        /*
+         * Record waitStart reusing the deadlock timeout timer.
+         *
+ * It would be ideal this can be synchronously done withupdating+ * lock information. Howerver, since it gives performanceimpacts+ * to hold partitionLock longer time, we do it hereasynchronously.
+         */

IMO it's better to comment why we reuse the deadlock timeout timer.

     proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);
pg_atomic_write_u64() should be used instead? Because waitStart canbe
accessed concurrently there.
I updated the patch and addressed the above review comments. Patchattached.
Barring any objection, I will commit this version.
Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).
Thanks for the test! I pushed the patch.
But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10
-    relation     | locktype |        mode
------------------+----------+---------------------
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0
"rorqual" reported that the above error happened in the server builtwith
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that"waitStart"atomic variable in the dummy proc created at the end of preparetransactionwas not initialized. I updated the patch so that pg_atomic_init_u64()is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.


Thanks for fixing the bug, I also tested v9.patch configured with
--disable-atomics --disable-spinlocks on my environment and confirmed
that all tests have passed.

BTW, while investigating this issue, I found that pg_stat_wal_receiveralso
could cause this error even in the current master (without the patch).
I will report that in separate thread.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
"prion" reported the following error. But I'm not sure how the changesofpg_locks caused this error. I found that Heikki also reported at [1]that
"prion" failed with the same error but was not sure how it happened.
This makes me think for now that this issue is not directly related to
the pg_locks changes.


Thanks! I was wondering how these errors were related to the commit.


Regards,

--
Atsushi Torikoshi

-------------------------------------
pg_dump: error: query failed: ERROR:  missing chunk number 0 for toast
value 14444 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a.attoptions, ', ') AS attoptions,
CASE WHEN a.attcollation <> t.typcollation THEN a.attcollation ELSE 0
END AS attcollation,
pg_catalog.array_to_string(ARRAY(SELECT
pg_catalog.quote_ident(option_name) || ' ' ||
pg_catalog.quote_literal(option_value) FROM
pg_catalog.pg_options_to_table(attfdwoptions) ORDER BY option_name),
E',
    ') AS attfdwoptions,
a.attidentity,
CASE WHEN a.atthasmissing AND NOT a.attisdropped THEN a.attmissingval
ELSE null END AS attmissingval,
a.attgenerated
FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t ON
a.atttypid = t.oid

WHERE a.attrelid = '35987'::pg_catalog.oid AND a.attnum >0::pg_catalog.int2

ORDER BY a.attnum
pg_dumpall: error: pg_dump failed on database "regression", exiting
waiting for server to shut down.... done
server stopped
pg_dumpall of post-upgrade database cluster failed
-------------------------------------

[1]
https://www.postgresql.org/message-id/f03ea04a-9b77-e371-9ab9-182cb35db...@iki.fi


Regards,

Re: adding wait_start column to pg_locks

Reply via email to