Re: Slow standby snapshot

2022-02-20 Thread Michail Nikolaev
Hello, Andrey.

Thanks for your efforts.

> Patch on barrier seems too complicated to me right now. I’d propose to focus 
> on KnowAssignedXidsNext patch: it’s clean, simple and effective.
I'll extract it to the separated patch later.

> I’ve rebased the patch so that it does not depend on previous step. Please 
> check out it’s current state, if you are OK with it - let’s mark the patch 
> Ready for Committer. Just maybe slightly better commit message would make the 
> patch easier to understand.
Everything seems to be correct.

Best regards,
Michail.




Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second

2022-03-21 Thread Michail Nikolaev
Hello, Peter.

> * Instead of avoiding the FPI when this happens, proactively call
> _bt_simpledel_pass() just before _bt_killitems() returns. Accept the
> immediate cost of setting an LP_DEAD bit, just like today, but avoid
> repeated FPIs.

Hm, not sure here
AFAIK current implementation does not produce repeated FPIs. Page is
marked as dirty on the first bit. So, others LP_DEAD (if not set by
single scan) do not generate FPI until checkpoint is ready.
Also, the issue affects GITS and HASH indexes and HEAP pages.

Best regards,
Michail.




Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second

2022-03-22 Thread Michail Nikolaev
Hello, Peter.

Thanks for your comments.

> There is one FPI per checkpoint for any leaf page that is modified
> during that checkpoint. The difference between having that happen once
> or twice per leaf page and having that happen many more times per leaf
> page could be very large.

Yes, I am almost sure proactively calling of_bt_simpledel_pass() will
positively impact the system on many workloads. But also I am almost
sure it will not change the behavior of the incident I mention -
because it is not related to multiple checkpoints.

> Of course it's true that that might not make that much difference. Who
> knows? But if you're not willing to measure it then we'll never know.
> What version are you using here? How frequently were checkpoints
> occurring in the period in question, and how does that compare to
> normal? You didn't even include this basic information.

Yes, I probably had to provide more details. Downtime is pretty short
(you could see network peak on telemetry image from the first letter)
- so, just 1-3 minutes. Checkpoints are about each 30 min.
It is just an issue with super-high WAL traffic caused by tons of FPI
traffic after a long transaction commit. The issue resolved fast on
its own, but downtime still happens.

> Many things have changed in this area already, and it's rather unclear
> how much just upgrading to Postgres 14 would help.

Version is 11. Yes, many things have changed but IFAIK nothing's
changed related to FPI mechanics (LP_DEAD and other hint bits,
including HEAP).

I could probably try to reproduce the issue, but I'm not sure how to
do it in a fast and reliable way (it is hard to wait for a day for
each test). Probably it may be possible by some temporary crutch in
postgres source (to emulate old transaction commit somehow).

> The main reason that this can be so complex is that FPIs are caused by
> more frequent checkpoints, but *also* cause more frequent checkpoints
> in turn. So you could have a "death spiral" with FPIs -- the effect is
> nonlinear, which has the potential to lead to pathological, chaotic
> behavior. The impact on response time is *also* nonlinear and chaotic,
> in turn.

Could you please explain "death spiral" mechanics related to FPIs?

> What do you do when there were too many FPIs for a long time, but also too 
> much
> avoiding them earlier on? It's very complicated.

Yes, it could cause at least performance degradation in case of too
aggressive avoiding the FPI. I am 100% sure such settings should be
disabled by default. It is more about the physical limits of servers.
Personally I would like to set it to about 75% of resources.

Also, there are some common things between checkpoints and vacuum -
they are processes which are required to be done regularly (but not
right now) and they are limited in resources. Setting LP_DEAD (and
other hint bits, especially in HEAP) is also something required to be
done regularly (but not right now). But it is not limited by
resources.

BTW, probably new index creation is something with the same nature.

Best regards,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-22 Thread Michail Nikolaev
Hello, Andres.

> Fails to apply at the moment: http://cfbot.cputube.org/patch_37_2947.log

Thanks for notifying me. BTW, some kind of automatic email in case of
status change could be very helpful.

> Marked as waiting for author.

New version is attached, build is passing
(https://cirrus-ci.com/build/5599876384817152), so, moving it back to
"ready for committer" .

Best regards,
Michail.
From 9ecb33a54971cfa1c766ed9d129c6abb44e39f98 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sat, 15 Jan 2022 16:21:51 +0300
Subject: [PATCH v10 1/3] code

---
 src/backend/access/common/bufmask.c  | 25 
 src/backend/access/gist/gistget.c| 43 +++--
 src/backend/access/gist/gistxlog.c   | 15 +
 src/backend/access/hash/hash.c   |  4 +-
 src/backend/access/hash/hash_xlog.c  | 17 +
 src/backend/access/hash/hashsearch.c | 18 --
 src/backend/access/hash/hashutil.c   | 33 +-
 src/backend/access/heap/heapam.c | 42 +---
 src/backend/access/heap/heapam_handler.c |  5 +-
 src/backend/access/index/genam.c | 20 +++---
 src/backend/access/index/indexam.c   | 81 +---
 src/backend/access/nbtree/nbtinsert.c| 22 +--
 src/backend/access/nbtree/nbtree.c   |  4 +-
 src/backend/access/nbtree/nbtsearch.c| 14 +++-
 src/backend/access/nbtree/nbtutils.c | 33 +-
 src/backend/access/nbtree/nbtxlog.c  | 16 +
 src/backend/access/table/tableam.c   |  4 +-
 src/backend/access/transam/rmgr.c|  4 +-
 src/backend/access/transam/xlogutils.c   |  6 ++
 src/backend/storage/ipc/standby.c|  6 ++
 src/bin/pg_rewind/parsexlog.c|  2 +-
 src/bin/pg_waldump/rmgrdesc.c|  2 +-
 src/include/access/bufmask.h |  1 +
 src/include/access/gist.h|  5 ++
 src/include/access/gistxlog.h|  1 +
 src/include/access/hash.h|  2 +
 src/include/access/hash_xlog.h   |  1 +
 src/include/access/heapam.h  |  2 +-
 src/include/access/nbtree.h  |  2 +
 src/include/access/nbtxlog.h |  1 +
 src/include/access/relscan.h | 15 -
 src/include/access/rmgr.h|  2 +-
 src/include/access/rmgrlist.h| 44 ++---
 src/include/access/tableam.h | 14 ++--
 src/include/access/xlog_internal.h   |  4 ++
 35 files changed, 421 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/common/bufmask.c 
b/src/backend/access/common/bufmask.c
index 4e953bfd61..22026482ad 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -128,3 +128,28 @@ mask_page_content(Page page)
memset(&((PageHeader) page)->pd_upper, MASK_MARKER,
   sizeof(uint16));
 }
+
+/*
+ * mask_lp_dead
+ *
+ * In some index AMs, line pointer flags can be modified without emitting any
+ * WAL record. Sometimes it is required to mask LP_DEAD flags set on primary to
+ * set own values on standby.
+ */
+void
+mask_lp_dead(Page page)
+{
+   OffsetNumber offnum,
+maxoff;
+
+   maxoff = PageGetMaxOffsetNumber(page);
+   for (offnum = FirstOffsetNumber;
+offnum <= maxoff;
+offnum = OffsetNumberNext(offnum))
+   {
+   ItemId  itemId = PageGetItemId(page, offnum);
+
+   if (ItemIdHasStorage(itemId) && ItemIdIsDead(itemId))
+   itemId->lp_flags = LP_NORMAL;
+   }
+}
diff --git a/src/backend/access/gist/gistget.c 
b/src/backend/access/gist/gistget.c
index adbf622c83..1905c04c51 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/bufmask.h"
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
@@ -49,6 +50,7 @@ gistkillitems(IndexScanDesc scan)
Assert(so->curBlkno != InvalidBlockNumber);
Assert(!XLogRecPtrIsInvalid(so->curPageLSN));
Assert(so->killedItems != NULL);
+   Assert(so->numKilled > 0);
 
buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
if (!BufferIsValid(buffer))
@@ -62,8 +64,13 @@ gistkillitems(IndexScanDesc scan)
 * If page LSN differs it means that the page was modified since the 
last
 * read. killedItems could be not valid so LP_DEAD hints applying is not
 * safe.
+*
+* Another case - standby was promoted after start of current 
transaction.
+* It is not required for correctness, but it is better to just skip
+* everything.
 */
-   if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+   if ((BufferGetLSNAtomic(buffer) != so->curPageLSN) ||
+   (scan->xactStartedInRecover

Why latestRemovedXid|cuteoff_xid are always sent?

2021-01-02 Thread Michail Nikolaev
Hello, hackers.

Working on some stuff, I realized I do not understand why
latestRemovedXid|cuteoff_xid (in different types of WAL records) are
sent every time they appear on the primary side.

latestRemovedXid|cuteoff_xid is used to call
ResolveRecoveryConflictWithSnapshot and cancel conflicting backend on
Standby. In some of the cases, snapshot conflict resolving is the only
work REDO does (heap_xlog_cleanup_info
 or btree_xlog_reuse_page, for example).

Could we try to somehow optimistically advance the latest sent
latestRemovedXid value in shared memory on the primary and skip
sending it if the newer xid was sent already? In such a way we could
reduce the number of ResolveRecoveryConflictWithSnapshot calls on
Standby and even skip some WAL records.

At least we could do the same optimization on the standby side
(skipping ResolveRecoveryConflictWithSnapshot if it was called with
newer xid already).

Is it a sane idea or I have missed something huge?

Thanks,
Michail.




Re: Why latestRemovedXid|cuteoff_xid are always sent?

2021-01-08 Thread Michail Nikolaev
Hello, Peter.

Thanks for your explanation. One of the reasons I was asking - is an idea
to use the same technique in the "LP_DEAD index hint bits on standby" WIP
patch to reduce the amount of additional WAL.

Now I am sure such optimization should work correctly.

Thanks,
Michail.


[PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-18 Thread Michail Nikolaev
Hello, hackers.

[ABSTRACT]

Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:

 * We do this because the xmin on the primary node could easily be
 * later than the xmin on the standby node, so that what the primary
 * thinks is killed is supposed to be visible on standby. So for correct
 * MVCC for queries during recovery we must ignore these hints and check
 * all tuples.

Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.

The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.

This thread is the continuation (and party copy-paste) of the old
previous one [4].

[PROBLEM]

The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.

So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.

The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...

But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].

Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.

[DETAILS]

The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.

`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).

Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).

New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).

There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.

It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.

What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).

While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.

To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.

So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.

And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because `HeapTupleIsSurelyDead` (and i

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-21 Thread Michail Nikolaev
Hello, everyone.

Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN
forward).

Something like this:

LSN=10  STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
<---minRecoveryPoint will go here
LSN=20  STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)
   REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
   INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
   CRASH

On crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
(LSN=20).
So, need to think here.

Thanks,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-27 Thread Michail Nikolaev
Hello, hackers.

I think I was able to fix the issue related to minRecoveryPoint and crash
recovery. To make sure standby will be consistent after crash recovery, we
need to take the current value of minRecoveryPoint into account while
setting LP_DEAD hints (almost the same way as it is done for *heap* hint
bits already).

I have introduced new structure IndexHintBitsData:
---
/* guaranteed not visible for all backends */
bool all_dead;

/* latest removed xid if known */
TransactionId latest_removed_xid;

 /* lsn of page where dead tuple located */
XLogRecPtr page_lsn;
---

This structure is filled by the `heap_hot_search_buffer` function. After,
we decide to set or not `kill_prior_tuple` depending on its content
(calling `IsMarkBufferDirtyIndexHintAllowed`).

For primary - it is always safe to set LP_DEAD in index if `all_dead` ==
true.

In the case of standby, we need to check `latest_removed_xid` (if
available) first. If commit LSN of the latest removed xid is already lower
than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set
`kill_prior_tuple`.

Sometimes we are not sure about the latest removed xid - heap record could
be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case
we check the LSN of the *heap* page containing the tuple (LSN could be
updated by other transactions already - but it does not matter in that
situation). If page LSN is lower than minRecoveryPoint - it is safe to set
LP_DEAD in the index too. Otherwise - just leave the index tuple alive.


So, to bring it all together:

* Normal operation, proc->indexIgnoreKilledTuples is true:
  It is safe for standby to use hint bits from the primary FPI because
of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
  It is safe for standby to set its index hint bits because
`ComputeXidHorizons` honors other read-only procs xmin and lowest xid on
primary (`KnownAssignedXidsGetOldestXmin`).

* Normal operation, proc->indexIgnoreKilledTuples is false:
  Index hint bits are never set or taken into account.

* Crash recovery, proc->indexIgnoreKilledTuples is true:
  It is safe for standby to use hint bits from the primary FPW because
XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record
of transaction removed the tuple is logged before
XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken
into account by minRecoveryPoint) - both transaction-remover and horizon
records are replayed before reading queries.
  It is safe for standby to use its hint bits because they can be set
only if the commit record of transaction-remover is lower than
minRecoveryPoint or LSN of heap page with removed tuples is lower than
minRecoveryPoint.

* Crash recovery, proc->indexIgnoreKilledTuples is false:
  Index hint bits are never set or taken into account.

So, now it seems correct to me.

Another interesting point here - now position of minRecoveryPoint affects
performance a lot. It is happening already (because of *heap* hint bits)
but after the patch, it is noticeable even more. Is there any sense to keep
minRecoveryPoint at a low value?

Rebased and updated patch in attachment.

Will be happy if someone could recheck my ideas or even the code :)

Thanks a lot,
Michail.
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 96442ceb4e..6399184a8c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/022_index_hint_bits.pl b/src/test/recovery/t/022_index_hint_bits.pl
new file mode 100644
index 00..737dca0185
--- /dev/null
+++ b/src/test/recovery/t/022_index_hint_bits.pl
@@ -0,0 +1,282 @@
+# Checks that index hints on standby work as excepted.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 15;
+use Config;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+'postgres', 'CREATE TABLE test_index_hint (id int, value int)');
+$node_primary->safe_psql(
+'postgres', 'CREATE INDEX test_index ON test_index_hint (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+'INSERT INTO test_index_hint VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-27 Thread Michail Nikolaev
Hello, hackers.

Sorry for necroposting, but if someone is interested - I hope the patch is
ready now and available in the other thread (1).

Thanks,
Michail.

[1]
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com


Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-28 Thread Michail Nikolaev
Hello, Peter.


> I wonder if it would help to not actually use the LP_DEAD bit for
> this. Instead, you could use the currently-unused-in-indexes
> LP_REDIRECT bit.

Hm… Sound very promising - an additional bit is a lot in this situation.

> Whether or not "recently dead" means "dead to my
> particular MVCC snapshot" can be determined using some kind of
> in-memory state that won't survive a crash (or a per-index-page
> epoch?).

Do you have any additional information about this idea? (maybe some
thread). What kind of “in-memory state that won't survive a crash” and how
to deal with flushed bits after the crash?

> "Not known to be dead in any sense" (0), "Unambiguously dead to all"
> (what we now simply call LP_DEAD), "recently dead on standby"
> (currently-spare bit is set), and "recently dead on primary" (both
> 'lp_flags' bits set).

Hm. What is about this way:

10 - dead to all on standby (LP_REDIRECT)
11 - dead to all on primary (LP_DEAD)
01 - future “recently DEAD” on primary (LP_NORMAL)

In such a case standby could just always ignore all LP_DEAD bits from
primary (standby will lose its own hint after FPI - but it is not a big
deal). So, we don’t need any conflict resolution (and any additional WAL
records). Also, hot_standby_feedback-related stuff is not required anymore.
All we need to do (without details of course) - is correctly check if it is
safe to set LP_REDIRECT on standby according to `minRecoveryPoint` (to
avoid consistency issues during crash recovery). Or, probably, introduce
some kind of `indexHintMinRecoveryPoint`.

Also, looks like both GIST and HASH indexes also do not use LP_REDIRECT.

So, it will remove more than 80% of the current patch complexity!

Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
Because it blocks standby for settings hints bits in *heap* already. And,
probably, will block standby to set *index* hint bits aggressively.

Thanks a lot,
Michail.


Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-30 Thread Michail Nikolaev
Hello, Peter.

> Yeah, it would help a lot. But those bits are precious. So it makes
> sense to think about what to do with both of them in index AMs at the
> same time.  Otherwise we risk missing some important opportunity.

Hm. I was trying to "expand the scope" as you said and got an idea... Why
we even should do any conflict resolution for hint bits? Or share precious
LP bits?

The only way standby could get an "invalid" hint bit - is an FPI from the
primary. We could just use the current *btree_mask* infrastructure and
clear all "probably invalid" bits from primary in case of standby while
applying FPI (in `XLogReadBufferForRedoExtended`)!
For binary compatibility, we could use one of `btpo_flags` bits to mark the
page as "primary bits masked". The same way would work for hash\gist too.

No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many
free bits for now, easy to implement, page content of primary\standby
already differs in this bits...
Looks like an easy and effective solution for me.

What do you think?

>> Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?
> Not offhand.

If so, looks like it is not a bad idea to move minRecoveryPoint forward
from time to time (for more aggressive standby index hint bits). For each
`xl_running_xacts` (about each 15s), for example.

> BTW, what happens when the page splits on the primary, btw? Does your
> patch "move over" the LP_DEAD bits to each half of the split?

That part is not changed in any way.

Thanks,
Michail.


Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-01 Thread Michail Nikolaev
Hello, Peter.

Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and
your comments:

> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).

It is fine to receive a page to the standby from any source: `btpo_flags`
should have some kind “LP_DEAD safe for standby” bit set to allow new bits
to be set and old - read.

> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.

As far as I can see - there is no problem here. Checksums already differ
for both heap and index pages on standby and primary. Checksums are
calculated before the page is written to the disk (not after applying FPI).
So, the masking page during *applying* the FPI is semantically the same as
setting a bit in it 1 nanosecond after.

And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.

> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.

I hope it is not big enough, because FPIs are not too frequent + positive
effect will easily overcome additional CPU cycles of `btree_mask` (and the
page is already in CPU cache at the moment).

> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.

I think it is fine to lose part of LP_DEAD on promotion (which can be
received only by FPI in practice). They will be set on the first scan
anyway. Also, bits set by standby may be used by newly promoted primary if
we honor OldestXmin of the previous primary while setting it (just need to
add OldestXmin in xl_running_xacts and include it into dead-horizon on
standby).

> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).

As I said before - applying the mask on *standby* will not break any
checksums - because the page is still dirty after that (and it is even
possible to call `PageSetChecksumInplace` for an additional paranoia).
Actual checksums on standby and primary already have different values (and,
probably, in the most of the pages because LP_DEAD and “classic” hint bits).

> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

Not sure I was able to understand your idea here, sorry.

> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums

Yes, and I see three major ways to implement it in the current
infrastructure:

1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts
(like here -
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
)
3) Mask index FPI during a replay on hot standby + mark page as “primary
LP_DEAD free” in btpo_flags

Of course, each variant requires some special additional things to keep
everything safe.

As I see in SetHintsBits limitations are related to XLogNeedsFlush (check
of minRecoveryPoint in case of standby).

> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-02 Thread Michail Nikolaev
Hello, Peter.

> AFAICT that's not true, at least not in any practical sense. See the
> comment in the middle of MarkBufferDirtyHint() that begins with "If we
> must not write WAL, due to a relfilenode-specific...", and see the
> "Checksums" section at the end of src/backend/storage/page/README. The
> last paragraph in the README is particularly relevant:

I have attached a TAP-test to demonstrate how easily checksums on standby
and primary starts to differ. The test shows two different scenarios - for
both heap and index (and the bit is placed in both standby and primary).

Yes, MarkBufferDirtyHint does not mark the page as dirty… So, hint bits on
secondary could be easily lost. But it leaves the page dirty if it already
is (or it could be marked dirty by WAL replay later). So, hints bits could
be easily flushed and taken into account during checksum calculation on
both - standby and primary.

> "We can set the hint, just not dirty the page as a result so the hint
> is lost when we evict the page or shutdown"

Yes, it is not allowed to mark a page as dirty because of hints on standby.
Because we could achieve this:

CHECKPOINT
SET HINT BIT
TORN FLUSH + CRASH = BROKEN CHECKSUM, SERVER FAULT

But this scenario is totally fine:

CHECKPOINT
FPI (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED, EVERYTHING IS OK

And, as result, this is fine too:

CHECKPOINT
FPI WITH MASKED LP_DEAD (page is still dirty)
SET HINT BIT
TORN FLUSH + CRASH = PAGE IS RECOVERED + LP_DEAD MASKED AGAIN IF STANDBY

So, my point here - it is fine to mask LP_DEAD bits during replay because
they are already different on standby and primary. And it is fine to set
and flush hint bits (and LP_DEADs) on standby because they already could be
easily flushed (just need to consider minRecovertPoint and, probably,
OldesXmin from primary in case of LP_DEAD to make promotion easily).

>> And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.
> I don't see how that is relevant. btree_mask() is only used by
> wal_consistency_checking, which is mostly just for Postgres hackers.

I was thinking about the possibility to reuse these functions in masking
during replay.

Thanks,
Michail.


022_checksum_tests.pl
Description: Perl program


Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-02-10 Thread Michail Nikolaev
Hello, everyone.

After some correspondence with Peter Geoghegan (1) and his ideas, I
have reworked the patch a lot and now it is much more simple with even
better performance (no new WAL or conflict resolution, hot standby
feedback is unrelated).

The idea is pretty simple now - let’s mark the page with
“standby-safe” LP_DEAD hints by the bit in btpo_flags
(BTP_LP_SAFE_ON_STANDBY and similar for gist and hash).

If standby wants to set LP_DEAD - it checks BTP_LP_SAFE_ON_STANDBY on
the page first, if it is not set - all “primary” hints are removed
first, and then the flag is set (with memory barrier to avoid memory
ordering issues in concurrent scans).
Also, standby checks BTP_LP_SAFE_ON_STANDBY to be sure about ignoring
tuples marked by LP_DEAD during the scan.

Of course, it is not so easy. If standby was promoted (or primary was
restored from standby backup) - it is still possible to receive FPI
with such flag set in WAL logs. So, the main problem is still there.

But we could just clear this flag while applying FPI because the page
remains dirty after that anyway! It should not cause any checksum,
consistency, or pg_rewind issues as explained in (2).
Semantically it is the same as set hint bit one milisecond after FPI
was applied (while page still remains dirty after FPI replay) - and
standby already does it with *heap* hint bits.

Also, TAP-test attached to (2) shows how it is easy to flush a hint
bit which was set by standby to achieve different checksum comparing
to primary already.

If standby was promoted (or restored from standby backup) it is safe
to use LP_DEAD with or without BTP_LP_SAFE_ON_STANDBY on a page. But
for accuracy BTP_LP_SAFE_ON_STANDBY is cleared by primary if found.

Also, we should take into account minRecoveryPoint as described in (3)
to avoid consistency issues during crash recovery (see
IsIndexLpDeadAllowed).

Also, as far as I know - there is no practical sense to keep
minRecoveryPoint at a low value. So, there is an optional patch that
moves minRecoveryPoint forward at each xl_running_data (to allow
standby to set hint bits and LP_DEADs more aggressively). It is about
every 15s.

There are some graphics showing performance testing results on my PC
in the attachment (test is taken from (4)). Each test was running for
10 minutes.
Additional primary performance is probably just measurement error. But
standby performance gain is huge.

Feel free to ask if you need more proof about correctness.

Thanks,
Michail.

[1] - 
https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040
[2] - 
https://www.postgresql.org/message-id/flat/CANtu0oiAtteJ%2BMpPonBg6WfEsJCKrxuLK15P6GsaGDcYGjefVQ%40mail.gmail.com#091fca433185504f2818d5364819f7a4
[3] - 
https://www.postgresql.org/message-id/flat/CANtu0oh28mX5gy5jburH%2Bn1mcczK5_dCQnhbBnCM%3DPfqh-A26Q%40mail.gmail.com#ecfe5a331a3058f895c0cba698fbc4d3
[4] - 
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..7ebfa0d1a7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1070,6 +1070,12 @@ standby_redo(XLogReaderState *record)
 		running.xids = xlrec->xids;
 
 		ProcArrayApplyRecoveryInfo(&running);
+		if (InHotStandby)
+		{
+			/* Move minRecoveryPoint forward to allow standby set
+			 * hint bits and index-LP_DEAD more aggressively. */
+			XLogFlush(record->currRecPtr);
+		}
 	}
 	else if (info == XLOG_INVALIDATIONS)
 	{
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 92205325fb..14e547ee6b 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -653,17 +653,23 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple L

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-02-10 Thread Michail Nikolaev
Hello, Peter.

If you are interested, the possible patch (based on FPI mask during
replay) was sent with some additional explanation and graphics to (1).
At the moment I unable to find any "incorrectness" in it.

Thanks again for your comments.

Michail.


[1] 
https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec




Re: Stream replication test fails of cfbot/windows server 2019

2022-01-12 Thread Michail Nikolaev
Hello.

Looks like logical replication also affected:

[08:26:35.599] # poll_query_until timed out executing this query:
[08:26:35.599] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
srsubstate NOT IN ('r', 's');
[08:26:35.599] # expecting this output:
[08:26:35.599] # t
[08:26:35.599] # last actual query output:
[08:26:35.599] # f

https://cirrus-ci.com/task/6532060239101952
https://cirrus-ci.com/task/471606276096

Best regards,
Michail.




Re: Windows vs recovery tests

2022-01-12 Thread Michail Nikolaev
Hello.

It is also could be related -
https://www.postgresql.org/message-id/flat/20220112112425.pgzymqcgdy62e7m3%40jrouhaud#097b54a539ac3091ca4e4ed8ce9ab89c
(both Windows and Linux cases.

Best regards,
Michail.


Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-15 Thread Michail Nikolaev
Hello, Junien.

Thanks for your attention.

> The cfbot reports that this patch is currently failing at least on
> Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952.

Fixed. It was the issue with the test - hangs on Windows because of
psql + spurious vacuum sometimes.

> I'm switching this patch on Waiting on Author.

I have tested it multiple times on my Github repo, seems to be stable now.
Switching back to Ready for committer.

Best regards.
Michail.
From 9372bac9b56d27cf993e9d1fa66127c86b51f25c Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sat, 15 Jan 2022 16:21:51 +0300
Subject: [PATCH v7 1/3] code

---
 src/backend/access/common/bufmask.c  | 25 
 src/backend/access/gist/gistget.c| 43 +++--
 src/backend/access/gist/gistxlog.c   | 15 +
 src/backend/access/hash/hash.c   |  4 +-
 src/backend/access/hash/hash_xlog.c  | 17 +
 src/backend/access/hash/hashsearch.c | 18 --
 src/backend/access/hash/hashutil.c   | 33 +-
 src/backend/access/heap/heapam.c | 42 +---
 src/backend/access/heap/heapam_handler.c |  5 +-
 src/backend/access/index/genam.c | 20 +++---
 src/backend/access/index/indexam.c   | 81 +---
 src/backend/access/nbtree/nbtinsert.c| 22 +--
 src/backend/access/nbtree/nbtree.c   |  4 +-
 src/backend/access/nbtree/nbtsearch.c| 14 +++-
 src/backend/access/nbtree/nbtutils.c | 33 +-
 src/backend/access/nbtree/nbtxlog.c  | 16 +
 src/backend/access/table/tableam.c   |  4 +-
 src/backend/access/transam/rmgr.c|  4 +-
 src/backend/access/transam/xlogutils.c   |  6 ++
 src/backend/storage/ipc/standby.c|  6 ++
 src/bin/pg_rewind/parsexlog.c|  2 +-
 src/bin/pg_waldump/rmgrdesc.c|  2 +-
 src/include/access/bufmask.h |  1 +
 src/include/access/gist.h|  5 ++
 src/include/access/gistxlog.h|  1 +
 src/include/access/hash.h|  2 +
 src/include/access/hash_xlog.h   |  1 +
 src/include/access/heapam.h  |  2 +-
 src/include/access/nbtree.h  |  2 +
 src/include/access/nbtxlog.h |  1 +
 src/include/access/relscan.h | 15 -
 src/include/access/rmgr.h|  2 +-
 src/include/access/rmgrlist.h| 46 +++---
 src/include/access/tableam.h | 14 ++--
 src/include/access/xlog_internal.h   |  4 ++
 35 files changed, 422 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
index 4e953bfd61..22026482ad 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -128,3 +128,28 @@ mask_page_content(Page page)
 	memset(&((PageHeader) page)->pd_upper, MASK_MARKER,
 		   sizeof(uint16));
 }
+
+/*
+ * mask_lp_dead
+ *
+ * In some index AMs, line pointer flags can be modified without emitting any
+ * WAL record. Sometimes it is required to mask LP_DEAD flags set on primary to
+ * set own values on standby.
+ */
+void
+mask_lp_dead(Page page)
+{
+	OffsetNumber offnum,
+ maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdHasStorage(itemId) && ItemIdIsDead(itemId))
+			itemId->lp_flags = LP_NORMAL;
+	}
+}
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index adbf622c83..1905c04c51 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/bufmask.h"
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
@@ -49,6 +50,7 @@ gistkillitems(IndexScanDesc scan)
 	Assert(so->curBlkno != InvalidBlockNumber);
 	Assert(!XLogRecPtrIsInvalid(so->curPageLSN));
 	Assert(so->killedItems != NULL);
+	Assert(so->numKilled > 0);
 
 	buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
 	if (!BufferIsValid(buffer))
@@ -62,8 +64,13 @@ gistkillitems(IndexScanDesc scan)
 	 * If page LSN differs it means that the page was modified since the last
 	 * read. killedItems could be not valid so LP_DEAD hints applying is not
 	 * safe.
+	 *
+	 * Another case - standby was promoted after start of current transaction.
+	 * It is not required for correctness, but it is better to just skip
+	 * everything.
 	 */
-	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+	if ((BufferGetLSNAtomic(buffer) != so->curPageLSN) ||
+			(scan->xactStartedInRecovery && !RecoveryInProgress()))
 	{
 		UnlockReleaseBuffer(buffer);
 		so->numKilled = 0;		/* reset counter */
@@ -71,6 +78,20 @@ gistkillitems(IndexScanDesc scan)
 	}
 
 	Assert(GistPageIsLeaf(page));

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-08-02 Thread Michail Nikolaev
Hello, Peter.

> Attached is a revised version of your patch
Thanks for your work, the patch is looking better now.

Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-05 Thread Michail Nikolaev
Hello, Antonin.

Thanks for pushing it forward.

> I understand that the RR snapshot is used to check the MVCC behaviour, however
> this comment seems to indicate that the RR snapshot should also prevent the
> standb from setting the hint bits.
> # Make sure previous queries not set the hints on standby because
> # of RR snapshot
> I can imagine that on the primary, but I don't think that the backend that
> checks visibility on standby does checks other snapshots/backends. And it
> didn't work when I ran the test manually, although I could have missed
> something.

Yes, it checks - you could see ComputeXidHorizons for details. It is
the main part of the correctness of the whole feature. I added some
details about it to the test.

> * 026_standby_index_lp_dead.pl should probably be renamed to
>  027_standby_index_lp_dead.pl (026_* was created in the master branch
>  recently)

Done.

> BEGIN failed--compilation aborted at t/026_standby_index_lp_dead.pl line 5.
> t/026_standby_index_lp_dead.pl .. Dubious, test returned 2 (wstat 512, 0x200)

Fixed.

> * The messages like this

Fixed.

 > * There's an extra colon in mask_lp_dead():

Oh, it is a huge error really (the loop was empty) :) Fixed.

> * the header comment of heap_hot_search_buffer() still says "*all_dead"
>  whereas I'd expect "->all_dead".
>  The same for "*page_lsn".

I was trying to mimic the style of comment (it says about “*tid” from
2007). So, I think it is better to keep it in the same style for the
whole function comment.

> * I can see no test for the INDEX_LP_DEAD_OK_MIN_LSN value of the
>   IndexLpDeadAllowedResult enumeration. Shouldn't there be only two values,
>   e.g. INDEX_LP_DEAD_OK and INDEX_LP_DEAD_MAYBE_OK ? Or a boolean variable (in
>   index_fetch_heap()) of the appropriate name, e.g. kill_maybe_allowed, and
>   rename the function is_index_lp_dead_allowed() to
>   is_index_lp_dead_maybe_allowed()?

Yes, this way it is looks better. Done. Also, I have added some checks
for “maybe” LSN-related logic to the test.

Thanks a lot,
Michail.
From f8a87a2329e81b55b484547dd50edfd97a722ad2 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Fri, 5 Nov 2021 19:28:12 +0300
Subject: [PATCH v5 3/3] doc

---
 src/backend/access/nbtree/README | 35 ++--
 src/backend/storage/page/README  |  8 +---
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2a7332d07c..67b3e38ace 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -714,17 +714,30 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple LP_DEAD bits, but
+they can still appear on the standby because of the full-page writes. Such
+a cause could cause MVCC failures because the oldest xmin on the standby
+server can be older than the oldest xmin on the primary server, which means
+tuples can be marked LP_DEAD even when they are still visible on the standby.
+
+To prevent such failure, we mark pages with LP_DEAD bits set by standby with a
+special hint. In the case of FPW from primary the hint is always cleared while
+applying the full page write, so, LP_DEAD received from primary is ignored on
+standby. Also, standby clears all LP_DEAD set by primary on the page before
+setting of own bits.
+
+There are restrictions on settings LP_DEAD bits by the standby related to
+minRecoveryPoint value. In case of crash recovery standby will start to process
+queries after replaying WAL to minRecoveryPoint position (some kind of rewind to
+the previous state). A the same time setting of LP_DEAD bits are not protected
+by WAL in any way. So, to mark tuple as dead we must be sure it was "killed"
+before minRecoveryPoint

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-09 Thread Michail Nikolaev
I have changed approach, so it is better to start from this email:
https://www.postgresql.org/message-id/flat/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com#4c81a4d623d8152f5e8889e97e750eec

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-09 Thread Michail Nikolaev
Woo-hoo :)

> Attached is a proposal for a minor addition that would make sense to me, add
> it if you think it's appropriate.

Yes, I'll add to the patch.

> I think I've said enough, changing the status to "ready for committer" :-)

Thanks a lot for your help and attention!

Best regards,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-11-09 Thread Michail Nikolaev
Hello.

> Attached is a proposal for a minor addition that would make sense to me, add
> it if you think it's appropriate.

Added. Also, I updated the documentation a little.

> I have changed approach, so it is better to start from this email:

Oops, I was thinking the comments feature in the commitfest app works
in a different way :)

Best regards,
Michail.
From 02b0dd27944c37007d8a92905a14e6b3e8e50fa8 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Tue, 9 Nov 2021 21:43:58 +0300
Subject: [PATCH v6 2/3] test

---
 src/test/recovery/Makefile|   1 +
 .../recovery/t/027_standby_index_lp_dead.pl   | 343 ++
 2 files changed, 344 insertions(+)
 create mode 100644 src/test/recovery/t/027_standby_index_lp_dead.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 288c04b861..4049f720f3 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/027_standby_index_lp_dead.pl b/src/test/recovery/t/027_standby_index_lp_dead.pl
new file mode 100644
index 00..2e1a66e13b
--- /dev/null
+++ b/src/test/recovery/t/027_standby_index_lp_dead.pl
@@ -0,0 +1,343 @@
+# Checks that index hints on standby work as excepted.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Config;
+
+plan tests => 30;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+checkpoint_timeout = 1h
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+'postgres', 'CREATE TABLE test_table (id int, value int)');
+$node_primary->safe_psql(
+'postgres', 'CREATE INDEX test_index ON test_table (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+'INSERT INTO test_table VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set
+$node_primary->safe_psql('postgres', 'VACUUM test_table');
+# For fail-fast in case FPW from primary
+$node_primary->safe_psql('postgres', 'CHECKPOINT');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Restore standby node from backup backup
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+has_streaming => 1);
+
+my $standby_settings = qq{
+max_standby_streaming_delay = 1
+wal_receiver_status_interval = 1
+hot_standby_feedback = off
+enable_seqscan = off
+enable_indexonlyscan = off
+checkpoint_timeout = 1h
+};
+$node_standby_1->append_conf('postgresql.conf', $standby_settings);
+$node_standby_1->start;
+
+$node_standby_1->backup($backup_name);
+
+# Create second standby node linking to standby 1
+my $node_standby_2 = PostgreSQL::Test::Cluster->new('standby_2');
+$node_standby_2->init_from_backup($node_standby_1, $backup_name,
+has_streaming => 1);
+$node_standby_2->append_conf('postgresql.conf', $standby_settings);
+$node_standby_2->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(10);
+
+# One psql to run command in repeatable read isolation level.
+# It is used to test xactStartedInRecovery snapshot after promotion.
+# Also, it is used to check fact what active snapshot on standby prevent LP_DEAD
+# to be set (ComputeXidHorizons work on standby).
+my %psql_standby_repeatable_read = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_repeatable_read{run} =
+IPC::Run::start(
+[ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+'<', \$psql_standby_repeatable_read{stdin},
+'>', \$psql_standby_repeatable_rea

Re: Slow standby snapshot

2021-11-09 Thread Michail Nikolaev
Hello, Andrey.

Thanks for your feedback.

> Current patch addresses another problem. In presence of old enough 
> transaction enumeration of KnownAssignedXids with shared lock prevents adding 
> new transactions with exclusive lock. And recovery effectively pauses.

Actually, I see two problems here (caused by the presence of old long
transactions). The first one is lock contention which causes recovery
pauses. The second one - just high CPU usage on standby by
KnownAssignedXidsGetAndSetXmin.

> All in all, I think using proposed "KnownAssignedXidsNext" patch solves real 
> problem and the problem of binary searches should be addressed by compressing 
> KnownAssignedXids more often.

I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
causes fewer cache misses because some values are stored in variables
(registers). I think it is better to not lean on the compiler here
because of `volatile` args.
Also, I have added some comments.

Best regards,
Michail.
From 1d55c6fae8cc160eadd705da0d70d9e7fb5bc00f Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Wed, 10 Nov 2021 00:02:18 +0300
Subject: [PATCH v2] known assignment xid next

---
 src/backend/storage/ipc/procarray.c | 67 -
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 892f0f6799..422004ad31 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -405,6 +406,7 @@ void
 CreateSharedProcArray(void)
 {
 	bool		found;
+	int			i;
 
 	/* Create or attach to the ProcArray shared structure */
 	procArray = (ProcArrayStruct *)
@@ -446,6 +448,12 @@ CreateSharedProcArray(void)
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+		KnownAssignedXidsNext = (pg_atomic_uint32 *)
+ShmemInitStruct("KnownAssignedXidsNext",
+mul_size(sizeof(pg_atomic_uint32), TOTAL_MAX_CACHED_SUBXIDS),
+&found);
+		for (i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+			pg_atomic_init_u32(&KnownAssignedXidsNext[i], 1);
 	}
 }
 
@@ -4517,7 +4525,13 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * XID entry itself.  This preserves the property that the XID entries are
  * sorted, so we can do binary searches easily.  Periodically we compress
  * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNext[] array to skip known to be invalid xids. It
+ * helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNext[] is
+ * updating while taking the snapshot. In general case KnownAssignedXidsNext
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid.
  *
  * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
  * are those with indexes tail <= i < head; items outside this subscript range
@@ -4544,7 +4558,10 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * head/tail pointers.  (We could dispense with the spinlock if we were to
  * create suitable memory access barrier primitives and use those instead.)
  * The spinlock must be taken to read or write the head/tail pointers unless
- * the caller holds ProcArrayLock exclusively.
+ * the caller holds ProcArrayLock exclusively. Access to KnownAssignedXidsNext
+ * is not especially protected by any lock because it just some kind of hint
+ * to reduce the scan cost, but at least shared ProcArrayLock is held anyway -
+ * it is required to access KnownAssignedXids.
  *
  * Algorithmic analysis:
  *
@@ -4555,7 +4572,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  *		must happen)
  *	* Compressing the array is O(S) and requires exclusive lock
  *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
+ *	* Taking a snapshot is O(S), O(N) next call; requires shared lock
  *	* Checking for an XID is O(logS) and requires shared lock
  *
  * In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4615,12 +4632,13 @@ KnownAssignedXidsCompress(bool force)
 	 * re-aligning data to 0th element.
 	 */
 	compress_index = 0;
-	for (i = tail; i < head; i++)
+	for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
 	{
 		if (KnownAssignedXidsValid[i])
 		{
 			KnownAssignedXids[compress_index] = KnownAssigne

Re: Slow standby snapshot

2021-11-21 Thread Michail Nikolaev
Hello, Alexander.

Thanks for your review.

> It looks like KnownAssignedXidsNext doesn't have to be
> pg_atomic_uint32.  I see it only gets read with pg_atomic_read_u32()
> and written with pg_atomic_write_u32().  Existing code believes that
> read/write of 32-bit values is atomic.  So, you can use just uint32
> instead of pg_atomic_uint32.

Fixed. Looks better now, yeah.

Also, I added an additional (not depending on KnownAssignedXidsNext
feature) commit replacing the spinlock with a memory barrier. It goes
to Simon Riggs and Tom Lane at 2010:

> (We could dispense with the spinlock if we were to
> create suitable memory access barrier primitives and use those instead.)

Now it is possible to avoid additional spinlock on each
KnownAssignedXidsGetAndSetXmin. I have not measured the performance
impact of this particular change yet (and it is not easy to reliable
measure impact less than 0.5% probably), but I think it is worth
adding. We need to protect only the head pointer because the tail is
updated only with exclusive ProcArrayLock. BTW should I provide a
separate patch for this?

So, now we have a pretty successful benchmark for the typical use-case
and some additional investigation had been done. So, I think I’ll
re-add the patch to the commitfest app.

Thanks,
Michail
From 94cd2fbf37b5f0b824e0f9a9bc23f762a8bb19b5 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sun, 21 Nov 2021 21:23:02 +0300
Subject: [PATCH v3 1/2] memory barrier instead of spinlock

---
 src/backend/storage/ipc/procarray.c | 42 +++--
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a9945c80eb..da0c4eaa00 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -60,7 +60,6 @@
 #include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -81,7 +80,6 @@ typedef struct ProcArrayStruct
 	int			numKnownAssignedXids;	/* current # of valid entries */
 	int			tailKnownAssignedXids;	/* index of oldest valid element */
 	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-	slock_t		known_assigned_xids_lck;	/* protects head/tail pointers */
 
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
@@ -425,7 +423,6 @@ CreateSharedProcArray(void)
 		procArray->numKnownAssignedXids = 0;
 		procArray->tailKnownAssignedXids = 0;
 		procArray->headKnownAssignedXids = 0;
-		SpinLockInit(&procArray->known_assigned_xids_lck);
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4553,10 +4550,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * pointer.  This wouldn't require any lock at all, except that on machines
  * with weak memory ordering we need to be careful that other processors
  * see the array element changes before they see the head pointer change.
- * We handle this by using a spinlock to protect reads and writes of the
- * head/tail pointers.  (We could dispense with the spinlock if we were to
- * create suitable memory access barrier primitives and use those instead.)
- * The spinlock must be taken to read or write the head/tail pointers unless
+ * We handle this by using a memory barrier to protect writes of the
+ * head pointer.
+ * The memory barrier is taken before write the head pointer unless
  * the caller holds ProcArrayLock exclusively.
  *
  * Algorithmic analysis:
@@ -4600,7 +4596,6 @@ KnownAssignedXidsCompress(bool force)
 	int			compress_index;
 	int			i;
 
-	/* no spinlock required since we hold ProcArrayLock exclusively */
 	head = pArray->headKnownAssignedXids;
 	tail = pArray->tailKnownAssignedXids;
 
@@ -4686,7 +4681,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 
 	/*
 	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
+	 * are safe read them here.
 	 */
 	head = pArray->headKnownAssignedXids;
 	tail = pArray->tailKnownAssignedXids;
@@ -4744,21 +4739,20 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 	pArray->numKnownAssignedXids += nxids;
 
 	/*
-	 * Now update the head pointer.  We use a spinlock to protect this
+	 * Now update the head pointer.  We use a memory barrier to protect this
 	 * pointer, not because the update is likely to be non-atomic, but to
 	 * ensure that other processors see the above array updates before they
 	 * see the head pointer change.
 	 *
 	 * If we're holding ProcArrayLock exclusively, there's no need to take the
-	 * spinlock.
+	 * barrier.
 	 */
 	if

Re: Slow standby snapshot

2021-11-22 Thread Michail Nikolaev
Hello, Andrey.

> Write barrier must be issued after write, not before.
> Don't we need to issue read barrier too?

The write barrier is issued after the changes to KnownAssignedXidsNext
and KnownAssignedXidsValid arrays and before the update of
headKnownAssignedXids.
So, it seems to be correct. We make sure once the CPU sees changes of
headKnownAssignedXids - underlying arrays contain all the required
data.

Thanks,
Michail.




Re: Optimize single tuple fetch from nbtree index

2019-08-04 Thread Michail Nikolaev
Hello everyone.

I am also was looking into possibility of such optimisation few days ago
(attempt to reduce memcpy overhead on IndexOnlyScan).

One thing I noticed here - whole page is scanned only if index quals are
"opened" at some side.

So,  in case of
 SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC
LIMIT 1;
whole index page will be read.

But
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp AND ts<:=timestamp -
:interval  ORDER BY k, ts DESC LIMIT 1;
is semantically the same, but only few :interval records will be processed.

So, you could try to compare such query in your benchmarks.

Also, some info about current design is contained in
src\backend\access\nbtree\README ("To minimize lock/unlock traffic, an
index scan always searches a leaf page
to identify all the matching items at once").

Thanks,
 Michail.


[PATCH] fix for C4141 warning on MSVC

2018-01-23 Thread Michail Nikolaev
Hello.

Just very small fix for C4141 warning (
https://docs.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4141
).

Also could be viewed on Github -
https://github.com/michail-nikolaev/postgres/commit/38a590a00110a4ea870d625470e4c898e5ad79aa

Tested both MSVC and gcc builds.

Thanks.


C4141.patch
Description: Binary data


[WIP PATCH] Index scan offset optimisation using visibility map

2018-01-31 Thread Michail Nikolaev
Hello.

WIP-Patch for optimisation of OFFSET + IndexScan using visibility map.
Patch based on idea of Maxim Boguk [1] with some inspiration from Douglas
Doole [2].
-
Everyone knows - using OFFSET (especially big) is not an good practice.
But in reality they widely used mostly for paging (because it is simple).

Typical situation is some table (for example tickets) with indexes used for
paging\sorting:

VACUUM FULL;
VACUUM ANALYZE ticket;
SET work_mem = '512MB';
SET random_page_cost = 1.0;

CREATE TABLE ticket AS
SELECT
id,
TRUNC(RANDOM() * 100 + 1) AS project_id,
NOW() + (RANDOM() * (NOW()+'365 days' - NOW())) AS created_date,
repeat((TRUNC(RANDOM() * 100 + 1)::text), 1000) as payload
FROM GENERATE_SERIES(1, 100) AS g(id);

CREATE INDEX simple_index ON ticket using btree(project_id, created_date);

And some typical query to do offset on tickets of some project with paging,
some filtration (based on index) and sorting:

SELECT * FROM ticket
WHERE project_id = ?
AND created_date > '20.06.2017'
ORDER BY created_date offset 500 limit 100;

At the current moment IndexScan node will be required to do 600 heap
fetches to execute the query.
But first 500 of them are just dropped by the NodeLimit.

The idea of the patch is to push down offset information in
ExecSetTupleBound (like it done for Top-sort) to IndexScan in case
of simple scan (without projection, reordering and qual). In such situation
we could use some kind of index only scan
(even better because we dont need index data) to avoid fetching tuples
while they are just thrown away by nodeLimit.

Patch is also availble on Github:
https://github.com/michail-nikolaev/postgres/commit/a368c3483250e4c02046d418a27091678cb963f4?diff=split
And some test here:
https://gist.github.com/michail-nikolaev/b7cbe1d6f463788407ebcaec8917d1e0

So, at the moment everything seems to work (check-world is ok too) and I
got next result for test ticket table:
| offset | master | patch
| 100 | ~1.3ms | ~0.7ms
| 1000 | ~5.6ms | ~1.1ms
| 1 | ~46.7ms | ~3.6ms

To continue development I have following questions:
0) Maybe I missed something huge...
1) Is it required to support non-mvvc (dirty) snapshots? They are not
supported for IndexOnlyScan - not sure about IndexScan.
2) Should I try to pass informaiton about such optimisation to
planner/optimizer? It is not too easy with current desigh but seems
possible.
3) If so, should I add something to EXPLAIN?
4) If so, should I add some counters to EXPLAIN ANALYZE? (for example
number of heap fetch avoided).
5) Should I add description of optimisation to
https://www.postgresql.org/docs/10/static/queries-limit.html ?
6) Maybe you have some ideas for additional tests I need to add.

Thanks a lot.

[1]
https://www.postgresql.org/message-id/CAK-MWwQpZobHfuTtHj9%2B9G%2B5%3Dck%2BaX-ANWHtBK_0_D_qHYxWuw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CADE5jYLuugnEEUsyW6Q_4mZFYTxHxaVCQmGAsF0yiY8ZDggi-w%40mail.gmail.com
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
***
*** 1303,1310  ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
  	pwcxt.seg = seg;
  	ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);
  
! 	/* Pass down any tuple bound */
! 	ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
  
  	/*
  	 * Run the plan.  If we specified a tuple bound, be careful not to demand
--- 1303,1310 
  	pwcxt.seg = seg;
  	ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);
  
! 	/* Pass down any tuple bound. Offset cannot be optimized because parallel execution. */
! 	ExecSetTupleBound(fpes->tuples_needed, 0, queryDesc->planstate);
  
  	/*
  	 * Run the plan.  If we specified a tuple bound, be careful not to demand
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***
*** 785,793  ExecShutdownNode(PlanState *node)
   *
   * Set a tuple bound for a planstate node.  This lets child plan nodes
   * optimize based on the knowledge that the maximum number of tuples that
!  * their parent will demand is limited.  The tuple bound for a node may
!  * only be changed between scans (i.e., after node initialization or just
!  * before an ExecReScan call).
   *
   * Any negative tuples_needed value means "no limit", which should be the
   * default assumption when this is not called at all for a particular node.
--- 785,794 
   *
   * Set a tuple bound for a planstate node.  This lets child plan nodes
   * optimize based on the knowledge that the maximum number of tuples that
!  * their parent will demand is limited. Also tuples skipped may be used by
!  * child nodes to optimize retrieval of tuples which immediately skipped by
!  * parent (nodeLimit). The tuple bound for a node may only be changed 
!  * between scans (i.e., after node initialization or just before an ExecReScan call).
   *
   * Any negative tuples_needed value means

Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-02-13 Thread Michail Nikolaev
Hello.

Thanks a lot for review.

Patch updated + rebased on master. check-world is passing.

Still not sure about comment formatting. Have you seen any style guid about
it except "strict ANSI C comment formatting"? Anyway still need to work on
comments.

Also, non-MVCC snaphots are now supported.

Github is updated too.
https://github.com/postgres/postgres/compare/master...michail-nikolaev:offset_index_only

Still not sure about questions 0, 2, 3, 4, 5, and 6 from initial mail
(about explain, explain analyze, documentation and optimiser).

Thanks.

пн, 5 февр. 2018 г. в 23:36, Andrey Borodin :

> Hi, Michail!
>
> Thanks for the patch!
>
> > 1 февр. 2018 г., в 1:17, Michail Nikolaev 
> написал(а):
> >
> > Hello.
> >
> > WIP-Patch for optimisation of OFFSET + IndexScan using visibility map.
>
> While the patch seems to me useful improvement, I see few problems with
> code:
> 1. Both branches of "if (node->iss_tuples_skipped_remaning != 0)" seem too
> similar. There is a lot of duplicate comments et c. I think that this
> branch should be refactored to avoid code duplication.
> 2. Most of comments are formatted not per project style.
>
> Besides this, patch looks good. Please, add it to the following commitfest
> so that work on the patch could be tracked.
>
> Best regards, Andrey Borodin.


offset_index_only_v2.patch
Description: Binary data


Thoughts on "killed tuples" index hint bits support on standby

2020-01-16 Thread Michail Nikolaev
Hello, hackers.

Currently hint bits in the index pages (dead tuples) are set and taken
into account only at primary server. Standby just ignores it. It is
done for reasons, of course (see RelationGetIndexScan and [1]):

* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.

Also, according to [2] and cases like [3] it seems to be good idea to
support "ignore_killed_tuples" on standby.

I hope I know the way to support it correctly with reasonable amount of changes.

First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.

And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits.  So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).

Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.

Could someone validate my thoughts please? If the idea is mostly
correct - I could try to implement and test it.

[1] - 
https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - 
https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - 
https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org




Re: Thoughts on "killed tuples" index hint bits support on standby

2020-01-24 Thread Michail Nikolaev
Hello again.

Andres, Peter, thanks for your comments.

Some of issues your mentioned (reporting feedback to the another
cascade standby, processing queries after restart and newer xid
already reported) could be fixed in provided design, but your
intention to have "independent correctness backstop" is a right thing
to do.

So, I was thinking about another approach which is:
* still not too tricky to implement
* easy to understand
* does not rely on hot_standby_feedback for correctness, but only for efficiency
* could be used with any kind of index
* does not generate a lot of WAL

Let's add a new type of WAL record like "some index killed tuple hint
bits are set according to RecentGlobalXmin=x" (without specifying page
or even relation). Let's call 'x' as 'LastKilledIndexTuplesXmin' and
track it in standby memory. It is sent only in case of
wal_log_hints=true. If hints cause FPW - it is sent before FPW record.
Also, it is not required to write such WAL every time primary marks
index tuple as dead. It should be done only in case
'LastKilledIndexTuplesXmin' is changed (moved forward).

On standby such record is used to cancel queries. If transaction is
executed with "ignore_killed_tuples==true" (set on snapshot creation)
and its xid is less than received LastKilledIndexTuplesXmin - just
cancel the query (because it could rely on invalid hint bit). So,
technically it should be correct to use hints received from master to
skip tuples according to MVCC, but "the conflict rate goes through the
roof".

To avoid any real conflicts standby sets
ignore_killed_tuples = (hot_standby_feedback is on)
   AND (wal_log_hints is on on primary)
   AND (standby new snapshot xid >= last
LastKilledIndexTuplesXmin received)
   AND (hot_standby_feedback is reported
directly to master).

So, hot_standby_feedback loop effectively eliminates any conflicts
(because LastKilledIndexTuplesXmin is technically RecentGlobalXmin in
such case). But if feedback is broken for some reason - query
cancellation logic will keep everything safe.

For correctness LastKilledIndexTuplesXmin (and as consequence
RecentGlobalXmin) should be moved only forward.

To set killed bits on standby we should check tuples visibility
according to last LastKilledIndexTuplesXmin received. It is just like
master sets these bits according to its state - so it is even safe to
transfer them to another standby.

Does it look better now?

Thanks, Michail.




Re: Contention preventing locking

2018-02-16 Thread Michail Nikolaev
Hello.

Just want to notice - this work also correlates with
https://www.postgresql.org/message-id/CAEepm%3D18buPTwNWKZMrAXLqja1Tvezw6sgFJKPQ%2BsFFTuwM0bQ%40mail.gmail.com
 paper.
It provides more general way to address the issue comparing to single
optimisations (but they could do the work too, of course).

Thanks,
Michail.


чт, 15 февр. 2018 г. в 19:00, Konstantin Knizhnik :

> Hi,
>
> PostgreSQL performance degrades signficantly in case of high contention.
> You can look at the attached YCSB results (ycsb-zipf-pool.png) to
> estimate the level of this degradation.
>
> Postgres is acquiring two kind of heavy weight locks during update: it
> locks TID of the updated tuple and XID of transaction created this version.
> In debugger I see the following picture: if several transactions are
> trying to update the same record, then first of all they compete for
> SnapshotDirty.xmax transaction lock in EvalPlanQualFetch.
> Then in heap_tuple_update they are trying to lock TID of the updated
> tuple: heap_acquire_tuplock in heapam.c
> After that transactions wait completion of the transaction updated the
> tuple: XactLockTableWait in heapam.c
>
> So in heap_acquire_tuplock all competing transactions are waiting for
> TID of the updated version. When transaction which changed this tuple is
> committed, one of the competitors will grant this lock and proceed,
> creating new version of the tuple. Then all other competitors will be
> awaken and ... find out that locked tuple is not the last version any more.
> Then they will locate new version and try to lock it... The more
> competitors we have, then more attempts they all have to perform
> (quadratic complexity).
>
> My idea was that we can avoid such performance degradation if we somehow
> queue competing transactions so that they will get control one-by-one,
> without doing useless work.
> First of all I tried to replace TID  lock with PK (primary key) lock.
> Unlike TID, PK of record  is not changed during hot update. The second
> idea is that instead immediate releasing lock after update we can hold
> it until the end of transaction. And this optimization really provides
> improve of performance - it corresponds to pg_f1_opt configuration at
> the attached diagram.
> For some workloads it provides up to two times improvement comparing
> with vanilla Postgres. But there are many problems with correct
> (non-prototype) implementation of this approach:
> handling different types of PK, including compound keys, PK updates,...
>
> Another approach,  which I have recently implemented, is much simpler
> and address another lock kind: transaction lock.
> The idea o this approach is mostly the same: all competing transaction
> are waiting for transaction which is changing the tuple.
> Then one of them is given a chance to proceed and now all other
> transactions are waiting for this transaction and so on:
>
> T1<-T2,T3,T4,T5,T6,T7,T8,T9
> T2<-T3,T4,T5,T6,T7,T8,T9
> T3<-T4,T5,T6,T7,T8,T9
> ...
>
> <- here corresponds to "wait for" dependency between transactions.
>
> If we change this picture to
>
> T1<-T2, T2<-T3, T3<-T4, T4<-T5, T5<-T6, T6<-T7, T7<-T8, T8<-T9
>
> then number of lock requests can be significantly reduced.
> And it can be implemented using very simple patch (attached xlock.patch).
> I hope that I have not done something incorrect here.
> Effect of this simple patch is even larger:  more than 3 times for
> 50..70 clients.
> Please look at the attached xlock.pdf spreadsheet.
>
> Unfortunately combination of this two approaches gives worser result
> than just single xlock.patch.
> Certainly this patch also requires further improvement, for example it
> will not correctly work in case of aborting transactions due to deadlock
> or some other reasons.
> I just want to know option of community if the proposed approaches to
> reduce contention are really promising and it makes sense to continue
> work in this direction.
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [HACKERS] Can ICU be used for a database's default sort order?

2018-02-18 Thread Michail Nikolaev
Hello.

Just want to inform:
I have run
check,installcheck,plcheck,contribcheck,modulescheck,ecpgcheck,isolationcheck,upgradecheck
tests on Windows 10, VC2017 with patch applied on top of
2a41507dab0f293ff241fe8ae326065998668af8 as Andrey asked me.

Everything is passing with and without $config->{icu} =
'D:\Dev\postgres\icu\';

Best regards,
Michail.


пт, 16 февр. 2018 г. в 11:13, Andrey Borodin :

> Hi everyone!
>
> > 10 февр. 2018 г., в 20:45, Andrey Borodin 
> написал(а):
> >
> > I'm planning to provide review
> >
>
> So, I was looking into the patch.
> The patch adds:
> 1. Ability to specify collation provider (with version) in --locale for
> initdb and createdb.
> 2. Changes to locale checks
> 3. Sets ICU as default collation provider. For example
> "ru_RU@icu.153.80.32.1" is default on my machine with patch
> 4. Tests and necessary changes to documentation
>
> With patch I get correct ICU ordering by default
> postgres=# select unnest(array['е','ё','ж']) order by 1;
>  unnest
> 
>  е
>  ё
>  ж
> (3 rows)
>
> While libc locale provides incorrect order (I also get same ordering by
> default without patch)
>
> postgres=# select c from unnest(array['е','ё','ж']) c order by c collate
> "ru_RU";
>  c
> ---
>  е
>  ж
>  ё
> (3 rows)
>
>
> Unfortunately, neither "ru_RU@icu.153.80.32.1" (exposed by LC_COLLATE and
> other places) nor "ru_RU@icu" cannot be used by collate SQL clause.
> Also, patch removes compatibility with MSVC 1800 (Visual Studio 2013) on
> Windows XP and Windows Server 2003. This is done to use newer
> locale-related functions in VS2013 build.
>
> If the database was initialized with default locale without this patch,
> one cannot connect to it anymore
> psql: FATAL:  could not find out the collation provider for datcollate
> "ru_RU.UTF-8" of database "postgres"
> This problem is mentioned in commit message of the patch. I think that
> this problem should be addressed somehow.
> What do you think?
>
> Overall patch looks solid and thoughtful work and adds important
> functionality.
>
> Best regards, Andrey Borodin.
>


Re: New gist vacuum.

2018-02-25 Thread Michail Nikolaev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello.

I have added small change to patch to allow it be compiled using msvc (uint64_t 
-> uint64).
Everything seems to work, check-world is passing.

Actually patch fixes two issues:
1) Partial GIST indexes now have corrent tuples count estimation.
2) Now subsequent calls to VACUUM on GIST index (like "VACCUM table_name") do 
not change tuples count to estimated number of tuples in table (which is 
changed even without any updates in table due current implementation).

I think it is fine to commit.

Patch is also availble on github:
https://github.com/michail-nikolaev/postgres/commit/ff5171b586e4eb60ea5d15a18055d8ea4e44d244?ts=4

I'll attach patch file next message.

The new status of this patch is: Ready for Committer


Re: New gist vacuum.

2018-02-25 Thread Michail Nikolaev
> I'll attach patch file next message.

Updated patch is attached.


gist-vacuum-count.patch
Description: Binary data


Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-03-06 Thread Michail Nikolaev
Hello, Andrey.

Thanks for review.

I have updated comments according your review also renamed some fields for
consistency.
Additional some notes added to documentation.

Updated patch in attach, github updated too.


offset_index_only_v3.patch
Description: Binary data


Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-03-10 Thread Michail Nikolaev
Hello.

Andrey, Tels - thanks for review.

> It could be named "SkipTuples" (e.g. this is the number of tuples we need
> to skip, not the number we have skipped), and the other one then
> "iss_SkipTuplesRemaining" so they are consistent with each other.

Agreed, done.

> Also, I think that this check could be removed from loop. We do not
expect that it's state will change during execution, do we?

Removed.

Patch is attached, github is updated too.

Michail.


offset_index_only_v4.patch
Description: Binary data


Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-03-12 Thread Michail Nikolaev
Hello.

> Sorry, seems like I've incorrectly expressed what I wanted to say.
> I mean this Assert() can be checked before loop, not on every loop cycle.

Yes, I understood it. Condition should be checked every cycle - at least it
is done this way for index only scan:
https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeIndexonlyscan.c#L234

But since original index scan do not contains such check (probably due ot
https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeIndexscan.c#L552)
- I think it could be removed.

Michail.


Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-03-13 Thread Michail Nikolaev
Hello.

Tom, thanks a lot for your thorough review.

> What you've done to
> IndexNext() is a complete disaster from a modularity standpoint: it now
> knows all about the interactions between index_getnext, index_getnext_tid,
> and index_fetch_heap.

I was looking into the current IndexOnlyNext implementation as a starting
point - and it knows about index_getnext_tid and index_fetch_heap already.
At the same time I was trying to keep patch non-invasive.
Patched IndexNext now only knowns about index_getnext_tid and
index_fetch_heap - the same as IndexOnlyNext.
But yes, it probably could be done better.

> I'm not sure about a nicer way to refactor that, but I'd suggest that
> maybe you need an additional function in indexam.c that hides all this
> knowledge about the internal behavior of an IndexScanDesc.

I'll try to split index_getnext into two functions. A new one will do
everything index_getnext does except index_fetch_heap.

> Or that is, it knows too much and still not enough,
> because it's flat out wrong for the case that xs_continue_hot is set.
> You can't call index_getnext_tid when that's still true from last time.

Oh.. Yes, clear error here.

< The PredicateLockPage call also troubles me quite a bit, not only from
< a modularity standpoint but because that implies a somewhat-user-visible
< behavioral change when this optimization activates. People who are using
< serializable mode are not going to find it to be an improvement if their
< queries fail and need retries a lot more often than they did before.
< I don't know if that problem is bad enough that we should disable skipping
< when serializable mode is active, but it's something to think about.

Current IndexOnlyScan already does that. And I think a user should expect
such a change in serializable mode.

> You haven't documented the behavior required for tuple-skipping in any
> meaningful fashion, particularly not the expectation that the child plan
> node will still return tuples that just need not contain any valid
> content.

Only nodeLimit could receive such tuples and they are immediately
discarded. I'll add some comment to it.

> is a gross hack and probably wrong. You could use ExecStoreAllNullTuple,
> perhaps.

Oh, nice, missed it.

> I'm disturbed by the fact that you've not taught the planner about the
> potential cost saving from this, so that it won't have any particular
> reason to pick a regular indexscan over some other plan type when this
> optimization applies. Maybe there's no practical way to do that, or maybe
> it wouldn't really matter in practice; I've not looked into it. But not
> doing anything feels like a hack.

I was trying to do it. But current planner architecture does not provide a
nice way to achive it due to the way it handles limit and offset.
So, I think it is better to to be avoided for now.

> Setting this back to Waiting on Author.

I'll try to make the required changes in a few days.

Thanks.


Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-03-20 Thread Michail Nikolaev
Hello everyone.

I need an advice.

I was reworking the patch: added support for the planner, added support for
queries with projection, addded support for predicates which could be
executed over index data.
And.. I realized that my IndexScan is even more index-only than the
original IndexOnlyScan. So, it seems to be a wrong way.

I think the task could be splitted into two:

1. Extended support for index-only-scan

Currently IndexOnlyScan is used only in case when target data is fully
located in
index. If we need some additional columns - regular index scan is used
anyway.

For example, let's consider such table and index:

CREATE TABLE test_events (
  time timestamp ,
  event varchar(255),
  data jsonb
);

CREATE INDEX on test_events USING btree(time, event);

It is some kind of big table with log events. And let's consinder such
query:

SELECT data->>'event_id'
FROM test_events
WHERE
time > (now() - interval '2 year') AND -- indexqual
event = 'tax' AND -- indexqual
extract(ISODOW from time) = 1 --qpquals
ORDER BY time DESC

At the moment IndexScan plan will be used for such query due to result
data. But 1/7 of all heap access will be lost. At the same time
"extract(ISODOW from time) = 1" (qpqualsl) could be easily calculated over
index data.

The idea is simple: extend IndexOnly scan to be used if all query
predicates (both indexqual and qpquals) could be calculated over index
data. And if all checks are passed - just load tuple from heap to return.
It seems like index-data access is really cheap now and such plan will
be faster even for qpquals without high selectivity. At least for
READCOMMITED.

I think I will able to create prototype within a few days (most of work
is done in current patch rework).

Probably it is not an ne idea - so, is it worth implementation? Maybe
I've missed something huge.

2. For extented IndexOnlyScan - add support to avoid heap fetch in case of
OFFSET applied to tuple.

If first part is implemented - OFFSET optimisation is much easier to
achieve.

Thanks,
Michail.


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-01 Thread Michail Nikolaev
> > > I just realised there is one issue with this design: We can't cheaply
> > > reset the snapshot during the second table scan:
> > > It is critically important that the second scan of R/CIC uses an index
> > > contents summary (made with index_bulk_delete) that was created while
> > > the current snapshot was already registered.
> >
> > > So, the "reset the snapshot every so often" trick cannot be applied in
> > > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > > every time we reset the snapshot. Rescanning might be worth the cost
> > > (e.g. when using BRIN), but that is very unlikely.
> >
> > Hm, I think it is still possible. We could just manually recheck the
> > tuples we see
> > to the snapshot currently used for the scan. If an "old" snapshot can see
> > the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> > index summary.
> That's an interesting method.
>
> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

I made a draft of the patch and this idea is not working.

The problem is generally the same:

* reference snapshot sees tuple X
* reference snapshot is used to create index summary (but there is no
tuple X in the index summary)
* tuple X is updated to Y creating a HOT-chain
* we started scan with new temporary snapshot (it sees Y, X is too old for it)
* tuple X is pruned from HOT-chain because it is not protected by any snapshot
* we see tuple Y in the scan with temporary snapshot
* it is not in the index summary - so, we need to check if
reference snapshot can see it
* there is no way to understand if the reference snapshot was able
to see tuple X - because we need the full HOT chain (with X tuple) for
that

Best regards,
Michail.




Replace known_assigned_xids_lck by memory barrier

2023-03-19 Thread Michail Nikolaev
Hello everyone and Tom.

Tom, this is about your idea (1) from 2010 to replace spinlock with a
memory barrier in a known assignment xids machinery.

It was mentioned by you again in (2) and in (3) we have decided to
extract this change into a separate commitfest entry.

So, creating it here with a rebased version of (4).

In a nutshell: KnownAssignedXids as well as the head/tail pointers are
modified only by the startup process, so spinlock is used to ensure
that updates of the array and head/tail pointers are seen in a correct
order. It is enough to pass the barrier after writing to the array
(but before updating the pointers) to achieve the same result.

Best regards.

[1]: 
https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R2408-R2412

[2]: 
https://www.postgresql.org/message-id/flat/1249332.1668553589%40sss.pgh.pa.us#19d00eb435340f5c5455e3bf259eccc8

[3]: 
https://www.postgresql.org/message-id/flat/1225350.1669757944%40sss.pgh.pa.us#23ca1956e694910fd7795a514a3bc79f

[4]: 
https://www.postgresql.org/message-id/flat/CANtu0oiPoSdQsjRd6Red5WMHi1E83d2%2B-bM9J6dtWR3c5Tap9g%40mail.gmail.com#cc4827dee902978f93278732435e8521
From d968645551412abdb3fac6b8514c3d6746e8ac3d Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sat, 18 Mar 2023 21:28:00 +0300
Subject: [PATCH v2] Removes known_assigned_xids_lck, using the write memory
 barrier as suggested earlier.

---
 src/backend/storage/ipc/procarray.c | 41 +++--
 1 file changed, 9 insertions(+), 32 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ea91ce355f..95e2672f9a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -61,7 +61,6 @@
 #include "port/pg_lfind.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -82,7 +81,6 @@ typedef struct ProcArrayStruct
 	int			numKnownAssignedXids;	/* current # of valid entries */
 	int			tailKnownAssignedXids;	/* index of oldest valid element */
 	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-	slock_t		known_assigned_xids_lck;	/* protects head/tail pointers */
 
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
@@ -444,7 +442,6 @@ CreateSharedProcArray(void)
 		procArray->numKnownAssignedXids = 0;
 		procArray->tailKnownAssignedXids = 0;
 		procArray->headKnownAssignedXids = 0;
-		SpinLockInit(&procArray->known_assigned_xids_lck);
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4651,10 +4648,9 @@ KnownAssignedTransactionIdsIdleMaintenance(void)
  * pointer.  This wouldn't require any lock at all, except that on machines
  * with weak memory ordering we need to be careful that other processors
  * see the array element changes before they see the head pointer change.
- * We handle this by using a spinlock to protect reads and writes of the
- * head/tail pointers.  (We could dispense with the spinlock if we were to
- * create suitable memory access barrier primitives and use those instead.)
- * The spinlock must be taken to read or write the head/tail pointers unless
+ * We handle this by using a memory barrier to protect writes of the
+ * head pointer.
+ * The memory barrier is taken before write the head pointer unless
  * the caller holds ProcArrayLock exclusively.
  *
  * Algorithmic analysis:
@@ -4712,7 +4708,7 @@ KnownAssignedXidsCompress(KAXCompressReason reason, bool haveLock)
 
 	/*
 	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
+	 * are safe read them here.
 	 */
 	head = pArray->headKnownAssignedXids;
 	tail = pArray->tailKnownAssignedXids;
@@ -4893,21 +4889,20 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 	pArray->numKnownAssignedXids += nxids;
 
 	/*
-	 * Now update the head pointer.  We use a spinlock to protect this
+	 * Now update the head pointer.  We use a memory barrier to protect this
 	 * pointer, not because the update is likely to be non-atomic, but to
 	 * ensure that other processors see the above array updates before they
 	 * see the head pointer change.
 	 *
 	 * If we're holding ProcArrayLock exclusively, there's no need to take the
-	 * spinlock.
+	 * barrier.
 	 */
 	if (exclusive_lock)
 		pArray->headKnownAssignedXids = head;
 	else
 	{
-		SpinLockAcquire(&pArray->known_assigned_xids_lck);
+		pg_write_barrier();
 		pArray->headKnownAssignedXids = head;
-		SpinLockRelease(&pArray->known_assigned_xids_lck);
 	}
 }
 
@@ -4930,20 +4925,8 @@ 

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-20 Thread Michail Nikolaev
Hello!

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash)

Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged. Also, some way to remove the index from the catalog in case of
a crash may be required.

A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :) Yes, in some cases it may be
too bad because of the huge amount of TIDs we need to scan + also
btree copies whole page despite we need single item. But some
additional index method may help - feels like something related to
uniqueness (but it is only in btree anyway).

Thanks,
Mikhail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Michail Nikolaev
One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.

Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).

Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-02-21 Thread Michail Nikolaev
Hi!

> How do you suppose this would work differently from a long-lived
> normal snapshot, which is how it works right now?

Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.

> Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.

> How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.

If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).

(1): 
https://www.postgresql.org/message-id/flat/CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-%3DuZKUpNJmuZtw%40mail.gmail.com#8141eb2ea177ff560ee713b3f20de404




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-03-07 Thread Michail Nikolaev
Hello!

> I'm not a fan of this approach. Changing visibility and cleanup
> semantics to only benefit R/CIC sounds like a pain to work with in
> essentially all visibility-related code. I'd much rather have to deal
> with another index AM, even if it takes more time: the changes in
> semantics will be limited to a new plug in the index AM system and a
> behaviour change in R/CIC, rather than behaviour that changes in all
> visibility-checking code.

Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
But yes, still an inelegant and a little strange-looking option.

At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.

> But regardless of second scan snapshots, I think we can worry about
> that part at a later moment: The first scan phase is usually the most
> expensive and takes the most time of all phases that hold snapshots,
> and in the above discussion we agreed that we can already reduce the
> time that a snapshot is held during that phase significantly. Sure, it
> isn't great that we have to scan the table again with only a single
> snapshot, but generally phase 2 doesn't have that much to do (except
> when BRIN indexes are involved) so this is likely less of an issue.
> And even if it is, we would still have reduced the number of
> long-lived snapshots by half.

Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.

One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.

Do you know any less-hacky way? Or is it a nice way to go?

[1]: 
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793




Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2022-12-26 Thread Michail Nikolaev
Hello.

Just a small story about small data-loss on logical replication.

We were logically replicating a 4 TB database from

> PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on 
> x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 
> 64-bit

to

> PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Database includes many tables, but there are A, B and C tables. Tables
A, B and are C changed in the same transaction (new tuples created in
B and C with corresponding update of A).

Table A was added to the PUBLICATION from the start. Once initial sync
was done, tables B and C were added to PUBLICATION with REFRESH on
subscription (to reduce the WAL collection on the source database).

So, we see in logs:

> -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "A" has started
> -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "A" has finished


> -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "B" has started
> -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "B" has finished


> -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "C" has started
> -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table 
> synchronization worker for subscription "cloud_production_main_sub_v4", table 
> "C" has finished


Also, we had to reboot subscription server twice. Moreover, we have
plenty of messages like:


> -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical 
> replication worker" (PID 47960) exited with exit code 1
> -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker 
> for subscription "cloud_production_main_sub_v4" has started
> -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL 
> stream: SSL SYSCALL error: EOF detected

Additionally, our HA replica of subscriber was broken and recovered by
support… And logs like this:

> psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel 
> while waiting for synchronous replication ack.
> The COMMIT record has already flushed to WAL locally and might not have been 
> replicated to the standby. We must wait here.

Furthermore, we were adding\removing another table D from publication
few times. So, it was a little bit messy story.

After all, we got streaming working for the whole database.

But after some time we realized we have lost 203 records for table B
created from

2022-12-14 09:21:25.705 to
2022-12-14 09:49:20.664 (after synchronization start, but before finish).

And the most tricky part here - A, B and C are changed in the same
transaction (related tuples). But tables A and C  - are fine, only few
records from B are lost.

We have compared all other tables record to record - only 203 records
from B are missing. We have restored the server from backup with
point-in-time-recovery (to exclude case with application or human
error) - the same results. Furthermore, we have tried different
indexes in search (to exclude issue with broken btree) - the same
results.

So, yes, we understand our replication story was not a classical happy
path even close. But the result feels a little bit scary.

Currently, I have access to database and logs - so, feel free to ask
for additional debugging information if you like.

Thanks a lot,
Michail.




Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2022-12-26 Thread Michail Nikolaev
Hello again.

Just small a fix for:

> 2022-12-14 09:21:25.705 to
> 2022-12-14 09:49:20.664 (after synchronization start, but before finish).

Correct values are:

2022-12-14 09:49:31.340
2022-12-14 09:49:41.683

So, it looks like we lost about 10s of one of the tables WAL.




Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2022-12-27 Thread Michail Nikolaev
Hello, Amit!

> IUC, this is the time when only table B's initial sync was
> in-progress. Table A's initial sync was finished by that time and for
> Table C, it is yet not started.
Yes, it is correct. C was started too, but unsuccessfully (restarted
after, see below).

> During the time of
> the initial sync of B, are there any other operations on that table
> apart from the missing ones?
Yes, a lot of them. Tuples created all the time without any pauses,
only ~10s interval is gone.

> You may want to see the LOGs of
> subscribers during the initial sync time for any errors or other
> messages.

Looks like I have found something interesting. Below, sorted by time
with some comments.

Restart of the logical apply worker day before issue.

-2022-12-13 21:04:25 UTC-6398e8d9.4ec3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
-2022-12-13 21:04:26 UTC-6398e8d9.4ec3-ERROR: could not start
WAL streaming: FATAL: out of relcache_callback_list slots CONTEXT:
slot "cloud_production_main_sub_v4", output plugin "pgoutput", in the
startup callback ERROR: odyssey: c1747b31d0187: remote server
read/write error s721c052ace56: (null) SSL SYSCALL error: EOF detected
-2022-12-13 21:04:26 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20163) exited with exit code 1
-2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started

Start of B and C table initial sync (adding tables to the
subscription, table A already in streaming mode):

-2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has started
-2022-12-14 08:08:34 UTC-63998482.7d13-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "C" has started

B is synchronized without any errors:

-2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical
replication table synchronization worker for subscription
"cloud_production_main_sub_v4", table "B" has finished

After about 15 seconds, replication worker is restarted because of
issues with I/O:

-2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not
receive data from WAL stream: SSL SYSCALL error: EOF detected
-2022-12-14 10:19:22 UTC-6399a32a.6af3-LOG: logical
replication apply worker for subscription
"cloud_production_main_sub_v4" has started
-2022-12-14 10:19:22 UTC-639872fa.1-LOG: background worker
"logical replication worker" (PID 20348) exited with exit code 1

Then cancel of query (something about insert into public.lsnmover):

-2022-12-14 10:21:03 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
-2022-12-14 10:21:39 UTC-639997f9.fd6e-ERROR: canceling
statement due to user request

After small amount of time, synchronous replication seems to be
broken, we see tons of:

-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: shutdown
requested while waiting for synchronous replication ack.
-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested
cancel while waiting for synchronous

After few minutes at 10:36:05 we initiated database restart:

-2022-12-14 10:35:20 UTC-63999c6a.f25d-ERROR: canceling
statement due to user request
-2022-12-14 10:37:25 UTC-6399a765.1-LOG: pgms_stats: Finishing PG_init
-2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv4
address "0.0.0.0", port 5432
-2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on IPv6
address "::", port 5432
-2022-12-14 10:37:26 UTC-6399a765.1-LOG: starting PostgreSQL
14.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
-2022-12-14 10:37:26 UTC-6399a765.1-LOG: listening on Unix
socket "/tmp/.s.PGSQL.5432"
-2022-12-14 10:37:26 UTC-6399a766.103-LOG: database system was
interrupted; last known up at 2022-12-14 10:36:47 UTC
-2022-12-14 10:37:26 UTC-6399a766.105-FATAL: the database
system is starting up
-2022-12-14 10:37:26 UTC-6399a766.107-FATAL: the database
system is starting up
-2022-12-14 10:37:27 UTC-6399a767.108-FATAL: the database
system is starting up
-2022-12-14 10:37:27 UTC-6399a767.109-FATAL: the database
system is starting up
-2022-12-14 10:37:27 UTC-6399a767.10a-FATAL: the database
system is starting up
-2022-12-14 10:37:27 UTC-6399a767.10b-FATAL: the database
system is starting up
-2022-12-14 10:37:27 UTC-6399a767.113-FATAL: the database
system is starting up
-2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 4 to 185F6/CBF4BBB8
-2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 1 to 18605/FE229E10
-2022-12-14 10:37:29 UTC-6399a766.103-LOG: recovered
replication state of node 6 to 185F8/86AEC880
-2022-12-

Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2022-12-28 Thread Michail Nikolaev
Hello.

> None of these entries are from the point mentioned by you [1]
> yesterday where you didn't find the corresponding data in the
> subscriber. How did you identify that the entries corresponding to
> that timing were missing?

Some of the before the interval, some after... But the source database
was generating a lot of WAL during logical replication
- some of these log entries from time AFTER completion of initial sync
of B but (probably) BEFORE finishing B table catch up (entering
streaming mode).

Just to clarify, tables A, B and C are updated in the same
transaction, something like:

BEGIN;
UPDATE A SET x = x +1 WHERE id = :id;
INSERT INTO B(current_time, :id);
INSERT INTO C(current_time, :id);
COMMIT;

Other (non-mentioned) tables also included into this transaction, but
only B missed small amount of data.

So, shortly the story looks like:

* initial sync of A (and other tables) started and completed, they are
in streaming mode
* B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION)
* B sync completed, but new changes are still applying to the tables
to catch up primary
* logical replication apply worker is restarted because IO error on WAL receive
* Postgres killed
* Postgres restarted
* C initial sync restarted
* logical replication apply worker few times restarted because IO
error on WAL receive
* finally every table in streaming mode but with small gap in B

Thanks,
Michail.




Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2023-01-03 Thread Michail Nikolaev
Hello, Amid.

> The point which is not completely clear from your description is the
> timing of missing records. In one of your previous emails, you seem to
> have indicated that the data missed from Table B is from the time when
> the initial sync for Table B was in-progress, right? Also, from your
> description, it seems there is no error or restart that happened
> during the time of initial sync for Table B. Is that understanding
> correct?

Yes and yes.
* B sync started - 08:08:34
* lost records are created - 09:49:xx
* B initial sync finished - 10:19:08
* I/O error with WAL - 10:19:22
* SIGTERM - 10:35:20

"Finished" here is `logical replication table synchronization worker
for subscription "cloud_production_main_sub_v4", table "B" has
finished`.
As far as I know, it is about COPY command.

> I am not able to see how these steps can lead to the problem.

One idea I have here - it is something related to the patch about
forbidding of canceling queries while waiting for synchronous
replication acknowledgement [1].
It is applied to Postgres in the cloud we were using [2]. We started
to see such errors in 10:24:18:

  `The COMMIT record has already flushed to WAL locally and might
not have been replicated to the standby. We must wait here.`

I wonder could it be some tricky race because of downtime of
synchronous replica and queries stuck waiting for ACK forever?

> If the problem is reproducible at your end, you might want to increase LOG
> verbosity to DEBUG1 and see if there is additional information in the
> LOGs that can help or it would be really good if there is a
> self-sufficient test to reproduce it.

Unfortunately, it looks like it is really hard to reproduce.

Best regards,
Michail.

[1]: 
https://www.postgresql.org/message-id/flat/CALj2ACU%3DnzEb_dEfoLqez5CLcwvx1GhkdfYRNX%2BA4NDRbjYdBg%40mail.gmail.com#8b7ffc8cdecb89de43c0701b4b6b5142
[2]: 
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50




Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

2023-01-03 Thread Michail Nikolaev
> Does that by any chance mean you are using a non-community version of
> Postgres which has some other changes?

It is a managed Postgres service in the general cloud. Usually, such
providers apply some custom minor patches.
The only one I know about - about forbidding of canceling queries
while waiting for synchronous replication acknowledgement.

> It is possible but ideally, in that case, the client should request
> such a transaction again.

I am not sure I get you here.

I'll try to explain what I mean:

The patch I'm referring to does not allow canceling a query while it
waiting acknowledge for ACK for COMMIT message in case of synchronous
replication.
If synchronous standby is down - query and connection just stuck until
server restart (or until standby become available to process ACK).
Tuples changed by such a hanging transaction are not visible by other
transactions. It is all done to prevent seeing spurious tuples in case
of network split.

So, it seems like we had such a situation during that story because of
our synchronous standby downtime (before server restart).
My thoughts just about the possibility of fact that such transactions
(waiting for ACK for COMMIT) are handled somehow incorrectly by
logical replication engine.

Michail.




BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-05 Thread Michail Nikolaev
Hello, hackers.

It seems like PG 14 works incorrectly with vacuum_defer_cleanup_age
(or just not cleared rows, not sure) and SELECT FOR UPDATE + UPDATE.
I am not certain, but hot_standby_feedback probably able to cause the
same issues.

Steps to reproduce:

1) Start Postgres like this:

  docker run -it -p 5432:5432 --name pg -e
POSTGRES_PASSWORD=postgres -e LANG=C.UTF-8 -d postgres:14.6  -c
vacuum_defer_cleanup_age=100

2) Prepare scheme:

  CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);
  INSERT INTO something_is_wrong_here (value) (SELECT 1 from
generate_series(0, 100));

3) Prepare file for pgbench:

  BEGIN;

  SELECT omg.*
 FROM something_is_wrong_here AS omg
 ORDER BY random()
 LIMIT 1
 FOR UPDATE
  \gset

  UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;

  END;

4) Run pgbench:

  pgbench -c 50 -j 2 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T 100
-P 1

I was able to see such a set of errors (looks scary):

ERROR:  MultiXactId 30818104 has not been created yet -- apparent wraparound
ERROR:  could not open file "base/13757/16385.1" (target block
39591744): previous segment is only 24 blocks
ERROR:  attempted to lock invisible tuple
ERROR:  could not access status of transaction 38195704
DETAIL:  Could not open file "pg_subtrans/0246": No such file or directory.


Best regards,
Michail.




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-05 Thread Michail Nikolaev
Hello, Andres.

I apologize for the direct ping, but I think your snapshot scalability
work in PG14 could be related to the issue.

The TransactionIdRetreatedBy implementation looks correct... But with
txid_current=212195 I see errors like "could not access status of
transaction 58643736"...
So, maybe vacuum_defer_cleanup_age just highlights some special case
(something with "previous" xids on the left side of zero?)

Thanks,
Michail.




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-06 Thread Michail Nikolaev
Hello!

Thanks for checking the issue!

> FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
> heap_lock_tuple() is the first thing that fails on my assert-enabled
> build.

Yes, the same for me:

 TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c",
Line: 4292, PID: 33416


> Reproduced this on HEAD locally (no docker), without any difficulty.

It is a little bit harder without docker in my case, need to adjust
connections and number of threads:

 pgbench -c 90 -j 8 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T
2000 -P 1


Best regards,
Michail.




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-07 Thread Michail Nikolaev
Hello.

The few things I have got so far:

1) It is not required to order by random() to reproduce the issue - it
could be done using queries like:

  BEGIN;
  SELECT omg.*
 FROM something_is_wrong_here AS omg
 ORDER BY value -- change is here
 LIMIT 1
 FOR UPDATE
  \gset


  UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
  COMMIT;

But for some reason it is harder to reproduce without random in my
case (typically need to wait for about a minute with 100 connections).

2) It is not an issue at table creation time. Issue is reproducible if
vacuum_defer_cleanup_age set after table preparation.

3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
over zero (be >= txid_current()).
And it is stable So, for example - unable to reproduce with 733
value, but 734 gives error each time.
Just a single additional txid_current() (after data is filled) fixes a
crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
poisons everything somehow.
You could use such PSQL script:

  DROP TABLE IF EXISTS something_is_wrong_here;

  CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);

  INSERT INTO something_is_wrong_here (value) (SELECT 1 from
generate_series(0, 100));

  SELECT txid_current() \gset

  SELECT :txid_current + 1 as txid \gset

  ALTER SYSTEM SET vacuum_defer_cleanup_age to :txid;SELECT
pg_reload_conf();

I have attached some scripts if someone goes to reproduce.

Best regards,
Michail.


reproduce.sh
Description: application/shellscript


reproduce.bench
Description: Binary data


scheme.sql
Description: application/sql


Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second

2022-03-24 Thread Michail Nikolaev
Hello, Peter.

>> * Add code to _bt_killitems() that detects if it has generated an FPI,
>> just to set some LP_DEAD bits.
>> * Instead of avoiding the FPI when this happens, proactively call
>> _bt_simpledel_pass() just before _bt_killitems() returns. Accept the
>> immediate cost of setting an LP_DEAD bit, just like today, but avoid
>> repeated FPIs.

> Yes, I am almost sure proactively calling of_bt_simpledel_pass() will
> positively impact the system on many workloads. But also I am almost
> sure it will not change the behavior of the incident I mention -
> because it is not related to multiple checkpoints.

I just realized what it seems to be dangerous approache because of
locking mechanism.
Currently _bt_killitems requires only read lock but _bt_simpledel_pass
required write lock (it ends with _bt_delitems_delete).
It will required to increase locking mode in order to call _bt_simpledel_pass.

Such a change may negatively affect many workloads because of write
lock during scanning - and it is really hard to to prove absence of
regression (have no idea how).

Thanks,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-29 Thread Michail Nikolaev
Hello, Greg.

> I'm seeing a recovery test failure. Not sure if this represents an
> actual bug or just a test that needs to be adjusted for the new
> behaviour.

Thanks for notifying me. It is a failure of a test added in the patch.
It is a little hard to make it stable (because it depends on
minRecoveryLSN which could be changed in asynchronous way without any
control). I’ll check how to make it more stable.

Thanks,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-29 Thread Michail Nikolaev
Hello, Peter.

Thanks for your review!

> I doubt that the patch's use of pg_memory_barrier() in places like
> _bt_killitems() is correct. There is no way to know for sure if this
> novel new lockless algorithm is correct or not, since it isn't
> explained anywhere.

The memory barrier is used only to ensure memory ordering in case of
clearing LP_DEAD bits. Just to make sure the flag allowing the use
LP_DEAD is seen AFTER bits are cleared.
Yes, it should be described in more detail.
The flapping test is one added in the patch and not related to memory
ordering. I have already tried to make it stable once before, but it
depends on minRecoveryLSN propagation. I’ll think about how to make it
stable.

> If there is an LP_DEAD bit set on a posting list on the primary, and
> we need to do a posting list split against the posting tuple, we need
> to be careful -- we cannot allow our new TID to look like it's LP_DEAD
> immediately, before our transaction even commits/aborts. We cannot
> swap out our new TID with an old LP_DEAD TID, because we'll think that
> our new TID is LP_DEAD when we shouldn't.

Oh, good catch! I was thinking it is safe to have additional hint bits
on primary, but it seems like no. BTW I am wondering if it is possible
to achieve the same situation by pg_rewind and standby promotion…

> Overall, I think that this patch has serious design flaws, and that
> this issue is really just a symptom of a bigger problem.

Could you please advise me on something? The ways I see:
* give up :)
* try to fix this concept
* go back to concept with LP_DEAD horizon WAL and optional cancellation
* try to limit scope on “allow standby to use LP_DEAD set on primary
in some cases” (by marking something in btree page probably)
* look for some new way

Best regards,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-29 Thread Michail Nikolaev
UPD:

> I was thinking it is safe to have additional hint bits
> on primary, but it seems like no.

Oh, sorry for the mistake, it is about standby of course.

> BTW I am wondering if it is possible
> to achieve the same situation by pg_rewind and standby promotion…

Looks like it is impossible, because wal_log_hints is required in
order to use pg_rewind.
It is possible to achieve a situation with some additional LP_DEAD on
standby compared to the primary, but any change on primary would cause
FPI, so LP_DEAD will be cleared.

Thanks,
Michail.




Re: Slow standby snapshot

2022-03-31 Thread Michail Nikolaev
Hello.

Just an updated commit message.

Thanks,
Michail.
From 934d649a18c66f8b448463e57375865b33ce53e7 Mon Sep 17 00:00:00 2001
From: nkey 
Date: Fri, 1 Apr 2022 02:17:55 +0300
Subject: [PATCH v5] Optimize KnownAssignedXidsGetAndSetXmin by maintaining
 offset to next valid xid.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently KnownAssignedXidsGetAndSetXmin requires an iterative loop through 
KnownAssignedXids array,
including xids marked as invalid. Performance impact is especially noticeable 
in the presence of
long (few seconds) transactions on primary, high value (few thousands) of 
max_connections and high
read workload on standby. Most of the cpu spent on looping throw 
KnownAssignedXids while almost all
xid are invalid anyway. KnownAssignedXidsCompress removes invalid xids from 
time to time,
but performance is still affected.

To increase performance we lazily maintain an offset in the 
KnownAssignedXidsNext array to skip known
to be invalid xids. KnownAssignedXidsNext does not always point to “next” valid 
xid, it is just some
offset safe to skip (known to be filled by only invalid xids).

It helps to skip the gaps and could significantly increase performance - up to 
10% more TPS on standby.

Thread: 
https://www.postgresql.org/message-id/flat/CALdSSPgahNUD_%3DpB_j%3D1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw%40mail.gmail.com

Benchmark:
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6
---
 src/backend/storage/ipc/procarray.c | 56 -
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 13d192ec2b..b5cb3423fb 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static int32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -455,6 +456,12 @@ CreateSharedProcArray(void)
ShmemInitStruct("KnownAssignedXidsValid",
mul_size(sizeof(bool), 
TOTAL_MAX_CACHED_SUBXIDS),
&found);
+   KnownAssignedXidsNext = (int32 *)
+   ShmemInitStruct("KnownAssignedXidsNext",
+   
mul_size(sizeof(int32), TOTAL_MAX_CACHED_SUBXIDS),
+   &found);
+   for (int i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+   KnownAssignedXidsNext[i] = 1;
}
 }
 
@@ -4544,7 +4551,13 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * XID entry itself.  This preserves the property that the XID entries are
  * sorted, so we can do binary searches easily.  Periodically we compress
  * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNext[] array to skip known to be invalid xids. It
+ * helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNext[] is
+ * updating while taking the snapshot. In general case KnownAssignedXidsNext
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid.
  *
  * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
  * are those with indexes tail <= i < head; items outside this subscript range
@@ -4582,7 +4595,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * must happen)
  * * Compressing the array is O(S) and requires exclusive lock
  * * Removing an XID is O(logS) and requires exclusive lock
- * * Taking a snapshot is O(S) and requires shared lock
+ * * Taking a snapshot is O(S), O(N) next call; requires shared lock
  * * Checking for an XID is O(logS) and requires shared lock
  *
  * In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4642,12 +4655,13 @@ KnownAssignedXidsCompress(bool force)
 * re-aligning data to 0th element.
 */
compress_index = 0;
-   for (i = tail; i < head; i++)
+   for (i = tail; i < head; i += KnownAssignedXidsNext[i])
{
if (KnownAssignedXidsValid[i])
{
KnownAssignedXids[compress_index] = 
KnownAssignedXids[i];
KnownAssignedXidsValid[compress_index] = true;
+   KnownAssignedXidsNext[compress_index] = 1;
compress_index++;
}
 

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-31 Thread Michail Nikolaev
Hello, Peter.

> The simple answer is: I don't know. I could probably come up with a
> better answer than that, but it would take real effort, and time.

I remember you had an idea about using the LP_REDIRECT bit in btree
indexes as some kind of “recently dead” flag (1).
Is this idea still in progress? Maybe an additional bit could provide
a space for a better solution.

> I think that you could do a better job of explaining and promoting the
> problem that you're trying to solve here. Emphasis on the problem, not
> so much the solution.

System I am working on highly depends on the performance of reading
from standby. In our workloads queries on standby are sometimes
10-100x slower than on primary due to absence of LP_DEAD support.
Other users have the same issues (2). I believe such functionality is
great optimization for read replicas with both analytics and OLTP
(read-only) workloads.

> You must understand that this whole area is *scary*. The potential for
> serious data corruption bugs is very real. And because the whole area
> is so complicated (it is at the intersection of 2-3 complicated
> areas), we can expect those bugs to be hidden for a long time. We
> might never be 100% sure that we've fixed all of them if the initial
> design is not generally robust. Most patches are not like that.

Moved to “Waiting for Author” for now.

[1]: 
https://www.postgresql.org/message-id/flat/CAH2-Wz%3D-BoaKgkN-MnKj6hFwO1BOJSA%2ByLMMO%2BLRZK932fNUXA%40mail.gmail.com#6d7cdebd68069cc493c11b9732fd2040
[2]: 
https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org

Thanks,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-03-31 Thread Michail Nikolaev
Hello, David.

Thanks for your review!

> As a specific recommendation here - submit patches with a complete commit 
> message.
> Tweak it for each new version so that any prior discussion that informed the 
> general design of
> the patch is reflected in the commit message.

Yes, agreed. Applied to my other patch (1).

> In terms of having room for bugs this description seems like a lot of logic 
> to have to get correct.

Yes, it is the scary part. But it is contained in single
is_index_lp_dead_maybe_allowed function for now.

> Could we just do this first pass as:
> Enable recovery mode LP_DEAD hint bit updates after the first streamed 
> CHECKPOINT record comes over from the primary.
> ?

Not sure, but yes, it is better to split the patch into more detailed commits.

Thanks,
Michail.

[1]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382




Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-15 Thread Michail Nikolaev
Hello, hackers!

I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.

Yes, a serious bug was (2) caused by this optimization and now it reverted.

But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)

Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.

But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)

The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?

[1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
[2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-17 Thread Michail Nikolaev
Hello!

> I've thought about alternative solutions, too: how about getting a new 
> snapshot every so often?
> We don't really care about the liveness of the already-scanned data; the 
> snapshots used for RIC
> are used only during the scan. C/RIC's relation's lock level means vacuum 
> can't run to clean up
> dead line items, so as long as we only swap the backend's reported snapshot 
> (thus xmin) while
> the scan is between pages we should be able to reduce the time C/RIC is the 
> one backend
> holding back cleanup of old tuples.

Hm, it looks like an interesting idea! It may be more dangerous, but
at least it feels much more elegant than an LP_DEAD-related way.
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).

But for the first scan we allowed to do so only for non-unique indexes
because of:

> * The reason for doing that is to avoid
> * bogus unique-index failures due to concurrent UPDATEs (we might see
> * different versions of the same row as being valid when we pass over them,
> * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> * does not contain any tuples added to the table while we built the index.

Also, (1) was limited to indexes without expressions and predicates
(2) because such may execute queries to other tables (sic!).
One possible solution is to add some checks to make sure no
user-defined functions are used.
But as far as I understand, it affects only CIC for now and does not
affect the ability to use the proposed technique (updating snapshot
time to time).

However, I think we need some more-less formal proof it is safe - it
is really challenging to keep all the possible cases in the head. I’ll
try to do something here.
Another possible issue may be caused by the new locking pattern - we
will be required to wait for all transaction started before the ending
of the phase to exit.

[1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
[2]: 
https://www.postgresql.org/message-id/flat/CAAaqYe_tq_Mtd9tdeGDsgQh%2BwMvouithAmcOXvCbLaH2PPGHvA%40mail.gmail.com#cbe3997b75c189c3713f243e25121c20




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-20 Thread Michail Nikolaev
Hello!

> Also, feels like we may apply this to both phases (first and the second 
> scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).

Oops, I was wrong here. The original version of the patch was also applied to
both phases.

> Note that the use of such expressions would be a violation of the
> function's definition; it would depend on data from other tables which
> makes the function behave like a STABLE function, as opposed to the
> IMMUTABLE that is required for index expressions. So, I don't think we
> should specially care about being correct for incorrectly marked
> function definitions.

Yes, but such cases could probably cause crashes also...
So, I think it is better to check them for custom functions. But I
still not sure -
if such limitations still required for proposed optimization or not.

> I just realised there is one issue with this design: We can't cheaply
> reset the snapshot during the second table scan:
> It is critically important that the second scan of R/CIC uses an index
> contents summary (made with index_bulk_delete) that was created while
> the current snapshot was already registered.

> So, the "reset the snapshot every so often" trick cannot be applied in
> phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> every time we reset the snapshot. Rescanning might be worth the cost
> (e.g. when using BRIN), but that is very unlikely.

Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.

> What do you mean by "new locking pattern"? We already keep an
> ShareUpdateExclusiveLock on every heap table we're accessing during
> R/CIC, and that should already prevent any concurrent VACUUM
> operations, right?

I was thinking not about "classical" locking, but about waiting for
other backends
by WaitForLockers(heaplocktag, ShareLock, true). But I think
everything should be
fine.

Best regards,
Michail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-20 Thread Michail Nikolaev
Hello!

> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

Yes, exactly.

>> HeapTupleSatisfiesHistoricMVCC
> That function has this comment marker:
> "Only usable on tuples from catalog tables!"
> Is that correct even for this?

Yeah, we just need HeapTupleSatisfiesVisibility (which calls
HeapTupleSatisfiesMVCC) instead.

> Should this deal with any potential XID wraparound, too?

Yeah, looks like we should care about such case somehow.

Possible options here:

1) Skip vac_truncate_clog while CIC is running. In fact, I think it's
not that much worse than the current state - datfrozenxid is still
updated in the catalog and will be considered the next time
vac_update_datfrozenxid is called (the next VACCUM on any table).

2) Delay vac_truncate_clog while CIC is running.
In such a case, if it was skipped, we will need to re-run it using the
index builds backend later.

3) Wait for 64-bit xids :)

4) Any ideas?

In addition, for the first and second options, we need logic to cancel
the second phase in the case of ForceTransactionIdLimitUpdate.
But maybe I'm missing something and the tuples may be frozen, ignoring
the set datfrozenxid values (over some horizon calculated at runtime
based on the xmin backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

Yes, good catch.
Assuming we have somehow prevented vac_truncate_clog from occurring
during CIC, we can leave frozen and potentially frozen
(xminfrozenXID (may not be frozen)
* visible by snapshot

second phase:
* frozen
* xmin>frozenXID (may be frozen)
* not in the index summary
* visible by "old" snapshot

You might also think – why is the first stage needed at all? Just use
batch processing during initial index building?

Best regards,
Mikhail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-20 Thread Michail Nikolaev
> Yes, good catch.
> Assuming we have somehow prevented vac_truncate_clog from occurring
> during CIC, we can leave frozen and potentially frozen
> (xmin

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-21 Thread Michail Nikolaev
Hello.

Realized my last idea is invalid (because tuples are frozen by using
dynamically calculated horizon) - so, don't waste your time on it :)

Need to think a little bit more here.

Thanks,
Mikhail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2023-12-25 Thread Michail Nikolaev
Hello!

It seems like the idea of "old" snapshot is still a valid one.

> Should this deal with any potential XID wraparound, too?

As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.

Merry Christmas,
Mikhail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-04 Thread Michail Nikolaev
Hello!

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

As another option, we could extract a dedicated horizon value for an
opportunistic freezing.
And use some flags in R/CIC backend to keep it at the required value.

Best regards,
Michail.




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-01-09 Thread Michail Nikolaev
Hello, Melanie!

Sorry to interrupt you, just a quick question.

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

Do you have any patches\threads related to that refactoring
(opportunistic freezing of tuples during pruning) [0]?
This may affect the idea of the current thread (latest version of it
mostly in [1]) - it may be required to disable such a feature for
particular relation temporary or affect horizon used for pruning
(without holding xmin).

Just no sure - is it reasonable to start coding right now, or wait for
some prune-freeze-related patch first?

[0] 
https://www.postgresql.org/message-id/caakru_a+g2oe6ahjcbibftnfiy2aib4e31x9qyj_qkjxzmz...@mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/CANtu0ojRX%3DosoiXL9JJG6g6qOowXVbVYX%2BmDsN%2B2jmFVe%3DeG7w%40mail.gmail.com#a8ff53f23d0fc7edabd446b4d634e7b5




Re: Replace known_assigned_xids_lck by memory barrier

2023-09-05 Thread Michail Nikolaev
Thanks everyone for help!


[PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-03-16 Thread Michail Nikolaev
Hello, hackers.

-- ABSTRACT --
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

-- REMARK --
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR:  could not find left sibling of block  in index "
during stress tests.

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later...  I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

-- HOW TO REPRODUCE  --
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):

1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'

2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543

3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;

4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 50)"

5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 1 -U
postgres -p 6543

6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"

7) You should see something like this:
> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 1451 in index "pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 2104 in index "pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 2501 in index "pgbench_accounts_pkey"

I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.

It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
>   + pg_usleep(10 * 1000L);
>
>   /* Rewrite target page as empty deleted page */
>   buffer = XLogInitBufferForRedo(record, 0);

--  WHAT HAPPENS  --
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.

btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:

>* In normal operation, we would lock all the pages this WAL record
>* touches before changing any of them.  In WAL replay, it should be okay
>* to lock just one page at a time, since no concurrent index updates can
>* be happening, and readers should not care whether they arrive at the
>* target page or not (since it's surely empty).

_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:

>Moving left in a backward scan is complicated because we must consider
>the possibility that the left sibling was just split (meaning we must find
>the rightmost page derived from the left sibling), plus the possibility
>that the page we were just on has now been deleted and hence isn't in the
>sibling chain at all anymore.

So, this is how race is happens:

0) this is target page (B) and its siblings.
A <---> B <---> C ---> END

1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A < B > C ---> END
^   ^
 \_/

But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].

2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.

But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.


--  HOW TO FIX  --
The first idea was to mark page as BTP_DELETED before updating siblings lin

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-03-27 Thread Michail Nikolaev
Hello.

>  Probably, patch in this thread should fix this in btree_xlog_split() too?

I have spent some time trying to find any possible race condition
between btree_xlog_split and _bt_walk_left… But I can’t find any.
Also, I have tried to cause any issue by putting pg_sleep put into
btree_xlog_split (between releasing and taking of locks) but without
any luck.

I agree it is better to keep the same locking logic for primary and
standby in general. But it is a possible scope of another patch.

Thanks,
Michail.




Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

2020-04-05 Thread Michail Nikolaev
Hello, Peter.

>  I added
>  something about this to the nbtree README in commit 9f83468b353.

I have added some updates to your notes in the updated patch version.

I also was trying to keep the original wrapping of the paragraph, so
the patch looks too wordy.

Thanks,
Michail.


btree-race-and-docs.patch
Description: Binary data


Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-23 Thread Michail Nikolaev
Hello, Justin.

Thanks for your attention.
After some investigation, I think I have found the problem. It is
caused by XLOG_RUNNING_XACTS at an undetermined moment (some test
parts rely on it).

Now test waits for XLOG_RUNNING_XACTS to happen (maximum is 15s) and
proceed forward.

I'll move entry back to "Ready for Committer" once it passes tests.

Best regards,
Michail.
From a46315fd96b5432241ab6c67c37493ef41d7dc73 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sun, 23 Jan 2022 20:47:51 +0300
Subject: [PATCH v8 2/3] test

---
 src/test/recovery/Makefile|   1 +
 .../recovery/t/027_standby_index_lp_dead.pl   | 372 ++
 2 files changed, 373 insertions(+)
 create mode 100644 src/test/recovery/t/027_standby_index_lp_dead.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index e3011c3e37..b5eb3a3a70 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/027_standby_index_lp_dead.pl b/src/test/recovery/t/027_standby_index_lp_dead.pl
new file mode 100644
index 00..5237d7603c
--- /dev/null
+++ b/src/test/recovery/t/027_standby_index_lp_dead.pl
@@ -0,0 +1,372 @@
+# Checks that index hints on standby work as excepted.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Config;
+
+plan tests => 30;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+checkpoint_timeout = 1h
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+'postgres', 'CREATE TABLE test_table (id int, value int)');
+$node_primary->safe_psql(
+'postgres', 'CREATE INDEX test_index ON test_table (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+'INSERT INTO test_table VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set
+$node_primary->safe_psql('postgres', 'VACUUM test_table');
+# For fail-fast in case FPW from primary
+$node_primary->safe_psql('postgres', 'CHECKPOINT');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Restore standby node from backup backup
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+has_streaming => 1);
+
+my $standby_settings = qq{
+max_standby_streaming_delay = 1
+wal_receiver_status_interval = 1
+hot_standby_feedback = off
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+checkpoint_timeout = 1h
+};
+$node_standby_1->append_conf('postgresql.conf', $standby_settings);
+$node_standby_1->start;
+
+$node_standby_1->backup($backup_name);
+
+# Create second standby node linking to standby 1
+my $node_standby_2 = PostgreSQL::Test::Cluster->new('standby_2');
+$node_standby_2->init_from_backup($node_standby_1, $backup_name,
+has_streaming => 1);
+$node_standby_2->append_conf('postgresql.conf', $standby_settings);
+$node_standby_2->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(300);
+
+# One psql to run command in repeatable read isolation level.
+# It is used to test xactStartedInRecovery snapshot after promotion.
+# Also, it is used to check fact what active snapshot on standby prevent LP_DEAD
+# to be set (ComputeXidHorizons work on standby).
+my %psql_standby_repeatable_read = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_repeatable_read{run} =
+IPC::Run::start(
+[ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+'<', \$psql_standby_repeatable_read{stdin},
+'>', \$psql_standby_repeata

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-25 Thread Michail Nikolaev
Hello, Julien.

> I rebased the pathset in attached v9.  Please double check that I didn't miss
> anything in the rebase.

Thanks a lot for your help.

> I will let you mark the patch as Ready for Committer once you validate that 
> the
> rebase was ok.

Yes, rebase looks good.

Best regards,
Michail.




Re: BufferAlloc: don't take two simultaneous locks

2022-01-30 Thread Michail Nikolaev
Hello, Yura.

Test results look promising. But it seems like the naming and dynahash
API change is a little confusing.

1) I think it is better to split the main part and atomic nentries
optimization into separate commits.
2) Also, it would be nice to also fix hash_update_hash_key bug :)
3) Do we really need a SIZEOF_LONG check? I think pg_atomic_uint64 is
fine these days.
4) Looks like hash_insert_with_hash_nocheck could potentially break
the hash table. Is it better to replace it with
hash_search_with_hash_value with HASH_ATTACH action?
5) In such a case hash_delete_skip_freelist with
hash_search_with_hash_value with HASH_DETTACH.
6) And then hash_return_to_freelist -> hash_dispose_dettached_entry?

Another approach is a new version of hash_update_hash_key with
callbacks. Probably it is the most "correct" way to keep a hash table
implementation details closed. It should be doable, I think.

Thanks,
Michail.




Re: BufferAlloc: don't take two simultaneous locks

2022-02-06 Thread Michail Nikolaev
Hello, Yura.

A one additional moment:

> 1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> 1333: CLEAR_BUFFERTAG(buf->tag);
> 1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
> 1335: UnlockBufHdr(buf, buf_state);

I think there is no sense to unlock buffer here because it will be
locked after a few moments (and no one is able to find it somehow). Of
course, it should be unlocked in case of collision.

BTW, I still think is better to introduce some kind of
hash_update_hash_key and use it.

It may look like this:

// should be called with oldPartitionLock acquired
// newPartitionLock hold on return
// oldPartitionLock and newPartitionLock are not taken at the same time
// if newKeyPtr is present - existingEntry is removed
bool hash_update_hash_key_or_remove(
  HTAB *hashp,
  void *existingEntry,
  const void *newKeyPtr,
  uint32 newHashValue,
  LWLock *oldPartitionLock,
  LWLock *newPartitionLock
);

Thanks,
Michail.




Re: Slow standby snapshot

2022-11-16 Thread Michail Nikolaev
Hello everyone.

> However ... I tried to reproduce the original complaint, and
> failed entirely.  I do see KnownAssignedXidsGetAndSetXmin
> eating a bit of time in the standby backends, but it's under 1%
> and doesn't seem to be rising over time.  Perhaps we've already
> applied some optimization that ameliorates the problem?  But
> I tested v13 as well as HEAD, and got the same results.

> Hmm.  I wonder if my inability to detect a problem is because the startup
> process does keep ahead of the workload on my machine, while it fails
> to do so on the OP's machine.  I've only got a 16-CPU machine at hand,
> which probably limits the ability of the primary to saturate the standby's
> startup process.

Yes, optimization by Andres Freund made things much better, but the
impact is still noticeable.

I was also using 16CPU machine - but two of them (primary and standby).

Here are the scripts I was using (1) for benchmark - maybe it could help.


> Nowadays we've *got* those primitives.  Can we get rid of
> known_assigned_xids_lck, and if so would it make a meaningful
> difference in this scenario?

I was trying it already - but was unable to find real benefits for it.
WIP patch in attachment.

Hm, I see I have sent it to list, but it absent in archives... Just
quote from it:

> First potential positive effect I could see is
> (TransactionIdIsInProgress -> KnownAssignedXidsSearch) locking but
> seems like it is not on standby hotpath.

> Second one - locking for KnownAssignedXidsGetAndSetXmin (build
> snapshot). But I was unable to measure impact. It wasn’t visible
> separately in (3) test.

> Maybe someone knows scenario causing known_assigned_xids_lck or
> TransactionIdIsInProgress become bottleneck on standby?

The latest question is still actual :)

> I think it might be a bigger effect than one might immediately think. Because
> the spinlock will typically be on the same cacheline as head/tail, and because
> every spinlock acquisition requires the cacheline to be modified (and thus
> owned mexlusively) by the current core, uses of head/tail will very commonly
> be cache misses even in workloads without a lot of KAX activity.

I was trying to find some way to practically achieve any noticeable
impact here, but unsuccessfully.

>> But yeah, it does feel like the proposed
>> approach is only going to be optimal over a small range of conditions.

> In particular, it doesn't adapt at all to workloads that don't replay all that
> much, but do compute a lot of snapshots.

The approach (2) was optimized to avoid any additional work for anyone
except non-startup
processes (approach with offsets to skip gaps while building snapshot).


[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
[2]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382

--
Michail Nikolaev
From 9759ea174db5e140415e4ce80a62f63075e91fe3 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sun, 21 Nov 2021 21:23:02 +0300
Subject: [PATCH] memory barrier instead of spinlock

---
 src/backend/storage/ipc/procarray.c | 42 -
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 4da53a5b3f..bca336a80d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -60,7 +60,6 @@
 #include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -81,7 +80,6 @@ typedef struct ProcArrayStruct
int numKnownAssignedXids;   /* current # of valid 
entries */
int tailKnownAssignedXids;  /* index of oldest 
valid element */
int headKnownAssignedXids;  /* index of newest 
element, + 1 */
-   slock_t known_assigned_xids_lck;/* protects head/tail 
pointers */
 
/*
 * Highest subxid that has been removed from KnownAssignedXids array to
@@ -424,7 +422,6 @@ CreateSharedProcArray(void)
procArray->numKnownAssignedXids = 0;
procArray->tailKnownAssignedXids = 0;
procArray->headKnownAssignedXids = 0;
-   SpinLockInit(&procArray->known_assigned_xids_lck);
procArray->lastOverflowedXid = InvalidTransactionId;
procArray->replication_slot_xmin = InvalidTransactionId;
procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4552,10 +4549,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * p

Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Hello.

On Wed, Nov 16, 2022 at 3:44 AM Andres Freund  wrote:
> Approach 1:

> We could have an atomic variable in ProcArrayStruct that counts the amount of
> wasted effort and have processes update it whenever they've wasted a
> meaningful amount of effort.  Something like counting the skipped elements in
> KnownAssignedXidsGetAndSetXmin in a function local static variable and
> updating the shared counter whenever that reaches

I made the WIP patch for that approach and some initial tests. It
seems like it works pretty well.
At least it is better than previous ways for standbys without high
read only load.

Both patch and graph in attachments. Strange numbers is a limit of
wasted work to perform compression.
I have used the same (1) testing script and configuration as before
(two 16-CPU machines, long transaction on primary at 60th second,
simple-update and select-only for pgbench).

If such approach looks committable - I could do more careful
performance testing to find the best value for
WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS.

[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
--
Michail Nikolaev
Index: src/backend/storage/ipc/procarray.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/procarray.c	(revision fb32748e32e2b6b2fcb32220980b93d5436f855e)
+++ b/src/backend/storage/ipc/procarray.c	(date 1668950653116)
@@ -272,6 +272,7 @@
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsWastedSnapshotWork;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -451,6 +452,10 @@
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+		KnownAssignedXidsWastedSnapshotWork = (pg_atomic_uint32 *)
+			ShmemInitStruct("KnownAssignedXidsWastedSnapshotWork",
+			sizeof(pg_atomic_uint32), &found);
+		pg_atomic_init_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 	}
 }
 
@@ -4616,20 +4621,9 @@
 
 	if (!force)
 	{
-		/*
-		 * If we can choose how much to compress, use a heuristic to avoid
-		 * compressing too often or not often enough.
-		 *
-		 * Heuristic is if we have a large enough current spread and less than
-		 * 50% of the elements are currently in use, then compress. This
-		 * should ensure we compress fairly infrequently. We could compress
-		 * less often though the virtual array would spread out more and
-		 * snapshots would become more expensive.
-		 */
-		int			nelements = head - tail;
-
-		if (nelements < 4 * PROCARRAY_MAXPROCS ||
-			nelements < 2 * pArray->numKnownAssignedXids)
+#define WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS 111
+		if (pg_atomic_read_u32(KnownAssignedXidsWastedSnapshotWork)
+	< WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS)
 			return;
 	}
 
@@ -4650,6 +4644,8 @@
 
 	pArray->tailKnownAssignedXids = 0;
 	pArray->headKnownAssignedXids = compress_index;
+	/* Reset wasted work counter */
+	pg_atomic_write_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 }
 
 /*
@@ -5031,6 +5027,7 @@
 KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
 			   TransactionId xmax)
 {
+	ProcArrayStruct *pArray = procArray;
 	int			count = 0;
 	int			head,
 tail;
@@ -5078,6 +5075,10 @@
 		}
 	}
 
+	/* Add number of invalid items scanned to wasted work counter */
+	pg_atomic_add_fetch_u32(KnownAssignedXidsWastedSnapshotWork,
+			(head - tail) - pArray->numKnownAssignedXids);
+
 	return count;
 }
 


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oops, wrong image, this is correct one. But is 1-run tests, so it
shows only basic correlation,


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oh, seems like it is not my day :) The image fixed again.


Re: Slow standby snapshot

2022-11-22 Thread Michail Nikolaev
Hello, everyone.

I have tried to put it all together.

> In the absence of that approach, falling back to a counter that
> compresses every N xids would be best, in addition to the two new
> forced compression events.

Done.

> Also, if we add more forced compressions, it seems like we should have
> a short-circuit for a forced compression where there's nothing to do.

Done.

> I'm also wondering why there's not an
>
> Assert(compress_index == pArray->numKnownAssignedXids);
>
> after the loop, to make sure our numKnownAssignedXids tracking
> is sane.

Done.

> * when idle - since we have time to do it when that happens, which
> happens often since most workloads are bursty

I have added getting of ProcArrayLock for this case.
Also, I have added maximum frequency as 1 per second to avoid
contention with heavy read load in case of small,
episodic but regular WAL traffic (WakeupRecovery() for each 100ms for
example). Or it is useless?

> It'd be more reliable
> to use a static counter to skip all but every N'th compress attempt
> (something we could do inside KnownAssignedXidsCompress itself, instead
> of adding warts at the call sites).

Done. I have added “reason” enum for calling KnownAssignedXidsCompress
to keep it as much clean as possible.
But not sure that I was successful here.

Also, I think while we still in the context, it is good to add:
* Simon's optimization (1) for KnownAssignedXidsRemoveTree (it is
simple and effective for some situations without any seen drawbacks)
* Maybe my patch (2) for replacing known_assigned_xids_lck with memory barrier?

New version in attach. Running benchmarks now.
Preliminary result in attachments (16CPU, 5000 max_connections, now 64
active connections instead of 16).
Also, interesting moment - with 64 connections, vanilla version is
unable to recover its performance after 30-sec transaction on primary.

[1]: 
https://www.postgresql.org/message-id/flat/CANbhV-Ey8HRYPvnvQnsZAteCfzN3VHVhZVKfWMYcnjMnSzs4dQ%40mail.gmail.com#05993cf2bc87e35e0dff38fec26b9805
[2]: 
https://www.postgresql.org/message-id/flat/CANtu0oiPoSdQsjRd6Red5WMHi1E83d2%2B-bM9J6dtWR3c5Tap9g%40mail.gmail.com#cc4827dee902978f93278732435e8521

--
Michail Nikolaev
From 926403598e9506edfa32c9534be9a4e48f756002 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Wed, 23 Nov 2022 00:13:32 +0300
Subject: [PATCH vnext] Currently, KnownAssignedXidsGetAndSetXmin requires an
 iterative loop through KnownAssignedXids array, including xids marked as
 invalid. Performance impact is especially noticeable in the presence of long
 (few seconds) transactions on primary, high value (few thousands) of
 max_connections and high read workload on standby. Most of the CPU spent on
 looping throw KnownAssignedXids while almost all xid are invalid anyway.
 KnownAssignedXidsCompress removes invalid xid from time to time, but
 performance is still affected.

To increase performance, frequency of running KnownAssignedXidsCompress is increased. Now it is called for each xid % 64 == 0 (number selected by running benchmarks). Also, the minimum bound of element to compress (4 * PROCARRAY_MAXPROCS) is removed.

Additionally, compression is called on RecoveryInfo and idle state of startup process.

Simon Riggs, Michail Nikolaev with help by Tom Lane and
Andres Freund.
---
 src/backend/access/transam/xlogrecovery.c |   3 +
 src/backend/storage/ipc/procarray.c   | 103 +-
 src/include/storage/standby.h |   1 +
 3 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..2cc9581cb6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3836,6 +3836,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		streaming_reply_sent = true;
 	}
 
+	/* Do any background tasks that might benefit us later */
+	StartupProcessIdleMaintenance();
+
 	/* Update pg_stat_recovery_prefetch before sleeping. */
 	XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 283517d956..fde748519e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -274,6 +274,10 @@ static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
+/* Counters for KnownAssignedXids compression heuristic */
+static int transactionEndsCounter;
+static TimestampTz lastCompressTs;
+
 /*
  * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
  * the highest xid that might still be running that we don't have in
@@ -335,8 +339,16 @@ static void DisplayXidCache(void);
 #define xc_slow_answer_inc()		((void) 0)
 #endif			/* XI

Re: Slow standby snapshot

2022-11-29 Thread Michail Nikolaev
Hello, Tom.

> Since we're running out of time in the current commitfest,
> I went ahead and changed that, and made the cosmetic fixes
> I wanted, and pushed.

Great, thanks!

The small thing I was thinking to add in KnownAssignedXidsCompress is
the assertion like

Assert(MyBackendType == B_STARTUP);

Just to make it more clear that locking is not the only thing required
for the call.

>  I'd be willing to
> make the memory barrier change anyway, because that seems like
> a simple change that can't hurt.

I'm going to create a separate commit fest entry for it, ok?

Best regards,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-07 Thread Michail Nikolaev
Hello, Antonin.

> I'm trying to review the patch, but not sure if I understand this problem,
> please see my comment below.

Thanks a lot for your attention. It is strongly recommended to look at
version N3 (1) because it is a much more elegant, easy, and reliable
solution :) But the minRecoveryPoint-related issue affects it anyway.

> Why doesn't minRecoveryPoint get updated to 20? IMO that should happen by
> replaying the commit record. And if the standby happens to crash before the
> commit record could be replayed, no query should see the deletion and thus no
> hint bit should be set in the index.

minRecoveryPoint is not affected by replaying the commit record in
most cases. It is updated in a lazy way, something like this:
minRecoveryPoint = max LSN of flushed page. Version 3 of a patch
contains a code_optional.patch to move minRecoveryPoint more
aggressively to get additional performance on standby (based on
Peter’s answer in (2).

So, “minRecoveryPoint will go here” is not because of “STANDBY INSERTS
NEW ROW IN INDEX” it is just a random event.

Thanks,
Michail.

[1]: 
https://www.postgresql.org/message-id/CANtu0ohHu1r1xQfTzEJuxeaOMYncG7xRxUQWdH%3DcMXZSf%2Bnzvg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAH2-WzkSUcuFukhJdSxHFgtL6zEQgNhgOzNBiTbP_4u%3Dk6igAg%40mail.gmail.com
(“Also, btw, do you know any reason to keep minRecoveryPoint at a low
value?”)




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-10 Thread Michail Nikolaev
Hello,
Antonin.

> Sorry, I missed the fact that your example can be executed inside BEGIN - END
> block, in which case minRecoveryPoint won't advance after each command.

No, the block is not executed as a single transaction, all commands
are separated transactions (see below)

> Actually I think that a commit record should be replayed
> more often than XLOG_RUNNING_XACTS, shouldn't it?

Yes, but replaying commit records DOES NOT affect minRecoveryPoint in
almost all cases.

UpdateMinRecoveryPoint is called by XLogFlush,  but xact_redo_commit
calls XLogFlush only in two cases:
* DropRelationFiles is called (some relation are dropped)
* If ForceSyncCommit was used on primary - few “heavy” commands, like
DropTableSpace, CreateTableSpace, movedb, etc.

But “regular” commit record is replayed without XLogFlush and, as
result, without UpdateMinRecoveryPoint.

So, in practice, UpdateMinRecoveryPoint is updated in an “async” way
by checkpoint job. This is why there is a sense to call it on
XLOG_RUNNING_XACTS.

Thanks,
Michail.




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-12 Thread Michail Nikolaev
Hello,
Antonin.

> My review that started in [1] continues here.
Thanks a lot for the review.

> (Please note that code.patch does not apply to the current master branch.)
Rebased.

> Especially for the problem discussed in [1] it should be
> explained what would happen if kill_prior_tuple_min_lsn was not checked.
Updated README, hope it is better now. Also, added few details related
to the flush of hint bits.

> However I think there's one more case: if heap_hot_search_buffer() considers
> all tuples in the chain to be "surely dead", but
> HeapTupleHeaderAdvanceLatestRemovedXid() skips them all for this reason:
Yes, good catch, missed it.

> I think that the dead tuples produced this way should never be visible on the
> standby (and even if they were, they would change the page LSN so your
> algorithm would treat them correctly) so I see no correctness problem. But it
> might be worth explaining better the meaning of invalid "latest_removed_xid"
> in comments.
Added additional comment.

> but it's not clear to me what "latestRemovedXid" is. If you mean the
> scan->kill_prior_tuple_min_lsn field, you probably need more words to explain
> it.
Hope it is better now.

> should probably be
>/* It is always allowed on primary if *all_dead. */
Fixed.

> As the function is only called if (so->numKilled > 0), I think both
> "killedsomething" and "dirty" variables should always have the same value, so
> one variable should be enough. Assert(so->numKilled) would be appropriate in
> that case.
Fixed, but partly. It is because I have added additional checks for a
long transaction in the case of promoted server.

> "+applying the fill page write."
Fixed.

Updated version in attach.

Thanks a lot,
Michail.
From 004b2dea9b700d890147b840573bb5b796c1f96a Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Wed, 12 May 2021 22:56:18 +0300
Subject: [PATCH v2 4/4] doc

---
 src/backend/access/nbtree/README | 35 ++--
 src/backend/storage/page/README  |  8 +---
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index bfe33b6b43..969d7b6928 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -705,17 +705,30 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple LP_DEAD bits, but
+they can still appear on the standby because of the full-page writes. Such
+a cause could cause MVCC failures because the oldest xmin on the standby
+server can be older than the oldest xmin on the primary server, which means
+tuples can be marked LP_DEAD even when they are still visible on the standby.
+
+To prevent such failure, we mark pages with LP_DEAD bits set by standby with a
+special hint. In the case of FPW from primary the hint is always cleared while
+applying the full page write, so, LP_DEAD received from primary is ignored on
+standby. Also, standby clears all LP_DEAD set by primary on the page before
+setting of own bits.
+
+There are restrictions on settings LP_DEAD bits by the standby related to
+minRecoveryPoint value. In case of crash recovery standby will start to process
+queries after replaying WAL to minRecoveryPoint position (some kind of rewind to
+the previous state). A the same time setting of LP_DEAD bits are not protected
+by WAL in any way. So, to mark tuple as dead we must be sure it was "killed"
+before minRecoveryPoint (comparing the LSN of commit record). Another valid
+option is to compare "killer" LSN with index page LSN because minRecoveryPoint
+is moved forward if the index page flushed. Also, in some cases xid of "killer"
+is unknown - tuples were cleared by XLOG_HEAP2_CLEAN. In that case, 

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-05-13 Thread Michail Nikolaev
Hello.

Added a check for standby promotion with the long transaction to the
test (code and docs are unchanged).

Thanks,
Michail.
From c5e1053805c537b50b0922151bcf127754500adb Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Fri, 14 May 2021 00:32:30 +0300
Subject: [PATCH v3 3/4] test

---
 src/test/recovery/Makefile|   1 +
 .../recovery/t/022_standby_index_lp_dead.pl   | 265 ++
 2 files changed, 266 insertions(+)
 create mode 100644 src/test/recovery/t/022_standby_index_lp_dead.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 96442ceb4e..6399184a8c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/022_standby_index_lp_dead.pl b/src/test/recovery/t/022_standby_index_lp_dead.pl
new file mode 100644
index 00..fc91f789a1
--- /dev/null
+++ b/src/test/recovery/t/022_standby_index_lp_dead.pl
@@ -0,0 +1,265 @@
+# Checks that index hints on standby work as excepted.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 18;
+use Config;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+'postgres', 'CREATE TABLE test_table (id int, value int)');
+$node_primary->safe_psql(
+'postgres', 'CREATE INDEX test_index ON test_table (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+'INSERT INTO test_table VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set
+$node_primary->safe_psql('postgres', 'VACUUM test_table');
+# For fail-fast in case FPW from primary
+$node_primary->safe_psql('postgres', 'CHECKPOINT');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Restore standby node from backup backup
+my $node_standby_1 = get_new_node('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+has_streaming => 1);
+
+my $standby_settings = qq{
+max_standby_streaming_delay = 1
+wal_receiver_status_interval = 1
+hot_standby_feedback = off
+enable_seqscan = off
+enable_indexonlyscan = off
+};
+$node_standby_1->append_conf('postgresql.conf', $standby_settings);
+$node_standby_1->start;
+
+$node_standby_1->backup($backup_name);
+
+# Create second standby node linking to standby 1
+my $node_standby_2 = get_new_node('standby_2');
+$node_standby_2->init_from_backup($node_standby_1, $backup_name,
+has_streaming => 1);
+$node_standby_2->append_conf('postgresql.conf', $standby_settings);
+$node_standby_2->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(10);
+
+# One psql to run command in repeatable read isolation level
+my %psql_standby_repeatable_read = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_repeatable_read{run} =
+IPC::Run::start(
+[ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+'<', \$psql_standby_repeatable_read{stdin},
+'>', \$psql_standby_repeatable_read{stdout},
+'2>', \$psql_standby_repeatable_read{stderr},
+$psql_timeout);
+
+# Another psql to run command in read committed isolation level
+my %psql_standby_read_committed = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_read_committed{run} =
+IPC::Run::start(
+[ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+'<', \$psql_standby_read_committed{stdin},
+'>', \$psql_stand

Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-01-22 Thread Michail Nikolaev
Hello.

I have registered it as patch in the commit fest:
https://commitfest.postgresql.org/42/4138/

Best regards,
Michail.




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-02-04 Thread Michail Nikolaev
Hello, Andres.

> Not sure I like TransactionIdRetreatSafely() as a name. Maybe
> TransactionIdRetreatClamped() is better?

I think it is better to just replace TransactionIdRetreatedBy.
It is not used anymore after
`v3-0001-WIP-Fix-corruption-due-to-vacuum_defer_cleanup_ag.patch` -
so, it is better to replace the dangerous version in order to avoid
similar issues in the future.
But we need also to move FullXidRelativeTo in that case (not sure it is safe).

> I think it'll make the code easier to read in the other places too, the
> variable names / function names in this space are uncomfortably long to
> fit into 78chars..., particularly when there's two references to the
> same variable in the same line.

Looks fine for my taste, but it is pretty far from perfect :)

Best regards,
Michail.
From ffa80fece3384e3a12a52de18102a0d169f52841 Mon Sep 17 00:00:00 2001
From: Nkey 
Date: Sat, 4 Feb 2023 15:16:52 +0300
Subject: [PATCH] WIP: Fix corruption due to vacuum_defer_cleanup_age
 underflowing 64bit xids

---
 src/backend/storage/ipc/procarray.c | 56 -
 src/include/access/transam.h| 71 +
 2 files changed, 87 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 4340bf9641..5dd662cd75 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -368,8 +368,6 @@ static void ProcArrayGroupClearXid(PGPROC *proc, 
TransactionId latestXid);
 static void MaintainLatestCompletedXid(TransactionId latestXid);
 static void MaintainLatestCompletedXidRecovery(TransactionId latestXid);
 
-static inline FullTransactionId FullXidRelativeTo(FullTransactionId rel,
-   
  TransactionId xid);
 static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 
 /*
@@ -1889,16 +1887,32 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 * in varsup.c.  Also note that we intentionally don't apply
 * vacuum_defer_cleanup_age on standby servers.
 */
+   
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+   
 h->shared_oldest_nonremovable));
+   
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+   
 h->data_oldest_nonremovable));
+
h->oldest_considered_running =
TransactionIdRetreatedBy(h->oldest_considered_running,
-
vacuum_defer_cleanup_age);
-   h->shared_oldest_nonremovable =
+
vacuum_defer_cleanup_age,
+
h->latest_completed);
+
+   h->shared_oldest_nonremovable = 
TransactionIdRetreatedBy(h->shared_oldest_nonremovable,
-
vacuum_defer_cleanup_age);
-   h->data_oldest_nonremovable =
+
vacuum_defer_cleanup_age,
+
h->latest_completed);
+
+   h->data_oldest_nonremovable = 
TransactionIdRetreatedBy(h->data_oldest_nonremovable,
-
vacuum_defer_cleanup_age);
+
vacuum_defer_cleanup_age,
+   
h->latest_completed);
+
/* defer doesn't apply to temp relations */
+
+   
Assert(TransactionIdPrecedesOrEquals(h->oldest_considered_running,
+   
 h->shared_oldest_nonremovable));
+   
Assert(TransactionIdPrecedesOrEquals(h->shared_oldest_nonremovable,
+   
 h->data_oldest_nonremovable));
}
 
/*
@@ -2471,7 +2485,7 @@ GetSnapshotData(Snapshot snapshot)
 
/* apply vacuum_defer_cleanup_age */
def_vis_xid_data =
-   TransactionIdRetreatedBy(xmin, 
vacuum_defer_cleanup_age);
+   TransactionIdRetreatedBy(xmin, 
vacuum_defer_cleanup_age, oldestfxid);
 
/* Check whether there's a replication slot requiring an older 
xmin. */
def_vis_xid_data =
@@ -4295,32 +4309,6 @@ GlobalVisCheckRemovableXid(Relation rel, TransactionId 
xid)
  

Re: Slow standby snapshot

2021-06-13 Thread Michail Nikolaev
)Hello.

> I recently ran into a problem in one of our production postgresql cluster.
> I had noticed lock contention on procarray lock on standby, which causes WAL
> replay lag growth.

Yes, I saw the same issue on my production cluster.

> 1) set max_connections to big number, like 10

I made the tests with a more realistic value - 5000. It is valid value
for Amazon RDS for example (default is
LEAST({DBInstanceClassMemory/9531392}, 5000)).

The test looks like this:

pgbench -i -s 10 -U postgres -d postgres
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
long transaction on primary - begin;select txid_current();
perf top -p 

So, on postgres 14 (master) non-patched version looks like this:

   5.13%  postgres[.] KnownAssignedXidsGetAndSetXmin
   4.61%  postgres[.] pg_checksum_block
   2.54%  postgres[.] AllocSetAlloc
   2.44%  postgres[.] base_yyparse

It is too much to spend 5-6% of CPU running throw an array :) I think
it should be fixed for both the 13 and 14 versions.

The patched version like this (was unable to notice
KnownAssignedXidsGetAndSetXmin):

   3.08%  postgres[.] pg_checksum_block
   2.89%  postgres[.] AllocSetAlloc
   2.66%  postgres[.] base_yyparse
   2.00%  postgres[.] MemoryContextAllocZeroAligned

On postgres 13 non patched version looks even worse (definitely need
to be fixed in my opinion):

  26.44%  postgres   [.] KnownAssignedXidsGetAndSetXmin
   2.17%  postgres[.] base_yyparse
   2.01%  postgres[.] AllocSetAlloc
   1.55%  postgres[.] MemoryContextAllocZeroAligned

But your patch does not apply to REL_13_STABLE. Could you please
provide two versions?

Also, there are warnings while building with patch:

procarray.c:4595:9: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
4595 | int prv = -1;
 | ^~~
procarray.c: In function ‘KnownAssignedXidsGetOldestXmin’:
procarray.c:5056:5: warning: variable ‘tail’ set but not used
[-Wunused-but-set-variable]
5056 | tail;
 | ^~~~
procarray.c:5067:38: warning: ‘i’ is used uninitialized in
this function [-Wuninitialized]
5067 | i = KnownAssignedXidsValidDLL[i].nxt;


Some of them are clear errors, so, please recheck the code.

Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?

Also, it is a good idea to register the patch in the commitfest app
(https://commitfest.postgresql.org/).

Thanks,
Michail.




Re: Slow standby snapshot

2021-07-11 Thread Michail Nikolaev
Hello, Kirill.

> Also, maybe it is better to reduce the invasivity by using a more
> simple approach. For example, use the first bit to mark xid as valid
> and the last 7 bit (128 values) as an optimistic offset to the next
> valid xid (jump by 127 steps in the worse scenario).
> What do you think?

I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.

I have looked again at your patch, ut seems like it has a lot of
issues at the moment:

* error in KnownAssignedXidsGetOldestXmin, `i` is uninitialized, logic is wrong
* error in compressing function
(```KnownAssignedXidsValidDLL[compress_index].prv = prv;```, `prv` is
never updated)
* probably other errors?
* compilation warnings
* looks a little complex logic with `KAX_DLL_ENTRY_INVALID`
* variable\methods placing is bad (see `KAX_E_INVALID` and others)
* need to update documentation about KnownAssignedXidsValid, see ```To
keep individual deletions cheap, we need to allow gaps in the array```
in procarray.c
* formatting is broken

Do you have plans to update it? If not - I could try to rewrite it.

Also, what is about to add a patch to commitfest?

Thanks,
Michail.




CPU time for pg_stat_statement

2022-05-20 Thread Michail Nikolaev
Hello, hackers.

Today I was doing some aggregates over pg_stat_statements in order to
find types of queries consuming most of the CPU. Aggregates were made
on two pg_state_statement snapshots within 30 sec delay.

The sum(total_time) had the biggest value for a very frequent query
with about 10ms execution. I was thinking it is the biggest CPU
consumer.

But after reducing the frequency of queries a lot I was unable to see
any significant difference in server CPU usage...

So, looks like clock_gettime is not so accurate to measure real CPU
usage for some OLTP workloads. I suppose it is caused by the wall time
vs CPU time difference (IO, thread switch, etc).

But what do you think about adding cpu_time (by calling getrusage) to
pg_stat_statements? Seems it could be very useful for CPU profiling.

I am probably able to prepare the patch, but it is always better to
get some feedback on the idea first :)

Best regards,
Michail.




Re: CPU time for pg_stat_statement

2022-05-20 Thread Michail Nikolaev
Hello, Thomas.

> This might be interesting:
> https://github.com/powa-team/pg_stat_kcache

Oh, nice, looks like it could help me to reduce CPU and test my
assumption (using exec_user_time and exec_system_time).

BWT, do you know why extension is not in standard contrib (looks mature)?

Best regards,
Michail.




Re: CPU time for pg_stat_statement

2022-05-20 Thread Michail Nikolaev
Hello, Tom.

> This is a pretty broad claim to make on the basis of one undocumented
> test case on one unmentioned platform.

I'll try to use pg_stat_kcache to check the difference between Wall
and CPU for my case.

> On what grounds do you claim getrusage will be better?  One thing we
> can be pretty certain of is that it will be slower, since it has to
> return many more pieces of information.  And the API for it only allows
> time info to be specified to microseconds, versus nanoseconds for
> clock_gettime, so it's also going to be taking a precision hit.

My idea was to not replace wall-clock (clock_gettime) by cpu-clock (getrusage).
I think about adding getrusage as an additional column (with flag to
enable actual measuring).
Looks like I need to be more precise in words :)

It is just two different clocks - and sometimes you need physical
time, sometimes CPU time (and sometimes, for example, amount of WAL
written).

Best regards,
Michail.




Re: CPU time for pg_stat_statement

2022-06-08 Thread Michail Nikolaev
Hello, Tom.

>> This is a pretty broad claim to make on the basis of one undocumented
>> test case on one unmentioned platform.

> I'll try to use pg_stat_kcache to check the difference between Wall
> and CPU for my case.

In my case I see pretty high correlation of pg_stat_kcache and
pg_stat_statement (clock_gettime vs getrusage).
Looks like CPU usage is hidden somewhere else (planning probably, not
measured on postgres 11, but I see really high
*clauselist_selectivity* in perf).

Thanks,
Michail.




  1   2   >