Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-26 Thread Masahiko Sawada
On Tue, Dec 27, 2022 at 2:24 PM John Naylor
 wrote:
>
> On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada  
> wrote:
> >
> > On Fri, Dec 23, 2022 at 8:47 PM John Naylor
> >  wrote:
>
> > These 4 patches make sense to me.We can merge them into 0002 patch
>
> Okay, then I'll squash them when I post my next patch.
>
> > and I'll do similar changes for functions for leaf nodes as well.
>
> I assume you meant something else? -- some of the differences between inner 
> and leaf are already abstracted away.

Right. If we template these routines I don't need that.

>
> In any case, some things are still half-baked, so please wait until my next 
> patch before doing work on these files.
>
> Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.

Thanks!

>
> > > 0010 and 0011 template a common implementation for both leaf and inner 
> > > nodes for searching and inserting.
> > >
> > > 0012: While at it, I couldn't resist using this technique to separate out 
> > > delete from search, which makes sense and might give a small performance 
> > > boost (at least on less capable hardware). I haven't got to the iteration 
> > > functions, but they should be straightforward.
>
> Two things came to mind since I posted this, which I'll make clear next patch:
> - A good compiler will get rid of branches when inlining, so maybe no 
> difference in code generation, but it still looks nicer this way.
> - Delete should really use its own template, because it only _accidentally_ 
> looks like search because we don't yet shrink nodes.

Okay.

>
> > What do you
> > think about how we can expand this template method to deal with DSA
> > memory? I imagined that we load say radixtree_template.h with some
> > macros to use the radix tree like we do for simplehash.h. And
> > radixtree_template.h further loads xxx_impl.h files for some internal
> > functions.
>
> Right, I was thinking the same. I wanted to start small and look for 
> opportunities to shrink the code footprint.

Thank you for your confirmation!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-26 Thread Masahiko Sawada
On Mon, Dec 26, 2022 at 10:29 PM Amit Kapila  wrote:
>
> On Mon, Dec 26, 2022 at 6:33 PM Masahiko Sawada  wrote:
> >
> > ---
> > +if (!pa_can_start(xid))
> > +return;
> > +
> > +/* First time through, initialize parallel apply worker state
> > hashtable. */
> > +if (!ParallelApplyTxnHash)
> > +{
> > +HASHCTLctl;
> > +
> > +MemSet(, 0, sizeof(ctl));
> > +ctl.keysize = sizeof(TransactionId);
> > +ctl.entrysize = sizeof(ParallelApplyWorkerEntry);
> > +ctl.hcxt = ApplyContext;
> > +
> > +ParallelApplyTxnHash = hash_create("logical
> > replication parallel apply workershash",
> > +
> >  16, ,
> > +
> >  HASH_ELEM |HASH_BLOBS | HASH_CONTEXT);
> > +}
> > +
> > +/*
> > + * It's necessary to reread the subscription information
> > before assigning
> > + * the transaction to a parallel apply worker. Otherwise, the
> > leader may
> > + * not be able to reread the subscription information if streaming
> > + * transactions keep coming and are handled by parallel apply 
> > workers.
> > + */
> > +maybe_reread_subscription();
> >
> > pa_can_start() checks if the skiplsn is an invalid xid or not, and
> > then maybe_reread_subscription() could update the skiplsn to a valid
> > value. As the comments in pa_can_start() says, it won't work. I think
> > we should call maybe_reread_subscription() in
> > apply_handle_stream_start() before calling pa_allocate_worker().
> >
>
> But I think a similar thing can happen when we start the worker and
> then before the transaction ends, we do maybe_reread_subscription().

Where do we do maybe_reread_subscription() in this case? IIUC if the
leader sends all changes to the worker, there is no chance for the
leader to do maybe_reread_subscription except for when waiting for the
input. On reflection, adding maybe_reread_subscription() to
apply_handle_stream_start() adds one extra call of it so it's not
good. Alternatively, we can do that in pa_can_start() before checking
the skiplsn. I think we do a similar thing in AllTablesyncsRead() --
update the information before the check if necessary.

> I think we should try to call maybe_reread_subscription() when we are
> reasonably sure that we are going to enter parallel mode, otherwise,
> anyway, it will be later called by the leader worker.

It isn't a big problem even if we update the skiplsn after launching a
worker since we will skip the transaction the next time. But it would
be more consistent with the current behavior. As I mentioned above,
doing it in pa_can_start() seems to be reasonable to me. What do you
think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-26 Thread Masahiko Sawada
On Fri, Dec 23, 2022 at 8:47 PM John Naylor
 wrote:
>
> I wrote:
>
> > - Try templating out the differences between local and shared memory.
>
> Here is a brief progress report before Christmas vacation.

Thanks!

>
> I thought the best way to approach this was to go "inside out", that is, 
> start with the modest goal of reducing duplicated code for v16.
>
> 0001-0005 are copies from v13.
>
> 0006 whacks around the rt_node_insert_inner function to reduce the "surface 
> area" as far as symbols and casts. This includes replacing the goto with an 
> extra "unlikely" branch.
>
> 0007 removes the STRICT pragma for one of our benchmark functions that crept 
> in somewhere -- it should use the default and not just return NULL instantly.
>
> 0008 further whacks around the node-growing code in rt_node_insert_inner to 
> remove casts. When growing the size class within the same kind, we have no 
> need for a "new32" (etc) variable. Also, to keep from getting confused about 
> what an assert build verifies at the end, add a "newnode" variable and assign 
> it to "node" as soon as possible.
>
> 0009 uses the bitmap logic from 0004 for node256 also. There is no 
> performance reason for this, because there is no iteration needed, but it's 
> good for simplicity and consistency.

These 4 patches make sense to me. We can merge them into 0002 patch
and I'll do similar changes for functions for leaf nodes as well.

> 0010 and 0011 template a common implementation for both leaf and inner nodes 
> for searching and inserting.
>
> 0012: While at it, I couldn't resist using this technique to separate out 
> delete from search, which makes sense and might give a small performance 
> boost (at least on less capable hardware). I haven't got to the iteration 
> functions, but they should be straightforward.

Cool!

>
> There is more that could be done here, but I didn't want to get too ahead of 
> myself. For example, it's possible that struct members "children" and 
> "values" are names that don't need to be distinguished. Making them the same 
> would reduce code like
>
> +#ifdef RT_NODE_LEVEL_LEAF
> + n32->values[insertpos] = value;
> +#else
> + n32->children[insertpos] = child;
> +#endif
>
> ...but there could be downsides and I don't want to distract from the goal of 
> dealing with shared memory.

With these patches, some functions in radixtree.h load the header
files, radixtree_xxx_impl.h, that have the function body. What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-26 Thread Masahiko Sawada
On Mon, Dec 26, 2022 at 1:22 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, December 23, 2022 5:20 PM houzj.f...@fujitsu.com 
>  wrote:
> >
> > I noticed a CFbot failure in one of the new testcases in 015_stream.pl which
> > comes from old 032_xx.pl. It's because I slightly adjusted the change size 
> > in a
> > transaction in last version which cause the transaction's size not to 
> > exceed the
> > decoding work mem, so the transaction is not being applied as expected as
> > streaming transactions(it is applied as a non-stremaing transaction) which 
> > cause
> > the failure. Attach the new version patch which fixed this miss.
> >
>
> Since the GUC used to force stream changes has been committed, I removed that
> patch from the patch set here and rebased the testcases based on that commit.
> Here is the rebased patch set.
>

Thank you for updating the patches. Here are some comments for 0001
and 0002 patches:


I think it'd be better to write logs when the leader enters the
serialization mode. It would be helpful for investigating issues.

---
+if (!pa_can_start(xid))
+return;
+
+/* First time through, initialize parallel apply worker state
hashtable. */
+if (!ParallelApplyTxnHash)
+{
+HASHCTLctl;
+
+MemSet(, 0, sizeof(ctl));
+ctl.keysize = sizeof(TransactionId);
+ctl.entrysize = sizeof(ParallelApplyWorkerEntry);
+ctl.hcxt = ApplyContext;
+
+ParallelApplyTxnHash = hash_create("logical
replication parallel apply workershash",
+
 16, ,
+
 HASH_ELEM |HASH_BLOBS | HASH_CONTEXT);
+}
+
+/*
+ * It's necessary to reread the subscription information
before assigning
+ * the transaction to a parallel apply worker. Otherwise, the
leader may
+ * not be able to reread the subscription information if streaming
+ * transactions keep coming and are handled by parallel apply workers.
+ */
+maybe_reread_subscription();

pa_can_start() checks if the skiplsn is an invalid xid or not, and
then maybe_reread_subscription() could update the skiplsn to a valid
value. As the comments in pa_can_start() says, it won't work. I think
we should call maybe_reread_subscription() in
apply_handle_stream_start() before calling pa_allocate_worker().

---
+static inline bool
+am_leader_apply_worker(void)
+{
+return (!OidIsValid(MyLogicalRepWorker->relid) &&
+!isParallelApplyWorker(MyLogicalRepWorker));
+}

How about using !am_tablesync_worker() instead of
!OidIsValid(MyLogicalRepWorker->relid) for better readability?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-23 Thread Masahiko Sawada
On Fri, Dec 23, 2022 at 5:32 PM shiy.f...@fujitsu.com
 wrote:
>
> On Fri, Dec 23, 2022 1:50 PM Amit Kapila 
> >
> > On Thu, Dec 22, 2022 at 6:18 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > >
> > > Besides, I tried to reduce data size in streaming subscription tap tests 
> > > by this
> > > new GUC (see 0002 patch). But I didn't covert all streaming tap tests
> > because I
> > > think we also need to cover the case that there are lots of changes. So, 
> > > 015*
> > is
> > > not modified. And 017* is not modified because streaming transactions and
> > > non-streaming transactions are tested alternately in this test.
> > >
> >
> > I think we can remove the newly added test from the patch and instead
> > combine the 0001 and 0002 patches. I think we should leave the
> > 022_twophase_cascade as it is because it can impact code coverage,
> > especially the below part of the test:
> > # 2PC PREPARE with a nested ROLLBACK TO SAVEPOINT
> > $node_A->safe_psql(
> > 'postgres', "
> > BEGIN;
> > INSERT INTO test_tab VALUES (, 'foobar');
> > SAVEPOINT sp_inner;
> > INSERT INTO test_tab SELECT i, md5(i::text) FROM
> > generate_series(3, 5000) s(i);
> >
> > Here, we will stream first time after the subtransaction, so can
> > impact the below part of the code in ReorderBufferStreamTXN:
> > if (txn->snapshot_now == NULL)
> > {
> > ...
> > dlist_foreach(subxact_i, >subtxns)
> > {
> > ReorderBufferTXN *subtxn;
> >
> > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> > ReorderBufferTransferSnapToParent(txn, subtxn);
> > }
> > ...
> >
>
> OK, I removed the modification in 022_twophase_cascade.pl and combine the two 
> patches.

Thank you for updating the patch. The v6 patch looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-22 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 9:48 PM shiy.f...@fujitsu.com
 wrote:
>
> On Thu, Dec 22, 2022 5:24 PM Amit Kapila  wrote:
> >
> > On Thu, Dec 22, 2022 at 1:15 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 22 Dec 2022 12:35:46 +0530, Amit Kapila
> >  wrote in
> > > > I have addressed these comments in the attached. Additionally, I have
> > > > modified the docs and commit messages to make those clear. I think
> > > > instead of adding new tests with this patch, it may be better to
> > > > change some of the existing tests related to streaming to use this
> > > > parameter as that will clearly show one of the purposes of this patch.
> > >
> > > Being late but I'm warried that we might sometime regret that the lack
> > > of the explicit default. Don't we really need it?
> > >
> >
> > For this, I like your proposal for "buffered" as an explicit default value.
> >
> > > +Allows streaming or serializing changes immediately in logical
> > decoding.
> > > +The allowed values of logical_decoding_mode
> > are the
> > > +empty string and immediate. When set to
> > > +immediate, stream each change if
> > > +streaming option is enabled, otherwise, 
> > > serialize
> > > +each change.  When set to an empty string, which is the default,
> > > +decoding will stream or serialize changes when
> > > +logical_decoding_work_mem is reached.
> > >
> > > With (really) fresh eyes, I took a bit long time to understand what
> > > the "streaming" option is. Couldn't we augment the description by a
> > > bit?
> > >
> >
> > Okay, how about modifying it to: "When set to
> > immediate, stream each change if
> > streaming option (see optional parameters set by
> > CREATE SUBSCRIPTION) is enabled, otherwise, serialize each change.
> >
>
> I updated the patch to use "buffered" as the explicit default value, and 
> include
> Amit's changes about document.
>
> Besides, I tried to reduce data size in streaming subscription tap tests by 
> this
> new GUC (see 0002 patch). But I didn't covert all streaming tap tests because 
> I
> think we also need to cover the case that there are lots of changes. So, 015* 
> is
> not modified.

If we want to eventually convert 015 some time, isn't it better to
include it even if it requires many changes? Is there any reason we
want to change 017 in a separate patch?

> And 017* is not modified because streaming transactions and
> non-streaming transactions are tested alternately in this test.

How about 029_on_error.pl? It also sets logical_decoding_work_mem to
64kb to test the STREAM COMMIT case.

>
> I collected the time to run these tests before and after applying the patch 
> set
> on my machine. In debug version, it saves about 5.3 s; and in release version,
> it saves about 1.8 s. The time of each test is attached.

Nice improvements.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-22 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 6:19 PM Amit Kapila  wrote:
>
> On Thu, Dec 22, 2022 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Dec 22, 2022 at 4:05 PM Amit Kapila  wrote:
> > >
> >
> > >  I think
> > > instead of adding new tests with this patch, it may be better to
> > > change some of the existing tests related to streaming to use this
> > > parameter as that will clearly show one of the purposes of this patch.
> >
> > +1. I think test_decoding/sql/stream.sql and spill.sql are good
> > candidates and we change logical replication TAP tests in a separate
> > patch.
> >
>
> I prefer the other way, let's first do TAP tests because that will
> also help immediately with the parallel apply feature. We need to
> execute most of those tests in parallel mode.

Good point. Or we can do both if changes for test_decoding tests are not huge?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-22 Thread Masahiko Sawada
On Fri, Dec 23, 2022 at 12:21 PM Amit Kapila  wrote:
>
> On Thu, Dec 22, 2022 at 6:18 PM Masahiko Sawada  wrote:
> >
> > On Thu, Dec 22, 2022 at 7:04 PM Amit Kapila  wrote:
> > >
> > > On Thu, Dec 22, 2022 at 11:39 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > Thank you for updating the patch. Here are some comments on v64 patches:
> > > >
> > > > While testing the patch, I realized that if all streamed transactions
> > > > are handled by parallel workers, there is no chance for the leader to
> > > > call maybe_reread_subscription() except for when waiting for the next
> > > > message. Due to this, the leader didn't stop for a while even if the
> > > > subscription gets disabled. It's an extreme case since my test was
> > > > that pgbench runs 30 concurrent transactions and logical_decoding_mode
> > > > = 'immediate', but we might want to make sure to call
> > > > maybe_reread_subscription() at least after committing/preparing a
> > > > transaction.
> > > >
> > >
> > > Won't it be better to call it only if we handle the transaction by the
> > > parallel worker?
> >
> > Agreed. And we won't need to do that after handling stream_prepare as
> > we don't do that now.
> >
>
> I think we do this for both prepare and non-prepare cases via
> begin_replication_step(). Here, in both cases, as the changes are sent
> to the parallel apply worker, we missed in both cases. So, I think it
> is better to do in both cases.

Agreed. I missed that we call maybe_reread_subscription() even in the
prepare case.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-22 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 7:24 PM John Naylor
 wrote:
>
>
> On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 20, 2022 at 3:09 PM John Naylor
> >  wrote:
>
> > > https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
> > >
> > > I'm guessing the hash join case can afford to be precise about memory 
> > > because it must spill to disk when exceeding workmem. We don't have that 
> > > design constraint.
> >
> > You mean that the memory used by the radix tree should be limited not
> > by the amount of memory actually used, but by the amount of memory
> > allocated? In other words, it checks by MomoryContextMemAllocated() in
> > the local cases and by dsa_get_total_size() in the shared case.
>
> I mean, if this patch set uses 10x less memory than v15 (not always, but easy 
> to find cases where it does), and if it's also expensive to track memory use 
> precisely, then we don't have an incentive to track memory precisely. Even if 
> we did, we don't want to assume that every future caller of radix tree is 
> willing to incur that cost.

Understood.

>
> > The idea of using up to half of maintenance_work_mem might be a good
> > idea compared to the current flat-array solution. But since it only
> > uses half, I'm concerned that there will be users who double their
> > maintenace_work_mem. When it is improved, the user needs to restore
> > maintenance_work_mem again.
>
> I find it useful to step back and look at the usage patterns:
>
> Autovacuum: Limiting the memory allocated by vacuum is important, since there 
> are multiple workers and they can run at any time (possibly most of the 
> time). This case will not use parallel index vacuum, so will use slab, where 
> the quick estimation of memory taken by the context is not terribly far off, 
> so we can afford to be more optimistic here.
>
> Manual vacuum: The default configuration assumes we want to finish as soon as 
> possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My 
> experience leads me to believe users are willing to use a lot of memory to 
> make manual vacuum finish as quickly as possible, and are disappointed to 
> learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.

Agreed.

> So I don't believe anyone will have to double maintenance work mem after 
> upgrading (even with pessimistic accounting) because we'll be both
> - much more efficient with memory on average
> - free from the 1GB cap

Make sense.

>
> That said, it's possible 50% is too pessimistic -- a 75% threshold will bring 
> us very close to powers of two for example:
>
> 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
> 766 + 256 = 1022MB -> stop
>
> I'm not sure if that calculation could cause going over the limit, or how 
> common that would be.
>

If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:

2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.

In a more bigger case, if it's 11000MB,

2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB

That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-22 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 7:04 PM Amit Kapila  wrote:
>
> On Thu, Dec 22, 2022 at 11:39 AM Masahiko Sawada  
> wrote:
> >
> > Thank you for updating the patch. Here are some comments on v64 patches:
> >
> > While testing the patch, I realized that if all streamed transactions
> > are handled by parallel workers, there is no chance for the leader to
> > call maybe_reread_subscription() except for when waiting for the next
> > message. Due to this, the leader didn't stop for a while even if the
> > subscription gets disabled. It's an extreme case since my test was
> > that pgbench runs 30 concurrent transactions and logical_decoding_mode
> > = 'immediate', but we might want to make sure to call
> > maybe_reread_subscription() at least after committing/preparing a
> > transaction.
> >
>
> Won't it be better to call it only if we handle the transaction by the
> parallel worker?

Agreed. And we won't need to do that after handling stream_prepare as
we don't do that now.

>
> > ---
> > +if (pg_atomic_read_u32(>pending_stream_count) == 
> > 0)
> > +{
> > +if (pa_has_spooled_message_pending())
> > +return;
> > +
> > +elog(ERROR, "invalid pending streaming block number");
> > +}
> >
> > I think it's helpful if the error message shows the invalid block number.
> >
>
> +1. Additionally, I suggest changing the message to "invalid pending
> streaming chunk"?
>
> > ---
> > On Wed, Dec 7, 2022 at 10:13 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Wednesday, December 7, 2022 7:51 PM Masahiko Sawada 
> > >  wrote:
> > > > ---
> > > > If a value of max_parallel_apply_workers_per_subscription is not
> > > > sufficient, we get the LOG "out of parallel apply workers" every time
> > > > when the apply worker doesn't launch a worker. But do we really need
> > > > this log? It seems not consistent with
> > > > max_sync_workers_per_subscription behavior. I think we can check if
> > > > the number of running parallel workers is less than
> > > > max_parallel_apply_workers_per_subscription before calling
> > > > logicalrep_worker_launch(). What do you think?
> > >
> > > (Sorry, I missed this comment in last email)
> > >
> > > I personally feel giving a hint might help user to realize that the
> > > max_parallel_applyxxx is not enough for the current workload and then 
> > > they can
> > > adjust the parameter. Otherwise, user might have an easy way to check if 
> > > more
> > > workers are needed.
> > >
> >
> > Sorry, I missed this comment.
> >
> > I think the number of concurrent transactions on the publisher could
> > be several hundreds, and the number of streamed transactions among
> > them could be several tens. I agree setting
> > max_parallel_apply_workers_per_subscription to a value high enough is
> > ideal but I'm not sure we want to inform users immediately that the
> > setting value is not enough. I think that with the default value
> > (i.e., 2), it will not be enough for many systems and the server logs
> > could be flood with the LOG "out of parallel apply workers".
> >
>
> It seems currently we give a similar message when the logical
> replication worker slots are finished "out of logical replication
> worker slots" or when we are not able to register background workers
> "out of background worker slots". Now, OTOH, when we exceed the limit
> of sync workers "max_sync_workers_per_subscription", we don't display
> any message. Personally, I think if any user has used the streaming
> option as "parallel" she wants all large transactions to be performed
> in parallel and if the system is not able to deal with it, displaying
> a LOG message will be useful for users. This is because the
> performance difference for large transactions between parallel and
> non-parallel is big (30-40%) and it is better for users to know as
> soon as possible instead of expecting them to run some monitoring
> query to notice the same.

I see your point. But looking at other parallel features such as
parallel queries, parallel vacuum and parallel index creation, we
don't give such messages even if the number of parallel workers
actually launched is lower than the ideal. They also bring a big
performance benefit.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-12-22 Thread Masahiko Sawada
On Wed, Dec 21, 2022 at 2:44 AM Imseih (AWS), Sami  wrote:
>
> Attached is a patch to check scanned pages rather
> than blockno.

Thank you for the patch. It looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-12-22 Thread Masahiko Sawada
On Sat, Nov 12, 2022 at 12:28 AM Imseih (AWS), Sami  wrote:
>
> >Yeah, it's a little inconsistent.
>
> Yes, this should be corrected by calling the failsafe
> inside the parallel vacuum loops and handling the case by exiting
> the loop and parallel vacuum if failsafe kicks in.

I agree it's better to be consistent but I think we cannot simply call
lazy_check_wraparound_failsafe() inside the parallel vacuum loops.
IIUC the failsafe is heap (or lazyvacuum ) specific, whereas parallel
vacuum is a common infrastructure to do index vacuum in parallel. We
should not break this design. For example, we would need to have a
callback for index scan loop so that the caller (i.e. lazy vacuum) can
do its work.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-22 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 4:18 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thank you for updating the patch. I have also checked the patch
> and basically it has worked well. Almost all things I found were modified
> by v4.
>
> One comment: while setting logical_decoding_mode to wrong value,
> I got unfriendly ERROR message.
>
> ```
> postgres=# SET logical_decoding_mode = 1;
> ERROR:  invalid value for parameter "logical_decoding_mode": "1"
> HINT:  Available values: , immediate
> ```
>
> Here all acceptable enum should be output as HINT, but we could not see the 
> empty string.
> Should we modify config_enum_get_options() for treating empty string, maybe
> like (empty)?

Good point. I think the hint message can say "The only allowed value
is \"immediate\" as recovery_target does. Or considering the name of
logical_decoding_mode, we might want to have a non-empty string, say
'normal' as Kuroda-san proposed, as the default value.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-21 Thread Masahiko Sawada
On Thu, Dec 22, 2022 at 4:05 PM Amit Kapila  wrote:
>
> On Wed, Dec 21, 2022 at 7:25 PM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 21, 2022 at 10:14 PM shiy.f...@fujitsu.com
> >  wrote:
> >
> > The patch looks good to me. Some minor comments are:
> >
> > - * limit, but we might also adapt a more elaborate eviction strategy
> > - for example
> > - * evicting enough transactions to free certain fraction (e.g. 50%)
> > of the memory
> > - * limit.
> > + * limit, but we might also adapt a more elaborate eviction strategy - for
> > + * example evicting enough transactions to free certain fraction (e.g. 
> > 50%) of
> > + * the memory limit.
> >
> > This change is not relevant with this feature.
> >
> > ---
> > +if (logical_decoding_mode == LOGICAL_DECODING_MODE_DEFAULT
> > +&& rb->size < logical_decoding_work_mem * 1024L)
> >
> > Since we put '&&' before the new line in all other places in
> > reorderbuffer.c, I think it's better to make it consistent. The same
> > is true for the change for while loop in the patch.
> >
>
> I have addressed these comments in the attached. Additionally, I have
> modified the docs and commit messages to make those clear.

Thanks!

>  I think
> instead of adding new tests with this patch, it may be better to
> change some of the existing tests related to streaming to use this
> parameter as that will clearly show one of the purposes of this patch.

+1. I think test_decoding/sql/stream.sql and spill.sql are good
candidates and we change logical replication TAP tests in a separate
patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-21 Thread Masahiko Sawada
On Wed, Dec 21, 2022 at 2:32 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wed, Dec 21, 2022 9:07 AM Peter Smith  wrote:
> > FYI - applying v63-0001 using the latest master does not work.
> >
> > git apply 
> > ../patches_misc/v63-0001-Perform-streaming-logical-transactions-by-
> > parall.patch
> > error: patch failed: src/backend/replication/logical/meson.build:1
> > error: src/backend/replication/logical/meson.build: patch does not apply
> >
> > Looks like a recent commit [1] to add copyrights broke the patch
>
> Thanks for your reminder.
> Rebased the patch set.
>
> Attach the new patch set which also includes some
> cosmetic comment changes.
>

Thank you for updating the patch. Here are some comments on v64 patches:

While testing the patch, I realized that if all streamed transactions
are handled by parallel workers, there is no chance for the leader to
call maybe_reread_subscription() except for when waiting for the next
message. Due to this, the leader didn't stop for a while even if the
subscription gets disabled. It's an extreme case since my test was
that pgbench runs 30 concurrent transactions and logical_decoding_mode
= 'immediate', but we might want to make sure to call
maybe_reread_subscription() at least after committing/preparing a
transaction.

---
+if (pg_atomic_read_u32(>pending_stream_count) == 0)
+{
+if (pa_has_spooled_message_pending())
+return;
+
+elog(ERROR, "invalid pending streaming block number");
+}

I think it's helpful if the error message shows the invalid block number.

---
On Wed, Dec 7, 2022 at 10:13 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, December 7, 2022 7:51 PM Masahiko Sawada 
>  wrote:
> > ---
> > If a value of max_parallel_apply_workers_per_subscription is not
> > sufficient, we get the LOG "out of parallel apply workers" every time
> > when the apply worker doesn't launch a worker. But do we really need
> > this log? It seems not consistent with
> > max_sync_workers_per_subscription behavior. I think we can check if
> > the number of running parallel workers is less than
> > max_parallel_apply_workers_per_subscription before calling
> > logicalrep_worker_launch(). What do you think?
>
> (Sorry, I missed this comment in last email)
>
> I personally feel giving a hint might help user to realize that the
> max_parallel_applyxxx is not enough for the current workload and then they can
> adjust the parameter. Otherwise, user might have an easy way to check if more
> workers are needed.
>

Sorry, I missed this comment.

I think the number of concurrent transactions on the publisher could
be several hundreds, and the number of streamed transactions among
them could be several tens. I agree setting
max_parallel_apply_workers_per_subscription to a value high enough is
ideal but I'm not sure we want to inform users immediately that the
setting value is not enough. I think that with the default value
(i.e., 2), it will not be enough for many systems and the server logs
could be flood with the LOG "out of parallel apply workers". If we
want to give a hint to users, we can probably show the statistics on
pg_stat_subscription_stats view such as the number of streamed
transactions that are handled by the leader and parallel workers.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-21 Thread Masahiko Sawada
On Wed, Dec 21, 2022 at 10:14 PM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Dec 21, 2022 4:54 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 21, 2022 at 1:55 PM Peter Smith 
> > wrote:
> > >
> > > On Wed, Dec 21, 2022 at 6:22 PM Masahiko Sawada
> >  wrote:
> > > >
> > > > On Tue, Dec 20, 2022 at 7:49 PM Amit Kapila 
> > wrote:
> > > > >
> > > > > On Tue, Dec 20, 2022 at 2:46 PM Hayato Kuroda (Fujitsu)
> > > > >  wrote:
> > > > > >
> > > > > > Dear hackers,
> > > > > >
> > > > > > > We have discussed three different ways to provide GUC for these
> > > > > > > features. (1) Have separate GUCs like force_server_stream_mode,
> > > > > > > force_server_serialize_mode, force_client_serialize_mode (we can
> > use
> > > > > > > different names for these) for each of these; (2) Have two sets of
> > > > > > > GUCs for server and client. We can have logical_decoding_mode with
> > > > > > > values as 'stream' and 'serialize' for the server and then
> > > > > > > logical_apply_serialize = true/false for the client. (3) Have one 
> > > > > > > GUC
> > > > > > > like logical_replication_mode with values as 'server_stream',
> > > > > > > 'server_serialize', 'client_serialize'.
> > > > > >
> > > > > > I also agreed for adding new GUC parameters (and I have already done
> > partially
> > > > > > in parallel apply[1]), and basically options 2 made sense for me. 
> > > > > > But is
> > it OK
> > > > > > that we can choose "serialize" mode even if subscribers require
> > streaming?
> > > > > >
> > > > > > Currently the reorder buffer transactions are serialized on 
> > > > > > publisher
> > only when
> > > > > > the there are no streamable transaction. So what happen if the
> > > > > > logical_decoding_mode = "serialize" but streaming option streaming 
> > > > > > is
> > on? If we
> > > > > > break the first one and serialize changes on publisher anyway, it 
> > > > > > may
> > be not
> > > > > > suitable for testing the normal operation.
> > > > > >
> > > > >
> > > > > I think the change will be streamed as soon as the next change is
> > > > > processed even if we serialize based on this option. See
> > > > > ReorderBufferProcessPartialChange. However, I see your point that
> > when
> > > > > the streaming option is given, the value 'serialize' for this GUC may
> > > > > not make much sense.
> > > > >
> > > > > > Therefore, I came up with the variant of (2): logical_decoding_mode
> > can be
> > > > > > "normal" or "immediate".
> > > > > >
> > > > > > "normal" is a default value, which is same as current HEAD. Changes
> > are streamed
> > > > > > or serialized when the buffered size exceeds
> > logical_decoding_work_mem.
> > > > > >
> > > > > > When users set to "immediate", the walsenders starts to stream or
> > serialize all
> > > > > > changes. The choice is depends on the subscription option.
> > > > > >
> > > > >
> > > > > The other possibility to achieve what you are saying is that we allow
> > > > > a minimum value of logical_decoding_work_mem as 0 which would
> > mean
> > > > > stream or serialize each change depending on whether the streaming
> > > > > option is enabled. I think we normally don't allow a minimum value
> > > > > below a certain threshold for other *_work_mem parameters (like
> > > > > maintenance_work_mem, work_mem), so we have followed the same
> > here.
> > > > > And, I think it makes sense from the user's perspective because below
> > > > > a certain threshold it will just add overhead by either writing small
> > > > > changes to the disk or by sending those over the network. However, it
> > > > > can be quite useful for testing/debugging. So, not sure, if we should
> > > > > restrict setting logical_decoding_work_mem below a certain threshold.
> > > > >

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-21 Thread Masahiko Sawada
On Tue, Dec 20, 2022 at 3:09 PM John Naylor
 wrote:
>
>
> On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada  
> > wrote:
>
> > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > > seems that they look at only memory that are actually dsa_allocate'd.
> > > To be exact, we estimate the number of hash buckets based on work_mem
> > > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > > confirmed that the result of dsa_get_total_size() could exceed the
> > > limit. I'm not sure it's a known and legitimate usage. If we can
> > > follow such usage, we can probably track how much dsa_allocate'd
> > > memory is used in the radix tree.
> >
> > I've experimented with this idea. The newly added 0008 patch changes
> > the radix tree so that it counts the memory usage for both local and
> > shared cases. As shown below, there is an overhead for that:
> >
> > w/o 0008 patch
> >  298453544 | 282
>
> > w/0 0008 patch
> >  293603184 | 297
>
> This adds about as much overhead as the improvement I measured in the v4 slab 
> allocator patch.

Oh, yes, that's bad.

> https://www.postgresql.org/message-id/20220704211822.kfxtzpcdmslzm2dy%40awork3.anarazel.de
>
> I'm guessing the hash join case can afford to be precise about memory because 
> it must spill to disk when exceeding workmem. We don't have that design 
> constraint.

You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.

The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.

A better solution would be to have slab-like DSA. We allocate the
dynamic shared memory by adding fixed-length large segments. However,
downside would be since the segment size gets large we need to
increase maintenance_work_mem as well. Also, this patch set is already
getting bigger and more complicated, I don't think it's a good idea to
add more.

If we limit the memory usage by checking the amount of memory actually
used, we can use SlabStats() for the local cases. Since DSA doesn't
have such functionality for now we would need to add it. Or we can
track it in the radix tree only in the shared cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-20 Thread Masahiko Sawada
On Tue, Dec 20, 2022 at 7:49 PM Amit Kapila  wrote:
>
> On Tue, Dec 20, 2022 at 2:46 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear hackers,
> >
> > > We have discussed three different ways to provide GUC for these
> > > features. (1) Have separate GUCs like force_server_stream_mode,
> > > force_server_serialize_mode, force_client_serialize_mode (we can use
> > > different names for these) for each of these; (2) Have two sets of
> > > GUCs for server and client. We can have logical_decoding_mode with
> > > values as 'stream' and 'serialize' for the server and then
> > > logical_apply_serialize = true/false for the client. (3) Have one GUC
> > > like logical_replication_mode with values as 'server_stream',
> > > 'server_serialize', 'client_serialize'.
> >
> > I also agreed for adding new GUC parameters (and I have already done 
> > partially
> > in parallel apply[1]), and basically options 2 made sense for me. But is it 
> > OK
> > that we can choose "serialize" mode even if subscribers require streaming?
> >
> > Currently the reorder buffer transactions are serialized on publisher only 
> > when
> > the there are no streamable transaction. So what happen if the
> > logical_decoding_mode = "serialize" but streaming option streaming is on? 
> > If we
> > break the first one and serialize changes on publisher anyway, it may be not
> > suitable for testing the normal operation.
> >
>
> I think the change will be streamed as soon as the next change is
> processed even if we serialize based on this option. See
> ReorderBufferProcessPartialChange. However, I see your point that when
> the streaming option is given, the value 'serialize' for this GUC may
> not make much sense.
>
> > Therefore, I came up with the variant of (2): logical_decoding_mode can be
> > "normal" or "immediate".
> >
> > "normal" is a default value, which is same as current HEAD. Changes are 
> > streamed
> > or serialized when the buffered size exceeds logical_decoding_work_mem.
> >
> > When users set to "immediate", the walsenders starts to stream or serialize 
> > all
> > changes. The choice is depends on the subscription option.
> >
>
> The other possibility to achieve what you are saying is that we allow
> a minimum value of logical_decoding_work_mem as 0 which would mean
> stream or serialize each change depending on whether the streaming
> option is enabled. I think we normally don't allow a minimum value
> below a certain threshold for other *_work_mem parameters (like
> maintenance_work_mem, work_mem), so we have followed the same here.
> And, I think it makes sense from the user's perspective because below
> a certain threshold it will just add overhead by either writing small
> changes to the disk or by sending those over the network. However, it
> can be quite useful for testing/debugging. So, not sure, if we should
> restrict setting logical_decoding_work_mem below a certain threshold.
> What do you think?

I agree with (2), having separate GUCs for publisher side and
subscriber side. Also, on the publisher side, Amit's idea, controlling
the logical decoding behavior by changing logical_decoding_work_mem,
seems like a good idea.

But I'm not sure it's a good idea if we lower the minimum value of
logical_decoding_work_mem to 0. I agree it's helpful for testing and
debugging but setting logical_decoding_work_mem = 0 doesn't benefit
users at all, rather brings risks.

I prefer the idea Kuroda-san previously proposed; setting
logical_decoding_mode = 'immediate' means setting
logical_decoding_work_mem = 0. We might not need to have it as an enum
parameter since it has only two values, though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Masahiko Sawada
On Fri, Dec 16, 2022 at 12:49 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > I don't think we need additional PG_TRY() for that since exec_stmts()
> > is already called in PG_TRY() if there is an exception block. I meant
> > to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
> > an error is caught by the exception block). Currently, if an error is
> > caught, we call stmt_begin() and stmt_end() for statements executed
> > inside the exception block but call only stmt_begin() for the
> > statement that raised an error.
>
> I fail to see anything wrong with that.  We never completed execution
> of the statement that raised an error, but calling stmt_end for it
> would imply that we did.

Thank you for the comment. Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-15 Thread Masahiko Sawada
On Thu, Dec 15, 2022 at 12:28 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, December 14, 2022 2:49 PM Amit Kapila  
> wrote:
>
> >
> > On Wed, Dec 14, 2022 at 9:50 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada
> >  wrote:
> > > >
> > > > Here are comments on v59 0001, 0002 patches:
> > >
> > > Thanks for the comments!
> > >
> > > > +void
> > > > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > > > +while (1)
> > > > +{
> > > > +SpinLockAcquire(>mutex);
> > > > +
> > > > +/*
> > > > + * Don't try to increment the count if the parallel
> > > > apply worker is
> > > > + * taking the stream lock. Otherwise, there would
> > > > + be
> > > > a race condition
> > > > + * that the parallel apply worker checks there is
> > > > + no
> > > > pending streaming
> > > > + * block and before it actually starts waiting on a
> > > > lock, the leader
> > > > + * sends another streaming block and take the
> > > > + stream
> > > > lock again. In
> > > > + * this case, the parallel apply worker will start
> > > > waiting for the next
> > > > + * streaming block whereas there is actually a
> > > > pending streaming block
> > > > + * available.
> > > > + */
> > > > +if (!wshared->pa_wait_for_stream)
> > > > +{
> > > > +wshared->pending_stream_count++;
> > > > +SpinLockRelease(>mutex);
> > > > +break;
> > > > +}
> > > > +
> > > > +SpinLockRelease(>mutex);
> > > > +}
> > > > +}
> > > >
> > > > I think we should add an assertion to check if we don't hold the stream 
> > > > lock.
> > > >
> > > > I think that waiting for pa_wait_for_stream to be false in a busy
> > > > loop is not a good idea. It's not interruptible and there is not
> > > > guarantee that we can break from this loop in a short time. For
> > > > instance, if PA executes
> > > > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > > > pa_increment_stream_block(), LA has to wait for PA to acquire and
> > > > release the stream lock in a busy loop. It should not be long in
> > > > normal cases but the duration LA needs to wait for PA depends on PA,
> > > > which could be long. Also what if PA raises an error in
> > > > pa_lock_stream() due to some reasons? I think LA won't be able to
> > > > detect the failure.
> > > >
> > > > I think we should at least make it interruptible and maybe need to
> > > > add some sleep. Or perhaps we can use the condition variable for this 
> > > > case.
> > >
> >
> > Or we can leave this while (true) logic altogether for the first version 
> > and have a
> > comment to explain this race. Anyway, after restarting, it will probably be
> > solved. We can always change this part of the code later if this really 
> > turns out
> > to be problematic.
>
> Agreed, and reverted this part.
>
> >
> > > Thanks for the analysis, I will research this part.
> > >
> > > > ---
> > > > In worker.c, we have the following common pattern:
> > > >
> > > > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > > > write change to the file;
> > > > do some work;
> > > > break;
> > > >
> > > > case TRANS_LEADER_SEND_TO_PARALLEL:
> > > > pa_send_data();
> > > >
> > > > if (winfo->serialize_changes)
> > > > {
> > > > do some worker required after writing changes to the file.
> > > > }
> > > > :
> > > > break;
> > > >
> > > > IIUC there are two different paths for partial serialization: (a)
> > > > where apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > > > apply_action is TRANS_

Re: plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-15 Thread Masahiko Sawada
On Thu, Dec 15, 2022 at 4:53 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 15 Dec 2022 08:41:21 +0100, Pavel Stehule  
> wrote in
> > čt 15. 12. 2022 v 8:25 odesílatel Masahiko Sawada 
> > napsal:
> > > Is this a bug in plpgsql?
> > >
> >
> > I think it is by design.  There is not any callback that is called after an
> > exception.
> >
> > It is true, so some callbacks on statement error and function's error can
> > be nice. It can help me to implement profilers, or tracers more simply and
> > more robustly.
> >
> > But I am not sure about performance impacts. This is on a critical path.
>
> I didn't searched for, but I guess all of the end-side callback of all
> begin-end type callbacks are not called on exception. Additional
> PG_TRY level wouldn't be acceptable for performance reasons.

I don't think we need additional PG_TRY() for that since exec_stmts()
is already called in PG_TRY() if there is an exception block. I meant
to call stmt_end() in PG_CATCH() in exec_stmt_block() (i.e. only when
an error is caught by the exception block). Currently, if an error is
caught, we call stmt_begin() and stmt_end() for statements executed
inside the exception block but call only stmt_begin() for the
statement that raised an error.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




plpgsq_plugin's stmt_end() is not called when an error is caught

2022-12-14 Thread Masahiko Sawada
Hi,

While investigating the issue reported on pg_hint_plan[1], I realized
that stmt_end() callback is not called if an error raised during the
statement execution is caught. I've attached the patch to check when
stmt_beg() and stmt_end() are called. Here is an example:

postgres(1:3220232)=# create or replace function testfn(a text) returns int as
$$
declare
  x int;
begin
  select a::int into x;
  return x;
  exception when others then return 99;
end;
$$
language plpgsql;
CREATE FUNCTION

postgres(1:3220232)=# select testfn('1');
NOTICE:  stmt_beg toplevel_block
NOTICE:  stmt_beg stmt SQL statement
NOTICE:  stmt_end stmt SQL statement
NOTICE:  stmt_beg stmt RETURN
NOTICE:  stmt_end stmt RETURN
NOTICE:  stmt_end toplevel_block
 testfn

  1
(1 row)

postgres(1:3220232)=# select testfn('x');
NOTICE:  stmt_beg toplevel_block
NOTICE:  stmt_beg stmt SQL statement
NOTICE:  stmt_beg stmt RETURN
NOTICE:  stmt_end stmt RETURN
NOTICE:  stmt_end toplevel_block
 testfn

 99
(1 row)

In exec_stmt_block(), we call exec_stmts() in a PG_TRY() block and
call stmt_beg() and stmt_end() callbacks for each statement executed
there. However, if an error is caught during executing a statement, we
jump to PG_CATCH() block in exec_stmt_block() so we don't call
stmt_end() callback that is supposed to be called in exec_stmts(). To
fix it, I think we can call stmt_end() callback in PG_CATCH() block.

pg_hint_plan increments and decrements a count in stmt_beg() and
stmt_end() callbacks, respectively[2]. It resets the counter when
raising an ERROR (not caught). But if an ERROR is caught, the counter
could be left as an invalid value.

Is this a bug in plpgsql?

Regards,

[1] https://github.com/ossc-db/pg_hint_plan/issues/93
[2] https://github.com/ossc-db/pg_hint_plan/blob/master/pg_hint_plan.c#L4870

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


test_plpgsql.patch
Description: Binary data


Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-14 Thread Masahiko Sawada
On Wed, Dec 14, 2022 at 3:48 PM Amit Kapila  wrote:
>
> On Wed, Dec 14, 2022 at 9:50 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada 
> >  wrote:
> > >
> > > Here are comments on v59 0001, 0002 patches:
> >
> > Thanks for the comments!
> >
> > > +void
> > > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > > +while (1)
> > > +{
> > > +SpinLockAcquire(>mutex);
> > > +
> > > +/*
> > > + * Don't try to increment the count if the parallel
> > > apply worker is
> > > + * taking the stream lock. Otherwise, there would be
> > > a race condition
> > > + * that the parallel apply worker checks there is no
> > > pending streaming
> > > + * block and before it actually starts waiting on a
> > > lock, the leader
> > > + * sends another streaming block and take the stream
> > > lock again. In
> > > + * this case, the parallel apply worker will start
> > > waiting for the next
> > > + * streaming block whereas there is actually a
> > > pending streaming block
> > > + * available.
> > > + */
> > > +if (!wshared->pa_wait_for_stream)
> > > +{
> > > +wshared->pending_stream_count++;
> > > +SpinLockRelease(>mutex);
> > > +break;
> > > +}
> > > +
> > > +SpinLockRelease(>mutex);
> > > +}
> > > +}
> > >
> > > I think we should add an assertion to check if we don't hold the stream 
> > > lock.
> > >
> > > I think that waiting for pa_wait_for_stream to be false in a busy loop is 
> > > not a
> > > good idea. It's not interruptible and there is not guarantee that we can 
> > > break
> > > from this loop in a short time. For instance, if PA executes
> > > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > > pa_increment_stream_block(), LA has to wait for PA to acquire and release 
> > > the
> > > stream lock in a busy loop. It should not be long in normal cases but the
> > > duration LA needs to wait for PA depends on PA, which could be long. Also
> > > what if PA raises an error in
> > > pa_lock_stream() due to some reasons? I think LA won't be able to detect 
> > > the
> > > failure.
> > >
> > > I think we should at least make it interruptible and maybe need to add 
> > > some
> > > sleep. Or perhaps we can use the condition variable for this case.
> >
>
> Or we can leave this while (true) logic altogether for the first
> version and have a comment to explain this race. Anyway, after
> restarting, it will probably be solved. We can always change this part
> of the code later if this really turns out to be problematic.
>

+1. Thank you Hou-san for adding this comment in the latest version (v61) patch!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-14 Thread Masahiko Sawada
On Wed, Dec 14, 2022 at 1:20 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada 
>  wrote:
> >
> > On Sun, Dec 11, 2022 at 8:45 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, December 9, 2022 3:14 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Dec 8, 2022 at 12:37 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > >
> > > > Review comments
> > >
> > > Thanks for the comments!
> > >
> > > > ==
> > > > 1. Currently, we don't release the stream lock in LA (leade apply
> > > > worker) for "rollback to savepoint" and the reason is mentioned in
> > > > comments of
> > > > apply_handle_stream_abort() in the patch. But, today, while testing,
> > > > I found that can lead to deadlock which otherwise, won't happen on
> > > > the publisher. The key point is rollback to savepoint releases the
> > > > locks acquired by the particular subtransaction, so parallel apply
> > > > worker should also do the same. Consider the following example where
> > > > the transaction in session-1 is being performed by the parallel
> > > > apply worker and the transaction in session-2 is being performed by the
> > leader apply worker. I have simulated it by using GUC force_stream_mode.
> > > > Publisher
> > > > ==
> > > > Session-1
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# savepoint s1;
> > > > SAVEPOINT
> > > > postgres=*# truncate t1;
> > > > TRUNCATE TABLE
> > > >
> > > > Session-2
> > > > postgres=# begin;
> > > > BEGIN
> > > > postgres=*# insert into t1 values(4);
> > > >
> > > > Session-1
> > > > postgres=*# rollback to savepoint s1; ROLLBACK
> > > >
> > > > Session-2
> > > > Commit;
> > > >
> > > > With or without commit of Session-2, this scenario will lead to
> > > > deadlock on the subscriber because PA (parallel apply worker) is
> > > > waiting for LA to send the next command, and LA is blocked by
> > > > Exclusive of PA. There is no deadlock on the publisher because
> > > > rollback to savepoint will release the lock acquired by truncate.
> > > >
> > > > To solve this, How about if we do three things before sending abort
> > > > of sub-transaction (a) unlock the stream lock, (b) increment
> > > > pending_stream_count,
> > > > (c) take the stream lock again?
> > > >
> > > > Now, if the PA is not already waiting on the stop, it will not wait
> > > > at stream_stop but will wait after applying abort of sub-transaction
> > > > and if it is already waiting at stream_stop, the wait will be
> > > > released. If this works then probably we should try to do (b) before 
> > > > (a) to
> > match the steps with stream_start.
> > >
> > > The solution works for me, I have changed the code as suggested.
> > >
> > >
> > > > 2. There seems to be another general problem in the way the patch
> > > > waits for stream_stop in PA (parallel apply worker). Currently, PA
> > > > checks, if there are no more pending streams then it tries to wait
> > > > for the next stream by waiting on a stream lock. However, it is
> > > > possible after PA checks there is no pending stream and before it
> > > > actually starts waiting on a lock, the LA sends another stream for
> > > > which even stream_stop is sent, in this case, PA will start waiting
> > > > for the next stream whereas there is actually a pending stream
> > > > available. In this case, it won't lead to any problem apart from
> > > > delay in applying the changes in such cases but for the case mentioned 
> > > > in
> > the previous point (Pont 1), it can lead to deadlock even after we 
> > implement the
> > solution proposed to solve it.
> > >
> > > Thanks for reporting, I have introduced another flag in shared memory
> > > and use it to prevent the leader from incrementing the
> > > pending_stream_count if the parallel apply worker is trying to lock the 
> > > stream
> > lock.
> > >
> > >
> > > > 3. The other point to consider is that for
> > > > stream_commit/prepare/abo

Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-13 Thread Masahiko Sawada
On Tue, Dec 13, 2022 at 1:40 PM Imseih (AWS), Sami  wrote:
>
> Thanks for the feedback. I agree with the feedback, except
> for
>
> >need to have ParallelVacuumProgress. I see
> >parallel_vacuum_update_progress() uses this value but I think it's
> >better to pass ParallelVacuumState to via IndexVacuumInfo.
>
> I was trying to avoid passing a pointer to
> ParallelVacuumState in IndexVacuuminfo.
>
> ParallelVacuumProgress is implemented in the same
> way as VacuumSharedCostBalance and
> VacuumActiveNWorkers. See vacuum.h
>
> These values are reset at the start of a parallel vacuum cycle
> and reset at the end of an index vacuum cycle.
>
> This seems like a better approach and less invasive.
> What would be a reason not to go with this approach?

First of all, I don't think we need to declare ParallelVacuumProgress
in vacuum.c since it's set and used only in vacuumparallel.c. But I
don't even think it's a good idea to declare it in vacuumparallel.c as
a static variable. The primary reason is that it adds things we need
to care about. For example, what if we raise an error during index
vacuum? The transaction aborts but ParallelVacuumProgress still refers
to something old. Suppose further that the next parallel vacuum
doesn't launch any workers, the leader process would still end up
accessing the old value pointed by ParallelVacuumProgress, which
causes a SEGV. So we need to reset it anyway at the beginning of the
parallel vacuum. It's easy to fix at this time but once the parallel
vacuum code gets more complex, it could forget to care about it.

IMO VacuumSharedCostBalance and VacuumActiveNWorkers have a different
story. They are set in vacuumparallel.c and are used in vacuum.c for
vacuum delay. If they weren't global variables, we would need to pass
them to every function that could eventually call the vacuum delay
function. So it makes sense to me to have them as global variables.On
the other hand, for ParallelVacuumProgress, it's a common pattern that
ambulkdelete(), amvacuumcleanup() or a common index scan routine like
btvacuumscan() checks the progress. I don't think index AM needs to
pass the value down to many of its functions. So it makes sense to me
to pass it via IndexVacuumInfo.

Having said that, I'd like to hear opinions also from other hackers, I
might be wrong and it's more invasive as you pointed out.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread Masahiko Sawada
 * sending the STREAM_ABORT to ensure that the
parallel apply
+* worker will wait on the lock for the next
set of changes after
+* processing the STREAM_ABORT message if it
is not already waiting
+* for STREAM_STOP message.
+*/
+   if (!toplevel_xact)
+   {
+   pa_increment_stream_block(winfo->shared);
+   pa_lock_stream(xid, AccessExclusiveLock);
+   }
+
+   /* Send STREAM ABORT message to the parallel
apply worker. */
+   pa_send_data(winfo, s->len, s->data);
+
+   if (toplevel_xact)
+   (void) pa_free_worker(winfo, xid);
+
+   break;

In apply_handle_stream_abort(), it's better to add the comment why we
don't need to wait for PA to finish.


Also, given that we don't wait for PA to finish in this case, does it
really make sense to call pa_free_worker() immediately after sending
STREAM_ABORT?

---
PA acquires the transaction lock in AccessShare mode whereas LA
acquires it in AccessExclusiveMode. Is it better to do the opposite?
Like a backend process acquires a lock on its XID in Exclusive mode,
we can have PA acquire the lock on its XID in Exclusive mode whereas
other attempts to acquire it in Share mode to wait.

---
 void
pa_lock_stream(TransactionId xid, LOCKMODE lockmode)
{
LockApplyTransactionForSession(MyLogicalRepWorker->subid, xid,
   PARALLEL_APPLY_LOCK_STREAM, lockmode);
}

I think since we don't need to let the caller to specify the lock mode
but need only shared and exclusive modes, we can make it simple by
having a boolean argument say shared instead of lockmode.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-12 Thread Masahiko Sawada
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
 wrote:
>
>
> On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 9, 2022 at 5:53 PM John Naylor  
> > wrote:
> > >
>
> > > I don't think that'd be very controversial, but I'm also not sure why 
> > > we'd need 4MB -- can you explain in more detail what exactly we'd need so 
> > > that the feature would work? (The minimum doesn't have to work *well* 
> > > IIUC, just do some useful work and not fail).
> >
> > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > the radix tree is using dsa_get_total_size(). If the size returned by
> > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > and heap vacuum. However, when allocating DSA memory for
> > radix_tree_control at creation, we allocate 1MB
> > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > there is no TID collected.
>
> 2MB makes sense.
>
> If the metadata is small, it seems counterproductive to count it towards the 
> total. We want the decision to be driven by blocks allocated. I have an idea 
> on that below.
>
> > > Remember when we discussed how we might approach parallel pruning? I 
> > > envisioned a local array of a few dozen kilobytes to reduce contention on 
> > > the tidstore. We could use such an array even for a single worker (always 
> > > doing the same thing is simpler anyway). When the array fills up enough 
> > > so that the next heap page *could* overflow it: Stop, insert into the 
> > > store, and check the store's memory usage before continuing.
> >
> > Right, I think it's no problem in slab cases. In DSA cases, the new
> > segment size follows a geometric series that approximately doubles the
> > total storage each time we create a new segment. This behavior comes
> > from the fact that the underlying DSM system isn't designed for large
> > numbers of segments.
>
> And taking a look, the size of a new segment can get quite large. It seems we 
> could test if the total DSA area allocated is greater than half of 
> maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, 
> then the area will contain just under a power of two the last time it passes 
> the test. The next segment will bring it to about 3/4 full, like this:
>
> maintenance work mem = 256MB, so stop if we go over 128MB:
>
> 2*(1+2+4+8+16+32) = 126MB -> keep going
> 126MB + 64 = 190MB-> stop
>
> That would be a simple way to be conservative with the memory limit. The 
> unfortunate aspect is that the last segment would be mostly wasted, but it's 
> paradise compared to the pessimistically-sized single array we have now (even 
> with Peter G.'s VM snapshot informing the allocation size, I imagine).

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

>
> And as for minimum possible maintenance work mem, I think this would work 
> with 2MB, if the community is okay with technically going over the limit by a 
> few bytes of overhead if a buildfarm animal set to that value. I imagine it 
> would never go over the limit for realistic (and even most unrealistic) 
> values. Even with a VM snapshot page in memory and small local arrays of 
> TIDs, I think with this scheme we'll be well under the limit.

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree. Templating whether or not to count
the memory usage might help avoid the overheads.

> After this feature is complete, I think we should consider a follow-on patch 
> to get rid of vacuum_work_mem, since it would no longer be needed.

I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-09 Thread Masahiko Sawada
On Fri, Dec 9, 2022 at 5:53 PM John Naylor  wrote:
>
>
> On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada  wrote:
>
> > In the meanwhile, I've been working on vacuum integration. There are
> > two things I'd like to discuss some time:
> >
> > The first is the minimum of maintenance_work_mem, 1 MB. Since the
> > initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> > vacuum with radix tree cannot work with the minimum
> > maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> > we can start a new thread for that.
>
> I don't think that'd be very controversial, but I'm also not sure why we'd 
> need 4MB -- can you explain in more detail what exactly we'd need so that the 
> feature would work? (The minimum doesn't have to work *well* IIUC, just do 
> some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

>
> > The second is how to limit the size of the radix tree to
> > maintenance_work_mem. I think that it's tricky to estimate the maximum
> > number of keys in the radix tree that fit in maintenance_work_mem. The
> > radix tree size varies depending on the key distribution. The next
> > idea I considered was how to limit the size when inserting a key. In
> > order to strictly limit the radix tree size, probably we have to
> > change the rt_set so that it breaks off and returns false if the radix
> > tree size is about to exceed the memory limit when we allocate a new
> > node or grow a node kind/class.
>
> That seems complex, fragile, and wrong scope.
>
> > Ideally, I'd like to control the size
> > outside of radix tree (e.g. TIDStore) since it could introduce
> > overhead to rt_set() but probably we need to add such logic in radix
> > tree.
>
> Does the TIDStore have the ability to ask the DSA (or slab context) to see 
> how big it is?

Yes, TIDStore can check it using dsa_get_total_size().

> If a new segment has been allocated that brings us to the limit, we can stop 
> when we discover that fact. In the local case with slab blocks, it won't be 
> on nice neat boundaries, but we could check if we're within the largest block 
> size (~64kB) of overflow.
>
> Remember when we discussed how we might approach parallel pruning? I 
> envisioned a local array of a few dozen kilobytes to reduce contention on the 
> tidstore. We could use such an array even for a single worker (always doing 
> the same thing is simpler anyway). When the array fills up enough so that the 
> next heap page *could* overflow it: Stop, insert into the store, and check 
> the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.


Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-09 Thread Masahiko Sawada
On Fri, Dec 9, 2022 at 3:05 PM Amit Kapila  wrote:
>
> On Fri, Dec 9, 2022 at 7:45 AM Peter Smith  wrote:
> >
> > On Thu, Dec 8, 2022 at 7:43 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Dec 8, 2022 at 4:42 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Dec 8, 2022 at 12:42 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Dec 7, 2022 at 10:03 PM houzj.f...@fujitsu.com
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > > +static void
> > > > > > > +ProcessParallelApplyInterrupts(void)
> > > > > > > +{
> > > > > > > +CHECK_FOR_INTERRUPTS();
> > > > > > > +
> > > > > > > +if (ShutdownRequestPending)
> > > > > > > +{
> > > > > > > +ereport(LOG,
> > > > > > > +(errmsg("logical replication 
> > > > > > > parallel
> > > > > > > apply worker for subscrip
> > > > > > > tion \"%s\" has finished",
> > > > > > > +
> > > > > > > MySubscription->name)));
> > > > > > > +
> > > > > > > +apply_worker_clean_exit(false);
> > > > > > > +}
> > > > > > > +
> > > > > > > +if (ConfigReloadPending)
> > > > > > > +{
> > > > > > > +ConfigReloadPending = false;
> > > > > > > +ProcessConfigFile(PGC_SIGHUP);
> > > > > > > +}
> > > > > > > +}
> > > > > > >
> > > > > > > I personally think that we don't need to have a function to do 
> > > > > > > only
> > > > > > > these few things.
> > > > > >
> > > > > > I thought that introduce a new function make the handling of worker 
> > > > > > specific
> > > > > > Interrupts logic similar to other existing ones. Like:
> > > > > > ProcessWalRcvInterrupts () in walreceiver.c and 
> > > > > > HandlePgArchInterrupts() in
> > > > > > pgarch.c ...
> > > > >
> > > > > I think the difference from them is that there is only one place to
> > > > > call ProcessParallelApplyInterrupts().
> > > > >
> > > >
> > > > But I feel it is better to isolate this code in a separate function.
> > > > What if we decide to extend it further by having some logic to stop
> > > > workers after reloading of config?
> > >
> > > I think we can separate the function at that time. But let's keep the
> > > current code as you and Hou agree with the current code. I'm not going
> > > to insist on that.
> > >
> > > >
> > > > > >
> > > > > > > ---
> > > > > > > server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
> > > > > > > options.proto.logical.proto_version =
> > > > > > > +server_version >= 16 ?
> > > > > > > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM :
> > > > > > > server_version >= 15 ?
> > > > > > > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
> > > > > > > server_version >= 14 ?
> > > > > > > LOGICALREP_PROTO_STREAM_VERSION_NUM :
> > > > > > > LOGICALREP_PROTO_VERSION_NUM;
> > > > > > >
> > > > > > > Instead of always using the new protocol version, I think we can 
> > > > > > > use
> > > > > > > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM if the streaming is not
> > > > > > > 'parallel'. That way, we don't need to change protocl version 
> > > > > > > check
> > > > > > > logic in pgoutput.c and don't need to expose 
> > > > > > > defGetStreamingMode().
> > > > > > > What do you think?
> > > > > >
> > > > > > I think that some user can also use the new version number when 
> > > > > > trying to get

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-08 Thread Masahiko Sawada
On Tue, Dec 6, 2022 at 7:32 PM John Naylor  wrote:
>
> On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada  wrote:
> >
> > > On Mon, Nov 14, 2022 at 7:59 PM John Naylor 
> > >  wrote:
> > > >
> > > > - Optimize node128 insert.
> > >
> > > I've attached a rough start at this. The basic idea is borrowed from our 
> > > bitmapset nodes, so we can iterate over and operate on word-sized (32- or 
> > > 64-bit) types at a time, rather than bytes.
> >
> > Thanks! I think this is a good idea.
> >
> > > To make this easier, I've moved some of the lower-level macros and types 
> > > from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a 
> > > separate email thread to resolve the coding style clash this causes, so 
> > > that can be put off for later.
>
> I started a separate thread [1], and 0002 comes from feedback on that. There 
> is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm 
> putting that off to ease rebasing the rest as v13 -- getting some CI testing 
> with 0002 seems like a good idea. There are no other changes yet. Next, I 
> will take a look at templating local vs. shared memory. I might try basing 
> that on the styles of both v12 and v8, and see which one works best with 
> templating.

Thank you so much!

In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:

The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.

The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class. Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-08 Thread Masahiko Sawada
On Thu, Dec 8, 2022 at 4:42 PM Amit Kapila  wrote:
>
> On Thu, Dec 8, 2022 at 12:42 PM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 7, 2022 at 10:03 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > >
> > > > +static void
> > > > +ProcessParallelApplyInterrupts(void)
> > > > +{
> > > > +CHECK_FOR_INTERRUPTS();
> > > > +
> > > > +if (ShutdownRequestPending)
> > > > +{
> > > > +ereport(LOG,
> > > > +(errmsg("logical replication parallel
> > > > apply worker for subscrip
> > > > tion \"%s\" has finished",
> > > > +
> > > > MySubscription->name)));
> > > > +
> > > > +apply_worker_clean_exit(false);
> > > > +}
> > > > +
> > > > +if (ConfigReloadPending)
> > > > +{
> > > > +ConfigReloadPending = false;
> > > > +ProcessConfigFile(PGC_SIGHUP);
> > > > +}
> > > > +}
> > > >
> > > > I personally think that we don't need to have a function to do only
> > > > these few things.
> > >
> > > I thought that introduce a new function make the handling of worker 
> > > specific
> > > Interrupts logic similar to other existing ones. Like:
> > > ProcessWalRcvInterrupts () in walreceiver.c and HandlePgArchInterrupts() 
> > > in
> > > pgarch.c ...
> >
> > I think the difference from them is that there is only one place to
> > call ProcessParallelApplyInterrupts().
> >
>
> But I feel it is better to isolate this code in a separate function.
> What if we decide to extend it further by having some logic to stop
> workers after reloading of config?

I think we can separate the function at that time. But let's keep the
current code as you and Hou agree with the current code. I'm not going
to insist on that.

>
> > >
> > > > ---
> > > > server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
> > > > options.proto.logical.proto_version =
> > > > +server_version >= 16 ?
> > > > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM :
> > > > server_version >= 15 ?
> > > > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
> > > > server_version >= 14 ?
> > > > LOGICALREP_PROTO_STREAM_VERSION_NUM :
> > > > LOGICALREP_PROTO_VERSION_NUM;
> > > >
> > > > Instead of always using the new protocol version, I think we can use
> > > > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM if the streaming is not
> > > > 'parallel'. That way, we don't need to change protocl version check
> > > > logic in pgoutput.c and don't need to expose defGetStreamingMode().
> > > > What do you think?
> > >
> > > I think that some user can also use the new version number when trying to 
> > > get
> > > changes (via pg_logical_slot_peek_binary_changes or other functions), so 
> > > I feel
> > > leave the check for new version number seems fine.
> > >
> > > Besides, I feel even if we don't use new version number, we still need to 
> > > use
> > > defGetStreamingMode to check if parallel mode in used as we need to send
> > > abort_lsn when parallel is in used. I might be missing something, sorry 
> > > for
> > > that. Can you please explain the idea a bit ?
> >
> > My idea is that we use LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM if
> > (server_version >= 16 && MySubscription->stream ==
> > SUBSTREAM_PARALLEL). If the stream is SUBSTREAM_ON, we use
> > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM even if server_version is
> > 16. That way, in pgoutput.c, we can send abort_lsn if the protocol
> > version is LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM. We don't need
> > to send "streaming = parallel" to the publisher since the publisher
> > can decide whether or not to send abort_lsn based on the protocol
> > version (still needs to send "streaming = on" though). I might be
> > missing something.
> >
>
> What if we decide to send some more additional information as part of
> another patch like we are discussing in the thread [1]? Now, we won't
> be able to decide the version number based on just the streaming
> option. Also, in such a case, even for
> LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM, it may not be a good
> idea to send additional abort information unless the user has used the
> streaming=parallel option.

If we're going to send the additional information, it makes sense to
send streaming=parallel. But the next question came to me is why do we
need to increase the protocol version for parallel apply feature? If
sending the additional information is also controlled by an option
like "streaming", we can decide what we send based on these options,
no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-07 Thread Masahiko Sawada
On Wed, Dec 7, 2022 at 10:03 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, December 7, 2022 7:51 PM Masahiko Sawada 
>  wrote:
> >
> > On Mon, Dec 5, 2022 at 1:29 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Sunday, December 4, 2022 7:17 PM houzj.f...@fujitsu.com
> > 
> > > >
> > > > Thursday, December 1, 2022 8:40 PM Amit Kapila
> > 
> > > > wrote:
> > > > > Some other comments:
> > > > ...
> > > > Attach the new version patch set which addressed most of the comments
> > > > received so far except some comments being discussed[1].
> > > > [1]
> > https://www.postgresql.org/message-id/OS0PR01MB57167BF64FC0891734C
> > 8E81A94149%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> > >
> > > Attach a new version patch set which fixed a testcase failure on CFbot.
> >
> > Here are some comments on v56 0001, 0002 patches. Please ignore
> > comments if you already incorporated them in v57.
>
> Thanks for the comments!
>
> > +static void
> > +ProcessParallelApplyInterrupts(void)
> > +{
> > +CHECK_FOR_INTERRUPTS();
> > +
> > +if (ShutdownRequestPending)
> > +{
> > +ereport(LOG,
> > +(errmsg("logical replication parallel
> > apply worker for subscrip
> > tion \"%s\" has finished",
> > +MySubscription->name)));
> > +
> > +apply_worker_clean_exit(false);
> > +}
> > +
> > +if (ConfigReloadPending)
> > +{
> > +ConfigReloadPending = false;
> > +ProcessConfigFile(PGC_SIGHUP);
> > +}
> > +}
> >
> > I personally think that we don't need to have a function to do only
> > these few things.
>
> I thought that introduce a new function make the handling of worker specific
> Interrupts logic similar to other existing ones. Like:
> ProcessWalRcvInterrupts () in walreceiver.c and HandlePgArchInterrupts() in
> pgarch.c ...

I think the difference from them is that there is only one place to
call ProcessParallelApplyInterrupts().

>
> >
> > Should we change the names to something like
> > LOGICALREP_STREAM_PARALLEL?
>
> Agreed, will change.
>
> > ---
> > + * The lock graph for the above example will look as follows:
> > + * LA (waiting to acquire the lock on the unique index) -> PA (waiting to
> > + * acquire the lock on the remote transaction) -> LA
> >
> > and
> >
> > + * The lock graph for the above example will look as follows:
> > + * LA (waiting to acquire the transaction lock) -> PA-2 (waiting to 
> > acquire the
> > + * lock due to unique index constraint) -> PA-1 (waiting to acquire the 
> > stream
> > + * lock) -> LA
> >
> > "(waiting to acquire the lock on the remote transaction)" in the first
> > example and "(waiting to acquire the stream lock)" in the second
> > example is the same meaning, right? If so, I think we should use
> > either term for consistency.
>
> Will change.
>
> > ---
> > +bool   write_abort_info = (data->streaming ==
> > SUBSTREAM_PARALLEL);
> >
> > I think that instead of setting write_abort_info every time when
> > pgoutput_stream_abort() is called, we can set it once, probably in
> > PGOutputData, at startup.
>
> I thought that since we already have a "stream" flag in PGOutputData, I am not
> sure if it would be better to introduce another flag for the same option.

I see your point. Another way is to have it as a static variable like
publish_no_origin. But since It's trivial change I'm fine also with
the current code.

>
> > ---
> > server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
> > options.proto.logical.proto_version =
> > +server_version >= 16 ?
> > LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM :
> > server_version >= 15 ?
> > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
> > server_version >= 14 ?
> > LOGICALREP_PROTO_STREAM_VERSION_NUM :
> > LOGICALREP_PROTO_VERSION_NUM;
> >
> > Instead of always using the new protocol version, I think we can use
> > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM if the streaming is not
> > 'parallel'. That way, we don't need to change protocl version check
> > logic in pgoutput.c and don't need to expose defGetStreamingMode().
> > What do you t

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-07 Thread Masahiko Sawada
On Thu, Dec 8, 2022 at 1:52 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 6:33 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Wednesday, December 7, 2022 7:51 PM Masahiko Sawada 
> >  wrote:
> > >
> >
> > > ---
> > > When max_parallel_apply_workers_per_subscription is changed to a value
> > > lower than the number of parallel worker running at that time, do we
> > > need to stop extra workers?
> >
> > I think we can do this, like adding a check in the main loop of leader 
> > worker, and
> > check every time after reloading the conf. OTOH, we will also stop the 
> > worker after
> > finishing a transaction, so I am slightly not sure do we need to add 
> > another check logic here.
> > But I am fine to add it if you think it would be better.
> >
>
> I think this is tricky because it is possible that all active workers
> are busy with long-running transactions, so, I think stopping them
> doesn't make sense.

Right, we should not stop running parallel workers.

> I think as long as we are freeing them after use
> it seems okay to me. OTOH, each time after finishing the transaction,
> we can stop the workers, if the workers in the free pool exceed
> 'max_parallel_apply_workers_per_subscription'.

Or the apply leader worker can check that after reloading the config file.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-12-07 Thread Masahiko Sawada
On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila  wrote:
>
> On Sat, Nov 19, 2022 at 6:35 AM Andres Freund  wrote:
> >
> > On 2022-11-18 11:20:36 +0530, Amit Kapila wrote:
> > > Okay, updated the patch accordingly.
> >
> > Assuming it passes tests etc, this'd work for me.
> >
>
> Thanks, Pushed.

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
  LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
  ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin. Before calling
ProcArraySetReplicationSlotXmin() (i.e. before acquiring
ProcArrayLock), another walsender process called
CreateInitDecodingContext(), acquired ProcArrayLock, computed
slot->effective_catalog_xmin, called
ReplicationSlotsComputeRequiredXmin(true). Since its
effective_catalog_xmin had been set, it got 39968 as the minimum xmin,
and updated replication_slot_xmin. However, as soon as the second
walsender released ProcArrayLock, the first walsender updated the
replication_slot_xmin to 0. After that, the second walsender called
SnapBuildInitialSnapshot(), and GetOldestSafeDecodingTransactionId()
returned an XID newer than snap->xmin.

One idea to fix this issue is that in
ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin
while holding both ProcArrayLock and ReplicationSlotControlLock, and
release only ReplicationSlotsControlLock before updating the
replication_slot_xmin. I'm concerned it will increase the contention
on ProcArrayLock but I've attached the patch for discussion.

Regards,

[1] 
https://www.postgresql.org/message-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406%40qq.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_slot_xmin_race_condition.patch
Description: Binary data


Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-07 Thread Masahiko Sawada
On Wed, Dec 7, 2022 at 4:31 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 10:10 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 7, 2022 at 1:29 PM Amit Kapila  wrote:
> > >
> > > Right, but the leader will anyway exit at some point either due to an
> > > ERROR like "lost connection ... to parallel worker" or with a LOG
> > > like: "... will restart because of a parameter change" but I see your
> > > point. So, will it be better if we have a LOG message here and then
> > > proc_exit()? Do you have something else in mind for this?
> >
> > No, I was thinking that too. It's better to write a LOG message and do
> > proc_exit().
> >
> > Regarding the error "lost connection ... to parallel worker", it could
> > still happen depending on the timing even if the parallel worker
> > cleanly exits due to parameter changes, right? If so, I'm concerned
> > that it could lead to disable the subscription unexpectedly if
> > disable_on_error is enabled.
> >
>
> If we want to avoid this then I think we have the following options
> (a) parallel apply skips checking parameter change (b) parallel worker
> won't exit on parameter change but will silently absorb the parameter
> and continue its processing; anyway, the leader will detect it and
> stop the worker for the parameter change
>
> Among these, the advantage of (b) is that it will allow reflecting the
> parameter change (that doesn't need restart) in the parallel worker.
> Do you have any better idea to deal with this?

I think (b) is better. We need to reflect the synchronous_commit
parameter also in parallel workers in the worker pool.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-07 Thread Masahiko Sawada
;
+}
+else if (server_version >= 14 &&
+ MySubscription->stream != SUBSTREAM_OFF)
+{
+options.proto.logical.streaming_str = pstrdup("on");
+MyLogicalRepWorker->parallel_apply = false;
+}

I think we don't need to use pstrdup().

---
-   BeginTransactionBlock();
-   CommitTransactionCommand(); /* Completes the preceding Begin command. */
+   if (!IsTransactionBlock())
+   {
+   BeginTransactionBlock();
+   CommitTransactionCommand(); /* Completes the preceding
Begin command. */
+   }

Do we need this change? In my environment, 'make check-world' passes
without this change.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada
On Wed, Dec 7, 2022 at 12:55 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 7:31 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 7, 2022 at 8:46 AM Peter Smith  wrote:
> > >
> > > >
> > > > Yeah, I think this can also help in reducing the time for various
> > > > tests in test_decoding/stream and
> > > > src/test/subscription/t/*_stream_*.pl file by reducing the number of
> > > > changes required to invoke streaming mode. Can we think of making this
> > > > GUC extendible to even test more options on server-side (publisher)
> > > > and client-side (subscriber) cases? For example, we can have something
> > > > like logical_replication_mode with the following valid values: (a)
> > > > server_serialize: this will serialize each change to file on
> > > > publishers and then on commit restore and send all changes; (b)
> > > > server_stream: this will stream each change as currently proposed in
> > > > your patch. Then if we want to extend it for subscriber-side testing
> > > > then we can introduce new options like client_serialize for the case
> > > > being discussed in the email [1].
> > > >
> > > > Thoughts?
> > >
> > > There is potential for lots of developer GUCs for testing/debugging in
> > > the area of logical replication but IMO it might be better to keep
> > > them all separated. Putting everything into a single
> > > 'logical_replication_mode' might cause difficulties later when/if you
> > > want combinations of the different modes.
> >
> > I think we want the developer option that forces streaming changes
> > during logical decoding to be PGC_USERSET but probably the developer
> > option for testing the parallel apply feature would be PGC_SIGHUP.
> >
>
> Ideally, that is true but if we want to combine the multiple modes in
> one parameter, is there a harm in keeping it as PGC_SIGHUP?

It's not a big harm but we will end up doing ALTER SYSTEM and
pg_reload_conf() even in regression tests (e.g. in
test_decoding/stream.sql).

>
> > Also, since streaming changes is not specific to logical replication
> > but to logical decoding, I'm not sure logical_replication_XXX is a
> > good name. IMO having force_stream_mode and a different GUC for
> > testing the parallel apply feature makes sense to me.
> >
>
> But if we want to have a separate variable for testing/debugging
> streaming like force_stream_mode, why not for serializing as well? And
> if we want for both then we can even think of combining them in one
> variable as logical_decoding_mode with values as 'stream' and
> 'serialize'.

Making it enum makes sense to me.

> The first one specified would be given preference. Also,
> the name force_stream_mode doesn't seem to convey that it is for
> logical decoding.

Agreed.

> On one side having separate GUCs for publisher and subscriber seems to
> give better flexibility but having more GUCs also sometimes makes them
> less usable. Here, my thought was to have a single or as few GUCs as
> possible which can be extendible by providing multiple values instead
> of having different GUCs. I was trying to map this with the existing
> string parameters in developer options.

I see your point. On the other hand, I'm not sure it's a good idea to
control different features by one GUC in general. The developer option
could be an exception?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-06 Thread Masahiko Sawada
On Wed, Dec 7, 2022 at 1:29 PM Amit Kapila  wrote:
>
> On Wed, Dec 7, 2022 at 9:00 AM Masahiko Sawada  wrote:
> >
> > On Thu, Dec 1, 2022 at 7:17 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > > ---
> > > > if (am_parallel_apply_worker() && on_subinfo_change) {
> > > > /*
> > > >  * If a parallel apply worker exits due to the subscription
> > > >  * information change, we notify the leader apply worker so that the
> > > >  * leader can report more meaningful message in time and restart the
> > > >  * logical replication.
> > > >  */
> > > > pq_putmessage('X', NULL, 0);
> > > > }
> > > >
> > > > and
> > > >
> > > > ereport(ERROR,
> > > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > >  errmsg("logical replication parallel apply worker 
> > > > exited
> > > > because of subscription information change")));
> > > >
> > > > Do we really need an additional message in case of 'X'? When we call
> > > > apply_worker_clean_exit with on_subinfo_change = true, we have reported 
> > > > the
> > > > error message such as:
> > > >
> > > > ereport(LOG,
> > > > (errmsg("logical replication parallel apply worker for 
> > > > subscription
> > > > \"%s\" will stop because of a parameter change",
> > > > MySubscription->name)));
> > > >
> > > > I think that reporting a similar message from the leader might not be
> > > > meaningful for users.
> > >
> > > The intention is to let leader report more meaningful message if a worker
> > > exited due to subinfo change. Otherwise, the leader is likely to report an
> > > error like " lost connection ... to parallel apply worker" when trying to 
> > > send
> > > data via shared memory if the worker exited. What do you think ?
> >
> > Agreed. But do we need to have the leader exit with an error in spite
> > of the fact that the worker cleanly exits? If the leader exits with an
> > error, the subscription will be disabled if disable_on_error is true,
> > right?
> >
>
> Right, but the leader will anyway exit at some point either due to an
> ERROR like "lost connection ... to parallel worker" or with a LOG
> like: "... will restart because of a parameter change" but I see your
> point. So, will it be better if we have a LOG message here and then
> proc_exit()? Do you have something else in mind for this?

No, I was thinking that too. It's better to write a LOG message and do
proc_exit().

Regarding the error "lost connection ... to parallel worker", it could
still happen depending on the timing even if the parallel worker
cleanly exits due to parameter changes, right? If so, I'm concerned
that it could lead to disable the subscription unexpectedly if
disable_on_error is enabled.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-06 Thread Masahiko Sawada
On Thu, Dec 1, 2022 at 7:17 PM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, December 1, 2022 3:58 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 30, 2022 at 10:51 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Wednesday, November 30, 2022 9:41 PM houzj.f...@fujitsu.com
> >  wrote:
> > > >
> > > > On Tuesday, November 29, 2022 8:34 PM Amit Kapila
> > > > > Review comments on v53-0001*
> > > >
> > > > Attach the new version patch set.
> > >
> > > Sorry, there were some mistakes in the previous patch set.
> > > Here is the correct V54 patch set. I also ran pgindent for the patch set.
> > >
> >
> > Thank you for updating the patches. Here are random review comments for
> > 0001 and 0002 patches.
>
> Thanks for the comments!
>
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > abnormally"),
> >  errcontext("%s", edata.context))); and
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > because of subscription information change")));
> >
> > I'm not sure ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is appropriate
> > here. Given that parallel apply worker has already reported the error 
> > message
> > with the error code, I think we don't need to set the errorcode for the logs
> > from the leader process.
> >
> > Also, I'm not sure the term "exited abnormally" is appropriate since we use 
> > it
> > when the server crashes for example. I think ERRORs reported here don't mean
> > that in general.
>
> How about reporting "xxx worker exited due to error" ?

Sounds better to me.

>
> > ---
> > if (am_parallel_apply_worker() && on_subinfo_change) {
> > /*
> >  * If a parallel apply worker exits due to the subscription
> >  * information change, we notify the leader apply worker so that the
> >  * leader can report more meaningful message in time and restart the
> >  * logical replication.
> >  */
> > pq_putmessage('X', NULL, 0);
> > }
> >
> > and
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >  errmsg("logical replication parallel apply worker exited
> > because of subscription information change")));
> >
> > Do we really need an additional message in case of 'X'? When we call
> > apply_worker_clean_exit with on_subinfo_change = true, we have reported the
> > error message such as:
> >
> > ereport(LOG,
> > (errmsg("logical replication parallel apply worker for subscription
> > \"%s\" will stop because of a parameter change",
> > MySubscription->name)));
> >
> > I think that reporting a similar message from the leader might not be
> > meaningful for users.
>
> The intention is to let leader report more meaningful message if a worker
> exited due to subinfo change. Otherwise, the leader is likely to report an
> error like " lost connection ... to parallel apply worker" when trying to send
> data via shared memory if the worker exited. What do you think ?

Agreed. But do we need to have the leader exit with an error in spite
of the fact that the worker cleanly exits? If the leader exits with an
error, the subscription will be disabled if disable_on_error is true,
right?

And what do you think about the error code?

>
> > ---
> > -if (options->proto.logical.streaming &&
> > -PQserverVersion(conn->streamConn) >= 14)
> > -appendStringInfoString(, ", streaming 'on'");
> > +if (options->proto.logical.streaming_str)
> > +appendStringInfo(, ", streaming '%s'",
> > +
> > options->proto.logical.streaming_str);
> >
> > and
> >
> > +/*
> > + * Assign the appropriate option value for streaming option
> > according to
> > + * the 'streaming' mode and the publisher's ability to
> > support that mode.
> > + */
> > +if (server_version >= 16 &&
> > +MySubscription->stream == SUBSTREAM_PARALLEL)
> > 

Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada
On Wed, Dec 7, 2022 at 8:46 AM Peter Smith  wrote:
>
> On Tue, Dec 6, 2022 at 9:29 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 6, 2022 at 11:53 AM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > Hi hackers,
> > >
> > > In logical decoding, when logical_decoding_work_mem is exceeded, the 
> > > changes are
> > > sent to output plugin in streaming mode. But there is a restriction that 
> > > the
> > > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC 
> > > to
> > > allow sending every change to output plugin without waiting until
> > > logical_decoding_work_mem is exceeded.
> > >
> > > This helps to test streaming mode. For example, to test "Avoid streaming 
> > > the
> > > transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> > > messages. With the new option, it can be tested with fewer changes and in 
> > > less
> > > time. Also, this new option helps to test more scenarios for "Perform 
> > > streaming
> > > logical transactions by background workers" [2].
> > >
>
> +1
>
> >
> > Yeah, I think this can also help in reducing the time for various
> > tests in test_decoding/stream and
> > src/test/subscription/t/*_stream_*.pl file by reducing the number of
> > changes required to invoke streaming mode. Can we think of making this
> > GUC extendible to even test more options on server-side (publisher)
> > and client-side (subscriber) cases? For example, we can have something
> > like logical_replication_mode with the following valid values: (a)
> > server_serialize: this will serialize each change to file on
> > publishers and then on commit restore and send all changes; (b)
> > server_stream: this will stream each change as currently proposed in
> > your patch. Then if we want to extend it for subscriber-side testing
> > then we can introduce new options like client_serialize for the case
> > being discussed in the email [1].
> >
> > Thoughts?
>
> There is potential for lots of developer GUCs for testing/debugging in
> the area of logical replication but IMO it might be better to keep
> them all separated. Putting everything into a single
> 'logical_replication_mode' might cause difficulties later when/if you
> want combinations of the different modes.

I think we want the developer option that forces streaming changes
during logical decoding to be PGC_USERSET but probably the developer
option for testing the parallel apply feature would be PGC_SIGHUP.
Also, since streaming changes is not specific to logical replication
but to logical decoding, I'm not sure logical_replication_XXX is a
good name. IMO having force_stream_mode and a different GUC for
testing the parallel apply feature makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Force streaming every change in logical decoding

2022-12-06 Thread Masahiko Sawada
On Tue, Dec 6, 2022 at 7:29 PM Amit Kapila  wrote:
>
> On Tue, Dec 6, 2022 at 11:53 AM shiy.f...@fujitsu.com
>  wrote:
> >
> > Hi hackers,
> >
> > In logical decoding, when logical_decoding_work_mem is exceeded, the 
> > changes are
> > sent to output plugin in streaming mode. But there is a restriction that the
> > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC to
> > allow sending every change to output plugin without waiting until
> > logical_decoding_work_mem is exceeded.
> >
> > This helps to test streaming mode. For example, to test "Avoid streaming the
> > transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> > messages. With the new option, it can be tested with fewer changes and in 
> > less
> > time. Also, this new option helps to test more scenarios for "Perform 
> > streaming
> > logical transactions by background workers" [2].
> >
>
> Yeah, I think this can also help in reducing the time for various
> tests in test_decoding/stream and
> src/test/subscription/t/*_stream_*.pl file by reducing the number of
> changes required to invoke streaming mode.

+1

> Can we think of making this
> GUC extendible to even test more options on server-side (publisher)
> and client-side (subscriber) cases? For example, we can have something
> like logical_replication_mode with the following valid values: (a)
> server_serialize: this will serialize each change to file on
> publishers and then on commit restore and send all changes; (b)
> server_stream: this will stream each change as currently proposed in
> your patch. Then if we want to extend it for subscriber-side testing
> then we can introduce new options like client_serialize for the case
> being discussed in the email [1].

Setting logical_replication_mode = 'client_serialize' implies that the
publisher behaves as server_stream? or do you mean we can set like
logical_replication_mode = 'server_stream, client_serialize'?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-05 Thread Masahiko Sawada
s info after updating
pvs->shared->idx_completed_progress.

---
+/*
+ * Check if we are done vacuuming indexes and report
+ * progress.

How about "Waiting for all indexes to be vacuumed while updating the
parallel index vacuum progress"?

+ *
+ * We nap using with a WaitLatch to avoid a busy loop.
+ *
+ * Note: This function should be used by the leader process only,
+ * and it's up to the caller to ensure this.
+ */

I think these comments are not necessary.

+void
+parallel_wait_for_workers_to_finish(ParallelVacuumState *pvs)

How about "parallel_vacuum_wait_to_finish"?

---
+/*
+ * Read the shared ParallelVacuumProgress and update progress.h
+ * with indexes vacuumed so far. This function is called periodically
+ * by index AMs as well as parallel_vacuum_process_one_index.
+ *
+ * To avoid unnecessarily updating progress, we check the progress
+ * values from the backend entry and only update if the value
+ * of completed indexes increases.
+ *
+ * Note: This function should be used by the leader process only,
+ * and it's up to the caller to ensure this.
+ */
+void
+parallel_vacuum_update_progress(void)
+{
+volatile PgBackendStatus *beentry = MyBEEntry;
+
+Assert(!IsParallelWorker);
+
+if (beentry && ParallelVacuumProgress)
+{
+int parallel_vacuum_current_value =
beentry->st_progress_param[PROGRESS_VACUUM_INDEX_COMPLETED];
+int parallel_vacuum_new_value =
pg_atomic_read_u32(ParallelVacuumProgress);
+
+if (parallel_vacuum_new_value > parallel_vacuum_current_value)
+
pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
parallel_vacuum_new_value);
+}
+}

parallel_vacuum_update_progress() is typically called every 1GB so I
think we don't need to worry about unnecessary update. Also, I think
this code doesn't work when pgstat_track_activities is false. Instead,
I think that in parallel_wait_for_workers_to_finish(), we can check
the value of pvs->nindexes_completed and update the progress if there
is an update or it's first time.

---
+(void) WaitLatch(MyLatch, WL_TIMEOUT | WL_LATCH_SET |
WL_EXIT_ON_PM_DEATH, PARALLEL_VACUUM_PROGRESS_TIMEOUT,
+
WAIT_EVENT_PARALLEL_VACUUM_FINISH);
+ResetLatch(MyLatch);

I think we don't necessarily need to use
PARALLEL_VACUUM_PROGRESS_TIMEOUT here. Probably we can use 1000L
instead. If we want to use PARALLEL_VACUUM_PROGRESS_TIMEOUT, we need
comments for that:

+#define PARALLEL_VACUUM_PROGRESS_TIMEOUT   1000

---
-WAIT_EVENT_XACT_GROUP_UPDATE
+WAIT_EVENT_XACT_GROUP_UPDATE,
+WAIT_EVENT_PARALLEL_VACUUM_FINISH
 } WaitEventIPC;

 Enums of WaitEventIPC should be defined in alphabetical order.

---
cfbot fails.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-01 Thread Masahiko Sawada
On Thu, Dec 1, 2022 at 4:00 PM John Naylor  wrote:
>
>
> On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada  
> wrote:
> >
> > I've investigated this issue and have a question about using atomic
> > variables on palloc'ed memory. In non-parallel vacuum cases,
> > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> > the memory allocated by aset.c is 4-bytes aligned so these atomic
> > variables are not always 8-bytes aligned. Is there any way to enforce
> > 8-bytes aligned memory allocations in 32-bit machines?
>
> The bigger question in my mind is: Why is there an atomic variable in 
> backend-local memory?

Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-30 Thread Masahiko Sawada
ialized on first use). */
+static HTAB *ParallelApplyWorkersHash = NULL;
+
+/*
+ * A list to maintain the active parallel apply workers. The information for
+ * the new worker is added to the list after successfully launching it. The
+ * list entry is removed if there are already enough workers in the worker
+ * pool either at the end of the transaction or while trying to find a free
+ * worker for applying the transaction. For more information about the worker
+ * pool, see comments atop this file.
+ */
+static List *ParallelApplyWorkersList = NIL;

The names ParallelApplyWorkersHash and ParallelWorkersList are very
similar but the usages are completely different. Probably we can find
better names such as ParallelApplyTxnHash and ParallelApplyWorkerPool.
And probably we can add more comments for ParallelApplyWorkersHash.

---
if (winfo->serialize_changes ||
napplyworkers > (max_parallel_apply_workers_per_subscription / 2))
{
int slot_no;
uint16  generation;

SpinLockAcquire(>shared->mutex);
generation = winfo->shared->logicalrep_worker_generation;
slot_no = winfo->shared->logicalrep_worker_slot_no;
SpinLockRelease(>shared->mutex);

logicalrep_pa_worker_stop(slot_no, generation);

pa_free_worker_info(winfo);

return true;
}

/* Unlink any files that were needed to serialize partial changes. */
if (winfo->serialize_changes)
stream_cleanup_files(MyLogicalRepWorker->subid, winfo->shared->xid);

If winfo->serialize_changes is true, we return true in the first if
statement. So stream_cleanup_files in the second if statement is never
executed.

---
+/*
+ * First, try to get a parallel apply worker from the pool,
if available.
+ * Otherwise, try to start a new parallel apply worker.
+ */
+winfo = pa_get_available_worker();
+if (!winfo)
+{
+winfo = pa_init_and_launch_worker();
+if (!winfo)
+return;
+}

I think we don't necessarily need to separate two functions for
getting a worker from the pool and launching a new worker. It seems to
reduce the readability. Instead, I think that we can have one function
that returns winfo if there is a free worker in the worker pool or it
launches a worker. That way, we can simply do like:

winfo = pg_launch_parallel_worker()
if (!winfo)
return;

---
+/* Setup replication origin tracking. */
+StartTransactionCommand();
+ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
+
 originname, sizeof(originname));
+originid = replorigin_by_name(originname, true);
+if (!OidIsValid(originid))
+originid = replorigin_create(originname);

This code looks to allow parallel workers to use different origins in
cases where the origin doesn't exist, but is that okay? Shouldn't we
pass miassing_ok = false in this case?

---
cfbot seems to fails:

https://cirrus-ci.com/task/6264595342426112

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-30 Thread Masahiko Sawada
On Wed, Nov 30, 2022 at 7:54 PM Amit Kapila  wrote:
>
> On Tue, Nov 29, 2022 at 10:18 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > Attach the new version patch which addressed all comments.
> >
>
> Some comments on v53-0002*
> 
> 1. I think testing the scenario where the shm_mq buffer is full
> between the leader and parallel apply worker would require a large
> amount of data and then also there is no guarantee. How about having a
> developer GUC [1] force_apply_serialize which allows us to serialize
> the changes and only after commit the parallel apply worker would be
> allowed to apply it?

+1

The code coverage report shows that we don't cover the partial
serialization codes. This GUC would improve the code coverage.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread Masahiko Sawada
On Wed, Nov 23, 2022 at 2:10 AM Andres Freund  wrote:
>
> On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
> > Sure. I've attached the v10 patches. 0004 is the pure refactoring
> > patch and 0005 patch introduces the pointer tagging.
>
> This failed on cfbot, with som many crashes that the VM ran out of disk for
> core dumps. During testing with 32bit, so there's probably something broken
> around that.
>
> https://cirrus-ci.com/task/4635135954386944
>
> A failure is e.g. at: 
> https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log
>
> performing post-bootstrap initialization ... 
> ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within 
> misaligned address 0x590faf74 for type 'struct radix_tree_control', which 
> requires 8 byte alignment
> 0x590faf74: note: pointer points here
>   90 11 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00
>   ^

radix_tree_control struct has two pg_atomic_uint64 variables, and the
assertion check in pg_atomic_init_u64() failed:

static inline void
pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
{
/*
 * Can't necessarily enforce alignment - and don't need it - when using
 * the spinlock based fallback implementation. Therefore only assert when
 * not using it.
 */
#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif
pg_atomic_init_u64_impl(ptr, val);
}

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread Masahiko Sawada
On Tue, Nov 29, 2022 at 1:36 PM John Naylor
 wrote:
>
> While creating a benchmark for inserting into node128-inner, I found a bug. 
> If a caller deletes from a node128, the slot index is set to invalid, but the 
> child pointer is still valid. Do that a few times, and every child pointer is 
> valid, even if no slot index points to it. When the next inserter comes 
> along, something surprising happens. This function:
>
> /* Return an unused slot in node-128 */
> static int
> node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
> {
>   int slotpos = 0;
>
>   Assert(!NODE_IS_LEAF(node));
>   while (node_inner_128_is_slot_used(node, slotpos))
>   slotpos++;
>
>   return slotpos;
> }
>
> ...passes an integer to this function, whose parameter is a uint8:
>
> /* Is the slot in the node used? */
> static inline bool
> node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
> {
>   Assert(!NODE_IS_LEAF(node));
>   return (node->children[slot] != NULL);
> }
>
> ...so instead of growing the node unnecessarily or segfaulting, it enters an 
> infinite loop doing this:
>
> add eax, 1
> movzx   ecx, al
> cmp QWORD PTR [rbx+264+rcx*8], 0
> jne .L147
>
> The fix is easy enough -- set the child pointer to null upon deletion,

Good catch!

> but I'm somewhat astonished that the regression tests didn't hit this. I do 
> still intend to replace this code with something faster, but before I do so 
> the tests should probably exercise the deletion paths more. Since VACUUM

Indeed, there are some tests for deletion but all of them delete all
keys in the node so we end up deleting the node. I've added tests of
repeating deletion and insertion as well as additional assertions.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread Masahiko Sawada
On Fri, Nov 25, 2022 at 6:47 PM John Naylor
 wrote:
>
>
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada  wrote:
> >
> > [v11]
>
> There is one more thing that just now occurred to me: In expanding the use of 
> size classes, that makes rebasing and reworking the shared memory piece more 
> work than it should be. That's important because there are still some open 
> questions about the design around shared memory. To keep unnecessary churn to 
> a minimum, perhaps we should limit size class expansion to just one (or 5 
> total size classes) for the near future?

Make sense. We can add size classes once we have a good design and
implementation around shared memory.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread Masahiko Sawada
On Fri, Nov 25, 2022 at 5:00 PM John Naylor
 wrote:
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada  wrote:
> >
> > So it seems that there are two candidates of rt_node structure: (1)
> > all nodes except for node256 are variable-size nodes and use pointer
> > tagging, and (2) node32 and node128 are variable-sized nodes and do
> > not use pointer tagging (fanout is in part of only these two nodes).
> > rt_node can be 5 bytes in both cases. But before going to this step, I
> > started to verify the idea of variable-size nodes by using 6-bytes
> > rt_node. We can adjust the node kinds and node classes later.
>
> First, I'm glad you picked up the size class concept and expanded it. (I have 
> some comments about some internal APIs below.)
>
> Let's leave the pointer tagging piece out until the main functionality is 
> committed. We have all the prerequisites in place, except for a benchmark 
> random enough to demonstrate benefit. I'm still not quite satisfied with how 
> the shared memory coding looked, and that is the only sticky problem we still 
> have, IMO. The rest is "just work".
>
> That said, (1) and (2) above are still relevant -- variable sizing any given 
> node is optional, and we can refine as needed.
>
> > Overall, the idea of variable-sized nodes is good, smaller size
> > without losing search performance.
>
> Good.
>
> > I'm going to check the load
> > performance as well.
>
> Part of that is this, which gets called a lot more now, when node1 expands:
>
> + if (inner)
> + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
> + rt_node_kind_info[kind].inner_size);
> + else
> + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
> + rt_node_kind_info[kind].leaf_size);
>
> Since memset for expanding size class is now handled separately, these can 
> use the non-zeroing versions. When compiling MemoryContextAllocZero, the 
> compiler has no idea how big the size is, so it assumes the worst and 
> optimizes for large sizes. On x86-64, that means using "rep stos", which 
> calls microcode found in the CPU's ROM. This is slow for small sizes. The 
> "init" function should be always inline with const parameters where possible. 
> That way, memset can compile to a single instruction for the smallest node 
> kind. (More on alloc/init below)

Right. I forgot to update it.

>
> Note, there is a wrinkle: As currently written inner_node128 searches the 
> child pointers for NULL when inserting, so when expanding from partial to 
> full size class, the new node must be zeroed (Worth fixing in the short term. 
> I thought of this while writing the proof-of-concept for size classes, but 
> didn't mention it.) Medium term, rather than special-casing this, I actually 
> want to rewrite the inner-node128 to be more similar to the leaf, with an 
> "isset" array, but accessed and tested differently. I guarantee it's *really* 
> slow now to load (maybe somewhat true even for leaves), but I'll leave the 
> details for later.

Agreed, I start with zeroing out the node when expanding from partial
to full size.

> Regarding node128 leaf, note that it's slightly larger than a DSA size class, 
> and we can trim it to fit:
>
> node61:  6 + 256+(2) +16 +  61*8 =  768
> node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

>
> > I've attached the patches I used for the verification. I don't include
> > patches for pointer tagging, DSA support, and vacuum integration since
> > I'm investigating the issue on cfbot that Andres reported. Also, I've
> > modified tests to improve the test coverage.
>
> Sounds good. For v12, I think size classes have proven themselves, so v11's 
> 0002/4/5 can be squashed. Plus, some additional comments:
>
> +/* Return a new and initialized node */
> +static rt_node *
> +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, 
> bool inner)
> +{
> + rt_node *newnode;
> +
> + newnode = rt_alloc_node(tree, kind, inner);
> + rt_init_node(newnode, kind, shift, chunk, inner);
> +
> + return newnode;
> +}
>
> I don't see the point of a function that just calls two functions.

Removed.

>
> +/*
> + * Create a new node with 'new_kind' and the same shift, chunk, and
> + * count of 'node'.
> + */
> +static rt_node *
> +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
> +{
> + rt_node*newnode;
> +
> + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
> + node->shift > 0);
> + newnode->count = node->count;
> +
> + return newnode;
> +}
>
> This, in turn, just calls a function that does _a

Re: Fix comment in SnapBuildFindSnapshot

2022-11-28 Thread Masahiko Sawada
On Tue, Nov 29, 2022 at 8:54 AM Michael Paquier  wrote:
>
> On Mon, Nov 28, 2022 at 04:46:44PM +0900, Michael Paquier wrote:
> > Hm, yes, that seems right.  There are three "c) states" in these
> > paragraphs, they are incremental steps.  Will apply if there are no
> > objections.
>
> And done.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Fix comment in SnapBuildFindSnapshot

2022-11-27 Thread Masahiko Sawada
Hi,

We have the following comment in SnapBuildFindSnapshot():

* c) transition from FULL_SNAPSHOT to CONSISTENT.
*
* In FULL_SNAPSHOT state (see d) ), and this xl_running_xacts'

It mentions "(state d) )", which seems like a typo of "(state d)", but
there is no "state d" in the first place. Reading the discussion of
the commit 955a684e040 that introduced this comment, this was a
comment for an old version patch[1]. So I think we can remove this
part.

I've attached the patch.

Regards,

[1] 
https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_comment_in_SnapBuildFindSnapshot.patch
Description: Binary data


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-27 Thread Masahiko Sawada
On Fri, Nov 25, 2022 at 5:58 PM Maxim Orlov  wrote:
>
>
>
> On Fri, 25 Nov 2022 at 09:40, Amit Kapila  wrote:
>>
>> On Thu, Nov 24, 2022 at 4:43 PM Amit Kapila  wrote:
>> >
>> > On Thu, Nov 24, 2022 at 1:48 PM Masahiko Sawada  
>> > wrote:
>> > >
>> > > On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  
>> > > wrote:
>> > >
>> > > Agreed not to have a test case for this.
>> > >
>> > > I've attached an updated patch. I've confirmed this patch works for
>> > > all supported branches.
>> > >
>> >
>> > I have slightly changed the checks used in the patch, otherwise looks
>> > good to me. I am planning to push (v11-v15) the attached tomorrow
>> > unless there are more comments.
>> >
>>
>> Pushed.
>
> A big thanks to you! Could you also, close the commitfest entry 
> https://commitfest.postgresql.org/41/4024/, please?

Closed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-24 Thread Masahiko Sawada
On Wed, Nov 23, 2022 at 12:00 PM Amit Kapila  wrote:
>
> On Tue, Nov 22, 2022 at 10:33 PM Maxim Orlov  wrote:
> >>
> >>
> >> Regarding the tests, the patch includes a new scenario to
> >> reproduce this issue. However, since the issue can be reproduced also
> >> by the existing scenario (with low probability, though), I'm not sure
> >> it's worth adding the new scenario.
> >
> > AFAICS, the test added doesn't 100% reproduce this issue, so, maybe, it 
> > does not worth it. But, I do not have a strong opinion here.
> > Let's add tests in a separate commit and let the actual committer to decide 
> > what to do, should we?
> >
>
> +1 to not have a test for this as the scenario can already be tested
> by the existing set of tests.

Agreed not to have a test case for this.

I've attached an updated patch. I've confirmed this patch works for
all supported branches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-Reset-InitialRunningXacts-array-when-freeing-SnapBui.patch
Description: Binary data


Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

2022-11-22 Thread Masahiko Sawada
Hi,

On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila  wrote:
>
> On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov  wrote:
> >
> > PROBLEM
> >
> > After some investigation, I think, the problem is in the snapbuild.c 
> > (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> > array in the context of builder->context, but for the time when we call 
> > SnapBuildPurgeOlderTxn this context may be already free'd.
> >
>
> I think you are seeing it freed in SnapBuildPurgeOlderTxn when we
> finish and restart decoding in the same session. After finishing the
> first decoding, it frees the decoding context but we forgot to reset
> NInitialRunningXacts and InitialRunningXacts array. So, next time when
> we start decoding in the same session where we don't restore any
> serialized snapshot, it can lead to the problem you are seeing because
> NInitialRunningXacts (and InitialRunningXacts array) won't have sane
> values.
>
> This can happen in the catalog_change_snapshot test as we have
> multiple permutations and those use the same session across a restart
> of decoding.

I have the same analysis. In order to restart the decoding from the
LSN where we don't restore any serialized snapshot, we somehow need to
advance the slot's restart_lsn. In this case, I think it happened
since the same session drops at the end of the first scenario and
creates the replication slot with the same name at the beginning of
the second scenario in catalog_change_snapshot.spec.

>
> >
> > Simple fix like:
> > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr 
> > lsn, xl_running_xacts *runn
> >  * changes. See SnapBuildXidSetCatalogChanges.
> >  */
> > NInitialRunningXacts = nxacts;
> > -   InitialRunningXacts = MemoryContextAlloc(builder->context, 
> > sz);
> > +   InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, 
> > sz);
> > memcpy(InitialRunningXacts, running->xids, sz);
> > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), 
> > xidComparator);
> >
> > seems to solve the described problem, but I'm not in the context of [0] and 
> > why array is allocated in builder->context.
> >
>
> It will leak the memory for InitialRunningXacts. We need to reset
> NInitialRunningXacts (and InitialRunningXacts) as mentioned above.
>
> Thank you for the report and initial analysis. I have added Sawada-San
> to know his views as he was the primary author of this work.

Thanks!

I've attached a draft patch. To fix it, I think we can reset
InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
and add an assertion in AllocateSnapshotBuilder() to make sure both
are reset. Regarding the tests, the patch includes a new scenario to
reproduce this issue. However, since the issue can be reproduced also
by the existing scenario (with low probability, though), I'm not sure
it's worth adding the new scenario.

I've not checked if the patch works for version 14 or older yet.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


reset_initial_running_xacts.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread Masahiko Sawada
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
 wrote:
>
>
> On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> > One issue with this patch: The "fanout" member is a uint8, so it can't hold 
> > 256 for the largest node kind. That's not an issue in practice, since we 
> > never need to grow it, and we only compare that value with the count in an 
> > Assert(), so I just set it to zero. That does break an invariant, so it's 
> > not great. We could use 2 bytes to be strictly correct in all cases, but 
> > that limits what we can do with the smallest node kind.
>
> Thinking about this part, there's an easy resolution -- use a different macro 
> for fixed- and variable-sized node kinds to determine if there is a free slot.
>
> Also, I wanted to share some results of adjusting the boundary between the 
> two smallest node kinds. In the hackish attached patch, I modified the fixed 
> height search benchmark to search a small (within L1 cache) tree thousands of 
> times. For the first set I modified node4's maximum fanout and filled it up. 
> For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 
> (actually the partially-filled node15 size class as demoed earlier).
>
> node4:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16520 |  0 |3
>
> NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16456 |  0 |   17
>
> NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16456 |  0 |   89
>
> NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |16488 |  0 |  327
>
>
> node32:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16488 |  0 |5
> (1 row)
>
> NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16520 |  0 |   28
>
> NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16408 |  0 |   79
>
> NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |24616 |  0 |  199
>
> In this test, node32 seems slightly faster than node4 with 4 elements, at the 
> cost of more memory.
>
> Assuming the smallest node is fixed size (i.e. fanout/capacity member not 
> part of the common set, so only part of variable-sized nodes), 3 has a nice 
> property: no wasted padding space:
>
> node4: 5 + 4+(7) + 4*8 = 48 bytes
> node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 :  5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread Masahiko Sawada
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada  wrote:
>
> On Wed, Nov 16, 2022 at 4:39 PM John Naylor
>  wrote:
> >
> >
> > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> > >  wrote:
> > > >
> > > >
> > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
> > > >  wrote:
> > > > > Thanks! Please let me know if there is something I can help with.
> > > >
> > > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > > >
> > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> > >
> > > Which tests do you use to get this assertion failure? I've confirmed
> > > there is a bug in 0005 patch but without it, "make check-world"
> > > passed.
> >
> > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> > happened, sorry for the noise.
>
> Good to know. No problem.
>
> > I'm attaching a test I wrote to stress test branch prediction in search, 
> > and while trying it out I found two possible issues.
>
> Thank you for testing!
>
> >
> > It's based on the random int load test, but tests search speed. Run like 
> > this:
> >
> > select * from bench_search_random_nodes(10 * 1000 * 1000)
> >
> > It also takes some care to include all the different node kinds, 
> > restricting the possible keys by AND-ing with a filter. Here's a simple 
> > demo:
> >
> > filter = ((uint64)1<<40)-1;
> > LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> > 62663, n256 = 3130
> >
> > Just using random integers leads to >99% using the smallest node. I wanted 
> > to get close to having the same number of each, but that's difficult while 
> > still using random inputs. I ended up using
> >
> > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
> >
> > which gives
> >
> > LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> > 182670, n256 = 1024
> >
> > Which seems okay for the task. One puzzling thing I found while trying 
> > various filters is that sometimes the reported tree height would change. 
> > For example:
> >
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> > 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't 
> > expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
>
> >
> > 2) It seems that 0004 actually causes a significant slowdown in this test 
> > (as in the attached, using the second filter above and with turboboost 
> > disabled):
> >
> > v9 0003: 2062 2051 2050
> > v9 0004: 2346 2316 2321
> >
> > That means my idea for the pointer struct might have some problems, at 
> > least as currently implemented. Maybe in the course of separating out and 
> > polishing that piece, an inefficiency will fall out. Or, it might be 
> > another reason to template local and shared separately. Not sure yet. I 
> > also haven't tried to adjust this test for the shared memory case.
>
> I'll also run the test on my environment and do the investigation tomorrow.
>

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch: 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG:  num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-11-18 Thread Masahiko Sawada
On Sat, Nov 12, 2022 at 4:10 AM Imseih (AWS), Sami  wrote:
>
> >I don't think any of these progress callbacks should be done while 
> > pinning a
> >buffer and ...
>
> Good point.
>
> >I also don't understand why info->parallel_progress_callback exists? 
> > It's only
> >set to parallel_vacuum_progress_report(). Why make this stuff more 
> > expensive
> >than it has to already be?
>
> Agree. Modified the patch to avoid the callback .
>
> >So each of the places that call this need to make an additional external
> >function call for each page, just to find that there's nothing to do or 
> > to
> >make yet another indirect function call. This should probably a static 
> > inline
> >function.
>
> Even better to just remove a function call altogether.
>
> >This is called, for every single page, just to read the number of indexes
> >completed by workers? A number that barely ever changes?
>
> I will take the initial suggestion by Sawada-san to update the progress
> every 1GB of blocks scanned.
>
> Also, It sems to me that we don't need to track progress in brin indexes,
> Since it is rare, if ever, this type of index will go through very heavy
> block scans. In fact, I noticed the vacuum AMs for brin don't invoke the
> vacuum_delay_point at all.
>
> The attached patch addresses the feedback.
>

Thank you for updating the patch! Here are review comments on v15 patch:

+  
+   Number of indexes that wil be vacuumed. This value will be
+   0 if there are no indexes to vacuum or
+   vacuum failsafe is triggered.

I think that indexes_total should be 0 also when INDEX_CLEANUP is off.

---
+/*
+ * Reset the indexes completed at this point.
+ * If we end up in another index vacuum cycle, we will
+ * start counting from the start.
+ */
+pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED, 0);

I think we don't need to reset it at the end of index vacuuming. There
is a small window before switching to the next phase. If we reset this
value while showing "index vacuuming" phase, the user might get
confused. Instead, we can reset it at the beginning of the index
vacuuming.

---
+void
+parallel_wait_for_workers_to_finish(ParallelVacuumState *pvs)
+{
+while (pg_atomic_read_u32(&(pvs->shared->idx_completed_progress))
< pvs->nindexes)
+{
+pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+
   pg_atomic_read_u32(&(pvs->shared->idx_completed_progress)));
+
+(void) WaitLatch(MyLatch, WL_LATCH_SET |
WL_EXIT_ON_PM_DEATH, -1,
+ WAIT_EVENT_PARALLEL_FINISH);
+ResetLatch(MyLatch);
+}
+}

We should add CHECK_FOR_INTERRUPTS() at the beginning of the loop to
make the wait interruptible.

I think it would be better to update the counter only when the value
has been increased.

I think we should set a timeout, say 1 sec, to WaitLatch so that it
can periodically check the progress.

Probably it's better to have a new wait event for this wait in order
to distinguish wait for workers to complete index vacuum from the wait
for workers to exit.

---
@@ -838,7 +867,12 @@
parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation
indrel,
 ivinfo.estimated_count = pvs->shared->estimated_count;
 ivinfo.num_heap_tuples = pvs->shared->reltuples;
 ivinfo.strategy = pvs->bstrategy;
-
+ivinfo.idx_completed_progress = pvs->shared->idx_completed_progress;

and

@@ -998,6 +998,9 @@ btvacuumscan(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,
 if (info->report_progress)

pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,

   scanblkno);
+if (info->report_parallel_progress &&
(scanblkno % REPORT_PARALLEL_VACUUM_EVERY_PAGES) == 0)
+
pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+

pg_atomic_read_u32(&(info->idx_completed_progress)));
 }

I think this doesn't work, since ivinfo.idx_completed is in the
backend-local memory. Instead, I think we can have a function in
vacuumparallel.c that updates the progress. Then we can have index AM
call this function.

---
+if (!IsParallelWorker())
+ivinfo.report_parallel_progress = true;
+else
+ivinfo.report_parallel_progress = false;

We can do like:

ivinfo.report_parallel_progress = !IsParallelWorker();

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada
On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila  wrote:
>
> On Fri, Nov 18, 2022 at 8:01 AM Peter Smith  wrote:
> >
> > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  
> > wrote:
> > >
> > ...
> > > ---
> > > The streaming parameter has the new value "parallel" for "streaming"
> > > option to enable the parallel apply. It fits so far but I think the
> > > parallel apply feature doesn't necessarily need to be tied up with
> > > streaming replication. For example, we might want to support parallel
> > > apply also for non-streaming transactions in the future. It might be
> > > better to have another option, say "parallel", to control parallel
> > > apply behavior. The "parallel" option can be a boolean option and
> > > setting parallel = on requires streaming = on.
> > >
>
> If we do that then how will the user be able to use streaming
> serialize mode (write to file for streaming transactions) as we have
> now? Because after we introduce parallelism for non-streaming
> transactions, the user would want parallel = on irrespective of the
> streaming mode. Also, users may wish to only parallelize large
> transactions because of additional overhead for non-streaming
> transactions for transaction dependency tracking, etc. So, the user
> may wish to have a separate knob for large transactions as the patch
> has now.

One idea for that would be to make it enum. For example, setting
parallel = "streaming" works for that.

>
> >
> > FWIW, I tend to agree with this idea but for a different reason. In
> > this patch, the 'streaming' parameter had become a kind of hybrid
> > boolean/enum. AFAIK there are no other parameters anywhere that use a
> > hybrid pattern like this so I was thinking it may be better not to be
> > different.
> >
>
> I think we have a similar pattern for GUC parameters like
> constraint_exclusion (see constraint_exclusion_options),
> backslash_quote (see backslash_quote_options), etc.

Right. vacuum_index_cleanup and buffering storage parameters that
accept 'on', 'off', or 'auto') are other examples.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada
*
+ * Release all session level locks that could be held in parallel apply
+ * mode.
+ */
+LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+

I think we call LockReleaseAll() at the process exit (in ProcKill()),
but do we really need to do LockReleaseAll() here too?

---

+elog(ERROR, "could not find replication state slot
for replication"
+ "origin with OID %u which was acquired by
%d", node, acquired_by);

Let's not break the error log message in the middle so that the user
can search the message by grep easily.

---
+{
+{"max_parallel_apply_workers_per_subscription",
+PGC_SIGHUP,
+REPLICATION_SUBSCRIBERS,
+gettext_noop("Maximum number of parallel
apply workers per subscription."),
+NULL,
+},
+_parallel_apply_workers_per_subscription,
+2, 0, MAX_BACKENDS,
+NULL, NULL, NULL
+},
+

I think we should use MAX_PARALLEL_WORKER_LIMIT as the max value
instead. MAX_BACKENDS is too high.

---
+/*
+ * Indicates whether there are pending messages in the queue.
The parallel
+ * apply worker will check it before starting to wait.
+ */
+pg_atomic_uint32   pending_message_count;

The "pending messages" sounds like individual logical replication
messages such as LOGICAL_REP_MSG_INSERT. But IIUC what this value
actually shows is how many streamed chunks are pending to process,
right?

---
The streaming parameter has the new value "parallel" for "streaming"
option to enable the parallel apply. It fits so far but I think the
parallel apply feature doesn't necessarily need to be tied up with
streaming replication. For example, we might want to support parallel
apply also for non-streaming transactions in the future. It might be
better to have another option, say "parallel", to control parallel
apply behavior. The "parallel" option can be a boolean option and
setting parallel = on requires streaming = on.

Another variant is to have a new subscription parameter for example
"parallel_workers" parameter that specifies the number of parallel
workers. That way, users can specify the number of parallel workers
per subscription.

---
When the parallel apply worker raises an error, I got the same error
twice from the leader worker and parallel worker as follows. Can we
suppress either one?

2022-11-17 17:30:23.490 JST [3814552] LOG:  logical replication
parallel apply worker for subscription "test_sub1" has started
2022-11-17 17:30:23.490 JST [3814552] ERROR:  duplicate key value
violates unique constraint "test1_c_idx"
2022-11-17 17:30:23.490 JST [3814552] DETAIL:  Key (c)=(1) already exists.
2022-11-17 17:30:23.490 JST [3814552] CONTEXT:  processing remote data
for replication origin "pg_16390" during message type "INSERT" for
replication target relatio
n "public.test1" in transaction 731
2022-11-17 17:30:23.490 JST [3814550] ERROR:  duplicate key value
violates unique constraint "test1_c_idx"
2022-11-17 17:30:23.490 JST [3814550] DETAIL:  Key (c)=(1) already exists.
2022-11-17 17:30:23.490 JST [3814550] CONTEXT:  processing remote data
for replication origin "pg_16390" during message type "INSERT" for
replication target relatio
n "public.test1" in transaction 731
parallel apply worker

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-16 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
 wrote:
>
>
> On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> >  wrote:
> > >
> > >
> > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > > wrote:
> > > > Thanks! Please let me know if there is something I can help with.
> > >
> > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > >
> > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> >
> > Which tests do you use to get this assertion failure? I've confirmed
> > there is a bug in 0005 patch but without it, "make check-world"
> > passed.
>
> Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> happened, sorry for the noise.

Good to know. No problem.

> I'm attaching a test I wrote to stress test branch prediction in search, and 
> while trying it out I found two possible issues.

Thank you for testing!

>
> It's based on the random int load test, but tests search speed. Run like this:
>
> select * from bench_search_random_nodes(10 * 1000 * 1000)
>
> It also takes some care to include all the different node kinds, restricting 
> the possible keys by AND-ing with a filter. Here's a simple demo:
>
> filter = ((uint64)1<<40)-1;
> LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> 62663, n256 = 3130
>
> Just using random integers leads to >99% using the smallest node. I wanted to 
> get close to having the same number of each, but that's difficult while still 
> using random inputs. I ended up using
>
> filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
>
> which gives
>
> LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> 182670, n256 = 1024
>
> Which seems okay for the task. One puzzling thing I found while trying 
> various filters is that sometimes the reported tree height would change. For 
> example:
>
> filter = (((uint64) 1<<32) | (0xFF<<24));
> LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> 62632, n256 = 3161
>
> 1) Any idea why the tree height would be reported as 7 here? I didn't expect 
> that.

In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

>
> 2) It seems that 0004 actually causes a significant slowdown in this test (as 
> in the attached, using the second filter above and with turboboost disabled):
>
> v9 0003: 2062 2051 2050
> v9 0004: 2346 2316 2321
>
> That means my idea for the pointer struct might have some problems, at least 
> as currently implemented. Maybe in the course of separating out and polishing 
> that piece, an inefficiency will fall out. Or, it might be another reason to 
> template local and shared separately. Not sure yet. I also haven't tried to 
> adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 2:17 PM John Naylor
 wrote:
>
>
>
> On Wed, Nov 16, 2022 at 11:46 AM John Naylor  
> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Actually I do want to offer some general advice. Upthread I recommended a 
> purely refactoring patch that added the node-pointer struct but did nothing 
> else, so that the DSA changes would be smaller. 0004 attempted pointer 
> tagging in the same commit, which makes it no longer a purely refactoring 
> patch, so that 1) makes it harder to tell what part caused the bug and 2) 
> obscures what is necessary for DSA pointers and what was additionally 
> necessary for pointer tagging. Shared memory support is a prerequisite for a 
> shippable feature, but pointer tagging is (hopefully) a performance 
> optimization. Let's keep them separate.

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread Masahiko Sawada
On Mon, Nov 14, 2022 at 10:00 PM John Naylor
 wrote:
>
> On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada  wrote:
> >
> > 0004 patch is a new patch supporting a pointer tagging of the node
> > kind. Also, it introduces rt_node_ptr we discussed so that internal
> > functions use it rather than having two arguments for encoded and
> > decoded pointers. With this intermediate patch, the DSA support patch
> > became more readable and understandable. Probably we can make it
> > smaller further if we move the change of separating the control object
> > from radix_tree to the main patch (0002). The patch still needs to be
> > polished but I'd like to check if this idea is worthwhile. If we agree
> > on this direction, this patch will be merged into the main radix tree
> > implementation patch.
>
> Thanks for the new patch set. I've taken a very brief look at 0004 and I 
> think the broad outlines are okay. As you say it needs polish, but before 
> going further, I'd like to do some experiments of my own as I mentioned 
> earlier:
>
> - See how much performance we actually gain from tagging the node kind.
> - Try additional size classes while keeping the node kinds to only four.
> - Optimize node128 insert.
> - Try templating out the differences between local and shared memory. With 
> local memory, the node-pointer struct would be a union, for example. 
> Templating would also reduce branches and re-simplify some internal APIs, but 
> it's likely that would also make the TID store and/or vacuum more complex, 
> because at least some external functions would be duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-08 Thread Masahiko Sawada
On Sat, Nov 5, 2022 at 6:23 PM John Naylor  wrote:
>
> On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada  wrote:
> >
> > For parallel heap pruning, multiple workers will insert key-value
> > pairs to the radix tree concurrently. The simplest solution would be a
> > single lock to protect writes but the performance will not be good.
> > Another solution would be that we can divide the tables into multiple
> > ranges so that keys derived from TIDs are not conflicted with each
> > other and have parallel workers process one or more ranges. That way,
> > parallel vacuum workers can build *sub-trees* and the leader process
> > can merge them. In use cases of lazy vacuum, since the write phase and
> > read phase are separated the readers don't need to worry about
> > concurrent updates.
>
> It's a good idea to use ranges for a different reason -- readahead. See 
> commit 56788d2156fc3, which aimed to improve readahead for sequential scans. 
> It might work to use that as a model: Each worker prunes a range of 64 pages, 
> keeping the dead tids in a local array. At the end of the range: lock the tid 
> store, enter the tids into the store, unlock, free the local array, and get 
> the next range from the leader. It's possible contention won't be too bad, 
> and I suspect using small local arrays as-we-go would be faster and use less 
> memory than merging multiple sub-trees at the end.

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-11-07 Thread Masahiko Sawada
On Thu, Nov 3, 2022 at 1:52 AM Imseih (AWS), Sami  wrote:
>
> Attached is v13-0001--Show-progress-for-index-vacuums.patch which addresses
> the latest comments.

Thank you for updating the patch!

> 4/ Went back to calling parallel_vacuum_progress_report inside
> WaitForParallelWorkersToFinish to cover the case when a
> leader is waiting for parallel workers to finish.

I don't think we need to modify WaitForParallelWorkersToFinish to
cover that case. Instead, I think the leader process can execute a new
function. The function will be like WaitForParallelWorkersToFinish()
but simpler; it just updates the progress information if necessary and
then checks if idx_completed_progress is equal to the number of
indexes to vacuum. If yes, return from the function and call
WaitForParallelWorkersToFinish() to wait for all workers to finish.
Otherwise, it naps by using WaitLatch() and does this loop again.

---
@@ -46,6 +46,8 @@ typedef struct ParallelContext
 ParallelWorkerInfo *worker;
 intnknown_attached_workers;
 bool  *known_attached_workers;
+void   (*parallel_progress_callback)(void *arg);
+void   *parallel_progress_arg;
 } ParallelContext;

With the above change I suggested, I think we won't need to have a
callback function in ParallelContext. Instead, I think we can have
index-AMs call parallel_vacuum_report() if report_parallel_progress is
true.

Regards,

--
Masahiko Sawada

Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-07 Thread Masahiko Sawada
On Thu, Nov 3, 2022 at 10:06 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada 
>  wrote:
> >
> > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila 
> > wrote:
> > > >
> > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> >  wrote:
> > > > >
> > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila 
> > wrote:
> > > > > >
> > > > > > About your point that having different partition structures for
> > > > > > publisher and subscriber, I don't know how common it will be once we
> > > > > > have DDL replication. Also, the default value of
> > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > that this is a quite common case.
> > > > >
> > > > > So how can we consider these concurrent issues that could happen only
> > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > the problem or can we have a safeguard against these conflicts?
> > > > >
> > > >
> > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > cases as you can see in *0003* patch.
> > >
> > > Tightening the restrictions could work in some cases but there might
> > > still be coner cases and it could reduce the usability. I'm not really
> > > sure that we can ensure such a deadlock won't happen with the current
> > > restrictions. I think we need something safeguard just in case. For
> > > example, if the leader apply worker is waiting for a lock acquired by
> > > its parallel worker, it cancels the parallel worker's transaction,
> > > commits its transaction, and restarts logical replication. Or the
> > > leader can log the deadlock to let the user know.
> > >
> >
> > As another direction, we could make the parallel apply feature robust
> > if we can detect deadlocks that happen among the leader worker and
> > parallel workers. I'd like to summarize the idea discussed off-list
> > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > that when the leader worker or parallel worker needs to wait for
> > something (eg. transaction completion, messages) we use lmgr
> > functionality so that we can create wait-for edges and detect
> > deadlocks in lmgr.
> >
> > For example, a scenario where a deadlock occurs is the following:
> >
> > [Publisher]
> > create table tab1(a int);
> > create publication pub for table tab1;
> >
> > [Subcriber]
> > creat table tab1(a int primary key);
> > create subscription sub connection 'port=1 dbname=postgres'
> > publication pub with (streaming = parallel);
> >
> > TX1:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> > Tx2:
> > BEGIN;
> > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- 
> > streamed
> > COMMIT;
> > COMMIT;
> >
> > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > leader apply worker (LA) is executing TX-2 concurrently on the
> > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > deadlock between PA-1 and LA but lmgr cannot detect it.
> >
> > One idea to resolve this issue is that we have LA acquire a session
> > lock on a shared object (by LockSharedObjectForSession()) and have
> > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > acquires the lock before sending STREAM_STOP and releases it if
> > already acquired before sending STREAM_START, STREAM_PREPARE and
> > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > processing STREAM_STOP and then release immediately after acquiring
> > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> >
> > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > object) -> LA
> >
> > We would need the shared objects per parallel apply worker.
> >
> > After detecting a deadlock, we can restart logical replication with
> > temporarily disabling the parallel apply, which is done by 0005 patch.
> >
> > Another scenario is similar to the previous case but T

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada
On Mon, Nov 7, 2022 at 12:58 PM Amit Kapila  wrote:
>
> On Mon, Nov 7, 2022 at 8:26 AM Masahiko Sawada  wrote:
> >
> > On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> > > 
> > > >
> > > > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> > > >  wrote:
> > > > > >
> > > > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > > > >  wrote:
> > > > > > >
> > > > > > > Thanks for the analysis and summary !
> > > > > > >
> > > > > > > I tried to implement the above idea and here is the patch set.
> > > > > > >
> > > > > >
> > > > > > Few comments on v42-0001
> > > > > > ===
> > > > >
> > > > > Thanks for the comments.
> > > > >
> > > > > >
> > > > > > 10.
> > > > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > > > + winfo->shared->transaction_lock_id =
> > > > > > + winfo->shared->parallel_apply_get_unique_id();
> > > > > >
> > > > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > > > (one generated by parallel apply) for the other?
> > > > ...
> > > > ...
> > > > >
> > > > > I also considered using xid for these locks, but it seems the objsubid
> > > > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > > > to generate a unique 16bit id here.
> > > > >
> > > >
> > > > Okay, I see your point. Can we think of having a new lock tag for this 
> > > > with classid,
> > > > objid, objsubid for the first three fields of locktag field? We can use 
> > > > a new
> > > > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > > > tag and acquire the lock. One more point related to this is that I am 
> > > > suggesting
> > > > classid by referring to SET_LOCKTAG_OBJECT as that is used in the 
> > > > current
> > > > patch but do you think we need it for our purpose, won't subscription 
> > > > id and
> > > > xid can uniquely identify the tag?
> > >
> > > I agree that it could be better to have a new lock tag. Another point is 
> > > that
> > > the remote xid and Local xid could be the same in some rare cases, so I 
> > > think
> > > we might need to add another identifier to make it unique.
> > >
> > > Maybe :
> > > locktag_field1 : subscription oid
> > > locktag_field2 : xid(remote or local)
> > > locktag_field3 : 0(lock for stream block)/1(lock for transaction)
> >
> > Or I think we can use locktag_field2 for remote xid and locktag_field3
> > for local xid.
> >
>
> We can do that way as well but OTOH, I think for the local
> transactions we don't need subscription oid, so field1 could be
> InvalidOid and field2 will be xid of local xact. Won't that be better?

This would work. But I'm a bit concerned that we cannot identify which
subscriptions the lock belongs to when checking pg_locks view.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-06 Thread Masahiko Sawada
On Sun, Nov 6, 2022 at 3:40 PM houzj.f...@fujitsu.com
 wrote:
>
> On Saturday, November 5, 2022 1:43 PM Amit Kapila 
> >
> > On Fri, Nov 4, 2022 at 7:35 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Friday, November 4, 2022 4:07 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Nov 3, 2022 at 6:36 PM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > Thanks for the analysis and summary !
> > > > >
> > > > > I tried to implement the above idea and here is the patch set.
> > > > >
> > > >
> > > > Few comments on v42-0001
> > > > ===
> > >
> > > Thanks for the comments.
> > >
> > > >
> > > > 10.
> > > > + winfo->shared->stream_lock_id = parallel_apply_get_unique_id();
> > > > + winfo->shared->transaction_lock_id =
> > > > + winfo->shared->parallel_apply_get_unique_id();
> > > >
> > > > Why can't we use xid (remote_xid) for one of these and local_xid
> > > > (one generated by parallel apply) for the other?
> > ...
> > ...
> > >
> > > I also considered using xid for these locks, but it seems the objsubid
> > > for the shared object lock is 16bit while xid is 32 bit. So, I tried
> > > to generate a unique 16bit id here.
> > >
> >
> > Okay, I see your point. Can we think of having a new lock tag for this with 
> > classid,
> > objid, objsubid for the first three fields of locktag field? We can use a 
> > new
> > macro SET_LOCKTAG_APPLY_TRANSACTION and a common function to set the
> > tag and acquire the lock. One more point related to this is that I am 
> > suggesting
> > classid by referring to SET_LOCKTAG_OBJECT as that is used in the current
> > patch but do you think we need it for our purpose, won't subscription id and
> > xid can uniquely identify the tag?
>
> I agree that it could be better to have a new lock tag. Another point is that
> the remote xid and Local xid could be the same in some rare cases, so I think
> we might need to add another identifier to make it unique.
>
> Maybe :
> locktag_field1 : subscription oid
> locktag_field2 : xid(remote or local)
> locktag_field3 : 0(lock for stream block)/1(lock for transaction)

Or I think we can use locktag_field2 for remote xid and locktag_field3
for local xid.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-04 Thread Masahiko Sawada
On Thu, Nov 3, 2022 at 1:59 PM John Naylor  wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some 
> comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for 
> further work that we discussed. However, before I do so I'd like to request 
> another revision regarding the following:
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes 
> like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
>   rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
>   /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> -  RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought 
> a bit about how to split this patch up to make this easier to read. Here's 
> what I came up with:
>
> typedef struct rt_node_ptr
> {
>   uintptr_t encoded;
>   rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, 
> we can use the "encoded" field for a tagged pointer as well, as I hope we can 
> do (at least for local pointers) in the future. So an intermediate patch 
> would have "static inline void" functions  node_ptr_encode() and  
> node_ptr_decode(), which would only copy from one member to another. I 
> suspect that: 1. The actual DSA changes will be *much* smaller and easier to 
> reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is 
> that necessary because of DSA?, or does this ideally belong in 0002? What's 
> the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap 
> pruning. We don't need to go that far for v16 of course, but what's the 
> simplest thing we can do to make that possible? Other use cases can change to 
> more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of l

Re: Fix comments atop ReorderBufferAddInvalidations

2022-11-03 Thread Masahiko Sawada
Hi

On Thu, Nov 3, 2022 at 7:53 PM Amit Kapila  wrote:
>
> The comments atop seem to indicate that we always accumulate
> invalidation messages in top-level transactions which is neither
> required nor match with the code. This is introduced in the commit
> c55040ccd0 and I have observed it while working on a fix for commit
> 16b1fe0037.

Thank you for the patch. It looks good to me.

I think we can backpatch it to avoid confusion in future.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-01 Thread Masahiko Sawada
On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada  wrote:
>
> On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  
> > > wrote:
> > > >
> > > > About your point that having different partition structures for
> > > > publisher and subscriber, I don't know how common it will be once we
> > > > have DDL replication. Also, the default value of
> > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > that this is a quite common case.
> > >
> > > So how can we consider these concurrent issues that could happen only
> > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > the problem or can we have a safeguard against these conflicts?
> > >
> >
> > Yeah, right now the strategy is to disallow parallel apply for such
> > cases as you can see in *0003* patch.
>
> Tightening the restrictions could work in some cases but there might
> still be coner cases and it could reduce the usability. I'm not really
> sure that we can ensure such a deadlock won't happen with the current
> restrictions. I think we need something safeguard just in case. For
> example, if the leader apply worker is waiting for a lock acquired by
> its parallel worker, it cancels the parallel worker's transaction,
> commits its transaction, and restarts logical replication. Or the
> leader can log the deadlock to let the user know.
>

As another direction, we could make the parallel apply feature robust
if we can detect deadlocks that happen among the leader worker and
parallel workers. I'd like to summarize the idea discussed off-list
(with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
that when the leader worker or parallel worker needs to wait for
something (eg. transaction completion, messages) we use lmgr
functionality so that we can create wait-for edges and detect
deadlocks in lmgr.

For example, a scenario where a deadlock occurs is the following:

[Publisher]
create table tab1(a int);
create publication pub for table tab1;

[Subcriber]
creat table tab1(a int primary key);
create subscription sub connection 'port=1 dbname=postgres'
publication pub with (streaming = parallel);

TX1:
BEGIN;
INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
Tx2:
BEGIN;
INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
COMMIT;
COMMIT;

Suppose a parallel apply worker (PA-1) is executing TX-1 and the
leader apply worker (LA) is executing TX-2 concurrently on the
subscriber. Now, LA is waiting for PA-1 because of the unique key of
tab1 while PA-1 is waiting for LA to send further messages. There is a
deadlock between PA-1 and LA but lmgr cannot detect it.

One idea to resolve this issue is that we have LA acquire a session
lock on a shared object (by LockSharedObjectForSession()) and have
PA-1 wait on the lock before trying to receive messages. IOW,  LA
acquires the lock before sending STREAM_STOP and releases it if
already acquired before sending STREAM_START, STREAM_PREPARE and
STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
processing STREAM_STOP and then release immediately after acquiring
it. That way, when PA-1 is waiting for LA, we can have a wait-edge
from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:

LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
object) -> LA

We would need the shared objects per parallel apply worker.

After detecting a deadlock, we can restart logical replication with
temporarily disabling the parallel apply, which is done by 0005 patch.

Another scenario is similar to the previous case but TX-1 and TX-2 are
executed by two parallel apply workers (PA-1 and PA-2 respectively).
In this scenario, PA-2 is waiting for PA-1 to complete its transaction
while PA-1 is waiting for subsequent input from LA. Also, LA is
waiting for PA-2 to complete its transaction in order to preserve the
commit order. There is a deadlock among three processes but it cannot
be detected in lmgr because the fact that LA is waiting for PA-2 to
complete its transaction doesn't appear in lmgr (see
parallel_apply_wait_for_xact_finish()). To fix it, we can use
XactLockTableWait() instead.

However, since XactLockTableWait() considers PREPARED TRANSACTION as
still in progress, probably we need a similar trick as above in case
where a transaction is prepared. For example, suppose that TX-2 was
prepared instead of committed in the above scenario, PA-2 acquires
another shared lock at START_STREAM and releases it at
STREAM_COMMIT/PREPARE. LA can wait on the lock.

Yet another scenario where LA has to wait is the case where the shm_mq
buffe

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-27 Thread Masahiko Sawada
On Thu, Oct 27, 2022 at 11:34 AM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Oct 26, 2022 7:19 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
> > >  wrote:
> > >
> > > I've started to review this patch. I tested v40-0001 patch and have
> > > one question:
> > >
> > > IIUC even when most of the changes in the transaction are filtered out
> > > in pgoutput (eg., by relation filter or row filter), the walsender
> > > sends STREAM_START. This means that the subscriber could end up
> > > launching parallel apply workers also for almost empty (and streamed)
> > > transactions. For example, I created three subscriptions each of which
> > > subscribes to a different table. When I loaded a large amount of data
> > > into one table, all three (leader) apply workers received START_STREAM
> > > and launched their parallel apply workers.
> > >
> >
> > The apply workers will be launched just the first time then we
> > maintain a pool so that we don't need to restart them.
> >
> > > However, two of them
> > > finished without applying any data. I think this behaviour looks
> > > problematic since it wastes workers and rather decreases the apply
> > > performance if the changes are not large. Is it worth considering a
> > > way to delay launching a parallel apply worker until we find out the
> > > amount of changes is actually large?
> > >
> >
> > I think even if changes are less there may not be much difference
> > because we have observed that the performance improvement comes from
> > not writing to file.
> >
> > > For example, the leader worker
> > > writes the streamed changes to files as usual and launches a parallel
> > > worker if the amount of changes exceeds a threshold or the leader
> > > receives the second segment. After that, the leader worker switches to
> > > send the streamed changes to parallel workers via shm_mq instead of
> > > files.
> > >
> >
> > I think writing to file won't be a good idea as that can hamper the
> > performance benefit in some cases and not sure if it is worth.
> >
>
> I tried to test some cases that only a small part of the transaction or an 
> empty
> transaction is sent to subscriber, to see if using streaming parallel will 
> bring
> performance degradation.
>
> The test was performed ten times, and the average was taken.
> The results are as follows. The details and the script of the test is 
> attached.
>
> 10% of rows are sent
> --
> HEAD24.4595
> patched 18.4545
>
> 5% of rows are sent
> --
> HEAD21.244
> patched 17.9655
>
> 0% of rows are sent
> --
> HEAD18.0605
> patched 17.893
>
>
> It shows that when only 5% or 10% of rows are sent to subscriber, using 
> parallel
> apply takes less time than HEAD, and even if all rows are filtered there's no
> performance degradation.

Thank you for the testing!

I think this performance improvement comes from both applying changes
in parallel to receiving changes and avoiding writing a file. I'm
happy to know there is also a benefit also for small streaming
transactions. I've also measured the overhead when processing
streaming empty transactions and confirmed the overhead is negligible.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread Masahiko Sawada
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
 wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada  
> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it 
> wasn't clear we should. After looking again at the node types, I must not 
> have thought through this at all. Storing one byte instead of four for the 
> full enum is a good step, but saving one more byte usually doesn't buy 
> anything because of padding, with a few exceptions like this example:
>
> node4:   4 +  4   +  4*8 =   40
> node4:   5 +  4+(7)   +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And 
> with my idea of decoupling size classes from kind, the variable-sized kinds 
> will require another byte to store "capacity". Then, even if the kind gets 
> encoded in a pointer tag, we'll still have 5 bytes in the base type. So I 
> think we should assume 5 bytes from the start. (Might be 6 temporarily if I 
> work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 
> has syntactic support for them, so no need to write your own shifting/masking 
> code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are 
> addressed, we can proceed independently with the above two items. Seeing the 
> DSA work will also inform me how invasive pointer tagging will be. There will 
> still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) 
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) )[i], ((const uint8 *) )[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and 
> potential non-SIMD callers will be better off writing their own byte-wise 
> loop rather than using this API. Especially since the "min" function exists 
> only as a workaround for lack of unsigned comparison in (at least) SSE2. 
> There is one existing function in this file with that idiom for non-assert 
> code (for completeness), but even there, inputs of current interest to us use 
> the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << 
> sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is 
> greater
> + * than or equal to 'key'.  Return -1 if 

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-24 Thread Masahiko Sawada
On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, October 20, 2022 5:49 PM Amit Kapila  
> wrote:
> > On Thu, Oct 20, 2022 at 2:08 PM Peter Smith 
> > wrote:
> > >
> > > 7. get_transaction_apply_action
> > >
> > > > 12. get_transaction_apply_action
> > > >
> > > > I still felt like there should be some tablesync checks/comments in
> > > > this function, just for sanity, even if it works as-is now.
> > > >
> > > > For example, are you saying ([3] #22b) that there might be rare
> > > > cases where a Tablesync would call to parallel_apply_find_worker?
> > > > That seems strange, given that "for streaming transactions that are
> > > > being applied in the parallel ... we disallow applying changes on a
> > > > table that is not in the READY state".
> > > >
> > > > --
> > >
> > > Houz wrote [2] -
> > >
> > > I think because we won't try to start parallel apply worker in table
> > > sync worker(see the check in parallel_apply_can_start()), so we won't
> > > find any worker in parallel_apply_find_worker() which means
> > > get_transaction_apply_action will return TRANS_LEADER_SERIALIZE. And
> > > get_transaction_apply_action is a function which can be invoked for
> > > all kinds of workers(same is true for all apply_handle_xxx functions),
> > > so not sure if table sync check/comment is necessary.
> > >
> > > ~
> > >
> > > Sure, and I believe you when you say it all works OK - but IMO there
> > > is something still not quite right with this current code. For
> > > example,
> > >
> > > e.g.1 the functional will return TRANS_LEADER_SERIALIZE for Tablesync
> > > worker, and yet the comment for TRANS_LEADER_SERIALIZE says "means
> > > that we are in the leader apply worker" (except we are not)
> > >
> > > e.g.2 we know for a fact that Tablesync workers cannot start their own
> > > parallel apply workers, so then why do we even let the Tablesync
> > > worker make a call to parallel_apply_find_worker() looking for
> > > something we know will not be found?
> > >
> >
> > I don't see much benefit in adding an additional check for tablesync workers
> > here. It will unnecessarily make this part of the code look bit ugly.
>
> Thanks for the review, here is the new version patch set which addressed 
> Peter[1]
> and Kuroda-san[2]'s comments.

I've started to review this patch. I tested v40-0001 patch and have
one question:

IIUC even when most of the changes in the transaction are filtered out
in pgoutput (eg., by relation filter or row filter), the walsender
sends STREAM_START. This means that the subscriber could end up
launching parallel apply workers also for almost empty (and streamed)
transactions. For example, I created three subscriptions each of which
subscribes to a different table. When I loaded a large amount of data
into one table, all three (leader) apply workers received START_STREAM
and launched their parallel apply workers. However, two of them
finished without applying any data. I think this behaviour looks
problematic since it wastes workers and rather decreases the apply
performance if the changes are not large. Is it worth considering a
way to delay launching a parallel apply worker until we find out the
amount of changes is actually large? For example, the leader worker
writes the streamed changes to files as usual and launches a parallel
worker if the amount of changes exceeds a threshold or the leader
receives the second segment. After that, the leader worker switches to
send the streamed changes to parallel workers via shm_mq instead of
files.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-24 Thread Masahiko Sawada
On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila  wrote:
>
> On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada  wrote:
> >
> > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  wrote:
> > >
> > > About your point that having different partition structures for
> > > publisher and subscriber, I don't know how common it will be once we
> > > have DDL replication. Also, the default value of
> > > publish_via_partition_root is false which doesn't seem to indicate
> > > that this is a quite common case.
> >
> > So how can we consider these concurrent issues that could happen only
> > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > the problem or can we have a safeguard against these conflicts?
> >
>
> Yeah, right now the strategy is to disallow parallel apply for such
> cases as you can see in *0003* patch.

Tightening the restrictions could work in some cases but there might
still be coner cases and it could reduce the usability. I'm not really
sure that we can ensure such a deadlock won't happen with the current
restrictions. I think we need something safeguard just in case. For
example, if the leader apply worker is waiting for a lock acquired by
its parallel worker, it cancels the parallel worker's transaction,
commits its transaction, and restarts logical replication. Or the
leader can log the deadlock to let the user know.

>
> > We
> > could find a new problematic scenario in the future and if it happens,
> > logical replication gets stuck, it cannot be resolved only by apply
> > workers themselves.
> >
>
> I think users can change streaming option to on/off and internally the
> parallel apply worker can detect and restart to allow replication to
> proceed. Having said that, I think that would be a bug in the code and
> we should try to fix it. We may need to disable parallel apply in the
> problematic case.
>
> The other ideas that occurred to me in this regard are (a) provide a
> reloption (say parallel_apply) at table level and we can use that to
> bypass various checks like different Unique Key between
> publisher/subscriber, constraints/expressions having mutable
> functions, Foreign Key (when enabled on subscriber), operations on
> Partitioned Table. We can't detect whether those are safe or not
> (primarily because of a different structure in publisher and
> subscriber) so we prohibit parallel apply but if users use this
> option, we can allow it even in those cases.

The parallel apply worker is assigned per transaction, right? If so,
how can we know which tables are modified in the transaction in
advance? and what if two tables whose reloptions are true and false
are modified in the same transaction?

> (b) While enabling the
> parallel option in the subscription, we can try to match all the
> table(s) information of the publisher/subscriber. It will be tricky to
> make this work because say even if match some trigger function name,
> we won't be able to match the function body. The other thing is when
> at a later point the table definition is changed on the subscriber, we
> need to again validate the information between publisher and
> subscriber which I think would be difficult as we would be already in
> between processing some message and getting information from the
> publisher at that stage won't be possible.

Indeed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-21 Thread Masahiko Sawada
On Thu, Oct 20, 2022 at 6:54 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-13 15:57:28 +0900, Masahiko Sawada wrote:
> > I've attached an updated patch. I've added the common function to
> > start pg_recvlogical and wait for it to become active. Please review
> > it.
>
> > +# Start pg_recvlogical process and wait for it to become active.
> > +sub start_pg_recvlogical
> > +{
> > + my ($node, $slot_name, $create_slot) = @_;
> > +
> > + my @cmd = (
> > + 'pg_recvlogical', '-S', "$slot_name", '-d',
> > + $node->connstr('postgres'),
> > + '--start', '--no-loop', '-f', '-');
> > + push @cmd, '--create-slot' if $create_slot;
> > +
> > + # start pg_recvlogical process.
> > + my $pg_recvlogical = IPC::Run::start(@cmd);
> > +
> > + # Wait for the replication slot to become active.
> > + $node->poll_query_until('postgres',
> > + "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE 
> > slot_name = '$slot_name' AND active_pid IS NOT NULL)"
> > + ) or die "slot never became active";
> > +
> > + return $pg_recvlogical;
> > +}
>
> Because you never process the output from pg_recvlogical I think this test
> will fail if the pipe buffer is small (or more changes are made). I think
> either it needs to output to a file, or we need to process the output.

Okay, but how can we test this situation? As far as I tested, if we
don't specify the redirection of pg_recvlogical's output as above,
pg_recvlogical's stdout and stderr are output to the log file. So I
could not reproduce the issue you're concerned about. Which pipe do
you refer to?

>
> A file seems the easier solution in this case, we don't really care about what
> changes are streamed to the client, right?
>
>
> > +$node = PostgreSQL::Test::Cluster->new('test2');
> > +$node->init(allows_streaming => 'logical');
> > +$node->start;
> > +$node->safe_psql('postgres', "CREATE TABLE test(i int)");
>
> Why are we creating a new cluster? Initdbs takes a fair bit of time on some
> platforms, so this seems unnecessary?

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-20 Thread Masahiko Sawada
On Thu, Oct 20, 2022 at 6:57 PM Amit Kapila  wrote:
>
> On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada  wrote:
> >
> > I've attached patches for Change-3 with a test case. Please review them as 
> > well.
> >
>
> The patch looks mostly good to me apart from few minor comments which
> are as follows:
> 1.
> +# The last decoding restarts from the first checkpoint, and add
> invalidation messages
> +# generated by "s0_truncate" to the subtransaction. When decoding the
> commit record of
> +# the top-level transaction, we mark both top-level transaction and
> its subtransactions
> +# as containing catalog changes. However, we check if we don't create
> the association
> +# between top-level and subtransactions at this time. Otherwise, we
> miss executing
> +# invalidation messages when forgetting the transaction.
> +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
> "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
> "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
> "s1_get_changes"
>
> The second part of this comment seems to say things more than required
> which makes it less clear. How about something like: "The last
> decoding restarts from the first checkpoint and adds invalidation
> messages generated by "s0_truncate" to the subtransaction. While
> processing the commit record for the top-level transaction, we decide
> to skip this xact but ensure that corresponding invalidation messages
> get processed."?
>
> 2.
> + /*
> + * We will assign subtransactions to the top transaction before
> + * replaying the contents of the transaction.
> + */
>
> I don't think we need this comment.
>

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-v15_v2-0001-Fix-executing-invalidation-messages-generated-by-.patch
Description: Binary data


v11_v2-0001-Fix-executing-invalidation-messages-generated-by-.patch
Description: Binary data


Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-20 Thread Masahiko Sawada
On Thu, Oct 20, 2022 at 8:09 PM Amit Kapila  wrote:
>
> On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I've attached two patches that need to be back-patched to all branches
> > > > and includes Change-1, Change-2, and a test case for them. FYI this
> > > > patch resolves the assertion failure reported in this thread as well
> > > > as one reported in another thread[2]. So I borrowed some of the
> > > > changes from the patch[2] Osumi-san recently proposed.
> > > >
> > >
> > > Amit pointed out offlist that the changes in reorderbuffer.c is not
> > > pgindent'ed. I've run pgindent and attached updated patches.
> > >
> >
> > Thanks, I have tested these across all branches till v10 and it works
> > as expected. I am planning to push this tomorrow unless I see any
> > further comments.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-19 Thread Masahiko Sawada
On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


HEAD_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


v10-v15_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada
On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

I've attached patches for Change-3 with a test case. Please review them as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 763cb604fae97131b5be2da36deb10054b6dbf3a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency bu

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada
On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada  wrote:
>
> On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  
> > wrote:
> > >
> > > >
> > > > I think because the test case proposed needs all three changes, we can
> > > > push the change-1 without a test case and then as a second patch have
> > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > Do you have any other ideas to proceed here?
> > >
> > > I found another test case that causes the assertion failure at
> > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > attached the patch for the test case. In this test case, I modified a
> > > user-catalog table instead of system-catalog table. That way, we don't
> > > generate invalidation messages while generating NEW_CID records. As a
> > > result, we mark only the subtransactions as containing catalog change
> > > and don't make association between top-level and sub transactions. The
> > > assertion failure happens on all supported branches. If we need to fix
> > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > branches.
> > >
> > > There are three changes as Amit mentioned, and regarding the test
> > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > between assertion failures and test cases are very complex. I could
> > > not find any test case to cause only one assertion failure on all
> > > branches. One idea to proceed is:
> > >
> > > Patch-1 includes Change-1 and is applied to all branches.
> > >
> > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > applied to all branches.
> > >
> > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > test case), and is applied to v14 and v15 (also till v11 if we
> > > prefer).
> > >
> > > The patch-1 doesn't include any test case but the user_catalog test
> > > case can test both Change-1 and Change-2 on all branches.
> > >
> >
> > I was wondering if it makes sense to commit both Change-1 and Change-2
> > together as one patch? Both assertions are caused by a single test
> > case and are related to the general problem that the association of
> > top and sub transaction is only guaranteed to be formed before we
> > decode transaction changes. Also, it would be good to fix the problem
> > with a test case that can cause it. What do you think?
>
> Yeah, it makes sense to me.
>

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Regards,

[1] 
https://www.postgresql.org/message-id/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v15_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


HEAD_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada
On Tue, Oct 18, 2022 at 7:56 PM Amit Kapila  wrote:
>
> On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
> > >
> > > --- a/src/backend/replication/logical/decode.c
> > > +++ b/src/backend/replication/logical/decode.c
> > > @@ -113,6 +113,15 @@
> > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > > XLogReaderState *recor
> > >   buf.origptr);
> > >   }
> > >
> > > +#ifdef USE_ASSERT_CHECKING
> > > + /*
> > > + * Check the order of transaction LSNs when we reached the start decoding
> > > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > > + */
> > > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > > + AssertTXNLsnOrder(ctx->reorder);
> > > +#endif
> > > +
> > >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > > >
> > >
> > > I am not able to think how/when this check will be useful. Because we
> > > skipped assert checking only for records that are prior to
> > > start_decoding_at point, I think for those records ordering should
> > > have been checked before the restart. start_decoding_at point will be
> > > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > > any record prior to that must have been processed before restart.
> >
> > Good point. I was considering the case where the client sets far ahead
> > LSN but it's not worth considering this case in this context. I've
> > updated the patch accoringly.
> >
>
> One minor comment:
> Can we slightly change the comment: ". The ordering of the records
> prior to the LSN, we should have been checked before the restart." to
> ". The ordering of the records prior to the start_decoding_at LSN
> should have been checked before the restart."?

Agreed. I'll update the patch accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada
On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada  wrote:
> >
> > >
> > > I think because the test case proposed needs all three changes, we can
> > > push the change-1 without a test case and then as a second patch have
> > > change-2 for HEAD and change-3 for back branches with the test case.
> > > Do you have any other ideas to proceed here?
> >
> > I found another test case that causes the assertion failure at
> > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > attached the patch for the test case. In this test case, I modified a
> > user-catalog table instead of system-catalog table. That way, we don't
> > generate invalidation messages while generating NEW_CID records. As a
> > result, we mark only the subtransactions as containing catalog change
> > and don't make association between top-level and sub transactions. The
> > assertion failure happens on all supported branches. If we need to fix
> > this (I believe so), Change-2 needs to be backpatched to all supported
> > branches.
> >
> > There are three changes as Amit mentioned, and regarding the test
> > case, we have three test cases I've attached: truncate_testcase.patch,
> > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > between assertion failures and test cases are very complex. I could
> > not find any test case to cause only one assertion failure on all
> > branches. One idea to proceed is:
> >
> > Patch-1 includes Change-1 and is applied to all branches.
> >
> > Patch-2 includes Change-2 and the user_catalog test case, and is
> > applied to all branches.
> >
> > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > test case), and is applied to v14 and v15 (also till v11 if we
> > prefer).
> >
> > The patch-1 doesn't include any test case but the user_catalog test
> > case can test both Change-1 and Change-2 on all branches.
> >
>
> I was wondering if it makes sense to commit both Change-1 and Change-2
> together as one patch? Both assertions are caused by a single test
> case and are related to the general problem that the association of
> top and sub transaction is only guaranteed to be formed before we
> decode transaction changes. Also, it would be good to fix the problem
> with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-18 Thread Masahiko Sawada
On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila  wrote:
>
> On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada  wrote:
> >
> > On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila  wrote:
> > >
> > >
> > > IIUC, here you are speaking of three different changes. Change-1: Add
> > > a check in AssertTXNLsnOrder() to skip assert checking till we reach
> > > start_decoding_at. Change-2: Set needs_timetravel to true in one of
> > > the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> > > call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> > > in PG-14/15 as that won't be required after Change-1.
> >
> > Yes.
> >
> > >
> > > AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> > > required in HEAD/v15/v14 to fix the problem.
> >
> > IIUC Change-2 is required in v16 and HEAD
> >
>
> Why are you referring v16 and HEAD separately?

Sorry, my wrong, I was confused.

>
> > but not mandatory in v15 and
> > v14. The reason why we need Change-2 is that there is a case where we
> > mark only subtransactions as containing catalog change while not doing
> > that for its top-level transaction. In v15 and v14, since we mark both
> > subtransactions and top-level transaction in
> > SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> > don't get the assertion failure at "Assert(!needs_snapshot ||
> > needs_timetravel)".
> >
> > Regarding Change-3, it's required in v15 and v14 but not in HEAD and
> > v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
> > HEAD, Change-3 cannot be applied to the two branches.
> >
> > > Now, the second and third
> > > changes are not required in branches prior to v14 because we don't
> > > record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> > > we want, we can even back-patch Change-2 and Change-3 to keep the code
> > > consistent or maybe just Change-3.
> >
> > Right. I don't think it's a good idea to back-patch Change-2 in
> > branches prior to v14 as it's not a relevant issue.
> >
>
> Fair enough but then why to even backpatch it to v15 and v14?

Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.

>
> > Regarding
> > back-patching Change-3 to branches prior 14, I think it may be okay
> > til v11, but I'd be hesitant for v10 as the final release comes in a
> > month.
> >
>
> So to fix the issue in all branches, what we need to do is to
> backpatch change-1: in all branches till v10, change-2: in HEAD, and
> change-3: in V15 and V14. Additionally, we think, it is okay to
> backpatch change-3 till v11 as it is mainly done to avoid the problem
> fixed by change-1 and it makes code consistent in back branches.

Right.

>
> I think because the test case proposed needs all three changes, we can
> push the change-1 without a test case and then as a second patch have
> change-2 for HEAD and change-3 for back branches with the test case.
> Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


user_catalog_testcase.patch
Description: Binary data


analyze_testcase.patch
Description: Binary data


truncate_testcase.patch
Description: Binary data


Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-17 Thread Masahiko Sawada
On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila  wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada  
> wrote:
> >
> > Please note that to pass the new regression tests, the fix proposed in
> > a related thread[1] is required. Particularly, we need:
> >
> > @@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
> > XLogRecPtr lsn, TransactionId xid,
> > else if (sub_needs_timetravel)
> > {
> > /* track toplevel txn as well, subxact alone isn't 
> > meaningful */
> > +   elog(DEBUG2, "forced transaction %u to do timetravel
> > due to one of its subtransaction",
> > +xid);
> > +   needs_timetravel = true;
> > SnapBuildAddCommittedTxn(builder, xid);
> > }
> > else if (needs_timetravel)
> >
> > A side benefit of this approach is that we can fix another assertion
> > failure too that happens on REL14 and REL15 and reported here[2]. In
> > the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
> > we make the association between sub-txns to top-txn in
> > SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
> > in AssertTXNLsnOrder(). However, since the invalidation messages are
> > not transported from sub-txn to top-txn during the assignment, another
> > assertion check in ReorderBufferForget() fails when forgetting the
> > subtransaction. If we apply this idea of skipping the assertion
> > checks, we no longer need to make the such association in
> > SnapBuildXidSetCatalogChanges() and resolve this issue as well.
> >
>
> IIUC, here you are speaking of three different changes. Change-1: Add
> a check in AssertTXNLsnOrder() to skip assert checking till we reach
> start_decoding_at. Change-2: Set needs_timetravel to true in one of
> the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> in PG-14/15 as that won't be required after Change-1.

Yes.

>
> AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

> Now, the second and third
> changes are not required in branches prior to v14 because we don't
> record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> we want, we can even back-patch Change-2 and Change-3 to keep the code
> consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue. Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-16 Thread Masahiko Sawada
On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada  
> wrote:
> >
> > Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> > fails as reported because the current logical decoding cannot properly
> > handle the case where the decoding restarts from NEW_CID. Since we
> > don't make the association between top-level transaction and its
> > subtransaction while decoding NEW_CID (ie, in
> > SnapBuildProcessNewCid()), two transactions are created in
> > ReorderBuffer as top-txn and have the same LSN. This failure happens
> > on all supported versions.
> >
> > To fix the problem, one idea is that we make the association between
> > top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> > as Tomas proposed. On the other hand, since we don't guarantee to make
> > the association between the top-level transaction and its
> > sub-transactions until we try to decode the actual contents of the
> > transaction, it makes sense to me that instead of trying to solve by
> > making association, we need to change the code which are assuming that
> > it is associated.
> >
> > I've attached the patch for this idea. With the patch, we skip the
> > assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> > which we start decoding the contents of transaction, ie.
> > start_decoding_at in SnapBuild. The minor concern is other way that
> > the assertion check could miss some faulty cases where two unrelated
> > top-transactions could have same LSN. With this patch, it will pass
> > for such a case. Therefore, for transactions that we skipped checking,
> > we do the check when we reach the LSN.
> >
>
> >
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -113,6 +113,15 @@
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *recor
>   buf.origptr);
>   }
>
> +#ifdef USE_ASSERT_CHECKING
> + /*
> + * Check the order of transaction LSNs when we reached the start decoding
> + * LSN. See the comments in AssertTXNLsnOrder() for details.
> + */
> + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> + AssertTXNLsnOrder(ctx->reorder);
> +#endif
> +
>   rmgr = GetRmgr(XLogRecGetRmid(record));
> >
>
> I am not able to think how/when this check will be useful. Because we
> skipped assert checking only for records that are prior to
> start_decoding_at point, I think for those records ordering should
> have been checked before the restart. start_decoding_at point will be
> either (a) confirmed_flush location, or (b) lsn sent by client, and
> any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-14 Thread Masahiko Sawada
Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
 wrote:
>
> The following is not quite a full review, but has plenty to think about. 
> There is too much to cover at once, and I have to start somewhere...
>
> My main concerns are that internal APIs:
>
> 1. are difficult to follow
> 2. lead to poor branch prediction and too many function calls
>
> Some of the measurements are picking on the SIMD search code, but I go into 
> details in order to demonstrate how a regression there can go completely 
> unnoticed. Hopefully the broader themes are informative.
>
> On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada  wrote:
> > [fixed benchmarks]
>
> Thanks for that! Now I can show clear results on some aspects in a simple 
> way. The attached patches (apply on top of v6) are not intended to be 
> incorporated as-is quite yet, but do point the way to some reorganization 
> that I think is necessary. I've done some testing on loading, but will leave 
> it out for now in the interest of length.
>
>
> 0001-0003 are your performance test fix and and some small conveniences for 
> testing. Binary search is turned off, for example, because we know it 
> already. And the sleep call is so I can run perf in a different shell 
> session, on only the search portion.
>
> Note the v6 test loads all block numbers in the range. Since the test item 
> ids are all below 64 (reasonable), there are always 32 leaf chunks, so all 
> the leaves are node32 and completely full. This had the effect of never 
> taking the byte-wise loop in the proposed pg_lsearch function. These two 
> aspects make this an easy case for the branch predictor:
>
> john=# select * from bench_seq_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |167 | 
> 0 |  822 |   0
>
>  1,470,141,841  branches:u
> 63,693  branch-misses:u   #0.00% of all branches
>
> john=# select * from bench_shuffle_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 | 
> 0 | 2174 |   0
>
>  1,470,142,569  branches:u
> 15,023,983  branch-misses:u   #1.02% of all branches
>
>
> 0004 randomizes block selection in the load part of the search test so that 
> each block has a 50% chance of being loaded.  Note that now we have many 
> node16s where we had none before. Although node 16 and node32 appear to share 
> the same path in the switch statement of rt_node_search(), the chunk 
> comparison and node_get_values() calls each must go through different 
> branches. The shuffle case is most affected, but even the sequential case 
> slows down. (The leaves are less full -> there are more of them, so memory 
> use is larger, but it shouldn't matter much, in the sequential case at least)
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   179937720 |173 | 0 
> |  907 |   0
>
>  1,684,114,926  branches:u
>  1,989,901  branch-misses:u   #0.12% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   17993

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-13 Thread Masahiko Sawada
On Thu, Oct 13, 2022 at 1:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-11 17:10:52 +0900, Masahiko Sawada wrote:
> > +# Reset the replication slot statistics.
> > +$node->safe_psql('postgres',
> > + "SELECT pg_stat_reset_replication_slot('regression_slot');");
> > +my $result = $node->safe_psql('postgres',
> > + "SELECT * FROM pg_stat_replication_slots WHERE slot_name = 
> > 'regrssion_slot'"
> > +);
>
> Typo in the slot name "regrssion_slot" instead of "regression_slot". We can't
> use * here, because that'll include the reset timestamp.

Fixed.

>
>
> > +# Teardown the node so the statistics is removed.
> > +$pg_recvlogical->kill_kill;
> > +$node->teardown_node;
> > +$node->start;
>
> ISTM that removing the file instead of shutting down the cluster with force
> would make it a more targeted test.

Agreed.

>
>
> > +# Check if the replication slot statistics have been removed.
> > +$result = $node->safe_psql('postgres',
> > + "SELECT * FROM pg_stat_replication_slots WHERE slot_name = 
> > 'regrssion_slot'"
> > +);
> > +is($result, "", "replication slot statistics are removed");
>
> Same typo as above. We can't assert a specific result here either, because
> recvlogical will have processed a bunch of changes. Perhaps we could check at
> least that the reset time is NULL?

Agreed.

>
>
> > +# Test if the replication slot staistics continue to be accumulated even 
> > after
>
> s/staistics/statistics/

Fixed.

I've attached an updated patch. I've added the common function to
start pg_recvlogical and wait for it to become active. Please review
it.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


regression_test_for_replslot_stats_v2.patch
Description: Binary data


Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-12 Thread Masahiko Sawada
On Wed, Oct 12, 2022 at 3:35 PM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Sawada-san,
>
> > FYI, as I just replied to the related thread[1], the assertion failure
> > in REL14 and REL15 can be fixed by the patch proposed there. So I'd
> > like to see how the discussion goes. Regardless of this proposed fix,
> > the patch proposed by Kuroda-san is required for HEAD, REL14, and
> > REL15, in order to fix the assertion failure in SnapBuildCommitTxn().
>
> I understood that my patches for REL14 and REL15 might be not needed.

No, sorry for confusing you. I meant that even if we agreed with the
patch I proposed there, your patch is still required to fix the issue.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add index scan progress to pg_stat_progress_vacuum

2022-10-12 Thread Masahiko Sawada
On Tue, Oct 11, 2022 at 10:50 PM Imseih (AWS), Sami  wrote:
>
> >One idea would be to add a flag, say report_parallel_vacuum_progress,
> >to IndexVacuumInfo struct and expect index AM to check and update the
> >parallel index vacuum progress, say every 1GB blocks processed. The
> >flag is true only when the leader process is vacuuming an index.
>
> >Regards,
>
> Sorry for the long delay on this. I have taken the approach as suggested
> by Sawada-san and Robert and attached is v12.

Thank you for updating the patch!

>
> 1. The patch introduces a new counter in the same shared memory already
> used by the parallel leader and workers to keep track of the number
> of indexes completed. This way there is no reason to loop through
> the index status every time we want to get the status of indexes completed.

While it seems to be a good idea to have the atomic counter for the
number of indexes completed, I think we should not use the global
variable referencing the counter as follow:

+static pg_atomic_uint32 *index_vacuum_completed = NULL;
:
+void
+parallel_vacuum_progress_report(void)
+{
+   if (IsParallelWorker() || !report_parallel_vacuum_progress)
+   return;
+
+   pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
+pg_atomic_read_u32(index_vacuum_completed));
+}

I think we can pass ParallelVacuumState (or PVIndStats) to the
reporting function so that it can check the counter and report the
progress.

> 2. A new function in vacuumparallel.c will be used to update
> the progress of indexes completed by reading from the
> counter created in point #1.
>
> 3. The function is called during the vacuum_delay_point as a
> matter of convenience, since this is called in all major vacuum
> loops. The function will only do something if the caller
> sets a boolean to report progress. Doing so will also ensure
> progress is being reported in case the parallel workers completed
> before the leader.

Robert pointed out:

---
I am not too sure that the idea of using the vacuum delay points is the best
plan. I think we should try to avoid piggybacking on such general
infrastructure if we can, and instead look for a way to tie this to
something that is specific to parallel vacuum.
---

I agree with this part.

Instead, I think we can add a boolean and the pointer for
ParallelVacuumState to IndexVacuumInfo. If the boolean is true, an
index AM can call the reporting function with the pointer to
ParallelVacuumState while scanning index blocks, for example, for
every 1GB. The boolean can be true only for the leader process.

>
> 4. Rather than adding any complexity to WaitForParallelWorkersToFinish
> and introducing a new callback, vacuumparallel.c will wait until
> the number of vacuum workers is 0 and then call
> WaitForParallelWorkersToFinish as it does currently.

Agreed, but with the following change, the leader process waits in a
busy loop, which should not be acceptable:

+   if (VacuumActiveNWorkers)
+   {
+   while (pg_atomic_read_u32(VacuumActiveNWorkers) > 0)
+   {
+   parallel_vacuum_progress_report();
+   }
+   }
+

Also, I think it's better to check whether idx_completed_progress
equals to the number indexes instead.

> 5. Went back to the idea of adding a new view called 
> pg_stat_progress_vacuum_index
> which is accomplished by adding a new type called VACUUM_PARALLEL in 
> progress.h

Probably, we can devide the patch into two patches. One for adding the
new statistics of the number of vacuumed indexes to
pg_stat_progress_vacuum and another one for adding new statistics view
pg_stat_progress_vacuum_index.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-12 Thread Masahiko Sawada
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the 
> first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in 
> REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the 
> function
> a transaction will be marked as containing catalog changes if the transaction 
> is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on 
> the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>

FYI, as I just replied to the related thread[1], the assertion failure
in REL14 and REL15 can be fixed by the patch proposed there. So I'd
like to see how the discussion goes. Regardless of this proposed fix,
the patch proposed by Kuroda-san is required for HEAD, REL14, and
REL15, in order to fix the assertion failure in SnapBuildCommitTxn().

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoA1gV9pfu8hoXpTQBWH8uEMRg_F_MKM%2BU3Sr0HnyH4AUQ%40mail.gmail.com

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-11 Thread Masahiko Sawada
napshots and how we share them between the transactions :-( If we share
> > the snapshots between transactions, you're probably right we can't just
> > skip these changes.
> >
> > However, doesn't that pretty much mean we *have* to do something about
> > the assignment? I mean, suppose we miss the assignment (like now), so
> > that we end up with two TXNs that we think are top-level. And then we
> > get the commit for the actual top-level transaction. AFAICS that won't
> > clean-up the subxact, and we end up with a lingering TXN.
> >
>
> I think we will clean up such a subxact. Such a xact should be skipped
> via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
> for each of the subxacts and that will make sure that we clean up each
> of subtxn's.
>

Right.

Regards,

[1] 
https://www.postgresql.org/message-id/TYAPR01MB58666BD6BE24853269624282F5419%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com


--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data


Re: test_decoding assertion failure for the loss of top-sub transaction relationship

2022-10-11 Thread Masahiko Sawada
On Wed, Sep 7, 2022 at 11:06 AM kuroda.hay...@fujitsu.com
 wrote:
>
> Dear Amit,
>
> Thanks for giving comments!
>
> > Did you get this new assertion failure after you applied the patch for
> > the first failure? Because otherwise, how can you reach it with the
> > same test case?
>
> The first failure is occurred only in the HEAD, so I did not applied the 
> first patch
> to REL14 and REL15.
> This difference is caused because the commit [Fix catalog lookup...] in 
> REL15(272248a) and older is different
> from the HEAD one.
> In order versions SnapBuildXidSetCatalogChanges() has been added. In the 
> function
> a transaction will be marked as containing catalog changes if the transaction 
> is in InitialRunningXacts,
> and after that the relation between sub-top transactions is assigned based on 
> the parsed->subxact.
> The marking avoids the first failure, but the assignment triggers new failure.
>
>
> > About patch:
> > else if (sub_needs_timetravel)
> >   {
> > - /* track toplevel txn as well, subxact alone isn't meaningful */
> > + elog(DEBUG2, "forced transaction %u to do timetravel due to one of
> > its subtransaction",
> > + xid);
> > + needs_timetravel = true;
> >   SnapBuildAddCommittedTxn(builder, xid);
> >
> > Why did you remove the above comment? I think it still makes sense to 
> > retain it.
>
> Fixed.

Here are some review comments for v2 patch:

+# Test that we can force the top transaction to do timetravel when one of sub
+# transactions needs that. This is necessary when we restart decoding
from RUNNING_XACT
+# without the wal to associate subtransaction to its top transaction.

I don't think the second sentence is necessary.

---
The last decoding
+# starts from the first checkpoint and NEW_CID of "s0_truncate"
doesn't mark the top
+# transaction as catalog modifying transaction. In this scenario, the
enforcement sets
+# needs_timetravel to true even if the top transaction is regarded as
that it does not
+# have catalog changes and thus the decoding works without a
contradition that one
+# subtransaction needed timetravel while its top transaction didn't.

I don't understand the last sentence, probably it's a long sentence.

How about the following description?

# Test that we can handle the case where only subtransaction is marked
as containing
# catalog changes. The last decoding starts from NEW_CID generated by
"s0_truncate" and
# marks only the subtransaction as containing catalog changes but we
don't create the
# association between top-level transaction and subtransaction yet.
When decoding the
# commit record of the top-level transaction, we must force the
top-level transaction
# to do timetravel since one of its subtransactions is marked as
containing catalog changes.

---
+ elog(DEBUG2, "forced transaction %u to do timetravel due to one of
its subtransaction",
+ xid);
+ needs_timetravel = true;

I think "one of its subtransaction" should be "one of its subtransactions".

Regards,

--
Masahiko Sawada




Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-11 Thread Masahiko Sawada
On Sun, Oct 9, 2022 at 2:42 AM Andres Freund  wrote:
>
> On 2022-10-08 09:53:50 -0700, Andres Freund wrote:
> > On 2022-10-07 19:56:33 -0700, Andres Freund wrote:
> > > I'm planning to push this either later tonight (if I feel up to it after
> > > cooking dinner) or tomorrow morning PST, due to the release wrap deadline.
> >
> > I looked this over again, tested a bit more, and pushed the adjusted 15 and
> > master versions to github to get a CI run. Once that passes, as I expect, 
> > I'll
> > push them for real.
>
> Those passed and thus pushed.
>
> Thanks for the report, debugging and review everyone!

Thanks!

>
>
> I think we need at least the following tests for replslots:
> - a reset while decoding is ongoing works correctly
> - replslot stats continue to be accumulated after stats have been removed
>
>
> I wonder how much it'd take to teach isolationtester to handle the replication
> protocol...

I think we can do these tests by using pg_recvlogical in TAP tests.
I've attached a patch to do that.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


regression_test_for_replslot_stats.patch
Description: Binary data


Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-10 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila  wrote:
>
> On Fri, Oct 7, 2022 at 8:47 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > I think the root reason for this kind of deadlock problems is the table
> > > structure difference between publisher and subscriber(similar to the 
> > > unique
> > > difference reported earlier[1]). So, I think we'd better disallow this 
> > > case. For
> > > example to avoid the reported problem, we could only support parallel 
> > > apply if
> > > pubviaroot is false on publisher and replicated tables' types(relkind) 
> > > are the
> > > same between publisher and subscriber.
> > >
> > > Although it might restrict some use cases, but I think it only restrict 
> > > the
> > > cases when the partitioned table's structure is different between 
> > > publisher and
> > > subscriber. User can still use parallel apply for cases when the table
> > > structure is the same between publisher and subscriber which seems 
> > > acceptable
> > > to me. And we can also document that the feature is expected to be used 
> > > for the
> > > case when tables' structure are the same. Thoughts ?
> >
> > I'm concerned that it could be a big restriction for users. Having
> > different partitioned table's structures on the publisher and the
> > subscriber is quite common use cases.
> >
> > From the feature perspective, the root cause seems to be the fact that
> > the apply worker does both receiving and applying changes. Since it
> > cannot receive the subsequent messages while waiting for a lock on a
> > table, the parallel apply worker also cannot move forward. If we have
> > a dedicated receiver process, it can off-load the messages to the
> > worker while another process waiting for a lock. So I think that
> > separating receiver and apply worker could be a building block for
> > parallel-apply.
> >
>
> I think the disadvantage that comes to mind is the overhead of passing
> messages between receiver and applier processes even for non-parallel
> cases. Now, I don't think it is advisable to have separate handling
> for non-parallel cases. The other thing is that we need to someway
> deal with feedback messages which helps to move synchronous replicas
> and update subscriber's progress which in turn helps to keep the
> restart point updated. These messages also act as heartbeat messages
> between walsender and walapply process.
>
> To deal with this, one idea is that we can have two connections to
> walsender process, one with walreceiver and the other with walapply
> process which according to me could lead to a big increase in resource
> consumption and it will bring another set of complexities in the
> system. Now, in this, I think we have two possibilities, (a) The first
> one is that we pass all messages to the leader apply worker and then
> it decides whether to execute serially or pass it to the parallel
> apply worker. However, that can again deadlock in the truncate
> scenario we discussed because the main apply worker won't be able to
> receive new messages once it is blocked at the truncate command. (b)
> The second one is walreceiver process itself takes care of passing
> streaming transactions to parallel apply workers but if we do that
> then walreceiver needs to wait at the transaction end to maintain
> commit order which means it can also lead to deadlock in case the
> truncate happens in a streaming xact.

I imagined (b) but I had missed the point of preserving the commit
order. Separating the receiver and apply worker cannot resolve this
problem.

>
> The other alternative is that we allow walreceiver process to wait for
> apply process to finish transaction and send the feedback but that
> seems to be again an overhead if we have to do it even for small
> transactions, especially it can delay sync replication cases. Even, if
> we don't consider overhead, it can still lead to a deadlock because
> walreceiver won't be able to move in the scenario we are discussing.
>
> About your point that having different partition structures for
> publisher and subscriber, I don't know how common it will be once we
> have DDL replication. Also, the default value of
> publish_via_partition_root is false which doesn't seem to indicate
> that this is a quite common case.

So how can we consider these concurrent issues that could happen only
when streaming = 'parallel'? Can we restrict some use cases to avoid
the problem or can we have a safeguard against these conflicts? We
could find a new problematic scenario in the future and if it happens,
logical replication gets stuck, it cannot be resolved only by apply
workers themselves.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 2:29 PM John Naylor  wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> > In addition to two patches, I've attached the third patch. It's not
> > part of radix tree implementation but introduces a contrib module
> > bench_radix_tree, a tool for radix tree performance benchmarking. It
> > measures loading and lookup performance of both the radix tree and a
> > flat array.
>
> Hi Masahiko, I've been using these benchmarks, along with my own variations, 
> to try various things that I've mentioned. I'm long overdue for an update, 
> but the picture is not yet complete.

Thanks!

> For now, I have two questions that I can't figure out on my own:
>
> 1. There seems to be some non-obvious limit on the number of keys that are 
> loaded (or at least what the numbers report). This is independent of the 
> number of tids per block. Example below:
>
> john=# select * from bench_shuffle_search(0, 8*1000*1000);
> NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 25, n256 = 981
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  800 |268435456 |4800 |661 |
> 29 |  276 | 389
>
> john=# select * from bench_shuffle_search(0, 9*1000*1000);
> NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 262144, n256 = 1028
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  8388608 |276824064 |5400 |718 |
> 33 |  311 | 446
>
> The array is the right size, but nkeys hasn't kept pace. Can you reproduce 
> this? Attached is the patch I'm using to show the stats when running the 
> test. (Side note: The numbers look unfavorable for radix tree because I'm 
> using 1 tid per block here.)

Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:

   tid_i = ItemPointerGetOffsetNumber(tid);
   tid_i |= ItemPointerGetBlockNumber(tid) << shift;

>
> 2. I found that bench_shuffle_search() is much *faster* for traditional 
> binary search on an array than bench_seq_search(). I've found this to be true 
> in every case. This seems counterintuitive to me -- any idea why this is? 
> Example:
>
> john=# select * from bench_seq_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 |   
> 106 |  827 |3348
>
> john=# select * from bench_shuffle_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |171 |   
> 107 |  827 |1400
>

Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:

for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(, i, nitems - 1);
   ItemPointerData t = itemptrs[j];

   itemptrs[j] = itemptrs[i];
   itemptrs[i] = t;

With the fix, the results on my environment were:

postgres(1:4093192)=# select * from bench_seq_search(0, 1000);
2022-10-07 16:57:03.124 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--+--+-++---+--+-
 1000 |101826560 |  18 |846 |
 486 | 6096 |   21128
(1 row)

Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 1000);
2022-10-07 16:57:37.476 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_a

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada
On Thu, Oct 6, 2022 at 9:04 PM houzj.f...@fujitsu.com
 wrote:
>
>
>
> > -Original Message-
> > From: Masahiko Sawada 
> > Sent: Thursday, October 6, 2022 4:07 PM
> > To: Hou, Zhijie/侯 志杰 
> > Cc: Amit Kapila ; Wang, Wei/王 威
> > ; Peter Smith ; Dilip
> > Kumar ; Shi, Yu/侍 雨 ;
> > PostgreSQL Hackers 
> > Subject: Re: Perform streaming logical transactions by background workers 
> > and
> > parallel apply
> >
> > On Tue, Sep 27, 2022 at 9:26 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, September 24, 2022 7:40 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila 
> > > > wrote:
> > > > >
> > > > > On Thu, Sep 22, 2022 at 8:59 AM wangw.f...@fujitsu.com
> > > > >  wrote:
> > > > > >
> > > > >
> > > > > Few comments on v33-0001
> > > > > ===
> > > > >
> > > >
> > > > Some more comments on v33-0001
> > > > =
> > > > 1.
> > > > + /* Information from the corresponding LogicalRepWorker slot. */
> > > > + uint16 logicalrep_worker_generation;
> > > > +
> > > > + int logicalrep_worker_slot_no;
> > > > +} ParallelApplyWorkerShared;
> > > >
> > > > Both these variables are read/changed by leader/parallel workers without
> > > > using any lock (mutex). It seems currently there is no problem because 
> > > > of
> > the
> > > > way the patch is using in_parallel_apply_xact but I think it won't be a 
> > > > good
> > idea
> > > > to rely on it. I suggest using mutex to operate on these variables and 
> > > > also
> > check
> > > > if the slot_no is in a valid range after reading it in 
> > > > parallel_apply_free_worker,
> > > > otherwise error out using elog.
> > >
> > > Changed.
> > >
> > > > 2.
> > > >  static void
> > > >  apply_handle_stream_stop(StringInfo s)
> > > >  {
> > > > - if (!in_streamed_transaction)
> > > > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action;
> > > > +
> > > > + if (!am_parallel_apply_worker() &&
> > > > + (!in_streamed_transaction && !stream_apply_worker))
> > > >   ereport(ERROR,
> > > >   (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > >   errmsg_internal("STREAM STOP message without STREAM START")));
> > > >
> > > > This check won't be able to detect missing stream start messages for 
> > > > parallel
> > > > apply workers apart from the first pair of start/stop. I thought of 
> > > > adding
> > > > in_remote_transaction check along with
> > > > am_parallel_apply_worker() to detect the same but that also won't work
> > > > because the parallel worker doesn't reset it at the stop message.
> > > > Another possibility is to introduce yet another variable for this but 
> > > > that
> > doesn't
> > > > seem worth it. I would like to keep this check simple.
> > > > Can you think of any better way?
> > >
> > > I feel we can reuse the in_streamed_transaction in parallel apply worker 
> > > to
> > > simplify the check there. I tried to set this flag in parallel apply 
> > > worker
> > > when stream starts and reset it when stream stop so that we can directly 
> > > check
> > > this flag for duplicate stream start message and other related things.
> > >
> > > > 3. I think we can skip sending start/stop messages from the leader to 
> > > > the
> > > > parallel worker because unlike apply worker it will process only one
> > > > transaction-at-a-time. However, it is not clear whether that is worth 
> > > > the
> > effort
> > > > because it is sent after logical_decoding_work_mem changes. For now, I 
> > > > have
> > > > added a comment for this in the attached patch but let me if I am 
> > > > missing
> > > > something or if I am wrong.
> > >
> > > I the suggested comments look good.
> > >
> > > > 4.
> > > > postgres=# select pid, leader_pid, application_name, backend_type from
> > > > pg_stat_activity;
> > > >   pid  | leader_pi

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-10-06 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 8:00 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-06 14:10:46 +0900, Kyotaro Horiguchi wrote:
> > +1. FWIW, the atttached is an example of what it looks like if we
> > avoid file format change.
>
> What about if we go the other direction - simply remove the name from the
> stats entry at all. I don't actually think we need it anymore. Unless I am
> missing something right now - entirely possible! - the danger that
> pgstat_acquire_replslot() mentions doesn't actually exist [anymore]. After a
> crash we throw away the old stats data and if a slot is dropped while shut
> down, we'll not load the slot data at startup.

+1. I think it works. Since the replication slot index doesn't change
during server running we can fetch the name from
ReplicationSlotCtl->replication_slots.

If we don't need the name in stats entry, pgstat_acquire_replslot() is
no longer necessary?

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Issue with pg_stat_subscription_stats

2022-10-06 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 9:27 AM Andres Freund  wrote:
>
> On 2022-10-06 16:43:43 -0700, Andres Freund wrote:
> > On 2022-10-06 14:10:56 +0900, Michael Paquier wrote:
> > > On Tue, Jul 12, 2022 at 09:31:16AM +0530, Amit Kapila wrote:
> > > > I am not against backpatching this but OTOH it doesn't appear critical
> > > > enough to block one's work, so not backpatching should be fine.
> > >
> > > We are just talking about the reset timestamp not being set at
> > > when the object is created, right?  This does not strike me as
> > > critical, so applying it only on HEAD is fine IMO.  A few months ago,
> > > while in beta, I would have been fine with something applied to
> > > REL_15_STABLE.  Now that we are in RC, that's not worth taking a risk
> > > in my opinion.
> >
> > Agreed.
> >
> > > Amit or Andres, are you planning to double-check and perhaps merge
> > > this patch to take care of the inconsistency?
> >
> > I'll run it through CI and then to master unless somebody pipes up in the
> > meantime.
>
> And pushed. Thanks all!

Thanks!

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-06 Thread Masahiko Sawada
rue');

* On subscriber
create table p (c int) partition by list (c);
create table c1 partition of p for values In (2);
create table c2 partition of p for values In (1);
create subscription test_sub connection 'port=5551 dbname=postgres'
publication test_pub with (streaming = 'parallel', copy_data =
'false');

Note that while both the publisher and the subscriber have the same
name tables the partition structure is different and rows go to a
different table on the subscriber (eg, row c=1 will go to c2 table on
the subscriber). If two current transactions are executed as follows,
the apply worker (ig, the leader apply worker) waits for a lock on c2
held by its parallel apply worker:

* TX-1
BEGIN;
INSERT INTO p SELECT 1 FROM generate_series(1, 1); --- changes are streamed

* TX-2
BEGIN;
TRUNCATE c2; --- wait for a lock on c2

* TX-1
INSERT INTO p SELECT 1 FROM generate_series(1, 1);
COMMIT;

This might not be a common case in practice but it could mean that
there is a restriction on how partitioned tables should be structured
on the publisher and the subscriber when using streaming = 'parallel'.
When this happens, since the logical replication cannot move forward
the users need to disable parallel-apply mode or increase
logical_decoding_work_mem. We could describe this limitation in the
doc but it would be hard for users to detect problematic table
structure.

BTW, when the leader apply worker waits for a lock on c2 in the above
example, the parallel apply worker is in a busy-loop, which should be
fixed.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




<    2   3   4   5   6   7   8   9   10   11   >