Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-21 Thread Amit Kapila
On Thu, May 20, 2021 at 12:46 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Thursday, May 20, 2021 3:05 PM, Amit Kapila  
> wrote:
> > Okay, I have prepared the patches for all branches (11...HEAD). Each
> > version needs minor changes in the test, the code doesn't need much
> > change. Some notable changes in the tests:
> > 1. I have removed the conf change for max_logical_replication_workers
> > on the publisher node. We only need this for the subscriber node.
> > 2. After creating the new subscriptions wait for initial
> > synchronization as we do for other tests.
> > 3. synchronous_standby_names need to be reset for the previous test.
> > This is only required for HEAD.
> > 4. In PG-11, we need to specify the application_name in the connection
> > string, otherwise, it took the testcase file name as application_name.
> > This is the same as other tests are doing in PG11.
> >
> > Can you please once verify the attached patches?
>
> I have tested your patches for all branches(11...HEAD). All of them passed. 
> B.T.W, I also confirmed that the bug exists in these branches without your 
> fix.
>
> The changes in tests LGTM.
> But I saw whitespace warnings when applied the patches for PG11 and PG12, 
> please take a look at this.
>

Thanks, I have pushed after fixing the whitespace warning.

-- 
With Regards,
Amit Kapila.




RE: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-20 Thread tanghy.f...@fujitsu.com
On Thursday, May 20, 2021 3:05 PM, Amit Kapila  wrote:
> Okay, I have prepared the patches for all branches (11...HEAD). Each
> version needs minor changes in the test, the code doesn't need much
> change. Some notable changes in the tests:
> 1. I have removed the conf change for max_logical_replication_workers
> on the publisher node. We only need this for the subscriber node.
> 2. After creating the new subscriptions wait for initial
> synchronization as we do for other tests.
> 3. synchronous_standby_names need to be reset for the previous test.
> This is only required for HEAD.
> 4. In PG-11, we need to specify the application_name in the connection
> string, otherwise, it took the testcase file name as application_name.
> This is the same as other tests are doing in PG11.
> 
> Can you please once verify the attached patches?

I have tested your patches for all branches(11...HEAD). All of them passed. 
B.T.W, I also confirmed that the bug exists in these branches without your fix.

The changes in tests LGTM. 
But I saw whitespace warnings when applied the patches for PG11 and PG12, 
please take a look at this.

Regards
Tang


Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-20 Thread Amit Kapila
On Tue, May 18, 2021 at 6:52 AM Peter Smith  wrote:
>
> >
> > Yeah, have you checked it in the back branches?
> >
>
> Yes, the apply_handle_truncate function was introduced in April/2018 [1].
>
> REL_11_0 was tagged in Oct/2018.
>
> The "ERROR:  deadlock detected" log is reproducible in PG 11.0.
>

Okay, I have prepared the patches for all branches (11...HEAD). Each
version needs minor changes in the test, the code doesn't need much
change. Some notable changes in the tests:
1. I have removed the conf change for max_logical_replication_workers
on the publisher node. We only need this for the subscriber node.
2. After creating the new subscriptions wait for initial
synchronization as we do for other tests.
3. synchronous_standby_names need to be reset for the previous test.
This is only required for HEAD.
4. In PG-11, we need to specify the application_name in the connection
string, otherwise, it took the testcase file name as application_name.
This is the same as other tests are doing in PG11.

Can you please once verify the attached patches?

-- 
With Regards,
Amit Kapila.


v4-0001-Fix-deadlock-for-multiple-replicating-truncates-11.patch
Description: Binary data


v4-0001-Fix-deadlock-for-multiple-replicating-truncates-12.patch
Description: Binary data


v4-0001-Fix-deadlock-for-multiple-replicating-truncates-13.patch
Description: Binary data


v4-0001-Fix-deadlock-for-multiple-replicating-truncates-HEAD.patch
Description: Binary data


Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila
On Tue, May 18, 2021 at 6:19 AM Peter Smith  wrote:
>
> On Mon, May 17, 2021 at 8:13 PM Amit Kapila  wrote:
> >
> > On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> > >
> > > On Mon, May 17, 2021 at 12:30 PM Peter Smith  
> > > wrote:
> > >
> > > > PSA a patch adding a test for this scenario.
> > >
> > > I am not sure this test case is exactly targeting the problematic
> > > behavior because that will depends upon the order of execution of the
> > > apply workers right?
> > >
> >
> > Yeah, so we can't guarantee that this test will always reproduce the
> > problem but OTOH, I have tried two times and it reproduced both times.
> > I guess we don't have a similar test where Truncate will replicate to
> > two subscriptions, otherwise, we would have caught such a problem.
> > Having said that, I am fine with leaving this test if others feel so
> > on the grounds that it won't always lead to the problem reported.
> >
>
> If there is any concern that the problem won't always happen then I
> think we should just increase the numbers of subscriptions.
>
> Having more simultaneous subscriptions (e.g. I have tried 4). will
> make it much more likely for the test to encounter the deadlock, and
> it probably would also be quite a useful worker stress test in it's
> own right.
>

I don't think we need to go that far.

> ~~
>
> Also, should this test be in the 010_truncate,pl,
>

+1 for keeping it in 010_truncate.pl but remove the synchronous part
of it from the testcase and comments as that is not required.

-- 
With Regards,
Amit Kapila.




Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Peter Smith
On Mon, May 17, 2021 at 6:47 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
[...]
> > The essence of the trouble seems to be that the apply_handle_truncate
> > function never anticipated it may end up truncating the same table
> > from 2 separate workers (subscriptions) like this test case is doing.
> > Probably this is quite an old problem because the
> > apply_handle_truncate code has not changed much for a long time.
> >
>
> Yeah, have you checked it in the back branches?
>

Yes, the apply_handle_truncate function was introduced in April/2018 [1].

REL_11_0 was tagged in Oct/2018.

The "ERROR:  deadlock detected" log is reproducible in PG 11.0.

--
[1] 
https://github.com/postgres/postgres/commit/039eb6e92f20499ac36cc74f8a5cef7430b706f6

Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Peter Smith
On Mon, May 17, 2021 at 8:13 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.
>

If there is any concern that the problem won't always happen then I
think we should just increase the numbers of subscriptions.

Having more simultaneous subscriptions (e.g. I have tried 4). will
make it much more likely for the test to encounter the deadlock, and
it probably would also be quite a useful worker stress test in it's
own right.

~~

Also, should this test be in the 010_truncate,pl, or does it belong in
the 100_bugs.pl?  (I don't know what are the rules for when a test
gets put into 100_bugs.pl)

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Dilip Kumar
On Mon, May 17, 2021 at 3:43 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.

Although it is not guaranteed to reproduce the scenario every time, it
is testing a new scenario, so +1 for the test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila
On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
>
> > PSA a patch adding a test for this scenario.
>
> I am not sure this test case is exactly targeting the problematic
> behavior because that will depends upon the order of execution of the
> apply workers right?
>

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

-- 
With Regards,
Amit Kapila.




RE: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread tanghy.f...@fujitsu.com
On Monday, May 17, 2021 5:47 PM, Amit Kapila  wrote
> +$node_publisher->safe_psql('postgres',
> + "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1,
> sub5_2)'");
> +$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
> 
> Do you really need these steps to reproduce the problem? IIUC, this
> has nothing to do with synchronous replication.

Agreed. 
I tested in asynchronous mode, and could reproduce this problem, too.

The attached patch removed the steps for setting synchronous replication.
And the test passed after applying Peter's patch.
Please take it as your reference.

Regards
Tang



v3_test_for_deadlock.patch
Description: v3_test_for_deadlock.patch


Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Dilip Kumar
On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:

> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time. The
> code of apply_handle_truncate function (worker.c) has a very similar
> pattern to the ExecuteTruncate function (tablecmds.c) but the
> ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
> apply_handle_truncate was using.

Right, that's a problem.

>
> PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
> same as the ExecuteTruncate function does.

I think the fix makes sense to me.

> PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila
On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
>
> While doing logical replication testing we encountered a problem which
> causes a deadlock error to be logged when replicating a TRUNCATE
> simultaneously to 2 Subscriptions.
> e.g.
> --
> 2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
> 2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
> ShareLock on transaction 597; blocked by process 11582.
> Process 11582 waits for ShareLock on relation 16384 of database 14896;
> blocked by process 11393.
> --
>
> At this time, both the subscriber (apply worker) processes are
> executing inside the ExecuteTruncateGuts function simultaneously and
> they become co-dependent.
>
> e.g.
> --
> Process 11582
> (gdb) bt
> #0  0x7fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x0093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
> cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
> latch.c:1450
> #2  0x0093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
> occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
> at latch.c:1396
> #3  0x0093d8cd in WaitLatch (latch=0x7fa191042654,
> wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
> #4  0x009660f0 in ProcSleep (locallock=0x2fd06d8,
> lockMethodTable=0xcd90a0 ) at proc.c:1361
..
> --
> Process 11393
> (gdb) bt
> #0  0x7fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
> #1  0x7fa197917e44 in sleep () from /lib64/libc.so.6
> #2  0x00950f84 in DeadLockReport () at deadlock.c:1151
> #3  0x00955013 in WaitOnLock (locallock=0x2fd05d0,
> owner=0x2fd9a48) at lock.c:1873
>
..
> --
>
> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time.
>

Yeah, have you checked it in the back branches?

I am also able to reproduce and have analyzed the cause of the above
error. In the above, Process 11393 waits while updating pg_class tuple
via RelationSetNewRelfilenode() which is already updated by process
11582 (with transaction id 597 which is yet not committed). Now,
process 11582 waits for acquiring ShareLock on relation 16384 which is
acquired RowExclusiveMode by process 11393 in function
apply_handle_truncate. So, both the processes started waiting on each
other which causes a deadlock.

>
> PSA a patch adding a test for this scenario.
>

+
+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");

Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

-- 
With Regards,
Amit Kapila.