On Mon, 26 Feb 2018, Tignor, Tom wrote:
In the slony1 log of our primary host (the same one which later showed
the bug) we had:
2018-02-21 19:19:49 UTC [22582] INFO remoteWorkerThread_8: SYNC 5002075962
done in 1.725 seconds
2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
The timestamp of the problem event was somewhat earlier: 2018-02-21
19:19:41.
To me it looks like there is a race condition and despite the order
of the log events, the DROP NODE was processed first and the SYNC was
accepted later, and thus created an sl_event record. Of course, that’s
simply a guess.
This might be possible.
If the drop being droped is 8, and the event node from the drop is 4.
Some third node say 3 might have a sequence like this
remoteWorkerThread_8: Starts processing SYNC
remoteWorkerThread_4: process drop node. Updates runtime config
remoteWorkerThread_8: finishes processing sync COMMITS
I'd feel more comfortable if we had a way of reproducing this though.
One option might be to have the remoteWorkerThread recheck the runtime
config before the COMMIT to make sure that the node is still active.
Tom (
On 2/26/18, 12:01 PM, "Steve Singer" <st...@ssinger.info> wrote:
On Mon, 26 Feb 2018, Tignor, Tom wrote:
>
> Thanks. I see the deletes added for sl_seqlog and sl_log_script.
The
> constraint violation appearing in the errors was for sl_event. Do we
> expect these changes fully remove all state, including sl_event? The
> checkNodeDeleted function doesn’t look at sl_event.
It could be that this is a different issue then.
That function (the dropNode_int) should have removed the rows from sl_event.
The question the becomes did it not remove them for some reason, or did they
get added back later, and if so how?
> Tom (
>
>
> On 2/26/18, 11:03 AM, "Steve Singer" <st...@ssinger.info> wrote:
>
> On Mon, 26 Feb 2018, Tignor, Tom wrote:
>
> You can get it from the github branch (latest commit) at
> https://github.com/ssinger/slony1-engine/tree/bug375
>
>
>
>
> >
> > Steve,
> > The patch link actually goes to a page which says “Bugzilla
is down for maintenance.” Is there a way to see the patch currently? Does it exist or
is it scheduled in some Slony-I release?
> >
> > Tom (
> >
> >
> > On 2/22/18, 6:06 PM, "Steve Singer" <st...@ssinger.info> wrote:
> >
> > On Thu, 22 Feb 2018, Tignor, Tom wrote:
> >
> > Looks like?
> >
http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
> >
> > I can't remember if that was what prompted
> >
http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
> >
> >
> > https://github.com/ssinger/slony1-engine/tree/bug375
> >
> > I can't seem to find a reason why this wasn't committed.
> >
> >
> > >
> > >
> > >
> > > Hello slony1 community,
> > >
> > > We have a head scratcher here. It appears a DROP
NODE command was not fully processed. The
> > > command was issued and confirmed on all our nodes at
approximately 2018-02-21 19:19:50 UTC. When we went to
> > > restore it over two hours later, all replication stopped on an
sl_event constraint violation. Investigation
> > > showed a SYNC event for the dropped node with a timestamp of
just a few seconds before the drop. I believe this
> > > is a first for us. The DROP NODE command is supposed to remove
all state for the dropped node. Is that right? Is
> > > there a potential race condition somewhere which could leave
behind state?
> > >
> > > Thanks in advance,
> > >
> > >
> > >
> > > ---- master log replication freeze error ----
> > >
> > > 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into
"_ams_cluster".sl_event (ev_origin,
> > > ev_seqno, ev_timestamp, ev_snapshot, ev\
> > >
> > > _type ) values ('8', '5002075962', '2018-02-21
19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
> > > into "_ams_cluster".sl_confirm (con_origi\
> > >
> > > n, con_received, con_seqno, con_timestamp) values (8, 1,
'5002075962', now()); select
> > > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139
s'::inter\
> > >
> > > val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate
key value violates unique constraint
> > > "sl_event-pkey"
> > >
> > > DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already
exists.
> > >
> > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated
signal: 9; pid: 5775, current worker pid: 5775
> > >
> > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker
in 10 seconds
> > >
> > > ---- master log replication freeze error ----
> > >
> > >
> > >
> > >
> > >
> > > ---- master DB leftover event ----
> > >
> > > a...@ams6.cmb.netmgmt:~$ psql -U akamai -d ams
> > >
> > > psql (9.1.24)
> > >
> > > Type "help" for help.
> > >
> > >
> > >
> > > ams=# select * from sl_event_bak;
> > >
> > > ev_origin | ev_seqno | ev_timestamp |
ev_snapshot | ev_type | ev_data1 | ev_data2 |
> > > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
> > >
> > > data7 | ev_data8
> > >
> > >
-----------+------------+-------------------------------+--------------------+---------+----------+----------+-
> > > ---------+----------+----------+----------+----
> > >
> > > ------+----------
> > >
> > > 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 |
87044110:87044110: | SYNC | | |
> > > | | | |
> > >
> > > |
> > >
> > > (1 row)
> > >
> > >
> > >
> > > ams=#
> > >
> > > ---- master DB leftover event ----
> > >
> > >
> > >
> > > ---- master log drop node record ----
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4
li_receiver=1 li_provider=4
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7
li_receiver=1 li_provider=7
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3
li_receiver=1 li_provider=3
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4:
update provider configuration
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4:
connection for provider 4 terminated
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4:
disconnecting from data provider 4
> > >
> > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4:
connection for provider 7 terminated
> > >
> > > ---- master log drop node record ----
> > >
> > >
> > >
> > > ---- replica log drop node record ----
> > >
> > > 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got
DROP NODE for local node ID
> > >
> > > NOTICE: Slony-I: Please drop schema "_ams_cluster"
> > >
> > > 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7:
SYNC 5001868819 done in 2.153 seconds
> > >
> > > NOTICE: drop cascades to 243 other objects
> > >
> > > DETAIL: drop cascades to table _ams_cluster.sl_node
> > >
> > > drop cascades to table _ams_cluster.sl_nodelock
> > >
> > > drop cascades to table _ams_cluster.sl_set
> > >
> > > drop cascades to table _ams_cluster.sl_setsync
> > >
> > > drop cascades to table _ams_cluster.sl_table
> > >
> > > drop cascades to table _ams_cluster.sl_sequence
> > >
> > > ---- replica log drop node record ----
> > >
> > >
> > >
> > >
> > >
> > > Tom ☺
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general