Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-12-12 Thread Han Zhou
Hi Yun,

Sorry for late reply. Thanks for trying the patch. In fact, OVSDB will
trigger compaction by itself, depending on DB size and time, and the snap
is just in the same data file. It is just in the file header (usually the
second line). So based on what you said, it seems the patch solved your
problem. Please let me know if you still see problems afterwards.

Thanks,
Han

On Mon, Dec 9, 2019 at 4:10 AM taoyunupt  wrote:

> Hi, Han,
>  I do not  encounter that problem these days  after using this
> patch.  I think there is no COMPACT in my environment.  Actaully , I don't
> see any snap file in /var/lib/openvswitch.
>
> Thanks,
> Yun
>
>
>
>
>
> At 2019-12-04 10:01:16, "Han Zhou"  wrote:
>
> Hi,
>
> Could you see if this patch fixes your problem?
> https://patchwork.ozlabs.org/patch/1203951/
>
> Thanks,
> Han
>
>
> On Mon, Dec 2, 2019 at 12:28 AM Han Zhou  wrote:
>
>> Sorry for the late reply. It was holiday here.
>> I didn't see such problem when there is no compaction. Did you see this
>> problem when DB compaction didn't happen? The difference is that after
>> compaction the RAFT log doesn't have any entries and all the data is in the
>> snapshot.
>>
>> On Fri, Nov 29, 2019 at 12:11 AM taoyunupt  wrote:
>>
>>> Hi,Han
>>>   Hope to receive your reply.
>>>
>>>
>>> Thanks,
>>> Yun
>>>
>>>
>>>
>>> 在 2019-11-28 16:17:07,"taoyunupt"  写道:
>>>
>>> Hi,Han
>>>  Another question. NO COMPACT. If restart a follower , leader
>>> sender some entries during the  break time, when it has started, if it also
>>> happend to this problem?  What is the difference between simply restart and
>>> COMPACT with restart ?
>>>
>>>
>>> Thanks,
>>> Yun
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 在 2019-11-28 13:58:36,"taoyunupt"  写道:
>>>
>>> Hi,Han
>>>  Thanks for your reply.  I think maybe we can disconnect the
>>> failed follower from the Haproxy then synchronize the date, after all
>>> completed, reconnect it to Haproxy again. But I do not know how to
>>> synchronize actually.
>>>  It is just my naive idea. Do you have some suggestion about how
>>> to fix this problem.  If not very completed, I wii have a try.
>>>
>>>
>>> Thanks
>>> Yun
>>>
>>>
>>>
>>>
>>>
>>>
>>> 在 2019-11-28 11:47:55,"Han Zhou"  写道:
>>>
>>>
>>>
>>> On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>>> >
>>> > Hi,
>>> > My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
>>> with a VIP. Recently, I restart OVN cluster frequently.  One of the members
>>> report the logs below.
>>> > After read the code and paper of RAFT, it seems normal process ,If
>>> the follower does not find an entry in its log with the same index and
>>> term, then it refuses the new entries.
>>> > I think it's reasonable to refuse. But, as we could not control
>>> Haproxy or some proxy maybe, so it will happen error when an session
>>> assignate to the failed follower.
>>> >
>>> > Does have some means or ways to solve this problem. Maybe we can
>>> kick off the failed follower or disconnect it from the haproxy then
>>> synchronize the date ?  Hope to hear your suggestion.
>>> >
>>> >
>>> > 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request
>>> because previous entry 1103,50975 not in local log (mismatch past end of
>>> log)
>>> > 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in
>>> last 12 seconds (most recently, 0 seconds ago) due to excessive rate
>>> > 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
>>> append_reply message completed but not ready to send because message index
>>> 14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
>>> log": term=1103 log_end=14891 result="inconsistency"
>>> > 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request
>>> because previous entry 1103,50975 not in local log (mismatch past end of
>>> log)
>>> >
>>> >
>>> > [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
>>> cluster/status OVN_Southbound
>>> > a2b2
>>> > Name: OVN_Southbound
>>> > Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
>>> > Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
>>> > Address: tcp:10.254.8.209:6644
>>> > Status: cluster member
>>> > Role: leader
>>> > Term: 1103
>>> > Leader: self
>>> > Vote: self
>>> >
>>> > Log: [42052, 51009]
>>> > Entries not yet committed: 0
>>> > Entries not yet applied: 0
>>> > Connections: ->beaf ->9a33 <-9a33 <-beaf
>>> > Servers:
>>> > a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
>>> match_index=51008
>>> > beaf (beaf at tcp:10.254.8.208:6644) next_index=51009
>>> match_index=0
>>> > 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
>>> match_index=51008
>>>
>>> >
>>>
>>>
>>> I think it is a bug. I noticed that this problem happens when the
>>> cluster is restarted after DB compaction. I mentioned it in one of the test
>>> cases:
>>> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
>>> I also mentioned another 

Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-12-09 Thread taoyunupt
Hi, Han,
 I do not  encounter that problem these days  after using this 
patch.  I think there is no COMPACT in my environment.  Actaully , I don't see 
any snap file in /var/lib/openvswitch.  


Thanks,
Yun






At 2019-12-04 10:01:16, "Han Zhou"  wrote:

Hi,


Could you see if this patch fixes your problem?
https://patchwork.ozlabs.org/patch/1203951/


Thanks,
Han





On Mon, Dec 2, 2019 at 12:28 AM Han Zhou  wrote:

Sorry for the late reply. It was holiday here.
I didn't see such problem when there is no compaction. Did you see this problem 
when DB compaction didn't happen? The difference is that after compaction the 
RAFT log doesn't have any entries and all the data is in the snapshot.



On Fri, Nov 29, 2019 at 12:11 AM taoyunupt  wrote:

Hi,Han
  Hope to receive your reply.


Thanks,
Yun



在 2019-11-28 16:17:07,"taoyunupt"  写道:

Hi,Han
 Another question. NO COMPACT. If restart a follower , leader sender 
some entries during the  break time, when it has started, if it also happend to 
this problem?  What is the difference between simply restart and COMPACT with 
restart ?


Thanks,
Yun








在 2019-11-28 13:58:36,"taoyunupt"  写道:

Hi,Han
 Thanks for your reply.  I think maybe we can disconnect the failed 
follower from the Haproxy then synchronize the date, after all completed, 
reconnect it to Haproxy again. But I do not know how to synchronize actually. 
 It is just my naive idea. Do you have some suggestion about how to fix 
this problem.  If not very completed, I wii have a try.


Thanks
Yun






在 2019-11-28 11:47:55,"Han Zhou"  写道:



On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>
> Hi,
> My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy with a 
> VIP. Recently, I restart OVN cluster frequently.  One of the members report 
> the logs below.
> After read the code and paper of RAFT, it seems normal process ,If the 
> follower does not find an entry in its log with the same index and term, then 
> it refuses the new entries.
> I think it's reasonable to refuse. But, as we could not control Haproxy 
> or some proxy maybe, so it will happen error when an session assignate to the 
> failed follower.
>   
> Does have some means or ways to solve this problem. Maybe we can kick off 
> the failed follower or disconnect it from the haproxy then synchronize the 
> date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last 12 
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred append_reply 
> message completed but not ready to send because message index 14890 is past 
> last synced index 0: a2b2 append_reply "mismatch past end of log": term=1103 
> log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl 
> cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
> a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199 
> match_index=51008
> beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009 match_index=51008

>


I think it is a bug. I noticed that this problem happens when the cluster is 
restarted after DB compaction. I mentioned it in one of the test cases: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to find 
some time next week (it would be great if you could figure it out and submit 
patches).



Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-12-03 Thread Han Zhou
Hi,

Could you see if this patch fixes your problem?
https://patchwork.ozlabs.org/patch/1203951/

Thanks,
Han


On Mon, Dec 2, 2019 at 12:28 AM Han Zhou  wrote:

> Sorry for the late reply. It was holiday here.
> I didn't see such problem when there is no compaction. Did you see this
> problem when DB compaction didn't happen? The difference is that after
> compaction the RAFT log doesn't have any entries and all the data is in the
> snapshot.
>
> On Fri, Nov 29, 2019 at 12:11 AM taoyunupt  wrote:
>
>> Hi,Han
>>   Hope to receive your reply.
>>
>>
>> Thanks,
>> Yun
>>
>>
>>
>> 在 2019-11-28 16:17:07,"taoyunupt"  写道:
>>
>> Hi,Han
>>  Another question. NO COMPACT. If restart a follower , leader
>> sender some entries during the  break time, when it has started, if it also
>> happend to this problem?  What is the difference between simply restart and
>> COMPACT with restart ?
>>
>>
>> Thanks,
>> Yun
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2019-11-28 13:58:36,"taoyunupt"  写道:
>>
>> Hi,Han
>>  Thanks for your reply.  I think maybe we can disconnect the
>> failed follower from the Haproxy then synchronize the date, after all
>> completed, reconnect it to Haproxy again. But I do not know how to
>> synchronize actually.
>>  It is just my naive idea. Do you have some suggestion about how
>> to fix this problem.  If not very completed, I wii have a try.
>>
>>
>> Thanks
>> Yun
>>
>>
>>
>>
>>
>>
>> 在 2019-11-28 11:47:55,"Han Zhou"  写道:
>>
>>
>>
>> On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>> >
>> > Hi,
>> > My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
>> with a VIP. Recently, I restart OVN cluster frequently.  One of the members
>> report the logs below.
>> > After read the code and paper of RAFT, it seems normal process ,If
>> the follower does not find an entry in its log with the same index and
>> term, then it refuses the new entries.
>> > I think it's reasonable to refuse. But, as we could not control
>> Haproxy or some proxy maybe, so it will happen error when an session
>> assignate to the failed follower.
>> >
>> > Does have some means or ways to solve this problem. Maybe we can
>> kick off the failed follower or disconnect it from the haproxy then
>> synchronize the date ?  Hope to hear your suggestion.
>> >
>> >
>> > 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request
>> because previous entry 1103,50975 not in local log (mismatch past end of
>> log)
>> > 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last
>> 12 seconds (most recently, 0 seconds ago) due to excessive rate
>> > 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
>> append_reply message completed but not ready to send because message index
>> 14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
>> log": term=1103 log_end=14891 result="inconsistency"
>> > 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request
>> because previous entry 1103,50975 not in local log (mismatch past end of
>> log)
>> >
>> >
>> > [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
>> cluster/status OVN_Southbound
>> > a2b2
>> > Name: OVN_Southbound
>> > Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
>> > Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
>> > Address: tcp:10.254.8.209:6644
>> > Status: cluster member
>> > Role: leader
>> > Term: 1103
>> > Leader: self
>> > Vote: self
>> >
>> > Log: [42052, 51009]
>> > Entries not yet committed: 0
>> > Entries not yet applied: 0
>> > Connections: ->beaf ->9a33 <-9a33 <-beaf
>> > Servers:
>> > a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
>> match_index=51008
>> > beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
>> > 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
>> match_index=51008
>>
>> >
>>
>>
>> I think it is a bug. I noticed that this problem happens when the cluster
>> is restarted after DB compaction. I mentioned it in one of the test cases:
>> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
>> I also mentioned another problem related to compaction:
>> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
>> I was planning to debug these but didn't get the time yet. I will try to
>> find some time next week (it would be great if you could figure it out and
>> submit patches).
>>
>>
>>
>> Thanks,
>> Han
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-12-02 Thread Han Zhou
Sorry for the late reply. It was holiday here.
I didn't see such problem when there is no compaction. Did you see this
problem when DB compaction didn't happen? The difference is that after
compaction the RAFT log doesn't have any entries and all the data is in the
snapshot.

On Fri, Nov 29, 2019 at 12:11 AM taoyunupt  wrote:

> Hi,Han
>   Hope to receive your reply.
>
>
> Thanks,
> Yun
>
>
>
> 在 2019-11-28 16:17:07,"taoyunupt"  写道:
>
> Hi,Han
>  Another question. NO COMPACT. If restart a follower , leader
> sender some entries during the  break time, when it has started, if it also
> happend to this problem?  What is the difference between simply restart and
> COMPACT with restart ?
>
>
> Thanks,
> Yun
>
>
>
>
>
>
>
>
> 在 2019-11-28 13:58:36,"taoyunupt"  写道:
>
> Hi,Han
>  Thanks for your reply.  I think maybe we can disconnect the
> failed follower from the Haproxy then synchronize the date, after all
> completed, reconnect it to Haproxy again. But I do not know how to
> synchronize actually.
>  It is just my naive idea. Do you have some suggestion about how
> to fix this problem.  If not very completed, I wii have a try.
>
>
> Thanks
> Yun
>
>
>
>
>
>
> 在 2019-11-28 11:47:55,"Han Zhou"  写道:
>
>
>
> On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
> >
> > Hi,
> > My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
> with a VIP. Recently, I restart OVN cluster frequently.  One of the members
> report the logs below.
> > After read the code and paper of RAFT, it seems normal process ,If
> the follower does not find an entry in its log with the same index and
> term, then it refuses the new entries.
> > I think it's reasonable to refuse. But, as we could not control
> Haproxy or some proxy maybe, so it will happen error when an session
> assignate to the failed follower.
> >
> > Does have some means or ways to solve this problem. Maybe we can
> kick off the failed follower or disconnect it from the haproxy then
> synchronize the date ?  Hope to hear your suggestion.
> >
> >
> > 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request
> because previous entry 1103,50975 not in local log (mismatch past end of
> log)
> > 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last
> 12 seconds (most recently, 0 seconds ago) due to excessive rate
> > 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
> append_reply message completed but not ready to send because message index
> 14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
> log": term=1103 log_end=14891 result="inconsistency"
> > 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request
> because previous entry 1103,50975 not in local log (mismatch past end of
> log)
> >
> >
> > [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
> cluster/status OVN_Southbound
> > a2b2
> > Name: OVN_Southbound
> > Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> > Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> > Address: tcp:10.254.8.209:6644
> > Status: cluster member
> > Role: leader
> > Term: 1103
> > Leader: self
> > Vote: self
> >
> > Log: [42052, 51009]
> > Entries not yet committed: 0
> > Entries not yet applied: 0
> > Connections: ->beaf ->9a33 <-9a33 <-beaf
> > Servers:
> > a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
> match_index=51008
> > beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> > 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
> match_index=51008
>
> >
>
>
> I think it is a bug. I noticed that this problem happens when the cluster
> is restarted after DB compaction. I mentioned it in one of the test cases:
> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
> I also mentioned another problem related to compaction:
> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
> I was planning to debug these but didn't get the time yet. I will try to
> find some time next week (it would be great if you could figure it out and
> submit patches).
>
>
>
> Thanks,
> Han
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-11-29 Thread taoyunupt
Hi,Han
  Hope to receive your reply.


Thanks,
Yun



在 2019-11-28 16:17:07,"taoyunupt"  写道:

Hi,Han
 Another question. NO COMPACT. If restart a follower , leader sender 
some entries during the  break time, when it has started, if it also happend to 
this problem?  What is the difference between simply restart and COMPACT with 
restart ?


Thanks,
Yun








在 2019-11-28 13:58:36,"taoyunupt"  写道:

Hi,Han
 Thanks for your reply.  I think maybe we can disconnect the failed 
follower from the Haproxy then synchronize the date, after all completed, 
reconnect it to Haproxy again. But I do not know how to synchronize actually.  
 It is just my naive idea. Do you have some suggestion about how to fix 
this problem.  If not very completed, I wii have a try.


Thanks 
Yun






在 2019-11-28 11:47:55,"Han Zhou"  写道:



On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>
> Hi,
> My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy with a 
> VIP. Recently, I restart OVN cluster frequently.  One of the members report 
> the logs below.
> After read the code and paper of RAFT, it seems normal process ,If the 
> follower does not find an entry in its log with the same index and term, then 
> it refuses the new entries.
> I think it's reasonable to refuse. But, as we could not control Haproxy 
> or some proxy maybe, so it will happen error when an session assignate to the 
> failed follower.
>
> Does have some means or ways to solve this problem. Maybe we can kick off 
> the failed follower or disconnect it from the haproxy then synchronize the 
> date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last 12 
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred append_reply 
> message completed but not ready to send because message index 14890 is past 
> last synced index 0: a2b2 append_reply "mismatch past end of log": term=1103 
> log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl 
> cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
> a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199 
> match_index=51008
> beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009 match_index=51008

>


I think it is a bug. I noticed that this problem happens when the cluster is 
restarted after DB compaction. I mentioned it in one of the test cases: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to find 
some time next week (it would be great if you could figure it out and submit 
patches).



Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-11-28 Thread taoyunupt
Hi,Han
 Another question. NO COMPACT. If restart a follower , leader sender 
some entries during the  break time, when it has started, if it also happend to 
this problem?  What is the difference between simply restart and COMPACT with 
restart ?


Thanks,
Yun








在 2019-11-28 13:58:36,"taoyunupt"  写道:

Hi,Han
 Thanks for your reply.  I think maybe we can disconnect the failed 
follower from the Haproxy then synchronize the date, after all completed, 
reconnect it to Haproxy again. But I do not know how to synchronize actually.  
 It is just my naive idea. Do you have some suggestion about how to fix 
this problem.  If not very completed, I wii have a try.


Thanks 
Yun






在 2019-11-28 11:47:55,"Han Zhou"  写道:



On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>
> Hi,
> My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy with a 
> VIP. Recently, I restart OVN cluster frequently.  One of the members report 
> the logs below.
> After read the code and paper of RAFT, it seems normal process ,If the 
> follower does not find an entry in its log with the same index and term, then 
> it refuses the new entries.
> I think it's reasonable to refuse. But, as we could not control Haproxy 
> or some proxy maybe, so it will happen error when an session assignate to the 
> failed follower.
>
> Does have some means or ways to solve this problem. Maybe we can kick off 
> the failed follower or disconnect it from the haproxy then synchronize the 
> date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last 12 
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred append_reply 
> message completed but not ready to send because message index 14890 is past 
> last synced index 0: a2b2 append_reply "mismatch past end of log": term=1103 
> log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl 
> cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
> a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199 
> match_index=51008
> beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009 match_index=51008

>


I think it is a bug. I noticed that this problem happens when the cluster is 
restarted after DB compaction. I mentioned it in one of the test cases: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to find 
some time next week (it would be great if you could figure it out and submit 
patches).



Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-11-27 Thread taoyunupt
Hi,Han
 Thanks for your reply.  I think maybe we can disconnect the failed 
follower from the Haproxy then synchronize the date, after all completed, 
reconnect it to Haproxy again. But I do not know how to synchronize actually.  
 It is just my naive idea. Do you have some suggestion about how to fix 
this problem.  If not very completed, I wii have a try.


Thanks 
Yun






在 2019-11-28 11:47:55,"Han Zhou"  写道:



On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>
> Hi,
> My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy with a 
> VIP. Recently, I restart OVN cluster frequently.  One of the members report 
> the logs below.
> After read the code and paper of RAFT, it seems normal process ,If the 
> follower does not find an entry in its log with the same index and term, then 
> it refuses the new entries.
> I think it's reasonable to refuse. But, as we could not control Haproxy 
> or some proxy maybe, so it will happen error when an session assignate to the 
> failed follower.
>
> Does have some means or ways to solve this problem. Maybe we can kick off 
> the failed follower or disconnect it from the haproxy then synchronize the 
> date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last 12 
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred append_reply 
> message completed but not ready to send because message index 14890 is past 
> last synced index 0: a2b2 append_reply "mismatch past end of log": term=1103 
> log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because 
> previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl 
> cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
> a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199 
> match_index=51008
> beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009 match_index=51008

>


I think it is a bug. I noticed that this problem happens when the cluster is 
restarted after DB compaction. I mentioned it in one of the test cases: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction: 
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to find 
some time next week (it would be great if you could figure it out and submit 
patches).



Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-11-27 Thread taoyunupt
Hi,
My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy with a 
VIP. Recently, I restart OVN cluster frequently.  One of the members report the 
logs below.
After read the code and paper of RAFT, it seems normal process ,If the 
follower does not find an entry in its log with the same index and term, then 
it refuses the new entries.
I think it's reasonable to refuse. But, as we could not control Haproxy or 
some proxy maybe, so it will happen error when an session assignate to the 
failed follower.
   
Does have some means or ways to solve this problem. Maybe we can kick off 
the failed follower or disconnect it from the haproxy then synchronize the date 
?  Hope to hear your suggestion.




2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because 
previous entry 1103,50975 not in local log (mismatch past end of log)
2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last 12 
seconds (most recently, 0 seconds ago) due to excessive rate
2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred append_reply 
message completed but not ready to send because message index 14890 is past 
last synced index 0: a2b2 append_reply "mismatch past end of log": term=1103 
log_end=14891 result="inconsistency"
2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because 
previous entry 1103,50975 not in local log (mismatch past end of log)




[root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl cluster/status 
OVN_Southbound
a2b2
Name: OVN_Southbound
Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
Address: tcp:10.254.8.209:6644
Status: cluster member
Role: leader
Term: 1103
Leader: self
Vote: self


Log: [42052, 51009]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->beaf ->9a33 <-9a33 <-beaf
Servers:
a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199 
match_index=51008
beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009 match_index=51008

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

2019-11-27 Thread Han Zhou
On Wed, Nov 27, 2019 at 7:22 PM taoyunupt  wrote:
>
> Hi,
> My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
with a VIP. Recently, I restart OVN cluster frequently.  One of the members
report the logs below.
> After read the code and paper of RAFT, it seems normal process ,If
the follower does not find an entry in its log with the same index and
term, then it refuses the new entries.
> I think it's reasonable to refuse. But, as we could not control
Haproxy or some proxy maybe, so it will happen error when an session
assignate to the failed follower.
>
> Does have some means or ways to solve this problem. Maybe we can kick
off the failed follower or disconnect it from the haproxy then synchronize
the date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because
previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last
12 seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
append_reply message completed but not ready to send because message index
14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
log": term=1103 log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because
previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root@ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
> a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
match_index=51008
> beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
> 9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
match_index=51008
>

I think it is a bug. I noticed that this problem happens when the cluster
is restarted after DB compaction. I mentioned it in one of the test cases:
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction:
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to
find some time next week (it would be great if you could figure it out and
submit patches).

Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev