Re: [ovs-discuss] OVN Scale with RAFT: how to make ovn-northd more reliable when RAFT leader unstable

Dumitru Ceara Fri, 17 Jul 2020 00:54:25 -0700

On 7/17/20 2:58 AM, Winson Wang wrote:
> Hi Dumitru,
> 
> most of the flows are in table 19.


This is the ls_in_pre_hairpin table where we add flows for each backend
of the load balancers.

> 
> -rw-r--r-- 1 root root 142M Jul 16 17:07 br-int.txt (all flows dump file)
> -rw-r--r-- 1 root root 102M Jul 16 17:43 table-19.txt
> -rw-r--r-- 1 root root 7.8M Jul 16 17:43 table-11.txt
> -rw-r--r-- 1 root root 3.7M Jul 16 17:43 table-21.txt
> 
> # cat table-19.txt |wc -l
> 408458
> ]# cat table-19.txt | grep "=9153" | wc -l
> 124744
>  cat table-19.txt | grep "=53" | wc -l
> 249488
> Coredns pod has svc with port number 53 and 9153.
> 

How many backends do you have for these VIPs (with port number 53 and
9153) in your load_balancer config?

Thanks,
Dumitru

> Please let me know if you need more information.
> 
> 
> Regards,
> Winson
> 
> 
> On Thu, Jul 16, 2020 at 11:23 AM Dumitru Ceara <dce...@redhat.com
> <mailto:dce...@redhat.com>> wrote:
> 
>     On 7/15/20 8:02 PM, Winson Wang wrote:
>     > +add ovn-Kubernetes group.
>     >
>     > Hi Dumitru,
>     >
>     > With recent patches from you and Han,  now for k8s basic workload,
>     such
>     > node resources and pod resources are fixed and look good.
>     > Much thanks!
> 
>     Hi Winson,
> 
>     Glad to hear that!
> 
>     >
>     > For k8s workload which exposes as svc IP is every common,  for
>     example,
>     > the coreDNS pod's deployment.
>     > With large cluster size such  as 1000,  there is service to auto scale
>     > up coreDNS deployment,  if we use default 16 nodes per coredns, 
>     it could be
>     > 63 coredns pods.
>     > On my 1006 nodes setup,  deployment from coreDNS from 2 to 63.
>     > SB raft election 16s is not good for this operation in my test
>     > environment, it makes one raft node cannot finish the election in two
>     > election slot when making all it's
>     > clients disconnect and reconnect to two other raft nodes,  which makes
>     > raft clients in an unbalanced state after this operation.
>     > This condition might be avoided without larger election timer.
>     >
>     > For the SB and work node resource side:
>     > SB DB size increased 27MB.
>     > br-int open flows increased around 369K, 
>     > RSS memory of (ovs + ovn-controller) increased more than 600MB.
> 
>     This increase on the hypervisor side is most likely because of the
>     openflows for hairpin traffic for VIPs (service IP). To confirm, would
>     it be possible to take a snapshot of the OVS flow table and see how many
>     flows there are per table?
> 
>     >
>     > So if OVN experts can figure how to optimize it would be very
>     great for
>     > ovn-k8s scale up to large cluster size I think.
>     >
> 
>     If the above is due to flows for LB flows to handle hairpin traffic, the
>     only idea I have is to use OVS "learn" action to have the flows
>     generated as needed. However, I didn't get the chance to try it out yet.
> 
>     Thanks,
>     Dumitru
> 
>     >
>     > Regards,
>     > Winson
>     >
>     >
>     > On Fri, May 1, 2020 at 1:35 AM Dumitru Ceara <dce...@redhat.com
>     <mailto:dce...@redhat.com>
>     > <mailto:dce...@redhat.com <mailto:dce...@redhat.com>>> wrote:
>     >
>     >     On 5/1/20 12:00 AM, Winson Wang wrote:
>     >     > Hi Han,  Dumitru,
>     >     >
>     >
>     >     Hi Winson,
>     >
>     >     > With the fix from Dumitru
>     >     >
>     >   
>      
> https://github.com/ovn-org/ovn/commit/97e82ae5f135a088c9e95b49122d8217718d23f4
>     >     >
>     >     > It can greatly reduced the OVS SB RAFT workload based on my
>     stress
>     >     test
>     >     > mode with k8s svc with large endpoints.
>     >     >
>     >     > The DB file size increased much less with fix, so it will
>     not trigger
>     >     > the leader election with same work load.
>     >     >
>     >     > Dumitru,  based my test,  logic flows number is fixed with
>     cluster
>     >     size
>     >     > regardless of number of VIP endpoints.
>     >
>     >     The number of logical flows will be fixed based on number of
>     VIPs (2 per
>     >     VIP) but the size of the match expression depends on the number of
>     >     backends per VIP so the SB DB size will increase when adding
>     backends to
>     >     existing VIPs.
>     >
>     >     >
>     >     > But the open flow count on each node still have the relationship
>     >     of the
>     >     > endpoints size.
>     >
>     >     Yes, this is due to the match expression in the logical flow
>     above which
>     >     is of the form:
>     >
>     >     (ip.src == backend-ip1 && ip.dst == backend-ip2) || .. ||
>     (ip.src ==
>     >     backend-ipn && ip.dst == backend-ipn)
>     >
>     >     This will get expanded to n openflow rules, one per backend, to
>     >     determine if traffic was hairpinned.
>     >
>     >     > Any idea how to reduce the open flow cnt on each node's br-int?
>     >     >
>     >     >
>     >
>     >     Unfortunately I don't think there's a way to determine if
>     traffic was
>     >     hairpinned because I don't think we can have openflow rules
>     that match
>     >     on "ip.src == ip.dst". So in the worst case, we will probably
>     need two
>     >     openflow rules per backend IP (one for initiator traffic, one for
>     >     reply).
>     >
>     >     I'll think more about it though.
>     >
>     >     Regards,
>     >     Dumitru
>     >
>     >     > Regards,
>     >     > Winson
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     > On Wed, Apr 29, 2020 at 1:42 PM Winson Wang
>     >     <windson.w...@gmail.com <mailto:windson.w...@gmail.com>
>     <mailto:windson.w...@gmail.com <mailto:windson.w...@gmail.com>>
>     >     > <mailto:windson.w...@gmail.com
>     <mailto:windson.w...@gmail.com> <mailto:windson.w...@gmail.com
>     <mailto:windson.w...@gmail.com>>>>
>     >     wrote:
>     >     >
>     >     >     Hi Han,
>     >     >
>     >     >     Thanks for quick reply.
>     >     >     Please see my reply below.
>     >     >
>     >     >     On Wed, Apr 29, 2020 at 12:31 PM Han Zhou <hz...@ovn.org
>     <mailto:hz...@ovn.org>
>     >     <mailto:hz...@ovn.org <mailto:hz...@ovn.org>>
>     >     >     <mailto:hz...@ovn.org <mailto:hz...@ovn.org>
>     <mailto:hz...@ovn.org <mailto:hz...@ovn.org>>>> wrote:
>     >     >
>     >     >
>     >     >
>     >     >         On Wed, Apr 29, 2020 at 10:29 AM Winson Wang
>     >     >         <windson.w...@gmail.com
>     <mailto:windson.w...@gmail.com> <mailto:windson.w...@gmail.com
>     <mailto:windson.w...@gmail.com>>
>     >     <mailto:windson.w...@gmail.com <mailto:windson.w...@gmail.com>
>     <mailto:windson.w...@gmail.com <mailto:windson.w...@gmail.com>>>> wrote:
>     >     >         >
>     >     >         > Hello Experts,
>     >     >         >
>     >     >         > I am doing stress with k8s cluster with ovn,  one
>     thing I am
>     >     >         seeing is that when raft nodes
>     >     >         > got update for large data in short time from
>     ovn-northd,  3
>     >     >         raft nodes will trigger voting and leader role switched
>     >     from one
>     >     >         node to another.
>     >     >         >
>     >     >         > From ovn-northd side,  I can see ovn-northd will
>     trigger the
>     >     >         BACKOFF, RECONNECT...
>     >     >         >
>     >     >         > Since ovn-northd only connect to NB/SB leader only and
>     >     how can
>     >     >         we make ovn-northd more available  in most of the time?
>     >     >         >
>     >     >         > Is it possible to make ovn-northd have established
>     >     connections
>     >     >         to all raft nodes to avoid the
>     >     >         > reconnect mechanism?
>     >     >         > Since the backoff time 8s is not configurable for now.
>     >     >         >
>     >     >         >
>     >     >         > Test logs:
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:08.296Z|41861|ovsdb_idl|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >     >         clustered database server is not cluster leader; trying
>     >     another
>     >     >         server
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:08.296Z|41862|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering RECONNECT
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:08.304Z|41863|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering BACKOFF
>     >     >         >
>     >     >         >
>     2020-04-29T17:03:09.708Z|41867|coverage|INFO|Dropped 2 log
>     >     >         messages in last 78 seconds (most recently, 71 seconds
>     >     ago) due
>     >     >         to excessive rate
>     >     >         >
>     >     >         > 2020-04-29T17:03:09.708Z|41868|coverage|INFO|Skipping
>     >     details
>     >     >         of duplicate event coverage for hash=ceada91f
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.304Z|41869|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering CONNECTING
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.308Z|41870|reconnect|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >     >         connected
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.308Z|41871|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering ACTIVE
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:16.308Z|41872|ovn_northd|INFO|ovn-northd lock
>     >     >         lost. This ovn-northd instance is now on standby.
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:16.309Z|41873|ovn_northd|INFO|ovn-northd lock
>     >     >         acquired. This ovn-northd instance is now active.
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.311Z|41874|ovsdb_idl|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >     >         clustered database server is disconnected from
>     cluster; trying
>     >     >         another server
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.311Z|41875|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering RECONNECT
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:16.312Z|41876|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering BACKOFF
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.316Z|41877|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642>
>     >     >         <http://10.0.2.151:6642>: entering CONNECTING
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.321Z|41878|reconnect|INFO|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642> <http://10.0.2.151:6642>:
>     >     >         connected
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.321Z|41879|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642>
>     >     >         <http://10.0.2.151:6642>: entering ACTIVE
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:24.321Z|41880|ovn_northd|INFO|ovn-northd lock
>     >     >         lost. This ovn-northd instance is now on standby.
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:24.354Z|41881|ovn_northd|INFO|ovn-northd lock
>     >     >         acquired. This ovn-northd instance is now active.
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.358Z|41882|ovsdb_idl|INFO|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642> <http://10.0.2.151:6642>:
>     >     >         clustered database server is not cluster leader; trying
>     >     another
>     >     >         server
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.358Z|41883|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642>
>     >     >         <http://10.0.2.151:6642>: entering RECONNECT
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:24.360Z|41884|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >     <http://10.0.2.151:6642>
>     >     >         <http://10.0.2.151:6642>: entering BACKOFF
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.367Z|41885|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering CONNECTING
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.372Z|41886|reconnect|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >     >         connected
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.372Z|41887|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering ACTIVE
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:32.372Z|41888|ovn_northd|INFO|ovn-northd lock
>     >     >         lost. This ovn-northd instance is now on standby.
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:32.373Z|41889|ovn_northd|INFO|ovn-northd lock
>     >     >         acquired. This ovn-northd instance is now active.
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.376Z|41890|ovsdb_idl|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >     >         clustered database server is not cluster leader; trying
>     >     another
>     >     >         server
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.376Z|41891|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering RECONNECT
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:32.378Z|41892|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >     <http://10.0.2.152:6642>
>     >     >         <http://10.0.2.152:6642>: entering BACKOFF
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:40.381Z|41893|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering CONNECTING
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:40.385Z|41894|reconnect|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >     >         connected
>     >     >         >
>     >     >         >
>     >     >       
>     >   
>       2020-04-29T17:03:40.385Z|41895|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >         <http://10.0.2.153:6642>: entering ACTIVE
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:40.385Z|41896|ovn_northd|INFO|ovn-northd lock
>     >     >         lost. This ovn-northd instance is now on standby.
>     >     >         >
>     >     >         >
>     >     2020-04-29T17:03:40.385Z|41897|ovn_northd|INFO|ovn-northd lock
>     >     >         acquired. This ovn-northd instance is now active.
>     >     >         >
>     >     >         >
>     >     >         > --
>     >     >         > Winson
>     >     >
>     >     >         Hi Winson,
>     >     >
>     >     >         Since northd heavily writes to SB DB, it is
>     implemented to
>     >     >         connect to leader only, for better performance
>     (avoid the
>     >     extra
>     >     >         cost of a follower forwarding writes to leader).
>     When leader
>     >     >         re-election happened, it has to reconnect to the new
>     leader.
>     >     >         However, if the cluster is unstable, this step would
>     also take
>     >     >         longer time than expected. I'd suggest to tune the
>     election
>     >     >         timer to avoid re-election during heavy operations.
>     >     >
>     >     >     I can see with election timer to higher value can avoid
>     this,
>     >     but if
>     >     >     more stress generated then I  see it happen again.
>     >     >     For real workload,  it may not hit the spike stress I
>     trigger for
>     >     >     stress test, so this is just for scale profiling.
>     >     >      
>     >     >
>     >     >
>     >     >         If the server is overloaded for too long and longer
>     election
>     >     >         timer is unacceptable, the only way to solve the
>     availability
>     >     >         problem is to improve ovsdb performance. How big is your
>     >     >         transaction and what's your election timer setting?
>     >     >
>     >     >     I can see ovn-northd send 33MB data in short time,  and
>     >     ovsdb-server
>     >     >     need sync with clients,  I run iftop on on-controller
>     side, each
>     >     >     node will receive around 25MB update.
>     >     >     Each ovn-controller get 25MB data,  3 raft nodes total send
>     >     25*646 ~16GB
>     >     >      
>     >     >
>     >     >         The number of clients also impacts the performance
>     since the
>     >     >         heavy update needs to be synced to all clients. How many
>     >     clients
>     >     >         do you have?
>     >     >
>     >     >     Is there one mechanism for all the ovn-controller clients to
>     >     connect
>     >     >     to raft followers only to skip leader?
>     >     >     That will make leader node more cpu resource for voting and
>     >     cluster
>     >     >     level sync.
>     >     >     Based my stress test,  after ovn-controller connected to 2
>     >     follower
>     >     >     nodes,  leader node only connect to ovn-northd.
>     >     >     This model can finish raft voting finish in shorter time
>     when
>     >     >     ovn-northd trigger same workload.
>     >     >      
>     >     >      Total clients is 646 nodes.
>     >     >     Before the leader role changes,  all clients connected to 3
>     >     nodes in
>     >     >     balanced way,  each raft node has 200+ connections.
>     >     >     After lead role change,  ovn controller side get the
>     following
>     >     messages:
>     >     >   
>     >   
>       2020-04-29T04:21:14.566Z|00674|ovsdb_idl|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>
>     >     >     <http://10.0.2.153:6642>: clustered database server is
>     >     disconnected
>     >     >     from cluster; trying another server
>     >     >
>     >     >     Node 10.0.2.153 :
>     >     >
>     >     >     SB role changed from follower to candidate on 21:21:06
>     >     >
>     >     >     SB role changed from candidate to leader on 21:22:16
>     >     >
>     >     >     netstat for 6642 port connections:
>     >     >
>     >     >     21:21:31 ESTABLISHED 202
>     >     >
>     >     >     21:21:31 Pending 0
>     >     >
>     >     >     21:21:41 ESTABLISHED 0
>     >     >
>     >     >     21:21:41 Pending 0
>     >     >
>     >     >
>     >     >     The above node in candidate role for more than 60s which
>     more than
>     >     >     my election timer setting 30s.
>     >     >
>     >     >     all the 202 connections of node (10.0.2.153) shift to the
>     >     other two
>     >     >     nodes in short time. After that only
>     >     >
>     >     >     ovn-northd connected to this node.
>     >     >
>     >     >
>     >     >     Node 10.0.2.151 <http://10.0.2.151>:
>     >     >
>     >     >     SB role changed from leader to follower on 21:21:23
>     >     >
>     >     >
>     >     >     21:21:35 ESTABLISHED 233
>     >     >
>     >     >     21:21:35 Pending 0
>     >     >
>     >     >     21:21:45 ESTABLISHED 282
>     >     >
>     >     >     21:21:45 Pending 9
>     >     >
>     >     >     21:21:55 ESTABLISHED 330
>     >     >
>     >     >     21:21:55 Pending 1
>     >     >
>     >     >     21:22:05 ESTABLISHED 330
>     >     >
>     >     >     21:22:05 Pending 1
>     >     >
>     >     >
>     >     >
>     >     >     Node 10.0.2.152 <http://10.0.2.152>:
>     >     >
>     >     >     SB role changed from follower to candidate on 21:21:57
>     >     >
>     >     >     SB role changed from candidate to follower on 21:22:17
>     >     >
>     >     >
>     >     >     21:21:35 ESTABLISHED 211
>     >     >
>     >     >     21:21:35 Pending 0
>     >     >
>     >     >     21:21:45 ESTABLISHED 263
>     >     >
>     >     >     21:21:45 Pending 5
>     >     >
>     >     >     21:21:55 ESTABLISHED 316
>     >     >
>     >     >     21:21:55 Pending 0
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >         Thanks,
>     >     >         Han
>     >     >
>     >     >
>     >     >
>     >     >     --
>     >     >     Winson
>     >     >
>     >     >
>     >     >
>     >     > --
>     >     > Winson
>     >
>     >
>     >
>     > --
>     > Winson
> 
> 
> 
> -- 
> Winson

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN Scale with RAFT: how to make ovn-northd more reliable when RAFT leader unstable

Reply via email to