Re: [ovs-discuss] Too many resubmits in the OVS pipeline for scaled OVN topologies due to multicast_group implementation

2019-10-03 Thread Ben Pfaff
On Wed, Oct 02, 2019 at 06:10:54PM +0200, Dumitru Ceara wrote:
> I'm afraid that increasing the limit will just postpone the problem
> for later. Also I'm worried about the time the packet would spend in
> the pipeline and the fact that potentially there will be high jitter
> between delivery times to different ports.

This raises the question of what scale OVN should support for L2
broadcast domain size.  This question always arises for network
virtualization systems.  It came up in the very first one we built at
Nicira and it's come up in every subsequent one.  It's not even specific
to virtual networks, since L2 broadcast domain size is a primary reason
to partition L2 physical networks.

Along most scaling dimensions in OVN, scaling has come down to
acceptable performance.  A great example comes from Han's early work,
where I believe he found that 2000 nodes scaled acceptably and 3000
nodes did not.  That generally works OK for things where the main
bottleneck is performance.

Dumitru has shown that we've got a different kind of bottleneck for
broadcast domain size.  Performance is implicated here, too, since it's
at best O(n) in the number of members in the broadcast domain, but we
also have a harder limit in terms of what OVS is willing to process.
That's come up before in Nicira network virtualization systems, although
if I recall correctly the actual limit was the number of actions per
OpenFlow flow (there is a ~64 kB limit) rather than the number of
resubmits.

So, it might be worth estimating the L2 broadcast domain that OVS+OVN
can currently handle, rounding it down a bit, and then saying that that
is what we plan to support.  Perhaps we should arrange to test it, too.

There's another angle that might be worth pursuing.  The problem here is
broadcast/unknown-unicast/multicast (BUM) traffic, not inherently the
size of an L2 domain.  A lot of OVN use cases don't really need BUM
traffic because all of the MAC addresses are known.  Perhaps, therefore,
there should be a setting in the northbound database to drop BUM traffic
for a given logical switch.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Too many resubmits in the OVS pipeline for scaled OVN topologies due to multicast_group implementation

2019-10-03 Thread Dumitru Ceara
On Wed, Oct 2, 2019 at 8:46 PM Han Zhou  wrote:
>
>
>
> On Wed, Oct 2, 2019 at 9:11 AM Dumitru Ceara  wrote:
> >
> > On Tue, Oct 1, 2019 at 8:41 PM Han Zhou  wrote:
> > >
> > >
> > >
> > > On Tue, Oct 1, 2019 at 3:34 AM Dumitru Ceara  wrote:
> > > >
> > > > Hi,
> > > >
> > > > We've hit a scaling issue recently [1] in the following topology:
> > > >
> > > > - External network connected to public logical switch "LS-pub"
> > > > - ~300 logical networks (LR-i <--> LS-i <--> VMi) connected to LS-pub
> > > > with dnat_and_snat rules.
> > > >
> > > > While trying to ping the VMs from outside the ARP request packet from
> > > > the external host doesn't reach all the LR-i pipelines because it gets
> > > > dropped due to "Too many resubmits".
> > > >
> > > > This happens because the MC_FLOOD group for LS-pub installs openflow
> > > > entries that resubmit the packet:
> > > > - through the patch ports towards all LR-i (one entry in table 32 with
> > > > 300 resubmit actions).
> > > > - to the egress pipeline for each VIF that's local to LS-pub (one
> > > > entry in table 33).
> > > >
> > > > This means that for the ARP broadcast packet we'll execute the LR-i
> > > > ingress/egress pipeline 300 times. For each execution we do a fair
> > > > amount of resubmits through the different tables of the pipeline
> > > > leading to a total number of resubmits for the single initial
> > > > broadcast packet of more than 4K, the maximum allowed by OVS.
> > > >
> > > > After looking at the implementation I couldn't figure out a way to
> > > > avoid running the full pipelines for each potential logical output
> > > > port (patch or local VIF) because we might have flows later in the
> > > > pipelines that perform actions based on the value of the logical
> > > > output port (e.g., out ACL, qos).
> > > >
> > > > Do you think there's a different approach that we could use to
> > > > implement flooding of broadcast/unknown unicast packets that would
> > > > require less resubmit actions?
> > > >
> > > > This issue could also appear in a flat topology with a single logical
> > > > switch and multiple VIFs (>300). In this case the resubmits would be
> > > > part of the the openflow entry in table 33 but the result would be the
> > > > same: too many resubmits due to the egress pipeline resubmits for each
> > > > logical output port.
> > > >
> > > > Thanks,
> > > > Dumitru
> > > >
> > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1756945
> > >
> > > Thanks Dumitru for reporting this interesting problem.
> >
> > Hi Han,
> >
> > Thanks for the reply.
> >
> > > In theory, I don't think we can avoid this because different datapath can 
> > > have different actions for the same packet, so when a packet need to be 
> > > sent to multiple datapaths, we will have to let it go through each steps 
> > > following each datapath.
> > > In practice, the number of stages to go through depends on the number of 
> > > output ports in the broadcast domain and the number of stages in each 
> > > output port. To alleviate the problem, we should try our best to handle 
> > > broadcast packets at the earliest stages of each datapath, and terminate 
> > > the packet pipeline as early as possible, e.g. ARP reponse, to reduce the 
> > > total number of resubmits.
> > >
> > > So I think what we can do are (although I hope there are better ways):
> > > - Increase the limit in OVS for the maximum allowed resubmit, considering 
> > > the value for worst case reasonable real world deployment
> >
> > I'm afraid that increasing the limit will just postpone the problem
> > for later. Also I'm worried about the time the packet would spend in
> > the pipeline and the fact that potentially there will be high jitter
> > between delivery times to different ports.
> >
>
> It should not be a big concern if fast-path flows are generated. But I am not 
> sure if that's the case for ARPing different IPs. Did you observe high jitter 
> with large number of ports in same L2?

I don't know if it's completely realistic but I just ran the following
basic test:
- 1 logical switch with 250 emulated vms (ovs internal interfaces)
- flush ip neighbor cache on vm1
- ping (count 1) vm2-vm250 from vm1: will result in arp requests for
all 249 destinations and one icmp request per destination.

I get rtt for the pings ranging from 0.5ms to 14ms distributed per
0.1ms interval like this:

RTT 0.5 ms: 1
RTT 0.7 ms: 1
RTT 0.9 ms: 3
RTT 1.0 ms: 1
RTT 1.2 ms: 1
RTT 1.3 ms: 2
RTT 1.8 ms: 31
RTT 1.9 ms: 152
RTT 2.0 ms: 36
RTT 2.1 ms: 7
RTT 2.2 ms: 5
RTT 2.3 ms: 1
RTT 2.4 ms: 2
RTT 2.5 ms: 3
RTT 2.6 ms: 1
RTT 2.8 ms: 1
RTT 4.0 ms: 1
RTT 14.0 ms: 1

If I execute the same pings immediately after without flushing the arp
cache (no ARP requests):

RTT 0.0 ms: 10
RTT 0.1 ms: 237
RTT 0.2 ms: 2

So it seems to me that at least in this simple test there is some jitter.
As you said, I'm not sure if we can improve the implementation or if
it's really a big issue because it will happen only with the 

Re: [ovs-discuss] Too many resubmits in the OVS pipeline for scaled OVN topologies due to multicast_group implementation

2019-10-02 Thread Dumitru Ceara
On Tue, Oct 1, 2019 at 8:41 PM Han Zhou  wrote:
>
>
>
> On Tue, Oct 1, 2019 at 3:34 AM Dumitru Ceara  wrote:
> >
> > Hi,
> >
> > We've hit a scaling issue recently [1] in the following topology:
> >
> > - External network connected to public logical switch "LS-pub"
> > - ~300 logical networks (LR-i <--> LS-i <--> VMi) connected to LS-pub
> > with dnat_and_snat rules.
> >
> > While trying to ping the VMs from outside the ARP request packet from
> > the external host doesn't reach all the LR-i pipelines because it gets
> > dropped due to "Too many resubmits".
> >
> > This happens because the MC_FLOOD group for LS-pub installs openflow
> > entries that resubmit the packet:
> > - through the patch ports towards all LR-i (one entry in table 32 with
> > 300 resubmit actions).
> > - to the egress pipeline for each VIF that's local to LS-pub (one
> > entry in table 33).
> >
> > This means that for the ARP broadcast packet we'll execute the LR-i
> > ingress/egress pipeline 300 times. For each execution we do a fair
> > amount of resubmits through the different tables of the pipeline
> > leading to a total number of resubmits for the single initial
> > broadcast packet of more than 4K, the maximum allowed by OVS.
> >
> > After looking at the implementation I couldn't figure out a way to
> > avoid running the full pipelines for each potential logical output
> > port (patch or local VIF) because we might have flows later in the
> > pipelines that perform actions based on the value of the logical
> > output port (e.g., out ACL, qos).
> >
> > Do you think there's a different approach that we could use to
> > implement flooding of broadcast/unknown unicast packets that would
> > require less resubmit actions?
> >
> > This issue could also appear in a flat topology with a single logical
> > switch and multiple VIFs (>300). In this case the resubmits would be
> > part of the the openflow entry in table 33 but the result would be the
> > same: too many resubmits due to the egress pipeline resubmits for each
> > logical output port.
> >
> > Thanks,
> > Dumitru
> >
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1756945
>
> Thanks Dumitru for reporting this interesting problem.

Hi Han,

Thanks for the reply.

> In theory, I don't think we can avoid this because different datapath can 
> have different actions for the same packet, so when a packet need to be sent 
> to multiple datapaths, we will have to let it go through each steps following 
> each datapath.
> In practice, the number of stages to go through depends on the number of 
> output ports in the broadcast domain and the number of stages in each output 
> port. To alleviate the problem, we should try our best to handle broadcast 
> packets at the earliest stages of each datapath, and terminate the packet 
> pipeline as early as possible, e.g. ARP reponse, to reduce the total number 
> of resubmits.
>
> So I think what we can do are (although I hope there are better ways):
> - Increase the limit in OVS for the maximum allowed resubmit, considering the 
> value for worst case reasonable real world deployment

I'm afraid that increasing the limit will just postpone the problem
for later. Also I'm worried about the time the packet would spend in
the pipeline and the fact that potentially there will be high jitter
between delivery times to different ports.

> - Try to avoid large broadcast domain in deployment (of course, only when we 
> have a choice)

Agreed, but I guess this is outside the scope of OVN.

> - See if there is optimization opportunity to move broadcast related 
> processing earlier in pipeline and complete as early as possible for the 
> processing for the packet.
>

I don't think the problem is broadcast specific. It should be the same
for unknown unicast, right?

> Thanks,
> Han

Thanks,
Dumitru
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Too many resubmits in the OVS pipeline for scaled OVN topologies due to multicast_group implementation

2019-10-01 Thread Han Zhou
On Tue, Oct 1, 2019 at 3:34 AM Dumitru Ceara  wrote:
>
> Hi,
>
> We've hit a scaling issue recently [1] in the following topology:
>
> - External network connected to public logical switch "LS-pub"
> - ~300 logical networks (LR-i <--> LS-i <--> VMi) connected to LS-pub
> with dnat_and_snat rules.
>
> While trying to ping the VMs from outside the ARP request packet from
> the external host doesn't reach all the LR-i pipelines because it gets
> dropped due to "Too many resubmits".
>
> This happens because the MC_FLOOD group for LS-pub installs openflow
> entries that resubmit the packet:
> - through the patch ports towards all LR-i (one entry in table 32 with
> 300 resubmit actions).
> - to the egress pipeline for each VIF that's local to LS-pub (one
> entry in table 33).
>
> This means that for the ARP broadcast packet we'll execute the LR-i
> ingress/egress pipeline 300 times. For each execution we do a fair
> amount of resubmits through the different tables of the pipeline
> leading to a total number of resubmits for the single initial
> broadcast packet of more than 4K, the maximum allowed by OVS.
>
> After looking at the implementation I couldn't figure out a way to
> avoid running the full pipelines for each potential logical output
> port (patch or local VIF) because we might have flows later in the
> pipelines that perform actions based on the value of the logical
> output port (e.g., out ACL, qos).
>
> Do you think there's a different approach that we could use to
> implement flooding of broadcast/unknown unicast packets that would
> require less resubmit actions?
>
> This issue could also appear in a flat topology with a single logical
> switch and multiple VIFs (>300). In this case the resubmits would be
> part of the the openflow entry in table 33 but the result would be the
> same: too many resubmits due to the egress pipeline resubmits for each
> logical output port.
>
> Thanks,
> Dumitru
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1756945

Thanks Dumitru for reporting this interesting problem.
In theory, I don't think we can avoid this because different datapath can
have different actions for the same packet, so when a packet need to be
sent to multiple datapaths, we will have to let it go through each steps
following each datapath.
In practice, the number of stages to go through depends on the number of
output ports in the broadcast domain and the number of stages in each
output port. To alleviate the problem, we should try our best to handle
broadcast packets at the earliest stages of each datapath, and terminate
the packet pipeline as early as possible, e.g. ARP reponse, to reduce the
total number of resubmits.

So I think what we can do are (although I hope there are better ways):
- Increase the limit in OVS for the maximum allowed resubmit, considering
the value for worst case reasonable real world deployment
- Try to avoid large broadcast domain in deployment (of course, only when
we have a choice)
- See if there is optimization opportunity to move broadcast related
processing earlier in pipeline and complete as early as possible for the
processing for the packet.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] Too many resubmits in the OVS pipeline for scaled OVN topologies due to multicast_group implementation

2019-10-01 Thread Dumitru Ceara
Hi,

We've hit a scaling issue recently [1] in the following topology:

- External network connected to public logical switch "LS-pub"
- ~300 logical networks (LR-i <--> LS-i <--> VMi) connected to LS-pub
with dnat_and_snat rules.

While trying to ping the VMs from outside the ARP request packet from
the external host doesn't reach all the LR-i pipelines because it gets
dropped due to "Too many resubmits".

This happens because the MC_FLOOD group for LS-pub installs openflow
entries that resubmit the packet:
- through the patch ports towards all LR-i (one entry in table 32 with
300 resubmit actions).
- to the egress pipeline for each VIF that's local to LS-pub (one
entry in table 33).

This means that for the ARP broadcast packet we'll execute the LR-i
ingress/egress pipeline 300 times. For each execution we do a fair
amount of resubmits through the different tables of the pipeline
leading to a total number of resubmits for the single initial
broadcast packet of more than 4K, the maximum allowed by OVS.

After looking at the implementation I couldn't figure out a way to
avoid running the full pipelines for each potential logical output
port (patch or local VIF) because we might have flows later in the
pipelines that perform actions based on the value of the logical
output port (e.g., out ACL, qos).

Do you think there's a different approach that we could use to
implement flooding of broadcast/unknown unicast packets that would
require less resubmit actions?

This issue could also appear in a flat topology with a single logical
switch and multiple VIFs (>300). In this case the resubmits would be
part of the the openflow entry in table 33 but the result would be the
same: too many resubmits due to the egress pipeline resubmits for each
logical output port.

Thanks,
Dumitru

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1756945
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss