Sounds like a great plan, Ben! Thanks for that. It'd be great if
people could chime in this thread to help identify those gaps.

As about the anecdotes, we had just been involved in a case where OVN
was used and packets were dropped at conntrack:

Two VMs on different Logical Switches (externally routed), running on
the same hypervisor were communicating between each other and packet
loss was observed. The packet loss was observed only on small (<64B)
packets. These packets were padded by the NIC before being put on the
wire and when they came back, due to the ACLs, they were put into
conntrack and dropped there. We determined this by inspecting DP flows
via 'ovs-dpctl dump-flows' and then we enabled logging on netfilter
which showed that there was an error with the checksum calculation. It
happened to be a bug on the OVS kernel side which was already fixed in
newer kernels but it took quite a while to figure out and a good
understanding on what was going on. In this scenario, if OVN ACLs were
removed, traffic worked so OVN was the first to be blamed. And
sometimes, the OVN user/engineer is not an OVS expert to be able to
tell effectively what happened to a packet.

Maybe the example is not the best as it was resolved using just the
'ovs-dpctl' tool and some logging but support engineers may loop in
OVN engineers which may loop in OVS engineers which may loop in kernel
engineers. It'd be great to improve the experience somehow so that the
initial assessment doesn't have to go always all the way down.

I'm curious about other folks' experiences here as well with more pure
OVS experience.

Thanks a lot!
Daniel

On Thu, Mar 14, 2019 at 5:55 PM Ben Pfaff <b...@ovn.org> wrote:
>
> On Thu, Mar 14, 2019 at 04:55:56PM +0100, Daniel Alvarez Sanchez wrote:
> > Hi folks,
> >
> > Lately I'm getting the question in the subject line more and more
> > frequently and facing it myself, especially in the context of
> > OpenStack.
> >
> > The shift to OVN in OpenStack involves a totally different approach
> > when it comes to tracing packet drops. Before OVN, there were a bunch
> > of network namespaces and devices where you could hook a tcpdump on
> > and inspect the traffic. People are used to those troubleshooting
> > techniques and OVS was merely used for normal action switches.
> >
> > It's clear that there's tools and techniques to analyze this (trace
> > tool, port mirroring, etc.), but often times requires quite high
> > knowledge and understanding of the pipeline and OVS itself to
> > effectively trace where a packet got dropped. Furthermore, there could
> > be some scenarios where the packet can be silently dropped.
> >
> > I came across this patch [0] and presentation about it [1] which aims
> > to tackle partly the problem described here (focusing in the DPDK
> > datapath).
> >
> > The intent of this email is to gather some feedback as how to provide
> > efficient tools and techniques to troubleshoot OVS/OVN issues and what
> > do you think is immediately missing in this context.
>
> I guess that there are multiple things to do here:
>
> - Better document the tools that are available.
>
> - Implement improvements, especially UX-wise, to the existing tools.
>
> - Identify gaps in the available tools (and then fill them).
>
> Do you have any good anecdotes about user/admin frustration?  They might
> be helpful for figuring out how to help.  A lot of us here designed and
> built this stuff and so the gaps are not always obvious to us.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to