Sounds like a great plan, Ben! Thanks for that. It'd be great if people could chime in this thread to help identify those gaps.
As about the anecdotes, we had just been involved in a case where OVN was used and packets were dropped at conntrack: Two VMs on different Logical Switches (externally routed), running on the same hypervisor were communicating between each other and packet loss was observed. The packet loss was observed only on small (<64B) packets. These packets were padded by the NIC before being put on the wire and when they came back, due to the ACLs, they were put into conntrack and dropped there. We determined this by inspecting DP flows via 'ovs-dpctl dump-flows' and then we enabled logging on netfilter which showed that there was an error with the checksum calculation. It happened to be a bug on the OVS kernel side which was already fixed in newer kernels but it took quite a while to figure out and a good understanding on what was going on. In this scenario, if OVN ACLs were removed, traffic worked so OVN was the first to be blamed. And sometimes, the OVN user/engineer is not an OVS expert to be able to tell effectively what happened to a packet. Maybe the example is not the best as it was resolved using just the 'ovs-dpctl' tool and some logging but support engineers may loop in OVN engineers which may loop in OVS engineers which may loop in kernel engineers. It'd be great to improve the experience somehow so that the initial assessment doesn't have to go always all the way down. I'm curious about other folks' experiences here as well with more pure OVS experience. Thanks a lot! Daniel On Thu, Mar 14, 2019 at 5:55 PM Ben Pfaff <b...@ovn.org> wrote: > > On Thu, Mar 14, 2019 at 04:55:56PM +0100, Daniel Alvarez Sanchez wrote: > > Hi folks, > > > > Lately I'm getting the question in the subject line more and more > > frequently and facing it myself, especially in the context of > > OpenStack. > > > > The shift to OVN in OpenStack involves a totally different approach > > when it comes to tracing packet drops. Before OVN, there were a bunch > > of network namespaces and devices where you could hook a tcpdump on > > and inspect the traffic. People are used to those troubleshooting > > techniques and OVS was merely used for normal action switches. > > > > It's clear that there's tools and techniques to analyze this (trace > > tool, port mirroring, etc.), but often times requires quite high > > knowledge and understanding of the pipeline and OVS itself to > > effectively trace where a packet got dropped. Furthermore, there could > > be some scenarios where the packet can be silently dropped. > > > > I came across this patch [0] and presentation about it [1] which aims > > to tackle partly the problem described here (focusing in the DPDK > > datapath). > > > > The intent of this email is to gather some feedback as how to provide > > efficient tools and techniques to troubleshoot OVS/OVN issues and what > > do you think is immediately missing in this context. > > I guess that there are multiple things to do here: > > - Better document the tools that are available. > > - Implement improvements, especially UX-wise, to the existing tools. > > - Identify gaps in the available tools (and then fill them). > > Do you have any good anecdotes about user/admin frustration? They might > be helpful for figuring out how to help. A lot of us here designed and > built this stuff and so the gaps are not always obvious to us. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss