On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 5/26/23 22:07, Ilya Maximets wrote:
> > CC: Marcelo.
> >
> > On 5/26/23 22:02, Ilya Maximets wrote:
> >> On 5/26/23 21:19, Eelco Chaudron wrote:
> >>>
> >>>
> >>> Send from my phone
> >>>
> >>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maxim...@ovn.org> het 
> >>>> volgende geschreven:
> >>>>
> >>>> On 5/26/23 20:43, Ilya Maximets wrote:
> >>>>> On 5/23/23 12:39, Frode Nordahl wrote:
> >>>>>> The tc module combines the use of the `tc_transact` helper
> >>>>>> function for communication with the in-kernel tc infrastructure
> >>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
> >>>>>> received data prior to further processing.
> >>>>>>
> >>>>>> `tc_transact` in turn calls `nl_transact`, which via
> >>>>>> `nl_transact_multiple__` ultimately calls and handles return
> >>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
> >>>>>> and a consequence of this condition is effectively to provide a
> >>>>>> non-error (0) result and an empty reply buffer.
> >>>>>
> >>>>> Hi, Frode, others.

Hello, Ilya, thank you for weighing in on the issue and thread, much
appreciated.

> >>>>> I took a closer look at the patch and the code in the netlink-socket.
> >>>>> IIUC, the EAGAIN here is not a result of operation that we're 
> >>>>> requesting,
> >>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
> >>>>> trying to read.  The reply to that transaction will arrive eventually 
> >>>>> and
> >>>>> we will interpret it later as a reply to a different netlink 
> >>>>> transaction.
> >>>>>
> >>>>> The issue appears to be introduced in the following commit:
> >>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message 
> >>>>> in a transaction.")
> >>>>>
> >>>>> And it was a performance optimization introduced as part of the set:
> >>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
> >>>>>
> >>>>> The problem is that we can't tell apart socket EAGAIN if there is 
> >>>>> nothing
> >>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
> >>>>> manage to reply yet.

I did find that change suspicious, and it is listed as one of the
commits fixed in
my patch, although I'm only addressing it by documenting its behavior as I do
not have full visibility of the blast radius for changing it.

> >>>> It's still strance though that reply is delayed.  Typically netlink
> >>>> replies are formed right in the request handler.  Did you manage to
> >>>> figure out what is causing this in tc specifically?  It wasn't an issue
> >>>> for an OVS and other common families for many years.
> >>>>
> >>>
> >>> Replying from my phone so I did not look at any code. However when I 
> >>> looked at the code during the review I noticed it will ignore earlier 
> >>> (none processed messages) replies based in the receive loop on the 
> >>> transaction ID. So I do not think it should be an issue of messages 
> >>> getting out of sync.
> >>
> >> That makes sense, OK.

Our challenge here is that this crash is happening in a production environment,
and we are currently unable to reproduce it in a lab environment. The
consequence being that production has disabled the feature. So the proposed
patch is in my mind a step on the way to figure out what is actually going on by
allowing the system to be up and capture more logs without interruption of
service.

> >>>
> >>> I assumed the delay happens when we request something to a hardware 
> >>> offload drive which is replying async due to it being busy.
> >>
> >> I'm not sure it is actually possible for netlink replies to be delivered
> >> asynchronously.  Looking at the kernel code, by the time the original
> >> request is processed, the reply should already be sent.
> >>
> >> A simpler explanation, would be that it just doesn't reply for some reason
> >> to a request that needs a reply. e.g. some incorrect error handling in the
> >> kernel that leaves request unanswered.
> >>
> >> One potential issue I see in the tc_new_tfilter() implementation is that
> >> if tfilter_notify() will fail for any reason, the function will still
> >> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
> >> will be generated, because the return value is zero (success) and the
> >> rtnetlink_send() was never called due to an error in the tfilter_notify().
> >>
> >> I'm not sure if that's what happening here, but it is one option how we
> >> end up with a successful netlink transaction with NLM_F_ECHO set, but no
> >> reply on the socket.
> >>
> >> A potential fix would be (not tested):
> >>
> >> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> >> index 2621550bfddc..9c334dbf9c5f 100644
> >> --- a/net/sched/cls_api.c
> >> +++ b/net/sched/cls_api.c
> >> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, 
> >> struct nlmsghdr *n,
> >>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> >>                            flags, extack);
> >>      if (err == 0) {
> >> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> >> +            err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> >>                             RTM_NEWTFILTER, false, rtnl_held, extack);
> >>              tfilter_put(tp, fh);
> >>              /* q pointer is NULL for shared blocks */
> >> ---
>
> ^^ This is definitely not a correct solution though, because we already
> changed the filter, but we just failed to dump it back.  Some other
> solution is needed.  Most likely, it needs to be re-tried until it succeeds.
> Or the change() will need to be reverted and a failure reported to the
> userspace.

Good catch, the cited code does indeed look problematic.

> >>
> >> Solution on the OVS side might be to fail transactions that requested a
> >> reply, but didn't get one.  Async replies sound like something that should
> >> not be allowed and likely is not allowed.

Even if this turns out to be a bug on the kernel side, we ought to
protect ourself
from the condition. Just to understand your proposal, you are thinking partial
revert of 407556ac6c90 and add a check for no/empty reply in the nl_transact?

-- 
Frode Nordahl

> >> Best regards, Ilya Maximets.
> >>
> >>>
> >>> Maybe Frode can get some perf traces to confirm this.
> >>>
> >>> //Eelco
> >>>>>
> >>>>> The real solution would be to revert commit 407556ac6c90 or be a bit
> >>>>> smarter and wait for reply only on requests that specify a reply buffer.
> >>>>>
> >>>>> Or am I missing something?
> >>>>>
> >>>>> Best regards, Ilya Maximets.
> >>>>
> >>
> >
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to