On 5/27/23 09:21, Frode Nordahl wrote:
> On Fri, May 26, 2023 at 10:31 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 5/26/23 22:07, Ilya Maximets wrote:
>>> CC: Marcelo.
>>>
>>> On 5/26/23 22:02, Ilya Maximets wrote:
>>>> On 5/26/23 21:19, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> Send from my phone
>>>>>
>>>>>> Op 26 mei 2023 om 20:52 heeft Ilya Maximets <i.maxim...@ovn.org> het 
>>>>>> volgende geschreven:
>>>>>>
>>>>>> On 5/26/23 20:43, Ilya Maximets wrote:
>>>>>>> On 5/23/23 12:39, Frode Nordahl wrote:
>>>>>>>> The tc module combines the use of the `tc_transact` helper
>>>>>>>> function for communication with the in-kernel tc infrastructure
>>>>>>>> with assertions on the reply data by `ofpbuf_at_assert` on the
>>>>>>>> received data prior to further processing.
>>>>>>>>
>>>>>>>> `tc_transact` in turn calls `nl_transact`, which via
>>>>>>>> `nl_transact_multiple__` ultimately calls and handles return
>>>>>>>> value from `recvmsg`.  On error a check for EAGAIN is performed
>>>>>>>> and a consequence of this condition is effectively to provide a
>>>>>>>> non-error (0) result and an empty reply buffer.
>>>>>>>
>>>>>>> Hi, Frode, others.
> 
> Hello, Ilya, thank you for weighing in on the issue and thread, much
> appreciated.
> 
>>>>>>> I took a closer look at the patch and the code in the netlink-socket.
>>>>>>> IIUC, the EAGAIN here is not a result of operation that we're 
>>>>>>> requesting,
>>>>>>> it's just a EAGAIN on a non-blocking (MSG_DONTWAIT) netlink socket while
>>>>>>> trying to read.  The reply to that transaction will arrive eventually 
>>>>>>> and
>>>>>>> we will interpret it later as a reply to a different netlink 
>>>>>>> transaction.
>>>>>>>
>>>>>>> The issue appears to be introduced in the following commit:
>>>>>>>  407556ac6c90 ("netlink-socket: Avoid forcing a reply for final message 
>>>>>>> in a transaction.")
>>>>>>>
>>>>>>> And it was a performance optimization introduced as part of the set:
>>>>>>>  https://mail.openvswitch.org/pipermail/ovs-dev/2012-April/260122.html
>>>>>>>
>>>>>>> The problem is that we can't tell apart socket EAGAIN if there is 
>>>>>>> nothing
>>>>>>> to wait (requests didn't need a reply) and EAGAIN if kernel just didn't
>>>>>>> manage to reply yet.
> 
> I did find that change suspicious, and it is listed as one of the
> commits fixed in
> my patch, although I'm only addressing it by documenting its behavior as I do
> not have full visibility of the blast radius for changing it.
> 
>>>>>> It's still strance though that reply is delayed.  Typically netlink
>>>>>> replies are formed right in the request handler.  Did you manage to
>>>>>> figure out what is causing this in tc specifically?  It wasn't an issue
>>>>>> for an OVS and other common families for many years.
>>>>>>
>>>>>
>>>>> Replying from my phone so I did not look at any code. However when I 
>>>>> looked at the code during the review I noticed it will ignore earlier 
>>>>> (none processed messages) replies based in the receive loop on the 
>>>>> transaction ID. So I do not think it should be an issue of messages 
>>>>> getting out of sync.
>>>>
>>>> That makes sense, OK.
> 
> Our challenge here is that this crash is happening in a production 
> environment,
> and we are currently unable to reproduce it in a lab environment. The
> consequence being that production has disabled the feature. So the proposed
> patch is in my mind a step on the way to figure out what is actually going on 
> by
> allowing the system to be up and capture more logs without interruption of
> service.
> 
>>>>>
>>>>> I assumed the delay happens when we request something to a hardware 
>>>>> offload drive which is replying async due to it being busy.
>>>>
>>>> I'm not sure it is actually possible for netlink replies to be delivered
>>>> asynchronously.  Looking at the kernel code, by the time the original
>>>> request is processed, the reply should already be sent.
>>>>
>>>> A simpler explanation, would be that it just doesn't reply for some reason
>>>> to a request that needs a reply. e.g. some incorrect error handling in the
>>>> kernel that leaves request unanswered.
>>>>
>>>> One potential issue I see in the tc_new_tfilter() implementation is that
>>>> if tfilter_notify() will fail for any reason, the function will still
>>>> return 0, because tc_new_tfilter() doesn't check the result.  And no reply
>>>> will be generated, because the return value is zero (success) and the
>>>> rtnetlink_send() was never called due to an error in the tfilter_notify().
>>>>
>>>> I'm not sure if that's what happening here, but it is one option how we
>>>> end up with a successful netlink transaction with NLM_F_ECHO set, but no
>>>> reply on the socket.
>>>>
>>>> A potential fix would be (not tested):
>>>>
>>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>>> index 2621550bfddc..9c334dbf9c5f 100644
>>>> --- a/net/sched/cls_api.c
>>>> +++ b/net/sched/cls_api.c
>>>> @@ -2311,7 +2311,7 @@ static int tc_new_tfilter(struct sk_buff *skb, 
>>>> struct nlmsghdr *n,
>>>>      err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>>>>                            flags, extack);
>>>>      if (err == 0) {
>>>> -            tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>>>> +            err = tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>>>>                             RTM_NEWTFILTER, false, rtnl_held, extack);
>>>>              tfilter_put(tp, fh);
>>>>              /* q pointer is NULL for shared blocks */
>>>> ---
>>
>> ^^ This is definitely not a correct solution though, because we already
>> changed the filter, but we just failed to dump it back.  Some other
>> solution is needed.  Most likely, it needs to be re-tried until it succeeds.
>> Or the change() will need to be reverted and a failure reported to the
>> userspace.
> 
> Good catch, the cited code does indeed look problematic.
> 
>>>>
>>>> Solution on the OVS side might be to fail transactions that requested a
>>>> reply, but didn't get one.  Async replies sound like something that should
>>>> not be allowed and likely is not allowed.
> 
> Even if this turns out to be a bug on the kernel side, we ought to
> protect ourself
> from the condition. Just to understand your proposal, you are thinking partial
> revert of 407556ac6c90 and add a check for no/empty reply in the nl_transact?
> 

I was thinking about revert, but in the presence of the kernel bug if we
block on receive, we will just wait forever, if the kernel simply doesn't
reply to the ECHO.

Solution from the OVS side is similar to your patch, but with EPROTO instead.
Since EAGAIN is not a kernel's reply, we're not really expected to re-try
the receive.  EAGAIN on a socket just means that the kernel didn't and will
not reply.

Only caller of the tc_transact() knows if they expected a reply or not.
And if it was expected, but didn't happen, that's a protocol error.

What we can do is to check that reply->size > NLMSG_HDRLEN before pulling
the header, and return EPROTO if it's not.  That's what ofpbuf_try_pull()
can do.  tc_transact() users that also use ofpbuf_at_assert() should be
converted to use ofpbuf_try_pull() instead, so we do not crash if something
like this happens.  See dpif_netlink_vport_from_ofpbuf() for example.

And we should have a loud error/warning message for this condition.

No retries are needed.  And EAGAIN documentation part is also not needed,
because we're not supposed to retry, AFAICT.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to