On 1/13/23 14:30, David Marchand wrote:
> On Fri, Jan 13, 2023 at 1:58 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 1/12/23 21:55, David Marchand wrote:
>>> As Ilya reported, we have a ABBA deadlock between DPDK vq->access_lock
>>> and OVS dev->mutex when OVS main thread refreshes statistics, while a
>>> vring state change event is being processed for a same vhost port.
>>>
>>> To break from this situation, move vring state change notifications
>>> handling from the vhost-events DPDK thread, back to the OVS main thread
>>> using a lockless queue.
>>>
>>> Reported-at: 
>>> https://mail.openvswitch.org/pipermail/ovs-dev/2023-January/401101.html
>>> Fixes: 3b29286db1c5 ("netdev-dpdk: Add per virtqueue statistics.")
>>> Signed-off-by: David Marchand <david.march...@redhat.com>
>>> ---
>>>  lib/netdev-dpdk.c | 67 +++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 53 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 5e2d64651d..7946e7f2df 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -49,6 +49,7 @@
>>>  #include "dpif-netdev.h"
>>>  #include "fatal-signal.h"
>>>  #include "if-notifier.h"
>>> +#include "mpsc-queue.h"
>>>  #include "netdev-provider.h"
>>>  #include "netdev-vport.h"
>>>  #include "odp-util.h"
>>> @@ -4255,30 +4256,37 @@ destroy_device(int vid)
>>>      }
>>>  }
>>>
>>> -static int
>>> -vring_state_changed(int vid, uint16_t queue_id, int enable)
>>> +static struct mpsc_queue vhost_state_change_queue
>>> +    = MPSC_QUEUE_INITIALIZER(&vhost_state_change_queue);
>>> +
>>> +struct vhost_state_change {
>>> +    struct mpsc_queue_node node;
>>> +    char ifname[IF_NAME_SZ];
>>> +    uint16_t queue_id;
>>> +    int enable;
>>> +};
>>> +
>>> +static void
>>> +vring_state_changed__(struct vhost_state_change *sc)
>>>  {
>>>      struct netdev_dpdk *dev;
>>>      bool exists = false;
>>> -    int qid = queue_id / VIRTIO_QNUM;
>>> -    bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
>>> -    char ifname[IF_NAME_SZ];
>>> -
>>> -    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
>>> +    int qid = sc->queue_id / VIRTIO_QNUM;
>>> +    bool is_rx = (sc->queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
>>>
>>>      ovs_mutex_lock(&dpdk_mutex);
>>>      LIST_FOR_EACH (dev, list_node, &dpdk_list) {
>>>          ovs_mutex_lock(&dev->mutex);
>>> -        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
>>> +        if (nullable_string_is_equal(sc->ifname, dev->vhost_id)) {
>>>              if (is_rx) {
>>>                  bool old_state = dev->vhost_rxq_enabled[qid];
>>>
>>> -                dev->vhost_rxq_enabled[qid] = enable != 0;
>>> +                dev->vhost_rxq_enabled[qid] = sc->enable != 0;
>>>                  if (old_state != dev->vhost_rxq_enabled[qid]) {
>>>                      netdev_change_seq_changed(&dev->up);
>>>                  }
>>>              } else {
>>> -                if (enable) {
>>> +                if (sc->enable) {
>>>                      dev->tx_q[qid].map = qid;
>>>                  } else {
>>>                      dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
>>> @@ -4295,11 +4303,41 @@ vring_state_changed(int vid, uint16_t queue_id, int 
>>> enable)
>>>
>>>      if (exists) {
>>>          VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s' "
>>> -                  "changed to \'%s\'", queue_id, is_rx == true ? "rx" : 
>>> "tx",
>>> -                  qid, ifname, (enable == 1) ? "enabled" : "disabled");
>>> +                  "changed to \'%s\'", sc->queue_id, is_rx ? "rx" : "tx",
>>> +                  qid, sc->ifname, sc->enable == 1 ? "enabled" : 
>>> "disabled");
>>>      } else {
>>> -        VLOG_INFO("vHost Device '%s' not found", ifname);
>>> -        return -1;
>>> +        VLOG_INFO("vHost Device '%s' not found", sc->ifname);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +netdev_dpdk_vhost_run(const struct netdev_class *netdev_class OVS_UNUSED)
>>> +{
>>> +    struct mpsc_queue_node *node;
>>> +
>>> +    mpsc_queue_acquire(&vhost_state_change_queue);
>>> +    MPSC_QUEUE_FOR_EACH_POP (node, &vhost_state_change_queue) {
>>> +        struct vhost_state_change *sc;
>>> +
>>> +        sc = CONTAINER_OF(node, struct vhost_state_change, node);
>>> +        vring_state_changed__(sc);
>>> +        free(sc);
>>> +    }
>>> +    mpsc_queue_release(&vhost_state_change_queue);
>>> +}
>>> +
>>> +static int
>>> +vring_state_changed(int vid, uint16_t queue_id, int enable)
>>> +{
>>> +    struct vhost_state_change *sc;
>>> +
>>> +    sc = xmalloc(sizeof *sc);
>>> +    if (!rte_vhost_get_ifname(vid, sc->ifname, sizeof sc->ifname)) {
>>> +        sc->queue_id = queue_id;
>>> +        sc->enable = enable;
>>> +        mpsc_queue_insert(&vhost_state_change_queue, &sc->node);
>>> +    } else {
>>> +        free(sc);
>>>      }
>>>
>>>      return 0;
>>> @@ -5766,6 +5804,7 @@ static const struct netdev_class dpdk_vhost_class = {
>>>  static const struct netdev_class dpdk_vhost_client_class = {
>>>      .type = "dpdkvhostuserclient",
>>>      NETDEV_DPDK_CLASS_COMMON,
>>> +    .run = netdev_dpdk_vhost_run,
>>
>> This part is also needed for the 'server' type.
>>
>> However, there is no guarantee that run() will be called in any reasonable
>> time, because we don't have a wait() method that would schedule a wake up
>> for the main thread.   The main thread wakes up for other reasons like
>> stats update, but that happens not that frequently and we can't rely on that
>> anyway.  For example, default stats update interval is 5 seconds.
>>
>> The problem is that we can't really iterate over devices and take dev->mutex
>> in the context of a callback.  Meaning we can't really implement a good
>> wait() method that would trigger some seqno change that would wake up the
>> main thread.  We could trigger a global connectivity seqno, I guess.  Not
>> sure how safe it is, since sequence numbers use internal mutex.
>>
>> OTOH, maybe we can employ dpdk-watchdog thread to perform vhost housekeeping?
>> Currently it's doing a very similar thing for physical ports, i.e. updates
>> the link state periodically.  It's sleep interval is a bit large though,
>> but maybe we could tweak that, or update phy link states with the same
>> interval as it is now, but check the vhost event queue a bit more frequently.
>> Maybe even have something like a backoff in dp_netdev_flow_offload_main().
> 
> Having a dedicated thread seems safer, for the same reason as Maxime
> mentionned for OVS  main thread: vhost events may be triggered by a
> malicious guest and OVS would be blind to some phy link updates.
> 
> 
> Another thought on my patch... is it safe to have an unbound queue wrt
> to memory consumption?
> Again, a malicious guest could make OVS consume a lot of memory (in my
> tests, OVS complained that heap increased to 500M).

We're entering a gross over-engineering territory, but the solution
might be following:

RCU-list with nodes with following elements:
 - ifname
 - qid
 - atomic_bool enabled
 - atomic_bool seen
 - timestamp

callback:

 1. get ifname
 2. Iterate over the list looking for existing {ifname, qid} node.
 3. If found:
    - set 'enabled' to a new value -- relaxed
    - set 'seen' to false -- release
    - update timestamp
 4. If not found, add a new node with seen == false.
 5. Iterate over the list and remove all 'seen' older than a few
    seconds.

OVS thread:

 1. Iterate over the list.
 2. atomic_compare_exchange 'seen': false -> true -- acquire, relaxed
 3. On success, get the new 'enabled' and process -- relaxed


Callback is the only writer, so no need to lock.  OVS thread is
RCU-protected reader.  We will never drain the list, but it should
never contain too many entries.  There is also possibility to
process the same event twice, but that should not be a big deal.

Note: I didn't give too much thought to the usage of atomics,
so it needs to be re-checked.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to