Cong Wang wrote: > On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <linyunsh...@huawei.com> wrote: > > > > On 2020/9/3 9:48, Cong Wang wrote: > > > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <linyunsh...@huawei.com> > > > wrote: > > >> > > >> On 2020/9/3 8:35, Cong Wang wrote: > > >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <linyunsh...@huawei.com> > > >>> wrote: > > >>>> > > >>>> On 2020/9/2 12:41, Cong Wang wrote: > > >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <linyunsh...@huawei.com> > > >>>>> wrote: > > >>>>>> > > >>>>>> On 2020/9/2 2:24, Cong Wang wrote: > > >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin > > >>>>>>> <linyunsh...@huawei.com> wrote: > > >>>>>>>> > > >>>>>>>> Currently there is concurrent reset and enqueue operation for the > > >>>>>>>> same lockless qdisc when there is no lock to synchronize the > > >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in > > >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may > > >>>>>>>> cause > > >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has > > >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a > > >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is > > >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later. > > >>>>>>> > > >>>>>>> Can you be more specific here? Which call path requests a smaller > > >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly > > >>>>>>> we already have a synchronize_net() there. > > >>>>>> > > >>>>>> When the netdevice is in active state, the synchronize_net() seems to > > >>>>>> do the correct work, as below: > > >>>>>> > > >>>>>> CPU 0: CPU1: > > >>>>>> __dev_queue_xmit() > > >>>>>> netif_set_real_num_tx_queues() > > >>>>>> rcu_read_lock_bh(); > > >>>>>> netdev_core_pick_tx(dev, skb, sb_dev); > > >>>>>> . > > >>>>>> . dev->real_num_tx_queues = > > >>>>>> txq; > > >>>>>> . . > > >>>>>> . . > > >>>>>> . synchronize_net(); > > >>>>>> . . > > >>>>>> q->enqueue() . > > >>>>>> . . > > >>>>>> rcu_read_unlock_bh() . > > >>>>>> qdisc_reset_all_tx_gt > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> Right. > > >>>>> > > >>>>> > > >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a > > >>>>>> problem > > >>>>>> too. > > >>>>>> > > >>>>>> The problem we hit is as below: > > >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is > > >>>>>> called > > >>>>>> to deactive the netdevice when user requested a smaller queue num, > > >>>>>> and > > >>>>>> txq->qdisc is already changed to noop_qdisc when calling > > >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the > > >>>>>> function > > >>>>>> netif_set_real_num_tx_queues() does not help here. > > >>>>> > > >>>>> How could qdisc still be running after deactivating the device? > > >>>> > > >>>> qdisc could be running during the device deactivating process. > > >>>> > > >>>> The main process of changing channel number is as below: > > >>>> > > >>>> 1. dev_deactivate() > > >>>> 2. hns3 handware related setup > > >>>> 3. netif_set_real_num_tx_queues() > > >>>> 4. netif_tx_wake_all_queues() > > >>>> 5. dev_activate() > > >>>> > > >>>> During step 1, qdisc could be running while qdisc is resetting, so > > >>>> there could be skb left in the old qdisc(which will be restored back to > > >>>> txq->qdisc during dev_activate()), as below: > > >>>> > > >>>> CPU 0: CPU1: > > >>>> __dev_queue_xmit(): dev_deactivate_many(): > > >>>> rcu_read_lock_bh(); qdisc_deactivate(qdisc); > > >>>> q = rcu_dereference_bh(txq->qdisc); . > > >>>> netdev_core_pick_tx(dev, skb, sb_dev); . > > >>>> . > > >>>> . > > >>>> rcu_assign_pointer(dev_queue->qdisc, qdisc_default); > > >>>> . . > > >>>> . . > > >>>> . . > > >>>> . . > > >>>> q->enqueue() . > > >>> > > >>> > > >>> Well, like I said, if the deactivated bit were tested before > > >>> ->enqueue(), > > >>> there would be no packet queued after qdisc_deactivate().
Trying to unwind this through git history :/ Original code had a test_bit in dev_xmit_skb(), if (q->flags & TCQ_F_NOLOCK) { if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; __qdisc_run(q); } if (unlikely(to_free)) kfree_skb_list(to_free); return rc; } So we would never enqueue something on top of a deactivated qdisc. And to ensure we don't have any in-flight enqueues we have to swap qdiscs, wait a grace period, and then reset the qdisc. That _should_ be OK. But, I'm still not entirely sure how you got here. In the drivers I did I always stop the queue before messing with these things with netif_tx_stop_queue(). Then we really should not get these packets into the driver. In sch_direct_xmit(): if (likely(skb)) { HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); HARD_TX_UNLOCK(dev, txq); } else { if (root_lock) spin_lock(root_lock); return true; } Maybe I missed something? Does your driver use the netif_tx_stop/start APIs? > > >> > > >> Only if the deactivated bit testing is also protected by qdisc->seqlock? > > >> otherwise there is still window between setting and testing the > > >> deactivated bit. > > > > > > Can you be more specific here? Why testing or setting a bit is not atomic? > > > > testing a bit or setting a bit separately is atomic. > > But testing a bit and setting a bit is not atomic, right? > > > > cpu0: cpu1: > > testing A bit > > setting A bit . > > . . > > . qdisc enqueuing > > qdisc reset > > > > Well, this was not a problem until commit d518d2ed8640c1cbbb. > Prior to that commit, qdsic can still be scheduled even with this > race condition, that is, the packet just enqueued after resetting can > still be dequeued with qdisc_run(). > > It is amazing to see how messy the lockless qdisc is now. > > Thanks.