On Thu, Dec 18, 2025 at 08:51:41PM +0100, Harald Mommer wrote:
> Hello,
> 
> On 12/14/25 16:25, Francesco Valla wrote:
> 
> >>>> +/* Compare with m_can.c/m_can_echo_tx_event() */
> 
> For the question whether some comments were originally more personal notes: 
> Yes!
> 
> This applies especially for the ones which state in which already accepted 
> driver(s) was looked to get an idea how things may be expected to be done. 
> Most of those comments should have served their purpose now.
> 
> >>>> +static int virtio_can_read_tx_queue(struct virtqueue *vq)
> >>>> +{
> >>>> +        struct virtio_can_priv *can_priv = vq->vdev->priv;
> >>>> +        struct net_device *dev = can_priv->dev;
> >>>> +        struct virtio_can_tx *can_tx_msg;
> >>>> +        struct net_device_stats *stats;
> >>>> +        unsigned long flags;
> >>>> +        unsigned int len;
> >>>> +        u8 result;
> >>>> +
> >>>> +        stats = &dev->stats;
> >>>> +
> >>>> +        /* Protect list and virtio queue operations */
> >>>> +        spin_lock_irqsave(&can_priv->tx_lock, flags);
> >>>
> >>> The section below seems a pretty big one to protect behind a spin lock. 
> >>>
> >>
> >> How can I split it? 
> >>
> > 
> > Question here is: what needs to be protected? As far as I can tell, the
> > only entity needing some kind of locking here is the queue, while both
> > ida_* and tx_inflight operations are already covered (the former by
> > design [1], the second because it's implemented using an atomic.
> > 
> > If I'm not wrong (but I might be, so please double check) this can be
> > limited to:
> > 
> >     /* Protect queue operations */
> >     scoped_guard(spinlock_irqsave, &priv->tx_lock)
> >             err = virtqueue_add_sgs(vq, sgs, 1u, 1u, can_tx_msg, 
> > GFP_ATOMIC);
> > 
> > 
> > Maybe the whole locking pattern is a leftover from a previous version, 
> > where a list of TX messages was kept?
> 
> 1.) There is virtqueue_get_buf() => virtqueue_get_buf_ctx() and there is a 
> comment 
> " * Caller must ensure we don't call this with other virtqueue
>   * operations at the same time (except where noted)."
> 
> Are we safe when at the same time in virtio_can_start_xmit() a queue 
> operation is done in parallel?
> Locking may or may not be necessary here. I cannot tell in this moment.

you need a way to make sure add buf and get buf do not run
in parallel. virtio does not provide this protection itself.
up to the caller.

> 
> 2.) There was once a "list_del(&can_tx_msg->list);" in the code here. 
> 
> When in virtio_can_start_xmit() at the same time a list_add_tail() or a 
> list_del() would have been executed we had a garbled linked list.
> 
> The linked list now does not exist any more in the newer code base.
> 
> => could be that the lock is not needed any more at all
> => could be that we have to protect only the queue operations now and this 
> would shorten the locking time and simplify the code
> 
> >>>> +
> >>>> +        can_tx_msg = virtqueue_get_buf(vq, &len);
> >>>> +        if (!can_tx_msg) {
> >>>> +                spin_unlock_irqrestore(&can_priv->tx_lock, flags);
> >>>> +                return 0; /* No more data */
> >>>> +        }
> >>>> +
> >>>> +        if (unlikely(len < sizeof(struct virtio_can_tx_in))) {
> >>>> +                netdev_err(dev, "TX ACK: Device sent no result code\n");
> >>>> +                result = VIRTIO_CAN_RESULT_NOT_OK; /* Keep things going 
> >>>> */
> >>>> +        } else {
> >>>> +                result = can_tx_msg->tx_in.result;
> >>>> +        }
> >>>> +
> > 
> > (snip)
> > 
> 
> >>>> +        if (!priv->rpkt) {
> >>>> +                virtio_can_del_vq(vdev);
> >>>> +                goto on_failure;
> >>>> +        }
> >>>> +        virtio_can_populate_rx_vq(vdev);
> >>>> +
> >>>> +        err = register_virtio_can_dev(dev);
> >>>> +        if (err) {
> >>>> +                virtio_can_del_vq(vdev);
> >>>> +                goto on_failure;
> >>>> +        }
> >>>> +
> >>>> +        napi_enable(&priv->napi);
> >>>> +        napi_enable(&priv->napi_tx);
> >>>
> >>> Most of the existing drivers enable the napi(s) during the open() phase,
> >>> IIUC to avoid scheduling napi operations for devices that might never
> >>> get used. But here maybe there is a specific reason to do it this way?
> >>>
> >>
> >> I do not have idea. I moved to open() and something stopped to work. I
> >> am investigating it.
> >>
> > 
> > On a second thought, it may be wiser to have the napis enabled on probe,
> > to drop the incoming messages even when the interface is brought down.
> 
> It's a while since then but I wanted to drop messages not having lurking a 3 
> hours old cooling water temperature in some virtio message buffer being 
> misinterpreted as an actual value. May have the disadvantage to cause load 
> when the driver is not open-ed. But I see you also thought about 3 hours old 
> outdated values now which may cause trouble.
> 
> > 
> > (last snip)
> > 
> > 
> > While stress testing this, I noticed that flooding the virtio-can
> > interface with packets leads to an hang of the interface itself.
> > I am seeing this issuing, at host side:
> > 
> >     while true; do cansend can0 123#00; done
> > 
> > with:
> > 
> >  - QEMU: the tip of the master branch plus [2]
> >  - vhost-device: the tip of the main branch
> > 
> > and the following QEMU invocation:
> > 
> > qemu-system-x86_64 -serial mon:stdio \
> >     -m 2G -smp 2 \
> >     -kernel $(pwd)/BUILD.bin/arch/x86/boot/bzImage \
> >     -initrd /home/francesco/SRC/LINUX_KERNEL/initramfs.gz \
> >     -append "loglevel=7 console=ttyS0" \
> >     -machine memory-backend=pc.ram \
> >     -object 
> > memory-backend-file,id=pc.ram,size=2G,mem-path=/tmp/pc.ram,share=on \
> >     -chardev socket,id=can0,path=/tmp/sock-can0 \
> >     -device vhost-user-can-pci,chardev=can0
> 
> I had this problem when I enabled the experimental feature late TX ACK on the 
> device side instead of immediately sending the TX ack early even when the CAN 
> message had not yet been transmitted on the (physical) bus. In this case I 
> relied that no ACK message (own sent message received) was lost otherwise I 
> ran out of messages in the transmit queue everything waiting until doomsday 
> for ACKs which would never come.
> 
> The problem was that somewhere in the Linux stack those acknowledgements got 
> lost under heavy load on the device side. Workaround was to ack the TX 
> message early (means putting the message immediately back into the used queue 
> when received) in the virtio device. But this is a device thing, the device 
> MUST put back ALL messages back into the used queue not forgetting about some 
> under whatever circumstances otherwise the avail queue will get empty forever.
> 
> Besides that I could do what I want stressing the code and it did not stop. 
> But this code was different from what I see now, and the testing environment 
> was also a different one. 
> 
> > Restarting the interface (i.e.: ip link set down and the up) does not
> > fix the situation.
> > 
> > I'll try to do some more testing during the next days.
> Other than fixing the swapped feature flag values for the next release 
> internally I've had not yet the chance to look deeply into all those changes 
> and really to think about them in depth.
> 


Reply via email to