On Thu, Jul 24, 2025 at 12:03:33AM +0000, Hay, Joshua A wrote: > > On Thu, Jul 17, 2025 at 05:21:44PM -0700, Joshua Hay wrote: > > > This series fixes a stability issue in the flow scheduling Tx send/clean > > > path that results in a Tx timeout. > > > > > > The existing guardrails in the Tx path were not sufficient to prevent > > > the driver from reusing completion tags that were still in flight (held > > > by the HW). This collision would cause the driver to erroneously clean > > > the wrong packet thus leaving the descriptor ring in a bad state. > > > > > > The main point of this refactor is to replace the flow scheduling buffer > > > ring with a large pool/array of buffers. The completion tag then simply > > > is the index into this array. The driver tracks the free tags and pulls > > > the next free one from a refillq. The cleaning routines simply use the > > > completion tag from the completion descriptor to index into the array to > > > quickly find the buffers to clean. > > > > > > All of the code to support the refactor is added first to ensure traffic > > > still passes with each patch. The final patch then removes all of the > > > obsolete stashing code. > > > > > > --- > > > v2: > > > - Add a new patch "idpf: simplify and fix splitq Tx packet rollback > > > error path" that fixes a bug in the error path. It also sets up > > > changes in patch 4 that are necessary to prevent a crash when a packet > > > rollback occurs using the buffer pool. > > > > > > v1: > > > https://lore.kernel.org/intel-wired-lan/c6444d15-bc20-41a8-9230- > > [email protected]/T/#maf9f464c598951ee860e5dd24ef8a45 > > 1a488c5a0 > > > > > > Joshua Hay (6): > > > idpf: add support for Tx refillqs in flow scheduling mode > > > idpf: improve when to set RE bit logic > > > idpf: simplify and fix splitq Tx packet rollback error path > > > idpf: replace flow scheduling buffer ring with buffer pool > > > idpf: stop Tx if there are insufficient buffer resources > > > idpf: remove obsolete stashing code > > > > > > .../ethernet/intel/idpf/idpf_singleq_txrx.c | 61 +- > > > drivers/net/ethernet/intel/idpf/idpf_txrx.c | 723 +++++++----------- > > > drivers/net/ethernet/intel/idpf/idpf_txrx.h | 87 +-- > > > 3 files changed, 356 insertions(+), 515 deletions(-) > > > > Hi Joshua, all, > > > > Perhaps it is not followed much anymore, but at least according to [1] > > patches for stable should not be more than 100 lines, with context. > > > > This patch-set is an order of magnitude larger. > > Can something be done to create a more minimal fix? > > > > [1] https://docs.kernel.org/process/stable-kernel-rules.html#stable-kernel- > > rules > > Hi Simon, > > It will be quite difficult to make this series smaller without any negative > side effects. Here's a summary of why certain patches are necessary or where > deferrals are possible (and the side effects): > > "idpf: add support for Tx refillqs in flow scheduling mode" is required to > keep the Tx path lockless. Otherwise, we would have to use locking in hot > path to pass the tags freed in the cleaning thread back to whatever data > struct the sending thread pulls the tags from. > > Without "idpf: improve when to set RE bit logic", the driver will be much > more susceptible to Tx timeouts when sending large packets. > > "idpf: simplify and fix splitq Tx packet rollback error path" is a must to > setup for the new buffer chaining. The previous implementation rolled back > based on ring indexing, which will crash the system if a dma_mapping_error > occurs or we run out of tags (more on that below). The buffer tags (indexes) > pulled for a packet are not guaranteed to be contiguous, especially as > out-of-order completions are processed. > > "idpf: stop Tx if there are insufficient buffer resources" could possibly be > deferred, but that makes the rollback change above even more critical as we > lose an extra layer of protection from running out of tags. It's also one of > the smaller patches and won't make a significant difference in terms of size. > > "idpf: remove obsolete stashing code" could also potentially be deferred but > is 95% removing obsolete code, which leaves a lot of dead code.
Thanks Joshua, if there is justification, which is indeed the case, then I withdraw my suggestion to create a smaller fix.
