On Thu, Apr 02, 2026 at 08:50:55PM -0400, Aaron Tomlin wrote: > On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote: > > On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote: > > > Hi Sebastian, > > Hi, > > > > > Thank you for taking the time to document the "managed_irq" behaviour; it > > > is immensely helpful. You raise a highly pertinent point regarding the > > > potential proliferation of "isolcpus=" flags. It is certainly a situation > > > that must be managed carefully to prevent every subsystem from demanding > > > its own bit. > > > > > > To clarify the reasoning behind introducing "io_queue" rather than > > > strictly > > > relying on managed_irq: > > > > > > The managed_irq flag belongs firmly to the interrupt subsystem. It > > > dictates > > > whether a CPU is eligible to receive hardware interrupts whose affinity is > > > managed by the kernel. Whilst many modern block drivers use managed IRQs, > > > the block layer multi-queue mapping encompasses far more than just > > > interrupt routing. It maps logical queues to CPUs to handle I/O > > > submission, > > > software queues, and crucially, poll queues, which do not utilise > > > interrupts at all. Furthermore, there are specific drivers that do not use > > > the managed IRQ infrastructure but still rely on the block layer for queue > > > distribution. > > > > Could you tell block which queue maps to which CPU at /sys/block/$$/mq/ > > level? Then you have one queue going to one CPU. > > Then the drive could request one or more interrupts managed or not. For > > managed you could specify a CPU mask which you desire to occupy. > > You have the case where > > - you have more queues than CPUs > > - use all of them > > - use less > > - less queues than CPUs > > - mapped a queue to more than once CPU in case it goes down or becomes > > not available > > - mapped to one CPU > > > > Ideally you solve this at one level so that the device(s) can request > > less queues than CPUs if told so without patching each and every driver. > > > > This should give you the freedom to isolate CPUs, decide at boot time > > which CPUs get I/O queues assigned. At run time you can tell which > > queues go to which CPUs. If you shutdown a queue, the interrupt remains > > but does not get any I/O requests assigned so no problem. If the CPU > > goes down, same thing. > > > > I am trying to come up with a design here which I haven't found so far. > > But I might be late to the party and everyone else is fully aware. > > > > > If managed_irq were solely relied upon, the IRQ subsystem would > > > successfully keep hardware interrupts off the isolated CPUs, but the block > > > > The managed_irqs can't be influence by userland. The CPUs are auto > > distributed. > > > > > layer would still blindly map polling queues or non-managed queues to > > > those > > > same isolated CPUs. This would force isolated CPUs to process I/O > > > submissions or handle polling tasks, thereby breaking the strict > > > isolation. > > > > > > Regarding the point about the networking subsystem, it is a very valid > > > comparison. If the networking layer wishes to respect isolcpus in the > > > future, adding a net flag would indeed exacerbate the bit proliferation. > > > > Networking could also have different cases like adding a RX filter and > > having HW putting packet based on it in a dedicated queue. But also in > > this case I would like to have the freedome to decide which isolated > > CPUs should receive interrupts/ traffic and which don't. > > > > > For the present time, retaining io_queue seems the most prudent approach > > > to > > > ensure that block queue mapping remains semantically distinct from > > > interrupt delivery. This provides an immediate and clean architectural > > > boundary. However, if the consensus amongst the maintainers suggests that > > > this is too granular, alternative approaches could certainly be considered > > > for the future. For instance, a broader, more generic flag could be > > > introduced to encompass both block and future networking queue mappings. > > > Alternatively, if semantic conflation is deemed acceptable, the existing > > > managed_irq housekeeping mask could simply be overloaded within the block > > > layer to restrict all queue mappings. > > > > > > Keeping the current separation appears to be the cleanest solution for > > > this > > > series, but your thoughts, and those of the wider community, on > > > potentially > > > migrating to a consolidated generic flag in the future would be very much > > > welcomed. > > > > I just don't like introducing yet another boot argument, making it a > > boot constraint while in my naive view this could be managed at some > > degree via sysfs as suggested above. > > Hi Sebastian, > > I believe it would be more prudent to defer to Thomas Gleixner and Jens > Axboe on this matter. > > > Indeed, I am entirely sympathetic to your reluctance to introduce yet > another boot parameter, and I concur that run-time configurability > represents the ideal scenario for system tuning.
`io_queue` introduces cost of potential failure on offlining CPU, so how can it replace the existing `managed_irq`? > > At present, a device such as an NVMe controller allocates its hardware > queues and requests its interrupt vectors during the initial device probe > phase. The block layer calculates the optimal queue to CPU mapping based on > the system topology at that precise moment. Altering this mapping > dynamically at runtime via sysfs would be an exceptionally intricate > undertaking. It would necessitate freezing all active operations, tearing > down the physical hardware queues on the device, renegotiating the > interrupt vectors with the peripheral component interconnect subsystem, and > finally reconstructing the entire queue map. > > Furthermore, the proposed io_queue boot parameter successfully achieves the > objective of avoiding driver level modifications. By applying the > housekeeping mask constraint centrally within the core block layer mapping > helpers, all multiqueue drivers automatically inherit the CPU isolation > boundaries without requiring a single line of code to be changed within the > individual drivers themselves. > > Because the hardware queue count and CPU alignment must be calculated as > the device initialises, a reliable mechanism is required to inform the > block layer of which CPUs are strictly isolated before the probe sequence > commences. This is precisely why integrating with the existing boot time > housekeeping infrastructure is currently the most viable and robust > solution. > > Whilst a fully dynamic sysfs driven reconfiguration architecture would be a > great, it would represent a substantial paradigm shift for the block layer. > For the present time, the io_queue flag resolves the immediate and severe > latency issues experienced by users with isolated CPUs, employing an > established and safe methodology. I'd suggest to document the exact existing problem, cause `managed_irq` should cover it in try-best way, so people can know how to select the two parameters. Thanks, Ming

