I'm sponsoring this fasttrack for Scott Carter. It is set to timeout on 10/15/2008. Given that this is a relatively minor amendment to a larger, approved case PSARC/2004/253 "Advanced DDI Interrupt Functions", I think it qualifies as a fasttrack. As the discussion progresses, extending the timer or even promoting to a full case may be considered.
In addition to this proposal, three man page files can be found in the case directory: ddi_cb_register.9f.txt ddi_intr_set_nreq.9f.txt ndi_irm_create.9f.txt -Artem Template Version: @(#)sac_nextcase %I% %G% SMI This information is Copyright 2008 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Interrupt Resource Management 1.2. Name of Document Author/Supplier: Author: Scott Carter 1.3 Date of This Document: 07 October, 2008 4. Technical Description 4.1 Project Description This project delivers the resource management feature originally defined in PSARC/2004/253 "Advanced DDI Interrupt Functions" with minor changes (see 4.1.3 for details). It provides a mechanism for device drivers to get more interrupt vectors (and increased performance). By participating with the new feature, their number of available interrupts becomes dynamic and can increase or decrease. The goal is to maximize utilization of interrupt vectors in a fair manner, and rebalance the allocations whenever devices are added or removed from the system. 4.1.1 Definition The project defines new DDI interfaces for device drivers to register and unregister a generic callback handler. If a driver has registered a handler, then it can be notified when its number of interrupts has been increased or decreased. How many interrupt vectors a device driver wants is set by the initial number it attempts to allocate through ddi_intr_alloc(). And a driver can explicitly adjust this number at later times with a new DDI function that is introduced by this project. The project also defines new NDI interfaces for nexus drivers to define the supplies of interrupt vectors that are available in the system. A supply can be associated with a specific IO bus (e.g. PCIe root complex) or it could be a global supply shared by all devices. Along with the new DDI and NDI interfaces, the project also delivers an MDB debugging module with two new debug macros. One macro is used to display statistics on all of the defined interrupt supplies, and the other displays statistics about how a supply is divided up between the drivers that map to it. 4.1.2 Motivation, Goals, and Requirements Currently, interrupt vectors are given to device drivers in very small numbers. This is to avoid exhausting a system's supply before all the devices have been attached, and to keep interrupt vectors in reserve for later hotplugs. Interrupts are given so conservatively because there is no mechanism to take them back later unless a driver detaches. Fewer interrupt vectors means less IO performance. More interrupt vectors means more parallelism for handling interrupt conditions. So the motivation for this project is to increase IO performance. And the goal of this project is to maximize the allocation of interrupts given to each device, but in a way which is fair and balanced across the full set of attached devices. The requirement of this project is: - Provide a mechanism for drivers to get more interrupts. The feature is optional so drivers that don't use it still work even if the system has implemented support for the feature. And conversely, drivers that do use it also work if the system does not implement the support. A full implementation of the platform level support will be provided by this project for PCIe IO bus drivers on SPARC. 4.1.3 Changes From the Previous Case This project is a micro change to the approved case PSARC/2004/253 "Advanced DDI Interrupt Interfaces". The existing DDI interrupt interfaces are not changed. But some new DDI interrupt interfaces are added, which extend the capabilities of the existing interfaces. The interfaces described in section 6.3 of the specifications for PSARC/2004/253 "Advanced DDI Interrupt Functions" have already been approved, but those resource management interfaces have never been implemented. This project is providing the implementation of those interfaces, with minor changes. The changes include: - Generalization of the proposed callback interfaces, so they can be shared by both IRM and any future projects. - Removed semantics originally proposed to temporarily disable callbacks. - An additional function (ddi_intr_set_nreq()) for explicitly setting how many interrupt vectors a device driver requests. The interface described in PSARC/2007/453 "MSI-X interrupt limit override" provides a temporary workaround for device drivers to request more interrupts. And the workaround is currently used. This project supercedes the functionality of this workaround, and drivers using the workaround should ultimately be converted. But in the meantime, the workaround is preserved and still works in conjunction with this project. 4.1.4 Competitive Analysis This project is important for the overall IO performance on all of Sun's platforms in order to remain competitive in the marketplace. Modern IO bus technologies support large numbers of interrupts. A single PCI or PCIe device could use up to 32 MSI interrupts, or 2048 MSI-X interrupts. Without this project, Solaris only gives at most 2 interrupts per each device. We are limiting the IO throughput of our systems by not having this feature, and by not giving some devices nearly as many interrupt vectors as they can support and utilize. Current Sun platforms have limited numbers of interrupts to give to the devices. SPARC PCIe platforms have 256 interrupts per root complex, and x64 systems only have 256 interrupts per each processor. But we have future platforms in development with many thousands of interrupt vectors available. We need this feature so that we can achieve the full potential of advanced MSI and MSI-X devices on such platforms. And in the meantime we already have some devices which already could benefit from getting more than the current 2 interrupt vector limit. 4.2 Technical Description 4.2.1 Architecture The basic strategy is to organize representations of each supply of interrupt vectors in the system with the drivers who consume them. And to compute the optimal way to divide each supply amongst those devices. When a driver is attached or detached, the computations are performed to rebalance the system. And, drivers can explicitly change how many interrupts they want in response to load. The computations seek to maximize the use of the system's interrupts and derive fair allocations for each device. Drivers are notified using a callback mechanism when the computations result in giving the driver more or less interrupts. The main components of this project are: - NDI interfaces for nexus drivers to create or destroy the descriptions of individual supplies of interrupt vectors. - DDI interfaces for drivers to register or unregister callback mechanisms. The callbacks notify them of changes to their interrupt availability. - DDI interface for drivers to explicitly set their requested number of interrupts dynamically. - A mechanism to map individual devices to interrupt supplies. - Background threads which keep the allocations of interrupts from each supply to each device optimized and balanced. - Routines to initialize the new feature at boottime. - An MDB debugging module to display the status of how the interrupt subsystem has been balanced. Each supply of interrupt vectors in the system is described by a data structure (ddi_irm_pool_t), representing one pool of interrupts that can be shared by multiple devices. And for each device that maps to an interrupt pool, a data structure (ddi_irm_req_t) represents how many interrupts it wants versus how many it received. Nexus drivers create the interrupt pools, and then they map individual devices to them through the existing bus nexus driver INTROP feature. The request data structures are created internally and associated with the interrupt pools when devices are attached and mapped to a pool. Existing device drivers do not benefit, and they continue to only get the same small number of interrupt vectors that they currently receive. In order to benefit, they must be modified with optional enhancements so that they can participate. To participate, they must provide a new callback mechanism so that the system can notify them when they have been given more or less interrupt vectors. A modified driver first registers a generic callback handler so it can receive notifications of interrupt availability changes. Then it calls ddi_intr_alloc() to request an initial number of interrupt vectors. If the system has the necessary support (from nexus drivers), then it associates the requesting driver with an interrupt pool. Through this association, the system will compute if the driver gets more interrupts. A driver may initially receive all the interrupts it requested, or it may receive callbacks at a later time (post attach(9F)) notifying it when more interrupt vectors are available. More could be available if other devices were removed, or if workload changes cause other drivers to reduce their requests. But to qualify for additional interrupts, a driver must also yield and call ddi_intr_free() when necessary. This may occur if another device is inserted, or if changes in workload cause other drivers to need more. This project introduces the interfaces for nexus drivers to describe the interrupt pools, and for device drivers to engage with those interrupt pools to possibly receive more resources. Plus all of the additional implementation behind the scenes to perform the related computations when necessary. 4.2.2 Interfaces 4.2.2.1 Exported Interfaces Interface Stability Comments -----------------------------------+----------+------------------------- ndi_irm_create() Committed Create an IRM pool ndi_irm_destroy() Committed Destroy an IRM pool DDI_INTROP_GETPOOL Committed Get IRM pool INTROP ddi_cb_register() Committed Install callback handler ddi_cb_unregister() Committed Remove callback handler ddi_cb_action_t Committed Callback action type ddi_cb_flags_t Committed Callback flags type ddi_cb_func_t Committed Callback function type ddi_cb_handle_t Committed Callback handle type ddi_intr_set_nreq() Committed Set IRM request size -----------------------------------+----------+------------------------- 4.2.2.2 Imported Interfaces Interface Stability Comments -----------------------------------+----------+------------------------- ddi_intr_alloc() Committed Added hooks into IRM ddi_intr_free() Committed Added hooks into IRM -----------------------------------+----------+------------------------- 4.2.2.3 Removed Interfaces These interfaces were previously approved, but never implemented. And they are now superceded by the interfaces of this project. Interface Stability Comments -----------------------------------+----------+------------------------- ddi_intr_register_management_cb() Committed Register callback ddi_intr_unregister_management_cb() Committed Unregister callback ddi_intr_enable_management_cb() Committed Enable callback ddi_intr_disable_management_cb() Committed Disable callback -----------------------------------+----------+------------------------- 5. References This Project Implements, Extends, or Replaces these Projects: - PSARC/2004/253: Advanced DDI Interrupt Functions - PSARC/2007/453: MSI-X interrupt limit override (A future RFE will remove the workaround once all consumers have been converted to use this project.) Consumers of this Project: - PSARC/2008/181: Solaris Hotplug Framework (Uses this project's generic callback mechanism) - IRM Enhancements for Atlas/Neptune Driver (Convert from existing workaround to use this project) - x86 APIC Expansion and IRM Support (Provide interrupt pool definitions on x86. The scope of this project is to only deliver interrupt pools on SPARC.) Design and Implementation Specification of this Project: - http://pciexpress.sfbay/intr/docs/irm/irm_design.txt 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open