I'm sponsoring this fast track for Govinda Tatti and the PCI team. This project introduces new DDI interfaces, and changes PCITool's command line syntax in an incompatible way. However, the change is intended to correct an incompatibility with respect to CLIP, and the original code has only been integrated in the last couple of builds of Nevada, so we believe it is an opportune time to fix this.
The project is seeking Minor Commitment, since the interfaces are primarily intended for consumption by Crossbow which is not available in Solaris 10. Man pages, headers, and supporting materials are also located in the case directory under "materials/" - Garrett Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Interrupt affinity interfaces and PCITool enhancements 1.2. Name of Document Author/Supplier: Author: Govinda Tatti 1.3 Date of This Document: 03 June, 2009 4. Technical Description Template Version: @(#)sac_nextcase 1.9 06/02/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1 Project/Component Working Name: Interrupt Affinity Interfaces and PCITool Enhancements 1.2 Name of Document Author/Supplier: Author: Govinda Tatti 1.3 Date of This Document: 02 June, 2009 4. Technical Description 4.1 Project Summary This project provides a mechanism for device drivers, IO frameworks such as Crossbow, and for the users who want to know the current CPU binding for their interrupts and fine tune those bindings to achieve maximum IO performance. The first phase of this project delivers the simple DDI interrupt affinity interfaces to allow a device driver to retrieve the current interrupt target CPU and to express its interrupt target preference. In addition, it will deliver some PCITool enhancement to retarget MSI/X interrupts. In the next phase, these simple DDI interrupt affinity interfaces will be replaced with hint or preference based interfaces. Plus, the DDI interrupt framework and platform specific implementation will be modified to query the NUMA-IO framework for optimal interrupt target CPU before configuring the platform interrupt targeting hardware logic. 4.2 Problem and Requirements Modern IO bus technologies support large numbers of interrupts. A single PCI or PCIe device could use up to 32 MSI interrupts, or 2048 MSI-X interrupts. The IRM project (PSARC/2008/628) fixed the MSI-X allocation limit issue and solved part of an IO performance problem. Other part of this problem is how to fine tune the CPU bindings for these multiple MSI-X interrupts to achieve the expected IO performance. Currently there is a need for Solaris device drivers such as NIC (10G), HBA (Emulex) and IO frameworks such as Crossbow to retrieve and reroute the target CPU for their interrupts. For example, Crossbow provides a framework by which NIC resources such as Rx and Tx rings are exposed to the MAC layer. The MAC layer doles out these resources to VNICs when they get created while reserving a fixed amount for the primary NIC. CPUs, on which the processing of packets take place, can be specified at VNIC creation time or later. If they are specified, the interrupts associated with the Rx/Tx rings need to be re-targeted to the specified CPUs. A mechanism by which we can re-target a specific MSI-X interrupt to a different CPU is needed. This is for the virtualization part of Crossbow. For optimal performance of regular NICs (as well as VNICs), the poll thread associated with an Rx ring should be bound to the same CPU as the interrupt CPU. So given an interrupt handle and a CPU, we need a mechanism to retarget the interrupt to the specified CPU. This has become a major issue (on Maramba) for performance when multiple 10 Gig NICs are present. The poll threads belonging to one NIC can end up running on CPUs which is taking interrupts from another NIC. Presently Crossbow uses the PCITool ioctls (sys/pci_tools.h) to re-target fixed interrupts from inside the kernel. The interface provided is not ideal for doing this kind of work from inside the kernel. A better interface is needed here. Also this mechanism currently does not work for MSI-Xs on SPARC platforms. This should be addressed. To achieve the above objectives, the following interfaces are required: 1. Given an interrupt handle (ddi_intr_handle_t) that is associated with an Rx/Tx ring, provide the CPU (processorid_t) to which interrupt is going. 2. Given an interrupt handle (ddi_intr_handle_t) that is associated with an Rx/Tx ring and a CPU, bind the interrupt to the specified CPU. 4.3 Changes From the Previous Case This project is an extension to the approved cases, PSARC/2004/253 "Advanced DDI Interrupt Interfaces" and PSARC/2008/628 "Interrupt Resource Management". The existing DDI interrupt interfaces are not changed. But some new DDI interrupt interfaces are added, which extend the capabilities of the existing interfaces. The changes include: - A new function (ddi_intr_get_affinity(9f)) to return the interrupt target CPU for a given DDI interrupt handle h. - A new function (ddi_intr_set_affinity(9f)) to set the interrupt target CPU for a given DDI interrupt handle h. - Modify ddi_intr_get_cap(9f) function to return the new capability flag DDI_INTR_FLAG_RETARGETABLE indicating all the interrupts are retargetable for the current interrupt type in use. - A new PCITool option, -m to retarget MSI/X interrupts. 4.4 Competitive Analysis Linux and Microsoft OSs already provides the interrupt retarget interfaces of some fashion to their device drivers. So, it is important to provide similar features to Solaris device drivers to achieve individual device performance and also, overall IO performance on all of Sun's platforms in order to remain competitive in the marketplace. 4.5 Project Description 4.5.1 Interrupt Affinity Interfaces The basic strategy is to provide an opportunity for device drivers to provide its input in selecting the proper interrupt target CPU (such as CPU# or preference) for its interrupts. The device drivers or IO frameworks will call the proposed affinity interfaces either during its initialization or run time to optimize its IO performance based on the available resources such as DMA channels, rings, interrupts allocated and current CPU bindings. typedef processorid_t ddi_intr_target_t; int ddi_intr_get_affinity(ddi_intr_handle_t h, ddi_intr_target_t *tgt_p); int ddi_intr_set_affinity(ddi_intr_handle_t h, ddi_intr_target_t tgt); These interfaces are optional to the device drivers, so drivers that don't use it still work even if the system has implemented this feature. And conversely, drivers that do use it also work if the system does not implement the support. This case also includes the contract for Crossbow framework to use these interrupt affinity interfaces in place of existing PCITool ioctl interfaces. Constraints: a) Set affinity limitations for certain interrupt types Fixed or INTx interrupts could be either exclusive or sharable depending on hardware. Because there is no good way to detect that, the current implementation will refuse any set affinity requests for INTx interrupts. On x86 platforms, multiple MSI interrupts of a single PCI function need to be rerouted together since all MSI interrupts share the same MSI address, which in turn includes same CPU number. Hence the current x86 implementation will refuse any set affinity requests for MSI interrupts. The future phase of this project may support MSI group retarget, similar to PCITool method. b) CPU offline considerations CPUs may be online/offlined through administrative interfaces. When a CPU is offlined, all of the interrupts targeting it are re-targeted. The OS will pick any set of the surviving CPUs for re-targeting. The OS is under no obligation to maintain drivers' interrupt affinity preferences. The first phase of this project will not provide any callback on CPU online/offline events. Such callback events need to be defined in the future. If a driver or framework is interested in maintaining optimal CPU targeting, it should monitor its interrupt CPU bindings on a regular basis using ddi_intr_get_affinity(9f) or register a callback to receive various CPU specific events using register_cpu_setup_func(). Where as, the userland entities should subscribe to CPU DR specific sysevents. 4.5.2 PCITool Enhancements Current syntax: pcitool pci@<unit-address> -i ino=ino [ -r [ -c ] | -w cpu=CPU [ -g ] ] [ -v ] [ -q ] Proposed syntax: pcitool pci@<unit-address> -i <ino#> | all [ -r [ -c ] | -w <cpu#> [ -g ] ] [ -v ] [ -q ] pcitool pci@<unit-address> -m <msi#> | all [ -r [ -c ] | -w <cpu#> [ -g ] ] [ -v ] [ -q ] The PCItool is a low-level tool which provides a facility for getting and setting interrupt routing information. This project is making some minor syntax changes to PCITool since the current syntax is not compliant with existing userland guidelines. In addition, this project is adding a new "-m" option to retrieve and reroute the interrupt target CPU for MSI/Xs on SPARC platforms. On SPARC platforms, the INO is mapped to an interrupt mondo, and where as one or more MSI/Xs are mapped to an INO. So, INO and MSI/Xs are individually retargetable. Use "-i " option to retrieve or reroute a given INO, and where as use "-m" option for MSI/Xs. On x86 platforms, both INOs and MSI/Xs are mapped to the same interrupt vectors. Use "-i" option to retrieve and reroute any interrupt vectors (both INO and MSI/Xs). So, "-m" option is not required on x86 platforms. Hence it is not supported. 4.6 Interfaces 4.6.1 Exported Interfaces Interface Stability Comments ----------------------------+---------------+-------------------------- ddi_intr_target_t Project Interrupt target CPU Private ddi_intr_get_affinity Project Get interrupt target CPU Private ddi_intr_set_affinity Project Set interrupt target CPU Private ----------------------------------------------------------------------- 4.6.2 Imported Interfaces Interface Stability Comments ----------------------------+---------------+-------------------------- DDI_INTR_FLAG_RETARGETABLE Project Return this new flag (RO) to Private ddi_intr_get_cap() callers if current interrupt type in use is retargetable pcitool Project Minor syntax changes. Added Private new -m option for MSI/Xs. ----------------------------------------------------------------------- 5. References [1] Solaris Interrupt Project Webpage http://pciexpress.sfbay/intr [2] Advanced DDI Interrupt Functions - PSARC/2004/253 http://sac.sfbay.sun.com/PSARC/2004/253 [3] Interrupt Resource Management - PSARC/2008/628 http://sac.sfbay.sun.com/PSARC/2008/628 [4] PCITool and its nexus ioctl support - PSARC/2005/232 http://sac.sfbay.sun.com/PSARC/2005/232 [5] PCITool Public Interrupts - PSARC/2009/215 http://sac.sfbay.sun.com/PSARC/2009/215 6. Resources and Schedule 6.4 Steering Committee requested information 6.4.1 Consolidation C-team Name: ON 6.5 ARC review type: FastTrack 6.6 ARC Exposure: open 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open