I'm sponsoring this fasttrack for Scott Carter. It is set to timeout on 
10/15/2008. Given that this is a relatively minor amendment to a larger, 
approved case PSARC/2004/253 "Advanced DDI Interrupt Functions", I think 
   it qualifies as a fasttrack. As the discussion progresses, extending 
the timer or even promoting to a full case may be considered.

In addition to this proposal, three man page files can be found in the 
case directory:

ddi_cb_register.9f.txt
ddi_intr_set_nreq.9f.txt
ndi_irm_create.9f.txt

-Artem

Template Version: @(#)sac_nextcase %I% %G% SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
     1.1. Project/Component Working Name:
         Interrupt Resource Management
     1.2. Name of Document Author/Supplier:
         Author:  Scott Carter
     1.3  Date of This Document:
        07 October, 2008
4. Technical Description
4.1 Project Description

This project delivers the resource management feature originally
defined in PSARC/2004/253 "Advanced DDI Interrupt Functions"
with minor changes (see 4.1.3 for details).

It provides a mechanism for device drivers to get more interrupt
vectors (and increased performance).  By participating with the new
feature, their number of available interrupts becomes dynamic and
can increase or decrease.  The goal is to maximize utilization of
interrupt vectors in a fair manner, and rebalance the allocations
whenever devices are added or removed from the system.


4.1.1 Definition

The project defines new DDI interfaces for device drivers to register
and unregister a generic callback handler.  If a driver has registered
a handler, then it can be notified when its number of interrupts has
been increased or decreased.

How many interrupt vectors a device driver wants is set by the initial
number it attempts to allocate through ddi_intr_alloc().  And a driver
can explicitly adjust this number at later times with a new DDI function
that is introduced by this project.

The project also defines new NDI interfaces for nexus drivers to define
the supplies of interrupt vectors that are available in the system.  A
supply can be associated with a specific IO bus (e.g. PCIe root complex)
or it could be a global supply shared by all devices.

Along with the new DDI and NDI interfaces, the project also delivers
an MDB debugging module with two new debug macros.  One macro is used
to display statistics on all of the defined interrupt supplies, and the
other displays statistics about how a supply is divided up between the
drivers that map to it.


4.1.2 Motivation, Goals, and Requirements

Currently, interrupt vectors are given to device drivers in very small
numbers.  This is to avoid exhausting a system's supply before all the
devices have been attached, and to keep interrupt vectors in reserve for
later hotplugs.  Interrupts are given so conservatively because there
is no mechanism to take them back later unless a driver detaches.

Fewer interrupt vectors means less IO performance.  More interrupt
vectors means more parallelism for handling interrupt conditions.  So
the motivation for this project is to increase IO performance.  And
the goal of this project is to maximize the allocation of interrupts
given to each device, but in a way which is fair and balanced across
the full set of attached devices.

The requirement of this project is:

- Provide a mechanism for drivers to get more interrupts.

The feature is optional so drivers that don't use it still work even
if the system has implemented support for the feature.  And conversely,
drivers that do use it also work if the system does not implement the
support.

A full implementation of the platform level support will be provided
by this project for PCIe IO bus drivers on SPARC.


4.1.3 Changes From the Previous Case

This project is a micro change to the approved case PSARC/2004/253
"Advanced DDI Interrupt Interfaces".  The existing DDI interrupt
interfaces are not changed.  But some new DDI interrupt interfaces
are added, which extend the capabilities of the existing interfaces.

The interfaces described in section 6.3 of the specifications for
PSARC/2004/253 "Advanced DDI Interrupt Functions" have already been
approved, but those resource management interfaces have never been
implemented.  This project is providing the implementation of those
interfaces, with minor changes.

The changes include:

- Generalization of the proposed callback interfaces, so they
   can be shared by both IRM and any future projects.

- Removed semantics originally proposed to temporarily disable
   callbacks.

- An additional function (ddi_intr_set_nreq()) for explicitly
   setting how many interrupt vectors a device driver requests.

The interface described in PSARC/2007/453 "MSI-X interrupt limit
override" provides a temporary workaround for device drivers to request
more interrupts.  And the workaround is currently used.  This project
supercedes the functionality of this workaround, and drivers using the
workaround should ultimately be converted.  But in the meantime, the
workaround is preserved and still works in conjunction with this
project.


4.1.4 Competitive Analysis

This project is important for the overall IO performance on all of
Sun's platforms in order to remain competitive in the marketplace.

Modern IO bus technologies support large numbers of interrupts.  A
single PCI or PCIe device could use up to 32 MSI interrupts, or 2048
MSI-X interrupts.  Without this project, Solaris only gives at most
2 interrupts per each device.  We are limiting the IO throughput of
our systems by not having this feature, and by not giving some devices
nearly as many interrupt vectors as they can support and utilize.

Current Sun platforms have limited numbers of interrupts to give to
the devices.  SPARC PCIe platforms have 256 interrupts per root complex,
and x64 systems only have 256 interrupts per each processor.  But we
have future platforms in development with many thousands of interrupt
vectors available.  We need this feature so that we can achieve the
full potential of advanced MSI and MSI-X devices on such platforms.
And in the meantime we already have some devices which already could
benefit from getting more than the current 2 interrupt vector limit.


4.2 Technical Description

4.2.1 Architecture

The basic strategy is to organize representations of each supply of
interrupt vectors in the system with the drivers who consume them.  And
to compute the optimal way to divide each supply amongst those devices.

When a driver is attached or detached, the computations are performed
to rebalance the system.  And, drivers can explicitly change how many
interrupts they want in response to load.  The computations seek to
maximize the use of the system's interrupts and derive fair allocations
for each device.  Drivers are notified using a callback mechanism when
the computations result in giving the driver more or less interrupts.

The main components of this project are:

- NDI interfaces for nexus drivers to create or destroy the
   descriptions of individual supplies of interrupt vectors.

- DDI interfaces for drivers to register or unregister callback
   mechanisms.  The callbacks notify them of changes to their
   interrupt availability.

- DDI interface for drivers to explicitly set their requested
   number of interrupts dynamically.

- A mechanism to map individual devices to interrupt supplies.

- Background threads which keep the allocations of interrupts
   from each supply to each device optimized and balanced.

- Routines to initialize the new feature at boottime.

- An MDB debugging module to display the status of how the
   interrupt subsystem has been balanced.

Each supply of interrupt vectors in the system is described by a data
structure (ddi_irm_pool_t), representing one pool of interrupts that
can be shared by multiple devices.  And for each device that maps to
an interrupt pool, a data structure (ddi_irm_req_t) represents how
many interrupts it wants versus how many it received.

Nexus drivers create the interrupt pools, and then they map individual
devices to them through the existing bus nexus driver INTROP feature.
The request data structures are created internally and associated with
the interrupt pools when devices are attached and mapped to a pool.

Existing device drivers do not benefit, and they continue to only get
the same small number of interrupt vectors that they currently receive.
In order to benefit, they must be modified with optional enhancements
so that they can participate.  To participate, they must provide a new
callback mechanism so that the system can notify them when they have
been given more or less interrupt vectors.

A modified driver first registers a generic callback handler so it can
receive notifications of interrupt availability changes.  Then it calls
ddi_intr_alloc() to request an initial number of interrupt vectors.  If
the system has the necessary support (from nexus drivers), then it
associates the requesting driver with an interrupt pool.  Through this
association, the system will compute if the driver gets more interrupts.

A driver may initially receive all the interrupts it requested, or it
may receive callbacks at a later time (post attach(9F)) notifying it
when more interrupt vectors are available.  More could be available if
other devices were removed, or if workload changes cause other drivers
to reduce their requests.

But to qualify for additional interrupts, a driver must also yield and
call ddi_intr_free() when necessary.  This may occur if another device
is inserted, or if changes in workload cause other drivers to need more.

This project introduces the interfaces for nexus drivers to describe
the interrupt pools, and for device drivers to engage with those
interrupt pools to possibly receive more resources.  Plus all of the
additional implementation behind the scenes to perform the related
computations when necessary.


4.2.2 Interfaces

4.2.2.1 Exported Interfaces

Interface                            Stability  Comments
-----------------------------------+----------+-------------------------
ndi_irm_create()                     Committed  Create an IRM pool
ndi_irm_destroy()                    Committed  Destroy an IRM pool
DDI_INTROP_GETPOOL                   Committed  Get IRM pool INTROP

ddi_cb_register()                    Committed  Install callback handler
ddi_cb_unregister()                  Committed  Remove callback handler
ddi_cb_action_t                      Committed  Callback action type
ddi_cb_flags_t                       Committed  Callback flags type
ddi_cb_func_t                        Committed  Callback function type
ddi_cb_handle_t                      Committed  Callback handle type

ddi_intr_set_nreq()                  Committed  Set IRM request size
-----------------------------------+----------+-------------------------


4.2.2.2 Imported Interfaces

Interface                            Stability  Comments
-----------------------------------+----------+-------------------------
ddi_intr_alloc()                    Committed   Added hooks into IRM
ddi_intr_free()                     Committed   Added hooks into IRM
-----------------------------------+----------+-------------------------


4.2.2.3 Removed Interfaces

These interfaces were previously approved, but never implemented.
And they are now superceded by the interfaces of this project.

Interface                            Stability  Comments
-----------------------------------+----------+-------------------------
ddi_intr_register_management_cb()   Committed   Register callback
ddi_intr_unregister_management_cb() Committed   Unregister callback
ddi_intr_enable_management_cb()     Committed   Enable callback
ddi_intr_disable_management_cb()    Committed   Disable callback
-----------------------------------+----------+-------------------------


5. References

This Project Implements, Extends, or Replaces these Projects:

- PSARC/2004/253: Advanced DDI Interrupt Functions
- PSARC/2007/453: MSI-X interrupt limit override
   (A future RFE will remove the workaround once all
   consumers have been converted to use this project.)

Consumers of this Project:

- PSARC/2008/181: Solaris Hotplug Framework
   (Uses this project's generic callback mechanism)
- IRM Enhancements for Atlas/Neptune Driver
   (Convert from existing workaround to use this project)
- x86 APIC Expansion and IRM Support
   (Provide interrupt pool definitions on x86.  The scope of
   this project is to only deliver interrupt pools on SPARC.)

Design and Implementation Specification of this Project:
- http://pciexpress.sfbay/intr/docs/irm/irm_design.txt

6. Resources and Schedule
     6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
     6.5. ARC review type: FastTrack
     6.6. ARC Exposure: open


Reply via email to