Hi Hal, Please see my responses inside
Eitan > > > > RFC: OpenFabrics Enhancements for QoS Support > > =============================================== > > > > Authors: . Eitan Zahavi <[EMAIL PROTECTED]> > > Date: .... May 2006. > > Revision: 0.1 > > > > Table of contents: > > 1. Overview > > 2. Architecture > > 3. Supported Policy > > 4. CMA functionality > > 5. IPoIB functionality > > 6. SDP functionality > > 7. SRP functionality > > 8. iSER functionality > > 9. OpenSM functionality > > > > 1. Overview > > ------------ > > Quality of Service requirements stem from the realization of I/O consolidation > > over IB network: As multiple applications and ULPs share the same fabric, means > > to control their use of the network resources are becoming a must. The basic > > need is to differentiate the service levels provided to different traffic flows. > > Such that a policy could be enforced and control each flow utilization of the > > fabric resources. > > > > IBTA specification defined several hardware features and management interfaces > > to support QoS: > > * Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking manner > > * Arbitration between traffic of different VL is performed by a 2 priority > > levels weighted round robin arbiter. The arbiter is programmable with > > a sequence of (VL, weight) pairs and maximal number of high priority credits > > to be processed before low priority is served > > * Packets carry class of service marking in the range 0 to 15 in their > > header SL field > > * Each switch can map the incoming packet by its SL to a particular output > > VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL) > > * The Subnet Administrator controls each communication flow parameters > > by providing them as a response to Path Record query > > > > The IB QoS features provide the means to implement a DiffServ like architecture. > > DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic > > fabrics. > > Only certain DSCP code point equivalents are provided by IBA. [EZ] True. > > > This proposal provides the detailed functional definition for the various > > software elements that are required to enable a DiffServ like architecture over > > the OpenFabrics software stack. > > > > > > > > > > > > 2. Architecture > > ---------------- > > This proposal split the QoS functionality between the SM/SA, CMA and the various > > ULPS. We take the "chronology approach" to describe how the overall system > > works: > > > > 2.1. The network manager (human) provides a set of rules (policy) that defines > > how the network is being configured and how its resources are split to different > > QoS-Levels. The policy also define how to decide which QoS-Level each > > application or ULP or service use. > > > 2.2. The SM analyzes the provided policy to see if it is realizable and performs > > the necessary fabric setup. The SM may continuously monitor the policy and adapt > > to changes in it. > > Do you mean monitor the policy or the fabric here ? [EZ] I mean monitor the policy such that changes in it are enforced. > > > Part of this policy defines the default QoS-Level of each > > partition. The SA is being enhanced to match the requested Source, Destination, > > TClass, Service-ID > > Service ID does not apply to many ULPs. Also, how is it known what > ULP/application a particular service ID refers to (other than perhaps > some well known ones) ? [EZ] True - only well known Service-IDs can have a predefined policy attached to. But I disagree on the fact services are unknown - if they are unknown how are they being found by the clients? > > > (and optionally SL and priority) against the policy. So > > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also > > enhanced to support setting up partitions with appropriate IPoIB broadcast > > group. This broadcast group carries its QoS attributes: TClass, SL, MTU and > > RATE. > > > > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the > > multicast group which forms the broadcast group of this partition. > > > > 2.4. MPI which provides non IB based connection management should be > configured > > to run using hard coded SLs. It uses these SLs in every QP being opened. > > > > 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned > > Service-ID and use it while obtaining PathRecord for establishing their > > connections. The SA receiving the PathRecord should match it against the policy > > and return the appropriate PathRecord including SL, MTU, RATE and TClass. > > > > 2.6. ULPs and programs using CMA to establish RC connection should provide the > > CMA the target IP and Service-ID. Some of the ULPs might also provide TClass > > (E.g. for SDP sockets that are provided the TOS socket option). The CMA should > > then use the provided Service-ID and optional TClass and pass them in the > > PathRecord request. The resulting PathRecord should be used for configuring the > > connection QP. > > > > PathRecord and MultiPathRecord enhancement for QoS: > > As mentioned above the PathRecord and MultiPathRecord attributes should be > > enhanced to carry the Service-ID which is a 64bit value. Given the existing > > definition for these attributes we propose to use the following fields for > > Service-ID: > > * For PathRecord: use the first 2 reserved fields whicg are 32bits each > > (component masks 0x1 and 0x2). Component mask 1 should be used to refer to the > > merged Service-ID field > > * For MultiPathRecord: use 2 reserved fields: > > 1. after the packet life (8 bits) which is component mask bit 0x10000 (17) > > 2. the field before SDGID1 (56 bits) which is component mask bit 0x200000 (22) > > This is not possible with the existing approved 1.2 erratum changes. [EZ] Ooops I was using 1.2 spec. Can you elaborate on the field I missed? Can we find a replacement? > > > Once merged they should be selected using component mask bit 0x10000 (17) > > A new capability bit should describe the SM QoS support in the SA class port > > info. This approach provides an easy migration path for existing access layer > > and ULPs by not introducing a new attribute. > > > > > > 3. Supported Policy > > -------------------- > > > > The QoS policy supported by this proposal is divided into 4 sub sections: > > > > * Node Group: a set of HCAs, Routers or Switches that share the same settings. > > A node groups might be a partition defined by the partition manager policy in > > terms of GUIDs. Future implementations might provide support for > NodeDescription > > based definition of node groups. > > > > * Fabric Setup: > > Defines how the SL2VL and VLArb tables should be setup. This policy definition > > assumes the computation of target behavior should be performed outside of > > OpenSM. > > > > * QoS-Levels Definition: > > This section defines the possible sets of parameters for QoS that a client might > > be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, Path Bits > > (in case LMC > 0 is used for QoS) and TClass. > > > > * Matching Rules: > > A list of rules that match an incoming PathRecord request to a QoS-Level. The > > rules are processed in order such as the first match is applied. Each rule is > > built out of set of match expressions which should all match for the rule to > > apply. The matching expressions are defined for the following fields > > ** SRC and DST to lists of node groups > > ** Service-ID to a list of Service-ID or Service-ID ranges > > ** TClass to a list of TClass values or ranges > > > > XML style syntax is provided for the policy file. However, a strict BNF format > > (provided in section 8) > > What section ? [EZ] Sorry I planned to add it and did not make it for this mail. Please ignore this. I will provide the BNF once we make some progress. > > > should be used for parsing it. > > > > <?xml version="1.0" encoding="ISO-8859-1"?> > > <qos-policy> > > <!-- Port Groups define sets of ports to be used later in the settings --> > > <port-groups> > > <!-- using port GUIDs --> > > <port-group> <name>Storage</name> <use>our SRP storage targets</use> > > <port-guid>0x1000000000000001</port-guid> > > <port-guid>0x1000000000000002</port-guid> > > </port-group> > > <!-- using names obtained by concatenation of first 2 words of NodeDescription > > and port number --> > > <port-group> <name>Virtual Servers</name> <use>node desc and IB port #</use> > > <port-name>vs1/HCA-1/P1</port-name> > > <port-name>vs3/HCA-1/P1</port-name> > > <port-name>vs3/HCA-2/P1</port-name> > > </port-group> > > <!-- using partitions defined in the partition policy --> > > <port-group> <name>Partition 1</name> <use>default settings</use> > > <partition>Part1</partition> > > </port-group> > > <!-- using node types HCA|ROUTER|SWITCH --> > > <port-group> <name>Routers</name> <use>all routers</use> > > <node-type>ROUTER</node-type> > > </port-group> > > </port-groups> > > <qos-setup> > > <!-- define all types of SL2VL tables always have 16 VL entries --> > > <sl2vl-tables> > > <!-- scope defines the exact devices and in/out ports the tables apply to > > if the same port is matching several rules the last one applies --> > > <sl2vl-scope> <group>Part1</group> <from>*</from> <to>*</to> > > <sl2vl-table>0,1,2,3,4,5,6,7,8,9,10,11,12,13,14</sl2vl-table> > > </sl2vl-scope> > > <!-- "across" means the port just connected to the given group, > > also the link across port 1 is probably supporting only 2 VLs --> > > <sl2vl-scope> <across>Storage</across> <from>*</from> <to>1</to> > > <sl2vl-table>0,1,1,1,1,1,1,1,1,1,1,1,1,1,1</sl2vl-table> > > </sl2vl-scope> > > <sl2vl-tables> > > > > <!-- define all types of VLArb tables. The length of the tables should > > match the physically supported tables by their target ports --> > > <vlarb-tables> > > <!-- scope defines the exact ports the VLArb tables apply to --> > > <vlarb-scope> <group>Storage</group> <to>*</to> > > <!-- VLArb table holds VL and weight pairs --> > > <vlarb-high>0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1</vlarb-high> > > <vlarb-low>8:255,9:127,10:63,11:31,12:15,13:7,14:3</vlarb-low> > > <vl-high-limit>10</vl-high-limit> > > </vlarb-scope> > > </vlarb-tables> > > </qos-setup> > > > > <qos-levels> > > <!-- the first one is just setting SL --> > > <qos-level> <sn>1</sn> <use>for the lowest priority comm</use> > > <sl>16</sl> > > </qos-level> > > <!-- the second sets SL and TClass --> > > <qos-level> <sn>2</sn> <use>low latency best bandwidth</use> > > <sl>0</sl> <tclass>7</tclass> > > </qos-level> > > <!-- the whole set: SL, TClass, MTU-Limit, Rate-Limit, Path-Bits --> > > <qos-level> <sn>3</sn> <use>just an example</use> > > <sl>0</sl> <tclass>32</tclass> <mtu_limit>1</mtl_limit> > > <rate_limit>1</rate_limit> > > </qos-level> > > </qos-levels> > > > > <qos_match_rules> > > <!-- matching by single criteria: tclass (list of values and ranges) --> > > <qos_match_rule> <sn>1</sn> <use>low latency by tclass 7-9 or 11></use> > > <tclass>7-9,11</tclass> <match-level>1</match-level> > > </qos_match_rule> > > <!-- show matching by destination group AND service-ids --> > > <qos_match_rule> <sn>2</sn> <use>Storage targets connection></use> > > <destination>Storage</destination> <service>22,4719</service> > > <match-level>3</match-level> > > </qos_match_rule> > > </qos_match_rules> > > > > </qos-policy> > > > > > > 4. IPoIB > > --------- > > > > IPoIB already query the SA for its broadcast group information. The additional > > functionality required is for IPoIB to provide the broadcast group SL, MTU, RATE > > and TClass in every following PathRecord query performed when a new UDAV is > > needed by IPoIB. > > We could assign a special Service-ID for IPoIB use but since all communication > > on the same IPoIB interface shares the same QoS-Level without the ability to > > differentiate it by target service we can ignore it for simplicity. > > > > 5. CMA features > > ---------------- > > > > The CMA interface supports Service-ID through the notion of port space as a > > prefixes to the port_num which is part of the sockaddr provided to > > rdma_resolve_add(). What is missing is the explicit request for a TClass that > > should allow the ULP (like SDP) to propagate a specific request for a class of > > service. A mechanism for providing the TClass is available in the IPv6 address, > > so we could use that address field. Another option is to implement a special > > connection options API for CMA. > > > > Missing functionality by CMA is the usage of the provided TClass and Service-ID > > in the sent PathRecord. When a response is obtained it is an existing > > requirement for the CMA to use the PathRecord from the response in setting up > > the QP address vector. > > > > > > 6. SDP > > ------- > > > > SDP uses CMA for building its connections. > > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits > > holding the remote TCP/IP Port Number to connect to. > > SDP might be provided with SO_PRIORITY socket option. In that case the value > > provided should be sent to the CMA as the TClass option of that connection. > > > > 7. SRP > > ------- > > > > Current SRP implementation uses its own CM callbacks (not CMA). So SRP should > > fill in the Service-ID in the PathRecord by itself and use that information in > > setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined > > by the SRP target I/O Controller (but they should also comply with IBTA Service- > > ID rules). Anyway, the Service-ID is reported by the I/O Controller in the > > ServiceEntries DMA attribute and should be used in the PathRecord if the SA > > reports its ability to handle QoS PathRecords. > > > > 8. iSER > > -------- > > iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER > > should be TBD. > > > > > > 9. OpenSM features > > ------------------- > > The QoS related functionality to be provided by OpenSM can be split into two > > main parts: > > > > 3.1. Fabric Setup > > During fabric initialization the SM should parse the policy and apply its > > settings to the discovered fabric elements. The following actions should be > > performed: > > * Parsing of policy > > * Node Group identification. Warning should be provided for each node not > > specified but found. > > What about the other way 'round too (nodes specified but not found) ? [EZ] Yep. Will require some warning too. > > > * SL2VL settings validation should be checked: > > + A warning will be provided if there are no matching targets for the SL2VL > > setting statement. > > + An error message will be printed to the log file if an invalid setting is > > found. A setting is invalid if it refers to: > > - Non existing port numbers of the target devices > > - Unsupported VLs for the target device. In the later case the map to non > > existing VLs should be replaced to VL15 i.e. packets will be dropped. > > * SL2VL setting is to be performed > > * VL Arbitration table settings should be validated according to the following > > rules: > > + A warning will be provided if there are no matching targets for the setting > > statement > > + An error will be provided if the port number exceeds the target ports > > + An error will be generated if the table length exceeds device capabilities > > + An warning will be generated if the table quote a VL that is not supported > > by the target device > > * VL Arbitration tables will be set on the appropriate targets > > One needs to be careful about these rules as there are a number of > different "shapes" to these tables. [EZ] Not sure what you mean by shape. IBTA defined all VLArb with same format? > > > 3.2. PathRecord query handling: > > OpenSM should be able to enforce the provided policy on client request. > > The overall flow for such requests is: first the request is matched against the > > defined match rules such that the target QoS-Level definition is found. Given > > the QoS-Level a path(s) search is performed with the given restrictions imposed > > by that level. The following two sections describe these steps. > > > > One issue not standardized by the IBTA is how Service-ID is carried in the > > PathRecord and MultiPathRecord attributes. There are basically two options: > > a. Replace the SM-Key field by the Service-ID. In that case no component mask > > bit will be assigned to it. Such that if the field is zero we should treat it > > as if the component mask bit is clear. > > b. Encode it into spare fields. For PathRecord the first two fields are reserved > > and are 64 bit when combined. The first component mask bit maps to the first > > reserved field and should be used for Service-ID masking. For MultiPathRecord > > attribute there are no adjacent reserve fields that makes a 64 bit field. So > > the reserve field following the packet-lifetime (8 bits) combined with the > > reserved field DGIDCount (56 bits) can make the Service-ID. In this case also > > the first reserve field component mask bit should be used as the Service-ID > > component mask bit. > > > > > > > > 3.2.1. Matching rule search: > > A rule is "matching" a PathRecord request using the following criteria: > > * Matching rules provide values in a list of either single value, or range of > > values. A PathRecord field is "matching" the rule field if it is explicitly > > noted in the list of values or is one of the values covered by a range > > included in the field values list. > > * Only PathRecord fields that have their component mask bit set should be > > compared. > > * For a rule to be "matching" a PathRecord request all the rule fields should be > > "matching" their PathRecord fields. Such that a PathRecord request that does > > not have a component mask field set for one of the rule defined fields can > > not match that rule. > > * A PathRecord request that have a component mask bit set for one of the fields > > that is not defined by the rule can match the rule. > > > > The algorithm to be used for searching for a rule match might be as simple as a > > sequential search through all rules or enhanced for better performance. The > > semantics of every rule field and its matching PathRecord field are described > > below: > > * Source: the SGID or SLID should be part of this group > > * Destination: the DGID or DLID should be part of this group > > * Service-ID: check if the requested Service-ID (available in the PathRecord old > > SM-Key field) is matching any of this rule Service-IDs > > * TClass: check if the PathRecord TClass field is matching > > > > 3.2.2 PathRecord response generation: > > The QoS-Level pointed by the first rule that matches the PathRecord request > > should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits > > and TClass. A default QoS-Level should be used if no rule is matching the query. > > > > The efficient algorithm for finding paths that meet the QoS-Level criteria is > > beyond the scope of this RFC and left for the implementer to provide. However > > the criteria by which the paths match the QoS-Level are described below: > > > > * SL: The paths found should all use the given SL. For that sake PathRecord > > algorithm should traverse the path from source to destination only through > > ports that carry a valid VL (not VL15) by the SL2VL map (should consider input > > and output ports and SL). > > * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit > > * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit > > (rate limit is given in units of link BW = Width*Speed according to IBTA > > Specification Vol-1 table-205 p-901 l-24). > > * Path-Bits: define the target LID lowest bits (number of bits defined by the > > target port PortInfo.LMC field). The path should traverse the LFT using the > > target port LID with the path-bits set. > > * TClass: should be returned in the result PathRecord. When routing is going to > > be supported by OpenSM we might use this field in selecting the target > > router too in a TBD way. > > _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general