Hi,

On 12/14/2012 1:24 PM, Jens Domke wrote:
> Hello Hal,
> 
> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
> 
>> Hi again,
>>
>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>> Hello Hal,
>>>
>>> thank you for the fast response. I will try to clarify some points.
>>>
>>>>> d) OpenMPI runs are executed with "--mca 
>>>>> btl_openib_ib_path_record_service_level 1"
>>>>
>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>> there should be no need to set this. The proper SL for querying the SA
>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>> pushes this into each port. That should be used. It's possible that SL1
>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
>>> specify the SL for querying the PathRecords.
>>> It just enables the functionality. And the ompi processes use the 
>>> PortInfo.SMSL to send the request.
>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the 
>>> SA received the requests.  
>>>>
>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>
>>>>> As far as I understand the whole system:
>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to 
>>>>> the OpenSM
>>>>> 2. the SA receives the request on QP1
>>>>
>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>> set for that port.
>>> Hmm, there you might have a point. I think I saw that the query itself had 
>>> SL=0 specified.
>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>
>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about 
>>>>> a special service level for the slid/dlid path
>>>>
>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>> than the one the query used and is the one returned inside the
>>>> PathRecord attribute/data.
>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
>>> running on a port which is also used for MPI comm.
>>
>> With DFSSSP are all SLs same from source port to get to any destination ?
> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == 
> SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).

If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.

>>
>>>>
>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in 
>>>>> libvendor/osm_vendor_ibumad.c
>>>>
>>>> By the response reversibility rule, I think this is returned on the SL
>>>> of the original query but haven't verified this in the code base yet.
>>> Ok, I was not aware of that rule. But if this is true, then the SA should 
>>> also be able to send via SL>0.
>>
>> I doubled checked and indeed the SA response does use the SL that the
>> incoming request was received on.
>>
>>>>
>>>>> The osm_vendor_send() function builds the MAD packet with the following 
>>>>> attributes:
>>>>>       /* GS classes */
>>>>>       umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>                         p_mad_addr->addr_type.gsi.remote_qp,
>>>>>                         p_mad_addr->addr_type.gsi.service_level,
>>>>>                         IB_QP1_WELL_KNOWN_Q_KEY);
>>>>> So, the SL is the same like the one which was used by the OMPI process. 
>>>>> The Q_Key matches the Q_key on the OMPI process, and remote_qp and 
>>>>> dest_lid is correct, too.
>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, 
>>>>> and this send does not work (except for SL=0).
>>>>
>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>> requester with no message in the OpenSM log or not received at the
>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>> received at the SM or the response not to make it back to the requester
>>>> from the SA if the SL used is not "reversible".
>>> By "not working" I mean, that the MPI process does not receive any response 
>>> from the SA.
>>> I get messages from the MPI process like the following:
>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
>>>  No response from SA after 20 retries
>>> The log of OpenSM shows that the SA received the PathRequest query, dumps 
>>> the query into the log, and sends the reply back.
>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>
>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, 
>>>>> length=120, timeout_ms=0, retries=3)
>>>>>   at src/umad.c:791
>>>>> 791             if (umaddebug > 1)
>>>>> (gdb) p *mad
>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, 
>>>>> addr = {qpn = 1325427712, qkey = 384, 
>>>>>   lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 
>>>>> '\000', gid_index = 0 '\000', 
>>>>>   hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 
>>>>> 15 times>, flow_label = 0, 
>>>>>   pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 
>>>>> 0x7fffe8012530 "\002"}
>>>>
>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>> OpenSM side ? SL is 6 rather than 1 here.
>>> This is the response on the OpenSM side (inside the umad_send function, 
>>> right before it is written to the device with write(fd, …).
>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>
>> What is SMSL for the requester ? Was it SL 6 ?
> Yes, it was SL 6.
> Here is a content of a similar packet which was received by the SA. I have 
> used ibdump on the port where the OpenSM was running:
> ======================================================================================
> No.     Time        Source                Destination           Protocol 
> Length Info
>     785 14.352168   LID: 384              LID: 4140             InfiniBand 
> 290    UD Send Only SubnAdmGet(PathRecord)
> 
> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>     Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>     Epoch Time: 1355389784.437633332 seconds
>     [Time delta from previous captured frame: 4.332020528 seconds]
>     [Time delta from previous displayed frame: 4.332020528 seconds]
>     [Time since reference or first frame: 14.352168681 seconds]
>     Frame Number: 785
>     Frame Length: 290 bytes (2320 bits)
>     Capture Length: 290 bytes (2320 bits)
>     [Frame is marked: False]
>     [Frame is ignored: False]
>     [Protocols in frame: erf:infiniband]
> Extensible Record Format
>     [ERF Header]
>         Timestamp: 0x50c99b587008bcf2
>         [Header type]
>             .001 0101 = type: INFINIBAND (21)
>             0... .... = Extension header present: 0
>         0000 0100 = flags: 4
>             .... ..00 = capture interface: 0
>             .... .1.. = varying record length: 1
>             .... 0... = truncated: 0
>             ...0 .... = rx error: 0
>             ..0. .... = ds error: 0
>             00.. .... = reserved: 0
>         record length: 306
>         loss counter: 0
>         wire length: 290
> InfiniBand
>     Local Route Header
>         0110 .... = Virtual Lane: 0x06
>         .... 0000 = Link Version: 0
>         0110 .... = Service Level: 6
>         .... 00.. = Reserved (2 bits): 0
>         .... ..10 = Link Next Header: 0x02
>         Destination Local ID: 19
>         0000 0... .... .... = Reserved (5 bits): 0
>         .... .000 0100 1000 = Packet Length: 72
>         Source Local ID: 16
>     Base Transport Header
>         Opcode: 100
>         1... .... = Solicited Event: True
>         .1.. .... = MigReq: True
>         ..00 .... = Pad Count: 0
>         .... 0000 = Header Version: 0
>         Partition Key: 65535
>         Reserved (8 bits): 0
>         Destination Queue Pair: 0x000001
>         0... .... = Acknowledge Request: False
>         .000 0000 = Reserved (7 bits): 0
>         Packet Sequence Number: 0
>     DETH - Datagram Extended Transport Header
>         Queue Key: 2147549184
>         Reserved (8 bits): 0
>         Source Queue Pair: 0x00380050
>     MAD Header - Common Management Datagram
>         Base Version: 0x01
>         Management Class: 0x03
>         Class Version: 0x02
>         Method: Get() (0x01)
>         Status: 0x0000
>         Class Specific: 0x0000
>         Transaction ID: 0x0010000f38005000
>         Attribute ID: 0x0035
>         Reserved: 0x0000
>         Attribute Modifier: 0x00000000
>         MAD Data Payload: 000000000000000000000000000000000000000000000000...
>      Illegal RMPP Type (0)! 
>         RMPP Type: 0x00
>         RMPP Type: 0x00
>         0000 .... = R Resp Time: 0x00
>         .... 0000 = RMPP Flags: Unknown (0x00)
>         RMPP Status:  (Normal) (0x00)
>         RMPP Data 1: 0x00000000
>         RMPP Data 2: 0x00000000
>     SMASubnAdmGet(PathRecord)
>         SM_Key (Verification Key): 0x0000000000000000
>         Attribute Offset: 0x0000
>         Reserved: 0x0000
>         Component Mask: 0x0000003000000000
>         Attribute (PathRecord)
>             PathRecord
>                 DGID: :: (::)
>                 SGID: ::0.15.0.16 (::0.15.0.16)
>                 DLID: 0x0000
>                 SLID: 0x0000
>                 0... .... = RawTraffic: 0x00
>                 .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>                 HopLimit: 0x00
>                 TClass: 0x00
>                 0... .... = Reversible: 0x00
>                 .000 0000 = NumbPath: 0x00
>                 P_Key: 0x0000
>                 .... .... .... 0000 = SL: 0x0000
>                 00.. .... = MTUSelector: 0x00
>                 ..00 0000 = MTU: 0x00
>                 00.. .... = RateSelector: 0x00
>                 ..00 0000 = Rate: 0x00
>                 00.. .... = PacketLifeTimeSelector: 0x00
>                 ..00 0000 = PacketLifeTime: 0x00
>                 Preference: 0x00
>     Variant CRC: 0xad4e
> ======================================================================================

And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
out that machine and the issue is internal to that machine. It could be
because of the underlying issue which hangs OpenSM when some IB program
tried to unregister from the MAD layer but there were outstanding work
completions. That's based on your original email earlier this AM.

>>
>> One would need to walk the SLToVLMappingTables from requester (OMPI
>> port) to SA and back to see whether SL6 would even have a chance of
>> working (not dropping) aside from whether it's really the correct SL to use.
> All SL2VL tables look the same. I checked the output of OpenSM.
>       SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 
> 13 | 14 | 15 |
>       VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 
> |0x5 |0x6 |0x7 |
> But this is also as expected, because I have set the QoS in the opensm config 
> as follows:
>       qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
> This was set for "default", "CA" and "Switch external ports". I have not 
> touched the config for "Switch Port 0" and "Router ports", they remained: 
> qos_[sw0 | rtr]_sl2vl (null)

That works as long as all links have (at least) 8 data VLs (VLCap 4).

-- Hal

> Regards
> Jens
> 
>>
>> -- Hal
>>
>>>>
>>>>> The output of OpenMPI or OpenSM's log file don't show any useful 
>>>>> information for this problem, even with higher debug levels.
>>>>
>>>> So nothing interesting logged relative to the PathRecord queries ?
>>> In the OpenSM log, only that it was received, how the request looks like, 
>>> and that it was send back.
>>> And a few "outstanding MADs" a few lines later in the log.
>>>>
>>>>> So, right now I'm stuck, and have no idea if there is an error in the 
>>>>> kernel driver, the HCA firmware or something completely different. Or if 
>>>>> umad_send basically does not support SL>0.
>>>>> A workaround for the moment is to set the SL in the 
>>>>> umad_set_addr_net(...) call to 0.
>>>>
>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>> that's how SMSL is set by DFSSSP.
>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our 
>>> case (OpenSM running on a compute node), it sets the same SL, which is used
>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>
>>> Regards
>>> Jens
>>>
>>> --------------------------------
>>> Dipl.-Math. Jens Domke
>>> Researcher - Tokyo Institute of Technology
>>> Satoshi MATSUOKA Laboratory
>>> Global Scientific Information and Computing Center
>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>> Tokyo, 152-8550, JAPAN
>>> Tel/Fax: +81-3-5734-3876
>>> E-Mail: domke.j...@m.titech.ac.jp
>>> --------------------------------
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j...@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to