Hello Hal,

On Dec 17, 2012, at 9:04 PM, Hal Rosenstock wrote:

> Hi,
> 
> On 12/17/2012 1:16 AM, Jens Domke wrote:
>> Hello Hal,
>> 
>> I have checked the smpquery and saquery command today.
>> 
>> The smpquery SL2VL and PI commands for the opensm port work fine, and I get 
>> the expected results:
>> ======================================================
>> # SL2VL table: Lid 19
>> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
>> ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
>> ======================================================
>> # Port info: Lid 19 port 0
>> Mkey:............................<not displayed>
>> GidPrefix:.......................0xfe80000000000000
>> Lid:.............................19
>> SMLid:...........................19
>> CapMask:.........................0x251086a
>>                                IsSM
>>                                IsTrapSupported
>>                                IsAutomaticMigrationSupported
>>                                IsSLMappingSupported
>>                                IsSystemImageGUIDsupported
>>                                IsCommunicatonManagementSupported
>>                                IsVendorClassSupported
>>                                IsCapabilityMaskNoticeSupported
>>                                IsClientRegistrationSupported
>> DiagCode:........................0x0000
>> MkeyLeasePeriod:.................0
>> LocalPort:.......................1
>> LinkWidthEnabled:................1X or 4X
>> LinkWidthSupported:..............1X or 4X
>> LinkWidthActive:.................4X
>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
>> LinkState:.......................Active
>> PhysLinkState:...................LinkUp
>> LinkDownDefState:................Polling
>> ProtectBits:.....................0
>> LMC:.............................0
>> LinkSpeedActive:.................5.0 Gbps
>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
>> NeighborMTU:.....................2048
>> SMSL:............................0
>> VLCap:...........................VL0-7
>> InitType:........................0x00
>> VLHighLimit:.....................0
>> VLArbHighCap:....................8
>> VLArbLowCap:.....................8
>> InitReply:.......................0x00
>> MtuCap:..........................2048
>> VLStallCount:....................0
>> HoqLife:.........................31
>> OperVLs:.........................VL0-7
>> PartEnforceInb:..................0
>> PartEnforceOutb:.................0
>> FilterRawInb:....................0
>> FilterRawOutb:...................0
>> MkeyViolations:..................0
>> PkeyViolations:..................0
>> QkeyViolations:..................0
>> GuidCap:.........................32
>> ClientReregister:................0
>> McastPkeyTrapSuppressionEnabled:.0
>> SubnetTimeout:...................18
>> RespTimeVal:.....................16
>> LocalPhysErr:....................8
>> OverrunErr:......................8
>> MaxCreditHint:...................0
>> RoundTrip:.......................0
>> CapabilityMask2:.................0x0000
>> LinkSpeedExtActive:..............No Extended Speed
>> LinkSpeedExtSupported:...........0
>> LinkSpeedExtEnabled:.............0
>> ======================================================
>> 
>> 
>> The problem are the saquery commands on other nodes.
>> In most cases the executions fails, and the node shows the same behaviour 
>> like the OpenSM node, when it trys to send on SL>0. The PathRequest paket 
>> does not arrive at the node with the running OpenSM (checked with ibdumb). 
>> At some point of the execution the saquery binary hangs, the kernel log 
>> indicates errors and the only option is to reboot. 
>> This is the output I see for the saquery:
>> ======================================================
>> saquery -P --src-to-dst 4:8
>> ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
>> 
>> Query SA failed: Connection timed out
>> ======================================================
>> (In really rar cases I get the PathRequest back and see the dump, but the 
>> saquery binary stalls afterwards, too.)
>> 
>> 
>> I did some debugging with gdb again, and stepped thru the saquery code.
>> When I change the SL to 0 in the addr vector of the MAD right before 
>> umad_send is called, then everthing works.
>> So, the saquery on the compute nodes shows the same behaviour as the opensm 
>> with respect to the SL value for umad_send.
>> 
>> 
>> At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in 
>> the config file of opensm.
>> Sadly, this configuration results in the same crashes of the saquery 
>> commands.
>> For the runs with MinHop I used also a different SL2VL mapping, just to be 
>> sure, that there is no problem with VL>0 and every SL travels on VL=0:
>> ======================================================
>> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
>> ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
>> ======================================================
> 
> Non QoS routing algorithms still need -Q otherwise the full range of QoS
> is not available. Was OpenSM started with -Q for this test ?

Yes I had QoS enabled in my configuration file with "qos TRUE".

Jens

> 
> -- Hal
>> 
>> Regards,
>> Jens
>> 
>> 
>> On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:
>> 
>>> 
>>> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
>>> 
>>>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>>>> Hi,
>>>>> 
>>>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>> 
>>>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>>>> Hello Hal,
>>>>>>>>> 
>>>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>>>> Hello Hal,
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi again,
>>>>>>>>>>>> 
>>>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>>>> Hello Hal,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thank you for the fast response. I will try to clarify some 
>>>>>>>>>>>>> points.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca 
>>>>>>>>>>>>>>> btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly 
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> there should be no need to set this. The proper SL for querying 
>>>>>>>>>>>>>> the SA
>>>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of 
>>>>>>>>>>>>>> DFSSSP
>>>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and 
>>>>>>>>>>>>>> the SM
>>>>>>>>>>>>>> pushes this into each port. That should be used. It's possible 
>>>>>>>>>>>>>> that SL1
>>>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level 
>>>>>>>>>>>>> does not specify the SL for querying the PathRecords.
>>>>>>>>>>>>> It just enables the functionality. And the ompi processes use the 
>>>>>>>>>>>>> PortInfo.SMSL to send the request.
>>>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, 
>>>>>>>>>>>>> and the SA received the requests.  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests 
>>>>>>>>>>>>>>> (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL 
>>>>>>>>>>>>>> that the SM
>>>>>>>>>>>>>> set for that port.
>>>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query 
>>>>>>>>>>>>> itself had SL=0 specified.
>>>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or 
>>>>>>>>>>>>>>> Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port 
>>>>>>>>>>>>>> communication)
>>>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because 
>>>>>>>>>>>>> the SM is running on a port which is also used for MPI comm.
>>>>>>>>>>>> 
>>>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any 
>>>>>>>>>>>> destination ?
>>>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce 
>>>>>>>>>>> SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == 
>>>>>>>>>>> SL(LID1->LID3).
>>>>>>>>>> 
>>>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the 
>>>>>>>>> SL for the reversible path.
>>>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP 
>>>>>>>>> would recommend another SL.
>>>>>>>>> 
>>>>>>>>> I just read the IB Specs and it says, that "SL specified in the 
>>>>>>>>> received packet is used as the SL in the response packet" for MAD 
>>>>>>>>> packets.
>>>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI 
>>>>>>>>> does the setup of the PathRequest and the way how the SA does build 
>>>>>>>>> the respond packet.
>>>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest 
>>>>>>>>> packet, 
>>>>>>>> 
>>>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside 
>>>>>>>> the
>>>>>>>> SubAdmGet of PatchRecord ?
>>>>>>> 
>>>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>>>> 
>>>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>>>> valid one in the response.
>>>>> Ok.
>>>>>> 
>>>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the 
>>>>>>> only reference I found was in osm_sa_path_record.c
>>>>>>> The SA just treats the SL in the PathRequest as a "I would like to use 
>>>>>>> this SL" in case the SL bit is set.
>>>>>>> But the routing engine can overwrite the requested SL before the reply 
>>>>>>> is send.
>>>>>>> 
>>>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL 
>>>>>>> bit in the CompMask and sets the SL to SMSL for the PathRequest, so 
>>>>>>> that SL_a == SL_b.
>>>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). 
>>>>>>> Only if I change the SL to 0 in the MAD right before umad_send is 
>>>>>>> called by the SA, the paket is able to leave the node and reaches the 
>>>>>>> OMPI process.
>>>>>> 
>>>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>>>> at the requester (OMPI node) ?
>>>>> No, I'm not sure. Is there any possibility to check that? As far as I 
>>>>> know, ibdump does not show MAD pakets which leave a port, it only shows 
>>>>> the pakets when they are received on the other end.
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>>>> 
>>>>>>>> Good.
>>>>>>>> 
>>>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, 
>>>>>>>>> for the response.
>>>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>>>> 
>>>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>>>> 
>>>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, 
>>>>>>>>> that it does not specify the SL within the PathRequest in a 
>>>>>>>>> appropriate way (which would be a SL suggested by DFSSSP for the 
>>>>>>>>> reversible path). And the second bug is that the SA uses the SL, on 
>>>>>>>>> which the PathRequest packet was send, and not the SL specified 
>>>>>>>>> within the packet.
>>>>>>>>> What do you think?
>>>>>>>> 
>>>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>>>> OMPI node so it's not even getting that far.
>>>>>>> 
>>>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA 
>>>>>>> node?
>>>>>>> I have no inside of the underlying layer (kernel driver and fireware). 
>>>>>>> Maybe there are some implementations, which prevent the SA from sending 
>>>>>>> MADs back on SL>0?
>>>>>> 
>>>>>> If you're sure this response doesn't get out of the SA node, please
>>>>>> contact Mellanox support with the details.
>>>>> Ok, I can do this, if it turns out to be true.
>>>>>> 
>>>>>>>> 
>>>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it 
>>>>>>>>> matches addr_type.gsi.service_level.
>>>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI 
>>>>>>>>> process on a SL>0.
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via 
>>>>>>>>>>>>>>> umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> By the response reversibility rule, I think this is returned on 
>>>>>>>>>>>>>> the SL
>>>>>>>>>>>>>> of the original query but haven't verified this in the code base 
>>>>>>>>>>>>>> yet.
>>>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the 
>>>>>>>>>>>>> SA should also be able to send via SL>0.
>>>>>>>>>>>> 
>>>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that 
>>>>>>>>>>>> the
>>>>>>>>>>>> incoming request was received on.
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the 
>>>>>>>>>>>>>>> following attributes:
>>>>>>>>>>>>>>> /* GS classes */
>>>>>>>>>>>>>>> umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>>>>                   p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>>>>                   p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>>>>                   IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI 
>>>>>>>>>>>>>>> process. The Q_Key matches the Q_key on the OMPI process, and 
>>>>>>>>>>>>>>> remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the 
>>>>>>>>>>>>>>> PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received 
>>>>>>>>>>>>>> at the
>>>>>>>>>>>>>> requester with no message in the OpenSM log or not received at 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being 
>>>>>>>>>>>>>> used in
>>>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it 
>>>>>>>>>>>>>> not to be
>>>>>>>>>>>>>> received at the SM or the response not to make it back to the 
>>>>>>>>>>>>>> requester
>>>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive 
>>>>>>>>>>>>> any response from the SA.
>>>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
>>>>>>>>>>>>>  No response from SA after 20 retries
>>>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest 
>>>>>>>>>>>>> query, dumps the query into the log, and sends the reply back.
>>>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding 
>>>>>>>>>>>>> MAD…".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like 
>>>>>>>>>>>>>>> this:
>>>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, 
>>>>>>>>>>>>>>> length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, 
>>>>>>>>>>>>>>> length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 
>>>>>>>>>>>>>>> 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' 
>>>>>>>>>>>>>>> <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 
>>>>>>>>>>>>>>> 0x7fffe8012530 "\002"}
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response 
>>>>>>>>>>>>>> on the
>>>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send 
>>>>>>>>>>>>> function, right before it is written to the device with write(fd, 
>>>>>>>>>>>>> …).
>>>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on 
>>>>>>>>>>>>> SL 6.
>>>>>>>>>>>> 
>>>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>>>> Yes, it was SL 6.
>>>>>>>>>>> Here is a content of a similar packet which was received by the SA. 
>>>>>>>>>>> I have used ibdump on the port where the OpenSM was running:
>>>>>>>>>>> ======================================================================================
>>>>>>>>>>> No.     Time        Source                Destination           
>>>>>>>>>>> Protocol Length Info
>>>>>>>>>>> 785 14.352168   LID: 384              LID: 4140             
>>>>>>>>>>> InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>>>> 
>>>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 
>>>>>>>>>>> bits)
>>>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>>>> Frame Number: 785
>>>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>>>> [Frame is marked: False]
>>>>>>>>>>> [Frame is ignored: False]
>>>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>>>> Extensible Record Format
>>>>>>>>>>> [ERF Header]
>>>>>>>>>>>   Timestamp: 0x50c99b587008bcf2
>>>>>>>>>>>   [Header type]
>>>>>>>>>>>       .001 0101 = type: INFINIBAND (21)
>>>>>>>>>>>       0... .... = Extension header present: 0
>>>>>>>>>>>   0000 0100 = flags: 4
>>>>>>>>>>>       .... ..00 = capture interface: 0
>>>>>>>>>>>       .... .1.. = varying record length: 1
>>>>>>>>>>>       .... 0... = truncated: 0
>>>>>>>>>>>       ...0 .... = rx error: 0
>>>>>>>>>>>       ..0. .... = ds error: 0
>>>>>>>>>>>       00.. .... = reserved: 0
>>>>>>>>>>>   record length: 306
>>>>>>>>>>>   loss counter: 0
>>>>>>>>>>>   wire length: 290
>>>>>>>>>>> InfiniBand
>>>>>>>>>>> Local Route Header
>>>>>>>>>>>   0110 .... = Virtual Lane: 0x06
>>>>>>>>>>>   .... 0000 = Link Version: 0
>>>>>>>>>>>   0110 .... = Service Level: 6
>>>>>>>>>>>   .... 00.. = Reserved (2 bits): 0
>>>>>>>>>>>   .... ..10 = Link Next Header: 0x02
>>>>>>>>>>>   Destination Local ID: 19
>>>>>>>>>>>   0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>>>>   .... .000 0100 1000 = Packet Length: 72
>>>>>>>>>>>   Source Local ID: 16
>>>>>>>>>>> Base Transport Header
>>>>>>>>>>>   Opcode: 100
>>>>>>>>>>>   1... .... = Solicited Event: True
>>>>>>>>>>>   .1.. .... = MigReq: True
>>>>>>>>>>>   ..00 .... = Pad Count: 0
>>>>>>>>>>>   .... 0000 = Header Version: 0
>>>>>>>>>>>   Partition Key: 65535
>>>>>>>>>>>   Reserved (8 bits): 0
>>>>>>>>>>>   Destination Queue Pair: 0x000001
>>>>>>>>>>>   0... .... = Acknowledge Request: False
>>>>>>>>>>>   .000 0000 = Reserved (7 bits): 0
>>>>>>>>>>>   Packet Sequence Number: 0
>>>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>>>>   Queue Key: 2147549184
>>>>>>>>>>>   Reserved (8 bits): 0
>>>>>>>>>>>   Source Queue Pair: 0x00380050
>>>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>>>>   Base Version: 0x01
>>>>>>>>>>>   Management Class: 0x03
>>>>>>>>>>>   Class Version: 0x02
>>>>>>>>>>>   Method: Get() (0x01)
>>>>>>>>>>>   Status: 0x0000
>>>>>>>>>>>   Class Specific: 0x0000
>>>>>>>>>>>   Transaction ID: 0x0010000f38005000
>>>>>>>>>>>   Attribute ID: 0x0035
>>>>>>>>>>>   Reserved: 0x0000
>>>>>>>>>>>   Attribute Modifier: 0x00000000
>>>>>>>>>>>   MAD Data Payload: 
>>>>>>>>>>> 000000000000000000000000000000000000000000000000...
>>>>>>>>>>> Illegal RMPP Type (0)! 
>>>>>>>>>>>   RMPP Type: 0x00
>>>>>>>>>>>   RMPP Type: 0x00
>>>>>>>>>>>   0000 .... = R Resp Time: 0x00
>>>>>>>>>>>   .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>>>>   RMPP Status:  (Normal) (0x00)
>>>>>>>>>>>   RMPP Data 1: 0x00000000
>>>>>>>>>>>   RMPP Data 2: 0x00000000
>>>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>>>>   SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>>>>   Attribute Offset: 0x0000
>>>>>>>>>>>   Reserved: 0x0000
>>>>>>>>>>>   Component Mask: 0x0000003000000000
>>>>>>>>>>>   Attribute (PathRecord)
>>>>>>>>>>>       PathRecord
>>>>>>>>>>>           DGID: :: (::)
>>>>>>>>>>>           SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>>>>           DLID: 0x0000
>>>>>>>>>>>           SLID: 0x0000
>>>>>>>>>>>           0... .... = RawTraffic: 0x00
>>>>>>>>>>>           .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>>>>           HopLimit: 0x00
>>>>>>>>>>>           TClass: 0x00
>>>>>>>>>>>           0... .... = Reversible: 0x00
>>>>>>>>>>>           .000 0000 = NumbPath: 0x00
>>>>>>>>>>>           P_Key: 0x0000
>>>>>>>>>>>           .... .... .... 0000 = SL: 0x0000
>>>>>>>>>>>           00.. .... = MTUSelector: 0x00
>>>>>>>>>>>           ..00 0000 = MTU: 0x00
>>>>>>>>>>>           00.. .... = RateSelector: 0x00
>>>>>>>>>>>           ..00 0000 = Rate: 0x00
>>>>>>>>>>>           00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>>>>           ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>>>>           Preference: 0x00
>>>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>>>> ======================================================================================
>>>>>>>>>> 
>>>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't 
>>>>>>>>>> get
>>>>>>>>>> out that machine and the issue is internal to that machine. It could 
>>>>>>>>>> be
>>>>>>>>>> because of the underlying issue which hangs OpenSM when some IB 
>>>>>>>>>> program
>>>>>>>>>> tried to unregister from the MAD layer but there were outstanding 
>>>>>>>>>> work
>>>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI 
>>>>>>>>> side and the SA uses a SL>0.
>>>>>>>> 
>>>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>>>> 
>>>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>>>> But I have started ibdump before opensm, maybe that makes a difference, 
>>>>>>> not sure.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Jens
>>>>>>> 
>>>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or 
>>>>>>> ibdump, but the response received by the OMPI node isn't shown 
>>>>>>> correctly. The PathRecord contains an offset which is either missing in 
>>>>>>> the dump or is not treated correctly be wireshark. But it causes 
>>>>>>> wireshark to show the PathRecord data with wrong values.
>>>>>>> Maybe you could redirect this to the developer of ibdump, so that he 
>>>>>>> can check/fix it.
>>>>>> 
>>>>>> Are you referring to the fields after the SA AttributeOffset or
>>>>>> something else ?
>>>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>>>> I get on the OMPI side:
>>>>>  SMASubnAdmGetResp(PathRecord)
>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>      Attribute Offset: 0x0008
>>>>>      Reserved: 0x0000
>>>>>      Component Mask: 0x0000803000000000
>>>>>      Attribute (PathRecord)
>>>>>          PathRecord
>>>>>              DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>>>>              SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>>>>              DLID: 0x0000
>>>>>              SLID: 0x0000
>>>>>              0... .... = RawTraffic: 0x00
>>>>>              .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>>>>              HopLimit: 0xff
>>>>>              TClass: 0x00
>>>>>              0... .... = Reversible: 0x00
>>>>>              .000 0011 = NumbPath: 0x03
>>>>>              P_Key: 0x8486
>>>>>              .... .... .... 0000 = SL: 0x0000
>>>>>              00.. .... = MTUSelector: 0x00
>>>>>              ..00 0000 = MTU: 0x00
>>>>>              00.. .... = RateSelector: 0x00
>>>>>              ..00 0000 = Rate: 0x00
>>>>>              00.. .... = PacketLifeTimeSelector: 0x00
>>>>>              ..00 0000 = PacketLifeTime: 0x00
>>>>>              Preference: 0x00
>>>>> 
>>>>> But it should show (see the difference in SLID, DLID, SL which are now 
>>>>> correct):
>>>>>  SMASubnAdmGetResp(PathRecord)
>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>      Attribute Offset: 0x0008
>>>>>      Reserved: 0x0000
>>>>>      Component Mask: 0x0000803000000000
>>>>>      Attribute (PathRecord)
>>>>>          PathRecord
>>>>>              DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>>>>              SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>>>>              DLID: 0x0004
>>>>>              SLID: 0x0008
>>>>>              0... .... = RawTraffic: 0x00
>>>>>              .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>              HopLimit: 0x00
>>>>>              TClass: 0x00
>>>>>              1... .... = Reversible: 0x01
>>>>>              .000 0000 = NumbPath: 0x00
>>>>>              P_Key: 0xffff
>>>>>              .... .... .... 0011 = SL: 0x0003
>>>>>              10.. .... = MTUSelector: 0x02
>>>>>              ..00 0100 = MTU: 0x04
>>>>>              10.. .... = RateSelector: 0x02
>>>>>              ..00 0110 = Rate: 0x06
>>>>>              10.. .... = PacketLifeTimeSelector: 0x02
>>>>>              ..01 0010 = PacketLifeTime: 0x12
>>>>>              Preference: 0x00
>>>> 
>>>> 
>>>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>>>> look right to me (no subnet prefix fe80:: in front of GUID).
>>> 
>>> Yes, I made a small mistake with the hexeditor. I started the shift after 
>>> the subnet prefix.
>>> Sorry for the confusion.
>>> 
>>> Thank you for the hint with smpquery and saquery, I will check that 
>>> tomorrow.
>>> 
>>> Jens
>>> 
>>>> 
>>>> -- Hal
>>>> 
>>>>> 
>>>>> Regards,
>>>>> Jens
>>>>> 
>>>>>> 
>>>>>> -- Hal
>>>>>> 
>>>>>>>> 
>>>>>>>> -- Hal
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>>>> working (not dropping) aside from whether it's really the correct 
>>>>>>>>>>>> SL to use.
>>>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>>>>     SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 
>>>>>>>>>>> 11 | 12 | 13 | 14 | 15 |
>>>>>>>>>>>     VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 
>>>>>>>>>>> |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>>>> But this is also as expected, because I have set the QoS in the 
>>>>>>>>>>> opensm config as follows:
>>>>>>>>>>>     qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I 
>>>>>>>>>>> have not touched the config for "Switch Port 0" and "Router ports", 
>>>>>>>>>>> they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>>>> 
>>>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> Jens
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- Hal
>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> Jens
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Hal
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any 
>>>>>>>>>>>>>>> useful information for this problem, even with higher debug 
>>>>>>>>>>>>>>> levels.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries 
>>>>>>>>>>>>>> ?
>>>>>>>>>>>>> In the OpenSM log, only that it was received, how the request 
>>>>>>>>>>>>> looks like, and that it was send back.
>>>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error 
>>>>>>>>>>>>>>> in the kernel driver, the HCA firmware or something completely 
>>>>>>>>>>>>>>> different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>>>> A workaround for the moment is to set the SL in the 
>>>>>>>>>>>>>>> umad_set_addr_net(...) call to 0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. 
>>>>>>>>>>>>>> Wonder if
>>>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked 
>>>>>>>>>>>>> this. In our case (OpenSM running on a compute node), it sets the 
>>>>>>>>>>>>> same SL, which is used
>>>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Jens
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>>> linux-rdma" in
>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>> linux-rdma" in
>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> 
>>>>>>>>> --------------------------------
>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>>> --------------------------------
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
>>>>>>>> in
>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> 
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>> --------------------------------
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j...@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j...@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to