Hi,

On 12/17/2012 1:16 AM, Jens Domke wrote:
> Hello Hal,
> 
> I have checked the smpquery and saquery command today.
> 
> The smpquery SL2VL and PI commands for the opensm port work fine, and I get 
> the expected results:
> ======================================================
> # SL2VL table: Lid 19
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ======================================================
> # Port info: Lid 19 port 0
> Mkey:............................<not displayed>
> GidPrefix:.......................0xfe80000000000000
> Lid:.............................19
> SMLid:...........................19
> CapMask:.........................0x251086a
>                                 IsSM
>                                 IsTrapSupported
>                                 IsAutomaticMigrationSupported
>                                 IsSLMappingSupported
>                                 IsSystemImageGUIDsupported
>                                 IsCommunicatonManagementSupported
>                                 IsVendorClassSupported
>                                 IsCapabilityMaskNoticeSupported
>                                 IsClientRegistrationSupported
> DiagCode:........................0x0000
> MkeyLeasePeriod:.................0
> LocalPort:.......................1
> LinkWidthEnabled:................1X or 4X
> LinkWidthSupported:..............1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> LinkDownDefState:................Polling
> ProtectBits:.....................0
> LMC:.............................0
> LinkSpeedActive:.................5.0 Gbps
> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
> NeighborMTU:.....................2048
> SMSL:............................0
> VLCap:...........................VL0-7
> InitType:........................0x00
> VLHighLimit:.....................0
> VLArbHighCap:....................8
> VLArbLowCap:.....................8
> InitReply:.......................0x00
> MtuCap:..........................2048
> VLStallCount:....................0
> HoqLife:.........................31
> OperVLs:.........................VL0-7
> PartEnforceInb:..................0
> PartEnforceOutb:.................0
> FilterRawInb:....................0
> FilterRawOutb:...................0
> MkeyViolations:..................0
> PkeyViolations:..................0
> QkeyViolations:..................0
> GuidCap:.........................32
> ClientReregister:................0
> McastPkeyTrapSuppressionEnabled:.0
> SubnetTimeout:...................18
> RespTimeVal:.....................16
> LocalPhysErr:....................8
> OverrunErr:......................8
> MaxCreditHint:...................0
> RoundTrip:.......................0
> CapabilityMask2:.................0x0000
> LinkSpeedExtActive:..............No Extended Speed
> LinkSpeedExtSupported:...........0
> LinkSpeedExtEnabled:.............0
> ======================================================
> 
> 
> The problem are the saquery commands on other nodes.
> In most cases the executions fails, and the node shows the same behaviour 
> like the OpenSM node, when it trys to send on SL>0. The PathRequest paket 
> does not arrive at the node with the running OpenSM (checked with ibdumb). At 
> some point of the execution the saquery binary hangs, the kernel log 
> indicates errors and the only option is to reboot. 
> This is the output I see for the saquery:
> ======================================================
> saquery -P --src-to-dst 4:8
> ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
> 
> Query SA failed: Connection timed out
> ======================================================
> (In really rar cases I get the PathRequest back and see the dump, but the 
> saquery binary stalls afterwards, too.)
> 
> 
> I did some debugging with gdb again, and stepped thru the saquery code.
> When I change the SL to 0 in the addr vector of the MAD right before 
> umad_send is called, then everthing works.
> So, the saquery on the compute nodes shows the same behaviour as the opensm 
> with respect to the SL value for umad_send.
> 
> 
> At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in 
> the config file of opensm.
> Sadly, this configuration results in the same crashes of the saquery commands.
> For the runs with MinHop I used also a different SL2VL mapping, just to be 
> sure, that there is no problem with VL>0 and every SL travels on VL=0:
> ======================================================
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
> ======================================================

Non QoS routing algorithms still need -Q otherwise the full range of QoS
is not available. Was OpenSM started with -Q for this test ?

-- Hal
> 
> Regards,
> Jens
> 
> 
> On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:
> 
>>
>> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
>>
>>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>>> Hi,
>>>>
>>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>>
>>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>>> Hello Hal,
>>>>>>>>
>>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>>> Hello Hal,
>>>>>>>>>>
>>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi again,
>>>>>>>>>>>
>>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>>> Hello Hal,
>>>>>>>>>>>>
>>>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>>>>
>>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca 
>>>>>>>>>>>>>> btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly 
>>>>>>>>>>>>> but
>>>>>>>>>>>>> there should be no need to set this. The proper SL for querying 
>>>>>>>>>>>>> the SA
>>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of 
>>>>>>>>>>>>> DFSSSP
>>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and 
>>>>>>>>>>>>> the SM
>>>>>>>>>>>>> pushes this into each port. That should be used. It's possible 
>>>>>>>>>>>>> that SL1
>>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does 
>>>>>>>>>>>> not specify the SL for querying the PathRecords.
>>>>>>>>>>>> It just enables the functionality. And the ompi processes use the 
>>>>>>>>>>>> PortInfo.SMSL to send the request.
>>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, 
>>>>>>>>>>>> and the SA received the requests.  
>>>>>>>>>>>>>
>>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests 
>>>>>>>>>>>>>> (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that 
>>>>>>>>>>>>> the SM
>>>>>>>>>>>>> set for that port.
>>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query 
>>>>>>>>>>>> itself had SL=0 specified.
>>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or 
>>>>>>>>>>>>>> Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port 
>>>>>>>>>>>>> communication)
>>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the 
>>>>>>>>>>>> SM is running on a port which is also used for MPI comm.
>>>>>>>>>>>
>>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any 
>>>>>>>>>>> destination ?
>>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce 
>>>>>>>>>> SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>>>>
>>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the 
>>>>>>>> SL for the reversible path.
>>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP 
>>>>>>>> would recommend another SL.
>>>>>>>>
>>>>>>>> I just read the IB Specs and it says, that "SL specified in the 
>>>>>>>> received packet is used as the SL in the response packet" for MAD 
>>>>>>>> packets.
>>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does 
>>>>>>>> the setup of the PathRequest and the way how the SA does build the 
>>>>>>>> respond packet.
>>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest 
>>>>>>>> packet, 
>>>>>>>
>>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>>>> SubAdmGet of PatchRecord ?
>>>>>>
>>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>>>
>>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>>> valid one in the response.
>>>> Ok.
>>>>>
>>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the 
>>>>>> only reference I found was in osm_sa_path_record.c
>>>>>> The SA just treats the SL in the PathRequest as a "I would like to use 
>>>>>> this SL" in case the SL bit is set.
>>>>>> But the routing engine can overwrite the requested SL before the reply 
>>>>>> is send.
>>>>>>
>>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit 
>>>>>> in the CompMask and sets the SL to SMSL for the PathRequest, so that 
>>>>>> SL_a == SL_b.
>>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). 
>>>>>> Only if I change the SL to 0 in the MAD right before umad_send is called 
>>>>>> by the SA, the paket is able to leave the node and reaches the OMPI 
>>>>>> process.
>>>>>
>>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>>> at the requester (OMPI node) ?
>>>> No, I'm not sure. Is there any possibility to check that? As far as I 
>>>> know, ibdump does not show MAD pakets which leave a port, it only shows 
>>>> the pakets when they are received on the other end.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>>>
>>>>>>> Good.
>>>>>>>
>>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, 
>>>>>>>> for the response.
>>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>>>
>>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>>>
>>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, 
>>>>>>>> that it does not specify the SL within the PathRequest in a 
>>>>>>>> appropriate way (which would be a SL suggested by DFSSSP for the 
>>>>>>>> reversible path). And the second bug is that the SA uses the SL, on 
>>>>>>>> which the PathRequest packet was send, and not the SL specified within 
>>>>>>>> the packet.
>>>>>>>> What do you think?
>>>>>>>
>>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>>> OMPI node so it's not even getting that far.
>>>>>>
>>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA 
>>>>>> node?
>>>>>> I have no inside of the underlying layer (kernel driver and fireware). 
>>>>>> Maybe there are some implementations, which prevent the SA from sending 
>>>>>> MADs back on SL>0?
>>>>>
>>>>> If you're sure this response doesn't get out of the SA node, please
>>>>> contact Mellanox support with the details.
>>>> Ok, I can do this, if it turns out to be true.
>>>>>
>>>>>>>
>>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it 
>>>>>>>> matches addr_type.gsi.service_level.
>>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI 
>>>>>>>> process on a SL>0.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via 
>>>>>>>>>>>>>> umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>>>
>>>>>>>>>>>>> By the response reversibility rule, I think this is returned on 
>>>>>>>>>>>>> the SL
>>>>>>>>>>>>> of the original query but haven't verified this in the code base 
>>>>>>>>>>>>> yet.
>>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA 
>>>>>>>>>>>> should also be able to send via SL>0.
>>>>>>>>>>>
>>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that 
>>>>>>>>>>> the
>>>>>>>>>>> incoming request was received on.
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the 
>>>>>>>>>>>>>> following attributes:
>>>>>>>>>>>>>>  /* GS classes */
>>>>>>>>>>>>>>  umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>>>                    IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI 
>>>>>>>>>>>>>> process. The Q_Key matches the Q_key on the OMPI process, and 
>>>>>>>>>>>>>> remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the 
>>>>>>>>>>>>>> PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>>>
>>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received 
>>>>>>>>>>>>> at the
>>>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being 
>>>>>>>>>>>>> used in
>>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it 
>>>>>>>>>>>>> not to be
>>>>>>>>>>>>> received at the SM or the response not to make it back to the 
>>>>>>>>>>>>> requester
>>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any 
>>>>>>>>>>>> response from the SA.
>>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
>>>>>>>>>>>>  No response from SA after 20 retries
>>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest 
>>>>>>>>>>>> query, dumps the query into the log, and sends the reply back.
>>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding 
>>>>>>>>>>>> MAD…".
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like 
>>>>>>>>>>>>>> this:
>>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, 
>>>>>>>>>>>>>> length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, 
>>>>>>>>>>>>>> length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 
>>>>>>>>>>>>>> '\000', gid_index = 0 '\000', 
>>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' 
>>>>>>>>>>>>>> <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 
>>>>>>>>>>>>>> 0x7fffe8012530 "\002"}
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send 
>>>>>>>>>>>> function, right before it is written to the device with write(fd, 
>>>>>>>>>>>> …).
>>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 
>>>>>>>>>>>> 6.
>>>>>>>>>>>
>>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>>> Yes, it was SL 6.
>>>>>>>>>> Here is a content of a similar packet which was received by the SA. 
>>>>>>>>>> I have used ibdump on the port where the OpenSM was running:
>>>>>>>>>> ======================================================================================
>>>>>>>>>> No.     Time        Source                Destination           
>>>>>>>>>> Protocol Length Info
>>>>>>>>>> 785 14.352168   LID: 384              LID: 4140             
>>>>>>>>>> InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>>>
>>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 
>>>>>>>>>> bits)
>>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>>> Frame Number: 785
>>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>>> [Frame is marked: False]
>>>>>>>>>> [Frame is ignored: False]
>>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>>> Extensible Record Format
>>>>>>>>>> [ERF Header]
>>>>>>>>>>    Timestamp: 0x50c99b587008bcf2
>>>>>>>>>>    [Header type]
>>>>>>>>>>        .001 0101 = type: INFINIBAND (21)
>>>>>>>>>>        0... .... = Extension header present: 0
>>>>>>>>>>    0000 0100 = flags: 4
>>>>>>>>>>        .... ..00 = capture interface: 0
>>>>>>>>>>        .... .1.. = varying record length: 1
>>>>>>>>>>        .... 0... = truncated: 0
>>>>>>>>>>        ...0 .... = rx error: 0
>>>>>>>>>>        ..0. .... = ds error: 0
>>>>>>>>>>        00.. .... = reserved: 0
>>>>>>>>>>    record length: 306
>>>>>>>>>>    loss counter: 0
>>>>>>>>>>    wire length: 290
>>>>>>>>>> InfiniBand
>>>>>>>>>> Local Route Header
>>>>>>>>>>    0110 .... = Virtual Lane: 0x06
>>>>>>>>>>    .... 0000 = Link Version: 0
>>>>>>>>>>    0110 .... = Service Level: 6
>>>>>>>>>>    .... 00.. = Reserved (2 bits): 0
>>>>>>>>>>    .... ..10 = Link Next Header: 0x02
>>>>>>>>>>    Destination Local ID: 19
>>>>>>>>>>    0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>>>    .... .000 0100 1000 = Packet Length: 72
>>>>>>>>>>    Source Local ID: 16
>>>>>>>>>> Base Transport Header
>>>>>>>>>>    Opcode: 100
>>>>>>>>>>    1... .... = Solicited Event: True
>>>>>>>>>>    .1.. .... = MigReq: True
>>>>>>>>>>    ..00 .... = Pad Count: 0
>>>>>>>>>>    .... 0000 = Header Version: 0
>>>>>>>>>>    Partition Key: 65535
>>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>>    Destination Queue Pair: 0x000001
>>>>>>>>>>    0... .... = Acknowledge Request: False
>>>>>>>>>>    .000 0000 = Reserved (7 bits): 0
>>>>>>>>>>    Packet Sequence Number: 0
>>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>>>    Queue Key: 2147549184
>>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>>    Source Queue Pair: 0x00380050
>>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>>>    Base Version: 0x01
>>>>>>>>>>    Management Class: 0x03
>>>>>>>>>>    Class Version: 0x02
>>>>>>>>>>    Method: Get() (0x01)
>>>>>>>>>>    Status: 0x0000
>>>>>>>>>>    Class Specific: 0x0000
>>>>>>>>>>    Transaction ID: 0x0010000f38005000
>>>>>>>>>>    Attribute ID: 0x0035
>>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>>    Attribute Modifier: 0x00000000
>>>>>>>>>>    MAD Data Payload: 
>>>>>>>>>> 000000000000000000000000000000000000000000000000...
>>>>>>>>>> Illegal RMPP Type (0)! 
>>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>>    0000 .... = R Resp Time: 0x00
>>>>>>>>>>    .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>>>    RMPP Status:  (Normal) (0x00)
>>>>>>>>>>    RMPP Data 1: 0x00000000
>>>>>>>>>>    RMPP Data 2: 0x00000000
>>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>>>    SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>>>    Attribute Offset: 0x0000
>>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>>    Component Mask: 0x0000003000000000
>>>>>>>>>>    Attribute (PathRecord)
>>>>>>>>>>        PathRecord
>>>>>>>>>>            DGID: :: (::)
>>>>>>>>>>            SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>>>            DLID: 0x0000
>>>>>>>>>>            SLID: 0x0000
>>>>>>>>>>            0... .... = RawTraffic: 0x00
>>>>>>>>>>            .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>>>            HopLimit: 0x00
>>>>>>>>>>            TClass: 0x00
>>>>>>>>>>            0... .... = Reversible: 0x00
>>>>>>>>>>            .000 0000 = NumbPath: 0x00
>>>>>>>>>>            P_Key: 0x0000
>>>>>>>>>>            .... .... .... 0000 = SL: 0x0000
>>>>>>>>>>            00.. .... = MTUSelector: 0x00
>>>>>>>>>>            ..00 0000 = MTU: 0x00
>>>>>>>>>>            00.. .... = RateSelector: 0x00
>>>>>>>>>>            ..00 0000 = Rate: 0x00
>>>>>>>>>>            00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>>>            ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>>>            Preference: 0x00
>>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>>> ======================================================================================
>>>>>>>>>
>>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't 
>>>>>>>>> get
>>>>>>>>> out that machine and the issue is internal to that machine. It could 
>>>>>>>>> be
>>>>>>>>> because of the underlying issue which hangs OpenSM when some IB 
>>>>>>>>> program
>>>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI 
>>>>>>>> side and the SA uses a SL>0.
>>>>>>>
>>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>>>
>>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>>> But I have started ibdump before opensm, maybe that makes a difference, 
>>>>>> not sure.
>>>>>>
>>>>>> Regards,
>>>>>> Jens
>>>>>>
>>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or 
>>>>>> ibdump, but the response received by the OMPI node isn't shown 
>>>>>> correctly. The PathRecord contains an offset which is either missing in 
>>>>>> the dump or is not treated correctly be wireshark. But it causes 
>>>>>> wireshark to show the PathRecord data with wrong values.
>>>>>> Maybe you could redirect this to the developer of ibdump, so that he can 
>>>>>> check/fix it.
>>>>>
>>>>> Are you referring to the fields after the SA AttributeOffset or
>>>>> something else ?
>>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>>> I get on the OMPI side:
>>>>   SMASubnAdmGetResp(PathRecord)
>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>       Attribute Offset: 0x0008
>>>>       Reserved: 0x0000
>>>>       Component Mask: 0x0000803000000000
>>>>       Attribute (PathRecord)
>>>>           PathRecord
>>>>               DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>>>               SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>>>               DLID: 0x0000
>>>>               SLID: 0x0000
>>>>               0... .... = RawTraffic: 0x00
>>>>               .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>>>               HopLimit: 0xff
>>>>               TClass: 0x00
>>>>               0... .... = Reversible: 0x00
>>>>               .000 0011 = NumbPath: 0x03
>>>>               P_Key: 0x8486
>>>>               .... .... .... 0000 = SL: 0x0000
>>>>               00.. .... = MTUSelector: 0x00
>>>>               ..00 0000 = MTU: 0x00
>>>>               00.. .... = RateSelector: 0x00
>>>>               ..00 0000 = Rate: 0x00
>>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>>               ..00 0000 = PacketLifeTime: 0x00
>>>>               Preference: 0x00
>>>>
>>>> But it should show (see the difference in SLID, DLID, SL which are now 
>>>> correct):
>>>>   SMASubnAdmGetResp(PathRecord)
>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>       Attribute Offset: 0x0008
>>>>       Reserved: 0x0000
>>>>       Component Mask: 0x0000803000000000
>>>>       Attribute (PathRecord)
>>>>           PathRecord
>>>>               DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>>>               SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>>>               DLID: 0x0004
>>>>               SLID: 0x0008
>>>>               0... .... = RawTraffic: 0x00
>>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>               HopLimit: 0x00
>>>>               TClass: 0x00
>>>>               1... .... = Reversible: 0x01
>>>>               .000 0000 = NumbPath: 0x00
>>>>               P_Key: 0xffff
>>>>               .... .... .... 0011 = SL: 0x0003
>>>>               10.. .... = MTUSelector: 0x02
>>>>               ..00 0100 = MTU: 0x04
>>>>               10.. .... = RateSelector: 0x02
>>>>               ..00 0110 = Rate: 0x06
>>>>               10.. .... = PacketLifeTimeSelector: 0x02
>>>>               ..01 0010 = PacketLifeTime: 0x12
>>>>               Preference: 0x00
>>>
>>>
>>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>>> look right to me (no subnet prefix fe80:: in front of GUID).
>>
>> Yes, I made a small mistake with the hexeditor. I started the shift after 
>> the subnet prefix.
>> Sorry for the confusion.
>>
>> Thank you for the hint with smpquery and saquery, I will check that tomorrow.
>>
>> Jens
>>
>>>
>>> -- Hal
>>>
>>>>
>>>> Regards,
>>>> Jens
>>>>
>>>>>
>>>>> -- Hal
>>>>>
>>>>>>>
>>>>>>> -- Hal
>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>>> working (not dropping) aside from whether it's really the correct 
>>>>>>>>>>> SL to use.
>>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>>>      SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 
>>>>>>>>>> 11 | 12 | 13 | 14 | 15 |
>>>>>>>>>>      VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 
>>>>>>>>>> |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>>> But this is also as expected, because I have set the QoS in the 
>>>>>>>>>> opensm config as follows:
>>>>>>>>>>      qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have 
>>>>>>>>>> not touched the config for "Switch Port 0" and "Router ports", they 
>>>>>>>>>> remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>>>
>>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Jens
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Hal
>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Jens
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- Hal
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful 
>>>>>>>>>>>>>> information for this problem, even with higher debug levels.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>>>> In the OpenSM log, only that it was received, how the request 
>>>>>>>>>>>> looks like, and that it was send back.
>>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error 
>>>>>>>>>>>>>> in the kernel driver, the HCA firmware or something completely 
>>>>>>>>>>>>>> different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>>> A workaround for the moment is to set the SL in the 
>>>>>>>>>>>>>> umad_set_addr_net(...) call to 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. 
>>>>>>>>>>>>> Wonder if
>>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked 
>>>>>>>>>>>> this. In our case (OpenSM running on a compute node), it sets the 
>>>>>>>>>>>> same SL, which is used
>>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Jens
>>>>>>>>>>>>
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>> linux-rdma" in
>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>> --------------------------------
>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
>>>>>>>>> in
>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j...@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majord...@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j...@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to