Hello Hal, On Dec 17, 2012, at 9:04 PM, Hal Rosenstock wrote:
> Hi, > > On 12/17/2012 1:16 AM, Jens Domke wrote: >> Hello Hal, >> >> I have checked the smpquery and saquery command today. >> >> The smpquery SL2VL and PI commands for the opensm port work fine, and I get >> the expected results: >> ====================================================== >> # SL2VL table: Lid 19 >> # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| >> ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| >> ====================================================== >> # Port info: Lid 19 port 0 >> Mkey:............................<not displayed> >> GidPrefix:.......................0xfe80000000000000 >> Lid:.............................19 >> SMLid:...........................19 >> CapMask:.........................0x251086a >> IsSM >> IsTrapSupported >> IsAutomaticMigrationSupported >> IsSLMappingSupported >> IsSystemImageGUIDsupported >> IsCommunicatonManagementSupported >> IsVendorClassSupported >> IsCapabilityMaskNoticeSupported >> IsClientRegistrationSupported >> DiagCode:........................0x0000 >> MkeyLeasePeriod:.................0 >> LocalPort:.......................1 >> LinkWidthEnabled:................1X or 4X >> LinkWidthSupported:..............1X or 4X >> LinkWidthActive:.................4X >> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps >> LinkState:.......................Active >> PhysLinkState:...................LinkUp >> LinkDownDefState:................Polling >> ProtectBits:.....................0 >> LMC:.............................0 >> LinkSpeedActive:.................5.0 Gbps >> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps >> NeighborMTU:.....................2048 >> SMSL:............................0 >> VLCap:...........................VL0-7 >> InitType:........................0x00 >> VLHighLimit:.....................0 >> VLArbHighCap:....................8 >> VLArbLowCap:.....................8 >> InitReply:.......................0x00 >> MtuCap:..........................2048 >> VLStallCount:....................0 >> HoqLife:.........................31 >> OperVLs:.........................VL0-7 >> PartEnforceInb:..................0 >> PartEnforceOutb:.................0 >> FilterRawInb:....................0 >> FilterRawOutb:...................0 >> MkeyViolations:..................0 >> PkeyViolations:..................0 >> QkeyViolations:..................0 >> GuidCap:.........................32 >> ClientReregister:................0 >> McastPkeyTrapSuppressionEnabled:.0 >> SubnetTimeout:...................18 >> RespTimeVal:.....................16 >> LocalPhysErr:....................8 >> OverrunErr:......................8 >> MaxCreditHint:...................0 >> RoundTrip:.......................0 >> CapabilityMask2:.................0x0000 >> LinkSpeedExtActive:..............No Extended Speed >> LinkSpeedExtSupported:...........0 >> LinkSpeedExtEnabled:.............0 >> ====================================================== >> >> >> The problem are the saquery commands on other nodes. >> In most cases the executions fails, and the node shows the same behaviour >> like the OpenSM node, when it trys to send on SL>0. The PathRequest paket >> does not arrive at the node with the running OpenSM (checked with ibdumb). >> At some point of the execution the saquery binary hangs, the kernel log >> indicates errors and the only option is to reboot. >> This is the output I see for the saquery: >> ====================================================== >> saquery -P --src-to-dst 4:8 >> ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out >> >> Query SA failed: Connection timed out >> ====================================================== >> (In really rar cases I get the PathRequest back and see the dump, but the >> saquery binary stalls afterwards, too.) >> >> >> I did some debugging with gdb again, and stepped thru the saquery code. >> When I change the SL to 0 in the addr vector of the MAD right before >> umad_send is called, then everthing works. >> So, the saquery on the compute nodes shows the same behaviour as the opensm >> with respect to the SL value for umad_send. >> >> >> At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in >> the config file of opensm. >> Sadly, this configuration results in the same crashes of the saquery >> commands. >> For the runs with MinHop I used also a different SL2VL mapping, just to be >> sure, that there is no problem with VL>0 and every SL travels on VL=0: >> ====================================================== >> # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| >> ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| >> ====================================================== > > Non QoS routing algorithms still need -Q otherwise the full range of QoS > is not available. Was OpenSM started with -Q for this test ? Yes I had QoS enabled in my configuration file with "qos TRUE". Jens > > -- Hal >> >> Regards, >> Jens >> >> >> On Dec 16, 2012, at 11:59 PM, Jens Domke wrote: >> >>> >>> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote: >>> >>>> On 12/16/2012 8:39 AM, Jens Domke wrote: >>>>> Hi, >>>>> >>>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> On 12/16/2012 7:03 AM, Jens Domke wrote: >>>>>>> Hello Hal, >>>>>>> >>>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote: >>>>>>>>> Hello Hal, >>>>>>>>> >>>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote: >>>>>>>>>>> Hello Hal, >>>>>>>>>>> >>>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi again, >>>>>>>>>>>> >>>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote: >>>>>>>>>>>>> Hello Hal, >>>>>>>>>>>>> >>>>>>>>>>>>> thank you for the fast response. I will try to clarify some >>>>>>>>>>>>> points. >>>>>>>>>>>>> >>>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca >>>>>>>>>>>>>>> btl_openib_ib_path_record_service_level 1" >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly >>>>>>>>>>>>>> but >>>>>>>>>>>>>> there should be no need to set this. The proper SL for querying >>>>>>>>>>>>>> the SA >>>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of >>>>>>>>>>>>>> DFSSSP >>>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and >>>>>>>>>>>>>> the SM >>>>>>>>>>>>>> pushes this into each port. That should be used. It's possible >>>>>>>>>>>>>> that SL1 >>>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP. >>>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level >>>>>>>>>>>>> does not specify the SL for querying the PathRecords. >>>>>>>>>>>>> It just enables the functionality. And the ompi processes use the >>>>>>>>>>>>> PortInfo.SMSL to send the request. >>>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, >>>>>>>>>>>>> and the SA received the requests. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As far as I understand the whole system: >>>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests >>>>>>>>>>>>>>> (SubnAdmGet:PathRecord) to the OpenSM >>>>>>>>>>>>>>> 2. the SA receives the request on QP1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL >>>>>>>>>>>>>> that the SM >>>>>>>>>>>>>> set for that port. >>>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query >>>>>>>>>>>>> itself had SL=0 specified. >>>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or >>>>>>>>>>>>>>> Torus_2QoS) about a special service level for the slid/dlid path >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port >>>>>>>>>>>>>> communication) >>>>>>>>>>>>>> than the one the query used and is the one returned inside the >>>>>>>>>>>>>> PathRecord attribute/data. >>>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because >>>>>>>>>>>>> the SM is running on a port which is also used for MPI comm. >>>>>>>>>>>> >>>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any >>>>>>>>>>>> destination ? >>>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce >>>>>>>>>>> SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == >>>>>>>>>>> SL(LID1->LID3). >>>>>>>>>> >>>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path. >>>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the >>>>>>>>> SL for the reversible path. >>>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP >>>>>>>>> would recommend another SL. >>>>>>>>> >>>>>>>>> I just read the IB Specs and it says, that "SL specified in the >>>>>>>>> received packet is used as the SL in the response packet" for MAD >>>>>>>>> packets. >>>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI >>>>>>>>> does the setup of the PathRequest and the way how the SA does build >>>>>>>>> the respond packet. >>>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest >>>>>>>>> packet, >>>>>>>> >>>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside >>>>>>>> the >>>>>>>> SubAdmGet of PatchRecord ? >>>>>>> >>>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0. >>>>>> >>>>>> That means the SL in the request is wildcarded so the SA/SM fills in a >>>>>> valid one in the response. >>>>> Ok. >>>>>> >>>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the >>>>>>> only reference I found was in osm_sa_path_record.c >>>>>>> The SA just treats the SL in the PathRequest as a "I would like to use >>>>>>> this SL" in case the SL bit is set. >>>>>>> But the routing engine can overwrite the requested SL before the reply >>>>>>> is send. >>>>>>> >>>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL >>>>>>> bit in the CompMask and sets the SL to SMSL for the PathRequest, so >>>>>>> that SL_a == SL_b. >>>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). >>>>>>> Only if I change the SL to 0 in the MAD right before umad_send is >>>>>>> called by the SA, the paket is able to leave the node and reaches the >>>>>>> OMPI process. >>>>>> >>>>>> Are you sure the response doesn't leave the SA node or it's not received >>>>>> at the requester (OMPI node) ? >>>>> No, I'm not sure. Is there any possibility to check that? As far as I >>>>> know, ibdump does not show MAD pakets which leave a port, it only shows >>>>> the pakets when they are received on the other end. >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> and sends the packet on SL_b (PortInfo.SMSL). >>>>>>>> >>>>>>>> Good. >>>>>>>> >>>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, >>>>>>>>> for the response. >>>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right? >>>>>>>> >>>>>>>> Depends. It may be that both SLs work but maybe not. >>>>>>>> >>>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, >>>>>>>>> that it does not specify the SL within the PathRequest in a >>>>>>>>> appropriate way (which would be a SL suggested by DFSSSP for the >>>>>>>>> reversible path). And the second bug is that the SA uses the SL, on >>>>>>>>> which the PathRequest packet was send, and not the SL specified >>>>>>>>> within the packet. >>>>>>>>> What do you think? >>>>>>>> >>>>>>>> Yes, it might be better to wildcard the SL in the query. The only >>>>>>>> scenario that would fail with the query you are making if there's no SL >>>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query. >>>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 - >>>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester >>>>>>>> OMPI node so it's not even getting that far. >>>>>>> >>>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA >>>>>>> node? >>>>>>> I have no inside of the underlying layer (kernel driver and fireware). >>>>>>> Maybe there are some implementations, which prevent the SA from sending >>>>>>> MADs back on SL>0? >>>>>> >>>>>> If you're sure this response doesn't get out of the SA node, please >>>>>> contact Mellanox support with the details. >>>>> Ok, I can do this, if it turns out to be true. >>>>>> >>>>>>>> >>>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it >>>>>>>>> matches addr_type.gsi.service_level. >>>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI >>>>>>>>> process on a SL>0. >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via >>>>>>>>>>>>>>> umad_send in libvendor/osm_vendor_ibumad.c >>>>>>>>>>>>>> >>>>>>>>>>>>>> By the response reversibility rule, I think this is returned on >>>>>>>>>>>>>> the SL >>>>>>>>>>>>>> of the original query but haven't verified this in the code base >>>>>>>>>>>>>> yet. >>>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the >>>>>>>>>>>>> SA should also be able to send via SL>0. >>>>>>>>>>>> >>>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that >>>>>>>>>>>> the >>>>>>>>>>>> incoming request was received on. >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the >>>>>>>>>>>>>>> following attributes: >>>>>>>>>>>>>>> /* GS classes */ >>>>>>>>>>>>>>> umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, >>>>>>>>>>>>>>> p_mad_addr->addr_type.gsi.remote_qp, >>>>>>>>>>>>>>> p_mad_addr->addr_type.gsi.service_level, >>>>>>>>>>>>>>> IB_QP1_WELL_KNOWN_Q_KEY); >>>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI >>>>>>>>>>>>>>> process. The Q_Key matches the Q_key on the OMPI process, and >>>>>>>>>>>>>>> remote_qp and dest_lid is correct, too. >>>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the >>>>>>>>>>>>>>> PathRecord, and this send does not work (except for SL=0). >>>>>>>>>>>>>> >>>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received >>>>>>>>>>>>>> at the >>>>>>>>>>>>>> requester with no message in the OpenSM log or not received at >>>>>>>>>>>>>> the >>>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being >>>>>>>>>>>>>> used in >>>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it >>>>>>>>>>>>>> not to be >>>>>>>>>>>>>> received at the SM or the response not to make it back to the >>>>>>>>>>>>>> requester >>>>>>>>>>>>>> from the SA if the SL used is not "reversible". >>>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive >>>>>>>>>>>>> any response from the SA. >>>>>>>>>>>>> I get messages from the MPI process like the following: >>>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] >>>>>>>>>>>>> No response from SA after 20 retries >>>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest >>>>>>>>>>>>> query, dumps the query into the log, and sends the reply back. >>>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding >>>>>>>>>>>>> MAD…". >>>>>>>>>>>>>> >>>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like >>>>>>>>>>>>>>> this: >>>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, >>>>>>>>>>>>>>> length=120, timeout_ms=0, retries=3) >>>>>>>>>>>>>>> at src/umad.c:791 >>>>>>>>>>>>>>> 791 if (umaddebug > 1) >>>>>>>>>>>>>>> (gdb) p *mad >>>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, >>>>>>>>>>>>>>> length = 0, addr = {qpn = 1325427712, qkey = 384, >>>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = >>>>>>>>>>>>>>> 0 '\000', gid_index = 0 '\000', >>>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' >>>>>>>>>>>>>>> <repeats 15 times>, flow_label = 0, >>>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = >>>>>>>>>>>>>>> 0x7fffe8012530 "\002"} >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response >>>>>>>>>>>>>> on the >>>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here. >>>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send >>>>>>>>>>>>> function, right before it is written to the device with write(fd, >>>>>>>>>>>>> …). >>>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on >>>>>>>>>>>>> SL 6. >>>>>>>>>>>> >>>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ? >>>>>>>>>>> Yes, it was SL 6. >>>>>>>>>>> Here is a content of a similar packet which was received by the SA. >>>>>>>>>>> I have used ibdump on the port where the OpenSM was running: >>>>>>>>>>> ====================================================================================== >>>>>>>>>>> No. Time Source Destination >>>>>>>>>>> Protocol Length Info >>>>>>>>>>> 785 14.352168 LID: 384 LID: 4140 >>>>>>>>>>> InfiniBand 290 UD Send Only SubnAdmGet(PathRecord) >>>>>>>>>>> >>>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 >>>>>>>>>>> bits) >>>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST >>>>>>>>>>> Epoch Time: 1355389784.437633332 seconds >>>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds] >>>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds] >>>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds] >>>>>>>>>>> Frame Number: 785 >>>>>>>>>>> Frame Length: 290 bytes (2320 bits) >>>>>>>>>>> Capture Length: 290 bytes (2320 bits) >>>>>>>>>>> [Frame is marked: False] >>>>>>>>>>> [Frame is ignored: False] >>>>>>>>>>> [Protocols in frame: erf:infiniband] >>>>>>>>>>> Extensible Record Format >>>>>>>>>>> [ERF Header] >>>>>>>>>>> Timestamp: 0x50c99b587008bcf2 >>>>>>>>>>> [Header type] >>>>>>>>>>> .001 0101 = type: INFINIBAND (21) >>>>>>>>>>> 0... .... = Extension header present: 0 >>>>>>>>>>> 0000 0100 = flags: 4 >>>>>>>>>>> .... ..00 = capture interface: 0 >>>>>>>>>>> .... .1.. = varying record length: 1 >>>>>>>>>>> .... 0... = truncated: 0 >>>>>>>>>>> ...0 .... = rx error: 0 >>>>>>>>>>> ..0. .... = ds error: 0 >>>>>>>>>>> 00.. .... = reserved: 0 >>>>>>>>>>> record length: 306 >>>>>>>>>>> loss counter: 0 >>>>>>>>>>> wire length: 290 >>>>>>>>>>> InfiniBand >>>>>>>>>>> Local Route Header >>>>>>>>>>> 0110 .... = Virtual Lane: 0x06 >>>>>>>>>>> .... 0000 = Link Version: 0 >>>>>>>>>>> 0110 .... = Service Level: 6 >>>>>>>>>>> .... 00.. = Reserved (2 bits): 0 >>>>>>>>>>> .... ..10 = Link Next Header: 0x02 >>>>>>>>>>> Destination Local ID: 19 >>>>>>>>>>> 0000 0... .... .... = Reserved (5 bits): 0 >>>>>>>>>>> .... .000 0100 1000 = Packet Length: 72 >>>>>>>>>>> Source Local ID: 16 >>>>>>>>>>> Base Transport Header >>>>>>>>>>> Opcode: 100 >>>>>>>>>>> 1... .... = Solicited Event: True >>>>>>>>>>> .1.. .... = MigReq: True >>>>>>>>>>> ..00 .... = Pad Count: 0 >>>>>>>>>>> .... 0000 = Header Version: 0 >>>>>>>>>>> Partition Key: 65535 >>>>>>>>>>> Reserved (8 bits): 0 >>>>>>>>>>> Destination Queue Pair: 0x000001 >>>>>>>>>>> 0... .... = Acknowledge Request: False >>>>>>>>>>> .000 0000 = Reserved (7 bits): 0 >>>>>>>>>>> Packet Sequence Number: 0 >>>>>>>>>>> DETH - Datagram Extended Transport Header >>>>>>>>>>> Queue Key: 2147549184 >>>>>>>>>>> Reserved (8 bits): 0 >>>>>>>>>>> Source Queue Pair: 0x00380050 >>>>>>>>>>> MAD Header - Common Management Datagram >>>>>>>>>>> Base Version: 0x01 >>>>>>>>>>> Management Class: 0x03 >>>>>>>>>>> Class Version: 0x02 >>>>>>>>>>> Method: Get() (0x01) >>>>>>>>>>> Status: 0x0000 >>>>>>>>>>> Class Specific: 0x0000 >>>>>>>>>>> Transaction ID: 0x0010000f38005000 >>>>>>>>>>> Attribute ID: 0x0035 >>>>>>>>>>> Reserved: 0x0000 >>>>>>>>>>> Attribute Modifier: 0x00000000 >>>>>>>>>>> MAD Data Payload: >>>>>>>>>>> 000000000000000000000000000000000000000000000000... >>>>>>>>>>> Illegal RMPP Type (0)! >>>>>>>>>>> RMPP Type: 0x00 >>>>>>>>>>> RMPP Type: 0x00 >>>>>>>>>>> 0000 .... = R Resp Time: 0x00 >>>>>>>>>>> .... 0000 = RMPP Flags: Unknown (0x00) >>>>>>>>>>> RMPP Status: (Normal) (0x00) >>>>>>>>>>> RMPP Data 1: 0x00000000 >>>>>>>>>>> RMPP Data 2: 0x00000000 >>>>>>>>>>> SMASubnAdmGet(PathRecord) >>>>>>>>>>> SM_Key (Verification Key): 0x0000000000000000 >>>>>>>>>>> Attribute Offset: 0x0000 >>>>>>>>>>> Reserved: 0x0000 >>>>>>>>>>> Component Mask: 0x0000003000000000 >>>>>>>>>>> Attribute (PathRecord) >>>>>>>>>>> PathRecord >>>>>>>>>>> DGID: :: (::) >>>>>>>>>>> SGID: ::0.15.0.16 (::0.15.0.16) >>>>>>>>>>> DLID: 0x0000 >>>>>>>>>>> SLID: 0x0000 >>>>>>>>>>> 0... .... = RawTraffic: 0x00 >>>>>>>>>>> .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000 >>>>>>>>>>> HopLimit: 0x00 >>>>>>>>>>> TClass: 0x00 >>>>>>>>>>> 0... .... = Reversible: 0x00 >>>>>>>>>>> .000 0000 = NumbPath: 0x00 >>>>>>>>>>> P_Key: 0x0000 >>>>>>>>>>> .... .... .... 0000 = SL: 0x0000 >>>>>>>>>>> 00.. .... = MTUSelector: 0x00 >>>>>>>>>>> ..00 0000 = MTU: 0x00 >>>>>>>>>>> 00.. .... = RateSelector: 0x00 >>>>>>>>>>> ..00 0000 = Rate: 0x00 >>>>>>>>>>> 00.. .... = PacketLifeTimeSelector: 0x00 >>>>>>>>>>> ..00 0000 = PacketLifeTime: 0x00 >>>>>>>>>>> Preference: 0x00 >>>>>>>>>>> Variant CRC: 0xad4e >>>>>>>>>>> ====================================================================================== >>>>>>>>>> >>>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't >>>>>>>>>> get >>>>>>>>>> out that machine and the issue is internal to that machine. It could >>>>>>>>>> be >>>>>>>>>> because of the underlying issue which hangs OpenSM when some IB >>>>>>>>>> program >>>>>>>>>> tried to unregister from the MAD layer but there were outstanding >>>>>>>>>> work >>>>>>>>>> completions. That's based on your original email earlier this AM. >>>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI >>>>>>>>> side and the SA uses a SL>0. >>>>>>>> >>>>>>>> Can ibdump be used to capture output on the SM port ? >>>>>>> >>>>>>> Yes, that works quite well, despite the warning in the ibdump manual. >>>>>>> But I have started ibdump before opensm, maybe that makes a difference, >>>>>>> not sure. >>>>>>> >>>>>>> Regards, >>>>>>> Jens >>>>>>> >>>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or >>>>>>> ibdump, but the response received by the OMPI node isn't shown >>>>>>> correctly. The PathRecord contains an offset which is either missing in >>>>>>> the dump or is not treated correctly be wireshark. But it causes >>>>>>> wireshark to show the PathRecord data with wrong values. >>>>>>> Maybe you could redirect this to the developer of ibdump, so that he >>>>>>> can check/fix it. >>>>>> >>>>>> Are you referring to the fields after the SA AttributeOffset or >>>>>> something else ? >>>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example: >>>>> I get on the OMPI side: >>>>> SMASubnAdmGetResp(PathRecord) >>>>> SM_Key (Verification Key): 0x0000000000000000 >>>>> Attribute Offset: 0x0008 >>>>> Reserved: 0x0000 >>>>> Component Mask: 0x0000803000000000 >>>>> Attribute (PathRecord) >>>>> PathRecord >>>>> DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0) >>>>> SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8) >>>>> DLID: 0x0000 >>>>> SLID: 0x0000 >>>>> 0... .... = RawTraffic: 0x00 >>>>> .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff >>>>> HopLimit: 0xff >>>>> TClass: 0x00 >>>>> 0... .... = Reversible: 0x00 >>>>> .000 0011 = NumbPath: 0x03 >>>>> P_Key: 0x8486 >>>>> .... .... .... 0000 = SL: 0x0000 >>>>> 00.. .... = MTUSelector: 0x00 >>>>> ..00 0000 = MTU: 0x00 >>>>> 00.. .... = RateSelector: 0x00 >>>>> ..00 0000 = Rate: 0x00 >>>>> 00.. .... = PacketLifeTimeSelector: 0x00 >>>>> ..00 0000 = PacketLifeTime: 0x00 >>>>> Preference: 0x00 >>>>> >>>>> But it should show (see the difference in SLID, DLID, SL which are now >>>>> correct): >>>>> SMASubnAdmGetResp(PathRecord) >>>>> SM_Key (Verification Key): 0x0000000000000000 >>>>> Attribute Offset: 0x0008 >>>>> Reserved: 0x0000 >>>>> Component Mask: 0x0000803000000000 >>>>> Attribute (PathRecord) >>>>> PathRecord >>>>> DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5) >>>>> SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5) >>>>> DLID: 0x0004 >>>>> SLID: 0x0008 >>>>> 0... .... = RawTraffic: 0x00 >>>>> .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000 >>>>> HopLimit: 0x00 >>>>> TClass: 0x00 >>>>> 1... .... = Reversible: 0x01 >>>>> .000 0000 = NumbPath: 0x00 >>>>> P_Key: 0xffff >>>>> .... .... .... 0011 = SL: 0x0003 >>>>> 10.. .... = MTUSelector: 0x02 >>>>> ..00 0100 = MTU: 0x04 >>>>> 10.. .... = RateSelector: 0x02 >>>>> ..00 0110 = Rate: 0x06 >>>>> 10.. .... = PacketLifeTimeSelector: 0x02 >>>>> ..01 0010 = PacketLifeTime: 0x12 >>>>> Preference: 0x00 >>>> >>>> >>>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't >>>> look right to me (no subnet prefix fe80:: in front of GUID). >>> >>> Yes, I made a small mistake with the hexeditor. I started the shift after >>> the subnet prefix. >>> Sorry for the confusion. >>> >>> Thank you for the hint with smpquery and saquery, I will check that >>> tomorrow. >>> >>> Jens >>> >>>> >>>> -- Hal >>>> >>>>> >>>>> Regards, >>>>> Jens >>>>> >>>>>> >>>>>> -- Hal >>>>>> >>>>>>>> >>>>>>>> -- Hal >>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI >>>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of >>>>>>>>>>>> working (not dropping) aside from whether it's really the correct >>>>>>>>>>>> SL to use. >>>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM. >>>>>>>>>>> SL: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | >>>>>>>>>>> 11 | 12 | 13 | 14 | 15 | >>>>>>>>>>> VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 >>>>>>>>>>> |0x3 |0x4 |0x5 |0x6 |0x7 | >>>>>>>>>>> But this is also as expected, because I have set the QoS in the >>>>>>>>>>> opensm config as follows: >>>>>>>>>>> qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 >>>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I >>>>>>>>>>> have not touched the config for "Switch Port 0" and "Router ports", >>>>>>>>>>> they remained: qos_[sw0 | rtr]_sl2vl (null) >>>>>>>>>> >>>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4). >>>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Jens >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- Hal >>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Jens >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- Hal >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any >>>>>>>>>>>>>>> useful information for this problem, even with higher debug >>>>>>>>>>>>>>> levels. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries >>>>>>>>>>>>>> ? >>>>>>>>>>>>> In the OpenSM log, only that it was received, how the request >>>>>>>>>>>>> looks like, and that it was send back. >>>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error >>>>>>>>>>>>>>> in the kernel driver, the HCA firmware or something completely >>>>>>>>>>>>>>> different. Or if umad_send basically does not support SL>0. >>>>>>>>>>>>>>> A workaround for the moment is to set the SL in the >>>>>>>>>>>>>>> umad_set_addr_net(...) call to 0. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. >>>>>>>>>>>>>> Wonder if >>>>>>>>>>>>>> that's how SMSL is set by DFSSSP. >>>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked >>>>>>>>>>>>> this. In our case (OpenSM running on a compute node), it sets the >>>>>>>>>>>>> same SL, which is used >>>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> Jens >>>>>>>>>>>>> >>>>>>>>>>>>> -------------------------------- >>>>>>>>>>>>> Dipl.-Math. Jens Domke >>>>>>>>>>>>> Researcher - Tokyo Institute of Technology >>>>>>>>>>>>> Satoshi MATSUOKA Laboratory >>>>>>>>>>>>> Global Scientific Information and Computing Center >>>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, >>>>>>>>>>>>> Tokyo, 152-8550, JAPAN >>>>>>>>>>>>> Tel/Fax: +81-3-5734-3876 >>>>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp >>>>>>>>>>>>> -------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>>> linux-rdma" in >>>>>>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>> >>>>>>>>>>> -------------------------------- >>>>>>>>>>> Dipl.-Math. Jens Domke >>>>>>>>>>> Researcher - Tokyo Institute of Technology >>>>>>>>>>> Satoshi MATSUOKA Laboratory >>>>>>>>>>> Global Scientific Information and Computing Center >>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, >>>>>>>>>>> Tokyo, 152-8550, JAPAN >>>>>>>>>>> Tel/Fax: +81-3-5734-3876 >>>>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp >>>>>>>>>>> -------------------------------- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>> linux-rdma" in >>>>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> -------------------------------- >>>>>>>>> Dipl.-Math. Jens Domke >>>>>>>>> Researcher - Tokyo Institute of Technology >>>>>>>>> Satoshi MATSUOKA Laboratory >>>>>>>>> Global Scientific Information and Computing Center >>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, >>>>>>>>> Tokyo, 152-8550, JAPAN >>>>>>>>> Tel/Fax: +81-3-5734-3876 >>>>>>>>> E-Mail: domke.j...@m.titech.ac.jp >>>>>>>>> -------------------------------- >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" >>>>>>>> in >>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> -------------------------------- >>>>>>> Dipl.-Math. Jens Domke >>>>>>> Researcher - Tokyo Institute of Technology >>>>>>> Satoshi MATSUOKA Laboratory >>>>>>> Global Scientific Information and Computing Center >>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, >>>>>>> Tokyo, 152-8550, JAPAN >>>>>>> Tel/Fax: +81-3-5734-3876 >>>>>>> E-Mail: domke.j...@m.titech.ac.jp >>>>>>> -------------------------------- >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>>>>> the body of a message to majord...@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>>> the body of a message to majord...@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -------------------------------- >> Dipl.-Math. Jens Domke >> Researcher - Tokyo Institute of Technology >> Satoshi MATSUOKA Laboratory >> Global Scientific Information and Computing Center >> 2-12-1-E2-7 Ookayama, Meguro-ku, >> Tokyo, 152-8550, JAPAN >> Tel/Fax: +81-3-5734-3876 >> E-Mail: domke.j...@m.titech.ac.jp >> -------------------------------- >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -------------------------------- Dipl.-Math. Jens Domke Researcher - Tokyo Institute of Technology Satoshi MATSUOKA Laboratory Global Scientific Information and Computing Center 2-12-1-E2-7 Ookayama, Meguro-ku, Tokyo, 152-8550, JAPAN Tel/Fax: +81-3-5734-3876 E-Mail: domke.j...@m.titech.ac.jp -------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html