Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: > Hello Hal, > > thank you for the fast response. I will try to clarify some points. > >>> d) OpenMPI runs are executed with "--mca >>> btl_openib_ib_path_record_service_level 1" >> >> I'm not familiar with what DFSSSP does to figure out SLs exactly but >> there should be no need to set this. The proper SL for querying the SA >> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP >> (and other QoS based routing algorithms), it calculates that and the SM >> pushes this into each port. That should be used. It's possible that SL1 >> is not a valid SL for port <-> SA querying using DFSSSP. > The OpenMPI parameter btl_openib_ib_path_record_service_level does not > specify the SL for querying the PathRecords. > It just enables the functionality. And the ompi processes use the > PortInfo.SMSL to send the request. > For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA > received the requests. >> >>> e) kernel 2.6.32-220.13.1.el6.x86_64 >>> >>> As far as I understand the whole system: >>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to >>> the OpenSM >>> 2. the SA receives the request on QP1 >> >> There is the SL in the query itself. This should be the SMSL that the SM >> set for that port. > Hmm, there you might have a point. I think I saw that the query itself had > SL=0 specified. > In fact OpenMPI sets everthing to 0 except for slid and dlid. >> >>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a >>> special service level for the slid/dlid path >> >> This is a (potentially) different SL (for MPI<->MPI port communication) >> than the one the query used and is the one returned inside the >> PathRecord attribute/data. > Yes, it can be different, but DFSSSP sets the same SL, because the SM is > running on a port which is also used for MPI comm.
With DFSSSP are all SLs same from source port to get to any destination ? >> >>> 4. SA sends the PathRecord back to the OMPI process via umad_send in >>> libvendor/osm_vendor_ibumad.c >> >> By the response reversibility rule, I think this is returned on the SL >> of the original query but haven't verified this in the code base yet. > Ok, I was not aware of that rule. But if this is true, then the SA should > also be able to send via SL>0. I doubled checked and indeed the SA response does use the SL that the incoming request was received on. >> >>> The osm_vendor_send() function builds the MAD packet with the following >>> attributes: >>> /* GS classes */ >>> umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid, >>> p_mad_addr->addr_type.gsi.remote_qp, >>> p_mad_addr->addr_type.gsi.service_level, >>> IB_QP1_WELL_KNOWN_Q_KEY); >>> So, the SL is the same like the one which was used by the OMPI process. The >>> Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is >>> correct, too. >>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and >>> this send does not work (except for SL=0). >> >> By not working, what do you mean ? Do you mean it's not received at the >> requester with no message in the OpenSM log or not received at the >> OpenSM or something else ? It could be due to the wrong SL being used in >> the original request (forcing it to SL 1). That could cause it not to be >> received at the SM or the response not to make it back to the requester >> from the SA if the SL used is not "reversible". > By "not working" I mean, that the MPI process does not receive any response > from the SA. > I get messages from the MPI process like the following: > [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] > No response from SA after 20 retries > The log of OpenSM shows that the SA received the PathRequest query, dumps the > query into the log, and sends the reply back. > And I think I was some messages in the log about "…1 outstanding MAD…". >> >>> If I look into the MAD before it is send, then it looks like this: >>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, >>> timeout_ms=0, retries=3) >>> at src/umad.c:791 >>> 791 if (umaddebug > 1) >>> (gdb) p *mad >>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, >>> addr = {qpn = 1325427712, qkey = 384, >>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', >>> gid_index = 0 '\000', >>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 >>> times>, flow_label = 0, >>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = >>> 0x7fffe8012530 "\002"} >> >> Is this the PathRecord query on the OpenMPI side or the response on the >> OpenSM side ? SL is 6 rather than 1 here. > This is the response on the OpenSM side (inside the umad_send function, right > before it is written to the device with write(fd, …). > SL=6 indicates, that the MPI process was sending the request on SL 6. What is SMSL for the requester ? Was it SL 6 ? One would need to walk the SLToVLMappingTables from requester (OMPI port) to SA and back to see whether SL6 would even have a chance of working (not dropping) aside from whether it's really the correct SL to use. -- Hal >> >>> The output of OpenMPI or OpenSM's log file don't show any useful >>> information for this problem, even with higher debug levels. >> >> So nothing interesting logged relative to the PathRecord queries ? > In the OpenSM log, only that it was received, how the request looks like, and > that it was send back. > And a few "outstanding MADs" a few lines later in the log. >> >>> So, right now I'm stuck, and have no idea if there is an error in the >>> kernel driver, the HCA firmware or something completely different. Or if >>> umad_send basically does not support SL>0. >>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) >>> call to 0. >> >> So SL 0 works between all nodes and SA for querying/responses. Wonder if >> that's how SMSL is set by DFSSSP. > No, the SMSL set by DFSSSP is different from 0, I have checked this. In our > case (OpenSM running on a compute node), it sets the same SL, which is used for MPI<->MPI traffic, to ensure deadlock freedom. > > Regards > Jens > > -------------------------------- > Dipl.-Math. Jens Domke > Researcher - Tokyo Institute of Technology > Satoshi MATSUOKA Laboratory > Global Scientific Information and Computing Center > 2-12-1-E2-7 Ookayama, Meguro-ku, > Tokyo, 152-8550, JAPAN > Tel/Fax: +81-3-5734-3876 > E-Mail: domke.j...@m.titech.ac.jp > -------------------------------- > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html