RE: When IBoE will be merged to upstream?
...A verbs consumer using a RoCE network relies strictly on so-called Layer 3 addressing (GIDs); layer 2 addresses (e.g. subnet local identifiers) are not passed across the verbs interface... Ah, hmm, well, I was on that list during this time and I don't think this statement means what you are saying it does :) ?? It doesn't get any clearer than this. 'subnet local identidifer' == LID The text is saying that the specification does not use any of the LID fields in the verbs interface, that is it. It isn't talking about MAC addresses. Exactly how and where the MAC address comes about was never decided, and at least some participants thought it should be a 1:1 algorithmic mapping from the GID. Ditto for VLANs, how and where the vlan tag comes about is not part of the spec. You are trying to rewrite history. Read the spec, address handles fields are fixed. Good idea! This is exactly what we do today for addresses that the user explicitly declares as link-local addresses. But, we can't mandate an overload of the GID in a way that it prevents its use as a true L3 address (eventually routable). We are very unlikely to see routable IBoE, ever.. Says who? But, even if we do get there some day then we could extend the AH. This is unacceptable - we are not going to add another L3 identifier. BTW, I absolutely hate the mixing of 'Sometimes it is a IPv4, sometimes it is a GID, and sometimes it is an IPv6' in the same field. That is just so nasty. The GID is a GID, don't overload it in an ambiguous way to mean 2 other things! A GID is a GID indeed -- in a RoCE environment, it's the layer 3 identifier. All of our intended values are standard ipv6 encapsulations. create_ah does not accept any sort of source address specifier You are wrong -- sgid_index specifies it. So, what do you propose to put in sgid_index? It isn't big enough to store an IPv6 address. You can't exactly number every IP assigned to every ethernet interface. An iboe device is associated with a specific Ethernet interface. Thus, its gid table only needs to map the ip addresses assigned to that interface. The other fields you mention are not a supserset of socket parameters, they are only IPv6 parameters, IPv4 uses a different set. Like what? Jason, bottom line, I think that we both agree that the rdmacm should do the address resolution. The difference is that by having the rdmacm initially only bind to the device and complete the resolution later (by a call from create_ah()), we don't change the user API for *all* gid types. Having addressed your concerns regarding resolution below the Verbs, we continue to believe that this is the best approach. Again, I don't see how what I've outlined changes the API in any way. We currently support link-local gids, but the architecture must not limit the scope. Doing two routing lookups for the same connection is bad design, it is racey. L2 parameters have to flow from the first routing lookup in RDMA-CM to everything else. So is caching L3--L2 mappings that change a second later... So what? Liran, I don't think you have at all come close to addressing my concerns, you still haven't explained how a full route lookup is even possible in create_ah, for instance. Let alone my other concerns! Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm/osm_helper.c: Add some missing message names to disp_msg_str
On 10:27 Fri 09 Jul , Hal Rosenstock wrote: Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some dapl assistance
Davis, Arlin R wrote: There is limited debug in the non-debug builds. If you want full debugging capabilities you can install the source RPM and configure and make as follows [..] (OFED target example): okay, got that, once I built the sources by hand as you suggested I could see debug prints but things didn't really work, so I stepped back and installed the latest rpms - dapl-2.0.29-1 and compat-dapl-1.2.18-1, now I couldn't get intel-mpi to run: [r...@dodly0 ~]# rpm -qav | grep dapl dapl-utils-2.0.29-1 dapl-2.0.29-1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# ldconfig -p | grep libdat libdat2.so.2 (libc6,x86-64) = /usr/lib64/libdat2.so.2 libdat.so.1 (libc6,x86-64) = /usr/lib64/libdat.so.1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat.so.1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat2.so.2 dapl-2.0.29-1 [r...@dodly0 ~]# /opt/intel/impi/4.0.0.027/intel64/bin/mpiexec -ppn 1 -n 2 -env DAPL_IB_PKEY 0x8002 -env DAPL_DBG_TYPE 0xff -env DAPL_DBG_DEST 0x3 -env I_MPI_DEBUG 3 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env I_MPI_FABRICS dapl:dapl /tmp/osu [0] MPI startup(): cannot open dynamic library libdat.so [1] MPI startup(): cannot open dynamic library libdat.so [0] MPI startup(): cannot open dynamic library libdat2.so [0] dapl fabric is not available and fallback fabric is not enabled [1] MPI startup(): cannot open dynamic library libdat2.so [1] dapl fabric is not available and fallback fabric is not enabled rank 1 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 1: return code 254 rank 0 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 0: return code 254 Any idea what we're doing wrong? BTW - before things stopped to work, exporting LD_DEBUG=libs to the MPI rank, I noticed that it used the compat-1.2 rpm ... Now, I can run dapltest fine, [r...@dodly0 ~]# dapltest -T S -D ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Server: Transaction Test Finished for this client [r...@dodly4 ~]# dapltest -T T -D ofa-v2-mlx4_0-1 -s dodly0 -i 1000 server SR 65536 4 client SR 65536 4 Server Name: dodly0 Server Net Address: 172.30.3.230 DT_cs_Client: Starting Test ... - Stats : 1 threads, 1 EPs Total WQE:2919.70 WQE/Sec Total Time : 0.68 sec Total Send : 262.14 MB - 382.69 MB/Sec Total Recv : 262.14 MB - 382.69 MB/Sec Total RDMA Read : 0.00 MB - 0.00 MB/Sec Total RDMA Write : 0.00 MB - 0.00 MB/Sec DT_cs_Client: == End of Work -- Client Exiting I also noted that the dapl-utils and the compat-dapl-utils are mutual exclusive as both attempt to install the same man page for dat.conf # rpm -Uvh /usr/src/redhat/RPMS/x86_64/compat-dapl-utils-1.2.18-1.x86_64.rpm Preparing...### [100%] file /usr/share/man/man5/dat.conf.5.gz from install of compat-dapl-utils-1.2.18-1.x86_64 conflicts with file from package dapl-utils-2.0.29-1.x86_64 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] management tarballs release
On 13:17 Thu 08 Jul , Hal Rosenstock wrote: d3586e7a17bca99fd384a943f00e259e libibumad-1.3.5.tar.gz 754d93f567393d3b9987a65326f40917 libibmad-1.3.5.tar.gz 5c94d6ee49e9c51c801f6634823b5ad5 opensm-3.3.6.tar.gz ba28f6b5323e6067ca019a999eeaf907 infiniband-diags-1.5.6.tar.gz Shouldn't these versions be labeled/tagged in your management git tree ? Would you do that ? I did, but forgot to push a tags to openfabrics free. Fixed now. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: some dapl assistance
Sorry, Intel MPI requires development packages which include libdat.so and libdat2.so Please see the install instructions on http://www.openfabrics.org/downloads/dapl/ --- For 1.2 and 2.0 support on same system, including development, install RPM packages as follow: dapl-2.0.29-1 dapl-utils-2.0.29-1 dapl-devel-2.0.29-1 dapl-debuginfo-2.0.29-1 compat-dapl-1.2.18-1 compat-dapl-devel-1.2.18-1 --- Thanks for the heads up on dat.conf manpage. I will fix the conflict in next release. -arlin -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Tuesday, July 13, 2010 4:41 AM To: Davis, Arlin R Cc: Itay Berman; linux-rdma Subject: Re: some dapl assistance Davis, Arlin R wrote: There is limited debug in the non-debug builds. If you want full debugging capabilities you can install the source RPM and configure and make as follows [..] (OFED target example): okay, got that, once I built the sources by hand as you suggested I could see debug prints but things didn't really work, so I stepped back and installed the latest rpms - dapl-2.0.29-1 and compat-dapl-1.2.18-1, now I couldn't get intel-mpi to run: [r...@dodly0 ~]# rpm -qav | grep dapl dapl-utils-2.0.29-1 dapl-2.0.29-1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# ldconfig -p | grep libdat libdat2.so.2 (libc6,x86-64) = /usr/lib64/libdat2.so.2 libdat.so.1 (libc6,x86-64) = /usr/lib64/libdat.so.1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat.so.1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat2.so.2 dapl-2.0.29-1 [r...@dodly0 ~]# /opt/intel/impi/4.0.0.027/intel64/bin/mpiexec -ppn 1 -n 2 -env DAPL_IB_PKEY 0x8002 -env DAPL_DBG_TYPE 0xff -env DAPL_DBG_DEST 0x3 -env I_MPI_DEBUG 3 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env I_MPI_FABRICS dapl:dapl /tmp/osu [0] MPI startup(): cannot open dynamic library libdat.so [1] MPI startup(): cannot open dynamic library libdat.so [0] MPI startup(): cannot open dynamic library libdat2.so [0] dapl fabric is not available and fallback fabric is not enabled [1] MPI startup(): cannot open dynamic library libdat2.so [1] dapl fabric is not available and fallback fabric is not enabled rank 1 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 1: return code 254 rank 0 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 0: return code 254 Any idea what we're doing wrong? BTW - before things stopped to work, exporting LD_DEBUG=libs to the MPI rank, I noticed that it used the compat-1.2 rpm ... Now, I can run dapltest fine, [r...@dodly0 ~]# dapltest -T S -D ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Server: Transaction Test Finished for this client [r...@dodly4 ~]# dapltest -T T -D ofa-v2-mlx4_0-1 -s dodly0 -i 1000 server SR 65536 4 client SR 65536 4 Server Name: dodly0 Server Net Address: 172.30.3.230 DT_cs_Client: Starting Test ... - Stats : 1 threads, 1 EPs Total WQE:2919.70 WQE/Sec Total Time : 0.68 sec Total Send : 262.14 MB - 382.69 MB/Sec Total Recv : 262.14 MB - 382.69 MB/Sec Total RDMA Read : 0.00 MB - 0.00 MB/Sec Total RDMA Write : 0.00 MB - 0.00 MB/Sec DT_cs_Client: == End of Work -- Client Exiting I also noted that the dapl-utils and the compat-dapl-utils are mutual exclusive as both attempt to install the same man page for dat.conf # rpm -Uvh /usr/src/redhat/RPMS/x86_64/compat-dapl-utils-1.2.18-1.x86_64.rpm Preparing... ### [100%] file /usr/share/man/man5/dat.conf.5.gz from install of compat-dapl-utils-1.2.18-1.x86_64 conflicts with file from package dapl-utils-2.0.29-1.x86_64 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions about partitions
Hal, The procedure is a little different. You'll need to create a child interface (on a partition) first and then you will be able to configure it as follows: echo 0x8001 /sys/class/net/ib0/create_child ifconfig ib0.8001 ... Note that you'll want the 0x8000 bit on for full membership. What does full membership mean? And are you saying that for full membership, any value beginning with 0x8 would work? Can an HCA have full membership on multiple partitions simultaneously? Thanks, Tom -- Hal Tom On 7/12/2010 1:06 PM, Hal Rosenstock wrote: On Mon, Jul 12, 2010 at 3:03 PM, Hal Rosenstock hal.rosenst...@gmail.comwrote: Hi Tom, On 7/12/10, Tom Ammontom.am...@utah.eduwrote: Hi, I have some basic questions about IB partitions. Can an HCA port belong to more than 1 partition at a time? Yes. How do you configure partitions with opensm? From reading the opensmd man page, it looks like you just create a file called /etc/osm-partitions.conf, with port GUIDs and such, but is this current? The default location depends on how OpenSM is configured/built. I missed this the first time around: and yes, the syntax indicated is current. -- Hal I ask because according to the man page the opensm configuration file is in /etc/opensm/ . Can you tell opensm where to look for the partitions file? Yes, with either the -P option on the command line or partition_config_file line in the options file. -- Hal Thanks, Tom -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMA test performance comment
I see that this rdma_post_send call gives a big contribute to CPU use on client side. Now the CPU usage (%) is about 95%-99%. CPU utilization is usually related to how you process completions. If you switch from polling the CQ to using events, the CPU utilization will go down. This will also result in the latency going up. Could you please send me a comment also about the plateau non stable in graph speed versus buffer size with buffer size 10^5 bytes? (attached files) This is likely just an artifact of the test and hardware. N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: basic questions about partitions
Tom, On Tue, Jul 13, 2010 at 12:50 PM, Tom Ammon tom.am...@utah.edu wrote: Hal, The procedure is a little different. You'll need to create a child interface (on a partition) first and then you will be able to configure it as follows: echo 0x8001 /sys/class/net/ib0/create_child ifconfig ib0.8001 ... Note that you'll want the 0x8000 bit on for full membership. What does full membership mean? For a partition, full members can talk with both full and limited members whereas limited members can only talk with full members. And are you saying that for full membership, any value beginning with 0x8 would work? The 0x8000 bit is the (full) membership bit so 0x8abc is the pkey for full membership in the 0xabc partition. 15 bits of pkey can be used for partition although my example above only used 12 bits. Can an HCA have full membership on multiple partitions simultaneously? Yes. -- Hal Thanks, Tom -- Hal Tom On 7/12/2010 1:06 PM, Hal Rosenstock wrote: On Mon, Jul 12, 2010 at 3:03 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: Hi Tom, On 7/12/10, Tom Ammontom.am...@utah.edu wrote: Hi, I have some basic questions about IB partitions. Can an HCA port belong to more than 1 partition at a time? Yes. How do you configure partitions with opensm? From reading the opensmd man page, it looks like you just create a file called /etc/osm-partitions.conf, with port GUIDs and such, but is this current? The default location depends on how OpenSM is configured/built. I missed this the first time around: and yes, the syntax indicated is current. -- Hal I ask because according to the man page the opensm configuration file is in /etc/opensm/ . Can you tell opensm where to look for the partitions file? Yes, with either the -P option on the command line or partition_config_file line in the options file. -- Hal Thanks, Tom -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
On Tue, Jul 13, 2010 at 11:26:41AM +0300, Liran Liss wrote: 'subnet local identidifer' == LID The text is saying that the specification does not use any of the LID fields in the verbs interface, that is it. It isn't talking about MAC addresses. Exactly how and where the MAC address comes about was never decided, and at least some participants thought it should be a 1:1 algorithmic mapping from the GID. Ditto for VLANs, how and where the vlan tag comes about is not part of the spec. You are trying to rewrite history. Read the spec, address handles fields are fixed. Not really, this was all discussed on this list before the IBxoE working group was formed, it was discussed in the working group, I objected to the draft spec leaving this area absent, even. The spec doesn't say squat about how MAC and VLAN values get into the AH, and you have already heard how my opinion on this subject differs from others. But, even if we do get there some day then we could extend the AH. This is unacceptable - we are not going to add another L3 identifier. It wouldn't be adding another L3 itentifier it would be an L2 next hop MAC address for the router. It would be nice to do this from the start but if growing the AH is really that scary then it should wait until someone figures out how to solve the lossless routing problem on ethernet. BTW, I absolutely hate the mixing of 'Sometimes it is a IPv4, sometimes it is a GID, and sometimes it is an IPv6' in the same field. That is just so nasty. The GID is a GID, don't overload it in an ambiguous way to mean 2 other things! A GID is a GID indeed -- in a RoCE environment, it's the layer 3 identifier. All of our intended values are standard ipv6 encapsulations. What makes a GID a GID is the fact that it is a seperate addressing space from IPv6! If it is a GID then you don't overload it, if it is an IPv6 then you don't get to special case certain things, like link local! create_ah does not accept any sort of source address specifier You are wrong -- sgid_index specifies it. So, what do you propose to put in sgid_index? It isn't big enough to store an IPv6 address. You can't exactly number every IP assigned to every ethernet interface. An iboe device is associated with a specific Ethernet interface. Thus, its gid table only needs to map the ip addresses assigned to that interface. A few messages ago you said there was only one RDMA device per physical ethernet interface, not one per vlan! VLAN interfaces can have overlapping addreses (ie IPv6 link local) so I really don't see how creating an GID table helps dis-ambiguate these cases. Doing two routing lookups for the same connection is bad design, it is racey. L2 parameters have to flow from the first routing lookup in RDMA-CM to everything else. So is caching L3--L2 mappings that change a second later... So what? No, it is not the same. If you do a route lookup you get an atomic result from the routing table that represents something an admin configured. If you do two lookups and use information from both then the net result might be a configuration that was never admin configured - ie you loose the atomicity of route configuration change. Normally ND mappings (L3-L2) track updates through the things that use them. The fact this cannot happen with IBoE is another bug, and again, a reason why it is unsuitable to treat a GID as an IPv6 address when you cannot provide the same functionality. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
On Mon, Jul 12, 2010 at 02:20:22PM -0700, Roland Dreier wrote: So the best solution I can see is to declare that an IBoE GID must be an IPv6 address coming from an EUI-64 Ethernet address for the corresponding port; for MGIDs I guess we use the standard IPv6 mapping to Ethernet address 33:33:xx:xx:xx:xx. This is what I have been advocating.. A quibble about multicast - AFAIK this is unsolved. I think some spec needs to be agreed that documents what sort of multicast snooping operations switches need to do, ie if IGMP joins imply that IBoE traffic for the same DMAC is included in the join, or if IBoE requires a seperate IGMP type process on its own ether-type. That would make it much clearer what to do with MGIDs. IPv4 could be handled by mapping a IPv4 multicast address within an IPv6 mapped address, if necessary. It would be nice to at least have a plan on how to integrate a non-link local address, if that is ever necessary in future. An extended AH with an additional 48 DMAC field seems reasonable to me? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions about partitions
Hal, I have some more partition questions. Do I have to configure the nodes as members of partition(s) both in opensm and on the individual node (if the node is a member of two partitions)? And what about the switches - do the switches need be configured as part of any partition? Tom On 7/13/2010 11:44 AM, Hal Rosenstock wrote: Tom, On Tue, Jul 13, 2010 at 12:50 PM, Tom Ammontom.am...@utah.edu wrote: Hal, The procedure is a little different. You'll need to create a child interface (on a partition) first and then you will be able to configure it as follows: echo 0x8001/sys/class/net/ib0/create_child ifconfig ib0.8001 ... Note that you'll want the 0x8000 bit on for full membership. What does full membership mean? For a partition, full members can talk with both full and limited members whereas limited members can only talk with full members. And are you saying that for full membership, any value beginning with 0x8 would work? The 0x8000 bit is the (full) membership bit so 0x8abc is the pkey for full membership in the 0xabc partition. 15 bits of pkey can be used for partition although my example above only used 12 bits. Can an HCA have full membership on multiple partitions simultaneously? Yes. -- Hal Thanks, Tom -- Hal Tom On 7/12/2010 1:06 PM, Hal Rosenstock wrote: On Mon, Jul 12, 2010 at 3:03 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: Hi Tom, On 7/12/10, Tom Ammontom.am...@utah.edu wrote: Hi, I have some basic questions about IB partitions. Can an HCA port belong to more than 1 partition at a time? Yes. How do you configure partitions with opensm? From reading the opensmd man page, it looks like you just create a file called /etc/osm-partitions.conf, with port GUIDs and such, but is this current? The default location depends on how OpenSM is configured/built. I missed this the first time around: and yes, the syntax indicated is current. -- Hal I ask because according to the man page the opensm configuration file is in /etc/opensm/ . Can you tell opensm where to look for the partitions file? Yes, with either the -P option on the command line or partition_config_file line in the options file. -- Hal Thanks, Tom -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
A quibble about multicast - AFAIK this is unsolved. I think some spec needs to be agreed that documents what sort of multicast snooping operations switches need to do, ie if IGMP joins imply that IBoE traffic for the same DMAC is included in the join, or if IBoE requires a seperate IGMP type process on its own ether-type. That would make it much clearer what to do with MGIDs. I agree -- the current spec is rather broken for multicast. Choosing a different ethertype and then saying that all switches will just flood multicast traffic is half-baked at best. It would be nice to at least have a plan on how to integrate a non-link local address, if that is ever necessary in future. An extended AH with an additional 48 DMAC field seems reasonable to me? You mean have a next-hop destination + a final destination? Could be done I guess. But I'm not sure how having a routing table where you have to look up 48-bit Ethernet addresses is all that different from just having a standard Ethernet forwarding table. I suppose something based on MAC-in-MAC (a la 802.1ah) could be done but to be honest the IBoE spec that the IBTA came up with looks rather broken for routing. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions about partitions
Tom, On Tue, Jul 13, 2010 at 3:44 PM, Tom Ammon tom.am...@utah.edu wrote: Hal, I have some more partition questions. Do I have to configure the nodes as members of partition(s) both in opensm and on the individual node (if the node is a member of two partitions)? OpenSM needs the configuration we've been discussing. It pushes the pkeys to the end ports. All that needs to be done at the end port is whatever the ULP/application needs. I sent the info for IPoIB. The partitions used there need to match how OpenSM is configured. And what about the switches - do the switches need be configured as part of any partition? OpenSM automatically calculates the peer switch port partitions based on the end port partitions so no explicit configuration is required for switch ports. -- Hal Tom On 7/13/2010 11:44 AM, Hal Rosenstock wrote: Tom, On Tue, Jul 13, 2010 at 12:50 PM, Tom Ammontom.am...@utah.edu wrote: Hal, The procedure is a little different. You'll need to create a child interface (on a partition) first and then you will be able to configure it as follows: echo 0x8001 /sys/class/net/ib0/create_child ifconfig ib0.8001 ... Note that you'll want the 0x8000 bit on for full membership. What does full membership mean? For a partition, full members can talk with both full and limited members whereas limited members can only talk with full members. And are you saying that for full membership, any value beginning with 0x8 would work? The 0x8000 bit is the (full) membership bit so 0x8abc is the pkey for full membership in the 0xabc partition. 15 bits of pkey can be used for partition although my example above only used 12 bits. Can an HCA have full membership on multiple partitions simultaneously? Yes. -- Hal Thanks, Tom -- Hal Tom On 7/12/2010 1:06 PM, Hal Rosenstock wrote: On Mon, Jul 12, 2010 at 3:03 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: Hi Tom, On 7/12/10, Tom Ammontom.am...@utah.edu wrote: Hi, I have some basic questions about IB partitions. Can an HCA port belong to more than 1 partition at a time? Yes. How do you configure partitions with opensm? From reading the opensmd man page, it looks like you just create a file called /etc/osm-partitions.conf, with port GUIDs and such, but is this current? The default location depends on how OpenSM is configured/built. I missed this the first time around: and yes, the syntax indicated is current. -- Hal I ask because according to the man page the opensm configuration file is in /etc/opensm/ . Can you tell opensm where to look for the partitions file? Yes, with either the -P option on the command line or partition_config_file line in the options file. -- Hal Thanks, Tom -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenMPI over RoCEE
Does it work with Open MPI v1.4.2? On Jul 12, 2010, at 4:21 PM, Steve Wise wrote: I'm running OFED-1.5.1 with the RoCEE mlx4 drivers. I can run low level verbs programs ok, but when running open mpi, I'm getting this error. Anybody seen this? - [o...@escher ~]$ mpirun -np 2 -host 10.192.176.111,10.192.176.112 --mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 -msglen msglen.txt -iter 100 pingpong [escher][[36356,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] error modifing QP to RTR errno says Invalid argument [escher][[36356,1],1][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect -- mpirun has exited due to process rank 1 with PID 4894 on node escher exiting without calling finalize. This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenMPI over RoCEE
You know, I got it running by adding this: --mca btl_openib_cpc_include rdmacm Which basically sez use only the rdmacm to setup the connection. Thanks, Steve. Jeff Squyres wrote: Does it work with Open MPI v1.4.2? On Jul 12, 2010, at 4:21 PM, Steve Wise wrote: I'm running OFED-1.5.1 with the RoCEE mlx4 drivers. I can run low level verbs programs ok, but when running open mpi, I'm getting this error. Anybody seen this? - [o...@escher ~]$ mpirun -np 2 -host 10.192.176.111,10.192.176.112 --mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 -msglen msglen.txt -iter 100 pingpong [escher][[36356,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] error modifing QP to RTR errno says Invalid argument [escher][[36356,1],1][connect/btl_openib_connect_oob.c:809:rml_recv_cb] error in endpoint reply start connect -- mpirun has exited due to process rank 1 with PID 4894 on node escher exiting without calling finalize. This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 1/2][RESEND] opensm/PerfMgr: Better redirection support
Sasha, What is the status of this patch and the follow on patch: opensm/osm_console.c: Add dump and clear redir perfmgr command support Thanks, Ira On Thu, 17 Jun 2010 10:28:49 -0700 Hal Rosenstock hnr...@comcast.net wrote: Handle PKey and QPN redirection information GID redirection handling remains Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v4: Fixed some trailing whitespace problems Changes since v3: Rebased Changes since v2: Use OpenSM DB rather than vendor layer for local port number and PKeys Change most log levels from ERROR to VERBOSE Redirection info validity now determined by single flag validate_redir_pkey returns pkey index or -1 rather than boolean Removed redir_ prefixes Changes since v1: Added include of osm_helper.h to osm_perfmgr.c diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index c26c141..34925e8 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -90,11 +90,17 @@ typedef enum { PERFMGR_SWEEP_SUSPENDED } osm_perfmgr_sweep_state_t; -/* Redirection information */ -typedef struct redir { - ib_net16_t redir_lid; - ib_net32_t redir_qp; -} redir_t; +typedef struct monitored_port { + uint16_t pkey_ix; + ib_net16_t orig_lid; + boolean_t redirection; + boolean_t valid; + /* Redirection fields from ClassPortInfo */ + ib_gid_t gid; + ib_net16_t lid; + ib_net16_t pkey; + ib_net32_t qp; +} monitored_port_t; /* Node to store information about nodes being monitored */ typedef struct monitored_node { @@ -104,7 +110,7 @@ typedef struct monitored_node { boolean_t esp0; char *name; uint32_t num_ports; - redir_t redir_port[1]; /* redirection on a per port basis */ + monitored_port_t port[1]; } monitored_node_t; struct osm_opensm; @@ -134,6 +140,8 @@ typedef struct osm_perfmgr { uint32_t max_outstanding_queries; cl_qmap_t monitored_map;/* map the nodes being tracked */ monitored_node_t *remove_list; + ib_net64_t port_guid; + int16_t local_port; } osm_perfmgr_t; /* * FIELDS diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 398b463..fccf9d6 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -64,6 +64,7 @@ #include opensm/osm_log.h #include opensm/osm_node.h #include opensm/osm_opensm.h +#include opensm/osm_helper.h #define PERFMGR_INITIAL_TID_VALUE 0xcafe @@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, uint8_t port = context-perfmgr_context.port; cl_map_item_t *p_node; monitored_node_t *p_mon_node; + ib_net16_t orig_lid; OSM_LOG_ENTER(pm-log); @@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void *bind_context, p_mon_node-num_ports); goto Exit; } - /* Clear redirection info */ - p_mon_node-redir_port[port].redir_lid = 0; - p_mon_node-redir_port[port].redir_qp = 0; + /* Clear redirection info for this port except orig_lid */ + orig_lid = p_mon_node-port[port].orig_lid; + memset(p_mon_node-port[port], 0, sizeof(monitored_port_t)); + p_mon_node-port[port].orig_lid = orig_lid; + p_mon_node-port[port].valid = TRUE; cl_plock_release(pm-osm-lock); } @@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, ib_net64_t port_guid) goto Exit; } - bind_info.port_guid = port_guid; + bind_info.port_guid = pm-port_guid = port_guid; bind_info.mad_class = IB_MCLASS_PERF; bind_info.class_version = 1; bind_info.is_responder = FALSE; @@ -309,24 +313,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) ib_net32_t qp = IB_QP1;
Re: ib_qib: Allow writes to the diag_counters to be able to clear them
On Tue, 13 Jul 2010 03:31:09 -0700 Bart Van Assche bvanass...@acm.org wrote: On Sat, Jul 10, 2010 at 5:25 PM, Bart Van Assche bvanass...@acm.org wrote: On Sat, Jul 10, 2010 at 2:56 AM, Ira Weiny wei...@llnl.gov wrote: On Fri, 9 Jul 2010 12:33:14 -0700 Bart Van Assche bvanass...@acm.org wrote: On Thu, Jul 8, 2010 at 8:04 PM, Ira Weiny wei...@llnl.gov wrote: On Thu, 8 Jul 2010 10:37:26 -0700 Bart Van Assche bvanass...@acm.org wrote: On Thu, Jul 8, 2010 at 2:33 AM, Ira Weiny wei...@llnl.gov wrote: From 80eecc4046455999254fb312c4ba229b3a52d4c6 Mon Sep 17 00:00:00 2001 From: Ira Weiny wei...@llnl.gov Date: Wed, 7 Jul 2010 17:35:34 -0700 Subject: [PATCH] ib_qib: Allow writes to the diag_counters to be able to clear them Signed-off-by: Ira Weiny wei...@llnl.gov --- drivers/infiniband/hw/qib/qib_sysfs.c | 16 +++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_sysfs.c b/drivers/infiniband/hw/qib/qib_sysfs.c index dab4d9f..91cd1b8 100644 --- a/drivers/infiniband/hw/qib/qib_sysfs.c +++ b/drivers/infiniband/hw/qib/qib_sysfs.c @@ -347,7 +347,7 @@ static struct kobj_type qib_sl2vl_ktype = { #define QIB_DIAGC_ATTR(N) \ static struct qib_diagc_attr qib_diagc_attr_##N = { \ - .attr = { .name = __stringify(N), .mode = 0444 }, \ + .attr = { .name = __stringify(N), .mode = 0664 }, \ .counter = offsetof(struct qib_ibport, n_##N) \ } @@ -403,8 +403,22 @@ static ssize_t diagc_attr_show(struct kobject *kobj, struct attribute *attr, return sprintf(buf, %u\n, *(u32 *)((char *)qibp + dattr-counter)); } +static ssize_t diagc_attr_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t size) +{ + struct qib_diagc_attr *dattr = + container_of(attr, struct qib_diagc_attr, attr); + struct qib_pportdata *ppd = + container_of(kobj, struct qib_pportdata, diagc_kobj); + struct qib_ibport *qibp = ppd-ibport_data; + + *(u32 *)((char *)qibp + dattr-counter) = simple_strtol(buf, NULL, 0); + return 4; The above line is not correct -- it should probably be something like return size;. See also http://***git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/sysfs.txt. My mistake. From the document above the return should be: return strnlen(buf, PAGE_SIZE); Correct? Also I think I should check for invalid values as well. (resending as plain text) The documented I referred to was written before the size argument was added to sysfs store methods. Are you sure? The document's signature for store methods includes the size argument? I'm not sure it is still required that the buf argument that is passed to store methods is '\0'-terminated. So both return 4 and return strnlen(buf, PAGE_SIZE) can potentially return a value that is larger than the size argument, which I think is incorrect. Sorry that I pointed you to misleading documentation. Also, the document is the same as the one in Roland's latest master. Is the document wrong? I have not found anything newer (22 February 2009). Let's ask the experts. The feedback on the sysfs documentation patch that I just submitted will tell us whether or not the sysfs documentation is correct (http://*lkml.org/lkml/2010/7/10/72). I have just received a private e-mail that informed me that Andrew Morton was so kind to sign off that sysfs documentation patch and to add it to his -mm tree. So if nobody complains about that patch within the next few days, that patch will be integrated in a future version of the Linux kernel. Sorry, I see where my mistake was. Some of the store methods did have a count parameter and others (more importantly the sysfs_ops structure) did not. :-) I will send out V3 shortly. There was one more place the count parameter was missing: diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index d78ed0b..c853eee 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -333,7 +333,7 @@ Structure: struct bus_attribute { struct attributeattr; ssize_t (*show)(struct bus_type *, char * buf); -ssize_t (*store)(struct bus_type *, const char * buf); +ssize_t (*store)(struct bus_type *, const char * buf, size_t count); }; Declaring: I can send a
[PATCH V3] ib_qib: Allow writes to the diag_counters to be able to clear them
From: Ira Weiny wei...@llnl.gov Date: Wed, 7 Jul 2010 17:35:34 -0700 Subject: [PATCH] ib_qib: Allow writes to the diag_counters to be able to clear them Changes in V3: Add non-number error check Return proper proper length Changes in V2: Add check for negative values Return proper length Signed-off-by: Ira Weiny wei...@llnl.gov --- drivers/infiniband/hw/qib/qib_sysfs.c | 21 - 1 files changed, 20 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_sysfs.c b/drivers/infiniband/hw/qib/qib_sysfs.c index dab4d9f..b214eff 100644 --- a/drivers/infiniband/hw/qib/qib_sysfs.c +++ b/drivers/infiniband/hw/qib/qib_sysfs.c @@ -347,7 +347,7 @@ static struct kobj_type qib_sl2vl_ktype = { #define QIB_DIAGC_ATTR(N) \ static struct qib_diagc_attr qib_diagc_attr_##N = { \ - .attr = { .name = __stringify(N), .mode = 0444 }, \ + .attr = { .name = __stringify(N), .mode = 0664 }, \ .counter = offsetof(struct qib_ibport, n_##N) \ } @@ -403,8 +403,27 @@ static ssize_t diagc_attr_show(struct kobject *kobj, struct attribute *attr, return sprintf(buf, %u\n, *(u32 *)((char *)qibp + dattr-counter)); } +static ssize_t diagc_attr_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t size) +{ + struct qib_diagc_attr *dattr = + container_of(attr, struct qib_diagc_attr, attr); + struct qib_pportdata *ppd = + container_of(kobj, struct qib_pportdata, diagc_kobj); + struct qib_ibport *qibp = ppd-ibport_data; + char *endp; + long val = simple_strtol(buf, endp, 0); + + if (val 0 || endp == buf) + return -EINVAL; + + *(u32 *)((char *)qibp + dattr-counter) = (u32)val; + return size; +} + static const struct sysfs_ops qib_diagc_ops = { .show = diagc_attr_show, + .store = diagc_attr_store, }; static struct kobj_type qib_diagc_ktype = { -- 1.5.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html