Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
Walukiewicz, Miroslaw wrote: From my measuremnts it looks like the problem is related to memory allocation in the user-space and kernel path, that is a very, very expesive operation. Look for the tx path (rx is very similar). Ibv_post_send(): post_send_wrapper_1_0 for (w = wr; w; w = w->next) { real_wr = alloca(sizeof *real_wr); <- 1. dyn alloc real_wr->wr_id = w->wr_id; next the call to HW specific part and prepare message to send cmd = alloca(cmd_size); <- 2. dyn allocation Hi Mirek, I don't think there are applications around which would use raw qp AND are linked against libibverbs-1.0, such that they would exercise the 1_0 wrapper, so we can ignore the 1st allocation, the one at the wrapper code. As for the 2nd allocation, since a WQE --posting-- is synchronous, using the maximal values specified during the creation of the QP, I believe that this allocation can be done once per QP and used later. dive to kernel: ib_uverbs_post_send() user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr->num_sge * sizeof (struct ib_sge), GFP_KERNEL); <- 4. dyn alloc And now there is finel call to driver. ~same here for #4 you can compute/allocate once the maximal possible size for "next" per qp and use it later. As for #3, this need further thinking. But before diving to all this design changes, what was the penalty introduced by these allocations? is it in packets-per-second, latency? Diving to kernel is treated as a something like passing signal to kernel that there is prepared information to post_send/post_recv. The information about buffers are passed through shared page (available to userspace through mmap) to avoid copying of data. Write() ops is used to passing signal about post_send. Read() ops is used to pass information about post_recv(). We avoid additional copying of the data that way. thanks for the heads-up, I took a look and this user/kernel shared memory page is used to hold the work-request, nothing to do with data. As for the work request, you still have to copy it in user space from the user work request to the library mmaped buffer. So the only difference would be the copy_from_user done by uverbs, for few tens of bytes, can you tell if/what is the extra penalty introduced by this copy? struct nes_ud_send_wr { u32 wr_cnt; u32 qpn; u32 flags; u32 resv[1]; struct ib_sge sg_list[64]; }; struct nes_ud_recv_wr { u32 wr_cnt; u32 qpn; u32 resv[2]; struct ib_sge sg_list[64]; }; Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same instance can be used to post list of work requests, where is work request is limited to use one SGE, am I correct? I don't think there a need to support posting 64 --send-- requests, for recv it might makes sense, but it could be done in a "batch/background" flow, thoughts? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
On Wed, Jul 07, 2010 at 09:00:26AM +0300, Or Gerlitz wrote: > I think we need to let resolve through the rdma-cm && get to know at > the consumer level, what are the source / destination macs, vlan id > and vlan priority used by an IBoE QP, in the exact manner all the IB > equivalents (src/dst lid, pkey, sl) are resolved by the rdma-cm and > exposed to the consmer app for IB QP. I agree. Clearly following the model of IB is the best way to fit this in without major changing. RDMA-CM is the way to get IP integration, it uses existing eth devices attached to the master eth device (analogous to IPoIB devices) and resolves IP to eth device to VLAN header and neighbour. If someone needs to do something else special it is pretty easy to do all the same steps in userspace using netlink, and that could go into a library, just like PR queries for IB. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
Liran Liss wrote: > but keeping ib_create_ah() callable from any context is not a goal by itself. going with your approach, if your proposed design is accepted, I believe that you probably need to patch all the code-chains that makes calls under the current assumption > I am looking for constructive ideas for supporting iboe without breaking > Verbs/CQE/CM syntax. I don't agree that exposing the Ethernet L2 related information to the caller is breaking something, the converse, it is a required enhancement. I think we need to let resolve through the rdma-cm && get to know at the consumer level, what are the source / destination macs, vlan id and vlan priority used by an IBoE QP, in the exact manner all the IB equivalents (src/dst lid, pkey, sl) are resolved by the rdma-cm and exposed to the consmer app for IB QP. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writable files under /sys
Sumeet Lahorani wrote: > # find /sys -type f -perm -222 > /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger > /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2 > /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1 Jack, Tziporet Can you clarify the status of the upstream kernel mlx4 multi-protocol support? looking on Linus git, I see one commit, 7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5 "mlx4_core: Multiple port type support" dated to Oct 2008, wheres ofed ships couple of patches touching this area, e.g adding the above sysfs entries. So what is the extra functionality introduced or bug/s fixed by those patches? any reason not to push them upstream? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: event plig-in API fixed to compile with g++
On 07-Jul-10 12:03 AM, Ira Weiny wrote: On Mon, 5 Jul 2010 11:41:44 -0700 Sasha Khapyorsky wrote: On 14:30 Mon 05 Jul , Hal Rosenstock wrote: On Mon, Jul 5, 2010 at 2:11 PM, Sasha Khapyorsky wrote: On 11:10 Thu 24 Jun , Yevgeny Kliteynik wrote: Event API should have been able to be used by libraries written both in C and C++. I don't know about such requirement. Are you saying it isn't a valid requirement to allow OpenSM plugins to be C++ based ? If so, why not ? I'm saying that there is no requirement for plugin API to support C++ - obviously (following method names) plugin API was never developed for using it in C++. Actually IMO this is not correct. The use of "delete" was introduced by commit a5963f93fa3d4514cc526e4ad029b036724b8167. I was at fault to not have objected back then. The use of "extern C" in all of the header files below implies a desire to support C++. Couldn't agree more. -- Yevgeny 10:28:14> pwd; grep "BEGIN_C_DECLS extern" * /home/weiny2/OpenIB/git-trees/management/opensm/include/opensm osm_attrib_req.h:# define BEGIN_C_DECLS extern "C" { osm_base.h:# define BEGIN_C_DECLS extern "C" { osm_console.h:# define BEGIN_C_DECLS extern "C" { osm_console_io.h:# define BEGIN_C_DECLS extern "C" { osm_db.h:# define BEGIN_C_DECLS extern "C" { osm_db_pack.h:# define BEGIN_C_DECLS extern "C" { osm_event_plugin.h:# define BEGIN_C_DECLS extern "C" { osm_helper.h:# define BEGIN_C_DECLS extern "C" { osm_inform.h:# define BEGIN_C_DECLS extern "C" { osm_lid_mgr.h:# define BEGIN_C_DECLS extern "C" { osm_log.h:# define BEGIN_C_DECLS extern "C" { osm_mad_pool.h:# define BEGIN_C_DECLS extern "C" { osm_madw.h:# define BEGIN_C_DECLS extern "C" { osm_mcast_tbl.h:# define BEGIN_C_DECLS extern "C" { osm_mcm_port.h:# define BEGIN_C_DECLS extern "C" { osm_msgdef.h:# define BEGIN_C_DECLS extern "C" { osm_mtree.h:# define BEGIN_C_DECLS extern "C" { osm_multicast.h:# define BEGIN_C_DECLS extern "C" { osm_node.h:# define BEGIN_C_DECLS extern "C" { osm_opensm.h:# define BEGIN_C_DECLS extern "C" { osm_partition.h:# define BEGIN_C_DECLS extern "C" { osm_path.h:# define BEGIN_C_DECLS extern "C" { osm_perfmgr_db.h:# define BEGIN_C_DECLS extern "C" { osm_pkey.h:# define BEGIN_C_DECLS extern "C" { osm_port.h:# define BEGIN_C_DECLS extern "C" { osm_port_profile.h:# define BEGIN_C_DECLS extern "C" { osm_prefix_route.h:# define BEGIN_C_DECLS extern "C" { osm_remote_sm.h:# define BEGIN_C_DECLS extern "C" { osm_router.h:# define BEGIN_C_DECLS extern "C" { osm_sa.h:# define BEGIN_C_DECLS extern "C" { osm_sa_mad_ctrl.h:# define BEGIN_C_DECLS extern "C" { osm_service.h:# define BEGIN_C_DECLS extern "C" { osm_sm.h:# define BEGIN_C_DECLS extern "C" { osm_sm.h.orig:# define BEGIN_C_DECLS extern "C" { osm_sm_mad_ctrl.h:# define BEGIN_C_DECLS extern "C" { osm_stats.h:# define BEGIN_C_DECLS extern "C" { osm_subnet.h:# define BEGIN_C_DECLS extern "C" { osm_subnet.h.orig:# define BEGIN_C_DECLS extern "C" { osm_switch.h:# define BEGIN_C_DECLS extern "C" { osm_ucast_cache.h:# define BEGIN_C_DECLS extern "C" { osm_ucast_mgr.h:# define BEGIN_C_DECLS extern "C" { osm_vl15intf.h:# define BEGIN_C_DECLS extern "C" { st.h:# define BEGIN_C_DECLS extern "C" { Ira Why not is another question - for instance in order to not deal with C/C++ compatibility issues (such as castings, function names limitation, linking mess, etc.) Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://*vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/qib: if qib_init() fails, driver fails to clean up properly
thanks, applied -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
> We thought about this too, but in some special cases, we do not know the > correct page size of a memory range. For example when getting a 16M chunk > from a 16M huge page region which is also aligned to 16M, the first madvise() > will work fine and the code will assume that the page size is 64K. I see ... yes, that does break my idea completely. OK, another half-baked idea: what if we pay attention to when madvise(DOFORK) fails as well as well madvise(DONTFORK) fails, and use that as a hit that we better check the page size? Perhaps this adds too much complexity ... in which case your idea: > As this issue was not present in version 2 of the code, but there we had > a big performance penalty, I suggest the following: we could go back to > version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the > user > decide between huge page support and better performance (the same approach we > use for the COW protection itself). seems like a reasonable alternative. Thanks, Roland -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.35 3/3] RDMA/cxgb4: Avoid false GTS CIDX_INC overflows.
thanks, applied -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.35 1/3] RDMA/cxgb4: Don't call abort_connection() for active connect failures.
thanks, applied -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] RDMA/cxgb4: derive smac_idx from port viid.
thanks, applied -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] IB/qib: completion queue callback needs to be single threaded
thanks, applied all except 2/7 (which seems to be only an optimization) -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writeable files under /sys
thanks, applied -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: event plig-in API fixed to compile with g++
On Mon, 5 Jul 2010 11:41:44 -0700 Sasha Khapyorsky wrote: > On 14:30 Mon 05 Jul , Hal Rosenstock wrote: > > On Mon, Jul 5, 2010 at 2:11 PM, Sasha Khapyorsky > > wrote: > > > On 11:10 Thu 24 Jun , Yevgeny Kliteynik wrote: > > >> Event API should have been able to be used by libraries > > >> written both in C and C++. > > > > > > I don't know about such requirement. > > > > Are you saying it isn't a valid requirement to allow OpenSM plugins to > > be C++ based ? If so, why not ? > > I'm saying that there is no requirement for plugin API to support C++ - > obviously (following method names) plugin API was never developed for > using it in C++. Actually IMO this is not correct. The use of "delete" was introduced by commit a5963f93fa3d4514cc526e4ad029b036724b8167. I was at fault to not have objected back then. The use of "extern C" in all of the header files below implies a desire to support C++. 10:28:14 > pwd; grep "BEGIN_C_DECLS extern" * /home/weiny2/OpenIB/git-trees/management/opensm/include/opensm osm_attrib_req.h:# define BEGIN_C_DECLS extern "C" { osm_base.h:# define BEGIN_C_DECLS extern "C" { osm_console.h:# define BEGIN_C_DECLS extern "C" { osm_console_io.h:# define BEGIN_C_DECLS extern "C" { osm_db.h:# define BEGIN_C_DECLS extern "C" { osm_db_pack.h:# define BEGIN_C_DECLS extern "C" { osm_event_plugin.h:# define BEGIN_C_DECLS extern "C" { osm_helper.h:# define BEGIN_C_DECLS extern "C" { osm_inform.h:# define BEGIN_C_DECLS extern "C" { osm_lid_mgr.h:# define BEGIN_C_DECLS extern "C" { osm_log.h:# define BEGIN_C_DECLS extern "C" { osm_mad_pool.h:# define BEGIN_C_DECLS extern "C" { osm_madw.h:# define BEGIN_C_DECLS extern "C" { osm_mcast_tbl.h:# define BEGIN_C_DECLS extern "C" { osm_mcm_port.h:# define BEGIN_C_DECLS extern "C" { osm_msgdef.h:# define BEGIN_C_DECLS extern "C" { osm_mtree.h:# define BEGIN_C_DECLS extern "C" { osm_multicast.h:# define BEGIN_C_DECLS extern "C" { osm_node.h:# define BEGIN_C_DECLS extern "C" { osm_opensm.h:# define BEGIN_C_DECLS extern "C" { osm_partition.h:# define BEGIN_C_DECLS extern "C" { osm_path.h:# define BEGIN_C_DECLS extern "C" { osm_perfmgr_db.h:# define BEGIN_C_DECLS extern "C" { osm_pkey.h:# define BEGIN_C_DECLS extern "C" { osm_port.h:# define BEGIN_C_DECLS extern "C" { osm_port_profile.h:# define BEGIN_C_DECLS extern "C" { osm_prefix_route.h:# define BEGIN_C_DECLS extern "C" { osm_remote_sm.h:# define BEGIN_C_DECLS extern "C" { osm_router.h:# define BEGIN_C_DECLS extern "C" { osm_sa.h:# define BEGIN_C_DECLS extern "C" { osm_sa_mad_ctrl.h:# define BEGIN_C_DECLS extern "C" { osm_service.h:# define BEGIN_C_DECLS extern "C" { osm_sm.h:# define BEGIN_C_DECLS extern "C" { osm_sm.h.orig:# define BEGIN_C_DECLS extern "C" { osm_sm_mad_ctrl.h:# define BEGIN_C_DECLS extern "C" { osm_stats.h:# define BEGIN_C_DECLS extern "C" { osm_subnet.h:# define BEGIN_C_DECLS extern "C" { osm_subnet.h.orig:# define BEGIN_C_DECLS extern "C" { osm_switch.h:# define BEGIN_C_DECLS extern "C" { osm_ucast_cache.h:# define BEGIN_C_DECLS extern "C" { osm_ucast_mgr.h:# define BEGIN_C_DECLS extern "C" { osm_vl15intf.h:# define BEGIN_C_DECLS extern "C" { st.h:# define BEGIN_C_DECLS extern "C" { Ira > > Why not is another question - for instance in order to not deal with > C/C++ compatibility issues (such as castings, function names limitation, > linking mess, etc.) > > Sasha > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://*vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
I haven't thought through all the details, but in principal this should work. But this isn't just and iWARP issue. Currently all RDMA-CM users share the same port space. I think we need to maintain this, so a transport-independent RDMA app can run over both IB and IW. This goes for server side wrt listen/accept as well. Steve. Tung, Chien Tin wrote: Steve, Do you see any issues with Bernard's proposal? Is this something we can agree on? Chien -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tung, Chien Tin Sent: Friday, June 25, 2010 3:15 PM To: Bernard Metzler; Roland Dreier Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; linux-rdma-ow...@vger.kernel.org; Waskiewicz Jr, Peter P; Steve Wise Subject: RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage To my understanding, our discussion touches two topics. One is to solve the TCP port space issue, the other is more general, its about proper integration of offloaded TCP within Linux. So, the second topic is a generalization of the first. Regarding the first topic, what I was about to propose is that the iWARP kernel driver (software iWARP or RNIC) itself should take care of port space allocations. Port space maintenance functionality should be minimized at iWARP CM level. It looks straightforward to me if during the rdma_connect() call the driver picks a free port using a socket/bind sequence for its local interface. The same would be possible for the passive connection setup, which always involves an rdma_bind_addr() - we would have to pass the rdma_bind_addr() call down to the driver and EADDRINUSE would be a reasonable return value. Here things are getting a little more complicated, if it comes to INADDR_ANY and port 0 bindings. In private email, Bob Sharp already suggested it - the iWARP CM would have to pick a port and try it on all interfacesmaybe by starting with port 0 binding on one interface and trying to extend with the returned port on all remaining interfaces. That introduces an unbind() call if things fail, too. In any case, the rdma_bind_addr() call would create additional state at driver level. I am okay with adding rdma_bind_addr and rdma_unbind_addr calls. I won't speak for Sean and the work that needs to go into the CM. But this will allow all known iWARP implementations to work together. For softiwarp, during bind() or connect(), a TCP socket would be created and bound, for an RNIC driver (currently) the same would happen. While with softiwarp this socket would be used for communication later, the RNIC driver would simply have to keep it around until the connection endpoint gets destroyed or the port gets unbound. We want to be careful and make sure there is only one iWARP provider per IP address. If softiWARP binds and surfaces another verbs interface on an existing one, this scheme will not work. Chien -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
Steve, Do you see any issues with Bernard's proposal? Is this something we can agree on? Chien > -Original Message- > From: linux-rdma-ow...@vger.kernel.org > [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tung, > Chien Tin > Sent: Friday, June 25, 2010 3:15 PM > To: Bernard Metzler; Roland Dreier > Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; > linux-rdma-ow...@vger.kernel.org; Waskiewicz Jr, > Peter P; Steve Wise > Subject: RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage > > > To my understanding, our discussion touches two topics. One is > > to solve the TCP port space issue, the other is more general, its about > > proper integration of offloaded TCP within Linux. So, the second > > topic is a generalization of the first. > > > > Regarding the first topic, what I was about to propose is that the > > iWARP kernel driver (software iWARP or RNIC) itself should take care of > > port space allocations. Port space maintenance functionality should > > be minimized at iWARP CM level. It looks straightforward to me if > > during the rdma_connect() call the driver picks a free port using > > a socket/bind sequence for its local interface. The same would be possible > > for > > the passive connection setup, which always involves an rdma_bind_addr() > > - we would have to pass the rdma_bind_addr() call down to the driver > > and EADDRINUSE would be a reasonable return value. > > Here things are getting a little more complicated, if it comes to > > INADDR_ANY and port 0 bindings. In private email, Bob Sharp already > > suggested it - the iWARP CM would have to pick a port and > > try it on all interfacesmaybe by starting with port 0 binding > > on one interface and trying to extend with the returned port on > > all remaining interfaces. That introduces an unbind() call if things > > fail, too. In any case, the rdma_bind_addr() call would create additional > > state > > at driver level. > > I am okay with adding rdma_bind_addr and rdma_unbind_addr calls. I won't > speak for Sean and the work that needs to go into the CM. But this will allow > all known iWARP implementations to work together. > > > For softiwarp, during bind() or connect(), a TCP socket would be created > > and bound, for an RNIC driver (currently) the same would happen. While with > > softiwarp this socket would be used for communication later, the RNIC > > driver > > would simply have to keep it around until the connection endpoint gets > > destroyed > > or the port gets unbound. > > We want to be careful and make sure there is only one iWARP provider per IP > address. > If softiWARP binds and surfaces another verbs interface on an existing one, > this scheme will not work. > > Chien > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: event plig-in API fixed to compile with g++
> FWIW, I agree with Hal. Support for external plug-ins written in C++ seems > desirable. Seems that anyone who cared could already easily write a tiny shim in C and then write the rest of their plugin in C++. Or are there deeper issues than names of methods? - R. -- Roland Dreier || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next-2.6] IB/{nes,ipoib}: Pass supported flags to ethtool_op_set_flags()
On 07/03/10 12:41, Ben Hutchings wrote: > Following commit 1437ce3983bcbc0447a0dedcd644c14fe833d266 "ethtool: > Change ethtool_op_set_flags to validate flags", ethtool_op_set_flags > takes a third parameter and cannot be used directly as an > implementation of ethtool_ops::set_flags. > > Changes nes and ipoib driver to pass in the appropriate value. > > Signed-off-by: Ben Hutchings > --- > This is compile-tested only. Ack, thanks. > Dave, Roland, you'd better decide between yourselves should apply this. > > Ben. > > drivers/infiniband/hw/nes/nes_nic.c |8 +++- > drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |7 ++- > 2 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/hw/nes/nes_nic.c > b/drivers/infiniband/hw/nes/nes_nic.c > index 5cc0a9a..42e7aad 100644 > --- a/drivers/infiniband/hw/nes/nes_nic.c > +++ b/drivers/infiniband/hw/nes/nes_nic.c > @@ -1567,6 +1567,12 @@ static int nes_netdev_set_settings(struct net_device > *netdev, struct ethtool_cmd > } > > > +static int nes_netdev_set_flags(struct net_device *netdev, u32 flags) > +{ > + return ethtool_op_set_flags(netdev, flags, ETH_FLAG_LRO); > +} > + > + > static const struct ethtool_ops nes_ethtool_ops = { > .get_link = ethtool_op_get_link, > .get_settings = nes_netdev_get_settings, > @@ -1588,7 +1594,7 @@ static const struct ethtool_ops nes_ethtool_ops = { > .get_tso = ethtool_op_get_tso, > .set_tso = ethtool_op_set_tso, > .get_flags = ethtool_op_get_flags, > - .set_flags = ethtool_op_set_flags, > + .set_flags = nes_netdev_set_flags, > }; > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c > b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c > index 40e8584..1a1657c 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c > @@ -147,6 +147,11 @@ static void ipoib_get_ethtool_stats(struct net_device > *dev, > data[index++] = priv->lro.lro_mgr.stats.no_desc; > } > > +static int ipoib_set_flags(struct net_device *dev, u32 flags) > +{ > + return ethtool_op_set_flags(dev, flags, ETH_FLAG_LRO); > +} > + > static const struct ethtool_ops ipoib_ethtool_ops = { > .get_drvinfo= ipoib_get_drvinfo, > .get_rx_csum= ipoib_get_rx_csum, > @@ -154,7 +159,7 @@ static const struct ethtool_ops ipoib_ethtool_ops = { > .get_coalesce = ipoib_get_coalesce, > .set_coalesce = ipoib_set_coalesce, > .get_flags = ethtool_op_get_flags, > - .set_flags = ethtool_op_set_flags, > + .set_flags = ipoib_set_flags, > .get_strings= ipoib_get_strings, > .get_sset_count = ipoib_get_sset_count, > .get_ethtool_stats = ipoib_get_ethtool_stats, -- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] opensm: event plig-in API fixed to compile with g++
> > OpenSM is written in C, not C++, I'm pretty fine with it and don't see > > any reason to get C++ direction. > > That's OpenSM but why limit plug in writers to C ? FWIW, I agree with Hal. Support for external plug-ins written in C++ seems desirable.
Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs
On Sat, 03 Jul 2010 13:19:07 -0700 Roland Dreier wrote: > > When registering two memory regions A and B from within > > the same huge page, we will end up with one node in the tree which covers > the > > whole huge page after registering A. When the second MR is registered, a > node > > is created with the MR size rounded to the system page size (as there is no > > need to call madvise(), it is not noticed that MR B is part of a huge > page). > > > > Now if MR A is deregistered before MR B, I see that the tree containing > > mem_nodes is empty afterwards, which causes problems for the > deregistration of > > MR B, leaving the tree in a corrupted state with negative refcounts. This > also > > breaks later registrations of other memory regions within this huge page. > > Good thing I didn't get around to applying the patch yet ;) > > I haven't thought this through fully, but it seems that maybe we could > extend the madvise tracking tree to keep track of the page size used for > each node in the tree. Then for the registration of MR B above, we > would find the node for MR A covered MR B and we should be able to get > the ref counting right. We thought about this too, but in some special cases, we do not know the correct page size of a memory range. For example when getting a 16M chunk from a 16M huge page region which is also aligned to 16M, the first madvise() will work fine and the code will assume that the page size is 64K. If trying to register a 16M - 64K + 1 byte region, the first madvise() also works fine. Now if a second memory region which resides in the last 64K is registered, we end up with the same situation as above. As this issue was not present in version 2 of the code, but there we had a big performance penalty, I suggest the following: we could go back to version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the user decide between huge page support and better performance (the same approach we use for the COW protection itself). Would this be okay or do you see a problem with this? Regards, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: When IBoE will be merged to upstream?
> reading your replies would be much easier if you use strict > bottom posting OK :) > > >> For a long time we've assumed that the create_ah verb > can't sleep, so > >> where are you going to do neighbor discovery? > > > re [...] implementation, there is no inherent issue that > prevents create_ah() from sleeping: > > - Change a few spinlocks to mutexes in the cma (which > sleeps a lot anyway because is > > modifies QP states) > > - Trivial for user-space calls... > > Documentation/infiniband/core_locking.txt states that "The > corresponding functions exported to upper level protocol > consumers: ... ib_create_ah ... are therefore safe to call > from any context." > > Which is in turn assumed by bunch of components in the kernel > IB stack (look for ib_create_ah calls under > drivers/infiniband/core/). The examples you brought here, > don't cover them. This way or another, I don't see any reason > to break that convension just for the sake of this aspect of > the iboe implementation, simply, code with this assumption at hand. > > Or. > I understand this, but keeping ib_create_ah() callable from any context is not a goal by itself. I am looking for constructive ideas for supporting iboe without breaking Verbs/CQE/CM syntax. What do you propose?-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
Hello Or, I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. From my measuremnts it looks like the problem is related to memory allocation in the user-space and kernel path, that is a very, very expesive operation. Look for the tx path (rx is very similar). Ibv_post_send() post_send_wrapper_1_0 for (w = wr; w; w = w->next) { real_wr = alloca(sizeof *real_wr); <- 1. dyn alloc real_wr->wr_id = w->wr_id; next the call to HW specific part and prepare message to send cmd = alloca(cmd_size); <- 2. dyn allocation IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp); dive to kernel: ib_uverbs_post_send() user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr->num_sge * sizeof (struct ib_sge), GFP_KERNEL); <- 4. dyn alloc And now there is finel call to driver. Adding the additional device makes possible diving to kernel without that memory allocations. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry Not exactly. Diving to kernel is treated as a something like passing signal to kernel that there is prepared information to post_send/post_recv. The information about buffers are passed through shared page (available to userspace through mmap) to avoid copying of data. Write() ops is used to passing signal about post_send. Read() ops is used to pass information about post_recv(). We avoid additional copying of the data that way. > @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr > *attr, > nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state, > nesqp->iwarp_state, atomic_read(&nesqp->refcount)); > > + if (ibqp->qp_type == IB_QPT_RAW_PACKET) > + return 0; isn't a raw qp associated with a specific port of the device? In NES architecture the QP type and number defines a specific device or port. It is one to one mapping Regards, Mirek -Original Message- From: Or Gerlitz [mailto:ogerl...@voltaire.com] Sent: Tuesday, July 06, 2010 10:50 AM To: Walukiewicz, Miroslaw Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver miroslaw.walukiew...@intel.com wrote: > adds a IB_QPT_RAW_PACKET QP type implementation for nes driver > +++ b/drivers/infiniband/hw/nes/nes_ud.c > +static const struct file_operations nes_ud_sksq_fops = { > + .owner = THIS_MODULE, > + .open = nes_ud_sksq_open, > + .release = nes_ud_sksq_close, > + .write = nes_ud_sksq_write, > + .read = nes_ud_sksq_read, > + .mmap = nes_ud_sksq_mmap, > +}; > + > + > +static struct miscdevice nes_ud_sksq_misc = { > + .minor = MISC_DYNAMIC_MINOR, > + .name = "nes_ud_sksq", > + .fops = &nes_ud_sksq_fops, > +}; Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver" email thread, e.g at the below links, you say > The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is > shared with > all other user-kernel communication and it is quite complex. It is a perfect > path > for QP/CQ/PD/mem management but for me it is too complex for traffic > acceleration. > The user<->kernel path through additional driver, shared page for > lkey/vaddr/len > passing and SW memory translation in kernel is much more effective. http://marc.info/?l=linux-rdma&m=127299659017928 http://marc.info/?l=linux-rdma&m=127306694704653 I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry > --- a/drivers/infiniband/hw/nes/nes_verbs.c > +++ b/drivers/infiniband/hw/nes/nes_verbs.c > @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd, [...] > - atomic_inc(&qps_created); > @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(
Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
miroslaw.walukiew...@intel.com wrote: > adds a IB_QPT_RAW_PACKET QP type implementation for nes driver > +++ b/drivers/infiniband/hw/nes/nes_ud.c > +static const struct file_operations nes_ud_sksq_fops = { > + .owner = THIS_MODULE, > + .open = nes_ud_sksq_open, > + .release = nes_ud_sksq_close, > + .write = nes_ud_sksq_write, > + .read = nes_ud_sksq_read, > + .mmap = nes_ud_sksq_mmap, > +}; > + > + > +static struct miscdevice nes_ud_sksq_misc = { > + .minor = MISC_DYNAMIC_MINOR, > + .name = "nes_ud_sksq", > + .fops = &nes_ud_sksq_fops, > +}; Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver" email thread, e.g at the below links, you say > The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is > shared with > all other user-kernel communication and it is quite complex. It is a perfect > path > for QP/CQ/PD/mem management but for me it is too complex for traffic > acceleration. > The user<->kernel path through additional driver, shared page for > lkey/vaddr/len > passing and SW memory translation in kernel is much more effective. http://marc.info/?l=linux-rdma&m=127299659017928 http://marc.info/?l=linux-rdma&m=127306694704653 I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry > --- a/drivers/infiniband/hw/nes/nes_verbs.c > +++ b/drivers/infiniband/hw/nes/nes_verbs.c > @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd, [...] > - atomic_inc(&qps_created); > @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(struct ib_pd > *ibpd, [...] > + /* moved here to be sure that QP is really created */ > + /*(now it counted a number of QP creation trials */ > + atomic_inc(&qps_created); best if this change and couple more of its such will be placed in a clean-up patch to nes_verbs.c, such that the amount of RAW QP related changes to review is minimized. > @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr > *attr, > nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state, > nesqp->iwarp_state, atomic_read(&nesqp->refcount)); > > + if (ibqp->qp_type == IB_QPT_RAW_PACKET) > + return 0; isn't a raw qp associated with a specific port of the device? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html