Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-06 Thread Or Gerlitz

Walukiewicz, Miroslaw wrote:

From my measuremnts it looks like the problem is related to memory allocation 
in the user-space and kernel path, that is a very, very expesive operation. 
Look for the tx path (rx is very similar). Ibv_post_send():
post_send_wrapper_1_0
for (w = wr; w; w = w->next) {
real_wr = alloca(sizeof *real_wr);  <- 1. dyn alloc 
real_wr->wr_id = w->wr_id;

  next the call to HW specific part
and prepare message to send
cmd  = alloca(cmd_size);  <- 2. dyn allocation


Hi Mirek,

I don't think there are applications around which would use raw qp AND 
are linked against libibverbs-1.0, such that they would exercise the 1_0 
wrapper, so we can ignore the 1st allocation, the one at the wrapper code.


As for the 2nd allocation, since a WQE --posting-- is synchronous,  
using the maximal values specified during the creation of the QP, I 
believe that this allocation can be done once per QP and used later.



dive to kernel:
ib_uverbs_post_send()
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc
next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
   user_wr->num_sge * sizeof (struct ib_sge),
   GFP_KERNEL); <- 4. dyn alloc 
 
 		And now there is finel call to driver. 
~same here for #4 you can compute/allocate once the maximal possible 
size for "next" per qp and use it later. As for #3, this need further 
thinking.


But before diving to all this design changes, what was the penalty 
introduced by these allocations? is it in packets-per-second, latency?



Diving to kernel is treated as a something like passing signal to kernel that 
there is prepared information to post_send/post_recv. The information about 
buffers are passed through shared page (available to userspace through mmap) to 
avoid copying of data. Write() ops is used to passing signal about post_send. 
Read() ops is used to pass information about post_recv(). We avoid additional 
copying of the data that way.
thanks for the heads-up, I took a look and this user/kernel shared 
memory page is used to hold the work-request, nothing to do with data.


As for the work request, you still have to copy it in user space from 
the user work request to the library mmaped buffer. So the only 
difference would be the copy_from_user done by uverbs, for few tens of 
bytes, can you tell if/what is the extra penalty introduced by this copy?



struct nes_ud_send_wr {
u32   wr_cnt;
u32   qpn;
u32   flags;
u32   resv[1];
struct ib_sge sg_list[64];
};

struct nes_ud_recv_wr {
u32   wr_cnt;
u32   qpn;
u32   resv[2];
struct ib_sge sg_list[64];
};
Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same 
instance can be used to post list of work requests, where is work 
request is limited to use one SGE, am I correct?


I don't think there a need to support posting 64 --send-- requests, for 
recv it might makes sense, but it could be done in a "batch/background" 
flow, thoughts?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-06 Thread Jason Gunthorpe
On Wed, Jul 07, 2010 at 09:00:26AM +0300, Or Gerlitz wrote:

> I think we need to let resolve through the rdma-cm && get to know at
> the consumer level, what are the source / destination macs, vlan id
> and vlan priority used by an IBoE QP, in the exact manner all the IB
> equivalents (src/dst lid, pkey, sl) are resolved by the rdma-cm and
> exposed to the consmer app for IB QP.

I agree.

Clearly following the model of IB is the best way to fit this in
without major changing. RDMA-CM is the way to get IP integration, it
uses existing eth devices attached to the master eth device (analogous
to IPoIB devices) and resolves IP to eth device to VLAN header and
neighbour.

If someone needs to do something else special it is pretty easy to do
all the same steps in userspace using netlink, and that could go into
a library, just like PR queries for IB.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-06 Thread Or Gerlitz
Liran Liss wrote:
> but keeping ib_create_ah() callable from any context is not a goal by itself.

going with your approach, if your proposed design is accepted, I believe that 
you probably need to patch all the code-chains that makes calls under the 
current assumption

> I am looking for constructive ideas for supporting iboe without breaking 
> Verbs/CQE/CM syntax. 

I don't agree that exposing the Ethernet L2 related information to the caller 
is breaking something, the converse, it is a required enhancement. 

I think we need to let resolve through the rdma-cm && get to know at the 
consumer level, what are the source / destination macs, vlan id and vlan 
priority used by an IBoE QP, in the exact manner all the IB equivalents 
(src/dst lid, pkey, sl) are resolved by the rdma-cm and exposed to the consmer 
app for IB QP.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writable files under /sys

2010-07-06 Thread Or Gerlitz
Sumeet Lahorani wrote:
> # find /sys -type f -perm -222
> /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger
> /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2
> /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1

Jack, Tziporet 

Can you clarify the status of the upstream kernel mlx4 multi-protocol support? 
looking on Linus git, I see one commit, 
7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5 "mlx4_core: Multiple port type 
support" dated to Oct 2008, wheres ofed ships couple of patches touching this 
area, e.g adding the above sysfs entries. So what is the extra functionality 
introduced or bug/s fixed by those patches? any reason not to push them 
upstream? 


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm: event plig-in API fixed to compile with g++

2010-07-06 Thread Yevgeny Kliteynik

On 07-Jul-10 12:03 AM, Ira Weiny wrote:

On Mon, 5 Jul 2010 11:41:44 -0700
Sasha Khapyorsky  wrote:


On 14:30 Mon 05 Jul , Hal Rosenstock wrote:

On Mon, Jul 5, 2010 at 2:11 PM, Sasha Khapyorsky  wrote:

On 11:10 Thu 24 Jun , Yevgeny Kliteynik wrote:

Event API should have been able to be used by libraries
written both in C and C++.


I don't know about such requirement.


Are you saying it isn't a valid requirement to allow OpenSM plugins to
be C++ based ? If so, why not ?


I'm saying that there is no requirement for plugin API to support C++ -
obviously (following method names) plugin API was never developed for
using it in C++.


Actually IMO this is not correct.  The use of "delete" was introduced
by commit a5963f93fa3d4514cc526e4ad029b036724b8167.  I was at fault to
not have objected back then.  The use of "extern C" in all of the header
files below implies a desire to support C++.


Couldn't agree more.

-- Yevgeny


10:28:14>  pwd; grep "BEGIN_C_DECLS extern" *
/home/weiny2/OpenIB/git-trees/management/opensm/include/opensm
osm_attrib_req.h:#  define BEGIN_C_DECLS extern "C" {
osm_base.h:#  define BEGIN_C_DECLS extern "C" {
osm_console.h:#  define BEGIN_C_DECLS extern "C" {
osm_console_io.h:#  define BEGIN_C_DECLS extern "C" {
osm_db.h:#  define BEGIN_C_DECLS extern "C" {
osm_db_pack.h:#  define BEGIN_C_DECLS extern "C" {
osm_event_plugin.h:#  define BEGIN_C_DECLS extern "C" {
osm_helper.h:#  define BEGIN_C_DECLS extern "C" {
osm_inform.h:#  define BEGIN_C_DECLS extern "C" {
osm_lid_mgr.h:#  define BEGIN_C_DECLS extern "C" {
osm_log.h:#  define BEGIN_C_DECLS extern "C" {
osm_mad_pool.h:#  define BEGIN_C_DECLS extern "C" {
osm_madw.h:#  define BEGIN_C_DECLS extern "C" {
osm_mcast_tbl.h:#  define BEGIN_C_DECLS extern "C" {
osm_mcm_port.h:#  define BEGIN_C_DECLS extern "C" {
osm_msgdef.h:#  define BEGIN_C_DECLS extern "C" {
osm_mtree.h:#  define BEGIN_C_DECLS extern "C" {
osm_multicast.h:#  define BEGIN_C_DECLS extern "C" {
osm_node.h:#  define BEGIN_C_DECLS extern "C" {
osm_opensm.h:#  define BEGIN_C_DECLS extern "C" {
osm_partition.h:#  define BEGIN_C_DECLS extern "C" {
osm_path.h:#  define BEGIN_C_DECLS extern "C" {
osm_perfmgr_db.h:#  define BEGIN_C_DECLS extern "C" {
osm_pkey.h:#  define BEGIN_C_DECLS extern "C" {
osm_port.h:#  define BEGIN_C_DECLS extern "C" {
osm_port_profile.h:#  define BEGIN_C_DECLS extern "C" {
osm_prefix_route.h:#  define BEGIN_C_DECLS extern "C" {
osm_remote_sm.h:#  define BEGIN_C_DECLS extern "C" {
osm_router.h:#  define BEGIN_C_DECLS extern "C" {
osm_sa.h:#  define BEGIN_C_DECLS extern "C" {
osm_sa_mad_ctrl.h:#  define BEGIN_C_DECLS extern "C" {
osm_service.h:#  define BEGIN_C_DECLS extern "C" {
osm_sm.h:#  define BEGIN_C_DECLS extern "C" {
osm_sm.h.orig:#  define BEGIN_C_DECLS extern "C" {
osm_sm_mad_ctrl.h:#  define BEGIN_C_DECLS extern "C" {
osm_stats.h:#  define BEGIN_C_DECLS extern "C" {
osm_subnet.h:#  define BEGIN_C_DECLS extern "C" {
osm_subnet.h.orig:#  define BEGIN_C_DECLS extern "C" {
osm_switch.h:#  define BEGIN_C_DECLS extern "C" {
osm_ucast_cache.h:#  define BEGIN_C_DECLS extern "C" {
osm_ucast_mgr.h:#  define BEGIN_C_DECLS extern "C" {
osm_vl15intf.h:#  define BEGIN_C_DECLS extern "C" {
st.h:#  define BEGIN_C_DECLS extern "C" {

Ira



Why not is another question - for instance in order to not deal with
C/C++ compatibility issues (such as castings, function names limitation,
linking mess, etc.)

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://*vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/qib: if qib_init() fails, driver fails to clean up properly

2010-07-06 Thread Roland Dreier
thanks, applied
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-07-06 Thread Roland Dreier
 > We thought about this too, but in some special cases, we do not know the
 > correct page size of a memory range. For example when getting a 16M chunk
 > from a 16M huge page region which is also aligned to 16M, the first madvise()
 > will work fine and the code will assume that the page size is 64K.

I see ... yes, that does break my idea completely.

OK, another half-baked idea: what if we pay attention to when
madvise(DOFORK) fails as well as well madvise(DONTFORK) fails, and use
that as a hit that we better check the page size?

Perhaps this adds too much complexity ... in which case your idea:

 > As this issue was not present in version 2 of the code, but there we had
 > a big performance penalty, I suggest the following: we could go back to
 > version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the 
 > user
 > decide between huge page support and better performance (the same approach we
 > use for the COW protection itself).

seems like a reasonable alternative.

Thanks,
  Roland
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.35 3/3] RDMA/cxgb4: Avoid false GTS CIDX_INC overflows.

2010-07-06 Thread Roland Dreier
thanks, applied
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.35 1/3] RDMA/cxgb4: Don't call abort_connection() for active connect failures.

2010-07-06 Thread Roland Dreier
thanks, applied
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] RDMA/cxgb4: derive smac_idx from port viid.

2010-07-06 Thread Roland Dreier
thanks, applied
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] IB/qib: completion queue callback needs to be single threaded

2010-07-06 Thread Roland Dreier
thanks, applied all except 2/7 (which seems to be only an optimization)
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writeable files under /sys

2010-07-06 Thread Roland Dreier
thanks, applied
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm: event plig-in API fixed to compile with g++

2010-07-06 Thread Ira Weiny
On Mon, 5 Jul 2010 11:41:44 -0700
Sasha Khapyorsky  wrote:

> On 14:30 Mon 05 Jul , Hal Rosenstock wrote:
> > On Mon, Jul 5, 2010 at 2:11 PM, Sasha Khapyorsky  
> > wrote:
> > > On 11:10 Thu 24 Jun     , Yevgeny Kliteynik wrote:
> > >> Event API should have been able to be used by libraries
> > >> written both in C and C++.
> > >
> > > I don't know about such requirement.
> > 
> > Are you saying it isn't a valid requirement to allow OpenSM plugins to
> > be C++ based ? If so, why not ?
> 
> I'm saying that there is no requirement for plugin API to support C++ -
> obviously (following method names) plugin API was never developed for
> using it in C++.

Actually IMO this is not correct.  The use of "delete" was introduced by commit 
a5963f93fa3d4514cc526e4ad029b036724b8167.  I was at fault to not have objected 
back then.  The use of "extern C" in all of the header files below implies a 
desire to support C++.

10:28:14 > pwd; grep "BEGIN_C_DECLS extern" *
/home/weiny2/OpenIB/git-trees/management/opensm/include/opensm
osm_attrib_req.h:#  define BEGIN_C_DECLS extern "C" {
osm_base.h:#  define BEGIN_C_DECLS extern "C" {
osm_console.h:#  define BEGIN_C_DECLS extern "C" {
osm_console_io.h:#  define BEGIN_C_DECLS extern "C" {
osm_db.h:#  define BEGIN_C_DECLS extern "C" {
osm_db_pack.h:#  define BEGIN_C_DECLS extern "C" {
osm_event_plugin.h:#  define BEGIN_C_DECLS extern "C" {
osm_helper.h:#  define BEGIN_C_DECLS extern "C" {
osm_inform.h:#  define BEGIN_C_DECLS extern "C" {
osm_lid_mgr.h:#  define BEGIN_C_DECLS extern "C" {
osm_log.h:#  define BEGIN_C_DECLS extern "C" {
osm_mad_pool.h:#  define BEGIN_C_DECLS extern "C" {
osm_madw.h:#  define BEGIN_C_DECLS extern "C" {
osm_mcast_tbl.h:#  define BEGIN_C_DECLS extern "C" {
osm_mcm_port.h:#  define BEGIN_C_DECLS extern "C" {
osm_msgdef.h:#  define BEGIN_C_DECLS extern "C" {
osm_mtree.h:#  define BEGIN_C_DECLS extern "C" {
osm_multicast.h:#  define BEGIN_C_DECLS extern "C" {
osm_node.h:#  define BEGIN_C_DECLS extern "C" {
osm_opensm.h:#  define BEGIN_C_DECLS extern "C" {
osm_partition.h:#  define BEGIN_C_DECLS extern "C" {
osm_path.h:#  define BEGIN_C_DECLS extern "C" {
osm_perfmgr_db.h:#  define BEGIN_C_DECLS extern "C" {
osm_pkey.h:#  define BEGIN_C_DECLS extern "C" {
osm_port.h:#  define BEGIN_C_DECLS extern "C" {
osm_port_profile.h:#  define BEGIN_C_DECLS extern "C" {
osm_prefix_route.h:#  define BEGIN_C_DECLS extern "C" {
osm_remote_sm.h:#  define BEGIN_C_DECLS extern "C" {
osm_router.h:#  define BEGIN_C_DECLS extern "C" {
osm_sa.h:#  define BEGIN_C_DECLS extern "C" {
osm_sa_mad_ctrl.h:#  define BEGIN_C_DECLS extern "C" {
osm_service.h:#  define BEGIN_C_DECLS extern "C" {
osm_sm.h:#  define BEGIN_C_DECLS extern "C" {
osm_sm.h.orig:#  define BEGIN_C_DECLS extern "C" {
osm_sm_mad_ctrl.h:#  define BEGIN_C_DECLS extern "C" {
osm_stats.h:#  define BEGIN_C_DECLS extern "C" {
osm_subnet.h:#  define BEGIN_C_DECLS extern "C" {
osm_subnet.h.orig:#  define BEGIN_C_DECLS extern "C" {
osm_switch.h:#  define BEGIN_C_DECLS extern "C" {
osm_ucast_cache.h:#  define BEGIN_C_DECLS extern "C" {
osm_ucast_mgr.h:#  define BEGIN_C_DECLS extern "C" {
osm_vl15intf.h:#  define BEGIN_C_DECLS extern "C" {
st.h:#  define BEGIN_C_DECLS extern "C" {

Ira

> 
> Why not is another question - for instance in order to not deal with
> C/C++ compatibility issues (such as castings, function names limitation,
> linking mess, etc.)
> 
> Sasha
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage

2010-07-06 Thread Steve Wise
I haven't thought through all the details, but in principal this should 
work.  But this isn't just and iWARP issue.  Currently all RDMA-CM users 
share the same port space.   I think we need to maintain this, so a 
transport-independent RDMA app can run over both IB and IW.  This goes 
for server side wrt listen/accept as well.


Steve.


Tung, Chien Tin wrote:

Steve,

Do you see any issues with Bernard's proposal?  Is this something we can agree 
on?

Chien

  

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tung,
Chien Tin
Sent: Friday, June 25, 2010 3:15 PM
To: Bernard Metzler; Roland Dreier
Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; 
linux-rdma-ow...@vger.kernel.org; Waskiewicz Jr,
Peter P; Steve Wise
Subject: RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage



To my understanding, our discussion touches two topics. One is
to solve the TCP port space issue, the other is more general, its about
proper integration of offloaded TCP within Linux. So, the second
topic is a generalization of the first.

Regarding the first topic, what I was about to propose is that the
iWARP kernel driver (software iWARP or RNIC) itself should take care of
port space allocations. Port space maintenance functionality should
be minimized at iWARP CM level. It looks straightforward to me if
during the rdma_connect() call the driver picks a free port using
a socket/bind sequence for its local interface. The same would be possible
for
the passive connection setup, which always involves an rdma_bind_addr()
- we would have to pass the rdma_bind_addr() call down to the driver
and EADDRINUSE would be a reasonable return value.
Here things are getting a little more complicated, if it comes to
INADDR_ANY and port 0 bindings. In private email, Bob Sharp already
suggested it -  the iWARP CM would have to pick a port and
try it on all interfacesmaybe by starting with port 0 binding
on one interface and trying to extend with the returned port on
all remaining interfaces. That introduces an unbind() call if things
fail, too. In any case, the rdma_bind_addr() call would create additional
state
at driver level.
  

I am okay with adding rdma_bind_addr and rdma_unbind_addr calls.  I won't
speak for Sean and the work that needs to go into the CM.  But this will allow
all known iWARP implementations to work together.



For softiwarp, during bind() or connect(), a TCP socket would be created
and bound, for an RNIC driver (currently) the same would happen. While with
softiwarp this socket would be used for communication later, the RNIC
driver
would simply have to keep it around until the connection endpoint gets
destroyed
or the port gets unbound.
  

We want to be careful and make sure there is only one iWARP provider per IP 
address.
If softiWARP binds and surfaces another verbs interface on an existing one,
this scheme will not work.

Chien


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage

2010-07-06 Thread Tung, Chien Tin
Steve,

Do you see any issues with Bernard's proposal?  Is this something we can agree 
on?

Chien

> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org 
> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tung,
> Chien Tin
> Sent: Friday, June 25, 2010 3:15 PM
> To: Bernard Metzler; Roland Dreier
> Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; 
> linux-rdma-ow...@vger.kernel.org; Waskiewicz Jr,
> Peter P; Steve Wise
> Subject: RE: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
> 
> > To my understanding, our discussion touches two topics. One is
> > to solve the TCP port space issue, the other is more general, its about
> > proper integration of offloaded TCP within Linux. So, the second
> > topic is a generalization of the first.
> >
> > Regarding the first topic, what I was about to propose is that the
> > iWARP kernel driver (software iWARP or RNIC) itself should take care of
> > port space allocations. Port space maintenance functionality should
> > be minimized at iWARP CM level. It looks straightforward to me if
> > during the rdma_connect() call the driver picks a free port using
> > a socket/bind sequence for its local interface. The same would be possible
> > for
> > the passive connection setup, which always involves an rdma_bind_addr()
> > - we would have to pass the rdma_bind_addr() call down to the driver
> > and EADDRINUSE would be a reasonable return value.
> > Here things are getting a little more complicated, if it comes to
> > INADDR_ANY and port 0 bindings. In private email, Bob Sharp already
> > suggested it -  the iWARP CM would have to pick a port and
> > try it on all interfacesmaybe by starting with port 0 binding
> > on one interface and trying to extend with the returned port on
> > all remaining interfaces. That introduces an unbind() call if things
> > fail, too. In any case, the rdma_bind_addr() call would create additional
> > state
> > at driver level.
> 
> I am okay with adding rdma_bind_addr and rdma_unbind_addr calls.  I won't
> speak for Sean and the work that needs to go into the CM.  But this will allow
> all known iWARP implementations to work together.
> 
> > For softiwarp, during bind() or connect(), a TCP socket would be created
> > and bound, for an RNIC driver (currently) the same would happen. While with
> > softiwarp this socket would be used for communication later, the RNIC
> > driver
> > would simply have to keep it around until the connection endpoint gets
> > destroyed
> > or the port gets unbound.
> 
> We want to be careful and make sure there is only one iWARP provider per IP 
> address.
> If softiWARP binds and surfaces another verbs interface on an existing one,
> this scheme will not work.
> 
> Chien
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm: event plig-in API fixed to compile with g++

2010-07-06 Thread Roland Dreier
 > FWIW, I agree with Hal.  Support for external plug-ins written in C++ seems 
 > desirable.

Seems that anyone who cared could already easily write a tiny shim in C
and then write the rest of their plugin in C++.  Or are there deeper
issues than names of methods?

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next-2.6] IB/{nes,ipoib}: Pass supported flags to ethtool_op_set_flags()

2010-07-06 Thread Randy Dunlap
On 07/03/10 12:41, Ben Hutchings wrote:
> Following commit 1437ce3983bcbc0447a0dedcd644c14fe833d266 "ethtool:
> Change ethtool_op_set_flags to validate flags", ethtool_op_set_flags
> takes a third parameter and cannot be used directly as an
> implementation of ethtool_ops::set_flags.
> 
> Changes nes and ipoib driver to pass in the appropriate value.
> 
> Signed-off-by: Ben Hutchings 
> ---
> This is compile-tested only.

Ack, thanks.

> Dave, Roland, you'd better decide between yourselves should apply this.
> 
> Ben.
> 
>  drivers/infiniband/hw/nes/nes_nic.c  |8 +++-
>  drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |7 ++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
> b/drivers/infiniband/hw/nes/nes_nic.c
> index 5cc0a9a..42e7aad 100644
> --- a/drivers/infiniband/hw/nes/nes_nic.c
> +++ b/drivers/infiniband/hw/nes/nes_nic.c
> @@ -1567,6 +1567,12 @@ static int nes_netdev_set_settings(struct net_device 
> *netdev, struct ethtool_cmd
>  }
>  
> 
> +static int nes_netdev_set_flags(struct net_device *netdev, u32 flags)
> +{
> + return ethtool_op_set_flags(netdev, flags, ETH_FLAG_LRO);
> +}
> +
> +
>  static const struct ethtool_ops nes_ethtool_ops = {
>   .get_link = ethtool_op_get_link,
>   .get_settings = nes_netdev_get_settings,
> @@ -1588,7 +1594,7 @@ static const struct ethtool_ops nes_ethtool_ops = {
>   .get_tso = ethtool_op_get_tso,
>   .set_tso = ethtool_op_set_tso,
>   .get_flags = ethtool_op_get_flags,
> - .set_flags = ethtool_op_set_flags,
> + .set_flags = nes_netdev_set_flags,
>  };
>  
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
> index 40e8584..1a1657c 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
> @@ -147,6 +147,11 @@ static void ipoib_get_ethtool_stats(struct net_device 
> *dev,
>   data[index++] = priv->lro.lro_mgr.stats.no_desc;
>  }
>  
> +static int ipoib_set_flags(struct net_device *dev, u32 flags)
> +{
> + return ethtool_op_set_flags(dev, flags, ETH_FLAG_LRO);
> +}
> +
>  static const struct ethtool_ops ipoib_ethtool_ops = {
>   .get_drvinfo= ipoib_get_drvinfo,
>   .get_rx_csum= ipoib_get_rx_csum,
> @@ -154,7 +159,7 @@ static const struct ethtool_ops ipoib_ethtool_ops = {
>   .get_coalesce   = ipoib_get_coalesce,
>   .set_coalesce   = ipoib_set_coalesce,
>   .get_flags  = ethtool_op_get_flags,
> - .set_flags  = ethtool_op_set_flags,
> + .set_flags  = ipoib_set_flags,
>   .get_strings= ipoib_get_strings,
>   .get_sset_count = ipoib_get_sset_count,
>   .get_ethtool_stats  = ipoib_get_ethtool_stats,


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] opensm: event plig-in API fixed to compile with g++

2010-07-06 Thread Hefty, Sean
> > OpenSM is written in C, not C++, I'm pretty fine with it and don't see
> > any reason to get C++ direction.
> 
> That's OpenSM but why limit plug in writers to C ?

FWIW, I agree with Hal.  Support for external plug-ins written in C++ seems 
desirable.


Re: [ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

2010-07-06 Thread Alexander Schmidt
On Sat, 03 Jul 2010 13:19:07 -0700
Roland Dreier  wrote:

>  >  When registering two memory regions A and B from within
>  > the same huge page, we will end up with one node in the tree which covers 
> the
>  > whole huge page after registering A. When the second MR is registered, a 
> node
>  > is created with the MR size rounded to the system page size (as there is no
>  > need to call madvise(), it is not noticed that MR B is part of a huge 
> page).
>  > 
>  > Now if MR A is deregistered before MR B, I see that the tree containing
>  > mem_nodes is empty afterwards, which causes problems for the 
> deregistration of
>  > MR B, leaving the tree in a corrupted state with negative refcounts. This 
> also
>  > breaks later registrations of other memory regions within this huge page.
> 
> Good thing I didn't get around to applying the patch yet ;)
> 
> I haven't thought this through fully, but it seems that maybe we could
> extend the madvise tracking tree to keep track of the page size used for
> each node in the tree.  Then for the registration of MR B above, we
> would find the node for MR A covered MR B and we should be able to get
> the ref counting right.

We thought about this too, but in some special cases, we do not know the
correct page size of a memory range. For example when getting a 16M chunk
from a 16M huge page region which is also aligned to 16M, the first madvise()
will work fine and the code will assume that the page size is 64K.

If trying to register a 16M - 64K + 1 byte region, the first madvise() also
works fine. Now if a second memory region which resides in the last 64K is
registered, we end up with the same situation as above.

As this issue was not present in version 2 of the code, but there we had
a big performance penalty, I suggest the following: we could go back to
version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the user
decide between huge page support and better performance (the same approach we
use for the COW protection itself). Would this be okay or do you see a problem
with this?

Regards,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-06 Thread Liran Liss
> reading your replies would be much easier if you use strict 
> bottom posting

OK :)

> 
> >> For a long time we've assumed that the create_ah verb 
> can't sleep, so 
> >> where are you going to do neighbor discovery?
> 
> > re [...] implementation, there is no inherent issue that 
> prevents create_ah() from sleeping:
> > - Change a few spinlocks to mutexes in the cma (which 
> sleeps a lot anyway because is 
> >   modifies QP states)
> > - Trivial for user-space calls...
> 
> Documentation/infiniband/core_locking.txt states that "The 
> corresponding functions exported to upper level protocol 
> consumers: ... ib_create_ah ... are therefore safe to call 
> from any context." 
> 
> Which is in turn assumed by bunch of components in the kernel 
> IB stack (look for ib_create_ah calls  under 
> drivers/infiniband/core/). The examples you brought here, 
> don't cover them. This way or another, I don't see any reason 
> to break that convension just for the sake of this aspect of 
> the iboe implementation, simply, code with this assumption at hand.
> 
> Or.
> 

I understand this, but keeping ib_create_ah() callable from any context is not 
a goal by itself.
I am looking for constructive ideas for supporting iboe without breaking 
Verbs/CQE/CM syntax.

What do you propose?--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-06 Thread Walukiewicz, Miroslaw
Hello Or, 

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

 From my measuremnts it looks like the problem is related to memory 
allocation in the user-space and kernel path, that is a very, very expesive 
operation. Look for the tx path (rx is very similar).
Ibv_post_send()
post_send_wrapper_1_0
for (w = wr; w; w = w->next) {
real_wr = alloca(sizeof *real_wr);  <- 1. dyn alloc 
real_wr->wr_id = w->wr_id;
  next the call to HW specific part
and prepare message to send

cmd  = alloca(cmd_size);  <- 2. dyn allocation

IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp);
dive to kernel:
ib_uverbs_post_send()
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc
next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
   user_wr->num_sge * sizeof (struct ib_sge),
   GFP_KERNEL); <- 4. dyn alloc 

And now there is finel call to driver. 

Adding the additional device makes possible diving to kernel without that 
memory allocations. 

 Also, I understand that .read (.write) entry maps to posting a receive 
(send) buffer, what is the use case for .mmap entry

 Not exactly. Diving to kernel is treated as a something like passing 
signal to kernel that there is prepared information to post_send/post_recv. The 
information about buffers are passed through shared page (available to 
userspace through mmap) to avoid copying of data. Write() ops is used to 
passing signal about post_send. Read() ops is used to pass information about 
post_recv(). We avoid additional copying of the data that way.


> @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
> *attr,
>   nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state,
>   nesqp->iwarp_state, atomic_read(&nesqp->refcount));
>  
> + if (ibqp->qp_type == IB_QPT_RAW_PACKET)
> + return 0;

 isn't a raw qp associated with a specific port of the device?

 In NES architecture the QP type and number defines a specific device or 
port. It is one to one mapping  

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Tuesday, July 06, 2010 10:50 AM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com
Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

miroslaw.walukiew...@intel.com wrote:
> adds a IB_QPT_RAW_PACKET QP type implementation for nes driver 

> +++ b/drivers/infiniband/hw/nes/nes_ud.c
> +static const struct file_operations nes_ud_sksq_fops = {
> + .owner = THIS_MODULE,
> + .open = nes_ud_sksq_open,
> + .release = nes_ud_sksq_close,
> + .write = nes_ud_sksq_write,
> + .read = nes_ud_sksq_read,
> + .mmap = nes_ud_sksq_mmap,
> +};
> +
> +
> +static struct miscdevice nes_ud_sksq_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "nes_ud_sksq",
> + .fops = &nes_ud_sksq_fops,
> +};

Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for 
nes driver" email thread, e.g at the below links, you say


> The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is 
> shared with
> all other user-kernel  communication and it is quite complex. It is a perfect 
> path
> for QP/CQ/PD/mem management but for me it is too complex for traffic 
> acceleration.
> The user<->kernel  path  through additional driver, shared page for 
> lkey/vaddr/len
> passing and SW memory translation in kernel is much more effective.

http://marc.info/?l=linux-rdma&m=127299659017928
http://marc.info/?l=linux-rdma&m=127306694704653

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

Also, I understand that .read (.write) entry maps to posting a receive (send) 
buffer, what is the use case for .mmap entry

> --- a/drivers/infiniband/hw/nes/nes_verbs.c
> +++ b/drivers/infiniband/hw/nes/nes_verbs.c

> @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
[...]
> - atomic_inc(&qps_created);
> @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(

Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-06 Thread Or Gerlitz
miroslaw.walukiew...@intel.com wrote:
> adds a IB_QPT_RAW_PACKET QP type implementation for nes driver 

> +++ b/drivers/infiniband/hw/nes/nes_ud.c
> +static const struct file_operations nes_ud_sksq_fops = {
> + .owner = THIS_MODULE,
> + .open = nes_ud_sksq_open,
> + .release = nes_ud_sksq_close,
> + .write = nes_ud_sksq_write,
> + .read = nes_ud_sksq_read,
> + .mmap = nes_ud_sksq_mmap,
> +};
> +
> +
> +static struct miscdevice nes_ud_sksq_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "nes_ud_sksq",
> + .fops = &nes_ud_sksq_fops,
> +};

Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for 
nes driver" email thread, e.g at the below links, you say


> The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is 
> shared with
> all other user-kernel  communication and it is quite complex. It is a perfect 
> path
> for QP/CQ/PD/mem management but for me it is too complex for traffic 
> acceleration.
> The user<->kernel  path  through additional driver, shared page for 
> lkey/vaddr/len
> passing and SW memory translation in kernel is much more effective.

http://marc.info/?l=linux-rdma&m=127299659017928
http://marc.info/?l=linux-rdma&m=127306694704653

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

Also, I understand that .read (.write) entry maps to posting a receive (send) 
buffer, what is the use case for .mmap entry

> --- a/drivers/infiniband/hw/nes/nes_verbs.c
> +++ b/drivers/infiniband/hw/nes/nes_verbs.c

> @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
[...]
> - atomic_inc(&qps_created);
> @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(struct ib_pd 
> *ibpd,
[...]
> + /* moved here to be sure that QP is really created */
> + /*(now it counted a number of QP creation trials */
> + atomic_inc(&qps_created);

best if this change and couple more of its such will be placed in a clean-up 
patch to nes_verbs.c, such that the amount of RAW QP related changes to review 
is minimized.

> @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
> *attr,
>   nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state,
>   nesqp->iwarp_state, atomic_read(&nesqp->refcount));
>  
> + if (ibqp->qp_type == IB_QPT_RAW_PACKET)
> + return 0;

isn't a raw qp associated with a specific port of the device?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html