RE: [PATCH 0/4] add RAW Packet QP type

2012-01-17 Thread Walukiewicz, Miroslaw
According VLANs, for RAW QP better solution is allowing inserting VLANs per 
packet by adding new flags to ib_post_send with special field containing VLAN 
ID.

On ingress, it would be interesting to see the ingress VLAN in CQE, by 
introducing a new field and flags indicating VLAN appearance.

As Steve mentioned, the HW checksum offload is necessary also in the API for 
sending IP fragments. When IP packet is not fragmented it is necessary to send 
it with L4/L3 sum computed by HW. And when fragments are sent the L3 only csum 
computation is requested.

Regards,

Mirek


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise
Sent: Tuesday, January 17, 2012 4:08 PM
To: Or Gerlitz
Cc: Roland Dreier; linux-rdma; Christoph Lameter; Liran Liss
Subject: Re: [PATCH 0/4] add RAW Packet QP type


On 01/17/2012 05:34 AM, Or Gerlitz wrote:
> The new qp type designated usage is from user-space in Ethernet environments,
> e.g by applications that do TCP/IP offloading. Only processes with the NET_RAW
> capability may open such qp. The name raw packet was selected to resemble the
> similarity to AF_PACKET / SOL_RAW sockets. Applications that use this qp type
> should deal with whole packets, including link level headers.
>
> This series allows to create such QPs and send packets over them. To receive
> packets, flow steering support has to be added to the verbs and low-level
> drivers. Flow Steering is the ability to program the HCA to direct a packet
> which matches a given flow specification to a given QP. Flow specs set by
> applications are typically made of L3 (IP) and L4 (TCP/UDP) based tuples,
> where network drivers typically use L2 based tuples. Core and mlx4 patches
> for flow steering are expected in the coming weeks.

Hey Or,

I think this series should add some new send flags for HW that does checksum 
offload:

For example, cxgb4 supports these:

enum { /* TX_PKT_XT checksum types */
 TX_CSUM_TCP= 0,
 TX_CSUM_UDP= 1,
 TX_CSUM_CRC16  = 4,
 TX_CSUM_CRC32  = 5,
 TX_CSUM_CRC32C = 6,
 TX_CSUM_FCOE   = 7,
 TX_CSUM_TCPIP  = 8,
 TX_CSUM_UDPIP  = 9,
 TX_CSUM_TCPIP6 = 10,
 TX_CSUM_UDPIP6 = 11,
 TX_CSUM_IP = 12,
};

I'm sure mlx4 has this sort of functionality too?

Another form of HW assist is with VLAN insertion/extraction.  The API should 
provide a way to specify if a VLAN ID 
should be inserted by HW and removed from a packet on ingress (and passed to 
the app via the CQE).  In fact, we probably 
want a way to associate a VLAN with a RAW QP, maybe as a QP attribute?

Also, on ingress, most hardware can do INET checksum validation, and a way to 
indicate the results to the application is 
needed.  Perhaps flags in the CQE?  The cxgb4 device provides many fields on a 
ingress packet completion that would be 
useful for user mode applications including indications of MAC RX errors, 
protocol length vs packet length mismatches, 
IP version not 4 or 6, and more.  Does mlx4 has these sorts of indications on 
ingress packet CQEs?

Food for thought.

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-24 Thread Walukiewicz, Miroslaw
Sean,

The assumption here is that user space library prepares the vendor specific 
data in user-space using a shared page allocated by vendor driver. Here 
information about posted buffers is passed not through ib_wr but using the 
shared page. It is a reason why pointers indicating ib_wr in post_send are not 
set, they are not passed to kernel at all to avoid copying them in kernel. 

As there is no ib_wr structure in kernel there is no reference to bad_wr and a 
buffer that failed in this context so the only reasonable information about 
operation state passed using bad_wr could be return of binary information - 
operation successful (bad_wr = 0) or not (bad_wr != 0) 

Instead of using a specific case for RAW_QP it is possible to pass some 
information about posting buffers method by
enum ib_qp_create_flags {
IB_QP_CREATE_IPOIB_UD_LSO   = 1 << 0,
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1 << 1
};
Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1 << 2,

In that case a new method could be used for any type of QP and will be backward 
compatible.

Regards,

Mirek
-Original Message-
From: Hefty, Sean 
Sent: Friday, January 21, 2011 4:50 PM
To: Walukiewicz, Miroslaw; Roland Dreier
Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org
Subject: RE: ibv_post_send/recv kernel path optimizations

> + qp = idr_read_qp(cmd.qp_handle, file->ucontext);
> + if (!qp)
> + goto out_raw_qp;
> +
> + if (qp->qp_type == IB_QPT_RAW_ETH) {
> + resp.bad_wr = 0;
> + ret = qp->device->post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It 
adds assumptions to uverbs about the underlying driver implementation and ties 
that to the QP type.  I don't know if it makes more sense to key off something 
in the cmd or define some other property of the QP, but the NULL parameters 
into post_send are non-intuitive.
 
> + if (ret)
> + resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-21 Thread Walukiewicz, Miroslaw
Roland, 

You are right that the idr implementation introduces insignificant change in 
performance. I made the version with idr and semaphores usage and I see a 
minimal change comparing to hash table. Now only a shared page is used instead 
of kmalloc and copy_to_user.

I simplified changes to uverbs and I achieved what I wanted in performance. Now 
the patch looks like below.

Are these changes acceptable for k.org

Regards,

Mirek

--- ../SOURCES_19012011/ofa_kernel-1.5.3/drivers/infiniband/core/uverbs_cmd.c   
2011-01-19 05:37:55.0 +0100
+++ ofa_kernel-1.5.3_idr_qp/drivers/infiniband/core/uverbs_cmd.c
2011-01-21 04:10:07.0 +0100
@@ -1449,15 +1449,29 @@
 
if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr))
return -EINVAL;
+   qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+   if (!qp)
+   goto out_raw_qp;
+
+   if (qp->qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp->device->post_send(qp, NULL, NULL);
+   if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+   if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   &resp,
+   sizeof resp))
+   ret = -EFAULT;
+   put_qp_read(qp);
+   goto out_raw_qp;
+   }
 
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL);
if (!user_wr)
return -ENOMEM;
 
-   qp = idr_read_qp(cmd.qp_handle, file->ucontext);
-   if (!qp)
-   goto out;
-
is_ud = qp->qp_type == IB_QPT_UD;
sg_ind = 0;
last = NULL;
@@ -1577,9 +1591,8 @@
wr = next;
}
 
-out:
kfree(user_wr);
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
@@ -1681,16 +1694,31 @@
if (copy_from_user(&cmd, buf, sizeof cmd))
return -EFAULT;
 
+   qp = idr_read_qp(cmd.qp_handle, file->ucontext);
+   if (!qp)
+   goto out_raw_qp;
+
+if (qp->qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp->device->post_recv(qp, NULL, NULL);
+if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   &resp,
+   sizeof resp))
+   ret = -EFAULT;
+   put_qp_read(qp);
+   goto out_raw_qp;
+   }
+
wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
   in_len - sizeof cmd, cmd.wr_count,
   cmd.sge_count, cmd.wqe_size);
if (IS_ERR(wr))
return PTR_ERR(wr);
 
-   qp = idr_read_qp(cmd.qp_handle, file->ucontext);
-   if (!qp)
-   goto out;
-
resp.bad_wr = 0;
ret = qp->device->post_recv(qp, wr, &bad_wr);
 
@@ -1707,13 +1735,13 @@
 &resp, sizeof resp))
ret = -EFAULT;
 
-out:
while (wr) {
next = wr->next;
kfree(wr);
wr = next;
}
 
+out_raw_qp:
return ret ? ret : in_len;
 }




-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Monday, January 10, 2011 9:38 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

 > You are right that the most of the speed-up is coming from avoid semaphores, 
 > but not only.
 > 
 > From the oprof traces, the semaphores made half of difference.
 > 
 > The next one was copy_from_user and kmalloc/kfree usage (in my proposal - 
 > shared page method is used instead)

OK, but in any case the switch from idr to hash table seems to be
insignificant.  I agree that using a shared page is a good idea, but
removing locking needed for correctness is not a good optimization.

 > In my opinion, the responsibility for cases like protection of QP
 > against destroy during buffer post (and other similar cases) should
 > be moved to vendor driver. The OFED code should move only the code
 > path to driver.

Not sure what OFED code you're talking about.  We're discussing the
kernel uverbs code, right?

In any case I'd be interested in seeing how it looks if you move the
protection into the individual drivers.  I'd be worried about having to
duplicate the same code everywhere (which leads to bugs in individual
drivers) -- I guess this could be resolved by having the code be a
library that individual drivers call into.  But also I'm not sure if I
see how you could make such a scheme work -- you need to make sure that
the data str

RE: Problems with ibv_post_send and completion queues

2011-01-19 Thread Walukiewicz, Miroslaw
Hello Manoj,

Responding to your questions.
1. the 510 is HW max for QP length for NE020 card so it cannot be increased.

2. The NE020 driver keeps posted buffers on the FIFO-like queue which is 510 
entries long in your application.
There are maintained tail and head pointers that does not allow for QP 
overflow. The head pointer is updated during post_send and tail is updated 
during poll_cq on CQ assigned to your QP. When you make post_sends without 
checking cq the tail pointer is not updated so after 510 post_send calls the QP 
looks being full for driver and in effect you cannot send more data using that 
QP.

Regards,

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Manoj Nambiar
Sent: Wednesday, January 19, 2011 12:21 PM
To: linux-rdma@vger.kernel.org
Subject: Problems with ibv_post_send and completion queues

Hi,

I am writing some rdma based programs using the Intel NetEffect iWARP NICs.

I am running into the following problems with my code -

1.
I can only set maximum work requests to 510 using rdma_create_qp, otherwise 
it gives me an error – “libnes: nes_ucreate_qp Bad sq attr parameters 
max_send_wr=511 max_send_sge=1.
Is there a way to increase this?”

2.
Is there a way to do RDMA writes without using a completion queue? I use an 
alternative channel to determine if my work requests were correctly executed 
or not. When I tried to do so I could send 510 (may be related to the 
previous question) work requests successfully. After that ibv_post_send 
returns 22. Repeatedly retrying ibv_post_send doesn’t seem to clear the 
problem. It returns the same error code. Checked up the error code which 
tells me invalid arguments.? Unable to make sense of this.  Is there a way 
to clean up the work requests in the system?

I am creating the queue pair with sq_sig_all = 0 in struct ibv_qp_init_attr 
and am not setting setting IBV_SEND_SIGNALED in the send_flags member of 
struct ibv_send_wr. This means I do not get any completion events. When 
calling rdma_create_qp I initialize the send_cq and recv_cq of the struct 
ibv_qp_init_attr with the completion queue created using ibv_create_cq (with 
cge parameter same as cap.max_send_wr in struct ibv_qp_init_attr. (I think 
that creating a completion queue is a must for creating an rdma queue pair - 
pls correct me if I am wrong.)

Pls note – I do not get this problem when I poll completion queues (when 
rdma queue pair is created with sq_sig_all = 0 && IBV_SEND_SIGNALED set in 
the flag of the work requests to ibv_post_send)

Thanks,
Manoj Nambiar 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-10 Thread Walukiewicz, Miroslaw
Roland,

You are right that the most of the speed-up is coming from avoid semaphores, 
but not only.

>From the oprof traces, the semaphores made half of difference.

The next one was copy_from_user and kmalloc/kfree usage (in my proposal - 
shared page method is used instead)

In my opinion, the responsibility for cases like protection of QP against 
destroy during buffer post (and other similar cases) should be moved to vendor 
driver. The OFED code should move only the code path to driver. 

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier
Sent: Wednesday, January 05, 2011 7:17 PM
To: Walukiewicz, Miroslaw
Cc: Or Gerlitz; Jason Gunthorpe; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

 > The patch for ofed-1.5.3 looks like below. I will try to push it to 
 > kernel.org after porting.
 > 
 > Now an uverbs  post_send/post_recv path is modified to make pre-lookup
 > for RAW_ETH QPs. When a RAW_ETH QP is found the driver specific path
 > is used for posting buffers. for example using a shared page approach in
 > cooperation with user-space library

I don't quite see why a hash table helps performance much vs. an IDR.
Is the actual IDR lookup a significant part of the cost?  (By the way,
instead of list_head you could use hlist_head to make your hash table
denser and save cache footprint -- that way an 8-entry table on 64-bit
systems fits in one cacheline)

Also it seems that you get rid of all the locking on QPs when you look
them up in your hash table.  What protects against userspace posting a
send in one thread and destroying the QP in another thread, and ending
up having the destroy complete before the send is posted (leading to
use-after-free in the kernel)?

I would guess that your speedup is really coming from getting rid of
locking that is actually required for correctness.  Maybe I'm wrong
though, I'm just guessing here.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Walukiewicz, Miroslaw
> Just to clarify, when saying "achieved performance comparable to 
> previous solution" you refer to the approach which bypasses uverbs on 
> the post send path? Also, why enhance only the raw qp flow?

I compare to my previous solution using private device for passing information 
about packets. Comparing to current approach I see more than 20% of improvement.

This solution introduces a new path for posting buffers using a shared page 
approach. It works following way:
1. create RAW qp and add it to the raw QP hash list. 
2. user space library mmaps the shared page (it is specific action per device 
and must implemented separately per each driver)
3. during buffer posting the library puts buffers info into shared page and 
calls uverbs.
4. uverbs detects the raw qp and inform the driver bypassing current path.

The solution cannot be shared between RDMA drivers because it needs redesign of 
the driver (share page format is vendor specific).
Now only NES driver implements the RAW QP path through kernel (other vendors 
uses pure user-space solution) so 
No other vendor  will use this path.

There is possibility to add a new QP capability or attribute that will inform  
uverbs that it is a new transmit path used so then the solution could be 
extended for all drivers.

Mirek



-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Monday, December 27, 2010 4:22 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 12/27/2010 5:18 PM, Walukiewicz, Miroslaw wrote:
> I implemented the very small hash table and I achieved performance 
> comparable to previous solution.

Just to clarify, when saying "achieved performance comparable to 
previous solution" you refer to the approach which bypasses uverbs on 
the post send path? Also, why enhance only the raw qp flow?

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-27 Thread Walukiewicz, Miroslaw
  cmd.response,
+   &resp,
+   sizeof resp))
+   ret = -EFAULT;
+   goto out_raw_qp;
+   }
+   }
+
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL);
if (!user_wr)
return -ENOMEM;
@@ -1579,7 +1649,7 @@ out_put:
 
 out:
kfree(user_wr);
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
@@ -1664,7 +1734,6 @@ err:
kfree(wr);
wr = next;
}
-
return ERR_PTR(ret);
 }
 
@@ -1681,6 +1750,25 @@ ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file,
if (copy_from_user(&cmd, buf, sizeof cmd))
return -EFAULT;
 
+   mutex_lock(&file->mutex);
+   qp = raw_qp_lookup(cmd.qp_handle, file->ucontext);
+   mutex_unlock(&file->mutex);
+   if (qp) {
+   if (qp->qp_type == IB_QPT_RAW_ETH) {
+   resp.bad_wr = 0;
+   ret = qp->device->post_recv(qp, NULL, &bad_wr);
+   if (ret)
+   resp.bad_wr = cmd.wr_count;
+
+   if (copy_to_user((void __user *) (unsigned long)
+   cmd.response,
+   &resp,
+   sizeof resp))
+   ret = -EFAULT;
+   goto out_raw_qp;
+   }
+   }
+
wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd,
   in_len - sizeof cmd, cmd.wr_count,
   cmd.sge_count, cmd.wqe_size); @@ -1713,7 
+1801,7 @@ out:
kfree(wr);
wr = next;
}
-
+out_raw_qp:
return ret ? ret : in_len;
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 
f5b054a..adf1dd8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -838,6 +838,8 @@ struct ib_fmr_attr {
u8  page_shift;
 };
 
+#define MAX_RAW_QP_HASH 8
+
 struct ib_ucontext {
struct ib_device   *device;
struct list_headpd_list;
@@ -848,6 +850,7 @@ struct ib_ucontext {
struct list_headsrq_list;
struct list_headah_list;
struct list_headxrc_domain_list;
+   struct list_headraw_qp_hash[MAX_RAW_QP_HASH];
int closing;
 };
 
@@ -859,6 +862,7 @@ struct ib_uobject {
int id; /* index into kernel idr */
struct kref ref;
struct rw_semaphore mutex;  /* protects .live */
+   struct list_headraw_qp_list;    /* raw qp hash */
int live;
 };

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz
Sent: Monday, December 27, 2010 1:39 PM
To: Jason Gunthorpe; Walukiewicz, Miroslaw
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

Jason Gunthorpe wrote:
> Walukiewicz, Miroslaw wrote:

>> called for many QPs, there is a single entry point to
>> ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that
>> case there is a lookup to QP store (idr_read_qp) necessary to find a
>> correct ibv_qp Structure, what is a big time consumer on the path.

> I don't think this should be such a big problem. The simplest solution
> would be to front the idr_read_qp with a small learning hashing table.

yes, there must be a few ways (e.g as Jason suggested) to do this house-keeping
much more efficient, in a manner that fits fast path - which maybe wasn't the 
mindset 
when this code was written as its primary use was to invoke control plane 
commands.

>> The NES IMA kernel path also has such QP lookup but the QP number
>> format is designed to make such lookup very quickly.  The QP numbers in
>> OFED are not defined so generic lookup functions like idr_read_qp() must be 
>> use.

> Maybe look at moving the QPN to ibv_qp translation into the driver
> then - or better yet, move allocation out of the driver, if Mellanox
> could change their FW.. You are right that we could do this much
> faster if the QPN was structured in some way

I think there should be some validation on the uverbs level, as the caller is 
untrusted
user space application, e.g in a similar way for each system call done on a 
file-descriptor

Or.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-14 Thread Walukiewicz, Miroslaw
Or,

I looked into shared page approach of passing post_send/post_recv info. I still 
have some concerns.

The shared page must be allocated per QP and there should be a common way to 
allocate such page for each driver.

As Jason and Roland said, the best way to pass this parameter through mmap is 
offset. There is no common way how the 
Offset is used per driver and it is driver specific parameter.

The next problem is how many shared pages should driver allocate to share with 
user-space. They should contain place for each posted buffer by application. It 
is a big concern to post_recv where large number of buffers are posted.
Current implementation has no such limit. 

Even the common offset definition would be defined and accepted, the shared 
page must be stored in ib_qp structure. 
When a post_send is called for many QPs, there is a single entry point to 
ib_uverbs_post_send using write to /dev/infiniband/uverbsX. In that case there 
is a lookup to QP store (idr_read_qp) necessary to find a correct ibv_qp 
 Structure, what is a big time consumer on the path. 

The NES IMA kernel path also has such QP lookup but the QP number format is 
designed to make such lookup very quickly.
The QP numbers in OFED are not defined so generic lookup functions like 
idr_read_qp() must be use.

Regards,

Mirek


-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
> Form the trace it looks like the __up_read() - 11% wastes most of time. It is 
> called from idr_read_qp when a  put_uobj_read is called. if 
> (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from 
> ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... 
> and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
> called twice for each frame transmitted. It is about 20% of performance loss 
> comparing to nes_ud_sksq path which we miss when we use a OFED path.
>
> What I can modify is a kmalloc/kfree optimization - it is possible to make 
> allocation only at start and use pre-allocated buffers. I don't see any way 
> for optimalization of idr_read_qp usage or copy_user. In current approach we 
> use a shared page and a separate nes_ud_sksq handle for each created QP so 
> there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated && mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2010-12-08 Thread Walukiewicz, Miroslaw
Or, 

>I don't see why the ib uverbs flow (BTW - the data path has nothing to do with 
>the 
>rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
>e.g to support shared-page which is allocated && mmaped from uverbs to 
>user space and used in the same manner your implementation does.

The problem that I see is that the mmap is currently used for mapping of 
doorbell page in different drivers.

We can use it for mapping a page for transmit/receive operation when we are 
able to differentiate that we need to map 
Doorbell or our shared page. 

The second problem is that this rx/tx mmap should map the separate page per QP 
to avoid the unnecessary QP lookups so page identifier passed to mmap should be 
based on QP identifier. 

I cannot find a specific code for /dev/infiniband/uverbsX. Is this device 
driver sharing the same functions like /dev/infiniband/rdmacm or it has own 
implementation. 

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Wednesday, December 01, 2010 9:12 AM
To: Walukiewicz, Miroslaw; Jason Gunthorpe; Roland Dreier
Cc: Roland Dreier; Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations

On 11/26/2010 1:56 PM, Walukiewicz, Miroslaw wrote:
> Form the trace it looks like the __up_read() - 11% wastes most of time. It is 
> called from idr_read_qp when a  put_uobj_read is called. if 
> (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice from 
> ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame... 
> and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
> called twice for each frame transmitted. It is about 20% of performance loss 
> comparing to nes_ud_sksq path which we miss when we use a OFED path.
>
> What I can modify is a kmalloc/kfree optimization - it is possible to make 
> allocation only at start and use pre-allocated buffers. I don't see any way 
> for optimalization of idr_read_qp usage or copy_user. In current approach we 
> use a shared page and a separate nes_ud_sksq handle for each created QP so 
> there is no need for any user space data copy or QP lookup.
As was mentioned earlier on this thread, and repeated here, the 
kmalloc/kfree can be removed, as or the 2nd copy_from_user, I don't see 
why the ib uverbs flow (BTW - the data path has nothing to do with the 
rdma_cm, you're working with /dev/infiniband/uverbsX), can't be enhanced 
e.g to support shared-page which is allocated && mmaped from uverbs to 
user space and used in the same manner your implementation does. The 1st 
copy_from_user should add pretty nothing and if it does, it can be 
replaced with different user/kernel IPC mechanism which costs less. So 
we're basically remained with the idr_read_qp, I wonder what other 
people think if/how this can be optimized?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

2010-11-26 Thread Walukiewicz, Miroslaw
Some time ago we discussed a possibility of removing usage of nes_ud_sksq for 
IMA driver as a blocker of pushing IMA solution to kernel.org. 

The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm 
instead of using private nes_ud_sksq device.

I made an implementation of such solution for checking the performance impact 
and looking for optimize the existing code. 

I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. 
Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec.
The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec.


I run oprofile on rdma_cm code and got a following results:

samples  %linenr info app name 
symbol name
2586067  24.5323  nes_uverbs.c:558libnes-rdmav2.so 
nes_upoll_cq
1198042  11.3650  (no location information)   vmlinux  __up_read
5392585.1156  (no location information)   vmlinux  
copy_user_generic_string
4078843.8693  msa_verbs.c:1692libmsa.so.1.0.0  
msa_post_send
3045692.8892  msa_verbs.c:2098libmsa.so.1.0.0  
usq_sendmsg_noblock
2999542.8455  (no location information)   vmlinux  __kmalloc
2974632.8218  (no location information)   libibverbs.so.1.0.0  
/usr/lib64/libibverbs.so.1.0.0
2679512.5419  uverbs_cmd.c:1433   ib_uverbs.ko 
ib_uverbs_post_send
2647092.5111  (no location information)   vmlinux  kfree
2051071.9457  port.c:2947 libmsa.so.1.0.0  sendto
1462251.3871  (no location information)   vmlinux  
__down_read
1459411.3844  (no location information)   libpthread-2.5.so
__write_nocancel
1399341.3275  nes_ud.c:1746   iw_nes.ko
nes_ud_post_send_new_path
1318791.2510  send.c:32   msa_tst  
blocking_test_send(void*)
1275191.2097  (no location information)   vmlinux  
system_call
1235521.1721  port.c:858  libmsa.so.1.0.0  
find_mcast
1092491.0364  nes_verbs.c:3478iw_nes.ko
nes_post_send
92060 0.8733  (no location information)   vmlinux  vfs_write
90187 0.8555  uverbs_cmd.c:144ib_uverbs.ko 
__idr_get_uobj
89563 0.8496  nes_uverbs.c:1460   libnes-rdmav2.so 
nes_upost_send

Form the trace it looks like the__up_read() - 11% wastes most of time. It 
is called from idr_read_qp when a  put_uobj_read is called. 

if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice 
from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame
return -EFAULT;

and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
called twice for each frame transmitted.

It is about 20% of performance loss comparing to nes_ud_sksq path which we miss 
when we use a OFED path. 

What I can modify is a kmalloc/kfree optimization - it is possible to make 
allocation only at start and use pre-allocated buffers.

 I don't see any way for optimalization of idr_read_qp usage or copy_user. In 
current approach we use a shared page and a separate nes_ud_sksq handle for 
each created QP so there is no need for any user space data copy or QP lookup. 

Do you have any idea how can we optimize this path?

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Thursday, November 25, 2010 4:01 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; 
linux-rdma@vger.kernel.org
Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle 
large number of entries)

Jason Gunthorpe wrote:
> Hmm, considering your list is everything but Mellanox, maybe it makes much 
> more sense to push the copy_to_user down into the driver - ie a 
> ibv_poll_cq_user - then the driver can construct each CQ entry on the stack 
> and copy it to userspace, avoid the double copy, allocation and avoid any 
> fixed overhead of ibv_poll_cq.
>
> A bigger change to be sure, but remember this old thread:
> http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05114.html
> 2x improvement by removing allocs on the post path..
Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-10 Thread Walukiewicz, Miroslaw
Hello Jason, 

Do you have any benchmarks that show the alloca is a measurable
overhead?  

We changed overall path (both kernel and user space) to allocation-less 
approach and 
We achieved twice better latency using call to kernel driver. I have no data 
which path 
Is dominant - kernel or user space. I think I will have some measurements next 
week, so I will share 
My results.

Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

I agree. I will go into this direction.

Regards,

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Friday, August 06, 2010 6:33 PM
To: Walukiewicz, Miroslaw
Cc: Roland Dreier; linux-rdma@vger.kernel.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

> Currently the transmit/receive path works following way: User calls
> ibv_post_send() where vendor specific function is called.  When the
> path should go through kernel the ibv_cmd_post_send() is called.
> The function creates the POST_SEND message body that is passed to
> kernel.  As the number of sges is unknown the dynamic allocation for
> message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

> In the kernel the message body is parsed and a structure of wr and
> sges is recreated using dynamic allocations in kernel The goal of
> this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

> In kernel in ib_uverbs_post_send() instead of dynamic allocation of
> the ib_send_wr structures the table of 512 ib_send_wr structures
> will be defined and all entries will be linked to unidirectional
> list so qp->device->post_send(qp, wr, &bad_wr) API will be not
> changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

> As I know no driver uses that kernel path to posting buffers so
> iWARP multicast acceleration implemented in NES driver Would be a
> first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-10 Thread Walukiewicz, Miroslaw
I agree with you that changing kernel ABI is not necessary. 
I will follow your directions regarding a single allocation at start. 

Regards,

Mirek 

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Friday, August 06, 2010 5:58 PM
To: Walukiewicz, Miroslaw
Cc: linux-rdma@vger.kernel.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

 > The proposed path optimization is removing of dynamic allocations 
 > by redefining a structure definition passed to kernel. 

 > To 
 > 
 > struct ibv_post_send {
 > __u32 command;
 > __u16 in_words;
 > __u16 out_words;
 > __u64 response;
 > __u32 qp_handle;
 > __u32 wr_count;
 > __u32 sge_count;
 > __u32 wqe_size;
 > struct ibv_kern_send_wr send_wr[512];
 > };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


{RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Walukiewicz, Miroslaw
Currently the ibv_post_send()/ibv_post_recv() path through kernel 
(using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory 
allocations on the path. 

Currently the transmit/receive path works following way:
User calls ibv_post_send() where vendor specific function is called. 
When the path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to kernel. 
As the number of sges is unknown the dynamic allocation for message body is 
performed. 
(see libibverbs/src/cmd.c)

In the kernel the message body is parsed and a structure of wr and sges is 
recreated using dynamic allocations in kernel 
The goal of this operation is having a similar structure like in user space. 

The proposed path optimization is removing of dynamic allocations 
by redefining a structure definition passed to kernel. 
>From 

struct ibv_post_send {
__u32 command;
__u16 in_words;
__u16 out_words;
__u64 response;
__u32 qp_handle;
__u32 wr_count;
__u32 sge_count;
__u32 wqe_size;
struct ibv_kern_send_wr send_wr[0];
};
To 

struct ibv_post_send {
__u32 command;
__u16 in_words;
__u16 out_words;
__u64 response;
__u32 qp_handle;
__u32 wr_count;
__u32 sge_count;
__u32 wqe_size;
struct ibv_kern_send_wr send_wr[512];
};

Similar change is required in kernel  struct ib_uverbs_post_send defined in 
/ofa_kernel/include/rdma/ib_uverbs.h

This change limits a number of send_wr passed from unlimited (assured by 
dynamic allocation) to reasonable number of 512. 
I think this number should be a max number of QP entries available to send. 
As the all iB/iWARP applications are low latency applications so the number of 
WRs passed are never unlimited.

As the result instead of dynamic allocation the ibv_cmd_post_send() fills the 
proposed structure 
directly and passes it to kernel. Whenever the number of send_wr number exceeds 
the limit the ENOMEM error is returned.

In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the 
ib_send_wr structures 
the table of 512  ib_send_wr structures  will be defined and 
all entries will be linked to unidirectional list so qp->device->post_send(qp, 
wr, &bad_wr) API will be not changed. 

As I know no driver uses that kernel path to posting buffers so iWARP multicast 
acceleration implemented in NES driver 
Would be a first application that can utilize the optimized path. 

Regards,

Mirek

Signed-off-by: Mirek Walukiewicz 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: InfiniBand/RDMA merge plans for 2.6.36

2010-08-05 Thread Walukiewicz, Miroslaw
Hello Or, 
My patch was implemented using the most effective method available for current 
version of code and it is ready and working as a functionality. 

For me another thing then functionality is an optimization of the existing SW 
paths in infiniband code and modification earlier written code to new 
interfaces. 

Tomorrow I will start a new discussion regarding an optimization of the 
post_send/post_recv path. Thank you for reminder.

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Thursday, August 05, 2010 2:28 PM
To: Walukiewicz, Miroslaw
Cc: Roland Dreier; linux-ker...@vger.kernel.org; linux-rdma@vger.kernel.org
Subject: Re: InfiniBand/RDMA merge plans for 2.6.36

Walukiewicz, Miroslaw wrote:
> Hello Roland,  What about a series from Aleksey Senin [...] And my patch 
> RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver 
> https://patchwork.kernel.org/patch/110252
Hi Mirek,

Reading your response @ http://marc.info/?l=linux-rdma&m=127954552519544 
to the comments made during the review, I was under the impression that 
you're going to try and modify the NES implementation, isn't this the 
case any more?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: InfiniBand/RDMA merge plans for 2.6.36

2010-08-05 Thread Walukiewicz, Miroslaw
Hello Roland, 

What about a series from Aleksey Senin 

[PATCH V1 0/4] New RAW_PACKET QP type
[PATCH V1 1/4] Rename RAW_ETY to RAW_ETHERTYPE
[PATCH V1 2/4]  New RAW_PACKET QP  type definition
[PATCH V1 3/4] Security check on QP type
[PATCH V1 4/4] Add RAW_PACKET to ib_attach/detach mcast calls

And my patch RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
https://patchwork.kernel.org/patch/110252/

I see only one patch from that series in your plans

Aleksey Senin (1):
  IB: Rename RAW_ETY to RAW_ETHERTYPE

Regards,

Mirek


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Roland Dreier
Sent: Thursday, August 05, 2010 1:34 AM
To: linux-ker...@vger.kernel.org; linux-rdma@vger.kernel.org
Subject: InfiniBand/RDMA merge plans for 2.6.36

Since 2.6.35 is here, it's probably a good time to talk about 2.6.36
merge plans.  All the pending things that I'm aware of are listed below.

Boilerplate:

If something isn't already in my tree and it isn't listed below, I
probably missed it or dropped it unintentionally.  Please remind me.

As usual, when submitting a patch:

 - Give a good changelog that explains what issue your patch
   addresses, how you address the issue, how serious the issue is, and
   any other information that would be useful to someone evaluating
   your patch now, or trying to understand it years from now.

 - Please make sure that you include a "Signed-off-by:" line, and put
   any extra junk that should not go into the final kernel log *after*
   the "---" line so that git tools strip it off automatically.  Make
   the subject line be appropriate for inclusion in the kernel log as
   well once the leading "[PATCH ...]" stuff is stripped off.  I waste a
   lot of time fixing patches by hand that could otherwise be spent
   doing something productive like watching youtube.

 - Run your patch through checkpatch.pl so I don't have to nag you to
   fix trivial issues (or spend time fixing them myself).

 - Check your patch over at least enough so I don't see a memory leak
   or deadlock as soon as I look at it.

 - Build your patch with sparse checking ("C=2 CF=-D__CHECK_ENDIAN__")
   and make sure it doesn't introduce new warnings.  (A big bonus in
   goodwill for sending patches that fix old warnings)

 - Test your patch on a kernel with things like slab debugging and
   lockdep turned on.

And while you're waiting for me to get to your patch, I sure wouldn't
mind if you read and commented on someone else's patch.  We currently
have a big imbalance between people who are writing patches (many) and
people who are reviewing patches (mostly me).  None of this means you
shouldn't remind me about pending patches, since I often lose track of
things and drop them accidentally.

I don't think it makes sense to break down what I merged by topics
this time around -- there wasn't anything big that I can think of.  It
was really just a matter of small improvements, fixes, and cleanups
all over.

Here are a few topics I'm tracking that are not ready in time for the
2.6.36 window and will need to wait for 2.6.37 at least:

 - XRC.  While I think we made significant progress here, the fact is
   that this is not ready to merge at the beginning of the merge
   window, and so we'll need to keep working on it and wait for the
   next merge window.  I think this is just blocked on me at the
   moment.

 - IBoE.  Same as XRC; we made significant progress (and I opened an
   iboe branch to track this), and I think we have finally gotten the
   user-kernel interface nailed down yet, but it's just too late.

 - ummunotify-as-part-of-uverbs.  I'm working on this but don't have
   anything ready for the merge window.

 - AF_IB work.  I have not even had a chance to think about this yet,
   since I haven't dug through earlier backlog items.

 - mlx4 SR-IOV support.  See AF_IB above.

Here all the patches I already have in my for-next branch:

Aleksey Senin (1):
  IB: Rename RAW_ETY to RAW_ETHERTYPE

Alexander Schmidt (3):
  IB/ehca: Fix bitmask handling for lock_hcalls
  IB/ehca: Catch failing ioremap()
  IB/ehca: Init irq tasklet before irq can happen

Arnd Bergmann (1):
  IB/qib: Use generic_file_llseek

Bart Van Assche (3):
  IB/srp: Use print_hex_dump()
  IB/srp: Make receive buffer handling more robust
  IB/srp: Export req_lim via sysfs

Ben Hutchings (1):
  IB/ipath: Fix probe failure path

Chien Tung (1):
  RDMA/nes: Store and print eeprom version

Dan Carpenter (2):
  RDMA/cxgb4: Remove unneeded assignment
  RDMA/cxgb3: Clean up signed check of unsigned variable

Dave Olson (1):
  IB/qib: Allow PSM to select from multiple port assignment algorithms

David Rientjes (1):
  RDMA/cxgb4: Remove dependency on __GFP_NOFAIL

Faisal Latif (1):
  RDMA/nes: Fix hangs on ifdown

Ira Weiny (1):
  IB/qib: Allow writes to the diag_counters to be able to clear them

Miroslaw Walukiew

RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-19 Thread Walukiewicz, Miroslaw
Hello Or, 

I think about other options yet and I'm still not sure what option should I 
choose for implementation. 

I agree with you that it is possible to fix the post_send path in OFED.

Let me think a few days yet.

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Sunday, July 18, 2010 6:52 PM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com
Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

> I don't think there are applications around which would use raw qp AND
> are linked against libibverbs-1.0, such that they would exercise the 1_0
> wrapper, so we can ignore the 1st allocation, the one at the wrapper code.
> As for the 2nd allocation, since a WQE --posting-- is synchronous, 
> using the maximal values specified during the creation of the QP, I
> believe that this allocation can be done once per QP and used later.

[...] 

Hi Mirek, any comment on my response to the NES patch you sent?

Or.



> 
>> dive to kernel:
>> ib_uverbs_post_send()
>> user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc
>> next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
>>user_wr->num_sge * sizeof (struct ib_sge),
>>GFP_KERNEL); <- 4. dyn
>> alloc
>>  And now there is finel call to driver. 
> ~same here for #4 you can compute/allocate once the maximal possible
> size for "next" per qp and use it later. As for #3, this need further
> thinking.
> 
> But before diving to all this design changes, what was the penalty
> introduced by these allocations? is it in packets-per-second, latency?
> 
>> Diving to kernel is treated as a something like passing signal to
>> kernel that there is prepared information to post_send/post_recv. The
>> information about buffers are passed through shared page (available to
>> userspace through mmap) to avoid copying of data. Write() ops is used
>> to passing signal about post_send. Read() ops is used to pass
>> information about post_recv(). We avoid additional copying of the data
>> that way.
> thanks for the heads-up, I took a look and this user/kernel shared
> memory page is used to hold the work-request, nothing to do with data.
> 
> As for the work request, you still have to copy it in user space from
> the user work request to the library mmaped buffer. So the only
> difference would be the copy_from_user done by uverbs, for few tens of
> bytes, can you tell if/what is the extra penalty introduced by this copy?
> 
>> struct nes_ud_send_wr {
>> u32   wr_cnt;
>> u32   qpn;
>> u32   flags;
>> u32   resv[1];
>> struct ib_sge sg_list[64];
>> };
>>
>> struct nes_ud_recv_wr {
>> u32   wr_cnt;
>> u32   qpn;
>> u32   resv[2];
>> struct ib_sge sg_list[64];
>> };
> Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same
> instance can be used to post list of work requests, where is work
> request is limited to use one SGE, am I correct?
> 
> I don't think there a need to support posting 64 --send-- requests, for
> recv it might makes sense, but it could be done in a "batch/background"
> flow, thoughts?
> 
> Or.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FW: [PATCH] RDMA/nes: corrected link type for nes cards

2010-07-14 Thread Walukiewicz, Miroslaw
Now correct interface link type is set for ibv_query_port() 

Signed-off-by: Mirek Walukiewicz 
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index f179586..45bf56c 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -599,7 +599,7 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, 
struct ib_port_attr
props->active_width = IB_WIDTH_4X;
props->active_speed = 1;
props->max_msg_sz = 0x8000;
-
+   props->link_layer = IB_LINK_LAYER_ETHERNET;
return 0;
 }
 


N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

[PATCH] RDMA/nes: corrected firmware version update

2010-07-14 Thread Walukiewicz, Miroslaw
Now firmware version is read from correct place

Signed-off-by: Mirek Walukiewicz 
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 0abd4f2..f179586 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -520,7 +520,7 @@ static int nes_query_device(struct ib_device *ibdev, struct 
ib_device_attr *prop
memset(props, 0, sizeof(*props));
memcpy(&props->sys_image_guid, nesvnic->netdev->dev_addr, 6);
 
-   props->fw_ver = nesdev->nesadapter->fw_ver;
+   props->fw_ver = nesdev->nesadapter->firmware_version;
props->device_cap_flags = nesdev->nesadapter->device_cap_flags;
props->vendor_id = nesdev->nesadapter->vendor_id;
props->vendor_part_id = nesdev->nesadapter->vendor_part_id;




RE: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-06 Thread Walukiewicz, Miroslaw
Hello Or, 

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

 From my measuremnts it looks like the problem is related to memory 
allocation in the user-space and kernel path, that is a very, very expesive 
operation. Look for the tx path (rx is very similar).
Ibv_post_send()
post_send_wrapper_1_0
for (w = wr; w; w = w->next) {
real_wr = alloca(sizeof *real_wr);  <- 1. dyn alloc 
real_wr->wr_id = w->wr_id;
  next the call to HW specific part
and prepare message to send

cmd  = alloca(cmd_size);  <- 2. dyn allocation

IBV_INIT_CMD_RESP(cmd, cmd_size, POST_SEND, &resp, sizeof resp);
dive to kernel:
ib_uverbs_post_send()
user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc
next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
   user_wr->num_sge * sizeof (struct ib_sge),
   GFP_KERNEL); <- 4. dyn alloc 

And now there is finel call to driver. 

Adding the additional device makes possible diving to kernel without that 
memory allocations. 

 Also, I understand that .read (.write) entry maps to posting a receive 
(send) buffer, what is the use case for .mmap entry

 Not exactly. Diving to kernel is treated as a something like passing 
signal to kernel that there is prepared information to post_send/post_recv. The 
information about buffers are passed through shared page (available to 
userspace through mmap) to avoid copying of data. Write() ops is used to 
passing signal about post_send. Read() ops is used to pass information about 
post_recv(). We avoid additional copying of the data that way.


> @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
> *attr,
>   nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state,
>   nesqp->iwarp_state, atomic_read(&nesqp->refcount));
>  
> + if (ibqp->qp_type == IB_QPT_RAW_PACKET)
> + return 0;

 isn't a raw qp associated with a specific port of the device?

 In NES architecture the QP type and number defines a specific device or 
port. It is one to one mapping  

Regards,

Mirek

-Original Message-
From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
Sent: Tuesday, July 06, 2010 10:50 AM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org; aleks...@voltaire.com
Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

miroslaw.walukiew...@intel.com wrote:
> adds a IB_QPT_RAW_PACKET QP type implementation for nes driver 

> +++ b/drivers/infiniband/hw/nes/nes_ud.c
> +static const struct file_operations nes_ud_sksq_fops = {
> + .owner = THIS_MODULE,
> + .open = nes_ud_sksq_open,
> + .release = nes_ud_sksq_close,
> + .write = nes_ud_sksq_write,
> + .read = nes_ud_sksq_read,
> + .mmap = nes_ud_sksq_mmap,
> +};
> +
> +
> +static struct miscdevice nes_ud_sksq_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "nes_ud_sksq",
> + .fops = &nes_ud_sksq_fops,
> +};

Reading through the May 2010 "RDMA/nes: IB_QPT_RAW_PACKET QP type support for 
nes driver" email thread, e.g at the below links, you say


> The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is 
> shared with
> all other user-kernel  communication and it is quite complex. It is a perfect 
> path
> for QP/CQ/PD/mem management but for me it is too complex for traffic 
> acceleration.
> The user<->kernel  path  through additional driver, shared page for 
> lkey/vaddr/len
> passing and SW memory translation in kernel is much more effective.

http://marc.info/?l=linux-rdma&m=127299659017928
http://marc.info/?l=linux-rdma&m=127306694704653

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

Also, I understand that .read (.write) entry maps to posting a receive (send) 
buffer, what is the use case for .mmap entry

> --- a/drivers/infiniband/hw/nes/nes_verbs.c
> +++ b/d

RE: Name for a new type of QP

2010-06-23 Thread Walukiewicz, Miroslaw
I would prefer a name IBV_QPT_FRAME so it is a L2 layerQP. The packet is 
reserved for L3.

Regards,

Mirek

-Original Message-
From: Moni Shoua [mailto:mo...@voltaire.com] 
Sent: Wednesday, June 23, 2010 11:20 AM
To: linux-rdma
Cc: Walukiewicz, Miroslaw; Roland Dreier; al...@voltaire.com
Subject: Name for a new type of QP

Hi,
This message follows a discussion in the EWG mailing list.

We want to promote a patch that enables use of a new QP type.
This QP type lets the user post_send() data to its SQ and treat it as the 
entire packet, including headers.
An example of use with this QP is sending Ethernet packets from userspace (and 
enjoying kernel bypass).

An open question in this matter it how should we call this QP type.
The first name IBV_QPT_RAW_ETH seems to be too similar to the existing type 
IBV_QPT_RAW_ETY.

My suggestion (that were posted in a different thread) are

IBV_QPT_FRAME
IBV_QPT_PACKET
IBV_QPT_NOHDR

Please make your comments and send your suggestions.

When we decide about a name we will send a patch that enables the use of this 
QP type.


thanks

Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-22 Thread Walukiewicz, Miroslaw
Thanks Moni, 

I treated them as accepted. 

Sean,
response for your question is that changes with RAW_QP patches should be 
accepted first to have that application working.

Mirek


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Moni Shoua
Sent: Monday, June 21, 2010 1:17 PM
To: Walukiewicz, Miroslaw
Cc: Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: [PATCH] librdmacm/mcraw: Add a new test application for user-space 
IBV_QPT_RAW_ETH QP type

Walukiewicz, Miroslaw wrote:
> No, no more changes are necessary. It is a standalone application.
> 
> Mirek
> 
Are you sure?
AFAIK, patches to kernel and libibverbs are required which were not accepted 
yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-16 Thread Walukiewicz, Miroslaw
No, no more changes are necessary. It is a standalone application.

Mirek

-Original Message-
From: Hefty, Sean 
Sent: Monday, June 14, 2010 6:33 PM
To: Walukiewicz, Miroslaw
Cc: linux-rdma@vger.kernel.org
Subject: RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space 
IBV_QPT_RAW_ETH QP type

> The patch adds a new test application describing a usage of the
> IBV_QPT_RAW_ETH
> for IPv4 multicast acceleration on iWARP cards.

Are any changes still needed to the kernel or libibverbs to make this work?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-14 Thread Walukiewicz, Miroslaw
Hello Or,

any reason not to patch mckey to support both IB and Ethernet raw QPs?
The mckey works on UD_QP type and mcraw works on RAW_QP type. 

The data payload prepared for UD and RAW_QP are on different layers. 

The mckey uses rdma_join_multicast() that triggers a state machine for IB 
multicast joining. 

The mcraw does not trigger such state machine because for sending the  ethernet 
multicast there is no need for any multicast joining state machine. The 
multicast destination address on ethernet is determined by multicast group 
address.

For me the API changes between mcraw and mckey are quite large and adding 
additional options could be confused for users. 

does raw qp has any relation to the iWARP/TOE HW stack?

Yes, As I told the logic for joining multicast group is different for IB and 
ethernet (I mention here about ethernet not iWARP specific). The mcraw handles 
an ethernet path sending multicasts that could be similar for nes and mlx4. 

Regards,

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Or Gerlitz
Sent: Friday, June 11, 2010 10:13 PM
To: Walukiewicz, Miroslaw
Cc: Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for 
user-space IBV_QPT_RAW_ETH QP type

Walukiewicz, Miroslaw  wrote:
> The patch adds a new test application describing a usage of the 
> IBV_QPT_RAW_ETH
> for IPv4 multicast acceleration on iWARP cards. See man mcraw for parameters 
> description

So this is the only raw qp related patch to librdmacm? any reason not
to patch mckey to support both IB and Ethernet raw QPs? does raw qp
has any relation to the iWARP/TOE HW stack? there's also raw qp patch
posted to ewg for mlx4 which has no backing iwarp logic.

Or
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-11 Thread Walukiewicz, Miroslaw

The patch adds a new test application describing a usage of the IBV_QPT_RAW_ETH
for IPv4 multicast acceleration on iWARP cards.

See man mcraw for parameters description.

Signed-of-by: Mirek Walukiewicz 
---

diff --git a/Makefile.am b/Makefile.am
index 4ddbcfa..0132b36 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -18,7 +18,7 @@ src_librdmacm_la_LDFLAGS = -version-info 1 -export-dynamic \
 src_librdmacm_la_DEPENDENCIES =  $(srcdir)/src/librdmacm.map

 bin_PROGRAMS = examples/ucmatose examples/rping examples/udaddy examples/mckey 
\
-  examples/rdma_client examples/rdma_server
+  examples/rdma_client examples/rdma_server examples/mcraw
 examples_ucmatose_SOURCES = examples/cmatose.c
 examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la
 examples_rping_SOURCES = examples/rping.c
@@ -31,6 +31,8 @@ examples_rdma_client_SOURCES = examples/rdma_client.c
 examples_rdma_client_LDADD = $(top_builddir)/src/librdmacm.la
 examples_rdma_server_SOURCES = examples/rdma_server.c
 examples_rdma_server_LDADD = $(top_builddir)/src/librdmacm.la
+examples_mcraw_SOURCES = examples/mcraw.c
+examples_mcraw_LDADD = $(top_builddir)/src/librdmacm.la

 librdmacmincludedir = $(includedir)/rdma
 infinibandincludedir = $(includedir)/infiniband
@@ -77,7 +79,8 @@ man_MANS = \
man/udaddy.1 \
man/mckey.1 \
man/rping.1 \
-   man/rdma_cm.7
+   man/rdma_cm.7 \
+   man/mcraw.1

 EXTRA_DIST = include/rdma/rdma_cma_abi.h include/rdma/rdma_cma.h \
 include/infiniband/ib.h include/rdma/rdma_verbs.h \
diff --git a/examples/mcraw.c b/examples/mcraw.c
new file mode 100644
index 000..864c20d
--- /dev/null
+++ b/examples/mcraw.c
@@ -0,0 +1,897 @@
+/*
+ * Copyright (c) 2010 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define IB_SEND_IP_CSUM0x10
+#define IMA_VLAN_FLAG  0x20
+
+#define  VLAN_PRIORITY 0x0
+
+#define UDP_HEADER_SIZE(sizeof(struct udphdr))
+
+#define HEADER_LEN14 + 28
+
+struct cmatest_node {
+   int id;
+   struct rdma_cm_id   *cma_id;
+   int connected;
+   struct ibv_pd   *pd;
+   struct ibv_cq   *scq;
+   struct ibv_cq   *rcq;
+   struct ibv_mr   *mr;
+   struct ibv_ah   *ah;
+   uint32_tremote_qpn;
+   uint32_tremote_qkey;
+   uint8_t *mem;
+   struct ibv_comp_channel *channel;
+};
+
+struct cmatest {
+   struct rdma_event_channel *channel;
+   struct cmatest_node *nodes;
+   int conn_index;
+   int connects_left;
+
+   struct sockaddr_in6 dst_in;
+   struct sockaddr *dst_addr;
+   struct sockaddr_in6 src_in;
+   struct sockaddr *src_addr;
+   int fd[1024];
+};
+
+static struct cmatest test;
+static int connections = 1;
+static int message_size = 100;
+static int message_count = 10;
+static int is_sender;
+static int unmapped_addr;
+static char *dst_addr;
+static char *src_addr;
+static enum rdma_port_space port_space = RDMA_PS_UDP;
+
+int vlan_flag;
+int vlan_ident;
+
+static int cq_len = 512;
+static int qp_len = 256;
+
+uint16_t IP_CRC(void *buf, int hdr_len)
+{
+   unsigned long sum = 0;
+   const uint16_t *ip1;
+
+   ip1

RE: [PATCH v2] libibverbs: add path record definitions to sa.h

2010-05-21 Thread Walukiewicz, Miroslaw
Hello Steve, 

I want to add a change preventing creation of the L2 RAW_QPT from user 
priviledge (uid = 0 will be able to do such operation) 

What is the best place to do such change: ibv_create_qp in libibverbs(verbs.c) 
or  allowing to  decide for NIC vendors if they want to enable such API to user 
or root. In that case the change is requested only for libnes library?

Regards,

Mirek

-Original Message-
From: Steve Wise [mailto:sw...@opengridcomputing.com] 
Sent: Wednesday, May 19, 2010 6:00 PM
To: Walukiewicz, Miroslaw
Cc: Roland Dreier; Hefty, Sean; linux-rdma
Subject: Re: [PATCH v2] libibverbs: add path record definitions to sa.h

Walukiewicz, Miroslaw wrote:
> Hello Steve, 
>
> Do you plan some changes in the core code related to RAW_QPT? 
>
>   


The only changes I see needed to the kernel core is the mcast change you 
already proposed to allow mcast attach/detach for RAW_ETY qps...


> Could you explain me better what means "priviledged interface" for you?
>
>   


I just mean that allocating these raw qps should only be allowed by 
effective UID 0.  This is analogous to PF_PACKET sockets which are 
privileged as well.



Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] libibverbs: add path record definitions to sa.h

2010-05-19 Thread Walukiewicz, Miroslaw
Hello Steve, 

Do you plan some changes in the core code related to RAW_QPT? 

Could you explain me better what means "priviledged interface" for you?

Regards, 

Mirek
-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise
Sent: Tuesday, May 18, 2010 4:04 PM
To: Roland Dreier
Cc: Hefty, Sean; linux-rdma
Subject: Re: [PATCH v2] libibverbs: add path record definitions to sa.h

Roland Dreier wrote:
>  > Can you add the RAW_ETY qp type in this release as well?
>
> To be honest I haven't looked at the iWARP datagram stuff at all.  I'm
> not sure overloading the RAW_ETY QP type is necessarily the right thing
> to do -- it has quite different (never implemented) semantics in the IB
> case.  Is there any overview of what you guys are planning as far as
> how work requests are created for such QPs?
>   

The RAW_ETY qp would be just that:  A kernel-bypass/user mode qp that 
allows sending/receiving ethernet packets.   It would also provide a way 
for user applications to join/leave ethernet mcast groups (which 
requires an rdma core kernel change that Intel posted too).  What the 
iWARP vendors are doing on top of that is implementing some form of UDP 
in user mode.  The main goal here is to provide an ultra low latency UDP 
multicast and unicast channel for important market segments that desire 
this paradigm.  Also, due to the nature of this (send/recv raw eth 
frames), the interface would be privileged.

If you want to wait, then later I'll post patches on how this is being 
done for cxgb4.  But I thought adding the RAW_ETY was definitely a 
common requirement for Intel and Chelsio.


Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type

2010-05-06 Thread Walukiewicz, Miroslaw

Steve Wise wrote:
Is this all just optimizing mcast packets?

The RAW ETH QP API could be used to accelerate sending and receiving any L2 
packets. It depends on application and HW setup. We use it for accelerating a 
multicast traffic.

Steve Wise wrote:
Does this raw qp service share the mac address with the ports being used 
by the host stack?  Or does each raw qp get its own mac address?

We use a MAC address of the port as a source MAC. The destination MAC is 
derived from multicast group. In theory,  it is possible using other MAC for 
unicast traffic acceleration, but it is much more complex due to making correct 
ARP responses and HW possibility to push unicast packets to correct QPs. 

Steve Wise wrote:
Do you all have a user mode UDP/IP running on this raw qp? 

Yes, we use modified mckey for most tests. 

Steve Wise wrote:
If so, does it use its own IP address separate from the host stack or does it 
share the host's IP address.

Our test application shares an IP address of host interface as a source IP 
address. 

Regards,

Mirek
 

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Steve Wise
Sent: Wednesday, May 05, 2010 4:56 PM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org
Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration 
over IB_QPT_RAW_ETY QP type


> I see here some misunderstanding. Let me explain better how our tramsmit path 
> works. 
>
> In our implementation we use normal memory registration path using ibv_reg_mr 
> and we use ibv_post_send() with lkey/vaddr/len.
>
> The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP 
> passes lkey/virtual_addr/len information to kernel using shared page to our 
> device driver (ud_post_send). There is no data copy here and the driver is 
> used only for fast synchronization.
>
> Because our RAW ETH QP must use physical addresses only,  ud_post_send() in 
> kernel makes a virtual to physical memory translation and accesses the QP HW 
> for packet transmission. Previously a packet buffer memory was registered and 
> pinned by ibv_reg_mr to provide necessary information for making such 
> translation.
>
>   

I see.  Thanks!


> The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is 
> shared with all other user-kernel  communication and it is quite complex. It 
> is a perfect path for QP/CQ/PD/mem management but for me it is too complex 
> for traffic acceleration. 
>
> The user<->kernel  path  through additional driver, shared page for 
> lkey/vaddr/len passing and SW memory translation in kernel is much more 
> effective. 
>
> Maybe it is a good idea to make that API more official after some kind of 
> standarization. Our tests proved that it works. We achieved twice better 
> performance and latency. That way could open the way for adding some non-RDMA 
> devices to devices supported OFED API. 
>
>   

Sounds good.

Do you have specific perf numbers to share?  Is this all just optimizing 
mcast packets? 

Also:

Does this raw qp service share the mac address with the ports being used 
by the host stack?  Or does each raw qp get its own mac address?

Do you all have a user mode UDP/IP running on this raw qp?  If so, does 
it use its own IP address separate from the host stack or does it share 
the host's IP address.


Thanks,

Steve.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type

2010-05-05 Thread Walukiewicz, Miroslaw
Steve, 

> ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP 
> needs access to physical addresses from user space. Due to security reasons 
> we should make a virtual-to-physical address translation in kernel. 
>
>   
Steve Wise wrote:
But why couldn't you just use the normal memory registration paths?  IE 
the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs 
in the ibv_post_send() which could do kernel bypass.

I see here some misunderstanding. Let me explain better how our tramsmit path 
works. 

In our implementation we use normal memory registration path using ibv_reg_mr 
and we use ibv_post_send() with lkey/vaddr/len.

The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP passes 
lkey/virtual_addr/len information to kernel using shared page to our device 
driver (ud_post_send). There is no data copy here and the driver is used only 
for fast synchronization.

Because our RAW ETH QP must use physical addresses only,  ud_post_send() in 
kernel makes a virtual to physical memory translation and accesses the QP HW 
for packet transmission. Previously a packet buffer memory was registered and 
pinned by ibv_reg_mr to provide necessary information for making such 
translation.

Steve Wise wrote:
Seems like maybe you could fix the non-bypass post_send/recv paths 
instead of implementing an entirely new user<->kernel interface...

The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is shared 
with all other user-kernel  communication and it is quite complex. It is a 
perfect path for QP/CQ/PD/mem management but for me it is too complex for 
traffic acceleration. 

The user<->kernel  path  through additional driver, shared page for 
lkey/vaddr/len passing and SW memory translation in kernel is much more 
effective. 

Maybe it is a good idea to make that API more official after some kind of 
standarization. Our tests proved that it works. We achieved twice better 
performance and latency. That way could open the way for adding some non-RDMA 
devices to devices supported OFED API. 

Regards,

Mirek


-Original Message-
From: Steve Wise [mailto:sw...@opengridcomputing.com] 
Sent: Tuesday, May 04, 2010 8:15 PM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org
Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration 
over IB_QPT_RAW_ETY QP type

Walukiewicz, Miroslaw wrote:
> Hello Steve,
>
> Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion 
> that user provides a data payload only for TX and similarly receives a 
> payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached 
> by HW. 
>
> Our QP implementation in HW  does not provide such possibity of attaching 
> headers by HW for UD traffic so for multicast acceleration we choose L2 raw 
> path. It provides some overhead for user application but it is still zero 
> copy apprach.
>
> I thought about using a simulation of UD path using L2 raw QP to get the same 
> result like for true UD QP (user handles a payload only). Such approach costs 
> additional copy of payload in SW due to putting headers first and next 
> payload to single tx buffer. Similar situation is for rx. It is a need for 
> copy payload to posted buffers or provide data with some offset. 
>
> ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP 
> needs access to physical addresses from user space. Due to security reasons 
> we should make a virtual-to-physical address translation in kernel. 
>
>   

But why couldn't you just use the normal memory registration paths?  IE 
the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs 
in the ibv_post_send() which could do kernel bypass.

> Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow 
> due to some number of dynamic memory allocations in the path. We choose to 
> create own private post_send channel to increase tx bandwidth using 
> ud_post_send and friends.

Seems like maybe you could fix the non-bypass post_send/recv paths 
instead of implementing an entirely new user<->kernel interface...

Steve.


>  
>
> Regards,
>
> Mirek
>
> -Original Message-
> From: Steve Wise [mailto:sw...@opengridcomputing.com] 
> Sent: Tuesday, May 04, 2010 7:19 PM
> To: Walukiewicz, Miroslaw
> Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast 
> acceleration over IB_QPT_RAW_ETY QP type
>
> Hey Mirek,
>
> It looks like this patch adds a new file interface for a UD service.  
> Why didn't you extend the existing UD interface as needed? 
>
> What IO is supported with these changes?  IMA via the raw QP, but what 
> ud_post_send() and friends used for?
>

RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type

2010-05-04 Thread Walukiewicz, Miroslaw
Hello Steve,

Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion 
that user provides a data payload only for TX and similarly receives a payload 
only. The protocol headers (in case of UD - MAC/IP/UDP) are attached by HW. 

Our QP implementation in HW  does not provide such possibity of attaching 
headers by HW for UD traffic so for multicast acceleration we choose L2 raw 
path. It provides some overhead for user application but it is still zero copy 
apprach.

I thought about using a simulation of UD path using L2 raw QP to get the same 
result like for true UD QP (user handles a payload only). Such approach costs 
additional copy of payload in SW due to putting headers first and next payload 
to single tx buffer. Similar situation is for rx. It is a need for copy payload 
to posted buffers or provide data with some offset. 

ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP 
needs access to physical addresses from user space. Due to security reasons we 
should make a virtual-to-physical address translation in kernel. 

Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow due 
to some number of dynamic memory allocations in the path. We choose to create 
own private post_send channel to increase tx bandwidth using ud_post_send and 
friends. 

Regards,

Mirek

-Original Message-
From: Steve Wise [mailto:sw...@opengridcomputing.com] 
Sent: Tuesday, May 04, 2010 7:19 PM
To: Walukiewicz, Miroslaw
Cc: rdre...@cisco.com; linux-rdma@vger.kernel.org
Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration 
over IB_QPT_RAW_ETY QP type

Hey Mirek,

It looks like this patch adds a new file interface for a UD service.  
Why didn't you extend the existing UD interface as needed? 

What IO is supported with these changes?  IMA via the raw QP, but what 
ud_post_send() and friends used for?


Steve.



miroslaw.walukiew...@intel.com wrote:
> This patch implements iWarp multicast acceleration (IMA)
> over IB_QPT_RAW_ETY QP type in nes driver.
>
> Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and
> manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls.
>
> Calling ibv_attach_mcast/ibv_datach_mcast has an effect of
> enabling/disabling L2 MAC address filters in HW.
>
> Signed-off-by: Mirek Walukiewicz 
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to use IB_QPT_RAW_ETY QP from user space.

2010-03-09 Thread Walukiewicz, Miroslaw
Hello,

I look for equivalent of definition of the IB_QPT_RAW_ETY in 
libibverbs/include/infiniband/verbs.h

ib_verbs.h 

enum ib_qp_type {
/*
 * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries
 * here (and in that order) since the MAD layer uses them as
 * indices into a 2-entry table.
 */
IB_QPT_SMI,
IB_QPT_GSI,

IB_QPT_RC,
IB_QPT_UC,
IB_QPT_UD,
IB_QPT_RAW_IPV6,
IB_QPT_RAW_ETY
};

I see only 
Verbs.h 
enum ibv_qp_type {
IBV_QPT_RC = 2,
IBV_QPT_UC,
IBV_QPT_UD,
IBV_QPT_XRC
};

Is something missing in libibverbs.h?

Regards, 

Mirek

-
Intel Technology Poland sp. z o.o.
z siedziba w Gdansku
ul. Slowackiego 173
80-298 Gdansk

Sad Rejonowy Gdansk Polnoc w Gdansku, 
VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
numer KRS 101882

NIP 957-07-52-316
Kapital zakladowy 200.000 zl

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html