[PATCH] ummunotify: Userspace support for MMU notifications

2010-04-12 Thread Eric B Munson
Andrew,

I am resubmitting this patch because I believe that the discussion
has shown this to be an acceptable solution.  I have fixed the 32 bit
build errors, but other than that change, the code is the same as
Roland's V3 patch.

From: Roland Dreier rola...@cisco.com

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunotify_event in
linux/ummunotify.h).  select()/poll()/epoll() and SIGIO are
handled for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.

Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for
suggestions on the interface design.  Also thanks to Jeff Squyres
jsquyres at cisco.com for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier rola...@cisco.com
Signed-off-by: Eric B Munson ebmun...@us.ibm.com

---

Changes since v3:
 - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86
   builds
---
 Documentation/Makefile  |3 +-
 Documentation/ummunotify/Makefile   |7 +
 Documentation/ummunotify/ummunotify.txt |  150 
 Documentation/ummunotify/umn-test.c |  200 +++
 drivers/char/Kconfig|   12 +
 drivers/char/Makefile   |1 +
 drivers/char/ummunotify.c   |  567 +++
 include/linux/ummunotify.h  |  121 +++
 8 files changed, 1060 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 6fc7ea1..27ba76a 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,4 @@
 obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
-   pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
+   pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
+   watchdog/src/
diff --git a/Documentation/ummunotify/Makefile 
b/Documentation/ummunotify/Makefile
new file mode 100644
index 000..89f31a0
--- /dev/null
+++ b/Documentation/ummunotify/Makefile
@@ -0,0 +1,7 @@
+# List of programs to build
+hostprogs-y := umn-test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
diff --git a/Documentation/ummunotify/ummunotify.txt 
b/Documentation/ummunotify/ummunotify.txt
new file mode 100644
index 000..78a79c2
--- /dev/null
+++ b/Documentation/ummunotify/ummunotify.txt
@@ -0,0 +1,150 @@
+UMMUNOTIFY
+
+  Ummunotify relays MMU notifier events to userspace.  This is useful
+  for libraries that need to track the memory mapping of applications;
+  for example, MPI implementations using RDMA want to cache memory
+  registrations for performance, but tracking all possible crazy cases
+  such as when, say, the FORTRAN runtime frees memory is impossible
+  without kernel help.
+
+Basic Model
+
+  A userspace process uses it by opening /dev/ummunotify, which
+  returns a file descriptor.  Interest in address ranges is registered
+  using ioctl() and MMU notifier events are retrieved using read(), as
+  described in more detail below.  Userspace can register multiple
+  address ranges to watch, and can unregister individual ranges.
+
+  Userspace can also mmap() a single read-only page at offset 0 on
+  this file descriptor.  This page contains (at offest 0) a single
+  64-bit generation counter that the kernel increments each time an
+  MMU notifier event occurs.  Userspace can use this to very quickly
+  check if there are any events to retrieve without needing to do a
+ 

Socket Direct Protocol: help

2010-04-12 Thread Andrea Gozzelino
Good morning,

I'm testing some Neteffect cards (Intel code E10G81GP - Neteffect
NE020.LP.1.SSR).
PC has Linux| (kernel version) 2.6.18-164.15.1.el5 | x86_64 x86_64
x86_64 GNU/Linux.

In this phase, I measure the bandwidth with the netserver/nerperf
(version netperf-2.4.5) ad hoc tests.
They work fine with TCP protocol - as OFED 1.5.1 example programs - and
they have some problems with SDP one.

I'm trying test with the command lines below:

server: LD_PRELOAD=/usr/local/lib64/libsdp.so netserver

client: LD_PRELOAD=/usr/local/lib64/libsdp.so netperf -H server_address
-c -C
-- -m 65536

The /etc/libsdp.conf file contains rules below:
use both listen * *:*
use both connect * *:*
log min-level 9 destination file libsdp.log

Client displays Connection error: Can not allocate memory and the
connection fails.
(original text on client log file:libsdp Error connect: failed for SDP
fd:6 with error:Cannot allocate memory)

The library path is:
/usr/local/lib64/libsdp.so


Could someone explain me how LD_PRELOAD environment variable must be
set?
I don't understand why the test work with TCP and not with SDP.
Could I work with wrong Linux kernel environment or parameters?

I don't know if there is a specific mailing list for SDP so I ask you
help.

Thank you very much,
Andrea 







Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro  (LNL)
Viale dell'Universita' 2
I-35020 - Legnaro (PD)- ITALIA
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzel...@lnl.infn.it  

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/2] Add support for enhanced atomic operations

2010-04-12 Thread Vladimir Sokolovsky

Vladimir Sokolovsky wrote:

Roland Dreier wrote:

  Hence, I think it would be cleaner if a new capability,
  masked_atomic_cap, were introduced, using the original definitions
  (NONE, HCA, GLOB).

Vlad, what do you think about that?  The more I think about it, the
cleaner this seems to me.  And it doesn't even consume a device
capability flag bit, which is a nice bonus.


Hi Roland,
Do you propose to use IB_ATOMIC_GLOB instead of IB_ATOMIC_HCA while setting
atomic capability in the code below?

props-atomic_cap  = dev-dev-caps.flags  
MLX4_DEV_CAP_FLAG_ATOMIC ?

IB_ATOMIC_HCA : IB_ATOMIC_NONE;

Or add IB_MASKED_ATOMIC to ib_atomic_cap enum and use this one instead 
of IB_ATOMIC_HCA?


All this, of course, comes to replace setting IB_DEVICE_MASKED_ATOMIC 
for device capability.


Thanks,
Vladimir




Hi Roland,
Can you comment?

Thanks,
Vladimir
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Socket Direct Protocol: help (2)

2010-04-12 Thread Andrea Gozzelino
On Apr 12, 2010 10:14 AM, Andrea Gozzelino
andrea.gozzel...@lnl.infn.it wrote:

 Good morning,
 
 I'm testing some Neteffect cards (Intel code E10G81GP - Neteffect
 NE020.LP.1.SSR).
 PC has Linux| (kernel version) 2.6.18-164.15.1.el5 | x86_64 x86_64
 x86_64 GNU/Linux.
 
 In this phase, I measure the bandwidth with the netserver/nerperf
 (version netperf-2.4.5) ad hoc tests.
 They work fine with TCP protocol - as OFED 1.5.1 example programs -
 and
 they have some problems with SDP one.
 
 I'm trying test with the command lines below:
 
 server: LD_PRELOAD=/usr/local/lib64/libsdp.so netserver
 
 client: LD_PRELOAD=/usr/local/lib64/libsdp.so netperf -H
 server_address
 -c -C
 -- -m 65536
 
 The /etc/libsdp.conf file contains rules below:
 use both listen * *:*
 use both connect * *:*
 log min-level 9 destination file libsdp.log
 
 Client displays Connection error: Can not allocate memory and the
 connection fails.
 (original text on client log file:libsdp Error connect: failed for SDP
 fd:6 with error:Cannot allocate memory)
 
 The library path is:
 /usr/local/lib64/libsdp.so
 
 
 Could someone explain me how LD_PRELOAD environment variable must be
 set?
 I don't understand why the test work with TCP and not with SDP.
 Could I work with wrong Linux kernel environment or parameters?
 
 I don't know if there is a specific mailing list for SDP so I ask you
 help.
 
 Thank you very much,
 Andrea 
 
 
 
 
 
 
 
 Andrea Gozzelino
 
 INFN - Laboratori Nazionali di Legnaro(LNL)
 Viale dell'Universita' 2
 I-35020 - Legnaro (PD)- ITALIA
 Tel: +39 049 8068346
 Fax: +39 049 641925
 Mail: andrea.gozzel...@lnl.infn.it
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Hi all,

I add that in kernel space SDP debug the error is:

command line: dmesg
sdp_init_qp:95 sdp_sock( 2100:2 40720:0): recv sge's. capability: 4
needed: 9
sdp_init_qp:95 sdp_sock( 2100:2 41203:0): recv sge's. capability: 4
needed: 9

The structure sdp_init_qp() is defined in
/usr/src/ofa_kernel-1.5.1/drivers/infiniband/ulp/sdp/sdp_cma.c (lines 76
- 141).

Could be a firmware problem?
I have this situation:
command line: ethtool -i eth2
driver: iw_nes
version: 1.5.0.0
firmware-version: 3.16
bus-info: :03:00.0

Thank you very much,
Andrea
Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro  (LNL)
Viale dell'Universita' 2
I-35020 - Legnaro (PD)- ITALIA
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzel...@lnl.infn.it  

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 38/51] IB/qib: Add qib_sysfs.c

2010-04-12 Thread Ralph Campbell
On Fri, 2010-04-09 at 17:27 -0700, Jason Gunthorpe wrote:
 On Fri, Apr 09, 2010 at 05:13:24PM -0700, Ralph Campbell wrote:
 
  For the QSFP data, I hope I can leave it as is since it is
  related to the link state that the other files contain.
  It is a read-only file so no issue with trying to set a value.
 
 There was some flack for other stuff like this a while back.
 
 IMHO, it would be appropriate to have a hex dump of the entire QFSP
 EEPROM and leave parsing to userspace, or put the parsed version in
 debugfs..
 
 Jason

OK. I will move it to our file system which is used
to export binary data.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB performance benchmarking

2010-04-12 Thread Tom Ammon

Hi,

I'm trying to do some performance benchmarking of IPoIB on a DDR IB 
cluster, and I am having a hard time understanding what I am seeing.


When I do a simple netperf, I get results like these:

[r...@gateway3 ~]# netperf -H 192.168.23.252
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252 
(192.168.23.252) port 0 AF_INET

Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  65536  6553610.014577.70


Which is disappointing since it is simply two DDR IB-connected nodes 
plugged in to a DDR switch - I would expect much higher throughput than 
that. When I do a test with ibv_srq_pingpong (using the same message 
size reported above), here's what I get:


[r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536
  local address:  LID 0x012b, QPN 0x000337, PSN 0x19cc85
  local address:  LID 0x012b, QPN 0x000338, PSN 0x956fc2
...
[output omitted]
...
  remote address: LID 0x0129, QPN 0x00032e, PSN 0x891ce3
131072000 bytes in 0.08 seconds = 12763.08 Mbit/sec
1000 iters in 0.08 seconds = 82.16 usec/iter

Which is much closer to what I would expect with DDR.

The MTU on both of the QLogic DDR HCAs is set to 4096, as it is on the 
QLogic switch.


I know the above is not completely apples-to-apples, since the 
ibv_srq_pingpong is layer2 and is using 16 QPs. So I ran it again with 
only a single QP, to make it more roughly equivalent of my single-stream 
netperf test, and I still get almost double the performance:


[r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536 -q 1
  local address:  LID 0x012b, QPN 0x000347, PSN 0x65fb56
  remote address: LID 0x0129, QPN 0x00032f, PSN 0x5e52f9
131072000 bytes in 0.13 seconds = 8323.22 Mbit/sec
1000 iters in 0.13 seconds = 125.98 usec/iter


Is there something that I am not understanding, here? Is there any way 
to make single-stream TCP IPoIB performance better than 4.5Gb/s on a DDR 
network? Am I just not using the benchmarking tools correctly?


Thanks,

Tom

--


Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ummunotify: Userspace support for MMU notifications

2010-04-12 Thread Pavel Machek
Hi!

 I am resubmitting this patch because I believe that the discussion
 has shown this to be an acceptable solution.  I have fixed the 32 bit
 build errors, but other than that change, the code is the same as
 Roland's V3 patch.
 
 From: Roland Dreier rola...@cisco.com
 
 As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
 and follow-up messages, libraries using RDMA would like to track
 precisely when application code changes memory mapping via free(),
 munmap(), etc.  Current pure-userspace solutions using malloc hooks
 and other tricks are not robust, and the feeling among experts is that
 the issue is unfixable without kernel help.

I do not know. I still believe that this does not belong in the
kernel; application should not need to trace itself to know what it does.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance benchmarking

2010-04-12 Thread Tom Ammon

Dave,

Thanks for the pointer. I thought it was running in connected mode, and 
looking at that variable that you mentioned confirms it:


[r...@gateway3 ~]# cat /sys/class/net/ib0/mode
connected

And the IP MTU shows up as:

[r...@gateway3 ~]# ifconfig ib0
ib0   Link encap:InfiniBand  HWaddr 
80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  inet addr:192.168.23.253  Bcast:192.168.23.255  
Mask:255.255.254.0

  inet6 addr: fe80::211:7500:ff:6edc/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0
  TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:5450805352 (5.0 GiB)  TX bytes:154353169896 (143.7 GiB)


This is partly why I'm stumped - I've seen threads about how connected 
mode is supposed to improve IPoIB performance, but I'm not seeing as 
much performance as I'd like.


Tom

On 04/12/2010 02:19 PM, Dave Olson wrote:

On Mon, 12 Apr 2010, Tom Ammon wrote:
| I'm trying to do some performance benchmarking of IPoIB on a DDR IB
| cluster, and I am having a hard time understanding what I am seeing.
|
| When I do a simple netperf, I get results like these:
|
| [r...@gateway3 ~]# netperf -H 192.168.23.252
| TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252
| (192.168.23.252) port 0 AF_INET
| Recv   SendSend
| Socket Socket  Message  Elapsed
| Size   SizeSize Time Throughput
| bytes  bytes   bytessecs.10^6bits/sec
|
|   87380  65536  6553610.014577.70

Are you using connected mode, or UD?  Since you say you have a 4K MTU,
I'm guessing you are using UD.  Change to use connected mode (edit
/etc/infiniband/openib.conf), or as a quick test

 echo connected  /sys/class/net/ib0/mode

and then the mtu should show as 65520.  That should help
the bandwidth a fair amount.


Dave Olson
dave.ol...@qlogic.com
   


--

Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] uDAPL v1.2 - cma: memory leak of FD's (pipe) created during dat_evd_create

2010-04-12 Thread Davis, Arlin R

Add checking for pipe FD's during destroy and clean them up with close.

Signed-off-by: Arlin Davis arlin.r.da...@intel.com
---
 dapl/openib_cma/dapl_ib_cq.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_cq.c b/dapl/openib_cma/dapl_ib_cq.c
index cf19f38..c54bbaf 100644
--- a/dapl/openib_cma/dapl_ib_cq.c
+++ b/dapl/openib_cma/dapl_ib_cq.c
@@ -462,8 +462,11 @@ dapls_ib_wait_object_create(IN DAPL_EVD *evd_ptr,
ibv_create_comp_channel(
evd_ptr-header.owner_ia-hca_ptr-ib_hca_handle);  

-   if ((*p_cq_wait_obj_handle)-events == NULL)
+   if ((*p_cq_wait_obj_handle)-events == NULL) {
+   close((*p_cq_wait_obj_handle)-pipe[0]);
+   close((*p_cq_wait_obj_handle)-pipe[1]);
goto bail;
+   }
 
return DAT_SUCCESS;
 bail:
@@ -483,6 +486,9 @@ dapls_ib_wait_object_destroy(IN ib_wait_obj_handle_t 
p_cq_wait_obj_handle)

ibv_destroy_comp_channel(p_cq_wait_obj_handle-events);
 
+   close(p_cq_wait_obj_handle-pipe[0]);
+   close(p_cq_wait_obj_handle-pipe[1]);
+
dapl_os_free(p_cq_wait_obj_handle, 
 sizeof(struct _ib_wait_obj_handle));
 
-- 
1.5.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] uDAPL v1.2 - cma: memory leak of verbs CQ and completion channels created during dat_ia_open

2010-04-12 Thread Davis, Arlin R

check/cleanup CQ and completion channels during dat_ia_close

Signed-off-by: Arlin Davis arlin.r.da...@intel.com
---
 dapl/openib_cma/dapl_ib_util.c |   22 --
 1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c
index 9d97ae1..00aa5fb 100755
--- a/dapl/openib_cma/dapl_ib_util.c
+++ b/dapl/openib_cma/dapl_ib_util.c
@@ -373,12 +373,6 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr)
dapl_dbg_log(DAPL_DBG_TYPE_UTIL, close_hca: %p-%p\n,
 hca_ptr,hca_ptr-ib_hca_handle);
 
-   if (hca_ptr-ib_hca_handle != IB_INVALID_HANDLE) {
-   if (rdma_destroy_id(hca_ptr-ib_trans.cm_id)) 
-   return(dapl_convert_errno(errno,ib_close_device));
-   hca_ptr-ib_hca_handle = IB_INVALID_HANDLE;
-   }
-
dapl_os_lock(g_hca_lock);
if (g_ib_thread_state != IB_THREAD_RUN) {
dapl_os_unlock(g_hca_lock);
@@ -410,6 +404,22 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr)
nanosleep (sleep, remain);
}
 bail:
+   if (hca_ptr-ib_trans.ib_cq)
+   ibv_destroy_comp_channel(hca_ptr-ib_trans.ib_cq);
+
+   if (hca_ptr-ib_trans.ib_cq_empty) {
+   struct ibv_comp_channel *channel;
+   channel = hca_ptr-ib_trans.ib_cq_empty-channel;
+   ibv_destroy_cq(hca_ptr-ib_trans.ib_cq_empty);
+   ibv_destroy_comp_channel(channel);
+   }
+
+   if (hca_ptr-ib_hca_handle != IB_INVALID_HANDLE) {
+   if (rdma_destroy_id(hca_ptr-ib_trans.cm_id))
+   return (dapl_convert_errno(errno, ib_close_device));
+   hca_ptr-ib_hca_handle = IB_INVALID_HANDLE;
+   }
+
return (DAT_SUCCESS);
 }
   
-- 
1.5.2.5

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance benchmarking

2010-04-12 Thread Dave Olson
On Mon, 12 Apr 2010, Tom Ammon wrote:
| Thanks for the pointer. I thought it was running in connected mode, and 
| looking at that variable that you mentioned confirms it:


| [r...@gateway3 ~]# ifconfig ib0
| ib0   Link encap:InfiniBand  HWaddr 
| 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
|inet addr:192.168.23.253  Bcast:192.168.23.255  Mask:255.255.254.0
|RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0
|TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0

That's a lot of packets dropped on the tx side.

If you have the qlogic software installed, running ipathstats -c1 while
you are running the test would be useful, otherwise perfquery -r at
start and another perfquery at the end on both nodes might point to
something.

Oh, and depending on your tcp stack tuning, setting the receive and/or
send buffer size might help.   These are all ddr results, on a more
or less OFED 1.5.1 stack (completely unofficial, blah blah).

And yes, multi-thread will bring the results up (iperf, rather than netperf).

# netperf -H ib-host TCP_STREAM -- -m 65536  
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  65536  6553610.035150.24   
# netperf -H ib-host TCP_STREAM -- -m 65536 -S 131072
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

262144  65536  6553610.035401.83   

# netperf -H ib-host TCP_STREAM -- -m 65536 -S 262144
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

524288  65536  6553610.015478.28   


Dave Olson
dave.ol...@qlogic.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ummunotify: Userspace support for MMU notifications

2010-04-12 Thread Andrew Morton
On Mon, 12 Apr 2010 07:22:17 +0100
Eric B Munson ebmun...@us.ibm.com wrote:

 Andrew,
 
 I am resubmitting this patch because I believe that the discussion
 has shown this to be an acceptable solution.

To whom?  Some acked-by's would clarify.

  I have fixed the 32 bit
 build errors, but other than that change, the code is the same as
 Roland's V3 patch.
 
 From: Roland Dreier rola...@cisco.com
 
 As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
 and follow-up messages, libraries using RDMA would like to track
 precisely when application code changes memory mapping via free(),
 munmap(), etc.  Current pure-userspace solutions using malloc hooks
 and other tricks are not robust, and the feeling among experts is that
 the issue is unfixable without kernel help.

But this info could be reassembled by tracking syscall activity, yes? 
Perhaps some discussion here explaining why the (possibly enhanced)
ptrace, audit, etc interfaces are unsuitable.

 We solve this not by implementing the full API proposed in the email
 linked above but rather with a simpler and more generic interface,
 which may be useful in other contexts.  Specifically, we implement a
 new character device driver, ummunotify, that creates a /dev/ummunotify
 node.  A userspace process can open this node read-only and use the fd
 as follows:
 
  1. ioctl() to register/unregister an address range to watch in the
 kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h).
 
  2. read() to retrieve events generated when a mapping in a watched
 address range is invalidated (cf struct ummunotify_event in
 linux/ummunotify.h).  select()/poll()/epoll() and SIGIO are
 handled for this IO.
 
  3. mmap() one page at offset 0 to map a kernel page that contains a
 generation counter that is incremented each time an event is
 generated.  This allows userspace to have a fast path that checks
 that no events have occurred without a system call.

OK, what's missing from this whole description and from ummunotify.txt
is: how does one specify the target process?  Does /dev/ummunotify
implicitly attach to current-mm?  If so, why, and what are the
implications of this?

If instead it is possible to attach to some other process's mmu
activity (/proc/pid/ummunotity?) then how is that done and what are
the security/permissions implications?

Also, the whole thing is obviously racy: by the time userspace finds
out that something has happened, it might have changed.  This
inevitably reduces the applicability/usefulness of the whole thing as
compared to some synchronous mechanism which halts the monitored thread
until the request has been processed and acked.  All this should (IMO)
be explored, explained and justified.

Also, what prevents the obvious DoS which occurs when I register for
events and just let them queue up until the kernel runs out of memory? 
presumably events get dropped - what are the reliability implications
of this and how is the max queue length managed?

Also, ioctls are unpopular.  Were other intefaces considered?

 Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for
 suggestions on the interface design.  Also thanks to Jeff Squyres
 jsquyres at cisco.com for prototyping support for this in Open MPI, which
 helped find several bugs during development.
 
 Signed-off-by: Roland Dreier rola...@cisco.com
 Signed-off-by: Eric B Munson ebmun...@us.ibm.com
 
 ---
 
 Changes since v3:
  - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86
builds
 ---
  Documentation/Makefile  |3 +-
  Documentation/ummunotify/Makefile   |7 +
  Documentation/ummunotify/ummunotify.txt |  150 
  Documentation/ummunotify/umn-test.c |  200 +++
  drivers/char/Kconfig|   12 +
  drivers/char/Makefile   |1 +
  drivers/char/ummunotify.c   |  567 
 +++
  include/linux/ummunotify.h  |  121 +++
  8 files changed, 1060 insertions(+), 1 deletions(-)
  create mode 100644 Documentation/ummunotify/Makefile
  create mode 100644 Documentation/ummunotify/ummunotify.txt
  create mode 100644 Documentation/ummunotify/umn-test.c
  create mode 100644 drivers/char/ummunotify.c
  create mode 100644 include/linux/ummunotify.h
 
 diff --git a/Documentation/Makefile b/Documentation/Makefile
 index 6fc7ea1..27ba76a 100644
 --- a/Documentation/Makefile
 +++ b/Documentation/Makefile
 @@ -1,3 +1,4 @@
  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
   filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
 - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
 + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
 + watchdog/src/
 diff --git a/Documentation/ummunotify/Makefile 
 b/Documentation/ummunotify/Makefile
 new file mode 100644
 index 000..89f31a0
 --- /dev/null
 +++ b/Documentation/ummunotify/Makefile
 @@ -0,0 +1,7 @@
 +# List