[PATCH] ummunotify: Userspace support for MMU notifications
Andrew, I am resubmitting this patch because I believe that the discussion has shown this to be an acceptable solution. I have fixed the 32 bit build errors, but other than that change, the code is the same as Roland's V3 patch. From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rola...@cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes since v3: - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 builds --- Documentation/Makefile |3 +- Documentation/ummunotify/Makefile |7 + Documentation/ummunotify/ummunotify.txt | 150 Documentation/ummunotify/umn-test.c | 200 +++ drivers/char/Kconfig| 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 +++ include/linux/ummunotify.h | 121 +++ 8 files changed, 1060 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List of programs to build +hostprogs-y := umn-test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt new file mode 100644 index 000..78a79c2 --- /dev/null +++ b/Documentation/ummunotify/ummunotify.txt @@ -0,0 +1,150 @@ +UMMUNOTIFY + + Ummunotify relays MMU notifier events to userspace. This is useful + for libraries that need to track the memory mapping of applications; + for example, MPI implementations using RDMA want to cache memory + registrations for performance, but tracking all possible crazy cases + such as when, say, the FORTRAN runtime frees memory is impossible + without kernel help. + +Basic Model + + A userspace process uses it by opening /dev/ummunotify, which + returns a file descriptor. Interest in address ranges is registered + using ioctl() and MMU notifier events are retrieved using read(), as + described in more detail below. Userspace can register multiple + address ranges to watch, and can unregister individual ranges. + + Userspace can also mmap() a single read-only page at offset 0 on + this file descriptor. This page contains (at offest 0) a single + 64-bit generation counter that the kernel increments each time an + MMU notifier event occurs. Userspace can use this to very quickly + check if there are any events to retrieve without needing to do a +
Socket Direct Protocol: help
Good morning, I'm testing some Neteffect cards (Intel code E10G81GP - Neteffect NE020.LP.1.SSR). PC has Linux| (kernel version) 2.6.18-164.15.1.el5 | x86_64 x86_64 x86_64 GNU/Linux. In this phase, I measure the bandwidth with the netserver/nerperf (version netperf-2.4.5) ad hoc tests. They work fine with TCP protocol - as OFED 1.5.1 example programs - and they have some problems with SDP one. I'm trying test with the command lines below: server: LD_PRELOAD=/usr/local/lib64/libsdp.so netserver client: LD_PRELOAD=/usr/local/lib64/libsdp.so netperf -H server_address -c -C -- -m 65536 The /etc/libsdp.conf file contains rules below: use both listen * *:* use both connect * *:* log min-level 9 destination file libsdp.log Client displays Connection error: Can not allocate memory and the connection fails. (original text on client log file:libsdp Error connect: failed for SDP fd:6 with error:Cannot allocate memory) The library path is: /usr/local/lib64/libsdp.so Could someone explain me how LD_PRELOAD environment variable must be set? I don't understand why the test work with TCP and not with SDP. Could I work with wrong Linux kernel environment or parameters? I don't know if there is a specific mailing list for SDP so I ask you help. Thank you very much, Andrea Andrea Gozzelino INFN - Laboratori Nazionali di Legnaro (LNL) Viale dell'Universita' 2 I-35020 - Legnaro (PD)- ITALIA Tel: +39 049 8068346 Fax: +39 049 641925 Mail: andrea.gozzel...@lnl.infn.it -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/2] Add support for enhanced atomic operations
Vladimir Sokolovsky wrote: Roland Dreier wrote: Hence, I think it would be cleaner if a new capability, masked_atomic_cap, were introduced, using the original definitions (NONE, HCA, GLOB). Vlad, what do you think about that? The more I think about it, the cleaner this seems to me. And it doesn't even consume a device capability flag bit, which is a nice bonus. Hi Roland, Do you propose to use IB_ATOMIC_GLOB instead of IB_ATOMIC_HCA while setting atomic capability in the code below? props-atomic_cap = dev-dev-caps.flags MLX4_DEV_CAP_FLAG_ATOMIC ? IB_ATOMIC_HCA : IB_ATOMIC_NONE; Or add IB_MASKED_ATOMIC to ib_atomic_cap enum and use this one instead of IB_ATOMIC_HCA? All this, of course, comes to replace setting IB_DEVICE_MASKED_ATOMIC for device capability. Thanks, Vladimir Hi Roland, Can you comment? Thanks, Vladimir -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Socket Direct Protocol: help (2)
On Apr 12, 2010 10:14 AM, Andrea Gozzelino andrea.gozzel...@lnl.infn.it wrote: Good morning, I'm testing some Neteffect cards (Intel code E10G81GP - Neteffect NE020.LP.1.SSR). PC has Linux| (kernel version) 2.6.18-164.15.1.el5 | x86_64 x86_64 x86_64 GNU/Linux. In this phase, I measure the bandwidth with the netserver/nerperf (version netperf-2.4.5) ad hoc tests. They work fine with TCP protocol - as OFED 1.5.1 example programs - and they have some problems with SDP one. I'm trying test with the command lines below: server: LD_PRELOAD=/usr/local/lib64/libsdp.so netserver client: LD_PRELOAD=/usr/local/lib64/libsdp.so netperf -H server_address -c -C -- -m 65536 The /etc/libsdp.conf file contains rules below: use both listen * *:* use both connect * *:* log min-level 9 destination file libsdp.log Client displays Connection error: Can not allocate memory and the connection fails. (original text on client log file:libsdp Error connect: failed for SDP fd:6 with error:Cannot allocate memory) The library path is: /usr/local/lib64/libsdp.so Could someone explain me how LD_PRELOAD environment variable must be set? I don't understand why the test work with TCP and not with SDP. Could I work with wrong Linux kernel environment or parameters? I don't know if there is a specific mailing list for SDP so I ask you help. Thank you very much, Andrea Andrea Gozzelino INFN - Laboratori Nazionali di Legnaro(LNL) Viale dell'Universita' 2 I-35020 - Legnaro (PD)- ITALIA Tel: +39 049 8068346 Fax: +39 049 641925 Mail: andrea.gozzel...@lnl.infn.it -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi all, I add that in kernel space SDP debug the error is: command line: dmesg sdp_init_qp:95 sdp_sock( 2100:2 40720:0): recv sge's. capability: 4 needed: 9 sdp_init_qp:95 sdp_sock( 2100:2 41203:0): recv sge's. capability: 4 needed: 9 The structure sdp_init_qp() is defined in /usr/src/ofa_kernel-1.5.1/drivers/infiniband/ulp/sdp/sdp_cma.c (lines 76 - 141). Could be a firmware problem? I have this situation: command line: ethtool -i eth2 driver: iw_nes version: 1.5.0.0 firmware-version: 3.16 bus-info: :03:00.0 Thank you very much, Andrea Andrea Gozzelino INFN - Laboratori Nazionali di Legnaro (LNL) Viale dell'Universita' 2 I-35020 - Legnaro (PD)- ITALIA Tel: +39 049 8068346 Fax: +39 049 641925 Mail: andrea.gozzel...@lnl.infn.it -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 38/51] IB/qib: Add qib_sysfs.c
On Fri, 2010-04-09 at 17:27 -0700, Jason Gunthorpe wrote: On Fri, Apr 09, 2010 at 05:13:24PM -0700, Ralph Campbell wrote: For the QSFP data, I hope I can leave it as is since it is related to the link state that the other files contain. It is a read-only file so no issue with trying to set a value. There was some flack for other stuff like this a while back. IMHO, it would be appropriate to have a hex dump of the entire QFSP EEPROM and leave parsing to userspace, or put the parsed version in debugfs.. Jason OK. I will move it to our file system which is used to export binary data. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IPoIB performance benchmarking
Hi, I'm trying to do some performance benchmarking of IPoIB on a DDR IB cluster, and I am having a hard time understanding what I am seeing. When I do a simple netperf, I get results like these: [r...@gateway3 ~]# netperf -H 192.168.23.252 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252 (192.168.23.252) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 65536 6553610.014577.70 Which is disappointing since it is simply two DDR IB-connected nodes plugged in to a DDR switch - I would expect much higher throughput than that. When I do a test with ibv_srq_pingpong (using the same message size reported above), here's what I get: [r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536 local address: LID 0x012b, QPN 0x000337, PSN 0x19cc85 local address: LID 0x012b, QPN 0x000338, PSN 0x956fc2 ... [output omitted] ... remote address: LID 0x0129, QPN 0x00032e, PSN 0x891ce3 131072000 bytes in 0.08 seconds = 12763.08 Mbit/sec 1000 iters in 0.08 seconds = 82.16 usec/iter Which is much closer to what I would expect with DDR. The MTU on both of the QLogic DDR HCAs is set to 4096, as it is on the QLogic switch. I know the above is not completely apples-to-apples, since the ibv_srq_pingpong is layer2 and is using 16 QPs. So I ran it again with only a single QP, to make it more roughly equivalent of my single-stream netperf test, and I still get almost double the performance: [r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536 -q 1 local address: LID 0x012b, QPN 0x000347, PSN 0x65fb56 remote address: LID 0x0129, QPN 0x00032f, PSN 0x5e52f9 131072000 bytes in 0.13 seconds = 8323.22 Mbit/sec 1000 iters in 0.13 seconds = 125.98 usec/iter Is there something that I am not understanding, here? Is there any way to make single-stream TCP IPoIB performance better than 4.5Gb/s on a DDR network? Am I just not using the benchmarking tools correctly? Thanks, Tom -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ummunotify: Userspace support for MMU notifications
Hi! I am resubmitting this patch because I believe that the discussion has shown this to be an acceptable solution. I have fixed the 32 bit build errors, but other than that change, the code is the same as Roland's V3 patch. From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. I do not know. I still believe that this does not belong in the kernel; application should not need to trace itself to know what it does. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance benchmarking
Dave, Thanks for the pointer. I thought it was running in connected mode, and looking at that variable that you mentioned confirms it: [r...@gateway3 ~]# cat /sys/class/net/ib0/mode connected And the IP MTU shows up as: [r...@gateway3 ~]# ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.23.253 Bcast:192.168.23.255 Mask:255.255.254.0 inet6 addr: fe80::211:7500:ff:6edc/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0 TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:5450805352 (5.0 GiB) TX bytes:154353169896 (143.7 GiB) This is partly why I'm stumped - I've seen threads about how connected mode is supposed to improve IPoIB performance, but I'm not seeing as much performance as I'd like. Tom On 04/12/2010 02:19 PM, Dave Olson wrote: On Mon, 12 Apr 2010, Tom Ammon wrote: | I'm trying to do some performance benchmarking of IPoIB on a DDR IB | cluster, and I am having a hard time understanding what I am seeing. | | When I do a simple netperf, I get results like these: | | [r...@gateway3 ~]# netperf -H 192.168.23.252 | TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252 | (192.168.23.252) port 0 AF_INET | Recv SendSend | Socket Socket Message Elapsed | Size SizeSize Time Throughput | bytes bytes bytessecs.10^6bits/sec | | 87380 65536 6553610.014577.70 Are you using connected mode, or UD? Since you say you have a 4K MTU, I'm guessing you are using UD. Change to use connected mode (edit /etc/infiniband/openib.conf), or as a quick test echo connected /sys/class/net/ib0/mode and then the mtu should show as 65520. That should help the bandwidth a fair amount. Dave Olson dave.ol...@qlogic.com -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] uDAPL v1.2 - cma: memory leak of FD's (pipe) created during dat_evd_create
Add checking for pipe FD's during destroy and clean them up with close. Signed-off-by: Arlin Davis arlin.r.da...@intel.com --- dapl/openib_cma/dapl_ib_cq.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_cq.c b/dapl/openib_cma/dapl_ib_cq.c index cf19f38..c54bbaf 100644 --- a/dapl/openib_cma/dapl_ib_cq.c +++ b/dapl/openib_cma/dapl_ib_cq.c @@ -462,8 +462,11 @@ dapls_ib_wait_object_create(IN DAPL_EVD *evd_ptr, ibv_create_comp_channel( evd_ptr-header.owner_ia-hca_ptr-ib_hca_handle); - if ((*p_cq_wait_obj_handle)-events == NULL) + if ((*p_cq_wait_obj_handle)-events == NULL) { + close((*p_cq_wait_obj_handle)-pipe[0]); + close((*p_cq_wait_obj_handle)-pipe[1]); goto bail; + } return DAT_SUCCESS; bail: @@ -483,6 +486,9 @@ dapls_ib_wait_object_destroy(IN ib_wait_obj_handle_t p_cq_wait_obj_handle) ibv_destroy_comp_channel(p_cq_wait_obj_handle-events); + close(p_cq_wait_obj_handle-pipe[0]); + close(p_cq_wait_obj_handle-pipe[1]); + dapl_os_free(p_cq_wait_obj_handle, sizeof(struct _ib_wait_obj_handle)); -- 1.5.2.5 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] uDAPL v1.2 - cma: memory leak of verbs CQ and completion channels created during dat_ia_open
check/cleanup CQ and completion channels during dat_ia_close Signed-off-by: Arlin Davis arlin.r.da...@intel.com --- dapl/openib_cma/dapl_ib_util.c | 22 -- 1 files changed, 16 insertions(+), 6 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 9d97ae1..00aa5fb 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -373,12 +373,6 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr) dapl_dbg_log(DAPL_DBG_TYPE_UTIL, close_hca: %p-%p\n, hca_ptr,hca_ptr-ib_hca_handle); - if (hca_ptr-ib_hca_handle != IB_INVALID_HANDLE) { - if (rdma_destroy_id(hca_ptr-ib_trans.cm_id)) - return(dapl_convert_errno(errno,ib_close_device)); - hca_ptr-ib_hca_handle = IB_INVALID_HANDLE; - } - dapl_os_lock(g_hca_lock); if (g_ib_thread_state != IB_THREAD_RUN) { dapl_os_unlock(g_hca_lock); @@ -410,6 +404,22 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr) nanosleep (sleep, remain); } bail: + if (hca_ptr-ib_trans.ib_cq) + ibv_destroy_comp_channel(hca_ptr-ib_trans.ib_cq); + + if (hca_ptr-ib_trans.ib_cq_empty) { + struct ibv_comp_channel *channel; + channel = hca_ptr-ib_trans.ib_cq_empty-channel; + ibv_destroy_cq(hca_ptr-ib_trans.ib_cq_empty); + ibv_destroy_comp_channel(channel); + } + + if (hca_ptr-ib_hca_handle != IB_INVALID_HANDLE) { + if (rdma_destroy_id(hca_ptr-ib_trans.cm_id)) + return (dapl_convert_errno(errno, ib_close_device)); + hca_ptr-ib_hca_handle = IB_INVALID_HANDLE; + } + return (DAT_SUCCESS); } -- 1.5.2.5 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance benchmarking
On Mon, 12 Apr 2010, Tom Ammon wrote: | Thanks for the pointer. I thought it was running in connected mode, and | looking at that variable that you mentioned confirms it: | [r...@gateway3 ~]# ifconfig ib0 | ib0 Link encap:InfiniBand HWaddr | 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 |inet addr:192.168.23.253 Bcast:192.168.23.255 Mask:255.255.254.0 |RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0 |TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0 That's a lot of packets dropped on the tx side. If you have the qlogic software installed, running ipathstats -c1 while you are running the test would be useful, otherwise perfquery -r at start and another perfquery at the end on both nodes might point to something. Oh, and depending on your tcp stack tuning, setting the receive and/or send buffer size might help. These are all ddr results, on a more or less OFED 1.5.1 stack (completely unofficial, blah blah). And yes, multi-thread will bring the results up (iperf, rather than netperf). # netperf -H ib-host TCP_STREAM -- -m 65536 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 65536 6553610.035150.24 # netperf -H ib-host TCP_STREAM -- -m 65536 -S 131072 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 262144 65536 6553610.035401.83 # netperf -H ib-host TCP_STREAM -- -m 65536 -S 262144 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 524288 65536 6553610.015478.28 Dave Olson dave.ol...@qlogic.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ummunotify: Userspace support for MMU notifications
On Mon, 12 Apr 2010 07:22:17 +0100 Eric B Munson ebmun...@us.ibm.com wrote: Andrew, I am resubmitting this patch because I believe that the discussion has shown this to be an acceptable solution. To whom? Some acked-by's would clarify. I have fixed the 32 bit build errors, but other than that change, the code is the same as Roland's V3 patch. From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. But this info could be reassembled by tracking syscall activity, yes? Perhaps some discussion here explaining why the (possibly enhanced) ptrace, audit, etc interfaces are unsuitable. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. OK, what's missing from this whole description and from ummunotify.txt is: how does one specify the target process? Does /dev/ummunotify implicitly attach to current-mm? If so, why, and what are the implications of this? If instead it is possible to attach to some other process's mmu activity (/proc/pid/ummunotity?) then how is that done and what are the security/permissions implications? Also, the whole thing is obviously racy: by the time userspace finds out that something has happened, it might have changed. This inevitably reduces the applicability/usefulness of the whole thing as compared to some synchronous mechanism which halts the monitored thread until the request has been processed and acked. All this should (IMO) be explored, explained and justified. Also, what prevents the obvious DoS which occurs when I register for events and just let them queue up until the kernel runs out of memory? presumably events get dropped - what are the reliability implications of this and how is the max queue length managed? Also, ioctls are unpopular. Were other intefaces considered? Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rola...@cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes since v3: - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 builds --- Documentation/Makefile |3 +- Documentation/ummunotify/Makefile |7 + Documentation/ummunotify/ummunotify.txt | 150 Documentation/ummunotify/umn-test.c | 200 +++ drivers/char/Kconfig| 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 +++ include/linux/ummunotify.h | 121 +++ 8 files changed, 1060 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List