Re: [PATCH] ummunotify: Userspace support for MMU notifications V2
Hi, I understand that this patch went through to the -mm tree. MVAPICH/MVAPICH2 MPI stacks intend to utilize this feature as well. Thanks. On Thu, Apr 22, 2010 at 6:38 AM, Eric B Munson ebmun...@us.ibm.com wrote: From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rola...@cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes from V1: - Update Kbuild to handle test program build properly - Update documentation to cover questions not addressed in previous thread --- Documentation/Makefile | 3 +- Documentation/ummunotify/Makefile | 7 + Documentation/ummunotify/ummunotify.txt | 162 + Documentation/ummunotify/umn-test.c | 200 +++ drivers/char/Kconfig | 12 + drivers/char/Makefile | 1 + drivers/char/ummunotify.c | 567 +++ include/linux/Kbuild | 1 + include/linux/ummunotify.h | 121 +++ 9 files changed, 1073 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List of programs to build +hostprogs-y := umn-test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt new file mode 100644 index 000..d6c2ccc --- /dev/null +++ b/Documentation/ummunotify/ummunotify.txt @@ -0,0 +1,162 @@ +UMMUNOTIFY + + Ummunotify relays MMU notifier events to userspace. This is useful + for libraries that need to track the memory mapping of applications; + for example, MPI implementations using RDMA want to cache memory + registrations for performance, but tracking all possible crazy cases + such as when, say, the FORTRAN runtime frees memory is impossible + without kernel help. + +Basic Model + + A userspace process uses it by opening /dev/ummunotify, which + returns a file descriptor. Interest in address ranges is registered + using ioctl() and MMU notifier events are retrieved using read(), as + described in more detail below. Userspace can register multiple + address ranges to watch, and can unregister individual ranges. + + Userspace can also mmap() a single read-only page at offset 0 on + this file descriptor. This page contains (at offest 0) a single + 64-bit generation
RE: [PATCH] ummunotify: Userspace support for MMU notifications V2
As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI community is *very interested* in having this stuff in upstream kernels. It solves a fairly major problem for us. Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage of these capabilities. The code to use ummunotify is in a Mercurial branch; we're only waiting for ummunotify to go upstream before committing our support for it to our main SVN development trunk. Intel's MPI team has examined this proposal as well and would also like to see this merged upstream. It is helpful implementing MPI over RDMA devices. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ummunotify: Userspace support for MMU notifications V2
On Apr 22, 2010, at 9:38 AM, Eric B Munson wrote: From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI community is *very interested* in having this stuff in upstream kernels. It solves a fairly major problem for us. Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage of these capabilities. The code to use ummunotify is in a Mercurial branch; we're only waiting for ummunotify to go upstream before committing our support for it to our main SVN development trunk. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rola...@cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com Acked-by: Jeff Squyers jsquy...@cisco.com --- Changes from V1: - Update Kbuild to handle test program build properly - Update documentation to cover questions not addressed in previous thread --- Documentation/Makefile |3 +- Documentation/ummunotify/Makefile |7 + Documentation/ummunotify/ummunotify.txt | 162 + Documentation/ummunotify/umn-test.c | 200 +++ drivers/char/Kconfig| 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 +++ include/linux/Kbuild|1 + include/linux/ummunotify.h | 121 +++ 9 files changed, 1073 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List of programs to build +hostprogs-y := umn-test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt new file mode 100644 index 000..d6c2ccc --- /dev/null +++ b/Documentation/ummunotify/ummunotify.txt @@ -0,0 +1,162 @@ +UMMUNOTIFY + + Ummunotify relays MMU notifier events to userspace. This is useful + for libraries that need to track the memory mapping of applications; + for example, MPI implementations using RDMA want to cache memory + registrations for performance, but tracking all possible crazy cases + such as when, say, the FORTRAN runtime frees memory is impossible + without kernel help. + +Basic Model + + A userspace process uses it by opening /dev/ummunotify, which + returns a file descriptor. Interest in address ranges is registered + using
[PATCH] ummunotify: Userspace support for MMU notifications V2
From: Roland Dreier rola...@cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rola...@cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes from V1: - Update Kbuild to handle test program build properly - Update documentation to cover questions not addressed in previous thread --- Documentation/Makefile |3 +- Documentation/ummunotify/Makefile |7 + Documentation/ummunotify/ummunotify.txt | 162 + Documentation/ummunotify/umn-test.c | 200 +++ drivers/char/Kconfig| 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 +++ include/linux/Kbuild|1 + include/linux/ummunotify.h | 121 +++ 9 files changed, 1073 insertions(+), 1 deletions(-) create mode 100644 Documentation/ummunotify/Makefile create mode 100644 Documentation/ummunotify/ummunotify.txt create mode 100644 Documentation/ummunotify/umn-test.c create mode 100644 drivers/char/ummunotify.c create mode 100644 include/linux/ummunotify.h diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile new file mode 100644 index 000..89f31a0 --- /dev/null +++ b/Documentation/ummunotify/Makefile @@ -0,0 +1,7 @@ +# List of programs to build +hostprogs-y := umn-test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt new file mode 100644 index 000..d6c2ccc --- /dev/null +++ b/Documentation/ummunotify/ummunotify.txt @@ -0,0 +1,162 @@ +UMMUNOTIFY + + Ummunotify relays MMU notifier events to userspace. This is useful + for libraries that need to track the memory mapping of applications; + for example, MPI implementations using RDMA want to cache memory + registrations for performance, but tracking all possible crazy cases + such as when, say, the FORTRAN runtime frees memory is impossible + without kernel help. + +Basic Model + + A userspace process uses it by opening /dev/ummunotify, which + returns a file descriptor. Interest in address ranges is registered + using ioctl() and MMU notifier events are retrieved using read(), as + described in more detail below. Userspace can register multiple + address ranges to watch, and can unregister individual ranges. + + Userspace can also mmap() a single read-only page at offset 0 on + this file descriptor. This page contains (at offest 0) a single + 64-bit generation counter that the kernel increments each time an + MMU notifier event occurs. Userspace can use this to very quickly + check if there are any events to retrieve without needing to do a + system call. + +Control + + To start using ummunotify, a process opens /dev/ummunotify in + read-only mode. This will attach