Re: [PATCH] ummunotify: Userspace support for MMU notifications V2

2010-05-11 Thread Sayantan Sur
Hi,

I understand that this patch went through to the -mm tree.
MVAPICH/MVAPICH2 MPI stacks intend to utilize this feature as well.

Thanks.

On Thu, Apr 22, 2010 at 6:38 AM, Eric B Munson ebmun...@us.ibm.com wrote:
 From: Roland Dreier rola...@cisco.com

 As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
 and follow-up messages, libraries using RDMA would like to track
 precisely when application code changes memory mapping via free(),
 munmap(), etc.  Current pure-userspace solutions using malloc hooks
 and other tricks are not robust, and the feeling among experts is that
 the issue is unfixable without kernel help.

 We solve this not by implementing the full API proposed in the email
 linked above but rather with a simpler and more generic interface,
 which may be useful in other contexts.  Specifically, we implement a
 new character device driver, ummunotify, that creates a /dev/ummunotify
 node.  A userspace process can open this node read-only and use the fd
 as follows:

  1. ioctl() to register/unregister an address range to watch in the
     kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h).

  2. read() to retrieve events generated when a mapping in a watched
     address range is invalidated (cf struct ummunotify_event in
     linux/ummunotify.h).  select()/poll()/epoll() and SIGIO are
     handled for this IO.

  3. mmap() one page at offset 0 to map a kernel page that contains a
     generation counter that is incremented each time an event is
     generated.  This allows userspace to have a fast path that checks
     that no events have occurred without a system call.

 Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for
 suggestions on the interface design.  Also thanks to Jeff Squyres
 jsquyres at cisco.com for prototyping support for this in Open MPI,
 which
 helped find several bugs during development.

 Signed-off-by: Roland Dreier rola...@cisco.com
 Signed-off-by: Eric B Munson ebmun...@us.ibm.com

 ---

 Changes from V1:
 - Update Kbuild to handle test program build properly
 - Update documentation to cover questions not addressed in previous
   thread
 ---
  Documentation/Makefile  |    3 +-
  Documentation/ummunotify/Makefile   |    7 +
  Documentation/ummunotify/ummunotify.txt |  162 +
  Documentation/ummunotify/umn-test.c |  200 +++
  drivers/char/Kconfig    |   12 +
  drivers/char/Makefile   |    1 +
  drivers/char/ummunotify.c   |  567
 +++
  include/linux/Kbuild    |    1 +
  include/linux/ummunotify.h  |  121 +++
  9 files changed, 1073 insertions(+), 1 deletions(-)
  create mode 100644 Documentation/ummunotify/Makefile
  create mode 100644 Documentation/ummunotify/ummunotify.txt
  create mode 100644 Documentation/ummunotify/umn-test.c
  create mode 100644 drivers/char/ummunotify.c
  create mode 100644 include/linux/ummunotify.h

 diff --git a/Documentation/Makefile b/Documentation/Makefile
 index 6fc7ea1..27ba76a 100644
 --- a/Documentation/Makefile
 +++ b/Documentation/Makefile
 @@ -1,3 +1,4 @@
  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
     filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
 -   pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
 +   pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
 +   watchdog/src/
 diff --git a/Documentation/ummunotify/Makefile
 b/Documentation/ummunotify/Makefile
 new file mode 100644
 index 000..89f31a0
 --- /dev/null
 +++ b/Documentation/ummunotify/Makefile
 @@ -0,0 +1,7 @@
 +# List of programs to build
 +hostprogs-y := umn-test
 +
 +# Tell kbuild to always build the programs
 +always := $(hostprogs-y)
 +
 +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
 diff --git a/Documentation/ummunotify/ummunotify.txt
 b/Documentation/ummunotify/ummunotify.txt
 new file mode 100644
 index 000..d6c2ccc
 --- /dev/null
 +++ b/Documentation/ummunotify/ummunotify.txt
 @@ -0,0 +1,162 @@
 +UMMUNOTIFY
 +
 +  Ummunotify relays MMU notifier events to userspace.  This is useful
 +  for libraries that need to track the memory mapping of applications;
 +  for example, MPI implementations using RDMA want to cache memory
 +  registrations for performance, but tracking all possible crazy cases
 +  such as when, say, the FORTRAN runtime frees memory is impossible
 +  without kernel help.
 +
 +Basic Model
 +
 +  A userspace process uses it by opening /dev/ummunotify, which
 +  returns a file descriptor.  Interest in address ranges is registered
 +  using ioctl() and MMU notifier events are retrieved using read(), as
 +  described in more detail below.  Userspace can register multiple
 +  address ranges to watch, and can unregister individual ranges.
 +
 +  Userspace can also mmap() a single read-only page at offset 0 on
 +  this file descriptor.  This page contains (at offest 0) a single
 +  64-bit generation 

RE: [PATCH] ummunotify: Userspace support for MMU notifications V2

2010-05-10 Thread Sean Hefty
 As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
 and follow-up messages, libraries using RDMA would like to track
 precisely when application code changes memory mapping via free(),
 munmap(), etc.  Current pure-userspace solutions using malloc hooks
 and other tricks are not robust, and the feeling among experts is that
 the issue is unfixable without kernel help.

Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI
community is *very interested* in having this stuff in upstream kernels.  It
solves a fairly major problem for us.

Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage
of these capabilities.  The code to use ummunotify is in a Mercurial branch;
we're only waiting for ummunotify to go upstream before committing our support
for it to our main SVN development trunk.

Intel's MPI team has examined this proposal as well and would also like to see
this merged upstream.  It is helpful implementing MPI over RDMA devices.

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ummunotify: Userspace support for MMU notifications V2

2010-05-07 Thread Jeff Squyres
On Apr 22, 2010, at 9:38 AM, Eric B Munson wrote:

 From: Roland Dreier rola...@cisco.com
 
 As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
 and follow-up messages, libraries using RDMA would like to track
 precisely when application code changes memory mapping via free(),
 munmap(), etc.  Current pure-userspace solutions using malloc hooks
 and other tricks are not robust, and the feeling among experts is that
 the issue is unfixable without kernel help.

Sorry for not replying earlier -- just to throw in my $0.02 here: the MPI 
community is *very interested* in having this stuff in upstream kernels.  It 
solves a fairly major problem for us. 

Open MPI (www.open-mpi.org) is ready to pretty much immediately take advantage 
of these capabilities.  The code to use ummunotify is in a Mercurial branch; 
we're only waiting for ummunotify to go upstream before committing our support 
for it to our main SVN development trunk.

 We solve this not by implementing the full API proposed in the email
 linked above but rather with a simpler and more generic interface,
 which may be useful in other contexts.  Specifically, we implement a
 new character device driver, ummunotify, that creates a /dev/ummunotify
 node.  A userspace process can open this node read-only and use the fd
 as follows:
 
  1. ioctl() to register/unregister an address range to watch in the
 kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h).
 
  2. read() to retrieve events generated when a mapping in a watched
 address range is invalidated (cf struct ummunotify_event in
 linux/ummunotify.h).  select()/poll()/epoll() and SIGIO are
 handled for this IO.
 
  3. mmap() one page at offset 0 to map a kernel page that contains a
 generation counter that is incremented each time an event is
 generated.  This allows userspace to have a fast path that checks
 that no events have occurred without a system call.
 
 Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for
 suggestions on the interface design.  Also thanks to Jeff Squyres
 jsquyres at cisco.com for prototyping support for this in Open MPI, which
 helped find several bugs during development.
 
 Signed-off-by: Roland Dreier rola...@cisco.com
 Signed-off-by: Eric B Munson ebmun...@us.ibm.com

Acked-by: Jeff Squyers jsquy...@cisco.com

 ---
 
 Changes from V1:
 - Update Kbuild to handle test program build properly
 - Update documentation to cover questions not addressed in previous
   thread
 ---
  Documentation/Makefile  |3 +-
  Documentation/ummunotify/Makefile   |7 +
  Documentation/ummunotify/ummunotify.txt |  162 +
  Documentation/ummunotify/umn-test.c |  200 +++
  drivers/char/Kconfig|   12 +
  drivers/char/Makefile   |1 +
  drivers/char/ummunotify.c   |  567 
 +++
  include/linux/Kbuild|1 +
  include/linux/ummunotify.h  |  121 +++
  9 files changed, 1073 insertions(+), 1 deletions(-)
  create mode 100644 Documentation/ummunotify/Makefile
  create mode 100644 Documentation/ummunotify/ummunotify.txt
  create mode 100644 Documentation/ummunotify/umn-test.c
  create mode 100644 drivers/char/ummunotify.c
  create mode 100644 include/linux/ummunotify.h
 
 diff --git a/Documentation/Makefile b/Documentation/Makefile
 index 6fc7ea1..27ba76a 100644
 --- a/Documentation/Makefile
 +++ b/Documentation/Makefile
 @@ -1,3 +1,4 @@
  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
 filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
 -   pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
 +   pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
 +   watchdog/src/
 diff --git a/Documentation/ummunotify/Makefile 
 b/Documentation/ummunotify/Makefile
 new file mode 100644
 index 000..89f31a0
 --- /dev/null
 +++ b/Documentation/ummunotify/Makefile
 @@ -0,0 +1,7 @@
 +# List of programs to build
 +hostprogs-y := umn-test
 +
 +# Tell kbuild to always build the programs
 +always := $(hostprogs-y)
 +
 +HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
 diff --git a/Documentation/ummunotify/ummunotify.txt 
 b/Documentation/ummunotify/ummunotify.txt
 new file mode 100644
 index 000..d6c2ccc
 --- /dev/null
 +++ b/Documentation/ummunotify/ummunotify.txt
 @@ -0,0 +1,162 @@
 +UMMUNOTIFY
 +
 +  Ummunotify relays MMU notifier events to userspace.  This is useful
 +  for libraries that need to track the memory mapping of applications;
 +  for example, MPI implementations using RDMA want to cache memory
 +  registrations for performance, but tracking all possible crazy cases
 +  such as when, say, the FORTRAN runtime frees memory is impossible
 +  without kernel help.
 +
 +Basic Model
 +
 +  A userspace process uses it by opening /dev/ummunotify, which
 +  returns a file descriptor.  Interest in address ranges is registered
 +  using 

[PATCH] ummunotify: Userspace support for MMU notifications V2

2010-04-22 Thread Eric B Munson
From: Roland Dreier rola...@cisco.com

As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h).

 2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunotify_event in
linux/ummunotify.h).  select()/poll()/epoll() and SIGIO are
handled for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated.  This allows userspace to have a fast path that checks
that no events have occurred without a system call.

Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for
suggestions on the interface design.  Also thanks to Jeff Squyres
jsquyres at cisco.com for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier rola...@cisco.com
Signed-off-by: Eric B Munson ebmun...@us.ibm.com

---

Changes from V1:
- Update Kbuild to handle test program build properly
- Update documentation to cover questions not addressed in previous
  thread
---
 Documentation/Makefile  |3 +-
 Documentation/ummunotify/Makefile   |7 +
 Documentation/ummunotify/ummunotify.txt |  162 +
 Documentation/ummunotify/umn-test.c |  200 +++
 drivers/char/Kconfig|   12 +
 drivers/char/Makefile   |1 +
 drivers/char/ummunotify.c   |  567 +++
 include/linux/Kbuild|1 +
 include/linux/ummunotify.h  |  121 +++
 9 files changed, 1073 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 6fc7ea1..27ba76a 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,4 @@
 obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
-   pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
+   pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
+   watchdog/src/
diff --git a/Documentation/ummunotify/Makefile 
b/Documentation/ummunotify/Makefile
new file mode 100644
index 000..89f31a0
--- /dev/null
+++ b/Documentation/ummunotify/Makefile
@@ -0,0 +1,7 @@
+# List of programs to build
+hostprogs-y := umn-test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
diff --git a/Documentation/ummunotify/ummunotify.txt 
b/Documentation/ummunotify/ummunotify.txt
new file mode 100644
index 000..d6c2ccc
--- /dev/null
+++ b/Documentation/ummunotify/ummunotify.txt
@@ -0,0 +1,162 @@
+UMMUNOTIFY
+
+  Ummunotify relays MMU notifier events to userspace.  This is useful
+  for libraries that need to track the memory mapping of applications;
+  for example, MPI implementations using RDMA want to cache memory
+  registrations for performance, but tracking all possible crazy cases
+  such as when, say, the FORTRAN runtime frees memory is impossible
+  without kernel help.
+
+Basic Model
+
+  A userspace process uses it by opening /dev/ummunotify, which
+  returns a file descriptor.  Interest in address ranges is registered
+  using ioctl() and MMU notifier events are retrieved using read(), as
+  described in more detail below.  Userspace can register multiple
+  address ranges to watch, and can unregister individual ranges.
+
+  Userspace can also mmap() a single read-only page at offset 0 on
+  this file descriptor.  This page contains (at offest 0) a single
+  64-bit generation counter that the kernel increments each time an
+  MMU notifier event occurs.  Userspace can use this to very quickly
+  check if there are any events to retrieve without needing to do a
+  system call.
+
+Control
+
+  To start using ummunotify, a process opens /dev/ummunotify in
+  read-only mode.  This will attach