Re: [ewg] [RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-18 Thread Stefan Roscher
Hi Roland,

On Wednesday 12 May 2010 06:40:16 pm Roland Dreier wrote:
* added get_huge_page_size() to read the huge page size from
  /proc/meminfo. This is done at ibv_fork_init() time.
 
 That doesn't work on systems that have multiple huge page sizes (eg
 powerpc).  You can compare the code to get the size in libhugetlbfs.
 
 Also I think the munging through /proc/pid/maps doesn't really work.
 First of all, essentially grepping for libhugetlbfs is not robust as I
 mentioned -- this will break in the same way for apps using huge pages
 directly.  

I agree that this two points (multiple huge page sizes and performance) are two 
critical
issues which we have to attack.
One suggestion from our side would be to use /proc/pid/smaps which would 
workaround
the multiple huge page size problem. It would not solve the perfomance issue, 
because
reading the file would assume long time even if we buffer the ranges and 
pagesize during 
fork_init.
Do you have an oppinion on that?

We appreciate your help!

regards Stefan

 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-12 Thread Roland Dreier
   * added get_huge_page_size() to read the huge page size from
 /proc/meminfo. This is done at ibv_fork_init() time.

That doesn't work on systems that have multiple huge page sizes (eg
powerpc).  You can compare the code to get the size in libhugetlbfs.

Also I think the munging through /proc/pid/maps doesn't really work.
First of all, essentially grepping for libhugetlbfs is not robust as I
mentioned -- this will break in the same way for apps using huge pages
directly.  And while it is nice to be able to tell if a range came from
libhugetlbfs, I think there may be some bad performance impact from
reading /proc/pid/maps line-by-line.  (And by the way, as a trivial
optimization, it would make sense to me to check the address of each map
before doing the strstr).

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-07 Thread Alexander Schmidt
On Thu, 06 May 2010 13:55:31 -0700
Roland Dreier rdre...@cisco.com wrote:
 I think that we cannot assume huge pages only come from libhugetlbfs --
 we should support an application directly enabling huge pages (possibly
 via another library too, so we can't assume that an application knows
 the page size for a memory range it is about to register).
 
 And also the 16 MB page size constant is of course not feasible -- with
 all due respect, the x86 page size of 2 MB is much more likely in
 practice :)  (Although perhaps the much slower PowerPC TLB refill makes
 users more likely to try and use hugetlb pages ;)
 
 Alex suggested parsing files in the same way as libhugetlbfs does to get
 the page size, and that seems to be the best solution, since I don't
 think the libhugetlbfs license is compatible with the BSD license for
 libibverbs.
 
 But your trick of using /proc/*/maps looks nice.  Does that only work
 for libhugetlbfs or can we recognize direct mmap of hugetlb pages?

Hi Roland, thanks for your comments!

I've reworked my patch:
 * added get_huge_page_size() to read the huge page size from
   /proc/meminfo. This is done at ibv_fork_init() time.
 * I noticed that some applications like ibv_rc_pingpong already
   get memory from libhugetlbfs when running ibv_fork_init(). So
   I changed the code for testing madvise() to allocate a huge page
   if the huge page size is set in the system.

I have not tested this code with different libraries providing huge
pages / mmaped pages yet, but I hope this can be added later on when
we have agreed on an approach to handle huge pages.

Signed-off-by: Alexander Schmidt al...@linux.vnet.ibm.com
---
 src/memory.c |  103 ++-
 1 file changed, 95 insertions(+), 8 deletions(-)

--- libibverbs-1.1.2.orig/src/memory.c
+++ libibverbs-1.1.2/src/memory.c
@@ -40,6 +40,8 @@
 #include unistd.h
 #include stdlib.h
 #include stdint.h
+#include stdio.h
+#include string.h
 
 #include ibverbs.h
 
@@ -68,12 +70,45 @@ struct ibv_mem_node {
 static struct ibv_mem_node *mm_root;
 static pthread_mutex_t mm_mutex = PTHREAD_MUTEX_INITIALIZER;
 static int page_size;
+static int huge_page_size;
 static int too_late;
 
+static int get_huge_page_size(void)
+{
+   int ret = -1;
+   FILE *file;
+   char *path = /proc/meminfo;
+   char buf[1024], type[128];
+
+   file = fopen(path, r);
+   if (!file)
+   goto out;
+
+   while (fgets(buf, sizeof(buf), file) != NULL) {
+   int n;
+   unsigned long size;
+
+   n = sscanf(buf, %127s %lu %*s, type, size);
+
+   if (n  2)
+   continue;
+
+   if (!strcmp(type, Hugepagesize:)) {
+   /* huge page size is printed in Kb */
+   ret = size * 1024;
+   break;
+   }
+   }
+   fclose(file);
+
+out:
+   return ret;
+}
+
 int ibv_fork_init(void)
 {
void *tmp;
-   int ret;
+   int ret, size;
 
if (mm_root)
return 0;
@@ -85,11 +120,18 @@ int ibv_fork_init(void)
if (page_size  0)
return errno;
 
-   if (posix_memalign(tmp, page_size, page_size))
+   huge_page_size = get_huge_page_size();
+
+   if (huge_page_size  page_size)
+   size = huge_page_size;
+   else
+   size = page_size;
+
+   if (posix_memalign(tmp, size, size))
return ENOMEM;
 
-   ret = madvise(tmp, page_size, MADV_DONTFORK) ||
- madvise(tmp, page_size, MADV_DOFORK);
+   ret = madvise(tmp, size, MADV_DONTFORK) ||
+ madvise(tmp, size, MADV_DOFORK);
 
free(tmp);
 
@@ -446,11 +488,51 @@ static struct ibv_mem_node *__mm_find_st
return node;
 }
 
+static int is_huge_page(void *base)
+{
+   int ret = 0;
+   pid_t pid;
+   FILE *file;
+   char buf[1024], lib[128];
+
+   pid = getpid();
+   snprintf(buf, sizeof(buf), /proc/%d/maps, pid);
+
+   file = fopen(buf, r);
+   if (!file)
+   goto out;
+
+   while (fgets(buf, sizeof(buf), file) != NULL) {
+   int n;
+   char *substr;
+   uintptr_t range_start, range_end;
+
+   n = sscanf(buf, %lx-%lx %*s %*x %*s %*u %127s,
+   range_start, range_end, lib);
+
+   if (n  3)
+   continue;
+
+   substr = strstr(lib, libhugetlbfs);
+   if (substr) {
+   if ((uintptr_t) base = range_start 
+   (uintptr_t) base  range_end) {
+   ret = 1;
+   break;
+   }
+   }
+   }
+   fclose(file);
+
+out:
+   return ret;
+}
+
 static int ibv_madvise_range(void *base, size_t size, int advice)
 {
uintptr_t start, end;
struct 

[RFC] libibverbs: ibv_fork_init() and libhugetlbfs

2010-05-06 Thread Alexander Schmidt
Hi all,

we are trying to make use of libhugetlbfs in an application that relies on
ibv_fork_init() to enable fork() support. The problem we are running into is
that calls to the madvise system call fail when registering a memory region
for memory that is provided by libhugetlbfs. We have written a preliminary
fix (see below) for this and are looking for comments / feedback to get an
acceptable solution.

When fork support is enabled in libibverbs, madvise() is called for every
memory page that is registered as a memory region. Memory ranges that
are passed to madvise() must be page aligned and the size must be a
multiple of the page size. libibverbs uses sysconf(_SC_PAGESIZE) to find
out the system page size and rounds all ranges passed to reg_mr() according
to this page size. When memory from libhugetlbfs is passed to reg_mr(), this
does not work as the page size for this memory range might be different
(e.g. 16Mb). So libibverbs would have to use the huge page size to
calculate a page aligned range for madvise.

As huge pages are provided to the application under the hood when
preloading libhugetlbfs, the application does not have any knowledge about
when it registers a huge page or a usual page.

The patch below demonstrates a possible solution for this. It parses the
/proc/PID/maps file when registering a memory region and decides if the
memory that is to be registered is part of a libhugetlbfs range or not. If so,
a page size of 16Mb is used to align the memory range passed to madvise().

We see two problems with this: it is not a very elegant solution to parse the
procfs file and the 16Mb are hardcoded currently. The latter point could be
solved by calling gethugepagesize() from libhugetlbfs, which would add a new
dependency to libibverbs.

We are highly interested in reviews, comments, suggestions to get this solved
soon. Thanks!

Signed-off-by: Alexander Schmidt al...@linux.vnet.ibm.com
---
 src/memory.c |   50 +++---
 1 file changed, 47 insertions(+), 3 deletions(-)

--- libibverbs-1.1.2.orig/src/memory.c
+++ libibverbs-1.1.2/src/memory.c
@@ -40,6 +40,8 @@
 #include unistd.h
 #include stdlib.h
 #include stdint.h
+#include stdio.h
+#include string.h
 
 #include ibverbs.h
 
@@ -54,6 +56,8 @@
 #define MADV_DOFORK11
 #endif
 
+#define HUGE_PAGE_SIZE (16 * 1024 * 1024)
+
 struct ibv_mem_node {
enum {
IBV_RED,
@@ -446,6 +450,48 @@ static struct ibv_mem_node *__mm_find_st
return node;
 }
 
+static void get_range_address(uintptr_t *start, uintptr_t *end, void *base, 
size_t size)
+{
+   pid_t pid;
+   FILE *file;
+   char buf[1024], lib[128];
+   int range_page_size = page_size;
+
+   pid = getpid();
+   snprintf(buf, sizeof(buf), /proc/%d/maps, pid);
+
+   file = fopen(buf, r);
+   if (!file)
+   goto out;
+
+   while (fgets(buf, sizeof(buf), file) != NULL) {
+   int n;
+   char *substr;
+   uintptr_t range_start, range_end;
+
+   n = sscanf(buf, %lx-%lx %*s %*x %*s %*u %127s,
+   range_start, range_end, lib);
+
+   if (n  3)
+   continue;
+
+   substr = strstr(lib, libhugetlbfs);
+   if (substr) {
+   if ((uintptr_t) base = range_start 
+   (uintptr_t) base  range_end) {
+   range_page_size = HUGE_PAGE_SIZE;
+   break;
+   }
+   }
+   }
+   fclose(file);
+
+out:
+   *start = (uintptr_t) base  ~(range_page_size - 1);
+   *end   = ((uintptr_t) (base + size + range_page_size - 1) 
+~(range_page_size - 1)) - 1;
+}
+
 static int ibv_madvise_range(void *base, size_t size, int advice)
 {
uintptr_t start, end;
@@ -458,9 +504,7 @@ static int ibv_madvise_range(void *base,
 
inc = advice == MADV_DONTFORK ? 1 : -1;
 
-   start = (uintptr_t) base  ~(page_size - 1);
-   end   = ((uintptr_t) (base + size + page_size - 1) 
-~(page_size - 1)) - 1;
+   get_range_address(start, end, base, size);
 
pthread_mutex_lock(mm_mutex);
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html