Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
David Gibson wrote: > On Wed, Feb 20, 2008 at 03:43:08PM -0600, Andrew Hastings wrote: >> David Gibson wrote: >>> On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings wrote: Disable heap shrinking by default unless HUGETLB_MORECORE_SHRINK=yes is set in the environment. If malloc allocates memory on the heap before libhugetlbfs's setup_morecore is called, then when malloc calls hugetlbfs_morecore >>> Hrm... thing is, I'm pretty sure we're already in deep trouble if >>> malloc() is called before our setup_morecore(). Most of the various >>> link issues we've had to deal with have been about ensuring that the >>> libhugetlbfs constructor goes as early as possible, before everything >>> except libc itself, so that we can establish our morecore hooks before >>> anyone attempts to use the heap. Well, and so that segment remapping >>> also goes early, which is even more stuffed if things like malloc() >>> have been called first. >> But it's not hard to call malloc() before setup_morecore(). > > Indeed it's not. In fact it's hard to ensure that malloc() isn't > valled beforehand. > > But I don't think your patch was anything like a complete solution to > allowing the hugepage heap setup to go after other malloc() calls. It works for all of the test cases I tried (admittedly, not exhaustive). The restriction that other malloc calls can't occur before hugepage heap setup isn't documented in the HOWTO, and we agree that it's easy to violate this restriction. Wouldn't a partial solution be better than none? >> But we've encountered this issue in two other cases: >> 1. Link against a shared library that has a constructor that calls >> malloc, and then use LD_PRELOAD=libhugetlbfs. > > Erm... I think PRELOAD constructors always go before non-preload > constructors, other than the preload's dependencies, obviously. I'd thought so, too, but if I do: cc -o hs-2 heapshrink.o testutils.o libheapshrink.so and then use LD_PRELOAD=libhugetlbfs, I still hit the malloc misbehavior. >> 2. Link against a static libhugetlbfs. >> >> The second case is a bigger show-stopper for Cray; since most of our >> nodes are diskless we prefer to use static linking. > > Ick. The assumption that we're in a dynamically linked environment is > pretty widespread through libhugetlbfs. > > In this situation are you statically linking everything? If not, are > you statically linking libc? Yes, we static link everything with the exception of a few commands which use a shared libc. -Andrew Hastings Cray Inc. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] Add configuration information to the HOWTO
For new users of libhugetlbfs, the requirements of configuring the hugepage pool and mounting hugetlbfs should be made explicit. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/HOWTO b/HOWTO index a10461b..3998ef1 100644 --- a/HOWTO +++ b/HOWTO @@ -99,6 +99,42 @@ to use both gcc and GNU binutils. For PowerPC and AMD64 systems you will need a "biarch" compiler, which can build both 32-bit and 64-bit binaries. +Configuration prerequisites +--- + +In kernels before 2.6.24, hugepages must be allocated at boot-time via +the hugepages= command-line parameter or at run-time via the +/proc/sys/vm/nr_hugepages sysctl. If memory is restricted on the system, +boot-time allocation is recommended. Hugepages so allocated will be in +the static hugepage pool. + +In kernels starting with 2.6.24, the hugepage pool can grown on-demand. +If this feature should be used, /proc/sys/vm/nr_overcommit_hugepages +should be set to the maximum size of the hugepage pool. No hugepages +need to be allocated via /proc/sys/vm/nr_hugepages or hugepages= in this +case. Hugepages so allocated will be in the dynamic hugepage pool. + +For the running of the libhugetlbfs testsuite (see below), allocating 20 +static hugepages is recommended. Due to memory restrictions, the number +of hugepages requested may not be allocated if the allocation is +attempted at run-time. Users should verify the actual number of +hugepages allocated by either + + cat /proc/sys/vm/nr_hugepages + +or + + grep HugePages_Free /proc/meminfo + +To use libhugetlbfs features, as well as to run the testsuite, hugetlbfs +must be mounted: + + mkdir -p /mnt/hugetlbfs + mount -t hugetlbfs none /mnt/hugetlbfs + +If hugepages should be available to non-root users, the permissions on +the mountpoint need to be set appropriately. + Installation - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH] Add configuration information to the HOWTO
Nishanth Aravamudan wrote: > +To use libhugetlbfs features, as well as to run the testsuite, hugetlbfs > +must be mounted: > + > + mkdir -p /mnt/hugetlbfs > + mount -t hugetlbfs none /mnt/hugetlbfs > + > +If hugepages should be available to non-root users, the permissions on > +the mountpoint need to be set appropriately. We've been recommending "chmod 1777 /mnt/hugetlbfs" here. (For some reason "mount -o mode=1777 ..." doesn't work.) -Andrew Hastings Cray Inc. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] Lnux 2.6.24-rc5
On 22.02.2008 [18:35:53 +0530], Subrata Modak wrote: > > We maintain a set of detailed functionality tests for hugepages in > > libhugetlbfs. Any time you are looking for new tests to add to LTP, > > feel free to check the source code. > > Adam/Nishant, > > I am not sure if i found the test cases for this. Can you please provide > me more pointers on this? Recently we had patched LTP?? hugetlb test > cases for some enhancments. Would you also like to see the attached > patch for that. Get the current libhugetlbfs release from libhugetlbfs.ozlabs.org. Untar it and run `make check`. Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] libhugetlbfs: consolidate to one constructor
On 22.02.2008 [14:20:13 -0600], Andrew Hastings wrote: > Nish, > > Thanks for the review! > Glad you suggested this! It turns out there's a dependency on the order in > which the three libhugetlbfs constructors run, which in turns depends on > the order the objects appear in the link line. If the order is "stack.o > morecore.o elflink.o debug.o" the tests work. > > But rather than manipulate the link line (and depend on undocumented > behavior in binutils) I think it would be cleaner to have one constructor > that calls the other setup functions. Do you agree? How does this look? Use one constructor to control the constructor order for libhugetlbfs. Currently, the constructors are run in the order their containing object files are linked in to libhugetlbfs.so. This is fragile as new features are added. Instead, have one constructor that calls the others (which are now no longer actually constructors). Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/Makefile b/Makefile index 2c0f0f6..a689ad6 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ PREFIX = /usr/local -LIBOBJS = hugeutils.o version.o morecore.o debug.o +LIBOBJS = hugeutils.o version.o init.o morecore.o debug.o INSTALL_OBJ_LIBS = libhugetlbfs.so libhugetlbfs.a LDSCRIPT_TYPES = B BDT LDSCRIPT_DIST_ELF = elf32ppclinux elf64ppc elf_i386 elf_x86_64 diff --git a/debug.c b/debug.c index 967c99c..745e782 100644 --- a/debug.c +++ b/debug.c @@ -45,7 +45,7 @@ static void __hugetlbfs_init_debug(void) initialized = 1; } -static void __attribute__ ((constructor)) setup_debug(void) +void setup_debug(void) { __hugetlbfs_init_debug(); } diff --git a/elflink.c b/elflink.c index 227b476..7751469 100644 --- a/elflink.c +++ b/elflink.c @@ -1060,7 +1060,7 @@ static int parse_elf() return 0; } -static void __attribute__ ((constructor)) setup_elflink(void) +void setup_elflink(void) { int i, ret; diff --git a/init.c b/init.c new file mode 100644 index 000..63c9749 --- /dev/null +++ b/init.c @@ -0,0 +1,8 @@ +#include "libhugetlbfs_internal.h" + +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) +{ + setup_morecore(); + setup_elflink(); + setup_debug(); +} diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index bbe7a6b..466c4e8 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -29,6 +29,9 @@ #define ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1)) extern int __hugetlbfs_verbose; +extern void setup_elflink(); +extern void setup_morecore(); +extern void setup_debug(); #define ERROR(...) \ do { \ diff --git a/morecore.c b/morecore.c index c24a888..98ea145 100644 --- a/morecore.c +++ b/morecore.c @@ -187,7 +187,7 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) return p; } -static void __attribute__((constructor)) setup_morecore(void) +void setup_morecore(void) { char *env, *ep; unsigned long heapaddr; -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] debug: make HUGETLB_DEBUG generic
On 27.02.2008 [13:29:24 -0800], Nishanth Aravamudan wrote: > On 22.02.2008 [14:20:13 -0600], Andrew Hastings wrote: > > Nish, > > > > Thanks for the review! > > > > > Glad you suggested this! It turns out there's a dependency on the order in > > which the three libhugetlbfs constructors run, which in turns depends on > > the order the objects appear in the link line. If the order is "stack.o > > morecore.o elflink.o debug.o" the tests work. > > > > But rather than manipulate the link line (and depend on undocumented > > behavior in binutils) I think it would be cleaner to have one constructor > > that calls the other setup functions. Do you agree? > > How does this look? And, as a follow-on: debug: make HUGETLB_DEBUG generic Move the __debug variable and HUGETLB_DEBUG parsing to debug.c. I plan on using the environment variable to control /proc/pid/maps dumping in morecore.c when the heap is forced to move. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/debug.c b/debug.c index 745e782..5edf37e 100644 --- a/debug.c +++ b/debug.c @@ -28,6 +28,7 @@ #include "libhugetlbfs_internal.h" int __hugetlbfs_verbose = 1; +int __debug = 0; static int initialized; @@ -42,6 +43,10 @@ static void __hugetlbfs_init_debug(void) if (env) __hugetlbfs_verbose = atoi(env); + env = getenv("HUGETLB_DEBUG"); + if (env) + __debug = 1; + initialized = 1; } diff --git a/elflink.c b/elflink.c index 7751469..0fe5fc1 100644 --- a/elflink.c +++ b/elflink.c @@ -163,7 +163,6 @@ static int htlb_num_segs; static int minimal_copy = 1; static int sharing; /* =0 */ static unsigned long force_remap; /* =0 */ -int __debug = 0; /** * assemble_path - handy wrapper around snprintf() for building paths @@ -1017,12 +1016,6 @@ static int check_env(void) } } - env = getenv("HUGETLB_DEBUG"); - if (env) { - DEBUG("HUGETLB_DEBUG=%s, enabling extra checking\n", env); - __debug = 1; - } - return 0; } diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index 466c4e8..02de327 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -29,6 +29,7 @@ #define ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1)) extern int __hugetlbfs_verbose; +extern int __debug; extern void setup_elflink(); extern void setup_morecore(); extern void setup_debug(); -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH] libhugetlbfs: consolidate to one constructor
Nish, Thanks! You beat me to it! Nishanth Aravamudan wrote: > --- /dev/null > +++ b/init.c > @@ -0,0 +1,8 @@ > +#include "libhugetlbfs_internal.h" > + > +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) > +{ > + setup_morecore(); > + setup_elflink(); > + setup_debug(); > +} Don't we want setup_debug called first, so ERROR/WARNING/DEBUG etc. can be used in the other constructors? > diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h > index bbe7a6b..466c4e8 100644 > --- a/libhugetlbfs_internal.h > +++ b/libhugetlbfs_internal.h > @@ -29,6 +29,9 @@ > #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > > extern int __hugetlbfs_verbose; > +extern void setup_elflink(); > +extern void setup_morecore(); > +extern void setup_debug(); Shouldn't all these setup functions have a __hugetlbfs_ prefix to avoid namespace pollution (esp. for static links)? -Andrew Hastings Cray Inc. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH] debug: make HUGETLB_DEBUG generic
Nish, Nice! Just one comment: Nishanth Aravamudan wrote: > --- a/libhugetlbfs_internal.h > +++ b/libhugetlbfs_internal.h > @@ -29,6 +29,7 @@ > #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > > extern int __hugetlbfs_verbose; > +extern int __debug; This should be __hugetlbfs_debug to avoid namespace pollution, esp. for static links. -Andrew Hastings Cray Inc. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] morecore.c: eliminate use of mlock()
hugetlbfs_morecore() is currently calling mlock() presumably to instantiate newly-mapped hugepages before returning to malloc(). However, this is ineffective for two reasons: 1. The return value of mlock is ignored. If there aren't enough hugepages available, the process may be killed later, when it tries to access the hugepages just added to the heap. 2. Most distros have a default mlock limit well below the size of one hugepage. (The kernel default is 8 base pages.) This patch changes hugetlbfs_morecore to use readv() to instantiate the newly-mapped hugepages. If not all hugepages can be instantiated, the new mapping is removed, and hugetlbfs_morecore returns NULL to malloc. This allows for a graceful fallback to base pages if insufficient hugepages are available. This patch also adds a test case to verify that a process using HUGETLB_MORECORE does not get killed when it exhausts the available free hugepages, assuming sufficient free base pages are available. Signed-off-by: Andrew Hastings <[EMAIL PROTECTED]> on behalf of Cray Inc. --- I've tested this on x86_64 and PPC64. -Andrew Hastings Cray Inc. morecore.c| 47 ++-- tests/Makefile|2 - tests/heap-overflow.c | 97 ++ tests/run_tests.sh|1 4 files changed, 135 insertions(+), 12 deletions(-) diff -ruNp libhugetlbfs-dev-20080219/morecore.c libhugetlbfs-dev-20080219-nomlock/morecore.c --- libhugetlbfs-dev-20080219/morecore.c2008-02-19 09:44:01.0 -0600 +++ libhugetlbfs-dev-20080219-nomlock/morecore.c2008-02-27 15:51:18.185747000 -0600 @@ -27,12 +27,15 @@ #include #include #include +#include +#include #include "hugetlbfs.h" #include "libhugetlbfs_internal.h" static int heap_fd; +static int zero_fd; static long blocksize; static void *heapbase; @@ -65,9 +68,12 @@ static long hugetlbfs_next_addr(long add * Luckily, if it does not do so and we error out malloc will happily * go back to small pages and use mmap to get them. Hurrah. */ +#define IOV_LEN64 static void *hugetlbfs_morecore(ptrdiff_t increment) { + int i, j; + struct iovec iov[IOV_LEN]; int ret; void *p; long delta; @@ -114,20 +120,38 @@ static void *hugetlbfs_morecore(ptrdiff_ return NULL; } - /* Use of mlock was reintroduced in libhugetlbfs 1.1, -* as the NUMA issues have been fixed in-kernel. The -* NUMA users of libhugetlbfs' malloc feature are + /* The NUMA users of libhugetlbfs' malloc feature are * expected to use the numactl program to specify an * appropriate policy for hugepage allocation */ - /* Use mlock to guarantee these pages to the process */ - ret = mlock(p, delta); - if (ret) { - WARNING("Failed to reserve huge pages in " - "hugetlbfs_morecore(): %s\n", - strerror(errno)); - } else { - munlock(p, delta); + /* +* Use readv(2) to instantiate the hugepages. If we +* can't get all that were requested, release the entire +* mapping and return NULL. Glibc malloc will then fall back +* to using mmap of base pages. +* +* If we instead returned a hugepage mapping with insufficient +* hugepages, the VM system would kill the process when the +* process tried to access the missing memory. +*/ + + for (i = 0; i < delta; ) { + for (j = 0; j < IOV_LEN && i < delta; j++) { + iov[j].iov_base = p + i; + iov[j].iov_len = 1; + i += blocksize; + DEBUG("iov[%d]=%p\n", j, iov[j].iov_base); + } + DEBUG("readv(%d)\n", j); + ret = readv(zero_fd, iov, j); + DEBUG("Got %d of %d requested; err=%d\n", ret, j, + ret < 0 ? errno : 0); + if (ret != j) { + WARNING("Failed to reserve huge pages in " + "hugetlbfs_morecore()\n"); + munmap(p, delta); + return NULL; + } } /* we now have mmap'd further */ @@ -225,6 +249,7 @@ static void __attribute__((constructor)) heapaddr = (unsigned long)sbrk(0); heapaddr = hugetlbfs_next_addr(heapaddr); } + zero_fd = open("/dev/zero", O_RDONLY); DEBUG("setup_morecore
[Libhugetlbfs-devel] [PATCH] morecore: dump /proc/pid/maps when the heap is forced to move and DEBUG
morecore: dump /proc/pid/maps when the heap is forced to move and DEBUG If HUGETLB_DEBUG is set in the environment, use a new helper function, dump_proc_pid_maps() to indicate the address space layout. This is very useful in finding out *why* the hugepage heap cannot be placed where we'd like it. Getting this information for benchmarks, often not run directly, helps resolve bugs. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/hugetlbfs.h b/hugetlbfs.h index ec48732..0492ae9 100644 --- a/hugetlbfs.h +++ b/hugetlbfs.h @@ -29,6 +29,7 @@ int hugetlbfs_unlinked_fd(void); /* Diagnoses/debugging only functions */ long hugetlbfs_num_free_pages(void); long hugetlbfs_num_pages(void); +long dump_proc_pid_maps(void); #define PF_LINUX_HUGETLB 0x10 diff --git a/hugeutils.c b/hugeutils.c index 2080731..51ee5ba 100644 --- a/hugeutils.c +++ b/hugeutils.c @@ -259,3 +259,35 @@ long hugetlbfs_num_pages(void) { return read_meminfo("HugePages_Total:"); } + +#define MAPS_BUF_SZ 4096 +long dump_proc_pid_maps() +{ + FILE *f; + char line[MAPS_BUF_SZ]; + size_t ret; + + f = fopen("/proc/self/maps", "r"); + if (!f) { + ERROR("Failed to open /proc/self/maps\n"); + return -1; + } + + while (1) { + ret = fread(line, sizeof(char), MAPS_BUF_SZ, f); + if (ret < 0) { + ERROR("Failed to read /proc/self/maps\n"); + return -1; + } + if (ret == 0) + break; + ret = fwrite(line, sizeof(char), ret, stderr); + if (ret < 0) { + ERROR("Failed to write /proc/self/maps to stderr\n"); + return -1; + } + } + + fclose(f); + return 0; +} diff --git a/morecore.c b/morecore.c index 98ea145..b7a0ee8 100644 --- a/morecore.c +++ b/morecore.c @@ -101,9 +101,12 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) /* if this is the first map */ if (! mapsize) { - if (heapbase && (heapbase != p)) + if (heapbase && (heapbase != p)) { WARNING("Heap originates at %p instead of %p\n", p, heapbase); + if (__debug) + dump_proc_pid_maps(); + } /* then setup the heap variables */ heapbase = heaptop = p; } else if (p != (heapbase + mapsize)) { @@ -111,6 +114,8 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) munmap(p, delta); WARNING("Mapped at %p instead of %p in hugetlbfs_morecore()\n", p, heapbase + mapsize); + if (__debug) + dump_proc_pid_maps(); return NULL; } -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] elflink: filter extracopy on GLIBC symbols
I found that when using OpenMPI with libhugetlbfs there is a gigantic symbol in the BSS. From readelf -a: 81: 120048b0 6caf2d0 OBJECT GLOBAL DEFAULT 26fields This object is almost 109M in size! But because it is an OBJECT type and GLOBAL, we copy it in the extracopy logic. However, per the comments in elflink.c, this is not a GLIBC symbol, so we shouldn't need to. Use strstr() to search for GLIBC in the symbol names we find, and skip those that dont' match. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/elflink.c b/elflink.c index 227b476..aaec4a2 100644 --- a/elflink.c +++ b/elflink.c @@ -405,7 +405,7 @@ static int find_numsyms(Elf_Sym *symtab, char *strtab) * - Object type (variable) * - Non-zero size (zero size means the symbol is just a marker with no data) */ -static inline int keep_symbol(Elf_Sym *s, void *start, void *end) +static inline int keep_symbol(char *strtab, Elf_Sym *s, void *start, void *end) { if ((void *)s->st_value < start) return 0; @@ -418,6 +418,8 @@ static inline int keep_symbol(Elf_Sym *s, void *start, void *end) return 0; if (s->st_size == 0) return 0; + if (!strstr(strtab + s->st_name, "GLIBC")) + return 0; return 1; } @@ -468,10 +470,8 @@ static void get_extracopy(struct seg_info *seg, const Elf_Phdr *phdr, int phnum) end = start; for (sym = symtab; sym < symtab + numsyms; sym++) { - if (!keep_symbol(sym, start, end_orig)) + if (!keep_symbol(strtab, sym, start, end_orig)) continue; - /* TODO - add filtering so that we only look at symbols from glibc - (@@GLIBC_*) */ /* These are the droids we are looking for */ found_sym = 1; -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] elflink: emit symbol name when extracopy symbols are found
To ease with debugging, output the address and name of extra symbols to copy from within the BSS. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/elflink.c b/elflink.c index aaec4a2..427f8c3 100644 --- a/elflink.c +++ b/elflink.c @@ -421,6 +421,10 @@ static inline int keep_symbol(char *strtab, Elf_Sym *s, void *start, void *end) if (!strstr(strtab + s->st_name, "GLIBC")) return 0; + if (__debug) + DEBUG("symbol to copy at %p: %s\n", s->st_value, + strtab + s->st_name); + return 1; } -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] DEBUG overrides VERBOSE checks
For debugging purposes, setting HUGETLB_DEBUG should be sufficient. It should also imply maximum verbosity from the library, so spit out all types of messages (DEBUG, WARNING or ERROR). Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index bbe7a6b..09112d5 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -32,7 +32,7 @@ extern int __hugetlbfs_verbose; #define ERROR(...) \ do { \ - if (__hugetlbfs_verbose >= 1) { \ + if (__debug || __hugetlbfs_verbose >= 1) { \ fprintf(stderr, "libhugetlbfs: ERROR: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -40,7 +40,7 @@ extern int __hugetlbfs_verbose; #define WARNING(...) \ do { \ - if (__hugetlbfs_verbose >= 2) { \ + if (__debug || __hugetlbfs_verbose >= 2) { \ fprintf(stderr, "libhugetlbfs: WARNING: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -48,7 +48,7 @@ extern int __hugetlbfs_verbose; #define DEBUG(...) \ do { \ - if (__hugetlbfs_verbose >= 3) { \ + if (__debug || __hugetlbfs_verbose >= 3) { \ fprintf(stderr, "libhugetlbfs: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -56,7 +56,7 @@ extern int __hugetlbfs_verbose; #define DEBUG_CONT(...) \ do { \ - if (__hugetlbfs_verbose >= 3) { \ + if (__debug || __hugetlbfs_verbose >= 3) { \ fprintf(stderr, __VA_ARGS__); \ fflush(stderr); \ } \ -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH][UPDATED] libhugetlbfs: consolidate to one constructor
On 27.02.2008 [15:46:47 -0600], Andrew Hastings wrote: > Nish, > > Thanks! You beat me to it! > > Nishanth Aravamudan wrote: >> --- /dev/null >> +++ b/init.c >> @@ -0,0 +1,8 @@ >> +#include "libhugetlbfs_internal.h" >> + >> +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) >> +{ >> +setup_morecore(); >> +setup_elflink(); >> +setup_debug(); >> +} > > Don't we want setup_debug called first, so ERROR/WARNING/DEBUG etc. can be > used in the other constructors? You're right. I was going by your e-mail that put debug.o last. Updated patch follows. >> diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h >> index bbe7a6b..466c4e8 100644 >> --- a/libhugetlbfs_internal.h >> +++ b/libhugetlbfs_internal.h >> @@ -29,6 +29,9 @@ >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) >> extern int __hugetlbfs_verbose; >> +extern void setup_elflink(); >> +extern void setup_morecore(); >> +extern void setup_debug(); > > Shouldn't all these setup functions have a __hugetlbfs_ prefix to avoid > namespace pollution (esp. for static links)? Also updated. Use one constructor to control the constructor order for libhugetlbfs. Currently, the constructors are run in the order their containing object files are linked in to libhugetlbfs.so. This is fragile as new features are added. Instead, have one constructor that calls the others (which are now no longer actually constructors). Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/Makefile b/Makefile index 2c0f0f6..a689ad6 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ PREFIX = /usr/local -LIBOBJS = hugeutils.o version.o morecore.o debug.o +LIBOBJS = hugeutils.o version.o init.o morecore.o debug.o INSTALL_OBJ_LIBS = libhugetlbfs.so libhugetlbfs.a LDSCRIPT_TYPES = B BDT LDSCRIPT_DIST_ELF = elf32ppclinux elf64ppc elf_i386 elf_x86_64 diff --git a/debug.c b/debug.c index 967c99c..b563542 100644 --- a/debug.c +++ b/debug.c @@ -45,7 +45,7 @@ static void __hugetlbfs_init_debug(void) initialized = 1; } -static void __attribute__ ((constructor)) setup_debug(void) +void __hugetlbfs_setup_debug(void) { __hugetlbfs_init_debug(); } diff --git a/elflink.c b/elflink.c index 227b476..9f4989a 100644 --- a/elflink.c +++ b/elflink.c @@ -1060,7 +1060,7 @@ static int parse_elf() return 0; } -static void __attribute__ ((constructor)) setup_elflink(void) +void __hugetlbfs_setup_elflink(void) { int i, ret; diff --git a/init.c b/init.c new file mode 100644 index 000..3866990 --- /dev/null +++ b/init.c @@ -0,0 +1,27 @@ +/* + * libhugetlbfs - Easy use of Linux hugepages + * Copyright (C) 2008 Nishanth Aravamudan, IBM Corporation + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public License + * as published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libhugetlbfs_internal.h" + +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) +{ + __hugetlbfs_setup_debug(); + __hugetlbfs_setup_morecore(); + __hugetlbfs_setup_elflink(); +} diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index bbe7a6b..9a68db8 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -29,6 +29,9 @@ #define ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1)) extern int __hugetlbfs_verbose; +extern void __hugetlbfs_setup_elflink(); +extern void __hugetlbfs_setup_morecore(); +extern void __hugetlbfs_setup_debug(); #define ERROR(...) \ do { \ diff --git a/morecore.c b/morecore.c index c24a888..39020f8 100644 --- a/morecore.c +++ b/morecore.c @@ -187,7 +187,7 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) return p; } -static void __attribute__((constructor)) setup_morecore(void) +void __hugetlbfs_setup_morecore(void) { char *env, *ep; unsigned long heapaddr; -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH][UPDATED] debug: make HUGETLB_DEBUG generic
On 27.02.2008 [15:48:24 -0600], Andrew Hastings wrote: > Nish, > > Nice! Just one comment: > > Nishanth Aravamudan wrote: >> --- a/libhugetlbfs_internal.h >> +++ b/libhugetlbfs_internal.h >> @@ -29,6 +29,7 @@ >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) >> extern int __hugetlbfs_verbose; >> +extern int __debug; > > This should be __hugetlbfs_debug to avoid namespace pollution, esp. for > static links. debug: make HUGETLB_DEBUG generic Move the __debug variable and HUGETLB_DEBUG parsing to debug.c. I plan on using the environment variable to control /proc/pid/maps dumping in morecore.c when the heap is forced to move. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/debug.c b/debug.c index b563542..7e9b948 100644 --- a/debug.c +++ b/debug.c @@ -28,6 +28,7 @@ #include "libhugetlbfs_internal.h" int __hugetlbfs_verbose = 1; +int __hugetlbfs_debug = 0; static int initialized; @@ -42,6 +43,10 @@ static void __hugetlbfs_init_debug(void) if (env) __hugetlbfs_verbose = atoi(env); + env = getenv("HUGETLB_DEBUG"); + if (env) + __hugetlbfs_debug = 1; + initialized = 1; } diff --git a/elflink.c b/elflink.c index 9f4989a..38727d9 100644 --- a/elflink.c +++ b/elflink.c @@ -163,7 +163,6 @@ static int htlb_num_segs; static int minimal_copy = 1; static int sharing; /* =0 */ static unsigned long force_remap; /* =0 */ -int __debug = 0; /** * assemble_path - handy wrapper around snprintf() for building paths @@ -489,7 +488,7 @@ static void get_extracopy(struct seg_info *seg, const Elf_Phdr *phdr, int phnum) DEBUG("Found __libhuge_filesz at %p\n", &__libhuge_filesz); } - if (__debug) + if (__hugetlbfs_debug) check_bss(start, end_orig); if (found_sym) { @@ -624,7 +623,7 @@ int parse_elf_relinked(struct dl_phdr_info *info, size_t size, void *data) htlb_num_segs++; } - if (__debug) + if (__hugetlbfs_debug) check_memsz(); return 1; } @@ -1017,12 +1016,6 @@ static int check_env(void) } } - env = getenv("HUGETLB_DEBUG"); - if (env) { - DEBUG("HUGETLB_DEBUG=%s, enabling extra checking\n", env); - __debug = 1; - } - return 0; } diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index 9a68db8..3ea7705 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -29,6 +29,7 @@ #define ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1)) extern int __hugetlbfs_verbose; +extern int __hugetlbfs_debug; extern void __hugetlbfs_setup_elflink(); extern void __hugetlbfs_setup_morecore(); extern void __hugetlbfs_setup_debug(); -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH][UPDATED] morecore: dump /proc/pid/maps when the heap is force to move and DEBUG
On 27.02.2008 [14:23:18 -0800], Nishanth Aravamudan wrote: > morecore: dump /proc/pid/maps when the heap is forced to move and DEBUG > > If HUGETLB_DEBUG is set in the environment, use a new helper function, > dump_proc_pid_maps() to indicate the address space layout. This is very > useful in finding out *why* the hugepage heap cannot be placed where > we'd like it. Getting this information for benchmarks, often not run > directly, helps resolve bugs. Updated to the new name for __debug: morecore: dump /proc/pid/maps when the heap is forced to move and DEBUG If HUGETLB_DEBUG is set in the environment, use a new helper function, dump_proc_pid_maps() to indicate the address space layout. This is very useful in finding out *why* the hugepage heap cannot be placed where we'd like it. Getting this information for benchmarks, often not run directly, helps resolve bugs. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/hugetlbfs.h b/hugetlbfs.h index ec48732..0492ae9 100644 --- a/hugetlbfs.h +++ b/hugetlbfs.h @@ -29,6 +29,7 @@ int hugetlbfs_unlinked_fd(void); /* Diagnoses/debugging only functions */ long hugetlbfs_num_free_pages(void); long hugetlbfs_num_pages(void); +long dump_proc_pid_maps(void); #define PF_LINUX_HUGETLB 0x10 diff --git a/hugeutils.c b/hugeutils.c index 2080731..51ee5ba 100644 --- a/hugeutils.c +++ b/hugeutils.c @@ -259,3 +259,35 @@ long hugetlbfs_num_pages(void) { return read_meminfo("HugePages_Total:"); } + +#define MAPS_BUF_SZ 4096 +long dump_proc_pid_maps() +{ + FILE *f; + char line[MAPS_BUF_SZ]; + size_t ret; + + f = fopen("/proc/self/maps", "r"); + if (!f) { + ERROR("Failed to open /proc/self/maps\n"); + return -1; + } + + while (1) { + ret = fread(line, sizeof(char), MAPS_BUF_SZ, f); + if (ret < 0) { + ERROR("Failed to read /proc/self/maps\n"); + return -1; + } + if (ret == 0) + break; + ret = fwrite(line, sizeof(char), ret, stderr); + if (ret < 0) { + ERROR("Failed to write /proc/self/maps to stderr\n"); + return -1; + } + } + + fclose(f); + return 0; +} diff --git a/morecore.c b/morecore.c index 39020f8..b55349c 100644 --- a/morecore.c +++ b/morecore.c @@ -101,9 +101,12 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) /* if this is the first map */ if (! mapsize) { - if (heapbase && (heapbase != p)) + if (heapbase && (heapbase != p)) { WARNING("Heap originates at %p instead of %p\n", p, heapbase); + if (__hugetlbfs_debug) + dump_proc_pid_maps(); + } /* then setup the heap variables */ heapbase = heaptop = p; } else if (p != (heapbase + mapsize)) { @@ -111,6 +114,8 @@ static void *hugetlbfs_morecore(ptrdiff_t increment) munmap(p, delta); WARNING("Mapped at %p instead of %p in hugetlbfs_morecore()\n", p, heapbase + mapsize); + if (__hugetlbfs_debug) + dump_proc_pid_maps(); return NULL; } -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH][UPDATED] DEBUG overrides VERBOSE checks
On 27.02.2008 [14:43:38 -0800], Nishanth Aravamudan wrote: > For debugging purposes, setting HUGETLB_DEBUG should be sufficient. It > should also imply maximum verbosity from the library, so spit out all > types of messages (DEBUG, WARNING or ERROR). Update to new name for __debug: DEBUG overrides VERBOSE checks For debugging purposes, setting HUGETLB_DEBUG should be sufficient. It should also imply maximum verbosity from the library, so spit out all types of messages (DEBUG, WARNING or ERROR). Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h index 3ea7705..4a044e9 100644 --- a/libhugetlbfs_internal.h +++ b/libhugetlbfs_internal.h @@ -36,7 +36,7 @@ extern void __hugetlbfs_setup_debug(); #define ERROR(...) \ do { \ - if (__hugetlbfs_verbose >= 1) { \ + if (__hugetlbfs_debug || __hugetlbfs_verbose >= 1) { \ fprintf(stderr, "libhugetlbfs: ERROR: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -44,7 +44,7 @@ extern void __hugetlbfs_setup_debug(); #define WARNING(...) \ do { \ - if (__hugetlbfs_verbose >= 2) { \ + if (__hugetlbfs_debug || __hugetlbfs_verbose >= 2) { \ fprintf(stderr, "libhugetlbfs: WARNING: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -52,7 +52,7 @@ extern void __hugetlbfs_setup_debug(); #define DEBUG(...) \ do { \ - if (__hugetlbfs_verbose >= 3) { \ + if (__hugetlbfs_debug || __hugetlbfs_verbose >= 3) { \ fprintf(stderr, "libhugetlbfs: " __VA_ARGS__); \ fflush(stderr); \ } \ @@ -60,7 +60,7 @@ extern void __hugetlbfs_setup_debug(); #define DEBUG_CONT(...) \ do { \ - if (__hugetlbfs_verbose >= 3) { \ + if (__hugetlbfs_debug || __hugetlbfs_verbose >= 3) { \ fprintf(stderr, __VA_ARGS__); \ fflush(stderr); \ } \ -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] elflink: emit symbol name when extracopy symbols are found
On 27.02.2008 [14:43:04 -0800], Nishanth Aravamudan wrote: > To ease with debugging, output the address and name of extra symbols to > copy from within the BSS. Update to new name for __debug: elflink: emit symbol name when extracopy symbols are found To ease with debugging, output the address and name of extra symbols to copy from within the BSS. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/elflink.c b/elflink.c index f1cf6b0..fe022a4 100644 --- a/elflink.c +++ b/elflink.c @@ -420,6 +420,10 @@ static inline int keep_symbol(char *strtab, Elf_Sym *s, void *start, void *end) if (!strstr(strtab + s->st_name, "GLIBC")) return 0; + if (__hugetlbfs_debug) + DEBUG("symbol to copy at %p: %s\n", s->st_value, + strtab + s->st_name); + return 1; } -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [PATCH] elflink: emit symbol name when extracopy symbols are found
On 27.02.2008 [15:23:36 -0800], Nishanth Aravamudan wrote: > On 27.02.2008 [14:43:04 -0800], Nishanth Aravamudan wrote: > > To ease with debugging, output the address and name of extra symbols to > > copy from within the BSS. > > Update to new name for __debug: This introduced a warning actually, forgot a cast. elflink: emit symbol name when extracopy symbols are found To ease with debugging, output the address and name of extra symbols to copy from within the BSS. Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> diff --git a/elflink.c b/elflink.c index f1cf6b0..43ce562 100644 --- a/elflink.c +++ b/elflink.c @@ -420,6 +420,10 @@ static inline int keep_symbol(char *strtab, Elf_Sym *s, void *start, void *end) if (!strstr(strtab + s->st_name, "GLIBC")) return 0; + if (__hugetlbfs_debug) + DEBUG("symbol to copy at %p: %s\n", (void *)s->st_value, + strtab + s->st_name); + return 1; } -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] debug: make HUGETLB_DEBUG generic
Nishanth Aravamudan wrote: > On 27.02.2008 [15:48:24 -0600], Andrew Hastings wrote: >> Nish, >> >> Nice! Just one comment: >> >> Nishanth Aravamudan wrote: >>> --- a/libhugetlbfs_internal.h >>> +++ b/libhugetlbfs_internal.h >>> @@ -29,6 +29,7 @@ >>> #define ALIGN(x, a)(((x) + (a) - 1) & ~((a) - 1)) >>> extern int __hugetlbfs_verbose; >>> +extern int __debug; >> This should be __hugetlbfs_debug to avoid namespace pollution, esp. for >> static links. > > debug: make HUGETLB_DEBUG generic > > Move the __debug variable and HUGETLB_DEBUG parsing to debug.c. I plan > on using the environment variable to control /proc/pid/maps dumping in > morecore.c when the heap is forced to move. > > Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> Acked-by: Andrew Hastings <[EMAIL PROTECTED]> - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] libhugetlbfs: consolidate to one constructor
Nishanth Aravamudan wrote: > On 27.02.2008 [15:46:47 -0600], Andrew Hastings wrote: >> Nishanth Aravamudan wrote: >>> --- /dev/null >>> +++ b/init.c >>> @@ -0,0 +1,8 @@ >>> +#include "libhugetlbfs_internal.h" >>> + >>> +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) >>> +{ >>> + setup_morecore(); >>> + setup_elflink(); >>> + setup_debug(); >>> +} >> Don't we want setup_debug called first, so ERROR/WARNING/DEBUG etc. can be >> used in the other constructors? > > You're right. I was going by your e-mail that put debug.o last. Updated > patch follows. Yes -- the toolchain I'm using calls constructors LIFO, so debug.o was last in order to be called first. Acked-by: Andrew Hastings <[EMAIL PROTECTED]> -Andrew Hastings Cray Inc. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH] morecore.c: eliminate use of mlock()
On 27.02.2008 [16:12:14 -0600], Andrew Hastings wrote: > hugetlbfs_morecore() is currently calling mlock() presumably to > instantiate newly-mapped hugepages before returning to malloc(). > > However, this is ineffective for two reasons: > 1. The return value of mlock is ignored. If there aren't enough > hugepages available, the process may be killed later, when it tries > to access the hugepages just added to the heap. Hrm, so the idea was this -- by faulting in the hugepages, we avoid exactly this problem where we'd race with some other binary trying to use hugepages. We'd be able to extend the hugetlbfs file, but without the mlock() we wouldn't reserve the pages for our process, so we would get killed later when the nopage() handler was invoked (I think). I see what you're saying, though, that we only emit a warning if we don't grab the pages at mlock() time. But there's a chance that when we actually go and use the hugepages, that racing consumer may be done...so it's not entirely correct to fail. And mis-sizing the hugepage pool (or not enabling the dynamic pool, if it's available) is viewed (by me, at least) as a configuration error. > 2. Most distros have a default mlock limit well below the size of > one hugepage. (The kernel default is 8 base pages.) We usually advise customers to modify this. However, I think your mistaking hugepages for smallpages here. I don't believe hugepages are accounted as locked, even when mlock() is called on them. e.g. on a 4-way x86_64 running SLES10: # ulimit -l 32 LD_LIBRARY_PATH=./obj64 LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes ./tests/obj64/malloc Starting testcase "./tests/obj64/malloc", pid 10292 HUGETLB_MORECORE=yes malloc(4) = 0x600010 malloc(1024) = 0x600010 malloc(131072) = 0x600010 malloc(1048576) = 0x600010 malloc(16777216) = 0x600010 malloc(33554432) = 0x600010 PASS > This patch changes hugetlbfs_morecore to use readv() to instantiate > the newly-mapped hugepages. If not all hugepages can be instantiated, > the new mapping is removed, and hugetlbfs_morecore returns NULL to > malloc. This allows for a graceful fallback to base pages if > insufficient hugepages are available. Couldn't we just munmap() and return NULL from the case where mlock() returned !0? This seems like overkill? > This patch also adds a test case to verify that a process using > HUGETLB_MORECORE does not get killed when it exhausts the available > free hugepages, assuming sufficient free base pages are available. So, is this a problem you actually witnessed in production? I'm just curious. Thanks, Nish > Signed-off-by: Andrew Hastings <[EMAIL PROTECTED]> on behalf of Cray Inc. > --- > I've tested this on x86_64 and PPC64. > > -Andrew Hastings > Cray Inc. > > morecore.c| 47 ++-- > tests/Makefile|2 - > tests/heap-overflow.c | 97 > ++ > tests/run_tests.sh|1 > 4 files changed, 135 insertions(+), 12 deletions(-) > > diff -ruNp libhugetlbfs-dev-20080219/morecore.c > libhugetlbfs-dev-20080219-nomlock/morecore.c > --- libhugetlbfs-dev-20080219/morecore.c 2008-02-19 09:44:01.0 > -0600 > +++ libhugetlbfs-dev-20080219-nomlock/morecore.c 2008-02-27 > 15:51:18.185747000 -0600 > @@ -27,12 +27,15 @@ > #include > #include > #include > +#include > +#include > > #include "hugetlbfs.h" > > #include "libhugetlbfs_internal.h" > > static int heap_fd; > +static int zero_fd; > static long blocksize; > > static void *heapbase; > @@ -65,9 +68,12 @@ static long hugetlbfs_next_addr(long add > * Luckily, if it does not do so and we error out malloc will happily > * go back to small pages and use mmap to get them. Hurrah. > */ > +#define IOV_LEN 64 > > static void *hugetlbfs_morecore(ptrdiff_t increment) > { > + int i, j; > + struct iovec iov[IOV_LEN]; > int ret; > void *p; > long delta; > @@ -114,20 +120,38 @@ static void *hugetlbfs_morecore(ptrdiff_ > return NULL; > } > > - /* Use of mlock was reintroduced in libhugetlbfs 1.1, > - * as the NUMA issues have been fixed in-kernel. The > - * NUMA users of libhugetlbfs' malloc feature are > + /* The NUMA users of libhugetlbfs' malloc feature are >* expected to use the numactl program to specify an >* appropriate policy for hugepage allocation */ > > - /* Use mlock to guarantee these pages to the process */ > - ret = mlock(p, delta); > - if (ret) { > - WARNING("Failed to reserve huge pages in " > - "hugetlbfs_morecore(): %s\n", > - strerror(errno)); > - } else { > - munlock(p, delta); > + /* > + * Use readv(2) to instantiate the hug
Re: [Libhugetlbfs-devel] [PATCH] morecore.c: eliminate use of mlock()
On 27.02.2008 [17:02:15 -0800], Nishanth Aravamudan wrote: > On 27.02.2008 [16:12:14 -0600], Andrew Hastings wrote: > > hugetlbfs_morecore() is currently calling mlock() presumably to > > instantiate newly-mapped hugepages before returning to malloc(). > > > > However, this is ineffective for two reasons: > > 1. The return value of mlock is ignored. If there aren't enough > > hugepages available, the process may be killed later, when it tries > > to access the hugepages just added to the heap. > > Hrm, so the idea was this -- by faulting in the hugepages, we avoid > exactly this problem where we'd race with some other binary trying to > use hugepages. We'd be able to extend the hugetlbfs file, but without > the mlock() we wouldn't reserve the pages for our process, so we would > get killed later when the nopage() handler was invoked (I think). I see > what you're saying, though, that we only emit a warning if we don't grab > the pages at mlock() time. But there's a chance that when we actually go > and use the hugepages, that racing consumer may be done...so it's not > entirely correct to fail. And mis-sizing the hugepage pool (or not > enabling the dynamic pool, if it's available) is viewed (by me, at > least) as a configuration error. > > > 2. Most distros have a default mlock limit well below the size of > > one hugepage. (The kernel default is 8 base pages.) > > We usually advise customers to modify this. Sorry this first sentence is false. We actually modify the stack rlimit for some tests, but not the locked one. More reason to believe hugepages are "special" in this case. Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
[Libhugetlbfs-devel] [RFC] Release plans...
So, with the 6-patch drop I did today, we have between 23 and 29 patches post 1.2. How are folks feeling about that? I am starting to lean towards cutting a 1.3-rc0 release to start that cycle going. There are a few threatened features, though: * Andrew's HUGETLB_STACK set pros - Nicely isolated to stack.c - should't threaten any other features' performance cons: - Doesn't always work, which makes it relatively tough to use for performance measurements - May go unused except in special cases for the same reason, meaning lack of testing * Adam's new segment remapping code pros - Makes libhuge work with newer binutils cons - Needs lots of testing - David's responses indicate there is still hashing out to be done * My T-only relinking pros - Desired feature for database customers cons - Current implementation doesn't work on ppc32 (David did provide an idea for working around this, just haven't implemented it yet) - More linker script headaches, potentially Given that, I'm tempted to let Andrew's stack remapping code get in for the next release (1.3, in that case). I believe you're planning on reposting the code with the cleanups we've been discussing, Andrew. I also believe we should hold off on Adam's code and my T-only relinking stuff for this release. I think it will hold up the release unnecessarily, as things are working in RH5{,.1} and SLES10{,-SP1} with the current setup. Also, changing the relinking method so drastically necessitates (to me) moving to a 2.0 release. But nothing else in the 1.3 queue seems to do so. So perhaps just a release with the new relinking code would be good for bisection purposes. Finally, the need for the old-style T-mode may go away if we get Adam's method in (hence holding off on T-only for 1.3). What do folks think? I know Adam's on vacation until next week, so I won't make an dramatic decisions until he's had a chance to reply, but I think it's best to talk about the releases out in the open. Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] debug: make HUGETLB_DEBUG generic
On Wed, Feb 27, 2008 at 03:15:14PM -0800, Nishanth Aravamudan wrote: > On 27.02.2008 [15:48:24 -0600], Andrew Hastings wrote: > > Nish, > > > > Nice! Just one comment: > > > > Nishanth Aravamudan wrote: > >> --- a/libhugetlbfs_internal.h > >> +++ b/libhugetlbfs_internal.h > >> @@ -29,6 +29,7 @@ > >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > >> extern int __hugetlbfs_verbose; > >> +extern int __debug; > > > > This should be __hugetlbfs_debug to avoid namespace pollution, esp. for > > static links. > > debug: make HUGETLB_DEBUG generic > > Move the __debug variable and HUGETLB_DEBUG parsing to debug.c. I plan > on using the environment variable to control /proc/pid/maps dumping in > morecore.c when the heap is forced to move. > > Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> Acked-by: David Gibson <[EMAIL PROTECTED]> Seems sensible. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On Tue, Feb 26, 2008 at 12:11:56AM -0800, Nishanth Aravamudan wrote: > On 26.02.2008 [16:16:01 +1100], David Gibson wrote: > > On Mon, Feb 25, 2008 at 09:05:42PM -0800, Nishanth Aravamudan wrote: > > > On 26.02.2008 [15:33:55 +1100], David Gibson wrote: > > > > On Mon, Feb 25, 2008 at 08:23:48PM -0800, Nishanth Aravamudan wrote: > > > > > On 15.02.2008 [11:52:39 +1100], David Gibson wrote: > > > > > > On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings wrote: [snip] > > > If we keep trimming disabled, and allow the hugepage heap to jump in the > > > address space (presumably because we're on power and something > > > (/lib/ld-2.4.so for a 32-bit binary seems to be the common case) backed > > > by small pages is in the way of a large mmap() [512M or so]), will that > > > break applications? > > > > Um.. I don't see how the hugepage heap can jump. That's the whole > > point of that else case, if we can't allocate following on from our > > existing heap, we fail the morecore. > > The "if" applied to all of that first sentence. Sorry, I see the source > of the confusion. I am not asking if we can currently violate brk() > semantics. I am asking if we could change the code as I indicate below > and whether a) that would violate brk() semantics and b) whether > violating brk() semantics matters if we don't trim the heap. Ah, right, I think I understand now. Um.. a) yes, I think it would violate brk() semantics and b) I'm not sure. I suspect violating brk() semantics won't work though, because I think glibc's allocator will assume it can use everything between the previous morecore value and the new one. And will probably barf if the new value is less than the old, which it could be with this change. > > > In code, that is the following in the second conditional above: > > > > > > else if (p != (heapbase + mapsize)) { > > > /* Couldn't get the mapping where we wanted > > > Continue anyways, because we won't free it back */ > > > WARNING("Mapped at %p instead of %p in hugetlbfs_morecore()\n", > > > p, heapbase + mapsize); > > > } > > > > > > That is, we don't munmap the most recently created mapping and we don't > > > return NULL to malloc(). > > > > Except that we do munmap() and we do return NULL to malloc(). > > > > Even more confused now... > > Is that any clearer? I am asking about a hypothetical alternative that > relies on MORECORE not trimming the heap and allow the heap thus to be > discontinuous. Where does it matter that it is not -- and where is it > checked? > > Thanks, > Nish > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] libhugetlbfs: consolidate to one constructor
On Wed, Feb 27, 2008 at 03:13:20PM -0800, Nishanth Aravamudan wrote: > On 27.02.2008 [15:46:47 -0600], Andrew Hastings wrote: > > Nish, > > > > Thanks! You beat me to it! > > > > Nishanth Aravamudan wrote: > >> --- /dev/null > >> +++ b/init.c > >> @@ -0,0 +1,8 @@ > >> +#include "libhugetlbfs_internal.h" > >> + > >> +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) > >> +{ > >> + setup_morecore(); > >> + setup_elflink(); > >> + setup_debug(); > >> +} > > > > Don't we want setup_debug called first, so ERROR/WARNING/DEBUG etc. can be > > used in the other constructors? > > You're right. I was going by your e-mail that put debug.o last. Updated > patch follows. Yes, thought that was the wrong order. I suspect you also want elflink before morecore, although it may not matter in practice. > >> diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h > >> index bbe7a6b..466c4e8 100644 > >> --- a/libhugetlbfs_internal.h > >> +++ b/libhugetlbfs_internal.h > >> @@ -29,6 +29,9 @@ > >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > >> extern int __hugetlbfs_verbose; > >> +extern void setup_elflink(); > >> +extern void setup_morecore(); > >> +extern void setup_debug(); > > > > Shouldn't all these setup functions have a __hugetlbfs_ prefix to avoid > > namespace pollution (esp. for static links)? > > Also updated. > > Use one constructor to control the constructor order for libhugetlbfs. > Currently, the constructors are run in the order their containing object > files are linked in to libhugetlbfs.so. This is fragile as new features > are added. Instead, have one constructor that calls the others (which > are now no longer actually constructors). Avoiding namespace pollution was one reason I used separate constructors in the first place (they can all remain static, that way). But a less finicky way of controlling the execution order is probably worth that. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On 28.02.2008 [13:00:34 +1100], David Gibson wrote: > On Tue, Feb 26, 2008 at 12:11:56AM -0800, Nishanth Aravamudan wrote: > > On 26.02.2008 [16:16:01 +1100], David Gibson wrote: > > > On Mon, Feb 25, 2008 at 09:05:42PM -0800, Nishanth Aravamudan wrote: > > > > On 26.02.2008 [15:33:55 +1100], David Gibson wrote: > > > > > On Mon, Feb 25, 2008 at 08:23:48PM -0800, Nishanth Aravamudan wrote: > > > > > > On 15.02.2008 [11:52:39 +1100], David Gibson wrote: > > > > > > > On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings wrote: > [snip] > > > > If we keep trimming disabled, and allow the hugepage heap to jump in the > > > > address space (presumably because we're on power and something > > > > (/lib/ld-2.4.so for a 32-bit binary seems to be the common case) backed > > > > by small pages is in the way of a large mmap() [512M or so]), will that > > > > break applications? > > > > > > Um.. I don't see how the hugepage heap can jump. That's the whole > > > point of that else case, if we can't allocate following on from our > > > existing heap, we fail the morecore. > > > > The "if" applied to all of that first sentence. Sorry, I see the source > > of the confusion. I am not asking if we can currently violate brk() > > semantics. I am asking if we could change the code as I indicate below > > and whether a) that would violate brk() semantics and b) whether > > violating brk() semantics matters if we don't trim the heap. > > Ah, right, I think I understand now. Sorry for all the confusion. > Um.. a) yes, I think it would violate brk() semantics and b) I'm not > sure. I suspect violating brk() semantics won't work though, because > I think glibc's allocator will assume it can use everything between > the previous morecore value and the new one. And will probably barf > if the new value is less than the old, which it could be with this > change. Hrm, I guess I hadn't thought about it that way. I really am curious about these semantics. We already have cases where a small page [heap] exists in a processes address space, say at 0x1000 for a 32-bit process and we start the hugepage heap in the morecore constructor at 0x2000. If glibc thought it could use that entire area, but we didn't actually allocate any pages for it to use there, does it just take a bunch of page faults? I'm not sure how the new value can be less than the old? I think the only way that happens is with trimming in response to a negative allocation request... But I'm suggesting we have a situation like: 0x1000 [heap] 0x2000 /hugetlbfs/tmp.file 0x3000 /lib/ld-2.4.so 0x4000 /hugetlbfs/tmp.file I think we'll still be returning higher addresses every time. I don't know, maybe it's not worth pursuing, but I was just curious. Do you happen to have a reference to the definitions for brk() semantics? Or do you recommend I just look at glibc's source [not something I often hear recommended :)]? Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] debug: make HUGETLB_DEBUG generic
On 28.02.2008 [12:47:08 +1100], David Gibson wrote: > On Wed, Feb 27, 2008 at 03:15:14PM -0800, Nishanth Aravamudan wrote: > > On 27.02.2008 [15:48:24 -0600], Andrew Hastings wrote: > > > Nish, > > > > > > Nice! Just one comment: > > > > > > Nishanth Aravamudan wrote: > > >> --- a/libhugetlbfs_internal.h > > >> +++ b/libhugetlbfs_internal.h > > >> @@ -29,6 +29,7 @@ > > >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > > >> extern int __hugetlbfs_verbose; > > >> +extern int __debug; > > > > > > This should be __hugetlbfs_debug to avoid namespace pollution, esp. for > > > static links. > > > > debug: make HUGETLB_DEBUG generic > > > > Move the __debug variable and HUGETLB_DEBUG parsing to debug.c. I plan > > on using the environment variable to control /proc/pid/maps dumping in > > morecore.c when the heap is forced to move. > > > > Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> > > Acked-by: David Gibson <[EMAIL PROTECTED]> Applied, thanks. -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [PATCH][UPDATED] libhugetlbfs: consolidate to one constructor
On 28.02.2008 [12:43:13 +1100], David Gibson wrote: > On Wed, Feb 27, 2008 at 03:13:20PM -0800, Nishanth Aravamudan wrote: > > On 27.02.2008 [15:46:47 -0600], Andrew Hastings wrote: > > > Nish, > > > > > > Thanks! You beat me to it! > > > > > > Nishanth Aravamudan wrote: > > >> --- /dev/null > > >> +++ b/init.c > > >> @@ -0,0 +1,8 @@ > > >> +#include "libhugetlbfs_internal.h" > > >> + > > >> +static void __attribute__ ((constructor)) setup_libhugetlbfs(void) > > >> +{ > > >> +setup_morecore(); > > >> +setup_elflink(); > > >> +setup_debug(); > > >> +} > > > > > > Don't we want setup_debug called first, so ERROR/WARNING/DEBUG etc. can > > > be > > > used in the other constructors? > > > > You're right. I was going by your e-mail that put debug.o last. Updated > > patch follows. > > Yes, thought that was the wrong order. I suspect you also want > elflink before morecore, although it may not matter in practice. Andrew, do you know if it matters for your cases? I can move elflink in my commit. > > >> diff --git a/libhugetlbfs_internal.h b/libhugetlbfs_internal.h > > >> index bbe7a6b..466c4e8 100644 > > >> --- a/libhugetlbfs_internal.h > > >> +++ b/libhugetlbfs_internal.h > > >> @@ -29,6 +29,9 @@ > > >> #define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1)) > > >> extern int __hugetlbfs_verbose; > > >> +extern void setup_elflink(); > > >> +extern void setup_morecore(); > > >> +extern void setup_debug(); > > > > > > Shouldn't all these setup functions have a __hugetlbfs_ prefix to avoid > > > namespace pollution (esp. for static links)? > > > > Also updated. > > > > Use one constructor to control the constructor order for libhugetlbfs. > > Currently, the constructors are run in the order their containing object > > files are linked in to libhugetlbfs.so. This is fragile as new features > > are added. Instead, have one constructor that calls the others (which > > are now no longer actually constructors). > > Avoiding namespace pollution was one reason I used separate > constructors in the first place (they can all remain static, that > way). But a less finicky way of controlling the execution order is > probably worth that. Sure enough. I think the tradeoff is worth it, as well. Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On Wed, Feb 27, 2008 at 11:29:03AM -0600, Andrew Hastings wrote: > David Gibson wrote: > > On Wed, Feb 20, 2008 at 03:43:08PM -0600, Andrew Hastings wrote: > >> David Gibson wrote: > >>> On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings wrote: > Disable heap shrinking by default unless HUGETLB_MORECORE_SHRINK=yes is > set in the environment. > > If malloc allocates memory on the heap before libhugetlbfs's > setup_morecore is called, then when malloc calls hugetlbfs_morecore > >>> Hrm... thing is, I'm pretty sure we're already in deep trouble if > >>> malloc() is called before our setup_morecore(). Most of the various > >>> link issues we've had to deal with have been about ensuring that the > >>> libhugetlbfs constructor goes as early as possible, before everything > >>> except libc itself, so that we can establish our morecore hooks before > >>> anyone attempts to use the heap. Well, and so that segment remapping > >>> also goes early, which is even more stuffed if things like malloc() > >>> have been called first. > >> But it's not hard to call malloc() before setup_morecore(). > > > > Indeed it's not. In fact it's hard to ensure that malloc() isn't > > valled beforehand. > > > > But I don't think your patch was anything like a complete solution to > > allowing the hugepage heap setup to go after other malloc() calls. > > It works for all of the test cases I tried (admittedly, not exhaustive). > The restriction that other malloc calls can't occur before hugepage > heap setup isn't documented in the HOWTO, and we agree that it's easy to > violate this restriction. Wouldn't a partial solution be better than none? Well, no, I'm not convinced that it is. Getting the link and constructor order right is fiddly, but it shouldn't be impossible. I think I'd prefer that the library fail completely if this is wrong, rather than sort-of work mostly until it doesn't. Even if your patch is reliable for now, I suspect it will be very fragile to glibc changes, and even if not I'd prefer not to give the impression that having libhugetlbfs initialized later is ok, because the elflink code certainly won't work in that case. I would like to make it easier to get the link order correct. That's part of what the ld.hugetlbfs script is supposed to do - it would be interesting to hear if there's a way we can extend that to suit your needs. > >> But we've encountered this issue in two other cases: > >> 1. Link against a shared library that has a constructor that calls > >> malloc, and then use LD_PRELOAD=libhugetlbfs. > > > > Erm... I think PRELOAD constructors always go before non-preload > > constructors, other than the preload's dependencies, obviously. > > I'd thought so, too, but if I do: > > cc -o hs-2 heapshrink.o testutils.o libheapshrink.so > > and then use LD_PRELOAD=libhugetlbfs, I still hit the malloc misbehavior. Ok, that's a bit weird. Can you run this with LD_DEBUG="all" to see if we can track down what's happening a bit more precisely. > >> 2. Link against a static libhugetlbfs. > >> > >> The second case is a bigger show-stopper for Cray; since most of our > >> nodes are diskless we prefer to use static linking. > > > > Ick. The assumption that we're in a dynamically linked environment is > > pretty widespread through libhugetlbfs. > > > > In this situation are you statically linking everything? If not, are > > you statically linking libc? > > Yes, we static link everything with the exception of a few commands > which use a shared libc. Ok. For a statically linked binary, I'd expect you'd have to have a statically linked libhugetlbfs for it to work properly, again with correct link order so that its constructors go immediately after libc. Segment remapping certainly won't work with static linking, though. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On Wed, Feb 27, 2008 at 06:17:02PM -0800, Nishanth Aravamudan wrote: > On 28.02.2008 [13:00:34 +1100], David Gibson wrote: > > On Tue, Feb 26, 2008 at 12:11:56AM -0800, Nishanth Aravamudan wrote: > > > On 26.02.2008 [16:16:01 +1100], David Gibson wrote: > > > > On Mon, Feb 25, 2008 at 09:05:42PM -0800, Nishanth Aravamudan wrote: > > > > > On 26.02.2008 [15:33:55 +1100], David Gibson wrote: > > > > > > On Mon, Feb 25, 2008 at 08:23:48PM -0800, Nishanth Aravamudan wrote: > > > > > > > On 15.02.2008 [11:52:39 +1100], David Gibson wrote: > > > > > > > > On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings wrote: > > [snip] > > > > > If we keep trimming disabled, and allow the hugepage heap to jump in > > > > > the > > > > > address space (presumably because we're on power and something > > > > > (/lib/ld-2.4.so for a 32-bit binary seems to be the common case) > > > > > backed > > > > > by small pages is in the way of a large mmap() [512M or so]), will > > > > > that > > > > > break applications? > > > > > > > > Um.. I don't see how the hugepage heap can jump. That's the whole > > > > point of that else case, if we can't allocate following on from our > > > > existing heap, we fail the morecore. > > > > > > The "if" applied to all of that first sentence. Sorry, I see the source > > > of the confusion. I am not asking if we can currently violate brk() > > > semantics. I am asking if we could change the code as I indicate below > > > and whether a) that would violate brk() semantics and b) whether > > > violating brk() semantics matters if we don't trim the heap. > > > > Ah, right, I think I understand now. > > Sorry for all the confusion. > > > Um.. a) yes, I think it would violate brk() semantics and b) I'm not > > sure. I suspect violating brk() semantics won't work though, because > > I think glibc's allocator will assume it can use everything between > > the previous morecore value and the new one. And will probably barf > > if the new value is less than the old, which it could be with this > > change. > > Hrm, I guess I hadn't thought about it that way. I really am curious > about these semantics. We already have cases where a small page [heap] > exists in a processes address space, say at 0x1000 for a 32-bit > process and we start the hugepage heap in the morecore constructor at > 0x2000. If glibc thought it could use that entire area, but we > didn't actually allocate any pages for it to use there, does it just > take a bunch of page faults? We really have such cases? I wouldn't have expected that to work. When I wrote this, I thought from what I'd read of the glibc documentation and code that it was only going to work reliably if we got the very first morecore call, and every one after that. > I'm not sure how the new value can be less than the old? I think the > only way that happens is with trimming in response to a negative > allocation request... But I'm suggesting we have a situation like: > > 0x1000[heap] > 0x2000/hugetlbfs/tmp.file > 0x3000/lib/ld-2.4.so > 0x4000/hugetlbfs/tmp.file > > I think we'll still be returning higher addresses every time. I don't > know, maybe it's not worth pursuing, but I was just curious. In practice, we probably will, because of the way the kernel scans for mapping addresses. But if we don't get our hinted address, there's no guarantee where the mmap() will be - it could be in any free address space, including before the executable itself. > Do you happen to have a reference to the definitions for brk() > semantics? Or do you recommend I just look at glibc's source [not > something I often hear recommended :)]? Alas, I think you will need to look at the glibc source. Good luck. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On 28.02.2008 [13:28:47 +1100], David Gibson wrote: > On Wed, Feb 27, 2008 at 06:17:02PM -0800, Nishanth Aravamudan wrote: > > On 28.02.2008 [13:00:34 +1100], David Gibson wrote: > > > On Tue, Feb 26, 2008 at 12:11:56AM -0800, Nishanth Aravamudan wrote: > > > > On 26.02.2008 [16:16:01 +1100], David Gibson wrote: > > > > > On Mon, Feb 25, 2008 at 09:05:42PM -0800, Nishanth Aravamudan wrote: > > > > > > On 26.02.2008 [15:33:55 +1100], David Gibson wrote: > > > > > > > On Mon, Feb 25, 2008 at 08:23:48PM -0800, Nishanth Aravamudan > > > > > > > wrote: > > > > > > > > On 15.02.2008 [11:52:39 +1100], David Gibson wrote: > > > > > > > > > On Thu, Feb 14, 2008 at 11:48:42AM -0600, Andrew Hastings > > > > > > > > > wrote: > > > [snip] > > > > > > If we keep trimming disabled, and allow the hugepage heap to jump > > > > > > in the > > > > > > address space (presumably because we're on power and something > > > > > > (/lib/ld-2.4.so for a 32-bit binary seems to be the common case) > > > > > > backed > > > > > > by small pages is in the way of a large mmap() [512M or so]), will > > > > > > that > > > > > > break applications? > > > > > > > > > > Um.. I don't see how the hugepage heap can jump. That's the whole > > > > > point of that else case, if we can't allocate following on from our > > > > > existing heap, we fail the morecore. > > > > > > > > The "if" applied to all of that first sentence. Sorry, I see the source > > > > of the confusion. I am not asking if we can currently violate brk() > > > > semantics. I am asking if we could change the code as I indicate below > > > > and whether a) that would violate brk() semantics and b) whether > > > > violating brk() semantics matters if we don't trim the heap. > > > > > > Ah, right, I think I understand now. > > > > Sorry for all the confusion. > > > > > Um.. a) yes, I think it would violate brk() semantics and b) I'm not > > > sure. I suspect violating brk() semantics won't work though, because > > > I think glibc's allocator will assume it can use everything between > > > the previous morecore value and the new one. And will probably barf > > > if the new value is less than the old, which it could be with this > > > change. > > > > Hrm, I guess I hadn't thought about it that way. I really am curious > > about these semantics. We already have cases where a small page [heap] > > exists in a processes address space, say at 0x1000 for a 32-bit > > process and we start the hugepage heap in the morecore constructor at > > 0x2000. If glibc thought it could use that entire area, but we > > didn't actually allocate any pages for it to use there, does it just > > take a bunch of page faults? > > We really have such cases? I wouldn't have expected that to work. We definitely do. It's been a source of confusion for the testers, but it does seem to work. And given the code I pasted earlier in the thread (from morecore.c), it's clearly been coded to be a non-failure case if the first mmap() is moved (mapsize == 0), but if we've already allocated any of the heap, we fail. > When I wrote this, I thought from what I'd read of the glibc > documentation and code that it was only going to work reliably if we > got the very first morecore call, and every one after that. That was always my impression to -- I just chocked these binaries working up to luck. But I began to wonder if maybe since we're not doing trimming, that it's working as expected. > > I'm not sure how the new value can be less than the old? I think the > > only way that happens is with trimming in response to a negative > > allocation request... But I'm suggesting we have a situation like: > > > > 0x1000 [heap] > > 0x2000 /hugetlbfs/tmp.file > > 0x3000 /lib/ld-2.4.so > > 0x4000 /hugetlbfs/tmp.file > > > > I think we'll still be returning higher addresses every time. I don't > > know, maybe it's not worth pursuing, but I was just curious. > > In practice, we probably will, because of the way the kernel scans for > mapping addresses. But if we don't get our hinted address, there's no > guarantee where the mmap() will be - it could be in any free address > space, including before the executable itself. Ah, that's true. I'm ok with bailing out at that point like we do now if the only address the kernel can place our mmap() is "invalid". But I'd also like to make it more likely for MORECORE to work in cases where it tends to bail out early now. I'll code something up and see what happens with SPEC -- they tend to provoke our bugs :) > > Do you happen to have a reference to the definitions for brk() > > semantics? Or do you recommend I just look at glibc's source [not > > something I often hear recommended :)]? > > Alas, I think you will need to look at the glibc source. Good luck. Alright, I'll add it to my todo :) Thanks for all the input, David. -Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center ---
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On Wed, Feb 27, 2008 at 06:34:37PM -0800, Nishanth Aravamudan wrote: > On 28.02.2008 [13:28:47 +1100], David Gibson wrote: > > On Wed, Feb 27, 2008 at 06:17:02PM -0800, Nishanth Aravamudan wrote: [snip] > > > Hrm, I guess I hadn't thought about it that way. I really am curious > > > about these semantics. We already have cases where a small page [heap] > > > exists in a processes address space, say at 0x1000 for a 32-bit > > > process and we start the hugepage heap in the morecore constructor at > > > 0x2000. If glibc thought it could use that entire area, but we > > > didn't actually allocate any pages for it to use there, does it just > > > take a bunch of page faults? > > > > We really have such cases? I wouldn't have expected that to work. > > We definitely do. It's been a source of confusion for the testers, but > it does seem to work. And given the code I pasted earlier in the thread > (from morecore.c), it's clearly been coded to be a non-failure case if > the first mmap() is moved (mapsize == 0), but if we've already allocated > any of the heap, we fail. Hrm. Have we checked in these cases that malloc() or (glibc's default) morecore is actually being called before our version. I'm wondering if it might just be that the tail end of the BSS is being labelled "[heap]" in /proc/pid/maps, but we don't actually have a heap from the glibc allocator's point of view. > > When I wrote this, I thought from what I'd read of the glibc > > documentation and code that it was only going to work reliably if we > > got the very first morecore call, and every one after that. > > That was always my impression to -- I just chocked these binaries > working up to luck. But I began to wonder if maybe since we're not doing > trimming, that it's working as expected. > > > > I'm not sure how the new value can be less than the old? I think the > > > only way that happens is with trimming in response to a negative > > > allocation request... But I'm suggesting we have a situation like: > > > > > > 0x1000[heap] > > > 0x2000/hugetlbfs/tmp.file > > > 0x3000/lib/ld-2.4.so > > > 0x4000/hugetlbfs/tmp.file > > > > > > I think we'll still be returning higher addresses every time. I don't > > > know, maybe it's not worth pursuing, but I was just curious. > > > > In practice, we probably will, because of the way the kernel scans for > > mapping addresses. But if we don't get our hinted address, there's no > > guarantee where the mmap() will be - it could be in any free address > > space, including before the executable itself. > > Ah, that's true. I'm ok with bailing out at that point like we do now if > the only address the kernel can place our mmap() is "invalid". But I'd > also like to make it more likely for MORECORE to work in cases where it > tends to bail out early now. I'll code something up and see what happens > with SPEC -- they tend to provoke our bugs :) > > > > Do you happen to have a reference to the definitions for brk() > > > semantics? Or do you recommend I just look at glibc's source [not > > > something I often hear recommended :)]? > > > > Alas, I think you will need to look at the glibc source. Good luck. > > Alright, I'll add it to my todo :) Thanks for all the input, David. > > -Nish > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On 28.02.2008 [13:55:58 +1100], David Gibson wrote: > On Wed, Feb 27, 2008 at 06:34:37PM -0800, Nishanth Aravamudan wrote: > > On 28.02.2008 [13:28:47 +1100], David Gibson wrote: > > > On Wed, Feb 27, 2008 at 06:17:02PM -0800, Nishanth Aravamudan wrote: > [snip] > > > > Hrm, I guess I hadn't thought about it that way. I really am curious > > > > about these semantics. We already have cases where a small page [heap] > > > > exists in a processes address space, say at 0x1000 for a 32-bit > > > > process and we start the hugepage heap in the morecore constructor at > > > > 0x2000. If glibc thought it could use that entire area, but we > > > > didn't actually allocate any pages for it to use there, does it just > > > > take a bunch of page faults? > > > > > > We really have such cases? I wouldn't have expected that to work. > > > > We definitely do. It's been a source of confusion for the testers, but > > it does seem to work. And given the code I pasted earlier in the thread > > (from morecore.c), it's clearly been coded to be a non-failure case if > > the first mmap() is moved (mapsize == 0), but if we've already allocated > > any of the heap, we fail. > > Hrm. Have we checked in these cases that malloc() or (glibc's > default) morecore is actually being called before our version. I'm > wondering if it might just be that the tail end of the BSS is being > labelled "[heap]" in /proc/pid/maps, but we don't actually have a heap > from the glibc allocator's point of view. Hrm, I suppose that could be true. maps sees the brk() and says [heap] must be there, but it might not be. How would we (I) figure out if malloc() or morecore has been called before us? Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
Re: [Libhugetlbfs-devel] [RFC][PATCH] disable heap shrinking by default
On 27.02.2008 [23:20:35 -0800], Nishanth Aravamudan wrote: > On 28.02.2008 [13:55:58 +1100], David Gibson wrote: > > On Wed, Feb 27, 2008 at 06:34:37PM -0800, Nishanth Aravamudan wrote: > > > On 28.02.2008 [13:28:47 +1100], David Gibson wrote: > > > > On Wed, Feb 27, 2008 at 06:17:02PM -0800, Nishanth Aravamudan wrote: > > [snip] > > > > > Hrm, I guess I hadn't thought about it that way. I really am curious > > > > > about these semantics. We already have cases where a small page [heap] > > > > > exists in a processes address space, say at 0x1000 for a 32-bit > > > > > process and we start the hugepage heap in the morecore constructor at > > > > > 0x2000. If glibc thought it could use that entire area, but we > > > > > didn't actually allocate any pages for it to use there, does it just > > > > > take a bunch of page faults? > > > > > > > > We really have such cases? I wouldn't have expected that to work. > > > > > > We definitely do. It's been a source of confusion for the testers, but > > > it does seem to work. And given the code I pasted earlier in the thread > > > (from morecore.c), it's clearly been coded to be a non-failure case if > > > the first mmap() is moved (mapsize == 0), but if we've already allocated > > > any of the heap, we fail. > > > > Hrm. Have we checked in these cases that malloc() or (glibc's > > default) morecore is actually being called before our version. I'm > > wondering if it might just be that the tail end of the BSS is being > > labelled "[heap]" in /proc/pid/maps, but we don't actually have a heap > > from the glibc allocator's point of view. > > Hrm, I suppose that could be true. maps sees the brk() and says [heap] > must be there, but it might not be. > > How would we (I) figure out if malloc() or morecore has been called > before us? Actually thinking about it, I suppose strace() could help a little? Looking at the kernel, the following snippet from fs/proc/task_mmu.c is the relevant bit: if (vma->vm_start <= mm->start_brk && vma->vm_end >= mm->brk) { name = "[heap]"; } else if (vma->vm_start <= mm->start_stack && vma->vm_end >= mm->start_stack) { name = "[stack]"; } So if the VMA we're looking at starts at or before the original brk() and ends at or after the current brk(), it's considered the heap. How would that be the case if it's the tail-end of the heap, since we are not setting the brk()? Thanks, Nish -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel