Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On Mon, Sep 16, 2013 at 9:13 AM, Heikki Linnakangas wrote: > Robert, do you remember why you put the "pagesize = sysconf(_SC_PAGE_SIZE);" > call in the new mmap() shared memory allocator? Hmm, no. Unfortunately, I don't. We could try ripping it out and see if the buildfarm breaks. If it is needed, then the dynamic shared memory patch I posted probably needs it as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On 2013-09-16 15:18:50 +0200, Andres Freund wrote: > > So even a tiny allocation, much smaller than any page size, succeeds, and it > > reserves a huge page. I tried the same with larger values; the kernel always > > uses huge pages, and rounds up the allocation to a multiple of the huge page > > size. > > When developing the prototype I am pretty sure I had to add the rounding > up - but I am not sure why now, because after chatting with Heikki about > it, I've looked around and the initial MAP_HUGETLB support in the kernel > (commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for > rounding up. Ok, the reason for that seems to have been the following bug https://bugzilla.kernel.org/show_bug.cgi?id=56881 Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On 2013-09-16 16:13:57 +0300, Heikki Linnakangas wrote: > On 16.09.2013 13:15, Andres Freund wrote: > >On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: > >>On 14.09.2013 02:41, Richard Poole wrote: > >>>The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory > >>>on systems that support it. It's based on Christian Kruse's patch from > >>>last year, incorporating suggestions from Andres Freund. > >> > >>I don't understand the logic in figuring out the pagesize, and the smallest > >>supported hugepage size. First of all, even without the patch, why do we > >>round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel > >>will round up the request all by itself. The mmap() man page doesn't say > >>anything about length having to be a multiple of pages size. > > > >I think it does: > >EINVAL We don't like addr, length, or offset (e.g., they are too > > large, or not aligned on a page boundary). > > That doesn't mean that they *all* have to be aligned on a page boundary. > It's understandable that 'addr' and 'offset' have to be, but it doesn't make > much sense for 'length'. > > >and > >A file is mapped in multiples of the page size. For a file that is > > not a multiple > >of the page size, the remaining memory is zeroed when mapped, and > > writes to that > >region are not written out to the file. The effect of changing the > > size of the > >underlying file of a mapping on the pages that correspond to > > added or removed > >regions of the file is unspecified. > > > >And no, according to my past experience, the kernel does *not* do any > >such rounding up. It will just fail. > > I wrote a little test program to play with different values (attached). I > tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a > VM with a fresh Centos 6.4 install with 2.6.32 kernel > (2.6.32-358.18.1.el6.x86_64), and they both work the same: > > $ ./mmaptest 100 # mmap 100 bytes > > in a different terminal: > $ cat /proc/meminfo | grep HugePages_Rsvd > HugePages_Rsvd:1 > > So even a tiny allocation, much smaller than any page size, succeeds, and it > reserves a huge page. I tried the same with larger values; the kernel always > uses huge pages, and rounds up the allocation to a multiple of the huge page > size. When developing the prototype I am pretty sure I had to add the rounding up - but I am not sure why now, because after chatting with Heikki about it, I've looked around and the initial MAP_HUGETLB support in the kernel (commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for rounding up. > So, let's just get rid of the /sys scanning code. Alternatively we could round up NBuffers to actually use the additionally allocated space. Not sure if that's worth the amount of code, but wasting several megabytes - or even gigabytes - of memory isn't nice either. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On 16.09.2013 13:15, Andres Freund wrote: On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: On 14.09.2013 02:41, Richard Poole wrote: The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory on systems that support it. It's based on Christian Kruse's patch from last year, incorporating suggestions from Andres Freund. I don't understand the logic in figuring out the pagesize, and the smallest supported hugepage size. First of all, even without the patch, why do we round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel will round up the request all by itself. The mmap() man page doesn't say anything about length having to be a multiple of pages size. I think it does: EINVAL We don't like addr, length, or offset (e.g., they are too large, or not aligned on a page boundary). That doesn't mean that they *all* have to be aligned on a page boundary. It's understandable that 'addr' and 'offset' have to be, but it doesn't make much sense for 'length'. and A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. And no, according to my past experience, the kernel does *not* do any such rounding up. It will just fail. I wrote a little test program to play with different values (attached). I tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a VM with a fresh Centos 6.4 install with 2.6.32 kernel (2.6.32-358.18.1.el6.x86_64), and they both work the same: $ ./mmaptest 100 # mmap 100 bytes in a different terminal: $ cat /proc/meminfo | grep HugePages_Rsvd HugePages_Rsvd:1 So even a tiny allocation, much smaller than any page size, succeeds, and it reserves a huge page. I tried the same with larger values; the kernel always uses huge pages, and rounds up the allocation to a multiple of the huge page size. So, let's just get rid of the /sys scanning code. Robert, do you remember why you put the "pagesize = sysconf(_SC_PAGE_SIZE);" call in the new mmap() shared memory allocator? - Heikki #include #include #include #include int main(int argc, char **argv) { char *ptr; int size; size = (argc > 1) ? atoi(argv[1]) : (100 * 4096); ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (ptr != (void *) -1) printf("success: %p\n", ptr); else printf("failure: %s\n", strerror(errno)); sleep(10); return 0; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote: > On 14.09.2013 02:41, Richard Poole wrote: > >The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory > >on systems that support it. It's based on Christian Kruse's patch from > >last year, incorporating suggestions from Andres Freund. > > I don't understand the logic in figuring out the pagesize, and the smallest > supported hugepage size. First of all, even without the patch, why do we > round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel > will round up the request all by itself. The mmap() man page doesn't say > anything about length having to be a multiple of pages size. I think it does: EINVAL We don't like addr, length, or offset (e.g., they are too large, or not aligned on a page boundary). and A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. And no, according to my past experience, the kernel does *not* do any such rounding up. It will just fail. > And with the patch, why do you bother detecting the minimum supported > hugepage size? Surely the kernel will choose the appropriate hugepage size > just fine on its own, no? It will fail if it's not a multiple. > >It is still WIP as there are a couple of points that Andres has pointed > >out to me that haven't been addressed yet; > > Which points are those? I don't know which point Richard already has fixed, so I'll let him comment on that. > I wonder if it would be better to allow setting huge_tlb_pages=try even on > platforms that don't have hugepages. It would simply mean the same as 'off' > on such platforms. I wouldn't argue against that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On 14.09.2013 02:41, Richard Poole wrote: The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory on systems that support it. It's based on Christian Kruse's patch from last year, incorporating suggestions from Andres Freund. I don't understand the logic in figuring out the pagesize, and the smallest supported hugepage size. First of all, even without the patch, why do we round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel will round up the request all by itself. The mmap() man page doesn't say anything about length having to be a multiple of pages size. And with the patch, why do you bother detecting the minimum supported hugepage size? Surely the kernel will choose the appropriate hugepage size just fine on its own, no? It is still WIP as there are a couple of points that Andres has pointed out to me that haven't been addressed yet; Which points are those? I wonder if it would be better to allow setting huge_tlb_pages=try even on platforms that don't have hugepages. It would simply mean the same as 'off' on such platforms. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
On Sat, 2013-09-14 at 00:41 +0100, Richard Poole wrote: > The attached patch adds the MAP_HUGETLB flag to mmap() for shared > memory on systems that support it. Please fix the tabs in the SGML files. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)
The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory on systems that support it. It's based on Christian Kruse's patch from last year, incorporating suggestions from Andres Freund. On a system with 4GB shared_buffers, doing pgbench runs long enough for each backend to touch most of the buffers, this patch saves nearly 8MB of memory per backend and improves performances by just over 2% on average. It is still WIP as there are a couple of points that Andres has pointed out to me that haven't been addressed yet; also, the documentation is incomplete. Richard -- Richard Poole http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 23ebc11..703b28f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1052,6 +1052,42 @@ include 'filename' + + huge_tlb_pages (enum) + + huge_tlb_pages configuration parameter + + + +Enables/disables the use of huge tlb pages. Valid values are +on, off and try. +The default value is try. + + + + Use of huge tlb pages reduces the cpu time spent on memory management and + the amount of memory used for page tables and therefore improves performance. + + + +With huge_tlb_pages set to on +mmap() will be called with MAP_HUGETLB. +If the call fails the server will fail fatally. + + + +With huge_tlb_pages set to off we +will not use MAP_HUGETLB at all. + + + +With huge_tlb_pages set to try +we will try to use MAP_HUGETLB and fall back to +mmap() without MAP_HUGETLB. + + + + temp_buffers (integer) diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 20e3c32..57fff35 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -27,10 +27,14 @@ #ifdef HAVE_SYS_SHM_H #include #endif +#ifdef MAP_HUGETLB +#include +#endif #include "miscadmin.h" #include "storage/ipc.h" #include "storage/pg_shmem.h" +#include "utils/guc.h" typedef key_t IpcMemoryKey; /* shared memory key passed to shmget(2) */ @@ -61,6 +65,13 @@ typedef int IpcMemoryId; /* shared memory ID returned by shmget(2) */ #define MAP_FAILED ((void *) -1) #endif +#ifdef MAP_HUGETLB +#define PG_HUGETLB_BASE_ADDR (void *)(0x0UL) +#define PG_MAP_HUGETLB MAP_HUGETLB +#else +#define PG_MAP_HUGETLB 0 +#endif + unsigned long UsedShmemSegID = 0; void *UsedShmemSegAddr = NULL; @@ -342,6 +353,161 @@ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2) } +#ifdef MAP_HUGETLB +#define HUGE_PAGE_INFO_DIR "/sys/kernel/mm/hugepages" + +/* + * static long InternalGetFreeHugepagesCount(const char *name) + * + * Attempt to read the number of available hugepages from + * /sys/kernel/mm/hugepages/hugepages-/free_hugepages + * Will fail (return -1) if file could not be opened, 0 if no pages are available + * and > 0 if there are free pages + * + */ +static long +InternalGetFreeHugepagesCount(const char *name) +{ + int fd; + char buff[1024]; + size_t len; + long result; + char *ptr; + + len = snprintf(buff, 1024, "%s/%s/free_hugepages", HUGE_PAGE_INFO_DIR, name); + if (len == 1024) /* I don't think that this will happen ever */ + { + ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING, +(errmsg("Filename %s/%s/free_hugepages is too long", HUGE_PAGE_INFO_DIR, name), + errcontext("while checking hugepage size"))); + return -1; + } + + fd = open(buff, O_RDONLY); + if (fd <= 0) + { + ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING, +(errmsg("Could not open file %s: %s", buff, strerror(errno)), + errcontext("while checking hugepage size"))); + return -1; + } + + len = read(fd, buff, 1024); + if (len <= 0) + { + ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING, +(errmsg("Error reading from file %s: %s", buff, strerror(errno)), + errcontext("while checking hugepage size"))); + close(fd); + return -1; + } + + /* + * If the content of free_hugepages is longer than or equal to 1024 bytes + * the rest is irrelevant; we simply want to know if there are any + * hugepages left + */ + if (len == 1024) + { + buff[1023] = 0; + } + else + { + buff[len] = 0; + } + + close(fd); + + result = strtol(buff, &ptr, 10); + + if (ptr == NULL) + { + ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING, +(errmsg("Could not convert contents of file %s/%s/free_hugepages to number", HUGE_PAGE_INFO_DIR, name), + errcontext("while checking hugepage size"))); + return -1; + } + + return result; +} + +/* + * static long InternalGetHugepageSize() + * + * Attempt to get a valid hugepage size from /sys/kernel/mm/hugepages/ by + * reading directory contents + * Will fail (return -1) if the directory could not be opened or no valid + * page si