Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-17 Thread Robert Haas
On Mon, Sep 16, 2013 at 9:13 AM, Heikki Linnakangas
 wrote:
> Robert, do you remember why you put the "pagesize = sysconf(_SC_PAGE_SIZE);"
> call in the new mmap() shared memory allocator?

Hmm, no.  Unfortunately, I don't.  We could try ripping it out and see
if the buildfarm breaks. If it is needed, then the dynamic shared
memory patch I posted probably needs it as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-16 Thread Andres Freund
On 2013-09-16 15:18:50 +0200, Andres Freund wrote:
> > So even a tiny allocation, much smaller than any page size, succeeds, and it
> > reserves a huge page. I tried the same with larger values; the kernel always
> > uses huge pages, and rounds up the allocation to a multiple of the huge page
> > size.
> 
> When developing the prototype I am pretty sure I had to add the rounding
> up - but I am not sure why now, because after chatting with Heikki about
> it, I've looked around and the initial MAP_HUGETLB support in the kernel
> (commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for
> rounding up.

Ok, the reason for that seems to have been the following bug
https://bugzilla.kernel.org/show_bug.cgi?id=56881

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-16 Thread Andres Freund
On 2013-09-16 16:13:57 +0300, Heikki Linnakangas wrote:
> On 16.09.2013 13:15, Andres Freund wrote:
> >On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote:
> >>On 14.09.2013 02:41, Richard Poole wrote:
> >>>The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
> >>>on systems that support it. It's based on Christian Kruse's patch from
> >>>last year, incorporating suggestions from Andres Freund.
> >>
> >>I don't understand the logic in figuring out the pagesize, and the smallest
> >>supported hugepage size. First of all, even without the patch, why do we
> >>round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel
> >>will round up the request all by itself. The mmap() man page doesn't say
> >>anything about length having to be a multiple of pages size.
> >
> >I think it does:
> >EINVAL We don't like addr, length, or offset (e.g., they are  too
> >   large,  or  not aligned on a page boundary).
> 
> That doesn't mean that they *all* have to be aligned on a page boundary.
> It's understandable that 'addr' and 'offset' have to be, but it doesn't make
> much sense for 'length'.
> 
> >and
> >A file is mapped in multiples of the page size.  For a file that is 
> > not a multiple
> >of  the  page size, the remaining memory is zeroed when mapped, and 
> > writes to that
> >region are not written out to the file.  The effect of changing the  
> > size  of  the
> >underlying  file  of  a  mapping  on the pages that correspond to 
> > added or removed
> >regions of the file is unspecified.
> >
> >And no, according to my past experience, the kernel does *not* do any
> >such rounding up. It will just fail.
> 
> I wrote a little test program to play with different values (attached). I
> tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and on a
> VM with a fresh Centos 6.4 install with 2.6.32 kernel
> (2.6.32-358.18.1.el6.x86_64), and they both work the same:
> 
> $ ./mmaptest 100 # mmap 100 bytes
> 
> in a different terminal:
> $ cat /proc/meminfo  | grep HugePages_Rsvd
> HugePages_Rsvd:1
> 
> So even a tiny allocation, much smaller than any page size, succeeds, and it
> reserves a huge page. I tried the same with larger values; the kernel always
> uses huge pages, and rounds up the allocation to a multiple of the huge page
> size.

When developing the prototype I am pretty sure I had to add the rounding
up - but I am not sure why now, because after chatting with Heikki about
it, I've looked around and the initial MAP_HUGETLB support in the kernel
(commit 4e52780d41a741fb4861ae1df2413dd816ec11b1) has support for
rounding up.

> So, let's just get rid of the /sys scanning code.

Alternatively we could round up NBuffers to actually use the
additionally allocated space. Not sure if that's worth the amount of
code, but wasting several megabytes - or even gigabytes - of memory
isn't nice either.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-16 Thread Heikki Linnakangas

On 16.09.2013 13:15, Andres Freund wrote:

On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote:

On 14.09.2013 02:41, Richard Poole wrote:

The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
on systems that support it. It's based on Christian Kruse's patch from
last year, incorporating suggestions from Andres Freund.


I don't understand the logic in figuring out the pagesize, and the smallest
supported hugepage size. First of all, even without the patch, why do we
round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel
will round up the request all by itself. The mmap() man page doesn't say
anything about length having to be a multiple of pages size.


I think it does:
EINVAL We don't like addr, length, or offset (e.g., they are  too
   large,  or  not aligned on a page boundary).


That doesn't mean that they *all* have to be aligned on a page boundary. 
It's understandable that 'addr' and 'offset' have to be, but it doesn't 
make much sense for 'length'.



and
A file is mapped in multiples of the page size.  For a file that is not 
a multiple
of  the  page size, the remaining memory is zeroed when mapped, and 
writes to that
region are not written out to the file.  The effect of changing the  
size  of  the
underlying  file  of  a  mapping  on the pages that correspond to added 
or removed
regions of the file is unspecified.

And no, according to my past experience, the kernel does *not* do any
such rounding up. It will just fail.


I wrote a little test program to play with different values (attached). 
I tried this on my laptop with a 3.2 kernel (uname -r: 3.10-2-amd6), and 
on a VM with a fresh Centos 6.4 install with 2.6.32 kernel 
(2.6.32-358.18.1.el6.x86_64), and they both work the same:


$ ./mmaptest 100 # mmap 100 bytes

in a different terminal:
$ cat /proc/meminfo  | grep HugePages_Rsvd
HugePages_Rsvd:1

So even a tiny allocation, much smaller than any page size, succeeds, 
and it reserves a huge page. I tried the same with larger values; the 
kernel always uses huge pages, and rounds up the allocation to a 
multiple of the huge page size.


So, let's just get rid of the /sys scanning code.

Robert, do you remember why you put the "pagesize = 
sysconf(_SC_PAGE_SIZE);" call in the new mmap() shared memory allocator?


- Heikki
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
	char *ptr;
	int size;

	size = (argc > 1) ? atoi(argv[1]) : (100 * 4096);

	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
			   MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

	if (ptr != (void *) -1)
		printf("success: %p\n", ptr);
	else
		printf("failure: %s\n", strerror(errno));

	sleep(10);

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-16 Thread Andres Freund
On 2013-09-16 11:15:28 +0300, Heikki Linnakangas wrote:
> On 14.09.2013 02:41, Richard Poole wrote:
> >The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
> >on systems that support it. It's based on Christian Kruse's patch from
> >last year, incorporating suggestions from Andres Freund.
> 
> I don't understand the logic in figuring out the pagesize, and the smallest
> supported hugepage size. First of all, even without the patch, why do we
> round up the size passed to mmap() to the _SC_PAGE_SIZE? Surely the kernel
> will round up the request all by itself. The mmap() man page doesn't say
> anything about length having to be a multiple of pages size.

I think it does:
   EINVAL We don't like addr, length, or offset (e.g., they are  too
  large,  or  not aligned on a page boundary).
and
   A file is mapped in multiples of the page size.  For a file that is not 
a multiple
   of  the  page size, the remaining memory is zeroed when mapped, and 
writes to that
   region are not written out to the file.  The effect of changing the  
size  of  the
   underlying  file  of  a  mapping  on the pages that correspond to added 
or removed
   regions of the file is unspecified.

And no, according to my past experience, the kernel does *not* do any
such rounding up. It will just fail.

> And with the patch, why do you bother detecting the minimum supported
> hugepage size? Surely the kernel will choose the appropriate hugepage size
> just fine on its own, no?

It will fail if it's not a multiple.

> >It is still WIP as there are a couple of points that Andres has pointed
> >out to me that haven't been addressed yet;
> 
> Which points are those?

I don't know which point Richard already has fixed, so I'll let him
comment on that.

> I wonder if it would be better to allow setting huge_tlb_pages=try even on
> platforms that don't have hugepages. It would simply mean the same as 'off'
> on such platforms.

I wouldn't argue against that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-16 Thread Heikki Linnakangas

On 14.09.2013 02:41, Richard Poole wrote:

The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
on systems that support it. It's based on Christian Kruse's patch from
last year, incorporating suggestions from Andres Freund.


I don't understand the logic in figuring out the pagesize, and the 
smallest supported hugepage size. First of all, even without the patch, 
why do we round up the size passed to mmap() to the _SC_PAGE_SIZE? 
Surely the kernel will round up the request all by itself. The mmap() 
man page doesn't say anything about length having to be a multiple of 
pages size.


And with the patch, why do you bother detecting the minimum supported 
hugepage size? Surely the kernel will choose the appropriate hugepage 
size just fine on its own, no?



It is still WIP as there are a couple of points that Andres has pointed
out to me that haven't been addressed yet;


Which points are those?

I wonder if it would be better to allow setting huge_tlb_pages=try even 
on platforms that don't have hugepages. It would simply mean the same as 
'off' on such platforms.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-14 Thread Peter Eisentraut
On Sat, 2013-09-14 at 00:41 +0100, Richard Poole wrote:
> The attached patch adds the MAP_HUGETLB flag to mmap() for shared
> memory on systems that support it. 

Please fix the tabs in the SGML files.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: add MAP_HUGETLB to mmap() where supported (WIP)

2013-09-13 Thread Richard Poole
The attached patch adds the MAP_HUGETLB flag to mmap() for shared memory
on systems that support it. It's based on Christian Kruse's patch from
last year, incorporating suggestions from Andres Freund.

On a system with 4GB shared_buffers, doing pgbench runs long enough for
each backend to touch most of the buffers, this patch saves nearly 8MB of
memory per backend and improves performances by just over 2% on average.

It is still WIP as there are a couple of points that Andres has pointed
out to me that haven't been addressed yet; also, the documentation is
incomplete.

Richard

-- 
Richard Poole http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..703b28f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1052,6 +1052,42 @@ include 'filename'
   
  
 
+ 
+  huge_tlb_pages (enum)
+  
+   huge_tlb_pages configuration parameter
+  
+  
+   
+Enables/disables the use of huge tlb pages. Valid values are
+on, off and try.
+The default value is try.
+   
+
+	   
+	   Use of huge tlb pages reduces the cpu time spent on memory management and
+	   the amount of memory used for page tables and therefore improves performance.
+	   
+
+   
+With huge_tlb_pages set to on
+mmap() will be called with MAP_HUGETLB.
+If the call fails the server will fail fatally.
+   
+
+   
+With huge_tlb_pages set to off we
+will not use MAP_HUGETLB at all.
+   
+
+   
+With huge_tlb_pages set to try
+we will try to use MAP_HUGETLB and fall back to
+mmap() without MAP_HUGETLB.
+   
+  
+ 
+
  
   temp_buffers (integer)
   
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 20e3c32..57fff35 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -27,10 +27,14 @@
 #ifdef HAVE_SYS_SHM_H
 #include 
 #endif
+#ifdef MAP_HUGETLB
+#include 
+#endif
 
 #include "miscadmin.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
+#include "utils/guc.h"
 
 
 typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
@@ -61,6 +65,13 @@ typedef int IpcMemoryId;		/* shared memory ID returned by shmget(2) */
 #define MAP_FAILED ((void *) -1)
 #endif
 
+#ifdef MAP_HUGETLB
+#define PG_HUGETLB_BASE_ADDR (void *)(0x0UL)
+#define PG_MAP_HUGETLB MAP_HUGETLB
+#else
+#define PG_MAP_HUGETLB 0
+#endif
+
 
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
@@ -342,6 +353,161 @@ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
 }
 
 
+#ifdef MAP_HUGETLB
+#define HUGE_PAGE_INFO_DIR  "/sys/kernel/mm/hugepages"
+
+/*
+ *	static long InternalGetFreeHugepagesCount(const char *name)
+ *
+ * Attempt to read the number of available hugepages from
+ * /sys/kernel/mm/hugepages/hugepages-/free_hugepages
+ * Will fail (return -1) if file could not be opened, 0 if no pages are available
+ * and > 0 if there are free pages
+ *
+ */
+static long
+InternalGetFreeHugepagesCount(const char *name)
+{
+	int fd;
+	char buff[1024];
+	size_t len;
+	long result;
+	char *ptr;
+
+	len = snprintf(buff, 1024, "%s/%s/free_hugepages", HUGE_PAGE_INFO_DIR, name);
+	if (len == 1024) /* I don't think that this will happen ever */
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg("Filename %s/%s/free_hugepages is too long", HUGE_PAGE_INFO_DIR, name),
+ errcontext("while checking hugepage size")));
+		return -1;
+	}
+
+	fd = open(buff, O_RDONLY);
+	if (fd <= 0)
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg("Could not open file %s: %s", buff, strerror(errno)),
+ errcontext("while checking hugepage size")));
+		return -1;
+	}
+
+	len = read(fd, buff, 1024);
+	if (len <= 0)
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg("Error reading from file %s: %s", buff, strerror(errno)),
+ errcontext("while checking hugepage size")));
+		close(fd);
+		return -1;
+	}
+
+	/*
+	 * If the content of free_hugepages is longer than or equal to 1024 bytes
+	 * the rest is irrelevant; we simply want to know if there are any
+	 * hugepages left
+	 */
+	if (len == 1024)
+	{
+		buff[1023] = 0;
+	}
+	else
+	{
+		buff[len] = 0;
+	}
+
+	close(fd);
+
+	result = strtol(buff, &ptr, 10);
+
+	if (ptr == NULL)
+	{
+		ereport(huge_tlb_pages == HUGE_TLB_TRY ? DEBUG1 : WARNING,
+(errmsg("Could not convert contents of file %s/%s/free_hugepages to number", HUGE_PAGE_INFO_DIR, name),
+ errcontext("while checking hugepage size")));
+		return -1;
+	}
+
+	return result;
+}
+
+/*
+ *	static long InternalGetHugepageSize()
+ *
+ * Attempt to get a valid hugepage size from /sys/kernel/mm/hugepages/ by
+ * reading directory contents
+ * Will fail (return -1) if the directory could not be opened or no valid
+ * page si