remap the .text segment into huge pages at run time

2022-11-01 Thread John Naylor
It's been known for a while that Postgres spends a lot of time translating
instruction addresses, and using huge pages in the text segment yields a
substantial performance boost in OLTP workloads [1][2]. The difficulty is,
this normally requires a lot of painstaking work (unless your OS does
superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text
segment
- mmap aligned start address to a second region with huge pages and
MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the
remapping is done in a function that's forced to live in a different
segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text end:   0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text start: 0x60
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text end:   0x80
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  un-mmapping temporary code
region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that
still saves 512 entries in the TLB and might bring a small improvement. The
un-remapped region below 0x60 contains the ~600kB of "cold" code, since
the linker puts the cold section first, at least recent versions of ld and
lld.

0002 is my attempt to force the linker's hand and get the entire text
segment mapped to huge pages. It's quite a finicky hack, and easily broken
(see below). That said, it still builds easily within our normal build
process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152
-Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
done for predictability, but that means the next 2MB boundary is very
nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

FILE SIZEVM SIZE
 --  --
  53.7%  4.90Mi  58.7%  4.90Mi.text
...
 100.0%  9.12Mi 100.0%  8.35MiTOTAL

$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name  TypeAddress  OffSize   ES
Flg Lk Inf Al
...
  [12] .init PROGBITS00486000 086000 1b 00
 AX  0   0  4
  [13] .plt  PROGBITS00486020 086020 001520 10
 AX  0   0 16
  [14] .text PROGBITS00487540 087540 4e59d2 00
 AX  0   0 16
...

0002:

$ bloaty inst-perf/bin/postgres

FILE SIZEVM SIZE
 --  --
  46.9%  8.00Mi  69.9%  8.00Mi.text
...
 100.0%  17.1Mi 100.0%  11.4MiTOTAL


$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name  TypeAddress  OffSize   ES
Flg Lk Inf Al
...
  [12] .init PROGBITS0060 20 1b 00
 AX  0   0  4
  [13] .plt  PROGBITS00600020 200020 001520 10
 AX  0   0 16
  [14] .text PROGBITS00601540 201540 7ff512 00
 AX  0   0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text end:   0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text start: 0x80
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text end:   0xe0
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  un-mmapping temporary code
region

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
causes complete startup failure if the .text segment is larger than 8MB. I
haven't looked into what's happening there yet, but I didn't want to get
too far in the weeds before getting feedback on whether the entire approach
in this thread is sound enou

Re: remap the .text segment into huge pages at run time

2022-11-03 Thread Andres Freund
Hi,

On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> It's been known for a while that Postgres spends a lot of time translating
> instruction addresses, and using huge pages in the text segment yields a
> substantial performance boost in OLTP workloads [1][2].

Indeed. Some of that we eventually should address by making our code less
"jumpy", but that's a large amount of work and only going to go so far.


> The difficulty is,
> this normally requires a lot of painstaking work (unless your OS does
> superpage promotion, like FreeBSD).

I still am confused by FreeBSD being able to do this without changing the
section alignment to be big enough. Or is the default alignment on FreeBSD
large enough already?


> I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> remap the .text segment to huge pages at program start. Attached is a
> hackish, Meson-only, "works on my machine" patchset to experiment with this
> idea.

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting sufficiently
aligned executable pages to huge pages too, and might have succeeded already.

IOW, adding the linker flags might be a good first step.


> 0001 adapts the library to our error logging and GUC system. The overview:
> 
> - read ELF info to get the start/end addresses of the .text segment
> - calculate addresses therein aligned at huge page boundaries
> - mmap a temporary region and memcpy the aligned portion of the .text
> segment
> - mmap aligned start address to a second region with huge pages and
> MAP_FIXED
> - memcpy over from the temp region and revoke the PROT_WRITE bit

Would mremap()'ing the temporary region also work? That might be simpler and
more robust (you'd see the MAP_HUGETLB failure before doing anything
irreversible). And you then might not even need this:

> The reason this doesn't "saw off the branch you're standing on" is that the
> remapping is done in a function that's forced to live in a different
> segment, and doesn't call any non-libc functions living elsewhere:
> 
> static void
> __attribute__((__section__("lpstub")))
> __attribute__((__noinline__))
> MoveRegionToLargePages(const mem_range * r, int mmap_flags)


This would likely need a bunch more gating than the patch, understandably,
has. I think it'd faily horribly if there were .text relocations, for example?
I think there are some architectures that do that by default...


> 0002 is my attempt to force the linker's hand and get the entire text
> segment mapped to huge pages. It's quite a finicky hack, and easily broken
> (see below). That said, it still builds easily within our normal build
> process, and maybe there is a better way to get the effect.
> 
> It does two things:
> 
> - Pass the linker -Wl,-zcommon-page-size=2097152
> -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
> done for predictability, but that means the next 2MB boundary is very
> nearly 2MB away.

Yep. FWIW, my notes say

# align sections to 2MB boundaries for hugepage support
# bfd and gold linkers:
# -Wl,-zmax-page-size=0x20 -Wl,-zcommon-page-size=0x20
# lld:
# -Wl,-zmax-page-size=0x20 -Wl,-z,separate-loadable-segments
# then copy binary to tmpfs mounted with -o huge=always

I.e. with lld you need slightly different flags 
-Wl,-z,separate-loadable-segments

The meson bit should probably just use
cc.get_supported_link_arguments([
  '-Wl,-zmax-page-size=0x20',
  '-Wl,-zcommon-page-size=0x20',
  '-Wl,-zseparate-loadable-segments'])

Afaict there's really no reason to not do that by default, allowing kernels
that can promote to huge pages to do so.


My approach to forcing huge pages to be used was to then:

# copy binary to tmpfs mounted with -o huge=always


> - Add a "cold" __asm__ filler function that just takes up space, enough to
> push the end of the .text segment over the next aligned boundary, or to
> ~8MB in size.

I don't understand why this is needed - as long as the pages are aligned to
2MB, why do we need to fill things up on disk? The in-memory contents are the
relevant bit, no?


> Since the front is all-cold, and there is very little at the end,
> practically all hot pages are now remapped. The biggest problem with the
> hackish filler function (in addition to maintainability) is, if explicit
> huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
> causes complete startup failure if the .text segment is larger than 8MB.

I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
independent of the .text segment size?



> +/* Callback for dl_iterate_phdr to set the start and end of the .text 
> segment */
> +static int
> +FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
> +{
> + ElfW(Shdr) text_section;
> + FindParams *find_params = (FindParams *) data;
> +
> + /*
> +  * We are only interested in the mapping matching the main executable.
> +  * This

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread Andres Freund
Hi,

This nerd-sniped me badly :)

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> > remap the .text segment to huge pages at program start. Attached is a
> > hackish, Meson-only, "works on my machine" patchset to experiment with this
> > idea.
>
> I wonder how far we can get with just using the linker hints to align
> sections. I know that the linux folks are working on promoting sufficiently
> aligned executable pages to huge pages too, and might have succeeded already.
>
> IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never


FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 
0x55800,0x56122
sudo /home/andres/src/kernel/tools/vm/page-types -f 
/srv/dev/build/m-opt/src/backend/postgres2


Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e 
cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g
 pgbench -n -M prepared -S -P1 -c$c -j$c -T10

without MADV_COLLAPSE:

tps = 1038230.070771 (without initial connection time)

 Performance counter stats for 'system wide':

 1,184,344,476,152  cycles  
 (71.41%)
 2,846,146,710  iTLB-loads  
 (71.43%)
 2,021,885,782  iTLB-load-misses #   71.04% of all iTLB 
cache accesses  (71.44%)
75,633,850,933  itlb_misses.walk_active 
 (71.44%)
 2,020,962,930  itlb_misses.walk_completed_4k   
 (71.44%)
 1,213,368  itlb_misses.walk_completed_2m_4m
 (57.12%)
 2,293  itlb_misses.walk_completed_1g   
 (57.11%)

  10.064352587 seconds time elapsed



with MADV_COLLAPSE:

tps = 1113717.114278 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,049,140,611  cycles  
 (71.42%)
 1,059,224,678  iTLB-loads  
 (71.44%)
   653,603,712  iTLB-load-misses #   61.71% of all iTLB 
cache accesses  (71.44%)
26,135,902,949  itlb_misses.walk_active 
 (71.44%)
   628,314,285  itlb_misses.walk_completed_4k   
 (71.44%)
25,462,916  itlb_misses.walk_completed_2m_4m
 (57.13%)
 2,228  itlb_misses.walk_completed_1g   
 (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.


A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 10 / -t
1 to compare the same amount of work) I get:


without MADV_COLLAPSE:

not pipelined:

tps = 1037732.722805 (without initial connection time)

 Performance counter stats for 'system wide':

 1,691,411,678,007  cycles  
 (62.48%)
 8,856,107  itlb.itlb_flush 
 (62.48%)
 4,600,041,062  iTLB-loads  
 (62.48%)
 2,598,218,236  iTLB-load-misses #   56.48% of all iTLB 
cache accesses  (62.50%)
   100,095,862,126  itlb_misses.walk_active 
  

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread Andres Freund
Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > push the end of the .text segment over the next aligned boundary, or to
> > ~8MB in size.
>
> I don't understand why this is needed - as long as the pages are aligned to
> 2MB, why do we need to fill things up on disk? The in-memory contents are the
> relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x000c7f58 0x000c7f58  R  0x20
  LOAD   0x0020 0x0020 0x0020
 0x00921d39 0x00921d39  R E0x20
  LOAD   0x00c0 0x00c0 0x00c0
 0x002626b8 0x002626b8  R  0x20
  LOAD   0x00fdf510 0x011df510 0x011df510
 0x00037fd6 0x0006a310  RW 0x20

gold -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,--rosegment
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x009230f9 0x009230f9  R E0x20
  LOAD   0x00a0 0x00a0 0x00a0
 0x0033a738 0x0033a738  R  0x20
  LOAD   0x00ddf4e0 0x00fdf4e0 0x00fdf4e0
 0x0003800a 0x0006a340  RW 0x20

lld: -Wl,-zmax-page-size=0x20,-zseparate-loadable-segments
  LOAD   0x 0x 0x
 0x0033710c 0x0033710c  R  0x20
  LOAD   0x0040 0x0040 0x0040
 0x00921cb0 0x00921cb0  R E0x20
  LOAD   0x00e0 0x00e0 0x00e0
 0x00020ae0 0x00020ae0  RW 0x20
  LOAD   0x0100 0x0100 0x0100
 0x000174ea 0x00049820  RW 0x20

mold 
-Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,-zseparate-loadable-segments
  Type   Offset VirtAddr   PhysAddr
 FileSizMemSiz  Flags  Align
...
  LOAD   0x 0x 0x
 0x0032dde9 0x0032dde9  R  0x20
  LOAD   0x0040 0x0040 0x0040
 0x00921cbe 0x00921cbe  R E0x20
  LOAD   0x00e0 0x00e0 0x00e0
 0x002174e8 0x00249820  RW 0x20

With these flags the "R E" segments all start on a 0x20/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.


MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

 Performance counter stats for 'system wide':

 1,169,012,466,070  cycles  
 (55.53%)
   729,146,640,019  instructions #0.62  insn per 
cycle   (66.65%)
 7,062,923  itlb.itlb_flush 
 (66.65%)
 1,041,825,587  iTLB-loads  
 (66.65%)
   634,272,420  iTLB-load-misses #   60.88% of all iTLB 
cache accesses  (66.66%)
27,018,254,873  itlb_misses.walk_active 
 (66.68%)
   610,639,252  itlb_misses.walk_completed_4k   
 (44.47%)
24,262,549  itlb_misses.walk_completed_2m_4m
 (44.46%)
 2,948  itlb_misses.walk_completed_1g   
 (44.43%)

  10.039217004 seconds time elapsed


MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

 Performance counter stats for 'system wide':

 1,173,272,878,934  cycle

Re: remap the .text segment into huge pages at run time

2022-11-04 Thread John Naylor
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund  wrote:

> > I wonder how far we can get with just using the linker hints to align
> > sections. I know that the linux folks are working on promoting
sufficiently
> > aligned executable pages to huge pages too, and might have succeeded
already.
> >
> > IOW, adding the linker flags might be a good first step.
>
> Indeed, I did see that that works to some degree on the 5.19 kernel I was
> running. However, it never seems to get around to using huge pages
> sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped
before it does so. At least, that issue is mentioned in that paper linked
upthread for FreeBSD.

> More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
> added into linux 6.1. That explicitly remaps a region and uses huge pages
for
> it. Of course that's going to take a while to be widely available, but it
> seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

> I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
hardcode
> the address / length), and it seems to work nicely.
>
> With the weird caveat that on fs one needs to make sure that the
executable
> doesn't reflinks to reuse parts of other files, and that the mold linker
and
> cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

> The difference in itlb.itlb_flush between pipelined / non-pipelined cases
> unsurprisingly is stark.
>
> While the pipelined case still sees a good bit reduced itlb traffic, the
total
> amount of cycles in which a walk is active is just not large enough to
matter,
> by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something
devs should consider when microbenchmarking, to reduce noise from context
switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > - Add a "cold" __asm__ filler function that just takes up space,
enough to
> > > push the end of the .text segment over the next aligned boundary, or
to
> > > ~8MB in size.
> >
> > I don't understand why this is needed - as long as the pages are
aligned to
> > 2MB, why do we need to fill things up on disk? The in-memory contents
are the
> > relevant bit, no?
>
> I now assume it's because you either observed the mappings set up by the
> loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello
world" program, which links with 8 filler objects, each with 32768
__attribute((used)) dummy functions. I just cargo-culted that idea and
simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch
have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

> > Since the front is all-cold, and there is very little at the end,
> > practically all hot pages are now remapped. The biggest problem with the
> > hackish filler function (in addition to maintainability) is, if explicit
> > huge pages are turned off in the kernel, attempting mmap() with
MAP_HUGETLB
> > causes complete startup failure if the .text segment is larger than 8MB.
>
> I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
> independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text >
8MB (I have yet to enable core dumps on this new OS I have...), whereas
without it I did see the failures in the log, and successful fallback.

> With these flags the "R E" segments all start on a 0x20/2MiB boundary
and
> are padded to the next 2MiB boundary. However the OS / dynamic loader only
> maps the necessary part, not all the zero padding.
>
> This means that if we were to issue a MADV_COLLAPSE, we can before it do
an
> mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 1:33 AM Andres Freund  wrote:
> > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
> hardcode
> > the address / length), and it seems to work nicely.
> >
> > With the weird caveat that on fs one needs to make sure that the
> executable
> > doesn't reflinks to reuse parts of other files, and that the mold linker
> and
> > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> > binary with cp --reflink=never
>
> What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.


> > On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > > - Add a "cold" __asm__ filler function that just takes up space,
> enough to
> > > > push the end of the .text segment over the next aligned boundary, or
> to
> > > > ~8MB in size.
> > >
> > > I don't understand why this is needed - as long as the pages are
> aligned to
> > > 2MB, why do we need to fill things up on disk? The in-memory contents
> are the
> > > relevant bit, no?
> >
> > I now assume it's because you either observed the mappings set up by the
> > loader to not include the space between the segments?
>
> My knowledge is not quite that deep. The iodlr repo has an example "hello
> world" program, which links with 8 filler objects, each with 32768
> __attribute((used)) dummy functions. I just cargo-culted that idea and
> simplified it. Interestingly enough, looking through the commit history,
> they used to align the segments via linker flags, but took it out here:
>
> https://github.com/intel/iodlr/pull/25#discussion_r397787559
>
> ...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.



> I quickly tried to align the segments with the linker and then in my patch
> have the address for mmap() rounded *down* from the .text start to the
> beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.


> > With these flags the "R E" segments all start on a 0x20/2MiB boundary
> and
> > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > maps the necessary part, not all the zero padding.
> >
> > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> an
> > mremap() to increase the length of the mapping.
>
> I see, interesting. What location are you passing for madvise() and
> mremap()? The beginning of the segment (for me has .init/.plt) or an
> aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment.  Here's my hacky code, with a bunch of
comments added.

   void *addr = (void*) 0x5580;
   void *end = (void *) 0x55e09000;
   size_t advlen = (uintptr_t) end - (uintptr_t) addr;

   const size_t bound = 1024*1024*2 - 1;
   size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
   void *r2;

   /*
* Increase size of mapping to cover the tailing padding to the next
* segment. Otherwise all the code in that range can't be put into
* a huge page (access in the non-mapped range needs to cause a fault,
* hence can't be in the huge page).
* XXX: Should proably assert that that space is actually zeroes.
*/
   r2 = mremap(addr, advlen, advlen_up, 0);
   if (r2 == MAP_FAILED)
   fprintf(stderr, "mremap failed: %m\n");
   else if (r2 != addr)
   fprintf(stderr, "mremap wrong addr: %m\n");
   else
   advlen = advlen_up;

   /*
* The docs for MADV_COLLAPSE say there should be at least one page
* in the mapped space "for every eligible hugepage-aligned/sized
* region to be collapsed". I just forced that. But probably not
* necessary.
*/
   r = madvise(addr, advlen, MADV_WILLNEED);
   if (r != 0)
   fprintf(stderr, "MADV_WILLNEED failed: %m\n");

   r = madvise(addr, advlen, MADV_POPULATE_READ);
   if (r != 0)
   fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

   /*
* Make huge pages out of it. Requires at least linux 6.1.  We could
* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
* much in older kernels.
*/
#define MADV_COLLAPSE25
   r = madvise(addr, advlen, MADV_COLLAPSE);
   if (r != 0)
   fprintf(stderr, "MADV_COLLAPSE failed: %m\n");


A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for librarie

Re: remap the .text segment into huge pages at run time

2022-11-05 Thread John Naylor
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:

> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I
should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES=  \
   filler16.o   \

 OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

> > > With these flags the "R E" segments all start on a 0x20/2MiB
boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader
only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it
do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?

>/*
> * Make huge pages out of it. Requires at least linux 6.1.  We
could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
that
> * much in older kernels.
> */

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems. If explicit
HP's are used for shared mem, maybe THP is less of a risk? I need to look
back at the tests that led to that advice...

> A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

> postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

> > I quickly tried to align the segments with the linker and then in my
patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags
for
> some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
  sema_kind = 'unnamed_posix'
  cppflags += '-D_GNU_SOURCE'
  # Align the loadable segments to 2MB boundaries to support remapping to
  # huge pages.
  ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x20',
'-Wl,-zcommon-page-size=0x20',
'-Wl,-zseparate-loadable-segments'
  ])


According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
  [ 0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)


--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2022-11-06 Thread Andres Freund
Hi,

On 2022-11-06 13:56:10 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:
> > I don't think the dummy functions are a good approach, there were plenty
> > things after it when I played with them.
>
> To be technical, the point wasn't to have no code after it, but to have no
> *hot* code *before* it, since with the iodlr approach the first 1.99MB of
> .text is below the first aligned boundary within that section. But yeah,
> I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.


> > > > With these flags the "R E" segments all start on a 0x20/2MiB
> boundary
> > > and
> > > > are padded to the next 2MiB boundary. However the OS / dynamic loader
> only
> > > > maps the necessary part, not all the zero padding.
> > > >
> > > > This means that if we were to issue a MADV_COLLAPSE, we can before it
> do
> > > an
> > > > mremap() to increase the length of the mapping.
> > >
> > > I see, interesting. What location are you passing for madvise() and
> > > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > > aligned boundary within .text?
>
> >/*
> > * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> > * much in older kernels.
> > */
>
> About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
> THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise.  My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.


> In the support work I've done, the standard recommendation is to turn THP
> off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 
'madvise'.


> If explicit HP's are used for shared mem, maybe THP is less of a risk? I
> need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.


> > A real version would have to open /proc/self/maps and do this for at least
>
> I can try and generalize your above sketch into a v2 patch.

Cool.


> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
>
> It looks like plpgsql is only 27 standard pages in size...
>
> Regarding glibc, we could try moving a couple of the hotter functions into
> PG, using smaller and simpler coding, if that has better frontend cache
> behavior. The paper "Understanding and Mitigating Front-End Stalls in
> Warehouse-Scale Computers" talks about this, particularly section 4.4
> regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.


> > > I quickly tried to align the segments with the linker and then in my
> patch
> > > have the address for mmap() rounded *down* from the .text start to the
> > > beginning of that segment. It refused to start without logging an error.
> >
> > Hm, what linker was that? I did note that you need some additional flags
> for
> > some of the linkers.
>
> BFD, but I wouldn't worry about that failure too much, since the
> mremap()/madvise() strategy has a lot fewer moving parts.
>
> On the subject of linkers, though, one thing that tripped me up was trying
> to change the linker with Meson. First I tried
>
> -Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...


> but that led to warnings like this when :
> /usr/bin/ld: warning: -z separate-loadable-segments ignored
>
> When using this in the top level meson.build
>
> elif host_system == 'linux'
>   sema_kind = 'unnamed_posix'
>   cppflags += '-D_GNU_SOURCE'
>   # Align the loadable segments to 2MB boundaries to support remapping to
>   # huge pages.
>   ldflags += cc.get_supported_link_arguments([
> '-Wl,-zmax-page-size=0x20',
> '-Wl,-zcommon-page-size=0x20',
> '-Wl,-zseparate-loadable-segments'
>   ])
>
>
> According to
>
> https://mesonbuild.com/howtox.html#set-linker
>
> I need to add CC_LD=lld to the env vars before invoking, which got rid of
> the warning. Then I wanted to verify that lld was actually used, and in
>
> https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, be

Re: remap the .text segment into huge pages at run time

2023-06-13 Thread John Naylor
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:

>/*
> * Make huge pages out of it. Requires at least linux 6.1.  We
could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
that
> * much in older kernels.
> */
> #define MADV_COLLAPSE25
>r = madvise(addr, advlen, MADV_COLLAPSE);
>if (r != 0)
>fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
>
>
> A real version would have to open /proc/self/maps and do this for at least
> postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> aligned (both in memory and on-disk).

Hi Andres, my kernel has been new enough for a while now, and since TLBs
and context switches came up in the thread on... threads, I'm swapping this
back in my head.

For the postmaster, it should be simple to have a function that just takes
the address of itself, then parses /proc/self/maps to find the boundaries
within which it lies. I haven't thought about libraries much. Though with
just the postmaster it seems that would give us the biggest bang for the
buck?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2023-06-14 Thread Andres Freund
Hi,

On 2023-06-14 12:40:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:
> 
> >/*
> > * Make huge pages out of it. Requires at least linux 6.1.  We
> could
> > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> > * much in older kernels.
> > */
> > #define MADV_COLLAPSE25
> >r = madvise(addr, advlen, MADV_COLLAPSE);
> >if (r != 0)
> >fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
> >
> >
> > A real version would have to open /proc/self/maps and do this for at least
> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
> 
> Hi Andres, my kernel has been new enough for a while now, and since TLBs
> and context switches came up in the thread on... threads, I'm swapping this
> back in my head.

Cool - I think we have some real potential for substantial wins around this.


> For the postmaster, it should be simple to have a function that just takes
> the address of itself, then parses /proc/self/maps to find the boundaries
> within which it lies. I haven't thought about libraries much. Though with
> just the postmaster it seems that would give us the biggest bang for the
> buck?

I think that is the main bit, yes. We could just try to do this for the
libraries, but accept failure to do so?

Greetings,

Andres Freund




Re: remap the .text segment into huge pages at run time

2023-06-19 Thread John Naylor
On Wed, Jun 14, 2023 at 12:40 PM John Naylor 
wrote:
>
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:

> > A real version would have to open /proc/self/maps and do this for at
least
> > postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> > aligned (both in memory and on-disk).

> For the postmaster, it should be simple to have a function that just
takes the address of itself, then parses /proc/self/maps to find the
boundaries within which it lies. I haven't thought about libraries much.
Though with just the postmaster it seems that would give us the biggest
bang for the buck?

Here's a start at that, trying with postmaster only. Unfortunately, I get
"MADV_COLLAPSE failed: Invalid argument". I tried different addresses with
no luck, and also got the same result with a small standalone program. I'm
on ext4, so I gather I don't need "cp --reflink=never" but tried it anyway.
Configuration looks normal by "grep HUGEPAGE /boot/config-$(uname
-r)".  Maybe there's something obvious I'm missing?

--
John Naylor
EDB: http://www.enterprisedb.com
From ca38a370e866d27c8b51c83f8f18bdda1587b3df Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v2 2/2] Attmept to remap the .text segment into huge pages at
 postmaster start

Use MADV_COLLAPSE advice, available since Linux kernel 6.1.

Andres Freund and John Naylor
---
 src/backend/port/huge_page.c| 113 
 src/backend/port/meson.build|   4 +
 src/backend/postmaster/postmaster.c |   7 ++
 src/include/port/huge_page.h|  18 +
 4 files changed, 142 insertions(+)
 create mode 100644 src/backend/port/huge_page.c
 create mode 100644 src/include/port/huge_page.h

diff --git a/src/backend/port/huge_page.c b/src/backend/port/huge_page.c
new file mode 100644
index 00..92f87bb3c2
--- /dev/null
+++ b/src/backend/port/huge_page.c
@@ -0,0 +1,113 @@
+/*-
+ *
+ * huge_page.c
+ *	  Map .text segment of binary to huge pages
+ *
+ * TODO: better rationale for separate file if the huge page handling
+ * in sysv_shmem.c were moved here.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/backend/port/huge_page.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include 
+
+#include "port/huge_page.h"
+#include "storage/fd.h"
+
+/*
+ * Collapse specified memory range to huge pages.
+ */
+static void
+CollapseRegionToHugePages(void *addr, size_t advlen)
+{
+#ifdef __linux__
+	size_t advlen_up;
+	int r;
+	void *r2;
+	const size_t bound = 1024*1024*2; // FIXME: x86
+
+	fprintf(stderr, "old advlen: %lx\n", advlen);
+	advlen_up = (advlen + bound - 1) & ~(bound - 1);
+
+	/*
+	* Increase size of mapping to cover the tailing padding to the next
+	* segment. Otherwise all the code in that range can't be put into
+	* a huge page (access in the non-mapped range needs to cause a fault,
+	* hence can't be in the huge page).
+	* XXX: Should proably assert that that space is actually zeroes.
+	*/
+	r2 = mremap(addr, advlen, advlen_up, 0);
+	if (r2 == MAP_FAILED)
+		fprintf(stderr, "mremap failed: %m\n");
+	else if (r2 != addr)
+		fprintf(stderr, "mremap wrong addr: %m\n");
+	else
+		advlen = advlen_up;
+
+	fprintf(stderr, "new advlen: %lx\n", advlen);
+
+	/*
+	* The docs for MADV_COLLAPSE say there should be at least one page
+	* in the mapped space "for every eligible hugepage-aligned/sized
+	* region to be collapsed". I just forced that. But probably not
+	* necessary.
+	*/
+	r = madvise(addr, advlen, MADV_WILLNEED);
+	if (r != 0)
+		fprintf(stderr, "MADV_WILLNEED failed: %m\n");
+
+	r = madvise(addr, advlen, MADV_POPULATE_READ);
+	if (r != 0)
+		fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");
+
+	/*
+	* Make huge pages out of it. Requires at least linux 6.1.  We could
+	* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
+	* much in older kernels.
+	*/
+	r = madvise(addr, advlen, MADV_COLLAPSE);
+	if (r != 0)
+	{
+		fprintf(stderr, "MADV_COLLAPSE failed: %m\n");
+
+		r = madvise(addr, advlen, MADV_HUGEPAGE);
+		if (r != 0)
+			fprintf(stderr, "MADV_HUGEPAGE failed: %m\n");
+	}
+#endif
+}
+
+/*  Map the postgres .text segment into huge pages. */
+void
+MapStaticCodeToLargePages(void)
+{
+#ifdef __linux__
+	FILE	   *fp = AllocateFile("/proc/self/maps", "r");
+	char		buf[128]; // got this from code reading /proc/meminfo -- enough?
+	uintptr_t 	addr;
+	uintptr_t 	end;
+	void * 		self = &MapStaticCodeToLargePages;
+
+	if (fp)
+	{
+		while (fgets(buf, sizeof(buf), fp))
+		{
+			if (sscanf(buf, "%lx-%lx", &addr, &end) == 2 &&
+addr <= (uintptr_t) self && (uintptr_t) self < end)
+			{
+fprintf(stderr, "self: %p start: %lx end: %lx\n", self, addr, end);
+CollapseRegionToHugePages((void *) addr, 

Re: remap the .text segment into huge pages at run time

2023-06-20 Thread Andres Freund
Hi,

On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> Here's a start at that, trying with postmaster only. Unfortunately, I get
> "MADV_COLLAPSE failed: Invalid argument".

I also see that. But depending on the steps, I also see
  MADV_COLLAPSE failed: Resource temporarily unavailable

I suspect there's some kernel issue. I'll try to ping somebody.

Greetings,

Andres Freund




Re: remap the .text segment into huge pages at run time

2023-06-20 Thread Andres Freund
Hi,

On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > Here's a start at that, trying with postmaster only. Unfortunately, I get
> > "MADV_COLLAPSE failed: Invalid argument".
> 
> I also see that. But depending on the steps, I also see
>   MADV_COLLAPSE failed: Resource temporarily unavailable
> 
> I suspect there's some kernel issue. I'll try to ping somebody.

Which kernel version are you using? It looks like the issue I am hitting might
be specific to the in-development 6.4 kernel.

One thing I now remember, after trying older kernels, is that it looks like
one sometimes needs to call 'sync' to ensure the page cache data for the
executable is clean, before executing postgres.

Greetings,

Andres Freund




Re: remap the .text segment into huge pages at run time

2023-06-20 Thread John Naylor
On Wed, Jun 21, 2023 at 12:46 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> > On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > > Here's a start at that, trying with postmaster only. Unfortunately, I
get
> > > "MADV_COLLAPSE failed: Invalid argument".
> >
> > I also see that. But depending on the steps, I also see
> >   MADV_COLLAPSE failed: Resource temporarily unavailable
> >
> > I suspect there's some kernel issue. I'll try to ping somebody.
>
> Which kernel version are you using? It looks like the issue I am hitting
might
> be specific to the in-development 6.4 kernel.

(Fedora 38) uname -r shows

6.3.7-200.fc38.x86_64

--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2023-06-20 Thread Andres Freund
Hi,

On 2023-06-21 09:35:36 +0700, John Naylor wrote:
> On Wed, Jun 21, 2023 at 12:46 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-06-20 10:29:41 -0700, Andres Freund wrote:
> > > On 2023-06-20 10:23:14 +0700, John Naylor wrote:
> > > > Here's a start at that, trying with postmaster only. Unfortunately, I
> get
> > > > "MADV_COLLAPSE failed: Invalid argument".
> > >
> > > I also see that. But depending on the steps, I also see
> > >   MADV_COLLAPSE failed: Resource temporarily unavailable
> > >
> > > I suspect there's some kernel issue. I'll try to ping somebody.
> >
> > Which kernel version are you using? It looks like the issue I am hitting
> might
> > be specific to the in-development 6.4 kernel.
> 
> (Fedora 38) uname -r shows
> 
> 6.3.7-200.fc38.x86_64

FWIW, I bisected the bug I was encountering.

As far as I understand, it should not affect you, it was only merged into
6.4-rc1 and a fix is scheduled to be merged into 6.4 before its release. See
https://lore.kernel.org/all/ZJIWAvTczl0rHJBv@x1n/

So I am wondering if you're encountering a different kind of problem. As I
mentioned, I have observed that the pages need to be clean for this to
work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without
the sync it starts to work a while later (presumably when the kernel got
around to writing the data back).


without sync:

self: 0x563b2abf0a72 start: 563b2a80 end: 563b2afe3000
old advlen: 7e3000
new advlen: 80
MADV_COLLAPSE failed: Invalid argument

with sync:
self: 0x555c947f0a72 start: 555c9440 end: 555c94be3000
old advlen: 7e3000
new advlen: 80


Greetings,

Andres Freund




Re: remap the .text segment into huge pages at run time

2023-06-21 Thread John Naylor
On Wed, Jun 21, 2023 at 10:42 AM Andres Freund  wrote:

> So I am wondering if you're encountering a different kind of problem. As I
> mentioned, I have observed that the pages need to be clean for this to
> work. For me adding a "sync path/to/postgres" makes it work on 6.3.8.
Without
> the sync it starts to work a while later (presumably when the kernel got
> around to writing the data back).

Hmm, then after rebooting today, it shouldn't have that problem until a
build links again, but I'll make sure to do that when building. Still same
failure, though. Looking more closely at the manpage for madvise, it has
this under MADV_HUGEPAGE:

"The  MADV_HUGEPAGE,  MADV_NOHUGEPAGE,  and  MADV_COLLAPSE  operations  are
available only if the kernel was configured with
CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the
kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS."

Earlier, I only checked the first config option but didn't know about the
second...

$ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r)
# CONFIG_READ_ONLY_THP_FOR_FS is not set

Apparently, it's experimental. That could be the explanation, but now I'm
wondering why the fallback

madvise(addr, advlen, MADV_HUGEPAGE);

didn't also give an error. I wonder if we could mremap to some anonymous
region and call madvise on that. That would be more similar to the hack I
shared last year, which may be more fragile, but now it wouldn't
need explicit huge pages.

--
John Naylor
EDB: http://www.enterprisedb.com