remap the .text segment into huge pages at run time
It's been known for a while that Postgres spends a lot of time translating instruction addresses, and using huge pages in the text segment yields a substantial performance boost in OLTP workloads [1][2]. The difficulty is, this normally requires a lot of painstaking work (unless your OS does superpage promotion, like FreeBSD). I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea. 0001 adapts the library to our error logging and GUC system. The overview: - read ELF info to get the start/end addresses of the .text segment - calculate addresses therein aligned at huge page boundaries - mmap a temporary region and memcpy the aligned portion of the .text segment - mmap aligned start address to a second region with huge pages and MAP_FIXED - memcpy over from the temp region and revoke the PROT_WRITE bit The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere: static void __attribute__((__section__("lpstub"))) __attribute__((__noinline__)) MoveRegionToLargePages(const mem_range * r, int mmap_flags) Debug messages show 2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540 2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12 2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x60 2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x80 2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages 2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code region Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x60 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld. 0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect. It does two things: - Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away. - Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size. In a non-assert build: 0001: $ bloaty inst-perf/bin/postgres FILE SIZEVM SIZE -- -- 53.7% 4.90Mi 58.7% 4.90Mi.text ... 100.0% 9.12Mi 100.0% 8.35MiTOTAL $ readelf -S --wide inst-perf/bin/postgres [Nr] Name TypeAddress OffSize ES Flg Lk Inf Al ... [12] .init PROGBITS00486000 086000 1b 00 AX 0 0 4 [13] .plt PROGBITS00486020 086020 001520 10 AX 0 0 16 [14] .text PROGBITS00487540 087540 4e59d2 00 AX 0 0 16 ... 0002: $ bloaty inst-perf/bin/postgres FILE SIZEVM SIZE -- -- 46.9% 8.00Mi 69.9% 8.00Mi.text ... 100.0% 17.1Mi 100.0% 11.4MiTOTAL $ readelf -S --wide inst-perf/bin/postgres [Nr] Name TypeAddress OffSize ES Flg Lk Inf Al ... [12] .init PROGBITS0060 20 1b 00 AX 0 0 4 [13] .plt PROGBITS00600020 200020 001520 10 AX 0 0 16 [14] .text PROGBITS00601540 201540 7ff512 00 AX 0 0 16 ... Debug messages with 0002 shows 6MB mapped: 2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540 2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52 2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x80 2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe0 2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages 2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code region Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enou
Re: remap the .text segment into huge pages at run time
Hi, On 2022-11-02 13:32:37 +0700, John Naylor wrote: > It's been known for a while that Postgres spends a lot of time translating > instruction addresses, and using huge pages in the text segment yields a > substantial performance boost in OLTP workloads [1][2]. Indeed. Some of that we eventually should address by making our code less "jumpy", but that's a large amount of work and only going to go so far. > The difficulty is, > this normally requires a lot of painstaking work (unless your OS does > superpage promotion, like FreeBSD). I still am confused by FreeBSD being able to do this without changing the section alignment to be big enough. Or is the default alignment on FreeBSD large enough already? > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to > remap the .text segment to huge pages at program start. Attached is a > hackish, Meson-only, "works on my machine" patchset to experiment with this > idea. I wonder how far we can get with just using the linker hints to align sections. I know that the linux folks are working on promoting sufficiently aligned executable pages to huge pages too, and might have succeeded already. IOW, adding the linker flags might be a good first step. > 0001 adapts the library to our error logging and GUC system. The overview: > > - read ELF info to get the start/end addresses of the .text segment > - calculate addresses therein aligned at huge page boundaries > - mmap a temporary region and memcpy the aligned portion of the .text > segment > - mmap aligned start address to a second region with huge pages and > MAP_FIXED > - memcpy over from the temp region and revoke the PROT_WRITE bit Would mremap()'ing the temporary region also work? That might be simpler and more robust (you'd see the MAP_HUGETLB failure before doing anything irreversible). And you then might not even need this: > The reason this doesn't "saw off the branch you're standing on" is that the > remapping is done in a function that's forced to live in a different > segment, and doesn't call any non-libc functions living elsewhere: > > static void > __attribute__((__section__("lpstub"))) > __attribute__((__noinline__)) > MoveRegionToLargePages(const mem_range * r, int mmap_flags) This would likely need a bunch more gating than the patch, understandably, has. I think it'd faily horribly if there were .text relocations, for example? I think there are some architectures that do that by default... > 0002 is my attempt to force the linker's hand and get the entire text > segment mapped to huge pages. It's quite a finicky hack, and easily broken > (see below). That said, it still builds easily within our normal build > process, and maybe there is a better way to get the effect. > > It does two things: > > - Pass the linker -Wl,-zcommon-page-size=2097152 > -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's > done for predictability, but that means the next 2MB boundary is very > nearly 2MB away. Yep. FWIW, my notes say # align sections to 2MB boundaries for hugepage support # bfd and gold linkers: # -Wl,-zmax-page-size=0x20 -Wl,-zcommon-page-size=0x20 # lld: # -Wl,-zmax-page-size=0x20 -Wl,-z,separate-loadable-segments # then copy binary to tmpfs mounted with -o huge=always I.e. with lld you need slightly different flags -Wl,-z,separate-loadable-segments The meson bit should probably just use cc.get_supported_link_arguments([ '-Wl,-zmax-page-size=0x20', '-Wl,-zcommon-page-size=0x20', '-Wl,-zseparate-loadable-segments']) Afaict there's really no reason to not do that by default, allowing kernels that can promote to huge pages to do so. My approach to forcing huge pages to be used was to then: # copy binary to tmpfs mounted with -o huge=always > - Add a "cold" __asm__ filler function that just takes up space, enough to > push the end of the .text segment over the next aligned boundary, or to > ~8MB in size. I don't understand why this is needed - as long as the pages are aligned to 2MB, why do we need to fill things up on disk? The in-memory contents are the relevant bit, no? > Since the front is all-cold, and there is very little at the end, > practically all hot pages are now remapped. The biggest problem with the > hackish filler function (in addition to maintainability) is, if explicit > huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB > causes complete startup failure if the .text segment is larger than 8MB. I would expect MAP_HUGETLB to always fail if not enabled in the kernel, independent of the .text segment size? > +/* Callback for dl_iterate_phdr to set the start and end of the .text > segment */ > +static int > +FindMapping(struct dl_phdr_info *hdr, size_t size, void *data) > +{ > + ElfW(Shdr) text_section; > + FindParams *find_params = (FindParams *) data; > + > + /* > + * We are only interested in the mapping matching the main executable. > + * This
Re: remap the .text segment into huge pages at run time
Hi, This nerd-sniped me badly :) On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > On 2022-11-02 13:32:37 +0700, John Naylor wrote: > > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to > > remap the .text segment to huge pages at program start. Attached is a > > hackish, Meson-only, "works on my machine" patchset to experiment with this > > idea. > > I wonder how far we can get with just using the linker hints to align > sections. I know that the linux folks are working on promoting sufficiently > aligned executable pages to huge pages too, and might have succeeded already. > > IOW, adding the linker flags might be a good first step. Indeed, I did see that that works to some degree on the 5.19 kernel I was running. However, it never seems to get around to using huge pages sufficiently to compete with explicit use of huge pages. More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was added into linux 6.1. That explicitly remaps a region and uses huge pages for it. Of course that's going to take a while to be widely available, but it seems like a safer approach than the remapping approach from this thread. I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode the address / length), and it seems to work nicely. With the weird caveat that on fs one needs to make sure that the executable doesn't reflinks to reuse parts of other files, and that the mold linker and cp do... Not a concern on ext4, but on xfs. I took to copying the postgres binary with cp --reflink=never FWIW, you can see the state of the page mapping in more detail with the kernel's page-types tool sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 0x55800,0x56122 sudo /home/andres/src/kernel/tools/vm/page-types -f /srv/dev/build/m-opt/src/backend/postgres2 Perf results: c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g pgbench -n -M prepared -S -P1 -c$c -j$c -T10 without MADV_COLLAPSE: tps = 1038230.070771 (without initial connection time) Performance counter stats for 'system wide': 1,184,344,476,152 cycles (71.41%) 2,846,146,710 iTLB-loads (71.43%) 2,021,885,782 iTLB-load-misses # 71.04% of all iTLB cache accesses (71.44%) 75,633,850,933 itlb_misses.walk_active (71.44%) 2,020,962,930 itlb_misses.walk_completed_4k (71.44%) 1,213,368 itlb_misses.walk_completed_2m_4m (57.12%) 2,293 itlb_misses.walk_completed_1g (57.11%) 10.064352587 seconds time elapsed with MADV_COLLAPSE: tps = 1113717.114278 (without initial connection time) Performance counter stats for 'system wide': 1,173,049,140,611 cycles (71.42%) 1,059,224,678 iTLB-loads (71.44%) 653,603,712 iTLB-load-misses # 61.71% of all iTLB cache accesses (71.44%) 26,135,902,949 itlb_misses.walk_active (71.44%) 628,314,285 itlb_misses.walk_completed_4k (71.44%) 25,462,916 itlb_misses.walk_completed_2m_4m (57.13%) 2,228 itlb_misses.walk_completed_1g (57.13%) Note that while the rate of itlb-misses stays roughly the same, the total number of iTLB loads reduced substantially, and the number of cycles in which an itlb miss was in progress is 1/3 of what it was before. A lot of the remaining misses are from the context switches. The iTLB is flushed on context switches, and of course pgbench -S is extremely context switch heavy. Comparing plain -S with 10 pipelined -S transactions (using -t 10 / -t 1 to compare the same amount of work) I get: without MADV_COLLAPSE: not pipelined: tps = 1037732.722805 (without initial connection time) Performance counter stats for 'system wide': 1,691,411,678,007 cycles (62.48%) 8,856,107 itlb.itlb_flush (62.48%) 4,600,041,062 iTLB-loads (62.48%) 2,598,218,236 iTLB-load-misses # 56.48% of all iTLB cache accesses (62.50%) 100,095,862,126 itlb_misses.walk_active
Re: remap the .text segment into huge pages at run time
Hi, On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > - Add a "cold" __asm__ filler function that just takes up space, enough to > > push the end of the .text segment over the next aligned boundary, or to > > ~8MB in size. > > I don't understand why this is needed - as long as the pages are aligned to > 2MB, why do we need to fill things up on disk? The in-memory contents are the > relevant bit, no? I now assume it's because you either observed the mappings set up by the loader to not include the space between the segments? With sufficient linker flags the segments are sufficiently aligned both on disk and in memory to just map more: bfd: -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20 Type Offset VirtAddr PhysAddr FileSizMemSiz Flags Align ... LOAD 0x 0x 0x 0x000c7f58 0x000c7f58 R 0x20 LOAD 0x0020 0x0020 0x0020 0x00921d39 0x00921d39 R E0x20 LOAD 0x00c0 0x00c0 0x00c0 0x002626b8 0x002626b8 R 0x20 LOAD 0x00fdf510 0x011df510 0x011df510 0x00037fd6 0x0006a310 RW 0x20 gold -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,--rosegment Type Offset VirtAddr PhysAddr FileSizMemSiz Flags Align ... LOAD 0x 0x 0x 0x009230f9 0x009230f9 R E0x20 LOAD 0x00a0 0x00a0 0x00a0 0x0033a738 0x0033a738 R 0x20 LOAD 0x00ddf4e0 0x00fdf4e0 0x00fdf4e0 0x0003800a 0x0006a340 RW 0x20 lld: -Wl,-zmax-page-size=0x20,-zseparate-loadable-segments LOAD 0x 0x 0x 0x0033710c 0x0033710c R 0x20 LOAD 0x0040 0x0040 0x0040 0x00921cb0 0x00921cb0 R E0x20 LOAD 0x00e0 0x00e0 0x00e0 0x00020ae0 0x00020ae0 RW 0x20 LOAD 0x0100 0x0100 0x0100 0x000174ea 0x00049820 RW 0x20 mold -Wl,-zmax-page-size=0x20,-zcommon-page-size=0x20,-zseparate-loadable-segments Type Offset VirtAddr PhysAddr FileSizMemSiz Flags Align ... LOAD 0x 0x 0x 0x0032dde9 0x0032dde9 R 0x20 LOAD 0x0040 0x0040 0x0040 0x00921cbe 0x00921cbe R E0x20 LOAD 0x00e0 0x00e0 0x00e0 0x002174e8 0x00249820 RW 0x20 With these flags the "R E" segments all start on a 0x20/2MiB boundary and are padded to the next 2MiB boundary. However the OS / dynamic loader only maps the necessary part, not all the zero padding. This means that if we were to issue a MADV_COLLAPSE, we can before it do an mremap() to increase the length of the mapping. MADV_COLLAPSE without mremap: tps = 1117335.766756 (without initial connection time) Performance counter stats for 'system wide': 1,169,012,466,070 cycles (55.53%) 729,146,640,019 instructions #0.62 insn per cycle (66.65%) 7,062,923 itlb.itlb_flush (66.65%) 1,041,825,587 iTLB-loads (66.65%) 634,272,420 iTLB-load-misses # 60.88% of all iTLB cache accesses (66.66%) 27,018,254,873 itlb_misses.walk_active (66.68%) 610,639,252 itlb_misses.walk_completed_4k (44.47%) 24,262,549 itlb_misses.walk_completed_2m_4m (44.46%) 2,948 itlb_misses.walk_completed_1g (44.43%) 10.039217004 seconds time elapsed MADV_COLLAPSE with mremap: tps = 1140869.853616 (without initial connection time) Performance counter stats for 'system wide': 1,173,272,878,934 cycle
Re: remap the .text segment into huge pages at run time
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund wrote: > > I wonder how far we can get with just using the linker hints to align > > sections. I know that the linux folks are working on promoting sufficiently > > aligned executable pages to huge pages too, and might have succeeded already. > > > > IOW, adding the linker flags might be a good first step. > > Indeed, I did see that that works to some degree on the 5.19 kernel I was > running. However, it never seems to get around to using huge pages > sufficiently to compete with explicit use of huge pages. Oh nice, I didn't know that! There might be some threshold of pages mapped before it does so. At least, that issue is mentioned in that paper linked upthread for FreeBSD. > More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was > added into linux 6.1. That explicitly remaps a region and uses huge pages for > it. Of course that's going to take a while to be widely available, but it > seems like a safer approach than the remapping approach from this thread. I didn't know that either, funny timing. > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode > the address / length), and it seems to work nicely. > > With the weird caveat that on fs one needs to make sure that the executable > doesn't reflinks to reuse parts of other files, and that the mold linker and > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres > binary with cp --reflink=never What happens otherwise? That sounds like a difficult thing to guard against. > The difference in itlb.itlb_flush between pipelined / non-pipelined cases > unsurprisingly is stark. > > While the pipelined case still sees a good bit reduced itlb traffic, the total > amount of cycles in which a walk is active is just not large enough to matter, > by the looks of it. Good to know, thanks for testing. Maybe the pipelined case is something devs should consider when microbenchmarking, to reduce noise from context switches. On Sat, Nov 5, 2022 at 4:21 AM Andres Freund wrote: > > Hi, > > On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > > - Add a "cold" __asm__ filler function that just takes up space, enough to > > > push the end of the .text segment over the next aligned boundary, or to > > > ~8MB in size. > > > > I don't understand why this is needed - as long as the pages are aligned to > > 2MB, why do we need to fill things up on disk? The in-memory contents are the > > relevant bit, no? > > I now assume it's because you either observed the mappings set up by the > loader to not include the space between the segments? My knowledge is not quite that deep. The iodlr repo has an example "hello world" program, which links with 8 filler objects, each with 32768 __attribute((used)) dummy functions. I just cargo-culted that idea and simplified it. Interestingly enough, looking through the commit history, they used to align the segments via linker flags, but took it out here: https://github.com/intel/iodlr/pull/25#discussion_r397787559 ...saying "I'm not sure why we added this". :/ I quickly tried to align the segments with the linker and then in my patch have the address for mmap() rounded *down* from the .text start to the beginning of that segment. It refused to start without logging an error. BTW, that what I meant before, although I wasn't clear: > > Since the front is all-cold, and there is very little at the end, > > practically all hot pages are now remapped. The biggest problem with the > > hackish filler function (in addition to maintainability) is, if explicit > > huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB > > causes complete startup failure if the .text segment is larger than 8MB. > > I would expect MAP_HUGETLB to always fail if not enabled in the kernel, > independent of the .text segment size? With the file-level hack, it would just fail without a trace with .text > 8MB (I have yet to enable core dumps on this new OS I have...), whereas without it I did see the failures in the log, and successful fallback. > With these flags the "R E" segments all start on a 0x20/2MiB boundary and > are padded to the next 2MiB boundary. However the OS / dynamic loader only > maps the necessary part, not all the zero padding. > > This means that if we were to issue a MADV_COLLAPSE, we can before it do an > mremap() to increase the length of the mapping. I see, interesting. What location are you passing for madvise() and mremap()? The beginning of the segment (for me has .init/.plt) or an aligned boundary within .text? -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
Hi, On 2022-11-05 12:54:18 +0700, John Naylor wrote: > On Sat, Nov 5, 2022 at 1:33 AM Andres Freund wrote: > > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just > hardcode > > the address / length), and it seems to work nicely. > > > > With the weird caveat that on fs one needs to make sure that the > executable > > doesn't reflinks to reuse parts of other files, and that the mold linker > and > > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres > > binary with cp --reflink=never > > What happens otherwise? That sounds like a difficult thing to guard against. MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an issue on dev systems, not on prod systems, because there the files will be be unpacked from a package or such. > > On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > > > - Add a "cold" __asm__ filler function that just takes up space, > enough to > > > > push the end of the .text segment over the next aligned boundary, or > to > > > > ~8MB in size. > > > > > > I don't understand why this is needed - as long as the pages are > aligned to > > > 2MB, why do we need to fill things up on disk? The in-memory contents > are the > > > relevant bit, no? > > > > I now assume it's because you either observed the mappings set up by the > > loader to not include the space between the segments? > > My knowledge is not quite that deep. The iodlr repo has an example "hello > world" program, which links with 8 filler objects, each with 32768 > __attribute((used)) dummy functions. I just cargo-culted that idea and > simplified it. Interestingly enough, looking through the commit history, > they used to align the segments via linker flags, but took it out here: > > https://github.com/intel/iodlr/pull/25#discussion_r397787559 > > ...saying "I'm not sure why we added this". :/ That was about using a linker script, not really linker flags though. I don't think the dummy functions are a good approach, there were plenty things after it when I played with them. > I quickly tried to align the segments with the linker and then in my patch > have the address for mmap() rounded *down* from the .text start to the > beginning of that segment. It refused to start without logging an error. Hm, what linker was that? I did note that you need some additional flags for some of the linkers. > > With these flags the "R E" segments all start on a 0x20/2MiB boundary > and > > are padded to the next 2MiB boundary. However the OS / dynamic loader only > > maps the necessary part, not all the zero padding. > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it do > an > > mremap() to increase the length of the mapping. > > I see, interesting. What location are you passing for madvise() and > mremap()? The beginning of the segment (for me has .init/.plt) or an > aligned boundary within .text? I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the start/end of the r-xp mapped segment. Here's my hacky code, with a bunch of comments added. void *addr = (void*) 0x5580; void *end = (void *) 0x55e09000; size_t advlen = (uintptr_t) end - (uintptr_t) addr; const size_t bound = 1024*1024*2 - 1; size_t advlen_up = (advlen + bound - 1) & ~(bound - 1); void *r2; /* * Increase size of mapping to cover the tailing padding to the next * segment. Otherwise all the code in that range can't be put into * a huge page (access in the non-mapped range needs to cause a fault, * hence can't be in the huge page). * XXX: Should proably assert that that space is actually zeroes. */ r2 = mremap(addr, advlen, advlen_up, 0); if (r2 == MAP_FAILED) fprintf(stderr, "mremap failed: %m\n"); else if (r2 != addr) fprintf(stderr, "mremap wrong addr: %m\n"); else advlen = advlen_up; /* * The docs for MADV_COLLAPSE say there should be at least one page * in the mapped space "for every eligible hugepage-aligned/sized * region to be collapsed". I just forced that. But probably not * necessary. */ r = madvise(addr, advlen, MADV_WILLNEED); if (r != 0) fprintf(stderr, "MADV_WILLNEED failed: %m\n"); r = madvise(addr, advlen, MADV_POPULATE_READ); if (r != 0) fprintf(stderr, "MADV_POPULATE_READ failed: %m\n"); /* * Make huge pages out of it. Requires at least linux 6.1. We could * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that * much in older kernels. */ #define MADV_COLLAPSE25 r = madvise(addr, advlen, MADV_COLLAPSE); if (r != 0) fprintf(stderr, "MADV_COLLAPSE failed: %m\n"); A real version would have to open /proc/self/maps and do this for at least postgres' r-xp mapping. We could do it for librarie
Re: remap the .text segment into huge pages at run time
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: > > simplified it. Interestingly enough, looking through the commit history, > > they used to align the segments via linker flags, but took it out here: > > > > https://github.com/intel/iodlr/pull/25#discussion_r397787559 > > > > ...saying "I'm not sure why we added this". :/ > > That was about using a linker script, not really linker flags though. Oops, the commit I was referring to pointed to that discussion, but I should have shown it instead: --- a/large_page-c/example/Makefile +++ b/large_page-c/example/Makefile @@ -28,7 +28,6 @@ OBJFILES= \ filler16.o \ OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES)) -LDFLAGS=-Wl,-z,max-page-size=2097152 But from what you're saying, this flag wouldn't have been enough anyway... > I don't think the dummy functions are a good approach, there were plenty > things after it when I played with them. To be technical, the point wasn't to have no code after it, but to have no *hot* code *before* it, since with the iodlr approach the first 1.99MB of .text is below the first aligned boundary within that section. But yeah, I'm happy to ditch that hack entirely. > > > With these flags the "R E" segments all start on a 0x20/2MiB boundary > > and > > > are padded to the next 2MiB boundary. However the OS / dynamic loader only > > > maps the necessary part, not all the zero padding. > > > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it do > > an > > > mremap() to increase the length of the mapping. > > > > I see, interesting. What location are you passing for madvise() and > > mremap()? The beginning of the segment (for me has .init/.plt) or an > > aligned boundary within .text? >/* > * Make huge pages out of it. Requires at least linux 6.1. We could > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that > * much in older kernels. > */ About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for THP? The man page seems to indicate that. In the support work I've done, the standard recommendation is to turn THP off, especially if they report sudden performance problems. If explicit HP's are used for shared mem, maybe THP is less of a risk? I need to look back at the tests that led to that advice... > A real version would have to open /proc/self/maps and do this for at least I can try and generalize your above sketch into a v2 patch. > postgres' r-xp mapping. We could do it for libraries too, if they're suitably > aligned (both in memory and on-disk). It looks like plpgsql is only 27 standard pages in size... Regarding glibc, we could try moving a couple of the hotter functions into PG, using smaller and simpler coding, if that has better frontend cache behavior. The paper "Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers" talks about this, particularly section 4.4 regarding memcmp(). > > I quickly tried to align the segments with the linker and then in my patch > > have the address for mmap() rounded *down* from the .text start to the > > beginning of that segment. It refused to start without logging an error. > > Hm, what linker was that? I did note that you need some additional flags for > some of the linkers. BFD, but I wouldn't worry about that failure too much, since the mremap()/madvise() strategy has a lot fewer moving parts. On the subject of linkers, though, one thing that tripped me up was trying to change the linker with Meson. First I tried -Dc_args='-fuse-ld=lld' but that led to warnings like this when : /usr/bin/ld: warning: -z separate-loadable-segments ignored When using this in the top level meson.build elif host_system == 'linux' sema_kind = 'unnamed_posix' cppflags += '-D_GNU_SOURCE' # Align the loadable segments to 2MB boundaries to support remapping to # huge pages. ldflags += cc.get_supported_link_arguments([ '-Wl,-zmax-page-size=0x20', '-Wl,-zcommon-page-size=0x20', '-Wl,-zseparate-loadable-segments' ]) According to https://mesonbuild.com/howtox.html#set-linker I need to add CC_LD=lld to the env vars before invoking, which got rid of the warning. Then I wanted to verify that lld was actually used, and in https://releases.llvm.org/14.0.0/tools/lld/docs/index.html it says I can run this and it should show “Linker: LLD”, but that doesn't appear for me: $ readelf --string-dump .comment inst-perf/bin/postgres String dump of section '.comment': [ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2) -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
Hi, On 2022-11-06 13:56:10 +0700, John Naylor wrote: > On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: > > I don't think the dummy functions are a good approach, there were plenty > > things after it when I played with them. > > To be technical, the point wasn't to have no code after it, but to have no > *hot* code *before* it, since with the iodlr approach the first 1.99MB of > .text is below the first aligned boundary within that section. But yeah, > I'm happy to ditch that hack entirely. Just because code is colder than the alternative branch, doesn't necessary mean it's entirely cold overall. I saw hits to things after the dummy function to have a perf effect. > > > > With these flags the "R E" segments all start on a 0x20/2MiB > boundary > > > and > > > > are padded to the next 2MiB boundary. However the OS / dynamic loader > only > > > > maps the necessary part, not all the zero padding. > > > > > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it > do > > > an > > > > mremap() to increase the length of the mapping. > > > > > > I see, interesting. What location are you passing for madvise() and > > > mremap()? The beginning of the segment (for me has .init/.plt) or an > > > aligned boundary within .text? > > >/* > > * Make huge pages out of it. Requires at least linux 6.1. We > could > > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all > that > > * much in older kernels. > > */ > > About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for > THP? The man page seems to indicate that. MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is to always or madvise. My understanding is that MADV_COLLAPSE will work even if /sys/kernel/mm/transparent_hugepage/enabled is set to never. > In the support work I've done, the standard recommendation is to turn THP > off, especially if they report sudden performance problems. I think that's pretty much an outdated suggestion FWIW. Largely caused by Red Hat extremely aggressively backpatching transparent hugepages into RHEL 6 (IIRC). Lots of improvements have been made to THP since then. I've tried to see negative effects maybe 2-3 years back, without success. I really don't see a reason to ever set /sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'. > If explicit HP's are used for shared mem, maybe THP is less of a risk? I > need to look back at the tests that led to that advice... I wouldn't give that advice to customers anymore, unless they use extremely old platforms or unless there's very concrete evidence. > > A real version would have to open /proc/self/maps and do this for at least > > I can try and generalize your above sketch into a v2 patch. Cool. > > postgres' r-xp mapping. We could do it for libraries too, if they're > suitably > > aligned (both in memory and on-disk). > > It looks like plpgsql is only 27 standard pages in size... > > Regarding glibc, we could try moving a couple of the hotter functions into > PG, using smaller and simpler coding, if that has better frontend cache > behavior. The paper "Understanding and Mitigating Front-End Stalls in > Warehouse-Scale Computers" talks about this, particularly section 4.4 > regarding memcmp(). I think the amount of work necessary for that is nontrivial and continual. So I'm loathe to go there. > > > I quickly tried to align the segments with the linker and then in my > patch > > > have the address for mmap() rounded *down* from the .text start to the > > > beginning of that segment. It refused to start without logging an error. > > > > Hm, what linker was that? I did note that you need some additional flags > for > > some of the linkers. > > BFD, but I wouldn't worry about that failure too much, since the > mremap()/madvise() strategy has a lot fewer moving parts. > > On the subject of linkers, though, one thing that tripped me up was trying > to change the linker with Meson. First I tried > > -Dc_args='-fuse-ld=lld' It's -Dc_link_args=... > but that led to warnings like this when : > /usr/bin/ld: warning: -z separate-loadable-segments ignored > > When using this in the top level meson.build > > elif host_system == 'linux' > sema_kind = 'unnamed_posix' > cppflags += '-D_GNU_SOURCE' > # Align the loadable segments to 2MB boundaries to support remapping to > # huge pages. > ldflags += cc.get_supported_link_arguments([ > '-Wl,-zmax-page-size=0x20', > '-Wl,-zcommon-page-size=0x20', > '-Wl,-zseparate-loadable-segments' > ]) > > > According to > > https://mesonbuild.com/howtox.html#set-linker > > I need to add CC_LD=lld to the env vars before invoking, which got rid of > the warning. Then I wanted to verify that lld was actually used, and in > > https://releases.llvm.org/14.0.0/tools/lld/docs/index.html You can just look at build.ninja, fwiw. Or use ninja -v (in postgres's cases with -d keeprsp, be
Re: remap the .text segment into huge pages at run time
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: >/* > * Make huge pages out of it. Requires at least linux 6.1. We could > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that > * much in older kernels. > */ > #define MADV_COLLAPSE25 >r = madvise(addr, advlen, MADV_COLLAPSE); >if (r != 0) >fprintf(stderr, "MADV_COLLAPSE failed: %m\n"); > > > A real version would have to open /proc/self/maps and do this for at least > postgres' r-xp mapping. We could do it for libraries too, if they're suitably > aligned (both in memory and on-disk). Hi Andres, my kernel has been new enough for a while now, and since TLBs and context switches came up in the thread on... threads, I'm swapping this back in my head. For the postmaster, it should be simple to have a function that just takes the address of itself, then parses /proc/self/maps to find the boundaries within which it lies. I haven't thought about libraries much. Though with just the postmaster it seems that would give us the biggest bang for the buck? -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
Hi, On 2023-06-14 12:40:18 +0700, John Naylor wrote: > On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: > > >/* > > * Make huge pages out of it. Requires at least linux 6.1. We > could > > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all > that > > * much in older kernels. > > */ > > #define MADV_COLLAPSE25 > >r = madvise(addr, advlen, MADV_COLLAPSE); > >if (r != 0) > >fprintf(stderr, "MADV_COLLAPSE failed: %m\n"); > > > > > > A real version would have to open /proc/self/maps and do this for at least > > postgres' r-xp mapping. We could do it for libraries too, if they're > suitably > > aligned (both in memory and on-disk). > > Hi Andres, my kernel has been new enough for a while now, and since TLBs > and context switches came up in the thread on... threads, I'm swapping this > back in my head. Cool - I think we have some real potential for substantial wins around this. > For the postmaster, it should be simple to have a function that just takes > the address of itself, then parses /proc/self/maps to find the boundaries > within which it lies. I haven't thought about libraries much. Though with > just the postmaster it seems that would give us the biggest bang for the > buck? I think that is the main bit, yes. We could just try to do this for the libraries, but accept failure to do so? Greetings, Andres Freund
Re: remap the .text segment into huge pages at run time
On Wed, Jun 14, 2023 at 12:40 PM John Naylor wrote: > > On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: > > A real version would have to open /proc/self/maps and do this for at least > > postgres' r-xp mapping. We could do it for libraries too, if they're suitably > > aligned (both in memory and on-disk). > For the postmaster, it should be simple to have a function that just takes the address of itself, then parses /proc/self/maps to find the boundaries within which it lies. I haven't thought about libraries much. Though with just the postmaster it seems that would give us the biggest bang for the buck? Here's a start at that, trying with postmaster only. Unfortunately, I get "MADV_COLLAPSE failed: Invalid argument". I tried different addresses with no luck, and also got the same result with a small standalone program. I'm on ext4, so I gather I don't need "cp --reflink=never" but tried it anyway. Configuration looks normal by "grep HUGEPAGE /boot/config-$(uname -r)". Maybe there's something obvious I'm missing? -- John Naylor EDB: http://www.enterprisedb.com From ca38a370e866d27c8b51c83f8f18bdda1587b3df Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 31 Oct 2022 15:24:29 +0700 Subject: [PATCH v2 2/2] Attmept to remap the .text segment into huge pages at postmaster start Use MADV_COLLAPSE advice, available since Linux kernel 6.1. Andres Freund and John Naylor --- src/backend/port/huge_page.c| 113 src/backend/port/meson.build| 4 + src/backend/postmaster/postmaster.c | 7 ++ src/include/port/huge_page.h| 18 + 4 files changed, 142 insertions(+) create mode 100644 src/backend/port/huge_page.c create mode 100644 src/include/port/huge_page.h diff --git a/src/backend/port/huge_page.c b/src/backend/port/huge_page.c new file mode 100644 index 00..92f87bb3c2 --- /dev/null +++ b/src/backend/port/huge_page.c @@ -0,0 +1,113 @@ +/*- + * + * huge_page.c + * Map .text segment of binary to huge pages + * + * TODO: better rationale for separate file if the huge page handling + * in sysv_shmem.c were moved here. + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/port/huge_page.c + * + *- + */ + +#include "postgres.h" + +#include + +#include "port/huge_page.h" +#include "storage/fd.h" + +/* + * Collapse specified memory range to huge pages. + */ +static void +CollapseRegionToHugePages(void *addr, size_t advlen) +{ +#ifdef __linux__ + size_t advlen_up; + int r; + void *r2; + const size_t bound = 1024*1024*2; // FIXME: x86 + + fprintf(stderr, "old advlen: %lx\n", advlen); + advlen_up = (advlen + bound - 1) & ~(bound - 1); + + /* + * Increase size of mapping to cover the tailing padding to the next + * segment. Otherwise all the code in that range can't be put into + * a huge page (access in the non-mapped range needs to cause a fault, + * hence can't be in the huge page). + * XXX: Should proably assert that that space is actually zeroes. + */ + r2 = mremap(addr, advlen, advlen_up, 0); + if (r2 == MAP_FAILED) + fprintf(stderr, "mremap failed: %m\n"); + else if (r2 != addr) + fprintf(stderr, "mremap wrong addr: %m\n"); + else + advlen = advlen_up; + + fprintf(stderr, "new advlen: %lx\n", advlen); + + /* + * The docs for MADV_COLLAPSE say there should be at least one page + * in the mapped space "for every eligible hugepage-aligned/sized + * region to be collapsed". I just forced that. But probably not + * necessary. + */ + r = madvise(addr, advlen, MADV_WILLNEED); + if (r != 0) + fprintf(stderr, "MADV_WILLNEED failed: %m\n"); + + r = madvise(addr, advlen, MADV_POPULATE_READ); + if (r != 0) + fprintf(stderr, "MADV_POPULATE_READ failed: %m\n"); + + /* + * Make huge pages out of it. Requires at least linux 6.1. We could + * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that + * much in older kernels. + */ + r = madvise(addr, advlen, MADV_COLLAPSE); + if (r != 0) + { + fprintf(stderr, "MADV_COLLAPSE failed: %m\n"); + + r = madvise(addr, advlen, MADV_HUGEPAGE); + if (r != 0) + fprintf(stderr, "MADV_HUGEPAGE failed: %m\n"); + } +#endif +} + +/* Map the postgres .text segment into huge pages. */ +void +MapStaticCodeToLargePages(void) +{ +#ifdef __linux__ + FILE *fp = AllocateFile("/proc/self/maps", "r"); + char buf[128]; // got this from code reading /proc/meminfo -- enough? + uintptr_t addr; + uintptr_t end; + void * self = &MapStaticCodeToLargePages; + + if (fp) + { + while (fgets(buf, sizeof(buf), fp)) + { + if (sscanf(buf, "%lx-%lx", &addr, &end) == 2 && +addr <= (uintptr_t) self && (uintptr_t) self < end) + { +fprintf(stderr, "self: %p start: %lx end: %lx\n", self, addr, end); +CollapseRegionToHugePages((void *) addr,
Re: remap the .text segment into huge pages at run time
Hi, On 2023-06-20 10:23:14 +0700, John Naylor wrote: > Here's a start at that, trying with postmaster only. Unfortunately, I get > "MADV_COLLAPSE failed: Invalid argument". I also see that. But depending on the steps, I also see MADV_COLLAPSE failed: Resource temporarily unavailable I suspect there's some kernel issue. I'll try to ping somebody. Greetings, Andres Freund
Re: remap the .text segment into huge pages at run time
Hi, On 2023-06-20 10:29:41 -0700, Andres Freund wrote: > On 2023-06-20 10:23:14 +0700, John Naylor wrote: > > Here's a start at that, trying with postmaster only. Unfortunately, I get > > "MADV_COLLAPSE failed: Invalid argument". > > I also see that. But depending on the steps, I also see > MADV_COLLAPSE failed: Resource temporarily unavailable > > I suspect there's some kernel issue. I'll try to ping somebody. Which kernel version are you using? It looks like the issue I am hitting might be specific to the in-development 6.4 kernel. One thing I now remember, after trying older kernels, is that it looks like one sometimes needs to call 'sync' to ensure the page cache data for the executable is clean, before executing postgres. Greetings, Andres Freund
Re: remap the .text segment into huge pages at run time
On Wed, Jun 21, 2023 at 12:46 AM Andres Freund wrote: > > Hi, > > On 2023-06-20 10:29:41 -0700, Andres Freund wrote: > > On 2023-06-20 10:23:14 +0700, John Naylor wrote: > > > Here's a start at that, trying with postmaster only. Unfortunately, I get > > > "MADV_COLLAPSE failed: Invalid argument". > > > > I also see that. But depending on the steps, I also see > > MADV_COLLAPSE failed: Resource temporarily unavailable > > > > I suspect there's some kernel issue. I'll try to ping somebody. > > Which kernel version are you using? It looks like the issue I am hitting might > be specific to the in-development 6.4 kernel. (Fedora 38) uname -r shows 6.3.7-200.fc38.x86_64 -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
Hi, On 2023-06-21 09:35:36 +0700, John Naylor wrote: > On Wed, Jun 21, 2023 at 12:46 AM Andres Freund wrote: > > > > Hi, > > > > On 2023-06-20 10:29:41 -0700, Andres Freund wrote: > > > On 2023-06-20 10:23:14 +0700, John Naylor wrote: > > > > Here's a start at that, trying with postmaster only. Unfortunately, I > get > > > > "MADV_COLLAPSE failed: Invalid argument". > > > > > > I also see that. But depending on the steps, I also see > > > MADV_COLLAPSE failed: Resource temporarily unavailable > > > > > > I suspect there's some kernel issue. I'll try to ping somebody. > > > > Which kernel version are you using? It looks like the issue I am hitting > might > > be specific to the in-development 6.4 kernel. > > (Fedora 38) uname -r shows > > 6.3.7-200.fc38.x86_64 FWIW, I bisected the bug I was encountering. As far as I understand, it should not affect you, it was only merged into 6.4-rc1 and a fix is scheduled to be merged into 6.4 before its release. See https://lore.kernel.org/all/ZJIWAvTczl0rHJBv@x1n/ So I am wondering if you're encountering a different kind of problem. As I mentioned, I have observed that the pages need to be clean for this to work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without the sync it starts to work a while later (presumably when the kernel got around to writing the data back). without sync: self: 0x563b2abf0a72 start: 563b2a80 end: 563b2afe3000 old advlen: 7e3000 new advlen: 80 MADV_COLLAPSE failed: Invalid argument with sync: self: 0x555c947f0a72 start: 555c9440 end: 555c94be3000 old advlen: 7e3000 new advlen: 80 Greetings, Andres Freund
Re: remap the .text segment into huge pages at run time
On Wed, Jun 21, 2023 at 10:42 AM Andres Freund wrote: > So I am wondering if you're encountering a different kind of problem. As I > mentioned, I have observed that the pages need to be clean for this to > work. For me adding a "sync path/to/postgres" makes it work on 6.3.8. Without > the sync it starts to work a while later (presumably when the kernel got > around to writing the data back). Hmm, then after rebooting today, it shouldn't have that problem until a build links again, but I'll make sure to do that when building. Still same failure, though. Looking more closely at the manpage for madvise, it has this under MADV_HUGEPAGE: "The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations are available only if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS." Earlier, I only checked the first config option but didn't know about the second... $ grep CONFIG_READ_ONLY_THP_FOR_FS /boot/config-$(uname -r) # CONFIG_READ_ONLY_THP_FOR_FS is not set Apparently, it's experimental. That could be the explanation, but now I'm wondering why the fallback madvise(addr, advlen, MADV_HUGEPAGE); didn't also give an error. I wonder if we could mremap to some anonymous region and call madvise on that. That would be more similar to the hack I shared last year, which may be more fragile, but now it wouldn't need explicit huge pages. -- John Naylor EDB: http://www.enterprisedb.com