Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-19 Thread Mark Dawson
Yes, I'm aware that ARM has offered LAM since long ago. And regarding LAM
on Intel, we shouldn't have to wait that long at all since it's due in
Sapphire Rapids.

As far as Java workloads go, it seems there are some runtimes that
*require* disabling 5-level Page Tables already due to the fact that some
runtimes already play pointer tagging tricks on the upper bits for
metadata-tracking purposes and can't spare losing the extra 9 bits. Azul
Zing even requires disabling it if using ZST (though I'm unsure why that
is, specifically). All this, coupled with your feedback on some GC inner
workings, makes me think the Java world certainly adds some complication to
the discussion. Perhaps I'll write a followup article discussing these
special case scenarios.

But the gist of my article is mainly about the generic performance impact,
and things like page faults and pipe throughput and munmap() will certainly
take a hit.

But thanks again for compliment on the content and the insightful feedback!

On Sun, Jun 19, 2022, 10:42 AM 'Gil Tene' via mechanical-sympathy <
mechanical-sympathy@googlegroups.com> wrote:

> Yes yes yes. I want LAM. I've wanted it (and have been asking for it or
> something like it on x86) for over 14 years. And *when* LAM becomes a real
> thing we can run on and manipulate form user mode, I'll absolutely prefer
> using it to what we do today with VM multi-mapping, for many many reasons.
> Including (but not limited to) not having to deal with vma count limits,
> virtual memory fragmentation (and "innovative" ways to cap it and
> defragment it), cost of creating/changing mappings, etc. etc., and not
> having to explain to people that Linux's RSS is one of those words they use
> a lot, and that I do not think it mean what they think it means (but that
> e.g. cgroup memory use is a thing tracks what they think RSS means but
> doesn't), etc. etc.
>
> And when LAM is available, I''d much prefer 4-level page table to 5-level,
> because it would give me 9 more LAM-provided metadata bits to play with...
>
> But until LAM becomes a thing (in hardware lots of people run on, with
> Linux kernel and glibc support in Linux distros people actually run their
> OS images and container hosts on, etc.), 5-level page tables currently add
> speed (when dealing with >256GB of heap) on modern JVMs running on actual
> production x86 servers. And I think there may be other (non-Java) workloads
> for which the same is true. So I personally like 5 level page tables being
> on by default (on hardware that supports them) and prefer that people not
> turn them off unless their OS images are provisioned with physical memory
> small enough that they have no value even there. I.e. for physical machines
> or VM images with <256GB or RAM, turning off level 5 page tables makes
> perfect sense. But above that, I like them on better than I like them off.
>
> BTW, several other processors have (and had) this masking or "don't
> careness" stuff that LAM will finally be bringing to x86, and on those
> processors, we have (and had) better options than doing more than 2-way VM
> multi-mapping. But since the vast majority of the world runs on x86, we do
> what we must to makes things work there, and that makes us do some funny
> stuff with virtual memory, and some a-little-slower funnier stuff (with 4
> level page tables) when heap sizes go above sizes that, even when not
> super-common, are very real today. 5-level page tables move the line where
> we need the  a-little-slower funnier stuff out ot well beyond the RAM
> sizes mortals can get their hands on today. Hopefully LAM arrives before
> real-world RAM grows too much even for that.
>
> On Saturday, June 18, 2022 at 7:58:00 PM UTC+1 Mark E. Dawson, Jr. wrote:
>
>> Thanks, Gil, for the compliment!
>>
>> Regarding the VM tricks used by certain GCs out there, wouldn't that be
>> better served instead by Intel's upcoming Linear Address Masking (LAM) in
>> Sapphire Rapids, which would allow you to apply all types of tricks to the
>> upper 15 bits of unused bits on 4-level Page Table systems (only the upper
>> 7 bits on 5-level Page Table systems)? With LAM, user applications no
>> longer need to worry about masking off updates to those bits when accessing
>> memory via those pointers since the system will ignore them (i.e., the CPU
>> no longer checks for canonicality).
>>
>> In that case, 4-level Page Tables maintain their advantage over 5-level
>> Page Tables provided an end user does not require the 64TB or more of RAM
>> which 5-level Page Tables enable.
>>
>>
>> On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
>> mechanica...@googlegroups.com> wrote:
>>
>>> A good read, and very well written.
>>>
>>> I'm with you in general on most/all of it except for one thing: That
>>> 64TB line.
>>>
>>> I draw that same line at 256GB. Or at least at a 256GB
>>> guest/pod/container/process. There are things out there that use virtual
>>> memory multi (and "many") mapping tricks

Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-19 Thread 'Gil Tene' via mechanical-sympathy
Yes yes yes. I want LAM. I've wanted it (and have been asking for it or 
something like it on x86) for over 14 years. And *when* LAM becomes a real 
thing we can run on and manipulate from user mode, I'll absolutely prefer 
using it to what we do today with VM multi-mapping, for many many reasons. 
Including (but not limited to) not having to deal with vma count limits, 
virtual memory fragmentation (and "innovative" ways to cap it and 
defragment it), cost of creating/changing mappings, etc. etc., and not 
having to explain to people that Linux's RSS is one of those words they use 
a lot, and that I do not think it means what they think it means (but that 
e.g. cgroup memory use is a thing tracks what they think RSS means but 
doesn't), etc. etc. 

And when LAM is available, I''d much prefer 4-level page table to 5-level, 
because it would give me 9 more LAM-provided metadata bits to play with...

But until LAM becomes a thing (in hardware lots of people run on, with 
Linux kernel and glibc support in Linux distros people actually run their 
OS images and container hosts on, etc.), 5-level page tables currently add 
speed (when dealing with >256GB of heap) on modern JVMs running on actual 
production x86 servers. And I think there may be other (non-Java) workloads 
for which the same is true. So I personally like 5 level page tables being 
on by default (on hardware that supports them) and prefer that people not 
turn them off unless their OS images are provisioned with physical memory 
small enough that they have no value even there. I.e. for physical machines 
or VM images with <256GB or RAM, turning off level 5 page tables makes 
perfect sense. But above that, I like them on better than I like them off.

BTW, several other processors have (and had) this masking or "don't 
careness" stuff that LAM will finally be bringing to x86, and on those 
processors, we have (and had) better options than doing more than 2-way VM 
multi-mapping. But since the vast majority of the world runs on x86, we do 
what we must to makes things work there, and that makes us do some funny 
stuff with virtual memory, and some a-little-slower funnier stuff (with 4 
level page tables) when heap sizes go above sizes that, even when not 
super-common, are very real today. 5-level page tables move the line where 
we need the  a-little-slower funnier stuff out ot well beyond the RAM sizes 
mortals can get their hands on today. Hopefully LAM arrives before 
real-world RAM grows too much even for that.

On Saturday, June 18, 2022 at 7:58:00 PM UTC+1 Mark E. Dawson, Jr. wrote:

> Thanks, Gil, for the compliment!
>
> Regarding the VM tricks used by certain GCs out there, wouldn't that be 
> better served instead by Intel's upcoming Linear Address Masking (LAM) in 
> Sapphire Rapids, which would allow you to apply all types of tricks to the 
> upper 15 bits of unused bits on 4-level Page Table systems (only the upper 
> 7 bits on 5-level Page Table systems)? With LAM, user applications no 
> longer need to worry about masking off updates to those bits when accessing 
> memory via those pointers since the system will ignore them (i.e., the CPU 
> no longer checks for canonicality).
>
> In that case, 4-level Page Tables maintain their advantage over 5-level 
> Page Tables provided an end user does not require the 64TB or more of RAM 
> which 5-level Page Tables enable.
>
>
> On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
> mechanica...@googlegroups.com> wrote:
>
>> A good read, and very well written. 
>>
>> I'm with you in general on most/all of it except for one thing: That 64TB 
>> line. 
>>
>> I draw that same line at 256GB. Or at least at a 256GB 
>> guest/pod/container/process. There are things out there that use virtual 
>> memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
>> that seemingly plentiful 47 bit virtual user space in the process. The ones 
>> I know the most about (because I am mostly to blame for them) are high 
>> throughput concurrent GC mechanisms. Those have a multitude of 
>> implementation variants, all of which encode 
>> phases/generations/spaces/colors in higher-order virtual bits and use 
>> multi-mapping to efficiently recycle physical memory.
>>
>> For a concrete example, when running on a vanilla linux kernel with 
>> 4-level page tables, the current C4 collector in the Prime JVM (formerly 
>> known as "Zing"), uses different phase encoding, and different LVB barrier 
>> instruction encodings, depending on whether the heap size is above or below 
>> 256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
>> (using up 6 bits of virtual space) and a faster LVB test (test&jmp), and 
>> above 256GB, it uses denser encodings (using up only 3 bits) with slightly 
>> more expensive LVBs (test a bit in a mask & jmp). A 5 level page table (on 
>> hardware that supports it) we can move that line out by 512x, which means 
>> that even many-TB Java heaps can 

Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-19 Thread 'Gil Tene' via mechanical-sympathy
Yes yes yes. I want LAM. I've wanted it (and have been asking for it or 
something like it on x86) for over 14 years. And *when* LAM becomes a real 
thing we can run on and manipulate form user mode, I'll absolutely prefer 
using it to what we do today with VM multi-mapping, for many many reasons. 
Including (but not limited to) not having to deal with vma count limits, 
virtual memory fragmentation (and "innovative" ways to cap it and 
defragment it), cost of creating/changing mappings, etc. etc., and not 
having to explain to people that Linux's RSS is one of those words they use 
a lot, and that I do not think it mean what they think it means (but that 
e.g. cgroup memory use is a thing tracks what they think RSS means but 
doesn't), etc. etc. 

And when LAM is available, I''d much prefer 4-level page table to 5-level, 
because it would give me 9 more LAM-provided metadata bits to play with...

But until LAM becomes a thing (in hardware lots of people run on, with 
Linux kernel and glibc support in Linux distros people actually run their 
OS images and container hosts on, etc.), 5-level page tables currently add 
speed (when dealing with >256GB of heap) on modern JVMs running on actual 
production x86 servers. And I think there may be other (non-Java) workloads 
for which the same is true. So I personally like 5 level page tables being 
on by default (on hardware that supports them) and prefer that people not 
turn them off unless their OS images are provisioned with physical memory 
small enough that they have no value even there. I.e. for physical machines 
or VM images with <256GB or RAM, turning off level 5 page tables makes 
perfect sense. But above that, I like them on better than I like them off.

BTW, several other processors have (and had) this masking or "don't 
careness" stuff that LAM will finally be bringing to x86, and on those 
processors, we have (and had) better options than doing more than 2-way VM 
multi-mapping. But since the vast majority of the world runs on x86, we do 
what we must to makes things work there, and that makes us do some funny 
stuff with virtual memory, and some a-little-slower funnier stuff (with 4 
level page tables) when heap sizes go above sizes that, even when not 
super-common, are very real today. 5-level page tables move the line where 
we need the  a-little-slower funnier stuff out ot well beyond the RAM sizes 
mortals can get their hands on today. Hopefully LAM arrives before 
real-world RAM grows too much even for that.
 
On Saturday, June 18, 2022 at 7:58:00 PM UTC+1 Mark E. Dawson, Jr. wrote:

> Thanks, Gil, for the compliment!
>
> Regarding the VM tricks used by certain GCs out there, wouldn't that be 
> better served instead by Intel's upcoming Linear Address Masking (LAM) in 
> Sapphire Rapids, which would allow you to apply all types of tricks to the 
> upper 15 bits of unused bits on 4-level Page Table systems (only the upper 
> 7 bits on 5-level Page Table systems)? With LAM, user applications no 
> longer need to worry about masking off updates to those bits when accessing 
> memory via those pointers since the system will ignore them (i.e., the CPU 
> no longer checks for canonicality).
>
> In that case, 4-level Page Tables maintain their advantage over 5-level 
> Page Tables provided an end user does not require the 64TB or more of RAM 
> which 5-level Page Tables enable.
>
>
> On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
> mechanica...@googlegroups.com> wrote:
>
>> A good read, and very well written. 
>>
>> I'm with you in general on most/all of it except for one thing: That 64TB 
>> line. 
>>
>> I draw that same line at 256GB. Or at least at a 256GB 
>> guest/pod/container/process. There are things out there that use virtual 
>> memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
>> that seemingly plentiful 47 bit virtual user space in the process. The ones 
>> I know the most about (because I am mostly to blame for them) are high 
>> throughput concurrent GC mechanisms. Those have a multitude of 
>> implementation variants, all of which encode 
>> phases/generations/spaces/colors in higher-order virtual bits and use 
>> multi-mapping to efficiently recycle physical memory.
>>
>> For a concrete example, when running on a vanilla linux kernel with 
>> 4-level page tables, the current C4 collector in the Prime JVM (formerly 
>> known as "Zing"), uses different phase encoding, and different LVB barrier 
>> instruction encodings, depending on whether the heap size is above or below 
>> 256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
>> (using up 6 bits of virtual space) and a faster LVB test (test&jmp), and 
>> above 256GB, it uses denser encodings (using up only 3 bits) with slightly 
>> more expensive LVBs (test a bit in a mask & jmp). A 5 level page table (on 
>> hardware that supports it) we can move that line out by 512x, which means 
>> that even many-TB Java heaps can 

Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-18 Thread Mark Dawson
Thanks, Gil, for the compliment!

Regarding the VM tricks used by certain GCs out there, wouldn't that be
better served instead by Intel's upcoming Linear Address Masking (LAM) in
Sapphire Rapids, which would allow you to apply all types of tricks to the
upper 15 bits of unused bits on 4-level Page Table systems (only the upper
7 bits on 5-level Page Table systems)? With LAM, user applications no
longer need to worry about masking off updates to those bits when accessing
memory via those pointers since the system will ignore them (i.e., the CPU
no longer checks for canonicality).

In that case, 4-level Page Tables maintain their advantage over 5-level
Page Tables provided an end user does not require the 64TB or more of RAM
which 5-level Page Tables enable.


On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
mechanical-sympathy@googlegroups.com> wrote:

> A good read, and very well written.
>
> I'm with you in general on most/all of it except for one thing: That 64TB
> line.
>
> I draw that same line at 256GB. Or at least at a 256GB
> guest/pod/container/process. There are things out there that use virtual
> memory multi (and "many") mapping tricks for speed, and eat up a bunch of
> that seemingly plentiful 47 bit virtual user space in the process. The ones
> I know the most about (because I am mostly to blame for them) are high
> throughput concurrent GC mechanisms. Those have a multitude of
> implementation variants, all of which encode
> phases/generations/spaces/colors in higher-order virtual bits and use
> multi-mapping to efficiently recycle physical memory.
>
> For a concrete example, when running on a vanilla linux kernel with
> 4-level page tables, the current C4 collector in the Prime JVM (formerly
> known as "Zing"), uses different phase encoding, and different LVB barrier
> instruction encodings, depending on whether the heap size is above or below
> 256GB. Below 256GB, C4 gets to use sparse phase and generation encodings
> (using up 6 bits of virtual space) and a faster LVB test (test&jmp), and
> above 256GB, it uses denser encodings (using up only 3 bits) with slightly
> more expensive LVBs (test a bit in a mask & jmp). A 5 level page table (on
> hardware that supports it) we can move that line out by 512x, which means
> that even many-TB Java heaps can use the same cheap LVB tests that the
> smaller ones do.
>
> I expect the (exact) same cosniderations will be true for ZGC in OpenJDK
> (once it adds a generational mode to be able to keep up with high
> throughout allocations and large live sets), as ZGC's virtual space
> encoding needs and resulting LVB test instruction encodings are identical
> to C4's.
>
> So I'd say you can safely turn off 5 level tables on machines that
> physical have less than 256GB of memory, or on machines that are known to
> not run Java (now or in the future), or some other in-memory application
> technology that uses virtual memory tricks at scale. But above 256GB, I'd
> keep it on, especially if the thing is e.g. a Kubernetes node that may
> want to run some cool Java workload tomorrow with the best speed and
> efficiency.
> On Friday, June 17, 2022 at 7:06:42 PM UTC+2 Mark E. Dawson, Jr. wrote:
>
>> In the article below, I address just *one* of the areas where this new
>> default Linux build option in most recent distros can adversely impact
>> multithreaded performance - page fault handling. But any workload which
>> requires the kernel to mimic MMU page table walking to accomplish a task
>> could be impacted adversely, as well (e.g., pipe communication). You'd do
>> well to do your own testing:
>>
>> https://www.jabperf.com/5-level-vs-4-level-page-tables-does-it-matter/
>>
>> *NOTE*: 5-level Page Table can be disabled with "no5lvl" on the kernel
>> boot command line.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/mechanical-sympathy/c7d49c5d-4872-4f2d-8010-3035ccf5d7d8n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/CAFvqqVeMu8eHwrXaQ-WM%3DqDZbXfUNQMv82Hgx%2BBkcNZ2okZ%3D3w%40mail.gmail.com.


Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-18 Thread 'Gil Tene' via mechanical-sympathy
A good read, and very well written. 

I'm with you in general on most/all of it except for one thing: That 64TB 
line. 

I draw that same line at 256GB. Or at least at a 256GB 
guest/pod/container/process. There are things out there that use virtual 
memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
that seemingly plentiful 47 bit virtual user space in the process. The ones 
I know the most about (because I am mostly to blame for them) are high 
throughput concurrent GC mechanisms. Those have a multitude of 
implementation variants, all of which encode 
phases/generations/spaces/colors in higher-order virtual bits and use 
multi-mapping to efficiently recycle physical memory.

For a concrete example, when running on a vanilla linux kernel with 4-level 
page tables, the current C4 collector in the Prime JVM (formerly known as 
"Zing"), uses different phase encoding, and different LVB barrier 
instruction encodings, depending on whether the heap size is above or below 
256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
(using up 6 bits of virtual space) and a faster LVB test (test&jmp), and 
above 256GB, it uses denser encodings (using up only 3 bits) with slightly 
more expensive LVBs (test a bit in a mask & jmp). A 5 level page table (on 
hardware that supports it) we can move that line out by 512x, which means 
that even many-TB Java heaps can use the same cheap LVB tests that the 
smaller ones do.

I expect the (exact) same cosniderations will be true for ZGC in OpenJDK 
(once it adds a generational mode to be able to keep up with high 
throughout allocations and large live sets), as ZGC's virtual space 
encoding needs and resulting LVB test instruction encodings are identical 
to C4's.

So I'd say you can safely turn off 5 level tables on machines that physical 
have less than 256GB of memory, or on machines that are known to not run 
Java (now or in the future), or some other in-memory application technology 
that uses virtual memory tricks at scale. But above 256GB, I'd keep it 
on, especially if the thing is e.g. a Kubernetes node that may want to run 
some cool Java workload tomorrow with the best speed and efficiency.
On Friday, June 17, 2022 at 7:06:42 PM UTC+2 Mark E. Dawson, Jr. wrote:

> In the article below, I address just *one* of the areas where this new 
> default Linux build option in most recent distros can adversely impact 
> multithreaded performance - page fault handling. But any workload which 
> requires the kernel to mimic MMU page table walking to accomplish a task 
> could be impacted adversely, as well (e.g., pipe communication). You'd do 
> well to do your own testing:
>
> https://www.jabperf.com/5-level-vs-4-level-page-tables-does-it-matter/
>
> *NOTE*: 5-level Page Table can be disabled with "no5lvl" on the kernel 
> boot command line.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/c7d49c5d-4872-4f2d-8010-3035ccf5d7d8n%40googlegroups.com.


The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-17 Thread Mark E. Dawson, Jr.
In the article below, I address just *one* of the areas where this new 
default Linux build option in most recent distros can adversely impact 
multithreaded performance - page fault handling. But any workload which 
requires the kernel to mimic MMU page table walking to accomplish a task 
could be impacted adversely, as well (e.g., pipe communication). You'd do 
well to do your own testing:

https://www.jabperf.com/5-level-vs-4-level-page-tables-does-it-matter/

*NOTE*: 5-level Page Table can be disabled with "no5lvl" on the kernel boot 
command line.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/2ddf84cb-419b-4c71-933b-bb5b95a7b8b4n%40googlegroups.com.