Re: [osv-dev] [PATCH] threads: allocate stacks lazily to save memory

Nadav Har'El Sun, 17 Jan 2021 05:34:12 -0800

Thanks!
I have quite a few comments below, but most of them are small points,
nothing earth-shattering.



On Sun, Jan 17, 2021 at 4:52 AM Waldemar Kozaczuk <jwkozac...@gmail.com>
wrote:

> As the issue #143 explains, currently applications stacks are eagerly
> allocated (aka pre-populated) which may lead to substantial memory waste.
> We want these stacks to be lazily allocated instead, so that physical
> memory behind the stack gets allocated as it grows by standard page fault
> handling mechanism.
>
> The page fault handling logic requires that both interrupts and preemption
> are enabled. Therefore this patch implements somewhat simple strategy to
> read single byte on a stack one page (4K) deeper before disabling
> interrupts
> or preemption. The idea is to pre-fault next deeper page of the stack in
> order to avoid page fault later after interrupts or preemption has been
> disabled. And we assume that 4K of stack is enough for all these cases.
>
> It turns out we cannot read that single byte "blindly" because in some
> scenarios
> interrupts might have been already disabled when we are about to disable
> preemption and vice versa. Also similarly, which is more speculative,
> disabling
> preemption (and possibly interrupts) may nest. Therefore we need a scheme
> where
> we "conditionally" read a byte on the stack. More specifically we should
> ONLY
> try to pre-fault IF both interrupts and preemption are enabled before we
> disable any of the two.
>

Interesting.


> In theory, we could have an "if" to call sched::preemptable() and
> arch::irq_enabled() before we read a byte of the stack. But this together
> whould be quite expensive given irq_enabled() uses at least 4 instructions
> and more importantly it needs a stack to dump RFLAGS to.
>
> So instead, this patch takes different approach by adding a new "composite"
> thread-local counter - irq_preempt_counters - to track disabling/enabling
> of both interrupts and
> preemption that operates somewhat similar to the way the preemption
> counter does.
> Every time interrupts are disabled, we increment the counter and decrement
> it
> everytime interrupts are enabled back. Now, in order to cheaply determine
> if we should pre-fault, we place both 32-bit preempt counter and 16-bit
> irq counter
> together in the union shared with the 64-bit field lazy_stack_no_pre_fault.
> This way we can increment/decrement each counter independently and at the
> same time
> simply check lazy_stack_no_pre_fault if all bits are 0 and only then
> proceed to read
> a byte. This is exactly what arch::ensure_next_stack_page() does.
>
> In addition, we have another 16-bit field lazy_stack_flags that allows us
> to set certain bits on to disable and enable stack pre-faulting dynamicaly

.
> This allows us to do three things:
> 1) disable stack pre-faulting if stack is pre-allocated (not lazy),
> 2) dynamically disable/enable stack pre-faulting when handling the page
> fault,
> 3) dynamically disable/enable stack pre-faulting where some methods
> in core/mmu.cc take the vma_list_mutex for write to avoid a deadlock
> when page fault triggered by unallocated stack tries to take the same
> mutex for read.
>

> Comparing to previous versions of this patch, this one also adds
> more elaborate logic to determine if stack should be handled as lazy or
> not - please see init_stack() method comments.
>
> Finally, given lazy stack handling affects many critical parts of
> the kernel including scheduler and other ones which use RAII-like guards -
> irq_lock_type, irq_save_lock_type, rcu_lock_type and preempt_lock_t,
> this version adds a configurable compilation flag conf-lazy_stack
> which is disabled by default.
>

I like the idea of the configurable parameter - it should be documented as
a *tradeoff* between memory and CPU:
1. With lazy stack allocation, multi-threaded applications take less memory
for their stack.
2. With lazy stack allocation, kernel code disabling preemption becomes
slower.

However, I'd like this to be useful and *correct* in both settings. And I
think it's easy to do:
To make the "off" setting useful, it does need to correctly support
user-set lazy allocations (as happened to the current user who complained),
even if it still normally uses malloc() for the stacks.
There is an easy way to do this! Just *populate* the given stack, entirely,
once.
This can be as straightforward as reading one byte from every page in the
entire range, or something more sophisticated using the page tables.
But after that step, we will have basically "converted" a lazy stack into a
pre-populated one, and we continue to use it normally with no need for
runtime trickery.

If you do this, both "on" and "off" settings will be 100% correct and
useful. The benefit of the "on" setting will be saving memory in
multi-threaded applications, because stacks aren't fully allocated.
If you don't do this, this patch will not help the user who started this
discussion.

In my patch which solved (only) his scenario, I included a test, which
fails before my patch. Please make sure that this test (which you can
include in your patch) passes with your patch, both for the flag "on" and
"off".


>
> Please note that performance-wise this patch is slighy more expensive than
> it
> was originally hoped. As the disassembled snippets illustrate it costs us
> 3-4 instructions extra every time we disable preemption or interrupts.
>
> A. Disabling preemption in __tls_get_addr before the patch:
> ```
>  4035cae8:   e8 83 f6 ff ff          callq  4035c170 <elf::get_program()>
>  4035caed:   4c 8b 2d 6c 40 51 00    mov    0x51406c(%rip),%r13        #
> 40870b60 <sched::preempt_counter+0x40870b60>
>  4035caf4:   4c 8b 33                mov    (%rbx),%r14
>  4035caf7:   64 41 83 45 00 01       addl   $0x1,%fs:0x0(%r13)
>  4035cafd:   e8 6e f6 ff ff          callq  4035c170 <elf::get_program()>
> ```
>
> B. Disabling preemption in __tls_get_addr after the patch:
> ```
>  4035fdd8:   e8 63 f6 ff ff          callq  4035f440 <elf::get_program()>
>  4035fddd:   4c 8b 2d a4 4d 51 00    mov    0x514da4(%rip),%r13        #
> 40874b88 <arch::irq_preempt_counters+0x40874b88>
>  4035fde4:   4c 8b 33                mov    (%rbx),%r14
>  4035fde7:   64 49 83 7d 00 00       cmpq   $0x0,%fs:0x0(%r13)
>  4035fded:   75 07                   jne    4035fdf6 <__tls_get_addr+0x76>
>  4035fdef:   8a 84 24 00 f0 ff ff    mov    -0x1000(%rsp),%al
>  4035fdf6:   64 41 83 45 00 01       addl   $0x1,%fs:0x0(%r13)
>  4035fdfc:   e8 3f f6 ff ff          callq  4035f440 <elf::get_program()>
> ```
>

In https://github.com/cloudius-systems/osv/issues/143 I *wished* we can do
this without a branch, just a single read, but not always we can get our
wishes :-(


> As an example of memory saving, tomcat using ~ 300 thread ended up 365MB
> instead
> of 620MB before the patch.
>

Very nice!


> Fixes #143
>

I am ambivalent on whether we can actually do "Fixes #143" here :-(

The thing is, #143 is about the ability to save a lot of memory by enabling
this feature.
What this patch does is to add this as an option, but does NOT enable it by
default. So most users will not turn it on, and not save the memory that
#143 says they can save. Most users will feel that issue #143 still
exists...

So I'm not sure what to do about #143...



>
> Signed-off-by: Matthew Pabst <pabstmatt...@gmail.com>
> Signed-off-by: Waldemar Kozaczuk <jwkozac...@gmail.com>
> ---
>  Makefile                |  2 +-
>  arch/aarch64/arch.hh    |  6 +++++
>  arch/x64/arch-switch.hh | 24 ++++++++++++++++++
>  arch/x64/arch.hh        | 51 ++++++++++++++++++++++++++++++++++++++
>  arch/x64/mmu.cc         |  5 ++++
>  conf/aarch64.mk         |  3 +++
>  conf/x64.mk             |  1 +
>  core/mmu.cc             | 40 +++++++++++++++++++++++++++---
>  core/sched.cc           |  7 +++++-
>  include/osv/counters.hh | 54 +++++++++++++++++++++++++++++++++++++++++
>  include/osv/irqlock.hh  |  5 ++++
>  include/osv/mmu-defs.hh |  1 +
>  include/osv/mmu.hh      |  1 +
>  include/osv/sched.hh    | 11 ++++++---
>  libc/mman.cc            |  9 +++----
>  libc/pthread.cc         |  7 +++++-
>  16 files changed, 212 insertions(+), 15 deletions(-)
>  create mode 100644 include/osv/counters.hh
>
> diff --git a/Makefile b/Makefile
> index 66448a9b..4f02db5b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -328,7 +328,7 @@ $(out)/bsd/%.o: INCLUDES += -isystem bsd/
>  # for machine/
>  $(out)/bsd/%.o: INCLUDES += -isystem bsd/$(arch)
>
> -configuration-defines = conf-preempt conf-debug_memory conf-logger_debug
> conf-debug_elf
> +configuration-defines = conf-preempt conf-debug_memory conf-logger_debug
> conf-debug_elf conf-lazy_stack
>
>  configuration = $(foreach cf,$(configuration-defines), \
>                        -D$(cf:conf-%=CONF_%)=$($(cf)))
> diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh
> index 855f1987..1292a4b4 100644
> --- a/arch/aarch64/arch.hh
> +++ b/arch/aarch64/arch.hh
> @@ -20,6 +20,12 @@ namespace arch {
>  #define INSTR_SIZE_MIN 4
>  #define ELF_IMAGE_START (OSV_KERNEL_BASE + 0x10000)
>
> +inline void ensure_next_stack_page()
> +{
> +    //TODO: Implement lazy stack check for AARCH64 once there is
> +    //a dedicated exception stack to handle page faults
> +}
> +
>  inline void irq_disable()
>  {
>      processor::irq_disable();
> diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
> index ee8b7e99..00956bdb 100644
> --- a/arch/x64/arch-switch.hh
> +++ b/arch/x64/arch-switch.hh
> @@ -10,6 +10,7 @@
>
>  #include "msr.hh"
>  #include <osv/barrier.hh>
> +#include <osv/mmu.hh>
>  #include <string.h>
>
>  //
> @@ -142,12 +143,28 @@ void thread::init_stack()
>      if (!stack.begin) {
>          stack.begin = malloc(stack.size);
>          stack.deleter = stack.default_deleter;
> +#ifdef CONF_lazy_stack
> +        // As of now, malloc() always returns pre-populated memory, so
> let us mark the stack as not lazy
> +        // If that changes in future, we would need to accomodate this
> accordingly
> +        stack.lazy = false;
> +        // The stack has been already allocated so let us determine if is
> pre-populated
> +    } else if (mmu::is_linear_mapped(stack.begin, stack.size) ||
> mmu::ismapped_and_populated(stack.begin, stack.size)) {
> +        stack.lazy = false;
> +    } else {
> +        // Otherwise let us assume it is lazy
> +        stack.lazy = true;
> +#endif
>      }
>      void** stacktop = reinterpret_cast<void**>(stack.begin + stack.size);
>      _state.rbp = this;
>      _state.rip = reinterpret_cast<void*>(thread_main);
>      _state.rsp = stacktop;
>      _state.exception_stack = _arch.exception_stack +
> sizeof(_arch.exception_stack);
> +
> +    // In case, the thread stack was setup to get lazily allocated, and
> given the thread initially starts
> +    // running with preemption disabled, we need to pre-fault the first
> page from the top of the stack.
> +    volatile char r = *((char *) (stacktop - 1));
> +    (void) r; // trick the compiler into thinking that r is used
>

It's strange that above you set up this stack.lazy flag, but a few lines
later, you don't check it.

Also, maybe we should have some inline function that does this
song-and-dance of reading one byte from the stack instead of repeating it
in multiple places?



>  }
>
>  void thread::setup_tcb()
> @@ -315,6 +332,13 @@ void thread::free_tcb()
>
>  void thread_main_c(thread* t)
>  {
> +#ifdef CONF_lazy_stack
> +    // Enable lazy stack pre-fault logic for this thread
> +    // if stack is lazily allocated (most likely application thread)
> +    if (t->get_stack_info().lazy) {
> +
> arch::irq_preempt_counters._data.set_lazy_flags_bit(arch::lazy_stack_disable,
> false);
> +    }
> +#endif
>
     arch::irq_enable();
>  #ifdef CONF_preempt
>      preempt_enable();
> diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
> index 17df5f5c..17885245 100644
> --- a/arch/x64/arch.hh
> +++ b/arch/x64/arch.hh
> @@ -10,6 +10,8 @@
>
>  #include "processor.hh"
>  #include "msr.hh"
> +#include <osv/barrier.hh>
> +#include <osv/counters.hh>
>
>  // namespace arch - architecture independent interface for architecture
>  //                  dependent operations (e.g. irq_disable vs. cli)
> @@ -20,11 +22,41 @@ namespace arch {
>  #define INSTR_SIZE_MIN 1
>  #define ELF_IMAGE_START OSV_KERNEL_BASE
>
> +inline void ensure_next_stack_page() {
> +    // Because both irq and preempt counters and lazy stack
> +    // flags are co-located within same 8-bytes long union
>
+    // we can check single field for non-zero value to determine
> +    // if we should trigger pre-fault on lazy stack
> +    if (irq_preempt_counters.lazy_stack_no_pre_fault) {
> +        return;
> +    }
> +
> +    char i;
> +    asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
>

Any benefit to doing this read in assembler and not C (like you did in
other places)?
If we can do this in C, it has two benefits: Easier to read for mortals,
and potentially the same function can be used for other architectures too -
since the concept of lazy stack is probably not x86-specific.

+}
> +
> +inline void ensure_next_two_stack_pages() {
> +    if (irq_preempt_counters.lazy_stack_no_pre_fault) {
> +        return;
> +    }
> +
> +    char i;
> +    asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
> +    asm volatile("movb -8192(%%rsp), %0" : "=r"(i));
> +}
> +
>  inline void irq_disable()
>  {
> +#if CONF_lazy_stack
> +    ensure_next_stack_page();
> +    ++irq_preempt_counters._data.irq;
> +#endif
>
     processor::cli();
>
 }
>
> +// Given there is no corresponding enable function we choose to
> +// not increment the irq counter and instead rely on the save()/restore()
> +// methods of the irq_flag_notrace guard below.
>  __attribute__((no_instrument_function))
>  inline void irq_disable_notrace();
>
> @@ -36,11 +68,19 @@ inline void irq_disable_notrace()
>  inline void irq_enable()
>  {
>      processor::sti();
> +#if CONF_lazy_stack
> +    --irq_preempt_counters._data.irq;
> +    barrier();
>

What is the theory on why a compiler barrier is needed here and in
wait_for_interrupt() below, but not in irq_disable()?


> +#endif
>

I really hope our calls of irq_enable()/irq_disabled() are balanced,
because processor::cli() / processor::sti() could be unbalanced and still
work, but I hope we didn't rely on this in any place.

 }
>
>  inline void wait_for_interrupt()
>  {
>      processor::sti_hlt();
> +#if CONF_lazy_stack
> +    --irq_preempt_counters._data.irq;
>

Hmm, this assumes that wait_for_interrupt() is always called in
irq_disable() mode. Is this true? (I didn't check, but hope you did?)

+    barrier();
> +#endif
>
 }
>
>  inline void halt_no_interrupts()
> @@ -76,14 +116,25 @@ private:
>      unsigned long _rflags;
>  };
>
> +// The irq_flag_notrace guard is used in the trace code (core/trace.cc
> and include/osv/trace.hh)
> +// to save and restore RFLAGS around code that disables interrupts by
> calling irq_disable_notrace().
> +// Knowing that, we ensure next page of the stack is populated and
> increment irq counter in save()
> +// and accordingly we decrement same counter in restore().
>  inline void irq_flag_notrace::save()
>  {
> +#if CONF_lazy_stack
> +    ensure_next_stack_page();
> +    ++irq_preempt_counters._data.irq;
> +#endif
>      asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" :
> "=r"(_rflags));
>  }
>
>  inline void irq_flag_notrace::restore()
>  {
>      asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : :
> "r"(_rflags));
> +#if CONF_lazy_stack
> +    --irq_preempt_counters._data.irq;
> +#endif
>  }
>
>  inline bool irq_flag_notrace::enabled() const
> diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
> index 1af268c0..cd613c15 100644
> --- a/arch/x64/mmu.cc
> +++ b/arch/x64/mmu.cc
> @@ -37,6 +37,11 @@ void page_fault(exception_frame *ef)
>      assert(sched::preemptable());
>      assert(ef->rflags & processor::rflags_if);
>
> +    // Even though we are on exception stack that is not lazy
>

Is this "even though", or "because"?
I.e., the per-thread lazy flag is no longer relevant when running in the
other
stack, so we want to override it?

+    // for pre-caution we are disabling the pre-fault logic
> +    // for the time the fault is being handled
> +    arch::lazy_stack_flags_guard
> lazy_stack_guard(arch::lazy_stack_fault_disable);
> +
>      // And since we may sleep, make sure interrupts are enabled.
>      DROP_LOCK(irq_lock) { // irq_lock is acquired by HW
>          mmu::vm_fault(addr, ef);
> diff --git a/conf/aarch64.mk b/conf/aarch64.mk
> index de18ae39..96c87712 100644
> --- a/conf/aarch64.mk
> +++ b/conf/aarch64.mk
> @@ -7,3 +7,6 @@
>  # different objects depending on the TLS model used.
>  # Force all __thread variables encountered to local exec.
>  arch-cflags = -mstrict-align -mtls-dialect=desc -ftls-model=local-exec
> -DAARCH64_PORT_STUB
> +# Until we implement dedicated exception stack to handle page faults on
> +# the lazy stack feature should be disabled
> +conf-lazy_stack=0
> diff --git a/conf/x64.mk b/conf/x64.mk
> index 3d81bb89..d41767c5 100644
> --- a/conf/x64.mk
> +++ b/conf/x64.mk
> @@ -1 +1,2 @@
>  arch-cflags = -msse2
> +conf-lazy_stack=0
> diff --git a/core/mmu.cc b/core/mmu.cc
> index 37a1c60b..aef9b177 100644
> --- a/core/mmu.cc
> +++ b/core/mmu.cc
> @@ -42,6 +42,19 @@ extern const char text_start[], text_end[];
>
>  namespace mmu {
>
> +#if CONF_lazy_stack
> +// We need to ensure that lazy stack is populated deeply enough (2 pages)
> +// for all the cases when the vma_list_mutex is taken for write so that
> +// we can disable the pre-fault check (arch::ensure_next_stack_page)
> which,
> +// if page fault triggered on stack, would make page-fault handling logic
> +// attempt to take same vma_list_mutex fo read and end up with a deadlock.
> +#define PREVENT_STACK_PAGE_FAULT \
> +    arch::ensure_next_two_stack_pages(); \
> +    arch::lazy_stack_flags_guard
> lazy_stack_guard(arch::lazy_stack_mmu_disable);
> +#else
> +#define PREVENT_STACK_PAGE_FAULT
> +#endif
> +
>  namespace bi = boost::intrusive;
>
>  class vma_compare {
> @@ -1183,6 +1196,7 @@ static void nohugepage(void* addr, size_t length)
>
>  error advise(void* addr, size_t size, int advice)
>  {
> +    PREVENT_STACK_PAGE_FAULT
>      WITH_LOCK(vma_list_mutex.for_write()) {
>          if (!ismapped(addr, size)) {
>              return make_error(ENOMEM);
> @@ -1222,6 +1236,8 @@ void* map_anon(const void* addr, size_t size,
> unsigned flags, unsigned perm)
>      size = align_up(size, mmu::page_size);
>      auto start = reinterpret_cast<uintptr_t>(addr);
>      auto* vma = new mmu::anon_vma(addr_range(start, start + size), perm,
> flags);
> +
> +    PREVENT_STACK_PAGE_FAULT
>      SCOPE_LOCK(vma_list_mutex.for_write());
>      auto v = (void*) allocate(vma, start, size, search);
>      if (flags & mmap_populate) {
> @@ -1248,6 +1264,8 @@ void* map_file(const void* addr, size_t size,
> unsigned flags, unsigned perm,
>      auto start = reinterpret_cast<uintptr_t>(addr);
>      auto *vma = f->mmap(addr_range(start, start + size), flags |
> mmap_file, perm, offset).release();
>      void *v;
> +
> +    PREVENT_STACK_PAGE_FAULT
>      WITH_LOCK(vma_list_mutex.for_write()) {
>          v = (void*) allocate(vma, start, size, search);
>          if (flags & mmap_populate) {
> @@ -1265,8 +1283,9 @@ bool is_linear_mapped(const void *addr, size_t size)
>      return addr >= phys_mem;
>  }
>
> -// Checks if the entire given memory region is mmap()ed (in vma_list).
> -bool ismapped(const void *addr, size_t size)
> +// Find vma with the entire given memory region in it and return result of
> +// the predicate function against it or return false if vma not found
> +static bool find_vma(const void *addr, size_t size,
> std::function<bool(vma_list_type::iterator)> predicate)

.
Since this is a static function, please reduce runtime overhead by using a
template type (e.g., Func predicate) instead of std::function<>.


 {
>      uintptr_t start = (uintptr_t) addr;
>      uintptr_t end = start + size;
> @@ -1277,11 +1296,23 @@ bool ismapped(const void *addr, size_t size)
>              return false;
>          start = p->end();
>          if (start >= end)
> -            return true;
> +            return predicate(p);
>      }
>      return false;
>  }
>
> +// Checks if the entire given memory region is mmap()ed (in vma_list).
> +bool ismapped(const void *addr, size_t size)
> +{
> +    return find_vma(addr, size, [](vma_list_type::iterator v) { return
> true; });
> +}
> +
> +// Checks if the entire given memory region is mmap()ed (in vma_list) and
> populated.
> +bool ismapped_and_populated(const void *addr, size_t size)
> +{
> +    return find_vma(addr, size, [](vma_list_type::iterator v) { return
> v->has_flags(mmap_populate); });
> +}
> +
>  // Checks if the entire given memory region is readable.
>  bool isreadable(void *addr, size_t size)
>  {
> @@ -1614,6 +1645,7 @@ ulong map_jvm(unsigned char* jvm_addr, size_t size,
> size_t align, balloon_ptr b)
>
>      auto* vma = new mmu::jvm_balloon_vma(jvm_addr, start, start + size,
> b, v->perm(), v->flags());
>
> +    PREVENT_STACK_PAGE_FAULT
>

Did you consider each of these functions whether they need this extra
protection?
In this case, why would isreadable() need to run any code with preemption
or IRQ disabled? It's just supposed to check things, not to fault in
anything, no?

Similarly for other functions - did you consider each one or just add this
PREVENT_STACK_PAGE_FAULT everywhere vma_list_mutex is used, as in "better
safe than sorry"?

     WITH_LOCK(vma_list_mutex.for_write()) {
>          // This means that the mapping that we had before was a balloon
> mapping
>          // that was laying around and wasn't updated to an anon mapping.
> If we
> @@ -1875,6 +1907,7 @@ void free_initial_memory_range(uintptr_t addr,
> size_t size)
>
>  error mprotect(const void *addr, size_t len, unsigned perm)
>  {
> +    PREVENT_STACK_PAGE_FAULT
>      SCOPE_LOCK(vma_list_mutex.for_write());
>
>      if (!ismapped(addr, len)) {
> @@ -1886,6 +1919,7 @@ error mprotect(const void *addr, size_t len,
> unsigned perm)
>
>  error munmap(const void *addr, size_t length)
>  {
> +    PREVENT_STACK_PAGE_FAULT
>      SCOPE_LOCK(vma_list_mutex.for_write());
>
>      length = align_up(length, mmu::page_size);
> diff --git a/core/sched.cc b/core/sched.cc
> index 06f849d1..8d217169 100644
> --- a/core/sched.cc
> +++ b/core/sched.cc
> @@ -42,6 +42,12 @@ extern char _percpu_start[], _percpu_end[];
>  using namespace osv;
>  using namespace osv::clock::literals;
>
> +namespace arch {
> +// By default disable lazy stack pre-fault logic
> +// At the same time we also disable preemption and mark interrupts as
> disabled
> +counters __thread irq_preempt_counters = {{1, 1, 1}};
> +}
> +
>  namespace sched {
>
>  TRACEPOINT(trace_sched_idle, "");
> @@ -69,7 +75,6 @@ std::vector<cpu*> cpus
> __attribute__((init_priority((int)init_prio::cpus)));
>  thread __thread * s_current;
>  cpu __thread * current_cpu;
>
> -unsigned __thread preempt_counter = 1;
>

Why is the new irq_preempt_counters replacing preempt_counter, not in the
#ifdef?
Could this change in itself have a performance penalty, or is
*preempt_counter* the first item in irq_preempt_counters, so the
performance exactly the same as it was?


>  bool __thread need_reschedule = false;
>
>  elf::tls_data tls;
> diff --git a/include/osv/counters.hh b/include/osv/counters.hh
> new file mode 100644
> index 00000000..10f96dc8
> --- /dev/null
> +++ b/include/osv/counters.hh
>

Is there a benefit for putting this in a separate header file and not in
the same place where we previously had "preempt_counter" (sched.hh)?

I think the name "counters.hh" is a bit too generic. These are very
specific counters, for a very specific reason.



> @@ -0,0 +1,54 @@
> +/*
> + * Copyright (C) 2021 Waldemar Kozaczuk
> + *
> + * This work is open source software, licensed under the terms of the
> + * BSD license as described in the LICENSE file in the top-level
> directory.
> + */
> +
> +#ifndef OSV_COUNTERS_HH
> +#define OSV_COUNTERS_HH
> +
> +namespace arch {
> +
> +// Both preempt and irq counters are co-located next to each other
> +// along with the the flags that drive lazy stack related logic
> +// union by design. It is so that compiler can optimize
> ensure_next_stack_page()
> +// to test against single 64-bit lazy_stack_no_pre_fault field
> +union counters {
> +    struct data {
> +        unsigned preempt;
> +        u16 irq;
>

For the two counters above, the count *disabling* the preempt or irq. Maybe
you should have a comment saying this if the name isn't self-explanatory.


> +        u16 lazy_stack_flags;
>
+
> +        inline void set_lazy_flags_bit(unsigned nr, bool v) {
> +            lazy_stack_flags &= ~(u64(1) << nr);
> +            lazy_stack_flags |= u64(v) << nr;
> +        }
>

If you always call this code with v being a constant true or false, then it
is more efficient to have separate set() and unset() functions, each of
them just need to do one operation (&= or |=, but not both).

If you don't do that, maybe at least change the "&=" line to ~(u64(!v) <<
nr) so when v is a constant false the optimizer can more easily optimize
this line away?

Do we really need more than one bit for this extra flag?
Maybe just one "permanent disable" flag (for stacks we know aren't lazy)
and one "temporary disable" flag?

Then you won't need functions taking nr (and below, also saving nr).



> +    } _data;
> +    uint64_t lazy_stack_no_pre_fault; // Pre-faulting is disabled if this
> field is non-zero
> +};
> +
> +extern counters __thread irq_preempt_counters;
> +
> +// Simple guard intended to temporarily disable lazy stack pre-faulting
> +class lazy_stack_flags_guard {
> +    unsigned _flag_nr;
> +public:
> +    lazy_stack_flags_guard(unsigned flag_nr) {
> +        _flag_nr = flag_nr;
> +        irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, true);
> +    }
> +    ~lazy_stack_flags_guard() {
> +        irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, false);
> +    }
> +};
> +
> +enum {
> +    lazy_stack_disable = 0,      // Intended to disable pre-faulting for
> non-lazy stacks
> +    lazy_stack_mmu_disable = 1,  // Intended to temporarily disable
> pre-faulting in mmu functions
> +    lazy_stack_fault_disable = 2,// Intended to temporarily disable
> pre-faulting when handling a page fault
>
+};
> +
> +}
> +
> +#endif //OSV_COUNTERS_HH
> diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh
> index 2d5372fc..898924b5 100644
> --- a/include/osv/irqlock.hh
> +++ b/include/osv/irqlock.hh
> @@ -28,12 +28,17 @@ private:
>  inline void irq_save_lock_type::lock()
>  {
>      _flags.save();
> +    // The irq_disable() increments the irq counter and ensures next lazy
> stack page
>      arch::irq_disable();
>  }
>
>  inline void irq_save_lock_type::unlock()
>  {
>      _flags.restore();
> +#if CONF_lazy_stack
> +    --arch::irq_preempt_counters._data.irq;
> +    barrier();
> +#endif
>  }
>
>  extern irq_lock_type irq_lock;
> diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
> index 18edf441..694815f8 100644
> --- a/include/osv/mmu-defs.hh
> +++ b/include/osv/mmu-defs.hh
> @@ -84,6 +84,7 @@ enum {
>      mmap_small       = 1ul << 5,
>      mmap_jvm_balloon = 1ul << 6,
>      mmap_file        = 1ul << 7,
> +    mmap_stack       = 1ul << 8,
>

Strangely, I don't see where you changed the implementation to do anything
when this bit is on... Is it intentionally ignored, and it's fine just
because you pass this flag when it's not needed anyway? That's strange. Why
not support this bit properly and use it always on stacks? More on this
below.

 };
>
>  enum {
> diff --git a/include/osv/mmu.hh b/include/osv/mmu.hh
> index 12fcb8a4..6505a75a 100644
> --- a/include/osv/mmu.hh
> +++ b/include/osv/mmu.hh
> @@ -161,6 +161,7 @@ error msync(const void* addr, size_t length, int
> flags);
>  error mincore(const void *addr, size_t length, unsigned char *vec);
>  bool is_linear_mapped(const void *addr, size_t size);
>  bool ismapped(const void *addr, size_t size);
> +bool ismapped_and_populated(const void *addr, size_t size);
>  bool isreadable(void *addr, size_t size);
>  std::unique_ptr<file_vma> default_file_mmap(file* file, addr_range range,
> unsigned flags, unsigned perm, off_t offset);
>  std::unique_ptr<file_vma> map_file_mmap(file* file, addr_range range,
> unsigned flags, unsigned perm, off_t offset);
> diff --git a/include/osv/sched.hh b/include/osv/sched.hh
> index 0fb44f77..13511f90 100644
> --- a/include/osv/sched.hh
> +++ b/include/osv/sched.hh
> @@ -335,6 +335,7 @@ public:
>          void* begin;
>          size_t size;
>          void (*deleter)(stack_info si);  // null: don't delete
> +        bool lazy;
>
         static void default_deleter(stack_info si);
>      };
>      struct attr {
> @@ -978,14 +979,13 @@ inline void release(mutex_t* mtx)
>      }
>  }
>
> -extern unsigned __thread preempt_counter;
>  extern bool __thread need_reschedule;
>
>  #ifdef __OSV_CORE__
>  inline unsigned int get_preempt_counter()
>  {
>      barrier();
> -    return preempt_counter;
> +    return arch::irq_preempt_counters._data.preempt;
>

It is sad how we turned the arch-agnostic "preempt_counter" into something
arch-specific.
I'm not sure how to improve that, though.

 }
>
>  inline bool preemptable()
> @@ -1005,14 +1005,17 @@ inline void preempt()
>
>  inline void preempt_disable()
>  {
> -    ++preempt_counter;
> +#if CONF_lazy_stack
> +    arch::ensure_next_stack_page();
> +#endif
> +    ++arch::irq_preempt_counters._data.preempt;
>      barrier();
>  }
>
>  inline void preempt_enable()
>  {
>      barrier();
> -    --preempt_counter;
> +    --arch::irq_preempt_counters._data.preempt;
>      if (preemptable() && need_reschedule && arch::irq_enabled()) {
>          cpu::schedule();
>      }
> diff --git a/libc/mman.cc b/libc/mman.cc
> index add8ef15..0b5a0c53 100644
> --- a/libc/mman.cc
> +++ b/libc/mman.cc
> @@ -42,12 +42,11 @@ unsigned libc_flags_to_mmap(int flags)
>          mmap_flags |= mmu::mmap_populate;
>      }
>      if (flags & MAP_STACK) {
> -        // OSv currently requires that stacks be pinned (see issue #143).
> So
> -        // if an application wants to mmap() a stack for
> pthread_attr_setstack
> -        // and did us the courtesy of telling this to ue (via MAP_STACK),
> -        // let's return the courtesy by returning pre-faulted memory.
> -        // FIXME: If issue #143 is fixed, this workaround should be
> removed.
> +#ifdef AARCH64_PORT_STUB
>          mmap_flags |= mmu::mmap_populate;
> +#else
> +        mmap_flags |= mmu::mmap_stack;
>


Why is this aarch64-specific here? Why can't aarch64 support mmap_stack
(and have it do the same as mmap_populate)?
mmap_stack can simply not do anything but why not leave this decision to
where these flags are used, not here.

Also, below you have exactly the same choice - of using mmap_populate or
mmap_stack - and you used "CONF_lazy_stack"
to decide, so why is it different here?

By the way, above I mentioned the idea that maybe even when
!CONF_lazy_stack, you still need to support a lazy stack passed
by the user - because he mmap()ed a stack *without* MAP_STACK - so we need
to pre-fault the entire stack when we start
using it. If we do this, we can simplify everything and make MAP_STACK
do-nothing... The mmap() will not populate the new
area, but the pthread_create() will. If we do this, MAP_STACK can be
ignored, just like it is ignored in Linux.



> +#endif
>      }
>      if (flags & MAP_SHARED) {
>          mmap_flags |= mmu::mmap_shared;
> diff --git a/libc/pthread.cc b/libc/pthread.cc
> index a9a31eaf..90a9f3be 100644
> --- a/libc/pthread.cc
> +++ b/libc/pthread.cc
> @@ -140,7 +140,12 @@ namespace pthread_private {
>              return {attr.stack_begin, attr.stack_size};
>          }
>          size_t size = attr.stack_size;
> -        void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate,
> mmu::perm_rw);
> +#if CONF_lazy_stack
> +        unsigned stack_flags = mmu::mmap_stack;
> +#else
> +        unsigned stack_flags = mmu::mmap_populate;
>

Again, if you already have the mmap_stack flag, why not just use it
unconditionally?

+#endif
> +        void *addr = mmu::map_anon(nullptr, size, stack_flags,
> mmu::perm_rw);
>          mmu::mprotect(addr, attr.guard_size, 0);
>          sched::thread::stack_info si{addr, size};
>          si.deleter = free_stack;
> --
> 2.29.2
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/20210117025242.524772-1-jwkozaczuk%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyju%2B583wBnW%2B8PVPa0iLMzh2r-8Snisogb1ceEVtAN0uHw%40mail.gmail.com.

Re: [osv-dev] [PATCH] threads: allocate stacks lazily to save memory

Reply via email to