[osv-dev] [PATCH 8/8] lazy stack: new tracepoint for stack pre-faults
This last patch of the series adds new tracepoint - mmu_vm_stack_fault - which when enabled allows one to see how particular app triggers the stack page faults. The tracepoint captures the stack fault address, the thread id and number of the page (0 being the 1st page). Please note this does not capture the 1st page of the stack (page_no 0) as this one pre-faulted by the parent thread that creates a new one. ./scripts/run.py -e /tests/tst-pipe.so --trace=mmu_vm_stack_fault --trace-backtrace -H ./scripts/trace.py extract && ./scripts/trace.py list -bl 0x816b7040 >init0 0.002215401 mmu_vm_stack_fault thread=32, addr=0x200ff9d0, page_no=1 mmu::vm_fault(unsigned long, exception_frame*) page_fault ex_pf std_malloc(unsigned long, unsigned long) malloc operator new(unsigned long) do_main_thread(void*) std::_Function_handler::_M_invoke(std::_Any_data const&) __invoke_impl&>__invoke_r&> _M_invoke sched::thread::main() thread_main_c ... 0x816b7040 >init0 0.084799151 mmu_vm_stack_fault thread=32, addr=0x200f8440, page_no=8 mmu::vm_fault(unsigned long, exception_frame*) page_fault ex_pf memory::page_pool::l1::alloc_page() untracked_alloc_page memory::alloc_page() std_malloc(unsigned long, unsigned long) malloc operator new(unsigned long) lookup sys_lstat Signed-off-by: Waldemar Kozaczuk --- core/mmu.cc | 11 +++ 1 file changed, 11 insertions(+) diff --git a/core/mmu.cc b/core/mmu.cc index e41de215..afab7cc9 100644 --- a/core/mmu.cc +++ b/core/mmu.cc @@ -1413,6 +1413,9 @@ bool access_fault(vma& vma, unsigned int error_code) TRACEPOINT(trace_mmu_vm_fault, "addr=%p, error_code=%x", uintptr_t, unsigned int); TRACEPOINT(trace_mmu_vm_fault_sigsegv, "addr=%p, error_code=%x, %s", uintptr_t, unsigned int, const char*); TRACEPOINT(trace_mmu_vm_fault_ret, "addr=%p, error_code=%x", uintptr_t, unsigned int); +#if CONF_lazy_stack +TRACEPOINT(trace_mmu_vm_stack_fault, "thread=%d, addr=%p, page_no=%d", unsigned int, uintptr_t, unsigned int); +#endif static void vm_sigsegv(uintptr_t addr, exception_frame* ef) { @@ -1438,6 +1441,14 @@ void vm_fault(uintptr_t addr, exception_frame* ef) trace_mmu_vm_fault_sigsegv(addr, ef->get_error(), "fast"); return; } +#if CONF_lazy_stack +auto stack = sched::thread::current()->get_stack_info(); +void *v_addr = reinterpret_cast(addr); +if (v_addr >= stack.begin && v_addr < stack.begin + stack.size) { +trace_mmu_vm_stack_fault(sched::thread::current()->id(), addr, +((u64)(stack.begin + stack.size - addr)) / 4096); +} +#endif addr = align_down(addr, mmu::page_size); WITH_LOCK(vma_list_mutex.for_read()) { auto vma = find_intersecting_vma(addr); -- 2.34.1 -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20220831042433.140243-8-jwkozaczuk%40gmail.com.
[osv-dev] [PATCH 7/8] lazy stack: activate lazy stack in pthreads
This patch adds new mmap flag - mmap_stack - that is used when mmaping a stack when creating new pthread. This new flag is only used when the build parameter CONF_lazy_stack is enabled. Signed-off-by: Waldemar Kozaczuk --- include/osv/mmu-defs.hh | 1 + libc/mman.cc| 7 +-- libc/pthread.cc | 7 ++- 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh index 18edf441..694815f8 100644 --- a/include/osv/mmu-defs.hh +++ b/include/osv/mmu-defs.hh @@ -84,6 +84,7 @@ enum { mmap_small = 1ul << 5, mmap_jvm_balloon = 1ul << 6, mmap_file= 1ul << 7, +mmap_stack = 1ul << 8, }; enum { diff --git a/libc/mman.cc b/libc/mman.cc index 75a94eb0..115b0313 100644 --- a/libc/mman.cc +++ b/libc/mman.cc @@ -43,12 +43,7 @@ unsigned libc_flags_to_mmap(int flags) mmap_flags |= mmu::mmap_populate; } if (flags & MAP_STACK) { -// OSv currently requires that stacks be pinned (see issue #143). So -// if an application wants to mmap() a stack for pthread_attr_setstack -// and did us the courtesy of telling this to ue (via MAP_STACK), -// let's return the courtesy by returning pre-faulted memory. -// FIXME: If issue #143 is fixed, this workaround should be removed. -mmap_flags |= mmu::mmap_populate; +mmap_flags |= mmu::mmap_stack; } if (flags & MAP_SHARED) { mmap_flags |= mmu::mmap_shared; diff --git a/libc/pthread.cc b/libc/pthread.cc index cda6cf90..de5979e8 100644 --- a/libc/pthread.cc +++ b/libc/pthread.cc @@ -141,7 +141,12 @@ namespace pthread_private { return {attr.stack_begin, attr.stack_size}; } size_t size = attr.stack_size; -void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, mmu::perm_rw); +#if CONF_lazy_stack +unsigned stack_flags = mmu::mmap_stack; +#else +unsigned stack_flags = mmu::mmap_populate; +#endif +void *addr = mmu::map_anon(nullptr, size, stack_flags, mmu::perm_rw); mmu::mprotect(addr, attr.guard_size, 0); sched::thread::stack_info si{addr, size}; si.deleter = free_stack; -- 2.34.1 -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20220831042433.140243-7-jwkozaczuk%40gmail.com.
[osv-dev] [PATCH 5/8] lazy stack: ensure next stack page conditionally if interrupts and preemption enabled
This patch modifies the last set of call sites where we do not know statically if any of interrupts or preemption are enabled. In those cases we dynamically check if both preemtion and interrupts are enabled and only then pre-fault the stack. Most of these places are in the tracepoint and sampler implementation. This is obviously the most costly operation but even here we are lucky that it affects the performance only when tracepoints are enabled and even then the code already saves the state of the interrupts using the arch::irq_flag_notrace so checking if interrupts are enabled is pretty cheap. With sampler on other hand, the performance is only affected when starting or stoping the sampler which is quite rare. Signed-off-by: Waldemar Kozaczuk --- arch/aarch64/interrupt.cc | 5 + core/sampler.cc | 31 +++ core/trace.cc | 17 + include/osv/trace.hh | 5 + 4 files changed, 54 insertions(+), 4 deletions(-) diff --git a/arch/aarch64/interrupt.cc b/arch/aarch64/interrupt.cc index e26e10ee..0b244e2e 100644 --- a/arch/aarch64/interrupt.cc +++ b/arch/aarch64/interrupt.cc @@ -31,6 +31,11 @@ void sgi_interrupt::send(sched::cpu* cpu) void sgi_interrupt::send_allbutself() { +#if CONF_lazy_stack +if (sched::preemptable() && arch::irq_enabled()) { +arch::ensure_next_stack_page(); +} +#endif gic::gic->send_sgi(gic::sgi_filter::SGI_TARGET_ALL_BUT_SELF, 0, get_id()); } diff --git a/core/sampler.cc b/core/sampler.cc index 9e37ba4f..b9e2b05c 100644 --- a/core/sampler.cc +++ b/core/sampler.cc @@ -32,11 +32,16 @@ private: sched::timer_base _timer; bool _active; -void rearm() +void arm() { _timer.set(_config.period); } +void rearm() +{ +_timer.set_with_irq_disabled(_config.period); +} + public: cpu_sampler() : _timer(*this) @@ -54,7 +59,11 @@ public: { assert(!_active); _active = true; -rearm(); +if (arch::irq_enabled()) { +arm(); +} else { +rearm(); +} } void stop() @@ -97,7 +106,11 @@ static void start_on_current() if (prev_active + 1 == _n_cpus) { _started = true; -_controller.wake(); +if (arch::irq_enabled()) { +_controller.wake(); +} else { +_controller.wake_from_kernel_or_with_irq_disabled(); +} } } @@ -110,7 +123,11 @@ static void stop_on_current() _sampler->stop(); if (--_active_cpus == 0) { -_controller.wake(); +if (arch::irq_enabled()) { +_controller.wake(); +} else { +_controller.wake_from_kernel_or_with_irq_disabled(); +} } } @@ -170,6 +187,12 @@ void stop_sampler() throw() WITH_LOCK(migration_lock) { stop_sampler_ipi.send_allbutself(); +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +#endif +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif stop_on_current(); } diff --git a/core/trace.cc b/core/trace.cc index dc69c807..c9ed2bab 100644 --- a/core/trace.cc +++ b/core/trace.cc @@ -247,6 +247,13 @@ void tracepoint_base::update() WITH_LOCK(trace_control_lock) { bool empty; +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +assert(sched::preemptable()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(osv::rcu_read_lock) { auto& probes = *probes_ptr.read(); @@ -376,6 +383,11 @@ extern "C" void __cyg_profile_func_enter(void *this_fn, void *call_site) } arch::irq_flag_notrace irq; irq.save(); +#if CONF_lazy_stack +if (sched::preemptable() && irq.enabled()) { +arch::ensure_next_stack_page(); +} +#endif arch::irq_disable_notrace(); if (func_trace_nesting++ == 0) { trace_function_entry(this_fn, call_site); @@ -391,6 +403,11 @@ extern "C" void __cyg_profile_func_exit(void *this_fn, void *call_site) } arch::irq_flag_notrace irq; irq.save(); +#if CONF_lazy_stack +if (sched::preemptable() && irq.enabled()) { +arch::ensure_next_stack_page(); +} +#endif arch::irq_disable_notrace(); if (func_trace_nesting++ == 0) { trace_function_exit(this_fn, call_site); diff --git a/include/osv/trace.hh b/include/osv/trace.hh index d735575c..01d72022 100644 --- a/include/osv/trace.hh +++ b/include/osv/trace.hh @@ -348,6 +348,11 @@ public: if (active) { arch::irq_flag_notrace irq; irq.save(); +#if CONF_lazy_stack +if (sched::preemptable() && irq.enabled()) { +arch::ensure_next_stack_page(); +} +#endif arch::irq_disable_notrace(); log(as); run_probes(); -- 2.34.1 -- You received this
[osv-dev] [PATCH 6/8] lazy stack: prevent deadlock when taking vma_list_mutex for write
This patch makes all functions in core/mmu.cc that take vma_list_mutex for write to pre-fault the stack two pages deep before using the mutex. This is necessary to prevent any follow up stack faults down the call stack after the vma_list_mutex is take for write as this would lead to a deadlock experienced when testing one of the apps when the page fault handler and mmu::vm_fault() function would try to take the same vma_list_mutex for read. Signed-off-by: Waldemar Kozaczuk --- core/mmu.cc | 17 + 1 file changed, 17 insertions(+) diff --git a/core/mmu.cc b/core/mmu.cc index 007d4331..e41de215 100644 --- a/core/mmu.cc +++ b/core/mmu.cc @@ -47,6 +47,17 @@ extern const char text_start[], text_end[]; namespace mmu { +#if CONF_lazy_stack +// We need to ensure that lazy stack is populated deeply enough (2 pages) +// for all the cases when the vma_list_mutex is taken for write to prevent +// page faults triggered on stack. The page-fault handling logic would +// attempt to take same vma_list_mutex fo read and end up with a deadlock. +#define PREVENT_STACK_PAGE_FAULT \ +arch::ensure_next_two_stack_pages(); +#else +#define PREVENT_STACK_PAGE_FAULT +#endif + struct vma_range_compare { bool operator()(const vma_range& a, const vma_range& b) { return a.start() < b.start(); @@ -1271,6 +1282,7 @@ static void nohugepage(void* addr, size_t length) error advise(void* addr, size_t size, int advice) { +PREVENT_STACK_PAGE_FAULT WITH_LOCK(vma_list_mutex.for_write()) { if (!ismapped(addr, size)) { return make_error(ENOMEM); @@ -1310,6 +1322,7 @@ void* map_anon(const void* addr, size_t size, unsigned flags, unsigned perm) size = align_up(size, mmu::page_size); auto start = reinterpret_cast(addr); auto* vma = new mmu::anon_vma(addr_range(start, start + size), perm, flags); +PREVENT_STACK_PAGE_FAULT SCOPE_LOCK(vma_list_mutex.for_write()); auto v = (void*) allocate(vma, start, size, search); if (flags & mmap_populate) { @@ -1336,6 +1349,7 @@ void* map_file(const void* addr, size_t size, unsigned flags, unsigned perm, auto start = reinterpret_cast(addr); auto *vma = f->mmap(addr_range(start, start + size), flags | mmap_file, perm, offset).release(); void *v; +PREVENT_STACK_PAGE_FAULT WITH_LOCK(vma_list_mutex.for_write()) { v = (void*) allocate(vma, start, size, search); if (flags & mmap_populate) { @@ -1708,6 +1722,7 @@ ulong map_jvm(unsigned char* jvm_addr, size_t size, size_t align, balloon_ptr b) auto* vma = new mmu::jvm_balloon_vma(jvm_addr, start, start + size, b, v->perm(), v->flags()); +PREVENT_STACK_PAGE_FAULT WITH_LOCK(vma_list_mutex.for_write()) { // This means that the mapping that we had before was a balloon mapping // that was laying around and wasn't updated to an anon mapping. If we @@ -2014,6 +2029,7 @@ void free_initial_memory_range(uintptr_t addr, size_t size) error mprotect(const void *addr, size_t len, unsigned perm) { +PREVENT_STACK_PAGE_FAULT SCOPE_LOCK(vma_list_mutex.for_write()); if (!ismapped(addr, len)) { @@ -2025,6 +2041,7 @@ error mprotect(const void *addr, size_t len, unsigned perm) error munmap(const void *addr, size_t length) { +PREVENT_STACK_PAGE_FAULT SCOPE_LOCK(vma_list_mutex.for_write()); length = align_up(length, mmu::page_size); -- 2.34.1 -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20220831042433.140243-6-jwkozaczuk%40gmail.com.
[osv-dev] [PATCH 4/8] lazy stack: ensure next stack page dynamically if preemption enabled
This patch modifies number of critical places mostly in the scheduler code to dynamically pre-fault the stack if preemption is enabled. In all these places we can be statically sure that interrupts are enabled but not so sure about preemption (maybe in future we can prove it is enabled at least in some of the cases and replace conditional ensure_next_stack_page_if_preemptable() with cheaper single-intruction ensure_next_stack_page()). The three call sites before irq_lock is taken (interrupts are disabled) include: - cpu::schedule() before WITH_LOCK(irq_lock) - thread::pin() before WITH_LOCK(irq_lock) - thread::yield() before guard*(irq_lock) The reasoning above goes like this: the methods were designed with intention to be used when interrupts are enabled because otherwise the act of using WITH_LOCK or guard construct would imply that interrupts would be re-enabled after the block so the code does not care about restoring interrupts to the proper state it was before. If that was the intension, these methods would use irq_save_lock_type. Two other call sites in the scheduler are: - timer_base destructor before calling timer_base::cancel() which disables interrupts - thread::wake_with_from_mutex() only called from waiter::wake() follow the reasoning that interrupts are enabled most of the time except few places in scheduler and interrupt handler. Also we assume the mutex is not used when interrupts are disabled. And based on my analysis of all the code that disables/enables interrupts that seems to be the case. In any case we put invariants in these two places to verify that interrupts are enabled indeed. The last call site in abort() assumes that calling irq_disable() implies that interrupts are enabled before and we only need to check preemption status. Signed-off-by: Waldemar Kozaczuk --- core/sched.cc| 24 include/osv/sched.hh | 17 + runtime.cc | 3 +++ 3 files changed, 44 insertions(+) diff --git a/core/sched.cc b/core/sched.cc index 1e109694..65842ff3 100644 --- a/core/sched.cc +++ b/core/sched.cc @@ -224,6 +224,9 @@ void thread::cputime_estimator_get( // scheduler on a different CPU would be disastrous. void cpu::schedule() { +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif WITH_LOCK(irq_lock) { #ifdef __aarch64__ reschedule_from_interrupt(sched::cpu::current(), false, thyst); @@ -566,6 +569,9 @@ void thread::pin(cpu *target_cpu) t.wake(); }, sched::thread::attr().pin(source_cpu))); wakeme->start(); +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif WITH_LOCK(irq_lock) { trace_sched_migrate(, target_cpu->id); t.stat_migrations.incr(); @@ -822,6 +828,12 @@ void thread::yield(thread_runtime::duration preempt_after) { trace_sched_yield(); auto t = current(); +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +#endif +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif std::lock_guard guard(irq_lock); // FIXME: drive by IPI cpu::current()->handle_incoming_wakeups(); @@ -1258,6 +1270,12 @@ void thread::wake_impl(detached_state* st, unsigned allowed_initial_states_mask) void thread::wake() { +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +#endif +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif WITH_LOCK(rcu_read_lock) { wake_impl(_detached_state.get()); } @@ -1604,6 +1622,12 @@ timer_base::timer_base(timer_base::client& t) timer_base::~timer_base() { +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +#endif +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif cancel(); } diff --git a/include/osv/sched.hh b/include/osv/sched.hh index 1691bed8..8a2694cb 100644 --- a/include/osv/sched.hh +++ b/include/osv/sched.hh @@ -1043,6 +1043,15 @@ inline bool preemptable() return !get_preempt_counter(); } +#if CONF_lazy_stack +inline void ensure_next_stack_page_if_preemptable() { +if (!preemptable()) { +return; +} +arch::ensure_next_stack_page(); +} +#endif + inline void preempt() { if (preemptable()) { @@ -1350,6 +1359,14 @@ template inline void thread::wake_with_from_mutex(Action action) { +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled()); +#endif +#if CONF_lazy_stack +if (preemptable()) { +arch::ensure_next_stack_page(); +} +#endif return do_wake_with(action, (1 << unsigned(status::waiting)) | (1 << unsigned(status::sending_lock))); } diff --git a/runtime.cc b/runtime.cc index 521b5c24..5f67e79b 100644 --- a/runtime.cc +++ b/runtime.cc @@ -113,6 +113,9 @@ void abort(const char *fmt, ...) do {} while (true); } +#if CONF_lazy_stack +sched::ensure_next_stack_page_if_preemptable(); +#endif
[osv-dev] [PATCH 3/8] lazy stack: ensure next stack page statically
This patch modifies all relevant places, where we statically know that both interrupts and preemption should be enabled, to unconditionally pre-fault the stack. These include places in code before: - WITH_LOCK(preemption_lock) block - WITH_LOCK(rcu_read_lock) block - sched::preempt_disable() call - WITH_LOCK(irq_lock) block in one case In general, these are the places that follow the assumption that most of the time preemption and interrupts are enabled. And the functions/method below are called directly or indirectly by an application and there is no other kernel code in that application call stack above that also disables interrupts or preemption. One good example is the memory allocation code in mempool.cc that disables preemption in quite few places. Many of those use WITH_LOCK/DROP_LOCK(preempt_lock) combination that implies they were intended to be called with preemption enabled otherwise code in DROP_LOCK would not work. Also all of those end up calling some form of thread::wait() method that asserts that both interrupts and interrupts are enabled. To validate the reasoning, we add relevant invariants before we pre-fault the stack. Signed-off-by: Waldemar Kozaczuk --- bsd/porting/uma_stub.cc | 12 + bsd/sys/net/routecache.hh | 6 + bsd/sys/netinet/arpcache.hh | 7 + core/async.cc | 18 + core/condvar.cc | 7 + core/elf.cc | 12 + core/mempool.cc | 50 ++- core/rcu.cc | 6 + core/sched.cc | 52 + fs/vfs/kern_descrip.cc | 6 + include/osv/elf.hh | 7 + include/osv/percpu_xmit.hh | 13 ++ include/osv/sched.hh| 7 + 13 files changed, 202 insertions(+), 1 deletion(-) diff --git a/bsd/porting/uma_stub.cc b/bsd/porting/uma_stub.cc index cf320b13..a9d689c3 100644 --- a/bsd/porting/uma_stub.cc +++ b/bsd/porting/uma_stub.cc @@ -34,6 +34,12 @@ void * uma_zalloc_arg(uma_zone_t zone, void *udata, int flags) { void * ptr; +#if CONF_lazy_stack_invariant +assert(sched::preemptable() && arch::irq_enabled()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(preempt_lock) { ptr = (*zone->percpu_cache)->alloc(); } @@ -101,6 +107,12 @@ void uma_zfree_arg(uma_zone_t zone, void *item, void *udata) zone->uz_dtor(item, zone->uz_size, udata); } +#if CONF_lazy_stack_invariant +assert(sched::preemptable() && arch::irq_enabled()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(preempt_lock) { if ((*zone->percpu_cache)->free(item)) { return; diff --git a/bsd/sys/net/routecache.hh b/bsd/sys/net/routecache.hh index bdcdf496..e2140f06 100644 --- a/bsd/sys/net/routecache.hh +++ b/bsd/sys/net/routecache.hh @@ -162,6 +162,12 @@ public: // route.cc). assert(fibnum == 0); +#if CONF_lazy_stack_invariant +assert(sched::preemptable() && arch::irq_enabled()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(osv::rcu_read_lock) { auto *c = cache.read(); auto entry = c->search(dst->sin_addr.s_addr); diff --git a/bsd/sys/netinet/arpcache.hh b/bsd/sys/netinet/arpcache.hh index 4556d40e..289a2f29 100644 --- a/bsd/sys/netinet/arpcache.hh +++ b/bsd/sys/netinet/arpcache.hh @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -98,6 +99,12 @@ struct arp_cache { boost::optional lookup(const in_addr ip) { +#if CONF_lazy_stack_invariant +assert(sched::preemptable() && arch::irq_enabled()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(osv::rcu_read_lock) { auto i = _entries.reader_find(ip, std::hash(), entry_compare()); return boost::optional(!!i, *i); diff --git a/core/async.cc b/core/async.cc index 9f3d61d2..ada19fba 100644 --- a/core/async.cc +++ b/core/async.cc @@ -104,6 +104,12 @@ public: void insert(percpu_timer_task& task) { +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled() && sched::preemptable()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(preempt_lock) { trace_async_timer_task_insert(this, ); @@ -125,6 +131,12 @@ public: percpu_timer_task& borrow_task() { +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled() && sched::preemptable()); +#endif +#if CONF_lazy_stack +arch::ensure_next_stack_page(); +#endif WITH_LOCK(preempt_lock) { auto task = released_timer_tasks.pop(); if (task) { @@ -142,6 +154,12 @@ public: { auto task = new one_shot_task(std::move(callback)); +#if CONF_lazy_stack_invariant +assert(arch::irq_enabled() &&
[osv-dev] [PATCH 2/8] lazy stack: do nothing in kernel threads and on populated stack
This patch annotates the relevant call sites with the invariant assert expressions to validate assumptions that let us do "nothing" in all these cases. We also reorganize some code in the scheduler to help differentiate between cases when given function/method is called with interrupts or preemption disabled or from kernel thread or by interrupt handler. Following methods get added to scheduler code with names describing state of interrupts or preemption or kernel caller: - timer_base::set_with_irq_disabled(osv::clock::uptime::time_point time) - timer_base::set_with_irq_disabled(std::chrono::duration duration) - thread::wake_with_irq_disabled() - thread::wake_with_irq_or_preemption_disabled(Action action) - thread_handle::wake_from_kernel_or_with_irq_disabled() In general: - we modify all interrupt handlers (those that are executed on interrupt stack) to call one of the 3 new wake_...() methods (mostly wake_with_irq_disabled()) to indicate we do not need/should not pre-fault the stack; most of those are in device drivers code - we modify all code executed on kernel threads that disables preemption or interrupts by adding relevant invariant - assert(!sched::thread::current()->is_app()); we do not need to pre-fault because the stack is populated - we also modify the code whhich is indirectly called from kernel threads like classifier::post_packet() in net channels - finally we also modify the scheduler code to use timer_bas::set_with_irq_disabled() mostly around preemption_timer to indicate that we should not pre-fault the stack downstream Signed-off-by: Waldemar Kozaczuk --- arch/aarch64/exceptions.cc | 3 ++ arch/aarch64/interrupt.cc | 3 ++ arch/x64/exceptions.cc | 3 ++ arch/x64/mmu.cc| 2 +- arch/x64/msi.cc| 2 +- core/async.cc | 6 +++ core/epoll.cc | 2 +- core/mempool.cc| 12 -- core/net_channel.cc| 8 +++- core/rcu.cc| 5 ++- core/sched.cc | 79 ++ drivers/acpi.cc| 2 +- drivers/ahci.cc| 2 +- drivers/ahci.hh| 2 +- drivers/cadence-uart.cc| 2 +- drivers/isa-serial.cc | 2 +- drivers/kbd.cc | 2 +- drivers/mmio-isa-serial.cc | 2 +- drivers/pl011.cc | 2 +- drivers/virtio-blk.cc | 6 +-- drivers/virtio-fs.cc | 4 +- drivers/virtio-net.cc | 6 +-- drivers/virtio-rng.cc | 2 +- drivers/virtio-scsi.cc | 2 +- drivers/virtio-vring.hh| 2 +- drivers/vmw-pvscsi.cc | 2 +- drivers/xenconsole.cc | 2 +- include/osv/net_channel.hh | 5 ++- include/osv/sched.hh | 16 include/osv/xen_intr.hh| 2 +- libc/signal.cc | 3 ++ libc/timerfd.cc| 3 ++ 32 files changed, 157 insertions(+), 39 deletions(-) diff --git a/arch/aarch64/exceptions.cc b/arch/aarch64/exceptions.cc index cadbb3a2..5c2c59ab 100644 --- a/arch/aarch64/exceptions.cc +++ b/arch/aarch64/exceptions.cc @@ -122,6 +122,9 @@ void interrupt_table::unregister_interrupt(interrupt *interrupt) bool interrupt_table::invoke_interrupt(unsigned int id) { +#if CONF_lazy_stack_invariant +assert(!arch::irq_enabled()); +#endif WITH_LOCK(osv::rcu_read_lock) { assert(id < this->nr_irqs); interrupt_desc *desc = this->irq_desc[id].read(); diff --git a/arch/aarch64/interrupt.cc b/arch/aarch64/interrupt.cc index b8337e23..e26e10ee 100644 --- a/arch/aarch64/interrupt.cc +++ b/arch/aarch64/interrupt.cc @@ -22,6 +22,9 @@ sgi_interrupt::~sgi_interrupt() void sgi_interrupt::send(sched::cpu* cpu) { +#if CONF_lazy_stack_invariant +assert(!arch::irq_enabled() || !sched::preemptable()); +#endif gic::gic->send_sgi(gic::sgi_filter::SGI_TARGET_LIST, cpu->arch.smp_idx, get_id()); } diff --git a/arch/x64/exceptions.cc b/arch/x64/exceptions.cc index 7c9eaf51..fbf6be65 100644 --- a/arch/x64/exceptions.cc +++ b/arch/x64/exceptions.cc @@ -220,6 +220,9 @@ void interrupt_descriptor_table::unregister_interrupt(gsi_level_interrupt *inter void interrupt_descriptor_table::invoke_interrupt(unsigned vector) { +#if CONF_lazy_stack_invariant +assert(!arch::irq_enabled()); +#endif WITH_LOCK(osv::rcu_read_lock) { unsigned i, nr_shared; bool handled = false; diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc index 1af268c0..675410d0 100644 --- a/arch/x64/mmu.cc +++ b/arch/x64/mmu.cc @@ -64,7 +64,7 @@ std::atomic tlb_flush_pendingconfirms; inter_processor_interrupt tlb_flush_ipi{IPI_TLB_FLUSH, [] { mmu::flush_tlb_local(); if (tlb_flush_pendingconfirms.fetch_add(-1) == 1) { -tlb_flush_waiter.wake(); +tlb_flush_waiter.wake_from_kernel_or_with_irq_disabled(); } }}; diff --git a/arch/x64/msi.cc b/arch/x64/msi.cc index 9a28e3a5..cf0c3dc5 100644 --- a/arch/x64/msi.cc +++ b/arch/x64/msi.cc @@ -126,7 +126,7 @@ static
[osv-dev] [PATCH 1/8] lazy stack: inline assembly to pre-fault stack
This is the 1st of the total of eight patches that implement optional support of so called "lazy stack" feature. The lazy stack is well explained by the issue #143 and allows to save substantial amount of memory if application spawns many pthreads with large stacks by letting stack grow dynamically as needed instead of getting pre-populated ahead of time. The crux of this solution and the previous versions is based on the observation that OSv memory fault handler requires that both interrupts and preemption must be enabled when fault is triggered. Therefore if stack is dynamically mapped we need to make sure that stack page faults NEVER happen in the relatively few places of kernel code that executes with either interrupts or preemption disabled. And we satisfy this requirement by "pre-faulting" the stack by reading a byte page (4096) down per stack pointer just before preemption or interrupts are disabled. Now, the problem is complicated by the fact that the kernel code A that disables preemption or interrupts may nest by calling another kernel function B that also disables preemption or interrupts in which case the function B should NOT try to pre-fault the stack otherwise the fault handler will abort due to violated constraint stated before. In short we cannot "blindly" or unconditionally pre-fault the stack in all places before interrupts or preemption are disabled. Some of the previous solutions modified both arch::irq_disable() and sched::preempt_disable() to check if both preemption and interrupts are enabled and only then try to read a byte at -4096 offset down. Unfortunately, this makes it more costly than envisioned by Nadav Har'El - instead of single instruction to read from memory, compiler needs 4-5 to read data if preemption and interrupts are enabled and perform relevant jump. To make it worse, the implementation of arch::irq_enabled() is pretty expensive at least in x64 and uses stack with pushfq. To avoid it the previous solutions would add new thread local counter and pack irq disabling counter together with preemption one. But even with this optimization I found this approach to deteriorate the performance quite substantially. For example the memory allocation logic disables preemption in quite many places (see core/mempool.cc) and corresponding test - misc-free-perf.cc - would show performance - number of malloc()/free() executed per second - degrade on average by 15-20%. So this latest version implemented by this and next 7 patches takes different approach. Instead of putting the conditional pre-faulting of stack in both irq_disable() and preempt_disable(), we analyze OSv code to find all places where irq_disable() and/or preempt_disable() is called directly (or indirectly sometimes) and pre-fault the stack there if necessary or not. This makes it obviously way more laborious and prone to human error (we can miss some places), but would make it way more performant (no noticable performance degradation noticed) comparing to earlier versions described in the paragraph above. As we analyze all call sites, we need to make some observations to help us decide what exactly to do in each case: - do nothing - blindly pre-fault the stack (single instruction) - conditionally pre-fault the stack (hopefully in very few places) Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread. Rule 2: Do nothing if call site executes on populated stack - includes the above but also code executing on interrupt, exception or syscall stack. Rule 3: Do nothing if call site executes when we know that either interrupts or preemption are disabled. Good example is an interrupt handler or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks. Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts are enabled. These in most cases can only be deduced by analysing where the particular function is called. In general any such function called by user code like libc would satisfy the condition. But sometimes it is tricky because kernel might be calling libc functions, such as malloc(). Rule 5: Otherwise pre-fault stack by determining dynamically: only if sched::preemptable() and irq::enabled() One general rule is that all potential stack page faults would happen on application thread stack when some kernel code gets executed down the call stack. In general we identify the call sites in following categories: - direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints) - direct calls to sched::preempt_disable() - code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type - code using WITH_LOCK(preempt_lock) - code using WITH_LOCK(osv::rcu_read_lock) The above locations can be found with simple grep but also with an IDE like CLion from JetBrains that can help more efficiently find all direct but also more importantly indirect usages of the
Re: [osv-dev] [COMMIT osv master] trace: do not use malloc/free when interrupts are disabled
On Tue, Aug 30, 2022 at 6:59 PM Commit Bot wrote: > From: Waldemar Kozaczuk > Committer: Waldemar Kozaczuk > Branch: master > > trace: do not use malloc/free when interrupts are disabled > > This patch fixes a subtle bug found when working on lazy stack changes > and trying to establish some invariants in relevant places in the kernel > code. > One of the findings was that the memory allocation logic or more > specifically the functions malloc() and free() require both interrupts and > preemption enabled to function correctly. > > If one starts analysis with memory::pool::alloc(), he/she will notice > a DROP_LOCK(preempt_lock) executed under WITH_LOCK(preempt) if > the pool is empty. This means that the code assumes the preemption must > be enabled before calling alloc(), otherwise the code in DROP_LOCK would > run with preemption disabled which is contrary to the intention. > Also, after drilling down the add_page() calling path, one eventually > would find the call to reclaimers::wait() which calls > thread::wait_until(). The > thread::do_wait_until() called by the former enforces that both interrupts > and preemption needs to be enabled with assertion. > > The bottom line of the analysis above is that both interrupts and > preemption > must be enabled before calling malloc() (same applies to free()). This > effectively > means that trying to call malloc() after disabling interrupts might not > work (which > would only happen if relevent pool was out of free pages AND l1 pool was > empty and > had to refill from l2 pool which probably is quite rare). > > So this patch changes the code in trace::create_trace_dump() to avoid > calling new/malloc() (see copies.emplace_back()) and instead makes it > preallocate > the vector ahead of time and then simply copy the trace buffer to a > relevant > spot in the vector. > > Signed-off-by: Waldemar Kozaczuk > > --- > diff --git a/core/trace.cc b/core/trace.cc > --- a/core/trace.cc > +++ b/core/trace.cc > @@ -697,7 +697,7 @@ std::string > trace::create_trace_dump() > { > semaphore signal(0); > -std::vector copies; > +std::vector copies(sched::cpus.size()); > > auto is_valid_tracepoint = [](const tracepoint_base * tp_test) { > for (auto & tp : tracepoint_base::tp_list) { > @@ -717,7 +717,7 @@ trace::create_trace_dump() > irq.save(); > arch::irq_disable_notrace(); > auto * tbp = percpu_trace_buffer.for_cpu(cpu); > -copies.emplace_back(*tbp); > +copies[i] = trace_buf(*tbp); > This change is good, but I think you didn't reall have to change the emplace_back() call - you could have kept the emplace_back() and instead of the constructor in the beginning, use copies.reserve(sched::cpus.size()). irq.restore(); > signal.post(); > }, sched::thread::attr().pin(cpu))); > > -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to osv-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/0da02b05e708%40google.com > . > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CANEVyjvonrc5fPy-SDWY3r6VfYjhCHDYmgd38%2ByGHdY1GVnC1g%40mail.gmail.com.
Re: [osv-dev] [COMMIT osv master] aarch64 trace: make compiler pick a register instead of x0
What a difference one character makes :-) -- Nadav Har'El n...@scylladb.com On Tue, Aug 30, 2022 at 6:59 PM Commit Bot wrote: > From: Waldemar Kozaczuk > Committer: Waldemar Kozaczuk > Branch: master > > aarch64 trace: make compiler pick a register instead of x0 > > This subtle 1-character patch fixes a nasty bug that causes interrupts > to be enabled instead of correctly restored to the state it was when > saving the state. This bug would affect the tracing logic and result in > crashes described by the issues #1158 and #1195. > > This bug in inline assembly was most likely a typo as I am sure > the intention was to use '%0' instead of 'x0' to let compiler > correctly pick a register instead of using any garbage in the x0 > register. > > Fixes #1158 > Fixes #1195 > > Signed-off-by: Waldemar Kozaczuk > > --- > diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh > --- a/arch/aarch64/arch.hh > +++ b/arch/aarch64/arch.hh > @@ -79,7 +79,7 @@ inline void irq_flag_notrace::save() { > } > > inline void irq_flag_notrace::restore() { > -asm volatile("msr daif, x0" :: "r"(daif) : "memory"); > +asm volatile("msr daif, %0" :: "r"(daif) : "memory"); > } > > inline bool irq_flag_notrace::enabled() const { > > -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to osv-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/fa038905e6fc%40google.com > . > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CANEVyjur8-AnHOYW1vc1BqCK7Luj%2BnLhsCU5fNKFkGo5bnfi6w%40mail.gmail.com.
[osv-dev] [COMMIT osv master] aarch64: download boost program-options library needed by some tests
From: Waldemar Kozaczuk Committer: Waldemar Kozaczuk Branch: master aarch64: download boost program-options library needed by some tests Signed-off-by: Waldemar Kozaczuk --- diff --git a/scripts/download_aarch64_packages.py b/scripts/download_aarch64_packages.py --- a/scripts/download_aarch64_packages.py +++ b/scripts/download_aarch64_packages.py @@ -34,6 +34,7 @@ def fedora_download_commands(fedora_version): 'boost-filesystem', 'boost-test', 'boost-chrono', + 'boost-program-options', 'boost-timer'] script_path = '%s/scripts/download_fedora_aarch64_rpm_package.sh' % osv_root @@ -55,6 +56,8 @@ def ubuntu_download_commands(boost_long_version): 'libboost-test%s-dev' % boost_short_version, 'libboost-timer%s' % boost_short_version, 'libboost-timer%s-dev' % boost_short_version, + 'libboost-program-options%s' % boost_short_version, + 'libboost-program-options%s-dev' % boost_short_version, 'libboost-chrono%s' % boost_short_version, 'libboost-chrono%s-dev' % boost_short_version] -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/2066ee05e7ef%40google.com.
[osv-dev] [COMMIT osv master] trace: do not use malloc/free when interrupts are disabled
From: Waldemar Kozaczuk Committer: Waldemar Kozaczuk Branch: master trace: do not use malloc/free when interrupts are disabled This patch fixes a subtle bug found when working on lazy stack changes and trying to establish some invariants in relevant places in the kernel code. One of the findings was that the memory allocation logic or more specifically the functions malloc() and free() require both interrupts and preemption enabled to function correctly. If one starts analysis with memory::pool::alloc(), he/she will notice a DROP_LOCK(preempt_lock) executed under WITH_LOCK(preempt) if the pool is empty. This means that the code assumes the preemption must be enabled before calling alloc(), otherwise the code in DROP_LOCK would run with preemption disabled which is contrary to the intention. Also, after drilling down the add_page() calling path, one eventually would find the call to reclaimers::wait() which calls thread::wait_until(). The thread::do_wait_until() called by the former enforces that both interrupts and preemption needs to be enabled with assertion. The bottom line of the analysis above is that both interrupts and preemption must be enabled before calling malloc() (same applies to free()). This effectively means that trying to call malloc() after disabling interrupts might not work (which would only happen if relevent pool was out of free pages AND l1 pool was empty and had to refill from l2 pool which probably is quite rare). So this patch changes the code in trace::create_trace_dump() to avoid calling new/malloc() (see copies.emplace_back()) and instead makes it preallocate the vector ahead of time and then simply copy the trace buffer to a relevant spot in the vector. Signed-off-by: Waldemar Kozaczuk --- diff --git a/core/trace.cc b/core/trace.cc --- a/core/trace.cc +++ b/core/trace.cc @@ -697,7 +697,7 @@ std::string trace::create_trace_dump() { semaphore signal(0); -std::vector copies; +std::vector copies(sched::cpus.size()); auto is_valid_tracepoint = [](const tracepoint_base * tp_test) { for (auto & tp : tracepoint_base::tp_list) { @@ -717,7 +717,7 @@ trace::create_trace_dump() irq.save(); arch::irq_disable_notrace(); auto * tbp = percpu_trace_buffer.for_cpu(cpu); -copies.emplace_back(*tbp); +copies[i] = trace_buf(*tbp); irq.restore(); signal.post(); }, sched::thread::attr().pin(cpu))); -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/0da02b05e708%40google.com.
[osv-dev] [COMMIT osv master] aarch64 trace: make compiler pick a register instead of x0
From: Waldemar Kozaczuk Committer: Waldemar Kozaczuk Branch: master aarch64 trace: make compiler pick a register instead of x0 This subtle 1-character patch fixes a nasty bug that causes interrupts to be enabled instead of correctly restored to the state it was when saving the state. This bug would affect the tracing logic and result in crashes described by the issues #1158 and #1195. This bug in inline assembly was most likely a typo as I am sure the intention was to use '%0' instead of 'x0' to let compiler correctly pick a register instead of using any garbage in the x0 register. Fixes #1158 Fixes #1195 Signed-off-by: Waldemar Kozaczuk --- diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh --- a/arch/aarch64/arch.hh +++ b/arch/aarch64/arch.hh @@ -79,7 +79,7 @@ inline void irq_flag_notrace::save() { } inline void irq_flag_notrace::restore() { -asm volatile("msr daif, x0" :: "r"(daif) : "memory"); +asm volatile("msr daif, %0" :: "r"(daif) : "memory"); } inline bool irq_flag_notrace::enabled() const { -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/fa038905e6fc%40google.com.