It does for me. Are you using the latest version of this script from
master? I think there were couple of things I changed there recently as
part of wide upgrade to Python 3.

On Sun, Mar 22, 2020 at 20:27 Rick Payne <ri...@rossfell.co.uk> wrote:

>
> Does that command work for you? For me I get:
>
> (gdb) osv heap
> Python Exception <class 'TypeError'> %x format: an integer is required,
> not gdb.Value:
> Error occurred in Python command: %x format: an integer is required,
> not gdb.Value
>
> Rick
>
> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
> > Can you run ‘osv heap’ from gdb at this point?
> >
> > On Sun, Mar 22, 2020 at 19:31 Rick Payne <ri...@rossfell.co.uk>
> > wrote:
> > > Ok, so i applied the patch, and the printf and this is what I see:
> > >
> > > page_range_allocator: no ranges found for size 39849984 and exact
> > > order: 14
> > > Waiter: 1_scheduler, bytes: 39849984
> > > Out of memory: could not reclaim any further. Current memory:
> > > 10933128
> > > Kb, target: 0 Kb
> > >
> > > Rick
> > >
> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> > > >
> > > >
> > > > I think that OSv might be hitting some sort of fragmentation
> > > scenario
> > > > (another question is why we get to this point) which should be
> > > > handled but is not due to the bug this patch fixes:
> > > >
> > > > diff --git a/core/mempool.cc b/core/mempool.cc
> > > > index d902eea8..11fd1456 100644
> > > > --- a/core/mempool.cc
> > > > +++ b/core/mempool.cc
> > > > @@ -702,10 +702,13 @@ page_range*
> > > page_range_allocator::alloc(size_t
> > > > size)
> > > >          for (auto&& pr : _free[exact_order - 1]) {
> > > >              if (pr.size >= size) {
> > > >                  range = &pr;
> > > > +                remove_list(exact_order - 1, *range);
> > > >                  break;
> > > >              }
> > > >          }
> > > > -        return nullptr;
> > > > +        if (!range) {
> > > > +            return nullptr;
> > > > +        }
> > > >      } else if (order == max_order) {
> > > >          range = &*_free_huge.rbegin();
> > > >          if (range->size < size) {
> > > >
> > > > Can you give it a try and see what happens with this patch?
> > > >
> > > > Waldek
> > > >
> > > > PS. You might also add this printout to verify if you are hitting
> > > the
> > > > hard fragmentation case:
> > > >
> > > >      page_range* range = nullptr;
> > > >      if (!bitset) {
> > > >          if (!exact_order || _free[exact_order - 1].empty()) {
> > > > +            printf("page_range_allocator: no ranges found for
> > > size
> > > > %ld and exact order: %d\n", size, exact_order);
> > > >              return nullptr;
> > > >          }
> > > >
> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
> > > > > Ok, with the waiter_print, we see:
> > > > >
> > > > > Waiter: 2_scheduler, bytes: 4984832
> > > > > Waiter: 5_scheduler, bytes: 4984832
> > > > > Out of memory: could not reclaim any further. Current memory:
> > > > > 10871576Kb, target: 0 Kb
> > > > >
> > > > > 'osv mem' in gdb reports loads of free memory:
> > > > >
> > > > > (gdb) osv mem
> > > > > Total Memory: 12884372480 Bytes
> > > > > Mmap Memory:  1610067968 Bytes (12.50%)
> > > > > Free Memory:  11132493824 Bytes (86.40%)
> > > > >
> > > > > So why is it failing to allocate 5MB when it claims to have
> > > 11GB
> > > > > free?
> > > > >
> > > > > Any more debug I can provide?
> > > > >
> > > > > Rick
> > > > >
> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote:
> > > > > > And indeed, back on the 'release' image, and we hit an oom:
> > > > > >
> > > > > > (gdb) b memory::oom
> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505.
> > > > > > (gdb) c
> > > > > > Continuing.
> > > > > >
> > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0
> > > )
> > > > > at
> > > > > > core/mempool.cc:505
> > > > > > 505     core/mempool.cc: No such file or directory.
> > > > > > (gdb) osv memory
> > > > > > Total Memory: 12884372480 Bytes
> > > > > > Mmap Memory:  1541332992 Bytes (11.96%)
> > > > > > Free Memory:  11490983936 Bytes (89.19%)
> > > > > >
> > > > > > I don't have the _waiters list print unforutnately (as I
> > > hadn't
> > > > > > included it in this image). It looks like this though:
> > > > > >
> > > > > > (gdb) p _oom_blocked._waiters
> > > > > > $4 =
> > > > > >
> > > > >
> > > {<boost::intrusive::list_impl<boost::intrusive::bhtraits<memory::re
> > > > > cl
> > > > > > ai
> > > > > > mer_waiters::wait_node,
> > > > > boost::intrusive::list_node_traits<void*>,
> > > > > > (boost::intrusive::link_mode_type)1,
> > > boost::intrusive::dft_tag,
> > > > > 1>,
> > > > > > unsigned long, false, void>> = {static constant_time_size =
> > > > > false,
> > > > > > static stateful_value_traits = <optimized out>,
> > > > > >     static has_container_from_iterator = <optimized out>,
> > > static
> > > > > > safemode_or_autounlink = true,
> > > > > >     data_ =
> > > > > >
> > > > >
> > > {<boost::intrusive::bhtraits<memory::reclaimer_waiters::wait_node,
> > > > > > boost::intrusive::list_node_traits<void*>,
> > > > > > (boost::intrusive::link_mode_type)1,
> > > boost::intrusive::dft_tag,
> > > > > 1>> =
> > > > > >
> > > > >
> > > {<boost::intrusive::bhtraits_base<memory::reclaimer_waiters::wait_n
> > > > > od
> > > > > > e,
> > > > > > boost::intrusive::list_node<void*>*,
> > > boost::intrusive::dft_tag,
> > > > > 1>> =
> > > > > > {<No data fields>}, static link_mode =
> > > > > boost::intrusive::safe_link},
> > > > > >       root_plus_size_ =
> > > > > > {<boost::intrusive::detail::size_holder<false,
> > > > > > unsigned long, void>> = {
> > > > > >           static constant_time_size = <optimized out>},
> > > m_header
> > > > > =
> > > > > > {<boost::intrusive::list_node<void*>> = {
> > > > > >             next_ = 0x200041305580, prev_ = 0x200041305580},
> > > <No
> > > > > data
> > > > > > fields>}}}}, <No data fields>}
> > > > > >
> > > > > > 'osv waiters' fails, annoyingly:
> > > > > >
> > > > > > (gdb) osv waiters
> > > > > > waiters:
> > > > > > Python Exception <class 'gdb.MemoryError'> Cannot access
> > > memory
> > > > > at
> > > > > > address 0x42c00020982ac010:
> > > > > > Error occurred in Python command: Cannot access memory at
> > > > > address
> > > > > > 0x42c00020982ac010
> > > > > >
> > > > > > Rick
> > > > > >
> > > > > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote:
> > > > > > > So we've been trying with a debug image (with the tcp
> > > kassert
> > > > > > > disabled). We do not get an OOM crash, but after some time
> > > > > (24hrs
> > > > > > > or
> > > > > > > more) all cores just hit idle and never recover. It stops
> > > > > > > responding
> > > > > > > to
> > > > > > > TCP connections and when you attached gdb, you see this:
> > > > > > >
> > > > > > > Thread 1 received signal SIGINT, Interrupt.
> > > > > > > processor::sti_hlt () at arch/x64/processor.hh:252
> > > > > > > 252     arch/x64/processor.hh: No such file or directory.
> > > > > > > (gdb) info thread
> > > > > > >   Id   Target Id         Frame
> > > > > > > * 1    Thread 1 (CPU#0 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >   2    Thread 2 (CPU#1 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >   3    Thread 3 (CPU#2 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >   4    Thread 4 (CPU#3 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >   5    Thread 5 (CPU#4 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >   6    Thread 6 (CPU#5 [halted ]) processor::sti_hlt () at
> > > > > > > arch/x64/processor.hh:252
> > > > > > >
> > > > > > > The threads themselves look liks this:
> > > > > > >
> > > > > > > (gdb) bt
> > > > > > > #0  processor::sti_hlt () at arch/x64/processor.hh:252
> > > > > > > #1  0x00000000405e4016 in arch::wait_for_interrupt () at
> > > > > > > arch/x64/arch.hh:43
> > > > > > > #2  0x00000000405d7f10 in sched::cpu::do_idle
> > > > > > > (this=0xffff80000001b040)
> > > > > > > at core/sched.cc:404
> > > > > > > #3  0x00000000405d7fc1 in sched::cpu::idle
> > > > > > > (this=0xffff80000001b040)
> > > > > > > at
> > > > > > > core/sched.cc:423
> > > > > > > #4  0x00000000405d73ef in
> > > > > sched::cpu::<lambda()>::operator()(void)
> > > > > > > const (__closure=0xffff800100156070)
> > > > > > >     at core/sched.cc:165
> > > > > > > #5  0x00000000405e0b57 in std::_Function_handler<void(),
> > > > > > > sched::cpu::init_idle_thread()::<lambda()>
> > > >::_M_invoke(const
> > > > > > > std::_Any_data &) (__functor=...) at
> > > > > > > /usr/include/c++/9/bits/std_function.h:300
> > > > > > > #6  0x0000000040496206 in std::function<void
> > > ()>::operator()()
> > > > > > > const
> > > > > > > (this=0xffff800100156070)
> > > > > > >     at /usr/include/c++/9/bits/std_function.h:690
> > > > > > > #7  0x00000000405db386 in sched::thread::main
> > > > > > > (this=0xffff800100156040)
> > > > > > > at core/sched.cc:1210
> > > > > > > #8  0x00000000405d7173 in sched::thread_main_c
> > > > > > > (t=0xffff800100156040)
> > > > > > > at arch/x64/arch-switch.hh:321
> > > > > > > #9  0x00000000404911b3 in thread_main () at
> > > > > arch/x64/entry.S:113
> > > > > > >
> > > > > > > I tried the 'osv waiters' command but jsut get:
> > > > > > >
> > > > > > > (gdb) osv waiters
> > > > > > > waiters:
> > > > > > > Python Exception <class 'gdb.MemoryError'> Cannot access
> > > memory
> > > > > at
> > > > > > > address 0x42c00020880a9010:
> > > > > > > Error occurred in Python command: Cannot access memory at
> > > > > address
> > > > > > > 0x42c00020880a9010
> > > > > > >
> > > > > > > I think we'll go back to a 'release' image and see if we
> > > get
> > > > > the
> > > > > > > oom
> > > > > > > with a few more clues...
> > > > > > >
> > > > > > > Rick
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 2020-03-11 at 08:07 -0700, Waldek Kozaczuk wrote:
> > > > > > > > On Tuesday, March 10, 2020 at 10:53:17 PM UTC-4, rickp
> > > > > wrote:
> > > > > > > > > I've not found a way to reproduce this other than in
> > > > > production
> > > > > > > > > yet,
> > > > > > > > > which is annoying. I've built an image with this patch,
> > > > > and
> > > > > > > > > will
> > > > > > > > > see if
> > > > > > > > > we can run it with gdb too.
> > > > > > > > >
> > > > > > > > > What should I be looking for if it hits?
> > > > > > > >
> > > > > > > > In case of oom() this patch should show a list of
> > > threads
> > > > > > > > (waiters)
> > > > > > > > along with amount of memory requested. I am hoping it
> > > will
> > > > > give
> > > > > > > > better clues where the problem is.
> > > > > > > > > Note, the 'osv pagetable walk' command (and a few
> > > others)
> > > > > seem
> > > > > > > > > not
> > > > > > > > > to
> > > > > > > > > work for me:
> > > > > > > > >
> > > > > > > > > (gdb) osv pagetable walk
> > > > > > > > > Python Exception <class 'gdb.error'> Argument required
> > > > > > > > > (expression
> > > > > > > > > to
> > > > > > > > > compute).:
> > > > > > > > > Error occurred in Python: Argument required (expression
> > > to
> > > > > > > > > compute).
> > > > > > > >
> > > > > > > > I will look into it.
> > > > > > > > > Rick
> > > > > > > > >
> > > > > > > > > On Tue, 2020-03-10 at 11:21 -0700, Waldek Kozaczuk
> > > wrote:
> > > > > > > > > > Is there a way to have a reproducible test app? If
> > > not I
> > > > > was
> > > > > > > > > > wondering if this patch enabling some extra logging
> > > > > would
> > > > > > > > > > help:
> > > > > > > > > >
> > > > > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc
> > > > > > > > > > index d902eea8..ace28b38 100644
> > > > > > > > > > --- a/core/mempool.cc
> > > > > > > > > > +++ b/core/mempool.cc
> > > > > > > > > > @@ -501,9 +501,9 @@ ssize_t
> > > > > > > > > reclaimer::bytes_until_normal(pressure
> > > > > > > > > > curr)
> > > > > > > > > >      }
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > -void oom()
> > > > > > > > > > +void oom(ssize_t target)
> > > > > > > > > >  {
> > > > > > > > > > -    abort("Out of memory: could not reclaim any
> > > > > further.
> > > > > > > > > Current
> > > > > > > > > > memory: %d Kb", stats::free() >> 10);
> > > > > > > > > > +    abort("Out of memory: could not reclaim any
> > > > > further.
> > > > > > > > > Current
> > > > > > > > > > memory: %d Kb, target: %d Kb", stats::free() >> 10,
> > > > > target >>
> > > > > > > > > 10);
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > >  void reclaimer::wait_for_minimum_memory()
> > > > > > > > > > @@ -924,6 +924,17 @@ bool
> > > > > reclaimer_waiters::wake_waiters()
> > > > > > > > > >      return woken;
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +void reclaimer_waiters::print_waiters()
> > > > > > > > > > +{
> > > > > > > > > > +    auto it = _waiters.begin();
> > > > > > > > > > +    while (it != _waiters.end()) {
> > > > > > > > > > +        auto& wr = *it;
> > > > > > > > > > +        it++;
> > > > > > > > > > +
> > > > > > > > > > +        printf( "Waiter: %s, bytes: %ld\n",
> > > wr.owner-
> > > > > > > > > > >name(),
> > > > > > > > > > wr.bytes);
> > > > > > > > > > +    }
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > >  // Note for callers: Ideally, we would not only
> > > wake,
> > > > > but
> > > > > > > > > already
> > > > > > > > > > allocate
> > > > > > > > > >  // memory here and pass it back to the waiter.
> > > However,
> > > > > > > > > > memory
> > > > > > > > > is
> > > > > > > > > > not always
> > > > > > > > > >  // allocated the same way (ex: refill_page_buffer
> > > is
> > > > > > > > > > completely
> > > > > > > > > > different from
> > > > > > > > > > @@ -943,7 +954,7 @@ void
> > > reclaimer_waiters::wait(size_t
> > > > > > > > > > bytes)
> > > > > > > > > >
> > > > > > > > > >      // Wait for whom?
> > > > > > > > > >      if (curr == reclaimer_thread._thread.get()) {
> > > > > > > > > > -        oom();
> > > > > > > > > > +        oom(bytes);
> > > > > > > > > >       }
> > > > > > > > > >
> > > > > > > > > >      wait_node wr;
> > > > > > > > > > @@ -1027,7 +1038,8 @@ void reclaimer::_do_reclaim()
> > > > > > > > > >                  // Wake up all waiters that are
> > > waiting
> > > > > and
> > > > > > > > > > now
> > > > > > > > > have
> > > > > > > > > > a chance to succeed.
> > > > > > > > > >                  // If we could not wake any, there
> > > is
> > > > > > > > > > nothing
> > > > > > > > > really
> > > > > > > > > > we can do.
> > > > > > > > > >                  if (!_oom_blocked.wake_waiters()) {
> > > > > > > > > > -                    oom();
> > > > > > > > > > +                    _oom_blocked.print_waiters();
> > > > > > > > > > +                    oom(target);
> > > > > > > > > >                  }
> > > > > > > > > >              }
> > > > > > > > > >
> > > > > > > > > > diff --git a/include/osv/mempool.hh
> > > > > b/include/osv/mempool.hh
> > > > > > > > > > index 10fe5602..620f1a5b 100644
> > > > > > > > > > --- a/include/osv/mempool.hh
> > > > > > > > > > +++ b/include/osv/mempool.hh
> > > > > > > > > > @@ -137,6 +137,7 @@ public:
> > > > > > > > > >      bool wake_waiters();
> > > > > > > > > >      void wait(size_t bytes);
> > > > > > > > > >      bool has_waiters() { return !_waiters.empty();
> > > }
> > > > > > > > > > +    void print_waiters();
> > > > > > > > > >  private:
> > > > > > > > > >      struct wait_node:
> > > boost::intrusive::list_base_hook<>
> > > > > {
> > > > > > > > > >          sched::thread* owner;
> > > > > > > > > >
> > > > > > > > > > Waldek
> > > > > > > > > >
> > > > > > > > > > On Tuesday, March 10, 2020 at 12:26:54 PM UTC-4,
> > > Waldek
> > > > > > > > > > Kozaczuk
> > > > > > > > > > wrote:
> > > > > > > > > > > Well I found this commit changing original "> 0 "
> > > to
> > > > > ">= 0"
> > > > > > > > > > > -
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > >
> https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca
> > > > >
> > > > > > > > >
> > > > > > > > > > >  - with some explanation. So maybe my theory is
> > > wrong.
> > > > > Also
> > > > > > > > > > > I
> > > > > > > > > am
> > > > > > > > > > > not sure calling bytes_until_normal() before that
> > > "if"
> > > > > > > > > > > would
> > > > > > > > > change
> > > > > > > > > > > anything then.
> > > > > > > > > > >
> > > > > > > > > > > On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4,
> > > > > Waldek
> > > > > > > > > Kozaczuk
> > > > > > > > > > > wrote:
> > > > > > > > > > > > So here is full code of of _do_reclaim():
> > > > > > > > > > > >
> > > > > > > > > > > > 1001 void reclaimer::_do_reclaim()
> > > > > > > > > > > > 1002 {
> > > > > > > > > > > > 1003     ssize_t target;
> > > > > > > > > > > > 1004     emergency_alloc_level = 1;
> > > > > > > > > > > > 1005
> > > > > > > > > > > > 1006     while (true) {
> > > > > > > > > > > > 1007         WITH_LOCK(free_page_ranges_lock) {
> > > > > > > > > > > > 1008
> > > > > _blocked.wait(free_page_ranges_lock);
> > > > > > > > > > > > 1009             target = bytes_until_normal();
> > > > > > > > > > > > 1010         }
> > > > > > > > > > > > 1011
> > > > > > > > > > > > 1012         // This means that we are currently
> > > > > > > > > > > > ballooning,
> > > > > > > > > we
> > > > > > > > > > > > should
> > > > > > > > > > > > 1013         // try to serve the waiters from
> > > > > temporary
> > > > > > > > > memory
> > > > > > > > > > > > without
> > > > > > > > > > > > 1014         // going on hard mode. A big batch
> > > of
> > > > > more
> > > > > > > > > memory is
> > > > > > > > > > > > likely
> > > > > > > > > > > > 1015         // in its way.
> > > > > > > > > > > > 1016         if (_oom_blocked.has_waiters() &&
> > > > > > > > > > > > throttling_needed()) {
> > > > > > > > > > > > 1017             _shrinker_loop(target, [] {
> > > return
> > > > > > > > > > > > false;
> > > > > > > > > });
> > > > > > > > > > > > 1018             WITH_LOCK(free_page_ranges_lock)
> > > {
> > > > > > > > > > > > 1019                 if
> > > (_oom_blocked.wake_waiters())
> > > > > {
> > > > > > > > > > > > 1020                         continue;
> > > > > > > > > > > > 1021                 }
> > > > > > > > > > > > 1022             }
> > > > > > > > > > > > 1023         }
> > > > > > > > > > > > 1024
> > > > > > > > > > > > 1025         _shrinker_loop(target, [this] {
> > > return
> > > > > > > > > > > > _oom_blocked.has_waiters(); });
> > > > > > > > > > > > 1026
> > > > > > > > > > > > 1027         WITH_LOCK(free_page_ranges_lock) {
> > > > > > > > > > > > 1028             if (target >= 0) {
> > > > > > > > > > > > 1029                 // Wake up all waiters that
> > > are
> > > > > > > > > > > > waiting
> > > > > > > > > and
> > > > > > > > > > > > now have a chance to succeed.
> > > > > > > > > > > > 1030                 // If we could not wake any,
> > > > > there
> > > > > > > > > > > > is
> > > > > > > > > > > > nothing really we can do.
> > > > > > > > > > > > 1031                 if
> > > > > (!_oom_blocked.wake_waiters()) {
> > > > > > > > > > > > 1032                     oom();
> > > > > > > > > > > > 1033                 }
> > > > > > > > > > > > 1034             }
> > > > > > > > > > > > 1035
> > > > > > > > > > > > 1036             if (balloon_api) {
> > > > > > > > > > > > 1037                 balloon_api-
> > > > > >voluntary_return();
> > > > > > > > > > > > 1038             }
> > > > > > > > > > > > 1039         }
> > > > > > > > > > > > 1040     }
> > > > > > > > > > > > 1041 }
> > > > > > > > > > > >
> > > > > > > > > > > > We got oom() because target was '>= 0'. Now the
> > > > > target
> > > > > > > > > > > > is
> > > > > > > > > > > > calculated as the result of
> > > bytes_until_normal().
> > > > > > > > > > > >
> > > > > > > > > > > >  495 ssize_t
> > > reclaimer::bytes_until_normal(pressure
> > > > > > > > > > > > curr)
> > > > > > > > > > > >  496 {
> > > > > > > > > > > >  497
> > > > > assert(mutex_owned(&free_page_ranges_lock));
> > > > > > > > > > > >  498     if (curr == pressure::PRESSURE) {
> > > > > > > > > > > >  499         return watermark_lo -
> > > stats::free();
> > > > > > > > > > > >  500     } else {
> > > > > > > > > > > >  501         return 0;
> > > > > > > > > > > >  502     }
> > > > > > > > > > > >  503 }
> > > > > > > > > > > >
> > > > > > > > > > > > which seems to indicate that when 0 is returned
> > > there
> > > > > no
> > > > > > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > > reclaim any memory.
> > > > > > > > > > > >
> > > > > > > > > > > > So here are two things that might be wrong:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Shouldn't if (target >= 0) be changed to if
> > > > > (target >
> > > > > > > > > > > > 0)
> > > > > > > > > {?
> > > > > > > > > > > > 2. Shouldn't we re-read the target in second
> > > > > WITH_LOCK
> > > > > > > > > instead of
> > > > > > > > > > > > comparing the original value in the beginning of
> > > the
> > > > > body
> > > > > > > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > loop? The line before - _shrinker_loop(target,
> > > [this]
> > > > > {
> > > > > > > > > return
> > > > > > > > > > > > _oom_blocked.has_waiters(); }); - might have
> > > just
> > > > > > > > > > > > released
> > > > > > > > > enough
> > > > > > > > > > > > memory to bring target below 0, right?
> > > > > > > > > > > >
> > > > > > > > > > > > In any case it would be useful to print the value
> > > of
> > > > > the
> > > > > > > > > target
> > > > > > > > > > > > before oom():
> > > > > > > > > > > >
> > > > > > > > > > > >                  if
> > > (!_oom_blocked.wake_waiters()) {
> > > > > > > > > > > >
> > > > > > > > > > > >                      printf("--> Target: %ld\n",
> > > > > > > > > > > > target);
> > > > > > > > > > > >
> > > > > > > > > > > >                      oom();
> > > > > > > > > > > >
> > > > > > > > > > > >                  }
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Monday, March 9, 2020 at 10:51:02 PM UTC-4,
> > > rickp
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > We're pretty close to current on OSv, but it
> > > also
> > > > > > > > > > > > > happens
> > > > > > > > > on an
> > > > > > > > > > > > > older
> > > > > > > > > > > > > image. We have changed some stuff in our app,
> > > but
> > > > > I
> > > > > > > > > > > > > think
> > > > > > > > > that
> > > > > > > > > > > > > may just
> > > > > > > > > > > > > be provking the bug. Certainly from gdb, I can
> > > see
> > > > > that
> > > > > > > > > both
> > > > > > > > > > > > > mmaped and
> > > > > > > > > > > > > normal memory fluctuate up and down but
> > > eveything
> > > > > looks
> > > > > > > > > sane.
> > > > > > > > > > > > > More debug in wake_waiters would be useful, but
> > > > > I'm
> > > > > > > > > > > > > losing
> > > > > > > > > the
> > > > > > > > > > > > > argument
> > > > > > > > > > > > > to continue with OSv at the moment which makes
> > > > > testing
> > > > > > > > > > > > > this
> > > > > > > > > a
> > > > > > > > > > > > > bit
> > > > > > > > > > > > > 'political'.
> > > > > > > > > > > > >
> > > > > > > > > > > > > btw - when we do run the system out of memory,
> > > it
> > > > > seems
> > > > > > > > > > > > > to
> > > > > > > > > hang
> > > > > > > > > > > > > rather
> > > > > > > > > > > > > than generate an oom. Have you tried it?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The tcp_do_segment one has been mentioned
> > > before
> > > > > (by
> > > > > > > > > someone
> > > > > > > > > > > > > else). The
> > > > > > > > > > > > > issue is that the kassert only has effect in
> > > the
> > > > > debug
> > > > > > > > > build.
> > > > > > > > > > > > > I'd guess
> > > > > > > > > > > > > that the socket is being closed, but still has
> > > > > segments
> > > > > > > > > that
> > > > > > > > > > > > > have not
> > > > > > > > > > > > > been processed, or something like that. I'll
> > > try
> > > > > and
> > > > > > > > > > > > > narrow
> > > > > > > > > it
> > > > > > > > > > > > > down a
> > > > > > > > > > > > > bit if I get time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Rick
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 2020-03-09 at 22:32 -0400, Waldek
> > > Kozaczuk
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Does it happen with the very latest OSv code?
> > > Did
> > > > > it
> > > > > > > > > start
> > > > > > > > > > > > > happening
> > > > > > > > > > > > > > at some point more often?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I wonder if we could add some helpful
> > > printouts
> > > > > in
> > > > > > > > > > > > > wake_waiters().
> > > > > > > > > > > > > > Btw that assert() failure in tcp_do_segment()
> > > > > rings a
> > > > > > > > > bell.
> > > > > > > > > > > > > > On Mon, Mar 9, 2020 at 22:25 Rick Payne <
> > > > > > > > > ri...@rossfell.co.uk
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > I can't add much other than I doubt its
> > > > > > > > > > > > > > > fragmentation.
> > > > > > > > > > > > > Sometimes
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > happens within a few minutes of the system
> > > > > > > > > > > > > > > starting.
> > > > > > > > > > > > > > > At
> > > > > > > > > no
> > > > > > > > > > > > > point do
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > think we're using more than 2GB of ram (of
> > > the
> > > > > > > > > > > > > > > 12GB)
> > > > > > > > > > > > > either.
> > > > > > > > > > > > > > > I did compile up a debug verison of OSv and
> > > > > built
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > system with
> > > > > > > > > > > > > > > that,
> > > > > > > > > > > > > > > but I've been unable to trigger the oom().
> > > > > Worse, I
> > > > > > > > > > > > > > > hit
> > > > > > > > > a
> > > > > > > > > > > > > kassert
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the netchannel code that seems to be
> > > ignored
> > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > 'release'
> > > > > > > > > > > > > > > build,
> > > > > > > > > > > > > > > but panics in the debug build:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [E/384 bsd-kassert]: tcp_do_segment:
> > > > > TCPS_LISTEN
> > > > > > > > > > > > > > > Assertion failed: tp->get_state() > 1
> > > > > > > > > > > > > > > (bsd/sys/netinet/tcp_input.cc:
> > > > > > > > > > > > > > > tcp_do_segment: 1076)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [backtrace]
> > > > > > > > > > > > > > > 0x0000000040221330 <abort(char const*,
> > > > > ...)+280>
> > > > > > > > > > > > > > > 0x0000000040221399 <__assert_fail+64>
> > > > > > > > > > > > > > > 0x00000000402a4798 <???+1076512664>
> > > > > > > > > > > > > > > 0x00000000402a97c2 <???+1076533186>
> > > > > > > > > > > > > > > 0x00000000402a98a1 <???+1076533409>
> > > > > > > > > > > > > > > 0x00000000402aa448 <???+1076536392>
> > > > > > > > > > > > > > > 0x0000000040656a9a <std::function<void
> > > > > > > > > > > > > (mbuf*)>::operator()(mbuf*)
> > > > > > > > > > > > > > > const+76>
> > > > > > > > > > > > > > > 0x0000000040655855
> > > > > > > > > > > > > > > <net_channel::process_queue()+61>
> > > > > > > > > > > > > > > 0x000000004023b165 <???+1076080997>
> > > > > > > > > > > > > > > 0x000000004023b4d7 <soclose+878>
> > > > > > > > > > > > > > > 0x000000004024cd21
> > > <socket_file::close()+51>
> > > > > > > > > > > > > > > 0x00000000406a6a10 <fdrop+151>
> > > > > > > > > > > > > > > 0x00000000406a64f7 <fdclose(int)+184>
> > > > > > > > > > > > > > > 0x000000004067cd42 <close+41>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So at the moment, I'm a bit stuck with
> > > getting
> > > > > any
> > > > > > > > > more
> > > > > > > > > > > > > info...
> > > > > > > > > > > > > > > Rick
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, 2020-03-09 at 08:52 -0700, Waldek
> > > > > Kozaczuk
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > As I understand this stack trace the
> > > oom()
> > > > > was
> > > > > > > > > called
> > > > > > > > > > > > > here as
> > > > > > > > > > > > > > > part of
> > > > > > > > > > > > > > > > _do_reclaim():
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1025
> > >  WITH_LOCK(free_page_ranges_lock)
> > > > > {
> > > > > > > > > > > > > > > > 1026             if (target >= 0) {
> > > > > > > > > > > > > > > > 1027                 // Wake up all
> > > waiters
> > > > > that
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > waiting and
> > > > > > > > > > > > > > > now
> > > > > > > > > > > > > > > > have a chance to succeed.
> > > > > > > > > > > > > > > > 1028                 // If we could not
> > > wake
> > > > > any,
> > > > > > > > > there
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > nothing
> > > > > > > > > > > > > > > > really we can do.
> > > > > > > > > > > > > > > > 1029                 if
> > > > > > > > > (!_oom_blocked.wake_waiters()) {
> > > > > > > > > > > > > > > > 1030                     oom();
> > > > > > > > > > > > > > > > 1031                 }
> > > > > > > > > > > > > > > > 1032             }
> > > > > > > > > > > > > > > > 1033
> > > > > > > > > > > > > > > > 1034             if (balloon_api) {
> > > > > > > > > > > > > > > > 1035                 balloon_api-
> > > > > > > > > > voluntary_return();
> > > > > > > > > > > > > > > > 1036             }
> > > > > > > > > > > > > > > > 1037         }
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > so it seems wake_waiters() returned
> > > false. I
> > > > > > > > > > > > > > > > wonder
> > > > > > > > > if
> > > > > > > > > > > > > the memory
> > > > > > > > > > > > > > > was
> > > > > > > > > > > > > > > > heavily fragmented or there is some
> > > logical
> > > > > bug
> > > > > > > > > > > > > > > > in
> > > > > > > > > there.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > method
> > > > > > > > > > > > > > > > is called from two places and I wonder if
> > > > > this
> > > > > > > > > > > > > > > > part
> > > > > > > > > of
> > > > > > > > > > > > > > > wake_waiters()
> > > > > > > > > > > > > > > > is correct:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  921     if (!_waiters.empty()) {
> > > > > > > > > > > > > > > >  922         reclaimer_thread.wake();
> > > > > > > > > > > > > > > >  923     }
> > > > > > > > > > > > > > > >  924     return woken;
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > should this if also set woken to true?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Also could we also enhance the oom()
> > > logic
> > > > > to
> > > > > > > > > > > > > > > > print
> > > > > > > > > out
> > > > > > > > > > > > > more
> > > > > > > > > > > > > > > useful
> > > > > > > > > > > > > > > > information if this happens once again?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tuesday, March 3, 2020 at 2:21:40 AM
> > > UTC-
> > > > > 5,
> > > > > > > > > > > > > > > > rickp
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > Had a crash on a system that I don't
> > > > > > > > > > > > > > > > > understand.
> > > > > > > > > Its a
> > > > > > > > > > > > > VM with
> > > > > > > > > > > > > > > > > 12GB
> > > > > > > > > > > > > > > > > allocated, we were running without
> > > about
> > > > > 10.5GB
> > > > > > > > > free
> > > > > > > > > > > > > according
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > API.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Out of the blue, we had a panic:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Out of memory: could not reclaim any
> > > > > further.
> > > > > > > > > Current
> > > > > > > > > > > > > memory:
> > > > > > > > > > > > > > > > > 10954988
> > > > > > > > > > > > > > > > > Kb
> > > > > > > > > > > > > > > > > [backtrace]
> > > > > > > > > > > > > > > > > 0x00000000403f6320 <memory::oom()+32>
> > > > > > > > > > > > > > > > > 0x00000000403f71cc
> > > > > > > > > > > > > <memory::reclaimer::_do_reclaim()+380>
> > > > > > > > > > > > > > > > > 0x00000000403f722f <???+1077899823>
> > > > > > > > > > > > > > > > > 0x000000004040f29b <thread_main_c+43>
> > > > > > > > > > > > > > > > > 0x00000000403ae412 <???+1077601298>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The 'Out of memory' message seems to
> > > print
> > > > > > > > > > > > > stats::free() and
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > number suggests we have plenty of free
> > > > > ram.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Have I misunderstood, or is there
> > > something
> > > > > I
> > > > > > > > > > > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > looking
> > > > > > > > > > > > > > > > > at?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > Rick
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > You received this message because you
> > > are
> > > > > > > > > > > > > > > > subscribed
> > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > Google
> > > > > > > > > > > > > > > > Groups "OSv Development" group.
> > > > > > > > > > > > > > > > To unsubscribe from this group and stop
> > > > > receiving
> > > > > > > > > emails
> > > > > > > > > > > > > from it,
> > > > > > > > > > > > > > > > send an email to osv...@googlegroups.com.
> > >
> > > > > > > > > > > > > > > > To view this discussion on the web visit
> > > > > > > > > > > > > > > >
> > > > > > > > >
> > > > >
> > >
> https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com
> > > > >
> > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > > .
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > You received this message because you are subscribed
> > > to
> > > > > the
> > > > > > > > > Google
> > > > > > > > > > Groups "OSv Development" group.
> > > > > > > > > > To unsubscribe from this group and stop receiving
> > > emails
> > > > > from
> > > > > > > > > it,
> > > > > > > > > > send an email to osv...@googlegroups.com.
> > > > > > > > > > To view this discussion on the web visit
> > > > > > > > > >
> > > > > > > > >
> > > > >
> > >
> https://groups.google.com/d/msgid/osv-dev/b8a57083-2882-4878-8668-f510eaceff84%40googlegroups.com
> > > > >
> > > > > > > > >
> > > > > > > > > > .
> > > > > > > >
> > > > > > > > --
> > > > > > > > You received this message because you are subscribed to
> > > the
> > > > > > > > Google
> > > > > > > > Groups "OSv Development" group.
> > > > > > > > To unsubscribe from this group and stop receiving emails
> > > from
> > > > > it,
> > > > > > > > send an email to osv...@googlegroups.com.
> > > > > > > > To view this discussion on the web visit
> > > > > > > >
> > > > >
> > >
> https://groups.google.com/d/msgid/osv-dev/79e94c93-f832-4ed9-935e-0ffd9a1a4ec6%40googlegroups.com
> > > > >
> > > > > > > > .
> > > > >
> > > >
> > > > --
> > > > You received this message because you are subscribed to the
> > > Google
> > > > Groups "OSv Development" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> > > > send an email to osv-dev+unsubscr...@googlegroups.com.
> > > > To view this discussion on the web visit
> > > >
> > >
> https://groups.google.com/d/msgid/osv-dev/443b4410-5c69-436c-b9ba-85b4a42ad61f%40googlegroups.com
> > > > .
> > >
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/188c71d297eb49013250377e82df96a298b39df6.camel%40rossfell.co.uk
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfO3%3DFWM0bB0AEEgsLLz4k6Te%2Bp_rTMGmZ4NW9nSFD8LFQ%40mail.gmail.com.

Reply via email to