Re: [osv-dev] Re: OOM query

Waldek Kozaczuk Wed, 11 Mar 2020 21:33:25 -0700

BTW that patch I attached to my previous email should have had this:

printf( "Waiter: %s, bytes: %ld\n", wr.owner->name().c_str(), wr.bytes);


instead of

printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), wr.bytes);

Meanwhile, I have found at least 2 issues in mempool.cc, I think mostly 
impacting when we truly get close to running out of space.

1) I see this crash when I stress app with memory:

page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x00000000403f3dd6 <memory::page_pool::l2::refill()+230>
RFL: 0x0000000000010286  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000000  RBX: 0x0000000000020000  RCX: 0x0000000000000004  
RDX: 0x0000000000000000
RSI: 0x00000000409082f0  RDI: 0x0000000040908520  RBP: 0xffff80000010ffc0  
R8:  0xfffffffffffff758
R9:  0x00000000409085d8  R10: 0xffffa00000ef5640  R11: 0xffffa00000d7c300  
R12: 0xffff80000010ff90
R13: 0x00000000409085d8  R14: 0xffffa00000ef5640  R15: 0xffffa00000d7c300  
RSP: 0xffff80000010fe70
Aborted

[backtrace]
0x0000000040347cb9 <???+1077181625>
0x000000004034988d <mmu::vm_fault(unsigned long, exception_frame*)+397>
0x00000000403aab8b <page_fault+155>
0x00000000403a99d6 <???+1077582294>
0x00000000403f53a7 <memory::page_pool::l2::fill_thread()+231>
0x00000000403f9b6f <std::_Function_handler<void (), 
memory::page_pool::l2::l2()::{lambda()#1}>::_M_invoke(std::_Any_data 
const&)+15>
0x000000004040b66b <thread_main_c+43>
0x00000000403aa952 <???+1077586258>

which is caused by the fact that *page_range_allocator::alloc()* can return 
nullptr and we do not check for that in l2::refill():

void l2::refill()
{
    page_batch batch;
    page_batch* pb;
    while (get_nr() < _max / 2) {
        WITH_LOCK(free_page_ranges_lock) {
            reclaimer_thread.wait_for_minimum_memory();
            if (free_page_ranges.empty()) {
                // That is almost a guaranteed oom, but we can still have 
some hope
                // if we the current allocation is a small one. Another 
advantage
                // of waiting here instead of oom'ing directly is that we 
can have
                // less points in the code where we can oom, and be more
                // predictable.
                reclaimer_thread.wait_for_memory(mmu::page_size);
            }
            auto total_size = 0;
            for (size_t i = 0 ; i < page_batch::nr_pages; i++) {
                batch.pages[i] = free_page_ranges.alloc(page_size);
                *if (!batch.pages[i]) {*
*                    printf("Could not get a range\n"); //TODO: How shall 
we handle it?*
*                }*
                total_size += page_size;
            }
            on_alloc(total_size);
        }
        // Use the last page to store other page address
        pb = static_cast<page_batch*>(batch.pages[page_batch::nr_pages - 
1]);
        *pb = batch;
        if (_stack.push(pb)) {
            inc_nr();
        } else {
            // FIXME: _nr can change within 
{alloc,free}_page_batch_{fast,slow}
            // _stack might be full at this point, so we need to free the 
newly
            // allocated pages!!!
            free_batch(batch);
        }
    }
}

 not sure how to handle it but clearly a bug.

2) This code in page_range_allocator::alloc() looks wrong or at least 
suspicious:

page_range* page_range_allocator::alloc(size_t size)
{
    auto exact_order = ilog2_roundup(size / page_size);
    if (exact_order > max_order) {
        exact_order = max_order;
    }
    auto bitset = _not_empty.to_ulong();
    if (exact_order) {
        bitset &= ~((1 << exact_order) - 1);
    }
    auto order = count_trailing_zeros(bitset);

    page_range* range = nullptr;
    if (!bitset) {
        printf("___ ALE1!\n");
        if (!exact_order || _free[exact_order - 1].empty()) {
            return nullptr;
        }
        // TODO: This linear search makes worst case complexity of the 
allocator
        // O(n). It would be better to fall back to non-contiguous 
allocation
        // and make worst case complexity depend on the size of requested 
memory
        // block and the logarithm of the number of free huge page ranges.
*        for (auto&& pr : _free[exact_order - 1]) {*
*            if (pr.size >= size) {*
*                range = &pr;*
*                break; //TODO: Why are we returning nullptr below anyway? *
*            }*
*        }*
*        return nullptr;*
    } else if (order == max_order) {
        range = &*_free_huge.rbegin();
        if (range->size < size) {
            return nullptr;
        }
        remove_huge(*range);
    } else {
        range = &_free[order].front();
        remove_list(order, *range);
    }

    auto& pr = *range;
    if (pr.size > size) {
        auto& np = *new (static_cast<void*>(&pr) + size)
                        page_range(pr.size - size);
        insert<UseBitmap>(np);
        pr.size = size;
    }
    if (UseBitmap) {
        set_bits(pr, false);
    }
    return &pr;
}

I have not found a bug that would exactly cause the type of crash you 
reported.

On Wednesday, March 11, 2020 at 11:07:28 AM UTC-4, Waldek Kozaczuk wrote:
>
>
>
> On Tuesday, March 10, 2020 at 10:53:17 PM UTC-4, rickp wrote:
>>
>>
>> I've not found a way to reproduce this other than in production yet, 
>> which is annoying. I've built an image with this patch, and will see if 
>> we can run it with gdb too. 
>>
>> What should I be looking for if it hits? 
>>
> In case of oom() this patch should show a list of threads (waiters) along 
> with amount of memory requested. I am hoping it will give better clues 
> where the problem is. 
>
>>
>> Note, the 'osv pagetable walk' command (and a few others) seem not to 
>> work for me: 
>>
>> (gdb) osv pagetable walk 
>> Python Exception <class 'gdb.error'> Argument required (expression to 
>> compute).: 
>> Error occurred in Python: Argument required (expression to compute). 
>>
> I will look into it. 
>
>>
>> Rick 
>>
>> On Tue, 2020-03-10 at 11:21 -0700, Waldek Kozaczuk wrote: 
>> > Is there a way to have a reproducible test app? If not I was 
>> > wondering if this patch enabling some extra logging would help: 
>> > 
>> > diff --git a/core/mempool.cc b/core/mempool.cc 
>> > index d902eea8..ace28b38 100644 
>> > --- a/core/mempool.cc 
>> > +++ b/core/mempool.cc 
>> > @@ -501,9 +501,9 @@ ssize_t reclaimer::bytes_until_normal(pressure 
>> > curr) 
>> >      } 
>> >  } 
>> >   
>> > -void oom() 
>> > +void oom(ssize_t target) 
>> >  { 
>> > -    abort("Out of memory: could not reclaim any further. Current 
>> > memory: %d Kb", stats::free() >> 10); 
>> > +    abort("Out of memory: could not reclaim any further. Current 
>> > memory: %d Kb, target: %d Kb", stats::free() >> 10, target >> 10); 
>> >  } 
>> >   
>> >  void reclaimer::wait_for_minimum_memory() 
>> > @@ -924,6 +924,17 @@ bool reclaimer_waiters::wake_waiters() 
>> >      return woken; 
>> >  } 
>> >   
>> > +void reclaimer_waiters::print_waiters() 
>> > +{ 
>> > +    auto it = _waiters.begin(); 
>> > +    while (it != _waiters.end()) { 
>> > +        auto& wr = *it; 
>> > +        it++; 
>> > + 
>> > +        printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), 
>> > wr.bytes); 
>> > +    } 
>> > +} 
>> > + 
>> >  // Note for callers: Ideally, we would not only wake, but already 
>> > allocate 
>> >  // memory here and pass it back to the waiter. However, memory is 
>> > not always 
>> >  // allocated the same way (ex: refill_page_buffer is completely 
>> > different from 
>> > @@ -943,7 +954,7 @@ void reclaimer_waiters::wait(size_t bytes) 
>> >   
>> >      // Wait for whom? 
>> >      if (curr == reclaimer_thread._thread.get()) { 
>> > -        oom(); 
>> > +        oom(bytes); 
>> >       } 
>> >   
>> >      wait_node wr; 
>> > @@ -1027,7 +1038,8 @@ void reclaimer::_do_reclaim() 
>> >                  // Wake up all waiters that are waiting and now have 
>> > a chance to succeed. 
>> >                  // If we could not wake any, there is nothing really 
>> > we can do. 
>> >                  if (!_oom_blocked.wake_waiters()) { 
>> > -                    oom(); 
>> > +                    _oom_blocked.print_waiters(); 
>> > +                    oom(target); 
>> >                  } 
>> >              } 
>> >   
>> > diff --git a/include/osv/mempool.hh b/include/osv/mempool.hh 
>> > index 10fe5602..620f1a5b 100644 
>> > --- a/include/osv/mempool.hh 
>> > +++ b/include/osv/mempool.hh 
>> > @@ -137,6 +137,7 @@ public: 
>> >      bool wake_waiters(); 
>> >      void wait(size_t bytes); 
>> >      bool has_waiters() { return !_waiters.empty(); } 
>> > +    void print_waiters(); 
>> >  private: 
>> >      struct wait_node: boost::intrusive::list_base_hook<> { 
>> >          sched::thread* owner; 
>> > 
>> > Waldek 
>> > 
>> > On Tuesday, March 10, 2020 at 12:26:54 PM UTC-4, Waldek Kozaczuk 
>> > wrote: 
>> > > Well I found this commit changing original "> 0 " to ">= 0" - 
>> > > 
>> https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca
>>  
>> > >  - with some explanation. So maybe my theory is wrong. Also I am 
>> > > not sure calling bytes_until_normal() before that "if" would change 
>> > > anything then. 
>> > > 
>> > > On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4, Waldek Kozaczuk 
>> > > wrote: 
>> > > > So here is full code of of _do_reclaim(): 
>> > > > 
>> > > > 1001 void reclaimer::_do_reclaim() 
>> > > > 1002 { 
>> > > > 1003     ssize_t target; 
>> > > > 1004     emergency_alloc_level = 1; 
>> > > > 1005 
>> > > > 1006     while (true) { 
>> > > > 1007         WITH_LOCK(free_page_ranges_lock) { 
>> > > > 1008             _blocked.wait(free_page_ranges_lock); 
>> > > > 1009             target = bytes_until_normal(); 
>> > > > 1010         } 
>> > > > 1011 
>> > > > 1012         // This means that we are currently ballooning, we 
>> > > > should 
>> > > > 1013         // try to serve the waiters from temporary memory 
>> > > > without 
>> > > > 1014         // going on hard mode. A big batch of more memory is 
>> > > > likely 
>> > > > 1015         // in its way. 
>> > > > 1016         if (_oom_blocked.has_waiters() && 
>> > > > throttling_needed()) { 
>> > > > 1017             _shrinker_loop(target, [] { return false; }); 
>> > > > 1018             WITH_LOCK(free_page_ranges_lock) { 
>> > > > 1019                 if (_oom_blocked.wake_waiters()) { 
>> > > > 1020                         continue; 
>> > > > 1021                 } 
>> > > > 1022             } 
>> > > > 1023         } 
>> > > > 1024 
>> > > > 1025         _shrinker_loop(target, [this] { return 
>> > > > _oom_blocked.has_waiters(); }); 
>> > > > 1026 
>> > > > 1027         WITH_LOCK(free_page_ranges_lock) { 
>> > > > 1028             if (target >= 0) { 
>> > > > 1029                 // Wake up all waiters that are waiting and 
>> > > > now have a chance to succeed. 
>> > > > 1030                 // If we could not wake any, there is 
>> > > > nothing really we can do. 
>> > > > 1031                 if (!_oom_blocked.wake_waiters()) { 
>> > > > 1032                     oom(); 
>> > > > 1033                 } 
>> > > > 1034             } 
>> > > > 1035 
>> > > > 1036             if (balloon_api) { 
>> > > > 1037                 balloon_api->voluntary_return(); 
>> > > > 1038             } 
>> > > > 1039         } 
>> > > > 1040     } 
>> > > > 1041 } 
>> > > > 
>> > > > We got oom() because target was '>= 0'. Now the target is 
>> > > > calculated as the result of  bytes_until_normal(). 
>> > > > 
>> > > >  495 ssize_t reclaimer::bytes_until_normal(pressure curr) 
>> > > >  496 { 
>> > > >  497     assert(mutex_owned(&free_page_ranges_lock)); 
>> > > >  498     if (curr == pressure::PRESSURE) { 
>> > > >  499         return watermark_lo - stats::free(); 
>> > > >  500     } else { 
>> > > >  501         return 0; 
>> > > >  502     } 
>> > > >  503 } 
>> > > > 
>> > > > which seems to indicate that when 0 is returned there no need to 
>> > > > reclaim any memory. 
>> > > > 
>> > > > So here are two things that might be wrong: 
>> > > > 
>> > > > 1. Shouldn't if (target >= 0) be changed to if (target > 0) {? 
>> > > > 
>> > > > 2. Shouldn't we re-read the target in second WITH_LOCK instead of 
>> > > > comparing the original value in the beginning of the body of the 
>> > > > loop? The line before - _shrinker_loop(target, [this] { return 
>> > > > _oom_blocked.has_waiters(); }); - might have just released enough 
>> > > > memory to bring target below 0, right? 
>> > > > 
>> > > > In any case it would be useful to print the value of the target 
>> > > > before oom(): 
>> > > > 
>> > > >                  if (!_oom_blocked.wake_waiters()) { 
>> > > > 
>> > > >                      printf("--> Target: %ld\n", target); 
>> > > > 
>> > > >                      oom(); 
>> > > > 
>> > > >                  } 
>> > > > 
>> > > > 
>> > > > On Monday, March 9, 2020 at 10:51:02 PM UTC-4, rickp wrote: 
>> > > > > We're pretty close to current on OSv, but it also happens on an 
>> > > > > older 
>> > > > > image. We have changed some stuff in our app, but I think that 
>> > > > > may just 
>> > > > > be provking the bug. Certainly from gdb, I can see that both 
>> > > > > mmaped and 
>> > > > > normal memory fluctuate up and down but eveything looks sane. 
>> > > > > 
>> > > > > More debug in wake_waiters would be useful, but I'm losing the 
>> > > > > argument 
>> > > > > to continue with OSv at the moment which makes testing this a 
>> > > > > bit 
>> > > > > 'political'. 
>> > > > > 
>> > > > > btw - when we do run the system out of memory, it seems to hang 
>> > > > > rather 
>> > > > > than generate an oom. Have you tried it? 
>> > > > > 
>> > > > > The tcp_do_segment one has been mentioned before (by someone 
>> > > > > else). The 
>> > > > > issue is that the kassert only has effect in the debug build. 
>> > > > > I'd guess 
>> > > > > that the socket is being closed, but still has segments that 
>> > > > > have not 
>> > > > > been processed, or something like that. I'll try and narrow it 
>> > > > > down a 
>> > > > > bit if I get time. 
>> > > > > 
>> > > > > Rick 
>> > > > > 
>> > > > > On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: 
>> > > > > > Does it happen with the very latest OSv code? Did it start 
>> > > > > happening 
>> > > > > > at some point more often? 
>> > > > > > 
>> > > > > > I wonder if we could add some helpful printouts in 
>> > > > > wake_waiters(). 
>> > > > > > 
>> > > > > > Btw that assert() failure in tcp_do_segment() rings a bell. 
>> > > > > > 
>> > > > > > On Mon, Mar 9, 2020 at 22:25 Rick Payne <ri...@rossfell.co.uk 
>> > > > > > wrote: 
>> > > > > > > I can't add much other than I doubt its fragmentation. 
>> > > > > Sometimes 
>> > > > > > > this 
>> > > > > > > happens within a few minutes of the system starting. At no 
>> > > > > point do 
>> > > > > > > I 
>> > > > > > > think we're using more than 2GB of ram (of the 12GB) 
>> > > > > either. 
>> > > > > > > 
>> > > > > > > I did compile up a debug verison of OSv and built the 
>> > > > > system with 
>> > > > > > > that, 
>> > > > > > > but I've been unable to trigger the oom(). Worse, I hit a 
>> > > > > kassert 
>> > > > > > > in 
>> > > > > > > the netchannel code that seems to be ignored in the 
>> > > > > 'release' 
>> > > > > > > build, 
>> > > > > > > but panics in the debug build: 
>> > > > > > > 
>> > > > > > > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN 
>> > > > > > > Assertion failed: tp->get_state() > 1 
>> > > > > > > (bsd/sys/netinet/tcp_input.cc: 
>> > > > > > > tcp_do_segment: 1076) 
>> > > > > > > 
>> > > > > > > [backtrace] 
>> > > > > > > 0x0000000040221330 <abort(char const*, ...)+280> 
>> > > > > > > 0x0000000040221399 <__assert_fail+64> 
>> > > > > > > 0x00000000402a4798 <???+1076512664> 
>> > > > > > > 0x00000000402a97c2 <???+1076533186> 
>> > > > > > > 0x00000000402a98a1 <???+1076533409> 
>> > > > > > > 0x00000000402aa448 <???+1076536392> 
>> > > > > > > 0x0000000040656a9a <std::function<void 
>> > > > > (mbuf*)>::operator()(mbuf*) 
>> > > > > > > const+76> 
>> > > > > > > 0x0000000040655855 <net_channel::process_queue()+61> 
>> > > > > > > 0x000000004023b165 <???+1076080997> 
>> > > > > > > 0x000000004023b4d7 <soclose+878> 
>> > > > > > > 0x000000004024cd21 <socket_file::close()+51> 
>> > > > > > > 0x00000000406a6a10 <fdrop+151> 
>> > > > > > > 0x00000000406a64f7 <fdclose(int)+184> 
>> > > > > > > 0x000000004067cd42 <close+41> 
>> > > > > > > 
>> > > > > > > So at the moment, I'm a bit stuck with getting any more 
>> > > > > info... 
>> > > > > > > 
>> > > > > > > Rick 
>> > > > > > > 
>> > > > > > > On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote: 
>> > > > > > > > As I understand this stack trace the oom() was called 
>> > > > > here as 
>> > > > > > > part of 
>> > > > > > > > _do_reclaim(): 
>> > > > > > > > 
>> > > > > > > > 1025         WITH_LOCK(free_page_ranges_lock) { 
>> > > > > > > > 1026             if (target >= 0) { 
>> > > > > > > > 1027                 // Wake up all waiters that are 
>> > > > > waiting and 
>> > > > > > > now 
>> > > > > > > > have a chance to succeed. 
>> > > > > > > > 1028                 // If we could not wake any, there 
>> > > > > is 
>> > > > > > > nothing 
>> > > > > > > > really we can do. 
>> > > > > > > > 1029                 if (!_oom_blocked.wake_waiters()) { 
>> > > > > > > > 1030                     oom(); 
>> > > > > > > > 1031                 } 
>> > > > > > > > 1032             } 
>> > > > > > > > 1033 
>> > > > > > > > 1034             if (balloon_api) { 
>> > > > > > > > 1035                 balloon_api->voluntary_return(); 
>> > > > > > > > 1036             } 
>> > > > > > > > 1037         } 
>> > > > > > > > 
>> > > > > > > > so it seems wake_waiters() returned false. I wonder if 
>> > > > > the memory 
>> > > > > > > was 
>> > > > > > > > heavily fragmented or there is some logical bug in there. 
>> > > > > This 
>> > > > > > > method 
>> > > > > > > > is called from two places and I wonder if this part of 
>> > > > > > > wake_waiters() 
>> > > > > > > > is correct: 
>> > > > > > > > 
>> > > > > > > >  921     if (!_waiters.empty()) { 
>> > > > > > > >  922         reclaimer_thread.wake(); 
>> > > > > > > >  923     } 
>> > > > > > > >  924     return woken; 
>> > > > > > > > 
>> > > > > > > > 
>> > > > > > > > should this if also set woken to true? 
>> > > > > > > > 
>> > > > > > > > Also could we also enhance the oom() logic to print out 
>> > > > > more 
>> > > > > > > useful 
>> > > > > > > > information if this happens once again? 
>> > > > > > > > 
>> > > > > > > > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp 
>> > > > > wrote: 
>> > > > > > > > > Had a crash on a system that I don't understand. Its a 
>> > > > > VM with 
>> > > > > > > > > 12GB 
>> > > > > > > > > allocated, we were running without about 10.5GB free 
>> > > > > according 
>> > > > > > > to 
>> > > > > > > > > the 
>> > > > > > > > > API. 
>> > > > > > > > > 
>> > > > > > > > > Out of the blue, we had a panic: 
>> > > > > > > > > 
>> > > > > > > > > Out of memory: could not reclaim any further. Current 
>> > > > > memory: 
>> > > > > > > > > 10954988 
>> > > > > > > > > Kb 
>> > > > > > > > > [backtrace] 
>> > > > > > > > > 0x00000000403f6320 <memory::oom()+32> 
>> > > > > > > > > 0x00000000403f71cc 
>> > > > > <memory::reclaimer::_do_reclaim()+380> 
>> > > > > > > > > 0x00000000403f722f <???+1077899823> 
>> > > > > > > > > 0x000000004040f29b <thread_main_c+43> 
>> > > > > > > > > 0x00000000403ae412 <???+1077601298> 
>> > > > > > > > > 
>> > > > > > > > > The 'Out of memory' message seems to print 
>> > > > > stats::free() and 
>> > > > > > > that 
>> > > > > > > > > number suggests we have plenty of free ram. 
>> > > > > > > > > 
>> > > > > > > > > Have I misunderstood, or is there something I need to 
>> > > > > be 
>> > > > > > > looking 
>> > > > > > > > > at? 
>> > > > > > > > > 
>> > > > > > > > > Cheers, 
>> > > > > > > > > Rick 
>> > > > > > > > > 
>> > > > > > > > 
>> > > > > > > > -- 
>> > > > > > > > You received this message because you are subscribed to 
>> > > > > the 
>> > > > > > > Google 
>> > > > > > > > Groups "OSv Development" group. 
>> > > > > > > > To unsubscribe from this group and stop receiving emails 
>> > > > > from it, 
>> > > > > > > > send an email to osv...@googlegroups.com. 
>> > > > > > > > To view this discussion on the web visit 
>> > > > > > > > 
>> > > > > > > 
>> > > > > 
>> https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com
>>  
>> > > > >   
>> > > > > > > > . 
>> > > > > > > 
>> > > > > 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> > Groups "OSv Development" group. 
>> > To unsubscribe from this group and stop receiving emails from it, 
>> > send an email to osv...@googlegroups.com. 
>> > To view this discussion on the web visit 
>> > 
>> https://groups.google.com/d/msgid/osv-dev/b8a57083-2882-4878-8668-f510eaceff84%40googlegroups.com
>>  
>> > . 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/7ce4a799-5207-421e-8e1c-0ebde13a8d23%40googlegroups.com.

Re: [osv-dev] Re: OOM query

Reply via email to