I can't add much other than I doubt its fragmentation. Sometimes this
happens within a few minutes of the system starting. At no point do I
think we're using more than 2GB of ram (of the 12GB) either.

I did compile up a debug verison of OSv and built the system with that,
but I've been unable to trigger the oom(). Worse, I hit a kassert in
the netchannel code that seems to be ignored in the 'release' build,
but panics in the debug build:

[E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN
Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc:
tcp_do_segment: 1076)

[backtrace]
0x0000000040221330 <abort(char const*, ...)+280>
0x0000000040221399 <__assert_fail+64>
0x00000000402a4798 <???+1076512664>
0x00000000402a97c2 <???+1076533186>
0x00000000402a98a1 <???+1076533409>
0x00000000402aa448 <???+1076536392>
0x0000000040656a9a <std::function<void (mbuf*)>::operator()(mbuf*)
const+76>
0x0000000040655855 <net_channel::process_queue()+61>
0x000000004023b165 <???+1076080997>
0x000000004023b4d7 <soclose+878>
0x000000004024cd21 <socket_file::close()+51>
0x00000000406a6a10 <fdrop+151>
0x00000000406a64f7 <fdclose(int)+184>
0x000000004067cd42 <close+41>

So at the moment, I'm a bit stuck with getting any more info...

Rick

On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote:
> As I understand this stack trace the oom() was called here as part of
> _do_reclaim():
> 
> 1025         WITH_LOCK(free_page_ranges_lock) {
> 1026             if (target >= 0) {
> 1027                 // Wake up all waiters that are waiting and now
> have a chance to succeed.
> 1028                 // If we could not wake any, there is nothing
> really we can do.
> 1029                 if (!_oom_blocked.wake_waiters()) {
> 1030                     oom();
> 1031                 }
> 1032             }
> 1033 
> 1034             if (balloon_api) {
> 1035                 balloon_api->voluntary_return();
> 1036             }
> 1037         }
> 
> so it seems wake_waiters() returned false. I wonder if the memory was
> heavily fragmented or there is some logical bug in there. This method
> is called from two places and I wonder if this part of wake_waiters()
> is correct:
> 
>  921     if (!_waiters.empty()) {
>  922         reclaimer_thread.wake();
>  923     }
>  924     return woken;
> 
> 
> should this if also set woken to true?
> 
> Also could we also enhance the oom() logic to print out more useful
> information if this happens once again?
> 
> On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote:
> > Had a crash on a system that I don't understand. Its a VM with
> > 12GB 
> > allocated, we were running without about 10.5GB free according to
> > the 
> > API. 
> > 
> > Out of the blue, we had a panic: 
> > 
> > Out of memory: could not reclaim any further. Current memory:
> > 10954988 
> > Kb 
> > [backtrace] 
> > 0x00000000403f6320 <memory::oom()+32> 
> > 0x00000000403f71cc <memory::reclaimer::_do_reclaim()+380> 
> > 0x00000000403f722f <???+1077899823> 
> > 0x000000004040f29b <thread_main_c+43> 
> > 0x00000000403ae412 <???+1077601298> 
> > 
> > The 'Out of memory' message seems to print stats::free() and that 
> > number suggests we have plenty of free ram. 
> > 
> > Have I misunderstood, or is there something I need to be looking
> > at? 
> > 
> > Cheers, 
> > Rick 
> > 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com
> .

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/63b1cad2ad3c960fcfdfbbed9e4b014a7c75645e.camel%40rossfell.co.uk.

Reply via email to