Re: [osv-dev] Re: OOM query

2020-03-24 Thread Waldek Kozaczuk
I remember it is an Erlang app. Is there any comparable open source
app/example I could run to re-create your issue?

On Tue, Mar 24, 2020 at 17:49 Waldek Kozaczuk  wrote:

> Also, besides the main app, what other OSv modules do you have in the
> image? Httpserver-api, cloud-init, ?
>
> On Tue, Mar 24, 2020 at 17:44 Waldek Kozaczuk 
> wrote:
>
>>
>>
>> On Tue, Mar 24, 2020 at 17:34 Rick Payne  wrote:
>>
>>>
>>> I backed out the original patch, applied the other two. Do I need the
>>> first one still?
>>
>>
>> Nope. You should be fine.
>>
>>> We're not on master, but we're pretty close. Last
>>> synced on March 3rd, commit ref 92eb26f3a645
>>>
>>> scripts/build check runs fine:
>>>
>>> OK (131 tests run, 274.
>>
>> Well there may be a bug in my patches. It looks my patches make your
>> situation even worse. Did you try to connect with gdb to get better stack
>> trace? Ideally ‘osv info threads’
>>
>>>
>>>
>>> Rick
>>>
>>> On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote:
>>> > Is it with the exact same code as on master with the latest 2 patches
>>> > I sent applied ? Does ‘scripts/build check’ pass for you?
>>> >
>>> > On Tue, Mar 24, 2020 at 16:56 Rick Payne 
>>> > wrote:
>>> > > I tried the patches, but its crashing almost instantly...
>>> > >
>>> > > page fault outside application, addr: 0x56c0
>>> > > [registers]
>>> > > RIP: 0x403edd23 
>>> > > RFL:
>>> > > 0x00010206  CS:  0x0008  SS:
>>> > > 0x0010
>>> > > RAX: 0x56c0  RBX: 0x200056c00040  RCX:
>>> > > 0x004c  RDX: 0x0008
>>> > > RSI: 0x004c  RDI: 0x200056c00040  RBP:
>>> > > 0x200041501740  R8:  0x
>>> > > R9:  0x5e7a7333  R10: 0x  R11:
>>> > > 0x  R12: 0x004c
>>> > > R13: 0x5e7a7333  R14: 0x  R15:
>>> > > 0x  RSP: 0x2000415016f8
>>> > > Aborted
>>> > >
>>> > > [backtrace]
>>> > > 0x40343779 
>>> > > 0x4034534d >> > > exception_frame*)+397>
>>> > > 0x403a667b 
>>> > > 0x403a54c6 
>>> > > 0x1174c2b0 
>>> > > 0x 
>>> > >
>>> > > (gdb) osv heap
>>> > > 0x8e9ad000 0x22a53000
>>> > > 0x8e9a2000 0x4000
>>> > >
>>> > > Rick
>>> > >
>>> > >
>>> > >
>>> > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
>>> > > > I have sent a more complete patch that should also address
>>> > > > fragmentation issue with requests >= 4K and < 2MB.
>>> > > >
>>> > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk
>>> > > wrote:
>>> > > > > I have just sent a new patch to the mailing list. I am hoping
>>> > > it
>>> > > > > will address the OOM crash if my theory of heavy memory
>>> > > > > fragmentation is right. It would be nice if Nadav could review
>>> > > it.
>>> > > > >
>>> > > > > Regardless if you have another crash in production and are able
>>> > > to
>>> > > > > connect with gdb, could you run 'osv heap' - it should show
>>> > > > > free_page_ranges. If memory is heavily fragmented we should see
>>> > > a
>>> > > > > long list.
>>> > > > >
>>> > > > > It would be nice to recreate that load in dev env and capture
>>> > > the
>>> > > > > memory trace data (BWT you do not need to enable backtrace to
>>> > > have
>>> > > > > enough useful information). It would help us better understand
>>> > > how
>>> > > > > memory is allocated by the app. I saw you send me one trace but
>>> > > it
>>> > > > > does not seem to be revealing anything interesting.
>>> > > > >
>>> > > > > Waldek
>>> > > > >
>>> > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
>>> > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote:
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp
>>> > > wrote:
>>> > > > > > > > Looks to me like its trying to allocate 40MB but the
>>> > > > > > available
>>> > > > > > > > memory
>>> > > > > > > > is 10GB, surely? 10933128KB is 10,933MB
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > > I misread the number - forgot about 1K.
>>> > > > > > >
>>> > > > > > > Any chance you could run the app outside of production with
>>> > > > > > memory
>>> > > > > > > tracing enabled -
>>> > > > > > >
>>> > > > > >
>>> > >
>>> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>>> > > > > >
>>> > > > > > >  (without --trace-backtrace) for while? And then we can
>>> > > have a
>>> > > > > > better
>>> > > > > > > sense of what kind of allocations it makes. The output of
>>> > > > > > trace
>>> > > > > > > memory-analyzer would be really helpful.
>>> > > > > >
>>> > > > > > I can certainly run that local with locally generated
>>> > > workloads,
>>> > > > > > which
>>> > > > > > should be close enough - but we've never managed to trigger
>>> > > the
>>> > > > > > oom
>>> > > > > > condition that way (other 

Re: [osv-dev] Re: OOM query

2020-03-24 Thread Waldek Kozaczuk
Also, besides the main app, what other OSv modules do you have in the
image? Httpserver-api, cloud-init, ?

On Tue, Mar 24, 2020 at 17:44 Waldek Kozaczuk  wrote:

>
>
> On Tue, Mar 24, 2020 at 17:34 Rick Payne  wrote:
>
>>
>> I backed out the original patch, applied the other two. Do I need the
>> first one still?
>
>
> Nope. You should be fine.
>
>> We're not on master, but we're pretty close. Last
>> synced on March 3rd, commit ref 92eb26f3a645
>>
>> scripts/build check runs fine:
>>
>> OK (131 tests run, 274.
>
> Well there may be a bug in my patches. It looks my patches make your
> situation even worse. Did you try to connect with gdb to get better stack
> trace? Ideally ‘osv info threads’
>
>>
>>
>> Rick
>>
>> On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote:
>> > Is it with the exact same code as on master with the latest 2 patches
>> > I sent applied ? Does ‘scripts/build check’ pass for you?
>> >
>> > On Tue, Mar 24, 2020 at 16:56 Rick Payne 
>> > wrote:
>> > > I tried the patches, but its crashing almost instantly...
>> > >
>> > > page fault outside application, addr: 0x56c0
>> > > [registers]
>> > > RIP: 0x403edd23 
>> > > RFL:
>> > > 0x00010206  CS:  0x0008  SS:
>> > > 0x0010
>> > > RAX: 0x56c0  RBX: 0x200056c00040  RCX:
>> > > 0x004c  RDX: 0x0008
>> > > RSI: 0x004c  RDI: 0x200056c00040  RBP:
>> > > 0x200041501740  R8:  0x
>> > > R9:  0x5e7a7333  R10: 0x  R11:
>> > > 0x  R12: 0x004c
>> > > R13: 0x5e7a7333  R14: 0x  R15:
>> > > 0x  RSP: 0x2000415016f8
>> > > Aborted
>> > >
>> > > [backtrace]
>> > > 0x40343779 
>> > > 0x4034534d > > > exception_frame*)+397>
>> > > 0x403a667b 
>> > > 0x403a54c6 
>> > > 0x1174c2b0 
>> > > 0x 
>> > >
>> > > (gdb) osv heap
>> > > 0x8e9ad000 0x22a53000
>> > > 0x8e9a2000 0x4000
>> > >
>> > > Rick
>> > >
>> > >
>> > >
>> > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
>> > > > I have sent a more complete patch that should also address
>> > > > fragmentation issue with requests >= 4K and < 2MB.
>> > > >
>> > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk
>> > > wrote:
>> > > > > I have just sent a new patch to the mailing list. I am hoping
>> > > it
>> > > > > will address the OOM crash if my theory of heavy memory
>> > > > > fragmentation is right. It would be nice if Nadav could review
>> > > it.
>> > > > >
>> > > > > Regardless if you have another crash in production and are able
>> > > to
>> > > > > connect with gdb, could you run 'osv heap' - it should show
>> > > > > free_page_ranges. If memory is heavily fragmented we should see
>> > > a
>> > > > > long list.
>> > > > >
>> > > > > It would be nice to recreate that load in dev env and capture
>> > > the
>> > > > > memory trace data (BWT you do not need to enable backtrace to
>> > > have
>> > > > > enough useful information). It would help us better understand
>> > > how
>> > > > > memory is allocated by the app. I saw you send me one trace but
>> > > it
>> > > > > does not seem to be revealing anything interesting.
>> > > > >
>> > > > > Waldek
>> > > > >
>> > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
>> > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp
>> > > wrote:
>> > > > > > > > Looks to me like its trying to allocate 40MB but the
>> > > > > > available
>> > > > > > > > memory
>> > > > > > > > is 10GB, surely? 10933128KB is 10,933MB
>> > > > > > > >
>> > > > > > >
>> > > > > > > I misread the number - forgot about 1K.
>> > > > > > >
>> > > > > > > Any chance you could run the app outside of production with
>> > > > > > memory
>> > > > > > > tracing enabled -
>> > > > > > >
>> > > > > >
>> > >
>> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>> > > > > >
>> > > > > > >  (without --trace-backtrace) for while? And then we can
>> > > have a
>> > > > > > better
>> > > > > > > sense of what kind of allocations it makes. The output of
>> > > > > > trace
>> > > > > > > memory-analyzer would be really helpful.
>> > > > > >
>> > > > > > I can certainly run that local with locally generated
>> > > workloads,
>> > > > > > which
>> > > > > > should be close enough - but we've never managed to trigger
>> > > the
>> > > > > > oom
>> > > > > > condition that way (other than by really constraining the
>> > > memory
>> > > > > > artificially). It should be close enough though - let me see
>> > > what
>> > > > > > I can
>> > > > > > do.
>> > > > > >
>> > > > > > Rick
>> > > > > >
>> > > > > >
>> > > >
>> > > > --
>> > > > You received this message because you are subscribed to the
>> > 

Re: [osv-dev] Re: OOM query

2020-03-24 Thread Waldek Kozaczuk
On Tue, Mar 24, 2020 at 17:34 Rick Payne  wrote:

>
> I backed out the original patch, applied the other two. Do I need the
> first one still?


Nope. You should be fine.

> We're not on master, but we're pretty close. Last
> synced on March 3rd, commit ref 92eb26f3a645
>
> scripts/build check runs fine:
>
> OK (131 tests run, 274.

Well there may be a bug in my patches. It looks my patches make your
situation even worse. Did you try to connect with gdb to get better stack
trace? Ideally ‘osv info threads’

>
>
> Rick
>
> On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote:
> > Is it with the exact same code as on master with the latest 2 patches
> > I sent applied ? Does ‘scripts/build check’ pass for you?
> >
> > On Tue, Mar 24, 2020 at 16:56 Rick Payne 
> > wrote:
> > > I tried the patches, but its crashing almost instantly...
> > >
> > > page fault outside application, addr: 0x56c0
> > > [registers]
> > > RIP: 0x403edd23 
> > > RFL:
> > > 0x00010206  CS:  0x0008  SS:
> > > 0x0010
> > > RAX: 0x56c0  RBX: 0x200056c00040  RCX:
> > > 0x004c  RDX: 0x0008
> > > RSI: 0x004c  RDI: 0x200056c00040  RBP:
> > > 0x200041501740  R8:  0x
> > > R9:  0x5e7a7333  R10: 0x  R11:
> > > 0x  R12: 0x004c
> > > R13: 0x5e7a7333  R14: 0x  R15:
> > > 0x  RSP: 0x2000415016f8
> > > Aborted
> > >
> > > [backtrace]
> > > 0x40343779 
> > > 0x4034534d  > > exception_frame*)+397>
> > > 0x403a667b 
> > > 0x403a54c6 
> > > 0x1174c2b0 
> > > 0x 
> > >
> > > (gdb) osv heap
> > > 0x8e9ad000 0x22a53000
> > > 0x8e9a2000 0x4000
> > >
> > > Rick
> > >
> > >
> > >
> > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
> > > > I have sent a more complete patch that should also address
> > > > fragmentation issue with requests >= 4K and < 2MB.
> > > >
> > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk
> > > wrote:
> > > > > I have just sent a new patch to the mailing list. I am hoping
> > > it
> > > > > will address the OOM crash if my theory of heavy memory
> > > > > fragmentation is right. It would be nice if Nadav could review
> > > it.
> > > > >
> > > > > Regardless if you have another crash in production and are able
> > > to
> > > > > connect with gdb, could you run 'osv heap' - it should show
> > > > > free_page_ranges. If memory is heavily fragmented we should see
> > > a
> > > > > long list.
> > > > >
> > > > > It would be nice to recreate that load in dev env and capture
> > > the
> > > > > memory trace data (BWT you do not need to enable backtrace to
> > > have
> > > > > enough useful information). It would help us better understand
> > > how
> > > > > memory is allocated by the app. I saw you send me one trace but
> > > it
> > > > > does not seem to be revealing anything interesting.
> > > > >
> > > > > Waldek
> > > > >
> > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
> > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote:
> > > > > > >
> > > > > > >
> > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp
> > > wrote:
> > > > > > > > Looks to me like its trying to allocate 40MB but the
> > > > > > available
> > > > > > > > memory
> > > > > > > > is 10GB, surely? 10933128KB is 10,933MB
> > > > > > > >
> > > > > > >
> > > > > > > I misread the number - forgot about 1K.
> > > > > > >
> > > > > > > Any chance you could run the app outside of production with
> > > > > > memory
> > > > > > > tracing enabled -
> > > > > > >
> > > > > >
> > >
> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
> > > > > >
> > > > > > >  (without --trace-backtrace) for while? And then we can
> > > have a
> > > > > > better
> > > > > > > sense of what kind of allocations it makes. The output of
> > > > > > trace
> > > > > > > memory-analyzer would be really helpful.
> > > > > >
> > > > > > I can certainly run that local with locally generated
> > > workloads,
> > > > > > which
> > > > > > should be close enough - but we've never managed to trigger
> > > the
> > > > > > oom
> > > > > > condition that way (other than by really constraining the
> > > memory
> > > > > > artificially). It should be close enough though - let me see
> > > what
> > > > > > I can
> > > > > > do.
> > > > > >
> > > > > > Rick
> > > > > >
> > > > > >
> > > >
> > > > --
> > > > You received this message because you are subscribed to the
> > > Google
> > > > Groups "OSv Development" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> > > > send an email to osv-dev+unsubscr...@googlegroups.com.
> > > > To view this discussion on the web visit
> > > >
> > >
> 

Re: [osv-dev] Re: OOM query

2020-03-24 Thread Rick Payne


I backed out the original patch, applied the other two. Do I need the
first one still? We're not on master, but we're pretty close. Last
synced on March 3rd, commit ref 92eb26f3a645

scripts/build check runs fine:

OK (131 tests run, 274.753 s)

Rick

On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote:
> Is it with the exact same code as on master with the latest 2 patches
> I sent applied ? Does ‘scripts/build check’ pass for you?
> 
> On Tue, Mar 24, 2020 at 16:56 Rick Payne 
> wrote:
> > I tried the patches, but its crashing almost instantly...
> > 
> > page fault outside application, addr: 0x56c0
> > [registers]
> > RIP: 0x403edd23 
> > RFL:
> > 0x00010206  CS:  0x0008  SS: 
> > 0x0010
> > RAX: 0x56c0  RBX: 0x200056c00040  RCX:
> > 0x004c  RDX: 0x0008
> > RSI: 0x004c  RDI: 0x200056c00040  RBP:
> > 0x200041501740  R8:  0x
> > R9:  0x5e7a7333  R10: 0x  R11:
> > 0x  R12: 0x004c
> > R13: 0x5e7a7333  R14: 0x  R15:
> > 0x  RSP: 0x2000415016f8
> > Aborted
> > 
> > [backtrace]
> > 0x40343779 
> > 0x4034534d  > exception_frame*)+397>
> > 0x403a667b 
> > 0x403a54c6 
> > 0x1174c2b0 
> > 0x 
> > 
> > (gdb) osv heap
> > 0x8e9ad000 0x22a53000
> > 0x8e9a2000 0x4000
> > 
> > Rick
> > 
> > 
> > 
> > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
> > > I have sent a more complete patch that should also address
> > > fragmentation issue with requests >= 4K and < 2MB.
> > > 
> > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk
> > wrote:
> > > > I have just sent a new patch to the mailing list. I am hoping
> > it
> > > > will address the OOM crash if my theory of heavy memory
> > > > fragmentation is right. It would be nice if Nadav could review
> > it.
> > > > 
> > > > Regardless if you have another crash in production and are able
> > to
> > > > connect with gdb, could you run 'osv heap' - it should show
> > > > free_page_ranges. If memory is heavily fragmented we should see
> > a
> > > > long list.
> > > > 
> > > > It would be nice to recreate that load in dev env and capture
> > the
> > > > memory trace data (BWT you do not need to enable backtrace to
> > have
> > > > enough useful information). It would help us better understand
> > how
> > > > memory is allocated by the app. I saw you send me one trace but
> > it
> > > > does not seem to be revealing anything interesting.
> > > > 
> > > > Waldek 
> > > > 
> > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
> > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: 
> > > > > > 
> > > > > > 
> > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp
> > wrote: 
> > > > > > > Looks to me like its trying to allocate 40MB but the
> > > > > available 
> > > > > > > memory 
> > > > > > > is 10GB, surely? 10933128KB is 10,933MB 
> > > > > > > 
> > > > > > 
> > > > > > I misread the number - forgot about 1K. 
> > > > > > 
> > > > > > Any chance you could run the app outside of production with
> > > > > memory 
> > > > > > tracing enabled - 
> > > > > > 
> > > > > 
> > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
> > > > >  
> > > > > >  (without --trace-backtrace) for while? And then we can
> > have a
> > > > > better 
> > > > > > sense of what kind of allocations it makes. The output of
> > > > > trace 
> > > > > > memory-analyzer would be really helpful. 
> > > > > 
> > > > > I can certainly run that local with locally generated
> > workloads,
> > > > > which 
> > > > > should be close enough - but we've never managed to trigger
> > the
> > > > > oom 
> > > > > condition that way (other than by really constraining the
> > memory 
> > > > > artificially). It should be close enough though - let me see
> > what
> > > > > I can 
> > > > > do. 
> > > > > 
> > > > > Rick 
> > > > > 
> > > > > 
> > > 
> > > -- 
> > > You received this message because you are subscribed to the
> > Google
> > > Groups "OSv Development" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to osv-dev+unsubscr...@googlegroups.com.
> > > To view this discussion on the web visit 
> > > 
> > https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com
> > > .
> > 

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/75323fbb15af066032b00c4d0bcfce0f10b2c9af.camel%40rossfell.co.uk.


Re: [osv-dev] Re: OOM query

2020-03-24 Thread Waldek Kozaczuk
Is it with the exact same code as on master with the latest 2 patches I
sent applied ? Does ‘scripts/build check’ pass for you?

On Tue, Mar 24, 2020 at 16:56 Rick Payne  wrote:

>
> I tried the patches, but its crashing almost instantly...
>
> page fault outside application, addr: 0x56c0
> [registers]
> RIP: 0x403edd23 
> RFL:
> 0x00010206  CS:  0x0008  SS:  0x0010
> RAX: 0x56c0  RBX: 0x200056c00040  RCX:
> 0x004c  RDX: 0x0008
> RSI: 0x004c  RDI: 0x200056c00040  RBP:
> 0x200041501740  R8:  0x
> R9:  0x5e7a7333  R10: 0x  R11:
> 0x  R12: 0x004c
> R13: 0x5e7a7333  R14: 0x  R15:
> 0x  RSP: 0x2000415016f8
> Aborted
>
> [backtrace]
> 0x40343779 
> 0x4034534d 
> 0x403a667b 
> 0x403a54c6 
> 0x1174c2b0 
> 0x 
>
> (gdb) osv heap
> 0x8e9ad000 0x22a53000
> 0x8e9a2000 0x4000
>
> Rick
>
>
>
> On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
> > I have sent a more complete patch that should also address
> > fragmentation issue with requests >= 4K and < 2MB.
> >
> > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote:
> > > I have just sent a new patch to the mailing list. I am hoping it
> > > will address the OOM crash if my theory of heavy memory
> > > fragmentation is right. It would be nice if Nadav could review it.
> > >
> > > Regardless if you have another crash in production and are able to
> > > connect with gdb, could you run 'osv heap' - it should show
> > > free_page_ranges. If memory is heavily fragmented we should see a
> > > long list.
> > >
> > > It would be nice to recreate that load in dev env and capture the
> > > memory trace data (BWT you do not need to enable backtrace to have
> > > enough useful information). It would help us better understand how
> > > memory is allocated by the app. I saw you send me one trace but it
> > > does not seem to be revealing anything interesting.
> > >
> > > Waldek
> > >
> > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
> > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote:
> > > > >
> > > > >
> > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote:
> > > > > > Looks to me like its trying to allocate 40MB but the
> > > > available
> > > > > > memory
> > > > > > is 10GB, surely? 10933128KB is 10,933MB
> > > > > >
> > > > >
> > > > > I misread the number - forgot about 1K.
> > > > >
> > > > > Any chance you could run the app outside of production with
> > > > memory
> > > > > tracing enabled -
> > > > >
> > > >
> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
> > > >
> > > > >  (without --trace-backtrace) for while? And then we can have a
> > > > better
> > > > > sense of what kind of allocations it makes. The output of
> > > > trace
> > > > > memory-analyzer would be really helpful.
> > > >
> > > > I can certainly run that local with locally generated workloads,
> > > > which
> > > > should be close enough - but we've never managed to trigger the
> > > > oom
> > > > condition that way (other than by really constraining the memory
> > > > artificially). It should be close enough though - let me see what
> > > > I can
> > > > do.
> > > >
> > > > Rick
> > > >
> > > >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "OSv Development" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to osv-dev+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com
> > .
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/8e9dae61b2f412998468916508404d4867e759a3.camel%40rossfell.co.uk
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfNp4ceOgdgmDZoLd6JMJb5XDuyp8zGOgTsZukBEQxODCA%40mail.gmail.com.


Re: [osv-dev] Re: OOM query

2020-03-24 Thread Rick Payne


I tried the patches, but its crashing almost instantly...

page fault outside application, addr: 0x56c0
[registers]
RIP: 0x403edd23 
RFL:
0x00010206  CS:  0x0008  SS:  0x0010
RAX: 0x56c0  RBX: 0x200056c00040  RCX:
0x004c  RDX: 0x0008
RSI: 0x004c  RDI: 0x200056c00040  RBP:
0x200041501740  R8:  0x
R9:  0x5e7a7333  R10: 0x  R11:
0x  R12: 0x004c
R13: 0x5e7a7333  R14: 0x  R15:
0x  RSP: 0x2000415016f8
Aborted

[backtrace]
0x40343779 
0x4034534d 
0x403a667b 
0x403a54c6 
0x1174c2b0 
0x 

(gdb) osv heap
0x8e9ad000 0x22a53000
0x8e9a2000 0x4000

Rick



On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote:
> I have sent a more complete patch that should also address
> fragmentation issue with requests >= 4K and < 2MB.
> 
> On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote:
> > I have just sent a new patch to the mailing list. I am hoping it
> > will address the OOM crash if my theory of heavy memory
> > fragmentation is right. It would be nice if Nadav could review it.
> > 
> > Regardless if you have another crash in production and are able to
> > connect with gdb, could you run 'osv heap' - it should show
> > free_page_ranges. If memory is heavily fragmented we should see a
> > long list.
> > 
> > It would be nice to recreate that load in dev env and capture the
> > memory trace data (BWT you do not need to enable backtrace to have
> > enough useful information). It would help us better understand how
> > memory is allocated by the app. I saw you send me one trace but it
> > does not seem to be revealing anything interesting.
> > 
> > Waldek 
> > 
> > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
> > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: 
> > > > 
> > > > 
> > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: 
> > > > > Looks to me like its trying to allocate 40MB but the
> > > available 
> > > > > memory 
> > > > > is 10GB, surely? 10933128KB is 10,933MB 
> > > > > 
> > > > 
> > > > I misread the number - forgot about 1K. 
> > > > 
> > > > Any chance you could run the app outside of production with
> > > memory 
> > > > tracing enabled - 
> > > > 
> > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
> > >  
> > > >  (without --trace-backtrace) for while? And then we can have a
> > > better 
> > > > sense of what kind of allocations it makes. The output of
> > > trace 
> > > > memory-analyzer would be really helpful. 
> > > 
> > > I can certainly run that local with locally generated workloads,
> > > which 
> > > should be close enough - but we've never managed to trigger the
> > > oom 
> > > condition that way (other than by really constraining the memory 
> > > artificially). It should be close enough though - let me see what
> > > I can 
> > > do. 
> > > 
> > > Rick 
> > > 
> > > 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com
> .

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/8e9dae61b2f412998468916508404d4867e759a3.camel%40rossfell.co.uk.


Re: [osv-dev] Re: OOM query

2020-03-23 Thread Waldek Kozaczuk
I have sent a more complete patch that should also address fragmentation 
issue with requests >= 4K and < 2MB.

On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote:
>
> I have just sent a new patch to the mailing list. I am hoping it will 
> address the OOM crash if my theory of heavy memory fragmentation is right. 
> It would be nice if Nadav could review it.
>
> Regardless if you have another crash in production and are able to connect 
> with gdb, could you run 'osv heap' - it should show free_page_ranges. If 
> memory is heavily fragmented we should see a long list.
>
> It would be nice to recreate that load in dev env and capture the memory 
> trace data (BWT you do not need to enable backtrace to have enough useful 
> information). It would help us better understand how memory is allocated by 
> the app. I saw you send me one trace but it does not seem to be revealing 
> anything interesting.
>
> Waldek 
>
> On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
>>
>> On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: 
>> > 
>> > 
>> > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: 
>> > > Looks to me like its trying to allocate 40MB but the available 
>> > > memory 
>> > > is 10GB, surely? 10933128KB is 10,933MB 
>> > > 
>> > 
>> > I misread the number - forgot about 1K. 
>> > 
>> > Any chance you could run the app outside of production with memory 
>> > tracing enabled - 
>> > 
>> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>>  
>> >  (without --trace-backtrace) for while? And then we can have a better 
>> > sense of what kind of allocations it makes. The output of trace 
>> > memory-analyzer would be really helpful. 
>>
>> I can certainly run that local with locally generated workloads, which 
>> should be close enough - but we've never managed to trigger the oom 
>> condition that way (other than by really constraining the memory 
>> artificially). It should be close enough though - let me see what I can 
>> do. 
>>
>> Rick 
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com.


Re: [osv-dev] Re: OOM query

2020-03-23 Thread Waldek Kozaczuk
I have just sent a new patch to the mailing list. I am hoping it will 
address the OOM crash if my theory of heavy memory fragmentation is right. 
It would be nice if Nadav could review it.

Regardless if you have another crash in production and are able to connect 
with gdb, could you run 'osv heap' - it should show free_page_ranges. If 
memory is heavily fragmented we should see a long list.

It would be nice to recreate that load in dev env and capture the memory 
trace data (BWT you do not need to enable backtrace to have enough useful 
information). It would help us better understand how memory is allocated by 
the app. I saw you send me one trace but it does not seem to be revealing 
anything interesting.

Waldek 

On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote:
>
> On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: 
> > 
> > 
> > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: 
> > > Looks to me like its trying to allocate 40MB but the available 
> > > memory 
> > > is 10GB, surely? 10933128KB is 10,933MB 
> > > 
> > 
> > I misread the number - forgot about 1K. 
> > 
> > Any chance you could run the app outside of production with memory 
> > tracing enabled - 
> > 
> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>  
> >  (without --trace-backtrace) for while? And then we can have a better 
> > sense of what kind of allocations it makes. The output of trace 
> > memory-analyzer would be really helpful. 
>
> I can certainly run that local with locally generated workloads, which 
> should be close enough - but we've never managed to trigger the oom 
> condition that way (other than by really constraining the memory 
> artificially). It should be close enough though - let me see what I can 
> do. 
>
> Rick 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/2e1a9467-d34f-4ae4-b7fa-06ce8052209e%40googlegroups.com.


Re: [osv-dev] Re: OOM query

2020-03-22 Thread Rick Payne
On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote:
> 
> 
> On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote:
> > Looks to me like its trying to allocate 40MB but the available
> > memory 
> > is 10GB, surely? 10933128KB is 10,933MB 
> > 
> 
> I misread the number - forgot about 1K.
> 
> Any chance you could run the app outside of production with memory
> tracing enabled - 
> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>  (without --trace-backtrace) for while? And then we can have a better
> sense of what kind of allocations it makes. The output of trace
> memory-analyzer would be really helpful.

I can certainly run that local with locally generated workloads, which
should be close enough - but we've never managed to trigger the oom
condition that way (other than by really constraining the memory
artificially). It should be close enough though - let me see what I can
do.

Rick


-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/77725fd3d095c86e37648713ab4bd3185a496500.camel%40rossfell.co.uk.


Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk
Based on what I can tell the available remaining memory seems to be heavily 
fragmented (I have no idea why). If it is it would help to implement 
https://github.com/cloudius-systems/osv/issues/854 which I will give it a 
try. Hopefully, I can have a patch within a day.

On Monday, March 23, 2020 at 1:08:22 AM UTC-4, Waldek Kozaczuk wrote:
>
>
>
> On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote:
>
>
> Looks to me like its trying to allocate 40MB but the available memory 
> is 10GB, surely? 10933128KB is 10,933MB 
>
> I misread the number - forgot about 1K.
>
> Any chance you could run the app outside of production with memory tracing 
> enabled - 
> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
>  (without --trace-backtrace) for while? And then we can have a better 
> sense of what kind of allocations it makes. The output of trace 
> memory-analyzer would be really helpful.
>
>
> Its probably erlang garbage collection, but it seems very strange to me 
> as sometimes it happens very quickly after we start the system. There's 
> no way we're allocate more memory than the system ever claims to have - 
> so if it is fragmentation, why do we do so poor compared to the linux 
> kernel? 
>
> Is anyone else running VMs based on OSv for more than a few days? 
>
> Rick 
>
> On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote: 
> > Either way, it looks like the app is trying to allocate almost 40MB 
> > and the remaining memory is ~ 10MB. So, in this case, there is no 
> > surprise, is there? 
> > 
> > In the previous case, you reported there were two threads each 
> > requesting ~5MB and there was plenty of memory. If I understand the 
> > code correctly the remaining memory must have been fragmented and 
> > there was no big enough free page range. 
> > 
> > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: 
> > > It does for me. Are you using the latest version of this script 
> > > from master? I think there were couple of things I changed there 
> > > recently as part of wide upgrade to Python 3. 
> > > 
> > > On Sun, Mar 22, 2020 at 20:27 Rick Payne  
> > > wrote: 
> > > > Does that command work for you? For me I get: 
> > > > 
> > > > (gdb) osv heap 
> > > > Python Exception  %x format: an integer is 
> > > > required, 
> > > > not gdb.Value: 
> > > > Error occurred in Python command: %x format: an integer is 
> > > > required, 
> > > > not gdb.Value 
> > > > 
> > > > Rick 
> > > > 
> > > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: 
> > > > > Can you run ‘osv heap’ from gdb at this point? 
> > > > > 
> > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne  
> > > > > wrote: 
> > > > > > Ok, so i applied the patch, and the printf and this is what I 
> > > > see: 
> > > > > > 
> > > > > > page_range_allocator: no ranges found for size 39849984 and 
> > > > exact 
> > > > > > order: 14 
> > > > > > Waiter: 1_scheduler, bytes: 39849984 
> > > > > > Out of memory: could not reclaim any further. Current memory: 
> > > > > > 10933128 
> > > > > > Kb, target: 0 Kb 
> > > > > > 
> > > > > > Rick 
> > > > > > 
> > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: 
> > > > > > > 
> > > > > > > 
> > > > > > > I think that OSv might be hitting some sort of 
> > > > fragmentation 
> > > > > > scenario 
> > > > > > > (another question is why we get to this point) which should 
> > > > be 
> > > > > > > handled but is not due to the bug this patch fixes: 
> > > > > > > 
> > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc 
> > > > > > > index d902eea8..11fd1456 100644 
> > > > > > > --- a/core/mempool.cc 
> > > > > > > +++ b/core/mempool.cc 
> > > > > > > @@ -702,10 +702,13 @@ page_range* 
> > > > > > page_range_allocator::alloc(size_t 
> > > > > > > size) 
> > > > > > >  for (auto&& pr : _free[exact_order - 1]) { 
> > > > > > >  if (pr.size >= size) { 
> > > > > > >  range =  
> > > > > > > +remove_list(exact_order - 1, *range); 
> > > > > > >  break; 
> > > > > > >  } 
> > > > > > >  } 
> > > > > > > -return nullptr; 
> > > > > > > +if (!range) { 
> > > > > > > +return nullptr; 
> > > > > > > +} 
> > > > > > >  } else if (order == max_order) { 
> > > > > > >  range = &*_free_huge.rbegin(); 
> > > > > > >  if (range->size < size) { 
> > > > > > > 
> > > > > > > Can you give it a try and see what happens with this patch? 
> > > > > > > 
> > > > > > > Waldek 
> > > > > > > 
> > > > > > > PS. You might also add this printout to verify if you are 
> > > > hitting 
> > > > > > the 
> > > > > > > hard fragmentation case: 
> > > > > > > 
> > > > > > >  page_range* range = nullptr; 
> > > > > > >  if (!bitset) { 
> > > > > > >  if (!exact_order || _free[exact_order - 
> > > > 1].empty()) { 
> > > > > > > +printf("page_range_allocator: no 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk


On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote:
>
>
> Looks to me like its trying to allocate 40MB but the available memory 
> is 10GB, surely? 10933128KB is 10,933MB 
>
> I misread the number - forgot about 1K.

Any chance you could run the app outside of production with memory tracing 
enabled - 
https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations
 (without --trace-backtrace) for while? And then we can have a better sense 
of what kind of allocations it makes. The output of trace memory-analyzer would 
be really helpful.


Its probably erlang garbage collection, but it seems very strange to me 
> as sometimes it happens very quickly after we start the system. There's 
> no way we're allocate more memory than the system ever claims to have - 
> so if it is fragmentation, why do we do so poor compared to the linux 
> kernel? 
>
> Is anyone else running VMs based on OSv for more than a few days? 
>
> Rick 
>
> On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote: 
> > Either way, it looks like the app is trying to allocate almost 40MB 
> > and the remaining memory is ~ 10MB. So, in this case, there is no 
> > surprise, is there? 
> > 
> > In the previous case, you reported there were two threads each 
> > requesting ~5MB and there was plenty of memory. If I understand the 
> > code correctly the remaining memory must have been fragmented and 
> > there was no big enough free page range. 
> > 
> > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: 
> > > It does for me. Are you using the latest version of this script 
> > > from master? I think there were couple of things I changed there 
> > > recently as part of wide upgrade to Python 3. 
> > > 
> > > On Sun, Mar 22, 2020 at 20:27 Rick Payne  > 
> > > wrote: 
> > > > Does that command work for you? For me I get: 
> > > > 
> > > > (gdb) osv heap 
> > > > Python Exception  %x format: an integer is 
> > > > required, 
> > > > not gdb.Value: 
> > > > Error occurred in Python command: %x format: an integer is 
> > > > required, 
> > > > not gdb.Value 
> > > > 
> > > > Rick 
> > > > 
> > > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: 
> > > > > Can you run ‘osv heap’ from gdb at this point? 
> > > > > 
> > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne  > 
> > > > > wrote: 
> > > > > > Ok, so i applied the patch, and the printf and this is what I 
> > > > see: 
> > > > > > 
> > > > > > page_range_allocator: no ranges found for size 39849984 and 
> > > > exact 
> > > > > > order: 14 
> > > > > > Waiter: 1_scheduler, bytes: 39849984 
> > > > > > Out of memory: could not reclaim any further. Current memory: 
> > > > > > 10933128 
> > > > > > Kb, target: 0 Kb 
> > > > > > 
> > > > > > Rick 
> > > > > > 
> > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: 
> > > > > > > 
> > > > > > > 
> > > > > > > I think that OSv might be hitting some sort of 
> > > > fragmentation 
> > > > > > scenario 
> > > > > > > (another question is why we get to this point) which should 
> > > > be 
> > > > > > > handled but is not due to the bug this patch fixes: 
> > > > > > > 
> > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc 
> > > > > > > index d902eea8..11fd1456 100644 
> > > > > > > --- a/core/mempool.cc 
> > > > > > > +++ b/core/mempool.cc 
> > > > > > > @@ -702,10 +702,13 @@ page_range* 
> > > > > > page_range_allocator::alloc(size_t 
> > > > > > > size) 
> > > > > > >  for (auto&& pr : _free[exact_order - 1]) { 
> > > > > > >  if (pr.size >= size) { 
> > > > > > >  range =  
> > > > > > > +remove_list(exact_order - 1, *range); 
> > > > > > >  break; 
> > > > > > >  } 
> > > > > > >  } 
> > > > > > > -return nullptr; 
> > > > > > > +if (!range) { 
> > > > > > > +return nullptr; 
> > > > > > > +} 
> > > > > > >  } else if (order == max_order) { 
> > > > > > >  range = &*_free_huge.rbegin(); 
> > > > > > >  if (range->size < size) { 
> > > > > > > 
> > > > > > > Can you give it a try and see what happens with this patch? 
> > > > > > > 
> > > > > > > Waldek 
> > > > > > > 
> > > > > > > PS. You might also add this printout to verify if you are 
> > > > hitting 
> > > > > > the 
> > > > > > > hard fragmentation case: 
> > > > > > > 
> > > > > > >  page_range* range = nullptr; 
> > > > > > >  if (!bitset) { 
> > > > > > >  if (!exact_order || _free[exact_order - 
> > > > 1].empty()) { 
> > > > > > > +printf("page_range_allocator: no ranges found 
> > > > for 
> > > > > > size 
> > > > > > > %ld and exact order: %d\n", size, exact_order); 
> > > > > > >  return nullptr; 
> > > > > > >  } 
> > > > > > > 
> > > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp 
> > > > wrote: 
> > > > > > > > Ok, with the waiter_print, we see: 
> > > > > > > > 
> > > > > > 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Rick Payne


Looks to me like its trying to allocate 40MB but the available memory
is 10GB, surely? 10933128KB is 10,933MB

Its probably erlang garbage collection, but it seems very strange to me
as sometimes it happens very quickly after we start the system. There's
no way we're allocate more memory than the system ever claims to have -
so if it is fragmentation, why do we do so poor compared to the linux
kernel?

Is anyone else running VMs based on OSv for more than a few days?

Rick

On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote:
> Either way, it looks like the app is trying to allocate almost 40MB
> and the remaining memory is ~ 10MB. So, in this case, there is no
> surprise, is there?
> 
> In the previous case, you reported there were two threads each
> requesting ~5MB and there was plenty of memory. If I understand the
> code correctly the remaining memory must have been fragmented and
> there was no big enough free page range.
> 
> On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote:
> > It does for me. Are you using the latest version of this script
> > from master? I think there were couple of things I changed there
> > recently as part of wide upgrade to Python 3. 
> > 
> > On Sun, Mar 22, 2020 at 20:27 Rick Payne 
> > wrote:
> > > Does that command work for you? For me I get:
> > > 
> > > (gdb) osv heap
> > > Python Exception  %x format: an integer is
> > > required,
> > > not gdb.Value: 
> > > Error occurred in Python command: %x format: an integer is
> > > required, 
> > > not gdb.Value
> > > 
> > > Rick
> > > 
> > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
> > > > Can you run ‘osv heap’ from gdb at this point?
> > > > 
> > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne 
> > > > wrote:
> > > > > Ok, so i applied the patch, and the printf and this is what I
> > > see:
> > > > > 
> > > > > page_range_allocator: no ranges found for size 39849984 and
> > > exact
> > > > > order: 14
> > > > > Waiter: 1_scheduler, bytes: 39849984
> > > > > Out of memory: could not reclaim any further. Current memory:
> > > > > 10933128
> > > > > Kb, target: 0 Kb
> > > > > 
> > > > > Rick
> > > > > 
> > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> > > > > > 
> > > > > > 
> > > > > > I think that OSv might be hitting some sort of
> > > fragmentation
> > > > > scenario
> > > > > > (another question is why we get to this point) which should
> > > be
> > > > > > handled but is not due to the bug this patch fixes:
> > > > > > 
> > > > > > diff --git a/core/mempool.cc b/core/mempool.cc
> > > > > > index d902eea8..11fd1456 100644
> > > > > > --- a/core/mempool.cc
> > > > > > +++ b/core/mempool.cc
> > > > > > @@ -702,10 +702,13 @@ page_range*
> > > > > page_range_allocator::alloc(size_t
> > > > > > size)
> > > > > >  for (auto&& pr : _free[exact_order - 1]) {
> > > > > >  if (pr.size >= size) {
> > > > > >  range = 
> > > > > > +remove_list(exact_order - 1, *range);
> > > > > >  break;
> > > > > >  }
> > > > > >  }
> > > > > > -return nullptr;
> > > > > > +if (!range) {
> > > > > > +return nullptr;
> > > > > > +}
> > > > > >  } else if (order == max_order) {
> > > > > >  range = &*_free_huge.rbegin();
> > > > > >  if (range->size < size) {
> > > > > > 
> > > > > > Can you give it a try and see what happens with this patch?
> > > > > > 
> > > > > > Waldek
> > > > > > 
> > > > > > PS. You might also add this printout to verify if you are
> > > hitting
> > > > > the
> > > > > > hard fragmentation case:
> > > > > > 
> > > > > >  page_range* range = nullptr;
> > > > > >  if (!bitset) {
> > > > > >  if (!exact_order || _free[exact_order -
> > > 1].empty()) {
> > > > > > +printf("page_range_allocator: no ranges found
> > > for
> > > > > size
> > > > > > %ld and exact order: %d\n", size, exact_order);
> > > > > >  return nullptr;
> > > > > >  }
> > > > > > 
> > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp
> > > wrote:
> > > > > > > Ok, with the waiter_print, we see: 
> > > > > > > 
> > > > > > > Waiter: 2_scheduler, bytes: 4984832 
> > > > > > > Waiter: 5_scheduler, bytes: 4984832 
> > > > > > > Out of memory: could not reclaim any further. Current
> > > memory: 
> > > > > > > 10871576Kb, target: 0 Kb 
> > > > > > > 
> > > > > > > 'osv mem' in gdb reports loads of free memory: 
> > > > > > > 
> > > > > > > (gdb) osv mem 
> > > > > > > Total Memory: 12884372480 Bytes 
> > > > > > > Mmap Memory:  1610067968 Bytes (12.50%) 
> > > > > > > Free Memory:  11132493824 Bytes (86.40%) 
> > > > > > > 
> > > > > > > So why is it failing to allocate 5MB when it claims to
> > > have
> > > > > 11GB
> > > > > > > free? 
> > > > > > > 
> > > > > > > Any more debug I can provide? 
> > > > > > > 
> > > > > > > Rick 
> > > > > > > 
> > > > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk
Do you know if the application uses "huge" (>=2MB) malloc? I found this 
relevant discussion - 
https://groups.google.com/d/msg/osv-dev/uLoxcjU164k/WWl6dMzsCAAJ - about 
how OSv currently uses contiguous physical memory areas to back these kinds 
of requests. 

There is an open issue about that - 
https://github.com/cloudius-systems/osv/issues/854. It should not be 
difficult to implement it.

It would be nice to confirm that memory is indeed fragmented in your case.

On Sunday, March 22, 2020 at 9:01:26 PM UTC-4, Waldek Kozaczuk wrote:
>
> Either way, it looks like the app is trying to allocate almost 40MB and 
> the remaining memory is ~ 10MB. So, in this case, there is no surprise, is 
> there?
>
> In the previous case, you reported there were two threads each requesting 
> ~5MB and there was plenty of memory. If I understand the code correctly the 
> remaining memory must have been fragmented and there was no big enough free 
> page range.
>
> On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote:
>>
>> It does for me. Are you using the latest version of this script from 
>> master? I think there were couple of things I changed there recently as 
>> part of wide upgrade to Python 3. 
>>
>> On Sun, Mar 22, 2020 at 20:27 Rick Payne  wrote:
>>
>>>
>>> Does that command work for you? For me I get:
>>>
>>> (gdb) osv heap
>>> Python Exception  %x format: an integer is required,
>>> not gdb.Value: 
>>> Error occurred in Python command: %x format: an integer is required, 
>>> not gdb.Value
>>>
>>> Rick
>>>
>>> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
>>> > Can you run ‘osv heap’ from gdb at this point?
>>> > 
>>> > On Sun, Mar 22, 2020 at 19:31 Rick Payne 
>>> > wrote:
>>> > > Ok, so i applied the patch, and the printf and this is what I see:
>>> > > 
>>> > > page_range_allocator: no ranges found for size 39849984 and exact
>>> > > order: 14
>>> > > Waiter: 1_scheduler, bytes: 39849984
>>> > > Out of memory: could not reclaim any further. Current memory:
>>> > > 10933128
>>> > > Kb, target: 0 Kb
>>> > > 
>>> > > Rick
>>> > > 
>>> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
>>> > > > 
>>> > > > 
>>> > > > I think that OSv might be hitting some sort of fragmentation
>>> > > scenario
>>> > > > (another question is why we get to this point) which should be
>>> > > > handled but is not due to the bug this patch fixes:
>>> > > > 
>>> > > > diff --git a/core/mempool.cc b/core/mempool.cc
>>> > > > index d902eea8..11fd1456 100644
>>> > > > --- a/core/mempool.cc
>>> > > > +++ b/core/mempool.cc
>>> > > > @@ -702,10 +702,13 @@ page_range*
>>> > > page_range_allocator::alloc(size_t
>>> > > > size)
>>> > > >  for (auto&& pr : _free[exact_order - 1]) {
>>> > > >  if (pr.size >= size) {
>>> > > >  range = 
>>> > > > +remove_list(exact_order - 1, *range);
>>> > > >  break;
>>> > > >  }
>>> > > >  }
>>> > > > -return nullptr;
>>> > > > +if (!range) {
>>> > > > +return nullptr;
>>> > > > +}
>>> > > >  } else if (order == max_order) {
>>> > > >  range = &*_free_huge.rbegin();
>>> > > >  if (range->size < size) {
>>> > > > 
>>> > > > Can you give it a try and see what happens with this patch?
>>> > > > 
>>> > > > Waldek
>>> > > > 
>>> > > > PS. You might also add this printout to verify if you are hitting
>>> > > the
>>> > > > hard fragmentation case:
>>> > > > 
>>> > > >  page_range* range = nullptr;
>>> > > >  if (!bitset) {
>>> > > >  if (!exact_order || _free[exact_order - 1].empty()) {
>>> > > > +printf("page_range_allocator: no ranges found for
>>> > > size
>>> > > > %ld and exact order: %d\n", size, exact_order);
>>> > > >  return nullptr;
>>> > > >  }
>>> > > > 
>>> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
>>> > > > > Ok, with the waiter_print, we see: 
>>> > > > > 
>>> > > > > Waiter: 2_scheduler, bytes: 4984832 
>>> > > > > Waiter: 5_scheduler, bytes: 4984832 
>>> > > > > Out of memory: could not reclaim any further. Current memory: 
>>> > > > > 10871576Kb, target: 0 Kb 
>>> > > > > 
>>> > > > > 'osv mem' in gdb reports loads of free memory: 
>>> > > > > 
>>> > > > > (gdb) osv mem 
>>> > > > > Total Memory: 12884372480 Bytes 
>>> > > > > Mmap Memory:  1610067968 Bytes (12.50%) 
>>> > > > > Free Memory:  11132493824 Bytes (86.40%) 
>>> > > > > 
>>> > > > > So why is it failing to allocate 5MB when it claims to have
>>> > > 11GB
>>> > > > > free? 
>>> > > > > 
>>> > > > > Any more debug I can provide? 
>>> > > > > 
>>> > > > > Rick 
>>> > > > > 
>>> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: 
>>> > > > > > And indeed, back on the 'release' image, and we hit an oom: 
>>> > > > > > 
>>> > > > > > (gdb) b memory::oom 
>>> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. 
>>> > > > > > (gdb) c 
>>> > > > > > Continuing. 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk
Either way, it looks like the app is trying to allocate almost 40MB and the 
remaining memory is ~ 10MB. So, in this case, there is no surprise, is 
there?

In the previous case, you reported there were two threads each requesting 
~5MB and there was plenty of memory. If I understand the code correctly the 
remaining memory must have been fragmented and there was no big enough free 
page range.

On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote:
>
> It does for me. Are you using the latest version of this script from 
> master? I think there were couple of things I changed there recently as 
> part of wide upgrade to Python 3. 
>
> On Sun, Mar 22, 2020 at 20:27 Rick Payne  wrote:
>
>>
>> Does that command work for you? For me I get:
>>
>> (gdb) osv heap
>> Python Exception  %x format: an integer is required,
>> not gdb.Value: 
>> Error occurred in Python command: %x format: an integer is required, 
>> not gdb.Value
>>
>> Rick
>>
>> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
>> > Can you run ‘osv heap’ from gdb at this point?
>> > 
>> > On Sun, Mar 22, 2020 at 19:31 Rick Payne 
>> > wrote:
>> > > Ok, so i applied the patch, and the printf and this is what I see:
>> > > 
>> > > page_range_allocator: no ranges found for size 39849984 and exact
>> > > order: 14
>> > > Waiter: 1_scheduler, bytes: 39849984
>> > > Out of memory: could not reclaim any further. Current memory:
>> > > 10933128
>> > > Kb, target: 0 Kb
>> > > 
>> > > Rick
>> > > 
>> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
>> > > > 
>> > > > 
>> > > > I think that OSv might be hitting some sort of fragmentation
>> > > scenario
>> > > > (another question is why we get to this point) which should be
>> > > > handled but is not due to the bug this patch fixes:
>> > > > 
>> > > > diff --git a/core/mempool.cc b/core/mempool.cc
>> > > > index d902eea8..11fd1456 100644
>> > > > --- a/core/mempool.cc
>> > > > +++ b/core/mempool.cc
>> > > > @@ -702,10 +702,13 @@ page_range*
>> > > page_range_allocator::alloc(size_t
>> > > > size)
>> > > >  for (auto&& pr : _free[exact_order - 1]) {
>> > > >  if (pr.size >= size) {
>> > > >  range = 
>> > > > +remove_list(exact_order - 1, *range);
>> > > >  break;
>> > > >  }
>> > > >  }
>> > > > -return nullptr;
>> > > > +if (!range) {
>> > > > +return nullptr;
>> > > > +}
>> > > >  } else if (order == max_order) {
>> > > >  range = &*_free_huge.rbegin();
>> > > >  if (range->size < size) {
>> > > > 
>> > > > Can you give it a try and see what happens with this patch?
>> > > > 
>> > > > Waldek
>> > > > 
>> > > > PS. You might also add this printout to verify if you are hitting
>> > > the
>> > > > hard fragmentation case:
>> > > > 
>> > > >  page_range* range = nullptr;
>> > > >  if (!bitset) {
>> > > >  if (!exact_order || _free[exact_order - 1].empty()) {
>> > > > +printf("page_range_allocator: no ranges found for
>> > > size
>> > > > %ld and exact order: %d\n", size, exact_order);
>> > > >  return nullptr;
>> > > >  }
>> > > > 
>> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
>> > > > > Ok, with the waiter_print, we see: 
>> > > > > 
>> > > > > Waiter: 2_scheduler, bytes: 4984832 
>> > > > > Waiter: 5_scheduler, bytes: 4984832 
>> > > > > Out of memory: could not reclaim any further. Current memory: 
>> > > > > 10871576Kb, target: 0 Kb 
>> > > > > 
>> > > > > 'osv mem' in gdb reports loads of free memory: 
>> > > > > 
>> > > > > (gdb) osv mem 
>> > > > > Total Memory: 12884372480 Bytes 
>> > > > > Mmap Memory:  1610067968 Bytes (12.50%) 
>> > > > > Free Memory:  11132493824 Bytes (86.40%) 
>> > > > > 
>> > > > > So why is it failing to allocate 5MB when it claims to have
>> > > 11GB
>> > > > > free? 
>> > > > > 
>> > > > > Any more debug I can provide? 
>> > > > > 
>> > > > > Rick 
>> > > > > 
>> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: 
>> > > > > > And indeed, back on the 'release' image, and we hit an oom: 
>> > > > > > 
>> > > > > > (gdb) b memory::oom 
>> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. 
>> > > > > > (gdb) c 
>> > > > > > Continuing. 
>> > > > > > 
>> > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0
>> > > )
>> > > > > at 
>> > > > > > core/mempool.cc:505 
>> > > > > > 505 core/mempool.cc: No such file or directory. 
>> > > > > > (gdb) osv memory 
>> > > > > > Total Memory: 12884372480 Bytes 
>> > > > > > Mmap Memory:  1541332992 Bytes (11.96%) 
>> > > > > > Free Memory:  11490983936 Bytes (89.19%) 
>> > > > > > 
>> > > > > > I don't have the _waiters list print unforutnately (as I
>> > > hadn't 
>> > > > > > included it in this image). It looks like this though: 
>> > > > > > 
>> > > > > > (gdb) p _oom_blocked._waiters 
>> > > > > > $4 = 
>> > > > > >
>> > > > >
>> > > 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk
It does for me. Are you using the latest version of this script from
master? I think there were couple of things I changed there recently as
part of wide upgrade to Python 3.

On Sun, Mar 22, 2020 at 20:27 Rick Payne  wrote:

>
> Does that command work for you? For me I get:
>
> (gdb) osv heap
> Python Exception  %x format: an integer is required,
> not gdb.Value:
> Error occurred in Python command: %x format: an integer is required,
> not gdb.Value
>
> Rick
>
> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
> > Can you run ‘osv heap’ from gdb at this point?
> >
> > On Sun, Mar 22, 2020 at 19:31 Rick Payne 
> > wrote:
> > > Ok, so i applied the patch, and the printf and this is what I see:
> > >
> > > page_range_allocator: no ranges found for size 39849984 and exact
> > > order: 14
> > > Waiter: 1_scheduler, bytes: 39849984
> > > Out of memory: could not reclaim any further. Current memory:
> > > 10933128
> > > Kb, target: 0 Kb
> > >
> > > Rick
> > >
> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> > > >
> > > >
> > > > I think that OSv might be hitting some sort of fragmentation
> > > scenario
> > > > (another question is why we get to this point) which should be
> > > > handled but is not due to the bug this patch fixes:
> > > >
> > > > diff --git a/core/mempool.cc b/core/mempool.cc
> > > > index d902eea8..11fd1456 100644
> > > > --- a/core/mempool.cc
> > > > +++ b/core/mempool.cc
> > > > @@ -702,10 +702,13 @@ page_range*
> > > page_range_allocator::alloc(size_t
> > > > size)
> > > >  for (auto&& pr : _free[exact_order - 1]) {
> > > >  if (pr.size >= size) {
> > > >  range = 
> > > > +remove_list(exact_order - 1, *range);
> > > >  break;
> > > >  }
> > > >  }
> > > > -return nullptr;
> > > > +if (!range) {
> > > > +return nullptr;
> > > > +}
> > > >  } else if (order == max_order) {
> > > >  range = &*_free_huge.rbegin();
> > > >  if (range->size < size) {
> > > >
> > > > Can you give it a try and see what happens with this patch?
> > > >
> > > > Waldek
> > > >
> > > > PS. You might also add this printout to verify if you are hitting
> > > the
> > > > hard fragmentation case:
> > > >
> > > >  page_range* range = nullptr;
> > > >  if (!bitset) {
> > > >  if (!exact_order || _free[exact_order - 1].empty()) {
> > > > +printf("page_range_allocator: no ranges found for
> > > size
> > > > %ld and exact order: %d\n", size, exact_order);
> > > >  return nullptr;
> > > >  }
> > > >
> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
> > > > > Ok, with the waiter_print, we see:
> > > > >
> > > > > Waiter: 2_scheduler, bytes: 4984832
> > > > > Waiter: 5_scheduler, bytes: 4984832
> > > > > Out of memory: could not reclaim any further. Current memory:
> > > > > 10871576Kb, target: 0 Kb
> > > > >
> > > > > 'osv mem' in gdb reports loads of free memory:
> > > > >
> > > > > (gdb) osv mem
> > > > > Total Memory: 12884372480 Bytes
> > > > > Mmap Memory:  1610067968 Bytes (12.50%)
> > > > > Free Memory:  11132493824 Bytes (86.40%)
> > > > >
> > > > > So why is it failing to allocate 5MB when it claims to have
> > > 11GB
> > > > > free?
> > > > >
> > > > > Any more debug I can provide?
> > > > >
> > > > > Rick
> > > > >
> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote:
> > > > > > And indeed, back on the 'release' image, and we hit an oom:
> > > > > >
> > > > > > (gdb) b memory::oom
> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505.
> > > > > > (gdb) c
> > > > > > Continuing.
> > > > > >
> > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0
> > > )
> > > > > at
> > > > > > core/mempool.cc:505
> > > > > > 505 core/mempool.cc: No such file or directory.
> > > > > > (gdb) osv memory
> > > > > > Total Memory: 12884372480 Bytes
> > > > > > Mmap Memory:  1541332992 Bytes (11.96%)
> > > > > > Free Memory:  11490983936 Bytes (89.19%)
> > > > > >
> > > > > > I don't have the _waiters list print unforutnately (as I
> > > hadn't
> > > > > > included it in this image). It looks like this though:
> > > > > >
> > > > > > (gdb) p _oom_blocked._waiters
> > > > > > $4 =
> > > > > >
> > > > >
> > > { > > > > cl
> > > > > > ai
> > > > > > mer_waiters::wait_node,
> > > > > boost::intrusive::list_node_traits,
> > > > > > (boost::intrusive::link_mode_type)1,
> > > boost::intrusive::dft_tag,
> > > > > 1>,
> > > > > > unsigned long, false, void>> = {static constant_time_size =
> > > > > false,
> > > > > > static stateful_value_traits = ,
> > > > > > static has_container_from_iterator = ,
> > > static
> > > > > > safemode_or_autounlink = true,
> > > > > > data_ =
> > > > > >
> > > > >
> > > { > > > > > boost::intrusive::list_node_traits,
> > > > > > (boost::intrusive::link_mode_type)1,
> > > boost::intrusive::dft_tag,
> > > > > 1>> =
> > > > > >
> > 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Rick Payne


Does that command work for you? For me I get:

(gdb) osv heap
Python Exception  %x format: an integer is required,
not gdb.Value: 
Error occurred in Python command: %x format: an integer is required, 
not gdb.Value

Rick

On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote:
> Can you run ‘osv heap’ from gdb at this point?
> 
> On Sun, Mar 22, 2020 at 19:31 Rick Payne 
> wrote:
> > Ok, so i applied the patch, and the printf and this is what I see:
> > 
> > page_range_allocator: no ranges found for size 39849984 and exact
> > order: 14
> > Waiter: 1_scheduler, bytes: 39849984
> > Out of memory: could not reclaim any further. Current memory:
> > 10933128
> > Kb, target: 0 Kb
> > 
> > Rick
> > 
> > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> > > 
> > > 
> > > I think that OSv might be hitting some sort of fragmentation
> > scenario
> > > (another question is why we get to this point) which should be
> > > handled but is not due to the bug this patch fixes:
> > > 
> > > diff --git a/core/mempool.cc b/core/mempool.cc
> > > index d902eea8..11fd1456 100644
> > > --- a/core/mempool.cc
> > > +++ b/core/mempool.cc
> > > @@ -702,10 +702,13 @@ page_range*
> > page_range_allocator::alloc(size_t
> > > size)
> > >  for (auto&& pr : _free[exact_order - 1]) {
> > >  if (pr.size >= size) {
> > >  range = 
> > > +remove_list(exact_order - 1, *range);
> > >  break;
> > >  }
> > >  }
> > > -return nullptr;
> > > +if (!range) {
> > > +return nullptr;
> > > +}
> > >  } else if (order == max_order) {
> > >  range = &*_free_huge.rbegin();
> > >  if (range->size < size) {
> > > 
> > > Can you give it a try and see what happens with this patch?
> > > 
> > > Waldek
> > > 
> > > PS. You might also add this printout to verify if you are hitting
> > the
> > > hard fragmentation case:
> > > 
> > >  page_range* range = nullptr;
> > >  if (!bitset) {
> > >  if (!exact_order || _free[exact_order - 1].empty()) {
> > > +printf("page_range_allocator: no ranges found for
> > size
> > > %ld and exact order: %d\n", size, exact_order);
> > >  return nullptr;
> > >  }
> > > 
> > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
> > > > Ok, with the waiter_print, we see: 
> > > > 
> > > > Waiter: 2_scheduler, bytes: 4984832 
> > > > Waiter: 5_scheduler, bytes: 4984832 
> > > > Out of memory: could not reclaim any further. Current memory: 
> > > > 10871576Kb, target: 0 Kb 
> > > > 
> > > > 'osv mem' in gdb reports loads of free memory: 
> > > > 
> > > > (gdb) osv mem 
> > > > Total Memory: 12884372480 Bytes 
> > > > Mmap Memory:  1610067968 Bytes (12.50%) 
> > > > Free Memory:  11132493824 Bytes (86.40%) 
> > > > 
> > > > So why is it failing to allocate 5MB when it claims to have
> > 11GB
> > > > free? 
> > > > 
> > > > Any more debug I can provide? 
> > > > 
> > > > Rick 
> > > > 
> > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: 
> > > > > And indeed, back on the 'release' image, and we hit an oom: 
> > > > > 
> > > > > (gdb) b memory::oom 
> > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. 
> > > > > (gdb) c 
> > > > > Continuing. 
> > > > > 
> > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0
> > )
> > > > at 
> > > > > core/mempool.cc:505 
> > > > > 505 core/mempool.cc: No such file or directory. 
> > > > > (gdb) osv memory 
> > > > > Total Memory: 12884372480 Bytes 
> > > > > Mmap Memory:  1541332992 Bytes (11.96%) 
> > > > > Free Memory:  11490983936 Bytes (89.19%) 
> > > > > 
> > > > > I don't have the _waiters list print unforutnately (as I
> > hadn't 
> > > > > included it in this image). It looks like this though: 
> > > > > 
> > > > > (gdb) p _oom_blocked._waiters 
> > > > > $4 = 
> > > > >
> > > >
> > { > > > cl 
> > > > > ai 
> > > > > mer_waiters::wait_node,
> > > > boost::intrusive::list_node_traits, 
> > > > > (boost::intrusive::link_mode_type)1,
> > boost::intrusive::dft_tag,
> > > > 1>, 
> > > > > unsigned long, false, void>> = {static constant_time_size =
> > > > false, 
> > > > > static stateful_value_traits = , 
> > > > > static has_container_from_iterator = ,
> > static 
> > > > > safemode_or_autounlink = true, 
> > > > > data_ = 
> > > > >
> > > >
> > { > > > > boost::intrusive::list_node_traits, 
> > > > > (boost::intrusive::link_mode_type)1,
> > boost::intrusive::dft_tag,
> > > > 1>> = 
> > > > >
> > > >
> > { > > > od 
> > > > > e, 
> > > > > boost::intrusive::list_node*,
> > boost::intrusive::dft_tag,
> > > > 1>> = 
> > > > > {}, static link_mode =
> > > > boost::intrusive::safe_link}, 
> > > > >   root_plus_size_ = 
> > > > > { > > > > unsigned long, void>> = { 
> > > > >   static constant_time_size = },
> > m_header
> > > > = 
> > > > > {> = { 
> > > > > next_ = 0x200041305580, prev_ = 0x200041305580},
> >  > > 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Waldek Kozaczuk
Can you run ‘osv heap’ from gdb at this point?

On Sun, Mar 22, 2020 at 19:31 Rick Payne  wrote:

>
> Ok, so i applied the patch, and the printf and this is what I see:
>
> page_range_allocator: no ranges found for size 39849984 and exact
> order: 14
> Waiter: 1_scheduler, bytes: 39849984
> Out of memory: could not reclaim any further. Current memory: 10933128
> Kb, target: 0 Kb
>
> Rick
>
> On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> >
> >
> > I think that OSv might be hitting some sort of fragmentation scenario
> > (another question is why we get to this point) which should be
> > handled but is not due to the bug this patch fixes:
> >
> > diff --git a/core/mempool.cc b/core/mempool.cc
> > index d902eea8..11fd1456 100644
> > --- a/core/mempool.cc
> > +++ b/core/mempool.cc
> > @@ -702,10 +702,13 @@ page_range* page_range_allocator::alloc(size_t
> > size)
> >  for (auto&& pr : _free[exact_order - 1]) {
> >  if (pr.size >= size) {
> >  range = 
> > +remove_list(exact_order - 1, *range);
> >  break;
> >  }
> >  }
> > -return nullptr;
> > +if (!range) {
> > +return nullptr;
> > +}
> >  } else if (order == max_order) {
> >  range = &*_free_huge.rbegin();
> >  if (range->size < size) {
> >
> > Can you give it a try and see what happens with this patch?
> >
> > Waldek
> >
> > PS. You might also add this printout to verify if you are hitting the
> > hard fragmentation case:
> >
> >  page_range* range = nullptr;
> >  if (!bitset) {
> >  if (!exact_order || _free[exact_order - 1].empty()) {
> > +printf("page_range_allocator: no ranges found for size
> > %ld and exact order: %d\n", size, exact_order);
> >  return nullptr;
> >  }
> >
> > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
> > > Ok, with the waiter_print, we see:
> > >
> > > Waiter: 2_scheduler, bytes: 4984832
> > > Waiter: 5_scheduler, bytes: 4984832
> > > Out of memory: could not reclaim any further. Current memory:
> > > 10871576Kb, target: 0 Kb
> > >
> > > 'osv mem' in gdb reports loads of free memory:
> > >
> > > (gdb) osv mem
> > > Total Memory: 12884372480 Bytes
> > > Mmap Memory:  1610067968 Bytes (12.50%)
> > > Free Memory:  11132493824 Bytes (86.40%)
> > >
> > > So why is it failing to allocate 5MB when it claims to have 11GB
> > > free?
> > >
> > > Any more debug I can provide?
> > >
> > > Rick
> > >
> > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote:
> > > > And indeed, back on the 'release' image, and we hit an oom:
> > > >
> > > > (gdb) b memory::oom
> > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505.
> > > > (gdb) c
> > > > Continuing.
> > > >
> > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0)
> > > at
> > > > core/mempool.cc:505
> > > > 505 core/mempool.cc: No such file or directory.
> > > > (gdb) osv memory
> > > > Total Memory: 12884372480 Bytes
> > > > Mmap Memory:  1541332992 Bytes (11.96%)
> > > > Free Memory:  11490983936 Bytes (89.19%)
> > > >
> > > > I don't have the _waiters list print unforutnately (as I hadn't
> > > > included it in this image). It looks like this though:
> > > >
> > > > (gdb) p _oom_blocked._waiters
> > > > $4 =
> > > >
> > > { > > cl
> > > > ai
> > > > mer_waiters::wait_node,
> > > boost::intrusive::list_node_traits,
> > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag,
> > > 1>,
> > > > unsigned long, false, void>> = {static constant_time_size =
> > > false,
> > > > static stateful_value_traits = ,
> > > > static has_container_from_iterator = , static
> > > > safemode_or_autounlink = true,
> > > > data_ =
> > > >
> > > { > > > boost::intrusive::list_node_traits,
> > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag,
> > > 1>> =
> > > >
> > > { > > od
> > > > e,
> > > > boost::intrusive::list_node*, boost::intrusive::dft_tag,
> > > 1>> =
> > > > {}, static link_mode =
> > > boost::intrusive::safe_link},
> > > >   root_plus_size_ =
> > > > { > > > unsigned long, void>> = {
> > > >   static constant_time_size = }, m_header
> > > =
> > > > {> = {
> > > > next_ = 0x200041305580, prev_ = 0x200041305580},  > > data
> > > > fields>, }
> > > >
> > > > 'osv waiters' fails, annoyingly:
> > > >
> > > > (gdb) osv waiters
> > > > waiters:
> > > > Python Exception  Cannot access memory
> > > at
> > > > address 0x42c00020982ac010:
> > > > Error occurred in Python command: Cannot access memory at
> > > address
> > > > 0x42c00020982ac010
> > > >
> > > > Rick
> > > >
> > > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote:
> > > > > So we've been trying with a debug image (with the tcp kassert
> > > > > disabled). We do not get an OOM crash, but after some time
> > > (24hrs
> > > > > or
> > > > > more) all cores just hit idle and never recover. It stops
> > > > > responding
> > > > > to
> 

Re: [osv-dev] Re: OOM query

2020-03-22 Thread Rick Payne


Ok, so i applied the patch, and the printf and this is what I see:

page_range_allocator: no ranges found for size 39849984 and exact
order: 14
Waiter: 1_scheduler, bytes: 39849984
Out of memory: could not reclaim any further. Current memory: 10933128
Kb, target: 0 Kb

Rick

On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote:
> 
> 
> I think that OSv might be hitting some sort of fragmentation scenario
> (another question is why we get to this point) which should be
> handled but is not due to the bug this patch fixes:
> 
> diff --git a/core/mempool.cc b/core/mempool.cc
> index d902eea8..11fd1456 100644
> --- a/core/mempool.cc
> +++ b/core/mempool.cc
> @@ -702,10 +702,13 @@ page_range* page_range_allocator::alloc(size_t
> size)
>  for (auto&& pr : _free[exact_order - 1]) {
>  if (pr.size >= size) {
>  range = 
> +remove_list(exact_order - 1, *range);
>  break;
>  }
>  }
> -return nullptr;
> +if (!range) {
> +return nullptr;
> +}
>  } else if (order == max_order) {
>  range = &*_free_huge.rbegin();
>  if (range->size < size) {
> 
> Can you give it a try and see what happens with this patch?
> 
> Waldek
> 
> PS. You might also add this printout to verify if you are hitting the
> hard fragmentation case:
> 
>  page_range* range = nullptr;
>  if (!bitset) {
>  if (!exact_order || _free[exact_order - 1].empty()) {
> +printf("page_range_allocator: no ranges found for size
> %ld and exact order: %d\n", size, exact_order);
>  return nullptr;
>  }
> 
> On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote:
> > Ok, with the waiter_print, we see: 
> > 
> > Waiter: 2_scheduler, bytes: 4984832 
> > Waiter: 5_scheduler, bytes: 4984832 
> > Out of memory: could not reclaim any further. Current memory: 
> > 10871576Kb, target: 0 Kb 
> > 
> > 'osv mem' in gdb reports loads of free memory: 
> > 
> > (gdb) osv mem 
> > Total Memory: 12884372480 Bytes 
> > Mmap Memory:  1610067968 Bytes (12.50%) 
> > Free Memory:  11132493824 Bytes (86.40%) 
> > 
> > So why is it failing to allocate 5MB when it claims to have 11GB
> > free? 
> > 
> > Any more debug I can provide? 
> > 
> > Rick 
> > 
> > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: 
> > > And indeed, back on the 'release' image, and we hit an oom: 
> > > 
> > > (gdb) b memory::oom 
> > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. 
> > > (gdb) c 
> > > Continuing. 
> > > 
> > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0)
> > at 
> > > core/mempool.cc:505 
> > > 505 core/mempool.cc: No such file or directory. 
> > > (gdb) osv memory 
> > > Total Memory: 12884372480 Bytes 
> > > Mmap Memory:  1541332992 Bytes (11.96%) 
> > > Free Memory:  11490983936 Bytes (89.19%) 
> > > 
> > > I don't have the _waiters list print unforutnately (as I hadn't 
> > > included it in this image). It looks like this though: 
> > > 
> > > (gdb) p _oom_blocked._waiters 
> > > $4 = 
> > >
> > { > cl 
> > > ai 
> > > mer_waiters::wait_node,
> > boost::intrusive::list_node_traits, 
> > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag,
> > 1>, 
> > > unsigned long, false, void>> = {static constant_time_size =
> > false, 
> > > static stateful_value_traits = , 
> > > static has_container_from_iterator = , static 
> > > safemode_or_autounlink = true, 
> > > data_ = 
> > >
> > { > > boost::intrusive::list_node_traits, 
> > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag,
> > 1>> = 
> > >
> > { > od 
> > > e, 
> > > boost::intrusive::list_node*, boost::intrusive::dft_tag,
> > 1>> = 
> > > {}, static link_mode =
> > boost::intrusive::safe_link}, 
> > >   root_plus_size_ = 
> > > { > > unsigned long, void>> = { 
> > >   static constant_time_size = }, m_header
> > = 
> > > {> = { 
> > > next_ = 0x200041305580, prev_ = 0x200041305580},  > data 
> > > fields>, } 
> > > 
> > > 'osv waiters' fails, annoyingly: 
> > > 
> > > (gdb) osv waiters 
> > > waiters: 
> > > Python Exception  Cannot access memory
> > at 
> > > address 0x42c00020982ac010: 
> > > Error occurred in Python command: Cannot access memory at
> > address 
> > > 0x42c00020982ac010 
> > > 
> > > Rick 
> > > 
> > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote: 
> > > > So we've been trying with a debug image (with the tcp kassert 
> > > > disabled). We do not get an OOM crash, but after some time
> > (24hrs 
> > > > or 
> > > > more) all cores just hit idle and never recover. It stops 
> > > > responding 
> > > > to 
> > > > TCP connections and when you attached gdb, you see this: 
> > > > 
> > > > Thread 1 received signal SIGINT, Interrupt. 
> > > > processor::sti_hlt () at arch/x64/processor.hh:252 
> > > > 252 arch/x64/processor.hh: No such file or directory. 
> > > > (gdb) info thread 
> > > >   Id   Target Id 

Re: [osv-dev] Re: OOM query

2020-03-21 Thread Rick Payne


Ok, with the waiter_print, we see:

Waiter: 2_scheduler, bytes: 4984832
Waiter: 5_scheduler, bytes: 4984832
Out of memory: could not reclaim any further. Current memory:
10871576Kb, target: 0 Kb

'osv mem' in gdb reports loads of free memory:

(gdb) osv mem
Total Memory: 12884372480 Bytes
Mmap Memory:  1610067968 Bytes (12.50%)
Free Memory:  11132493824 Bytes (86.40%)

So why is it failing to allocate 5MB when it claims to have 11GB free?

Any more debug I can provide?

Rick

On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote:
> And indeed, back on the 'release' image, and we hit an oom:
> 
> (gdb) b memory::oom
> Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505.
> (gdb) c
> Continuing.
> 
> Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0) at
> core/mempool.cc:505
> 505 core/mempool.cc: No such file or directory.
> (gdb) osv memory
> Total Memory: 12884372480 Bytes
> Mmap Memory:  1541332992 Bytes (11.96%)
> Free Memory:  11490983936 Bytes (89.19%)
> 
> I don't have the _waiters list print unforutnately (as I hadn't
> included it in this image). It looks like this though:
> 
> (gdb) p _oom_blocked._waiters
> $4 =
> { ai
> mer_waiters::wait_node, boost::intrusive::list_node_traits,
> (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 1>,
> unsigned long, false, void>> = {static constant_time_size = false,
> static stateful_value_traits = , 
> static has_container_from_iterator = , static
> safemode_or_autounlink = true, 
> data_ =
> { boost::intrusive::list_node_traits,
> (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 1>> =
> { e,
> boost::intrusive::list_node*, boost::intrusive::dft_tag, 1>> =
> {}, static link_mode = boost::intrusive::safe_link}, 
>   root_plus_size_ =
> { unsigned long, void>> = {
>   static constant_time_size = }, m_header =
> {> = {
> next_ = 0x200041305580, prev_ = 0x200041305580},  fields>, }
> 
> 'osv waiters' fails, annoyingly:
> 
> (gdb) osv waiters
> waiters:
> Python Exception  Cannot access memory at
> address 0x42c00020982ac010: 
> Error occurred in Python command: Cannot access memory at address
> 0x42c00020982ac010
> 
> Rick
> 
> On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote:
> > So we've been trying with a debug image (with the tcp kassert
> > disabled). We do not get an OOM crash, but after some time (24hrs
> > or
> > more) all cores just hit idle and never recover. It stops
> > responding
> > to
> > TCP connections and when you attached gdb, you see this:
> > 
> > Thread 1 received signal SIGINT, Interrupt.
> > processor::sti_hlt () at arch/x64/processor.hh:252
> > 252 arch/x64/processor.hh: No such file or directory.
> > (gdb) info thread
> >   Id   Target Id Frame 
> > * 1Thread 1 (CPU#0 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> >   2Thread 2 (CPU#1 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> >   3Thread 3 (CPU#2 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> >   4Thread 4 (CPU#3 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> >   5Thread 5 (CPU#4 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> >   6Thread 6 (CPU#5 [halted ]) processor::sti_hlt () at
> > arch/x64/processor.hh:252
> > 
> > The threads themselves look liks this:
> > 
> > (gdb) bt
> > #0  processor::sti_hlt () at arch/x64/processor.hh:252
> > #1  0x405e4016 in arch::wait_for_interrupt () at
> > arch/x64/arch.hh:43
> > #2  0x405d7f10 in sched::cpu::do_idle
> > (this=0x8001b040)
> > at core/sched.cc:404
> > #3  0x405d7fc1 in sched::cpu::idle
> > (this=0x8001b040)
> > at
> > core/sched.cc:423
> > #4  0x405d73ef in sched::cpuoperator()(void)
> > const (__closure=0x800100156070)
> > at core/sched.cc:165
> > #5  0x405e0b57 in std::_Function_handler > sched::cpu::init_idle_thread():: >::_M_invoke(const
> > std::_Any_data &) (__functor=...) at
> > /usr/include/c++/9/bits/std_function.h:300
> > #6  0x40496206 in std::function::operator()()
> > const
> > (this=0x800100156070)
> > at /usr/include/c++/9/bits/std_function.h:690
> > #7  0x405db386 in sched::thread::main
> > (this=0x800100156040)
> > at core/sched.cc:1210
> > #8  0x405d7173 in sched::thread_main_c
> > (t=0x800100156040)
> > at arch/x64/arch-switch.hh:321
> > #9  0x404911b3 in thread_main () at arch/x64/entry.S:113
> > 
> > I tried the 'osv waiters' command but jsut get:
> > 
> > (gdb) osv waiters
> > waiters:
> > Python Exception  Cannot access memory at
> > address 0x42c00020880a9010: 
> > Error occurred in Python command: Cannot access memory at address
> > 0x42c00020880a9010
> > 
> > I think we'll go back to a 'release' image and see if we get the
> > oom
> > with a few more clues...
> > 
> > Rick
> > 
> > 
> > On Wed, 2020-03-11 at 08:07 -0700, Waldek Kozaczuk wrote:
> > > On Tuesday, March 

Re: [osv-dev] Re: OOM query

2020-03-20 Thread Rick Payne


So we've been trying with a debug image (with the tcp kassert
disabled). We do not get an OOM crash, but after some time (24hrs or
more) all cores just hit idle and never recover. It stops responding to
TCP connections and when you attached gdb, you see this:

Thread 1 received signal SIGINT, Interrupt.
processor::sti_hlt () at arch/x64/processor.hh:252
252 arch/x64/processor.hh: No such file or directory.
(gdb) info thread
  Id   Target Id Frame 
* 1Thread 1 (CPU#0 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252
  2Thread 2 (CPU#1 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252
  3Thread 3 (CPU#2 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252
  4Thread 4 (CPU#3 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252
  5Thread 5 (CPU#4 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252
  6Thread 6 (CPU#5 [halted ]) processor::sti_hlt () at
arch/x64/processor.hh:252

The threads themselves look liks this:

(gdb) bt
#0  processor::sti_hlt () at arch/x64/processor.hh:252
#1  0x405e4016 in arch::wait_for_interrupt () at
arch/x64/arch.hh:43
#2  0x405d7f10 in sched::cpu::do_idle (this=0x8001b040)
at core/sched.cc:404
#3  0x405d7fc1 in sched::cpu::idle (this=0x8001b040) at
core/sched.cc:423
#4  0x405d73ef in sched::cpuoperator()(void)
const (__closure=0x800100156070)
at core/sched.cc:165
#5  0x405e0b57 in std::_Function_handler >::_M_invoke(const
std::_Any_data &) (__functor=...) at
/usr/include/c++/9/bits/std_function.h:300
#6  0x40496206 in std::function::operator()() const
(this=0x800100156070)
at /usr/include/c++/9/bits/std_function.h:690
#7  0x405db386 in sched::thread::main (this=0x800100156040)
at core/sched.cc:1210
#8  0x405d7173 in sched::thread_main_c (t=0x800100156040)
at arch/x64/arch-switch.hh:321
#9  0x404911b3 in thread_main () at arch/x64/entry.S:113

I tried the 'osv waiters' command but jsut get:

(gdb) osv waiters
waiters:
Python Exception  Cannot access memory at
address 0x42c00020880a9010: 
Error occurred in Python command: Cannot access memory at address
0x42c00020880a9010

I think we'll go back to a 'release' image and see if we get the oom
with a few more clues...

Rick


On Wed, 2020-03-11 at 08:07 -0700, Waldek Kozaczuk wrote:
> 
> 
> On Tuesday, March 10, 2020 at 10:53:17 PM UTC-4, rickp wrote:
> > I've not found a way to reproduce this other than in production
> > yet, 
> > which is annoying. I've built an image with this patch, and will
> > see if 
> > we can run it with gdb too. 
> > 
> > What should I be looking for if it hits? 
> 
> In case of oom() this patch should show a list of threads (waiters)
> along with amount of memory requested. I am hoping it will give
> better clues where the problem is. 
> > Note, the 'osv pagetable walk' command (and a few others) seem not
> > to 
> > work for me: 
> > 
> > (gdb) osv pagetable walk 
> > Python Exception  Argument required (expression
> > to 
> > compute).: 
> > Error occurred in Python: Argument required (expression to
> > compute). 
> 
> I will look into it. 
> > Rick 
> > 
> > On Tue, 2020-03-10 at 11:21 -0700, Waldek Kozaczuk wrote: 
> > > Is there a way to have a reproducible test app? If not I was 
> > > wondering if this patch enabling some extra logging would help: 
> > > 
> > > diff --git a/core/mempool.cc b/core/mempool.cc 
> > > index d902eea8..ace28b38 100644 
> > > --- a/core/mempool.cc 
> > > +++ b/core/mempool.cc 
> > > @@ -501,9 +501,9 @@ ssize_t
> > reclaimer::bytes_until_normal(pressure 
> > > curr) 
> > >  } 
> > >  } 
> > >   
> > > -void oom() 
> > > +void oom(ssize_t target) 
> > >  { 
> > > -abort("Out of memory: could not reclaim any further.
> > Current 
> > > memory: %d Kb", stats::free() >> 10); 
> > > +abort("Out of memory: could not reclaim any further.
> > Current 
> > > memory: %d Kb, target: %d Kb", stats::free() >> 10, target >>
> > 10); 
> > >  } 
> > >   
> > >  void reclaimer::wait_for_minimum_memory() 
> > > @@ -924,6 +924,17 @@ bool reclaimer_waiters::wake_waiters() 
> > >  return woken; 
> > >  } 
> > >   
> > > +void reclaimer_waiters::print_waiters() 
> > > +{ 
> > > +auto it = _waiters.begin(); 
> > > +while (it != _waiters.end()) { 
> > > +auto& wr = *it; 
> > > +it++; 
> > > + 
> > > +printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), 
> > > wr.bytes); 
> > > +} 
> > > +} 
> > > + 
> > >  // Note for callers: Ideally, we would not only wake, but
> > already 
> > > allocate 
> > >  // memory here and pass it back to the waiter. However, memory
> > is 
> > > not always 
> > >  // allocated the same way (ex: refill_page_buffer is completely 
> > > different from 
> > > @@ -943,7 +954,7 @@ void reclaimer_waiters::wait(size_t bytes) 
> > >   
> > >  // Wait for whom? 
> > >  if (curr == reclaimer_thread._thread.get()) { 
> 

Re: [osv-dev] Re: OOM query

2020-03-10 Thread Waldek Kozaczuk
Is there a way to have a reproducible test app? If not I was wondering if 
this patch enabling some extra logging would help:

diff --git a/core/mempool.cc b/core/mempool.cc
index d902eea8..ace28b38 100644
--- a/core/mempool.cc
+++ b/core/mempool.cc
@@ -501,9 +501,9 @@ ssize_t reclaimer::bytes_until_normal(pressure curr)
 }
 }
 
-void oom()
+void oom(ssize_t target)
 {
-abort("Out of memory: could not reclaim any further. Current memory: 
%d Kb", stats::free() >> 10);
+abort("Out of memory: could not reclaim any further. Current memory: 
%d Kb, target: %d Kb", stats::free() >> 10, target >> 10);
 }
 
 void reclaimer::wait_for_minimum_memory()
@@ -924,6 +924,17 @@ bool reclaimer_waiters::wake_waiters()
 return woken;
 }
 
+void reclaimer_waiters::print_waiters()
+{
+auto it = _waiters.begin();
+while (it != _waiters.end()) {
+auto& wr = *it;
+it++;
+
+printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), wr.bytes);
+}
+}
+
 // Note for callers: Ideally, we would not only wake, but already allocate
 // memory here and pass it back to the waiter. However, memory is not 
always
 // allocated the same way (ex: refill_page_buffer is completely different 
from
@@ -943,7 +954,7 @@ void reclaimer_waiters::wait(size_t bytes)
 
 // Wait for whom?
 if (curr == reclaimer_thread._thread.get()) {
-oom();
+oom(bytes);
  }
 
 wait_node wr;
@@ -1027,7 +1038,8 @@ void reclaimer::_do_reclaim()
 // Wake up all waiters that are waiting and now have a 
chance to succeed.
 // If we could not wake any, there is nothing really we 
can do.
 if (!_oom_blocked.wake_waiters()) {
-oom();
+_oom_blocked.print_waiters();
+oom(target);
 }
 }
 
diff --git a/include/osv/mempool.hh b/include/osv/mempool.hh
index 10fe5602..620f1a5b 100644
--- a/include/osv/mempool.hh
+++ b/include/osv/mempool.hh
@@ -137,6 +137,7 @@ public:
 bool wake_waiters();
 void wait(size_t bytes);
 bool has_waiters() { return !_waiters.empty(); }
+void print_waiters();
 private:
 struct wait_node: boost::intrusive::list_base_hook<> {
 sched::thread* owner;

Waldek

On Tuesday, March 10, 2020 at 12:26:54 PM UTC-4, Waldek Kozaczuk wrote:
>
> Well I found this commit changing original "> 0 " to ">= 0" - 
> https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca
>  - 
> with some explanation. So maybe my theory is wrong. Also I am not sure 
> calling bytes_until_normal() before that "if" would change anything then.
>
> On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4, Waldek Kozaczuk wrote:
>>
>> So here is full code of of _do_reclaim():
>>
>> 1001 void reclaimer::_do_reclaim()
>> 1002 {
>> 1003 ssize_t target;
>> 1004 emergency_alloc_level = 1;
>> 1005 
>> 1006 while (true) {
>> 1007 WITH_LOCK(free_page_ranges_lock) {
>> 1008 _blocked.wait(free_page_ranges_lock);
>> 1009 *target = bytes_until_normal();*
>> 1010 }
>> 1011 
>> 1012 // This means that we are currently ballooning, we should
>> 1013 // try to serve the waiters from temporary memory without
>> 1014 // going on hard mode. A big batch of more memory is likely
>> 1015 // in its way.
>> 1016 if (_oom_blocked.has_waiters() && throttling_needed()) {
>> 1017 _shrinker_loop(target, [] { return false; });
>> 1018 WITH_LOCK(free_page_ranges_lock) {
>> 1019 if (_oom_blocked.wake_waiters()) {
>> 1020 continue;
>> 1021 }
>> 1022 }
>> 1023 }
>> 1024 
>> 1025 _shrinker_loop(target, [this] { return 
>> _oom_blocked.has_waiters(); });
>> 1026 
>> 1027 WITH_LOCK(free_page_ranges_lock) {
>> 1028 if (target >= 0) {
>> 1029 // Wake up all waiters that are waiting and now have 
>> a chance to succeed.
>> 1030 // If we could not wake any, there is nothing really 
>> we can do.
>> 1031 if (!_oom_blocked.wake_waiters()) {
>> 1032 oom();
>> 1033 }
>> 1034 }
>> 1035 
>> 1036 if (balloon_api) {
>> 1037 balloon_api->voluntary_return();
>> 1038 }
>> 1039 }
>> 1040 }
>> 1041 }
>>
>> We got oom() because target was '>= 0'. Now the target is calculated as 
>> the result of  *bytes_until_normal()*. 
>>
>>  495 ssize_t reclaimer::bytes_until_normal(pressure curr)
>>  496 {
>>  497 assert(mutex_owned(_page_ranges_lock));
>>  498 if (curr == pressure::PRESSURE) {
>>  499 return watermark_lo - stats::free();
>>  500 } else {
>>  501 return 0;
>>  502 }
>>  503 }
>>
>> which seems to indicate that when 0 is returned there no need to reclaim 
>> any memory.
>>
>> So here are two things 

Re: [osv-dev] Re: OOM query

2020-03-10 Thread Waldek Kozaczuk
Well I found this commit changing original "> 0 " to ">= 0" - 
https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca
 - 
with some explanation. So maybe my theory is wrong. Also I am not sure 
calling bytes_until_normal() before that "if" would change anything then.

On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4, Waldek Kozaczuk wrote:
>
> So here is full code of of _do_reclaim():
>
> 1001 void reclaimer::_do_reclaim()
> 1002 {
> 1003 ssize_t target;
> 1004 emergency_alloc_level = 1;
> 1005 
> 1006 while (true) {
> 1007 WITH_LOCK(free_page_ranges_lock) {
> 1008 _blocked.wait(free_page_ranges_lock);
> 1009 *target = bytes_until_normal();*
> 1010 }
> 1011 
> 1012 // This means that we are currently ballooning, we should
> 1013 // try to serve the waiters from temporary memory without
> 1014 // going on hard mode. A big batch of more memory is likely
> 1015 // in its way.
> 1016 if (_oom_blocked.has_waiters() && throttling_needed()) {
> 1017 _shrinker_loop(target, [] { return false; });
> 1018 WITH_LOCK(free_page_ranges_lock) {
> 1019 if (_oom_blocked.wake_waiters()) {
> 1020 continue;
> 1021 }
> 1022 }
> 1023 }
> 1024 
> 1025 _shrinker_loop(target, [this] { return 
> _oom_blocked.has_waiters(); });
> 1026 
> 1027 WITH_LOCK(free_page_ranges_lock) {
> 1028 if (target >= 0) {
> 1029 // Wake up all waiters that are waiting and now have 
> a chance to succeed.
> 1030 // If we could not wake any, there is nothing really 
> we can do.
> 1031 if (!_oom_blocked.wake_waiters()) {
> 1032 oom();
> 1033 }
> 1034 }
> 1035 
> 1036 if (balloon_api) {
> 1037 balloon_api->voluntary_return();
> 1038 }
> 1039 }
> 1040 }
> 1041 }
>
> We got oom() because target was '>= 0'. Now the target is calculated as 
> the result of  *bytes_until_normal()*. 
>
>  495 ssize_t reclaimer::bytes_until_normal(pressure curr)
>  496 {
>  497 assert(mutex_owned(_page_ranges_lock));
>  498 if (curr == pressure::PRESSURE) {
>  499 return watermark_lo - stats::free();
>  500 } else {
>  501 return 0;
>  502 }
>  503 }
>
> which seems to indicate that when 0 is returned there no need to reclaim 
> any memory.
>
> So here are two things that might be wrong:
>
> 1. Shouldn't if (target >= 0) be changed to if (target > 0) {? 
>
> 2. Shouldn't we re-read the target in second WITH_LOCK instead of 
> comparing the original value in the beginning of the body of the loop? The 
> line before - _shrinker_loop(target, [this] { return 
> _oom_blocked.has_waiters(); }); - might have just released enough memory 
> to bring target below 0, right?
>
> In any case it would be useful to print the value of the target before 
> oom():
>
>  if (!_oom_blocked.wake_waiters()) {
>
>  printf("--> Target: %ld\n", target);
>
>  oom();
>
>  }
>
> On Monday, March 9, 2020 at 10:51:02 PM UTC-4, rickp wrote:
>>
>>
>> We're pretty close to current on OSv, but it also happens on an older 
>> image. We have changed some stuff in our app, but I think that may just 
>> be provking the bug. Certainly from gdb, I can see that both mmaped and 
>> normal memory fluctuate up and down but eveything looks sane. 
>>
>> More debug in wake_waiters would be useful, but I'm losing the argument 
>> to continue with OSv at the moment which makes testing this a bit 
>> 'political'. 
>>
>> btw - when we do run the system out of memory, it seems to hang rather 
>> than generate an oom. Have you tried it? 
>>
>> The tcp_do_segment one has been mentioned before (by someone else). The 
>> issue is that the kassert only has effect in the debug build. I'd guess 
>> that the socket is being closed, but still has segments that have not 
>> been processed, or something like that. I'll try and narrow it down a 
>> bit if I get time. 
>>
>> Rick 
>>
>> On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: 
>> > Does it happen with the very latest OSv code? Did it start happening 
>> > at some point more often? 
>> > 
>> > I wonder if we could add some helpful printouts in wake_waiters(). 
>> > 
>> > Btw that assert() failure in tcp_do_segment() rings a bell. 
>> > 
>> > On Mon, Mar 9, 2020 at 22:25 Rick Payne  wrote: 
>> > > I can't add much other than I doubt its fragmentation. Sometimes 
>> > > this 
>> > > happens within a few minutes of the system starting. At no point do 
>> > > I 
>> > > think we're using more than 2GB of ram (of the 12GB) either. 
>> > > 
>> > > I did compile up a debug verison of OSv and built the system with 
>> > > that, 
>> > > but I've been unable to trigger the oom(). Worse, I hit 

Re: [osv-dev] Re: OOM query

2020-03-10 Thread Waldek Kozaczuk
So here is full code of of _do_reclaim():

1001 void reclaimer::_do_reclaim()
1002 {
1003 ssize_t target;
1004 emergency_alloc_level = 1;
1005 
1006 while (true) {
1007 WITH_LOCK(free_page_ranges_lock) {
1008 _blocked.wait(free_page_ranges_lock);
1009 *target = bytes_until_normal();*
1010 }
1011 
1012 // This means that we are currently ballooning, we should
1013 // try to serve the waiters from temporary memory without
1014 // going on hard mode. A big batch of more memory is likely
1015 // in its way.
1016 if (_oom_blocked.has_waiters() && throttling_needed()) {
1017 _shrinker_loop(target, [] { return false; });
1018 WITH_LOCK(free_page_ranges_lock) {
1019 if (_oom_blocked.wake_waiters()) {
1020 continue;
1021 }
1022 }
1023 }
1024 
1025 _shrinker_loop(target, [this] { return 
_oom_blocked.has_waiters(); });
1026 
1027 WITH_LOCK(free_page_ranges_lock) {
1028 if (target >= 0) {
1029 // Wake up all waiters that are waiting and now have a 
chance to succeed.
1030 // If we could not wake any, there is nothing really 
we can do.
1031 if (!_oom_blocked.wake_waiters()) {
1032 oom();
1033 }
1034 }
1035 
1036 if (balloon_api) {
1037 balloon_api->voluntary_return();
1038 }
1039 }
1040 }
1041 }

We got oom() because target was '>= 0'. Now the target is calculated as the 
result of  *bytes_until_normal()*. 

 495 ssize_t reclaimer::bytes_until_normal(pressure curr)
 496 {
 497 assert(mutex_owned(_page_ranges_lock));
 498 if (curr == pressure::PRESSURE) {
 499 return watermark_lo - stats::free();
 500 } else {
 501 return 0;
 502 }
 503 }

which seems to indicate that when 0 is returned there no need to reclaim 
any memory.

So here are two things that might be wrong:

1. Shouldn't if (target >= 0) be changed to if (target > 0) {? 

2. Shouldn't we re-read the target in second WITH_LOCK instead of comparing 
the original value in the beginning of the body of the loop? The line 
before - _shrinker_loop(target, [this] { return _oom_blocked.has_waiters(); 
}); - might have just released enough memory to bring target below 0, right?

In any case it would be useful to print the value of the target before 
oom():

 if (!_oom_blocked.wake_waiters()) {

 printf("--> Target: %ld\n", target);

 oom();

 }

On Monday, March 9, 2020 at 10:51:02 PM UTC-4, rickp wrote:
>
>
> We're pretty close to current on OSv, but it also happens on an older 
> image. We have changed some stuff in our app, but I think that may just 
> be provking the bug. Certainly from gdb, I can see that both mmaped and 
> normal memory fluctuate up and down but eveything looks sane. 
>
> More debug in wake_waiters would be useful, but I'm losing the argument 
> to continue with OSv at the moment which makes testing this a bit 
> 'political'. 
>
> btw - when we do run the system out of memory, it seems to hang rather 
> than generate an oom. Have you tried it? 
>
> The tcp_do_segment one has been mentioned before (by someone else). The 
> issue is that the kassert only has effect in the debug build. I'd guess 
> that the socket is being closed, but still has segments that have not 
> been processed, or something like that. I'll try and narrow it down a 
> bit if I get time. 
>
> Rick 
>
> On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: 
> > Does it happen with the very latest OSv code? Did it start happening 
> > at some point more often? 
> > 
> > I wonder if we could add some helpful printouts in wake_waiters(). 
> > 
> > Btw that assert() failure in tcp_do_segment() rings a bell. 
> > 
> > On Mon, Mar 9, 2020 at 22:25 Rick Payne  > wrote: 
> > > I can't add much other than I doubt its fragmentation. Sometimes 
> > > this 
> > > happens within a few minutes of the system starting. At no point do 
> > > I 
> > > think we're using more than 2GB of ram (of the 12GB) either. 
> > > 
> > > I did compile up a debug verison of OSv and built the system with 
> > > that, 
> > > but I've been unable to trigger the oom(). Worse, I hit a kassert 
> > > in 
> > > the netchannel code that seems to be ignored in the 'release' 
> > > build, 
> > > but panics in the debug build: 
> > > 
> > > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN 
> > > Assertion failed: tp->get_state() > 1 
> > > (bsd/sys/netinet/tcp_input.cc: 
> > > tcp_do_segment: 1076) 
> > > 
> > > [backtrace] 
> > > 0x40221330  
> > > 0x40221399 <__assert_fail+64> 
> > > 0x402a4798  
> > > 0x402a97c2  
> > > 0x402a98a1  
> > > 0x402aa448  
> > > 0x40656a9a ::operator()(mbuf*) 
> > 

Re: [osv-dev] Re: OOM query

2020-03-09 Thread Rick Payne


We're pretty close to current on OSv, but it also happens on an older
image. We have changed some stuff in our app, but I think that may just
be provking the bug. Certainly from gdb, I can see that both mmaped and
normal memory fluctuate up and down but eveything looks sane.

More debug in wake_waiters would be useful, but I'm losing the argument
to continue with OSv at the moment which makes testing this a bit
'political'.

btw - when we do run the system out of memory, it seems to hang rather
than generate an oom. Have you tried it?

The tcp_do_segment one has been mentioned before (by someone else). The
issue is that the kassert only has effect in the debug build. I'd guess
that the socket is being closed, but still has segments that have not
been processed, or something like that. I'll try and narrow it down a
bit if I get time.

Rick

On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote:
> Does it happen with the very latest OSv code? Did it start happening
> at some point more often? 
> 
> I wonder if we could add some helpful printouts in wake_waiters(). 
> 
> Btw that assert() failure in tcp_do_segment() rings a bell. 
> 
> On Mon, Mar 9, 2020 at 22:25 Rick Payne  wrote:
> > I can't add much other than I doubt its fragmentation. Sometimes
> > this
> > happens within a few minutes of the system starting. At no point do
> > I
> > think we're using more than 2GB of ram (of the 12GB) either.
> > 
> > I did compile up a debug verison of OSv and built the system with
> > that,
> > but I've been unable to trigger the oom(). Worse, I hit a kassert
> > in
> > the netchannel code that seems to be ignored in the 'release'
> > build,
> > but panics in the debug build:
> > 
> > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN
> > Assertion failed: tp->get_state() > 1
> > (bsd/sys/netinet/tcp_input.cc:
> > tcp_do_segment: 1076)
> > 
> > [backtrace]
> > 0x40221330 
> > 0x40221399 <__assert_fail+64>
> > 0x402a4798 
> > 0x402a97c2 
> > 0x402a98a1 
> > 0x402aa448 
> > 0x40656a9a ::operator()(mbuf*)
> > const+76>
> > 0x40655855 
> > 0x4023b165 
> > 0x4023b4d7 
> > 0x4024cd21 
> > 0x406a6a10 
> > 0x406a64f7 
> > 0x4067cd42 
> > 
> > So at the moment, I'm a bit stuck with getting any more info...
> > 
> > Rick
> > 
> > On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote:
> > > As I understand this stack trace the oom() was called here as
> > part of
> > > _do_reclaim():
> > > 
> > > 1025 WITH_LOCK(free_page_ranges_lock) {
> > > 1026 if (target >= 0) {
> > > 1027 // Wake up all waiters that are waiting and
> > now
> > > have a chance to succeed.
> > > 1028 // If we could not wake any, there is
> > nothing
> > > really we can do.
> > > 1029 if (!_oom_blocked.wake_waiters()) {
> > > 1030 oom();
> > > 1031 }
> > > 1032 }
> > > 1033 
> > > 1034 if (balloon_api) {
> > > 1035 balloon_api->voluntary_return();
> > > 1036 }
> > > 1037 }
> > > 
> > > so it seems wake_waiters() returned false. I wonder if the memory
> > was
> > > heavily fragmented or there is some logical bug in there. This
> > method
> > > is called from two places and I wonder if this part of
> > wake_waiters()
> > > is correct:
> > > 
> > >  921 if (!_waiters.empty()) {
> > >  922 reclaimer_thread.wake();
> > >  923 }
> > >  924 return woken;
> > > 
> > > 
> > > should this if also set woken to true?
> > > 
> > > Also could we also enhance the oom() logic to print out more
> > useful
> > > information if this happens once again?
> > > 
> > > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote:
> > > > Had a crash on a system that I don't understand. Its a VM with
> > > > 12GB 
> > > > allocated, we were running without about 10.5GB free according
> > to
> > > > the 
> > > > API. 
> > > > 
> > > > Out of the blue, we had a panic: 
> > > > 
> > > > Out of memory: could not reclaim any further. Current memory:
> > > > 10954988 
> > > > Kb 
> > > > [backtrace] 
> > > > 0x403f6320  
> > > > 0x403f71cc  
> > > > 0x403f722f  
> > > > 0x4040f29b  
> > > > 0x403ae412  
> > > > 
> > > > The 'Out of memory' message seems to print stats::free() and
> > that 
> > > > number suggests we have plenty of free ram. 
> > > > 
> > > > Have I misunderstood, or is there something I need to be
> > looking
> > > > at? 
> > > > 
> > > > Cheers, 
> > > > Rick 
> > > > 
> > > 
> > > -- 
> > > You received this message because you are subscribed to the
> > Google
> > > Groups "OSv Development" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to osv-dev+unsubscr...@googlegroups.com.
> > > To view this discussion on the web visit 
> > > 
> > 

Re: [osv-dev] Re: OOM query

2020-03-09 Thread Waldek Kozaczuk
Does it happen with the very latest OSv code? Did it start happening at
some point more often?

I wonder if we could add some helpful printouts in wake_waiters().

Btw that assert() failure in tcp_do_segment() rings a bell.

On Mon, Mar 9, 2020 at 22:25 Rick Payne  wrote:

>
> I can't add much other than I doubt its fragmentation. Sometimes this
> happens within a few minutes of the system starting. At no point do I
> think we're using more than 2GB of ram (of the 12GB) either.
>
> I did compile up a debug verison of OSv and built the system with that,
> but I've been unable to trigger the oom(). Worse, I hit a kassert in
> the netchannel code that seems to be ignored in the 'release' build,
> but panics in the debug build:
>
> [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN
> Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc:
> tcp_do_segment: 1076)
>
> [backtrace]
> 0x40221330 
> 0x40221399 <__assert_fail+64>
> 0x402a4798 
> 0x402a97c2 
> 0x402a98a1 
> 0x402aa448 
> 0x40656a9a ::operator()(mbuf*)
> const+76>
> 0x40655855 
> 0x4023b165 
> 0x4023b4d7 
> 0x4024cd21 
> 0x406a6a10 
> 0x406a64f7 
> 0x4067cd42 
>
> So at the moment, I'm a bit stuck with getting any more info...
>
> Rick
>
> On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote:
> > As I understand this stack trace the oom() was called here as part of
> > _do_reclaim():
> >
> > 1025 WITH_LOCK(free_page_ranges_lock) {
> > 1026 if (target >= 0) {
> > 1027 // Wake up all waiters that are waiting and now
> > have a chance to succeed.
> > 1028 // If we could not wake any, there is nothing
> > really we can do.
> > 1029 if (!_oom_blocked.wake_waiters()) {
> > 1030 oom();
> > 1031 }
> > 1032 }
> > 1033
> > 1034 if (balloon_api) {
> > 1035 balloon_api->voluntary_return();
> > 1036 }
> > 1037 }
> >
> > so it seems wake_waiters() returned false. I wonder if the memory was
> > heavily fragmented or there is some logical bug in there. This method
> > is called from two places and I wonder if this part of wake_waiters()
> > is correct:
> >
> >  921 if (!_waiters.empty()) {
> >  922 reclaimer_thread.wake();
> >  923 }
> >  924 return woken;
> >
> >
> > should this if also set woken to true?
> >
> > Also could we also enhance the oom() logic to print out more useful
> > information if this happens once again?
> >
> > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote:
> > > Had a crash on a system that I don't understand. Its a VM with
> > > 12GB
> > > allocated, we were running without about 10.5GB free according to
> > > the
> > > API.
> > >
> > > Out of the blue, we had a panic:
> > >
> > > Out of memory: could not reclaim any further. Current memory:
> > > 10954988
> > > Kb
> > > [backtrace]
> > > 0x403f6320 
> > > 0x403f71cc 
> > > 0x403f722f 
> > > 0x4040f29b 
> > > 0x403ae412 
> > >
> > > The 'Out of memory' message seems to print stats::free() and that
> > > number suggests we have plenty of free ram.
> > >
> > > Have I misunderstood, or is there something I need to be looking
> > > at?
> > >
> > > Cheers,
> > > Rick
> > >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "OSv Development" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to osv-dev+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com
> > .
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfNpCXpt%3DMNaOr2ruGXAgMVAYX235UkM3uFT%3DCMPxP%3DDAQ%40mail.gmail.com.


Re: [osv-dev] Re: OOM query

2020-03-09 Thread Rick Payne


I can't add much other than I doubt its fragmentation. Sometimes this
happens within a few minutes of the system starting. At no point do I
think we're using more than 2GB of ram (of the 12GB) either.

I did compile up a debug verison of OSv and built the system with that,
but I've been unable to trigger the oom(). Worse, I hit a kassert in
the netchannel code that seems to be ignored in the 'release' build,
but panics in the debug build:

[E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN
Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc:
tcp_do_segment: 1076)

[backtrace]
0x40221330 
0x40221399 <__assert_fail+64>
0x402a4798 
0x402a97c2 
0x402a98a1 
0x402aa448 
0x40656a9a ::operator()(mbuf*)
const+76>
0x40655855 
0x4023b165 
0x4023b4d7 
0x4024cd21 
0x406a6a10 
0x406a64f7 
0x4067cd42 

So at the moment, I'm a bit stuck with getting any more info...

Rick

On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote:
> As I understand this stack trace the oom() was called here as part of
> _do_reclaim():
> 
> 1025 WITH_LOCK(free_page_ranges_lock) {
> 1026 if (target >= 0) {
> 1027 // Wake up all waiters that are waiting and now
> have a chance to succeed.
> 1028 // If we could not wake any, there is nothing
> really we can do.
> 1029 if (!_oom_blocked.wake_waiters()) {
> 1030 oom();
> 1031 }
> 1032 }
> 1033 
> 1034 if (balloon_api) {
> 1035 balloon_api->voluntary_return();
> 1036 }
> 1037 }
> 
> so it seems wake_waiters() returned false. I wonder if the memory was
> heavily fragmented or there is some logical bug in there. This method
> is called from two places and I wonder if this part of wake_waiters()
> is correct:
> 
>  921 if (!_waiters.empty()) {
>  922 reclaimer_thread.wake();
>  923 }
>  924 return woken;
> 
> 
> should this if also set woken to true?
> 
> Also could we also enhance the oom() logic to print out more useful
> information if this happens once again?
> 
> On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote:
> > Had a crash on a system that I don't understand. Its a VM with
> > 12GB 
> > allocated, we were running without about 10.5GB free according to
> > the 
> > API. 
> > 
> > Out of the blue, we had a panic: 
> > 
> > Out of memory: could not reclaim any further. Current memory:
> > 10954988 
> > Kb 
> > [backtrace] 
> > 0x403f6320  
> > 0x403f71cc  
> > 0x403f722f  
> > 0x4040f29b  
> > 0x403ae412  
> > 
> > The 'Out of memory' message seems to print stats::free() and that 
> > number suggests we have plenty of free ram. 
> > 
> > Have I misunderstood, or is there something I need to be looking
> > at? 
> > 
> > Cheers, 
> > Rick 
> > 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com
> .

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/63b1cad2ad3c960fcfdfbbed9e4b014a7c75645e.camel%40rossfell.co.uk.