Re: [osv-dev] Re: OOM query
I remember it is an Erlang app. Is there any comparable open source app/example I could run to re-create your issue? On Tue, Mar 24, 2020 at 17:49 Waldek Kozaczuk wrote: > Also, besides the main app, what other OSv modules do you have in the > image? Httpserver-api, cloud-init, ? > > On Tue, Mar 24, 2020 at 17:44 Waldek Kozaczuk > wrote: > >> >> >> On Tue, Mar 24, 2020 at 17:34 Rick Payne wrote: >> >>> >>> I backed out the original patch, applied the other two. Do I need the >>> first one still? >> >> >> Nope. You should be fine. >> >>> We're not on master, but we're pretty close. Last >>> synced on March 3rd, commit ref 92eb26f3a645 >>> >>> scripts/build check runs fine: >>> >>> OK (131 tests run, 274. >> >> Well there may be a bug in my patches. It looks my patches make your >> situation even worse. Did you try to connect with gdb to get better stack >> trace? Ideally ‘osv info threads’ >> >>> >>> >>> Rick >>> >>> On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote: >>> > Is it with the exact same code as on master with the latest 2 patches >>> > I sent applied ? Does ‘scripts/build check’ pass for you? >>> > >>> > On Tue, Mar 24, 2020 at 16:56 Rick Payne >>> > wrote: >>> > > I tried the patches, but its crashing almost instantly... >>> > > >>> > > page fault outside application, addr: 0x56c0 >>> > > [registers] >>> > > RIP: 0x403edd23 >>> > > RFL: >>> > > 0x00010206 CS: 0x0008 SS: >>> > > 0x0010 >>> > > RAX: 0x56c0 RBX: 0x200056c00040 RCX: >>> > > 0x004c RDX: 0x0008 >>> > > RSI: 0x004c RDI: 0x200056c00040 RBP: >>> > > 0x200041501740 R8: 0x >>> > > R9: 0x5e7a7333 R10: 0x R11: >>> > > 0x R12: 0x004c >>> > > R13: 0x5e7a7333 R14: 0x R15: >>> > > 0x RSP: 0x2000415016f8 >>> > > Aborted >>> > > >>> > > [backtrace] >>> > > 0x40343779 >>> > > 0x4034534d >> > > exception_frame*)+397> >>> > > 0x403a667b >>> > > 0x403a54c6 >>> > > 0x1174c2b0 >>> > > 0x >>> > > >>> > > (gdb) osv heap >>> > > 0x8e9ad000 0x22a53000 >>> > > 0x8e9a2000 0x4000 >>> > > >>> > > Rick >>> > > >>> > > >>> > > >>> > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: >>> > > > I have sent a more complete patch that should also address >>> > > > fragmentation issue with requests >= 4K and < 2MB. >>> > > > >>> > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk >>> > > wrote: >>> > > > > I have just sent a new patch to the mailing list. I am hoping >>> > > it >>> > > > > will address the OOM crash if my theory of heavy memory >>> > > > > fragmentation is right. It would be nice if Nadav could review >>> > > it. >>> > > > > >>> > > > > Regardless if you have another crash in production and are able >>> > > to >>> > > > > connect with gdb, could you run 'osv heap' - it should show >>> > > > > free_page_ranges. If memory is heavily fragmented we should see >>> > > a >>> > > > > long list. >>> > > > > >>> > > > > It would be nice to recreate that load in dev env and capture >>> > > the >>> > > > > memory trace data (BWT you do not need to enable backtrace to >>> > > have >>> > > > > enough useful information). It would help us better understand >>> > > how >>> > > > > memory is allocated by the app. I saw you send me one trace but >>> > > it >>> > > > > does not seem to be revealing anything interesting. >>> > > > > >>> > > > > Waldek >>> > > > > >>> > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: >>> > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: >>> > > > > > > >>> > > > > > > >>> > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp >>> > > wrote: >>> > > > > > > > Looks to me like its trying to allocate 40MB but the >>> > > > > > available >>> > > > > > > > memory >>> > > > > > > > is 10GB, surely? 10933128KB is 10,933MB >>> > > > > > > > >>> > > > > > > >>> > > > > > > I misread the number - forgot about 1K. >>> > > > > > > >>> > > > > > > Any chance you could run the app outside of production with >>> > > > > > memory >>> > > > > > > tracing enabled - >>> > > > > > > >>> > > > > > >>> > > >>> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations >>> > > > > > >>> > > > > > > (without --trace-backtrace) for while? And then we can >>> > > have a >>> > > > > > better >>> > > > > > > sense of what kind of allocations it makes. The output of >>> > > > > > trace >>> > > > > > > memory-analyzer would be really helpful. >>> > > > > > >>> > > > > > I can certainly run that local with locally generated >>> > > workloads, >>> > > > > > which >>> > > > > > should be close enough - but we've never managed to trigger >>> > > the >>> > > > > > oom >>> > > > > > condition that way (other
Re: [osv-dev] Re: OOM query
Also, besides the main app, what other OSv modules do you have in the image? Httpserver-api, cloud-init, ? On Tue, Mar 24, 2020 at 17:44 Waldek Kozaczuk wrote: > > > On Tue, Mar 24, 2020 at 17:34 Rick Payne wrote: > >> >> I backed out the original patch, applied the other two. Do I need the >> first one still? > > > Nope. You should be fine. > >> We're not on master, but we're pretty close. Last >> synced on March 3rd, commit ref 92eb26f3a645 >> >> scripts/build check runs fine: >> >> OK (131 tests run, 274. > > Well there may be a bug in my patches. It looks my patches make your > situation even worse. Did you try to connect with gdb to get better stack > trace? Ideally ‘osv info threads’ > >> >> >> Rick >> >> On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote: >> > Is it with the exact same code as on master with the latest 2 patches >> > I sent applied ? Does ‘scripts/build check’ pass for you? >> > >> > On Tue, Mar 24, 2020 at 16:56 Rick Payne >> > wrote: >> > > I tried the patches, but its crashing almost instantly... >> > > >> > > page fault outside application, addr: 0x56c0 >> > > [registers] >> > > RIP: 0x403edd23 >> > > RFL: >> > > 0x00010206 CS: 0x0008 SS: >> > > 0x0010 >> > > RAX: 0x56c0 RBX: 0x200056c00040 RCX: >> > > 0x004c RDX: 0x0008 >> > > RSI: 0x004c RDI: 0x200056c00040 RBP: >> > > 0x200041501740 R8: 0x >> > > R9: 0x5e7a7333 R10: 0x R11: >> > > 0x R12: 0x004c >> > > R13: 0x5e7a7333 R14: 0x R15: >> > > 0x RSP: 0x2000415016f8 >> > > Aborted >> > > >> > > [backtrace] >> > > 0x40343779 >> > > 0x4034534d > > > exception_frame*)+397> >> > > 0x403a667b >> > > 0x403a54c6 >> > > 0x1174c2b0 >> > > 0x >> > > >> > > (gdb) osv heap >> > > 0x8e9ad000 0x22a53000 >> > > 0x8e9a2000 0x4000 >> > > >> > > Rick >> > > >> > > >> > > >> > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: >> > > > I have sent a more complete patch that should also address >> > > > fragmentation issue with requests >= 4K and < 2MB. >> > > > >> > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk >> > > wrote: >> > > > > I have just sent a new patch to the mailing list. I am hoping >> > > it >> > > > > will address the OOM crash if my theory of heavy memory >> > > > > fragmentation is right. It would be nice if Nadav could review >> > > it. >> > > > > >> > > > > Regardless if you have another crash in production and are able >> > > to >> > > > > connect with gdb, could you run 'osv heap' - it should show >> > > > > free_page_ranges. If memory is heavily fragmented we should see >> > > a >> > > > > long list. >> > > > > >> > > > > It would be nice to recreate that load in dev env and capture >> > > the >> > > > > memory trace data (BWT you do not need to enable backtrace to >> > > have >> > > > > enough useful information). It would help us better understand >> > > how >> > > > > memory is allocated by the app. I saw you send me one trace but >> > > it >> > > > > does not seem to be revealing anything interesting. >> > > > > >> > > > > Waldek >> > > > > >> > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: >> > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: >> > > > > > > >> > > > > > > >> > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp >> > > wrote: >> > > > > > > > Looks to me like its trying to allocate 40MB but the >> > > > > > available >> > > > > > > > memory >> > > > > > > > is 10GB, surely? 10933128KB is 10,933MB >> > > > > > > > >> > > > > > > >> > > > > > > I misread the number - forgot about 1K. >> > > > > > > >> > > > > > > Any chance you could run the app outside of production with >> > > > > > memory >> > > > > > > tracing enabled - >> > > > > > > >> > > > > > >> > > >> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations >> > > > > > >> > > > > > > (without --trace-backtrace) for while? And then we can >> > > have a >> > > > > > better >> > > > > > > sense of what kind of allocations it makes. The output of >> > > > > > trace >> > > > > > > memory-analyzer would be really helpful. >> > > > > > >> > > > > > I can certainly run that local with locally generated >> > > workloads, >> > > > > > which >> > > > > > should be close enough - but we've never managed to trigger >> > > the >> > > > > > oom >> > > > > > condition that way (other than by really constraining the >> > > memory >> > > > > > artificially). It should be close enough though - let me see >> > > what >> > > > > > I can >> > > > > > do. >> > > > > > >> > > > > > Rick >> > > > > > >> > > > > > >> > > > >> > > > -- >> > > > You received this message because you are subscribed to the >> >
Re: [osv-dev] Re: OOM query
On Tue, Mar 24, 2020 at 17:34 Rick Payne wrote: > > I backed out the original patch, applied the other two. Do I need the > first one still? Nope. You should be fine. > We're not on master, but we're pretty close. Last > synced on March 3rd, commit ref 92eb26f3a645 > > scripts/build check runs fine: > > OK (131 tests run, 274. Well there may be a bug in my patches. It looks my patches make your situation even worse. Did you try to connect with gdb to get better stack trace? Ideally ‘osv info threads’ > > > Rick > > On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote: > > Is it with the exact same code as on master with the latest 2 patches > > I sent applied ? Does ‘scripts/build check’ pass for you? > > > > On Tue, Mar 24, 2020 at 16:56 Rick Payne > > wrote: > > > I tried the patches, but its crashing almost instantly... > > > > > > page fault outside application, addr: 0x56c0 > > > [registers] > > > RIP: 0x403edd23 > > > RFL: > > > 0x00010206 CS: 0x0008 SS: > > > 0x0010 > > > RAX: 0x56c0 RBX: 0x200056c00040 RCX: > > > 0x004c RDX: 0x0008 > > > RSI: 0x004c RDI: 0x200056c00040 RBP: > > > 0x200041501740 R8: 0x > > > R9: 0x5e7a7333 R10: 0x R11: > > > 0x R12: 0x004c > > > R13: 0x5e7a7333 R14: 0x R15: > > > 0x RSP: 0x2000415016f8 > > > Aborted > > > > > > [backtrace] > > > 0x40343779 > > > 0x4034534d > > exception_frame*)+397> > > > 0x403a667b > > > 0x403a54c6 > > > 0x1174c2b0 > > > 0x > > > > > > (gdb) osv heap > > > 0x8e9ad000 0x22a53000 > > > 0x8e9a2000 0x4000 > > > > > > Rick > > > > > > > > > > > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: > > > > I have sent a more complete patch that should also address > > > > fragmentation issue with requests >= 4K and < 2MB. > > > > > > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk > > > wrote: > > > > > I have just sent a new patch to the mailing list. I am hoping > > > it > > > > > will address the OOM crash if my theory of heavy memory > > > > > fragmentation is right. It would be nice if Nadav could review > > > it. > > > > > > > > > > Regardless if you have another crash in production and are able > > > to > > > > > connect with gdb, could you run 'osv heap' - it should show > > > > > free_page_ranges. If memory is heavily fragmented we should see > > > a > > > > > long list. > > > > > > > > > > It would be nice to recreate that load in dev env and capture > > > the > > > > > memory trace data (BWT you do not need to enable backtrace to > > > have > > > > > enough useful information). It would help us better understand > > > how > > > > > memory is allocated by the app. I saw you send me one trace but > > > it > > > > > does not seem to be revealing anything interesting. > > > > > > > > > > Waldek > > > > > > > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: > > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp > > > wrote: > > > > > > > > Looks to me like its trying to allocate 40MB but the > > > > > > available > > > > > > > > memory > > > > > > > > is 10GB, surely? 10933128KB is 10,933MB > > > > > > > > > > > > > > > > > > > > > > I misread the number - forgot about 1K. > > > > > > > > > > > > > > Any chance you could run the app outside of production with > > > > > > memory > > > > > > > tracing enabled - > > > > > > > > > > > > > > > > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > > > > > > > > > > > > > (without --trace-backtrace) for while? And then we can > > > have a > > > > > > better > > > > > > > sense of what kind of allocations it makes. The output of > > > > > > trace > > > > > > > memory-analyzer would be really helpful. > > > > > > > > > > > > I can certainly run that local with locally generated > > > workloads, > > > > > > which > > > > > > should be close enough - but we've never managed to trigger > > > the > > > > > > oom > > > > > > condition that way (other than by really constraining the > > > memory > > > > > > artificially). It should be close enough though - let me see > > > what > > > > > > I can > > > > > > do. > > > > > > > > > > > > Rick > > > > > > > > > > > > > > > > > > > > -- > > > > You received this message because you are subscribed to the > > > Google > > > > Groups "OSv Development" group. > > > > To unsubscribe from this group and stop receiving emails from it, > > > > send an email to osv-dev+unsubscr...@googlegroups.com. > > > > To view this discussion on the web visit > > > > > > > >
Re: [osv-dev] Re: OOM query
I backed out the original patch, applied the other two. Do I need the first one still? We're not on master, but we're pretty close. Last synced on March 3rd, commit ref 92eb26f3a645 scripts/build check runs fine: OK (131 tests run, 274.753 s) Rick On Tue, 2020-03-24 at 17:15 -0400, Waldek Kozaczuk wrote: > Is it with the exact same code as on master with the latest 2 patches > I sent applied ? Does ‘scripts/build check’ pass for you? > > On Tue, Mar 24, 2020 at 16:56 Rick Payne > wrote: > > I tried the patches, but its crashing almost instantly... > > > > page fault outside application, addr: 0x56c0 > > [registers] > > RIP: 0x403edd23 > > RFL: > > 0x00010206 CS: 0x0008 SS: > > 0x0010 > > RAX: 0x56c0 RBX: 0x200056c00040 RCX: > > 0x004c RDX: 0x0008 > > RSI: 0x004c RDI: 0x200056c00040 RBP: > > 0x200041501740 R8: 0x > > R9: 0x5e7a7333 R10: 0x R11: > > 0x R12: 0x004c > > R13: 0x5e7a7333 R14: 0x R15: > > 0x RSP: 0x2000415016f8 > > Aborted > > > > [backtrace] > > 0x40343779 > > 0x4034534d > exception_frame*)+397> > > 0x403a667b > > 0x403a54c6 > > 0x1174c2b0 > > 0x > > > > (gdb) osv heap > > 0x8e9ad000 0x22a53000 > > 0x8e9a2000 0x4000 > > > > Rick > > > > > > > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: > > > I have sent a more complete patch that should also address > > > fragmentation issue with requests >= 4K and < 2MB. > > > > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk > > wrote: > > > > I have just sent a new patch to the mailing list. I am hoping > > it > > > > will address the OOM crash if my theory of heavy memory > > > > fragmentation is right. It would be nice if Nadav could review > > it. > > > > > > > > Regardless if you have another crash in production and are able > > to > > > > connect with gdb, could you run 'osv heap' - it should show > > > > free_page_ranges. If memory is heavily fragmented we should see > > a > > > > long list. > > > > > > > > It would be nice to recreate that load in dev env and capture > > the > > > > memory trace data (BWT you do not need to enable backtrace to > > have > > > > enough useful information). It would help us better understand > > how > > > > memory is allocated by the app. I saw you send me one trace but > > it > > > > does not seem to be revealing anything interesting. > > > > > > > > Waldek > > > > > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: > > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp > > wrote: > > > > > > > Looks to me like its trying to allocate 40MB but the > > > > > available > > > > > > > memory > > > > > > > is 10GB, surely? 10933128KB is 10,933MB > > > > > > > > > > > > > > > > > > > I misread the number - forgot about 1K. > > > > > > > > > > > > Any chance you could run the app outside of production with > > > > > memory > > > > > > tracing enabled - > > > > > > > > > > > > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > > > > > > > > > > > (without --trace-backtrace) for while? And then we can > > have a > > > > > better > > > > > > sense of what kind of allocations it makes. The output of > > > > > trace > > > > > > memory-analyzer would be really helpful. > > > > > > > > > > I can certainly run that local with locally generated > > workloads, > > > > > which > > > > > should be close enough - but we've never managed to trigger > > the > > > > > oom > > > > > condition that way (other than by really constraining the > > memory > > > > > artificially). It should be close enough though - let me see > > what > > > > > I can > > > > > do. > > > > > > > > > > Rick > > > > > > > > > > > > > > > > -- > > > You received this message because you are subscribed to the > > Google > > > Groups "OSv Development" group. > > > To unsubscribe from this group and stop receiving emails from it, > > > send an email to osv-dev+unsubscr...@googlegroups.com. > > > To view this discussion on the web visit > > > > > https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com > > > . > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/75323fbb15af066032b00c4d0bcfce0f10b2c9af.camel%40rossfell.co.uk.
Re: [osv-dev] Re: OOM query
Is it with the exact same code as on master with the latest 2 patches I sent applied ? Does ‘scripts/build check’ pass for you? On Tue, Mar 24, 2020 at 16:56 Rick Payne wrote: > > I tried the patches, but its crashing almost instantly... > > page fault outside application, addr: 0x56c0 > [registers] > RIP: 0x403edd23 > RFL: > 0x00010206 CS: 0x0008 SS: 0x0010 > RAX: 0x56c0 RBX: 0x200056c00040 RCX: > 0x004c RDX: 0x0008 > RSI: 0x004c RDI: 0x200056c00040 RBP: > 0x200041501740 R8: 0x > R9: 0x5e7a7333 R10: 0x R11: > 0x R12: 0x004c > R13: 0x5e7a7333 R14: 0x R15: > 0x RSP: 0x2000415016f8 > Aborted > > [backtrace] > 0x40343779 > 0x4034534d > 0x403a667b > 0x403a54c6 > 0x1174c2b0 > 0x > > (gdb) osv heap > 0x8e9ad000 0x22a53000 > 0x8e9a2000 0x4000 > > Rick > > > > On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: > > I have sent a more complete patch that should also address > > fragmentation issue with requests >= 4K and < 2MB. > > > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote: > > > I have just sent a new patch to the mailing list. I am hoping it > > > will address the OOM crash if my theory of heavy memory > > > fragmentation is right. It would be nice if Nadav could review it. > > > > > > Regardless if you have another crash in production and are able to > > > connect with gdb, could you run 'osv heap' - it should show > > > free_page_ranges. If memory is heavily fragmented we should see a > > > long list. > > > > > > It would be nice to recreate that load in dev env and capture the > > > memory trace data (BWT you do not need to enable backtrace to have > > > enough useful information). It would help us better understand how > > > memory is allocated by the app. I saw you send me one trace but it > > > does not seem to be revealing anything interesting. > > > > > > Waldek > > > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: > > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > > > > > Looks to me like its trying to allocate 40MB but the > > > > available > > > > > > memory > > > > > > is 10GB, surely? 10933128KB is 10,933MB > > > > > > > > > > > > > > > > I misread the number - forgot about 1K. > > > > > > > > > > Any chance you could run the app outside of production with > > > > memory > > > > > tracing enabled - > > > > > > > > > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > > > > > > > > > (without --trace-backtrace) for while? And then we can have a > > > > better > > > > > sense of what kind of allocations it makes. The output of > > > > trace > > > > > memory-analyzer would be really helpful. > > > > > > > > I can certainly run that local with locally generated workloads, > > > > which > > > > should be close enough - but we've never managed to trigger the > > > > oom > > > > condition that way (other than by really constraining the memory > > > > artificially). It should be close enough though - let me see what > > > > I can > > > > do. > > > > > > > > Rick > > > > > > > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "OSv Development" group. > > To unsubscribe from this group and stop receiving emails from it, > > send an email to osv-dev+unsubscr...@googlegroups.com. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com > > . > > -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to osv-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/8e9dae61b2f412998468916508404d4867e759a3.camel%40rossfell.co.uk > . > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CAL9cFfNp4ceOgdgmDZoLd6JMJb5XDuyp8zGOgTsZukBEQxODCA%40mail.gmail.com.
Re: [osv-dev] Re: OOM query
I tried the patches, but its crashing almost instantly... page fault outside application, addr: 0x56c0 [registers] RIP: 0x403edd23 RFL: 0x00010206 CS: 0x0008 SS: 0x0010 RAX: 0x56c0 RBX: 0x200056c00040 RCX: 0x004c RDX: 0x0008 RSI: 0x004c RDI: 0x200056c00040 RBP: 0x200041501740 R8: 0x R9: 0x5e7a7333 R10: 0x R11: 0x R12: 0x004c R13: 0x5e7a7333 R14: 0x R15: 0x RSP: 0x2000415016f8 Aborted [backtrace] 0x40343779 0x4034534d 0x403a667b 0x403a54c6 0x1174c2b0 0x (gdb) osv heap 0x8e9ad000 0x22a53000 0x8e9a2000 0x4000 Rick On Mon, 2020-03-23 at 22:06 -0700, Waldek Kozaczuk wrote: > I have sent a more complete patch that should also address > fragmentation issue with requests >= 4K and < 2MB. > > On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote: > > I have just sent a new patch to the mailing list. I am hoping it > > will address the OOM crash if my theory of heavy memory > > fragmentation is right. It would be nice if Nadav could review it. > > > > Regardless if you have another crash in production and are able to > > connect with gdb, could you run 'osv heap' - it should show > > free_page_ranges. If memory is heavily fragmented we should see a > > long list. > > > > It would be nice to recreate that load in dev env and capture the > > memory trace data (BWT you do not need to enable backtrace to have > > enough useful information). It would help us better understand how > > memory is allocated by the app. I saw you send me one trace but it > > does not seem to be revealing anything interesting. > > > > Waldek > > > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: > > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > > > > Looks to me like its trying to allocate 40MB but the > > > available > > > > > memory > > > > > is 10GB, surely? 10933128KB is 10,933MB > > > > > > > > > > > > > I misread the number - forgot about 1K. > > > > > > > > Any chance you could run the app outside of production with > > > memory > > > > tracing enabled - > > > > > > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > > > > > > > (without --trace-backtrace) for while? And then we can have a > > > better > > > > sense of what kind of allocations it makes. The output of > > > trace > > > > memory-analyzer would be really helpful. > > > > > > I can certainly run that local with locally generated workloads, > > > which > > > should be close enough - but we've never managed to trigger the > > > oom > > > condition that way (other than by really constraining the memory > > > artificially). It should be close enough though - let me see what > > > I can > > > do. > > > > > > Rick > > > > > > > > -- > You received this message because you are subscribed to the Google > Groups "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to osv-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com > . -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/8e9dae61b2f412998468916508404d4867e759a3.camel%40rossfell.co.uk.
Re: [osv-dev] Re: OOM query
I have sent a more complete patch that should also address fragmentation issue with requests >= 4K and < 2MB. On Monday, March 23, 2020 at 6:12:51 PM UTC-4, Waldek Kozaczuk wrote: > > I have just sent a new patch to the mailing list. I am hoping it will > address the OOM crash if my theory of heavy memory fragmentation is right. > It would be nice if Nadav could review it. > > Regardless if you have another crash in production and are able to connect > with gdb, could you run 'osv heap' - it should show free_page_ranges. If > memory is heavily fragmented we should see a long list. > > It would be nice to recreate that load in dev env and capture the memory > trace data (BWT you do not need to enable backtrace to have enough useful > information). It would help us better understand how memory is allocated by > the app. I saw you send me one trace but it does not seem to be revealing > anything interesting. > > Waldek > > On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: >> >> On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: >> > >> > >> > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: >> > > Looks to me like its trying to allocate 40MB but the available >> > > memory >> > > is 10GB, surely? 10933128KB is 10,933MB >> > > >> > >> > I misread the number - forgot about 1K. >> > >> > Any chance you could run the app outside of production with memory >> > tracing enabled - >> > >> https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations >> >> > (without --trace-backtrace) for while? And then we can have a better >> > sense of what kind of allocations it makes. The output of trace >> > memory-analyzer would be really helpful. >> >> I can certainly run that local with locally generated workloads, which >> should be close enough - but we've never managed to trigger the oom >> condition that way (other than by really constraining the memory >> artificially). It should be close enough though - let me see what I can >> do. >> >> Rick >> >> >> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/e88bb103-ca6f-4f12-b944-e2d1391e2f8e%40googlegroups.com.
Re: [osv-dev] Re: OOM query
I have just sent a new patch to the mailing list. I am hoping it will address the OOM crash if my theory of heavy memory fragmentation is right. It would be nice if Nadav could review it. Regardless if you have another crash in production and are able to connect with gdb, could you run 'osv heap' - it should show free_page_ranges. If memory is heavily fragmented we should see a long list. It would be nice to recreate that load in dev env and capture the memory trace data (BWT you do not need to enable backtrace to have enough useful information). It would help us better understand how memory is allocated by the app. I saw you send me one trace but it does not seem to be revealing anything interesting. Waldek On Monday, March 23, 2020 at 1:19:18 AM UTC-4, rickp wrote: > > On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > > Looks to me like its trying to allocate 40MB but the available > > > memory > > > is 10GB, surely? 10933128KB is 10,933MB > > > > > > > I misread the number - forgot about 1K. > > > > Any chance you could run the app outside of production with memory > > tracing enabled - > > > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > > > (without --trace-backtrace) for while? And then we can have a better > > sense of what kind of allocations it makes. The output of trace > > memory-analyzer would be really helpful. > > I can certainly run that local with locally generated workloads, which > should be close enough - but we've never managed to trigger the oom > condition that way (other than by really constraining the memory > artificially). It should be close enough though - let me see what I can > do. > > Rick > > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/2e1a9467-d34f-4ae4-b7fa-06ce8052209e%40googlegroups.com.
Re: [osv-dev] Re: OOM query
On Sun, 2020-03-22 at 22:08 -0700, Waldek Kozaczuk wrote: > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > Looks to me like its trying to allocate 40MB but the available > > memory > > is 10GB, surely? 10933128KB is 10,933MB > > > > I misread the number - forgot about 1K. > > Any chance you could run the app outside of production with memory > tracing enabled - > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > (without --trace-backtrace) for while? And then we can have a better > sense of what kind of allocations it makes. The output of trace > memory-analyzer would be really helpful. I can certainly run that local with locally generated workloads, which should be close enough - but we've never managed to trigger the oom condition that way (other than by really constraining the memory artificially). It should be close enough though - let me see what I can do. Rick -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/77725fd3d095c86e37648713ab4bd3185a496500.camel%40rossfell.co.uk.
Re: [osv-dev] Re: OOM query
Based on what I can tell the available remaining memory seems to be heavily fragmented (I have no idea why). If it is it would help to implement https://github.com/cloudius-systems/osv/issues/854 which I will give it a try. Hopefully, I can have a patch within a day. On Monday, March 23, 2020 at 1:08:22 AM UTC-4, Waldek Kozaczuk wrote: > > > > On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > > Looks to me like its trying to allocate 40MB but the available memory > is 10GB, surely? 10933128KB is 10,933MB > > I misread the number - forgot about 1K. > > Any chance you could run the app outside of production with memory tracing > enabled - > https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations > (without --trace-backtrace) for while? And then we can have a better > sense of what kind of allocations it makes. The output of trace > memory-analyzer would be really helpful. > > > Its probably erlang garbage collection, but it seems very strange to me > as sometimes it happens very quickly after we start the system. There's > no way we're allocate more memory than the system ever claims to have - > so if it is fragmentation, why do we do so poor compared to the linux > kernel? > > Is anyone else running VMs based on OSv for more than a few days? > > Rick > > On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote: > > Either way, it looks like the app is trying to allocate almost 40MB > > and the remaining memory is ~ 10MB. So, in this case, there is no > > surprise, is there? > > > > In the previous case, you reported there were two threads each > > requesting ~5MB and there was plenty of memory. If I understand the > > code correctly the remaining memory must have been fragmented and > > there was no big enough free page range. > > > > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: > > > It does for me. Are you using the latest version of this script > > > from master? I think there were couple of things I changed there > > > recently as part of wide upgrade to Python 3. > > > > > > On Sun, Mar 22, 2020 at 20:27 Rick Payne > > > wrote: > > > > Does that command work for you? For me I get: > > > > > > > > (gdb) osv heap > > > > Python Exception %x format: an integer is > > > > required, > > > > not gdb.Value: > > > > Error occurred in Python command: %x format: an integer is > > > > required, > > > > not gdb.Value > > > > > > > > Rick > > > > > > > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: > > > > > Can you run ‘osv heap’ from gdb at this point? > > > > > > > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne > > > > > wrote: > > > > > > Ok, so i applied the patch, and the printf and this is what I > > > > see: > > > > > > > > > > > > page_range_allocator: no ranges found for size 39849984 and > > > > exact > > > > > > order: 14 > > > > > > Waiter: 1_scheduler, bytes: 39849984 > > > > > > Out of memory: could not reclaim any further. Current memory: > > > > > > 10933128 > > > > > > Kb, target: 0 Kb > > > > > > > > > > > > Rick > > > > > > > > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > > > > > > > I think that OSv might be hitting some sort of > > > > fragmentation > > > > > > scenario > > > > > > > (another question is why we get to this point) which should > > > > be > > > > > > > handled but is not due to the bug this patch fixes: > > > > > > > > > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > > > > > index d902eea8..11fd1456 100644 > > > > > > > --- a/core/mempool.cc > > > > > > > +++ b/core/mempool.cc > > > > > > > @@ -702,10 +702,13 @@ page_range* > > > > > > page_range_allocator::alloc(size_t > > > > > > > size) > > > > > > > for (auto&& pr : _free[exact_order - 1]) { > > > > > > > if (pr.size >= size) { > > > > > > > range = > > > > > > > +remove_list(exact_order - 1, *range); > > > > > > > break; > > > > > > > } > > > > > > > } > > > > > > > -return nullptr; > > > > > > > +if (!range) { > > > > > > > +return nullptr; > > > > > > > +} > > > > > > > } else if (order == max_order) { > > > > > > > range = &*_free_huge.rbegin(); > > > > > > > if (range->size < size) { > > > > > > > > > > > > > > Can you give it a try and see what happens with this patch? > > > > > > > > > > > > > > Waldek > > > > > > > > > > > > > > PS. You might also add this printout to verify if you are > > > > hitting > > > > > > the > > > > > > > hard fragmentation case: > > > > > > > > > > > > > > page_range* range = nullptr; > > > > > > > if (!bitset) { > > > > > > > if (!exact_order || _free[exact_order - > > > > 1].empty()) { > > > > > > > +printf("page_range_allocator: no
Re: [osv-dev] Re: OOM query
On Monday, March 23, 2020 at 12:36:52 AM UTC-4, rickp wrote: > > > Looks to me like its trying to allocate 40MB but the available memory > is 10GB, surely? 10933128KB is 10,933MB > > I misread the number - forgot about 1K. Any chance you could run the app outside of production with memory tracing enabled - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py#tracing-memory-allocations (without --trace-backtrace) for while? And then we can have a better sense of what kind of allocations it makes. The output of trace memory-analyzer would be really helpful. Its probably erlang garbage collection, but it seems very strange to me > as sometimes it happens very quickly after we start the system. There's > no way we're allocate more memory than the system ever claims to have - > so if it is fragmentation, why do we do so poor compared to the linux > kernel? > > Is anyone else running VMs based on OSv for more than a few days? > > Rick > > On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote: > > Either way, it looks like the app is trying to allocate almost 40MB > > and the remaining memory is ~ 10MB. So, in this case, there is no > > surprise, is there? > > > > In the previous case, you reported there were two threads each > > requesting ~5MB and there was plenty of memory. If I understand the > > code correctly the remaining memory must have been fragmented and > > there was no big enough free page range. > > > > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: > > > It does for me. Are you using the latest version of this script > > > from master? I think there were couple of things I changed there > > > recently as part of wide upgrade to Python 3. > > > > > > On Sun, Mar 22, 2020 at 20:27 Rick Payne > > > > wrote: > > > > Does that command work for you? For me I get: > > > > > > > > (gdb) osv heap > > > > Python Exception %x format: an integer is > > > > required, > > > > not gdb.Value: > > > > Error occurred in Python command: %x format: an integer is > > > > required, > > > > not gdb.Value > > > > > > > > Rick > > > > > > > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: > > > > > Can you run ‘osv heap’ from gdb at this point? > > > > > > > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne > > > > > > wrote: > > > > > > Ok, so i applied the patch, and the printf and this is what I > > > > see: > > > > > > > > > > > > page_range_allocator: no ranges found for size 39849984 and > > > > exact > > > > > > order: 14 > > > > > > Waiter: 1_scheduler, bytes: 39849984 > > > > > > Out of memory: could not reclaim any further. Current memory: > > > > > > 10933128 > > > > > > Kb, target: 0 Kb > > > > > > > > > > > > Rick > > > > > > > > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > > > > > > > I think that OSv might be hitting some sort of > > > > fragmentation > > > > > > scenario > > > > > > > (another question is why we get to this point) which should > > > > be > > > > > > > handled but is not due to the bug this patch fixes: > > > > > > > > > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > > > > > index d902eea8..11fd1456 100644 > > > > > > > --- a/core/mempool.cc > > > > > > > +++ b/core/mempool.cc > > > > > > > @@ -702,10 +702,13 @@ page_range* > > > > > > page_range_allocator::alloc(size_t > > > > > > > size) > > > > > > > for (auto&& pr : _free[exact_order - 1]) { > > > > > > > if (pr.size >= size) { > > > > > > > range = > > > > > > > +remove_list(exact_order - 1, *range); > > > > > > > break; > > > > > > > } > > > > > > > } > > > > > > > -return nullptr; > > > > > > > +if (!range) { > > > > > > > +return nullptr; > > > > > > > +} > > > > > > > } else if (order == max_order) { > > > > > > > range = &*_free_huge.rbegin(); > > > > > > > if (range->size < size) { > > > > > > > > > > > > > > Can you give it a try and see what happens with this patch? > > > > > > > > > > > > > > Waldek > > > > > > > > > > > > > > PS. You might also add this printout to verify if you are > > > > hitting > > > > > > the > > > > > > > hard fragmentation case: > > > > > > > > > > > > > > page_range* range = nullptr; > > > > > > > if (!bitset) { > > > > > > > if (!exact_order || _free[exact_order - > > > > 1].empty()) { > > > > > > > +printf("page_range_allocator: no ranges found > > > > for > > > > > > size > > > > > > > %ld and exact order: %d\n", size, exact_order); > > > > > > > return nullptr; > > > > > > > } > > > > > > > > > > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp > > > > wrote: > > > > > > > > Ok, with the waiter_print, we see: > > > > > > > > > > > > > >
Re: [osv-dev] Re: OOM query
Looks to me like its trying to allocate 40MB but the available memory is 10GB, surely? 10933128KB is 10,933MB Its probably erlang garbage collection, but it seems very strange to me as sometimes it happens very quickly after we start the system. There's no way we're allocate more memory than the system ever claims to have - so if it is fragmentation, why do we do so poor compared to the linux kernel? Is anyone else running VMs based on OSv for more than a few days? Rick On Sun, 2020-03-22 at 18:01 -0700, Waldek Kozaczuk wrote: > Either way, it looks like the app is trying to allocate almost 40MB > and the remaining memory is ~ 10MB. So, in this case, there is no > surprise, is there? > > In the previous case, you reported there were two threads each > requesting ~5MB and there was plenty of memory. If I understand the > code correctly the remaining memory must have been fragmented and > there was no big enough free page range. > > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: > > It does for me. Are you using the latest version of this script > > from master? I think there were couple of things I changed there > > recently as part of wide upgrade to Python 3. > > > > On Sun, Mar 22, 2020 at 20:27 Rick Payne > > wrote: > > > Does that command work for you? For me I get: > > > > > > (gdb) osv heap > > > Python Exception %x format: an integer is > > > required, > > > not gdb.Value: > > > Error occurred in Python command: %x format: an integer is > > > required, > > > not gdb.Value > > > > > > Rick > > > > > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: > > > > Can you run ‘osv heap’ from gdb at this point? > > > > > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne > > > > wrote: > > > > > Ok, so i applied the patch, and the printf and this is what I > > > see: > > > > > > > > > > page_range_allocator: no ranges found for size 39849984 and > > > exact > > > > > order: 14 > > > > > Waiter: 1_scheduler, bytes: 39849984 > > > > > Out of memory: could not reclaim any further. Current memory: > > > > > 10933128 > > > > > Kb, target: 0 Kb > > > > > > > > > > Rick > > > > > > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > > > > > > > I think that OSv might be hitting some sort of > > > fragmentation > > > > > scenario > > > > > > (another question is why we get to this point) which should > > > be > > > > > > handled but is not due to the bug this patch fixes: > > > > > > > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > > > > index d902eea8..11fd1456 100644 > > > > > > --- a/core/mempool.cc > > > > > > +++ b/core/mempool.cc > > > > > > @@ -702,10 +702,13 @@ page_range* > > > > > page_range_allocator::alloc(size_t > > > > > > size) > > > > > > for (auto&& pr : _free[exact_order - 1]) { > > > > > > if (pr.size >= size) { > > > > > > range = > > > > > > +remove_list(exact_order - 1, *range); > > > > > > break; > > > > > > } > > > > > > } > > > > > > -return nullptr; > > > > > > +if (!range) { > > > > > > +return nullptr; > > > > > > +} > > > > > > } else if (order == max_order) { > > > > > > range = &*_free_huge.rbegin(); > > > > > > if (range->size < size) { > > > > > > > > > > > > Can you give it a try and see what happens with this patch? > > > > > > > > > > > > Waldek > > > > > > > > > > > > PS. You might also add this printout to verify if you are > > > hitting > > > > > the > > > > > > hard fragmentation case: > > > > > > > > > > > > page_range* range = nullptr; > > > > > > if (!bitset) { > > > > > > if (!exact_order || _free[exact_order - > > > 1].empty()) { > > > > > > +printf("page_range_allocator: no ranges found > > > for > > > > > size > > > > > > %ld and exact order: %d\n", size, exact_order); > > > > > > return nullptr; > > > > > > } > > > > > > > > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp > > > wrote: > > > > > > > Ok, with the waiter_print, we see: > > > > > > > > > > > > > > Waiter: 2_scheduler, bytes: 4984832 > > > > > > > Waiter: 5_scheduler, bytes: 4984832 > > > > > > > Out of memory: could not reclaim any further. Current > > > memory: > > > > > > > 10871576Kb, target: 0 Kb > > > > > > > > > > > > > > 'osv mem' in gdb reports loads of free memory: > > > > > > > > > > > > > > (gdb) osv mem > > > > > > > Total Memory: 12884372480 Bytes > > > > > > > Mmap Memory: 1610067968 Bytes (12.50%) > > > > > > > Free Memory: 11132493824 Bytes (86.40%) > > > > > > > > > > > > > > So why is it failing to allocate 5MB when it claims to > > > have > > > > > 11GB > > > > > > > free? > > > > > > > > > > > > > > Any more debug I can provide? > > > > > > > > > > > > > > Rick > > > > > > > > > > > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne
Re: [osv-dev] Re: OOM query
Do you know if the application uses "huge" (>=2MB) malloc? I found this relevant discussion - https://groups.google.com/d/msg/osv-dev/uLoxcjU164k/WWl6dMzsCAAJ - about how OSv currently uses contiguous physical memory areas to back these kinds of requests. There is an open issue about that - https://github.com/cloudius-systems/osv/issues/854. It should not be difficult to implement it. It would be nice to confirm that memory is indeed fragmented in your case. On Sunday, March 22, 2020 at 9:01:26 PM UTC-4, Waldek Kozaczuk wrote: > > Either way, it looks like the app is trying to allocate almost 40MB and > the remaining memory is ~ 10MB. So, in this case, there is no surprise, is > there? > > In the previous case, you reported there were two threads each requesting > ~5MB and there was plenty of memory. If I understand the code correctly the > remaining memory must have been fragmented and there was no big enough free > page range. > > On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: >> >> It does for me. Are you using the latest version of this script from >> master? I think there were couple of things I changed there recently as >> part of wide upgrade to Python 3. >> >> On Sun, Mar 22, 2020 at 20:27 Rick Payne wrote: >> >>> >>> Does that command work for you? For me I get: >>> >>> (gdb) osv heap >>> Python Exception %x format: an integer is required, >>> not gdb.Value: >>> Error occurred in Python command: %x format: an integer is required, >>> not gdb.Value >>> >>> Rick >>> >>> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: >>> > Can you run ‘osv heap’ from gdb at this point? >>> > >>> > On Sun, Mar 22, 2020 at 19:31 Rick Payne >>> > wrote: >>> > > Ok, so i applied the patch, and the printf and this is what I see: >>> > > >>> > > page_range_allocator: no ranges found for size 39849984 and exact >>> > > order: 14 >>> > > Waiter: 1_scheduler, bytes: 39849984 >>> > > Out of memory: could not reclaim any further. Current memory: >>> > > 10933128 >>> > > Kb, target: 0 Kb >>> > > >>> > > Rick >>> > > >>> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: >>> > > > >>> > > > >>> > > > I think that OSv might be hitting some sort of fragmentation >>> > > scenario >>> > > > (another question is why we get to this point) which should be >>> > > > handled but is not due to the bug this patch fixes: >>> > > > >>> > > > diff --git a/core/mempool.cc b/core/mempool.cc >>> > > > index d902eea8..11fd1456 100644 >>> > > > --- a/core/mempool.cc >>> > > > +++ b/core/mempool.cc >>> > > > @@ -702,10 +702,13 @@ page_range* >>> > > page_range_allocator::alloc(size_t >>> > > > size) >>> > > > for (auto&& pr : _free[exact_order - 1]) { >>> > > > if (pr.size >= size) { >>> > > > range = >>> > > > +remove_list(exact_order - 1, *range); >>> > > > break; >>> > > > } >>> > > > } >>> > > > -return nullptr; >>> > > > +if (!range) { >>> > > > +return nullptr; >>> > > > +} >>> > > > } else if (order == max_order) { >>> > > > range = &*_free_huge.rbegin(); >>> > > > if (range->size < size) { >>> > > > >>> > > > Can you give it a try and see what happens with this patch? >>> > > > >>> > > > Waldek >>> > > > >>> > > > PS. You might also add this printout to verify if you are hitting >>> > > the >>> > > > hard fragmentation case: >>> > > > >>> > > > page_range* range = nullptr; >>> > > > if (!bitset) { >>> > > > if (!exact_order || _free[exact_order - 1].empty()) { >>> > > > +printf("page_range_allocator: no ranges found for >>> > > size >>> > > > %ld and exact order: %d\n", size, exact_order); >>> > > > return nullptr; >>> > > > } >>> > > > >>> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: >>> > > > > Ok, with the waiter_print, we see: >>> > > > > >>> > > > > Waiter: 2_scheduler, bytes: 4984832 >>> > > > > Waiter: 5_scheduler, bytes: 4984832 >>> > > > > Out of memory: could not reclaim any further. Current memory: >>> > > > > 10871576Kb, target: 0 Kb >>> > > > > >>> > > > > 'osv mem' in gdb reports loads of free memory: >>> > > > > >>> > > > > (gdb) osv mem >>> > > > > Total Memory: 12884372480 Bytes >>> > > > > Mmap Memory: 1610067968 Bytes (12.50%) >>> > > > > Free Memory: 11132493824 Bytes (86.40%) >>> > > > > >>> > > > > So why is it failing to allocate 5MB when it claims to have >>> > > 11GB >>> > > > > free? >>> > > > > >>> > > > > Any more debug I can provide? >>> > > > > >>> > > > > Rick >>> > > > > >>> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: >>> > > > > > And indeed, back on the 'release' image, and we hit an oom: >>> > > > > > >>> > > > > > (gdb) b memory::oom >>> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. >>> > > > > > (gdb) c >>> > > > > > Continuing.
Re: [osv-dev] Re: OOM query
Either way, it looks like the app is trying to allocate almost 40MB and the remaining memory is ~ 10MB. So, in this case, there is no surprise, is there? In the previous case, you reported there were two threads each requesting ~5MB and there was plenty of memory. If I understand the code correctly the remaining memory must have been fragmented and there was no big enough free page range. On Sunday, March 22, 2020 at 8:31:01 PM UTC-4, Waldek Kozaczuk wrote: > > It does for me. Are you using the latest version of this script from > master? I think there were couple of things I changed there recently as > part of wide upgrade to Python 3. > > On Sun, Mar 22, 2020 at 20:27 Rick Payne wrote: > >> >> Does that command work for you? For me I get: >> >> (gdb) osv heap >> Python Exception %x format: an integer is required, >> not gdb.Value: >> Error occurred in Python command: %x format: an integer is required, >> not gdb.Value >> >> Rick >> >> On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: >> > Can you run ‘osv heap’ from gdb at this point? >> > >> > On Sun, Mar 22, 2020 at 19:31 Rick Payne >> > wrote: >> > > Ok, so i applied the patch, and the printf and this is what I see: >> > > >> > > page_range_allocator: no ranges found for size 39849984 and exact >> > > order: 14 >> > > Waiter: 1_scheduler, bytes: 39849984 >> > > Out of memory: could not reclaim any further. Current memory: >> > > 10933128 >> > > Kb, target: 0 Kb >> > > >> > > Rick >> > > >> > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: >> > > > >> > > > >> > > > I think that OSv might be hitting some sort of fragmentation >> > > scenario >> > > > (another question is why we get to this point) which should be >> > > > handled but is not due to the bug this patch fixes: >> > > > >> > > > diff --git a/core/mempool.cc b/core/mempool.cc >> > > > index d902eea8..11fd1456 100644 >> > > > --- a/core/mempool.cc >> > > > +++ b/core/mempool.cc >> > > > @@ -702,10 +702,13 @@ page_range* >> > > page_range_allocator::alloc(size_t >> > > > size) >> > > > for (auto&& pr : _free[exact_order - 1]) { >> > > > if (pr.size >= size) { >> > > > range = >> > > > +remove_list(exact_order - 1, *range); >> > > > break; >> > > > } >> > > > } >> > > > -return nullptr; >> > > > +if (!range) { >> > > > +return nullptr; >> > > > +} >> > > > } else if (order == max_order) { >> > > > range = &*_free_huge.rbegin(); >> > > > if (range->size < size) { >> > > > >> > > > Can you give it a try and see what happens with this patch? >> > > > >> > > > Waldek >> > > > >> > > > PS. You might also add this printout to verify if you are hitting >> > > the >> > > > hard fragmentation case: >> > > > >> > > > page_range* range = nullptr; >> > > > if (!bitset) { >> > > > if (!exact_order || _free[exact_order - 1].empty()) { >> > > > +printf("page_range_allocator: no ranges found for >> > > size >> > > > %ld and exact order: %d\n", size, exact_order); >> > > > return nullptr; >> > > > } >> > > > >> > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: >> > > > > Ok, with the waiter_print, we see: >> > > > > >> > > > > Waiter: 2_scheduler, bytes: 4984832 >> > > > > Waiter: 5_scheduler, bytes: 4984832 >> > > > > Out of memory: could not reclaim any further. Current memory: >> > > > > 10871576Kb, target: 0 Kb >> > > > > >> > > > > 'osv mem' in gdb reports loads of free memory: >> > > > > >> > > > > (gdb) osv mem >> > > > > Total Memory: 12884372480 Bytes >> > > > > Mmap Memory: 1610067968 Bytes (12.50%) >> > > > > Free Memory: 11132493824 Bytes (86.40%) >> > > > > >> > > > > So why is it failing to allocate 5MB when it claims to have >> > > 11GB >> > > > > free? >> > > > > >> > > > > Any more debug I can provide? >> > > > > >> > > > > Rick >> > > > > >> > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: >> > > > > > And indeed, back on the 'release' image, and we hit an oom: >> > > > > > >> > > > > > (gdb) b memory::oom >> > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. >> > > > > > (gdb) c >> > > > > > Continuing. >> > > > > > >> > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0 >> > > ) >> > > > > at >> > > > > > core/mempool.cc:505 >> > > > > > 505 core/mempool.cc: No such file or directory. >> > > > > > (gdb) osv memory >> > > > > > Total Memory: 12884372480 Bytes >> > > > > > Mmap Memory: 1541332992 Bytes (11.96%) >> > > > > > Free Memory: 11490983936 Bytes (89.19%) >> > > > > > >> > > > > > I don't have the _waiters list print unforutnately (as I >> > > hadn't >> > > > > > included it in this image). It looks like this though: >> > > > > > >> > > > > > (gdb) p _oom_blocked._waiters >> > > > > > $4 = >> > > > > > >> > > > > >> > >
Re: [osv-dev] Re: OOM query
It does for me. Are you using the latest version of this script from master? I think there were couple of things I changed there recently as part of wide upgrade to Python 3. On Sun, Mar 22, 2020 at 20:27 Rick Payne wrote: > > Does that command work for you? For me I get: > > (gdb) osv heap > Python Exception %x format: an integer is required, > not gdb.Value: > Error occurred in Python command: %x format: an integer is required, > not gdb.Value > > Rick > > On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: > > Can you run ‘osv heap’ from gdb at this point? > > > > On Sun, Mar 22, 2020 at 19:31 Rick Payne > > wrote: > > > Ok, so i applied the patch, and the printf and this is what I see: > > > > > > page_range_allocator: no ranges found for size 39849984 and exact > > > order: 14 > > > Waiter: 1_scheduler, bytes: 39849984 > > > Out of memory: could not reclaim any further. Current memory: > > > 10933128 > > > Kb, target: 0 Kb > > > > > > Rick > > > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > > > > > > > I think that OSv might be hitting some sort of fragmentation > > > scenario > > > > (another question is why we get to this point) which should be > > > > handled but is not due to the bug this patch fixes: > > > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > > index d902eea8..11fd1456 100644 > > > > --- a/core/mempool.cc > > > > +++ b/core/mempool.cc > > > > @@ -702,10 +702,13 @@ page_range* > > > page_range_allocator::alloc(size_t > > > > size) > > > > for (auto&& pr : _free[exact_order - 1]) { > > > > if (pr.size >= size) { > > > > range = > > > > +remove_list(exact_order - 1, *range); > > > > break; > > > > } > > > > } > > > > -return nullptr; > > > > +if (!range) { > > > > +return nullptr; > > > > +} > > > > } else if (order == max_order) { > > > > range = &*_free_huge.rbegin(); > > > > if (range->size < size) { > > > > > > > > Can you give it a try and see what happens with this patch? > > > > > > > > Waldek > > > > > > > > PS. You might also add this printout to verify if you are hitting > > > the > > > > hard fragmentation case: > > > > > > > > page_range* range = nullptr; > > > > if (!bitset) { > > > > if (!exact_order || _free[exact_order - 1].empty()) { > > > > +printf("page_range_allocator: no ranges found for > > > size > > > > %ld and exact order: %d\n", size, exact_order); > > > > return nullptr; > > > > } > > > > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: > > > > > Ok, with the waiter_print, we see: > > > > > > > > > > Waiter: 2_scheduler, bytes: 4984832 > > > > > Waiter: 5_scheduler, bytes: 4984832 > > > > > Out of memory: could not reclaim any further. Current memory: > > > > > 10871576Kb, target: 0 Kb > > > > > > > > > > 'osv mem' in gdb reports loads of free memory: > > > > > > > > > > (gdb) osv mem > > > > > Total Memory: 12884372480 Bytes > > > > > Mmap Memory: 1610067968 Bytes (12.50%) > > > > > Free Memory: 11132493824 Bytes (86.40%) > > > > > > > > > > So why is it failing to allocate 5MB when it claims to have > > > 11GB > > > > > free? > > > > > > > > > > Any more debug I can provide? > > > > > > > > > > Rick > > > > > > > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: > > > > > > And indeed, back on the 'release' image, and we hit an oom: > > > > > > > > > > > > (gdb) b memory::oom > > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. > > > > > > (gdb) c > > > > > > Continuing. > > > > > > > > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0 > > > ) > > > > > at > > > > > > core/mempool.cc:505 > > > > > > 505 core/mempool.cc: No such file or directory. > > > > > > (gdb) osv memory > > > > > > Total Memory: 12884372480 Bytes > > > > > > Mmap Memory: 1541332992 Bytes (11.96%) > > > > > > Free Memory: 11490983936 Bytes (89.19%) > > > > > > > > > > > > I don't have the _waiters list print unforutnately (as I > > > hadn't > > > > > > included it in this image). It looks like this though: > > > > > > > > > > > > (gdb) p _oom_blocked._waiters > > > > > > $4 = > > > > > > > > > > > > > > { > > > > cl > > > > > > ai > > > > > > mer_waiters::wait_node, > > > > > boost::intrusive::list_node_traits, > > > > > > (boost::intrusive::link_mode_type)1, > > > boost::intrusive::dft_tag, > > > > > 1>, > > > > > > unsigned long, false, void>> = {static constant_time_size = > > > > > false, > > > > > > static stateful_value_traits = , > > > > > > static has_container_from_iterator = , > > > static > > > > > > safemode_or_autounlink = true, > > > > > > data_ = > > > > > > > > > > > > > > { > > > > > boost::intrusive::list_node_traits, > > > > > > (boost::intrusive::link_mode_type)1, > > > boost::intrusive::dft_tag, > > > > > 1>> = > > > > > > > >
Re: [osv-dev] Re: OOM query
Does that command work for you? For me I get: (gdb) osv heap Python Exception %x format: an integer is required, not gdb.Value: Error occurred in Python command: %x format: an integer is required, not gdb.Value Rick On Sun, 2020-03-22 at 19:37 -0400, Waldek Kozaczuk wrote: > Can you run ‘osv heap’ from gdb at this point? > > On Sun, Mar 22, 2020 at 19:31 Rick Payne > wrote: > > Ok, so i applied the patch, and the printf and this is what I see: > > > > page_range_allocator: no ranges found for size 39849984 and exact > > order: 14 > > Waiter: 1_scheduler, bytes: 39849984 > > Out of memory: could not reclaim any further. Current memory: > > 10933128 > > Kb, target: 0 Kb > > > > Rick > > > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > > > > I think that OSv might be hitting some sort of fragmentation > > scenario > > > (another question is why we get to this point) which should be > > > handled but is not due to the bug this patch fixes: > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > index d902eea8..11fd1456 100644 > > > --- a/core/mempool.cc > > > +++ b/core/mempool.cc > > > @@ -702,10 +702,13 @@ page_range* > > page_range_allocator::alloc(size_t > > > size) > > > for (auto&& pr : _free[exact_order - 1]) { > > > if (pr.size >= size) { > > > range = > > > +remove_list(exact_order - 1, *range); > > > break; > > > } > > > } > > > -return nullptr; > > > +if (!range) { > > > +return nullptr; > > > +} > > > } else if (order == max_order) { > > > range = &*_free_huge.rbegin(); > > > if (range->size < size) { > > > > > > Can you give it a try and see what happens with this patch? > > > > > > Waldek > > > > > > PS. You might also add this printout to verify if you are hitting > > the > > > hard fragmentation case: > > > > > > page_range* range = nullptr; > > > if (!bitset) { > > > if (!exact_order || _free[exact_order - 1].empty()) { > > > +printf("page_range_allocator: no ranges found for > > size > > > %ld and exact order: %d\n", size, exact_order); > > > return nullptr; > > > } > > > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: > > > > Ok, with the waiter_print, we see: > > > > > > > > Waiter: 2_scheduler, bytes: 4984832 > > > > Waiter: 5_scheduler, bytes: 4984832 > > > > Out of memory: could not reclaim any further. Current memory: > > > > 10871576Kb, target: 0 Kb > > > > > > > > 'osv mem' in gdb reports loads of free memory: > > > > > > > > (gdb) osv mem > > > > Total Memory: 12884372480 Bytes > > > > Mmap Memory: 1610067968 Bytes (12.50%) > > > > Free Memory: 11132493824 Bytes (86.40%) > > > > > > > > So why is it failing to allocate 5MB when it claims to have > > 11GB > > > > free? > > > > > > > > Any more debug I can provide? > > > > > > > > Rick > > > > > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: > > > > > And indeed, back on the 'release' image, and we hit an oom: > > > > > > > > > > (gdb) b memory::oom > > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. > > > > > (gdb) c > > > > > Continuing. > > > > > > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0 > > ) > > > > at > > > > > core/mempool.cc:505 > > > > > 505 core/mempool.cc: No such file or directory. > > > > > (gdb) osv memory > > > > > Total Memory: 12884372480 Bytes > > > > > Mmap Memory: 1541332992 Bytes (11.96%) > > > > > Free Memory: 11490983936 Bytes (89.19%) > > > > > > > > > > I don't have the _waiters list print unforutnately (as I > > hadn't > > > > > included it in this image). It looks like this though: > > > > > > > > > > (gdb) p _oom_blocked._waiters > > > > > $4 = > > > > > > > > > > > { > > > cl > > > > > ai > > > > > mer_waiters::wait_node, > > > > boost::intrusive::list_node_traits, > > > > > (boost::intrusive::link_mode_type)1, > > boost::intrusive::dft_tag, > > > > 1>, > > > > > unsigned long, false, void>> = {static constant_time_size = > > > > false, > > > > > static stateful_value_traits = , > > > > > static has_container_from_iterator = , > > static > > > > > safemode_or_autounlink = true, > > > > > data_ = > > > > > > > > > > > { > > > > boost::intrusive::list_node_traits, > > > > > (boost::intrusive::link_mode_type)1, > > boost::intrusive::dft_tag, > > > > 1>> = > > > > > > > > > > > { > > > od > > > > > e, > > > > > boost::intrusive::list_node*, > > boost::intrusive::dft_tag, > > > > 1>> = > > > > > {}, static link_mode = > > > > boost::intrusive::safe_link}, > > > > > root_plus_size_ = > > > > > { > > > > unsigned long, void>> = { > > > > > static constant_time_size = }, > > m_header > > > > = > > > > > {> = { > > > > > next_ = 0x200041305580, prev_ = 0x200041305580}, > > > >
Re: [osv-dev] Re: OOM query
Can you run ‘osv heap’ from gdb at this point? On Sun, Mar 22, 2020 at 19:31 Rick Payne wrote: > > Ok, so i applied the patch, and the printf and this is what I see: > > page_range_allocator: no ranges found for size 39849984 and exact > order: 14 > Waiter: 1_scheduler, bytes: 39849984 > Out of memory: could not reclaim any further. Current memory: 10933128 > Kb, target: 0 Kb > > Rick > > On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > > > > I think that OSv might be hitting some sort of fragmentation scenario > > (another question is why we get to this point) which should be > > handled but is not due to the bug this patch fixes: > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > index d902eea8..11fd1456 100644 > > --- a/core/mempool.cc > > +++ b/core/mempool.cc > > @@ -702,10 +702,13 @@ page_range* page_range_allocator::alloc(size_t > > size) > > for (auto&& pr : _free[exact_order - 1]) { > > if (pr.size >= size) { > > range = > > +remove_list(exact_order - 1, *range); > > break; > > } > > } > > -return nullptr; > > +if (!range) { > > +return nullptr; > > +} > > } else if (order == max_order) { > > range = &*_free_huge.rbegin(); > > if (range->size < size) { > > > > Can you give it a try and see what happens with this patch? > > > > Waldek > > > > PS. You might also add this printout to verify if you are hitting the > > hard fragmentation case: > > > > page_range* range = nullptr; > > if (!bitset) { > > if (!exact_order || _free[exact_order - 1].empty()) { > > +printf("page_range_allocator: no ranges found for size > > %ld and exact order: %d\n", size, exact_order); > > return nullptr; > > } > > > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: > > > Ok, with the waiter_print, we see: > > > > > > Waiter: 2_scheduler, bytes: 4984832 > > > Waiter: 5_scheduler, bytes: 4984832 > > > Out of memory: could not reclaim any further. Current memory: > > > 10871576Kb, target: 0 Kb > > > > > > 'osv mem' in gdb reports loads of free memory: > > > > > > (gdb) osv mem > > > Total Memory: 12884372480 Bytes > > > Mmap Memory: 1610067968 Bytes (12.50%) > > > Free Memory: 11132493824 Bytes (86.40%) > > > > > > So why is it failing to allocate 5MB when it claims to have 11GB > > > free? > > > > > > Any more debug I can provide? > > > > > > Rick > > > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: > > > > And indeed, back on the 'release' image, and we hit an oom: > > > > > > > > (gdb) b memory::oom > > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. > > > > (gdb) c > > > > Continuing. > > > > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0) > > > at > > > > core/mempool.cc:505 > > > > 505 core/mempool.cc: No such file or directory. > > > > (gdb) osv memory > > > > Total Memory: 12884372480 Bytes > > > > Mmap Memory: 1541332992 Bytes (11.96%) > > > > Free Memory: 11490983936 Bytes (89.19%) > > > > > > > > I don't have the _waiters list print unforutnately (as I hadn't > > > > included it in this image). It looks like this though: > > > > > > > > (gdb) p _oom_blocked._waiters > > > > $4 = > > > > > > > { > > cl > > > > ai > > > > mer_waiters::wait_node, > > > boost::intrusive::list_node_traits, > > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, > > > 1>, > > > > unsigned long, false, void>> = {static constant_time_size = > > > false, > > > > static stateful_value_traits = , > > > > static has_container_from_iterator = , static > > > > safemode_or_autounlink = true, > > > > data_ = > > > > > > > { > > > boost::intrusive::list_node_traits, > > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, > > > 1>> = > > > > > > > { > > od > > > > e, > > > > boost::intrusive::list_node*, boost::intrusive::dft_tag, > > > 1>> = > > > > {}, static link_mode = > > > boost::intrusive::safe_link}, > > > > root_plus_size_ = > > > > { > > > unsigned long, void>> = { > > > > static constant_time_size = }, m_header > > > = > > > > {> = { > > > > next_ = 0x200041305580, prev_ = 0x200041305580}, > > data > > > > fields>, } > > > > > > > > 'osv waiters' fails, annoyingly: > > > > > > > > (gdb) osv waiters > > > > waiters: > > > > Python Exception Cannot access memory > > > at > > > > address 0x42c00020982ac010: > > > > Error occurred in Python command: Cannot access memory at > > > address > > > > 0x42c00020982ac010 > > > > > > > > Rick > > > > > > > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote: > > > > > So we've been trying with a debug image (with the tcp kassert > > > > > disabled). We do not get an OOM crash, but after some time > > > (24hrs > > > > > or > > > > > more) all cores just hit idle and never recover. It stops > > > > > responding > > > > > to >
Re: [osv-dev] Re: OOM query
Ok, so i applied the patch, and the printf and this is what I see: page_range_allocator: no ranges found for size 39849984 and exact order: 14 Waiter: 1_scheduler, bytes: 39849984 Out of memory: could not reclaim any further. Current memory: 10933128 Kb, target: 0 Kb Rick On Sat, 2020-03-21 at 21:16 -0700, Waldek Kozaczuk wrote: > > > I think that OSv might be hitting some sort of fragmentation scenario > (another question is why we get to this point) which should be > handled but is not due to the bug this patch fixes: > > diff --git a/core/mempool.cc b/core/mempool.cc > index d902eea8..11fd1456 100644 > --- a/core/mempool.cc > +++ b/core/mempool.cc > @@ -702,10 +702,13 @@ page_range* page_range_allocator::alloc(size_t > size) > for (auto&& pr : _free[exact_order - 1]) { > if (pr.size >= size) { > range = > +remove_list(exact_order - 1, *range); > break; > } > } > -return nullptr; > +if (!range) { > +return nullptr; > +} > } else if (order == max_order) { > range = &*_free_huge.rbegin(); > if (range->size < size) { > > Can you give it a try and see what happens with this patch? > > Waldek > > PS. You might also add this printout to verify if you are hitting the > hard fragmentation case: > > page_range* range = nullptr; > if (!bitset) { > if (!exact_order || _free[exact_order - 1].empty()) { > +printf("page_range_allocator: no ranges found for size > %ld and exact order: %d\n", size, exact_order); > return nullptr; > } > > On Saturday, March 21, 2020 at 7:17:16 PM UTC-4, rickp wrote: > > Ok, with the waiter_print, we see: > > > > Waiter: 2_scheduler, bytes: 4984832 > > Waiter: 5_scheduler, bytes: 4984832 > > Out of memory: could not reclaim any further. Current memory: > > 10871576Kb, target: 0 Kb > > > > 'osv mem' in gdb reports loads of free memory: > > > > (gdb) osv mem > > Total Memory: 12884372480 Bytes > > Mmap Memory: 1610067968 Bytes (12.50%) > > Free Memory: 11132493824 Bytes (86.40%) > > > > So why is it failing to allocate 5MB when it claims to have 11GB > > free? > > > > Any more debug I can provide? > > > > Rick > > > > On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: > > > And indeed, back on the 'release' image, and we hit an oom: > > > > > > (gdb) b memory::oom > > > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. > > > (gdb) c > > > Continuing. > > > > > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0) > > at > > > core/mempool.cc:505 > > > 505 core/mempool.cc: No such file or directory. > > > (gdb) osv memory > > > Total Memory: 12884372480 Bytes > > > Mmap Memory: 1541332992 Bytes (11.96%) > > > Free Memory: 11490983936 Bytes (89.19%) > > > > > > I don't have the _waiters list print unforutnately (as I hadn't > > > included it in this image). It looks like this though: > > > > > > (gdb) p _oom_blocked._waiters > > > $4 = > > > > > { > cl > > > ai > > > mer_waiters::wait_node, > > boost::intrusive::list_node_traits, > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, > > 1>, > > > unsigned long, false, void>> = {static constant_time_size = > > false, > > > static stateful_value_traits = , > > > static has_container_from_iterator = , static > > > safemode_or_autounlink = true, > > > data_ = > > > > > { > > boost::intrusive::list_node_traits, > > > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, > > 1>> = > > > > > { > od > > > e, > > > boost::intrusive::list_node*, boost::intrusive::dft_tag, > > 1>> = > > > {}, static link_mode = > > boost::intrusive::safe_link}, > > > root_plus_size_ = > > > { > > unsigned long, void>> = { > > > static constant_time_size = }, m_header > > = > > > {> = { > > > next_ = 0x200041305580, prev_ = 0x200041305580}, > data > > > fields>, } > > > > > > 'osv waiters' fails, annoyingly: > > > > > > (gdb) osv waiters > > > waiters: > > > Python Exception Cannot access memory > > at > > > address 0x42c00020982ac010: > > > Error occurred in Python command: Cannot access memory at > > address > > > 0x42c00020982ac010 > > > > > > Rick > > > > > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote: > > > > So we've been trying with a debug image (with the tcp kassert > > > > disabled). We do not get an OOM crash, but after some time > > (24hrs > > > > or > > > > more) all cores just hit idle and never recover. It stops > > > > responding > > > > to > > > > TCP connections and when you attached gdb, you see this: > > > > > > > > Thread 1 received signal SIGINT, Interrupt. > > > > processor::sti_hlt () at arch/x64/processor.hh:252 > > > > 252 arch/x64/processor.hh: No such file or directory. > > > > (gdb) info thread > > > > Id Target Id
Re: [osv-dev] Re: OOM query
Ok, with the waiter_print, we see: Waiter: 2_scheduler, bytes: 4984832 Waiter: 5_scheduler, bytes: 4984832 Out of memory: could not reclaim any further. Current memory: 10871576Kb, target: 0 Kb 'osv mem' in gdb reports loads of free memory: (gdb) osv mem Total Memory: 12884372480 Bytes Mmap Memory: 1610067968 Bytes (12.50%) Free Memory: 11132493824 Bytes (86.40%) So why is it failing to allocate 5MB when it claims to have 11GB free? Any more debug I can provide? Rick On Sat, 2020-03-21 at 15:32 +1000, Rick Payne wrote: > And indeed, back on the 'release' image, and we hit an oom: > > (gdb) b memory::oom > Breakpoint 1 at 0x403ee300: file core/mempool.cc, line 505. > (gdb) c > Continuing. > > Thread 1 hit Breakpoint 1, memory::oom (target=target@entry=0) at > core/mempool.cc:505 > 505 core/mempool.cc: No such file or directory. > (gdb) osv memory > Total Memory: 12884372480 Bytes > Mmap Memory: 1541332992 Bytes (11.96%) > Free Memory: 11490983936 Bytes (89.19%) > > I don't have the _waiters list print unforutnately (as I hadn't > included it in this image). It looks like this though: > > (gdb) p _oom_blocked._waiters > $4 = > { ai > mer_waiters::wait_node, boost::intrusive::list_node_traits, > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 1>, > unsigned long, false, void>> = {static constant_time_size = false, > static stateful_value_traits = , > static has_container_from_iterator = , static > safemode_or_autounlink = true, > data_ = > { boost::intrusive::list_node_traits, > (boost::intrusive::link_mode_type)1, boost::intrusive::dft_tag, 1>> = > { e, > boost::intrusive::list_node*, boost::intrusive::dft_tag, 1>> = > {}, static link_mode = boost::intrusive::safe_link}, > root_plus_size_ = > { unsigned long, void>> = { > static constant_time_size = }, m_header = > {> = { > next_ = 0x200041305580, prev_ = 0x200041305580}, fields>, } > > 'osv waiters' fails, annoyingly: > > (gdb) osv waiters > waiters: > Python Exception Cannot access memory at > address 0x42c00020982ac010: > Error occurred in Python command: Cannot access memory at address > 0x42c00020982ac010 > > Rick > > On Sat, 2020-03-21 at 06:42 +1000, Rick Payne wrote: > > So we've been trying with a debug image (with the tcp kassert > > disabled). We do not get an OOM crash, but after some time (24hrs > > or > > more) all cores just hit idle and never recover. It stops > > responding > > to > > TCP connections and when you attached gdb, you see this: > > > > Thread 1 received signal SIGINT, Interrupt. > > processor::sti_hlt () at arch/x64/processor.hh:252 > > 252 arch/x64/processor.hh: No such file or directory. > > (gdb) info thread > > Id Target Id Frame > > * 1Thread 1 (CPU#0 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > 2Thread 2 (CPU#1 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > 3Thread 3 (CPU#2 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > 4Thread 4 (CPU#3 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > 5Thread 5 (CPU#4 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > 6Thread 6 (CPU#5 [halted ]) processor::sti_hlt () at > > arch/x64/processor.hh:252 > > > > The threads themselves look liks this: > > > > (gdb) bt > > #0 processor::sti_hlt () at arch/x64/processor.hh:252 > > #1 0x405e4016 in arch::wait_for_interrupt () at > > arch/x64/arch.hh:43 > > #2 0x405d7f10 in sched::cpu::do_idle > > (this=0x8001b040) > > at core/sched.cc:404 > > #3 0x405d7fc1 in sched::cpu::idle > > (this=0x8001b040) > > at > > core/sched.cc:423 > > #4 0x405d73ef in sched::cpuoperator()(void) > > const (__closure=0x800100156070) > > at core/sched.cc:165 > > #5 0x405e0b57 in std::_Function_handler > sched::cpu::init_idle_thread():: >::_M_invoke(const > > std::_Any_data &) (__functor=...) at > > /usr/include/c++/9/bits/std_function.h:300 > > #6 0x40496206 in std::function::operator()() > > const > > (this=0x800100156070) > > at /usr/include/c++/9/bits/std_function.h:690 > > #7 0x405db386 in sched::thread::main > > (this=0x800100156040) > > at core/sched.cc:1210 > > #8 0x405d7173 in sched::thread_main_c > > (t=0x800100156040) > > at arch/x64/arch-switch.hh:321 > > #9 0x404911b3 in thread_main () at arch/x64/entry.S:113 > > > > I tried the 'osv waiters' command but jsut get: > > > > (gdb) osv waiters > > waiters: > > Python Exception Cannot access memory at > > address 0x42c00020880a9010: > > Error occurred in Python command: Cannot access memory at address > > 0x42c00020880a9010 > > > > I think we'll go back to a 'release' image and see if we get the > > oom > > with a few more clues... > > > > Rick > > > > > > On Wed, 2020-03-11 at 08:07 -0700, Waldek Kozaczuk wrote: > > > On Tuesday, March
Re: [osv-dev] Re: OOM query
So we've been trying with a debug image (with the tcp kassert disabled). We do not get an OOM crash, but after some time (24hrs or more) all cores just hit idle and never recover. It stops responding to TCP connections and when you attached gdb, you see this: Thread 1 received signal SIGINT, Interrupt. processor::sti_hlt () at arch/x64/processor.hh:252 252 arch/x64/processor.hh: No such file or directory. (gdb) info thread Id Target Id Frame * 1Thread 1 (CPU#0 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 2Thread 2 (CPU#1 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 3Thread 3 (CPU#2 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 4Thread 4 (CPU#3 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 5Thread 5 (CPU#4 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 6Thread 6 (CPU#5 [halted ]) processor::sti_hlt () at arch/x64/processor.hh:252 The threads themselves look liks this: (gdb) bt #0 processor::sti_hlt () at arch/x64/processor.hh:252 #1 0x405e4016 in arch::wait_for_interrupt () at arch/x64/arch.hh:43 #2 0x405d7f10 in sched::cpu::do_idle (this=0x8001b040) at core/sched.cc:404 #3 0x405d7fc1 in sched::cpu::idle (this=0x8001b040) at core/sched.cc:423 #4 0x405d73ef in sched::cpuoperator()(void) const (__closure=0x800100156070) at core/sched.cc:165 #5 0x405e0b57 in std::_Function_handler >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300 #6 0x40496206 in std::function::operator()() const (this=0x800100156070) at /usr/include/c++/9/bits/std_function.h:690 #7 0x405db386 in sched::thread::main (this=0x800100156040) at core/sched.cc:1210 #8 0x405d7173 in sched::thread_main_c (t=0x800100156040) at arch/x64/arch-switch.hh:321 #9 0x404911b3 in thread_main () at arch/x64/entry.S:113 I tried the 'osv waiters' command but jsut get: (gdb) osv waiters waiters: Python Exception Cannot access memory at address 0x42c00020880a9010: Error occurred in Python command: Cannot access memory at address 0x42c00020880a9010 I think we'll go back to a 'release' image and see if we get the oom with a few more clues... Rick On Wed, 2020-03-11 at 08:07 -0700, Waldek Kozaczuk wrote: > > > On Tuesday, March 10, 2020 at 10:53:17 PM UTC-4, rickp wrote: > > I've not found a way to reproduce this other than in production > > yet, > > which is annoying. I've built an image with this patch, and will > > see if > > we can run it with gdb too. > > > > What should I be looking for if it hits? > > In case of oom() this patch should show a list of threads (waiters) > along with amount of memory requested. I am hoping it will give > better clues where the problem is. > > Note, the 'osv pagetable walk' command (and a few others) seem not > > to > > work for me: > > > > (gdb) osv pagetable walk > > Python Exception Argument required (expression > > to > > compute).: > > Error occurred in Python: Argument required (expression to > > compute). > > I will look into it. > > Rick > > > > On Tue, 2020-03-10 at 11:21 -0700, Waldek Kozaczuk wrote: > > > Is there a way to have a reproducible test app? If not I was > > > wondering if this patch enabling some extra logging would help: > > > > > > diff --git a/core/mempool.cc b/core/mempool.cc > > > index d902eea8..ace28b38 100644 > > > --- a/core/mempool.cc > > > +++ b/core/mempool.cc > > > @@ -501,9 +501,9 @@ ssize_t > > reclaimer::bytes_until_normal(pressure > > > curr) > > > } > > > } > > > > > > -void oom() > > > +void oom(ssize_t target) > > > { > > > -abort("Out of memory: could not reclaim any further. > > Current > > > memory: %d Kb", stats::free() >> 10); > > > +abort("Out of memory: could not reclaim any further. > > Current > > > memory: %d Kb, target: %d Kb", stats::free() >> 10, target >> > > 10); > > > } > > > > > > void reclaimer::wait_for_minimum_memory() > > > @@ -924,6 +924,17 @@ bool reclaimer_waiters::wake_waiters() > > > return woken; > > > } > > > > > > +void reclaimer_waiters::print_waiters() > > > +{ > > > +auto it = _waiters.begin(); > > > +while (it != _waiters.end()) { > > > +auto& wr = *it; > > > +it++; > > > + > > > +printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), > > > wr.bytes); > > > +} > > > +} > > > + > > > // Note for callers: Ideally, we would not only wake, but > > already > > > allocate > > > // memory here and pass it back to the waiter. However, memory > > is > > > not always > > > // allocated the same way (ex: refill_page_buffer is completely > > > different from > > > @@ -943,7 +954,7 @@ void reclaimer_waiters::wait(size_t bytes) > > > > > > // Wait for whom? > > > if (curr == reclaimer_thread._thread.get()) { >
Re: [osv-dev] Re: OOM query
Is there a way to have a reproducible test app? If not I was wondering if this patch enabling some extra logging would help: diff --git a/core/mempool.cc b/core/mempool.cc index d902eea8..ace28b38 100644 --- a/core/mempool.cc +++ b/core/mempool.cc @@ -501,9 +501,9 @@ ssize_t reclaimer::bytes_until_normal(pressure curr) } } -void oom() +void oom(ssize_t target) { -abort("Out of memory: could not reclaim any further. Current memory: %d Kb", stats::free() >> 10); +abort("Out of memory: could not reclaim any further. Current memory: %d Kb, target: %d Kb", stats::free() >> 10, target >> 10); } void reclaimer::wait_for_minimum_memory() @@ -924,6 +924,17 @@ bool reclaimer_waiters::wake_waiters() return woken; } +void reclaimer_waiters::print_waiters() +{ +auto it = _waiters.begin(); +while (it != _waiters.end()) { +auto& wr = *it; +it++; + +printf( "Waiter: %s, bytes: %ld\n", wr.owner->name(), wr.bytes); +} +} + // Note for callers: Ideally, we would not only wake, but already allocate // memory here and pass it back to the waiter. However, memory is not always // allocated the same way (ex: refill_page_buffer is completely different from @@ -943,7 +954,7 @@ void reclaimer_waiters::wait(size_t bytes) // Wait for whom? if (curr == reclaimer_thread._thread.get()) { -oom(); +oom(bytes); } wait_node wr; @@ -1027,7 +1038,8 @@ void reclaimer::_do_reclaim() // Wake up all waiters that are waiting and now have a chance to succeed. // If we could not wake any, there is nothing really we can do. if (!_oom_blocked.wake_waiters()) { -oom(); +_oom_blocked.print_waiters(); +oom(target); } } diff --git a/include/osv/mempool.hh b/include/osv/mempool.hh index 10fe5602..620f1a5b 100644 --- a/include/osv/mempool.hh +++ b/include/osv/mempool.hh @@ -137,6 +137,7 @@ public: bool wake_waiters(); void wait(size_t bytes); bool has_waiters() { return !_waiters.empty(); } +void print_waiters(); private: struct wait_node: boost::intrusive::list_base_hook<> { sched::thread* owner; Waldek On Tuesday, March 10, 2020 at 12:26:54 PM UTC-4, Waldek Kozaczuk wrote: > > Well I found this commit changing original "> 0 " to ">= 0" - > https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca > - > with some explanation. So maybe my theory is wrong. Also I am not sure > calling bytes_until_normal() before that "if" would change anything then. > > On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4, Waldek Kozaczuk wrote: >> >> So here is full code of of _do_reclaim(): >> >> 1001 void reclaimer::_do_reclaim() >> 1002 { >> 1003 ssize_t target; >> 1004 emergency_alloc_level = 1; >> 1005 >> 1006 while (true) { >> 1007 WITH_LOCK(free_page_ranges_lock) { >> 1008 _blocked.wait(free_page_ranges_lock); >> 1009 *target = bytes_until_normal();* >> 1010 } >> 1011 >> 1012 // This means that we are currently ballooning, we should >> 1013 // try to serve the waiters from temporary memory without >> 1014 // going on hard mode. A big batch of more memory is likely >> 1015 // in its way. >> 1016 if (_oom_blocked.has_waiters() && throttling_needed()) { >> 1017 _shrinker_loop(target, [] { return false; }); >> 1018 WITH_LOCK(free_page_ranges_lock) { >> 1019 if (_oom_blocked.wake_waiters()) { >> 1020 continue; >> 1021 } >> 1022 } >> 1023 } >> 1024 >> 1025 _shrinker_loop(target, [this] { return >> _oom_blocked.has_waiters(); }); >> 1026 >> 1027 WITH_LOCK(free_page_ranges_lock) { >> 1028 if (target >= 0) { >> 1029 // Wake up all waiters that are waiting and now have >> a chance to succeed. >> 1030 // If we could not wake any, there is nothing really >> we can do. >> 1031 if (!_oom_blocked.wake_waiters()) { >> 1032 oom(); >> 1033 } >> 1034 } >> 1035 >> 1036 if (balloon_api) { >> 1037 balloon_api->voluntary_return(); >> 1038 } >> 1039 } >> 1040 } >> 1041 } >> >> We got oom() because target was '>= 0'. Now the target is calculated as >> the result of *bytes_until_normal()*. >> >> 495 ssize_t reclaimer::bytes_until_normal(pressure curr) >> 496 { >> 497 assert(mutex_owned(_page_ranges_lock)); >> 498 if (curr == pressure::PRESSURE) { >> 499 return watermark_lo - stats::free(); >> 500 } else { >> 501 return 0; >> 502 } >> 503 } >> >> which seems to indicate that when 0 is returned there no need to reclaim >> any memory. >> >> So here are two things
Re: [osv-dev] Re: OOM query
Well I found this commit changing original "> 0 " to ">= 0" - https://github.com/cloudius-systems/osv/commit/f888c39d744c38cf6fa2f7568c04cea3a7217dca - with some explanation. So maybe my theory is wrong. Also I am not sure calling bytes_until_normal() before that "if" would change anything then. On Tuesday, March 10, 2020 at 12:04:27 PM UTC-4, Waldek Kozaczuk wrote: > > So here is full code of of _do_reclaim(): > > 1001 void reclaimer::_do_reclaim() > 1002 { > 1003 ssize_t target; > 1004 emergency_alloc_level = 1; > 1005 > 1006 while (true) { > 1007 WITH_LOCK(free_page_ranges_lock) { > 1008 _blocked.wait(free_page_ranges_lock); > 1009 *target = bytes_until_normal();* > 1010 } > 1011 > 1012 // This means that we are currently ballooning, we should > 1013 // try to serve the waiters from temporary memory without > 1014 // going on hard mode. A big batch of more memory is likely > 1015 // in its way. > 1016 if (_oom_blocked.has_waiters() && throttling_needed()) { > 1017 _shrinker_loop(target, [] { return false; }); > 1018 WITH_LOCK(free_page_ranges_lock) { > 1019 if (_oom_blocked.wake_waiters()) { > 1020 continue; > 1021 } > 1022 } > 1023 } > 1024 > 1025 _shrinker_loop(target, [this] { return > _oom_blocked.has_waiters(); }); > 1026 > 1027 WITH_LOCK(free_page_ranges_lock) { > 1028 if (target >= 0) { > 1029 // Wake up all waiters that are waiting and now have > a chance to succeed. > 1030 // If we could not wake any, there is nothing really > we can do. > 1031 if (!_oom_blocked.wake_waiters()) { > 1032 oom(); > 1033 } > 1034 } > 1035 > 1036 if (balloon_api) { > 1037 balloon_api->voluntary_return(); > 1038 } > 1039 } > 1040 } > 1041 } > > We got oom() because target was '>= 0'. Now the target is calculated as > the result of *bytes_until_normal()*. > > 495 ssize_t reclaimer::bytes_until_normal(pressure curr) > 496 { > 497 assert(mutex_owned(_page_ranges_lock)); > 498 if (curr == pressure::PRESSURE) { > 499 return watermark_lo - stats::free(); > 500 } else { > 501 return 0; > 502 } > 503 } > > which seems to indicate that when 0 is returned there no need to reclaim > any memory. > > So here are two things that might be wrong: > > 1. Shouldn't if (target >= 0) be changed to if (target > 0) {? > > 2. Shouldn't we re-read the target in second WITH_LOCK instead of > comparing the original value in the beginning of the body of the loop? The > line before - _shrinker_loop(target, [this] { return > _oom_blocked.has_waiters(); }); - might have just released enough memory > to bring target below 0, right? > > In any case it would be useful to print the value of the target before > oom(): > > if (!_oom_blocked.wake_waiters()) { > > printf("--> Target: %ld\n", target); > > oom(); > > } > > On Monday, March 9, 2020 at 10:51:02 PM UTC-4, rickp wrote: >> >> >> We're pretty close to current on OSv, but it also happens on an older >> image. We have changed some stuff in our app, but I think that may just >> be provking the bug. Certainly from gdb, I can see that both mmaped and >> normal memory fluctuate up and down but eveything looks sane. >> >> More debug in wake_waiters would be useful, but I'm losing the argument >> to continue with OSv at the moment which makes testing this a bit >> 'political'. >> >> btw - when we do run the system out of memory, it seems to hang rather >> than generate an oom. Have you tried it? >> >> The tcp_do_segment one has been mentioned before (by someone else). The >> issue is that the kassert only has effect in the debug build. I'd guess >> that the socket is being closed, but still has segments that have not >> been processed, or something like that. I'll try and narrow it down a >> bit if I get time. >> >> Rick >> >> On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: >> > Does it happen with the very latest OSv code? Did it start happening >> > at some point more often? >> > >> > I wonder if we could add some helpful printouts in wake_waiters(). >> > >> > Btw that assert() failure in tcp_do_segment() rings a bell. >> > >> > On Mon, Mar 9, 2020 at 22:25 Rick Payne wrote: >> > > I can't add much other than I doubt its fragmentation. Sometimes >> > > this >> > > happens within a few minutes of the system starting. At no point do >> > > I >> > > think we're using more than 2GB of ram (of the 12GB) either. >> > > >> > > I did compile up a debug verison of OSv and built the system with >> > > that, >> > > but I've been unable to trigger the oom(). Worse, I hit
Re: [osv-dev] Re: OOM query
So here is full code of of _do_reclaim(): 1001 void reclaimer::_do_reclaim() 1002 { 1003 ssize_t target; 1004 emergency_alloc_level = 1; 1005 1006 while (true) { 1007 WITH_LOCK(free_page_ranges_lock) { 1008 _blocked.wait(free_page_ranges_lock); 1009 *target = bytes_until_normal();* 1010 } 1011 1012 // This means that we are currently ballooning, we should 1013 // try to serve the waiters from temporary memory without 1014 // going on hard mode. A big batch of more memory is likely 1015 // in its way. 1016 if (_oom_blocked.has_waiters() && throttling_needed()) { 1017 _shrinker_loop(target, [] { return false; }); 1018 WITH_LOCK(free_page_ranges_lock) { 1019 if (_oom_blocked.wake_waiters()) { 1020 continue; 1021 } 1022 } 1023 } 1024 1025 _shrinker_loop(target, [this] { return _oom_blocked.has_waiters(); }); 1026 1027 WITH_LOCK(free_page_ranges_lock) { 1028 if (target >= 0) { 1029 // Wake up all waiters that are waiting and now have a chance to succeed. 1030 // If we could not wake any, there is nothing really we can do. 1031 if (!_oom_blocked.wake_waiters()) { 1032 oom(); 1033 } 1034 } 1035 1036 if (balloon_api) { 1037 balloon_api->voluntary_return(); 1038 } 1039 } 1040 } 1041 } We got oom() because target was '>= 0'. Now the target is calculated as the result of *bytes_until_normal()*. 495 ssize_t reclaimer::bytes_until_normal(pressure curr) 496 { 497 assert(mutex_owned(_page_ranges_lock)); 498 if (curr == pressure::PRESSURE) { 499 return watermark_lo - stats::free(); 500 } else { 501 return 0; 502 } 503 } which seems to indicate that when 0 is returned there no need to reclaim any memory. So here are two things that might be wrong: 1. Shouldn't if (target >= 0) be changed to if (target > 0) {? 2. Shouldn't we re-read the target in second WITH_LOCK instead of comparing the original value in the beginning of the body of the loop? The line before - _shrinker_loop(target, [this] { return _oom_blocked.has_waiters(); }); - might have just released enough memory to bring target below 0, right? In any case it would be useful to print the value of the target before oom(): if (!_oom_blocked.wake_waiters()) { printf("--> Target: %ld\n", target); oom(); } On Monday, March 9, 2020 at 10:51:02 PM UTC-4, rickp wrote: > > > We're pretty close to current on OSv, but it also happens on an older > image. We have changed some stuff in our app, but I think that may just > be provking the bug. Certainly from gdb, I can see that both mmaped and > normal memory fluctuate up and down but eveything looks sane. > > More debug in wake_waiters would be useful, but I'm losing the argument > to continue with OSv at the moment which makes testing this a bit > 'political'. > > btw - when we do run the system out of memory, it seems to hang rather > than generate an oom. Have you tried it? > > The tcp_do_segment one has been mentioned before (by someone else). The > issue is that the kassert only has effect in the debug build. I'd guess > that the socket is being closed, but still has segments that have not > been processed, or something like that. I'll try and narrow it down a > bit if I get time. > > Rick > > On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: > > Does it happen with the very latest OSv code? Did it start happening > > at some point more often? > > > > I wonder if we could add some helpful printouts in wake_waiters(). > > > > Btw that assert() failure in tcp_do_segment() rings a bell. > > > > On Mon, Mar 9, 2020 at 22:25 Rick Payne > wrote: > > > I can't add much other than I doubt its fragmentation. Sometimes > > > this > > > happens within a few minutes of the system starting. At no point do > > > I > > > think we're using more than 2GB of ram (of the 12GB) either. > > > > > > I did compile up a debug verison of OSv and built the system with > > > that, > > > but I've been unable to trigger the oom(). Worse, I hit a kassert > > > in > > > the netchannel code that seems to be ignored in the 'release' > > > build, > > > but panics in the debug build: > > > > > > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN > > > Assertion failed: tp->get_state() > 1 > > > (bsd/sys/netinet/tcp_input.cc: > > > tcp_do_segment: 1076) > > > > > > [backtrace] > > > 0x40221330 > > > 0x40221399 <__assert_fail+64> > > > 0x402a4798 > > > 0x402a97c2 > > > 0x402a98a1 > > > 0x402aa448 > > > 0x40656a9a ::operator()(mbuf*) > >
Re: [osv-dev] Re: OOM query
We're pretty close to current on OSv, but it also happens on an older image. We have changed some stuff in our app, but I think that may just be provking the bug. Certainly from gdb, I can see that both mmaped and normal memory fluctuate up and down but eveything looks sane. More debug in wake_waiters would be useful, but I'm losing the argument to continue with OSv at the moment which makes testing this a bit 'political'. btw - when we do run the system out of memory, it seems to hang rather than generate an oom. Have you tried it? The tcp_do_segment one has been mentioned before (by someone else). The issue is that the kassert only has effect in the debug build. I'd guess that the socket is being closed, but still has segments that have not been processed, or something like that. I'll try and narrow it down a bit if I get time. Rick On Mon, 2020-03-09 at 22:32 -0400, Waldek Kozaczuk wrote: > Does it happen with the very latest OSv code? Did it start happening > at some point more often? > > I wonder if we could add some helpful printouts in wake_waiters(). > > Btw that assert() failure in tcp_do_segment() rings a bell. > > On Mon, Mar 9, 2020 at 22:25 Rick Payne wrote: > > I can't add much other than I doubt its fragmentation. Sometimes > > this > > happens within a few minutes of the system starting. At no point do > > I > > think we're using more than 2GB of ram (of the 12GB) either. > > > > I did compile up a debug verison of OSv and built the system with > > that, > > but I've been unable to trigger the oom(). Worse, I hit a kassert > > in > > the netchannel code that seems to be ignored in the 'release' > > build, > > but panics in the debug build: > > > > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN > > Assertion failed: tp->get_state() > 1 > > (bsd/sys/netinet/tcp_input.cc: > > tcp_do_segment: 1076) > > > > [backtrace] > > 0x40221330 > > 0x40221399 <__assert_fail+64> > > 0x402a4798 > > 0x402a97c2 > > 0x402a98a1 > > 0x402aa448 > > 0x40656a9a ::operator()(mbuf*) > > const+76> > > 0x40655855 > > 0x4023b165 > > 0x4023b4d7 > > 0x4024cd21 > > 0x406a6a10 > > 0x406a64f7 > > 0x4067cd42 > > > > So at the moment, I'm a bit stuck with getting any more info... > > > > Rick > > > > On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote: > > > As I understand this stack trace the oom() was called here as > > part of > > > _do_reclaim(): > > > > > > 1025 WITH_LOCK(free_page_ranges_lock) { > > > 1026 if (target >= 0) { > > > 1027 // Wake up all waiters that are waiting and > > now > > > have a chance to succeed. > > > 1028 // If we could not wake any, there is > > nothing > > > really we can do. > > > 1029 if (!_oom_blocked.wake_waiters()) { > > > 1030 oom(); > > > 1031 } > > > 1032 } > > > 1033 > > > 1034 if (balloon_api) { > > > 1035 balloon_api->voluntary_return(); > > > 1036 } > > > 1037 } > > > > > > so it seems wake_waiters() returned false. I wonder if the memory > > was > > > heavily fragmented or there is some logical bug in there. This > > method > > > is called from two places and I wonder if this part of > > wake_waiters() > > > is correct: > > > > > > 921 if (!_waiters.empty()) { > > > 922 reclaimer_thread.wake(); > > > 923 } > > > 924 return woken; > > > > > > > > > should this if also set woken to true? > > > > > > Also could we also enhance the oom() logic to print out more > > useful > > > information if this happens once again? > > > > > > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote: > > > > Had a crash on a system that I don't understand. Its a VM with > > > > 12GB > > > > allocated, we were running without about 10.5GB free according > > to > > > > the > > > > API. > > > > > > > > Out of the blue, we had a panic: > > > > > > > > Out of memory: could not reclaim any further. Current memory: > > > > 10954988 > > > > Kb > > > > [backtrace] > > > > 0x403f6320 > > > > 0x403f71cc > > > > 0x403f722f > > > > 0x4040f29b > > > > 0x403ae412 > > > > > > > > The 'Out of memory' message seems to print stats::free() and > > that > > > > number suggests we have plenty of free ram. > > > > > > > > Have I misunderstood, or is there something I need to be > > looking > > > > at? > > > > > > > > Cheers, > > > > Rick > > > > > > > > > > -- > > > You received this message because you are subscribed to the > > Google > > > Groups "OSv Development" group. > > > To unsubscribe from this group and stop receiving emails from it, > > > send an email to osv-dev+unsubscr...@googlegroups.com. > > > To view this discussion on the web visit > > > > >
Re: [osv-dev] Re: OOM query
Does it happen with the very latest OSv code? Did it start happening at some point more often? I wonder if we could add some helpful printouts in wake_waiters(). Btw that assert() failure in tcp_do_segment() rings a bell. On Mon, Mar 9, 2020 at 22:25 Rick Payne wrote: > > I can't add much other than I doubt its fragmentation. Sometimes this > happens within a few minutes of the system starting. At no point do I > think we're using more than 2GB of ram (of the 12GB) either. > > I did compile up a debug verison of OSv and built the system with that, > but I've been unable to trigger the oom(). Worse, I hit a kassert in > the netchannel code that seems to be ignored in the 'release' build, > but panics in the debug build: > > [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN > Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc: > tcp_do_segment: 1076) > > [backtrace] > 0x40221330 > 0x40221399 <__assert_fail+64> > 0x402a4798 > 0x402a97c2 > 0x402a98a1 > 0x402aa448 > 0x40656a9a ::operator()(mbuf*) > const+76> > 0x40655855 > 0x4023b165 > 0x4023b4d7 > 0x4024cd21 > 0x406a6a10 > 0x406a64f7 > 0x4067cd42 > > So at the moment, I'm a bit stuck with getting any more info... > > Rick > > On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote: > > As I understand this stack trace the oom() was called here as part of > > _do_reclaim(): > > > > 1025 WITH_LOCK(free_page_ranges_lock) { > > 1026 if (target >= 0) { > > 1027 // Wake up all waiters that are waiting and now > > have a chance to succeed. > > 1028 // If we could not wake any, there is nothing > > really we can do. > > 1029 if (!_oom_blocked.wake_waiters()) { > > 1030 oom(); > > 1031 } > > 1032 } > > 1033 > > 1034 if (balloon_api) { > > 1035 balloon_api->voluntary_return(); > > 1036 } > > 1037 } > > > > so it seems wake_waiters() returned false. I wonder if the memory was > > heavily fragmented or there is some logical bug in there. This method > > is called from two places and I wonder if this part of wake_waiters() > > is correct: > > > > 921 if (!_waiters.empty()) { > > 922 reclaimer_thread.wake(); > > 923 } > > 924 return woken; > > > > > > should this if also set woken to true? > > > > Also could we also enhance the oom() logic to print out more useful > > information if this happens once again? > > > > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote: > > > Had a crash on a system that I don't understand. Its a VM with > > > 12GB > > > allocated, we were running without about 10.5GB free according to > > > the > > > API. > > > > > > Out of the blue, we had a panic: > > > > > > Out of memory: could not reclaim any further. Current memory: > > > 10954988 > > > Kb > > > [backtrace] > > > 0x403f6320 > > > 0x403f71cc > > > 0x403f722f > > > 0x4040f29b > > > 0x403ae412 > > > > > > The 'Out of memory' message seems to print stats::free() and that > > > number suggests we have plenty of free ram. > > > > > > Have I misunderstood, or is there something I need to be looking > > > at? > > > > > > Cheers, > > > Rick > > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "OSv Development" group. > > To unsubscribe from this group and stop receiving emails from it, > > send an email to osv-dev+unsubscr...@googlegroups.com. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com > > . > > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/CAL9cFfNpCXpt%3DMNaOr2ruGXAgMVAYX235UkM3uFT%3DCMPxP%3DDAQ%40mail.gmail.com.
Re: [osv-dev] Re: OOM query
I can't add much other than I doubt its fragmentation. Sometimes this happens within a few minutes of the system starting. At no point do I think we're using more than 2GB of ram (of the 12GB) either. I did compile up a debug verison of OSv and built the system with that, but I've been unable to trigger the oom(). Worse, I hit a kassert in the netchannel code that seems to be ignored in the 'release' build, but panics in the debug build: [E/384 bsd-kassert]: tcp_do_segment: TCPS_LISTEN Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc: tcp_do_segment: 1076) [backtrace] 0x40221330 0x40221399 <__assert_fail+64> 0x402a4798 0x402a97c2 0x402a98a1 0x402aa448 0x40656a9a ::operator()(mbuf*) const+76> 0x40655855 0x4023b165 0x4023b4d7 0x4024cd21 0x406a6a10 0x406a64f7 0x4067cd42 So at the moment, I'm a bit stuck with getting any more info... Rick On Mon, 2020-03-09 at 08:52 -0700, Waldek Kozaczuk wrote: > As I understand this stack trace the oom() was called here as part of > _do_reclaim(): > > 1025 WITH_LOCK(free_page_ranges_lock) { > 1026 if (target >= 0) { > 1027 // Wake up all waiters that are waiting and now > have a chance to succeed. > 1028 // If we could not wake any, there is nothing > really we can do. > 1029 if (!_oom_blocked.wake_waiters()) { > 1030 oom(); > 1031 } > 1032 } > 1033 > 1034 if (balloon_api) { > 1035 balloon_api->voluntary_return(); > 1036 } > 1037 } > > so it seems wake_waiters() returned false. I wonder if the memory was > heavily fragmented or there is some logical bug in there. This method > is called from two places and I wonder if this part of wake_waiters() > is correct: > > 921 if (!_waiters.empty()) { > 922 reclaimer_thread.wake(); > 923 } > 924 return woken; > > > should this if also set woken to true? > > Also could we also enhance the oom() logic to print out more useful > information if this happens once again? > > On Tuesday, March 3, 2020 at 2:21:40 AM UTC-5, rickp wrote: > > Had a crash on a system that I don't understand. Its a VM with > > 12GB > > allocated, we were running without about 10.5GB free according to > > the > > API. > > > > Out of the blue, we had a panic: > > > > Out of memory: could not reclaim any further. Current memory: > > 10954988 > > Kb > > [backtrace] > > 0x403f6320 > > 0x403f71cc > > 0x403f722f > > 0x4040f29b > > 0x403ae412 > > > > The 'Out of memory' message seems to print stats::free() and that > > number suggests we have plenty of free ram. > > > > Have I misunderstood, or is there something I need to be looking > > at? > > > > Cheers, > > Rick > > > > -- > You received this message because you are subscribed to the Google > Groups "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to osv-dev+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/osv-dev/8f7e00a5-edfe-4487-aa5a-5072a560c6e3%40googlegroups.com > . -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/63b1cad2ad3c960fcfdfbbed9e4b014a7c75645e.camel%40rossfell.co.uk.