Re: ATF t_mlock() babylon5 kernel panics
Date:Thu, 14 Mar 2019 18:12:18 - (UTC) From:chris...@astron.com (Christos Zoulas) Message-ID: | Great debugging :-) The hard part was all Jason's work - he picked out just where things would be going wrong - after that it wasn't difficult. | LGTM, at this point I'd merge them: That's reasonable, but it isn't required for this bug fix, nor is correcting the error codes returned, so I think I'll leave those for a bit (or you can just make this change in a few minutes...) kre
Re: ATF t_mlock() babylon5 kernel panics
> On Mar 14, 2019, at 10:27 AM, Robert Elz wrote: > >Date:Thu, 14 Mar 2019 08:06:58 -0700 >From:Jason Thorpe >Message-ID: <134778ad-a675-414a-bbb3-7eeeaf2c2...@me.com> > > | Great sleuthing, you pretty much nailed it. > > OK, great.This is the patch I plan to commit soon (rather than > waiting on lots of review - so that the b5 tests can start working > again ASAP). > > If anyone has any issues with this, please make them known soon > (as in quite soon...) Looks good to me. This is probably worth pulling up to netbsd-8 as well. > Also, after this patch, I tested address wraparround (addr+llen < addr) > and get the ENOMEM error that the patch returns (which seems to be the > error that POSIX specifies for this kind of thing. Ans speaking of which > we appear to differ with POSIX in several error codes, POSIX does not use > EINVAL for mlock/munlock of a region which crosses a "hole" that would be > ENOMEM (munlock converts the internal EINVAL into ENOMEM before returning, > mlock() doesn't), and munlock() without a preceding mlock is not an error > at all (POSIX says "regardless of how many times mlock( ) has been called by > the process for any of the pages in the specified range." Zero is a > number of times that mlock() might have been called. ALso, EAGAIN is > correct for mlock() when the max number of pages the kernel will allow > to be locked have been locked, but not for the case when a process exceeds > its resource limit of locked pages ... that should be ENOMEM. There > might be more like this. This patch doesn't do anything with those > issues, they're not urgent, and we can fix them anytime (if we agree to > do so.) Yes, we should fix the error return cases as well. > > kre > > Index: uvm_map.c > === > RCS file: /cvsroot/src/sys/uvm/uvm_map.c,v > retrieving revision 1.358 > diff -u -r1.358 uvm_map.c > --- uvm_map.c 3 Mar 2019 17:37:36 - 1.358 > +++ uvm_map.c 14 Mar 2019 16:29:57 - > @@ -3359,6 +3359,14 @@ > } > entry = start_entry; > > + if (start == end) { /* nothing required */ > + if ((lockflags & UVM_LK_EXIT) == 0) > + vm_map_unlock(map); > + > + UVMHIST_LOG(maphist,"<- done (nothing)",0,0,0,0); > + return 0; > + } > + > /* >* handle wiring and unwiring separately. >*/ > Index: uvm_mmap.c > === > RCS file: /cvsroot/src/sys/uvm/uvm_mmap.c,v > retrieving revision 1.169 > diff -u -r1.169 uvm_mmap.c > --- uvm_mmap.c19 Dec 2017 18:34:47 - 1.169 > +++ uvm_mmap.c14 Mar 2019 16:29:58 - > @@ -759,8 +759,12 @@ > > pageoff = (addr & PAGE_MASK); > addr -= pageoff; > - size += pageoff; > - size = (vsize_t)round_page(size); > + if (size != 0) { > + size += pageoff; > + size = (vsize_t)round_page(size); > + } > + if (addr + size < addr) > + return ENOMEM; > > error = range_test(>p_vmspace->vm_map, addr, size, false); > if (error) > @@ -810,8 +814,12 @@ > > pageoff = (addr & PAGE_MASK); > addr -= pageoff; > - size += pageoff; > - size = (vsize_t)round_page(size); > + if (size != 0) { > + size += pageoff; > + size = (vsize_t)round_page(size); > + } > + if (addr + size < addr) > + return ENOMEM; > > error = range_test(>p_vmspace->vm_map, addr, size, false); > if (error) > Index: uvm_page.c > === > RCS file: /cvsroot/src/sys/uvm/uvm_page.c,v > retrieving revision 1.198 > diff -u -r1.198 uvm_page.c > --- uvm_page.c19 May 2018 15:03:26 - 1.198 > +++ uvm_page.c14 Mar 2019 16:29:58 - > @@ -1605,9 +1605,11 @@ > uvm_pageunwire(struct vm_page *pg) > { > KASSERT(mutex_owned(_pageqlock)); > + KASSERT(pg->wire_count != 0); > pg->wire_count--; > if (pg->wire_count == 0) { > uvm_pageactivate(pg); > + KASSERT(uvmexp.wired != 0); > uvmexp.wired--; > } > } > > -- thorpej
Re: ATF t_mlock() babylon5 kernel panics
In article <26375.1552584...@jinx.noi.kre.to>, Robert Elz wrote: > Great debugging :-) LGTM, at this point I'd merge them: static int round_and_check(const struct vm_map *map, vaddr_t *addr, vsize_t *size) { const vsize_t pageoff = (vsize_t)(*addr & PAGE_MASK); *addr -= pageoff; if (*size != 0) { *size += pageoff; *size = (vsize_t)round_page(*size); } if (*addr + *size < *addr) return ENOMEM; // Make range_test() take const map as it should have in the first place return range_test(map, *addr, *size, false); } christos
Re: ATF t_mlock() babylon5 kernel panics
Date:Thu, 14 Mar 2019 08:06:58 -0700 From:Jason Thorpe Message-ID: <134778ad-a675-414a-bbb3-7eeeaf2c2...@me.com> | Great sleuthing, you pretty much nailed it. OK, great.This is the patch I plan to commit soon (rather than waiting on lots of review - so that the b5 tests can start working again ASAP). If anyone has any issues with this, please make them known soon (as in quite soon...) Of course, we can always revise, or revert, these changes later. They do seem to work as intended however. Before I added this to my test system, I (using the original kernel) did a test of (several iterations of) munlock(buf, 0) (where buf is page aligned) without any problems, as expected. After applying this patch, the t_mlock test works with no failings (as it exists in -current ... I will fix all its idiocy after it has been used a few times in its current state - no matter how broken it should not be able to crash the kernel). Also, after this patch, I tested address wraparround (addr+llen < addr) and get the ENOMEM error that the patch returns (which seems to be the error that POSIX specifies for this kind of thing. Ans speaking of which we appear to differ with POSIX in several error codes, POSIX does not use EINVAL for mlock/munlock of a region which crosses a "hole" that would be ENOMEM (munlock converts the internal EINVAL into ENOMEM before returning, mlock() doesn't), and munlock() without a preceding mlock is not an error at all (POSIX says "regardless of how many times mlock( ) has been called by the process for any of the pages in the specified range." Zero is a number of times that mlock() might have been called. ALso, EAGAIN is correct for mlock() when the max number of pages the kernel will allow to be locked have been locked, but not for the case when a process exceeds its resource limit of locked pages ... that should be ENOMEM. There might be more like this. This patch doesn't do anything with those issues, they're not urgent, and we can fix them anytime (if we agree to do so.) kre Index: uvm_map.c === RCS file: /cvsroot/src/sys/uvm/uvm_map.c,v retrieving revision 1.358 diff -u -r1.358 uvm_map.c --- uvm_map.c 3 Mar 2019 17:37:36 - 1.358 +++ uvm_map.c 14 Mar 2019 16:29:57 - @@ -3359,6 +3359,14 @@ } entry = start_entry; + if (start == end) { /* nothing required */ + if ((lockflags & UVM_LK_EXIT) == 0) + vm_map_unlock(map); + + UVMHIST_LOG(maphist,"<- done (nothing)",0,0,0,0); + return 0; + } + /* * handle wiring and unwiring separately. */ Index: uvm_mmap.c === RCS file: /cvsroot/src/sys/uvm/uvm_mmap.c,v retrieving revision 1.169 diff -u -r1.169 uvm_mmap.c --- uvm_mmap.c 19 Dec 2017 18:34:47 - 1.169 +++ uvm_mmap.c 14 Mar 2019 16:29:58 - @@ -759,8 +759,12 @@ pageoff = (addr & PAGE_MASK); addr -= pageoff; - size += pageoff; - size = (vsize_t)round_page(size); + if (size != 0) { + size += pageoff; + size = (vsize_t)round_page(size); + } + if (addr + size < addr) + return ENOMEM; error = range_test(>p_vmspace->vm_map, addr, size, false); if (error) @@ -810,8 +814,12 @@ pageoff = (addr & PAGE_MASK); addr -= pageoff; - size += pageoff; - size = (vsize_t)round_page(size); + if (size != 0) { + size += pageoff; + size = (vsize_t)round_page(size); + } + if (addr + size < addr) + return ENOMEM; error = range_test(>p_vmspace->vm_map, addr, size, false); if (error) Index: uvm_page.c === RCS file: /cvsroot/src/sys/uvm/uvm_page.c,v retrieving revision 1.198 diff -u -r1.198 uvm_page.c --- uvm_page.c 19 May 2018 15:03:26 - 1.198 +++ uvm_page.c 14 Mar 2019 16:29:58 - @@ -1605,9 +1605,11 @@ uvm_pageunwire(struct vm_page *pg) { KASSERT(mutex_owned(_pageqlock)); + KASSERT(pg->wire_count != 0); pg->wire_count--; if (pg->wire_count == 0) { uvm_pageactivate(pg); + KASSERT(uvmexp.wired != 0); uvmexp.wired--; } }
Re: ATF t_mlock() babylon5 kernel panics
On Mar 14, 2019, at 5:44 AM, Robert Elz wrote: > > am guessing at the UVMHIST_LOG() stuff (copy/paste/edit...) > > Does all of this sound plausible, and reasonable to do ? > Please do remember, that before this, I have never been > anywhere near uvm or anything pmap related, so have mercy! Great sleuthing, you pretty much nailed it. -- thorpej
Re: ATF t_mlock() babylon5 kernel panics
Date:Wed, 13 Mar 2019 11:44:51 -0700 From:Jason Thorpe Message-ID: <9a2a4a34-35b0-490e-9a92-aab44174f...@me.com> | I would suggest instrumenting-with-printf the "new_pageable" | case of uvm_map_pageable() That didn't show much we didn't already know. Turns out the problem is in the other side, the mlock() case, that's where things go bad, then the munlock() falls foul of the corrupted setup. I haven't tested it yet, but I will, but I am fairly sure that simply doing munlock(addr, 0) when there has been no mlock() will cause no problems at all. I am less sure about the case mlock(addr, 100); munlock(addr+50, 0); but I will investigate that one too. What happens in the mlock(addr, 0) case is this: uvm_map_pageable() starts by validating the start and end addresses (ed6000 for both, since size == 0 ... that is the real address from the test, but with all the upper bits dropped for the e-mail, they're irrelevant). uvm_map_lookup_entry() finds the entry for the start address, which in this case is ecd000..eff000 (start..end) with wired_count == 0. Since this is the !new_pageable case we drop to pass 1 of the "do wiring" code. There entry != >header (the end ... that's impossible in the first iteration I think) and entry->start (ecd000) < end (ed6000) so we enter the loop. The map isn't wired (wired_count == 0) so we do the test to see if an amap_copy() is needed. It isn't, fortunately, or I'd have to go learn that that's all about... Then we get to the first interesting bit; In general, if we have a map entry covering addresses a..b and we want to lock a range x..y where a (ENTRY)->start && (VA) < (ENTRY)->end) { \ uvm_map_clip_end(MAP,ENTRY,VA); \ } \ (that's a macro definition, hence the \'s). Here ENTRY->start == x, ENTRY->end == b, and VA == y. But x == y, so VA == x too. Then since VA == ENTRY->start the condition fails, and no splitting (clipping) is done. So, now we have two entries [ecd000..ed6000] and [ed6000..eff000] with "entry" (the one we're going to work on) being the second of those (it is the one that contains the addresses we want to lock) whereas the first is now irrelevant, untouched, and we can (and will) simply forget about. The second one gets its wired_count incremented (which in our case moves it from 0 to 1.) That's where things really start to go wrong, as that's supposed to mean that the addresses [ed6000..eff000] are wired. That might be OK if we actually wired them (yet to come) but it is not what was requested. There is then a check for holes, not really sure what that's about, but if it happens we get an error (I'm guessing it indicates that we have the situation aFrom this we go onto the next map entry, which in our case isn't interesting (cannot possibly be: either there are no more, or the next has a value for its ->start which is >= eff000, which is > end (ed6000) so one way or the other, the loop that is pass1 terminates. On to pass 2. The entry we start at remains the [ed6000..eff000] which now has wired_count == 1. The pass2 loop is controlled by while (entry != >header && entry->start < end) { The first part of that is not an issue, and like in pass1, cannot ever happen the first time around. But: entry->start == ed6000, and end == ed6000, entry->start is not < end, and so we never enter the loop. Oops. That loop is what actually makes the pages wired, and we're doing none of that. Pass2 loop "done", rv remains 0 (it only gets set inside the loop, where we never go) so we simply return. mlock() "succeeded". And now, for the lead in to the panic (new KASSERT failed) when we come to munlock()... (but of course, it is not simple!) We enter uvm_map_pageable() again, with start==end==ed6000 since we are doing an munlock(same addr as locked, 0). As nothing else has affected this process, uvm_map_lookup_entry() finds the entry we were just working on [ed6000..eff000] - the entry that contains the start addr (just the same as last time). But now that entry has wired_count == 1 (which happened just a bit earlier in this message.) The new_pagable case starts with UVM_MAP_CLIP_START(map, entry, start); to strip off any leading section of the map entry, that we are not unlocking, but here start == entry->start so the unwire begins at the beginning of the map entry. That call changes nothing. As with the wiring case, there are 2 loops, loop 1 while ((entry != >header) && (entry->start < end)) { Just like before the first condition is impossible for the first iteration, but for the second, entry->start == end. Loop done before it started. Doesn't matter, all that loop does is validity checking on the range of addresses to be unlocked (just like the "hole" check above I think). Our address range is valid (well, kind of). On to the second loop. while ((entry != >header) && (entry->start < end)) { Well, that's boring.
Re: ATF t_mlock() babylon5 kernel panics
Date:Wed, 13 Mar 2019 11:44:51 -0700 From:Jason Thorpe Message-ID: <9a2a4a34-35b0-490e-9a92-aab44174f...@me.com> | Ok, well, I see some problematic code in sys_mlock() and sys_munlock(), | but I don't think it's affecting this case (and it may in fact have | the effect of making the test pass if a non-page-aligned buffer is passed): Yes, I was expecting that might happen - and in some ways that's a convenient thing, in (just one) recent run, the b5 tests actually passed (without any apparent changes anywhere that could have affected anything.) Tests that pass don't save any logs to be looked at (they're not supposed to be interesting!) so there's no way we will ever know for sure, but I was kind of hoping that this might have been the explanation. Usually we're getting page aligned memory, but perhaps not, that one time. | I would suggest instrumenting-with-printf the "new_pageable" | case of uvm_map_pageable() I will look at that later today (unless someone else solves the problem first) - more unrelated tasks for the next few hours... kre
Re: ATF t_mlock() babylon5 kernel panics
> On Mar 13, 2019, at 10:27 AM, Robert Elz wrote: > > Some progress. > > #1 touching the buffer that malloc() returns (which is page aligned) > made no difference - it is likely that the malloc llibrary would have > done that in any case (the malloc is for just one page, so either it > is resident (or paged) or it is a ZFoD page, and most likely not the > latter - certainly not after this small mod.) > > #2 changing the mlock_clip to avoid the sequence > > mlock(buf, 0); > munlock(buf, 0); > > avoided all of the issues and problems, and the test passed. > > What locking 0 bytes might achieve Finding bugs :-) > I have no idea, but that is > what the test does: > >buf = malloc(page); > /*...*/ >for (size_t i = page; i >= 1; i = i - 1024) { >err1 = mlock(buf, page - i); >if (err1 != 0) >fprintf(stderr, "mlock_clip: page=%ld i=%zu," >" mlock(%p, %ld): %s\n", page, i, buf, page - i, >strerror(errno)); >err2 = munlock(buf, page - i); > /*...*/ > > That first munlock() is where the panic (with the extra KASSERTs added) > occurs. Ok, well, I see some problematic code in sys_mlock() and sys_munlock(), but I don't think it's affecting this case (and it may in fact have the effect of making the test pass if a non-page-aligned buffer is passed): pageoff = (addr & PAGE_MASK); addr -= pageoff; size += pageoff; size = (vsize_t)round_page(size); If you pass in (0x1004, 0), for example, you end up with: pageoff = 0x4; addr = 0x1000; size = 0x4 ROUND UP-> 0x1000; It's not clear that's the actual intent ... seems like if you pass 0, you should get 0, regardless of alignment... but let's fix that problem LATER. Since the logic for trunc / round is the same in both mlock and munlock, let's turn our attention to any logical differences in uvm_map_pageable(). If first looks up the entry that covers the range, and assigns the same value to "entry" and "start_entry". The code paths for the "pageable" and "not pageable" are split, so we'll focus on "pageable" since that's the one that's panic'ing. Assuming the region that's passed to munlock() is page-aligned (I'll use the example 0x1000 with size 0), we should see uvm_map_pageable() called with: start = 0x1000; end = 0x1000; The first thing it does is call UVM_MAP_CLIP_START(), which ensures that "entry"'s start is the start address, splitting the entry as necessary (inserting the new split-off entry ahead of "entry"). Next it does a pass over all the entries in the range to ensure that every entry is actually wired and that there are no unmapped regions in the range. The check it uses is: - If the current entry is not wired OR - If the current entry ends before the end of the range AND - This is the last entry OR - The next entry starts after the end of the current entry (hole) ...then return EINVAL The logic there seems correct, and you're not seeing EINVAL. Now, AFTER that first loop, it does this: entry = start_entry; ...because it wants to pass over the range again. This should be safe, because UVM_MAP_CLIP_START() is *supposed* to preserve the value of "entry". The *second* loop uses the same termination logic as the first, and the is-entry-wired test is equivalent (look at the macro's expansion). It's obvious from the panic that the bottom layer of this cake is freaking out because it's being asked to unwire something that's not actually wired, but I don't see how this can happen for a page-aligned zero-length region. I would suggest instrumenting-with-printf the "new_pageable" case of uvm_map_pageable() > The err1 = and err2 = are recent additions so that printf (and a > similar one after the munlock) can indicate if anything goes wrong. > > The mlock(buf, 0) (first time around the loop when page == i) > call did not fail (err1 == 0). Or at least it seemed to be OK > (there was no output, but there might have been some coming when > the munlock() caused the panic i I will try again (much much > later today) with a sleep between mlock() and munlock() so any > output has time to drain. > > Anyway, this might give someone enough of a clue as to what is happening > that that someone (who understands UVM and the pmap code) can fix things. > > There's no indication in the mlock() man page, or in the posix spec, > that len == 0 is anything but a valid call (one would expect it to be > a no-op though, rather than scrambling the kernel data structs). > > With that loop modified so i starts at page - 1024 everything is fine, > > kre > -- thorpej
Re: ATF t_mlock() babylon5 kernel panics
Some progress. #1 touching the buffer that malloc() returns (which is page aligned) made no difference - it is likely that the malloc llibrary would have done that in any case (the malloc is for just one page, so either it is resident (or paged) or it is a ZFoD page, and most likely not the latter - certainly not after this small mod.) #2 changing the mlock_clip to avoid the sequence mlock(buf, 0); munlock(buf, 0); avoided all of the issues and problems, and the test passed. What locking 0 bytes might achieve I have no idea, but that is what the test does: buf = malloc(page); /*...*/ for (size_t i = page; i >= 1; i = i - 1024) { err1 = mlock(buf, page - i); if (err1 != 0) fprintf(stderr, "mlock_clip: page=%ld i=%zu," " mlock(%p, %ld): %s\n", page, i, buf, page - i, strerror(errno)); err2 = munlock(buf, page - i); /*...*/ That first munlock() is where the panic (with the extra KASSERTs added) occurs. The err1 = and err2 = are recent additions so that printf (and a similar one after the munlock) can indicate if anything goes wrong. The mlock(buf, 0) (first time around the loop when page == i) call did not fail (err1 == 0). Or at least it seemed to be OK (there was no output, but there might have been some coming when the munlock() caused the panic i I will try again (much much later today) with a sleep between mlock() and munlock() so any output has time to drain. Anyway, this might give someone enough of a clue as to what is happening that that someone (who understands UVM and the pmap code) can fix things. There's no indication in the mlock() man page, or in the posix spec, that len == 0 is anything but a valid call (one would expect it to be a no-op though, rather than scrambling the kernel data structs). With that loop modified so i starts at page - 1024 everything is fine, kre
Re: ATF t_mlock() babylon5 kernel panics
OK, with DIAGNOSTIC enabled, and with this patch made: --- uvm_page.c 19 May 2018 15:03:26 - 1.198 +++ uvm_page.c 13 Mar 2019 08:51:11 - @@ -1605,9 +1605,11 @@ uvm_pageunwire(struct vm_page *pg) { KASSERT(mutex_owned(_pageqlock)); + KASSERT(pg->wire_count != 0); pg->wire_count--; if (pg->wire_count == 0) { uvm_pageactivate(pg); + KASSERT(uvmexp.wired != 0); uvmexp.wired--; } } I now get a *very* quick panic... t_mlock: pagesize 4096 tp-start: 1552467032.204169, t_mlock, 5 tc-start: 1552467032.204178, mlock_clip [ 47.4101095] pmap_unwire: wiring for pmap 0xaf80023f6b40 va 0x7f7ff7ed6000 did not change! [ 47.4101095] panic: kernel diagnostic assertion "pg->wire_count != 0" failed: file "/readonly/release/testing/src/sys/uvm/uvm_page.c", line 1608 [ 47.4101095] cpu0: Begin traceback... [ 47.4101095] vpanic() at netbsd:vpanic+0x143 [ 47.4101095] kern_assert() at netbsd:kern_assert+0x48 [ 47.4101095] uvm_pageunwire() at netbsd:uvm_pageunwire+0x81 [ 47.4101095] uvm_fault_unwire_locked() at netbsd:uvm_fault_unwire_locked+0x138 [ 47.4101095] uvm_map_pageable() at netbsd:uvm_map_pageable+0x346 [ 47.4101095] sys_munlock() at netbsd:sys_munlock+0x63 [ 47.4101095] syscall() at netbsd:syscall+0x9c [ 47.4101095] --- syscall (number 204) --- [ 47.4101095] 7f7ff7042ffa: [ 47.4101095] cpu0: End traceback... [ 47.4101095] fatal breakpoint trap in supervisor mode [ 47.4101095] trap type 1 code 0 rip 0x802059b5 cs 0xe030 rflags 0x202 cr2 0x7f7ff7ed6000 ilevel 0 rsp 0xaf804d94cd10 [ 47.4101095] curlwp 0xaf80024f4b00 pid 364.1 lowest kstack 0xaf804d9482c0 Stopped in pid 364.1 (t_mlock) at netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x143 kern_assert() at netbsd:kern_assert+0x48 uvm_pageunwire() at netbsd:uvm_pageunwire+0x81 uvm_fault_unwire_locked() at netbsd:uvm_fault_unwire_locked+0x138 uvm_map_pageable() at netbsd:uvm_map_pageable+0x346 sys_munlock() at netbsd:sys_munlock+0x63 syscall() at netbsd:syscall+0x9c --- syscall (number 204) --- 7f7ff7042ffa: ds cd20 es ccd0 fs cd10 gs 10 rdi 0 rsi af804d94cabc rbp af804d94cd10 rbx 104 rdx 1 rcx 0 rax 0 r8 805b2780cpu_info_primary r9 3e93e7 r10 0 r11 0 r12 80502e70ostype+0xab0 r13 af804d94cd58 r14 805202f0ostype+0x1df30 r15 af80023f5730 rip 802059b5breakpoint+0x5 cs e030 rflags 202 rsp af804d94cd10 ss e02b netbsd:breakpoint+0x5: leave And the system is left sitting in ddb (which I also enabled for this) so if anyone can think of anything I should be looking at to help diagnose this, now would be the time to tell me! The mlock_clip() test is the first one run. It can be seen in .../src/tests/lib/libc/sys/t_mlock.c But I have non-computer work to do for a couple of hours, so this will be the last you hear from me about this for a while. kre
Re: ATF t_mlock() babylon5 kernel panics
On Wed, Mar 13, 2019 at 03:22:27PM +0700, Robert Elz wrote: > [...] > > netbsd# df /tmp > Filesystem1K-blocks Used Avail %Cap Mounted on > tmpfs 4 4 0 100% /tmp > > That's what it showed (it was still in my xterm scrollback buffer from > the window I use as the DomU console).So, no, df isn't showing space > allocated, just nothing available. Something had "stolen" all the available > ram, and wasn't letting go. > > It did not look as if there was no free memory however: > > netbsd# vmstat m > procsmemory page disk faults cpu > r b avmfre flt re pi po fr sr x0 in sy cs us sy id > 0 034868 945956 264 0 0000 15 428 426 119 0 0 100 > > but something as simple as attempting to read a man page (to check > what other options I could try): > > netbsd# man vmstat > man: Formatting manual page... > mandoc: stdout: No space left on device I've seen this too. df showed my tmpfs as full (although there was only a few mb in the 'used' colum) , top was showing a few MB free RAM but lots of ram allocated to files (I have 8GB RAM). UVM could have freed some memory from the file cache for tmpfs ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: ATF t_mlock() babylon5 kernel panics
Date:Tue, 12 Mar 2019 23:21:59 -0700 From:Jason Thorpe Message-ID: | THAT is particularly special, because the code in question is: | | | void | uvm_pagewire(struct vm_page *pg) | { | KASSERT(mutex_owned(_pageqlock)); | #if defined(READAHEAD_STATS) | if ((pg->pqflags & PQ_READAHEAD) != 0) { | uvm_ra_hit.ev_count++; | pg->pqflags &= ~PQ_READAHEAD; | } | #endif /* defined(READAHEAD_STATS) */ | if (pg->wire_count == 0) { | uvm_pagedequeue(pg); | uvmexp.wired++; | } | pg->wire_count++; | KASSERT(pg->wire_count > 0);/* detect wraparound */ | } | Actually, that probably also explains why my kernel is not crashing. I don't have DIAGNOSTIC enabled, so that KASSERT() does not happen in my kernel. Realised that when I was about to add the KASSERT you suggested... I will change that while continuing to test this. kre
Re: ATF t_mlock() babylon5 kernel panics
Date:Tue, 12 Mar 2019 23:21:59 -0700 From:Jason Thorpe Message-ID: Thanks for the reply. I have dropped tech-kern and tech-userlevel from this reply though. | The test employs a bogus understanding of how malloc() is specified. Yes, that is kind of obvious, but perhaps understandable given the man page. | I've also seen the term "fundamental object" used. I think that might be pushing understanding a bit far. That could refer to a quark, or qason or one of those truly fundamental objects, or perhaps a proton/neutron/electron, or even atom or molecule... I am not going to change the man page, as that should be done by someone who knows what they're actually talking about - what NetBSD libc malloc() is actually willing to promise, but perhaps something like suitable for any C primitive or constructed data type ? | One has to remember that malloc() is specified by the C standard, | and C has no notion of "pages" or any other such silliness that | we Unix people assume are fundamental :-) Yes, of course, but as long as we meet its requirements, there's no reason that we can't provide "more" alignment than is required. It would not perhaps be completely unreasonable for a malloc of any power of two size (perhaps up to the page size, or perhaps even more, depending upon what works best for the architecture) to align the result to that same size (or the minimum required alignment, whichever is greater). An application should not assume that, but it might make some operations magically work more efficiently if things just happened to be aligned that way - eg: a read for a (one or more) full page into a page aligned buffer could simply do page flipping (sharing with kernel on a CoW basis, or if the kernel doesn't need to keep the page, simply a page swap) rather than a byte by byte copy. No-one need even know that such a thing was happening... | POSIX specifically states that mlock() //may// require that the | address is page-aligned ... Our implementation does not require this: Yes, I know - the test is specifically testing that the implementation does not require it (run on a different system that test might fail). | And there are no files there? No, there were no files, and unless init has taken to holding open large unlinked files in /tmp (a /tmp that is not yet mounted when init does most of is work) then there were no remaining processes to be keeping any unlinked files. | Even an open-unliked file should disappear when the offending process exits. yes. I don't think it was that, I think it is some anon mmap()'d object that is no longer referenced by anyone but which is still hanging around. | Well, note that tmpfs also uses anonymous memory. Yes, I'm aware. | Is it that "df" on the tmpfs is really showing a bunch of space | allocated to the tmpfs? netbsd# df /tmp Filesystem1K-blocks Used Avail %Cap Mounted on tmpfs 4 4 0 100% /tmp That's what it showed (it was still in my xterm scrollback buffer from the window I use as the DomU console).So, no, df isn't showing space allocated, just nothing available. Something had "stolen" all the available ram, and wasn't letting go. It did not look as if there was no free memory however: netbsd# vmstat m procsmemory page disk faults cpu r b avmfre flt re pi po fr sr x0 in sy cs us sy id 0 034868 945956 264 0 0000 15 428 426 119 0 0 100 but something as simple as attempting to read a man page (to check what other options I could try): netbsd# man vmstat man: Formatting manual page... mandoc: stdout: No space left on device (attempting to write a temp file on /tmp I presume). netbsd# ps ax PID TTY STATTIME COMMAND 0 ? OKl 0:27.38 [system] 1 ? Ss 0:00.02 init 1966 xencons O+ 0:00.00 ps -ax 7241 xencons Ss 0:00.01 login 23376 xencons S0:00.00 -sh netbsd# fstat [...] The fstat showed nothing interesting, in partucular, init had no open files at all (not surprising). The ps was done by the time the fstat started of course, fstat login and sh had files open (also no surprise - except why login is there at all (it is the parent of the sh) - when did we start keeping login around rather than simply having it exec sh, and why? In any case I made that login and sh go away, by logging out and in again, and confirming new pids... A "normal" df of /tmp shows something more like ... netbsd# df /tmp Filesystem1K-blocks Used Avail %Cap Mounted on tmpfs524664 4 524660 0% /tmp | We use basic 4K page size. Yes, I saw that when I added diagnostics into the test (I printed that value).I also saw when I ran it on my laptop, where I do most basic development work initially, that whatever malloc it is using (userland from -current from just over a year
Re: ATF t_mlock() babylon5 kernel panics
> On Mar 12, 2019, at 9:09 PM, Robert Elz wrote: > > The first issue I noticed, is that t_mlock() apparently belives > the malloc(3) man page, which states: > > The malloc() function allocates size bytes of uninitialized memory. The > allocated space is suitably aligned (after possible pointer coercion) for > storage of any type of object. > > and in particular, those last few words. The "any type of object" that > t_mlock wants to store is a "page" - that is a hardware page. The test employs a bogus understanding of how malloc() is specified. On x86, malloc() should return memory that is 16-byte aligned because that is the maximum alignment requirement of the fundamental types used by the compiler. > It obtains > the size of that using: > > page = sysconf(_SC_PAGESIZE); > > and then does > >buf = malloc(page); > > and if buf is not NULL (which it does check) assumes that it now > has a correctly page aligned page sized block of memory, in which > it can run mlock() related tests. > > Something tells me that the "any type of object" does not include this > one, and that t_mlock should be using posix_memalign() instead to allocate > its page, so it can specify that it needs a page aligned page. Correct. Or mmap() (which always returns page-aligned pointers). > Again, I am not proposing fixing the test until the kernel issues > are corrected, but it would be good for someone who knows what alignment > malloc() really promises to return (across all NetBSD architectures) > to rewrite the man page to say something more specific than "any type of > object" ! I've also seen the term "fundamental object" used. One has to remember that malloc() is specified by the C standard, and C has no notion of "pages" or any other such silliness that we Unix people assume are fundamental :-) > NetBSD's mlock() rounds down, so regardless of the alignment of the > space allocated, the mlock() tests should be working (the locking might > not be exactly what the test is expecting, but all it is doing is keeping > pages locked in memory - which pages exactly this test does not really > care). POSIX specifically states that mlock() //may// require that the address is page-aligned ... Our implementation does not require this: /* * align the address to a page boundary and adjust the size accordingly */ pageoff = (addr & PAGE_MASK); addr -= pageoff; size += pageoff; size = (vsize_t)round_page(size); That is to say, the intent of our implementation is to wire the page where the range begins through the page where the range ends. Note that internally, UVM represents all ranges as [start, start+size) (assuming start and size are page aligned / rounded). > On my test setup, the kernel did not panic. It does however experience > some other weirdness, some of which is also apparent in the bablylon5 > tests, and others which might be. > > My test system is an amd64 XEN DomU - configired with no swap space, and > just 1GB RAM. It typically has a tmpfs mounted limitted to 1/2 GB > (actually slightly more than that - not sure where I got the number from, > there may have been a typo... the -s param is -s=537255936 in fstab. > That oddity should be irrelevant. > > The first thing I noticed was that when I run the t_mlock test in this > environment, it ends up failing when /tmp has run out of space. And I > mean really run out of space, in that it is gone forever, and nothing I > have thought of so far to try gets any of that space back again. And there are no files there? Even an open-unliked file should disappear when the offending process exits. > I assume that jemalloc() (aka malloc() in the test) is doing some kind > of mmap() that is backed by space on /tmp and continually grabbing more > until it eventually runs out, and that the kernel never releases that > space (even after the program that mapped it has exited). That seems > sub-optimal, and needs fixing in the kernel, anonymous mmap's (or whatever > kind jemalloc() is doing) need to be released when there are no more > processes that can possibly use them. Well, note that tmpfs also uses anonymous memory. Is it that "df" on the tmpfs is really showing a bunch of space allocated to the tmpfs? > I did not try umount -f (easier to just reboot...) but a regular umount > failed (EBUSY) even though there was nothing visibly using anything on > /tmp (and I killed every possible program, leaving only init - and yes, > that did include the console shell I use to test things). > > Umounting the tmpfs before running the t_mlock test worked fine (which also > illustrates that none of the very few daemon processes, nor the shell, etc, > from my login, are just happening to be using /tmp - and that it is the > results of the malloc() calls from t_mlock that must be the culprit. > (While ATF is running, it would be using /tmp as both its working > directory,
Re: ATF t_mlock() babylon5 kernel panics
Date:Wed, 13 Mar 2019 11:09:09 +0700 From:Robert Elz Message-ID: <27829.1552450...@jinx.noi.kre.to> A few corrections/additions to my message: | "page" is the page size.(4KB, 8KB or 16KB or ...) Looks to be 4K. Is that correct? | From the number of kernel messages, I'm assuming one from each of | those mlock/munlock calls Now I think more than one, with a 4K page size, that's just 4 iterations of the loop, and there are more than 8 printfs. So now I suspect 2 printf's per mlock() and munlock() (16 printfs) which looks about right - though there might be 19 printfs, which is a bit weird (but no-one ever claimed I can count). | (possibly excepting the first, but because | a similar buffer was malloc()/free() 'd in the previous sub-test, That's not possible, there was no previous sub-test. So: | it is conceivable that the page started out locked - that is, if malloc() | returned the same page as last time. isn't possible. | The buffer is free()'d after that loop ends (of course). In this test that always happens, as the only way the test fails is if the page size < 1024 (and not "fails" there but skips) in which case the test never bothers freeing the buffer (like I said before, this test has bugs ... it should do the page size test before the malloc()) or if the malloc() fails (when there is nothing to free). But in other tests, the buffer does not get free'd if the test fails (and I think all of the others fail right now). | While I will not fix the bugs in the test that might alter the way | it interacts with the kernel, I will update the test to get more | information from the tests that fail (and check results that are | currently not checked and make them available - but so as not to | alter what happens, merely by output to stderr). I have done that now, and will commit after a test build and run to make sure I made no stupid mistakes (a simple compile test and run on my development system, which is older, and where the test works, reveals no issues). The commit of this meaningless diagnostic only change is so we can see more of what is happening in the b5 test runs. This actually reduces the info in a sense, as some ATF macro calls which would have printed useful info about what failed now will just say "err == 0 failed" or something ... but the extra printf to stderr which will be immediately before that should provide the missing info, and more. kre