Re: UVMHIST logs
> On 15. Jul 2024, at 09:31, Valery Ushakov wrote: > > I've just got a long message in my dmesg: > > ...e24 uoffset=0 align=0 flags=0x101 entry=0x8f7f2df8 (uvm_map_findspace line > 2011)map=0x8ee29e80 hint=0x7def orig_hint=0x76fc3000 length=0x1000 > uobj=0x8fdd0e24 uoffset=0 align=0 flags=0x101 entry=0x8f7f2df8 > (uvm_map_findspace line 2214)map=0x8ee29e80 hint=0x7deef000 > orig_hint=0x76fc3000 length=0x1000 uobj=0x8fdd0a9c... > > where, obviously, newlines are missing from the UVMHIST output. I > don't want to go off on a tangent to a tangent to a tangent to > inestigate this, so I'd appreciate if someone familiar with UVMHIST > could kindly take a look. TIA. No UVMHIST involved. Seems to originate from a printf() without "\n" at sys/uvm/uvm_map.c function uvm_findspace_invariants() at line 1795. -- J. Hannken-Illjes - hann...@mailbox.org
Re: Problem with SIMPLEQ_INSERT_TAIL
> On 22. Jan 2024, at 11:52, Stephan wrote: > > Hello, > > I am working on the semaphore part of my Haiku compat layer. I took > uipc_sem.c as a reference, where control structures are organized in a > dynamically allocated array and not as part of a list. > > However, I like to maintain a "free list" in the form of a SIMPLEQ, on > which unused semaphores are maintained for fast allocation. > > There is an issue in the initialization function, where each structure > is inserted into the free queue. I spent some time on this, but have > been unable to find the cause. Maybe I am overlooking something > trivial or there is something special I just don´t know. > > The compiler says: > > In file included from /home/stephan/src/sys/sys/siginfo.h:38, > from /home/stephan/src/sys/sys/signal.h:112, > from ./machine/frame.h:75, > from ./x86/cpu.h:56, > from ./machine/cpu.h:42, > from ./machine/param.h:11, > from /home/stephan/src/sys/sys/param.h:142, > from /home/stephan/src/sys/kern/uipc_hsem.c:32: > /home/stephan/src/sys/kern/uipc_hsem.c: In function 'khsem_init': > /home/stephan/src/sys/sys/queue.h:347:19: error: assignment to 'struct > khsem **' from incompatible pointer type 'struct kshem **' > [-Werror=incompatible-pointer-types] > 347 | (head)->sqh_last = &(elm)->field.sqe_next; \ > | ^ > /home/stephan/src/sys/kern/uipc_hsem.c:68:9: note: in expansion of > macro 'SIMPLEQ_INSERT_TAIL' > 68 | SIMPLEQ_INSERT_TAIL(_freeq, [i], khs_entry); > | ^~~ > > > > The relevant code snippet is this: > > const int khsem_max = 8192; > > static kmutex_t khsem_mutex __cacheline_aligned; > static struct khsem *hsems__read_mostly; > static SIMPLEQ_HEAD(, khsem)khsem_freeq; > > > int > khsem_init(void) > { >int i, sz; > >SIMPLEQ_INIT(_freeq); >mutex_init(_mutex, MUTEX_DEFAULT, IPL_NONE); > >sz = ALIGN(khsem_max * sizeof(struct khsem)); >sz = round_page(sz); > >// XXX allocate memory > >for (i = 0; i < khsem_max; i++) { >hsems[i].khs_id = i; >mutex_init([i].khs_interlock, MUTEX_DEFAULT, IPL_NONE); >cv_init([i].khs_cv, "acquire_sem"); > >SIMPLEQ_INSERT_TAIL(_freeq, [i], khs_entry); // > << DOES NOT COMPILE >} > > } > > > The control structure looks like this: > > struct khsem { > sem_id khs_id; /* id of this semaphore */ > SIMPLEQ_ENTRY(kshem)khs_entry; /* free list entry */ Use SIMPLEQ_ENTRY(khsem) here ( s/sh/hs/ )? > kmutex_tkhs_interlock; /* lock on this semaphore */ > kcondvar_t khs_cv; /* CV for wait events */ > pid_t khs_owner; /* owning process */ > char*khs_name; /* name of this semaphore */ > size_t khs_namelen;/* length of name */ > int khs_state; /* state of this port */ > int khs_waiters; > int khs_count; /* current count */ > lwpid_t khs_latest_holder; /* latest holder LWP id */ > uid_t khs_uid;/* creator uid */ > gid_t khs_gid;/* creator gid */ > }; > > > > Any help is appreciated ;) > > Thanks, > > Stephan -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: veriexec(4) maintenance
On Tue, Dec 19, 2023 at 07:50:25AM +1030, Brett Lymn wrote: > On Sat, Sep 02, 2023 at 08:57:03AM +0930, Brett Lymn wrote: > > On Thu, Aug 31, 2023 at 10:24:07AM +0200, J. Hannken-Illjes wrote: > > > > > > I'm short on time, some remarks: > > > > > > > Thanks, I will have a closer look at these issues. This could be where the > > deadlock I was > > seeing is coming from. > > > > OK, I have, finally, updated the diff in line with the comments (thanks > again for the good feedback). Brett, first it is difficult to read a diff that also introduces a "notyet" operation on file pages. Looks like you replace global and item rw-locks with refcounts/signals. I tried to understand the rationale behind and failed. Your implementation seems at least vulnerable to livelocks as the writer side is missing priority. In general, what problem(s) are you trying to solve here? What is wrong with a quite simple hashtable to hold fileid<->fingerprint mapping accessed as: mutex_enter table lock lookup fileid -> item mutex_enter item mutex_exit table lock while item state is busy cv_wait item cv if item state is unknown set item state busy mutex_exit item compute fingerprint mutex_enter item set item state valid/invalid cv_broadcast item cv Here the item state is either valid or invalid mutex_exit item Some more problems (line numbers with your diff applied): 476 veriexec_fp_calc(struct lwp *l, struct vnode *vp, int lock_state, 477 struct veriexec_file_entry *vfe, u_char *fp) 478 { 524 error = vn_rdwr(UIO_READ, vp, buf, len, offset, 525 UIO_SYSSPACE, lock_state, lock_state is sometimes VERIEXEC_UNLOCKED, sometimes IO_NODELOCKED. 1629 veriexec_dump(struct lwp *l, prop_array_t rarray) 1630 { 1631 struct mount *mp; 1632 mount_iterator_t *mpiter; 1633 1634 mountlist_iterator_init(); 1635 1636 while ((mp = mountlist_iterator_trynext(mpiter)) != NULL) { 1637 /* If it fails, the file-system is [being] unmounted. */ 1638 if (vfs_busy(mp) != 0) 1639 continue; This is completely wrong. Operation mountlist_iterator_trynext() already returns the mount busy and skips mounts currently unmounting or remounting. Operation mountlist_iterator_next() was right or does this function get called from inside unmount? 1653 veriexec_flush(struct lwp *l) 1654 { 1655 struct mount *mp; 1656 mount_iterator_t *mpiter; 1657 int error = 0; 1658 1659 mountlist_iterator_init(); 1660 1661 while ((mp = mountlist_iterator_trynext(mpiter)) != NULL) { 1662 int lerror; 1663 1664 /* If it fails, the file-system is [being] unmounted. */ 1665 if (vfs_busy(mp) != 0) 1666 continue; This is completely wrong, see above. -- J. Hannken-Illjes
Re: how do I preset ddb's LINES to zero
> On 15. Dec 2023, at 19:44, Andrew Cagney wrote: > > On Fri, 15 Dec 2023 at 11:22, J. Hannken-Illjes wrote: > >>> Is there a way to stop this without having to rebuild the kernel. >> >> sysctl -w ddb.lines=0 > > thanks, I'll add that (it won't help with my immediate problem of a > panic during boot though) From DDB command prompt "set $lines = 0" ... -- J. Hannken-Illjes - hann...@mailbox.org
Re: how do I preset ddb's LINES to zero
> On 15. Dec 2023, at 17:19, Andrew Cagney wrote: > > Hi, > > I've the stock 10.0 boot.iso booting within a KVM based test > framework. I'd like to set things up so that should there be a panic, > it dumps registers et.al., without stopping half way waiting for > someone to hit the space bar vis: > > r9 af80b451d080 > r10 81d9a063 > r11 0 > r12 0 > --db_more-- > > Is there a way to stop this without having to rebuild the kernel. sysctl -w ddb.lines=0 -- J. Hannken-Illjes - hann...@mailbox.org
Re: veriexec(4) maintenance
> On 30. Aug 2023, at 00:16, Brett Lymn wrote: > > On Tue, Aug 15, 2023 at 07:52:46AM +0930, Brett Lymn wrote: >> >> I tried sending this with the diff but I think it was a bit big, >> here is the text, the diff is at ftp.netbsd.org under >> /pub/NetBSD/misc/blymn/veriexec_update >> > > OK, it has been a couple of weeks since I posted this and I have > had zero responses. I have tested the changes and they seem ok > on my daily usage pattern. > > Does anyone have _any_ comments or should I just commit the > changes? > I'm short on time, some remarks: - lockstate as an ioflag to vn_rdwr() cant be right. - waiting for condition is usually mutex_enter() while (!cond) cv_wait() ... mutex_exit() doing it as if (!cond) cv_wait() looks wrong. - if the reference counting is used to free an unreferenced object ist is obviously racy -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Maxphys on -current?
> On 4. Aug 2023, at 21:48, Jaromír Doleček wrote: > In a broader view, I have doubts if there is any practical reason to > even have support for bigger than 64kb block size support at all. Having tapes with record size > 64k would be a real benefit though. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: fstrans_start(vp->v_mount) and vgone
> On 11. Apr 2023, at 12:13, Taylor R Campbell > wrote: > > Last year hannken@ changed fstrans(9) to use a hash table mapping > struct mount pointers to fstrans info, in order to fix a crash found > by syzkaller arising from a race between fstrans_start(vp->v_mount) > and vgone(vp): > > https://syzkaller.appspot.com/bug?extid=54dc9ac0804a97b59bc6 > https://mail-index.netbsd.org/source-changes/2022/07/08/msg139622.html > > However, while this may avoid a _crash_ from use-after-free of > vp->v_mount->mnt_..., I'm not sure this is enough to prevent a _race_. > > Suppose we have the following sequence of events: > > lwp0lwp1 > > mp = vp->v_mount; >vgone(vp); >-> vp->v_mount = deadfs >unmount(mp) >-> free(mp) >mount a new file system >-> malloc() returns the same mp > fstrans_start(mp); > > Now lwp0 has started a transaction on some totally unrelated mount > point -- and may take confused actions under the misapprehension that > vp is a vnode that unrelated mount point. This sequence is obviously racy, we (should) always use fstrans_start* like: for (;;) { mp = vp->v_mount; fstrans_start(mp); if (mp == vp->v_mount) break; fstrans_done(mp); } or mp = vp->v_mount; fstrans_start(mp); if (mp != vp->v_mount) { fstrans_done(mp); return ENOENT; } thus avoiding this race and always holding fstrans on the right mount. Some of the current uses need review though ... -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Fix for PR kern/56713
> On 13. Jul 2022, at 20:33, Taylor R Campbell > wrote: > > Generally looks reasonable to me, assuming it passes the atf tests, > but give hannken@ another few days to review it? > > Maybe instead of adding vnode_impl.h to vfs_vnops.c, we can put the > vn_knote_* in vfs_vnode.c or create a new file vfs_knote.c, to avoid > having tentacles of vnode guts creep back into places we successfully > cut them out of before? Looks good to me, please go ahead. For me it is generally ok to include vnode_impl.h from all of kern/vfs_* miscfs/genfs/* and miscfs/specfs/*. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic in -current: "vp->v_iflag & VI_TEXT" failed: file "/usr/src/sys/kern/exec_subr.c", line 183
> On 9. Mar 2022, at 02:22, matthew green wrote: > > matthew green writes: >> "J. Hannken-Illjes" writes: >>> I'm now able to reproduce it here -- takes about six hours to trigger. >>> >>> I suppose vrelel() lost a check for new references with my last changes, >>> currently testing the diff attached. >> >> well, this ran until i ran out of file building the native gcc, >> it had been going for over an hour and was close to finishing. >> >> only one crash and not non crash, but this seems to help so far. >> >> i've cleaned the objdir and restarted, and i'll try again a couple >> more times when it's done. > > this partial build completed, and 2 more since. > > i say this fixes the problem. thanks! I built eleven releases without problems, fix committed as vfs_vnode.c rev. 1.135 -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic in -current: "vp->v_iflag & VI_TEXT" failed: file "/usr/src/sys/kern/exec_subr.c", line 183
> On 8. Mar 2022, at 05:14, matthew green wrote: > > that's this: > > 175 vmcmd_map_pagedvn(struct lwp *l, struct exec_vmcmd *cmd) > 176 { > [ ... ] > 181 vm_prot_t prot, maxprot; > 182 > 183 KASSERT(vp->v_iflag & VI_TEXT); > > christos said this happened to him on a 8c/16t 64GB machine, using > build.sh -j40, and i was able reproduce it on a 6c/12th 64GB machine. > my build only got as far as installing includes (so literally right > after tools were built.) the longer pacic message is: > > panic: kernel diagnostic assertion "vp->v_iflag & VI_TEXT" failed: file > "/usr/src/sys/kern/exec_subr.c", line 183 > cpu9: Begin traceback... > vpanic() at netbsd:vpanic+0x156 > kern_assert() at netbsd:kern_assert+0x4b > vmcmd_map_pagedvn() at netbsd:vmcmd_map_pagedvn+0x137 > execve_runproc() at netbsd:execve_runproc+0x394 > execve1() at netbsd:execve1+0x4f > sys_execve() at netbsd:sys_execve+0x2a > syscall() at netbsd:syscall+0x196 > --- syscall (number 59) --- > netbsd:syscall+0x196: > > my panic is the same as christos' -- inside execve1(). i'm getting > a crash dump now, so i can probably inspect this further, but i'm > wondering if this is related to the changes to vnode/nfs locking > changes that came around a week ago: > > https://mail-index.netbsd.org/source-changes/2022/02/28/msg137218.html > https://mail-index.netbsd.org/source-changes/2022/02/28/msg137219.html > > i don't know when this broke. i only really started looking because > christos said he saw this problem.. I'm now able to reproduce it here -- takes about six hours to trigger. I suppose vrelel() lost a check for new references with my last changes, currently testing the diff attached. -- J. Hannken-Illjes - hann...@mailbox.org vfs_node.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: reboot panic: "error == EOPNOTSUPP" in vfs_vnode.c line 1120
> On 7. Feb 2022, at 07:18, Michael van Elst wrote: > > m...@eterna.com.au (matthew green) writes: > >> while rebooting a quartz64 with a usb attached disk that just >> had a about 3.5GB of data written to it, i the umass gave some >> errorse and then i got a panic shortly later: > >> [ 6179.6038961] Skipping crash dump on recursive panic >> [ 6179.6038961] panic: kernel diagnostic assertion "error =3D=3D EOPNOTSUP= >> P" failed: file "/usr/src/sys/kern/vfs_vnode.c", line 1120 = > > genfs_suspendctl() may return ENOENT if IMNT_GONE is set in mnt_iflag. > I'm not sure if mnt_iflag is even protected against concurrent access > but vrevoke() racing with dounmount() might be enough. Flag IMNT_GONE is protected with suspension. Looks like the KASSERT should become "KASSERT(error == EOPNOTSUPP || error == ENOENT)" here. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Resolving open/close, attach/detach races
> On 1. Feb 2022, at 02:14, Taylor R Campbell wrote: > >> Date: Mon, 31 Jan 2022 11:15:25 +0100 >> From: "J. Hannken-Illjes" >> >> Do you have a recipe so I could try it here? > > New draft, and script I've been using for testing. It just does cat > and revoke on a set of device nodes, repeatedly, in many processes in > parallel; at the same time, I yank the USB devices and reinsert them. > My test now survives with ucom(4), at least; other drivers will need > fixes -- but fixes which will generally be pullable-up to a branch! As my main testbed is a virtual machine this doesn't work but it seems you already found a way to prevent these deadlocks. > I resolved the issue I was asking about by > > (a) having spec_node_lookup_by_dev optionally (when called by >vdevgone) wait with vcache_vget even for dying vnodes, so doing it >in a loop until none left waits until any revoke finishes; and > > (b) moving the hash table removal logic from spec_node_destroy to >spec_node_revoke, and making vcache_reclaim call spec_node_revoke >unconditionally for device vnodes. > > This way each specnode is only in the hash table while it's safe for > vcache_vget to be waiting for it, so vdevgone can safely wait for any > concurrent revoke to finish. Looks good so far. The new KASSERT(devvp->v_specnode->sn_opencnt) from spec_node_setmountedfs() fires for unmounts of udf fs as operation udf_vfsops.c::udf_unmount() closes the device BEFORE setmountedfs(). -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Resolving open/close, attach/detach races
> On 29. Jan 2022, at 22:08, Taylor R Campbell wrote: > > New draft changes to resolve open/close/attach/detach races. > > There's one snag in here that I haven't been able to resolve -- a race > in concurrent revoke and detach; maybe hannken@ can help? The snag is > that vdevgone needs to revoke all existing device nodes _and_ wait for > the revocation to complete -- even if it is happening concurrently via > the revoke(2) system call. > > But spec_lookup_by_dev will just skip vnodes currently being revoked, > and possibly return before they have finished being revoked. I tried > making spec_lookup_by_dev wait with vdead_check(vp, 0) instead of > vdead_check(vp, VDEAD_NOWAIT) in some circumstances, but that didn't > work -- it led to deadlock or livelock (and then kernel lock spinout), > depending on how I did it. > > My attempts to make this work may have failed because vdead_check is > forbidden if the caller doesn't hold a vnode reference, and it's not > clear taking a vnode reference is allowed at this point. Generally, > taking a new reference to a vnode being reclaimed should not be > allowed. Do you have a recipe so I could try it here? > (I don't entirely understand ad@'s recent(ish) changes to allow it in > some cases, which strikes me as a regression from the system we had > before where VOP_INACTIVE's decision is final -- a huge improvement > over the piles of bug-ridden incoherent gobbledegook we used to have > to deal with vnodes being revived in the middle of reclamation.) While ad@ introduced a race to vrelel() where a vnode cycle may miss the final VOP_INACTIVE() I don't see a regression here. The vnode lock already serializes VOP_INACTIVE() and all that is missing is a path to retry the vrelel() if the vnode gained another reference. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Resolving open/close, attach/detach races
> On 16. Jan 2022, at 23:41, Taylor R Campbell wrote: > >> Date: Sat, 15 Jan 2022 21:23:56 + >> From: Taylor R Campbell >> >> The attached patch series enables us to resolve a large class of races >> between devsw open/close, autoconf attach/detach, and module unload. > > Also introduced some bugs! > > - The devsw table management was broken for dynamically attached devsw > entries for statically allocated device major numbers. This broke > one of the fast(er) paths for statically linked devsw entries, but > it can be repaired without needing any memory barriers for them. > > - Holding fstrans across VOP_READ/WRITE prevented suspending file > systems with device vnodes for ttys that processes were waiting to > read from. Instead, these vops should take the new kind of specnode > I/O reference, while the vnode is unlocked. > > Updated patch addresses these. > > Unclear whether it is appropriate for VOP_CLOSE to use fstrans. Maybe > it too should use new specnode I/O references -- or perhaps it needn't > drop the vnode lock at all across bdev/cdev_close; generally blocking > indefinitely in a close routine strikes me as a bad idea. > This doesn't work. Running the attached program on a current kernel: # ./revoke pty /dev/tty00, pid 2107 revoke /dev/tty00 read /dev/tty00 -> 0 done With your patch: # ./revoke pty /dev/tty00, pid 975 revoke /dev/tty00 and hangs hard. The revoke() operation enters spec_node_revoke() and here it waits forever for sn_iocnt to become zero. As the read from the other thread set sn_iocnt to one by entering through spec_io_enter() the system will hang until this read returns. Am I missing something? -- J. Hannken-Illjes - hann...@mailbox.org revoke.c Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: ZFS L2ARC on NetBSD-9
> On 19. Apr 2021, at 19:59, Andrew Parker wrote: > > On Monday, 19 April 2021 13:28:07 EDT Robert Elz wrote: > > Date:Sun, 18 Apr 2021 18:58:56 + > > From:Andrew Parker > > Message-ID: <2245776.bZt9KSGgi3@t470s.local> > > > > | Does anyone else have a working L2ARC? > > > > Sorry, don't even know what that is, and don't (currently anyway) use zfs, > > > > but: > > | - interval = hz * l2arc_feed_secs; > > | + interval = mstohz(l2arc_feed_secs); > > > > Are you sure about that part of the change (the earlier fragment looked > > reasonable) ? > > > > mstohz() when starting with seconds (which the name of that var suggests) > > looks like it would be much smaller than intended, whereas simply > > multiplying seconds by hz gives ticks, which looks to be the objective in > > all of that. Alternatively multiply secs by 1000 to generate ms, and > > mstohz() that. > > > > Watch out for potential overflow in all of this though. > > > > kre > > Oops. I completely misread how the return value of l2arc_write_interval is > used so that patch doesn't make any sense. But adding the printf suggested > earlier results in this just after boot: > > [14.600107] WARNING: ZFS on NetBSD is under development > [14.650039] ZFS filesystem version: 5 > [14.650039] wait 100 > [ 15.690043] wait 96 > [17.840054] wait 0 As Manuel said this will block forever. Please try the attached diff that should prevent waits on negative or zero ticks. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) arc.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: ZFS L2ARC on NetBSD-9
> On 18. Apr 2021, at 20:58, Andrew Parker wrote: > > Hi, > > While trying to setup an L2ARC on my amd64 Xen dom0, I noticed that almost no > data was feeding in to the cache. The counters in kstat.zfs.misc.arcstats.l2* > indicate the l2arc feed process must be stuck somewhere. Which counters and which values? > I think I've traced > the problem to l2arc_write_interval() called by l2arc_feed_thread(). It looks > like the l2arc_feed_thread() loop only runs once because the delay set by > l2arc_write_interval() is always too large. For me (-current, KVM) l2arc_feed_thread() runs every second or every 0.15 second under load, tested with this patch: l2arc_feed_thread(void *dummy __unused) ... while (l2arc_thread_exit == 0) { CALLB_CPR_SAFE_BEGIN(); + printf("wait %u\n", next - ddi_get_lbolt()); > Maybe this is a difference between > NetBSD and Solaris cv_timedwait? Something like this results in a functioning > L2ARC on my system: > > --- a/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c > +++ b/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c > @@ -6520,7 +6520,7 @@ l2arc_write_size(void) > static clock_t > l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote) > { > - clock_t interval, next, now; > + clock_t interval; > >/* > * If the ARC lists are busy, increase our write rate; if the > @@ -6529,14 +6529,11 @@ l2arc_write_interval(clock_t began, uint64_t wanted, > uint64_t wrote) > * what we wanted, schedule the next write much sooner. > */ >if (l2arc_feed_again && wrote > (wanted / 2)) > - interval = (hz * l2arc_feed_min_ms) / 1000; > + interval = mstohz(l2arc_feed_min_ms) / 2; >else > - interval = hz * l2arc_feed_secs; > + interval = mstohz(l2arc_feed_secs); > > - now = ddi_get_lbolt(); > - next = MAX(now, MIN(now + interval, began + interval)); > - > - return (next); > + return (interval); > } This is completely wrong, l2arc_write_interval() must return next time, not a fixed interval. > This is on NetBSD-9. I hope I haven't missed something obvious in setup here. > Does anyone else have a working L2ARC? > > Thanks, > Andrew -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP
Re: ZFS: time to drop Big Scary Warning
> On 19. Mar 2021, at 21:18, Michael wrote: > > Hello, > > On Fri, 19 Mar 2021 15:57:18 -0400 > Greg Troxel wrote: > >> Even in current, zfs has a Big Scary Warning. Lots of people are using >> it and it seems quite solid, especially by -current standards. So it >> feels times to drop the warning. >> >> I am not proposing dropping the warning in 9. >> >> Objections/comments? > > I've been using it on sparc64 without issues for a while now. > Does nfs sharing work these days? I dimly remember problems there. If you mean misc/55042: Panic when creating a directory on a NFS served ZFS it should be fixed in -current. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: zfs panic in zfs:vdev_disk_open.part.4
> On 24. Nov 2020, at 19:16, Yorick Hardy wrote: > > Dear David, > > On 2020-11-24, Yorick Hardy wrote: >> Dear David, >> >> On 2020-11-22, David Brownlee wrote: >>> I'm seeing a (new?) panic on netbsd-9 with zfs. It seems to trigger >>> when a newly created zfs pool attempts to be mounted: >>> >>> panic: vrelel: bad ref count >>> cpu0: Begin traceback... >>> vpanic() at netbsd:vpanic+0x160 >>> vcache_reclaim() at netbsd:vcache_reclaim >>> vrelel() at netbsd:vrelel+0x22e >>> vdev_disk_open.part.4() at zfs:vdev_disk_open.part.4+0x44e >>> vdev_open() at zfs:vdev_open+0x9e >>> vdev_open_children() at zfs:vdev_open_children+0x39 >>> vdev_root_open() at zfs:vdev_root_open+0x33 >>> vdev_open() at zfs:vdev_open+0x9e >>> vdev_create() at zfs:vdev_create+0x1b >>> spa_create() at zfs:spa_create+0x28c >>> zfs_ioc_pool_create() at zfs:zfs_ioc_pool_create+0x19b >>> zfsdev_ioctl() at zfs:zfsdev_ioctl+0x265 >>> nb_zfsdev_ioctl() at zfs:nb_zfsdev_ioctl+0x38 >>> VOP_IOCTL() at netbsd:VOP_IOCTL+0x54 >>> vn_ioctl() at netbsd:vn_ioctl+0xa5 >>> sys_ioctl() at netbsd:sys_ioctl+0x5ab >>> syscall() at netbsd:syscall+0x157 >>> --- syscall (number 54) --- >>> 7e047af6822a: >>> cpu0: End traceback... >>> >>> Anyone seeing anything similar (I continue to have a bunch of other >>> boxes which use zfs without issue) >>> >>> David >> >> I don't know if it helps, but it looks like vn_close(vp,..) should be called >> instead >> of vrele(vp) at >> >> https://nxr.netbsd.org/xref/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c#218 >> >> and >> >> https://nxr.netbsd.org/xref/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c#250 > > Unless I am mistaken, I think you are hitting the error path which > should probably call vn_close as below. My machine is a bit old, so > I have not even compile tested yet and there is still the question > of why/if you are in the error path. > > If it does not help, apologies in advance! > > -- > Kind regards, > > Yorick Hardy > > Index: external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c > === > RCS file: > /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c,v > retrieving revision 1.18 > diff -u -r1.18 vdev_disk.c > --- src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c25 Jun > 2020 09:39:15 - 1.18 > +++ src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c24 Nov > 2020 18:10:07 - > @@ -215,7 +215,8 @@ > return (SET_ERROR(error)); > } > if (vp->v_type != VBLK) { > - vrele(vp); > + /* VN_RELE(vp); ?? */ > + vn_close(vp, FREAD|FWRITE, kcred); > vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED; > return (SET_ERROR(EINVAL)); > } > @@ -247,7 +248,8 @@ > error = workqueue_create(>vd_wq, "vdevsync", > vdev_disk_flush, dvd, PRI_NONE, IPL_NONE, WQ_MPSAFE); > if (error != 0) { > - vrele(vp); > + /* VN_RELE(vp); ?? */ > + vn_close(vp, FREAD|FWRITE, kcred); > return (SET_ERROR(error)); > } Good catch Yorick, though I prefer the attached diff. Pleasae commit and request pullup to -9. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig vdev_disk.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: Atomic vcache_tryvget()
> On 25. May 2020, at 00:53, Andrew Doran wrote: > > On Sat, Apr 04, 2020 at 09:33:26PM +, Andrew Doran wrote: > >> This change makes vcache_tryvget() do its work by means of a single atomic >> operation, which removes the need to take v_interlock beforehand. > > I re-did this one more elegantly. Assuming no problems I plan to commit > some time this week. > > http://www.netbsd.org/~ad/2020/vget.diff > > The basic idea: have a "gate" bit in v_usecount that's only set when the > vnode is LOADED. vcache_tryvget() needs to see this otherwise it fails and > we go down the slow path instead. The state transition for reclaiming is > then: > > ... > LOADED > -> BLOCKED > confirm that caller still holds the last vnode reference > -> RECLAIMING > ... Looks good to me -- please go ahead ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP
Re: Please review: lookup changes
> On 5. Mar 2020, at 23:48, Andrew Doran wrote: > > Hi, > > I'd like to merge the changes on the ad-namecache branch, and would > appreciate any code review. The motivation for these changes is, as > you might guess, performance. I put an overview below. I won't be > merging this for a couple of weeks in any case. [snip] > vfs_vnode.c: > > Namecache related changes, and the change to not block new vnode > references across VOP_INACTIVE() mentioned on tech-kern. I don't > think this blocking is needed at least for well behaved FSes like > ffs & tmpfs. I suppose this blocking is no longer needed. It originates from the early steps to clean up the vnode lifecycle. If anything goes wrong it is quite simple to undo this change. Blocking further references should only be needed for vrecycle(). We have to terminate vrelel() early if we got new references, diff attached. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig addendum.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: Folding vnode_impl_t back into struct vnode
> On 23. Jan 2020, at 17:37, Andrew Doran wrote: > > Hi, > > I would like to unify these back into a single struct vnode. > > I think I understand the motivation for the split in the first place and I > very much agree code outside the VFS core should not be playing around with > a lot of the stuff in struct vnode, but I think the split causes more > trouble than it's worth. Please do not, I strongly object. It was hard work to clean up file systems and kernel components using and abusing vnode fields that should be opaque outside the VFS implementation. I'm quite sure people will start again using these fields because it is so simple. For me the unique vnode was a maintenance nightmare. > Here are my reasons for wanting to go back to a single struct: > > 1) Performance. I want to be able to cluster the fields around the kind of > usage they get. For example changing v_usecount on libc.so's vnode or the > root vnode shouldn't force a cache miss on a different CPU that's just > trying to look up pages, or get at v_op or v_type or whatever; they're > different kinds of activity (see below for example). > > 2) Memory usage; to do the above in a way that doesn't use more memory than > needed. With the structs split in two it's difficult to arrange all of > this cleanly. > > 3) Clarity. I'm having a hard time working in the code with the split and > the different macros to convert back and forth. It's visually untidy. I > also think that if NetBSD's is to be a reference implementation or teaching > aid then well known stuff like this should be straightforward. > > Thoughts? > > Cheers, > Andrew > > struct vnode { >/* > * more stable stuff goes at the front. > */ >struct uvm_object v_uobj; >... >enum vtype v_type; >... > >/* > * counters usecounts etc. these guys get hammered on. > * think of v_usecount on "rootvnode" for example. > * (v_usecount will come out of struct uvm_object with uvm > * rwlock changes.) > */ >int v_usecount >__cacheline_aligned(COHERENCY_UNIT); >int v_holdcount; >int v_numoutput; >int v_writecount; >int v_someothercounter; >... All contained in struct vnode, there are no counters in struct vnode_impl. >/* > * then locks at the back in their own line so they don't > * impinge on any of the above either; different kind of activity > * again. > */ >krwlock_t v_lock >__cacheline_aligned(COHERENCY_UNIT); >krwlock_t v_nc_lock; /* namecache locks */ >krwlock_t v_nc_listlock; All (v_lock at least) contained in struct vnode_impl, all locks should go here as they are opaque to any consumer. > }; -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded
> On 15. Jan 2020, at 18:44, Andrew Doran wrote: > > I don't understand why we do this. I don't think it's needed at all because > if the file really is deleted and the inode is getting cleaned out, then it > shouldn't be getting new refs in any of the usual ways and we don't need to > worry about it. And in any case the vnode lock prevents anyone from messing > with the inode during VOP_INACTIVE(). > > I want to change this because it causes nasty problems on MP. For example I > see threads very regulary thrown out of the namecache by vcache_vtryget(), > and when they finally get the vnode ref'd and locked they do it by digging > through the file system metadata (when everything is already in cache). > > http://www.netbsd.org/~ad/inactive.diff > > What am I missing here? Is there a buggy file system or something like > that? We also come here with a full vnode cache cleaning vnodes to keep the cache size below its max calling vrecycle() on un-deleted vnodes. I suppose the "vp->v_usecount == 1" assert in vrecycle() may fire with this change as we drop the interlock before taking the lock. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: kmem_alloc() vs. VM_MIN_KERNEL_ADDRESS
> On 1. Jul 2019, at 20:12, Maxime Villard wrote: > > Le 01/07/2019 à 19:40, J. Hannken-Illjes a écrit : >> Sometimes kmem_alloc() returns an address below >> VM_MIN_KERNEL_ADDRESS, something like >> kmem_alloc() -> 0xaba25800a5a8 with >> VM_MIN_KERNEL_ADDRESS 0xad80 >> It doesn't happen after every reboot and breaks >> dtrace which treats the region 0.. VM_MIN_KERNEL_ADDRESS >> as toxic. >> How could we test against the lowest address the kernel >> may use? >> -- >> J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig > > Mmh, this seems to be a side effect of 1/2-KASLR which is enabled by default > in GENERIC. Reminder: in GENERIC, we now randomize by default every kernel > memory region. > > It is now possible that the direct map is below the main memory map -- it was > not the case before. kmem_alloc uses the direct map for certain allocations. > > First, please confirm by also dumping PMAP_DIRECT_BASE and PMAP_DIRECT_END to > make sure the area returned by kmem_alloc is indeed in the dmap when dtrace > doesn't work. > > Then, it seems that dtrace should be taught to consider as toxic everything > that is neither in the main map nor in the dmap. That is, only these ranges > are valid: > > VM_MIN_KERNEL_ADDRESS -> VM_MAX_KERNEL_ADDRESS > PMAP_DIRECT_BASE -> PMAP_DIRECT_END > > There is no actual "lowest" address the kernel may use; there are just two > separate regions that are valid. (Plus the kernel image itself, if it > matters.) It would work but unfortunately dtrace is a module and cannot use PMAP_DIRECT_BASE aka. pmap_direct_base. These variables are not visible from module context and don't exist if the kernel is built with KASAN. Solaris uses the toxic areas 0 .. _userlimit and all regions mapped from devices. Will the kernel ever map anything below VM_MIN_KERNEL_ADDRESS_DEFAULT? Do we have a mapping that contains all device-mapped memory? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
kmem_alloc() vs. VM_MIN_KERNEL_ADDRESS
Sometimes kmem_alloc() returns an address below VM_MIN_KERNEL_ADDRESS, something like kmem_alloc() -> 0xaba25800a5a8 with VM_MIN_KERNEL_ADDRESS 0xad80 It doesn't happen after every reboot and breaks dtrace which treats the region 0.. VM_MIN_KERNEL_ADDRESS as toxic. How could we test against the lowest address the kernel may use? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: ZFS works on 8.99.34 but fails on 201905260520Z
> On 29. May 2019, at 01:38, Paul Goyette wrote: > > The commit that introduced the new symbol should also have bumped the kernel > version... That's how we keep modules and kernel in sync... > > > On Tue, 28 May 2019, m...@netbsd.org wrote: > >> Found the commit - looks like newer modules than kernel. >> https://v4.freshbsd.org/commit/netbsd/src/IH8Jag0YCI3N6boB Do we really expect module updates without updating kernel to work? If yes will do iot next time. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: ZFS works on 8.99.34 but fails on 201905260520Z
Petr, your kernel is elder than your ZFS module. Please update to a current kernel and try again. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig > On 28. May 2019, at 20:27, Petr Topiarz wrote: > > Hi Tech-kern, > > I run two machines with NetBSD amd64 with ZFS, one is with 8.99.34 kernel > from february, > > the other is the latest today, 201905260520Z, > > It all runs fine with the first one, but as I upgraded the other, ZFS does > not load and tels me: > > modload: zfs: Exec format error > > and to /var/log messages it writes: > > May 28 18:55:46 poweredge /netbsd: [ 236.3881944] kobj_checksyms, 988: [zfs]: > linker error: symbol `disk_rename' not found > May 28 18:55:46 poweredge /netbsd: [ 236.4833169] WARNING: module error: > unable to affix module `zfs', error 8 > May 28 18:55:50 poweredge /netbsd: [ 240.2655954] kobj_checksyms, 988: [zfs]: > linker error: symbol `disk_rename' not found > May 28 18:55:50 poweredge /netbsd: [ 240.3599823] WARNING: module error: > unable to affix module `zfs', error 8 > May 28 18:56:18 poweredge /netbsd: [ 268.0810981] kobj_checksyms, 988: [zfs]: > linker error: symbol `disk_rename' not found > May 28 18:56:18 poweredge /netbsd: [ 268.1715047] WARNING: module error: > unable to affix module `zfs', error 8 > > considering configuration I got: > > cat /etc/modules.conf > solaris > zfs > > and in /etc/rc.conf I got > > modules=YES > > Any hint where to look or what to reconfigure in the kernel? I am using > standard netbsd kernel in both cases. > > thanks > > Petr > signature.asc Description: Message signed with OpenPGP
Re: snapshot_mount: already on list
> On 23. Nov 2018, at 11:57, Emmanuel Dreyfus wrote: > > Hello > > Some times ago, I encountered this nasty bug where the kernel > would crash when mounting a filesystem with snapshots. I thought > it was fixed, but it seems there are some problems leftover, at least > on netbsd-8. > > The crash: > > /mail: replaying log to disk > ffs_snapshot_mount: vget failed 2 > panic: ffs_snapshot_mount: 211471882 already on list This error cannot be fixed in the kernel as we had to remove the first inode from the list and when we detect this situation it is too late. Running fsck on the file system should remove duplicates from the list of snapshots to mount. > After another reboot, I ran fsck on the fileystem, inode 211471882 > indeed as troubles: > > ** Phase 1 - Check Blocks and Sizes > INCORRECT BLOCK COUNT I=211471881 (31466368 should be 31466624) > CORRECT? yes > > 845563840 DUP I=211471882 > 845563841 DUP I=211471882 > 845563842 DUP I=211471882 > 845563843 DUP I=211471882 > 845563844 DUP I=211471882 > 845563845 DUP I=211471882 > 845563846 DUP I=211471882 > 845563847 DUP I=211471882 > INCORRECT BLOCK COUNT I=211471882 (11141696 should be 11142208) > CORRECT? yes Did the file system mount OK after fsck? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: panic: ffs_snapshot_mount: already on list
> On 2. Oct 2018, at 12:23, Manuel Bouyer wrote: > > On Tue, Oct 02, 2018 at 12:11:49PM +0200, J. Hannken-Illjes wrote: >> >>> On 18. Sep 2018, at 16:39, J. Hannken-Illjes >>> wrote: >>> >>> >>>> On 18. Sep 2018, at 16:33, Manuel Bouyer wrote: >>>> >>>> On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote: >>>>> There will be no copy-on-write to persistent snapshots so the snapshots >>>>> will contain garbage where the real file system writes data. >>>> >>>> Ha yes, of course. One could argue that if a systadmin boots a >>>> kernel with FFS_NO_SNAPSHOT he should know what he's doing. >>>> >>>> Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the >>>> extra fs_snapinum[] entry ? >>> >>> >>> Without looking further I would suggest to print a warning, print the >>> fs_snapinum list and clear the entry as this inode is already an >>> active snapshot. >> >> This will not work as we must remove the first duplicate which >> is already active. > > So maybe we could just print a warning and ignore the duplicates at mount > time ? Are shapshots checked on a read-only mount ? A warning is not sufficient as we may be using the wrong block hints list. Snapshots are checked on read-write mounts only. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: panic: ffs_snapshot_mount: already on list
> On 18. Sep 2018, at 16:39, J. Hannken-Illjes wrote: > > >> On 18. Sep 2018, at 16:33, Manuel Bouyer wrote: >> >> On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote: >>> There will be no copy-on-write to persistent snapshots so the snapshots >>> will contain garbage where the real file system writes data. >> >> Ha yes, of course. One could argue that if a systadmin boots a >> kernel with FFS_NO_SNAPSHOT he should know what he's doing. >> >> Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the >> extra fs_snapinum[] entry ? > > > Without looking further I would suggest to print a warning, print the > fs_snapinum list and clear the entry as this inode is already an > active snapshot. This will not work as we must remove the first duplicate which is already active. As this should only happen after a crash or unclean unmount this looks like a job for fsck. The attached diff teaches fsck_ffs to handle this inconsistency. Opinions? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) 001_fsck_snap Description: Binary data
Re: panic: ffs_snapshot_mount: already on list
> On 18. Sep 2018, at 16:33, Manuel Bouyer wrote: > > On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote: >> There will be no copy-on-write to persistent snapshots so the snapshots >> will contain garbage where the real file system writes data. > > Ha yes, of course. One could argue that if a systadmin boots a > kernel with FFS_NO_SNAPSHOT he should know what he's doing. > > Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the > extra fs_snapinum[] entry ? Without looking further I would suggest to print a warning, print the fs_snapinum list and clear the entry as this inode is already an active snapshot. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: panic: ffs_snapshot_mount: already on list
> On 15. Sep 2018, at 17:06, Manuel Bouyer wrote: > > On Sat, Sep 15, 2018 at 04:34:41PM +0200, J. Hannken-Illjes wrote: >> This will damage all persistent snapshots if you boot a kernel with >> FFS_NO_SNAPSHOT >> Pleas don't commit ... > > Can you explain why ? The problem Emmanuel is trying to fix is real ... > I can't see how a ffs_snapshot_mount() which is a NOP could damage > anything. There will be no copy-on-write to persistent snapshots so the snapshots will contain garbage where the real file system writes data. I will be back from vacancies in October and take a deeper look. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: panic: ffs_snapshot_mount: already on list
This will damage all persistent snapshots if you boot a kernel with FFS_NO_SNAPSHOT Pleas don't commit ... > On 13. Sep 2018, at 08:08, Emmanuel Dreyfus wrote: > > Manuel Bouyer wrote: > >> maybe clear fs->fs_snapinum[snaploc] in that case ? > > I am about to commit the patch below, but the problem > is: I cannot test it, since I cleaned up the obsolete snapshot > storage backend file from the machine I had the initial > problem with. > > Index: sys/ufs/ffs/ffs_snapshot.c > === > RCS file: /cvsroot/src/sys/ufs/ffs/ffs_snapshot.c,v > retrieving revision 1.149 > diff -U4 -r1.149 ffs_snapshot.c > --- sys/ufs/ffs/ffs_snapshot.c 1 Jun 2017 02:45:15 - 1.149 > +++ sys/ufs/ffs/ffs_snapshot.c 13 Sep 2018 06:03:32 - > @@ -1734,8 +1734,9 @@ > */ > void > ffs_snapshot_mount(struct mount *mp) > { > +#ifndef FFS_NO_SNAPSHOT >struct vnode *devvp = VFSTOUFS(mp)->um_devvp; >struct fs *fs = VFSTOUFS(mp)->um_fs; >struct lwp *l = curlwp; >struct vnode *vp; > @@ -1823,14 +1824,16 @@ > >/* > * Link it onto the active snapshot list. > */ > - if (is_active_snapshot(si, ip)) > - panic("ffs_snapshot_mount: %"PRIu64" already on list", > + if (is_active_snapshot(si, ip)) { > + printf("ffs_snapshot_mount: %"PRIu64" already on > list", >ip->i_number); > - else > + fs->fs_snapinum[snaploc] = 0; > + } else { >TAILQ_INSERT_TAIL(>si_snapshots, ip, i_nextsnap); > - vp->v_vflag |= VV_SYSTEM; > + vp->v_vflag |= VV_SYSTEM; > + } >VOP_UNLOCK(vp); >} >/* > * No usable snapshots found. > @@ -1847,8 +1850,9 @@ >si->si_snapblklist = xp->i_snapblklist; >fscow_establish(mp, ffs_copyonwrite, devvp); > si->si_gen++; >mutex_exit(>si_lock); > +#endif /* FFS_NO_SNAPSHOT */ > } > > /* > * Disassociate snapshot files when unmounting. > > > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > m...@netbsd.org -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 23. Aug 2018, at 11:59, J. Hannken-Illjes wrote: > > >> On 22. Aug 2018, at 08:50, Emmanuel Dreyfus wrote: >> >> On Mon, Aug 20, 2018 at 10:39:21AM +0200, J. Hannken-Illjes wrote: >>>> I applied that to NetBSD-8.0, and it seems to behave much better. >>> Good. >> >> Will you commit and request a pullup? The change is valuable. > > Sure ... [pullup-8 #999] fss config update [pullup-8 #1000] Fix deadlock with getnewbuf() Should also fix the deadlock VFS_SNAPSHOT->ffs_copyonwrite->biowait. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 20. Aug 2018, at 10:34, Emmanuel Dreyfus wrote: > > On Thu, Aug 16, 2018 at 12:18:34PM +0200, J. Hannken-Illjes wrote: >> - 001_add_sc_state replaces the flags FSS_ACTIVE and FSS_ERROR with >> a state field. >> >> - 002_extend_state adds states for construction or destruction of >> a snapshot and fss_ioctl no longer blocks forever waiting for >> construction or destruction of a snapshot to complete. >> >> Opinions? > > I applied that to NetBSD-8.0, and it seems to behave much better. Good. > What about the deadlock scenario you mentionned, e.g. taking > multiple snapshots of / at the same time, or a snapshot of / > and /home at the same time? Should they work with this patch? Suppose you mean the "All processes go tstile" thread ... This patch will change nothing mentioned there. As I already asked: - The first thirty lines of "dumpfs /home" please. - Did you use "dump -x ..." or "dump -X"? - Did the "dump" process hang? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: All processes go tstile
> On 19. Aug 2018, at 20:30, David Holland wrote: > > On Thu, Aug 16, 2018 at 10:03:11AM +0200, J. Hannken-Illjes wrote: >>> sleepq_block >>> cv_wait >>> fstrans_start >>> VOP_BWRITE >>> getnewbuf >>> getblk >>> bio_doread >>> bread >>> ffs_update.part.3 >>> ffs_sync >>> VFS_SYNC >>> sched_sync >>> >>> If I understand correctly, that means a n I/O that never completed, right? >> >> Looks like a deadlock where we sync a file system, need a new buffer >> and try to free a buffer on a currently suspending file system. >> >> VFS_SYNC and VOP_BWRITE should be on different mounts here. > > Side note: what if there isn't a second mount? Not sure that's a good > enough way to deal with this case... This is just a guess -- if they were from the same mount sched_sync() did already call fstrans_start() through mountlist_iterator_trynext() and VOP_BWRITE would not sleep on fstrans_start() as it always succeeds on recursion. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: All processes go tstile
> On 17. Aug 2018, at 04:46, Emmanuel Dreyfus wrote: > > On Thu, Aug 16, 2018 at 10:03:11AM +0200, J. Hannken-Illjes wrote: >> Looks like a deadlock where we sync a file system, need a new buffer >> and try to free a buffer on a currently suspending file system. >> >> VFS_SYNC and VOP_BWRITE should be on different mounts here. > > Here are the mounts: > /dev/raid3a on / type ffs (log, local) > /dev/raid3e on /home type ffs (nodev, noexec, nosuid, NFS exported, local) > > The problem arises while dump is performing a snapshot backup on /home The first thirty lines of "dumpfs /home" please. Did you use "dump -x ..." or "dump -X"? Did the "dump" process hang? > It is worth noting that /home does not have -o log because frequent > fsync from NFS clients kill performances with -o log. At least it did > it on netbsd-7, but I do not see why it could have changed on netbsd-8: > frequent fsync means frequence log flushes on the server. > >> Do you have a crash dump? > > I was not able to get one so far, the dump device is not properly > configured on this system. If it hits you again, from ddb: ps /l call fstrans_dump backtrace of the "dump" process -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 14. Aug 2018, at 11:16, J. Hannken-Illjes wrote: > > >> On 13. Aug 2018, at 19:25, Emmanuel Dreyfus wrote: >> >> On Mon, Aug 13, 2018 at 11:56:45AM +, Taylor R Campbell wrote: >>> Unless I misunderstand fss(4), this is an abuse of mutex(9): nothing >>> should sleep while holding the lock, so that nothing trying to acquire >>> the lock will wait for a long time. >> >> Well, the cause is not yet completely clear to me, but the user >> experience is terrible. The first time I used it, I thought >> the system crashed, because fssconfig -l was just hung for >> hours. >> >> And it is very easy to acheive a situation where most processes >> are in tstile awaiting a vnode lock for a name lookup. > > I see two problems here. > > 1) File system internal snapshots take long to create or destroy on > large file systems. I have no solution to this problem. > > Using file system external snapshots for dumps should work fine. > > 2) Fss devices block in ioctl while a snapshot gets created or > destroyed. A possible fix is to replace the current > active/non-active state with idle/creating/active/destroying The attached diffs implement this in a pullup-friendly way. - 001_add_sc_state replaces the flags FSS_ACTIVE and FSS_ERROR with a state field. - 002_extend_state adds states for construction or destruction of a snapshot and fss_ioctl no longer blocks forever waiting for construction or destruction of a snapshot to complete. Opinions? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) 001_add_sc_state Description: Binary data 002_extend_state Description: Binary data
Re: Finding an available fss device
> On 13. Aug 2018, at 19:25, Emmanuel Dreyfus wrote: > > On Mon, Aug 13, 2018 at 11:56:45AM +, Taylor R Campbell wrote: >> Unless I misunderstand fss(4), this is an abuse of mutex(9): nothing >> should sleep while holding the lock, so that nothing trying to acquire >> the lock will wait for a long time. > > Well, the cause is not yet completely clear to me, but the user > experience is terrible. The first time I used it, I thought > the system crashed, because fssconfig -l was just hung for > hours. > > And it is very easy to acheive a situation where most processes > are in tstile awaiting a vnode lock for a name lookup. I see two problems here. 1) File system internal snapshots take long to create or destroy on large file systems. I have no solution to this problem. Using file system external snapshots for dumps should work fine. 2) Fss devices block in ioctl while a snapshot gets created or destroyed. A possible fix is to replace the current active/non-active state with idle/creating/active/destroying and changing FSSIOCSET to mutex_enter(>sc_lock); if (sc->sc_state != FSS_IDLE) { mutex_exit(>sc_lock); return EBUSY; } sc->sc_state = FSS_CREATING; mutex_exit(>sc_lock); error = fss_create_snapshot(); mutex_enter(>sc_lock); if (error) sc->sc_state = FSS_IDLE; else sc->sc_state = FSS_ACTIVE; mutex_exit(>sc_lock); return error; Problem here is backwards compatibility. I have no idea what to return for FSSIOCGET when the state is creating or destroying. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 13. Aug 2018, at 09:53, Emmanuel Dreyfus wrote: > > On Sun, Aug 12, 2018 at 10:16:48AM +0200, J. Hannken-Illjes wrote: >> While creating a snapshot "/mount0" lookup "/mount0/file", it will block >> as "/mount0" is suspended. The lookup holds a lock on "/". >> >> Now snapshot "/ "and trying to suspend "/" will block as the lookup >> has the root vnode locked. > > This scenario is not the same as the one I asked about, which > was: performing a snapshot of filesystem mounted on /mount0 > using /dev/fss0 and a snapshot of filesystem mounted on /mount1 > using /dev/fss1 while the first one is still active. Is there some > deadlock in this case? Still not sure we are talking about the same thing. 1) Create snapshot of /mount0 with fss0 1a) Open /dev/fss0 1b) Ioctl FSSIOCSET on /dev/fss0 to create the snapshot 1c) Read data from /dev/fss0 1d) Ioctl FSSIOCCLR on /dev/fss0 to delete the snapshot 1e) Close /dev/fss0 The same for a snapshot of /mount1 with fss1. 2) Create snapshot of /mount1 with fss1 2a) Open /dev/fss1 2b) Ioctl FSSIOCSET on /dev/fss1 to create the snapshot 2c) Read data from /dev/fss1 2d) Ioctl FSSIOCCLR on /dev/fss1 to delete the snapshot 2e) Close /dev/fss1 All operations are mutually exclusive, we always run exactly one of 1), 1a) ... 2e), a second operation will block until it gets exclusive access. Of these operations, 1b), 1d), 2b) and 2d) may take a long time to run if the snapshot is file system internal. > But you also raise a deadlock scenario for which there is no > protection in currentn code. I already experienced it in the > past and it would be fair to return EGAIN rather than letting the > administrator set a snapshot that will kill the system later. This scenario is protected by mutual exclusion of fss device operations as explained above. Creating the snapshot of "/" waits for the creation of the snapshot of "/mount0" to finish. Additionally VFS_SUSPEND is always exclusive (see mutex vfs_suspend_lock) as it gets used from mount() and unmount() too. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 12. Aug 2018, at 10:07, Emmanuel Dreyfus wrote: > > On Sun, Aug 12, 2018 at 09:55:27AM +0200, J. Hannken-Illjes wrote: >>> You mean you cannot at the same tme snapshot /mount0 on fss0 and >>> /mount1 on fss1? >> >> Yes, you have to create the snapshot on /mount0 and once it has been >> created you create the snapshot on /mount1. > > Where is that limitation? I would not exepect to get such a limitation > when working on both a different mount and fss device. The simplest deadlock is: While creating a snapshot "/mount0" lookup "/mount0/file", it will block as "/mount0" is suspended. The lookup holds a lock on "/". Now snapshot "/ "and trying to suspend "/" will block as the lookup has the root vnode locked. There are too many other deadlock scenarios this could ever work. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 12. Aug 2018, at 03:58, Emmanuel Dreyfus wrote: > > On Sat, Aug 11, 2018 at 10:33:04AM +0200, J. Hannken-Illjes wrote: >> When fssconfig "hangs" the dump is creating a snapshot. Creating >> a snapshot (and suspending a file system) is serialized. Allowing >> more than one file system suspension at a time will deadlock most >> of the time. > > You mean you cannot at the same tme snapshot /mount0 on fss0 and > /mount1 on fss1? Yes, you have to create the snapshot on /mount0 and once it has been created you create the snapshot on /mount1. The snapshot on /mount0 is already usable while you create the second snapshot on /mount1. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Finding an available fss device
> On 10. Aug 2018, at 15:46, Emmanuel Dreyfus wrote: > > Hello > > How are user processes supposed to find an unused fss device? > In dump(8) code, there is an iteration on /dev/rfss* trying to > performan an ioctl FSSIOCSET. The code tests for EBUSY on failure, > but in my experience that struggles to happen: if the device is > already in use, the ioctl will sleep in the kernel for ages before > getting a reply. > > This is something I can even experience with fssconfig -l, which > hangs for a while if dump is running. > > Is there another way? I thought about searching vnode in kernel to > check if the device is already used by someone else, but that looks > overkill. > > Perhaps the right way is to add a FSSIOBUSY ioctl that would > use mutex_tryenter and return EBUSY if the device is in use? When fssconfig "hangs" the dump is creating a snapshot. Creating a snapshot (and suspending a file system) is serialized. Allowing more than one file system suspension at a time will deadlock most of the time. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Performance of VSTATE_ASSERT_UNLOCKED
> On 12. Sep 2017, at 20:52, Joerg Sonnenberger <jo...@bec.de> wrote: > > Hello Juergen, > the new VSTATE_ASSERT_UNLOCKED code you added to genfs_lock is very > heavy as it significantly increases lock contention for something > already highly contended. I'd like to push the lock down to critical > cases and try to avoid it for common cases. > > Thoughts? > > Joerg > Looks good to me, cannot test anything as I am on vacation. Please commit and request pullup to -8. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem with lots of (perhaps too much?) memory
> On 17. May 2017, at 08:59, Paul Goyette <p...@whooppee.com> wrote: > > On Wed, 17 May 2017, J. Hannken-Illjes wrote: > >>>> Chances are very high that you are hitting the free page lock contention >>>> bug. >>> >>> Further observation: >>> >>> 1. A run of lockstat shows that there's lots of activity on >>> mntvnode_lock from vfs_insmntque >>> >>> 2. The problem can be triggered simply by running 'du -s' on a >>> file system with lots of files (for example, after having run >>> 'build.sh release' for ~40 different architectures, and having >>> kept ALL of the obj/*, dest/*, and release/* output files). >>> >>> 3. The ioflush thread _never_ finishes. Even 12 hours after the >>> trigger, and after an 8-hour sleep window doing nothing (other >>> than receiving a couple dozen Emails), the ioflush thread is >>> still using 5-10% of one CPU core/thread. >>> >>> 4. If I umount the trigger file system, ioflush time goes to near- >>> zero. I can remount without problem, however shortly after >>> re-running the 'du -s' command the problem returns. >>> >>> There was a comment on IRC that yamt@ had been working on a problem where >>> "ioflush wastes a lot of CPU time when there are lots of vnodes" seems to >>> describe this situation. Unfortunately, it seems that yamt never finished >>> working on the problem. :( >> >> Two questions: >> >> - What is the size of your vnode cache (sysctl kern.maxvnodes)? > > kern.maxvnodes = 6706326 What happens if you lower it to 1258268 for example? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem with lots of (perhaps too much?) memory
> On 16. May 2017, at 23:25, Paul Goyette <p...@whooppee.com> wrote: > > On Tue, 16 May 2017, Joerg Sonnenberger wrote: > >> On Tue, May 16, 2017 at 08:12:20AM +0800, Paul Goyette wrote: >>> * xosview frequently reports that the system is running at near-100% >>> in SYS time. >> >> Chances are very high that you are hitting the free page lock contention >> bug. > > Further observation: > > 1. A run of lockstat shows that there's lots of activity on > mntvnode_lock from vfs_insmntque > > 2. The problem can be triggered simply by running 'du -s' on a > file system with lots of files (for example, after having run > 'build.sh release' for ~40 different architectures, and having > kept ALL of the obj/*, dest/*, and release/* output files). > > 3. The ioflush thread _never_ finishes. Even 12 hours after the > trigger, and after an 8-hour sleep window doing nothing (other > than receiving a couple dozen Emails), the ioflush thread is > still using 5-10% of one CPU core/thread. > > 4. If I umount the trigger file system, ioflush time goes to near- > zero. I can remount without problem, however shortly after > re-running the 'du -s' command the problem returns. > > There was a comment on IRC that yamt@ had been working on a problem where > "ioflush wastes a lot of CPU time when there are lots of vnodes" seems to > describe this situation. Unfortunately, it seems that yamt never finished > working on the problem. :( Two questions: - What is the size of your vnode cache (sysctl kern.maxvnodes)? - Is the vdrain thread using CPU too? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator (round 2)
> On 9. Apr 2017, at 19:16, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > >> Date: Sun, 9 Apr 2017 18:47:25 +0200 >> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de> >> >>> On 6. Apr 2017, at 11:44, J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote: >>> Good hint. Prepared a partial implementation of >>> >>> int >>> mountlist_iterate(int (*cb)(struct mount *, void *), void *arg) >>> [...] >>> Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/ >>> >>> I prefer this API, opinions? >> >> If no one objects I will commit this iterator on Tuesday, April 11. > > I tend to prefer an iterator-style API over a callback-style API where > it works. (E.g., obviously xcall(9) can't be done with an iterator- > style API.) I like to avoid writing my code upside-down unless it's > really necessary. If you want a callback-style API for some reason, > you can always write that in terms of an iterator-style API, but not > vice versa. > > Note the suggestion I made earlier about mountlist_iterator_next_done > isn't necessary as long as anyone trying to break the loop and keep > using the mount does something extra to acquire a reference: > > mountlist_iterator_begin(); > while ((mp = mountlist_iterator_next()) != NULL) { > dostuff(mp); > if (interesting(mp)) { > vfs_busy(mp); > break; > } > } > mountlist_iterator_end(); > > All you have to do to make this work is store the last mp in the > iterator too, so that mountlist_iterator_next can release it before > acquiring the next one. > > But I'm not the one doing the work here, so this is not an objection! ... and I am open to both approaches, I just want it done :-) -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator (round 2)
> On 6. Apr 2017, at 11:44, J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote: > >> >> On 5. Apr 2017, at 05:14, Chuck Silvers <c...@chuq.com> wrote: >> >> have you considered a callback-based interface where the loop >> is inside the iteration API rather than in the caller? >> the vfs_busy/unbusy could also be hidden in the iteration API >> so that the callback would not need need to worry about it at all. >> >> I looked at how the iteration stuff is used in your patches >> (the first one, I haven't looked in the latest one in detail) >> and it looks like most of users would be fine with not being able >> to do anything after vfs_unbusy(). the one place that really does >> want to do more after vfs_unbusy() is vfs_unmountall1(), >> but that could be rewritten to loop unmounting the last mount >> in the list until they're all gone. >> >> I think this would further simplify most (if not all) cases of >> mountlist iteration. > > Good hint. Prepared a partial implementation of > > int > mountlist_iterate(int (*cb)(struct mount *, void *), void *arg) > > to take "mp" busy and call "cb(mp, arg)" for all mounted file systems. > > A non waiting variant "mountlist_iterate_nowait()" could be added > when it gets needed. > > Changed "vfs_unmountall1()" to retrieve mounts in descending > generation order which has the additional benefit that we don't > need reverse traversals any more. > > Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/ > > I prefer this API, opinions? If no one objects I will commit this iterator on Tuesday, April 11. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator (round 2)
> On 5. Apr 2017, at 05:14, Chuck Silvers <c...@chuq.com> wrote: > > have you considered a callback-based interface where the loop > is inside the iteration API rather than in the caller? > the vfs_busy/unbusy could also be hidden in the iteration API > so that the callback would not need need to worry about it at all. > > I looked at how the iteration stuff is used in your patches > (the first one, I haven't looked in the latest one in detail) > and it looks like most of users would be fine with not being able > to do anything after vfs_unbusy(). the one place that really does > want to do more after vfs_unbusy() is vfs_unmountall1(), > but that could be rewritten to loop unmounting the last mount > in the list until they're all gone. > > I think this would further simplify most (if not all) cases of > mountlist iteration. Good hint. Prepared a partial implementation of int mountlist_iterate(int (*cb)(struct mount *, void *), void *arg) to take "mp" busy and call "cb(mp, arg)" for all mounted file systems. A non waiting variant "mountlist_iterate_nowait()" could be added when it gets needed. Changed "vfs_unmountall1()" to retrieve mounts in descending generation order which has the additional benefit that we don't need reverse traversals any more. Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/ I prefer this API, opinions? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator (round 2)
Time to start a second round. This time the iterator only, vfs_busy() and friends deferred to a new thread once the iterator is done. Changes from the previous proposal: - Removed the "flags" argument from mountlist_iterator_next(), will add mountlist_iterator_trynext() and vfs_trybusy() when needed. - Added "void mountlist_iterator_done(mount_iterator_t *)" to completely hide vfs_busy()/vfs_unbusy() from the API and make it possible to assert that every call to mountlist_iterator_next() gets accompanied by a call to mountlist_iterator_done(). - Use the existing lock "mountlist_lock". A typical usage is: mount_iterator_t *iter; mountlist_iterator_init(); while ((mp = mountlist_iterator_next(iter)) != NULL) { ... mountlist_iterator_done(iter); } mountlist_iterator_destroy(iter); Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: locking from VOP_INACTIVE to VOP_RECLAIM
> On 2. Apr 2017, at 18:19, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > >> Date: Sun, 2 Apr 2017 16:34:20 +0200 >> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de> >> >> Looks like your proposal needs some clarification regarding "vnode lock" >> and "the lock". >> >> We have two vnode related locks: >> >> - the "vnode lock", located as field "vi_lock" in "struct vnode_impl" >> used by genfs_*lock() aka VOP_*LOCK. >> >> - the "genfs lock" located as field "g_lock" in "struct genfs_node" >> used by genfs_node_*lock(). >> >> Which lock do you want to be held across VOP_INACTIVE and destroyed >> from VOP_RECLAIM? > > Aha! That would explain why I had trouble tracking down why an assert > I was planning to add kept firing. I was thinking that the genfs node > lock and the vnode lock were one and the same. > > My proposal was about the vnode lock, not the genfs lock. Since > VOP_RECLAIM does *not* destroy the vnode lock, and must leave it > either locked or unlocked, I suggest leaving it locked -- it's always > easier to think about subroutines that preserve the state of a lock > (locked to locked, or unlocked to unlocked) rather than changing it > (locked to unlocked or unlocked to locked). Looks good to me so far and please keep in mind that VOP_RECLAIM must not drop/aquire the lock as it will never get it back for a vnode in state RECLAIMING. > (As an aside, we may have some bugs in ffs concerning use of the > struct inode::i_size field -- some callers seem to think access is > serialized by the genfs lock, while others, e.g. ufs_readwrite.c, seem > to use the vnode lock. This needs to be documented more clearly...) -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator
> On 2. Apr 2017, at 16:34, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > >> Date: Sun, 2 Apr 2017 11:09:49 +0200 >> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de> >> >>> On 1. Apr 2017, at 23:03, Taylor R Campbell >>> <campbell+netbsd-tech-k...@mumble.net> wrote: >>> >>> Any particular reason to use a pointer-to-opaque-pointer here instead >>> of a caller-allocated struct mountlist_iterator object? >> >> Just to make it opaque to the caller. I see no reason to make the >> mountlist internals part of the kernel API and expose them down to >> the file system implementations. > > The caller-allocated struct mountlist_iterator could be made to > contain just a single pointer as the current API uses. > > *shrug* Not a big deal either way. > >> As _mountlist_next() exists for DDB only (no locks, no allocs) I see >> no problems here. > > One might be concerned in ddb with having to chase many pointers in a > possibly corrupt kernel memory state, so there is some small advantage > to avoiding list traversal. Not from the DDB prompt but from operations like "printlockedvnodes()" or "fstrans_dump()". If the mountlist is corrupt you're lost. > >> Embedding the mountlist_entry in "struct mount" >> forces us to use a "struct mount" as a marker and this struct >> has a size of ~2.5 kBytes. > > By `embed' I didn't mean that the marker has to be a whole struct > mount. Rather, something more like: > > struct mountlist_entry { > TAILQ_ENTRY(mountlist_entry)mle_tqe; > enum { MNTENT_MARKER, MNTENT_MOUNT }mle_tag; > }; > > struct mount { > ... > struct mountlist_entry mnt_entry; > ... > }; > > TAILQ_HEAD(, mountlist_entry) mount_list; > > To recover the struct mount from a struct mountlist_entry that is > tagged as a mount, simply do > > struct mountlist_entry *mle = ...; > struct mount *mp = container_of(mle, struct mount, mnt_entry); Any reason you want "mle_tqe" and "mle_tag" public? >>> Candidate for cacheline-aligned struct? Why do we need a new lock >>> mount_list_lock for this instead of using the existing mountlist_lock? >> >> For the transition it is better to use a separate lock. After the >> transition "mountlist_lock" gets removed. > > OK, fair enough. Can you add a comment over the declaration stating > that plan? > >>> Beware: TAILQ_PREV (and TAILQ_LAST) violates strict aliasing rules, so >>> we should probably try to phase it out rather than add new users. >>> Maybe we should add a new kind of queue(3) that doesn't violate strict >>> aliasing but supports *_PREV and *_LAST. >>> >>> All that said, it is not clear to me why you need reverse iteration >>> here. None of your patches seem to use it. Are you planning to use >>> it for some future purpose? >> >> Historical the mountlist gets traversed in mount order. Some syscalls >> return in this order so it should be kept. >> >> Unmounting all file system on shutdown has to be done in reverse mount >> order so here (vfs_unmount_forceone / vfs_unmountall1) we need it. > > I see -- your patch just hasn't addressed those yet. Maybe Someone^TM > should just write a replacement for TAILQ so we can use it here now. File 002_ml_iterator_switch around line 453 and 487. >>> Why do we need a new refcnt dance here? Why isn't it enough to just >>> use vfs_busy like all the existing callers? >> >> Because vfs_busy() expects the caller to already hold a reference and >> one of the goals here is to drop the mount_list_lock as early as possible. > > I suppose you want to avoid setting down a lock order for > mount_list_lock and mp->mnt_unmounting. I guess this doesn't matter > much since it's limited to vfs internals. > >>> Can you make another separate commit to drop the keepref parameter to >>> vfs_unbusy, and change all the callers with keepref=true to just use >>> vfs_destroy themselves? >> >> You mean the opposite (keepref=false -> caller does vfs_destroy), yes? >> >> Of the 81 usages of vfs_unbusy() only 6 have keepref=true, all others >> would have to be changed to "vfs_unbusy(mp); vfs_destroy(mp);". > > Heh. This confusion on my part suggest two things: > > 1. Perhaps in this change, we ought to add a different function for > use with
Re: locking from VOP_INACTIVE to VOP_RECLAIM
> On 2. Apr 2017, at 16:05, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > >> Date: Sun, 2 Apr 2017 10:47:42 +0200 >> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de> >> >>> On 1. Apr 2017, at 20:41, Taylor R Campbell >>> <campbell+netbsd-tech-k...@mumble.net> wrote: >>> VOP_RECLAIM then *destroys* the lock, so we don't have to say whether >>> it is held or released on exit. This requires a one-line change to >>> vfs_vnode.c, and then mechanically moving VOP_UNLOCK from *_inactive >>> to *_reclaim throughout the file systems. >> >> Changing VOP_RECLAIM to destroy the lock forces a complete >> rewrite of the vnode lock mechanism. > > Huh? That part is not a change -- VOP_RECLAIM already destroys the > lock. E.g., ffs_reclaim calls genfs_node_destroy. > > I just added that sentence (`VOP_RECLAIM then destroys the lock...') > to emphasize that my proposed change doesn't need to address the state > of the lock after VOP_RECLAIM returns. > > Note that the only call to VOP_RECLAIM in the system immediately > follows VOP_INACTIVE, in vcache_reclaim, so that call actually > requires no changes. Only the other call to VOP_INACTIVE, in vrelel, > requires a change, to add a following VOP_UNLOCK. Looks like your proposal needs some clarification regarding "vnode lock" and "the lock". We have two vnode related locks: - the "vnode lock", located as field "vi_lock" in "struct vnode_impl" used by genfs_*lock() aka VOP_*LOCK. - the "genfs lock" located as field "g_lock" in "struct genfs_node" used by genfs_node_*lock(). Which lock do you want to be held across VOP_INACTIVE and destroyed from VOP_RECLAIM? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Add a mountlist iterator
> On 1. Apr 2017, at 23:03, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > >> Date: Thu, 30 Mar 2017 11:21:41 +0200 >> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de> >> >> Plan is to untangle the mountlist processing from vfs_busy() / vfs_unbusy() >> and add an iterator for the mountlist: > > Generally seems like a good idea to me! > >> - void mountlist_iterator_init(mount_iterator_t **) >> to initialize an iterator in mounted order. > > Any particular reason to use a pointer-to-opaque-pointer here instead > of a caller-allocated struct mountlist_iterator object? Just to make it opaque to the caller. I see no reason to make the mountlist internals part of the kernel API and expose them down to the file system implementations. > struct mountlist_iterator it; > > mountlist_iterator_init(); > while ((mp = mountlist_iterator_next(, ...)) != NULL) > ... > mountlist_iterator_destroy(); > > Then there's no need to kmem_alloc any intermediate data structures. > > Embedding the struct mountlist_entry in struct mount would obviate the > need for _mountlist_next to iterate over the entire list from the > start, too. As _mountlist_next() exists for DDB only (no locks, no allocs) I see no problems here. Embedding the mountlist_entry in "struct mount" forces us to use a "struct mount" as a marker and this struct has a size of ~2.5 kBytes. > Some in-line comments on the diff. General comment: can you make your > diffs with the `-p' option so it's easier to follow which function > each diff hunk edits? > > diff -r dab39dcbcb29 -r ff3f89481f2e sys/kern/vfs_mount.c > --- a/sys/kern/vfs_mount.c Thu Mar 30 11:17:56 2017 +0200 > +++ b/sys/kern/vfs_mount.c Thu Mar 30 11:17:56 2017 +0200 > @@ -119,7 +136,9 @@ > ... > +static TAILQ_HEAD(mountlist, mountlist_entry) mount_list; > +static kmutex_t mount_list_lock; > > Candidate for cacheline-aligned struct? Why do we need a new lock > mount_list_lock for this instead of using the existing mountlist_lock? For the transition it is better to use a separate lock. After the transition "mountlist_lock" gets removed. > @@ -1439,10 +1454,168 @@ > ... > + if (marker->me_type == ME_MARKER) > + me = TAILQ_NEXT(marker, me_list); > + else > + me = TAILQ_PREV(marker, mountlist, me_list); > > Beware: TAILQ_PREV (and TAILQ_LAST) violates strict aliasing rules, so > we should probably try to phase it out rather than add new users. > Maybe we should add a new kind of queue(3) that doesn't violate strict > aliasing but supports *_PREV and *_LAST. > > All that said, it is not clear to me why you need reverse iteration > here. None of your patches seem to use it. Are you planning to use > it for some future purpose? Historical the mountlist gets traversed in mount order. Some syscalls return in this order so it should be kept. Unmounting all file system on shutdown has to be done in reverse mount order so here (vfs_unmount_forceone / vfs_unmountall1) we need it. > + /* Take an initial reference for vfs_busy() below. */ > + mp = me->me_mount; > + KASSERT(mp != NULL); > + atomic_inc_uint(>mnt_refcnt); > + mutex_exit(_list_lock); > + > + /* Try to mark this mount busy and return on success. */ > + if (vfs_busy(mp, NULL) == 0) { > + vfs_destroy(mp); > + return mp; > + } > + vfs_destroy(mp); > + mutex_enter(_list_lock); > > Why do we need a new refcnt dance here? Why isn't it enough to just > use vfs_busy like all the existing callers? Because vfs_busy() expects the caller to already hold a reference and one of the goals here is to drop the mount_list_lock as early as possible. > if (vfs_busy(mp, NULL) == 0) { > mutex_exit(_list_lock); > return mp; > } > > Seems to me we ought to be removing code that names mnt->mnt_refcnt > specifically, rather than adding more code. > > (Then we could also move the `mutex_exit; return;' out of the for loop > to make it look a little more symmetric in mountlist_iterator_next.) > > diff -r ff3f89481f2e -r 28fea04e7b15 sys/coda/coda_vfsops.c > --- a/sys/coda/coda_vfsops.cThu Mar 30 11:17:56 2017 +0200 > +++ b/sys/coda/coda_vfsops.cThu Mar 30 11:17:57 2017 +0200 > @@ -624,23 +624,20 @@ > -mutex_enter(_lock); > -TAILQ_FOREACH(mp, , mnt_list) { > - if ((!strcmp(mp->mnt_op->vfs_nam
Re: locking from VOP_INACTIVE to VOP_RECLAIM
> On 1. Apr 2017, at 20:41, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > > Currently: > > - For VOP_INACTIVE, vnode lock is held on entry and released on exit. > - For VOP_RECLAIM, vnode lock is not held on entry. > > I would like to change this so that: > > - For VOP_INACTIVE, vnode lock is held on entry and exit. > - For VOP_RECLAIM, vnode lock is held on entry. Changing VOP_INACTIVE to keep the vnode locked looks good to me. > VOP_RECLAIM then *destroys* the lock, so we don't have to say whether > it is held or released on exit. This requires a one-line change to > vfs_vnode.c, and then mechanically moving VOP_UNLOCK from *_inactive > to *_reclaim throughout the file systems. Changing VOP_RECLAIM to destroy the lock forces a complete rewrite of the vnode lock mechanism. Currently we have genfs_lock() that takes care to not lock a reclaimed vnode and genfs_deadlock() that locks unconditionally if called with LK_RETRY. Before VOP_RECLAIM we use genfs_lock() and after VOP_RECLAIM we change the operations vector to dead_vnodeop_p and use genfs_deadlock(). Operation vn_lock() then takes care not to return a dead vnode if it gets called with LK_RETRY. If this is to make the locked state during VOP_RECLAIM assertable it should be changed to either "locked on entry, locked on return" or "locked on entry, unlocked on return". > I did a quick survey of every VOP_INACTIVE in src/sys. With one > exception, every one either (a) does VOP_UNLOCK just before return, or > (b) defers to another VOP_INACTIVE method. The one exception is > nfs_inactive, which also queues something up on a workqueue, but I > can't imagine that it matters whether the vnode lock is held. > > Benefit: Functions that VOP_RECLAIM calls would be able to assert, > rather than vaguely assume, exclusive ownership of the vnode lock. > Currently various code paths, e.g. ffs_update, require exclusive > access to the vnode in question, which can't be asserted right now > because of the VOP_RECLAIM exception. > > Thoughts? Objections? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Add a mountlist iterator
Currently vfs_busy() / vfs_unbusy() get used to - Enter/leave a critical section against unmounting - Add a reference to the mount - return the next mount from the mountlist on error Plan is to untangle the mountlist processing from vfs_busy() / vfs_unbusy() and add an iterator for the mountlist: - void mountlist_iterator_init(mount_iterator_t **) to initialize an iterator in mounted order. - void mountlist_iterator_init_reverse(mount_iterator_t **) to initialize an iterator in reverse mounted order. - void mountlist_iterator_destroy(mount_iterator_t *) to destroy an iterator aquired with one of the above. - struct mount *mountlist_iterator_next(mount_iterator_t *, int) to retrieve the next mount from an iterator or NULL if done. The mount is returned vfs_busy'd, caller has to vfs_unbusy. Accepts a flag MNT_NOWAIT to skip mounts whose unmount is in progress. - struct mount *_mountlist_next(struct mount *) a lockless variant to be used from DDB only. Diffs are here: https://www.netbsd.org/~hannken/ml_iterator/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Remount read-only and fstrans
> On 19 Feb 2017, at 18:08, Christos Zoulas <chris...@astron.com> wrote: > > In article <f1e96c92-0e24-40b1-b500-92cac4452...@eis.cs.tu-bs.de>, > J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote: >> >> Comments or objections anyone? > > Will that detect open for write file descriptors to removed files? This is the job of vflush(). Since Rev. 1.46 of vfs_mount.c (2017/01/27) it handles both open for read and open for write descriptors to unlinked files. The corresponding test is tests/fs/vfs/t_rwtoro.c -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Remount read-only and fstrans
Updating a mounted file system from read-write to read-only is racy as it is not clear when a mounted file system is read-only. Currently we set MNT_RDONLY before we call VFS_MOUNT(). If VFS_MOUNT() fails some operations may see the mounted file system read-only but this was never the case. Even if we would set MNT_RDONLY in xxx_mount() after successfull vflush() there would remain a window where operations would miss MNT_RDONLY. Plan is to suspend the file system while the mounted file system gets updated. This way no operations run on the mounted file system during update and all operations see the state before or after the update. Vfs_suspend() with fstrans_start()/fstrans_done() is a mechanism to give one thread exclusive access to a mounted file system. Other threads trying to run vnode operations on this mounted file system will stall until the file system resumes. On a non suspended file system fstrans_start()/fstrans_done() work without locks or other atomic operations. Currently fstrans_start()/fstrans_done() is mostly used inside the file system. This has drawbacks, first it is to late to change the operations vector if the vnode gets revoked and second it does not scale well to support it for all file systems. Plan is to move fstrans_start()/fstrans_done() into vnode_if.c and enable vfs_suspend() on all file systems. Diffs are here: https://www.netbsd.org/~hannken/vnode_if/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Remove vget from global namespace
With the global vnode cache there is no longer a need for vget() (take the first reference on a vnode) to reside in global namespace. 1) Rename vget() to vcache_vget() and vcache_tryvget() respectively and move the definitions to sys/vnode_impl.h. 2) Vget() increments the v_usecount before it checks the vnode is stable. This behaviour makes it impossible to assert v_usecount being zero or one as v_usecount also reflects other threads trying to vget(). Change vcache_vget() to increment v_holdcnt to prevent the vnode from disappearing while vcache_vget() waits for a stable state and free the vnode if this hold was the last reference to a reclaimed vnode. 3) Add some "v_usecount == 1" assertions. Diffs are here: https://www.netbsd.org/~hannken/vget/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: vrele vs. syncer deadlock
> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote: > > On a low-memory machine Nick ran into the following deadlock: > > (a) rename -> vrele on child -> inactive -> truncate -> getblk -> > no memory in buffer pool -> wait for syncer This prediction seems wrong. It is not the syncers job to free memory to buffer pool - it will just update and write buffers ending up on one of the buffer queues. To release memory to the buffer pool the pagedaemon calls buf_drain() when it detects a low memory condition. Looks like this check from uvm_pageout() fails: bufcnt = uvmexp.freetarg - uvmexp.free; if (bufcnt < 0) bufcnt = 0; > (b) syncer waiting for locked parent vnode from the rename > > This makes it in general unsafe to vrele while holding a vnode lock. > > In theory we could arrange this, but it's not desirable: the tidy > locking model for directory operations is: lock parent, call into fs, > fs handles child, unlock parent on return, so needing to unlock the > parent before releasing the child would bust up the abstraction > layer. (Not that we currently have this, but we're ostensibly working > towards it.) > > Does anyone have any suggestions for a patch to address this problem > without first rewriting the syncer? While the latter is probably a > good idea, it's not entirely trivial to do and get right, making it > e.g. a poor pullup candidate... > > -- > David A. Holland > dholl...@netbsd.org -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: vrele vs. syncer deadlock
> On 11 Dec 2016, at 22:33, Nick Hudson <sk...@netbsd.org> wrote: > > On 12/11/16 21:05, J. Hannken-Illjes wrote: >>> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote: >>> >>> On a low-memory machine Nick ran into the following deadlock: >>> >>> (a) rename -> vrele on child -> inactive -> truncate -> getblk -> >>> no memory in buffer pool -> wait for syncer >>> (b) syncer waiting for locked parent vnode from the rename >> Where is the syncer waiting for the parent? > > db> bt/a 8ff28060 > pid 0.37 at 0x98041096 > 0x980410961bb0: kernel_text+dc (0,0,0,0) ra 803ad484 sz 0 > 0x980410961bb0: mi_switch+1c4 (0,0,0,0) ra 803a9ef8 sz 96 > 0x980410961c10: sleepq_block+b0 (0,0,0,0) ra 803b8edc sz 64 > 0x980410961c50: turnstile_block+2e4 (0,0,0,0) ra 803a487c sz > 96 > 0x980410961cb0: rw_enter+17c (0,0,0,0) ra 8044862c sz 112 > 0x980410961d20: genfs_lock+8c (0,0,0,0) ra 8043fd60 sz 48 > 0x980410961d50: VOP_LOCK+30 (8049d4c8,2,0,0) ra > 80436c8c sz 48 > 0x980410961d80: vn_lock+94 (8049d4c8,2,0,0) ra > 803367d8 sz 64 > 0x980410961dc0: ffs_sync+c8 (8049d4c8,2,0,0) ra > 80428f4c sz 112 > 0x980410961e30: sched_sync+1c4 (8049d4c8,2,0,0) ra > 80228dd0 sz 112 > 0x980410961ea0: mips64r2_lwp_trampoline+18 (8049d4c8,2,0,0) ra > 0 sz 32 > > > >> Which file system? > > ffs Looks like a bug introduced by myself. Calling ffs_sync() from the syncer (MNT_LAZY set) will write back modified inodes only, fsync is handled by individual synclist entries. Some time ago I unconditionally removed LK_NOWAIT from vn_lock(). Suppose we need this patch: RCS file: /cvsroot/src/sys/ufs/ffs/ffs_vfsops.c,v retrieving revision 1.341 diff -p -u -2 -r1.341 ffs_vfsops.c --- ffs_vfsops.c20 Oct 2016 19:31:32 - 1.341 +++ ffs_vfsops.c12 Dec 2016 09:45:17 - @@ -1918,5 +1918,6 @@ ffs_sync(struct mount *mp, int waitfor, while ((vp = vfs_vnode_iterator_next(marker, ffs_sync_selector, ))) { - error = vn_lock(vp, LK_EXCLUSIVE); + error = vn_lock(vp, LK_EXCLUSIVE | + (waitfor == MNT_LAZY ? LK_NOWAIT : 0)); if (error) { vrele(vp); Is it reproducible so you can test it? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: vrele vs. syncer deadlock
> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote: > > On a low-memory machine Nick ran into the following deadlock: > > (a) rename -> vrele on child -> inactive -> truncate -> getblk -> > no memory in buffer pool -> wait for syncer > (b) syncer waiting for locked parent vnode from the rename Do you have full backtrace? Where is the syncer waiting for the parent? Which file system? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Change vnode free lists to lru lists
Vnodes are kept on freelists which means every vget() has to remove them from the list and the last vrele() has to put them back. Changing the lists to lrulists removes the operation from vget() and speeds up namei() on cached vnodes by ~3 percent. Diff is split into three parts for better readability and to help possible bisection: 1) Remove the redundant argument from vfs_drainvnodes() and merge vrele_flush() into vfs_drainvnodes(). 2) Rename the freelist members to lrulist members and move them to "struct vnode_impl". 3) Actually change the freelists to lrulists, merge vrele_thread() into vdrain_thread() and adapt vfs_drainvnodes(). Diffs are here: https://www.netbsd.org/~hannken/lrulist/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Vnode namespace - split sys/vnode.h
Its time to split sys/vnode.h into sys/vnode.h and sys/vnode_impl.h. The latter holds implementation details about the vnode life cycle and must be included only from kern/vfs_*, miscfs/genfs/* and miscfs/specfs/*. It is sufficient to include vnode_impl.h, it already includes sys/vnode.h. 1) Rename the vcache node to vnode_impl: - Rename struct vcache_node to vnode_impl, start its fields with vi_. - Rename enum vcache_state to vnode_state, start its elements with VS_. - Rename macros VN_TO_VP and VP_TO_VN to VIMPL_TO_VNODE and VNODE_TO_VIMPL. - Add typedef struct vnode_impl vnode_impl_t. 2) The actual split: - Split sys/vnode.h into vnode.h and vnode_impl.h. - Move _VFS_VNODE_PRIVATE protected operations into vnode_impl.h. - Move struct vnode_impl definition and operations into vnode_impl.h. - Include vnode_impl.h where we include vnode.h with _VFS_VNODE_PRIVATE defined. - Get rid of _VFS_VNODE_PRIVATE. 3) Vnode printing: - Add a function to print the fields of a vnode including its implementation. - Use it from vprint() and vfs_vnode_print(). - Move vstate_name() to vfs_subr.c. The diffs are here: http://www.netbsd.org/~hannken/vrename/ Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Locking strategy for device deletion (also see PR kern/48536)
> On 08 Jun 2016, at 13:47, Paul Goyette <p...@whooppee.com> wrote: > > On Wed, 8 Jun 2016, Paul Goyette wrote: > >> On Tue, 7 Jun 2016, Taylor R Campbell wrote: >> >>> Date: Tue, 7 Jun 2016 18:28:11 +0800 (PHT) >>> From: Paul Goyette <p...@whooppee.com> >>> >>> Can anyone suggest a reliable way to ensure that a device-driver module >>> can be _really_ safely detached? >>> General approach: >>> 1. Prevent new references (e.g., devsw_detach). >>> 2. Wait for existing references to drain (or roll back with >>> devsw_attach if you don't want to wait, and fail with EBUSY). >>> 3. Unload. >> >> Yes, of course. Thanks, I think this will work. > > Well, it almost works! Just one little problem... :) > > For some reason, the device's open() routine is being called twice, but the > close() routine is only called once. So every iteration leaves the refcount > with a net increment of one. > > I cannot figure out why/how the open routine is called twice. IIRC the device open() operation gets called on every open while the device close() operation gets called on last close only. See misfs/specfs/spec_vnops.c::spec_close(). -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode life cycle: add vnode state
> On 20 May 2016, at 18:31, Taylor R Campbell > <campbell+netbsd-tech-k...@mumble.net> wrote: > > Date: Fri, 20 May 2016 16:38:22 +0200 > From: J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> > > 1) Objects vnode and vcache_node always exist together so it makes sense > to merge them into one object. This can be done in two ways: > > - Add the current vcache_node elements (hash list, key and flag) to the > vnode. > - Overlay the vnode into vcache_node. This way we also get two views > to vnodes, a public one (struct vnode) visible from file system > implementations and a private one (struct vcache_node) visible from > the VFS subsystem (sys/kern/vfs_*) only. > > I prefer the second way, a diff is here: > http://www.netbsd.org/~hannken/vstate/001_merge_vnode_and_vcache_node > > Mild preference for putting the fields in struct vnode -- making a > bigger structure around it feels sneaky. But I see the motivation to > avoid exposing more members than are an intended part of the API, as > has been done for decades to our constant confusion about what is an > intended part of the API. > > On the other hand, maybe if we do the second option, it should be not > vcache_node but vnode_private or something, and we can use it to > incrementally move more members into it over time to make a usefully > compiler-checked stable VFS API -- perhaps even eventually make struct > vnode opaque altogether. Moving at least the list entries from the public to the private view is on my agenda. For now the vcache_node is local to vfs_vnode.c. Once it has to move to (the new) sys/vnode_private.h I will do the rename to vnode_private. > 2) The state of a vnode is currently determined by its flags > VI_MARKER, VI_CHANGING, VI_XLOCK and VI_CLEAN. These flags are not > sufficient to describe the state a vnode is in. It is for example > impossible to determine if a vnode is currently attaching to a file system. > > Sounds similar to a sketch I wrote a couple years ago. I like turning > inscrutable combinations of flags into explicit states and explicitly > designed transitions between them. Can't take a close look right now, > though. One tiny question: > > and these operations to work on the vnode state: > > - VSTATE_GET(vp): Get the current state. > - VSTATE_CHANGE(vp, from, to): Change state. > - VSTATE_WAIT_STABLE(vp): Wait until the vnode is in stable state, one of > VN_ACTIVE or VN_RECLAIMED. > - VSTATE_ASSERT(vp, state): Assert the current state. > > Why uppercase? Are these macros? If so, why not (inline) functions? These are all macros to get better diagnostics. In the DIAGNOSTICS case they expand to functions propagating name and line number of the caller. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Vnode life cycle: add vnode state
1) Objects vnode and vcache_node always exist together so it makes sense to merge them into one object. This can be done in two ways: - Add the current vcache_node elements (hash list, key and flag) to the vnode. - Overlay the vnode into vcache_node. This way we also get two views to vnodes, a public one (struct vnode) visible from file system implementations and a private one (struct vcache_node) visible from the VFS subsystem (sys/kern/vfs_*) only. I prefer the second way, a diff is here: http://www.netbsd.org/~hannken/vstate/001_merge_vnode_and_vcache_node 2) The state of a vnode is currently determined by its flags VI_MARKER, VI_CHANGING, VI_XLOCK and VI_CLEAN. These flags are not sufficient to describe the state a vnode is in. It is for example impossible to determine if a vnode is currently attaching to a file system. I propose to replace these flags with a vnode state, one of: - VN_MARKER: Used as a marker to traverse lists of vnodes, will never change or appear in a regular vnode operation. - VN_LOADING: Vnode is currently attaching to its file system and loading or creatingthe inode. - VN_ACTIVE: Vnode is fully initialised and useable. - VN_BLOCKED: Vnode is active but cannot get new references through vget(). - VN_RECLAIMING: Vnode is in process to detach from file system. - VN_RECLAIMED: Vnode is (no longer) attached to its file system, its dead. and these operations to work on the vnode state: - VSTATE_GET(vp): Get the current state. - VSTATE_CHANGE(vp, from, to): Change state. - VSTATE_WAIT_STABLE(vp): Wait until the vnode is in stable state, one of VN_ACTIVE or VN_RECLAIMED. - VSTATE_ASSERT(vp, state): Assert the current state. All operations are protected with vp->v_interlock with one exception: to change state from VN_LOADING vcache.lock is also needed. We can't take vp->v_interlock while the vnode is loading as it may change while the file system initialises the node, see uvm_obj_setlock(). Possible state transitions are: - -> VN_LOADING: Node gets allocated in vcache_alloc(). - VN_LOADING -> VN_ACTIVE: Node has been initialised in vcache_get() or vcache_new() and is ready to use. - VN_ACTIVE -> VN_RECLAIMING: Node starts disassociation from underlying file system in vclean(). - VN_RECLAIMING -> VN_RECLAIMED: Node finished disassociation from underlying file system in vclean(). - VN_ACTIVE -> VN_BLOCKED: Either vcache_rekey*() is changing the vnode key or vrelel() is about to call VOP_INACTIVE(). - VN_BLOCKED -> VN_ACTIVE: The blocking condition is over. - VN_LOADING -> VN_RECLAIMED: Either vcache_get() or vcache_new() failed to associate the underlying file system or vcache_rekey*() drops a vnode used as placeholder. A diff adding the state operations is here: http://www.netbsd.org/~hannken/vstate/002_vnode_state_operations A diff implementing the state with these operations is here: http://www.netbsd.org/~hannken/vstate/003_use_vnode_state Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: struct file at VFS level (round 1.5)
> On 03 May 2016, at 14:20, Emmanuel Dreyfus <m...@netbsd.org> wrote: > > On Tue, May 03, 2016 at 12:05:01PM +0200, Christoph Badura wrote: >> Didn't I just explain to you that a struct file is never an owner for >> these locks? Didn't others explain that to you too? > > Well, I need you to explain me what happens in sys_flock() then. > Is it a bug? >error = VOP_ADVLOCK(vp, fp, F_UNLCK, , F_FLOCK); > > fp is struct file. Same thing happens in open_setfp() and closef() The second argument is an identifier, "struct proc *" if the flags (the last argument) contain "F_POSIX" and "struct file *" otherwise. Fcntl(2) uses F_POSIX where flock(2) does not. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: struct file reference at VFS level
> On 28 Apr 2016, at 10:19, Emmanuel Dreyfus <m...@netbsd.org> wrote: > >> Wouldn't it be better to put the mandatory locking at the file->vnode >> level, operations vn_read() and vn_write()? > > Do you mean that instead of providing support so that filesystems > can implement mandatory locks, we should just implement it above VFS > for all filesystems? That would work for local filesystems, but I > understand it cannot be done for distributed filesystems, as the > local kernel knows nothing about locks set by other machines: you > need to let the filesystem handle it. Advertised locks are already implemented inside the file systems and we have VOP_ADVLOCK() with “op == F_GETLK”. What else is needed to make the decision here? >> Could you explain the proposed semantics in detail? > > We have this: > https://www.kernel.org/doc/Documentation/filesystems/mandatory-locking.txt I strongly object against an implementation whose specification starts: 0. Why you should avoid mandatory locking - The Linux implementation is prey to a number of difficult-to-fix race conditions which in practice make it not dependable: - The write system call checks for a mandatory lock only once at its start. It is therefore possible for a lock request to be granted after this check but before the data is modified. A process may then see file data change even while a mandatory lock was held. - Similarly, an exclusive lock may be granted on a file after the kernel has decided to proceed with a read, but before the read has actually completed, and the reading process may see the file data in a state which should not have been visible to it. - Similar races make the claimed mutual exclusion between lock and mmap similarly unreliable. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: issues to compile a kernel
On 03 Dec 2015, at 19:58, phipo <phipo.resea...@orange.fr> wrote: > > Hi, > I working on Bi quad cores opteron (amd64, 8 cores), i want to customize my > kernel, > i've installed NetBSD 7.0rc3, also i download syssrc.gz and installed it. > When i try to compile tools > like this : > # ./build.sh tools into the directory /usr/src/. > i've got issues, this is my ouput : > - >^ > In file included from /usr/src/sys/fs/msdosfs/msdosfs_fat.c:72:0: > /usr/src/tools/makefs/../../usr.sbin/makefs/ffs/buf.h:78:6: note: expected > 'int' but argument is of type 'void *' > int bread(struct vnode *, daddr_t, int, int, struct buf **); > ^ > /usr/src/sys/fs/msdosfs/msdosfs_fat.c:959:8: error: too many arguments to > function 'bread' >NOCRED, 0, ); >^ > In file included from /usr/src/sys/fs/msdosfs/msdosfs_fat.c:72:0: > /usr/src/tools/makefs/../../usr.sbin/makefs/ffs/buf.h:78:6: note: declared > here > int bread(struct vnode *, daddr_t, int, int, struct buf **); > ^ Looks like your sources got mixed from 7.0 and -current. On 7.0 file usr.sbin/makefs/ffs/buf.h is revision 1.9 and has: int bread(struct vnode *, daddr_t, int, struct kauth_cred *, int, struct buf **); while on -current its revision is 1.10 and has: int bread(struct vnode *, daddr_t, int, int, struct buf **); -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: NFS related panics and hangs
On 05 Nov 2015, at 21:48, Rhialto <rhia...@falu.nl> wrote: > Looking into this: > > the occurrences of nfs_reqq are as follows: > > fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the > nfs_reqq list. > > Since there is no other mention of nfs_reqq_mtx in the whole syssrc > tarball, this looks wrong. It also immediately causes the suspicion > that the list isn't in fact protected at all. This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs implementation from FreeBSD. It is not part of any kernel. Our nfs lives in sys/nfs. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Very slow transfers to/from micro SD card on a RPi B+
On 18 Aug 2015, at 12:44, Stephan stephan...@googlemail.com wrote: 2015-08-17 21:30 GMT+02:00 Michael van Elst mlel...@serpens.de: stephan...@googlemail.com (Stephan) writes: I have just rebooted with WAPBL enabled. Some quick notes: -Sequential write speed is a little lower, around 5,4 MB/s. WAPBL is rather slow on SD cards because SD cards are very slow when writing small chunks. So even when WAPBL excels, like unpacking lots of files or removing a directory tree, it is slow because the sequential journal is written in small blocks. The journal is written in chunks of MAXPHYS (64k) bytes. That might be all right. However, creating many files becomes worse the more files are being created. That is on all kinds of devices I´ve seen. This is from an amd64 server box with an aac raid controller. /root/test/files time seq 1 1|xargs touch 3.10s real 0.01s user 3.07s system /root/test/files rm * /root/test/files time seq 1 2|xargs touch 9.88s real 0.01s user 8.51s system /root/test/files rm * /root/test/files time seq 1 3|xargs touch 23.45s real 0.04s user20.41s system /root/test/files time seq 1 4|xargs touch 43.35s real 0.05s user38.32s system That is clearly not linear. I'm quite sure this is the memcmp in ufs_lookup.c:390. For every file we have to compare its name to all names in the directory so far leading to 0.5*n*(n-1) calls to memcmp. And our memcmp is slw ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Very slow transfers to/from micro SD card on a RPi B+
On 18 Aug 2015, at 15:06, Joerg Sonnenberger jo...@britannica.bec.de wrote: On Tue, Aug 18, 2015 at 01:06:02PM +0200, J. Hannken-Illjes wrote: And our memcmp is slw ... Do you have any thing substancial to justify this? This topic has already been discussed last year, see http://mail-index.netbsd.org/tech-kern/2014/08/25/msg017550.html -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: VOP_PUTPAGE ignores mount_nfs -o soft,intr
On 19 Jun 2015, at 10:36, Emmanuel Dreyfus m...@netbsd.org wrote: Hi I have encountered a bug with NetBSD NFS client. Despite a mount with -o intr,soft, we can hit situation where a process can remain hang in kernel because the NFS server is gone. This happens when the ioflush does its duty, with the following code path: sync_fsync / nfs_sync / VOP_FSYNC / nfs_fsync / nfs_flush / VOP_PUTPAGES VOP_PUTPAGES has flags = PGO_ALLPAGES|PGO_FREE. It then goes through genfs_putpages and genfs_do_putpages, and get stuck in: /* Wait for output to complete. */ if (!wasclean !async vp-v_numoutput != 0) { while (vp-v_numoutput != 0) cv_wait(vp-v_cv, slock); } Looks like your problem is the missing nfs aware implementation of function nfs_node.c:nfs_gop_write(). Falling back to genfs_gop_write() ignores nfs timeouts. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: vnd locking
On 22 May 2015, at 14:09, Patrick Welche pr...@cam.ac.uk wrote: There are a load of locking PRs involving vnd. I thought I would have a stab at converting vnd to use condvars and mutexes (mutices?) as a first step. I cobbled something together which allows me to vnconfig vnd0 /tmp/foo.img disklabel -r -w vnd0 vndtest newfs /dev/vnd0b mount /dev/vnd0b /mnt cd /mnt echo hello foo cat foo df -h . umount /mnt with a LOCKDEBUG kernel, so I may have guessed right, but I haven't found any documentation on The Right Way, so please review carefully, assume the worst, and tell me how to do it better! Things to ponder: - there was some sort of start kernel thread, thread sets variable, meanwhile loop waiting for variable to be set. I removed that. On the other hand I added a MUSTJOIN for the kernel thread exiting side of things. - given that vnd is a pseudo-device, do I need an extra lock to care of device creation/destruction? (What am I protecting against? One thread creating a device while it gets unconfigured by other thread? - make sure all accesses to sc_flags etc. are protected by mutex. - locking is no longer interruptible (no PCATCH / cv_wait_sig), not sure if this is a problem. - keep the first blank line for functions without locals. - do the white space change with a separate commit. Otherwise looks like a good starting point. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Removal of miscfs/syncfs
On 01 May 2015, at 20:02, David Holland dholland-t...@netbsd.org wrote: On Fri, May 01, 2015 at 10:17:05AM +0200, J. Hannken-Illjes wrote: Our miscfs/syncfs originating from the softdep import is a pseudo file system with one VOP. Its vnodes get used as a kind of marker on the syncer worklist so the syncer may run lazy VFS_SYNC for all mounted file systems. For this to work, it creates a vnode attached to the mount point with a special (syncfs) operations vector. This concept breaks some rules making it nearly is impossible to implement it with the new vcache operations. I propose to completely remove miscfs/syncfs and - move the syncer sched_sync into kern/vfs_subr.c - change the syncer to process the mountlist and VFS_SYNC as appropriate. - use an API for mount points similiar to the API for vnodes. Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff Comments or objections anyone? I think this is a good plan (having a file system to be the syncer is horribly weird) but I'm not entirely convinced that having marker mounts instead of marker vnodes is really the answer. But maybe it's an ok first step. This time goal is to remove the file system, not more. (also I'd prefer to have a syncer that went by blocks/buffers than by files, but that's a much bigger change...) I don't have time to read the diff properly now (maybe tomorrow? hope springs eternal) but I think it would be a good plan to put the syncer in its own file. (and maybe we should put it in sys/vfs? I announced plans to start moving vfs-level code in there some time ago, and this seems like an opportunity) Such move should probably wait until we are able to move files without loosing history. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Removal of miscfs/syncfs
On 01 May 2015, at 14:16, Christos Zoulas chris...@astron.com wrote: In article f40de9d8-f529-4411-9695-2d347e4de...@eis.cs.tu-bs.de, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: Our miscfs/syncfs originating from the softdep import is a pseudo file system with one VOP. Its vnodes get used as a kind of marker on the syncer worklist so the syncer may run lazy VFS_SYNC for all mounted file systems. For this to work, it creates a vnode attached to the mount point with a special (syncfs) operations vector. This concept breaks some rules making it nearly is impossible to implement it with the new vcache operations. I propose to completely remove miscfs/syncfs and - move the syncer sched_sync into kern/vfs_subr.c - change the syncer to process the mountlist and VFS_SYNC as appropriate. - use an API for mount points similiar to the API for vnodes. Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff Comments or objections anyone? Have you tested its performance? With ten mounts I got no problems. All it adds is once per second: mutex_enter(mountlist_lock); for (mp = TAILQ_FIRST(mountlist); mp != NULL; mp = nmp) { if (NOTHING_TO_DO(mp)) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } } mutex_exit(mountlist_lock); Out of the 4 time_t tunables, two are not being used; All tunables are used: kern/vfs_subr.c:461:delayx = dirdelay; kern/vfs_subr.c:465:delayx = metadelay; kern/vfs_subr.c:470:delayx = filedelay; kern/vfs_subr.c:622:return mp-mnt_wapbl != NULL ? metadelay : syncdelay; kern/vfs_subr.c:847:synced ? syncdelay : lockdelay); miscfs/genfs/genfs_io.c:95: vn_syncer_add_to_worklist(vp, filedelay); The other two are truncated to int's, why use time_t then? Because I didn't change anything here. As these tunables are sysctl nodes we had to change their size breaking backwards compatibility. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Removal of miscfs/syncfs
Our miscfs/syncfs originating from the softdep import is a pseudo file system with one VOP. Its vnodes get used as a kind of marker on the syncer worklist so the syncer may run lazy VFS_SYNC for all mounted file systems. For this to work, it creates a vnode attached to the mount point with a special (syncfs) operations vector. This concept breaks some rules making it nearly is impossible to implement it with the new vcache operations. I propose to completely remove miscfs/syncfs and - move the syncer sched_sync into kern/vfs_subr.c - change the syncer to process the mountlist and VFS_SYNC as appropriate. - use an API for mount points similiar to the API for vnodes. Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: fss(4) behaviour if backup fs full
On 24 Apr 2015, at 11:24, Edgar Fuß e...@math.uni-bonn.de wrote: How will an fss(4) snapshot behave if the file system used for the copy-on-write backup store gets full? Will it drop the snapshot? Will all write operations to the original fs fail? Will it panic? The snapshot becomes invalid, no further copy-on-write and reading gives ENXIO (Device not configured). The snapshot device stays configured. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: tracking P-V for unmanaged device pages
On 26 Mar 2015, at 13:13, Taylor R Campbell campbell+netbsd-tech-k...@mumble.net wrote: Various DRM graphics drivers, including Intel, Radeon, and Nouveau, sometimes need to unmap all virtual mappings of certain physical pages for which there is no struct vm_page. The issue is explained in detail here: https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html It's not desirable to simply add struct vm_pages on a freelist that uvm_pagealloc ignores, because struct vm_page is large (120 bytes on amd64, for example), most of it is unnecessary for P-V tracking, and the physical regions that need P-V tracking are large (hundreds of megabytes, or gigabytes). The attached patch implements the following extension to pmap(9) on x86 and uses it in DRM[*]. The implementation uses a linear list of pv-tracked ranges, since it is expected to be short (one to three elements). The list is managed with pserialize(9) so it adds no locking overhead to existing pmap operations that need to look up entries in it. If you plan to use pserialize here it is time to change pserialize_perform() to not kpause() every xc_broadcast. Currently pserialize_perform() takes at least two ticks to complete. It should be modified as: i = 0; do { mutex_spin_exit(psz_lock); xc = xc_broadcast(XC_HIGHPRI, (xcfunc_t)nullop, NULL, NULL); xc_wait(xc); if (i++ 1) kpause(psrlz, false, 1, NULL); mutex_spin_enter(psz_lock); } while (!kcpuset_iszero(psz-psz_target)); to become faster in the usual two activity switches case. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API change: add global vnode cache
On 22 May 2014, at 00:48, David Holland dholland-t...@netbsd.org wrote: snip As I discovered while prodding lfs last weekend, this is too optimistic, as there's another family of cases: allocating a new inode and constructing a vnode for it. It is *possible* to do this by allocating an inode number and then using vcache_get to procure the vnode, such that when the vnode cache calls VFS_LOADVNODE it will load a blank inode. This is what ffs does, and it works fine there since unallocated inodes have fixed known locations that are in a known on-disk state. It doesn't work for lfs, and it won't work for any other filesystem where inode locations aren't fixed, at least not without some ugly hackery: you can update the fs state with a location for the new inode, and write a buffer with the intended inode contents, and then have VFS_LOADVNODE find and use this buffer, but this is inefficient and gross. (This isn't itself adequate for lfs because lfs has some additional self-inflicted issues.) After sleeping on this a few times, ISTM that the best way to handle this is as follows: - add an additional call vcache_new to the API; - have vcache_new take a type argument; - have vcache_new assert that no vnode with the same key already exists (FS-level locking should guarantee this); - have vcache_new call a new op VFS_NEWVNODE instead of VFS_LOADVNODE and pass the type in. snip Here comes the new operation vcache_new() to allocate and initialise a new vnode/fsnode pair. Passing the type is not sufficient, therefore pass all information we get from VOP_MKNOD: int vcache_new(struct mount *mp, struct vnode *dvp, struct vattr *vap, kauth_cred_t cred, struct vnode **vpp) where dvp is the (referenced) directory where we want to create the new node, vap passes va_type, va_mode and possibly va_rdev and cred gives the credentials to setup uid/guid. The node returned from vcache_new() is referenced, fully initialised and has link count zero. A diff showing the implementation is here: http://www.netbsd.org/~hannken/vnode-pass7-1a.diff and a diff showing its usage for ffs is here: http://www.netbsd.org/~hannken/vnode-pass7-1b.diff Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: A brelse() in FFS...
On 13 Feb 2015, at 19:18, Maxime Villard m...@m00nbsd.net wrote: Hi, this may be a stupid question; in ufs/ffs/ffs_vfsops.c, l.1153: if (fs-fs_sbsize SBLOCKSIZE) brelse(bp, BC_INVAL); else brelse(bp, 0); However 'fs-fs_sbsize' is *always* smaller than SBLOCKSIZE. So only the first branch is reached. It may be equal to SBLOCKSIZE. The BC_INVAL is here to not keep the buffer in cache. It may be read later with a size greater fs_sbsize and would break buffer cache invariance (blocks are always read with the same size). I don't understand what the BC_INVAL flag means, and its definition in sys/buf.h is not very enlightening: #define BC_INVAL0x2000 /* Does not contain valid info. */ Which branch do I keep? Both. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: race condition between (specfs) device open and close
On 30 Dec 2014, at 09:57, matthew green m...@eterna.com.au wrote: while trying to fix any midi/sequencer issues i've seen in the last year, noticed that any attempt to call 'midiplay' on more than one midi device at the same time, while not allowed, would leave the sequencer device always busy. i tracked this down to an ugly race condition inside the specfs code. the problem is that: t1 t2 opens dev sd_opencnt++ tries to open dev sd_opencnt++ closes dev since sd_opencnt is != 1, does not call close sd_opencnt-- gets EBUSY sd_opencnt-- in this case, the device close routine is never called, and in at least the midi case, all future opens return EBUSY due to there already being an active open. i've spent some time reviewing the specfs code, and i think that the below patch fixes these specific problems. it certainly fixes the problem i have with trying to open /dev/music concurrently. it may not be entirely correct or have problems i've not seen with it, but atf works fine, as does the last couple of days of random testing. the main changes are: - introduce a dying flag to the specnode struture, and use it appropriately - add a condvar between open/close and revoke, to ensure any revoke succeeds properly. i'd like to get this into netbsd 7, so any additional testing against that branch or against -current would be greatly appreciated. snip @@ -397,7 +405,14 @@ spec_node_revoke(vnode_t *vp) KASSERT(sn-sn_gone == false); mutex_enter(device_lock); - KASSERT(sn-sn_opencnt = sd-sd_opencnt); + sn-sn_dying = true; + while (sn-sn_opencnt sd-sd_opencnt) { + /* + * Open is in progress, but not complete, must wait + * for it to complete. + */ + cv_wait(specopenclose_cv, device_lock); + } if (sn-sn_opencnt != 0) { sd-sd_opencnt -= (sn-sn_opencnt - 1); sn-sn_opencnt = 1; Will this work if we revoke a device whose cdev_open() blocks, a TTY waiting for carrier for example? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Should sys_unmount() FOLLOW?
On 23 Nov 2014, at 17:47, Emmanuel Dreyfus m...@netbsd.org wrote: Hi I ran into this strange bug with glusterfs NFS server, which is possible because it allows the mounted filesystem root vnode to be VLNK (NetBSD's native mountd prevents this situation and therefore the bug does not happen with our native NFS server): bacasel# mount bacasel.net.espci.fr:/patchy/symlink1 on /mnt/nfs/0 type nfs bacasel# ls -l /mnt/nfs lrwxrwxrwx 1 root wheel4 Nov 23 10:03 0 - dir1 That is possible because on the exported filesystem, symlink1 is a symlink to dir1. It looks funny and harmless, at least until one try to unmount while the NFS server is down. I do this using the most reliable way: umount -f -R /mnt/nfs/0 umount(8) will quickly call unmount(2), passing the /mnt/nfs/0 path without trying anything fancy. But at the beginning of sys_unmount() in the kernel, we can find: NDINIT(nd, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb); if ((error = namei(nd)) != 0) { pathbuf_destroy(pb); return error; } The FOLLOW flag will cause the symlink to be resolved before sys_unmount() can proceed. Since the NFS server is down, the namei call will never return, and umount(8) get stuck in kernel with this backtrace: sleepq_block kpause nfs_reconnect nfs_request nfs_readlinkrpc nfs_readlink VOP_READLINK namei_tryemulroot namei sys_unmount syscall This is annoying because umount -f should really unmount. Moreover, stuck process will crop on the system because umount(8) holds a vnode lock forever. I see two way of fixing this: 1) remove FOLLOW flag in sys_unmount(). After all, unless the -R flag is given, umount(8) should have resolved the patch before calling unmount(2). Did you try -- last time I tried a forced unmount with NFS server down it didn't work even with root being a directory because the namei() call would hang in VOP_LOOKUP(). Does it work these days? 2) Desfine a new MNT_RAW mount flag. If umount(8) is called with -R, pass that flag to unmount(8), and in sys_unmount(), do not use FOLLOW if MNT_RAW is set. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: find a file from DDB?
On 13 Nov 2014, at 17:23, Emmanuel Dreyfus m...@netbsd.org wrote: Hello I track a locking problem in DDB. Given a vnode, how can I find what file is involved? This is a UFS filesystem, and as far as I can tell, ddb is not able to ouput the inode, right? You may call VOP_PRINT(0) with 0 being the vnode address. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] PUFFS backend allocation (round 3)
On 28 Oct 2014, at 14:37, Emmanuel Dreyfus m...@netbsd.org wrote: On Sun, Oct 26, 2014 at 09:21:13AM +0100, J. Hannken-Illjes wrote: - You should recover the error in puffs_vnop_close() too. Should I? puffs_vnop_close() does not cause VOP_STRATEGY to be called, it only send a message to the filesystem: If I recover the VOP_STRATEGy error there I will steall it from a write or fsync call. Doesn't puffs allow async writes? NFS allows them and therefore VOP_CLOSE is the last chance to report write errors back to the application. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] PUFFS backend allocation (round 3)
On 28 Oct 2014, at 15:43, Emmanuel Dreyfus m...@netbsd.org wrote: On Tue, Oct 28, 2014 at 02:52:36PM +0100, J. Hannken-Illjes wrote: NFS allows them and therefore VOP_CLOSE is the last chance to report write errors back to the application. The situation is a bit different: write and fsync call VOP_PUTPAGE which calls VOP_STRATEGY and forgets about the write error. Confused. If write and/or fsync are synchronous (VOP_PUTPAGES with flag PGO_SYNCIO) no write error will be forgotten. If one of these VOP_PUTPAGES is async the write error will be delayed and if the next (and last) operation is VOP_CLOSE this is the last chance to report the error back. If any intermediate operation returns a delayed error to the application it is cleared and close will no longer see it. The write error should be recovered by write and fsync and then cleared: I do not want to get in in the next write operation: immagine I write at the end of a file, get EDQUOT, then within an allocated area in the file, I should not get EDQUOT again. I don't understand this section. But what about close? If I am to be reported a write error when calling close, it should _not_ be cleared as I should get it again if I call caose again, until the time the write succeeds. There is no close after close. Once a file is closed it may return a delayed error but has to close anyway. This is how NFS works and what applications expect. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] PUFFS backend allocation (round 3)
On 26 Oct 2014, at 03:52, Emmanuel Dreyfus m...@netbsd.org wrote: Chuck Silvers c...@chuq.com wrote: but more fundamentally, since puffs code cannot prevent changes to the file in the underlying fs (ie. changes that don't go through puffs), any preallocation done by puffs can be undone before it does any good. the puffs code just needs to be fixed to handle such ENOSPC/EDQUOT errors while flushing pages without hanging. NFS has the same fundamental issue but I suspect its error handling is better. I tried the same approach as NFS: save errors in VOP_STRATEGY and recover them in upper layer: http://ftp.espci.fr/shadow/manu/puffs-strategy-error.patch Opinions? Tests show that it behave gracefuly when encountering EDQUOT: no more processes hung in DE state and the correct errno is reported. The patch also contains fallocate and fdiscard implementation, which are now of no use, but I think it is good to have them ready for later. On the fdiscard front, I wonder if I should hold pn_sizemtx. It is not supposed to change file size, but what happens if file size is changed during the operation? - Increment PUFFSVERSION. - You should recover the error in puffs_vnop_close() too. - Are you sure pn_sizemtx is still needed? VOP_GETATTR() works on locked vnodes since 2011-10-30. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] PUFFS backend allocation (round 3)
On 26 Oct 2014, at 17:55, Emmanuel Dreyfus m...@netbsd.org wrote: J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: What happens when libpuffs receives an unknown message (fallocate in this case)? The filesystem using libpuffs tells the kernel what operations should be sent to userland when it calls puffs_init(). It can be done - by setting the list of wanted operation, in which case the kernel never sends an unknown message, - by requesting all operations to be sent to userland. In that case an unknown operation is handled in src/lib/libpuffs/dispatcher.c:dispatch() default: printf(inval op %d\n, preq-preq_optype); error = EINVAL; break; Ok -- you don't need it here. Since it was last incremented at Rev. 1.72 many messages got additional fields so the version should be incremented independent of your change. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] PUFFS backend allocation (round 3)
On 25 Oct 2014, at 06:39, Emmanuel Dreyfus m...@netbsd.org wrote: Summary: when writing to a PUFFS filesystem through page cache, we do not know if backend storage is really available. If it is not, cache flush may get EDQUOT or ENOSPC and the process cannot terminate (it gets stuck in DE state). Proposed solution: detect that a write may not have backend storage, and if it is the case, try to allocate the backend storage. Detecting is done on two conditions: - if allocated blocks is shorter than size, the file is sparse and we never know if we are writing in a hole or not: in that case, always write-once - if writing beyond EOF Allocating the backend storage is done - through newly introduced PUFFS fallocate operation (unlikely to work on NetBSD as the system call exists but FFS does not support it) - otherwise by reading from the file and rewriting the readen data The latest patch doing this: http://ftp.espci.fr/shadow/manu/puffs-alloc2.patch Opinions? At the very least you should increment PUFFSVERSION. Setting of PNODE_SPARSE looks wrong. You should mark a VREG pnode as sparse until it is sure not to have holes, which is when we get va_bytes != VNOVAL vp-v_size == rvap-va_bytes. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] GOP_ALLOC and fallocate for puffs (round 2)
On 14 Oct 2014, at 19:05, Emmanuel Dreyfus m...@netbsd.org wrote: Hi Here is the latest iteration of my patch to ensure we avoid sending data to the page cache we cannot flush later. http://ftp.espci.fr/shadow/manu/puffs-alloc.patch - When writing, we make sure backend storage is allocated in two cases: 1) the file is sparse, hence we never know if backend storage is here. 2) the file grows: we know we need to allocated for the new area. - sparse file detection is done at when updating metadata cache: if size is bigger than allocated blocks, it is sparse. - I add fallocate FUSE operation (userland support is in another patch) for efficient backend storage allocation. - If fallocate is unavailable, I fallback to explicit zero-filling. - puffs_vnop_write() is split into two functions, puffs_vnop_write_cache() that writes in cache, puffs_vnop_write_fs that writes to the FS First (and for the last time) GOP_ALLOC() should go to the backend, maybe with a short circuit on non-sparse, not-after-eof files. Zero filling looks wrong, as the region may contain valid data. Whatever you do you should increment the protocol version number. If you want to make UBC work with PUFFS fsx may help. Last time I tried it failed badly on a rump_ffs mount. You may find it here: http://www.netbsd.org/~hannken/fsx.tgz -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
On 02 Oct 2014, at 06:45, Emmanuel Dreyfus m...@netbsd.org wrote: J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: genfs_gop_write calls genfs_do_io which does error = biowait(mbp); near the end. This will catch errors from VOP_STRATEGY. I run the code below but VOP_PUTPAGES never return anything else than 0. Sure -- why should it? GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages. Here you run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP. GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make sure the pages in question are allocated and may be faulted in. The client has to fill holes or extend files. int puffs_gop_alloc(struct vnode *vp, off_t off, off_t len, int flags, kauth_cred_t cred) { struct puffs_mount *pmp = MPTOPUFFSMP(vp-v_mount); off_t start, end; int pgo_flags = PGO_CLEANIT|PGO_SYNCIO|PGO_PASTEOF; int u_flags = PUFFS_UPDATESIZE|PUFFS_UPDATEMTIME|PUFFS_UPDATECTIME; int error; if (EXISTSOP(pmp, FALLOCATE)) { error = _puffs_vnop_fallocate(vp, off, len); goto out; } start = trunc_page(off); end = round_page(off + len); if (off + len vp-v_size) uvm_vnp_setwritesize(vp, off + len); mutex_enter(vp-v_interlock); error = VOP_PUTPAGES(vp, start, end, pgo_flags); if (off + len vp-v_size) { if (error == 0) { uvm_vnp_setsize(vp, off + len); puffs_updatenode(VPTOPP(vp), u_flags, vp-v_size); } else { uvm_vnp_setwritesize(vp, vp-v_size); } } out: return error; } -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
On 02 Oct 2014, at 17:23, Emmanuel Dreyfus m...@netbsd.org wrote: On Thu, Oct 02, 2014 at 12:02:39PM +0200, J. Hannken-Illjes wrote: GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages. Here you run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP. GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make sure the pages in question are allocated and may be faulted in. You mean I have to run VOP_GETPAGES at the beginning of a cached write operation? Please describe cached write operation in terms of vnode operations. Which vnode operation finally calls GOP_ALLOC()? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
On 02 Oct 2014, at 19:09, Emmanuel Dreyfus m...@netbsd.org wrote: J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: Please describe cached write operation in terms of vnode operations. A write on a mount that uses page cache, without direct I/O. Which vnode operation finally calls GOP_ALLOC()? From genfs, only VOP_GETPAGES, but I understand we should call it on our own. For instance, UFS calls it through ufs_balloc_range() in VOP_WRITE. Ok -- if you want to use it, you have to implement it on the client side as puffs has no idea how to allocate blocks -- right? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
On 30 Sep 2014, at 18:24, Emmanuel Dreyfus m...@netbsd.org wrote: On Tue, Sep 30, 2014 at 03:30:05PM +, Antti Kantee wrote: Is it really better to sync fallocate, put stuff in the page cache and flush the page cache some day instead of just having a write-through (or write-first) page cache on the write() path? You also get rid of the fallocate-not-implemented problem that way. GOP_ALLOC calls puffs_gop_alloc for chunks bigger than pages (I observed 1 MB for now). If we have fallocate implemented in the filesystem, this is really efficient, since fallocate saves us from sending any data to write. Hence IMO fallocate should be the preferred way if available. But if it is not there, indeed, doing a write on first attemps should do the trick. Writing zeroes might be a bad emulation for distributed file systems, though I guess you're the expert in that field and can evaluate the risks better than me. I understand that areas fallocate'd should return zeroes, so it should be fine. The real problem is performances. I am not sure what approach is best. I first though about a puffs_gop_alloc like below, but that will not work, as VOP_PUTPAGES goes to genfs_putpages, which calls GOP_WRITE (genfs_gop_write), which calls VOP_STRATEGY without checking for failure. genfs_gop_write calls genfs_do_io which does error = biowait(mbp); near the end. This will catch errors from VOP_STRATEGY. But why do you need GOP_ALLOC? Is there a consumer beside genfs_getpages filling holes? Puffs doesn't return holes as its VOP_BMAP always returns valid ( != -1 ) block addrs. Should I directly call VOP_STRATEGY? int puffs_gop_alloc(struct vnode *vp, off_t off, off_t len, int flags, kauth_cred_t cred) { int error; if (EXISTSOP(pmp, FALLOCATE)) return _puffs_vnop_fallocate(vp, off, len); else return VOP_PUTPAGES(vp, off, off + len, PGO_CLEANIT|PGO_SYNCIO); } -- Emmanuel Dreyfus m...@netbsd.org -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: snip I tried to bisect and got an increase in time from ~15 secs to ~24 secs between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'. Someone should redo this test as this interval is the import of the compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools. I cant believe this to be a compiler problem. GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16. Could this be the cause of this big loss in performance? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 25 Aug 2014, at 17:39, Taylor R Campbell riastr...@netbsd.org wrote: Date: Mon, 25 Aug 2014 15:55:53 +0200 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16. Could this be the cause of this big loss in performance? Shouldn't be too hard to test this. Perhaps try dropping in the following replacements for the vcache key comparison and running the test for each one? memequal.c We are talking about a kernel from 2012/09 -- vcache came much later. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 25 Aug 2014, at 15:55, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: snip I tried to bisect and got an increase in time from ~15 secs to ~24 secs between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'. Someone should redo this test as this interval is the import of the compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools. I cant believe this to be a compiler problem. GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16. Could this be the cause of this big loss in performance? Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c from Rev. 1.3 to 1.2 brings back the old times which are the same as they were on NetBSD 6. Given that this test has many calls to ufs_lookup/cache_lookup using memcmp to check for equal filenames this is not a surprise. A rather naive implementation of memcmp (see below) drops the running time from ~15 sec to ~9 secs. We should consider improving our memcmp. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Index: libkern.h === RCS file: /cvsroot/src/sys/lib/libkern/libkern.h,v retrieving revision 1.106 diff -p -u -2 -r1.106 libkern.h --- libkern.h 30 Aug 2012 12:16:49 - 1.106 +++ libkern.h 25 Aug 2014 17:23:35 - @@ -262,5 +262,18 @@ void *memset(void *, int, size_t); #if __GNUC_PREREQ__(2, 95) !defined(_STANDALONE) #definememcpy(d, s, l) __builtin_memcpy(d, s, l) -#definememcmp(a, b, l) __builtin_memcmp(a, b, l) +static inline int __memcmp(const void *a, const void *b, size_t l) +{ + const unsigned char *pa = a, *pb = b; + + if (l 8) + return memcmp(a, b, l); + while (l-- 0) { + if (__predict_false(*pa != *pb)) + return *pa *pb ? -1 : 1; + pa++; pb++; + } + return 0; +} +#definememcmp(a, b, l) __memcmp(a, b, l) #endif #if __GNUC_PREREQ__(2, 95) !defined(_STANDALONE)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 22 Aug 2014, at 18:29, Taylor R Campbell riastr...@netbsd.org wrote: Date: Fri, 22 Aug 2014 17:59:37 +0200 From: Stephan stephan...@googlemail.com Has anybody an idea on this or how to track this down? At the moment, I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've also seen people playing with dtrace but that doesn't seem to be included. Dtrace may be a good idea. You can use it by (a) using a kernel built with `options KDTRACE_HOOKS', (b) using a userland built with MKDTRACE=yes, (c) modload /stand/ARCH/VERSION/solaris.kmod modload /stand/ARCH/VERSION/dtrace.kmod modload /stand/ARCH/VERSION/fbt.kmod modload /stand/ARCH/VERSION/sdt.kmod (d) mkdir /dev/dtrace mknod /dev/dtrace/dtrace c dtrace (Yes, this is too much work. Someone^TM should turn it all on by default for netbsd-7...!) From the lockstat output it looks like there's a lot of use of mntvnode_lock, which suggests this may be related to hannken@'s vnode cache changes. Might be worthwhile to sample stack traces of vfs_insmntque, with something like dtrace -n 'fbt::vfs_insmntqueue:entry { @[stack()]++ }' or perhaps sample stack traces of the mutex_enters of mntvnode_lock. This was my first guess too ... I tried to bisect and got an increase in time from ~15 secs to ~24 secs between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'. Someone should redo this test as this interval is the import of the compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools. I cant believe this to be a compiler problem. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Btw.: my test script is: #! /bin/sh mdconfig /dev/md0d 2048000 P=${!} newfs /dev/rmd0a mount -t ffs -o log /dev/md0a /mnt (cd /mnt time sh -c 'seq 1 3|xargs touch') umount /mnt kill ${P}
Re: Add operation vcache_rekey
On 26 Jun 2014, at 05:33, David Holland dholland-t...@netbsd.org wrote: On Wed, Jun 25, 2014 at 09:46:16AM +, Taylor R Campbell wrote: Also, I wonder whether any file systems will ever change the length of the key for a vnode. Probably not. As I recall from when I was looking at this a couple weeks ago (feels like a minor eternity ago, but it can't be much longer than that) ... yes, they might. David, they will. nfs/nfs_vnops.c::nfs_lookitup() will change the file handle in the *npp != NULL case. Here both the key and its length may change. I overlooked this one as it didn't update the rbtree. To make it worse the memory holding the file handle may get reallocated so it is impossible to do the rekey with one call. The attached diff adds int vcache_rekey_enter(struct mount *mp, struct vnode *vp, const void *old_key, size_t old_key_len, const void *new_key, size_t new_key_len) to prepare the key change. It will allocate a cache node for the new key and lock both the old and the new cache node. It is an error if the new node already exists. void vcache_rekey_exit(struct mount *mp, struct vnode *vp, const void *old_key, size_t old_key_len, const void *new_key, size_t new_key_len) to finish the key change. It will remove the old cache node and setup and unlock the new cache node. Comments, suggestions or objections? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) vcache.diff Description: Binary data