Re: UVMHIST logs

2024-07-15 Thread J. Hannken-Illjes
> On 15. Jul 2024, at 09:31, Valery Ushakov  wrote:
> 
> I've just got a long message in my dmesg:
> 
> ...e24 uoffset=0 align=0 flags=0x101 entry=0x8f7f2df8 (uvm_map_findspace line 
> 2011)map=0x8ee29e80 hint=0x7def orig_hint=0x76fc3000 length=0x1000 
> uobj=0x8fdd0e24 uoffset=0 align=0 flags=0x101 entry=0x8f7f2df8 
> (uvm_map_findspace line 2214)map=0x8ee29e80 hint=0x7deef000 
> orig_hint=0x76fc3000 length=0x1000 uobj=0x8fdd0a9c...
> 
> where, obviously, newlines are missing from the UVMHIST output.  I
> don't want to go off on a tangent to a tangent to a tangent to
> inestigate this, so I'd appreciate if someone familiar with UVMHIST
> could kindly take a look.  TIA.

No UVMHIST involved.

Seems to originate from a printf() without "\n" at sys/uvm/uvm_map.c
function uvm_findspace_invariants() at line 1795.

--
J. Hannken-Illjes - hann...@mailbox.org


Re: Problem with SIMPLEQ_INSERT_TAIL

2024-01-22 Thread J. Hannken-Illjes
> On 22. Jan 2024, at 11:52, Stephan  wrote:
> 
> Hello,
> 
> I am working on the semaphore part of my Haiku compat layer. I took
> uipc_sem.c as a reference, where control structures are organized in a
> dynamically allocated array and not as part of a list.
> 
> However, I like to maintain a "free list" in the form of a SIMPLEQ, on
> which unused semaphores are maintained for fast allocation.
> 
> There is an issue in the initialization function, where each structure
> is inserted into the free queue. I spent some time on this, but have
> been unable to find the cause. Maybe I am overlooking something
> trivial or there is something special I just don´t know.
> 
> The compiler says:
> 
> In file included from /home/stephan/src/sys/sys/siginfo.h:38,
> from /home/stephan/src/sys/sys/signal.h:112,
> from ./machine/frame.h:75,
> from ./x86/cpu.h:56,
> from ./machine/cpu.h:42,
> from ./machine/param.h:11,
> from /home/stephan/src/sys/sys/param.h:142,
> from /home/stephan/src/sys/kern/uipc_hsem.c:32:
> /home/stephan/src/sys/kern/uipc_hsem.c: In function 'khsem_init':
> /home/stephan/src/sys/sys/queue.h:347:19: error: assignment to 'struct
> khsem **' from incompatible pointer type 'struct kshem **'
> [-Werror=incompatible-pointer-types]
>  347 |  (head)->sqh_last = &(elm)->field.sqe_next;   \
>  |   ^
> /home/stephan/src/sys/kern/uipc_hsem.c:68:9: note: in expansion of
> macro 'SIMPLEQ_INSERT_TAIL'
>   68 | SIMPLEQ_INSERT_TAIL(_freeq, [i], khs_entry);
>  | ^~~
> 
> 
> 
> The relevant code snippet is this:
> 
> const int khsem_max = 8192;
> 
> static kmutex_t khsem_mutex   __cacheline_aligned;
> static struct khsem *hsems__read_mostly;
> static SIMPLEQ_HEAD(, khsem)khsem_freeq;
> 
> 
> int
> khsem_init(void)
> {
>int i, sz;
> 
>SIMPLEQ_INIT(_freeq);
>mutex_init(_mutex, MUTEX_DEFAULT, IPL_NONE);
> 
>sz = ALIGN(khsem_max * sizeof(struct khsem));
>sz = round_page(sz);
> 
>// XXX allocate memory
> 
>for (i = 0; i < khsem_max; i++) {
>hsems[i].khs_id = i;
>mutex_init([i].khs_interlock, MUTEX_DEFAULT, IPL_NONE);
>cv_init([i].khs_cv, "acquire_sem");
> 
>SIMPLEQ_INSERT_TAIL(_freeq, [i], khs_entry); //
> << DOES NOT COMPILE
>}
> 
> }
> 
> 
> The control structure looks like this:
> 
> struct khsem {
>  sem_id  khs_id; /* id of this semaphore */
>  SIMPLEQ_ENTRY(kshem)khs_entry;  /* free list entry */

Use SIMPLEQ_ENTRY(khsem) here ( s/sh/hs/ )?

>  kmutex_tkhs_interlock;  /* lock on this semaphore */
>  kcondvar_t  khs_cv; /* CV for wait events */
>  pid_t   khs_owner;  /* owning process */
>  char*khs_name;  /* name of this semaphore */
>  size_t  khs_namelen;/* length of name */
>  int khs_state;  /* state of this port */
>  int khs_waiters;
>  int khs_count;  /* current count */
>  lwpid_t khs_latest_holder;  /* latest holder LWP id */
>  uid_t   khs_uid;/* creator uid */
>  gid_t   khs_gid;/* creator gid */
> };
> 
> 
> 
> Any help is appreciated ;)
> 
> Thanks,
> 
> Stephan

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: veriexec(4) maintenance

2023-12-31 Thread J. Hannken-Illjes
On Tue, Dec 19, 2023 at 07:50:25AM +1030, Brett Lymn wrote:
> On Sat, Sep 02, 2023 at 08:57:03AM +0930, Brett Lymn wrote:
> > On Thu, Aug 31, 2023 at 10:24:07AM +0200, J. Hannken-Illjes wrote:
> > > 
> > > I'm short on time, some remarks:
> > > 
> > 
> > Thanks, I will have a closer look at these issues.  This could be where the 
> > deadlock I was
> > seeing is coming from.
> > 
> 
> OK, I have, finally, updated the diff in line with the comments (thanks
> again for the good feedback).

Brett,

first it is difficult to read a diff that also introduces a "notyet" operation
on file pages.

Looks like you replace global and item rw-locks with refcounts/signals.
I tried to understand the rationale behind and failed.  Your implementation
seems at least vulnerable to livelocks as the writer side is missing priority.

In general, what problem(s) are you trying to solve here?

What is wrong with a quite simple hashtable to hold fileid<->fingerprint 
mapping accessed as:

mutex_enter table lock
lookup fileid -> item
mutex_enter item
mutex_exit table lock

while item state is busy
cv_wait item cv

if item state is unknown
set item state busy
mutex_exit item
compute fingerprint
mutex_enter item
set item state valid/invalid
cv_broadcast item cv

Here the item state is either valid or invalid

mutex_exit item

Some more problems (line numbers with your diff applied):

   476  veriexec_fp_calc(struct lwp *l, struct vnode *vp, int lock_state,
   477  struct veriexec_file_entry *vfe, u_char *fp)
   478  {

   524  error = vn_rdwr(UIO_READ, vp, buf, len, offset,
   525  UIO_SYSSPACE, lock_state,

lock_state is sometimes VERIEXEC_UNLOCKED, sometimes IO_NODELOCKED.

  1629  veriexec_dump(struct lwp *l, prop_array_t rarray)
  1630  {
  1631  struct mount *mp;
  1632  mount_iterator_t *mpiter;
  1633  
  1634  mountlist_iterator_init();
  1635  
  1636  while ((mp = mountlist_iterator_trynext(mpiter)) != NULL) {
  1637  /* If it fails, the file-system is [being] unmounted. */
  1638  if (vfs_busy(mp) != 0)
  1639  continue;

This is completely wrong.  Operation mountlist_iterator_trynext() already
returns the mount busy and skips mounts currently unmounting or remounting.
Operation mountlist_iterator_next() was right or does this function get
called from inside unmount?

  1653  veriexec_flush(struct lwp *l)
  1654  {
  1655  struct mount *mp;
  1656  mount_iterator_t *mpiter;
  1657  int error = 0;
  1658  
  1659  mountlist_iterator_init();
  1660  
  1661  while ((mp = mountlist_iterator_trynext(mpiter)) != NULL) {
  1662  int lerror;
  1663  
  1664  /* If it fails, the file-system is [being] unmounted. */
  1665  if (vfs_busy(mp) != 0)
  1666  continue;

This is completely wrong, see above.

-- 
J. Hannken-Illjes


Re: how do I preset ddb's LINES to zero

2023-12-15 Thread J. Hannken-Illjes
> On 15. Dec 2023, at 19:44, Andrew Cagney  wrote:
> 
> On Fri, 15 Dec 2023 at 11:22, J. Hannken-Illjes  wrote:
> 
>>> Is there a way to stop this without having to rebuild the kernel.
>> 
>> sysctl -w ddb.lines=0
> 
> thanks, I'll add that (it won't help with my immediate problem of a
> panic during boot though)

From DDB command prompt "set $lines = 0" ...

--
J. Hannken-Illjes - hann...@mailbox.org

Re: how do I preset ddb's LINES to zero

2023-12-15 Thread J. Hannken-Illjes
> On 15. Dec 2023, at 17:19, Andrew Cagney  wrote:
> 
> Hi,
> 
> I've the stock 10.0 boot.iso booting within a KVM based test
> framework.  I'd like to set things up so that should there be a panic,
> it dumps registers et.al., without stopping half way waiting for
> someone to hit the space bar vis:
> 
> r9  af80b451d080
> r10 81d9a063
> r11 0
> r12 0
> --db_more--
> 
> Is there a way to stop this without having to rebuild the kernel.

sysctl -w ddb.lines=0

--
J. Hannken-Illjes - hann...@mailbox.org


Re: veriexec(4) maintenance

2023-08-31 Thread J. Hannken-Illjes
> On 30. Aug 2023, at 00:16, Brett Lymn  wrote:
> 
> On Tue, Aug 15, 2023 at 07:52:46AM +0930, Brett Lymn wrote:
>> 
>> I tried sending this with the diff but I think it was a bit big,
>> here is the text, the diff is at ftp.netbsd.org under
>> /pub/NetBSD/misc/blymn/veriexec_update
>> 
> 
> OK, it has been a couple of weeks since I posted this and I have
> had zero responses.  I have tested the changes and they seem ok
> on my daily usage pattern.
> 
> Does anyone have _any_ comments or should I just commit the
> changes?
> 

I'm short on time, some remarks:

- lockstate as an ioflag to vn_rdwr() cant be right.

- waiting for condition is usually

mutex_enter()
while (!cond)
cv_wait()
...
mutex_exit()

  doing it as

if (!cond)
cv_wait()

  looks wrong.

- if the reference counting is used to free an unreferenced
  object ist is obviously racy

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Maxphys on -current?

2023-08-04 Thread J. Hannken-Illjes
> On 4. Aug 2023, at 21:48, Jaromír Doleček  wrote:

> In a broader view, I have doubts if there is any practical reason to
> even have support for bigger than 64kb block size support at all.

Having tapes with record size > 64k would be a real benefit though.

--
J. Hannken-Illjes - hann...@mailbox.org



signature.asc
Description: Message signed with OpenPGP


Re: fstrans_start(vp->v_mount) and vgone

2023-04-12 Thread J. Hannken-Illjes
> On 11. Apr 2023, at 12:13, Taylor R Campbell 
>  wrote:
> 
> Last year hannken@ changed fstrans(9) to use a hash table mapping
> struct mount pointers to fstrans info, in order to fix a crash found
> by syzkaller arising from a race between fstrans_start(vp->v_mount)
> and vgone(vp):
> 
> https://syzkaller.appspot.com/bug?extid=54dc9ac0804a97b59bc6
> https://mail-index.netbsd.org/source-changes/2022/07/08/msg139622.html
> 
> However, while this may avoid a _crash_ from use-after-free of
> vp->v_mount->mnt_..., I'm not sure this is enough to prevent a _race_.
> 
> Suppose we have the following sequence of events:
> 
> lwp0lwp1
> 
> mp = vp->v_mount;
>vgone(vp);
>-> vp->v_mount = deadfs
>unmount(mp)
>-> free(mp)
>mount a new file system
>-> malloc() returns the same mp
> fstrans_start(mp);
> 
> Now lwp0 has started a transaction on some totally unrelated mount
> point -- and may take confused actions under the misapprehension that
> vp is a vnode that unrelated mount point.

This sequence is obviously racy, we (should) always use fstrans_start* like:

for (;;) {
mp = vp->v_mount;
fstrans_start(mp);
if (mp == vp->v_mount)
break;
fstrans_done(mp);
}

or

mp = vp->v_mount;
fstrans_start(mp);
if (mp != vp->v_mount) {
fstrans_done(mp);
return ENOENT;
}

thus avoiding this race and always holding fstrans on the right mount.

Some of the current uses need review though ...

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Fix for PR kern/56713

2022-07-15 Thread J. Hannken-Illjes
> On 13. Jul 2022, at 20:33, Taylor R Campbell 
>  wrote:
> 
> Generally looks reasonable to me, assuming it passes the atf tests,
> but give hannken@ another few days to review it?
> 
> Maybe instead of adding vnode_impl.h to vfs_vnops.c, we can put the
> vn_knote_* in vfs_vnode.c or create a new file vfs_knote.c, to avoid
> having tentacles of vnode guts creep back into places we successfully
> cut them out of before?

Looks good to me, please go ahead.

For me it is generally ok to include vnode_impl.h from all of
kern/vfs_* miscfs/genfs/* and miscfs/specfs/*.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic in -current: "vp->v_iflag & VI_TEXT" failed: file "/usr/src/sys/kern/exec_subr.c", line 183

2022-03-09 Thread J. Hannken-Illjes
> On 9. Mar 2022, at 02:22, matthew green  wrote:
> 
> matthew green writes:
>> "J. Hannken-Illjes" writes:
>>> I'm now able to reproduce it here -- takes about six hours to trigger.
>>> 
>>> I suppose vrelel() lost a check for new references with my last changes,
>>> currently testing the diff attached.
>> 
>> well, this ran until i ran out of file building the native gcc,
>> it had been going for over an hour and was close to finishing.
>> 
>> only one crash and not non crash, but this seems to help so far.
>> 
>> i've cleaned the objdir and restarted, and i'll try again a couple
>> more times when it's done.
> 
> this partial build completed, and 2 more since.
> 
> i say this fixes the problem.  thanks!

I built eleven releases without problems,
fix committed as vfs_vnode.c rev. 1.135

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic in -current: "vp->v_iflag & VI_TEXT" failed: file "/usr/src/sys/kern/exec_subr.c", line 183

2022-03-08 Thread J. Hannken-Illjes
> On 8. Mar 2022, at 05:14, matthew green  wrote:
> 
> that's this:
> 
> 175 vmcmd_map_pagedvn(struct lwp *l, struct exec_vmcmd *cmd)
> 176 {
> [ ... ]
> 181 vm_prot_t prot, maxprot;
> 182
> 183 KASSERT(vp->v_iflag & VI_TEXT);
> 
> christos said this happened to him on a 8c/16t 64GB machine, using
> build.sh -j40, and i was able reproduce it on a 6c/12th 64GB machine.
> my build only got as far as installing includes (so literally right
> after tools were built.)  the longer pacic message is:
> 
> panic: kernel diagnostic assertion "vp->v_iflag & VI_TEXT" failed: file 
> "/usr/src/sys/kern/exec_subr.c", line 183
> cpu9: Begin traceback...
> vpanic() at netbsd:vpanic+0x156
> kern_assert() at netbsd:kern_assert+0x4b
> vmcmd_map_pagedvn() at netbsd:vmcmd_map_pagedvn+0x137
> execve_runproc() at netbsd:execve_runproc+0x394
> execve1() at netbsd:execve1+0x4f
> sys_execve() at netbsd:sys_execve+0x2a
> syscall() at netbsd:syscall+0x196
> --- syscall (number 59) ---
> netbsd:syscall+0x196:
> 
> my panic is the same as christos' -- inside execve1().  i'm getting
> a crash dump now, so i can probably inspect this further, but i'm
> wondering if this is related to the changes to vnode/nfs locking
> changes that came around a week ago:
> 
> https://mail-index.netbsd.org/source-changes/2022/02/28/msg137218.html
> https://mail-index.netbsd.org/source-changes/2022/02/28/msg137219.html
> 
> i don't know when this broke.  i only really started looking because
> christos said he saw this problem..

I'm now able to reproduce it here -- takes about six hours to trigger.

I suppose vrelel() lost a check for new references with my last changes,
currently testing the diff attached.

--
J. Hannken-Illjes - hann...@mailbox.org


vfs_node.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: reboot panic: "error == EOPNOTSUPP" in vfs_vnode.c line 1120

2022-02-07 Thread J. Hannken-Illjes
> On 7. Feb 2022, at 07:18, Michael van Elst  wrote:
> 
> m...@eterna.com.au (matthew green) writes:
> 
>> while rebooting a quartz64 with a usb attached disk that just
>> had a about 3.5GB of data written to it, i the umass gave some
>> errorse and then i got a panic shortly later:
> 
>> [ 6179.6038961] Skipping crash dump on recursive panic
>> [ 6179.6038961] panic: kernel diagnostic assertion "error =3D=3D EOPNOTSUP=
>> P" failed: file "/usr/src/sys/kern/vfs_vnode.c", line 1120 =
> 
> genfs_suspendctl() may return ENOENT if IMNT_GONE is set in mnt_iflag.
> I'm not sure if mnt_iflag is even protected against concurrent access
> but vrevoke() racing with dounmount() might be enough.

Flag IMNT_GONE is protected with suspension.  Looks like the KASSERT
should become "KASSERT(error == EOPNOTSUPP || error == ENOENT)" here.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Resolving open/close, attach/detach races

2022-02-01 Thread J. Hannken-Illjes
> On 1. Feb 2022, at 02:14, Taylor R Campbell  wrote:
> 
>> Date: Mon, 31 Jan 2022 11:15:25 +0100
>> From: "J. Hannken-Illjes" 
>> 
>> Do you have a recipe so I could try it here?
> 
> New draft, and script I've been using for testing.  It just does cat
> and revoke on a set of device nodes, repeatedly, in many processes in
> parallel; at the same time, I yank the USB devices and reinsert them.
> My test now survives with ucom(4), at least; other drivers will need
> fixes -- but fixes which will generally be pullable-up to a branch!

As my main testbed is a virtual machine this doesn't work but it
seems you already found a way to prevent these deadlocks.

> I resolved the issue I was asking about by
> 
> (a) having spec_node_lookup_by_dev optionally (when called by
>vdevgone) wait with vcache_vget even for dying vnodes, so doing it
>in a loop until none left waits until any revoke finishes; and
> 
> (b) moving the hash table removal logic from spec_node_destroy to
>spec_node_revoke, and making vcache_reclaim call spec_node_revoke
>unconditionally for device vnodes.
> 
> This way each specnode is only in the hash table while it's safe for
> vcache_vget to be waiting for it, so vdevgone can safely wait for any
> concurrent revoke to finish.

Looks good so far.

The new KASSERT(devvp->v_specnode->sn_opencnt) from
spec_node_setmountedfs() fires for unmounts of udf fs as operation
udf_vfsops.c::udf_unmount() closes the device BEFORE setmountedfs().



--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Resolving open/close, attach/detach races

2022-01-31 Thread J. Hannken-Illjes
> On 29. Jan 2022, at 22:08, Taylor R Campbell  wrote:
> 
> New draft changes to resolve open/close/attach/detach races.
> 
> There's one snag in here that I haven't been able to resolve -- a race
> in concurrent revoke and detach; maybe hannken@ can help?  The snag is
> that vdevgone needs to revoke all existing device nodes _and_ wait for
> the revocation to complete -- even if it is happening concurrently via
> the revoke(2) system call.
> 
> But spec_lookup_by_dev will just skip vnodes currently being revoked,
> and possibly return before they have finished being revoked.  I tried
> making spec_lookup_by_dev wait with vdead_check(vp, 0) instead of
> vdead_check(vp, VDEAD_NOWAIT) in some circumstances, but that didn't
> work -- it led to deadlock or livelock (and then kernel lock spinout),
> depending on how I did it.
> 
> My attempts to make this work may have failed because vdead_check is
> forbidden if the caller doesn't hold a vnode reference, and it's not
> clear taking a vnode reference is allowed at this point.  Generally,
> taking a new reference to a vnode being reclaimed should not be
> allowed.

Do you have a recipe so I could try it here?

> (I don't entirely understand ad@'s recent(ish) changes to allow it in
> some cases, which strikes me as a regression from the system we had
> before where VOP_INACTIVE's decision is final -- a huge improvement
> over the piles of bug-ridden incoherent gobbledegook we used to have
> to deal with vnodes being revived in the middle of reclamation.)

While ad@ introduced a race to vrelel() where a vnode cycle may miss
the final VOP_INACTIVE() I don't see a regression here.
The vnode lock already serializes VOP_INACTIVE() and all that is
missing is a path to retry the vrelel() if the vnode gained another
reference.



--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Resolving open/close, attach/detach races

2022-01-18 Thread J. Hannken-Illjes
> On 16. Jan 2022, at 23:41, Taylor R Campbell  wrote:
> 
>> Date: Sat, 15 Jan 2022 21:23:56 +
>> From: Taylor R Campbell 
>> 
>> The attached patch series enables us to resolve a large class of races
>> between devsw open/close, autoconf attach/detach, and module unload.
> 
> Also introduced some bugs!
> 
> - The devsw table management was broken for dynamically attached devsw
>  entries for statically allocated device major numbers.  This broke
>  one of the fast(er) paths for statically linked devsw entries, but
>  it can be repaired without needing any memory barriers for them.
> 
> - Holding fstrans across VOP_READ/WRITE prevented suspending file
>  systems with device vnodes for ttys that processes were waiting to
>  read from.  Instead, these vops should take the new kind of specnode
>  I/O reference, while the vnode is unlocked.
> 
> Updated patch addresses these.
> 
> Unclear whether it is appropriate for VOP_CLOSE to use fstrans.  Maybe
> it too should use new specnode I/O references -- or perhaps it needn't
> drop the vnode lock at all across bdev/cdev_close; generally blocking
> indefinitely in a close routine strikes me as a bad idea.
> 

This doesn't work.  Running the attached program on a current kernel:

# ./revoke
pty /dev/tty00, pid 2107
revoke /dev/tty00
read /dev/tty00 -> 0
done

With your patch:

# ./revoke
pty /dev/tty00, pid 975
revoke /dev/tty00

and hangs hard.  The revoke() operation enters spec_node_revoke()
and here it waits forever for sn_iocnt to become zero.

As the read from the other thread set sn_iocnt to one by entering
through spec_io_enter() the system will hang until this read returns.

Am I missing something?

--
J. Hannken-Illjes - hann...@mailbox.org



revoke.c
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: ZFS L2ARC on NetBSD-9

2021-04-20 Thread J. Hannken-Illjes
> On 19. Apr 2021, at 19:59, Andrew Parker  wrote:
> 
> On Monday, 19 April 2021 13:28:07 EDT Robert Elz wrote:
> > Date:Sun, 18 Apr 2021 18:58:56 +
> > From:Andrew Parker 
> > Message-ID:  <2245776.bZt9KSGgi3@t470s.local>
> >
> >   | Does anyone else have a working L2ARC?
> >
> > Sorry, don't even know what that is, and don't (currently anyway) use zfs,
> >
> > but:
> >   | -   interval = hz * l2arc_feed_secs;
> >   | +   interval = mstohz(l2arc_feed_secs);
> >
> > Are you sure about that part of the change (the earlier fragment looked
> > reasonable) ?
> >
> > mstohz() when starting with seconds (which the name of that var suggests)
> > looks like it would be much smaller than intended, whereas simply
> > multiplying seconds by hz gives ticks, which looks to be the objective in
> > all of that. Alternatively multiply secs by 1000 to generate ms, and
> > mstohz() that.
> >
> > Watch out for potential overflow in all of this though.
> >
> > kre
> 
> Oops.  I completely misread how the return value of l2arc_write_interval is 
> used so that patch doesn't make any sense.  But adding the printf suggested 
> earlier results in this just after boot:
> 
> [14.600107] WARNING: ZFS on NetBSD is under development
> [14.650039] ZFS filesystem version: 5
> [14.650039] wait 100
> [    15.690043] wait 96
> [17.840054] wait 0

As Manuel said this will block forever.  Please try the attached diff
that should prevent waits on negative or zero ticks.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


arc.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: ZFS L2ARC on NetBSD-9

2021-04-19 Thread J. Hannken-Illjes
> On 18. Apr 2021, at 20:58, Andrew Parker  wrote:
> 
> Hi,
> 
> While trying to setup an L2ARC on my amd64 Xen dom0, I noticed that almost no
> data was feeding in to the cache.  The counters in kstat.zfs.misc.arcstats.l2*
> indicate the l2arc feed process must be stuck somewhere.

Which counters and which values?

> I think I've traced
> the problem to l2arc_write_interval() called by l2arc_feed_thread(). It looks
> like the l2arc_feed_thread() loop only runs once because the delay set by
> l2arc_write_interval() is always too large.

For me (-current, KVM) l2arc_feed_thread() runs every second or every 0.15
second under load, tested with this patch:

l2arc_feed_thread(void *dummy __unused)
...
while (l2arc_thread_exit == 0) {
CALLB_CPR_SAFE_BEGIN();
+   printf("wait %u\n", next - ddi_get_lbolt());

>  Maybe this is a difference between
> NetBSD and Solaris cv_timedwait?  Something like this results in a functioning
> L2ARC on my system:
> 
> --- a/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
> +++ b/external/cddl/osnet/dist/uts/common/fs/zfs/arc.c
> @@ -6520,7 +6520,7 @@ l2arc_write_size(void)
> static clock_t
> l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
> {
> -   clock_t interval, next, now;
> +   clock_t interval;
> 
>/*
> * If the ARC lists are busy, increase our write rate; if the
> @@ -6529,14 +6529,11 @@ l2arc_write_interval(clock_t began, uint64_t wanted,
> uint64_t wrote)
> * what we wanted, schedule the next write much sooner.
> */
>if (l2arc_feed_again && wrote > (wanted / 2))
> -   interval = (hz * l2arc_feed_min_ms) / 1000;
> +   interval = mstohz(l2arc_feed_min_ms) / 2;
>else
> -   interval = hz * l2arc_feed_secs;
> +   interval = mstohz(l2arc_feed_secs);
> 
> -   now = ddi_get_lbolt();
> -   next = MAX(now, MIN(now + interval, began + interval));
> -
> -   return (next);
> +   return (interval);
> }

This is completely wrong, l2arc_write_interval() must return next time,
not a fixed interval.

> This is on NetBSD-9.  I hope I haven't missed something obvious in setup here.
> Does anyone else have a working L2ARC?
> 
> Thanks,
> Andrew

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


signature.asc
Description: Message signed with OpenPGP


Re: ZFS: time to drop Big Scary Warning

2021-03-19 Thread J. Hannken-Illjes
> On 19. Mar 2021, at 21:18, Michael  wrote:
> 
> Hello,
> 
> On Fri, 19 Mar 2021 15:57:18 -0400
> Greg Troxel  wrote:
> 
>> Even in current, zfs has a Big Scary Warning.  Lots of people are using
>> it and it seems quite solid, especially by -current standards.  So it
>> feels times to drop the warning.
>> 
>> I am not proposing dropping the warning in 9.
>> 
>> Objections/comments?
> 
> I've been using it on sparc64 without issues for a while now.
> Does nfs sharing work these days? I dimly remember problems there.

If you mean misc/55042: Panic when creating a directory on a NFS served ZFS
it should be fixed in -current.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: zfs panic in zfs:vdev_disk_open.part.4

2020-11-28 Thread J. Hannken-Illjes
> On 24. Nov 2020, at 19:16, Yorick Hardy  wrote:
> 
> Dear David,
> 
> On 2020-11-24, Yorick Hardy wrote:
>> Dear David,
>> 
>> On 2020-11-22, David Brownlee wrote:
>>> I'm seeing a (new?) panic on netbsd-9 with zfs. It seems to trigger
>>> when a newly created zfs pool attempts to be mounted:
>>> 
>>> panic: vrelel: bad ref count
>>> cpu0: Begin traceback...
>>> vpanic() at netbsd:vpanic+0x160
>>> vcache_reclaim() at netbsd:vcache_reclaim
>>> vrelel() at netbsd:vrelel+0x22e
>>> vdev_disk_open.part.4() at zfs:vdev_disk_open.part.4+0x44e
>>> vdev_open() at zfs:vdev_open+0x9e
>>> vdev_open_children() at zfs:vdev_open_children+0x39
>>> vdev_root_open() at zfs:vdev_root_open+0x33
>>> vdev_open() at zfs:vdev_open+0x9e
>>> vdev_create() at zfs:vdev_create+0x1b
>>> spa_create() at zfs:spa_create+0x28c
>>> zfs_ioc_pool_create() at zfs:zfs_ioc_pool_create+0x19b
>>> zfsdev_ioctl() at zfs:zfsdev_ioctl+0x265
>>> nb_zfsdev_ioctl() at zfs:nb_zfsdev_ioctl+0x38
>>> VOP_IOCTL() at netbsd:VOP_IOCTL+0x54
>>> vn_ioctl() at netbsd:vn_ioctl+0xa5
>>> sys_ioctl() at netbsd:sys_ioctl+0x5ab
>>> syscall() at netbsd:syscall+0x157
>>> --- syscall (number 54) ---
>>> 7e047af6822a:
>>> cpu0: End traceback...
>>> 
>>> Anyone seeing anything similar (I continue to have a bunch of other
>>> boxes which use zfs without issue)
>>> 
>>> David
>> 
>> I don't know if it helps, but it looks like vn_close(vp,..) should be called 
>> instead
>> of vrele(vp) at
>> 
>> https://nxr.netbsd.org/xref/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c#218
>> 
>> and
>> 
>> https://nxr.netbsd.org/xref/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c#250
> 
> Unless I am mistaken, I think you are hitting the error path which
> should probably call vn_close as below. My machine is a bit old, so
> I have not even compile tested yet and there is still the question
> of why/if you are in the error path.
> 
> If it does not help, apologies in advance!
> 
> --
> Kind regards,
> 
> Yorick Hardy
> 
> Index: external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c
> ===
> RCS file: 
> /cvsroot/src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c,v
> retrieving revision 1.18
> diff -u -r1.18 vdev_disk.c
> --- src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c25 Jun 
> 2020 09:39:15 -  1.18
> +++ src/external/cddl/osnet/dist/uts/common/fs/zfs/vdev_disk.c24 Nov 
> 2020 18:10:07 -
> @@ -215,7 +215,8 @@
>   return (SET_ERROR(error));
>   }
>   if (vp->v_type != VBLK) {
> - vrele(vp);
> + /* VN_RELE(vp); ?? */
> + vn_close(vp, FREAD|FWRITE, kcred);
>   vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
>   return (SET_ERROR(EINVAL));
>   }
> @@ -247,7 +248,8 @@
>   error = workqueue_create(>vd_wq, "vdevsync",
>   vdev_disk_flush, dvd, PRI_NONE, IPL_NONE, WQ_MPSAFE);
>   if (error != 0) {
> - vrele(vp);
> + /* VN_RELE(vp); ?? */
> + vn_close(vp, FREAD|FWRITE, kcred);
>   return (SET_ERROR(error));
>   }

Good catch Yorick, though I prefer the attached diff.
Pleasae commit and request pullup to -9.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


vdev_disk.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Atomic vcache_tryvget()

2020-05-26 Thread J. Hannken-Illjes
> On 25. May 2020, at 00:53, Andrew Doran  wrote:
> 
> On Sat, Apr 04, 2020 at 09:33:26PM +, Andrew Doran wrote:
> 
>> This change makes vcache_tryvget() do its work by means of a single atomic
>> operation, which removes the need to take v_interlock beforehand.
> 
> I re-did this one more elegantly.  Assuming no problems I plan to commit
> some time this week.
> 
>   http://www.netbsd.org/~ad/2020/vget.diff
> 
> The basic idea: have a "gate" bit in v_usecount that's only set when the
> vnode is LOADED.  vcache_tryvget() needs to see this otherwise it fails and
> we go down the slow path instead.  The state transition for reclaiming is
> then:
> 
>   ...
>   LOADED
>   -> BLOCKED
>   confirm that caller still holds the last vnode reference
>       -> RECLAIMING
>   ...

Looks good to me -- please go ahead ...

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


signature.asc
Description: Message signed with OpenPGP


Re: Please review: lookup changes

2020-03-11 Thread J. Hannken-Illjes
> On 5. Mar 2020, at 23:48, Andrew Doran  wrote:
> 
> Hi,
> 
> I'd like to merge the changes on the ad-namecache branch, and would
> appreciate any code review.  The motivation for these changes is, as
> you might guess, performance.  I put an overview below.  I won't be
> merging this for a couple of weeks in any case.

[snip]

> vfs_vnode.c:
> 
>   Namecache related changes, and the change to not block new vnode
>   references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>   think this blocking is needed at least for well behaved FSes like
>   ffs & tmpfs.

I suppose this blocking is no longer needed.  It originates from
the early steps to clean up the vnode lifecycle.  If anything
goes wrong it is quite simple to undo this change.

Blocking further references should only be needed for vrecycle().

We have to terminate vrelel() early if we got new references,
diff attached.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig



addendum.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Folding vnode_impl_t back into struct vnode

2020-01-24 Thread J. Hannken-Illjes
> On 23. Jan 2020, at 17:37, Andrew Doran  wrote:
> 
> Hi,
> 
> I would like to unify these back into a single struct vnode.
> 
> I think I understand the motivation for the split in the first place and I
> very much agree code outside the VFS core should not be playing around with
> a lot of the stuff in struct vnode, but I think the split causes more
> trouble than it's worth.

Please do not, I strongly object.

It was hard work to clean up file systems and kernel components using
and abusing vnode fields that should be opaque outside the VFS implementation.

I'm quite sure people will start again using these fields because
it is so simple.  For me the unique vnode was a maintenance nightmare.


> Here are my reasons for wanting to go back to a single struct:
> 
> 1) Performance.  I want to be able to cluster the fields around the kind of
> usage they get.  For example changing v_usecount on libc.so's vnode or the
> root vnode shouldn't force a cache miss on a different CPU that's just
> trying to look up pages, or get at v_op or v_type or whatever; they're
> different kinds of activity (see below for example).
> 
> 2) Memory usage; to do the above in a way that doesn't use more memory than
> needed.  With the structs split in two it's difficult to arrange all of
> this cleanly.
> 
> 3) Clarity.  I'm having a hard time working in the code with the split and
> the different macros to convert back and forth.  It's visually untidy.  I
> also think that if NetBSD's is to be a reference implementation or teaching
> aid then well known stuff like this should be straightforward.
> 
> Thoughts?
> 
> Cheers,
> Andrew
> 
> struct vnode {
>/*
> * more stable stuff goes at the front.
> */
>struct uvm_object v_uobj;
>...
>enum vtype  v_type;
>...
> 
>/*
> * counters usecounts etc.  these guys get hammered on.
> * think of v_usecount on "rootvnode" for example.
> * (v_usecount will come out of struct uvm_object with uvm
> * rwlock changes.)
> */
>int v_usecount
>__cacheline_aligned(COHERENCY_UNIT);
>int v_holdcount;
>int v_numoutput;
>int v_writecount;
>int v_someothercounter;
>...

All contained in struct vnode, there are no counters in struct vnode_impl.

>/*
> * then locks at the back in their own line so they don't
> * impinge on any of the above either; different kind of activity
> * again.
> */
>krwlock_t   v_lock
>__cacheline_aligned(COHERENCY_UNIT);
>krwlock_t   v_nc_lock;     /* namecache locks */
>krwlock_t   v_nc_listlock;

All (v_lock at least) contained in struct vnode_impl, all locks should
go here as they are opaque to any consumer.

> };

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-15 Thread J. Hannken-Illjes
> On 15. Jan 2020, at 18:44, Andrew Doran  wrote:
> 
> I don't understand why we do this.  I don't think it's needed at all because
> if the file really is deleted and the inode is getting cleaned out, then it
> shouldn't be getting new refs in any of the usual ways and we don't need to
> worry about it.  And in any case the vnode lock prevents anyone from messing
> with the inode during VOP_INACTIVE().
> 
> I want to change this because it causes nasty problems on MP.  For example I
> see threads very regulary thrown out of the namecache by vcache_vtryget(),
> and when they finally get the vnode ref'd and locked they do it by digging
> through the file system metadata (when everything is already in cache).
> 
>   http://www.netbsd.org/~ad/inactive.diff
> 
> What am I missing here?  Is there a buggy file system or something like
> that?

We also come here with a full vnode cache cleaning vnodes to keep the cache
size below its max calling vrecycle() on un-deleted vnodes.

I suppose the "vp->v_usecount == 1" assert in vrecycle() may fire with
this change as we drop the interlock before taking the lock.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: kmem_alloc() vs. VM_MIN_KERNEL_ADDRESS

2019-07-02 Thread J. Hannken-Illjes
> On 1. Jul 2019, at 20:12, Maxime Villard  wrote:
> 
> Le 01/07/2019 à 19:40, J. Hannken-Illjes a écrit :
>> Sometimes kmem_alloc() returns an address below
>> VM_MIN_KERNEL_ADDRESS, something like
>>   kmem_alloc() -> 0xaba25800a5a8 with
>>   VM_MIN_KERNEL_ADDRESS 0xad80
>> It doesn't happen after every reboot and breaks
>> dtrace which treats the region 0.. VM_MIN_KERNEL_ADDRESS
>> as toxic.
>> How could we test against the lowest address the kernel
>> may use?
>> --
>> J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig
> 
> Mmh, this seems to be a side effect of 1/2-KASLR which is enabled by default
> in GENERIC. Reminder: in GENERIC, we now randomize by default every kernel
> memory region.
> 
> It is now possible that the direct map is below the main memory map -- it was
> not the case before. kmem_alloc uses the direct map for certain allocations.
> 
> First, please confirm by also dumping PMAP_DIRECT_BASE and PMAP_DIRECT_END to
> make sure the area returned by kmem_alloc is indeed in the dmap when dtrace
> doesn't work.
> 
> Then, it seems that dtrace should be taught to consider as toxic everything
> that is neither in the main map nor in the dmap. That is, only these ranges
> are valid:
> 
>   VM_MIN_KERNEL_ADDRESS -> VM_MAX_KERNEL_ADDRESS
>   PMAP_DIRECT_BASE -> PMAP_DIRECT_END
> 
> There is no actual "lowest" address the kernel may use; there are just two
> separate regions that are valid. (Plus the kernel image itself, if it
> matters.)

It would work but unfortunately dtrace is a module and cannot
use PMAP_DIRECT_BASE aka. pmap_direct_base.  These variables
are not visible from module context and don't exist if the
kernel is built with KASAN.

Solaris uses the toxic areas 0 .. _userlimit and all regions
mapped from devices.

Will the kernel ever map anything below
VM_MIN_KERNEL_ADDRESS_DEFAULT?

Do we have a mapping that contains all device-mapped memory?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


kmem_alloc() vs. VM_MIN_KERNEL_ADDRESS

2019-07-01 Thread J. Hannken-Illjes
Sometimes kmem_alloc() returns an address below
VM_MIN_KERNEL_ADDRESS, something like

  kmem_alloc() -> 0xaba25800a5a8 with
  VM_MIN_KERNEL_ADDRESS 0xad80

It doesn't happen after every reboot and breaks
dtrace which treats the region 0.. VM_MIN_KERNEL_ADDRESS
as toxic.

How could we test against the lowest address the kernel
may use?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig





signature.asc
Description: Message signed with OpenPGP


Re: ZFS works on 8.99.34 but fails on 201905260520Z

2019-05-29 Thread J. Hannken-Illjes
> On 29. May 2019, at 01:38, Paul Goyette  wrote:
> 
> The commit that introduced the new symbol should also have bumped the kernel 
> version...  That's how we keep modules and kernel in sync...
> 
> 
> On Tue, 28 May 2019, m...@netbsd.org wrote:
> 
>> Found the commit - looks like newer modules than kernel.
>> https://v4.freshbsd.org/commit/netbsd/src/IH8Jag0YCI3N6boB

Do we really expect module updates without updating kernel
to work?

If yes will do iot next time.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: ZFS works on 8.99.34 but fails on 201905260520Z

2019-05-28 Thread J. Hannken-Illjes
Petr,

your kernel is elder than your ZFS module.

Please update to a current kernel and try again.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig



> On 28. May 2019, at 20:27, Petr Topiarz  wrote:
> 
> Hi Tech-kern,
> 
> I run two machines with NetBSD amd64 with ZFS, one is with 8.99.34 kernel 
> from february,
> 
> the other is the latest today, 201905260520Z,
> 
> It all runs fine with the first one, but as I upgraded the other, ZFS does 
> not load and tels me:
> 
> modload: zfs: Exec format error
> 
> and to /var/log messages it writes:
> 
> May 28 18:55:46 poweredge /netbsd: [ 236.3881944] kobj_checksyms, 988: [zfs]: 
> linker error: symbol `disk_rename' not found
> May 28 18:55:46 poweredge /netbsd: [ 236.4833169] WARNING: module error: 
> unable to affix module `zfs', error 8
> May 28 18:55:50 poweredge /netbsd: [ 240.2655954] kobj_checksyms, 988: [zfs]: 
> linker error: symbol `disk_rename' not found
> May 28 18:55:50 poweredge /netbsd: [ 240.3599823] WARNING: module error: 
> unable to affix module `zfs', error 8
> May 28 18:56:18 poweredge /netbsd: [ 268.0810981] kobj_checksyms, 988: [zfs]: 
> linker error: symbol `disk_rename' not found
> May 28 18:56:18 poweredge /netbsd: [ 268.1715047] WARNING: module error: 
> unable to affix module `zfs', error 8
> 
> considering configuration I got:
> 
> cat /etc/modules.conf
> solaris
> zfs
> 
> and in /etc/rc.conf I got
> 
> modules=YES
> 
> Any hint where to look or what to reconfigure in the kernel? I am using 
> standard netbsd kernel in both cases.
> 
> thanks
> 
> Petr
> 



signature.asc
Description: Message signed with OpenPGP


Re: snapshot_mount: already on list

2018-11-23 Thread J. Hannken-Illjes


> On 23. Nov 2018, at 11:57, Emmanuel Dreyfus  wrote:
> 
> Hello
> 
> Some times ago, I encountered this nasty bug where the kernel
> would crash when mounting a filesystem with snapshots. I thought
> it was fixed, but it seems there are some problems leftover, at least
> on netbsd-8. 
> 
> The crash:
> 
> /mail: replaying log to disk
> ffs_snapshot_mount: vget failed 2
> panic: ffs_snapshot_mount: 211471882 already on list

This error cannot be fixed in the kernel as we had to remove
the first inode from the list and when we detect this situation
it is too late.

Running fsck on the file system should remove duplicates from
the list of snapshots to mount.

> After another reboot, I ran fsck on the fileystem, inode 211471882
> indeed as troubles:
> 
> ** Phase 1 - Check Blocks and Sizes
> INCORRECT BLOCK COUNT I=211471881 (31466368 should be 31466624)
> CORRECT? yes
> 
> 845563840 DUP I=211471882
> 845563841 DUP I=211471882
> 845563842 DUP I=211471882
> 845563843 DUP I=211471882
> 845563844 DUP I=211471882
> 845563845 DUP I=211471882
> 845563846 DUP I=211471882
> 845563847 DUP I=211471882
> INCORRECT BLOCK COUNT I=211471882 (11141696 should be 11142208)
> CORRECT? yes

Did the file system mount OK after fsck?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: panic: ffs_snapshot_mount: already on list

2018-10-02 Thread J. Hannken-Illjes


> On 2. Oct 2018, at 12:23, Manuel Bouyer  wrote:
> 
> On Tue, Oct 02, 2018 at 12:11:49PM +0200, J. Hannken-Illjes wrote:
>> 
>>> On 18. Sep 2018, at 16:39, J. Hannken-Illjes  
>>> wrote:
>>> 
>>> 
>>>> On 18. Sep 2018, at 16:33, Manuel Bouyer  wrote:
>>>> 
>>>> On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote:
>>>>> There will be no copy-on-write to persistent snapshots so the snapshots
>>>>> will contain garbage where the real file system writes data.
>>>> 
>>>> Ha yes, of course. One could argue that if a systadmin boots a
>>>> kernel with FFS_NO_SNAPSHOT he should know what he's doing.
>>>> 
>>>> Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the
>>>> extra fs_snapinum[] entry ?
>>> 
>>> 
>>> Without looking further I would suggest to print a warning, print the
>>> fs_snapinum list and clear the entry as this inode is already an
>>> active snapshot.
>> 
>> This will not work as we must remove the first duplicate which
>> is already active.
> 
> So maybe we could just print a warning and ignore the duplicates at mount
> time ? Are shapshots checked on a read-only mount ?

A warning is not sufficient as we may be using the wrong block hints list.

Snapshots are checked on read-write mounts only.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: panic: ffs_snapshot_mount: already on list

2018-10-02 Thread J. Hannken-Illjes

> On 18. Sep 2018, at 16:39, J. Hannken-Illjes  wrote:
> 
> 
>> On 18. Sep 2018, at 16:33, Manuel Bouyer  wrote:
>> 
>> On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote:
>>> There will be no copy-on-write to persistent snapshots so the snapshots
>>> will contain garbage where the real file system writes data.
>> 
>> Ha yes, of course. One could argue that if a systadmin boots a
>> kernel with FFS_NO_SNAPSHOT he should know what he's doing.
>> 
>> Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the
>> extra fs_snapinum[] entry ?
> 
> 
> Without looking further I would suggest to print a warning, print the
> fs_snapinum list and clear the entry as this inode is already an
> active snapshot.

This will not work as we must remove the first duplicate which
is already active.

As this should only happen after a crash or unclean unmount this looks
like a job for fsck.  The attached diff teaches fsck_ffs to handle this
inconsistency.

Opinions?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


001_fsck_snap
Description: Binary data


Re: panic: ffs_snapshot_mount: already on list

2018-09-18 Thread J. Hannken-Illjes


> On 18. Sep 2018, at 16:33, Manuel Bouyer  wrote:
> 
> On Tue, Sep 18, 2018 at 04:24:28PM +0200, J. Hannken-Illjes wrote:
>> There will be no copy-on-write to persistent snapshots so the snapshots
>> will contain garbage where the real file system writes data.
> 
> Ha yes, of course. One could argue that if a systadmin boots a
> kernel with FFS_NO_SNAPSHOT he should know what he's doing.
> 
> Maybe we could leave out the FFS_NO_SNAPSHOT change, and just clear the
> extra fs_snapinum[] entry ?


Without looking further I would suggest to print a warning, print the
fs_snapinum list and clear the entry as this inode is already an
active snapshot.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: panic: ffs_snapshot_mount: already on list

2018-09-18 Thread J. Hannken-Illjes


> On 15. Sep 2018, at 17:06, Manuel Bouyer  wrote:
> 
> On Sat, Sep 15, 2018 at 04:34:41PM +0200, J. Hannken-Illjes wrote:
>> This will damage all persistent snapshots if you boot a kernel with 
>> FFS_NO_SNAPSHOT
>> Pleas don't commit ...
> 
> Can you explain why ? The problem Emmanuel is trying to fix is real ...
> I can't see how a ffs_snapshot_mount() which is a NOP could damage
> anything.

There will be no copy-on-write to persistent snapshots so the snapshots
will contain garbage where the real file system writes data.

I will be back from vacancies in October and take a deeper look.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: panic: ffs_snapshot_mount: already on list

2018-09-15 Thread J. Hannken-Illjes
This will damage all persistent snapshots if you boot a kernel with 
FFS_NO_SNAPSHOT
Pleas don't commit ...

> On 13. Sep 2018, at 08:08, Emmanuel Dreyfus  wrote:
> 
> Manuel Bouyer  wrote:
> 
>> maybe clear fs->fs_snapinum[snaploc] in that case ?
> 
> I am about to commit the patch below, but the problem 
> is: I cannot test it, since I cleaned up the obsolete snapshot 
> storage backend file from the machine I had the initial
> problem with.
> 
> Index: sys/ufs/ffs/ffs_snapshot.c
> ===
> RCS file: /cvsroot/src/sys/ufs/ffs/ffs_snapshot.c,v
> retrieving revision 1.149
> diff -U4 -r1.149 ffs_snapshot.c
> --- sys/ufs/ffs/ffs_snapshot.c  1 Jun 2017 02:45:15 -   1.149
> +++ sys/ufs/ffs/ffs_snapshot.c  13 Sep 2018 06:03:32 -
> @@ -1734,8 +1734,9 @@
>  */
> void
> ffs_snapshot_mount(struct mount *mp)
> {
> +#ifndef FFS_NO_SNAPSHOT
>struct vnode *devvp = VFSTOUFS(mp)->um_devvp;
>struct fs *fs = VFSTOUFS(mp)->um_fs;
>struct lwp *l = curlwp;
>struct vnode *vp;
> @@ -1823,14 +1824,16 @@
> 
>/*
> * Link it onto the active snapshot list.
> */
> -   if (is_active_snapshot(si, ip))
> -   panic("ffs_snapshot_mount: %"PRIu64" already on list",
> +   if (is_active_snapshot(si, ip)) {
> +   printf("ffs_snapshot_mount: %"PRIu64" already on 
> list",
>ip->i_number);
> -   else
> +   fs->fs_snapinum[snaploc] = 0;
> +   } else {
>TAILQ_INSERT_TAIL(>si_snapshots, ip, i_nextsnap);
> -   vp->v_vflag |= VV_SYSTEM;
> +   vp->v_vflag |= VV_SYSTEM;
> +   }
>VOP_UNLOCK(vp);
>}
>/*
> * No usable snapshots found.
> @@ -1847,8 +1850,9 @@
>si->si_snapblklist = xp->i_snapblklist;
>fscow_establish(mp, ffs_copyonwrite, devvp);
>    si->si_gen++;
>mutex_exit(>si_lock);
> +#endif /* FFS_NO_SNAPSHOT */
> }
> 
> /*
>  * Disassociate snapshot files when unmounting.
> 
> 
> 
> -- 
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-30 Thread J. Hannken-Illjes


> On 23. Aug 2018, at 11:59, J. Hannken-Illjes  wrote:
> 
> 
>> On 22. Aug 2018, at 08:50, Emmanuel Dreyfus  wrote:
>> 
>> On Mon, Aug 20, 2018 at 10:39:21AM +0200, J. Hannken-Illjes wrote:
>>>> I applied that to NetBSD-8.0, and it seems to behave much better.
>>> Good.
>> 
>> Will you commit and request a pullup? The change is valuable.
> 
> Sure ...

[pullup-8 #999] fss config update
[pullup-8 #1000] Fix deadlock with getnewbuf()

Should also fix the deadlock VFS_SNAPSHOT->ffs_copyonwrite->biowait.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-20 Thread J. Hannken-Illjes


> On 20. Aug 2018, at 10:34, Emmanuel Dreyfus  wrote:
> 
> On Thu, Aug 16, 2018 at 12:18:34PM +0200, J. Hannken-Illjes wrote:
>> - 001_add_sc_state replaces the flags FSS_ACTIVE and FSS_ERROR with
>> a state field.
>> 
>> - 002_extend_state adds states for construction or destruction of
>> a snapshot and fss_ioctl no longer blocks forever waiting for
>> construction or destruction of a snapshot to complete.
>> 
>> Opinions?
> 
> I applied that to NetBSD-8.0, and it seems to behave much better.

Good.

> What about the deadlock scenario you mentionned, e.g. taking 
> multiple snapshots of / at the same time, or a snapshot of / 
> and /home at the same time? Should they work with this patch?

Suppose you mean the "All processes go tstile" thread ...

This patch will change nothing mentioned there.  As I already asked:

- The first thirty lines of "dumpfs /home" please.

- Did you use "dump -x ..." or "dump -X"?

- Did the "dump" process hang?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: All processes go tstile

2018-08-19 Thread J. Hannken-Illjes


> On 19. Aug 2018, at 20:30, David Holland  wrote:
> 
> On Thu, Aug 16, 2018 at 10:03:11AM +0200, J. Hannken-Illjes wrote:
>>> sleepq_block
>>> cv_wait
>>> fstrans_start
>>> VOP_BWRITE
>>> getnewbuf
>>> getblk
>>> bio_doread
>>> bread
>>> ffs_update.part.3
>>> ffs_sync
>>> VFS_SYNC
>>> sched_sync
>>> 
>>> If I understand correctly, that means a n I/O that never completed, right?
>> 
>> Looks like a deadlock where we sync a file system, need a new buffer
>> and try to free a buffer on a currently suspending file system.
>> 
>> VFS_SYNC and VOP_BWRITE should be on different mounts here.
> 
> Side note: what if there isn't a second mount? Not sure that's a good
> enough way to deal with this case...

This is just a guess -- if they were from the same mount sched_sync()
did already call fstrans_start() through mountlist_iterator_trynext()
and VOP_BWRITE would not sleep on fstrans_start() as it always succeeds
on recursion.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: All processes go tstile

2018-08-17 Thread J. Hannken-Illjes


> On 17. Aug 2018, at 04:46, Emmanuel Dreyfus  wrote:
> 
> On Thu, Aug 16, 2018 at 10:03:11AM +0200, J. Hannken-Illjes wrote:
>> Looks like a deadlock where we sync a file system, need a new buffer
>> and try to free a buffer on a currently suspending file system.
>> 
>> VFS_SYNC and VOP_BWRITE should be on different mounts here.
> 
> Here are the mounts:
> /dev/raid3a on / type ffs (log, local)
> /dev/raid3e on /home type ffs (nodev, noexec, nosuid, NFS exported, local)
> 
> The problem arises while dump is performing a snapshot backup on /home

The first thirty lines of "dumpfs /home" please.

Did you use "dump -x ..." or "dump -X"?

Did the "dump" process hang?

> It is worth noting that /home does not have -o log because frequent
> fsync from NFS clients kill performances with -o log. At least it did 
> it on netbsd-7, but I do not see why it could have changed on netbsd-8: 
> frequent fsync means frequence log flushes on the server.
> 
>> Do you have a crash dump?
> 
> I was not able to get one so far, the dump device is not properly
> configured on this system.

If it hits you again, from ddb:

ps /l
call fstrans_dump

backtrace of the "dump" process

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-16 Thread J. Hannken-Illjes

> On 14. Aug 2018, at 11:16, J. Hannken-Illjes  wrote:
> 
> 
>> On 13. Aug 2018, at 19:25, Emmanuel Dreyfus  wrote:
>> 
>> On Mon, Aug 13, 2018 at 11:56:45AM +, Taylor R Campbell wrote:
>>> Unless I misunderstand fss(4), this is an abuse of mutex(9): nothing
>>> should sleep while holding the lock, so that nothing trying to acquire
>>> the lock will wait for a long time.
>> 
>> Well, the cause is not yet completely clear to me, but the user 
>> experience is terrible. The first time I used it, I thought
>> the system crashed, because fssconfig -l was just hung for 
>> hours.
>> 
>> And it is very easy to acheive a situation where most processes
>> are in tstile awaiting a vnode lock for a name lookup.
> 
> I see two problems here.
> 
> 1) File system internal snapshots take long to create or destroy on
>   large file systems.  I have no solution to this problem.
> 
>   Using file system external snapshots for dumps should work fine.
> 
> 2) Fss devices block in ioctl while a snapshot gets created or
>   destroyed.  A possible fix is to replace the current
>   active/non-active state with idle/creating/active/destroying


The attached diffs implement this in a pullup-friendly way.

- 001_add_sc_state replaces the flags FSS_ACTIVE and FSS_ERROR with
a state field.

- 002_extend_state adds states for construction or destruction of
a snapshot and fss_ioctl no longer blocks forever waiting for
construction or destruction of a snapshot to complete.

Opinions?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


001_add_sc_state
Description: Binary data



002_extend_state
Description: Binary data


Re: Finding an available fss device

2018-08-14 Thread J. Hannken-Illjes


> On 13. Aug 2018, at 19:25, Emmanuel Dreyfus  wrote:
> 
> On Mon, Aug 13, 2018 at 11:56:45AM +, Taylor R Campbell wrote:
>> Unless I misunderstand fss(4), this is an abuse of mutex(9): nothing
>> should sleep while holding the lock, so that nothing trying to acquire
>> the lock will wait for a long time.
> 
> Well, the cause is not yet completely clear to me, but the user 
> experience is terrible. The first time I used it, I thought
> the system crashed, because fssconfig -l was just hung for 
> hours.
> 
> And it is very easy to acheive a situation where most processes
> are in tstile awaiting a vnode lock for a name lookup.

I see two problems here.

1) File system internal snapshots take long to create or destroy on
   large file systems.  I have no solution to this problem.

   Using file system external snapshots for dumps should work fine.

2) Fss devices block in ioctl while a snapshot gets created or
   destroyed.  A possible fix is to replace the current
   active/non-active state with idle/creating/active/destroying
   and changing FSSIOCSET to

   mutex_enter(>sc_lock);
   if (sc->sc_state != FSS_IDLE) {
  mutex_exit(>sc_lock);
  return EBUSY;
   }
   sc->sc_state = FSS_CREATING;
   mutex_exit(>sc_lock);

   error = fss_create_snapshot();

   mutex_enter(>sc_lock);
   if (error)
  sc->sc_state = FSS_IDLE;
   else
  sc->sc_state = FSS_ACTIVE;
   mutex_exit(>sc_lock);

   return error;

   Problem here is backwards compatibility.  I have no idea what to
   return for FSSIOCGET when the state is creating or destroying.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-13 Thread J. Hannken-Illjes


> On 13. Aug 2018, at 09:53, Emmanuel Dreyfus  wrote:
> 
> On Sun, Aug 12, 2018 at 10:16:48AM +0200, J. Hannken-Illjes wrote:
>> While creating a snapshot "/mount0" lookup "/mount0/file", it will block
>> as "/mount0" is suspended.  The lookup holds a lock on "/".
>> 
>> Now snapshot "/ "and trying to suspend "/" will block as the lookup
>> has the root vnode locked.
> 
> This scenario is not the same as the one I asked about, which
> was: performing a snapshot of filesystem mounted on /mount0 
> using /dev/fss0 and a snapshot of filesystem mounted on /mount1
> using /dev/fss1 while the first one is still active. Is there some
> deadlock in this case?

Still not sure we are talking about the same thing.

1) Create snapshot of /mount0 with fss0
1a) Open /dev/fss0
1b) Ioctl FSSIOCSET on /dev/fss0 to create the snapshot
1c) Read data from /dev/fss0
1d) Ioctl FSSIOCCLR on /dev/fss0 to delete the snapshot
1e) Close /dev/fss0

The same for a snapshot of /mount1 with fss1.

2) Create snapshot of /mount1 with fss1
2a) Open /dev/fss1
2b) Ioctl FSSIOCSET on /dev/fss1 to create the snapshot
2c) Read data from /dev/fss1
2d) Ioctl FSSIOCCLR on /dev/fss1 to delete the snapshot
2e) Close /dev/fss1

All operations are mutually exclusive, we always run exactly
one of 1), 1a) ... 2e), a second operation will block until
it gets exclusive access.

Of these operations, 1b), 1d), 2b) and 2d) may take a long time
to run if the snapshot is file system internal.

> But you also raise a deadlock scenario for which there is no
> protection in currentn code. I already experienced it in the
> past and it would be fair to return EGAIN rather than letting the
> administrator set a snapshot that will kill the system later.

This scenario is protected by mutual exclusion of fss device
operations as explained above.  Creating the snapshot of "/"
waits for the creation of the snapshot of "/mount0" to finish.

Additionally VFS_SUSPEND is always exclusive (see mutex
vfs_suspend_lock) as it gets used from mount() and unmount() too.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-12 Thread J. Hannken-Illjes



> On 12. Aug 2018, at 10:07, Emmanuel Dreyfus  wrote:
> 
> On Sun, Aug 12, 2018 at 09:55:27AM +0200, J. Hannken-Illjes wrote:
>>> You mean you cannot at the same tme snapshot /mount0 on fss0 and 
>>> /mount1 on fss1?
>> 
>> Yes, you have to create the snapshot on /mount0 and once it has been
>> created you create the snapshot on /mount1.
> 
> Where is that limitation? I would not exepect to get such a limitation
> when working on both a different mount and fss device.

The simplest deadlock is:

While creating a snapshot "/mount0" lookup "/mount0/file", it will block
as "/mount0" is suspended.  The lookup holds a lock on "/".

Now snapshot "/ "and trying to suspend "/" will block as the lookup
has the root vnode locked.

There are too many other deadlock scenarios this could ever work.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-12 Thread J. Hannken-Illjes



> On 12. Aug 2018, at 03:58, Emmanuel Dreyfus  wrote:
> 
> On Sat, Aug 11, 2018 at 10:33:04AM +0200, J. Hannken-Illjes wrote:
>> When fssconfig "hangs" the dump is creating a snapshot.  Creating
>> a snapshot (and suspending a file system) is serialized.  Allowing
>> more than one file system suspension at a time will deadlock most
>> of the time.
> 
> You mean you cannot at the same tme snapshot /mount0 on fss0 and 
> /mount1 on fss1?

Yes, you have to create the snapshot on /mount0 and once it has been
created you create the snapshot on /mount1.

The snapshot on /mount0 is already usable while you create the second
snapshot on /mount1.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Finding an available fss device

2018-08-11 Thread J. Hannken-Illjes



> On 10. Aug 2018, at 15:46, Emmanuel Dreyfus  wrote:
> 
> Hello
> 
> How are user processes supposed to find an unused fss device?
> In dump(8) code,  there is an iteration on /dev/rfss* trying to
> performan an ioctl FSSIOCSET. The code tests for EBUSY on failure,
> but in my experience that struggles to happen: if the device is 
> already in use, the ioctl will sleep in the kernel for ages before 
> getting a reply.
> 
> This is something I can even experience with fssconfig -l, which 
> hangs for a while if dump is running.
> 
> Is there another way? I thought about searching vnode in kernel to 
> check if the device is already used by someone else, but that looks 
> overkill. 
> 
> Perhaps the right way is to add a FSSIOBUSY ioctl that would
> use mutex_tryenter and return EBUSY if the device is in use?

When fssconfig "hangs" the dump is creating a snapshot.  Creating
a snapshot (and suspending a file system) is serialized.  Allowing
more than one file system suspension at a time will deadlock most
of the time.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Performance of VSTATE_ASSERT_UNLOCKED

2017-09-17 Thread J. Hannken-Illjes

> On 12. Sep 2017, at 20:52, Joerg Sonnenberger <jo...@bec.de> wrote:
> 
> Hello Juergen,
> the new VSTATE_ASSERT_UNLOCKED code you added to genfs_lock is very
> heavy as it significantly increases lock contention for something
> already highly contended. I'd like to push the lock down to critical
> cases and try to avoid it for common cases.
> 
> Thoughts?
> 
> Joerg
> 

Looks good to me, cannot test anything as I am on vacation.

Please commit and request pullup to -8.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem with lots of (perhaps too much?) memory

2017-05-17 Thread J. Hannken-Illjes

> On 17. May 2017, at 08:59, Paul Goyette <p...@whooppee.com> wrote:
> 
> On Wed, 17 May 2017, J. Hannken-Illjes wrote:
> 
>>>> Chances are very high that you are hitting the free page lock contention
>>>> bug.
>>> 
>>> Further observation:
>>> 
>>> 1. A run of lockstat shows that there's lots of activity on
>>>  mntvnode_lock from vfs_insmntque
>>> 
>>> 2. The problem can be triggered simply by running 'du -s' on a
>>>  file system with lots of files (for example, after having run
>>>  'build.sh release' for ~40 different architectures, and having
>>>  kept ALL of the obj/*, dest/*, and release/* output files).
>>> 
>>> 3. The ioflush thread _never_ finishes.  Even 12 hours after the
>>>  trigger, and after an 8-hour sleep window doing nothing (other
>>>  than receiving a couple dozen Emails), the ioflush thread is
>>>  still using 5-10% of one CPU core/thread.
>>> 
>>> 4. If I umount the trigger file system, ioflush time goes to near-
>>>  zero.  I can remount without problem, however shortly after
>>>  re-running the 'du -s' command the problem returns.
>>> 
>>> There was a comment on IRC that yamt@ had been working on a problem where 
>>> "ioflush wastes a lot of CPU time when there are lots of vnodes" seems to 
>>> describe this situation.  Unfortunately, it seems that yamt never finished 
>>> working on the problem.  :(
>> 
>> Two questions:
>> 
>> - What is the size of your vnode cache (sysctl kern.maxvnodes)?
> 
> kern.maxvnodes = 6706326

What happens if you lower it to 1258268 for example?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem with lots of (perhaps too much?) memory

2017-05-17 Thread J. Hannken-Illjes

> On 16. May 2017, at 23:25, Paul Goyette <p...@whooppee.com> wrote:
> 
> On Tue, 16 May 2017, Joerg Sonnenberger wrote:
> 
>> On Tue, May 16, 2017 at 08:12:20AM +0800, Paul Goyette wrote:
>>> * xosview frequently reports that the system is running at near-100%
>>>  in SYS time.
>> 
>> Chances are very high that you are hitting the free page lock contention
>> bug.
> 
> Further observation:
> 
> 1. A run of lockstat shows that there's lots of activity on
>   mntvnode_lock from vfs_insmntque
> 
> 2. The problem can be triggered simply by running 'du -s' on a
>   file system with lots of files (for example, after having run
>   'build.sh release' for ~40 different architectures, and having
>   kept ALL of the obj/*, dest/*, and release/* output files).
> 
> 3. The ioflush thread _never_ finishes.  Even 12 hours after the
>   trigger, and after an 8-hour sleep window doing nothing (other
>   than receiving a couple dozen Emails), the ioflush thread is
>   still using 5-10% of one CPU core/thread.
> 
> 4. If I umount the trigger file system, ioflush time goes to near-
>   zero.  I can remount without problem, however shortly after
>   re-running the 'du -s' command the problem returns.
> 
> There was a comment on IRC that yamt@ had been working on a problem where 
> "ioflush wastes a lot of CPU time when there are lots of vnodes" seems to 
> describe this situation.  Unfortunately, it seems that yamt never finished 
> working on the problem.  :(

Two questions:

- What is the size of your vnode cache (sysctl kern.maxvnodes)?

- Is the vdrain thread using CPU too?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator (round 2)

2017-04-10 Thread J. Hannken-Illjes

> On 9. Apr 2017, at 19:16, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Sun, 9 Apr 2017 18:47:25 +0200
>> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de>
>> 
>>> On 6. Apr 2017, at 11:44, J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote:
>>> Good hint.  Prepared a partial implementation of
>>> 
>>> int
>>> mountlist_iterate(int (*cb)(struct mount *, void *), void *arg)
>>> [...]
>>> Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/
>>> 
>>> I prefer this API, opinions?
>> 
>> If no one objects I will commit this iterator on Tuesday, April 11.
> 
> I tend to prefer an iterator-style API over a callback-style API where
> it works.  (E.g., obviously xcall(9) can't be done with an iterator-
> style API.)  I like to avoid writing my code upside-down unless it's
> really necessary.  If you want a callback-style API for some reason,
> you can always write that in terms of an iterator-style API, but not
> vice versa.
> 
> Note the suggestion I made earlier about mountlist_iterator_next_done
> isn't necessary as long as anyone trying to break the loop and keep
> using the mount does something extra to acquire a reference:
> 
> mountlist_iterator_begin();
> while ((mp = mountlist_iterator_next()) != NULL) {
>   dostuff(mp);
>   if (interesting(mp)) {
>   vfs_busy(mp);
>   break;
>   }
> }
> mountlist_iterator_end();
> 
> All you have to do to make this work is store the last mp in the
> iterator too, so that mountlist_iterator_next can release it before
> acquiring the next one.
> 
> But I'm not the one doing the work here, so this is not an objection!

... and I am open to both approaches, I just want it done :-)

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator (round 2)

2017-04-09 Thread J. Hannken-Illjes

> On 6. Apr 2017, at 11:44, J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote:
> 
>> 
>> On 5. Apr 2017, at 05:14, Chuck Silvers <c...@chuq.com> wrote:
>> 
>> have you considered a callback-based interface where the loop
>> is inside the iteration API rather than in the caller?
>> the vfs_busy/unbusy could also be hidden in the iteration API
>> so that the callback would not need need to worry about it at all.
>> 
>> I looked at how the iteration stuff is used in your patches
>> (the first one, I haven't looked in the latest one in detail)
>> and it looks like most of users would be fine with not being able
>> to do anything after vfs_unbusy().  the one place that really does
>> want to do more after vfs_unbusy() is vfs_unmountall1(),
>> but that could be rewritten to loop unmounting the last mount
>> in the list until they're all gone.
>> 
>> I think this would further simplify most (if not all) cases of
>> mountlist iteration.
> 
> Good hint.  Prepared a partial implementation of
> 
> int
> mountlist_iterate(int (*cb)(struct mount *, void *), void *arg)
> 
> to take "mp" busy and call "cb(mp, arg)" for all mounted file systems.
> 
> A non waiting variant "mountlist_iterate_nowait()" could be added
> when it gets needed.
> 
> Changed "vfs_unmountall1()" to retrieve mounts in descending
> generation order which has the additional benefit that we don't
> need reverse traversals any more.
> 
> Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/
> 
> I prefer this API, opinions?

If no one objects I will commit this iterator on Tuesday, April 11.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator (round 2)

2017-04-06 Thread J. Hannken-Illjes

> On 5. Apr 2017, at 05:14, Chuck Silvers <c...@chuq.com> wrote:
> 
> have you considered a callback-based interface where the loop
> is inside the iteration API rather than in the caller?
> the vfs_busy/unbusy could also be hidden in the iteration API
> so that the callback would not need need to worry about it at all.
> 
> I looked at how the iteration stuff is used in your patches
> (the first one, I haven't looked in the latest one in detail)
> and it looks like most of users would be fine with not being able
> to do anything after vfs_unbusy().  the one place that really does
> want to do more after vfs_unbusy() is vfs_unmountall1(),
> but that could be rewritten to loop unmounting the last mount
> in the list until they're all gone.
> 
> I think this would further simplify most (if not all) cases of
> mountlist iteration.

Good hint.  Prepared a partial implementation of

int
mountlist_iterate(int (*cb)(struct mount *, void *), void *arg)

to take "mp" busy and call "cb(mp, arg)" for all mounted file systems.

A non waiting variant "mountlist_iterate_nowait()" could be added
when it gets needed.

Changed "vfs_unmountall1()" to retrieve mounts in descending
generation order which has the additional benefit that we don't
need reverse traversals any more.

Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2a/

I prefer this API, opinions?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator (round 2)

2017-04-04 Thread J. Hannken-Illjes
Time to start a second round.  This time the iterator only, vfs_busy()
and friends deferred to a new thread once the iterator is done.

Changes from the previous proposal:

- Removed the "flags" argument from mountlist_iterator_next(), will
  add mountlist_iterator_trynext() and vfs_trybusy() when needed.

- Added "void mountlist_iterator_done(mount_iterator_t *)" to
  completely hide vfs_busy()/vfs_unbusy() from the API and make it
  possible to assert that every call to mountlist_iterator_next()
  gets accompanied by a call to mountlist_iterator_done().

- Use the existing lock "mountlist_lock".

A typical usage is:

mount_iterator_t *iter;

mountlist_iterator_init();
while ((mp = mountlist_iterator_next(iter)) != NULL) {
... 
mountlist_iterator_done(iter);
}
mountlist_iterator_destroy(iter);  

Diffs are here: https://www.netbsd.org/~hannken/ml_iterator2/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: locking from VOP_INACTIVE to VOP_RECLAIM

2017-04-02 Thread J. Hannken-Illjes

> On 2. Apr 2017, at 18:19, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Sun, 2 Apr 2017 16:34:20 +0200
>> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de>
>> 
>> Looks like your proposal needs some clarification regarding "vnode lock"
>> and "the lock".
>> 
>> We have two vnode related locks:
>> 
>> - the "vnode lock", located as field "vi_lock" in "struct vnode_impl"
>>  used by genfs_*lock() aka VOP_*LOCK.
>> 
>> - the "genfs lock" located as field "g_lock" in "struct genfs_node"
>>  used by genfs_node_*lock().
>> 
>> Which lock do you want to be held across VOP_INACTIVE and destroyed
>> from VOP_RECLAIM?
> 
> Aha!  That would explain why I had trouble tracking down why an assert
> I was planning to add kept firing.  I was thinking that the genfs node
> lock and the vnode lock were one and the same.
> 
> My proposal was about the vnode lock, not the genfs lock.  Since
> VOP_RECLAIM does *not* destroy the vnode lock, and must leave it
> either locked or unlocked, I suggest leaving it locked -- it's always
> easier to think about subroutines that preserve the state of a lock
> (locked to locked, or unlocked to unlocked) rather than changing it
> (locked to unlocked or unlocked to locked).

Looks good to me so far and please keep in mind that VOP_RECLAIM
must not drop/aquire the lock as it will never get it back for a
vnode in state RECLAIMING.

> (As an aside, we may have some bugs in ffs concerning use of the
> struct inode::i_size field -- some callers seem to think access is
> serialized by the genfs lock, while others, e.g. ufs_readwrite.c, seem
> to use the vnode lock.  This needs to be documented more clearly...)

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator

2017-04-02 Thread J. Hannken-Illjes

> On 2. Apr 2017, at 16:34, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Sun, 2 Apr 2017 11:09:49 +0200
>> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de>
>> 
>>> On 1. Apr 2017, at 23:03, Taylor R Campbell 
>>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> 
>>> Any particular reason to use a pointer-to-opaque-pointer here instead
>>> of a caller-allocated struct mountlist_iterator object?
>> 
>> Just to make it opaque to the caller.  I see no reason to make the
>> mountlist internals part of the kernel API and expose them down to
>> the file system implementations.
> 
> The caller-allocated struct mountlist_iterator could be made to
> contain just a single pointer as the current API uses.
> 
> *shrug* Not a big deal either way.
> 
>> As _mountlist_next() exists for DDB only (no locks, no allocs) I see
>> no problems here.
> 
> One might be concerned in ddb with having to chase many pointers in a
> possibly corrupt kernel memory state, so there is some small advantage
> to avoiding list traversal.

Not from the DDB prompt but from operations like "printlockedvnodes()"
or "fstrans_dump()".  If the mountlist is corrupt you're lost.

> 
>>   Embedding the mountlist_entry in "struct mount"
>> forces us to use a "struct mount" as a marker and this struct
>> has a size of ~2.5 kBytes.
> 
> By `embed' I didn't mean that the marker has to be a whole struct
> mount.  Rather, something more like:
> 
>   struct mountlist_entry {
>   TAILQ_ENTRY(mountlist_entry)mle_tqe;
>   enum { MNTENT_MARKER, MNTENT_MOUNT }mle_tag;
>   };
> 
>   struct mount {
>   ...
>   struct mountlist_entry mnt_entry;
>   ...
>   };
> 
>   TAILQ_HEAD(, mountlist_entry) mount_list;
> 
> To recover the struct mount from a struct mountlist_entry that is
> tagged as a mount, simply do
> 
>   struct mountlist_entry *mle = ...;
>   struct mount *mp = container_of(mle, struct mount, mnt_entry);

Any reason you want "mle_tqe" and "mle_tag" public?

>>> Candidate for cacheline-aligned struct?  Why do we need a new lock
>>> mount_list_lock for this instead of using the existing mountlist_lock?
>> 
>> For the transition it is better to use a separate lock.  After the
>> transition "mountlist_lock" gets removed.
> 
> OK, fair enough.  Can you add a comment over the declaration stating
> that plan?
> 
>>> Beware: TAILQ_PREV (and TAILQ_LAST) violates strict aliasing rules, so
>>> we should probably try to phase it out rather than add new users.
>>> Maybe we should add a new kind of queue(3) that doesn't violate strict
>>> aliasing but supports *_PREV and *_LAST.
>>> 
>>> All that said, it is not clear to me why you need reverse iteration
>>> here.  None of your patches seem to use it.  Are you planning to use
>>> it for some future purpose?
>> 
>> Historical the mountlist gets traversed in mount order.  Some syscalls
>> return in this order so it should be kept.
>> 
>> Unmounting all file system on shutdown has to be done in reverse mount
>> order so here (vfs_unmount_forceone / vfs_unmountall1) we need it.
> 
> I see -- your patch just hasn't addressed those yet.  Maybe Someone^TM
> should just write a replacement for TAILQ so we can use it here now.

File 002_ml_iterator_switch around line 453 and 487.

>>> Why do we need a new refcnt dance here?  Why isn't it enough to just
>>> use vfs_busy like all the existing callers?
>> 
>> Because vfs_busy() expects the caller to already hold a reference and
>> one of the goals here is to drop the mount_list_lock as early as possible.
> 
> I suppose you want to avoid setting down a lock order for
> mount_list_lock and mp->mnt_unmounting.  I guess this doesn't matter
> much since it's limited to vfs internals.
> 
>>> Can you make another separate commit to drop the keepref parameter to
>>> vfs_unbusy, and change all the callers with keepref=true to just use
>>> vfs_destroy themselves?
>> 
>> You mean the opposite (keepref=false -> caller does vfs_destroy), yes?
>> 
>> Of the 81 usages of vfs_unbusy() only 6 have keepref=true, all others
>> would have to be changed to "vfs_unbusy(mp); vfs_destroy(mp);".
> 
> Heh.  This confusion on my part suggest two things:
> 
> 1. Perhaps in this change, we ought to add a different function for
> use with

Re: locking from VOP_INACTIVE to VOP_RECLAIM

2017-04-02 Thread J. Hannken-Illjes

> On 2. Apr 2017, at 16:05, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Sun, 2 Apr 2017 10:47:42 +0200
>> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de>
>> 
>>> On 1. Apr 2017, at 20:41, Taylor R Campbell 
>>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> VOP_RECLAIM then *destroys* the lock, so we don't have to say whether
>>> it is held or released on exit.  This requires a one-line change to
>>> vfs_vnode.c, and then mechanically moving VOP_UNLOCK from *_inactive
>>> to *_reclaim throughout the file systems.
>> 
>> Changing VOP_RECLAIM to destroy the lock forces a complete
>> rewrite of the vnode lock mechanism.
> 
> Huh?  That part is not a change -- VOP_RECLAIM already destroys the
> lock.  E.g., ffs_reclaim calls genfs_node_destroy.
> 
> I just added that sentence (`VOP_RECLAIM then destroys the lock...')
> to emphasize that my proposed change doesn't need to address the state
> of the lock after VOP_RECLAIM returns.
> 
> Note that the only call to VOP_RECLAIM in the system immediately
> follows VOP_INACTIVE, in vcache_reclaim, so that call actually
> requires no changes.  Only the other call to VOP_INACTIVE, in vrelel,
> requires a change, to add a following VOP_UNLOCK.

Looks like your proposal needs some clarification regarding "vnode lock"
and "the lock".

We have two vnode related locks:

- the "vnode lock", located as field "vi_lock" in "struct vnode_impl"
  used by genfs_*lock() aka VOP_*LOCK.

- the "genfs lock" located as field "g_lock" in "struct genfs_node"
  used by genfs_node_*lock().

Which lock do you want to be held across VOP_INACTIVE and destroyed
from VOP_RECLAIM?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Add a mountlist iterator

2017-04-02 Thread J. Hannken-Illjes

> On 1. Apr 2017, at 23:03, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>> Date: Thu, 30 Mar 2017 11:21:41 +0200
>> From: "J. Hannken-Illjes" <hann...@eis.cs.tu-bs.de>
>> 
>> Plan is to untangle the mountlist processing from vfs_busy() / vfs_unbusy()
>> and add an iterator for the mountlist:
> 
> Generally seems like a good idea to me!
> 
>> - void mountlist_iterator_init(mount_iterator_t **)
>>  to initialize an iterator in mounted order.
> 
> Any particular reason to use a pointer-to-opaque-pointer here instead
> of a caller-allocated struct mountlist_iterator object?

Just to make it opaque to the caller.  I see no reason to make the
mountlist internals part of the kernel API and expose them down to
the file system implementations.

> struct mountlist_iterator it;
> 
> mountlist_iterator_init();
> while ((mp = mountlist_iterator_next(, ...)) != NULL)
>   ...
> mountlist_iterator_destroy();
> 
> Then there's no need to kmem_alloc any intermediate data structures.
> 
> Embedding the struct mountlist_entry in struct mount would obviate the
> need for _mountlist_next to iterate over the entire list from the
> start, too.

As _mountlist_next() exists for DDB only (no locks, no allocs) I see
no problems here.  Embedding the mountlist_entry in "struct mount"
forces us to use a "struct mount" as a marker and this struct
has a size of ~2.5 kBytes.

> Some in-line comments on the diff.  General comment: can you make your
> diffs with the `-p' option so it's easier to follow which function
> each diff hunk edits?
> 
>   diff -r dab39dcbcb29 -r ff3f89481f2e sys/kern/vfs_mount.c
>   --- a/sys/kern/vfs_mount.c  Thu Mar 30 11:17:56 2017 +0200
>   +++ b/sys/kern/vfs_mount.c  Thu Mar 30 11:17:56 2017 +0200
>   @@ -119,7 +136,9 @@
>   ...
>   +static TAILQ_HEAD(mountlist, mountlist_entry) mount_list;
>   +static kmutex_t mount_list_lock;
> 
> Candidate for cacheline-aligned struct?  Why do we need a new lock
> mount_list_lock for this instead of using the existing mountlist_lock?

For the transition it is better to use a separate lock.  After the
transition "mountlist_lock" gets removed.

>   @@ -1439,10 +1454,168 @@
>   ...
>   +   if (marker->me_type == ME_MARKER)
>   +   me = TAILQ_NEXT(marker, me_list);
>   +   else
>   +   me = TAILQ_PREV(marker, mountlist, me_list);
> 
> Beware: TAILQ_PREV (and TAILQ_LAST) violates strict aliasing rules, so
> we should probably try to phase it out rather than add new users.
> Maybe we should add a new kind of queue(3) that doesn't violate strict
> aliasing but supports *_PREV and *_LAST.
> 
> All that said, it is not clear to me why you need reverse iteration
> here.  None of your patches seem to use it.  Are you planning to use
> it for some future purpose?

Historical the mountlist gets traversed in mount order.  Some syscalls
return in this order so it should be kept.

Unmounting all file system on shutdown has to be done in reverse mount
order so here (vfs_unmount_forceone / vfs_unmountall1) we need it.

>   +   /* Take an initial reference for vfs_busy() below. */
>   +   mp = me->me_mount;
>   +   KASSERT(mp != NULL);
>   +   atomic_inc_uint(>mnt_refcnt);
>   +   mutex_exit(_list_lock);
>   +
>   +   /* Try to mark this mount busy and return on success. */
>   +   if (vfs_busy(mp, NULL) == 0) {
>   +   vfs_destroy(mp);
>   +   return mp;
>   +   }
>   +   vfs_destroy(mp);
>   +   mutex_enter(_list_lock);
> 
> Why do we need a new refcnt dance here?  Why isn't it enough to just
> use vfs_busy like all the existing callers?

Because vfs_busy() expects the caller to already hold a reference and
one of the goals here is to drop the mount_list_lock as early as possible.

> if (vfs_busy(mp, NULL) == 0) {
>   mutex_exit(_list_lock);
>   return mp;
> }
> 
> Seems to me we ought to be removing code that names mnt->mnt_refcnt
> specifically, rather than adding more code.
> 
> (Then we could also move the `mutex_exit; return;' out of the for loop
> to make it look a little more symmetric in mountlist_iterator_next.)
> 
>   diff -r ff3f89481f2e -r 28fea04e7b15 sys/coda/coda_vfsops.c
>   --- a/sys/coda/coda_vfsops.cThu Mar 30 11:17:56 2017 +0200
>   +++ b/sys/coda/coda_vfsops.cThu Mar 30 11:17:57 2017 +0200
>   @@ -624,23 +624,20 @@
>   -mutex_enter(_lock);
>   -TAILQ_FOREACH(mp, , mnt_list) {
>   -   if ((!strcmp(mp->mnt_op->vfs_nam

Re: locking from VOP_INACTIVE to VOP_RECLAIM

2017-04-02 Thread J. Hannken-Illjes

> On 1. Apr 2017, at 20:41, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
> Currently:
> 
> - For VOP_INACTIVE, vnode lock is held on entry and released on exit.
> - For VOP_RECLAIM, vnode lock is not held on entry.
> 
> I would like to change this so that:
> 
> - For VOP_INACTIVE, vnode lock is held on entry and exit.
> - For VOP_RECLAIM, vnode lock is held on entry.

Changing VOP_INACTIVE to keep the vnode locked looks good to me.

> VOP_RECLAIM then *destroys* the lock, so we don't have to say whether
> it is held or released on exit.  This requires a one-line change to
> vfs_vnode.c, and then mechanically moving VOP_UNLOCK from *_inactive
> to *_reclaim throughout the file systems.

Changing VOP_RECLAIM to destroy the lock forces a complete
rewrite of the vnode lock mechanism.

Currently we have genfs_lock() that takes care to not lock
a reclaimed vnode and genfs_deadlock() that locks unconditionally
if called with LK_RETRY.

Before VOP_RECLAIM we use genfs_lock() and after VOP_RECLAIM
we change the operations vector to dead_vnodeop_p and use
genfs_deadlock().

Operation vn_lock() then takes care not to return a dead vnode if
it gets called with LK_RETRY.

If this is to make the locked state during VOP_RECLAIM assertable
it should be changed to either "locked on entry, locked on return"
or "locked on entry, unlocked on return".

> I did a quick survey of every VOP_INACTIVE in src/sys.  With one
> exception, every one either (a) does VOP_UNLOCK just before return, or
> (b) defers to another VOP_INACTIVE method.  The one exception is
> nfs_inactive, which also queues something up on a workqueue, but I
> can't imagine that it matters whether the vnode lock is held.
> 
> Benefit: Functions that VOP_RECLAIM calls would be able to assert,
> rather than vaguely assume, exclusive ownership of the vnode lock.
> Currently various code paths, e.g. ffs_update, require exclusive
> access to the vnode in question, which can't be asserted right now
> because of the VOP_RECLAIM exception.
> 
> Thoughts?  Objections?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Add a mountlist iterator

2017-03-30 Thread J. Hannken-Illjes
Currently vfs_busy() / vfs_unbusy() get used to

- Enter/leave a critical section against unmounting

- Add a reference to the mount

- return the next mount from the mountlist on error

Plan is to untangle the mountlist processing from vfs_busy() / vfs_unbusy()
and add an iterator for the mountlist:

- void mountlist_iterator_init(mount_iterator_t **)
  to initialize an iterator in mounted order.

- void mountlist_iterator_init_reverse(mount_iterator_t **)
  to initialize an iterator in reverse mounted order.

- void mountlist_iterator_destroy(mount_iterator_t *)
  to destroy an iterator aquired with one of the above.

- struct mount *mountlist_iterator_next(mount_iterator_t *, int)
  to retrieve the next mount from an iterator or NULL if done.
  The mount is returned vfs_busy'd, caller has to vfs_unbusy.
  Accepts a flag MNT_NOWAIT to skip mounts whose unmount is in progress.

- struct mount *_mountlist_next(struct mount *)
  a lockless variant to be used from DDB only.

Diffs are here: https://www.netbsd.org/~hannken/ml_iterator/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Remount read-only and fstrans

2017-02-19 Thread J. Hannken-Illjes

> On 19 Feb 2017, at 18:08, Christos Zoulas <chris...@astron.com> wrote:
> 
> In article <f1e96c92-0e24-40b1-b500-92cac4452...@eis.cs.tu-bs.de>,
> J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote:
>> 
>> Comments or objections anyone?
> 
> Will that detect open for write file descriptors to removed files?

This is the job of vflush().  Since Rev. 1.46 of vfs_mount.c (2017/01/27)
it handles both open for read and open for write descriptors to
unlinked files.

The corresponding test is tests/fs/vfs/t_rwtoro.c

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Remount read-only and fstrans

2017-02-19 Thread J. Hannken-Illjes
Updating a mounted file system from read-write to read-only is
racy as it is not clear when a mounted file system is read-only.
Currently we set MNT_RDONLY before we call VFS_MOUNT().  If
VFS_MOUNT() fails some operations may see the mounted file
system read-only but this was never the case.  Even if we would
set MNT_RDONLY in xxx_mount() after successfull vflush() there
would remain a window where operations would miss MNT_RDONLY.

Plan is to suspend the file system while the mounted file system
gets updated.  This way no operations run on the mounted file
system during update and all operations see the state before or
after the update.

Vfs_suspend() with fstrans_start()/fstrans_done() is a mechanism
to give one thread exclusive access to a mounted file system.
Other threads trying to run vnode operations on this mounted file
system will stall until the file system resumes.  On a non suspended
file system fstrans_start()/fstrans_done() work without locks or
other atomic operations.

Currently fstrans_start()/fstrans_done() is mostly used inside the
file system.  This has drawbacks, first it is to late to change the
operations vector if the vnode gets revoked and second it does not
scale well to support it for all file systems.

Plan is to move fstrans_start()/fstrans_done() into vnode_if.c and
enable vfs_suspend() on all file systems.

Diffs are here: https://www.netbsd.org/~hannken/vnode_if/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Remove vget from global namespace

2016-12-28 Thread J. Hannken-Illjes
With the global vnode cache there is no longer a need for vget() (take
the first reference on a vnode) to reside in global namespace.
 
1) Rename vget() to vcache_vget() and vcache_tryvget() respectively and
   move the definitions to sys/vnode_impl.h.
   
2) Vget() increments the v_usecount before it checks the vnode is stable.
   This behaviour makes it impossible to assert v_usecount being zero
   or one as v_usecount also reflects other threads trying to vget().
   
   Change vcache_vget() to increment v_holdcnt to prevent the vnode from 
   disappearing while vcache_vget() waits for a stable state and 
   free the vnode if this hold was the last reference to a reclaimed vnode.
   
3) Add some "v_usecount == 1" assertions.
 
Diffs are here: https://www.netbsd.org/~hannken/vget/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: vrele vs. syncer deadlock

2016-12-22 Thread J. Hannken-Illjes

> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote:
> 
> On a low-memory machine Nick ran into the following deadlock:
> 
>  (a) rename -> vrele on child -> inactive -> truncate -> getblk ->
>  no memory in buffer pool -> wait for syncer

This prediction seems wrong.  It is not the syncers job to free
memory to buffer pool - it will just update and write buffers
ending up on one of the buffer queues.

To release memory to the buffer pool the pagedaemon calls buf_drain()
when it detects a low memory condition.  Looks like this check from
uvm_pageout() fails:

bufcnt = uvmexp.freetarg - uvmexp.free;
if (bufcnt < 0)
bufcnt = 0; 

>  (b) syncer waiting for locked parent vnode from the rename
> 
> This makes it in general unsafe to vrele while holding a vnode lock.
> 
> In theory we could arrange this, but it's not desirable: the tidy
> locking model for directory operations is: lock parent, call into fs,
> fs handles child, unlock parent on return, so needing to unlock the
> parent before releasing the child would bust up the abstraction
> layer. (Not that we currently have this, but we're ostensibly working
> towards it.)
> 
> Does anyone have any suggestions for a patch to address this problem
> without first rewriting the syncer? While the latter is probably a
> good idea, it's not entirely trivial to do and get right, making it
> e.g. a poor pullup candidate...
> 
> -- 
> David A. Holland
> dholl...@netbsd.org

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: vrele vs. syncer deadlock

2016-12-12 Thread J. Hannken-Illjes

> On 11 Dec 2016, at 22:33, Nick Hudson <sk...@netbsd.org> wrote:
> 
> On 12/11/16 21:05, J. Hannken-Illjes wrote:
>>> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote:
>>> 
>>> On a low-memory machine Nick ran into the following deadlock:
>>> 
>>>  (a) rename -> vrele on child -> inactive -> truncate -> getblk ->
>>>  no memory in buffer pool -> wait for syncer
>>>  (b) syncer waiting for locked parent vnode from the rename

>> Where is the syncer waiting for the parent?
> 
> db>  bt/a  8ff28060
> pid  0.37  at  0x98041096
> 0x980410961bb0:  kernel_text+dc  (0,0,0,0)  ra  803ad484  sz  0
> 0x980410961bb0:  mi_switch+1c4  (0,0,0,0)  ra  803a9ef8  sz  96
> 0x980410961c10:  sleepq_block+b0  (0,0,0,0)  ra  803b8edc  sz  64
> 0x980410961c50:  turnstile_block+2e4  (0,0,0,0)  ra  803a487c  sz 
>  96
> 0x980410961cb0:  rw_enter+17c  (0,0,0,0)  ra  8044862c  sz  112
> 0x980410961d20:  genfs_lock+8c  (0,0,0,0)  ra  8043fd60  sz  48
> 0x980410961d50:  VOP_LOCK+30  (8049d4c8,2,0,0)  ra  
> 80436c8c  sz  48
> 0x980410961d80:  vn_lock+94  (8049d4c8,2,0,0)  ra  
> 803367d8  sz  64
> 0x980410961dc0:  ffs_sync+c8  (8049d4c8,2,0,0)  ra  
> 80428f4c  sz  112
> 0x980410961e30:  sched_sync+1c4  (8049d4c8,2,0,0)  ra  
> 80228dd0  sz  112
> 0x980410961ea0:  mips64r2_lwp_trampoline+18  (8049d4c8,2,0,0)  ra 
>  0  sz  32
> 
> 
> 
>> Which file system?
> 
> ffs

Looks like a bug introduced by myself.  Calling ffs_sync() from the
syncer (MNT_LAZY set) will write back modified inodes only, fsync
is handled by individual synclist entries.

Some time ago I unconditionally removed LK_NOWAIT from vn_lock().
Suppose we need this patch:

RCS file: /cvsroot/src/sys/ufs/ffs/ffs_vfsops.c,v
retrieving revision 1.341
diff -p -u -2 -r1.341 ffs_vfsops.c
--- ffs_vfsops.c20 Oct 2016 19:31:32 -  1.341
+++ ffs_vfsops.c12 Dec 2016 09:45:17 -
@@ -1918,5 +1918,6 @@ ffs_sync(struct mount *mp, int waitfor, 
while ((vp = vfs_vnode_iterator_next(marker, ffs_sync_selector, )))
{
-   error = vn_lock(vp, LK_EXCLUSIVE);
+   error = vn_lock(vp, LK_EXCLUSIVE |
+   (waitfor == MNT_LAZY ? LK_NOWAIT : 0));
if (error) {
vrele(vp);

Is it reproducible so you can test it?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: vrele vs. syncer deadlock

2016-12-11 Thread J. Hannken-Illjes

> On 11 Dec 2016, at 21:01, David Holland <dholland-t...@netbsd.org> wrote:
> 
> On a low-memory machine Nick ran into the following deadlock:
> 
>  (a) rename -> vrele on child -> inactive -> truncate -> getblk ->
>  no memory in buffer pool -> wait for syncer
>  (b) syncer waiting for locked parent vnode from the rename

Do you have full backtrace?
Where is the syncer waiting for the parent?
Which file system? 

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Change vnode free lists to lru lists

2016-12-11 Thread J. Hannken-Illjes
Vnodes are kept on freelists which means every vget() has to remove
them from the list and the last vrele() has to put them back.

Changing the lists to lrulists removes the operation from vget()
and speeds up namei() on cached vnodes by ~3 percent.

Diff is split into three parts for better readability and to help
possible bisection:

1) Remove the redundant argument from vfs_drainvnodes() and merge
   vrele_flush() into vfs_drainvnodes().

2) Rename the freelist members to lrulist members and move them
   to "struct vnode_impl".

3) Actually change the freelists to lrulists, merge vrele_thread()
   into vdrain_thread() and adapt vfs_drainvnodes().

Diffs are here: https://www.netbsd.org/~hannken/lrulist/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Vnode namespace - split sys/vnode.h

2016-10-26 Thread J. Hannken-Illjes
Its time to split sys/vnode.h into sys/vnode.h and sys/vnode_impl.h.
 
The latter holds implementation details about the vnode life cycle
and must be included only from kern/vfs_*, miscfs/genfs/* and miscfs/specfs/*.
 
It is sufficient to include vnode_impl.h, it already includes sys/vnode.h.

1) Rename the vcache node to vnode_impl: 
- Rename struct vcache_node to vnode_impl, start its fields with vi_.  
- Rename enum vcache_state to vnode_state, start its elements with VS_.
- Rename macros VN_TO_VP and VP_TO_VN to VIMPL_TO_VNODE and VNODE_TO_VIMPL.
- Add typedef struct vnode_impl vnode_impl_t.

2) The actual split:
- Split sys/vnode.h into vnode.h and vnode_impl.h.
- Move _VFS_VNODE_PRIVATE protected operations into vnode_impl.h.
- Move struct vnode_impl definition and operations into vnode_impl.h.
- Include vnode_impl.h where we include vnode.h with _VFS_VNODE_PRIVATE defined.
- Get rid of _VFS_VNODE_PRIVATE.

3) Vnode printing:
- Add a function to print the fields of a vnode including its implementation.
- Use it from vprint() and vfs_vnode_print().
- Move vstate_name() to vfs_subr.c.

The diffs are here: http://www.netbsd.org/~hannken/vrename/

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Re: Locking strategy for device deletion (also see PR kern/48536)

2016-06-08 Thread J. Hannken-Illjes

> On 08 Jun 2016, at 13:47, Paul Goyette <p...@whooppee.com> wrote:
> 
> On Wed, 8 Jun 2016, Paul Goyette wrote:
> 
>> On Tue, 7 Jun 2016, Taylor R Campbell wrote:
>> 
>>>  Date: Tue, 7 Jun 2016 18:28:11 +0800 (PHT)
>>>  From: Paul Goyette <p...@whooppee.com>
>>> 
>>>  Can anyone suggest a reliable way to ensure that a device-driver module
>>>  can be _really_ safely detached?
>>> General approach:
>>> 1. Prevent new references (e.g., devsw_detach).
>>> 2. Wait for existing references to drain (or roll back with
>>> devsw_attach if you don't want to wait, and fail with EBUSY).
>>> 3. Unload.
>> 
>> Yes, of course.  Thanks, I think this will work.
> 
> Well, it almost works!  Just one little problem...  :)
> 
> For some reason, the device's open() routine is being called twice, but the 
> close() routine is only called once.  So every iteration leaves the refcount 
> with a net increment of one.
> 
> I cannot figure out why/how the open routine is called twice.

IIRC the device open() operation gets called on every open while
the device close() operation gets called on last close only.

See misfs/specfs/spec_vnops.c::spec_close().

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Vnode life cycle: add vnode state

2016-05-21 Thread J. Hannken-Illjes

> On 20 May 2016, at 18:31, Taylor R Campbell 
> <campbell+netbsd-tech-k...@mumble.net> wrote:
> 
>   Date: Fri, 20 May 2016 16:38:22 +0200
>   From: J. Hannken-Illjes <hann...@eis.cs.tu-bs.de>
> 
>   1) Objects vnode and vcache_node always exist together so it makes sense
>   to merge them into one object.  This can be done in two ways:
> 
>   - Add the current vcache_node elements (hash list, key and flag) to the
> vnode.
>   - Overlay the vnode into vcache_node.  This way we also get two views
> to vnodes, a public one (struct vnode) visible from file system
> implementations and a private one (struct vcache_node) visible from
> the VFS subsystem (sys/kern/vfs_*) only.
> 
>   I prefer the second way, a diff is here:
> http://www.netbsd.org/~hannken/vstate/001_merge_vnode_and_vcache_node
> 
> Mild preference for putting the fields in struct vnode -- making a
> bigger structure around it feels sneaky.  But I see the motivation to
> avoid exposing more members than are an intended part of the API, as
> has been done for decades to our constant confusion about what is an
> intended part of the API.
> 
> On the other hand, maybe if we do the second option, it should be not
> vcache_node but vnode_private or something, and we can use it to
> incrementally move more members into it over time to make a usefully
> compiler-checked stable VFS API -- perhaps even eventually make struct
> vnode opaque altogether.

Moving at least the list entries from the public to the private view
is on my agenda.
For now the vcache_node is local to vfs_vnode.c.  Once it has to move
to (the new) sys/vnode_private.h I will do the rename to vnode_private.

>   2) The state of a vnode is currently determined by its flags
>   VI_MARKER, VI_CHANGING, VI_XLOCK and VI_CLEAN.  These flags are not
>   sufficient to describe the state a vnode is in.  It is for example
>   impossible to determine if a vnode is currently attaching to a file system.
> 
> Sounds similar to a sketch I wrote a couple years ago.  I like turning
> inscrutable combinations of flags into explicit states and explicitly
> designed transitions between them.  Can't take a close look right now,
> though.  One tiny question:
> 
>   and these operations to work on the vnode state:
> 
>   - VSTATE_GET(vp): Get the current state.
>   - VSTATE_CHANGE(vp, from, to): Change state.
>   - VSTATE_WAIT_STABLE(vp): Wait until the vnode is in stable state, one of
> VN_ACTIVE or VN_RECLAIMED.
>   - VSTATE_ASSERT(vp, state): Assert the current state.
> 
> Why uppercase?  Are these macros?  If so, why not (inline) functions?

These are all macros to get better diagnostics.  In the DIAGNOSTICS
case they expand to functions propagating name and line number of
the caller.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Vnode life cycle: add vnode state

2016-05-20 Thread J . Hannken-Illjes
1) Objects vnode and vcache_node always exist together so it makes sense
to merge them into one object.  This can be done in two ways:

- Add the current vcache_node elements (hash list, key and flag) to the
  vnode.
- Overlay the vnode into vcache_node.  This way we also get two views
  to vnodes, a public one (struct vnode) visible from file system
  implementations and a private one (struct vcache_node) visible from
  the VFS subsystem (sys/kern/vfs_*) only.

I prefer the second way, a diff is here:
  http://www.netbsd.org/~hannken/vstate/001_merge_vnode_and_vcache_node


2) The state of a vnode is currently determined by its flags
VI_MARKER, VI_CHANGING, VI_XLOCK and VI_CLEAN.  These flags are not
sufficient to describe the state a vnode is in.  It is for example
impossible to determine if a vnode is currently attaching to a file system.

I propose to replace these flags with a vnode state, one of:

- VN_MARKER: Used as a marker to traverse lists of vnodes, will never change
  or appear in a regular vnode operation.
- VN_LOADING: Vnode is currently attaching to its file system and loading
  or creatingthe inode.
- VN_ACTIVE: Vnode is fully initialised and useable.
- VN_BLOCKED: Vnode is active but cannot get new references through vget().
- VN_RECLAIMING: Vnode is in process to detach from file system.
- VN_RECLAIMED: Vnode is (no longer) attached to its file system, its dead.

and these operations to work on the vnode state:

- VSTATE_GET(vp): Get the current state.
- VSTATE_CHANGE(vp, from, to): Change state.
- VSTATE_WAIT_STABLE(vp): Wait until the vnode is in stable state, one of
  VN_ACTIVE or VN_RECLAIMED.
- VSTATE_ASSERT(vp, state): Assert the current state.

All operations are protected with vp->v_interlock with one exception: to
change state from VN_LOADING vcache.lock is also needed.  We can't take
vp->v_interlock while the vnode is loading as it may change while
the file system initialises the node, see uvm_obj_setlock().

Possible state transitions are:

-  -> VN_LOADING: Node gets allocated in vcache_alloc().
- VN_LOADING -> VN_ACTIVE: Node has been initialised in vcache_get() or
  vcache_new() and is ready to use.
- VN_ACTIVE -> VN_RECLAIMING: Node starts disassociation from underlying
  file system in vclean().
- VN_RECLAIMING -> VN_RECLAIMED: Node finished disassociation from
  underlying file system in vclean().
- VN_ACTIVE -> VN_BLOCKED: Either vcache_rekey*() is changing the vnode key
  or vrelel() is about to call VOP_INACTIVE().
- VN_BLOCKED -> VN_ACTIVE: The blocking condition is over.
- VN_LOADING -> VN_RECLAIMED: Either vcache_get() or vcache_new() failed to
  associate the underlying file system or vcache_rekey*() drops a vnode
  used as placeholder.

A diff adding the state operations is here:
  http://www.netbsd.org/~hannken/vstate/002_vnode_state_operations

A diff implementing the state with these operations is here:
  http://www.netbsd.org/~hannken/vstate/003_use_vnode_state


Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: struct file at VFS level (round 1.5)

2016-05-03 Thread J. Hannken-Illjes

> On 03 May 2016, at 14:20, Emmanuel Dreyfus <m...@netbsd.org> wrote:
> 
> On Tue, May 03, 2016 at 12:05:01PM +0200, Christoph Badura wrote:
>> Didn't I just explain to you that a struct file is never an owner for
>> these locks?  Didn't others explain that to you too?
> 
> Well, I need you to explain me what happens in sys_flock() then.
> Is it a bug?
>error = VOP_ADVLOCK(vp, fp, F_UNLCK, , F_FLOCK);
> 
> fp is struct file. Same thing happens in open_setfp() and closef()

The second argument is an identifier, "struct proc *" if the flags
(the last argument) contain "F_POSIX" and "struct file *" otherwise.

Fcntl(2) uses F_POSIX where flock(2) does not.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: struct file reference at VFS level

2016-04-29 Thread J. Hannken-Illjes

> On 28 Apr 2016, at 10:19, Emmanuel Dreyfus <m...@netbsd.org> wrote:
> 
>> Wouldn't it be better to put the mandatory locking at the file->vnode
>> level, operations vn_read() and vn_write()?
> 
> Do you mean that instead of providing support so that filesystems 
> can implement mandatory locks, we should just implement it above VFS
> for all filesystems? That would work for local filesystems, but I 
> understand it cannot be done for distributed filesystems, as the 
> local kernel knows nothing about locks set by other machines: you 
> need to let the filesystem handle it.

Advertised locks are already implemented inside the file systems and
we have VOP_ADVLOCK() with “op == F_GETLK”. What else is needed to
make the decision here?

>> Could you explain the proposed semantics in detail?
> 
> We have this:
> https://www.kernel.org/doc/Documentation/filesystems/mandatory-locking.txt

I strongly object against an implementation whose specification starts:

  0. Why you should avoid mandatory locking
  -

  The Linux implementation is prey to a number of difficult-to-fix race
  conditions which in practice make it not dependable:

  - The write system call checks for a mandatory lock only once
at its start.  It is therefore possible for a lock request to
be granted after this check but before the data is modified.
A process may then see file data change even while a mandatory
lock was held.
  - Similarly, an exclusive lock may be granted on a file after
the kernel has decided to proceed with a read, but before the
read has actually completed, and the reading process may see
the file data in a state which should not have been visible
to it.
  - Similar races make the claimed mutual exclusion between lock
and mmap similarly unreliable.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: issues to compile a kernel

2015-12-04 Thread J. Hannken-Illjes
On 03 Dec 2015, at 19:58, phipo <phipo.resea...@orange.fr> wrote:
> 
> Hi,
> I working on Bi quad cores opteron (amd64, 8 cores), i want to customize my 
> kernel,
> i've installed NetBSD 7.0rc3, also i download syssrc.gz and installed it.
> When i try to compile tools > like this :
> # ./build.sh tools into the directory /usr/src/.
> i've got issues, this is my ouput :
> -
>^
> In file included from /usr/src/sys/fs/msdosfs/msdosfs_fat.c:72:0:
> /usr/src/tools/makefs/../../usr.sbin/makefs/ffs/buf.h:78:6: note: expected 
> 'int' but argument is of type 'void *'
> int  bread(struct vnode *, daddr_t, int, int, struct buf **);
>  ^
> /usr/src/sys/fs/msdosfs/msdosfs_fat.c:959:8: error: too many arguments to 
> function 'bread'
>NOCRED, 0, );
>^
> In file included from /usr/src/sys/fs/msdosfs/msdosfs_fat.c:72:0:
> /usr/src/tools/makefs/../../usr.sbin/makefs/ffs/buf.h:78:6: note: declared 
> here
> int  bread(struct vnode *, daddr_t, int, int, struct buf **);
>  ^

Looks like your sources got mixed from 7.0 and -current.

On 7.0 file usr.sbin/makefs/ffs/buf.h is revision 1.9 and has:

int bread(struct vnode *, daddr_t, int, struct kauth_cred *, int, 
struct buf **);

while on -current its revision is 1.10 and has:

    int bread(struct vnode *, daddr_t, int, int, struct buf **);

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: NFS related panics and hangs

2015-11-05 Thread J. Hannken-Illjes
On 05 Nov 2015, at 21:48, Rhialto <rhia...@falu.nl> wrote:


> Looking into this:
> 
> the occurrences of nfs_reqq are as follows:
> 
> fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the 
> nfs_reqq list.
> 
> Since there is no other mention of nfs_reqq_mtx in the whole syssrc
> tarball, this looks wrong.  It also immediately causes the suspicion
> that the list isn't in fact protected at all.

This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs
implementation from FreeBSD.  It is not part of any kernel.

Our nfs lives in sys/nfs.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Very slow transfers to/from micro SD card on a RPi B+

2015-08-18 Thread J. Hannken-Illjes
On 18 Aug 2015, at 12:44, Stephan stephan...@googlemail.com wrote:

 2015-08-17 21:30 GMT+02:00 Michael van Elst mlel...@serpens.de:
 stephan...@googlemail.com (Stephan) writes:
 
 I have just rebooted with WAPBL enabled. Some quick notes:
 
 -Sequential write speed is a little lower, around 5,4 MB/s.
 
 
 WAPBL is rather slow on SD cards because SD cards are very slow
 when writing small chunks. So even when WAPBL excels, like unpacking
 lots of files or removing a directory tree, it is slow because the
 sequential journal is written in small blocks.

The journal is written in chunks of MAXPHYS (64k) bytes.

 That might be all right. However, creating many files becomes worse the more 
 files are being created. That is on all kinds of devices I´ve seen.
 
 This is from an amd64 server box with an aac raid controller.
 
 /root/test/files time seq 1 1|xargs touch
 3.10s real 0.01s user 3.07s system
 /root/test/files rm *
 /root/test/files time seq 1 2|xargs touch
 9.88s real 0.01s user 8.51s system
 /root/test/files rm *
 /root/test/files time seq 1 3|xargs touch
23.45s real 0.04s user20.41s system
 /root/test/files time seq 1 4|xargs touch
43.35s real 0.05s user38.32s system
 
 That is clearly not linear.

I'm quite sure this is the memcmp in ufs_lookup.c:390.

For every file we have to compare its name to all names in the
directory so far leading to 0.5*n*(n-1) calls to memcmp.

And our memcmp is slw ...

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Very slow transfers to/from micro SD card on a RPi B+

2015-08-18 Thread J. Hannken-Illjes
On 18 Aug 2015, at 15:06, Joerg Sonnenberger jo...@britannica.bec.de wrote:

 On Tue, Aug 18, 2015 at 01:06:02PM +0200, J. Hannken-Illjes wrote:
 And our memcmp is slw ...
 
 Do you have any thing substancial to justify this?

This topic has already been discussed last year, see

http://mail-index.netbsd.org/tech-kern/2014/08/25/msg017550.html

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: VOP_PUTPAGE ignores mount_nfs -o soft,intr

2015-06-22 Thread J. Hannken-Illjes
On 19 Jun 2015, at 10:36, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hi
 
 I have encountered a bug with NetBSD NFS client. Despite a mount with
 -o intr,soft, we can hit situation where a process can remain hang in 
 kernel because the NFS server is gone.
 
 This happens when the ioflush does its duty, with the following code path:
 sync_fsync / nfs_sync / VOP_FSYNC / nfs_fsync / nfs_flush / VOP_PUTPAGES
 
 VOP_PUTPAGES has flags = PGO_ALLPAGES|PGO_FREE. It then goes through
 genfs_putpages and genfs_do_putpages, and get stuck in:
 
   /* Wait for output to complete. */
   if (!wasclean  !async  vp-v_numoutput != 0) {
   while (vp-v_numoutput != 0)
   cv_wait(vp-v_cv, slock);
   }

Looks like your problem is the missing nfs aware implementation of
function nfs_node.c:nfs_gop_write(). Falling back to genfs_gop_write()
ignores nfs timeouts.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: vnd locking

2015-05-22 Thread J. Hannken-Illjes
On 22 May 2015, at 14:09, Patrick Welche pr...@cam.ac.uk wrote:

 There are a load of locking PRs involving vnd. I thought I would have a
 stab at converting vnd to use condvars and mutexes (mutices?) as a first
 step. I cobbled something together which allows me to
 
  vnconfig vnd0 /tmp/foo.img
  disklabel -r -w vnd0 vndtest
  newfs /dev/vnd0b
  mount /dev/vnd0b /mnt
  cd /mnt
  echo hello  foo
  cat foo
  df -h .
  umount /mnt
 
 with a LOCKDEBUG kernel, so I may have guessed right, but I haven't found
 any documentation on The Right Way, so please review carefully, assume
 the worst, and tell me how to do it better!
 
 Things to ponder:
 - there was some sort of start kernel thread, thread sets variable,
  meanwhile loop waiting for variable to be set. I removed that. On
  the other hand I added a MUSTJOIN for the kernel thread exiting
  side of things.
 - given that vnd is a pseudo-device, do I need an extra lock to care
  of device creation/destruction? (What am I protecting against?

One thread creating a device while it gets unconfigured by other thread?

- make sure all accesses to sc_flags etc. are protected by mutex.

- locking is no longer interruptible (no PCATCH / cv_wait_sig), not
  sure if this is a problem.

- keep the first blank line for functions without locals.

- do the white space change with a separate commit.

Otherwise looks like a good starting point.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Removal of miscfs/syncfs

2015-05-02 Thread J. Hannken-Illjes
On 01 May 2015, at 20:02, David Holland dholland-t...@netbsd.org wrote:

 On Fri, May 01, 2015 at 10:17:05AM +0200, J. Hannken-Illjes wrote:
 Our miscfs/syncfs originating from the softdep import is a pseudo
 file system with one VOP.  Its vnodes get used as a kind of marker
 on the syncer worklist so the syncer may run lazy VFS_SYNC for
 all mounted file systems.
 
 For this to work, it creates a vnode attached to the mount point
 with a special (syncfs) operations vector.  This concept breaks
 some rules making it nearly is impossible to implement it with
 the new vcache operations.
 
 I propose to completely remove miscfs/syncfs and
 
 - move the syncer sched_sync into kern/vfs_subr.c
 - change the syncer to process the mountlist and VFS_SYNC as appropriate.
 - use an API for mount points similiar to the API for vnodes.
 
 Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff
 
 Comments or objections anyone?
 
 I think this is a good plan (having a file system to be the syncer
 is horribly weird) but I'm not entirely convinced that having marker
 mounts instead of marker vnodes is really the answer. But maybe it's
 an ok first step.

This time goal is to remove the file system, not more.

 (also I'd prefer to have a syncer that went by blocks/buffers than by
 files, but that's a much bigger change...)
 
 I don't have time to read the diff properly now (maybe tomorrow? hope
 springs eternal) but I think it would be a good plan to put the syncer
 in its own file.
 
 (and maybe we should put it in sys/vfs? I announced plans to start
 moving vfs-level code in there some time ago, and this seems like an
 opportunity)

Such move should probably wait until we are able to move files without
loosing history.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Removal of miscfs/syncfs

2015-05-02 Thread J . Hannken-Illjes
On 01 May 2015, at 14:16, Christos Zoulas chris...@astron.com wrote:

 In article f40de9d8-f529-4411-9695-2d347e4de...@eis.cs.tu-bs.de,
 J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 Our miscfs/syncfs originating from the softdep import is a pseudo
 file system with one VOP.  Its vnodes get used as a kind of marker
 on the syncer worklist so the syncer may run lazy VFS_SYNC for
 all mounted file systems.
 
 For this to work, it creates a vnode attached to the mount point
 with a special (syncfs) operations vector.  This concept breaks
 some rules making it nearly is impossible to implement it with
 the new vcache operations.
 
 I propose to completely remove miscfs/syncfs and
 
 - move the syncer sched_sync into kern/vfs_subr.c
 - change the syncer to process the mountlist and VFS_SYNC as appropriate.
 - use an API for mount points similiar to the API for vnodes.
 
 Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff
 
 Comments or objections anyone?
 
 Have you tested its performance?

With ten mounts I got no problems.  All it adds is once per second:

  mutex_enter(mountlist_lock);
  for (mp = TAILQ_FIRST(mountlist); mp != NULL; mp = nmp) {
if (NOTHING_TO_DO(mp)) {
nmp = TAILQ_NEXT(mp, mnt_list);
continue;
  }
  }
  mutex_exit(mountlist_lock);


 Out of the 4 time_t tunables, two are not being used;

All tunables are used:

kern/vfs_subr.c:461:delayx = dirdelay;
kern/vfs_subr.c:465:delayx = metadelay;
kern/vfs_subr.c:470:delayx = filedelay;
kern/vfs_subr.c:622:return mp-mnt_wapbl != NULL ? metadelay : syncdelay;
kern/vfs_subr.c:847:synced ? syncdelay : lockdelay);
miscfs/genfs/genfs_io.c:95: vn_syncer_add_to_worklist(vp, filedelay);

 The other two are truncated to int's, why use time_t then?

Because I didn't change anything here.  As these tunables are sysctl
nodes we had to change their size breaking backwards compatibility.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Removal of miscfs/syncfs

2015-05-01 Thread J. Hannken-Illjes
Our miscfs/syncfs originating from the softdep import is a pseudo
file system with one VOP.  Its vnodes get used as a kind of marker
on the syncer worklist so the syncer may run lazy VFS_SYNC for
all mounted file systems.

For this to work, it creates a vnode attached to the mount point
with a special (syncfs) operations vector.  This concept breaks
some rules making it nearly is impossible to implement it with
the new vcache operations.

I propose to completely remove miscfs/syncfs and

- move the syncer sched_sync into kern/vfs_subr.c
- change the syncer to process the mountlist and VFS_SYNC as appropriate.
- use an API for mount points similiar to the API for vnodes.

Diff at http://www.netbsd.org/~hannken/rm-miscfs-syncfs-1.diff

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: fss(4) behaviour if backup fs full

2015-04-28 Thread J. Hannken-Illjes
On 24 Apr 2015, at 11:24, Edgar Fuß e...@math.uni-bonn.de wrote:

 How will an fss(4) snapshot behave if the file system used for the 
 copy-on-write backup store gets full?
 Will it drop the snapshot? Will all write operations to the original fs fail? 
 Will it panic?

The snapshot becomes invalid, no further copy-on-write and reading
gives ENXIO (Device not configured).

The snapshot device stays configured.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: tracking P-V for unmanaged device pages

2015-03-26 Thread J. Hannken-Illjes

On 26 Mar 2015, at 13:13, Taylor R Campbell 
campbell+netbsd-tech-k...@mumble.net wrote:

 Various DRM graphics drivers, including Intel, Radeon, and Nouveau,
 sometimes need to unmap all virtual mappings of certain physical
 pages for which there is no struct vm_page.  The issue is explained in
 detail here:
 
 https://mail-index.netbsd.org/tech-kern/2014/07/23/msg017392.html
 
 It's not desirable to simply add struct vm_pages on a freelist that
 uvm_pagealloc ignores, because struct vm_page is large (120 bytes on
 amd64, for example), most of it is unnecessary for P-V tracking, and
 the physical regions that need P-V tracking are large (hundreds of
 megabytes, or gigabytes).
 
 The attached patch implements the following extension to pmap(9) on
 x86 and uses it in DRM[*].  The implementation uses a linear list of
 pv-tracked ranges, since it is expected to be short (one to three
 elements).  The list is managed with pserialize(9) so it adds no
 locking overhead to existing pmap operations that need to look up
 entries in it.

If you plan to use pserialize here it is time to change
pserialize_perform() to not kpause() every xc_broadcast.

Currently pserialize_perform() takes at least two ticks to complete.

It should be modified as:

i = 0;
do {
mutex_spin_exit(psz_lock);
xc = xc_broadcast(XC_HIGHPRI, (xcfunc_t)nullop, NULL, NULL);
xc_wait(xc);
if (i++  1)
kpause(psrlz, false, 1, NULL);
mutex_spin_enter(psz_lock);
} while (!kcpuset_iszero(psz-psz_target));

to become faster in the usual two activity switches case.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Vnode API change: add global vnode cache

2015-03-09 Thread J. Hannken-Illjes
On 22 May 2014, at 00:48, David Holland dholland-t...@netbsd.org wrote:

snip

 As I discovered while prodding lfs last weekend, this is too
 optimistic, as there's another family of cases: allocating a new inode
 and constructing a vnode for it.
 
 It is *possible* to do this by allocating an inode number and then
 using vcache_get to procure the vnode, such that when the vnode cache
 calls VFS_LOADVNODE it will load a blank inode. This is what ffs does,
 and it works fine there since unallocated inodes have fixed known
 locations that are in a known on-disk state.
 
 It doesn't work for lfs, and it won't work for any other filesystem
 where inode locations aren't fixed, at least not without some ugly
 hackery: you can update the fs state with a location for the new
 inode, and write a buffer with the intended inode contents, and then
 have VFS_LOADVNODE find and use this buffer, but this is inefficient
 and gross. (This isn't itself adequate for lfs because lfs has some
 additional self-inflicted issues.)
 
 After sleeping on this a few times, ISTM that the best way to handle
 this is as follows:
 
   - add an additional call vcache_new to the API;
   - have vcache_new take a type argument;
   - have vcache_new assert that no vnode with the same key already
 exists (FS-level locking should guarantee this);
   - have vcache_new call a new op VFS_NEWVNODE instead of
 VFS_LOADVNODE and pass the type in.

snip

Here comes the new operation vcache_new() to allocate and initialise
a new vnode/fsnode pair.  Passing the type is not sufficient, therefore
pass all information we get from VOP_MKNOD:

int
vcache_new(struct mount *mp, struct vnode *dvp, struct vattr *vap,
kauth_cred_t cred, struct vnode **vpp)

where dvp is the (referenced) directory where we want to create the
new node, vap passes va_type, va_mode and possibly va_rdev and cred
gives the credentials to setup uid/guid.

The node returned from vcache_new() is referenced, fully initialised
and has link count zero.

A diff showing the implementation is here:

http://www.netbsd.org/~hannken/vnode-pass7-1a.diff

and a diff showing its usage for ffs is here:

http://www.netbsd.org/~hannken/vnode-pass7-1b.diff

Comments or objections anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: A brelse() in FFS...

2015-02-13 Thread J. Hannken-Illjes
On 13 Feb 2015, at 19:18, Maxime Villard m...@m00nbsd.net wrote:

 Hi,
 this may be a stupid question; in ufs/ffs/ffs_vfsops.c, l.1153:
 
   if (fs-fs_sbsize  SBLOCKSIZE)
   brelse(bp, BC_INVAL);
   else
   brelse(bp, 0);
 
 However 'fs-fs_sbsize' is *always* smaller than SBLOCKSIZE. So only
 the first branch is reached.

It may be equal to SBLOCKSIZE.  The BC_INVAL is here to not keep
the buffer in cache.  It may be read later with a size greater fs_sbsize
and would break buffer cache invariance (blocks are always read with the
same size).

 I don't understand what the BC_INVAL flag means, and its definition
 in sys/buf.h is not very enlightening:
 
   #define BC_INVAL0x2000  /* Does not contain valid info. 
 */
 
 Which branch do I keep?

Both.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: race condition between (specfs) device open and close

2014-12-30 Thread J. Hannken-Illjes
On 30 Dec 2014, at 09:57, matthew green m...@eterna.com.au wrote:

 while trying to fix any midi/sequencer issues i've seen in the
 last year, noticed that any attempt to call 'midiplay' on more
 than one midi device at the same time, while not allowed, would
 leave the sequencer device always busy.
 
 i tracked this down to an ugly race condition inside the
 specfs code.  the problem is that:
 
   t1  t2
   opens dev
   sd_opencnt++
   tries to open dev
   sd_opencnt++
   closes dev
   since sd_opencnt is != 1, does not call close
   sd_opencnt--
   gets EBUSY
   sd_opencnt--
 
 in this case, the device close routine is never called, and in
 at least the midi case, all future opens return EBUSY due to
 there already being an active open.
 
 i've spent some time reviewing the specfs code, and i think that
 the below patch fixes these specific problems.  it certainly
 fixes the problem i have with trying to open /dev/music
 concurrently.  it may not be entirely correct or have problems
 i've not seen with it, but atf works fine, as does the last
 couple of days of random testing.
 
 the main changes are:
   - introduce a dying flag to the specnode struture,
 and use it appropriately
   - add a condvar between open/close and revoke, to
 ensure any revoke succeeds properly.
 
 i'd like to get this into netbsd 7, so any additional testing
 against that branch or against -current would be greatly
 appreciated.
snip
 @@ -397,7 +405,14 @@ spec_node_revoke(vnode_t *vp)
   KASSERT(sn-sn_gone == false);
 
   mutex_enter(device_lock);
 - KASSERT(sn-sn_opencnt = sd-sd_opencnt);
 + sn-sn_dying = true;
 + while (sn-sn_opencnt  sd-sd_opencnt) {
 + /*
 +  * Open is in progress, but not complete, must wait
 +  * for it to complete.
 +  */
 + cv_wait(specopenclose_cv, device_lock);
 + }
   if (sn-sn_opencnt != 0) {
   sd-sd_opencnt -= (sn-sn_opencnt - 1);
   sn-sn_opencnt = 1;

Will this work if we revoke a device whose cdev_open() blocks,
a TTY waiting for carrier for example?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Should sys_unmount() FOLLOW?

2014-11-23 Thread J. Hannken-Illjes
On 23 Nov 2014, at 17:47, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hi
 
 I ran into this strange bug with glusterfs NFS server, which is possible
 because it allows the mounted filesystem root vnode to be VLNK (NetBSD's
 native mountd prevents this situation and therefore the bug does not
 happen with our native NFS server):
 
 bacasel# mount
 bacasel.net.espci.fr:/patchy/symlink1 on /mnt/nfs/0 type nfs
 
 bacasel# ls -l /mnt/nfs
 lrwxrwxrwx  1 root  wheel4 Nov 23 10:03 0 - dir1
 
 That is possible because on the exported filesystem, symlink1 is a
 symlink to dir1.
 
 It looks funny and harmless, at least until one try to unmount while the
 NFS server is down. I do this using the most reliable way: umount -f -R
 /mnt/nfs/0
  
 
 umount(8) will quickly call unmount(2), passing the /mnt/nfs/0 path
 without trying anything fancy. But at the beginning of sys_unmount() in
 the kernel, we can find:
 
NDINIT(nd, LOOKUP, FOLLOW | LOCKLEAF | TRYEMULROOT, pb);
if ((error = namei(nd)) != 0) {
pathbuf_destroy(pb);
return error;
}
 
 The FOLLOW flag will cause the symlink to be resolved before
 sys_unmount() can proceed. Since the NFS server is down, the namei call
 will never return, and umount(8) get stuck in kernel with this
 backtrace:
 sleepq_block
 kpause
 nfs_reconnect
 nfs_request
 nfs_readlinkrpc
 nfs_readlink
 VOP_READLINK
 namei_tryemulroot
 namei
 sys_unmount
 syscall
 
 This is annoying because umount -f should really unmount. Moreover,
 stuck process will crop on the system because umount(8) holds a vnode
 lock forever.
 
 I see two way of fixing this:
 1) remove FOLLOW flag in sys_unmount(). After all, unless the -R flag is
 given, umount(8) should have resolved the patch before calling
 unmount(2).

Did you try -- last time I tried a forced unmount with NFS server down
it didn't work even with root being a directory because the namei() call
would hang in VOP_LOOKUP().  Does it work these days?

 2) Desfine a new MNT_RAW mount flag. If umount(8) is called with -R,
 pass that flag to unmount(8), and in sys_unmount(), do not use FOLLOW if
 MNT_RAW is set.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: find a file from DDB?

2014-11-13 Thread J. Hannken-Illjes
On 13 Nov 2014, at 17:23, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hello
 
 I track a locking problem in DDB. Given a vnode, how can I find what file
 is involved? This is a UFS filesystem, and as far as I can tell, ddb
 is not able to ouput the inode, right?

You may call VOP_PRINT(0) with 0 being the vnode address.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-28 Thread J. Hannken-Illjes
On 28 Oct 2014, at 14:37, Emmanuel Dreyfus m...@netbsd.org wrote:

 On Sun, Oct 26, 2014 at 09:21:13AM +0100, J. Hannken-Illjes wrote:
 - You should recover the error in puffs_vnop_close() too.
 
 Should I? puffs_vnop_close() does not cause VOP_STRATEGY to be
 called, it only send a message to the filesystem: If I recover 
 the VOP_STRATEGy error there I will steall it from a write or
 fsync call.

Doesn't puffs allow async writes?

NFS allows them and therefore VOP_CLOSE is the last chance to report
write errors back to the application.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-28 Thread J. Hannken-Illjes
On 28 Oct 2014, at 15:43, Emmanuel Dreyfus m...@netbsd.org wrote:

 On Tue, Oct 28, 2014 at 02:52:36PM +0100, J. Hannken-Illjes wrote:
 NFS allows them and therefore VOP_CLOSE is the last chance to report
 write errors back to the application.
 
 The situation is a bit different: write and fsync call VOP_PUTPAGE
 which calls VOP_STRATEGY and forgets about the write error.

Confused.  If write and/or fsync are synchronous (VOP_PUTPAGES with
flag PGO_SYNCIO) no write error will be forgotten.  If one of these
VOP_PUTPAGES is async the write error will be delayed and if the
next (and last) operation is VOP_CLOSE this is the last chance to
report the error back.  If any intermediate operation returns a delayed
error to the application it is cleared and close will no longer see it.

 The 
 write error should be recovered by write and fsync and then cleared:
 I do not want to get in in the next write operation: immagine I write
 at the end of a file, get EDQUOT, then within an allocated area in the
 file, I should not get EDQUOT again.

I don't understand this section.

 But what about close? If I am to be reported a write error when calling
 close, it should _not_ be cleared as I should get it again if I call
 caose again, until the time the write succeeds.

There is no close after close.  Once a file is closed it may return
a delayed error but has to close anyway.  This is how NFS works and
what applications expect.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-26 Thread J. Hannken-Illjes
On 26 Oct 2014, at 03:52, Emmanuel Dreyfus m...@netbsd.org wrote:

 Chuck Silvers c...@chuq.com wrote:
 
 but more fundamentally, since puffs code cannot prevent changes to the file
 in the underlying fs (ie. changes that don't go through puffs), any
 preallocation done by puffs can be undone before it does any good.
 the puffs code just needs to be fixed to handle such ENOSPC/EDQUOT errors
 while flushing pages without hanging.  NFS has the same fundamental issue
 but I suspect its error handling is better.
 
 I tried the same approach as NFS: save errors in VOP_STRATEGY and
 recover them in upper layer:
 http://ftp.espci.fr/shadow/manu/puffs-strategy-error.patch
 
 Opinions? Tests show that it behave gracefuly when encountering EDQUOT:
 no more processes hung in DE state and the correct errno is reported.
 
 The patch also contains fallocate and fdiscard implementation, which are
 now of no use, but I think it is good to have them ready for later.
 On the fdiscard front, I wonder if I should hold pn_sizemtx. It is not
 supposed to change file size, but what happens if file size is changed
 during the operation?

- Increment PUFFSVERSION.

- You should recover the error in puffs_vnop_close() too.

- Are you sure pn_sizemtx is still needed?  VOP_GETATTR() works on
  locked vnodes since 2011-10-30.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-26 Thread J. Hannken-Illjes
On 26 Oct 2014, at 17:55, Emmanuel Dreyfus m...@netbsd.org wrote:

 J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 What happens when libpuffs receives an unknown message (fallocate in
 this case)?
 
 The filesystem using libpuffs tells the kernel what operations should be
 sent to userland when it calls puffs_init(). It can be done 
 - by setting the list of wanted operation, in which case the kernel
 never sends an unknown message, 
 - by requesting all operations to be sent to userland. In that case an
 unknown operation is handled in src/lib/libpuffs/dispatcher.c:dispatch()
default:
printf(inval op %d\n, preq-preq_optype);
error = EINVAL;
break;

Ok -- you don't need it here.

Since it was last incremented at Rev. 1.72 many messages got additional
fields so the version should be incremented independent of your change. 

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-25 Thread J. Hannken-Illjes
On 25 Oct 2014, at 06:39, Emmanuel Dreyfus m...@netbsd.org wrote:

 Summary: when writing to a PUFFS filesystem through page cache, we do
 not know if backend storage is really available. If it is not, cache
 flush may get EDQUOT or ENOSPC and the process cannot terminate (it gets
 stuck in DE state).
 
 Proposed solution: detect that a write may not have backend storage, and
 if it is the case, try to allocate the backend storage.
 
 Detecting is done on two conditions:
 - if allocated blocks is shorter than size, the file is sparse and we
 never know if we are writing in a hole or not: in that case, always
 write-once
 - if writing beyond EOF
 
 Allocating the backend storage is done
 - through newly introduced PUFFS fallocate operation (unlikely to work
 on NetBSD as the system call exists but FFS does not support it)
 - otherwise by reading from the file and rewriting the readen data
 
 The latest patch doing this:
 http://ftp.espci.fr/shadow/manu/puffs-alloc2.patch
 
 Opinions?

At the very least you should increment PUFFSVERSION.

Setting of PNODE_SPARSE looks wrong.  You should mark a VREG pnode
as sparse until it is sure not to have holes, which is when we get
va_bytes != VNOVAL  vp-v_size == rvap-va_bytes.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] GOP_ALLOC and fallocate for puffs (round 2)

2014-10-15 Thread J. Hannken-Illjes
On 14 Oct 2014, at 19:05, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hi
 
 Here is the latest iteration of my patch to ensure we avoid sending data
 to the page cache we cannot flush later.
 http://ftp.espci.fr/shadow/manu/puffs-alloc.patch
 
 - When writing, we make sure backend storage is allocated in two cases:
  1) the file is sparse, hence we never know if backend storage is here.
  2) the file grows: we know we need to allocated for the new area.
 - sparse file detection is done at when updating metadata cache: if
  size is bigger than allocated blocks, it is sparse.
 - I add fallocate FUSE operation (userland support is in another patch)
  for efficient backend storage allocation.
 - If fallocate is unavailable, I fallback to explicit zero-filling. 
 - puffs_vnop_write() is split into two functions, puffs_vnop_write_cache()
  that writes in cache, puffs_vnop_write_fs that writes to the FS

First (and for the last time) GOP_ALLOC() should go to the backend, maybe
with a short circuit on non-sparse, not-after-eof files.

Zero filling looks wrong, as the region may contain valid data.

Whatever you do you should increment the protocol version number.

If you want to make UBC work with PUFFS fsx may help.  Last time
I tried it failed badly on a rump_ffs mount.  You may find it here:

http://www.netbsd.org/~hannken/fsx.tgz

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] GOP_ALLOC and fallocate for PUFFS

2014-10-02 Thread J. Hannken-Illjes
On 02 Oct 2014, at 06:45, Emmanuel Dreyfus m...@netbsd.org wrote:

 J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 genfs_gop_write calls genfs_do_io which does error = biowait(mbp);
 near the end.  This will catch errors from VOP_STRATEGY.
 
 I run the code below but VOP_PUTPAGES never return anything else than 0.

Sure -- why should it?

GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages.  Here you
run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP.

GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make
sure the pages in question are allocated and may be faulted in.

The client has to fill holes or extend files.

 int
 puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
 {
struct puffs_mount *pmp = MPTOPUFFSMP(vp-v_mount);
off_t start, end;
int pgo_flags = PGO_CLEANIT|PGO_SYNCIO|PGO_PASTEOF;
int u_flags = PUFFS_UPDATESIZE|PUFFS_UPDATEMTIME|PUFFS_UPDATECTIME;
int error;
 
if (EXISTSOP(pmp, FALLOCATE)) {
error = _puffs_vnop_fallocate(vp, off, len);
goto out;
}
 
start = trunc_page(off);
end = round_page(off + len);
 
if (off + len  vp-v_size)
uvm_vnp_setwritesize(vp, off + len);
 
mutex_enter(vp-v_interlock);
error = VOP_PUTPAGES(vp, start, end, pgo_flags);
 
if (off + len  vp-v_size) {
if (error == 0) {
uvm_vnp_setsize(vp, off + len);
puffs_updatenode(VPTOPP(vp), u_flags, vp-v_size);
} else {
uvm_vnp_setwritesize(vp, vp-v_size);
}
}
 out:
return error;
 }

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] GOP_ALLOC and fallocate for PUFFS

2014-10-02 Thread J. Hannken-Illjes
On 02 Oct 2014, at 17:23, Emmanuel Dreyfus m...@netbsd.org wrote:

 On Thu, Oct 02, 2014 at 12:02:39PM +0200, J. Hannken-Illjes wrote:
 GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages.  Here you
 run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP.
 
 GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make
 sure the pages in question are allocated and may be faulted in.
 
 You mean I have to run VOP_GETPAGES at the beginning of a cached 
 write operation?

Please describe cached write operation in terms of vnode operations.

Which vnode operation finally calls GOP_ALLOC()?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] GOP_ALLOC and fallocate for PUFFS

2014-10-02 Thread J. Hannken-Illjes

On 02 Oct 2014, at 19:09, Emmanuel Dreyfus m...@netbsd.org wrote:

 J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 Please describe cached write operation in terms of vnode operations.
 
 A write on a mount that uses page cache, without direct I/O.
 
 Which vnode operation finally calls GOP_ALLOC()?
 
 From genfs, only VOP_GETPAGES, but I understand we should call it on our
 own. For instance, UFS calls it through ufs_balloc_range() in VOP_WRITE.

Ok -- if you want to use it, you have to implement it on the client side
as puffs has no idea how to allocate blocks -- right?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: [PATCH] GOP_ALLOC and fallocate for PUFFS

2014-09-30 Thread J. Hannken-Illjes
On 30 Sep 2014, at 18:24, Emmanuel Dreyfus m...@netbsd.org wrote:

 On Tue, Sep 30, 2014 at 03:30:05PM +, Antti Kantee wrote:
 Is it really better to sync fallocate, put stuff in the page cache and flush
 the page cache some day instead of just having a write-through (or
 write-first) page cache on the write() path?  You also get rid of the
 fallocate-not-implemented problem that way.
 
 GOP_ALLOC calls puffs_gop_alloc for chunks bigger than pages 
 (I observed 1 MB for now).  If we have fallocate implemented in the
 filesystem, this is really efficient, since fallocate saves us from
 sending any data to write. Hence IMO fallocate should be the preferred
 way if available.
 
 But if it is not there, indeed, doing a write on first attemps should
 do the trick.
 
 Writing zeroes might be a bad emulation for distributed file systems, though
 I guess you're the expert in that field and can evaluate the risks better
 than me.
 
 I understand that areas fallocate'd should return zeroes, so it should be
 fine. The real problem is performances. I am not sure what approach is best.
 
 I first though about a puffs_gop_alloc like below, but that will not work, 
 as VOP_PUTPAGES goes to genfs_putpages, which calls 
 GOP_WRITE (genfs_gop_write), which calls VOP_STRATEGY without checking
 for failure.

genfs_gop_write calls genfs_do_io which does error = biowait(mbp);
near the end.  This will catch errors from VOP_STRATEGY.

But why do you need GOP_ALLOC?  Is there a consumer beside genfs_getpages
filling holes?  Puffs doesn't return holes as its VOP_BMAP always
returns valid ( != -1 ) block addrs.

 Should I directly call VOP_STRATEGY?
 
 int
 puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
 {  
   int error;
   
   if (EXISTSOP(pmp, FALLOCATE))
   return _puffs_vnop_fallocate(vp, off, len);
   else
   return VOP_PUTPAGES(vp, off, off + len,
   PGO_CLEANIT|PGO_SYNCIO);
 } 
 
 -- 
 Emmanuel Dreyfus
 m...@netbsd.org

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-25 Thread J. Hannken-Illjes
On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:

snip

 I tried to bisect and got an increase in time from ~15 secs to ~24 secs
 between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.
 
 Someone should redo this test as this interval is the import of the
 compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools.  I cant
 believe this to be a compiler problem.

GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.

Could this be the cause of this big loss in performance?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-25 Thread J. Hannken-Illjes
On 25 Aug 2014, at 17:39, Taylor R Campbell riastr...@netbsd.org wrote:

   Date: Mon, 25 Aug 2014 15:55:53 +0200
   From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de
 
   GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
 
   See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
 
   Could this be the cause of this big loss in performance?
 
 Shouldn't be too hard to test this.  Perhaps try dropping in the
 following replacements for the vcache key comparison and running the
 test for each one?
 memequal.c

We are talking about a kernel from 2012/09 -- vcache came much later.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-25 Thread J. Hannken-Illjes
On 25 Aug 2014, at 15:55, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:

 On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 snip
 
 I tried to bisect and got an increase in time from ~15 secs to ~24 secs
 between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.
 
 Someone should redo this test as this interval is the import of the
 compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools.  I cant
 believe this to be a compiler problem.
 
 GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
 
 See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
 
 Could this be the cause of this big loss in performance?

Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.

Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.

A rather naive implementation of memcmp (see below) drops the running
time from ~15 sec to ~9 secs.  We should consider improving our memcmp.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Index: libkern.h
===
RCS file: /cvsroot/src/sys/lib/libkern/libkern.h,v
retrieving revision 1.106
diff -p -u -2 -r1.106 libkern.h
--- libkern.h   30 Aug 2012 12:16:49 -  1.106
+++ libkern.h   25 Aug 2014 17:23:35 -
@@ -262,5 +262,18 @@ void   *memset(void *, int, size_t);
 #if __GNUC_PREREQ__(2, 95)  !defined(_STANDALONE)
 #definememcpy(d, s, l) __builtin_memcpy(d, s, l)
-#definememcmp(a, b, l) __builtin_memcmp(a, b, l)
+static inline int __memcmp(const void *a, const void *b, size_t l)
+{
+   const unsigned char *pa = a, *pb = b;
+
+   if (l  8)
+   return memcmp(a, b, l);
+   while (l--  0) {
+   if (__predict_false(*pa != *pb))
+   return *pa  *pb ? -1 : 1;
+   pa++; pb++;
+   }
+   return 0;
+}
+#definememcmp(a, b, l) __memcmp(a, b, l)
 #endif
 #if __GNUC_PREREQ__(2, 95)  !defined(_STANDALONE)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-24 Thread J. Hannken-Illjes
On 22 Aug 2014, at 18:29, Taylor R Campbell riastr...@netbsd.org wrote:

   Date: Fri, 22 Aug 2014 17:59:37 +0200
   From: Stephan stephan...@googlemail.com
 
   Has anybody an idea on this or how to track this down? At the moment,
   I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've
   also seen people playing with dtrace but that doesn't seem to be
   included.
 
 Dtrace may be a good idea.  You can use it by
 
 (a) using a kernel built with `options KDTRACE_HOOKS',
 (b) using a userland built with MKDTRACE=yes,
 (c) modload /stand/ARCH/VERSION/solaris.kmod
modload /stand/ARCH/VERSION/dtrace.kmod
modload /stand/ARCH/VERSION/fbt.kmod
modload /stand/ARCH/VERSION/sdt.kmod
 (d) mkdir /dev/dtrace  mknod /dev/dtrace/dtrace c dtrace
 
 (Yes, this is too much work.  Someone^TM should turn it all on by
 default for netbsd-7...!)
 
 From the lockstat output it looks like there's a lot of use of
 mntvnode_lock, which suggests this may be related to hannken@'s vnode
 cache changes.  Might be worthwhile to sample stack traces of
 vfs_insmntque, with something like
 
 dtrace -n 'fbt::vfs_insmntqueue:entry { @[stack()]++ }'
 
 or perhaps sample stack traces of the mutex_enters of mntvnode_lock.

This was my first guess too ...

I tried to bisect and got an increase in time from ~15 secs to ~24 secs
between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.

Someone should redo this test as this interval is the import of the
compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools.  I cant
believe this to be a compiler problem.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Btw.: my test script is:

#! /bin/sh

mdconfig /dev/md0d 2048000 
P=${!}
newfs /dev/rmd0a
mount -t ffs -o log /dev/md0a /mnt

(cd /mnt  time sh -c 'seq 1 3|xargs touch')

umount /mnt
kill ${P}


Re: Add operation vcache_rekey

2014-06-30 Thread J. Hannken-Illjes
On 26 Jun 2014, at 05:33, David Holland dholland-t...@netbsd.org wrote:

 On Wed, Jun 25, 2014 at 09:46:16AM +, Taylor R Campbell wrote:
 Also, I wonder whether any file systems will ever change the length of
 the key for a vnode.  Probably not.
 
 As I recall from when I was looking at this a couple weeks ago (feels
 like a minor eternity ago, but it can't be much longer than that)
 ... yes, they might.

David, they will.  nfs/nfs_vnops.c::nfs_lookitup() will change the
file handle in the *npp != NULL case.  Here both the key and its
length may change.  I overlooked this one as it didn't update the
rbtree.  To make it worse the memory holding the file handle may
get reallocated so it is impossible to do the rekey with one call.

The attached diff adds

int
vcache_rekey_enter(struct mount *mp, struct vnode *vp,
const void *old_key, size_t old_key_len,
const void *new_key, size_t new_key_len)

to prepare the key change.  It will allocate a cache node for the new
key and lock both the old and the new cache node.  It is an error if
the new node already exists.

void
vcache_rekey_exit(struct mount *mp, struct vnode *vp,
const void *old_key, size_t old_key_len,
const void *new_key, size_t new_key_len)

to finish the key change.  It will remove the old cache node and setup
and unlock the new cache node.

Comments, suggestions or objections?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


vcache.diff
Description: Binary data


  1   2   3   >