Re: Add a mountlist iterator

2017-04-01 Thread Taylor R Campbell
> Date: Thu, 30 Mar 2017 11:21:41 +0200
> From: "J. Hannken-Illjes" 
> 
> Plan is to untangle the mountlist processing from vfs_busy() / vfs_unbusy()
> and add an iterator for the mountlist:

Generally seems like a good idea to me!

> - void mountlist_iterator_init(mount_iterator_t **)
>   to initialize an iterator in mounted order.

Any particular reason to use a pointer-to-opaque-pointer here instead
of a caller-allocated struct mountlist_iterator object?

struct mountlist_iterator it;

mountlist_iterator_init();
while ((mp = mountlist_iterator_next(, ...)) != NULL)
...
mountlist_iterator_destroy();

Then there's no need to kmem_alloc any intermediate data structures.

Embedding the struct mountlist_entry in struct mount would obviate the
need for _mountlist_next to iterate over the entire list from the
start, too.

Some in-line comments on the diff.  General comment: can you make your
diffs with the `-p' option so it's easier to follow which function
each diff hunk edits?

   diff -r dab39dcbcb29 -r ff3f89481f2e sys/kern/vfs_mount.c
   --- a/sys/kern/vfs_mount.c  Thu Mar 30 11:17:56 2017 +0200
   +++ b/sys/kern/vfs_mount.c  Thu Mar 30 11:17:56 2017 +0200
   @@ -119,7 +136,9 @@
   ...
   +static TAILQ_HEAD(mountlist, mountlist_entry) mount_list;
   +static kmutex_t mount_list_lock;

Candidate for cacheline-aligned struct?  Why do we need a new lock
mount_list_lock for this instead of using the existing mountlist_lock?

   @@ -1439,10 +1454,168 @@
   ...
   +   if (marker->me_type == ME_MARKER)
   +   me = TAILQ_NEXT(marker, me_list);
   +   else
   +   me = TAILQ_PREV(marker, mountlist, me_list);

Beware: TAILQ_PREV (and TAILQ_LAST) violates strict aliasing rules, so
we should probably try to phase it out rather than add new users.
Maybe we should add a new kind of queue(3) that doesn't violate strict
aliasing but supports *_PREV and *_LAST.

All that said, it is not clear to me why you need reverse iteration
here.  None of your patches seem to use it.  Are you planning to use
it for some future purpose?

   +   /* Take an initial reference for vfs_busy() below. */
   +   mp = me->me_mount;
   +   KASSERT(mp != NULL);
   +   atomic_inc_uint(>mnt_refcnt);
   +   mutex_exit(_list_lock);
   +
   +   /* Try to mark this mount busy and return on success. */
   +   if (vfs_busy(mp, NULL) == 0) {
   +   vfs_destroy(mp);
   +   return mp;
   +   }
   +   vfs_destroy(mp);
   +   mutex_enter(_list_lock);

Why do we need a new refcnt dance here?  Why isn't it enough to just
use vfs_busy like all the existing callers?

if (vfs_busy(mp, NULL) == 0) {
mutex_exit(_list_lock);
return mp;
}

Seems to me we ought to be removing code that names mnt->mnt_refcnt
specifically, rather than adding more code.

(Then we could also move the `mutex_exit; return;' out of the for loop
to make it look a little more symmetric in mountlist_iterator_next.)

   diff -r ff3f89481f2e -r 28fea04e7b15 sys/coda/coda_vfsops.c
   --- a/sys/coda/coda_vfsops.cThu Mar 30 11:17:56 2017 +0200
   +++ b/sys/coda/coda_vfsops.cThu Mar 30 11:17:57 2017 +0200
   @@ -624,23 +624,20 @@
   -mutex_enter(_lock);
   -TAILQ_FOREACH(mp, , mnt_list) {
   -   if ((!strcmp(mp->mnt_op->vfs_name, MOUNT_UFS)) &&
   -   ((VFSTOUFS(mp))->um_dev == (dev_t) dev)) {
   -   /* mount corresponds to UFS and the device matches one we want */
   -   break;
   -   }
   +if (spec_node_lookup_by_dev(VBLK, dev, ) == 0) {
   +   mp = spec_node_getmountedfs(vp);
   +   vrele(vp);
   +} else {
   +   mp = NULL;
}
   -mutex_exit(_lock);

Commit this change separately first?

   ml_iterator_vfs_busy

   Drop the mountlist processing from vfs_busy() and vfs_unbusy().

   diff -r 28fea04e7b15 -r bb476fea44cd sys/kern/vfs_mount.c
   --- a/sys/kern/vfs_mount.c  Thu Mar 30 11:17:57 2017 +0200
   +++ b/sys/kern/vfs_mount.c  Thu Mar 30 11:17:57 2017 +0200
   @@ -307,26 +307,26 @@
 * => Will fail if the file system is being unmounted, or is unmounted.
 */
int
   -vfs_busy(struct mount *mp, struct mount **nextp)
   +vfs_busy(struct mount *mp, int flags)
{

   KASSERT(mp->mnt_refcnt > 0);
   +   KASSERT((flags & ~MNT_NOWAIT) == 0);

You seem to have put an unrelated change into this commit -- a change
to add a flags argument to vfs_busy.  Separate commit, please?

Not clear without further explanation what the purpose of this change
is.  I would also be inclined to say vfs_trybusy(mp) rather than
vfs_busy(mp, MNT_NOWAIT), particularly since MNT_NOWAIT is derived
from a different context.

Can you make another separate commit to drop the keepref parameter to
vfs_unbusy, and change all the callers with 

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 06:46:24PM +0200, Michael van Elst wrote:
> On Sat, Apr 01, 2017 at 11:12:42AM -0400, Thor Lancelot Simon wrote:
> 
> > That said, very high-latency transports like iSCSI require a lot more
> > data than we can put into flight at once.  We just don't have enough
> > parallelism in our I/O subsystem (and most applications can't supply
> > enough).
> 
> We have tons of parallelism for writing and a small amount for reading.

Unless you've done even more than I noticed, allocation in the filesystems
is going to be a bottleneck -- concurrent access not having been foremost
in anyone's mind when FFS was designed.

XFS is full of tricks for this.  Unfortunately, despite a few early papers,
the source code pretty much is the documentation -- and parts of the code
that were effectively hamstrung by the lesser capabilities of the early
Linux kernel compared to end-of-the-road Irix have in some cases been
removed.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 08:54:40AM +, Michael van Elst wrote:
> t...@panix.com (Thor Lancelot Simon) writes:
> 
> >When SCSI tagged queueing is used properly, it is not necessary to set WCE
> >to get good write performance, and doing so is in fact harmful, since it
> >allows the drive to return ORDERED commands as complete before any of the
> >data for those or prior commands have actually been committed to stable
> >storage.
> 
> Do you think that real world disks agree? WCE is often necessary to
> get any decent performance and yes, data is not committed to stable

I don't agree.  What's sometimes necessary is to adjust the other mode
page bits that allow the drive to arbitrarily reorder SIMPLE commands,
but with an I/O subsystem that can put enough data in flight at once,
there's no performance reason to use WCE and considerable reliability
reason not to.

That said, very high-latency transports like iSCSI require a lot more
data than we can put into flight at once.  We just don't have enough
parallelism in our I/O subsystem (and most applications can't supply
enough).

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Michael van Elst
t...@panix.com (Thor Lancelot Simon) writes:

>When SCSI tagged queueing is used properly, it is not necessary to set WCE
>to get good write performance, and doing so is in fact harmful, since it
>allows the drive to return ORDERED commands as complete before any of the
>data for those or prior commands have actually been committed to stable
>storage.

Do you think that real world disks agree? WCE is often necessary to
get any decent performance and yes, data is not committed to stable
storage when the command returns (but there is a good chance that it
will be even when there is a power outage).


-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Jaromír Doleček
2017-03-31 22:16 GMT+02:00 Thor Lancelot Simon :
> It's not obvious, but in fact ORDERED gets set for writes
> as a default, I believe -- in sd.c, I think?
>
> This confused me for some time when I last looked at it.

It confused me also, that's why I changed the code a while back to be
less confusing :)

It used to be easy to draw conclusion that we use ORDERED all the
time, as that was default if tag flag was unset. But in fact sd(4) and
cd(4) always set XS_CTL_SIMPLE_TAG explicitely, so it was actually
always SIMPLE.

scsipi_base.c
revision 1.166
date: 2016-10-02 21:40:35 +0200;  author: jdolecek;  state: Exp;
lines: +4 -4;  commitid: iiAGFJk9looTyBoz;
change scsipi_execute_xs() to default to simple tags for !XS_CTL_URGENT
if not specified by caller; this is mostly for documentation purposes
only, as sd(4) and cd(4) explicitly use simple tags already

Jaromir