Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-03 Thread Thor Lancelot Simon
On Tue, Apr 04, 2017 at 12:39:46AM +0200, Jarom??r Dole??ek wrote:
> 
> Is there any reason we wouldn't want to set QAM=1 by default for
> sd(4)? Seems like pretty obvious performance improvement tweak.

Supposedly, there are some rather old drives -- mid-1990s or thereabouts --
that may keep some SIMPLE tags pending and *never* finish them unless the
host occasionally issues an ORDERED tag.  I don't know if any of them
still do, but some Linux HBA drivers used to forcibly set 1 in N tags
(for relatively large values of N) to ORDERED to avoid this.

I was pondering putting this setting into scsictl (it is sufficiently
SCSI-specific it seems like it doens't belong in dkctl).

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-03 Thread Jaromír Doleček
2017-04-02 17:28 GMT+02:00 Thor Lancelot Simon :
> However -- I believe for the 20-30% of SAS drives you mention as shipping
> with WCE set, it should be possible to obtain nearly identical performance
> and more safety by setting the Queue Algorithm Modifier bit in the control
> mode page to 1.  This allows the drive to arbitrarily reorder SIMPLE
> writes so long as the precedence rules with HEAD and ORDERED commands
> are respected.

Is there any reason we wouldn't want to set QAM=1 by default for
sd(4)? Seems like pretty obvious performance improvement tweak.

MODE SENSE has a flag to tell it this field is settable, so should be
pretty safe to set relying on that.

I can make this change (maybe with sysctl to control it?),
unfortunately I don't have any hw to test if it actually makes
measurable difference. Any volunteers to test a patch?

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-02 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 09:26:50AM +, Michael van Elst wrote:
> 
> >Setting WCE on SCSI drives is simply a bad idea.  It is
> >not necessary for performance and creates data integrity
> >isues.
> 
> I don't know details about data integrity issues although
> I'm sure there are some. But unfortunately WCE makes a difference
> on many SCSI drives nowadays. You either run with WCE or
> use a RAID controller with it's own stable storage (BBU/Flash)
> or live with a significant speed penalty for writes.

But the RAID controller is just another embedded computer, with its own
SCSI initiator.  If it's not setting WCE on the drive, how does it get better
performance than we could?  The answer must be either:

A) Less barriers (ignoring cache flushes or clearing ORDERED tags earlier
   because it has stable "cache")
B) Larger commands (we could do this...sigh).
C) More tags in flight.

These are, in theory, things we could do too.

However -- I believe for the 20-30% of SAS drives you mention as shipping
with WCE set, it should be possible to obtain nearly identical performance
and more safety by setting the Queue Algorithm Modifier bit in the control
mode page to 1.  This allows the drive to arbitrarily reorder SIMPLE
writes so long as the precedence rules with HEAD and ORDERED commands
are respected.  I don't seem to have a drive like the ones you're describing
(all my SAS stuff is several years old at best, and nothing shipped with
WCE turned on as far as I can tell), but if you're able to try this, I'd love
to know what the result is.

Given enough tags in flight, the only difference between using SIMPLE tags
for writes with QAM=1 and running with WCE enabled is that the host should
be able to tell when the writes actually hit stable storage, which is kind
of a big deal...

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-02 Thread Michael van Elst
t...@panix.com (Thor Lancelot Simon) writes:

>> We have tons of parallelism for writing and a small amount for reading.

>Unless you've done even more than I noticed, allocation in the filesystems
>is going to be a bottleneck -- concurrent access not having been foremost
>in anyone's mind when FFS was designed.

When you write the filesystem will queue an infinite amount of data
to the device, and the device will issue as many concurrent commands
as it (and the target) has openings (i.e. max 256 for scsipi and thus iscsi).
The parallelism is then only limited by memory and the pagermap.

Reading sequentially is done by UVM with a read-ahead of (default) up
to 8 * MAXPHYS (I have bumped this locally to 16 * MAXPHYS to make
iscsi saturate a GigE link).

Reading randomly is limited by lock contention in the kernel when you
try to read with many threads.

Reading is obviously also limited by the pagermap. The default size of
amd64 is 16MByte and that's the amount of I/O you can have in flight.

Wether filesystem allocation (and possible snychronous writes) are
a limitation depends. WAPBL seems to hide that quite good.


Saying this, on real fast storage (NVME on PCIe) everything seems to
be CPU limited, and the largest overhead comes from UVM. I believe
changing device I/O to use unmapped pages will have the largest impact.
At the same time it will also avoid the pagermap limit.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-02 Thread Edgar Fuß
> Now, it might be the case that the on-media integrity is not the
> primary goal. Then flush is only a write barrier, not integrity
> measure. In that case yes, ORDERED does keep the semantics (e.g.
> earlier journal writes are written before later journal writes).
So either I'm completely wrong or there's some fundamental confusion here.

Probably it's due to different interpretations of ``on-media integrity''.

In my world -- save fsync() or fdatasync() (which no doubt require something 
like FUA or a cache flush (but see below) -- the one and only point of not 
writing to disc asynchronously is to ensure that at all points in time 
(where the system may crash) the on-disc date is in a state that can be 
made consistent again by fsck (or, more recently, a log replay). And this, 
with all approaches to the problem known to me, requires guaranteeing a 
write order.
[Of course there's a silent assumption that the ``consistent state'' 
restored by fsck is somewhat close in time to the time of the crash, 
otherwise you could just newfs.]

> It does make stuff much easier to code, too - simply mark I/O as ORDERED 
> and fire, no need to explicitly wait for competition, and can drop e.g 
> journal locks faster.
Which doesn't surprise me because, in my understanding, it's the solution 
closest to the problem to be solved.

> I do think that it's important to concentrate on case where WCE is on,
> since that is realistically what majority of systems run with.
I still doubt that makes any difference in the design.

> Just for record, I can see these practical problems with ORDERED:
> 1. only available on SCSI, so still needs fallback barrier logic for
> less awesome hw
Yes, sure. But it would still be nice to have some OS caring about sensible 
hardware. If I need support for commodity PeeCee HW, I know where to find 
Linux or FreeBSD (where I would assume that FB's SCSI support may well be 
more advanced than NB's).

> 3. bufq processing needs special care for MPSAFE SCSI drivers, to
> prevent processing any further commands while I/O with ORDERED tag is
> being submitted to the controller.
I don't get that.

If you have two processes concurrently writing to disc directly, nobody 
guarantees an ordering of the writes issued by them. If the two processes
write through the FS, it's the FS's job to serialize that anyway. I'm 
probably missing something.

> I still see my FUA effor[t] as more direct replacement of the cache flushes
Yes, sure.


Of course, there's still the problem of too many programs out there issuing
fsync()s. As far as I remember, SQLite issues four syncs for a transactional
update. Firefox keps a SQLite database for cookies, open tabs, history and
whatnot. Each is updated several times a minute. In the end, a completely 
idling browser causes half a magabyte of NFS traffic per minute and in the 
order of ten journal flushes per minute. Multiply that by 150 clients.


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-02 Thread Edgar Fuß
> When SCSI tagged queueing is used properly, it is not necessary to set WCE
> to get good write performance
I will be eager to test this in Real Life once NetBSD ``uses tagged queueing 
properly''.


> and doing so is in fact harmful, since it allows the drive to return 
> ORDERED commands as complete before any of the data for those or prior 
> commands have actually been committed to stable storage.
Which exactly violates the ordering assumptions in which way?


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 06:46:24PM +0200, Michael van Elst wrote:
> On Sat, Apr 01, 2017 at 11:12:42AM -0400, Thor Lancelot Simon wrote:
> 
> > That said, very high-latency transports like iSCSI require a lot more
> > data than we can put into flight at once.  We just don't have enough
> > parallelism in our I/O subsystem (and most applications can't supply
> > enough).
> 
> We have tons of parallelism for writing and a small amount for reading.

Unless you've done even more than I noticed, allocation in the filesystems
is going to be a bottleneck -- concurrent access not having been foremost
in anyone's mind when FFS was designed.

XFS is full of tricks for this.  Unfortunately, despite a few early papers,
the source code pretty much is the documentation -- and parts of the code
that were effectively hamstrung by the lesser capabilities of the early
Linux kernel compared to end-of-the-road Irix have in some cases been
removed.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 08:54:40AM +, Michael van Elst wrote:
> t...@panix.com (Thor Lancelot Simon) writes:
> 
> >When SCSI tagged queueing is used properly, it is not necessary to set WCE
> >to get good write performance, and doing so is in fact harmful, since it
> >allows the drive to return ORDERED commands as complete before any of the
> >data for those or prior commands have actually been committed to stable
> >storage.
> 
> Do you think that real world disks agree? WCE is often necessary to
> get any decent performance and yes, data is not committed to stable

I don't agree.  What's sometimes necessary is to adjust the other mode
page bits that allow the drive to arbitrarily reorder SIMPLE commands,
but with an I/O subsystem that can put enough data in flight at once,
there's no performance reason to use WCE and considerable reliability
reason not to.

That said, very high-latency transports like iSCSI require a lot more
data than we can put into flight at once.  We just don't have enough
parallelism in our I/O subsystem (and most applications can't supply
enough).

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Michael van Elst
t...@panix.com (Thor Lancelot Simon) writes:

>When SCSI tagged queueing is used properly, it is not necessary to set WCE
>to get good write performance, and doing so is in fact harmful, since it
>allows the drive to return ORDERED commands as complete before any of the
>data for those or prior commands have actually been committed to stable
>storage.

Do you think that real world disks agree? WCE is often necessary to
get any decent performance and yes, data is not committed to stable
storage when the command returns (but there is a good chance that it
will be even when there is a power outage).


-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Jaromír Doleček
2017-03-31 22:16 GMT+02:00 Thor Lancelot Simon :
> It's not obvious, but in fact ORDERED gets set for writes
> as a default, I believe -- in sd.c, I think?
>
> This confused me for some time when I last looked at it.

It confused me also, that's why I changed the code a while back to be
less confusing :)

It used to be easy to draw conclusion that we use ORDERED all the
time, as that was default if tag flag was unset. But in fact sd(4) and
cd(4) always set XS_CTL_SIMPLE_TAG explicitely, so it was actually
always SIMPLE.

scsipi_base.c
revision 1.166
date: 2016-10-02 21:40:35 +0200;  author: jdolecek;  state: Exp;
lines: +4 -4;  commitid: iiAGFJk9looTyBoz;
change scsipi_execute_xs() to default to simple tags for !XS_CTL_URGENT
if not specified by caller; this is mostly for documentation purposes
only, as sd(4) and cd(4) explicitly use simple tags already

Jaromir



Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Paul.Koning

> On Mar 31, 2017, at 4:16 PM, Thor Lancelot Simon  wrote:
> 
> On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote:
>>> The problem is that it does not always use SIMPLE and ORDERED tags in a
>>> way that would facilitate the use of ORDERED tags to enforce barriers.
>> 
>> Our scsipi layer actually never issues ORDERED tags right now as far
>> as I can see, and there is currently no interface to get it set for an
>> I/O.
> 
> It's not obvious, but in fact ORDERED gets set for writes
> as a default, I believe -- in sd.c, I think?

Why would you do that?  I don't know that as standard SCSI practice, and it 
seems like a recipe for slow performance.

paul



Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote:
> > The problem is that it does not always use SIMPLE and ORDERED tags in a
> > way that would facilitate the use of ORDERED tags to enforce barriers.
> 
> Our scsipi layer actually never issues ORDERED tags right now as far
> as I can see, and there is currently no interface to get it set for an
> I/O.

It's not obvious, but in fact ORDERED gets set for writes
as a default, I believe -- in sd.c, I think?

This confused me for some time when I last looked at it.

> I lived under assumption that SIMPLE tagged commands could be and are
> reordered by the controller/drive at will already, without setting any
> other flags.

They might be -- there are well defined mode page bits
to control this, but I believe targets are free to use
whatever default they like.

> 
> > When SCSI tagged queueing is used properly, it is not necessary to set WCE
> > to get good write performance, and doing so is in fact harmful, since it
> > allows the drive to return ORDERED commands as complete before any of the
> > data for those or prior commands have actually been committed to stable
> > storage.
> 
> This was what I meant when I said "even ordered tags couldn't avoid
> the cache flushes". Using ORDERED tags doesn't provide on-media
> integrity when WCE is set.

Setting WCE on SCSI drives is simply a bad idea.  It is
not necessary for performance and creates data integrity
isues.

> Now, it might be the case that the on-media integrity is not the
> primary goal. Then flush is only a write barrier, not integrity
> measure. In that case yes, ORDERED does keep the semantics (e.g.
> earlier journal writes are written before later journal writes). It
> does make stuff much easier to code, too - simply mark I/O as ORDERED
> and fire, no need to explicitly wait for competition, and can drop e.g
> journal locks faster.
> 
> I do think that it's important to concentrate on case where WCE is on,
> since that is realistically what majority of systems run with.

I don't believe most SCSI drives are run with WCE on.

I agree FUA or its equivalent is needed for non-SCSI
drives.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Jaromír Doleček
> The problem is that it does not always use SIMPLE and ORDERED tags in a
> way that would facilitate the use of ORDERED tags to enforce barriers.

Our scsipi layer actually never issues ORDERED tags right now as far
as I can see, and there is currently no interface to get it set for an
I/O.

> Also, that we may not know enough about the behavior of our filesystems
> in the real world to be 100% sure it's safe to set the other mode page
> bits that allow the drive to arbitrarily reorder SIMPLE commands (which
> under some conditions is necessary to match the performance of running
> with WCE set).

I lived under assumption that SIMPLE tagged commands could be and are
reordered by the controller/drive at will already, without setting any
other flags.

> When SCSI tagged queueing is used properly, it is not necessary to set WCE
> to get good write performance, and doing so is in fact harmful, since it
> allows the drive to return ORDERED commands as complete before any of the
> data for those or prior commands have actually been committed to stable
> storage.

This was what I meant when I said "even ordered tags couldn't avoid
the cache flushes". Using ORDERED tags doesn't provide on-media
integrity when WCE is set.

Now, it might be the case that the on-media integrity is not the
primary goal. Then flush is only a write barrier, not integrity
measure. In that case yes, ORDERED does keep the semantics (e.g.
earlier journal writes are written before later journal writes). It
does make stuff much easier to code, too - simply mark I/O as ORDERED
and fire, no need to explicitly wait for competition, and can drop e.g
journal locks faster.

I do think that it's important to concentrate on case where WCE is on,
since that is realistically what majority of systems run with.

Just for record, I can see these practical problems with ORDERED:
1. only available on SCSI, so still needs fallback barrier logic for
less awesome hw
2. Windows and Linux used to always use SIMPLE tags and wait for
completition; suggests this avenue may have been already explored and
found not interesting enough, or too buggy (remember scheduler
activations?)
3. bufq processing needs special care for MPSAFE SCSI drivers, to
prevent processing any further commands while I/O with ORDERED tag is
being submitted to the controller.

I still see my FUA efford as more direct replacement of the cache
flushes, for it keeps both the logical and on-media integrity. Also,
it will benefit the SATA disks too, once/if NCQ is integrated.

I think that implementing barrier/ORDERED can be parallel efford,
similar to the maxphys branch. I don't think barriers will make FUA
irrelevant, as its still needed for systems with WCE on.

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 02:16:44PM +0200, Edgar Fu? wrote:
> Oh well.
> 
> TLS> If the answer is that you're running with WCE on in the mode pages, then
> TLS> don't do that:
> EF> I don't get that. If you turn off the write cache, you need neither cache 
> EF> flushes nor ordering, no?
> MB> You still need ordering. With tagged queuing, you have multiple commands
> MB> running at the same time (up to 256, maybe more fore newer scsi) and the 
> MB> drive is free to complete them in any order.  Unless one of them is an 
> MB> ORDERED command, in which case comamnds queued before have to complete 
> MB> before.
> 
> I guess we are talking past each other. I should have phrased that ``If you
> don't use any tagging and turn off the write cache, ...''.

But that doesn't make sense.  Why would our SCSI layer not use tagging?

The problem is that it does not always use SIMPLE and ORDERED tags in a
way that would facilitate the use of ORDERED tags to enforce barriers.

Also, that we may not know enough about the behavior of our filesystems
in the real world to be 100% sure it's safe to set the other mode page
bits that allow the drive to arbitrarily reorder SIMPLE commands (which
under some conditions is necessary to match the performance of running
with WCE set).

When SCSI tagged queueing is used properly, it is not necessary to set WCE
to get good write performance, and doing so is in fact harmful, since it
allows the drive to return ORDERED commands as complete before any of the
data for those or prior commands have actually been committed to stable
storage.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Edgar Fuß
Oh well.

TLS> If the answer is that you're running with WCE on in the mode pages, then
TLS> don't do that:
EF> I don't get that. If you turn off the write cache, you need neither cache 
EF> flushes nor ordering, no?
MB> You still need ordering. With tagged queuing, you have multiple commands
MB> running at the same time (up to 256, maybe more fore newer scsi) and the 
MB> drive is free to complete them in any order.  Unless one of them is an 
MB> ORDERED command, in which case comamnds queued before have to complete 
MB> before.

I guess we are talking past each other. I should have phrased that ``If you
don't use any tagging and turn off the write cache, ...''.

The course of arguments was:
1. Jaromir wrote about FUA and integrating AHCI NCQ support
2. Edgar remembered each journal commit to cause two cache flushes and asked 
   whether using SCSI TCQ could save the cache flushes
3. Jaromir responded that even ordered tags couldn't avoid the cache flushes
4. Edgar wrote that the point of the flushes seems to guarantee an order and
   asked why then ordered tags couldn't make them unneccessary
5. TLS seconded Edgar, asked why tags weren't good enough, and stated ``If 
   the answer is that you're running with WCE on in the mode pages, then 
   don't do that''
6. Edgar (as I now guess) mis-interpreted that as ``current behaviour (no
   tags) and write cacheing'', while TLS (I guess) meant ``potential future 
   behaviour (using tags) and write cacheing'' and also phrased his own 
   reply (ommitting he referred to the current situation without any queueing)
   in a way that provoked further mis-understanding
7. Manuel mis-understood Edgar and explained problems arising from using no
   write cacheing and unordered tagging.

So while I'm still not sure what the SCSI behaviour is with both write cacheing
and tagged queuing (I would guess turning write cacheing on/off doesn't make
much of a difference when you queue everything, but I may well be missing 
something fundamental), I still have the impression that using (ordererd)
queueing and no cache flushes would be the perfect solution for journalling.


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-30 Thread Thor Lancelot Simon
On Wed, Mar 29, 2017 at 11:53:55AM +0200, Edgar Fu? wrote:
> > It needs to do this [flush disc cache after committing journal] because 
> > it needs to make sure that journal data are saved before we save the 
> > journal commit block.
> So the point is to force an order (data before commit block).
> 
> > Implicitly, the pre-commit flush also makes sure that all asynchronously 
> > written metadata updates are written to the media, before the commit makes
> > them impossible to replay.
> So the point is do force an order (metadata before journal).
> 
> > Even SCSI ORDERED tags wouldn't help to avoid the need for cache flushes.
> So why that if the point of the cache flushes is to ensure an order?

It doesn't make sense to me either.  ORDERED tags are required not to complete
until all previously submitted SIMPLE and ORDERED tags have been committed to
stable storage; and if that's not enough, you can, I believe, use a HEAD tag.
Why isn't that good enough?

If the answer is that you're running with WCE on in the mode pages, then
don't do that: use SIMPLE tags for all writes except when you intend a
barrier, and ORDERED when you do.  I must be missing something.

-- 
 Thor Lancelot Simon  t...@panix.com

Cry, the beloved country, for the unborn child that is the
inheritor of our fear.  -Alan Paton


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-29 Thread Edgar Fuß
> It needs to do this [flush disc cache after committing journal] because 
> it needs to make sure that journal data are saved before we save the 
> journal commit block.
So the point is to force an order (data before commit block).

> Implicitly, the pre-commit flush also makes sure that all asynchronously 
> written metadata updates are written to the media, before the commit makes
> them impossible to replay.
So the point is do force an order (metadata before journal).

> Even SCSI ORDERED tags wouldn't help to avoid the need for cache flushes.
So why that if the point of the cache flushes is to ensure an order?


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-27 Thread Thor Lancelot Simon
On Tue, Mar 28, 2017 at 01:17:18AM +0200, Jarom??r Dole??ek wrote:
> 2017-03-12 11:15 GMT+01:00 Edgar Fu?? :
> > Some comments as I probably count as one of the larger WAPBL consumers (we
> > have ~150 employee's Home and Mail on NFS on FFS2+WAPBL on RAIDframe on 
> > SAS):
> 
> I've not changed the code in RF to pass the cache flags, so the patch
> doesn't actually enable FUA there. Mainly because disks come and go
> and I'm not aware of mechanism to make WAPBL aware of such changes. It

I ran into this issue with tls-maxphys and got so frustrated I was actually
considering simply panicing if a less-capable disk were used to replace a
more-capable one.

Just FYI.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-27 Thread Jaromír Doleček
Attached is final version of the patch. It uses MEDIA prefix for the
flags, but keeps the FUA/DPO - i.e names are B_MEDIA_FUA, B_MEDIA_DPO.
For wapbl it introduces a sysctl to use the feature, default is off
for now.

I plan to commit this later in the week or early next week, unless
there are some serious objections.

Jaromir

2017-03-05 23:22 GMT+01:00 Jaromír Doleček :
> Here is an updated patch. It was updated to check for the FUA support
> for SCSI, using the MODE SENSE device-specific flag. Code was tested
> with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the
> flag. It keeps the flag naming for now.
>
> In the patch, WAPBL sets the flag for journal writes, and also for the
> metadata buffer for bawrite() call after journal commit.
>
> There is possible layer violation for metadata write - b_flags are
> supposed to be set by owner of the buffer. Not sure how strict we
> want/need to be there - perhaps introduce another flag field? Also the
> flag
> probably needs to be unset in biodone hook, so that the code
> guarantees the buffer in buffer cache doesn't accidentaly keep it over
> to another I/O.
>
> Jaromir
? dev/ic/TODO.nvme
Index: sys/buf.h
===
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.126
diff -u -p -r1.126 buf.h
--- sys/buf.h   26 Dec 2016 23:12:33 -  1.126
+++ sys/buf.h   27 Mar 2017 22:31:22 -
@@ -198,16 +198,21 @@ struct buf {
 #defineB_RAW   0x0008  /* Set by physio for raw 
transfers. */
 #defineB_READ  0x0010  /* Read buffer. */
 #defineB_DEVPRIVATE0x0200  /* Device driver private flag. 
*/
+#defineB_MEDIA_FUA 0x0800  /* Set Force Unit Access for 
media. */
+#defineB_MEDIA_DPO 0x1000  /* Set Disable Page Out for 
media. */
 
 #define BUF_FLAGBITS \
 "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \
 "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH"
+"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34MEDIAFUA\35MEDIADPO"
 
 /* Avoid weird code due to B_WRITE being a "pseudo flag" */
 #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ)
 #define BUF_ISWRITE(bp)(((bp)->b_flags & B_READ) == B_WRITE)
 
+/* Media flags, to be passed for nested I/O */
+#define B_MEDIA_FLAGS  (B_MEDIA_FUA|B_MEDIA_DPO)
+
 /*
  * This structure describes a clustered I/O.  It is stored in the b_saveaddr
  * field of the buffer on which I/O is done.  At I/O completion, cluster
Index: sys/dkio.h
===
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.22
diff -u -p -r1.22 dkio.h
--- sys/dkio.h  8 Dec 2015 20:36:15 -   1.22
+++ sys/dkio.h  27 Mar 2017 22:31:22 -
@@ -85,6 +85,8 @@
 #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */
 #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */
 #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save 
them */
+#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */
+#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */
 
/* sync disk cache */
 #defineDIOCCACHESYNC   _IOW('d', 118, int) /* sync cache (force?) 
*/
Index: kern/vfs_bio.c
===
RCS file: /cvsroot/src/sys/kern/vfs_bio.c,v
retrieving revision 1.271
diff -u -p -r1.271 vfs_bio.c
--- kern/vfs_bio.c  21 Mar 2017 10:46:49 -  1.271
+++ kern/vfs_bio.c  27 Mar 2017 22:31:22 -
@@ -2027,7 +2027,7 @@ nestiobuf_iodone(buf_t *bp)
 void
 nestiobuf_setup(buf_t *mbp, buf_t *bp, int offset, size_t size)
 {
-   const int b_read = mbp->b_flags & B_READ;
+   const int b_pass = mbp->b_flags & (B_READ|B_MEDIA_FLAGS);
struct vnode *vp = mbp->b_vp;
 
KASSERT(mbp->b_bcount >= offset + size);
@@ -2035,14 +2035,14 @@ nestiobuf_setup(buf_t *mbp, buf_t *bp, i
bp->b_dev = mbp->b_dev;
bp->b_objlock = mbp->b_objlock;
bp->b_cflags = BC_BUSY;
-   bp->b_flags = B_ASYNC | b_read;
+   bp->b_flags = B_ASYNC | b_pass;
bp->b_iodone = nestiobuf_iodone;
bp->b_data = (char *)mbp->b_data + offset;
bp->b_resid = bp->b_bcount = size;
bp->b_bufsize = bp->b_bcount;
bp->b_private = mbp;
BIO_COPYPRIO(bp, mbp);
-   if (!b_read && vp != NULL) {
+   if (BUF_ISWRITE(bp) && vp != NULL) {
mutex_enter(vp->v_interlock);
vp->v_numoutput++;
mutex_exit(vp->v_interlock);
Index: kern/vfs_wapbl.c
===
RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v
retrieving revision 1.92
diff -u -p -r1.92 vfs_wapbl.c
--- kern/vfs_wapbl.c17 Mar 2017 03:19:46 -  1.92

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-12 Thread Edgar Fuß
Some comments as I probably count as one of the larger WAPBL consumers (we 
have ~150 employee's Home and Mail on NFS on FFS2+WAPBL on RAIDframe on SAS):

DH> E.g. I'm not convinced that writing out journal blocks synchronously
DH> one at a time will be faster than flushing the cache at the end of a
DH> journal write, even though the latter inflicts collateral damage in
DH> the sense of waiting for perhaps many blocks that don't need to be
DH> waited for.
I'll be happy to instrument this in Real Life (see above) if that helps.

JD> Indeed - writing journal blocks sync one by one is unlikely to be
JD> faster then sending them all async and doing cache flush on the end,
JD> that wouldn't make sense.
Journal blocks are not written one by one (you starve a RAID to death with
that), but coalesced into (mostly) 64k chunks aligned with FS blocks (which
normally are aligned with RAID stripes or you're dead anyway).

Also, I remember each journal flush to actually cause two cache syncs, one
before and one after writing the actual data.

JD> it adds (another) incentive to actually integrate AHCI NCQ support
What about SCSI TCQ? I seem to remember all that flushing could be avoided 
if the FS used queueing. DHs comments on barriers seem to second that.


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-05 Thread Jaromír Doleček
Here is an updated patch. It was updated to check for the FUA support
for SCSI, using the MODE SENSE device-specific flag. Code was tested
with QEMU emulated bha(4) and nvme. WAPBL code was updated to use the
flag. It keeps the flag naming for now.

In the patch, WAPBL sets the flag for journal writes, and also for the
metadata buffer for bawrite() call after journal commit.

There is possible layer violation for metadata write - b_flags are
supposed to be set by owner of the buffer. Not sure how strict we
want/need to be there - perhaps introduce another flag field? Also the
flag
probably needs to be unset in biodone hook, so that the code
guarantees the buffer in buffer cache doesn't accidentaly keep it over
to another I/O.

Jaromir
? dev/ic/TODO.nvme
Index: sys/buf.h
===
RCS file: /cvsroot/src/sys/sys/buf.h,v
retrieving revision 1.126
diff -u -p -r1.126 buf.h
--- sys/buf.h   26 Dec 2016 23:12:33 -  1.126
+++ sys/buf.h   5 Mar 2017 22:08:35 -
@@ -198,11 +198,13 @@ struct buf {
 #defineB_RAW   0x0008  /* Set by physio for raw 
transfers. */
 #defineB_READ  0x0010  /* Read buffer. */
 #defineB_DEVPRIVATE0x0200  /* Device driver private flag. 
*/
+#defineB_FUA   0x0800  /* Force Unit Access flag 
(mandatory). */
+#defineB_DPO   0x1000  /* Disable Page Out flag 
(advisory). */
 
 #define BUF_FLAGBITS \
 "\20\1AGE\3ASYNC\4BAD\5BUSY\10DELWRI" \
 "\12DONE\13COWDONE\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
-"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH"
+"\23PHYS\24RAW\25READ\32DEVPRIVATE\33VFLUSH\34FUA\35DPO"
 
 /* Avoid weird code due to B_WRITE being a "pseudo flag" */
 #define BUF_ISREAD(bp) (((bp)->b_flags & B_READ) == B_READ)
Index: sys/dkio.h
===
RCS file: /cvsroot/src/sys/sys/dkio.h,v
retrieving revision 1.22
diff -u -p -r1.22 dkio.h
--- sys/dkio.h  8 Dec 2015 20:36:15 -   1.22
+++ sys/dkio.h  5 Mar 2017 22:08:35 -
@@ -85,6 +85,8 @@
 #defineDKCACHE_RCHANGE 0x000100 /* read enable is changeable */
 #defineDKCACHE_WCHANGE 0x000200 /* write enable is changeable */
 #defineDKCACHE_SAVE0x01 /* cache parameters are savable/save 
them */
+#defineDKCACHE_FUA 0x02 /* Force Unit Access supported */
+#defineDKCACHE_DPO 0x04 /* Disable Page Out supported */
 
/* sync disk cache */
 #defineDIOCCACHESYNC   _IOW('d', 118, int) /* sync cache (force?) 
*/
Index: kern/vfs_wapbl.c
===
RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v
retrieving revision 1.87
diff -u -p -r1.87 vfs_wapbl.c
--- kern/vfs_wapbl.c5 Mar 2017 13:57:29 -   1.87
+++ kern/vfs_wapbl.c5 Mar 2017 22:08:35 -
@@ -70,6 +70,7 @@ __KERNEL_RCSID(0, "$NetBSD: vfs_wapbl.c,
 static struct sysctllog *wapbl_sysctl;
 static int wapbl_flush_disk_cache = 1;
 static int wapbl_verbose_commit = 0;
+static int wapbl_use_fua = 1;
 
 static inline size_t wapbl_space_free(size_t, off_t, off_t);
 
@@ -229,6 +230,12 @@ struct wapbl {
u_char *wl_buffer;  /* l:   buffer for wapbl_buffered_write() */
daddr_t wl_buffer_dblk; /* l:   buffer disk block address */
size_t wl_buffer_used;  /* l:   buffer current use */
+
+   int wl_dkcache; /* r:   disk cache flags */
+#define WAPBL_USE_FUA(wl)  \
+   (wapbl_use_fua && ISSET(wl->wl_dkcache, DKCACHE_FUA))
+   int wl_jwrite_flags;/* r:   journal write flags */
+   int wl_mwrite_flags;/* r:   metadata write flags */
 };
 
 #ifdef WAPBL_DEBUG_PRINT
@@ -280,6 +287,8 @@ static void wapbl_deallocation_free(stru
 static void wapbl_evcnt_init(struct wapbl *);
 static void wapbl_evcnt_free(struct wapbl *);
 
+static void wapbl_dkcache_init(struct wapbl *);
+
 #if 0
 int wapbl_replay_verify(struct wapbl_replay *, struct vnode *);
 #endif
@@ -390,6 +399,30 @@ wapbl_evcnt_free(struct wapbl *wl)
evcnt_detach(>wl_ev_cacheflush);
 }
 
+static void
+wapbl_dkcache_init(struct wapbl *wl)
+{
+   int error;
+
+   /* Get disk cache flags */
+   error = VOP_IOCTL(wl->wl_devvp, DIOCGCACHE, >wl_dkcache,
+   FWRITE, FSCRED);
+   if (error) {
+   /* behave as if there is a write cache */
+   wl->wl_dkcache = DKCACHE_WRITE;
+   }
+
+   /* Use FUA instead of cache flush if available */
+   if (WAPBL_USE_FUA(wl)) {
+   wl->wl_jwrite_flags |= B_FUA;
+   wl->wl_mwrite_flags |= B_FUA;
+   }
+
+   /* Use DPO for journal writes if available */
+   if (ISSET(wl->wl_dkcache, DKCACHE_DPO))
+   wl->wl_jwrite_flags |= B_DPO;
+}
+
 static int
 wapbl_start_flush_inodes(struct wapbl *wl, struct wapbl_replay *wr)
 {
@@ -562,6 +595,8 @@ 

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-03 Thread Jaromír Doleček
2017-03-03 18:11 GMT+01:00 David Holland :
> Yes and no; there's also standard terminology for talking about
> caches, so my inclination would be to call it something like
> B_MEDIASYNC: synchronous at the media level.

Okay, this might be good. Words better then acronyms :)

>  > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
>  > for the same functionality, but not sure if it matches with  what swap
>  > is using this flag for. DPO is ideal for journal writes however,
>  > that's why I want to add the support for it now.
>
> What does DPO do?

It tells the hardware to not store the data into it's cache. Or more
precisely, to not put it into a cache if it meant that it would have
to evict something else from it. It should improve general performance
of the disk - the journal writes will not trash the device cache.

B_MEDIANOCACHE?

> Perfect abstractions for any of these would be more complex, but
> barriers serve pretty well.

Perhaps we could start with reworking DIOCCACHESYNC into a barrier :)
Currently it is not actually guaranteed to be executed after the
already queued writes - the ioctl is executed out of bounds, bypassing
the bufq queue. Hence it doesn't actually quite work if there are any
in-fligh async writes, as queued e.g. by the bawrite() calls in
ffs_fsync().

In linux the block write interface accounts for that, there are flags
to ask a sync to be done before or after the I/O, and it is also
possible to send just empty I/O with only the sync flags. Thus the
sync is always queued along with the writes. It would be good to adopt
something like this, but that would require bufq interface changes and
possibly device driver changes, with much broader tree disturbance.

At least with FUA the caller can ensure to have all the writes safely
on media, and wouldn't depend on an out-of-bound ioctl.

> Single synchronous block writes are a bad way to implement barriers
> and it maybe makes sense to have two models and force every fs to be
> able to do things two different ways; but single synchronous block
> writes are also a bad way to implement any of the above invariants.
> E.g. I'm not convinced that writing out journal blocks synchronously
> one at a time will be faster than flushing the cache at the end of a
> journal write, even though the latter inflicts collateral damage in
> the sense of waiting for perhaps many blocks that don't need to be
> waited for.

Indeed - writing journal blocks sync one by one is unlikely to be
faster then sending them all async and doing cache flush on the end,
that wouldn't make sense.

I plan to change WAPBL to do the journal writes partially async. It
will use several bufs, issue the I/O asynchronously and only wait if
it runs out of buffers, or it it needs to do the commit. Seems usually
there are three or four block writes done as part of the transaction
commit, so there is decent parallelism opportunity.

> I guess it would help if I knew what you were intending to do with
> wapbl in this regard; have you posted that? (I've been at best
> skimming tech-kern the past few months...)

I haven't posted details on the WAPBL part of the changes. I'll put
together a patch over the weekend, and send it over. It will be useful
to show my thinking how the proposed interface could be used.

> Getting it working first is great but I'm not sure a broadly exposed
> piece of infrastructure should be committed in a preliminary design
> state... especially in a place (the legacy buffer cache) that's
> already a big ol' mess.

That's one of reasons I want to keep the current changes minimal :)

The proposed patch doesn't actually touch the legacy buffer cache code
at all. It only adds another B_* flag, and changes hardware device
drivers to react upon it. The flag is supposed to be set by the
caller, for example by WAPBL itself. Nothing in e.g. ffs would set the
flags.

> I guess what worries me is the possibility of getting interrupted by
> real life and then all this remaining in a half-done state in the long
> term; there are few things worse for maintainability in general than
> half-finished reorgs that end up getting left to bitrot. :-/

There is semi-good chance this will be finished into workable state
soon - I picked up jornaling improvements as my Bachelor thesis
material, so it either gets done or I will fail :)

> Is there something more generic / less hardware-specific that we can
> put in the fs in the near term?

Well, the FUA support looks like a good candidate for being useful and
could have direct positive performance impact, so I picked up that. If
we have code taking advantage of FUA, it adds (another) incentive to
actually integrate AHCI NCQ support, as that is the only way how to
get FUA support on more contemporary hardware. Also, it's my
understanding using FUA instead of full cache sync should be huge win
for raid also, so it's worth for that avenue too.

> keep in mind that whatever it is might end up in 

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-03 Thread David Holland
On Thu, Mar 02, 2017 at 09:11:17PM +0100, Jarom?r Dole?ek wrote:
 > > Some quick thoughts, though:
 > >
 > > (1) ultimately it's necessary to patch each driver to crosscheck the
 > > flag, because otherwise eventually there'll be silent problems.
 > 
 > Maybe. I think I like having this as responsibility on the caller for
 > now, avoids too broad tree changes. Ultimately it might indeed be
 > necessary, if we find out that it can't be reasonably be handled by
 > the caller. Like maybe raidframe kicking in spare disk without FUA
 > into set with FUA.

It's more like "in the long run it is hazardous to assume the
upper-level code is perfectly correct".

 > > (2) it would be better not to expose hardware-specific flags in the
 > > buffercache, so it would be better to come up with a name that
 > > reflects the semantics, and a semantic guarantee that's at least
 > > notionally not hardware-specific.
 > 
 > I want to avoid unnecessary private NetBSD nomenclature. If storage
 > industry calls it FUA, it's probably good to just call it FUA.

Yes and no; there's also standard terminology for talking about
caches, so my inclination would be to call it something like
B_MEDIASYNC: synchronous at the media level.

 > For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
 > for the same functionality, but not sure if it matches with  what swap
 > is using this flag for. DPO is ideal for journal writes however,
 > that's why I want to add the support for it now.

What does DPO do?

 > > (3) as I recall (can you remind those of us not currently embedded in
 > > this stuff what the semantics of FUA actually are?) FUA is *not* a
 > > write barrier (as in, all writes before happen before all writes
 > > after) and since write barriers are a natural expression of the
 > > requirements for many fses, it would be well to make sure the
 > > implementation of this doesn't conflict with that.
 > 
 > FUA doesn't enforce any barriers. It merely changes the sematics of
 > the write request - the hardware will return success response only
 > after the data is written to non-volatile media.
 > 
 > Any barriers required by filesystem sematics need to be handled by the
 > fs code, same as now with DIOCCACHESYNC.
 > 
 > I've talked about adding some kind of generic barrier support in the
 > previous thread. After thinking about it, and reading more, I'm not
 > convinced it's necessary. Incidentally, Linux has moved away from the
 > generic barriers and pushed the logic into their fs code, which can
 > DTRT with e.g. journal transactions, too.

The reason barriers keep coming up is that barriers express the
requirements of filesystems reasonably well; e.g. for a journaling
filesystem, the requirement is that when you write a bunch of journal
blocks, they must all become permanent before any following blocks.
Similarly, for a snapshot/shadow-paging based fs like zfs, you write a
whole bunch of stuff, then a new superblock (which must come strictly
after) and then you go on. And for log-structured fses, generally you
want all the blocks from one segment to be written before any of the
next.

Perfect abstractions for any of these would be more complex, but
barriers serve pretty well.

Single synchronous block writes are a bad way to implement barriers
and it maybe makes sense to have two models and force every fs to be
able to do things two different ways; but single synchronous block
writes are also a bad way to implement any of the above invariants.
E.g. I'm not convinced that writing out journal blocks synchronously
one at a time will be faster than flushing the cache at the end of a
journal write, even though the latter inflicts collateral damage in
the sense of waiting for perhaps many blocks that don't need to be
waited for.

I guess it would help if I knew what you were intending to do with
wapbl in this regard; have you posted that? (I've been at best
skimming tech-kern the past few months...)

 > > (3a) Also, past discussion of this stuff has centered around trying to
 > > identify a single coherent interface for fs code to use, with the
 > > expansion into whatever hardware semantics are available happening in
 > > the bufferio layer. This would prevent needing conditional logic on
 > > device features in every fs. However, AFAICR these discussions have
 > > never reached any clear conclusion. Do you have any opinion on that?
 > 
 > I think that I'd like to have at least two different places in kernel
 > needing particular interface before generalizing this into a bufferio
 > level. Or at minimum, I'd like to have it working on one place
 > correctly, and then it can be generalized before using it on second
 > place. It would be awesome to use FUA e.g. for fsync(2), but let's not
 > get too ahead of ourselves.

Well, if I counted correctly we have seventeen on-disk filesystems (if
you count wapbl separately) and while one of them's read-only, all the
others need to manipulate the disk cache. Most of them currently 

Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-02 Thread Jaromír Doleček
> Some quick thoughts, though:
>
> (1) ultimately it's necessary to patch each driver to crosscheck the
> flag, because otherwise eventually there'll be silent problems.

Maybe. I think I like having this as responsibility on the caller for
now, avoids too broad tree changes. Ultimately it might indeed be
necessary, if we find out that it can't be reasonably be handled by
the caller. Like maybe raidframe kicking in spare disk without FUA
into set with FUA.

> (2) it would be better not to expose hardware-specific flags in the
> buffercache, so it would be better to come up with a name that
> reflects the semantics, and a semantic guarantee that's at least
> notionally not hardware-specific.

I want to avoid unnecessary private NetBSD nomenclature. If storage
industry calls it FUA, it's probably good to just call it FUA.

For DPO it's not so clear cut maybe. We could reuse B_NOCACHE maybe
for the same functionality, but not sure if it matches with  what swap
is using this flag for. DPO is ideal for journal writes however,
that's why I want to add the support for it now.

> (3) as I recall (can you remind those of us not currently embedded in
> this stuff what the semantics of FUA actually are?) FUA is *not* a
> write barrier (as in, all writes before happen before all writes
> after) and since write barriers are a natural expression of the
> requirements for many fses, it would be well to make sure the
> implementation of this doesn't conflict with that.

FUA doesn't enforce any barriers. It merely changes the sematics of
the write request - the hardware will return success response only
after the data is written to non-volatile media.

Any barriers required by filesystem sematics need to be handled by the
fs code, same as now with DIOCCACHESYNC.

I've talked about adding some kind of generic barrier support in the
previous thread. After thinking about it, and reading more, I'm not
convinced it's necessary. Incidentally, Linux has moved away from the
generic barriers and pushed the logic into their fs code, which can
DTRT with e.g. journal transactions, too.

> (3a) Also, past discussion of this stuff has centered around trying to
> identify a single coherent interface for fs code to use, with the
> expansion into whatever hardware semantics are available happening in
> the bufferio layer. This would prevent needing conditional logic on
> device features in every fs. However, AFAICR these discussions have
> never reached any clear conclusion. Do you have any opinion on that?

I think that I'd like to have at least two different places in kernel
needing particular interface before generalizing this into a bufferio
level. Or at minimum, I'd like to have it working on one place
correctly, and then it can be generalized before using it on second
place. It would be awesome to use FUA e.g. for fsync(2), but let's not
get too ahead of ourselves.

We don't commit too much right now besides a B_* flag. I'd rather to
keep this raw and lean for now, and  concentrate on fixing the device
drivers to work with the flags correctly. Only then maybe come up with
interface to make it easier for general use.

I want to avoid broadening the scope too much. Especially since I want
to introduce SATA NCQ support within next few months, which might need
some tweaks to the semantics again.

> We don't want to block improvements to wapbl while we figure out the
> one true device interface, but on the other hand I'd rather not
> acquire a new set of long-term hacks. Stuff like the "logic" wapbl
> uses to intercept the synchronous writes issued by the FFS code is
> very expensive to get rid of later.

Yes, that funny bwrite() not being real bwrite() until issued for
second time from WAPBL :) Quite ugly. It's shame the B_LOCKED hack is
not really extensible to cover also data in journal, as it holds all
transaction data in memory.

Jaromir


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-02 Thread David Holland
On Wed, Mar 01, 2017 at 10:37:00PM +0100, Jarom?r Dole?ek wrote:
 > I'm working on an interface for WAPBL to use Force Unit Access (FUA)
 > feature on compatible hardware (currently SCSI and NVMe), as a
 > replacement to full disk cache flushes. I'd also like to add support
 > for DPO (Disable Page Out), as that is trivial extension of FUA
 > support at least for SCSI.

Good, good :-)

Some quick thoughts, though:

(1) ultimately it's necessary to patch each driver to crosscheck the
flag, because otherwise eventually there'll be silent problems.

(2) it would be better not to expose hardware-specific flags in the
buffercache, so it would be better to come up with a name that
reflects the semantics, and a semantic guarantee that's at least
notionally not hardware-specific.

(3) as I recall (can you remind those of us not currently embedded in
this stuff what the semantics of FUA actually are?) FUA is *not* a
write barrier (as in, all writes before happen before all writes
after) and since write barriers are a natural expression of the
requirements for many fses, it would be well to make sure the
implementation of this doesn't conflict with that.

(3a) Also, past discussion of this stuff has centered around trying to
identify a single coherent interface for fs code to use, with the
expansion into whatever hardware semantics are available happening in
the bufferio layer. This would prevent needing conditional logic on
device features in every fs. However, AFAICR these discussions have
never reached any clear conclusion. Do you have any opinion on that?

We don't want to block improvements to wapbl while we figure out the
one true device interface, but on the other hand I'd rather not
acquire a new set of long-term hacks. Stuff like the "logic" wapbl
uses to intercept the synchronous writes issued by the FFS code is
very expensive to get rid of later.

(4) please update bufferio.9 :-)

-- 
David A. Holland
dholl...@netbsd.org