Re: RAIDframe: passing component capabilities

2017-03-31 Thread Greg Oster
On Fri, 31 Mar 2017 17:15:38 +0200
Edgar Fuß  wrote:

> > given that RAIDframe (nor ccd, nor much else) has a general 'query
> > the underlying layers to ask about this capability' function.  
> Is there a ``neither'' missing between ``that'' and ``RAIDframe''?

Yes, sorry.

> > (NetBSD 8 refusing to configure a RAID set because of this is not an
> > option.)  
> Of course not. With my model, you would need to (re-)configure the
> RAID set with ``all components have SCSI tagged queueing'' in order
> for the RAID device to announce that capability. If one of the drives
> is SATA, that configuration fails. If you later try to replace a SCSI
> drive with a SATA one it fails like it fails when the replacement
> drive has insufficient capacity.
> It's just like with capacities: There's no need to announce the full
> component capacity to the set (well, in fact, you don't use the full
> drive capacity for the partition that constitutes the component), but
> the component needs to have at least the announced capacity (in fact,
> you need to be able to create a partition of sufficient size on the
> drive). With capabilities, there would also be no need to announce
> all the drive's capabilities, but a component (original or
> replacement) needs to have at least the announced capabilities.

That still requires RAIDframe then asking the components (or having
them report to RAIDframe when they are attached) about whether or not
they can do a certain thing, in order to decide whether or not the
reconfiguration succeeds or fails.

Later...

Greg Oster


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Paul.Koning

> On Mar 31, 2017, at 4:16 PM, Thor Lancelot Simon  wrote:
> 
> On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote:
>>> The problem is that it does not always use SIMPLE and ORDERED tags in a
>>> way that would facilitate the use of ORDERED tags to enforce barriers.
>> 
>> Our scsipi layer actually never issues ORDERED tags right now as far
>> as I can see, and there is currently no interface to get it set for an
>> I/O.
> 
> It's not obvious, but in fact ORDERED gets set for writes
> as a default, I believe -- in sd.c, I think?

Why would you do that?  I don't know that as standard SCSI practice, and it 
seems like a recipe for slow performance.

paul



Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote:
> > The problem is that it does not always use SIMPLE and ORDERED tags in a
> > way that would facilitate the use of ORDERED tags to enforce barriers.
> 
> Our scsipi layer actually never issues ORDERED tags right now as far
> as I can see, and there is currently no interface to get it set for an
> I/O.

It's not obvious, but in fact ORDERED gets set for writes
as a default, I believe -- in sd.c, I think?

This confused me for some time when I last looked at it.

> I lived under assumption that SIMPLE tagged commands could be and are
> reordered by the controller/drive at will already, without setting any
> other flags.

They might be -- there are well defined mode page bits
to control this, but I believe targets are free to use
whatever default they like.

> 
> > When SCSI tagged queueing is used properly, it is not necessary to set WCE
> > to get good write performance, and doing so is in fact harmful, since it
> > allows the drive to return ORDERED commands as complete before any of the
> > data for those or prior commands have actually been committed to stable
> > storage.
> 
> This was what I meant when I said "even ordered tags couldn't avoid
> the cache flushes". Using ORDERED tags doesn't provide on-media
> integrity when WCE is set.

Setting WCE on SCSI drives is simply a bad idea.  It is
not necessary for performance and creates data integrity
isues.

> Now, it might be the case that the on-media integrity is not the
> primary goal. Then flush is only a write barrier, not integrity
> measure. In that case yes, ORDERED does keep the semantics (e.g.
> earlier journal writes are written before later journal writes). It
> does make stuff much easier to code, too - simply mark I/O as ORDERED
> and fire, no need to explicitly wait for competition, and can drop e.g
> journal locks faster.
> 
> I do think that it's important to concentrate on case where WCE is on,
> since that is realistically what majority of systems run with.

I don't believe most SCSI drives are run with WCE on.

I agree FUA or its equivalent is needed for non-SCSI
drives.

Thor


Re: Restricting rdtsc [was: kernel aslr]

2017-03-31 Thread Andreas Gustafsson
Maxime Villard wrote:
> Having read several papers on the exploitation of cache latency to defeat
> aslr (kernel or not), it appears that disabling the rdtsc instruction is a
> good mitigation on x86. However, some applications can legitimately use it,
> so I would rather suggest restricting it to root instead.

It's ASLR that's broken, not rdtsc, and I strongly object to
restricting the latter just to that people can continue to gain
a false sense of security from the former.
-- 
Andreas Gustafsson, g...@gson.org


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Jaromír Doleček
> The problem is that it does not always use SIMPLE and ORDERED tags in a
> way that would facilitate the use of ORDERED tags to enforce barriers.

Our scsipi layer actually never issues ORDERED tags right now as far
as I can see, and there is currently no interface to get it set for an
I/O.

> Also, that we may not know enough about the behavior of our filesystems
> in the real world to be 100% sure it's safe to set the other mode page
> bits that allow the drive to arbitrarily reorder SIMPLE commands (which
> under some conditions is necessary to match the performance of running
> with WCE set).

I lived under assumption that SIMPLE tagged commands could be and are
reordered by the controller/drive at will already, without setting any
other flags.

> When SCSI tagged queueing is used properly, it is not necessary to set WCE
> to get good write performance, and doing so is in fact harmful, since it
> allows the drive to return ORDERED commands as complete before any of the
> data for those or prior commands have actually been committed to stable
> storage.

This was what I meant when I said "even ordered tags couldn't avoid
the cache flushes". Using ORDERED tags doesn't provide on-media
integrity when WCE is set.

Now, it might be the case that the on-media integrity is not the
primary goal. Then flush is only a write barrier, not integrity
measure. In that case yes, ORDERED does keep the semantics (e.g.
earlier journal writes are written before later journal writes). It
does make stuff much easier to code, too - simply mark I/O as ORDERED
and fire, no need to explicitly wait for competition, and can drop e.g
journal locks faster.

I do think that it's important to concentrate on case where WCE is on,
since that is realistically what majority of systems run with.

Just for record, I can see these practical problems with ORDERED:
1. only available on SCSI, so still needs fallback barrier logic for
less awesome hw
2. Windows and Linux used to always use SIMPLE tags and wait for
completition; suggests this avenue may have been already explored and
found not interesting enough, or too buggy (remember scheduler
activations?)
3. bufq processing needs special care for MPSAFE SCSI drivers, to
prevent processing any further commands while I/O with ORDERED tag is
being submitted to the controller.

I still see my FUA efford as more direct replacement of the cache
flushes, for it keeps both the logical and on-media integrity. Also,
it will benefit the SATA disks too, once/if NCQ is integrated.

I think that implementing barrier/ORDERED can be parallel efford,
similar to the maxphys branch. I don't think barriers will make FUA
irrelevant, as its still needed for systems with WCE on.

Jaromir


Re: RAIDframe: passing component capabilities

2017-03-31 Thread Edgar Fuß
> given that RAIDframe (nor ccd, nor much else) has a general 'query the
> underlying layers to ask about this capability' function.
Is there a ``neither'' missing between ``that'' and ``RAIDframe''?

> (NetBSD 8 refusing to configure a RAID set because of this is not an
> option.)
Of course not. With my model, you would need to (re-)configure the RAID set 
with ``all components have SCSI tagged queueing'' in order for the RAID 
device to announce that capability. If one of the drives is SATA, that 
configuration fails. If you later try to replace a SCSI drive with a SATA 
one it fails like it fails when the replacement drive has insufficient
capacity.
It's just like with capacities: There's no need to announce the full component 
capacity to the set (well, in fact, you don't use the full drive capacity for 
the partition that constitutes the component), but the component needs to have
at least the announced capacity (in fact, you need to be able to create a 
partition of sufficient size on the drive).
With capabilities, there would also be no need to announce all the drive's 
capabilities, but a component (original or replacement) needs to have at least
the announced capabilities.


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 02:16:44PM +0200, Edgar Fu? wrote:
> Oh well.
> 
> TLS> If the answer is that you're running with WCE on in the mode pages, then
> TLS> don't do that:
> EF> I don't get that. If you turn off the write cache, you need neither cache 
> EF> flushes nor ordering, no?
> MB> You still need ordering. With tagged queuing, you have multiple commands
> MB> running at the same time (up to 256, maybe more fore newer scsi) and the 
> MB> drive is free to complete them in any order.  Unless one of them is an 
> MB> ORDERED command, in which case comamnds queued before have to complete 
> MB> before.
> 
> I guess we are talking past each other. I should have phrased that ``If you
> don't use any tagging and turn off the write cache, ...''.

But that doesn't make sense.  Why would our SCSI layer not use tagging?

The problem is that it does not always use SIMPLE and ORDERED tags in a
way that would facilitate the use of ORDERED tags to enforce barriers.

Also, that we may not know enough about the behavior of our filesystems
in the real world to be 100% sure it's safe to set the other mode page
bits that allow the drive to arbitrarily reorder SIMPLE commands (which
under some conditions is necessary to match the performance of running
with WCE set).

When SCSI tagged queueing is used properly, it is not necessary to set WCE
to get good write performance, and doing so is in fact harmful, since it
allows the drive to return ORDERED commands as complete before any of the
data for those or prior commands have actually been committed to stable
storage.

Thor


Re: RAIDframe: passing component capabilities (was: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL)

2017-03-31 Thread Greg Oster
On Wed, 29 Mar 2017 12:02:23 +0200
Edgar Fuß  wrote:

> EF> Some comments as I probably count as one of the larger WAPBL
> EF> consumers (we have ~150 employee's Home and Mail on NFS on
> EF> FFS2+WAPBL on RAIDframe on SAS):
> JD> I've not changed the code in RF to pass the cache flags, so the
> JD> patch doesn't actually enable FUA there. Mainly because disks
> JD> come and go and I'm not aware of mechanism to make WAPBL aware of
> JD> such changes. It
> TLS> I ran into this issue with tls-maxphys and got so frustrated I
> TLS> was actually considering simply panicing if a less-capable disk
> TLS> were used to replace a more-capable one.  
> Oops. What did you do in the end? What does Mr. RAIDframe say?
> 
> My (probably simplistic) idea would be to add a capabilities option
> to the configuration file, and just as you can't add a disc with
> insufficient capacity, you can't add one with insufficient
> capabilities. Of course, greater capabilities are to be ignored just
> as a larger capacity is.

FUA/maxphys/anything 'disk'-specific is a bit of a pain to deal with,
given that RAIDframe (nor ccd, nor much else) has a general 'query the
underlying layers to ask about this capability' function.

I see two major things here:
 1) Whatever we do can't break existing setups.  That is, if an
 underlying disk can't do FUA, then upper layers just need to Deal.
 (NetBSD 8 refusing to configure a RAID set because of this is not an
 option.)

 2) Whatever query mechanism is used must be device agnostic at the
 higher levels.  It needs to work for RAID, SAS, SCSI, SATA, HP-IB,
 etc, and leave it up to the lower levels to respond with the correct
 "Yes all devices I talk to (recursively) can do this" or "No,
 at least one of us can't do this" to the query.  And then it's up to
 the drivers to actually pass the appropriate flags and do the Right
 Things.

Later...

Greg Oster


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Edgar Fuß
Oh well.

TLS> If the answer is that you're running with WCE on in the mode pages, then
TLS> don't do that:
EF> I don't get that. If you turn off the write cache, you need neither cache 
EF> flushes nor ordering, no?
MB> You still need ordering. With tagged queuing, you have multiple commands
MB> running at the same time (up to 256, maybe more fore newer scsi) and the 
MB> drive is free to complete them in any order.  Unless one of them is an 
MB> ORDERED command, in which case comamnds queued before have to complete 
MB> before.

I guess we are talking past each other. I should have phrased that ``If you
don't use any tagging and turn off the write cache, ...''.

The course of arguments was:
1. Jaromir wrote about FUA and integrating AHCI NCQ support
2. Edgar remembered each journal commit to cause two cache flushes and asked 
   whether using SCSI TCQ could save the cache flushes
3. Jaromir responded that even ordered tags couldn't avoid the cache flushes
4. Edgar wrote that the point of the flushes seems to guarantee an order and
   asked why then ordered tags couldn't make them unneccessary
5. TLS seconded Edgar, asked why tags weren't good enough, and stated ``If 
   the answer is that you're running with WCE on in the mode pages, then 
   don't do that''
6. Edgar (as I now guess) mis-interpreted that as ``current behaviour (no
   tags) and write cacheing'', while TLS (I guess) meant ``potential future 
   behaviour (using tags) and write cacheing'' and also phrased his own 
   reply (ommitting he referred to the current situation without any queueing)
   in a way that provoked further mis-understanding
7. Manuel mis-understood Edgar and explained problems arising from using no
   write cacheing and unordered tagging.

So while I'm still not sure what the SCSI behaviour is with both write cacheing
and tagged queuing (I would guess turning write cacheing on/off doesn't make
much of a difference when you queue everything, but I may well be missing 
something fundamental), I still have the impression that using (ordererd)
queueing and no cache flushes would be the perfect solution for journalling.