Re: RAIDframe: passing component capabilities
On Fri, 31 Mar 2017 17:15:38 +0200 Edgar Fußwrote: > > given that RAIDframe (nor ccd, nor much else) has a general 'query > > the underlying layers to ask about this capability' function. > Is there a ``neither'' missing between ``that'' and ``RAIDframe''? Yes, sorry. > > (NetBSD 8 refusing to configure a RAID set because of this is not an > > option.) > Of course not. With my model, you would need to (re-)configure the > RAID set with ``all components have SCSI tagged queueing'' in order > for the RAID device to announce that capability. If one of the drives > is SATA, that configuration fails. If you later try to replace a SCSI > drive with a SATA one it fails like it fails when the replacement > drive has insufficient capacity. > It's just like with capacities: There's no need to announce the full > component capacity to the set (well, in fact, you don't use the full > drive capacity for the partition that constitutes the component), but > the component needs to have at least the announced capacity (in fact, > you need to be able to create a partition of sufficient size on the > drive). With capabilities, there would also be no need to announce > all the drive's capabilities, but a component (original or > replacement) needs to have at least the announced capabilities. That still requires RAIDframe then asking the components (or having them report to RAIDframe when they are attached) about whether or not they can do a certain thing, in order to decide whether or not the reconfiguration succeeds or fails. Later... Greg Oster
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
> On Mar 31, 2017, at 4:16 PM, Thor Lancelot Simonwrote: > > On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote: >>> The problem is that it does not always use SIMPLE and ORDERED tags in a >>> way that would facilitate the use of ORDERED tags to enforce barriers. >> >> Our scsipi layer actually never issues ORDERED tags right now as far >> as I can see, and there is currently no interface to get it set for an >> I/O. > > It's not obvious, but in fact ORDERED gets set for writes > as a default, I believe -- in sd.c, I think? Why would you do that? I don't know that as standard SCSI practice, and it seems like a recipe for slow performance. paul
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote: > > The problem is that it does not always use SIMPLE and ORDERED tags in a > > way that would facilitate the use of ORDERED tags to enforce barriers. > > Our scsipi layer actually never issues ORDERED tags right now as far > as I can see, and there is currently no interface to get it set for an > I/O. It's not obvious, but in fact ORDERED gets set for writes as a default, I believe -- in sd.c, I think? This confused me for some time when I last looked at it. > I lived under assumption that SIMPLE tagged commands could be and are > reordered by the controller/drive at will already, without setting any > other flags. They might be -- there are well defined mode page bits to control this, but I believe targets are free to use whatever default they like. > > > When SCSI tagged queueing is used properly, it is not necessary to set WCE > > to get good write performance, and doing so is in fact harmful, since it > > allows the drive to return ORDERED commands as complete before any of the > > data for those or prior commands have actually been committed to stable > > storage. > > This was what I meant when I said "even ordered tags couldn't avoid > the cache flushes". Using ORDERED tags doesn't provide on-media > integrity when WCE is set. Setting WCE on SCSI drives is simply a bad idea. It is not necessary for performance and creates data integrity isues. > Now, it might be the case that the on-media integrity is not the > primary goal. Then flush is only a write barrier, not integrity > measure. In that case yes, ORDERED does keep the semantics (e.g. > earlier journal writes are written before later journal writes). It > does make stuff much easier to code, too - simply mark I/O as ORDERED > and fire, no need to explicitly wait for competition, and can drop e.g > journal locks faster. > > I do think that it's important to concentrate on case where WCE is on, > since that is realistically what majority of systems run with. I don't believe most SCSI drives are run with WCE on. I agree FUA or its equivalent is needed for non-SCSI drives. Thor
Re: Restricting rdtsc [was: kernel aslr]
Maxime Villard wrote: > Having read several papers on the exploitation of cache latency to defeat > aslr (kernel or not), it appears that disabling the rdtsc instruction is a > good mitigation on x86. However, some applications can legitimately use it, > so I would rather suggest restricting it to root instead. It's ASLR that's broken, not rdtsc, and I strongly object to restricting the latter just to that people can continue to gain a false sense of security from the former. -- Andreas Gustafsson, g...@gson.org
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
> The problem is that it does not always use SIMPLE and ORDERED tags in a > way that would facilitate the use of ORDERED tags to enforce barriers. Our scsipi layer actually never issues ORDERED tags right now as far as I can see, and there is currently no interface to get it set for an I/O. > Also, that we may not know enough about the behavior of our filesystems > in the real world to be 100% sure it's safe to set the other mode page > bits that allow the drive to arbitrarily reorder SIMPLE commands (which > under some conditions is necessary to match the performance of running > with WCE set). I lived under assumption that SIMPLE tagged commands could be and are reordered by the controller/drive at will already, without setting any other flags. > When SCSI tagged queueing is used properly, it is not necessary to set WCE > to get good write performance, and doing so is in fact harmful, since it > allows the drive to return ORDERED commands as complete before any of the > data for those or prior commands have actually been committed to stable > storage. This was what I meant when I said "even ordered tags couldn't avoid the cache flushes". Using ORDERED tags doesn't provide on-media integrity when WCE is set. Now, it might be the case that the on-media integrity is not the primary goal. Then flush is only a write barrier, not integrity measure. In that case yes, ORDERED does keep the semantics (e.g. earlier journal writes are written before later journal writes). It does make stuff much easier to code, too - simply mark I/O as ORDERED and fire, no need to explicitly wait for competition, and can drop e.g journal locks faster. I do think that it's important to concentrate on case where WCE is on, since that is realistically what majority of systems run with. Just for record, I can see these practical problems with ORDERED: 1. only available on SCSI, so still needs fallback barrier logic for less awesome hw 2. Windows and Linux used to always use SIMPLE tags and wait for completition; suggests this avenue may have been already explored and found not interesting enough, or too buggy (remember scheduler activations?) 3. bufq processing needs special care for MPSAFE SCSI drivers, to prevent processing any further commands while I/O with ORDERED tag is being submitted to the controller. I still see my FUA efford as more direct replacement of the cache flushes, for it keeps both the logical and on-media integrity. Also, it will benefit the SATA disks too, once/if NCQ is integrated. I think that implementing barrier/ORDERED can be parallel efford, similar to the maxphys branch. I don't think barriers will make FUA irrelevant, as its still needed for systems with WCE on. Jaromir
Re: RAIDframe: passing component capabilities
> given that RAIDframe (nor ccd, nor much else) has a general 'query the > underlying layers to ask about this capability' function. Is there a ``neither'' missing between ``that'' and ``RAIDframe''? > (NetBSD 8 refusing to configure a RAID set because of this is not an > option.) Of course not. With my model, you would need to (re-)configure the RAID set with ``all components have SCSI tagged queueing'' in order for the RAID device to announce that capability. If one of the drives is SATA, that configuration fails. If you later try to replace a SCSI drive with a SATA one it fails like it fails when the replacement drive has insufficient capacity. It's just like with capacities: There's no need to announce the full component capacity to the set (well, in fact, you don't use the full drive capacity for the partition that constitutes the component), but the component needs to have at least the announced capacity (in fact, you need to be able to create a partition of sufficient size on the drive). With capabilities, there would also be no need to announce all the drive's capabilities, but a component (original or replacement) needs to have at least the announced capabilities.
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
On Fri, Mar 31, 2017 at 02:16:44PM +0200, Edgar Fu? wrote: > Oh well. > > TLS> If the answer is that you're running with WCE on in the mode pages, then > TLS> don't do that: > EF> I don't get that. If you turn off the write cache, you need neither cache > EF> flushes nor ordering, no? > MB> You still need ordering. With tagged queuing, you have multiple commands > MB> running at the same time (up to 256, maybe more fore newer scsi) and the > MB> drive is free to complete them in any order. Unless one of them is an > MB> ORDERED command, in which case comamnds queued before have to complete > MB> before. > > I guess we are talking past each other. I should have phrased that ``If you > don't use any tagging and turn off the write cache, ...''. But that doesn't make sense. Why would our SCSI layer not use tagging? The problem is that it does not always use SIMPLE and ORDERED tags in a way that would facilitate the use of ORDERED tags to enforce barriers. Also, that we may not know enough about the behavior of our filesystems in the real world to be 100% sure it's safe to set the other mode page bits that allow the drive to arbitrarily reorder SIMPLE commands (which under some conditions is necessary to match the performance of running with WCE set). When SCSI tagged queueing is used properly, it is not necessary to set WCE to get good write performance, and doing so is in fact harmful, since it allows the drive to return ORDERED commands as complete before any of the data for those or prior commands have actually been committed to stable storage. Thor
Re: RAIDframe: passing component capabilities (was: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL)
On Wed, 29 Mar 2017 12:02:23 +0200 Edgar Fußwrote: > EF> Some comments as I probably count as one of the larger WAPBL > EF> consumers (we have ~150 employee's Home and Mail on NFS on > EF> FFS2+WAPBL on RAIDframe on SAS): > JD> I've not changed the code in RF to pass the cache flags, so the > JD> patch doesn't actually enable FUA there. Mainly because disks > JD> come and go and I'm not aware of mechanism to make WAPBL aware of > JD> such changes. It > TLS> I ran into this issue with tls-maxphys and got so frustrated I > TLS> was actually considering simply panicing if a less-capable disk > TLS> were used to replace a more-capable one. > Oops. What did you do in the end? What does Mr. RAIDframe say? > > My (probably simplistic) idea would be to add a capabilities option > to the configuration file, and just as you can't add a disc with > insufficient capacity, you can't add one with insufficient > capabilities. Of course, greater capabilities are to be ignored just > as a larger capacity is. FUA/maxphys/anything 'disk'-specific is a bit of a pain to deal with, given that RAIDframe (nor ccd, nor much else) has a general 'query the underlying layers to ask about this capability' function. I see two major things here: 1) Whatever we do can't break existing setups. That is, if an underlying disk can't do FUA, then upper layers just need to Deal. (NetBSD 8 refusing to configure a RAID set because of this is not an option.) 2) Whatever query mechanism is used must be device agnostic at the higher levels. It needs to work for RAID, SAS, SCSI, SATA, HP-IB, etc, and leave it up to the lower levels to respond with the correct "Yes all devices I talk to (recursively) can do this" or "No, at least one of us can't do this" to the query. And then it's up to the drivers to actually pass the appropriate flags and do the Right Things. Later... Greg Oster
Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL
Oh well. TLS> If the answer is that you're running with WCE on in the mode pages, then TLS> don't do that: EF> I don't get that. If you turn off the write cache, you need neither cache EF> flushes nor ordering, no? MB> You still need ordering. With tagged queuing, you have multiple commands MB> running at the same time (up to 256, maybe more fore newer scsi) and the MB> drive is free to complete them in any order. Unless one of them is an MB> ORDERED command, in which case comamnds queued before have to complete MB> before. I guess we are talking past each other. I should have phrased that ``If you don't use any tagging and turn off the write cache, ...''. The course of arguments was: 1. Jaromir wrote about FUA and integrating AHCI NCQ support 2. Edgar remembered each journal commit to cause two cache flushes and asked whether using SCSI TCQ could save the cache flushes 3. Jaromir responded that even ordered tags couldn't avoid the cache flushes 4. Edgar wrote that the point of the flushes seems to guarantee an order and asked why then ordered tags couldn't make them unneccessary 5. TLS seconded Edgar, asked why tags weren't good enough, and stated ``If the answer is that you're running with WCE on in the mode pages, then don't do that'' 6. Edgar (as I now guess) mis-interpreted that as ``current behaviour (no tags) and write cacheing'', while TLS (I guess) meant ``potential future behaviour (using tags) and write cacheing'' and also phrased his own reply (ommitting he referred to the current situation without any queueing) in a way that provoked further mis-understanding 7. Manuel mis-understood Edgar and explained problems arising from using no write cacheing and unordered tagging. So while I'm still not sure what the SCSI behaviour is with both write cacheing and tagged queuing (I would guess turning write cacheing on/off doesn't make much of a difference when you queue everything, but I may well be missing something fundamental), I still have the impression that using (ordererd) queueing and no cache flushes would be the perfect solution for journalling.