Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote: things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and it's not atomic. One sector, at least in theory(*), is. (*) let's ignore for now the various daft things that disks sometimes do in practice. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Mon, Dec 03, 2012 at 12:19:58AM +, Julian Yon wrote: You appear to have just agreed with me, which makes me wonder what I'm missing, given you continue as though you disagree. You asked why 4096-byte-sector disks accept 512-byte writes. I was trying to explain. However, we're talking about hardware here, so you have to also consider the possibility that the drive firmware reports 512 because that's what someone coded up back in 1992 and nobody got around to fixing it. If that doesn't count as broken, what does? (Also, gosh, when did 1992 become so long ago?) By this standard, most hardware is broken. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 02:14:27PM +, David Holland wrote: On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote: things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all.
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 09:26:17AM -0500, Thor Lancelot Simon wrote: And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. That depends on additional properties of the pathway from the FS to the drive firmware. It might have sent 1 of 2 2048-byte writes before the panic, for example. Or it might be a vintage controller incapable of handling more than one sector at a time. Also, if there's a panic while the kernel is in the middle of talking to the drive, such that the drive receives only part of the data you intended to send, one can be reasonably certain it will reject a partial sector... but if it's received 5 of 8 physical sectors and the 6th is partial, it may well write out those 5, which isn't what was intended. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all. Yes, I'm aware of that. It remains a useful approximation, especially for already-existing FS code. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote: What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. That depends on additional properties of the pathway from the FS to the drive firmware. It might have sent 1 of 2 2048-byte writes before the panic, for example. Or it might be a vintage controller incapable of handling more than one sector at a time. The ATA command set supports writes of multiple sectors and multi-sector writes (probably not using those terms though!). In the first case, although a single command is written the drive will (effectively) loop through the sectors writing them 1 by 1. All drives support this mode. For multi-sector writes, the data transfer for each group of sectors is done as a single burst. So if the drive supports 8 sector multi-sector writes, and you are doing PIO transfers, you take a single 'data' interrupt and then write all 4k bytes at once (assuming 512 byte sectors). The drive identify response indicates whether multi-sector writes are supported, and if so how many sectors can be written at once. If the data transfer is DMA, it probably makes little difference to the driver. For quite a long time the netbsd ata driver mixes them up - and would only request writes of multiple sectors if the drive supported multi-sector writes. Multi-sector writes are probably quite difficult to kill part way through since there is only one DMA transfer block. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all. Yes, I'm aware of that. It remains a useful approximation, especially for already-existing FS code. Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame, once the write has started the old data is gone, it the write is actually interrupted you'll get a (correctable) bad sector. If you are really unlucky the write will be long - and trash the following sector (I managed to power off a floppy controller before it wrecked the rest of a track when I'd reset the writer with write enabled). If you are really, really unlucky I think it is possible to destroy adjacent tracks. David -- David Laight: da...@l8s.co.uk
Re: Problem identified: WAPL/RAIDframe performance problems
mo...@rodents-montreal.org (Mouse) writes: things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? The drive could partially complete the write, i.e. if one of the latter sectors has a write error or if the drive is powered down in the middle of the operation. Sure, you would know about it. But in case of a crash you can't rely on data consistency. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
On Sun, 2 Dec 2012 04:04:23 + David Holland dholland-t...@netbsd.org wrote: On Sun, Dec 02, 2012 at 03:22:24AM +, Julian Yon wrote: It's not weird, and there is a gain; it's for compatibility with large amounts of deployed code that assumes all devices have 512-byte blocks. If code makes that assumption, how does the reported block size affect that? Lying is illogical. Code either assumes a specific size (and ignores what you tell it), or it believes what it's told. Either way, dishonesty gains nothing. If code just blindly makes that assumption, it's ignoring what's being reported. You appear to have just agreed with me, which makes me wonder what I'm missing, given you continue as though you disagree. I assume there is or was code in Windows (like we used to have code in NetBSD) that would check the sector size and refuse to run if it wasn't 512. IMHO any time you do the same thing as Windows, you're almost certainly doing it wrong. However, we're talking about hardware here, so you have to also consider the possibility that the drive firmware reports 512 because that's what someone coded up back in 1992 and nobody got around to fixing it. If that doesn't count as broken, what does? (Also, gosh, when did 1992 become so long ago?) Julian -- 3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me signature.asc Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
mo...@rodents-montreal.org (Mouse) writes: These disks lie about their actual sector size. These disks just follow their specification. That's as meaningless as...on, to pick an unreasonably extreme example, a hitman saying I was just following orders. Apparently as meaningless saying lies about. They also report the true sector size. Not according to the documentation, at least not in the one case I investigated. The documentation flat-out says the sector size is 4K, but the disk claims to have half-K sectors. The problem is that there are two sizes here, That's why the disk has multiple attributes that it can report. Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. Apparently that doesn't work out :) Anything else is IMO a bug in the drive and should be treated as such, which in NetBSD's case I would say means a quirk entry, documented as being a workaround for broken hardware, for it. Believing the drive that it has standard sector sizes works fine. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
jul...@yon.org.uk (Julian Yon) writes: If it's smaller than the atomic write size that's equally weird. Because that implies that the designers have made the explicit decision to sacrifice performance for no gain. The gain of course is that people can use the drive and will buy it. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote: On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote: da...@l8s.co.uk (David Laight) writes: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. These disks lie about their actual sector size. The disk's own software does RMW cycles for 512 byte writes. These disks just follow their specification. They also report the true sector size. The problem is how to interpret it, obviously you can access the disk in 512 byte units and the real size and alignment just affects performance. So should the disk lie about the blocks you can address or lie about some recommended block size for accesses? The rest of the world just ignores such problems by using some values that are sufficiently sized/aligned for old and new disks. Greetings, -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
These disks lie about their actual sector size. These disks just follow their specification. That's as meaningless as...on, to pick an unreasonably extreme example, a hitman saying I was just following orders. They also report the true sector size. Not according to the documentation, at least not in the one case I investigated. The documentation flat-out says the sector size is 4K, but the disk claims to have half-K sectors. The problem is that there are two sizes here, which have historically been identical: the sector size on the media and the granularity of the interface. Trouble is, they were identical for good reason. I consider decoupling them slightly broken. I consider decoupling them without updating the interface to report both sizes cripplingly broken. So should the disk lie about the blocks you can address or lie about some recommended block size for accesses? Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. Either that or a new interface should be defined which reports both the media sector size and the interface grain size. Anything else is IMO a bug in the drive and should be treated as such, which in NetBSD's case I would say means a quirk entry, documented as being a workaround for broken hardware, for it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 04:27:14PM -0500, Mouse wrote: Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in the ordinary course of things anyway) be written atomically. I might also care about larger sizes that the drive considers significant for alignment purposes; but probably not very much. I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) I care even less about how the media is organized internally; if it announces that the atomic write size is 1024 bytes, it's 1024 bytes, even if it really means that it is writing one bit each to 8192 steel drum spindles. Now, we have legacy code that contains additional assumptions, such as the belief that the atomic write size is the same from device to device, or that it can be set at newfs time rather than being a dynamic/run-time property of the block device. And we have a lot of code that uses DEV_BSIZE as a convenient unit of measurement and mixes it indiscriminately with other device size properties. However, all this stuff should be cleaned up in the long term. It may also be necessary for lower-level code (e.g. the scsi layer) to know more than this, but any of that can be isolated underneath the block device interface. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives as far as you're concerned; you can ignore the 4K reality. At least, absent bugs in the drives, but that's always a valid caveat. This is because the RMW cycle that goes on internally for sub-4K writes is invisible: a 512-byte write always either has completed in full or has not yet started at all as far as all other interactions with the drive goes. That is, such writes (and reads) are atomic. It's a coherent point of view. But it's one I don't share; I care more about performance than that. This is why I care about visibility into internal organization. I might also care about larger sizes that the drive considers significant for alignment purposes; but probably not very much. That depends on whether you care about performance. I don't care about the block granularity of the interface. Don't you pretty much have to care about it, since that's the unit in which data addresses are presented to its interface? Or is that something you believe should be hidden by...something else? (It's not clear to me exactly what the `you' that doesn't care about interface granularity includes - hardware driver authors? filesystem authors? midlayer (eg scsipi) authors?) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, 1 Dec 2012 23:46:07 + David Holland dholland-t...@netbsd.org wrote: I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) If it's smaller than the atomic write size that's equally weird. Because that implies that the designers have made the explicit decision to sacrifice performance for no gain. But there is a cost: they had to write firmware code to emulate that block size. Julian -- 3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me signature.asc Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 07:07:36PM -0500, Mouse wrote: Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives as far as you're concerned; you can ignore the 4K reality. At least, absent bugs in the drives, but that's always a valid caveat. No; because I can do 4K atomic writes, I want to know about that. (Quite apart from any performance issues.) Physical realities pretty much guarantee that the largest atomic write is not going to cause a RMW cycle... at least on items that are actually block-based. RAIDs where you have to RMW a whole stripe or something but it isn't atomic might be a somewhat different story. I'm not sure how one would build a journaling FS on one of those without having it suck. (I guess by stuffing the journal into NVRAM.) I don't care about the block granularity of the interface. Don't you pretty much have to care about it, since that's the unit in which data addresses are presented to its interface? Or is that something you believe should be hidden by...something else? That is something only the device driver should have to be aware of. (There's an implicit assumption here that block devices should be addressed with byte offsets, as they are from userland, even though this typically wastes a dozen or so bits; the minor overhead is far preferable to the confusion that arises when you have multiple size units floating around, and the consequences of just one bug that mixes block offsets measured in different block sizes can be catastrophic.) (It's not clear to me exactly what the `you' that doesn't care about interface granularity includes - hardware driver authors? filesystem authors? midlayer (eg scsipi) authors?) I'm speaking from a filesystem point of view; but, more specifically, I'm talking about the abstraction we call a block device, whith sits above stuff like scsipi. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Sun, Dec 02, 2012 at 01:32:17AM +, Julian Yon wrote: I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) If it's smaller than the atomic write size that's equally weird. Because that implies that the designers have made the explicit decision to sacrifice performance for no gain. But there is a cost: they had to write firmware code to emulate that block size. It's not weird, and there is a gain; it's for compatibility with large amounts of deployed code that assumes all devices have 512-byte blocks. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
da...@l8s.co.uk (David Laight) writes: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. By the sound of it the log ought to be written in fs frag (or block) sized chunks - even if that means that 'pad' entries get written in order to flush it to disk after a period of inactivity. WALBL ? -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote: da...@l8s.co.uk (David Laight) writes: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. These disks lie about their actual sector size. The disk's own software does RMW cycles for 512 byte writes. David -- David Laight: da...@l8s.co.uk
Re: Problem identified: WAPL/RAIDframe performance problems
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. These disks lie about their actual sector size. The disk's own software does RMW cycles for 512 byte writes. Right, and it's important for FS code to be able to figure out what the right atomic write size is... on the disk it's using, which might not be the same as the disk that was newfs'd. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. Seems to me the right thing is to believe what the disk tells you. If you really want to be friendly to broken hardware, add a quirk for disks known to lie about their sector size. (Yes, I consider it broken for a disk to lie about its sector size.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
Edgar Fuß e...@math.uni-bonn.de writes: I seem to be facing two problems: 1. A certain svn update command is ridicously slow on my to-be file server. 2. During the svn update, the machine partially locks up and fails to respond to NFS requests. Thanks to very kind help by hannken@, I now at least know what the problem is. Short form: WAPBL is currently completely unusable on RAIDframe (I always suspected something like that), at least on non-Level 0 sets. The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non- fsbsize boundaries. So RAIDframe is nearly sure to RMW. That makes the log being written to disc at about 1MB/s with the write lock on the log being held. So everything else on that fs tstiles on the log's read lock. Do you see this on RAID-1 too? I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) pgpqIZfF6neHr.pgp Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
Do you see this on RAID-1 too? Well, I see a performance degradation, albeit not as much as on Level 5. I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Volunteers welcome.
Re: Problem identified: WAPL/RAIDframe performance problems
On Nov 28, 2012, at 6:02 PM, Greg Troxel g...@ir.bbn.com wrote: Edgar Fuß e...@math.uni-bonn.de writes: I seem to be facing two problems: 1. A certain svn update command is ridicously slow on my to-be file server. 2. During the svn update, the machine partially locks up and fails to respond to NFS requests. Thanks to very kind help by hannken@, I now at least know what the problem is. Short form: WAPBL is currently completely unusable on RAIDframe (I always suspected something like that), at least on non-Level 0 sets. The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non- fsbsize boundaries. So RAIDframe is nearly sure to RMW. That makes the log being written to disc at about 1MB/s with the write lock on the log being held. So everything else on that fs tstiles on the log's read lock. Do you see this on RAID-1 too? I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Sure -- add fsbsize sized buffer to struct wapbl and teach wapbl_write() to collect data until the buffers start or end touches a fsbsize boundary. As long as the writes don't cross the logs end they already come ordered. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem identified: WAPL/RAIDframe performance problems
Hello. If running 5.1 or 5.2 is acceptable for you, you could run ffs+softdep since it has all the namei fixes in it. -Brian `i On Nov 28, 5:15pm, Edgar =?iso-8859-1?B?RnXf?= wrote: } Subject: Problem identified: WAPL/RAIDframe performance problems } I seem to be facing two problems: } } 1. A certain svn update command is ridicously slow on my to-be file server. } 2. During the svn update, the machine partially locks up and fails to respond } to NFS requests. } Thanks to very kind help by hannken@, I now at least know what the problem is. } } Short form: WAPBL is currently completely unusable on RAIDframe (I always } suspected something like that), at least on non-Level 0 sets. } } The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non- } fsbsize boundaries. So RAIDframe is nearly sure to RMW. } That makes the log being written to disc at about 1MB/s with the write lock } on the log being held. So everything else on that fs tstiles on the log's } read lock. } } Anyone in a position to improve that? I could simply turn off logging, but then } any non-clean shutdown is sure to take ages. -- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=
Re: Problem identified: WAPL/RAIDframe performance problems
On Wed, Nov 28, 2012 at 06:41:28PM +0100, J. Hannken-Illjes wrote: On Nov 28, 2012, at 6:02 PM, Greg Troxel g...@ir.bbn.com wrote: Do you see this on RAID-1 too? I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Sure -- add fsbsize sized buffer to struct wapbl and teach wapbl_write() to collect data until the buffers start or end touches a fsbsize boundary. It is worth looking at the extensive work they did on this in XFS.
Re: Problem identified: WAPL/RAIDframe performance problems
g...@ir.bbn.com (Greg Troxel) writes: I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. The log itself could write much larger chunks but flushing is done in a series of writes as small as a single physical block. I think the only way to improve that is to copy everything first into a large buffer. Not very efficient. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
On Nov 28, 2012, at 9:20 PM, Brian Buhrow buh...@nfbcal.org wrote: Hello. If running 5.1 or 5.2 is acceptable for you, you could run ffs+softdep since it has all the namei fixes in it. I suppose running fsck on a 6 TByte file system will take hours and softdep needs this after a crash. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem identified: WAPL/RAIDframe performance problems
On Nov 28, 2012, at 10:13 PM, Michael van Elst mlel...@serpens.de wrote: g...@ir.bbn.com (Greg Troxel) writes: I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. The file system block size should match the raid stripe size or you have much more problems than flushing the log. The log itself could write much larger chunks but flushing is done in a series of writes as small as a single physical block. I think the only way to improve that is to copy everything first into a large buffer. Not very efficient. Needing to copy say 8 Mbytes of data and writing it in big chunks will be much faster than writing it many smaller unaligned segments. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem identified: WAPL/RAIDframe performance problems
Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. Nothing prevents one from making both quantities the same value The log itself could write much larger chunks but flushing is done in a series of writes as small as a single physical block. I think the only way to improve that is to copy everything first into a large buffer. Not very efficient. As far as I understood hannken@, I'm bitten by writing to the log, not by flushing it.
Re: Problem identified: WAPL/RAIDframe performance problems
I suppose running fsck on a 6 TByte file system will take hours Based on my own experience with a 7T filesystem, I would suggest you try it rather than masking assumptions. Depending on your use case, you may be able to speed fsck up dramatically by choosing the parameters for your filesystem suitably. I find that fsck on a filesystem built with -f 8192 -b 65536 -n 1, for example, is a great deal faster than on a filesystem built on the same amount of disk space with the defaults. (I have a few filesystems for which that combination of parameters is appropriate: a small number of large files with little churn.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
On Nov 28, 2012, at 10:19 PM, Edgar Fuß e...@math.uni-bonn.de wrote: Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. Nothing prevents one from making both quantities the same value The log itself could write much larger chunks but flushing is done in a series of writes as small as a single physical block. I think the only way to improve that is to copy everything first into a large buffer. Not very efficient. As far as I understood hannken@, I'm bitten by writing to the log, not by flushing it. Flushing is just writing to the log. These writes have sizes between 512 bytes and the file system block size. Problem is these writes are not multiples of and are not aligned to file system block size. Collecting the data and writing MAXPHYS bytes aligned to MAXPHYS should improve wapbl on raid. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Problem identified: WAPL/RAIDframe performance problems
Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. Nothing prevents one from making both quantities the same value That's not always true. For example, I think filesystem block sizes must be powers of two, but a RAID 5 with four members will necessarily have a stripe size that's a multiple of three. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
Hello. Well, to each his own, but for comparison, I have a system running 5.1 withthe the latest namei changes with a 13TB filesystem which, if fsck needs to run, takes less than an hour to complete. I've found 5.1 to be very stable, and so haven't had to worry about the penalty of running fsck after a crash very often. I've found raidframe to be invaluable in my installations, and to have WAPBL be broken in 6.x in conjunction with raidframe seems like a pretty big deturrent for me.
Re: Problem identified: WAPL/RAIDframe performance problems
That's not always true. OK. Nothing prevents me from making these two values equal (I have five discs).
Re: Problem identified: WAPL/RAIDframe performance problems
On Wed, Nov 28, 2012 at 10:14:58PM +0100, J. Hannken-Illjes wrote: On Nov 28, 2012, at 9:20 PM, Brian Buhrow buh...@nfbcal.org wrote: Hello. If running 5.1 or 5.2 is acceptable for you, you could run ffs+softdep since it has all the namei fixes in it. I suppose running fsck on a 6 TByte file system will take hours and softdep needs this after a crash. Well, the journal doesn't always avoids the fsck, it depends on the king of the crash (if it's a panic in filesystem code I know I want to run fsck anyway :) Also, the fsck time depends a lot of the filesystems parameters. A 9Tb filesystem formatted -O2 -b 32k -f4k -i100 can be checked in less than one hour. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: Problem identified: WAPL/RAIDframe performance problems
On Wed, Nov 28, 2012 at 10:18:04PM +0100, J. Hannken-Illjes wrote: On Nov 28, 2012, at 10:13 PM, Michael van Elst mlel...@serpens.de wrote: g...@ir.bbn.com (Greg Troxel) writes: I wonder if it's possible (easily) to make the log only use fsbize boundaries, (maybe forcing it to be bigger as a side effect.) Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. The file system block size should match the raid stripe size or you have much more problems than flushing the log. True. Still difficult to do, in particular for metadata which is written in fragsized blocks. Best for speed is probably to use fragsize=blocksize=64k. The log itself could write much larger chunks but flushing is done in a series of writes as small as a single physical block. I think the only way to improve that is to copy everything first into a large buffer. Not very efficient. Needing to copy say 8 Mbytes of data and writing it in big chunks will be much faster than writing it many smaller unaligned segments. One or two MB is probably good enough. A quick test of unpacking base.tgz produces transactions of ~3MB and 1.5MB and a few smaller ones of 30-50kB. -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
On Wed, Nov 28, 2012 at 04:28:57PM -0500, Mouse wrote: Writing filesystem blocks won't help. RAIDframe needs writes as large as a stripe. Nothing prevents one from making both quantities the same value That's not always true. For example, I think filesystem block sizes must be powers of two, but a RAID 5 with four members will necessarily have a stripe size that's a multiple of three. True. But the size of the writes generated by the filesystems, as it turns out, does not relate in the way you might expect to the filesystem block size. For example, in tls-maxphys Manuel and I have eliminated the code that chose readahead and writebehind (clustering) I/O sizes by shifting the filesystem blocksize (which always gave power-of-two sizes) and replaced it with the more relaxed constraint that it must simply write full pages. So you can have a filesystem with a 4K blocksize but, if you're on a RAIDframe RAID5 volume with 4 disks and an underlying MAXPHYS of 64K, find yourself sending 192K transactions to RAIDframe and thus the desired 64K to each data disk. I don't see why -- in theory -- the log code couldn't do the analogous thing. Though at some point, you end up with the LFS problem -- the need to flush partial clusters of transactions because you don't want to let them linger uncommitted for too much time.