Re: [zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10
On Jan 4, 2007, at 3:25 AM, [EMAIL PROTECTED] wrote: Is there some reason why a small read on a raidz2 is not statistically very likely to require I/O on only one device? Assuming a non-degraded pool of course. ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must be read to compute and verify the checksum. But why do ZFS reads require the computation of the RAIDZ checksum? If the block checksum is fine, then you need not care about the parity. It's the block checksum that requires reading all of the disks. If ZFS stored sub-block checksums for the RAID-Z case then short reads could often be satisfied without reading the whole block (and all disks). So actually I mis-spoke slightly; rather than all disks, I should have said all data disks. In practice this has the same effect: No more than one read may be processed at a time. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10
On Jan 4, 2007, at 10:26 AM, Roch - PAE wrote: All filesystems will incur a read-modify-write when application is updating portion of a block. For most Solaris file systems it is the page size, rather than the block size, that affects read-modify-write; hence 8K (SPARC) or 4K (x86/x64) writes do not require read-modify-write for UFS/QFS, even when larger block sizes are used. When direct I/O is enabled, UFS and QFS will write directly to disk (without reading) for 512-byte-aligned I/O. The read I/O only occurs if the block is not already in memory cache. Of course. ZFS stores files less than 128K (or less than the filesystem recordsize) as a single block. Larger files are stored as multiple recordsize blocks. So appending to any file less than 128K will result in a read-modify- write cycle (modulo read caching); while a write to a file which is not record-size-aligned (by default, 128K) results in a read-modify-write cycle. For RAID-Z a block spreads onto all devices of a group. Which means that all devices are involved in the read and the write; except, as I believe Casper pointed out, that very small blocks (less than 512 bytes per data device) will reside on a smaller set of disks. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and SE 3511
On Dec 19, 2006, at 7:14 AM, Mike Seda wrote: Anton B. Rang wrote: I have a Sun SE 3511 array with 5 x 500 GB SATA-I disks in a RAID 5. This 2 TB logical drive is partitioned into 10 x 200GB slices. I gave 4 of these slices to a Solaris 10 U2 machine and added each of them to a concat (non-raid) zpool as listed below: This is certainly a supportable configuration. However, it's not an optimal one. What would be the optimal configuration that you recommend? If you don't need ZFS redundancy, I would recommend taking a single slice for your ZFS file system (e.g. 6 x 200 GB for other file systems, and 1 x 800 GB for the ZFS pool). There would still be contention between the various file systems, but at least ZFS would be working with a single contiguous block of space on the array. Because of the implicit striping in ZFS, what you have right now is analogous to taking a single disk, partitioning it into several partitions, then striping across those partitions -- it works, you can use all of the space, but there's a rearrangement which means that logically contiguous blocks on disk are no longer physically contiguous, hurting performance substantially. Yes, I am worried about the lack of redundancy. And, I have some new disks on order, at least one of which will be a hot spare. Glad to hear it. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
On Oct 17, 2006, at 12:43 PM, Matthew Ahrens wrote: Jeremy Teo wrote: Heya Anton, On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote: No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. (Actually ZFS goes up to 128k not 256k (yet!)) 256K = 128K read + 128K write. Yes, although actually most non-COW filesystems have this same problem, because they don't write partial blocks either, even though technically they could. (And FYI, checksumming would take away the ability to write partial blocks too.) In direct I/O mode, though, which is commonly used for databases, writes only affect individual disk blocks, not the whole file system blocks. (At least for UFS QFS, but I presume VxFS is similar.) In the case of QFS in paged mode, only dirty pages are written, not whole file system blocks (disk allocation units, or DAUs, in QFS terminology). It's common to use 2 MB or larger DAUs to reduce allocation overhead, improve contiguity, and reduce the need for indirect blocks. I'm not sure if this is the case for UFS with 8K blocks and 4K pages, but I imagine it is. As you say, checksumming requires that either whole checksum blocks (not necessarily file system blocks!) be processed, or that the checksum function is reversible (in the sense that inverse and composition functions for it exist) [ checksum(ABC) = f(g(A),g(B),g (C)) and there exists g^-1(B) such that we can compute checksum(AB'C) = f(g(A),g(B'),g(C)) or checksum(AB'C) = h(checksum(ABC), range(A), range(B), range(C), g^-1(B), g(B')) ]. [The latter approach comes from a paper I can't track down right now; if anyone's familiar with it, I'd love to get the reference again.] -- Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of seeks?
On Aug 9, 2006, at 8:18 AM, Roch wrote: So while I'm feeling optimistic :-) we really ought to be able to do this in two I/O operations. If we have, say, 500K of data to write (including all of the metadata), we should be able to allocate a contiguous 500K block on disk and write that with a single operation. Then we update the Uberblock. Hi Anton, Optimistic a little yes. The data block should have aggregated quite well into near recordsize I/Os, are you sure they did not ? No O_DSYNC in here right ? When I repeated this with just 512K written in 1K chunks via dd, I saw six 16K writes. Those were the largest. The others were around 1K-4K. No O_DSYNC. dd if=/dev/zero of=xyz bs=1k count=512 So some writes are being aggregated, but we're missing a lot. Once the data blocks are on disk we have the information necessary to update the indirect blocks iteratively up to the ueberblock. Those are the smaller I/Os; I guess that becauseof ditto blocks they go to physically seperate locations, by design. We shouldn't have to wait for the data blocks to reach disk, though. We know where they're going in advance. One of the key advantages of the überblock scheme is that we can, in a sense, speculatively write to disk. We don't need the tight ordering that UFS requires to avoid security exposures and allow the file system to be repaired. We can lay out all of the data and metadata, write them all to disk, choose new locations if the writes fail, etc. and not worry about any ordering or state issues, because the on-disk image doesn't change until we commit it. You're right, the ditto block mechanism will mean that some writes will be spread around (at least when using a non-redundant pool like mine), but then we should have at most three writes followed by the überblock update, assuming three degrees of replication. All of these though are normally done asynchronously to applications, unless the disks are flooded. Which is a good thing (I think they're asynchronous anyway, unless the cache is full). But I follow you in that, It may be remotely possible to reduce the number of Iterations in the process by assuming that the I/O will all succeed, then if some fails, fix up the consequence and when all done, update the ueberblock. I would not hold my breath quite yet for that. Hmmm. I guess my point is that we shouldn't need to iterate at all. There are no dependencies between these writes; only between the complete set of writes and the überblock update. -- Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of seeks?
On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote: The problem is that you don't know the actual *contents* of the parent block until *all* of its children have been written to their final locations. (This is because the block pointer's value depends on the final location) But I know where the children are going before I actually write them. There is a dependency of the parent's contents on the *address* of its children, but not on the actual write. We can compute everything that we are going to write before we start to write. (Yes, in the event of a write failure we have to recover; but that's very rare, and can easily be handled -- we just start over, since no visible state has been changed.) The ditto blocks don't really effect this, since they can all be written out in parallel. The reason they affect my desire of turning the update into a two-phase commit (make all the changes, then update the überblock) is because the ditto blocks are deliberately spread across the disk, so we can't collect them into a single write (for a non-redundant pool, or at least a one- disk pool -- presumably they wind up on different disks for a two-disk pool, in which case we can still do a single write per disk). Again, there is; if a block write fails, you have to re-write it and all of it's parents. So the best you could do would be: 1. assign locations for all blocks, and update the space bitmaps as necessary. 2. update all of the non-Uberdata blocks with their actual contents (which requires calculating checksums on all of the child blocks) 3. write everything out in parallel. 3a. if any write fails, re-do 1+2 for that block, and 2 for all of its parents, then start over at 3 with all of the changed blocks. 4. once everything is on stable storage, update the uberblock. That's a lot more complicated than the current model, but certainly seems possible. (3a could actually be simplified to simply mark the bad blocks as unallocatable, and go to 1, but it's more efficient as you describe.) The eventual advantage, though, is that we get the performance of a single write (plus, always, the überblock update). In a heavily loaded system, the current approach (lots of small writes) won't scale so well. (Actually we'd probably want to limit the size of each write to some small value, like 16 MB, simply to allow the first write to start earlier under fairly heavy loads.) As I pointed out earlier, this would require getting scatter/gather support through the storage subsystem, but the potential win should be quite large. Something to think about for the future. :-) Incidentally, this is part of how QFS gets its performance for streaming I/O. We use an allocate forward policy, allow very large allocation blocks, and separate the metadata from data. This allows us to write (or read) data in fairly large I/O requests, without unnecessary disk head motion. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-sourceQFS... / was:Re: Re: Distributed File System for Solaris
On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance Engineering wrote: I'm not taking a stance on this, but if I keep a controler full of 128K I/Os and assuming there are targetting contiguous physical blocks, how different is that to issuing a very large I/O ? There are differences at the host, the HBA, the disk or RAID controller, and on the wire. At the host: The SCSI/FC/ATA stack is run once for each I/O. This takes a bit of CPU. We generally take one interrupt for each I/O (if the CPU is fast enough), so instead of taking one interrupt for 8 MB (for instance), we take 64. We run through the IOMMU or page translation code once per page, but the overhead of initially setting up the IOMMU or starting the translation loop happens once per I/O. At the HBA: There is some overhead each time that the controller switches processing from one I/O to another. This isn't too large on a modern system, but it does add up. There is overhead on the PCI (or other) bus for the small transfers that make up the command block and scatter/gather list for each I/O. Again, it adds up (faster than you might expect, since PCI Express can move 128 KB very quickly). There is a limit on the maximum number of outstanding I/O requests, but we're unlikely to hit this in normal use; it is typically at least 256 and more often 1024 or more on newer hardware. (This is shared for the whole channel in the FC and SCSI case, and may be shared between multiple channels for SAS or multi-port FC cards.) There is often a small cache of commands which can be handled quickly; commands outside of this cache (which may hold 4 to 16 or so) are much slower to context-switch in when their data is needed; in particular, the scatter/gather list may need to be read again. At the disk or RAID: There is a fixed overhead for processing each command. This can be fairly readily measured, and roughly reflects the difference between delivered 512-byte IOPs and bandwidth for a large I/O. Some of it is related to parsing the CDB and starting command execution; some of it is related to cache management. There is some overhead for switching between data transfers for each command. A typical track on a disk may hold 400K or so of data, and a full-track transfer is optimal (runs at platter speed). A partial-track transfer immediately followed by another may take enough time to switch that we sometimes lose one revolution (particularly on disks which do not have sector headers). Write caching should nearly eliminate this as a concern, however. There is a fixed-size window of commands that can be reordered on the device. Data transfer within a command can be reordered arbitrarily (for parallel SCSI and FC, though not for ATA or SAS). It's good to have lots of outstanding commands, but if they are all sequential, there's not much point (no reason to reorder them, except perhaps if you're going backwards, and FC/SCSI can handle this anyway). On the wire: Sending a command and its completion takes time that could be spent moving data instead; but for most protocols this probably isn't significant. You can actually see most of this with a PCI and protocol analyzer. -- Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can'tsaturate disks
Ok so lets consider your 2MB read. You have the option of setting in in one contiguous place on the disk or split it into 16 x 128K chunks, somewhat spread all over. Now you issue a read to that 2MB of data. As you noted, you either have to wait for the head to find the 2MB block and stream it, or you dump 16 I/O descriptor into an intelligent controller; Wherever the head is there is data to be gotten from the get go. I can't swear it wins the game, but it should be real close. Well, the full specs aren't available, but a little math and studying some models can get us close. :-) Let's presume we're using an enterprise-class disk, say a 37 GB Seagate Cheetah. This is best-case for seeks as it uses so little of the platter and runs at 15K RPM. Large-block case: On average, to reach the 2 MB, we'll take 3.5ms. Transfer can then proceed at media rate (average 110 MB/sec) and be sent to the host over a 200 MB/sec channel. 3.5 ms seek, 18.1 ms data transfer, total time 21.6 ms for a rate of 92.6 MB/sec. Small-block case: Each seek will be shorter than the average since we are ordering them optimally. A single-track seek is 0.2 ms; average is 3.5ms; if we assume linear scaling (which isn't quite right) then we're looking at 1/8 of 3.7 ms = 0.46 ms. We do 16 seeks, for 7.36 ms, and our data transfer time is the same (18.1 ms), for a rate of 25.46 ms, a rate of 78.5 MB/sec. Not too bad. It's pretty clear why these drives are pricey. :-) Mmmm, actually it's not that good. There are 50K tracks on this 35 GB disk, so each track holds 700 KB. We're only storing 128KB on each track, so on average we'll need to wait nearly 1/2 of a revolution before we see any of our data under the head. At 15K RPM, that's not so bad, only 2ms, but we've got 16 times to wait, adding 32 ms, dropping our rate to roughly half what we'd get otherwise. (Older disks should, surprisingly, do better since they have less data packed onto each track!) Looking at a 250 GB near-line SATA disk, and presuming its controller does the same optimizations, things are different. Average seek time is 8ms, with single-track seek time of 0.8ms, so 15 additional seeks will cost roughly 30 ms. A half-track wait is 4ms (60ms in total). Things are going pretty slow now. I just did an experiment and could see 60MB of data out of a 35G disk using 128K chunks ( 450 IOPS). On the only disk I have handy, I get 36 MB/sec with concurrent 128 KB chunks, 38 MB/sec with non-concurrent 2 MB chunks, 39 MB/sec with 2 MB chunks. But I'm issuing all of these I/O operations sequentially -- no seeks. Disruptive. What is? Multiple I/Os outstanding to a device isn't precisely new. ;-) Honestly, adding seeks is -never- going to improve performance. Giving the drive the opportunity to reorder I/O operations will, but splitting a single operation up can never speed it up, though if you get lucky it won't slow down. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On May 12, 2006, at 11:59 AM, Richard Elling wrote: CPU cycles and memory bandwidth (which both can be in short supply on a database server). We can throw hardware at that :-) Imagine a machine with lots of extra CPU cycles [ ... ] Yes, I've heard this story before, and I won't believe it this time. ;-) Seriously, I believe a database can perform very well on a CMT system, but there won't be any extra CPU cycles or memory bandwidth, because the demand for transaction rates will always exceed what we can supply. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss