Re: [zfs-discuss] Re: Self-tuning recordsize
Torrey McMahon writes: Reads? Maybe. Writes are an other matter. Namely the overhead associated with turning a large write into a lot of small writes. (Checksums for example.) Jeremy Teo wrote: Hello all, Isn't a large block size a simple case of prefetching? In other words, if we possessed an intelligent prefetch implementation, would there still be a need for large block sizes? (Thinking aloud) :) What Torrey says plus, a file stored with multiple small records still will need multiple head seeks to fetch data (prefetch or not). Given that head seeks are a precious resource large records are, at times, a goodness. Larger records also reduces the amount of metadata. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Hello all, Isn't a large block size a simple case of prefetching? In other words, if we possessed an intelligent prefetch implementation, would there still be a need for large block sizes? (Thinking aloud) :) -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Reads? Maybe. Writes are an other matter. Namely the overhead associated with turning a large write into a lot of small writes. (Checksums for example.) Jeremy Teo wrote: Hello all, Isn't a large block size a simple case of prefetching? In other words, if we possessed an intelligent prefetch implementation, would there still be a need for large block sizes? (Thinking aloud) :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Self-tuning recordsize
No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. This is really orthogonal to the cache — in fact, if we had a switch to disable caching, this problem would get worse instead of better (since we wouldn't amortize the initial large read over multiple small writes). This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Heya Anton, On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote: No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. Ah. I knew i was missing something. What COW giveth, COW taketh away... This is really orthogonal to the cache — in fact, if we had a switch to disable caching, this problem would get worse instead of better (since we wouldn't amortize the initial large read over multiple small writes). Agreed. It looks to me there are only 2 ways to solve this: 1) Set recordsize manually 2) Allow the blocksize of a file be changed even if there are multiple blocks in the file. -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Jeremy Teo wrote: Heya Anton, On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote: No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. (Actually ZFS goes up to 128k not 256k (yet!)) Ah. I knew i was missing something. What COW giveth, COW taketh away... Yes, although actually most non-COW filesystems have this same problem, because they don't write partial blocks either, even though technically they could. (And FYI, checksumming would take away the ability to write partial blocks too.) 1) Set recordsize manually 2) Allow the blocksize of a file be changed even if there are multiple blocks in the file. Or, as has been suggested, add an API for apps to tell us the recordsize before they populate the file. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Matthew Ahrens wrote: Or, as has been suggested, add an API for apps to tell us the recordsize before they populate the file. I'll drop a RFE in and point people at the number. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
On Oct 17, 2006, at 12:43 PM, Matthew Ahrens wrote: Jeremy Teo wrote: Heya Anton, On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote: No, the reason to try to match recordsize to the write size is so that a small write does not turn into a large read + a large write. In configurations where the disk is kept busy, multiplying 8K of data transfer up to 256K hurts. (Actually ZFS goes up to 128k not 256k (yet!)) 256K = 128K read + 128K write. Yes, although actually most non-COW filesystems have this same problem, because they don't write partial blocks either, even though technically they could. (And FYI, checksumming would take away the ability to write partial blocks too.) In direct I/O mode, though, which is commonly used for databases, writes only affect individual disk blocks, not the whole file system blocks. (At least for UFS QFS, but I presume VxFS is similar.) In the case of QFS in paged mode, only dirty pages are written, not whole file system blocks (disk allocation units, or DAUs, in QFS terminology). It's common to use 2 MB or larger DAUs to reduce allocation overhead, improve contiguity, and reduce the need for indirect blocks. I'm not sure if this is the case for UFS with 8K blocks and 4K pages, but I imagine it is. As you say, checksumming requires that either whole checksum blocks (not necessarily file system blocks!) be processed, or that the checksum function is reversible (in the sense that inverse and composition functions for it exist) [ checksum(ABC) = f(g(A),g(B),g (C)) and there exists g^-1(B) such that we can compute checksum(AB'C) = f(g(A),g(B'),g(C)) or checksum(AB'C) = h(checksum(ABC), range(A), range(B), range(C), g^-1(B), g(B')) ]. [The latter approach comes from a paper I can't track down right now; if anyone's familiar with it, I'd love to get the reference again.] -- Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Self-tuning recordsize
Torrey McMahon wrote: Matthew Ahrens wrote: Or, as has been suggested, add an API for apps to tell us the recordsize before they populate the file. I'll drop a RFE in and point people at the number. For those playing at home the RFE is 6483154 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Self-tuning recordsize
One technique would be to keep a histogram of read write sizes. Presumably one would want to do this only during a “tuning phase” after the file was first created, or when access patterns change. (A shift to smaller record sizes can be detected by a large proportion of write operations which require block pre-reads; a shift to larger record sizes can be detected by a large proportion of write operations which write more than one block.) The ability to change the block size on-the-fly seems useful here. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss