Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-23 Thread Roch
Torrey McMahon writes:
  Reads? Maybe. Writes are an other matter. Namely the overhead associated 
  with turning a large write into a lot of small writes. (Checksums for 
  example.)
  
  Jeremy Teo wrote:
   Hello all,
  
   Isn't a large block size a simple case of prefetching? In other words,
   if we possessed an intelligent prefetch implementation, would there
   still be a need for large block sizes? (Thinking aloud)
  
   :)
  
  

What Torrey says plus, a file stored with multiple small
records still will need multiple head seeks to fetch data
(prefetch or not). Given that head seeks are a precious
resource large records are, at times, a goodness.

Larger records also reduces the amount of metadata.

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-22 Thread Jeremy Teo

Hello all,

Isn't a large block size a simple case of prefetching? In other words,
if we possessed an intelligent prefetch implementation, would there
still be a need for large block sizes? (Thinking aloud)

:)

--
Regards,
Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-22 Thread Torrey McMahon
Reads? Maybe. Writes are an other matter. Namely the overhead associated 
with turning a large write into a lot of small writes. (Checksums for 
example.)


Jeremy Teo wrote:

Hello all,

Isn't a large block size a simple case of prefetching? In other words,
if we possessed an intelligent prefetch implementation, would there
still be a need for large block sizes? (Thinking aloud)

:)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Anton B. Rang
No, the reason to try to match recordsize to the write size is so that a small 
write does not turn into a large read + a large write.  In configurations where 
the disk is kept busy, multiplying 8K of data transfer up to 256K hurts.

This is really orthogonal to the cache — in fact, if we had a switch to disable 
caching, this problem would get worse instead of better (since we wouldn't 
amortize the initial large read over multiple small writes).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Jeremy Teo

Heya Anton,

On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote:

No, the reason to try to match recordsize to the write size is so that a small 
write does not turn into a large read + a large write.  In configurations where 
the disk is kept busy, multiplying 8K of data transfer up to 256K hurts.

Ah. I knew i was missing something. What COW giveth, COW taketh away...


This is really orthogonal to the cache — in fact, if we had a switch to disable 
caching, this problem would get worse instead of better (since we wouldn't 
amortize the initial large read over multiple small writes).

Agreed.

It looks to me there are only 2 ways to solve this:

1) Set recordsize manually
2) Allow the blocksize of a file be changed even if there are multiple
blocks in the file.
--
Regards,
Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Matthew Ahrens

Jeremy Teo wrote:

Heya Anton,

On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote:
No, the reason to try to match recordsize to the write size is so that 
a small write does not turn into a large read + a large write.  In 
configurations where the disk is kept busy, multiplying 8K of data 
transfer up to 256K hurts.


(Actually ZFS goes up to 128k not 256k (yet!))


Ah. I knew i was missing something. What COW giveth, COW taketh away...


Yes, although actually most non-COW filesystems have this same problem, 
because they don't write partial blocks either, even though technically 
they could.  (And FYI, checksumming would take away the ability to 
write partial blocks too.)



1) Set recordsize manually
2) Allow the blocksize of a file be changed even if there are multiple
blocks in the file.


Or, as has been suggested, add an API for apps to tell us the recordsize 
before they populate the file.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Torrey McMahon

Matthew Ahrens wrote:


Or, as has been suggested, add an API for apps to tell us the 
recordsize before they populate the file.



I'll drop a RFE in and point people at the number.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Anton Rang

On Oct 17, 2006, at 12:43 PM, Matthew Ahrens wrote:


Jeremy Teo wrote:

Heya Anton,
On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote:
No, the reason to try to match recordsize to the write size is so  
that a small write does not turn into a large read + a large  
write.  In configurations where the disk is kept busy,  
multiplying 8K of data transfer up to 256K hurts.


(Actually ZFS goes up to 128k not 256k (yet!))


256K = 128K read + 128K write.

Yes, although actually most non-COW filesystems have this same  
problem, because they don't write partial blocks either, even  
though technically they could.  (And FYI, checksumming would take  
away the ability to write partial blocks too.)


In direct I/O mode, though, which is commonly used for databases,  
writes only affect individual disk blocks, not the whole file system  
blocks.  (At least for UFS  QFS, but I presume VxFS is similar.)


In the case of QFS in paged mode, only dirty pages are written, not  
whole file system blocks (disk allocation units, or DAUs, in QFS  
terminology).  It's common to use 2 MB or larger DAUs to reduce  
allocation overhead, improve contiguity, and reduce the need for  
indirect blocks.  I'm not sure if this is the case for UFS with 8K  
blocks and 4K pages, but I imagine it is.


As you say, checksumming requires that either whole checksum  
blocks (not necessarily file system blocks!) be processed, or that  
the checksum function is reversible (in the sense that inverse and  
composition functions for it exist) [ checksum(ABC) = f(g(A),g(B),g 
(C)) and there exists g^-1(B) such that we can compute checksum(AB'C)  
= f(g(A),g(B'),g(C)) or checksum(AB'C) = h(checksum(ABC), range(A),  
range(B), range(C), g^-1(B), g(B')) ].  [The latter approach comes  
from a paper I can't track down right now; if anyone's familiar with  
it, I'd love to get the reference again.]


-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Torrey McMahon

Torrey McMahon wrote:

Matthew Ahrens wrote:


Or, as has been suggested, add an API for apps to tell us the 
recordsize before they populate the file.



I'll drop a RFE in and point people at the number. 



For those playing at home the RFE is 6483154
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Self-tuning recordsize

2006-10-13 Thread Anton B. Rang
One technique would be to keep a histogram of read  write sizes.

Presumably one would want to do this only during a “tuning phase” after the 
file was first created, or when access patterns change. (A shift to smaller 
record sizes can be detected by a large proportion of write operations which 
require block pre-reads; a shift to larger record sizes can be detected by a 
large proportion of write operations which write more than one block.)

The ability to change the block size on-the-fly seems useful here.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss