Re: [zfs-discuss] Re: RAIDZ2 vs. ZFS RAID-10

2007-01-04 Thread Anton Rang

On Jan 4, 2007, at 3:25 AM, [EMAIL PROTECTED] wrote:

Is there some reason why a small read on a raidz2 is not  
statistically very
likely to require I/O on only one device? Assuming a non-degraded  
pool of

course.


ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all  
disks must be read to compute and

 verify the checksum.

But why do ZFS reads require the computation of the RAIDZ checksum?

If the block checksum is fine, then you need not care about the  
parity.


It's the block checksum that requires reading all of the disks.  If  
ZFS stored sub-block checksums
for the RAID-Z case then short reads could often be satisfied without  
reading the whole block (and

all disks).

So actually I mis-spoke slightly; rather than all disks, I should  
have said all data disks.
In practice this has the same effect: No more than one read may be  
processed at a time.


Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10

2007-01-04 Thread Anton Rang


On Jan 4, 2007, at 10:26 AM, Roch - PAE wrote:


All   filesystems   will   incur  a   read-modify-write when
application is  updating  portion of a  block.


For most Solaris file systems it is the page size, rather than
the block size, that affects read-modify-write; hence 8K (SPARC)
or 4K (x86/x64) writes do not require read-modify-write for
UFS/QFS, even when larger block sizes are used.

When direct I/O is enabled, UFS and QFS will write directly to
disk (without reading) for 512-byte-aligned I/O.


The read I/O only occurs if the block is not already in memory cache.


Of course.


ZFS stores files less than 128K (or less than the filesystem
recordsize)  as a single block.  Larger  files are stored as
multiple recordsize blocks.


So appending to any file less than 128K will result in a read-modify- 
write

cycle (modulo read caching); while a write to a file which is not
record-size-aligned (by default, 128K) results in a read-modify-write  
cycle.



For RAID-Z a block spreads onto all devices of a group.


Which means that all devices are involved in the read and the write;  
except,
as I believe Casper pointed out, that very small blocks (less than  
512 bytes

per data device) will reside on a smaller set of disks.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and SE 3511

2006-12-19 Thread Anton Rang

On Dec 19, 2006, at 7:14 AM, Mike Seda wrote:


Anton B. Rang wrote:
I have a Sun SE 3511 array with 5 x 500 GB SATA-I disks in a RAID  
5. This
2 TB logical drive is partitioned into 10 x 200GB slices. I gave  
4 of these slices to a Solaris 10 U2 machine and added each of  
them to a concat (non-raid) zpool as listed below:




This is certainly a supportable configuration.  However, it's not  
an optimal one.



What would be the optimal configuration that you recommend?


If you don't need ZFS redundancy, I would recommend taking a single  
slice for your ZFS file system (e.g. 6 x 200 GB for other file  
systems, and 1 x 800 GB for the ZFS pool).  There would still be  
contention between the various file systems, but at least ZFS would  
be working with a single contiguous block of space on the array.


Because of the implicit striping in ZFS, what you have right now is  
analogous to taking a single disk, partitioning it into several  
partitions, then striping across those partitions -- it works, you  
can use all of the space, but there's a rearrangement which means  
that logically contiguous blocks on disk are no longer physically  
contiguous, hurting performance substantially.


Yes, I am worried about the lack of redundancy. And, I have some  
new disks on order, at least one of which will be a hot spare.


Glad to hear it.

Anton


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Self-tuning recordsize

2006-10-17 Thread Anton Rang

On Oct 17, 2006, at 12:43 PM, Matthew Ahrens wrote:


Jeremy Teo wrote:

Heya Anton,
On 10/17/06, Anton B. Rang [EMAIL PROTECTED] wrote:
No, the reason to try to match recordsize to the write size is so  
that a small write does not turn into a large read + a large  
write.  In configurations where the disk is kept busy,  
multiplying 8K of data transfer up to 256K hurts.


(Actually ZFS goes up to 128k not 256k (yet!))


256K = 128K read + 128K write.

Yes, although actually most non-COW filesystems have this same  
problem, because they don't write partial blocks either, even  
though technically they could.  (And FYI, checksumming would take  
away the ability to write partial blocks too.)


In direct I/O mode, though, which is commonly used for databases,  
writes only affect individual disk blocks, not the whole file system  
blocks.  (At least for UFS  QFS, but I presume VxFS is similar.)


In the case of QFS in paged mode, only dirty pages are written, not  
whole file system blocks (disk allocation units, or DAUs, in QFS  
terminology).  It's common to use 2 MB or larger DAUs to reduce  
allocation overhead, improve contiguity, and reduce the need for  
indirect blocks.  I'm not sure if this is the case for UFS with 8K  
blocks and 4K pages, but I imagine it is.


As you say, checksumming requires that either whole checksum  
blocks (not necessarily file system blocks!) be processed, or that  
the checksum function is reversible (in the sense that inverse and  
composition functions for it exist) [ checksum(ABC) = f(g(A),g(B),g 
(C)) and there exists g^-1(B) such that we can compute checksum(AB'C)  
= f(g(A),g(B'),g(C)) or checksum(AB'C) = h(checksum(ABC), range(A),  
range(B), range(C), g^-1(B), g(B')) ].  [The latter approach comes  
from a paper I can't track down right now; if anyone's familiar with  
it, I'd love to get the reference again.]


-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang


On Aug 9, 2006, at 8:18 AM, Roch wrote:




So while I'm feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock.

Hi Anton, Optimistic a little yes.

The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?


When I repeated this with just 512K written in 1K chunks via dd,
I saw six 16K writes.  Those were the largest.  The others were
around 1K-4K.  No O_DSYNC.

  dd if=/dev/zero of=xyz bs=1k count=512

So some writes are being aggregated, but we're missing a lot.


Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
becauseof ditto blocks  they  go  to physically seperate
locations, by design.


We shouldn't have to wait for the data blocks to reach disk,
though.  We know where they're going in advance.  One of the
key advantages of the überblock scheme is that we can, in a
sense, speculatively write to disk.  We don't need the tight
ordering that UFS requires to avoid security exposures and
allow the file system to be repaired.  We can lay out all of
the data and metadata, write them all to disk, choose new
locations if the writes fail, etc. and not worry about any
ordering or state issues, because the on-disk image doesn't
change until we commit it.

You're right, the ditto block mechanism will mean that some
writes will be spread around (at least when using a
non-redundant pool like mine), but then we should have at
most three writes followed by the überblock update, assuming
three degrees of replication.


All of these though are normally done asynchronously to
applications, unless the disks are flooded.


Which is a good thing (I think they're asynchronous anyway,
unless the cache is full).


But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.


Hmmm.  I guess my point is that we shouldn't need to iterate
at all.  There are no dependencies between these writes; only
between the complete set of writes and the überblock update.

-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang

On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:

The problem is that you don't know the actual *contents* of the  
parent block
until *all* of its children have been written to their final  
locations.
(This is because the block pointer's value depends on the final  
location)


But I know where the children are going before I actually write  
them.  There
is a dependency of the parent's contents on the *address* of its  
children, but
not on the actual write.  We can compute everything that we are going  
to write

before we start to write.

(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)

The ditto blocks don't really effect this, since they can all be  
written

out in parallel.


The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the
ditto blocks are deliberately spread across the disk, so we can't  
collect
them into a single write (for a non-redundant pool, or at least a one- 
disk

pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).


Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

1. assign locations for all blocks, and update the space bitmaps
   as necessary.
2. update all of the non-Uberdata blocks with their actual
   contents (which requires calculating checksums on all of the
   child blocks)
3. write everything out in parallel.
	3a. if any write fails, re-do 1+2 for that block, and 2 for all of  
its

parents, then start over at 3 with all of the changed blocks.

4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly  
seems

possible.


(3a could actually be simplified to simply mark the bad blocks as
unallocatable, and go to 1, but it's more efficient as you describe.)

The eventual advantage, though, is that we get the performance of a  
single

write (plus, always, the überblock update).  In a heavily loaded system,
the current approach (lots of small writes) won't scale so well.   
(Actually

we'd probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under  
fairly

heavy loads.)

As I pointed out earlier, this would require getting scatter/gather  
support
through the storage subsystem, but the potential win should be quite  
large.

Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance for  
streaming I/O.
We use an allocate forward policy, allow very large allocation  
blocks, and
separate the metadata from data.  This allows us to write (or read)  
data in

fairly large I/O requests, without unnecessary disk head motion.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-sourceQFS... / was:Re: Re: Distributed File System for Solaris

2006-05-31 Thread Anton Rang
On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance  
Engineering wrote:



I'm not taking  a stance on this, but  if I keep a controler
full  of 128K   I/Os  and  assuming  there  are   targetting
contiguous physical blocks, how different is that to issuing
a very large I/O ?


There are differences at the host, the HBA, the disk or RAID
controller, and on the wire.

At the host:

  The SCSI/FC/ATA stack is run once for each I/O.  This takes
  a bit of CPU.  We generally take one interrupt for each I/O
  (if the CPU is fast enough), so instead of taking one
  interrupt for 8 MB (for instance), we take 64.

  We run through the IOMMU or page translation code once per
  page, but the overhead of initially setting up the IOMMU or
  starting the translation loop happens once per I/O.

At the HBA:

  There is some overhead each time that the controller switches
  processing from one I/O to another.  This isn't too large on a
  modern system, but it does add up.

  There is overhead on the PCI (or other) bus for the small
  transfers that make up the command block and scatter/gather
  list for each I/O.  Again, it adds up (faster than you might
  expect, since PCI Express can move 128 KB very quickly).

  There is a limit on the maximum number of outstanding I/O
  requests, but we're unlikely to hit this in normal use; it
  is typically at least 256 and more often 1024 or more on
  newer hardware.  (This is shared for the whole channel
  in the FC and SCSI case, and may be shared between multiple
  channels for SAS or multi-port FC cards.)

  There is often a small cache of commands which can be handled
  quickly; commands outside of this cache (which may hold 4 to
  16 or so) are much slower to context-switch in when their
  data is needed; in particular, the scatter/gather list may
  need to be read again.

At the disk or RAID:

  There is a fixed overhead for processing each command.  This
  can be fairly readily measured, and roughly reflects the
  difference between delivered 512-byte IOPs and bandwidth for
  a large I/O.  Some of it is related to parsing the CDB and
  starting command execution; some of it is related to cache
  management.

  There is some overhead for switching between data transfers
  for each command.  A typical track on a disk may hold 400K
  or so of data, and a full-track transfer is optimal (runs at
  platter speed).  A partial-track transfer immediately followed
  by another may take enough time to switch that we sometimes
  lose one revolution (particularly on disks which do not have
  sector headers).  Write caching should nearly eliminate this
  as a concern, however.

  There is a fixed-size window of commands that can be
  reordered on the device.  Data transfer within a command can
  be reordered arbitrarily (for parallel SCSI and FC, though
  not for ATA or SAS).  It's good to have lots of outstanding
  commands, but if they are all sequential, there's not much
  point (no reason to reorder them, except perhaps if you're
  going backwards, and FC/SCSI can handle this anyway).

On the wire:

  Sending a command and its completion takes time that could
  be spent moving data instead; but for most protocols this
  probably isn't significant.

You can actually see most of this with a PCI and protocol
analyzer.

-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can'tsaturate disks

2006-05-16 Thread Anton Rang

Ok so lets consider your 2MB read. You have the option of
setting in in one contiguous place on the disk or split it
into 16 x 128K chunks, somewhat spread all over.

Now you issue a read to that 2MB of data.

As you noted, you  either have to wait for  the head to find
the 2MB block and  stream it, or  you dump 16 I/O descriptor
into an intelligent  controller; Wherever the  head is there
is data to be gotten from the get go. I can't swear it wins
the game, but it should be real close.


Well, the full specs aren't available, but a little math and
studying some models can get us close.  :-)

Let's presume we're using an enterprise-class disk, say a 37 GB
Seagate Cheetah.  This is best-case for seeks as it uses so
little of the platter and runs at 15K RPM.

Large-block case:

On average, to reach the 2 MB, we'll take 3.5ms.  Transfer can
then proceed at media rate (average 110 MB/sec) and be sent to
the host over a 200 MB/sec channel.  3.5 ms seek, 18.1 ms data
transfer, total time 21.6 ms for a rate of 92.6 MB/sec.

Small-block case:

Each seek will be shorter than the average since we are ordering
them optimally.  A single-track seek is 0.2 ms; average is 3.5ms;
if we assume linear scaling (which isn't quite right) then we're
looking at 1/8 of 3.7 ms = 0.46 ms.  We do 16 seeks, for 7.36 ms,
and our data transfer time is the same (18.1 ms), for a rate of
25.46 ms, a rate of 78.5 MB/sec.

Not too bad.  It's pretty clear why these drives are pricey.  :-)

Mmmm, actually it's not that good.  There are 50K tracks on this
35 GB disk, so each track holds 700 KB.  We're only storing 128KB
on each track, so on average we'll need to wait nearly 1/2 of a
revolution before we see any of our data under the head.  At 15K
RPM, that's not so bad, only 2ms, but we've got 16 times to wait,
adding 32 ms, dropping our rate to roughly half what we'd get
otherwise.  (Older disks should, surprisingly, do better since
they have less data packed onto each track!)

Looking at a 250 GB near-line SATA disk, and presuming its
controller does the same optimizations, things are different.
Average seek time is 8ms, with single-track seek time of 0.8ms,
so 15 additional seeks will cost roughly 30 ms.  A half-track
wait is 4ms (60ms in total).  Things are going pretty slow now.


I just did an experiment and could see  60MB of data out of
a 35G disk using 128K chunks ( 450 IOPS).


On the only disk I have handy, I get 36 MB/sec with concurrent
128 KB chunks, 38 MB/sec with non-concurrent 2 MB chunks,
39 MB/sec with 2 MB chunks.  But I'm issuing all of these I/O
operations sequentially -- no seeks.


Disruptive.


What is?

Multiple I/Os outstanding to a device isn't precisely new.  ;-)

Honestly, adding seeks is -never- going to improve performance.
Giving the drive the opportunity to reorder I/O operations will,
but splitting a single operation up can never speed it up, though
if you get lucky it won't slow down.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Anton Rang

On May 12, 2006, at 11:59 AM, Richard Elling wrote:


CPU cycles and memory bandwidth (which both can be in short
supply on a database server).


We can throw hardware at that :-)  Imagine a machine with lots
of extra CPU cycles [ ... ]


Yes, I've heard this story before, and I won't believe it this  
time.  ;-)


Seriously, I believe a database can perform very well on a CMT system,
but there won't be any extra CPU cycles or memory bandwidth, because
the demand for transaction rates will always exceed what we can supply.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss