Brandon High wrote:
> On Tue, Apr 15, 2008 at 2:44 AM, Jayaraman, Bhaskar
> <[EMAIL PROTECTED]> wrote:
>   
>> Thanks Brandon, so basically there is no way of knowing: -
>>  1] How your file will be distributed across the disks
>>  2] What will be the stripe size
>>     
>
> You could look at the source to try to determine it. I'm not sure if
> there's a hard & fast rule however.
>
> The stripe size will be across all vdevs that have space. For each
> stripe written, more data will land on the empty vdev. Once the
> previously existing vdevs fill up, writes will go to the new vdev.
>   

In general, for the first time a device is filled, space will be allocated
in the spacemap one slab at a time.  The default slab size is 1MByte.
So when you look at physical I/O, you may see something like 8
128kByte sequential writes to one vdev concurrent with 8 128kByte
sequential writes to another vdev, and so on.  Reads go where
needed.

As the space is filled, then you may see freed blocks in the spacemap
become re-allocated.  How it will look depends on many factors,
but I can map it in 3-D :-)

>>  3] If the blocks are redistributed when a new disk is attached to the
>>  storage.
>>     
>
> No existing data is redistributed, so far as I know. If there is a lot
> of churn on the volume, it will eventually balance out, but if you add
> a new vdev and don't add data to the zpool, there will be no data on
> the new vdev.
>
> Unless you've dedicated a device, the ZIL is supposed to stripe across
> all vdevs, so there will be some activity on the new volume.
>
>   
>>  If you happen to know, what does "Every block is its own RAID-Z stripe,
>>  regardless of blocksize" mean? http://blogs.sun.com/bonwick/category/ZFS
>>     
>
> In traditional RAID-5 or RAID-6, the entire stripe has to be read,
> altered, XOR re-calculated, and written to disk. Because of this,
> there's the RAID-5 write hole, where data can be lost or corrupted.
>
> By "full stripe", I don't think he meant a write across all vdevs, but
> a write of data and associated XOR information, which would allow
> recovery in the event of a device death.
>
>   
>>  4] So if I create files 64kb or less then the I'm assuming zfs will
>>  determine some stripe size and stripe my file across the assigned block
>>  pointers (let's assume 4 blocks of 16kb each)???
>>     

No, not normally.  ZFS groups writes to try to do 128kByte writes.
So in a single 128kByte block, there may be parts of different files.
By default the transaction group is flushed every 5 seconds, but there
are many reasons this may change (see other recent threads here).

It makes better sense if you think of ZFS allocating memory across
the devices, like the Solaris VM, rather than trying to look at it as
writing data to disks in the traditional disk storage sense.  See Jeff
Bonwick's papers on the slab allocator and additional commentary
on space maps at http://blogs.sun.com/bonwick (see the comments,
too :-)
 -- richard

>>  5] However if I create a file of 256kb then it may stripe it across 4
>>  blocks again with the stripe size at 64kb this time, but we can never be
>>  sure how this is decided is that right?
>>     
>
> For a non-protected stripe, I don't think that's how it works. It'll
> write a data to some (but not necessarily all) of the vdevs. I believe
> the cut off is 512KB. So a write of 1MB will "stripe" across 2 vdevs
> (biased toward under-utilized vdevs), while 4 writes of 64KB each
> could land on the same vdev.
>
> I think RAID-Z is different, since the stripe needs to spread across
> all devices for protection. I'm not sure how it's done.
>
>   
>>  6] Still I don't see how each block becoming its own stripe unless there
>>  is byte level striping with each byte on a different disk block which
>>  would be very wasteful.
>>     
>
> See above.
>
> Again, my answers may not be correct. This is what I've gleaned from
> my own research.
>
> -B
>
>   

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to