Re: [zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Bob Friesenhahn
On Sat, 15 Mar 2008, Richard Elling wrote:
>
> My observation, is that each metaslab is, by default, 1 MByte in size.  Each
> top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
> vdev's metaslab before moving onto another one.  So you should see eight
> 128kByte allocs per top-level vdev before the next top-level vdev is
> allocated.
>
> That said, the actual iops are sent in parallel.  So it is not unusual to see
> many, most, or all of the top-level vdevs concurrently busy.
>
> Does this match your experience?

I do see that all the devices are quite evenly busy.  There is no 
doubt that the load balancing is quite good.  The main question is if 
there is any actual "striping" going on (breaking the data into 
smaller chunks), or if the algorithm is simply load balancing. 
Striping trades IOPS for bandwidth.

Using my application, I did some tests today.  The application was 
used to do balanced read/write of about 500GB of data in some tens of 
thousand of reasonably large files.  The application sequentially 
reads a file, then sequentially writes a file.  Several copies (2-6) 
of the application were run at once for concurrency.  What I noticed 
is that with hardly any CPU being used, the read+write bandwidth 
seemed to be bottlenecked at about 280MB/second with 'zfs iostat' 
showing very balanced I/O between the reads and the writes.

The system I set up is performing quite a bit differently than I 
anticipated.  The I/O is bottlenecked and I find that my application 
can do significant processing of the data without significantly 
increasing the application run time.  So CPU time is almost free.

If I was to assign a smaller block size for the filesystem, would that 
provide more of the benefits of striping or would it be detrimental to 
performance due to the number of I/Os?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Richard Elling
Bob Friesenhahn wrote:
> Can someone please describe to me the actual underlying I/O operations 
> which occur when a 128K block of data is written to a storage pool 
> configured as shown below (with default ZFS block sizes)?  I am 
> particularly interested in the degree of "striping" across mirrors 
> which occurs.  This would be for Solaris 10 U4.
>   

My observation, is that each metaslab is, by default, 1 MByte in size.  Each
top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
vdev's metaslab before moving onto another one.  So you should see eight
128kByte allocs per top-level vdev before the next top-level vdev is
allocated.

That said, the actual iops are sent in parallel.  So it is not unusual 
to see
many, most, or all of the top-level vdevs concurrently busy.

Does this match your experience?
 -- richard

>
>   NAME   STATE READ WRITE CKSUM
>   Sun_2540   ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B80003A8A0B096A47B4559Ed0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AA047B4529Bd0  ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B80003A8A0B096E47B456DAd0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AA447B4544Fd0  ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B80003A8A0B096147B451BEd0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AA847B45605d0  ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B80003A8A0B096647B453CEd0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AAC47B45739d0  ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B80003A8A0B097347B457D4d0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AB047B457ADd0  ONLINE   0 0 0
> mirror   ONLINE   0 0 0
>   c4t600A0B800039C9B50A9C47B4522Dd0  ONLINE   0 0 0
>   c4t600A0B800039C9B50AB447B4595Fd0  ONLINE   0 0 0
>
>
> Thanks,
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Bob Friesenhahn
Can someone please describe to me the actual underlying I/O operations 
which occur when a 128K block of data is written to a storage pool 
configured as shown below (with default ZFS block sizes)?  I am 
particularly interested in the degree of "striping" across mirrors 
which occurs.  This would be for Solaris 10 U4.


NAME   STATE READ WRITE CKSUM
Sun_2540   ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096A47B4559Ed0  ONLINE   0 0 0
c4t600A0B800039C9B50AA047B4529Bd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096E47B456DAd0  ONLINE   0 0 0
c4t600A0B800039C9B50AA447B4544Fd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096147B451BEd0  ONLINE   0 0 0
c4t600A0B800039C9B50AA847B45605d0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B096647B453CEd0  ONLINE   0 0 0
c4t600A0B800039C9B50AAC47B45739d0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B80003A8A0B097347B457D4d0  ONLINE   0 0 0
c4t600A0B800039C9B50AB047B457ADd0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c4t600A0B800039C9B50A9C47B4522Dd0  ONLINE   0 0 0
c4t600A0B800039C9B50AB447B4595Fd0  ONLINE   0 0 0


Thanks,

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Max_Payload_Size

2008-03-15 Thread Marc Bevand
Anton B. Rang  acm.org> writes:
> Looking at the AMD 690 series manual (well, the family
> register guide), the max payload size value is deliberately
> set to 0 to indicate that the chip only supports 128-byte
> transfers. There is a bit in another register which can be
> set to ignore max-payload errors.  Perhaps that's being set?

Perhaps. I briefly tried looking for AMD 690 series manual or
datasheet, but they don't seem to be available to the public.

I think I'll go back to the 128-byte setting. I wouldn't want to
see errors happening under heavy usage even though my stress
tests were all successful (aggregate data rate of 610 MB/s
generated by reading the disks for 24+ hours, 6 million head
seeks performed by each disk, etc).

Thanks for your much appreciated comments.

-- 
Marc Bevand


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cpying between pools

2008-03-15 Thread Vahid Moghaddasi
On 3/14/08, Vahid Moghaddasi  wrote:


On Fri, Mar 14, 2008 at 11:26 PM, Tim  wrote:



replace your LUNs one at a time:

zpool replace -f rd_01 c4t6006048187870150525244353543d0 
first_lun_off_dmx-3
zpool replace -f rd_01 c4t6006048187870150525244353942d0 
second_lun_off_dmx-3


and so on.


Simple enough thanks. I assume as I start the zpool replace operation, the 
original LUNs will not be in rd_01 pool. Not that will do that, but 
theoretically I can perform this on a live machine without interruption, is 
that right?
Thank you,

On Fri, Mar 15, 2008 at 12:11 AM, Tim  wrote:


Yes, it's the same as replacing a bad drive.  Just make sure you do one, let it 
completely finish, then move on to the next.  Obviously the drives need to 
*rebuild* as you replace them.

Tim and Jim thank you very much for your help. I will give it a shot.
Vahid.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss