Re: [zfs-discuss] ZFS I/O algorithms
On Sat, 15 Mar 2008, Richard Elling wrote: > > My observation, is that each metaslab is, by default, 1 MByte in size. Each > top-level vdev is allocated by metaslabs. ZFS tries to allocate a top-level > vdev's metaslab before moving onto another one. So you should see eight > 128kByte allocs per top-level vdev before the next top-level vdev is > allocated. > > That said, the actual iops are sent in parallel. So it is not unusual to see > many, most, or all of the top-level vdevs concurrently busy. > > Does this match your experience? I do see that all the devices are quite evenly busy. There is no doubt that the load balancing is quite good. The main question is if there is any actual "striping" going on (breaking the data into smaller chunks), or if the algorithm is simply load balancing. Striping trades IOPS for bandwidth. Using my application, I did some tests today. The application was used to do balanced read/write of about 500GB of data in some tens of thousand of reasonably large files. The application sequentially reads a file, then sequentially writes a file. Several copies (2-6) of the application were run at once for concurrency. What I noticed is that with hardly any CPU being used, the read+write bandwidth seemed to be bottlenecked at about 280MB/second with 'zfs iostat' showing very balanced I/O between the reads and the writes. The system I set up is performing quite a bit differently than I anticipated. The I/O is bottlenecked and I find that my application can do significant processing of the data without significantly increasing the application run time. So CPU time is almost free. If I was to assign a smaller block size for the filesystem, would that provide more of the benefits of striping or would it be detrimental to performance due to the number of I/Os? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS I/O algorithms
Bob Friesenhahn wrote: > Can someone please describe to me the actual underlying I/O operations > which occur when a 128K block of data is written to a storage pool > configured as shown below (with default ZFS block sizes)? I am > particularly interested in the degree of "striping" across mirrors > which occurs. This would be for Solaris 10 U4. > My observation, is that each metaslab is, by default, 1 MByte in size. Each top-level vdev is allocated by metaslabs. ZFS tries to allocate a top-level vdev's metaslab before moving onto another one. So you should see eight 128kByte allocs per top-level vdev before the next top-level vdev is allocated. That said, the actual iops are sent in parallel. So it is not unusual to see many, most, or all of the top-level vdevs concurrently busy. Does this match your experience? -- richard > > NAME STATE READ WRITE CKSUM > Sun_2540 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096A47B4559Ed0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA047B4529Bd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096E47B456DAd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA447B4544Fd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096147B451BEd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AA847B45605d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B096647B453CEd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AAC47B45739d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B80003A8A0B097347B457D4d0 ONLINE 0 0 0 > c4t600A0B800039C9B50AB047B457ADd0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t600A0B800039C9B50A9C47B4522Dd0 ONLINE 0 0 0 > c4t600A0B800039C9B50AB447B4595Fd0 ONLINE 0 0 0 > > > Thanks, > > Bob > == > Bob Friesenhahn > [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS I/O algorithms
Can someone please describe to me the actual underlying I/O operations which occur when a 128K block of data is written to a storage pool configured as shown below (with default ZFS block sizes)? I am particularly interested in the degree of "striping" across mirrors which occurs. This would be for Solaris 10 U4. NAME STATE READ WRITE CKSUM Sun_2540 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B096A47B4559Ed0 ONLINE 0 0 0 c4t600A0B800039C9B50AA047B4529Bd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B096E47B456DAd0 ONLINE 0 0 0 c4t600A0B800039C9B50AA447B4544Fd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B096147B451BEd0 ONLINE 0 0 0 c4t600A0B800039C9B50AA847B45605d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B096647B453CEd0 ONLINE 0 0 0 c4t600A0B800039C9B50AAC47B45739d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B80003A8A0B097347B457D4d0 ONLINE 0 0 0 c4t600A0B800039C9B50AB047B457ADd0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600A0B800039C9B50A9C47B4522Dd0 ONLINE 0 0 0 c4t600A0B800039C9B50AB447B4595Fd0 ONLINE 0 0 0 Thanks, Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Max_Payload_Size
Anton B. Rang acm.org> writes: > Looking at the AMD 690 series manual (well, the family > register guide), the max payload size value is deliberately > set to 0 to indicate that the chip only supports 128-byte > transfers. There is a bit in another register which can be > set to ignore max-payload errors. Perhaps that's being set? Perhaps. I briefly tried looking for AMD 690 series manual or datasheet, but they don't seem to be available to the public. I think I'll go back to the 128-byte setting. I wouldn't want to see errors happening under heavy usage even though my stress tests were all successful (aggregate data rate of 610 MB/s generated by reading the disks for 24+ hours, 6 million head seeks performed by each disk, etc). Thanks for your much appreciated comments. -- Marc Bevand ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cpying between pools
On 3/14/08, Vahid Moghaddasi wrote: On Fri, Mar 14, 2008 at 11:26 PM, Tim wrote: replace your LUNs one at a time: zpool replace -f rd_01 c4t6006048187870150525244353543d0 first_lun_off_dmx-3 zpool replace -f rd_01 c4t6006048187870150525244353942d0 second_lun_off_dmx-3 and so on. Simple enough thanks. I assume as I start the zpool replace operation, the original LUNs will not be in rd_01 pool. Not that will do that, but theoretically I can perform this on a live machine without interruption, is that right? Thank you, On Fri, Mar 15, 2008 at 12:11 AM, Tim wrote: Yes, it's the same as replacing a bad drive. Just make sure you do one, let it completely finish, then move on to the next. Obviously the drives need to *rebuild* as you replace them. Tim and Jim thank you very much for your help. I will give it a shot. Vahid. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss