[zfs-discuss] Mount External USB cdrom on zfs
Dear support when i connect my external usb dvdrom to the sparc machine which has installed solaris 10u6 based zfs file system,,it return this error: bash-3.00# mount /dev/dsk/c1t0d0s0 /dvd/ Jan 27 11:08:41 global ufs: NOTICE: mount: not a UFS magic number (0x0) mount: /dev/dsk/c1t0d0s0 is not this fstype bash-3.00# Jan 27 11:08:41 global ufs: [ID 717476 kern.notice] NOTICE: mount: not a UFS magic number (0x0 and i cant mount it and cant see anythings. but i think in the messages,,it detect it: Jan 27 10:52:08 global usba: [ID 349649 kern.info] Cypress Semiconductor USB2.0 Storage Device DEF10AF1F9AD Jan 27 10:52:08 global genunix: [ID 936769 kern.info] scsa2usb0 is /p...@1f ,0/u...@a/stor...@1 Jan 27 10:52:08 global genunix: [ID 408114 kern.info] /p...@1f,0/u...@a /stor...@1 (scsa2usb0) online Jan 27 10:52:09 global scsi: [ID 193665 kern.info] sd0 at scsa2usb0: target 0 lun 0 Jan 27 10:52:09 global genunix: [ID 936769 kern.info] sd0 is /p...@1f,0/u...@a /stor...@1/d...@0,0 Jan 27 10:52:15 global genunix: [ID 408114 kern.info] /p...@1f,0/u...@a /stor...@1/d...@0,0 (sd0) online how can i mount external usb cdrom in solaris 10u6 based on zfs file system. Regards ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unusual CIFS write bursts
While doing some performance testing on a pair of X4540's running snv_105, I noticed some odd behavior while using CIFS. I am copying a 6TB database file (yes, a single file) over our GigE network to the X4540, then snapshotting that data to the secondary X4540. Writing said 6TB file can peak our gigabit network, with about 95-100MB/sec going over the wire (can't ask for any more, really). However, the disk IO on the X4540 appears unusual. I would expect the disks to be constantly writing 95-100MB/sec, but it appears it buffers about 1GB worth of data before committing to disk. This is in contrast to NFS write behavior, where as I write a 1GB file to the NFS server from an NFS client, traffic on the wire correlates concisely to the disk writes. For example, 60MB/sec on the wire via NFS will trigger 60MB/sec on disk. This is a single file on both cases. I wouldn't have a problem with this "buffer", it seems to be a rolling 10-second buffer, if I am copying several small files at lower speeds, the disk buffer still seems to "purge" after roughly 10 seconds, not when a certain size is reached. The larger the amount of data that goes into the buffer is what causes a problem, writing 1GB to disk can cause the system to slow down substantially, all network traffic pauses or drops to mere kilobytes a second while it writes this buffer. I would like to see a smoother handling of this buffer, or a tuneable to make the buffer write more often or fill quicker. This is a 48TB unit, 64GB ram, and the arcstat perl script reports my ARC is 55GB in size, with near 0% miss on reads. Has anyone seen something similar, or know of any un-documented tuneables to reduce the effects of this? Here is 'zpool iostat' output, in 1 second intervals while this "write storm" occurs". # zpool iostat pdxfilu01 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - pdxfilu01 2.09T 36.0T 1 61 143K 7.30M pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 60 0 7.55M pdxfilu01 2.09T 36.0T 0 1.70K 0 211M pdxfilu01 2.09T 36.0T 0 2.56K 0 323M pdxfilu01 2.09T 36.0T 0 2.97K 0 375M pdxfilu01 2.09T 36.0T 0 3.15K 0 399M pdxfilu01 2.09T 36.0T 0 2.22K 0 244M pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 pdxfilu01 2.09T 36.0T 0 0 0 0 Here is my 'zpool status' output. # zpool status pool: pdxfilu01 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM pdxfilu01 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 c9t0d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c9t3d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c9t6d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0
[zfs-discuss] Replacing HDD in x4500
The vendor wanted to come in and replace an HDD in the 2nd X4500, as it was "constantly busy", and since our x4500 has always died miserably in the past when a HDD dies, they wanted to replace it before the HDD actually died. The usual was done, HDD replaced, resilvering started and ran for about 50 minutes. Then the system hung, same as always, all ZFS related commands would just hang and do nothing. System is otherwise fine and completely idle. The vendor for some reason decided to fsck root-fs, not sure why as it is mounted with "logging", and also decided it would be best to do so from a CDRom boot. Anyway, that was 12 hours ago and the x4500 is still down. I think they have it at single-user prompt resilvering again. (I also noticed they'd decided to break the mirror of the root disks for some very strange reason). It still shows: raidz1 DEGRADED 0 0 0 c0t1d0ONLINE 0 0 0 replacing UNAVAIL 0 0 0 insufficient replicas c1t1d0s0/o OFFLINE 0 0 0 c1t1d0 UNAVAIL 0 0 0 cannot open So I am pretty sure it'll hang again sometime soon. What is interesting though is that this is on x4500-02, and all our previous troubles mailed to the list was regarding our first x4500. The hardware is all different, but identical. Solaris 10 5/08. Anyway, I think they want to boot CDrom to fsck root again for some reason, but since customers have been without their mail for 12 hours, they can go a little longer, I guess. What I was really wondering, has there been any progress or patches regarding the system always hanging whenever a HDD dies (or is replaced it seems). It really is rather frustrating. Lund -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11
Richard Elling wrote: > Jim Dunham wrote: >> Ahmed, >> >>> The setup is not there anymore, however, I will share as much >>> details >>> as I have documented. Could you please post the commands you have >>> used >>> and any differences you think might be important. Did you ever test >>> with 2008.11 ? instead of sxce ? >>> >> >> Specific to the following: >> > While we should be getting minimal performance hit (hopefully), > we got > a big performance hit, disk throughput was reduced to almost 10% > of > the normal rate. > >> >> It looks like I need to test on OpenSoalris 2008.11, not Solaris >> Express CE (b105), since this version does not have access to a >> version of 'dd' with a oflag= setting. >> >> # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync >> bs=256M count=10 >> dd: bad argument: "oflag=dsync >> > > Congratulations! You've been bit by the gnu-compatibility feature! Oh that's what one calls it... a feature? > SXCE and OpenSolaris have more than one version of dd. The difference > is that OpenSolaris sets your default PATH to use /usr/gnu/bin/dd, > which > has the oflag option, while SXCE sets your default PATH to use /usr/ > bin/dd. Thank you, Jim > > -- richard > >> Using a setting of 'oflag=dsync' will have performance implications. >> >> Also there is an issue with an I/O of size bs=256M. SNDR's >> internal architecture has a I/O unit chunk size of one bit in >> 32KB". Therefore when doing an I/O of 256MB, this results in the >> need to set 8192 bits, 1024 bytes, or 1KB of data with 0xFF. >> Although testing with an /O size of 256MB is interesting, typical >> I/O tests are more like the following: >> http://www.opensolaris.org/os/community/performance/filebench/quick_start/ >> >> - Jim >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11
Jim Dunham wrote: > Ahmed, > > >> The setup is not there anymore, however, I will share as much details >> as I have documented. Could you please post the commands you have used >> and any differences you think might be important. Did you ever test >> with 2008.11 ? instead of sxce ? >> > > Specific to the following: > > While we should be getting minimal performance hit (hopefully), we got a big performance hit, disk throughput was reduced to almost 10% of the normal rate. > > It looks like I need to test on OpenSoalris 2008.11, not Solaris > Express CE (b105), since this version does not have access to a > version of 'dd' with a oflag= setting. > > # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync > bs=256M count=10 > dd: bad argument: "oflag=dsync > Congratulations! You've been bit by the gnu-compatibility feature! SXCE and OpenSolaris have more than one version of dd. The difference is that OpenSolaris sets your default PATH to use /usr/gnu/bin/dd, which has the oflag option, while SXCE sets your default PATH to use /usr/bin/dd. -- richard > Using a setting of 'oflag=dsync' will have performance implications. > > Also there is an issue with an I/O of size bs=256M. SNDR's internal > architecture has a I/O unit chunk size of one bit in 32KB". Therefore > when doing an I/O of 256MB, this results in the need to set 8192 bits, > 1024 bytes, or 1KB of data with 0xFF. Although testing with an /O > size of 256MB is interesting, typical I/O tests are more like the > following: > http://www.opensolaris.org/os/community/performance/filebench/quick_start/ > > - Jim > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11
Ahmed, > The setup is not there anymore, however, I will share as much details > as I have documented. Could you please post the commands you have used > and any differences you think might be important. Did you ever test > with 2008.11 ? instead of sxce ? Specific to the following: >>> While we should be getting minimal performance hit (hopefully), we >>> got >>> a big performance hit, disk throughput was reduced to almost 10% of >>> the normal rate. It looks like I need to test on OpenSoalris 2008.11, not Solaris Express CE (b105), since this version does not have access to a version of 'dd' with a oflag= setting. # dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync bs=256M count=10 dd: bad argument: "oflag=dsync" Using a setting of 'oflag=dsync' will have performance implications. Also there is an issue with an I/O of size bs=256M. SNDR's internal architecture has a I/O unit chunk size of one bit in 32KB". Therefore when doing an I/O of 256MB, this results in the need to set 8192 bits, 1024 bytes, or 1KB of data with 0xFF. Although testing with an /O size of 256MB is interesting, typical I/O tests are more like the following: http://www.opensolaris.org/os/community/performance/filebench/quick_start/ - Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
> "js" == Jakov Sosic writes: > "tt" == Toby Thain writes: js> Yes but that will do the complete resilvering, and I just want js> to fix the corrupted blocks... :) tt> What you are asking for is impossible, since ZFS cannot know tt> which blocks are corrupted without actually checking them yeah of course you have to read every (occupied) block, but he's still not asking for something completely nonsensical. What if the good drive has a latent sector error in one of the blocks that hasn't been scribbled over on the bad drive? scrub could heal the error if not for the ``too many errors'' fault, while 'zpool replace' could not heal it. pgp8nwAbkix64.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send -R slow
BJ Quinn wrote: > That sounds like a great idea if I can get it to work-- > > What does? > I get how to add a drive to a zfs mirror, but for the life of me I can't find > out how to safely remove a drive from a mirror. > > Have you tried "man zpool"? See the entry for detach. > Also, if I do remove the drive from the mirror, then pop it back up in some > unsuspecting (and unrelated) Solaris box, will it just see a drive with a > pool on it and let me mount it up? You should be able to import it, but I haven't tried. > What about when I pop in the drive to be resilvered, but right before I add > it back to the mirror, will Solaris get upset that I have two drives both > with the same pool name? > No, you have to do a manual import. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send -R slow
That sounds like a great idea if I can get it to work-- I get how to add a drive to a zfs mirror, but for the life of me I can't find out how to safely remove a drive from a mirror. Also, if I do remove the drive from the mirror, then pop it back up in some unsuspecting (and unrelated) Solaris box, will it just see a drive with a pool on it and let me mount it up? What about when I pop in the drive to be resilvered, but right before I add it back to the mirror, will Solaris get upset that I have two drives both with the same pool name? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
On 26-Jan-09, at 6:21 PM, Jakov Sosic wrote: >>> So I wonder now, how to fix this up? Why doesn't >> scrub overwrite bad data with good data from first >> disk? >> >> ZFS doesn't know why the errors occurred, the most >> likely scenario would be a >> bad disk -- in which case you'd need to replace it. > > I know and understand that... But, what is then a limit for self- > healing? 2 errors per vdev? 3 errors? 10 errors? before ZFS decides > that vdev is irreparable... > > >> You shouldn't need to attach/detach anything. >> I think you're looking for 'zpool replace'. >>zpool replace tank c0d1s0 > > Yes but that will do the complete resilvering, and I just want to > fix the corrupted blocks... :) What you are asking for is impossible, since ZFS cannot know which blocks are corrupted without actually checking them all (like a scrub). A resilver involves knowing that some set of blocks is out of date, but ZFS need not verify the rest. --Toby > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
Jakov Sosic wrote: > Hi guys! > > I'm doing series of tests on ZFS before putting it into production on several > machines, and I've come to a dead end. I have two disks in mirror (rpool). > Intentionally, I corrupt data on second disk: > > # dd if=/dev/urandom of=/dev/rdsk/c0d1t0 bs=512 count=20480 seek=10240 > > So, I've written 10MB's of random data after first 5MB's of hard drive. After > sync and reboot, ZFS got the corruption noticed, and then I run zpool scrub > rpool. After that, I've got this state: > > unknown# zpool status > pool: rpool > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub in progress for 0h0m, 5.64% done, 0h5m to go > config: > > NAMESTATE READ WRITE CKSUM > rpool DEGRADED 0 0 0 > mirrorDEGRADED 0 0 0 > c0d1s0 DEGRADED 0 026 too many errors > c0d0s0 ONLINE 0 0 0 > > errors: No known data errors > > > So I wonder now, how to fix this up? Why doesn't scrub overwrite bad data > with good data from first disk? > The data is already fixed, which is why it says "errors: No known data errors" > If I run zpool clear, it will only clear the error reports, and it won't > fixed them - I presume that because I don't understand the man page for that > section clearly. > > So, how can I fix this disk, without detach/attach procedure > Be happy, the data is already fixed. The "DEGRADED" state is used when too many errors were found in a short period of time, which one would use as an idicator of a failing device. However, since the device is not actually failed, it is of no practical use in your test case. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
Looks like your scrub was not finished yet. Did check it later? You should not have had to replace the disk. You might have to reinstall the bootblock. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
> > So I wonder now, how to fix this up? Why doesn't > scrub overwrite bad data with good data from first > disk? > > ZFS doesn't know why the errors occurred, the most > likely scenario would be a > bad disk -- in which case you'd need to replace it. I know and understand that... But, what is then a limit for self-healing? 2 errors per vdev? 3 errors? 10 errors? before ZFS decides that vdev is irreparable... > You shouldn't need to attach/detach anything. > I think you're looking for 'zpool replace'. >zpool replace tank c0d1s0 Yes but that will do the complete resilvering, and I just want to fix the corrupted blocks... :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to fix zpool with corrupted disk?
Jakov Sosic wrote: > Hi guys! > > I'm doing series of tests on ZFS before putting it into production on several > machines, and I've come to a dead end. I have two disks in mirror (rpool). > Intentionally, I corrupt data on second disk: > > # dd if=/dev/urandom of=/dev/rdsk/c0d1t0 bs=512 count=20480 seek=10240 > > So, I've written 10MB's of random data after first 5MB's of hard drive. After > sync and reboot, ZFS got the corruption noticed, and then I run zpool scrub > rpool. After that, I've got this state: > > unknown# zpool status > pool: rpool > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub in progress for 0h0m, 5.64% done, 0h5m to go > config: > > NAMESTATE READ WRITE CKSUM > rpool DEGRADED 0 0 0 > mirrorDEGRADED 0 0 0 > c0d1s0 DEGRADED 0 026 too many errors > c0d0s0 ONLINE 0 0 0 > > errors: No known data errors > > > So I wonder now, how to fix this up? Why doesn't scrub overwrite bad data > with good data from first disk? ZFS doesn't know why the errors occurred, the most likely scenario would be a bad disk -- in which case you'd need to replace it. > If I run zpool clear, it will only clear the error reports, and it won't > fixed them - I presume that because I don't understand the man page for that > section clearly. The admin guide is great to follow for these tests : http://docs.sun.com/app/docs/doc/819-5461 > So, how can I fix this disk, without detach/attach procedure? You shouldn't need to attach/detach anything. I think you're looking for 'zpool replace'. zpool replace tank c0d1s0 -Bryant ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] how to fix zpool with corrupted disk?
Hi guys! I'm doing series of tests on ZFS before putting it into production on several machines, and I've come to a dead end. I have two disks in mirror (rpool). Intentionally, I corrupt data on second disk: # dd if=/dev/urandom of=/dev/rdsk/c0d1t0 bs=512 count=20480 seek=10240 So, I've written 10MB's of random data after first 5MB's of hard drive. After sync and reboot, ZFS got the corruption noticed, and then I run zpool scrub rpool. After that, I've got this state: unknown# zpool status pool: rpool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress for 0h0m, 5.64% done, 0h5m to go config: NAMESTATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c0d1s0 DEGRADED 0 026 too many errors c0d0s0 ONLINE 0 0 0 errors: No known data errors So I wonder now, how to fix this up? Why doesn't scrub overwrite bad data with good data from first disk? If I run zpool clear, it will only clear the error reports, and it won't fixed them - I presume that because I don't understand the man page for that section clearly. So, how can I fix this disk, without detach/attach procedure? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] thoughts on parallel backups, rsync, and send/receive
Richard Elling wrote: > Ian Collins wrote: > >> One thing I have yet to do is find the optimum number of parallel >> transfers when there are 100s of filesystems. I'm looking into making >> this dynamic, based on throughput. >> > > I'm not convinced that a throughput throttle or metric will be > meaningful. I believe this will need to be iop-based. > OK, I'll check. I was looking at adding jibs until the average send time declined. >> Are you working with OpenSolaris? I still haven't managed to nail the >> toxic streams problem in Solaris 10, which have curtailed my project. >> > > I am aware of the bug, but have not seen it. Murphy's Law says it won't > happen until we roll into production :-( How many file systems do you have? I hit the problem about 1 in 1500 send/receives. The last time was with a 1TB filesystem with about 600GB of snaps, so I couldn't attach it to the bug! -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] AVS on opensolaris 2008.11
Hi Jim, The setup is not there anymore, however, I will share as much details as I have documented. Could you please post the commands you have used and any differences you think might be important. Did you ever test with 2008.11 ? instead of sxce ? I will probably be testing again soon. Any tips or obvious errors are welcome :) ->8- The Setup * A 100G zvol has been setup on each node of an AVS replicating pair * A "ramdisk" has been setup on each node using ramdiskadm -a ram1 10m * The replication relationship has been setup using sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async * The AVS driver was configured to not log the disk bitmap to disk, rather to keep it in kernel memory and write it to disk only upon machine shutdown. This is configured as such grep bitmap_mode /usr/kernel/drv/rdc.conf rdc_bitmap_mode=2; * The replication was configured to be in logging mode sndradm -P /dev/zvol/rdsk/gold/myzvol <- pri:/dev/zvol/rdsk/gold/myzvol autosync: off, max q writes: 4096, max q fbas: 16384, async threads: 2, mode: async, state: logging Testing was done with: dd if=/dev/zero of=/dev/zvol/rdsk/gold/xxVolNamexx oflag=dsync bs=256M count=10 * Option 'dsync' is chosen to try avoiding zfs's aggressive caching. Moreover however, usually a couple of runs were launched initially to fill the instant zfs cache and to force real writing to disk * Option 'bs=256M' was used in order to avoid the overhead of copying multiple small blocks to kernel memory before disk writes. A larger bs size ensures max throughput. Smaller values were used without much difference though The results on multiple runs Non Replicated Vol Throughputs: 42.2, 52.8, 50.9 MB/s Replicated Vol Throughputs: 4.9, 5.5, 4.6 MB/s -->8- Regards On Mon, Jan 26, 2009 at 1:22 AM, Jim Dunham wrote: > Ahmed, > >> Thanks for your informative reply. I am involved with kristof >> (original poster) in the setup, please allow me to reply below >> >>> Was the follow 'test' run during resynchronization mode or replication >>> mode? >>> >> >> Neither, testing was done while in logging mode. This was chosen to >> simply avoid any network "issues" and to get the setup working as fast >> as possible. The setup was created with: >> >> sndradm -E pri /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 sec >> /dev/zvol/rdsk/gold/myzvol /dev/rramdisk/ram1 ip async >> >> Note that the logging disks are ramdisks again trying to avoid disk >> contention and get fastest performance (reliability is not a concern >> in this test). Before running the tests, this was the state >> >> #sndradm -P >> /dev/zvol/rdsk/gold/myzvol <- pri:/dev/zvol/rdsk/gold/myzvol >> autosync: off, max q writes: 4096, max q fbas: 16384, async threads: >> 2, mode: async, state: logging >> >> While we should be getting minimal performance hit (hopefully), we got >> a big performance hit, disk throughput was reduced to almost 10% of >> the normal rate. > > Is it possible to share information on your ZFS storage pool configuration, > your testing tool, testing types and resulting data? > > I just downloaded Solaris Express CE (b105) > http://opensolaris.org/os/downloads/sol_ex_dvd_1/, configured ZFS in > various storage pool types, SNDR with and without RAM disks, and I do not > see that disk throughput was reduced to almost 10% o the normal rate. Yes > there is some performance impact, but no where near there amount reported. > > There are various factors which could come into play here, but the most > obvious reason that someone may see a serious performance degradation as > reported, is that prior to SNDR being configured, the existing system under > test was already maxed out on some system limitation, such as CPU and > memory. I/O impact should not be a factor, given that a RAM disk is used. > The addition of both SNDR and a RAM disk in the data, regardless of how > small their system cost is, will have a profound impact on disk throughput. > > Jim > >> >> Please feel free to ask for any details, thanks for the help >> >> Regards >> ___ >> storage-discuss mailing list >> storage-disc...@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/storage-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] thoughts on parallel backups, rsync, and send/receive
Ahmed Kamal wrote: > Did anyone share a script to send/recv zfs filesystems tree in > parallel, especially if a cap on concurrency can be specified? > Richard, how fast were you taking those snapshots, how fast were the > syncs over the network. For example, assuming a snapshot every 10mins, > is it reasonable to expect to sync every snapshot as they're created > every 10 mins. What would be the limit trying to lower those 10mins > even more > We were snapping every hour with send/receive times on the order of 25 minutes. I do not believe there will be time to experiment with other combinations. > Is it catastrophic if a second zfs send launches, while an older one > is still being run > I use a semaphore property to help avoid this, by design. That said, I have not tried to see if there is a lurking bug with ZFS receive that would need to be fixed if it cannot handle concurrent receives. My send/receive script will incrementally copy from the latest, common snapshot to the latest snapshot. For rsync, it will sync from the epoch to the latest snapshot. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] thoughts on parallel backups, rsync, and send/receive
Ian Collins wrote: Richard Elling wrote: Recently, I've been working on a project which had agressive backup requirements. I believe we solved the problem with parallelism. You might consider doing the same. If you get time to do your own experiments, please share your observations with the community. http://richardelling.blogspot.com/2009/01/parallel-zfs-sendreceive.html You raise some interesting points about rsync getting bogged down over time. I have been working with a client with a requirement for replication between a number of hosts and I have found doing several rend/receives made quite an impact. What I haven't done is try this with the latest performance improvements in b105. Have you? My guess is the gain will be less. Unfortunately, the rig was constrained to Solaris 10 10/08, so I don't have any data on this for OpenSolaris. One thing I have yet to do is find the optimum number of parallel transfers when there are 100s of filesystems. I'm looking into making this dynamic, based on throughput. I'm not convinced that a throughput throttle or metric will be meaningful. I believe this will need to be iop-based. Are you working with OpenSolaris? I still haven't managed to nail the toxic streams problem in Solaris 10, which have curtailed my project. I am aware of the bug, but have not seen it. Murphy's Law says it won't happen until we roll into production :-( -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] E2BIG
Hello all... We are getting this error: "E2BIG - Arg list too long", when trying to send incremental backups (b89 -> b101). Do you know about any bugs related to that? I did a look on the archives, and google but could not find anything. What i did find was something related with wrong timestamps (32bits), and some ZFS test on the code: zfs_vnops.c. But the error is EOVERFLOW... Thanks a lot for your time! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Eric D. Mudama writes: > On Tue, Jan 20 at 21:35, Eric D. Mudama wrote: > > On Tue, Jan 20 at 9:04, Richard Elling wrote: > >> > >> Yes. And I think there are many more use cases which are not > >> yet characterized. What we do know is that using an SSD for > >> the separate ZIL log works very well for a large number of cases. > >> It is not clear to me that the efforts to characterize a large > >> number of cases is worthwhile, when we can simply throw an SSD > >> at the problem and solve it. > >> -- richard > >> > > > > I think the issue is, like a previous poster discovered, there's not a > > lot of available data on exact performance changes of adding ZIL/L2ARC > > devices in a variety of workloads, so people wind up spending money > > and doing lots of trial and error, without clear expectations of > > whether their modifications are working or not. > > Sorry for that terrible last sentence, my brain is fried right now. > > I was trying to say that most people don't know what they're going to > get out of an SSD or other ZIL/L2ARC device ahead of time, since it > varies so much by workload, configuration, etc. and it's an expensive > problem to solve through trial an error since these > performance-improving devices are many times more expensive than the > raw SAS/SATA devices in the main pool. > I agree with you on the L2ARC front but not on the SSD for ZIL. We clearly expect 10X gain for lightly threaded workloads and that's a big satifyer because not everything happen with large amount of concurrency and some high value tasks do not. On the L2ARC the benefit are less direct because of the L1 ARC presence. The gains, if present will be of the similar nature with 8-10X gain to workloads that are lightly threaded and served from L2ARC vs disk. Note that it's possible to configurewhich (higher businessvalue) filesystems are allowed to install in the L2ARC. One dirty way to evaluate if the L2ARC will be effective in your environment is to consider if the last X GB of added memory had a positive impact on your performance metrics (does nailing down memory reduces performance ?). If so then on the graph of performance vs caching you are still on a positive slope and L2ARC is likely to help. When request you care most about are served from caches, or when something else saturates (e.g. total CPU) then it's time to stop. -r > -- > Eric D. Mudama > edmud...@mail.bounceswoosh.org > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Eric D. Mudama writes: > On Mon, Jan 19 at 23:14, Greg Mason wrote: > >So, what we're looking for is a way to improve performance, without > >disabling the ZIL, as it's my understanding that disabling the ZIL > >isn't exactly a safe thing to do. > > > >We're looking for the best way to improve performance, without > >sacrificing too much of the safety of the data. > > > >The current solution we are considering is disabling the cache > >flushing (as per a previous response in this thread), and adding one > >or two SSD log devices, as this is similar to the Sun storage > >appliances based on the Thor. Thoughts? > > In general principles, the evil tuning guide states that the ZIL > should be able to handle 10 seconds of expected synchronous write > workload. > > To me, this implies that it's improving burst behavior, but > potentially at the expense of sustained throughput, like would be > measured in benchmarking type runs. > > If you have a big JBOD array with say 8+ mirror vdevs on multiple > controllers, in theory, each VDEV can commit from 60-80MB/s to disk. > Unless you are attaching a separate ZIL device that can match the > aggregate throughput of that pool, wouldn't it just be better to have > the default behavior of the ZIL contents being inside the pool itself? > > The best practices guide states that the max ZIL device size should be > roughly 50% of main system memory, because that's approximately the > most data that can be in-flight at any given instant. > > "For a target throughput of X MB/sec and given that ZFS pushes > transaction groups every 5 seconds (and have 2 outstanding), we also > expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service > 100MB/sec of synchronous writes, 1 GBytes of log device should be > sufficient." > > But, no comments are made on the performance requirements of the ZIL > device(s) relative to the main pool devices. Clicking around finds > this entry: > > http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on > > ...which appears to indicate cases where a significant number of ZILs > were required to match the bandwidth of just throwing them in the pool > itself. > > Big topic. Some write requests are synchronous and some not, some start as non synchronous and end up being synced. For non-synchronous loads, ZFS does not commit data to the slog. The presence of the slog is transparent and won't hinder performance. For synchronous loads, the performance is normally governed by fewer threads commiting more modest amount of data; performance here is dominated by latency effect, not disk throughput and this is where a slog greatly helps (10X). Now you're right to point out that some workloads might end up as synchronous while still manageing large quantity of data. The Storage 7000 line was tweaked to handle some of those cases. So when commiting more say 10MB in a single operation, the first MB will go to the SSD but the rest will actually be send to the main storage pool. All these I/Os being issued concurrently. The latency response of a 1 MB to our SSD is expected to be similar to the response of regular disks. -r > --eric > > > -- > Eric D. Mudama > edmud...@mail.bounceswoosh.org > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Nicholas Lee writes: > Another option to look at is: > set zfs:zfs_nocacheflush=1 > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide > > Best option is to get a a fast ZIL log device. > > > Depends on your pool as well. NFS+ZFS means zfs will wait for write > completes before responding to a sync NFS write ops. If you have a RAIDZ > array, writes will be slower than a RAID10 style pool. > Nicholas, Raid-Z requires a more complexity in software however the total amount of I/O to disk is less than raid-10. So the net performance effect is often in favor of Raid-10 must not necessarely so. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Greg Mason writes: > We're running into a performance problem with ZFS over NFS. When working > with many small files (i.e. unpacking a tar file with source code), a > Thor (over NFS) is about 4 times slower than our aging existing storage > solution, which isn't exactly speedy to begin with (17 minutes versus 3 > minutes). > > We took a rough stab in the dark, and started to examine whether or not > it was the ZIL. > > Performing IO tests locally on the Thor shows no real IO problems, but > running IO tests over NFS, specifically, with many smaller files we see > a significant performance hit. > > Just to rule in or out the ZIL as a factor, we disabled it, and ran the > test again. It completed in just under a minute, around 3 times faster > than our existing storage. This was more like it! > > Are there any tunables for the ZIL to try to speed things up? Or would > it be best to look into using a high-speed SSD for the log device? > > And, yes, I already know that turning off the ZIL is a Really Bad Idea. > We do, however, need to provide our users with a certain level of > performance, and what we've got with the ZIL on the pool is completely > unacceptable. > > Thanks for any pointers you may have... > I think you found out for the replies, this NFS issue is not related to ZFS nor a ZIL malfunction in any way. http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine NFS (particularly lightly threaded load) is much speeded up with any form of SSD|NVRAM storage and that's independant on the backing filesystem used (provided the Filesystem is safe). For ZFS the best way to acheive NFS performance for lightly threaded loads is to have a separate intent log in a low latency device such as in the 7000 line. -r > -- > > Greg Mason > Systems Administrator > Michigan State University > High Performance Computing Center > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] thoughts on parallel backups, rsync, and send/receive
Did anyone share a script to send/recv zfs filesystems tree in parallel, especially if a cap on concurrency can be specified? Richard, how fast were you taking those snapshots, how fast were the syncs over the network. For example, assuming a snapshot every 10mins, is it reasonable to expect to sync every snapshot as they're created every 10 mins. What would be the limit trying to lower those 10mins even more Is it catastrophic if a second zfs send launches, while an older one is still being run Regards On Mon, Jan 26, 2009 at 9:16 AM, Ian Collins wrote: > Richard Elling wrote: >> Recently, I've been working on a project which had agressive backup >> requirements. I believe we solved the problem with parallelism. You >> might consider doing the same. If you get time to do your own experiments, >> please share your observations with the community. >> http://richardelling.blogspot.com/2009/01/parallel-zfs-sendreceive.html >> > > You raise some interesting points about rsync getting bogged down over > time. I have been working with a client with a requirement for > replication between a number of hosts and I have found doing several > rend/receives made quite an impact. What I haven't done is try this > with the latest performance improvements in b105. Have you? My guess > is the gain will be less. > > One thing I have yet to do is find the optimum number of parallel > transfers when there are 100s of filesystems. I'm looking into making > this dynamic, based on throughput. > > Are you working with OpenSolaris? I still haven't managed to nail the > toxic streams problem in Solaris 10, which have curtailed my project. > > -- > Ian. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss