[zfs-discuss] Re: Re: ZFS Support for remote mirroring
To clarify further; EMC note "EMC Host Connectivity Guide for Solaris" indicates that ZFS is supported on 11/06 (aka Update 3) and onwards. However, they sneak in a cautionary disclaimer that snapshot and clone features are supported by Sun. If one reads it carefully it appears that they do support ZFS (not that they should care what FS or not is on their disks) but want to make a big deal about ZFS by inserting this superfluous comment. Jealousy I suppose, I don't see a comparable disclaimer against VxFS for example. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Extremely long ZFS destroy operations
I've since stopped making the second clone when I realized the .zfs/snapshot/ still exists after the clone operation is completed. So my need for the local clone is met by the direct access to the snapshot. However, the poor performance of the destroy is still valid. It is quite possible that we might create another clone for reasons beyond my original reason. Why is the destroy so slow with the second clone in play? Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS Support for remote mirroring
For whatever reason EMC notes (on PowerLink) suggest that ZFS is not supported on their arrays. If one is going to use a ZFS filesystem on top of a EMC array be warned about support issues. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Extremely long ZFS destroy operations
We've Solaris 10 Update 3 (aka 11/06) running on an E2900 (24 x 96). On this server we've been running a large SAS environment totalling well over 2TB. We also take daily snapshots of the filesystems and clone them for use by a local zone. This setup has been in use for well over 6 months. Starting Monday I started making a second clone from the same snapshot to facilitate quick access to day old image of data in the global zone. I've started noticing that my ZFS destroy operations are inordinately long with the second clone in place (I'm using zfs destroy -Rf ). The degradation is close to an order of magnitude; my destroys now take 6-7 minutes while they took sub minute in the past. Any thoughts? Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS performance with Oracle
I'm sorry dude, I can't make head or tail from your post. What is your point? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Need help making lsof work with ZFS
I think so. After all there are features shipped which are not fully baked/guranteed like the send/receive. Isn't shipping the header files better than letting developers guess their structure and possibly make mistakes? Of course the developer can compile against OpenSolaris source but far easier to compile against the shipped version from Sun. In this age of FOSS lot of people expect to download and compile the source and that won't possible with tools that interact with ZFS, right? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Need help making lsof work with ZFS
I did find zfs.h and libzfs.h (thanks Eric). However, when I try to compile the latest version (4.87C) of lsof it finds the following files missing: dmu.h zfs_acl.h zfs_debug.h zfs_rlock.h zil.h spa.h zfs_context.h zfs_dir.h zfs_vfsops.h zio.h txg.h zfs_ctldir.h zfs_ioctl.h zfs_znode.h zio_impl.h. I looked on my server which has the full cluster of Solaris 10 Update 2 and can't find these files. Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Need help making lsof work with ZFS
I contacted the author of 'lsof' regarding the missing ZFS support. The command works but fails to display any files that are opened by the process in a ZFS filesystem. He indicates that the required ZFS kernel structure definitions (header files) are not shipped with the OS. He further indicated that he rummaged through the OpenSolaris source tree and the files doesn't match either of the Solaris 10 Update 2 or 3. Can one of the ZFS maestros point me in the direction of where these files can be found. I find it hard to believe that the header files are not shipped with the OS. Thanks, any help will be appreciated since ya'll agree that 'lsof' is an invaluable tool. Sooner it is available better it is for ZFS users. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Disk Failure Rates and Error Rates -- ( Off topic: Jim Gray lost at sea)
Here's another website working on his rescue, myy prayers are for a safe return of this CS icon. http://www.helpfindjim.com/ This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: ZFS or UFS - what to do?
Agreed, I guess I didn't articulate my point/thought very well. The best config is to present JBoDs and let ZFS provide the data protection. This has been a very stimulating conversation thread; it is shedding new light into how to best use ZFS. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?
You're right that storage level snapshots are filesystem agnostic. I'm not sure why you believe you won't be able to restore individual files by using a NetApp snapshot? In the case of ZFS you'd take a periodic snapshot and use it to restore files, in the case of NetApp you can do the same (of course you've to have the additional step to mount the new snapshot volume.) Is this convenience tipping the scales for you to pursue ZFS? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: ZFS or UFS - what to do?
I'm not sure what benefit you forsee by running a COW filesystem (ZFS) on a COW array (NetApp). Back to regularly scheduled programming: I still say you should let ZFS manage JBoD type storage. I can personally recount the horror of relying upon an intelligent storage array (EMC DMX3500 in our case.) We had in flight data corruption that EMC faithfully wrote just like NetApp would in your case. Everybody is assuming that corruption or data loss occurs only on disks, it can happen everywhere. In a datacenter SAN you've so many more paths that can introduce data corruption. Hence the need for ensuring data integrity closest to the use of data, namely ZFS. ZFS will not stop alpha particle induced memory corruption after data has been received by server and verified to be correct. Sadly I've been hit with that as well. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS or UFS - what to do?
I've used ZFS since July/August 2006 when Sol 10 Update 2 came out (first release to integrate ZFS.) I've used it on three servers (E25K domain, and 2 E2900s) extensivesely; two them are production. I've over 3TB of storage from an EMC SAN under ZFS management for no less than 6 months. Like your configuration we've defered data redundancy to SAN. My observations are: 1. ZFS is stable to a very large extent. There are two known issues that I'm aware of: a. You can end up in an endless 'reboot' cycle when you've a corrupt zpool. I came across this when I had data corruption due to a HBA mismatch with EMC SAN. This mismatch injected data corruption in transit and the EMC faithfully wrote bad data, upon reading this bad data ZFS threw up all over the floor for that pool. There is a documented workaround to snap out of the 'reboot' cycle, I've not checked if this is fixed in 11/06 update 3. b. Your server will hang when one of the underlying disks disappear. In our case we had a T2000 running 11/06 and had a mirrored zpool against two internal drives. When we pulled one of the drives abruptly the server simply hung. I believe this is a known bug, workaround? 2. When you've I/O operations that either request fsync or open files with O_DSYNC option coupled with high I/O ZFS will choke. It won't crash but the filesystem I/O runs like molases on a cold morning. All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I've no comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2, or mirror) and simply delegate the Engenio to stripe the data for performance. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Converting home directory from ufs to zfs
No such facility exists to automagically convert an existing UFS filesystem to ZFS. You've to create a new ZFS pool/filesystem and then move your data. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Can you turn on zfs compression when the fs is already populated?
I've used the COMPRESS feature for quite a while and you can flip back and forth without any problem. When you turn the compress ON nothing happens to the existing data. However when you start updating your files all new blocks will be compressed; so it is possible to have your file be composed of both compressed and uncompressed blocks! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: How much do we really want zpool remove?
I can vouch for this situation. I had to go through a long maintenance to accomplish the following: - 50 x 64GB drives in a zpool; needed to seperate out 15 of them out due to performance issues. There was no need to increase storage capacity. Because I couldn't yank 15 drives from the existing pool to create a UFS filesystem I had to go evacuate the entire 50 disk pool, recreate a new pool and the UFS filesystem, and then repopulate the filesystems. I think this feature will add to the adoption rate of ZFS. However, I feel that this shouldn't be at the top of the 'to-do' list. I'll trade this feature for some of the performance enhancements that've been discussed on this group. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Heavy writes freezing system
I did some straight up Oracle/ZFS testing but not on Zvols. I'll give it a shot and report back, next week is the earliest. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
Bag-o-tricks-r-us, I suggest the following in such a case: - Two ZFS pools - One for production - One for Education - Isolate the LUNs feeding the pools if possible, don't share spindles. Remember on EMC/Hitachi you've logical LUNs created by striping/concat'ng carved up physical disks, so you could have two LUNs that share the same spindle. Don't believe one word from your storage admin about we've lot of cache to abstract the physical structure; Oracle can push any storage sub-system over the edge. Almost all of the storage vendors prevent one LUN from flooding the cache with writes, EMC gives no more than 8x the initial allocation of cache (total cache/total disk space) and after that it'll stall your writes until destage is complete. - At least two ZFS filesystems under Production pool - One for online redo logs and control files. If need be you can further seggregate them onto two seperate ZFS filesystems. - One for db files. If need be you can isolate further by data, index, temp, archived redo, ... - Don't host the 'temp' on ZFS, just feed it plain old UFS or raw disk. - Match up your ZFS recordsize with your DB blocksize * multi block read count. Don't do this for the index filesystem, just the filesystem hosting data Rinse and repeat for your Education ZFS pool. This will give you substantial isolation and improvement, sufficient enough to buy you time to plan out a better deployment strategy given that you're under the gun now. Another thought is while ZFS works out its kinks why not use the BCV or ShadowCopy or whatever IBM calls it to create Education instance. This will reduce a tremendous amount of I/O. Just this past weekend I re-did our SAS server to relocate [b]just[/b] the SAS work area to good ol' UFS and the payback is tremendous; not one complaint about performance 3 days in a row (we used to hear daily complaints.) By taking care of your online redo logs and control files (maybe skipping ZFS for it all together and running it on UFS) you'll breathe easier. BTW, I'm curious what application using Oracle is creating more than a million files? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Heavy writes freezing system
Bug 6413510 is the root cause. ZFS maestros please correct me if I'm quoting an incorrect bug. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Extremely poor ZFS perf and other observations
U3 is under consideration, we're going through some rudimentary testing of the update. I ran the following commands gtar cf , I was just creating a tar file The remote copying was done as follows: scp -c arcfour . [EMAIL PROTECTED]:/ BTW, the reverse operation of repopulating my FS (by untarring the local tar file) was extremely slow. Me thinks it was 2x slower, I averaged 5MB/S. Let me run some more experiments before I conclusively say the problems are related to compression. Initial observations suggest that it is for the following reasons: I ran 4 parallel gtar sessions reading from 4 ZFS filesystems with compresson writing to a new ZFS filesystem with compression on. My aggregate I/O never changed whether I ran 1 stream or 4 streams, it never exceeded around 20MB/S. My I/O sub-system was idling most of them time, sub 10ms write response time per 'iostat'. If compression was not the issue what else can I explain the magical 20MB/S ceiling no matter how many write streams I had going? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
You're probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters 'fssync' or if any of the files were opened with 'O_DSYNC' option. I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC option. During normal times it does result in excessive I/O but is probably well under your system capacity (it was in our case.) But when you are doing backups or clones (Oracle clones by using RMAN or copying of db files?) you are going to flood the I/O sub-system and that's when the whole ZFS excessive I/O starts to put a hurt on the DB performance. Here are a few suggestions that can give you interim relief: - Seggregate your I/O at filesystem level; the bug is at the filesystem level not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS that nobody else uses, same for control files. As long as the writes to control and online redo logs are met your system will be happy. - Ensure that your clone and RMAN (if you're going to disk) write to a seperate ZFS FS that contains no production files. - If the above two items don't give you relieve then relocate the online redo log and control files to a UFS filesystem. No need to downgrade the entire ZFS to something else. - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. Feel free to drop me a line, I've over 17 years of Oracle DB experience and love to troubleshoot problems like this. I've another vested interest; we're considering ZFS for widespread use in our environment and any experience is good for us. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Extremely poor ZFS perf and other observations
I'm observing the following behavior in our environment (Sol10U2, E2900, 24x96, 2x2Gbps, ...) - I've a compressed ZFS filesystem where I'm creating a large tar file. I notice that the tar process is running fine (accumulating CPU, truss shows writes, ...) but for whatever reason the timestamp on the file doesn't change nor does the file size change. The same is true for 'zpool list' output, the usage numbers don't change for minutes at a time. - I started a tar job to the compressed ZFS filesystem reading from another compressed ZFS filesystem. At the same time I started copying files from another ZFS filesysem (same pool & same attributes) to a remote server (GigE connection) using SCP writing to an UFS filesystem. [b]Guess what? My scp over the wire beat the pants off of the local ZFS tar session writing to a 2x2Gbps SAN and EMC disks![/b] [b]I'm beginning to develop serious reservations about ZFS performance, specially with the compress feature turned on.[/b] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
I've some important information that should shed some light on this behavior: This evening I created a new filesystem across the very same 50 disks including the COMPRESS attribute. My goal was to isolate some workload to the new filesystem and started moving a 100GB directory tree over to the new FS. While I was copying I was averaging around 25MB read and 25MB write as expected. [b]Now I opened 'vi' and wanted to write out a new file in the new filesystem and what I saw was shocking: my reads remained the same but my writes shot upto the 150+MB/S range. This abnormal I/O pattern continued until the 'vi' returned from the write request.[/b] Here are the 'zpool iostat mtdc 30' output: capacity operationsbandwidth pool used avail read write read write -- - - - - - - mtdc 806G 2.48T 38173 1.93M 7.52M mtdc 806G 2.48T188228 15.0M 8.78M mtdc 807G 2.48T266624 14.0M 16.5M mtdc 807G 2.48T286670 17.1M 14.5M mtdc 807G 2.48T293 1.21K 18.2M 98.4M <<-- vi activity, note mismatch in r/w rates mtdc 808G 2.48T457560 35.5M 24.2M mtdc 809G 2.48T405504 31.7M 26.3M mtdc 809G 2.48T328 1.37K 25.2M 152M <<-- vi activity, note r/w mismatch in r/w rates mtdc 810G 2.48T428671 33.0M 48.0M mtdc 811G 2.48T463500 35.9M 26.4M mtdc 811G 2.48T207 1.39K 16.5M 154M<<-- vi activity, note r/w mismatch in r/w rates mtdc 812G 2.48T310878 23.9M 77.7M mtdc 813G 2.48T362494 26.1M 25.3M mtdc 813G 2.48T381 1.05K 26.8M 103M mtdc 814G 2.48T347 1.33K 25.0M 135M mtdc 815G 2.48T288 1.38K 21.7M 150M mtdc 815G 2.48T425513 32.7M 25.8M mtdc 816G 2.47T413515 30.2M 25.1M mtdc 817G 2.47T341512 21.9M 25.1M mtdc 818G 2.47T293529 18.5M 25.5M mtdc 818G 2.47T344508 23.4M 24.7M mtdc 819G 2.47T442512 33.4M 24.1M mtdc 820G 2.47T385483 28.3M 24.4M mtdc 820G 2.47T372483 24.7M 24.7M mtdc 821G 2.47T347535 23.0M 24.2M mtdc 821G 2.47T290497 17.9M 24.9M mtdc 823G 2.47T349517 20.0M 24.1M mtdc 823G 2.47T399512 21.2M 24.5M mtdc 824G 2.47T383612 19.3M 17.7M mtdc 824G 2.47T390614 14.2M 17.5M This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS reference
We've been using ZFS for at least 3 months in a production environment. Not only are we using the basic functionality but we use the snapshot/cloning feature heavily along with Zones. We're running Solaris 10 Update 2 (aka 06/06) version and are going to Update 3 shortly. Our diskspace is large (> 3TB) and for the experience has been positive. Are there a few kinks? yes but none that stopped us from using it. I'd encourage you to use it, me thinks it is ready for Production use (as Joyent likes to say "Fsck you, if you think ZFS isn't production") Good luck. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
I'll see if I can confirm what you are suggesting. Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Quick update, since my original post I've confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Our setup: - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) - 2 2Gbps FC HBA - EMC DMX storage - 50 x 64GB LUNs configured in 1 ZFS pool - Many filesystems created with COMPRESS enabled; specifically I've one that is 768GB I'm observing the following puzzling behavior: - We are currently creating a large (>1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I've confirmed the same with iostat. This is very confusing - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) - [b]Why on God's green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can't believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). Any thoughts? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS behavior under heavy load (I/O that is)
Thanks, I just downloaded Update 3 and hopefully the problem will go away. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS behavior under heavy load (I/O that is)
I'm observing the following behavior on our E2900 (24 x 92 config), 2 FCs, and ... I've a large filesystem (~758GB) with compress mode on. When this filesystem is under heavy load (>150MB/S) I've problems saving files in 'vi'. I posted here about it and recall that the issue is addressed in Sol10U3. This morning I observed another variation of this problem as follows: - Create a file in 'vi' and save it, session will hang as if it is waiting for the write to complete. - In another session you'll observe the write from 'vi' is indeed complete as evidenced by the contents of the file. Am I repeating myself here or is it a different problem all together. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)
[b]Setting:[/b] We've operating in the following setup for well over 60 days. - E2900 (24 x 92) - 2 2Gbps FC to EMC SAN - Solaris 10 Update 2 (06/06) - ZFS with compression turned on - Global zone + 1 local zone (sparse) - Local zone is fed ZFS clones from the global Zone [b]Daily Routine[/b] - Shutdown local Zone - Recreate ZFS clones - Restart local Zone - End to end timing for this refresh is anywhere between 5 to 30 minutes. Bulk of the time is spent in the ZFS 'destroy' phase. [b]Problem[/b] - We had extensive read/write activity in the global and local Zones yesterday. I estimate that we wrote 1/4 of one large ZFS filesystem, ~ 160GB of write. - This morning we had a fair amount of activity on the system when the refresh started, zpool was reporting around 150MB/S of write. - Our 'zfs destroy' commands took what I considere 'normal', the FS that was fielding the bulk of the I/O took 15 minutes. During this time everything was crawling or more accurately come to a dead stop. A simple 'rm' would hang. I've reported this problem to the forum in the past. I also believe the fix for the problem is in Update 3 for Solaris 10, right? -[b]Surprisingly today the ZFS 'snapshot & clone' took an inordinate amount of time. I observed each snapshot & clone activity together took 10+ minutes. In the past the same activity has taken no more than a few seconds even during busy times. The total end-to-end timing for all snapshots/clones was a whopping 1:44:00!!![/b] - Even more surprising was that local Zone refused to startup (zoneadm -z bluenile boot) with no error messages. - I was able to start the Zone only after an hour or so after the completion of the ZFS commands. [b]Questions:[/b] - Why is the destroy phase taking so long? - What can explain the unduly long snapshot/clone times - Why didn't the Zone startup? - More surprisingly why did the Zone startup after an hour? Thanks in advance. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Production ZFS Server Death (06/06)
Glad it worked for you. I suspect in your case the corruption happened way down in the tree and you could get around it by pruning the tree (rm the file) below the point of corruption. I suspect this could be due to a very localized corruption like Alpha particle problem where a bit was flipped on the platter or in the cache of the storage sub-system before destaging to disk. In our case the problem was pervasive due to the problems affecting our data path (FC). [b]You do raise a very very valid point.[/b] It'd be nice if ZFS provided better diagnostics; namely identify where exactly in the tree it found corruption. At that point we can determine if the remedy is to contain our damage (similar to fsck discarding all suspect inodes) and continue. For example I've very high regard for the space management in the Oracle DB. When it finds a bad block(s) it prints out the address of the block and marks it corrupt. [b]It doesn't put the whole file/tablespace/table/index in 'suspect' mode like ZFS[/b]. DBA can then either drop the table/index that contains the bad block or extract data from the table minus the bad block. Oracle DB handles it very gracefully giving the user/DBA a chance to recover the known good data. For ZFS to achieve wide acceptance we [b]must[/b] have the ability to pin point the problem area and take remedial action (rm for example) not simply give up. Yes there are times when the corruption could affect a block high up in the chain making the situation hopeless, in such a case we'd have to discard and restart. ZFS now has solved one part of the problem, namely identifying bad data and doing it reliably. It provides resiliency in the form form of Raid-Z(2) and Raid-1. For it to realize its full potential it must also provide tools to discard corrupt parts (branches) of the tree and give us a chance to save the remaining data. We won't always have the luxury of rebuilding the pool and restoring in a production environment. Easier said than done, me thinks. Good night. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Production ZFS Server Death (06/06)
Oh my, one day after I posted my horror story another one strikes. This is validation of the design objectives of ZFS, looks like this type of stuff happens more often than not. In the past we'd have just attributed this type of problem to some application induced corruption, now ZFS is pinning this problem squarely on the storage sub-system. If you didn't do any ZFS redundancy then your data is DONE as the support person indicated. Make sure you follow the instructions in the ZFS FAQ otherwise your server will end up in an endless 'panic-reboot cycle'. Don't shoot the messenger (ZFS), consider running diags on your storage sub-system. Good luck. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Another win for ZFS
Today ZFS proved its mettle at our site. We've a set of Sun servers (25k and 2900s) that are all connected to a DMX3500 via a SAN. Different servers use the storage differently; some of the storage on the server side was configured with ZFS while others were configured as UFS filesystems while some more were used in the 'raw' form by Oracle ASM. In all cases there was no mirroring or protection at the server level, we had delegated that function to the DMX3500. This decision came back to haunt us this morning. One of the 25k domains panic'd this morning and ended up in the 'endless panic-reboot cycle'. As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems. ZFS was the first one to pick up on the silent corruption this morning. We're still grateful for ZFS even though it put the server in the 'endless panic-reboot cycle' that we fixed by following the ZFS FAQ. It'd have been nicer if the bug were not present. Our data grows rapidly and the earlier we know of corruption shorter the rebuild/restore cycle. [b]Note to self, use Raid-Z or Raid-1 using ZFS next time around.[/b] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Configuring a 3510 for ZFS
Thanks for the stimulating exchange of ideas/thoughts. I've always been a believer of letting s/w do my RAID functions; for example in the old days of VxVM I always preferred to do mirroring at the s/w level. It is my belief that there is more 'meta' information available at the OS level than at the storage level for s/w to make intelligent decisions; dynamic recordsize in ZFS is one example. Any thoughts on the following approach: 1. I'll configure 3511 to present multiple LUNs (mirrored internally) to OS. 2. Lay down a ZFS pool/filesystem without RAID protection (RAIDZ...) in the OS With this approach I will enjoy the caching facility of 3511 and the checksum protection afforded by ZFS. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Configuring a 3510 for ZFS
I'm glad you asked this question. We are currently expecting 3511 storage sub-systems for our servers. We were wondering about their configuration as well. This ZFS thing throws a wrench in the old line think ;-) Seriously, we now have to put on a new hat to figure out the best way to leverage both the storage sub-system as well as ZFS. As a sidebar if the performance of ZFS keeps improving then I can tell you the ultra expensive large arrays will be in trouble. ZFS falls in the category of 'disruptive technologies' as discussed the book Innovator's Dillemma. In the short run it'll eat away at the bottom of the performance curve but will trend upwards and beat the incumbents (just like RAM took over from core memory.) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Fastest way to send 100gb ( with ZFS send )
You're most certainly are hitting the SSH limitation. Note that SSH/SCP sessions are single threaded and won't utilize all of the system resources even if they are available. Around 4 months back I was doing some testing between 2 fully configured T2000s connected using crossover cables and figured the maximum I was able to push across the wire was around <10MB/S. To further ensure that I didn't have other network problems I ran 10-15 simultaneous SCP sessions and was able to push the network utilization up linearly. This told me that SSH (v2) transfers cannot saturate a GigE channel. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: I'm dancin' in the streets
Some people have privately asked me the configuration details when the problem was encountered. Here they are: zonecfg:bluenile> info zonepath: /zones/bluenile autoboot: false pool: inherit-pkg-dir: dir: /lib inherit-pkg-dir: dir: /platform inherit-pkg-dir: dir: /sbin inherit-pkg-dir: dir: /usr net: address: a.b.c.d physical: ce0 dataset: name: mtdc/bluenile/cloneu001 dataset: name: mtdc/bluenile/cloneu002 dataset: name: mtdc/bluenile/cloneu003 dataset: name: mtdc/bluenile/cloneu004 dataset: name: mtdc/bluenile/cloneu005 dataset: name: mtdc/bluenile/cloneu006 dataset: name: mtdc/bluenile/cloneu007 dataset: name: mtdc/bluenile/cloneu008 dataset: name: mtdc/bluenile/cloneu099 dataset: name: zfspool/bluenile/capps [b]<-- This is the dataset in question; if you replace 'capps' with 'cloneapps' local zone stops seeing it.[/b] dataset: name: zfspool/bluenile/home This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: I'm dancin' in the streets
I've found a small bug in the ZFS & Zones integration in Sol10 06/06 release. This evening I started tweaking my configuration to make it consistent (I like orthogonal naming standards) and hit upon this situation: - Setup a ZFS clone as /zfspool/bluenile/cloneapps; this is a clone of my global zones' /apps filesystem. - Updated my zone configuration for bluenile to use the /zfspool/bluenile/cloneapps - Booted my zone and I couldn't see the just provisioned ZFS filesystem Upon a hunch I recreated the ZFS clone but this time I named it as /zfspool/bluenile/capps to reduce the overall length and updated my Zone config. Upon boot I was able to see the ZFS filesystem! I'm not sure if this is a ZFS, Zones, or ZFS/Zones integration problem. It is not a show stopper but in the spirit of ZFS being 'unlimited' in all dimensions why are we limiting the length of the clone name? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] I'm dancin' in the streets
Wow! I solved a tricky problem this morning thanks to Zones & ZFS integration. We have a SAS SPDS database environment running on Sol10 06/06. The SPDS database is unique in that when a table is being updated by one user it is unavailable to the rest of the user community. Our nightly update jobs (occassionally they turn into day jobs when they take longer :-() were coming in the way of our normal usage. So I put on my ZFS cap and figure it can be simply solved by deploying the 'clone' feature. Simply stated I'd create a clone of all the SPDS filesystems and start another instance of SPDS to read/write from the cloned data. Unfortunately I hit a wall when I realized that there is no way to update the SPDS metadata (binary file containing a description of the physical structure of the database) with the new directory path. I was stumped until it occurred to me that I can solve it by simply marrying the clones with a Solaris Zone Now our problem is solved as follows: 1. Stop local zone 2. Reclaim the ZFS clones in the global-zone 3. Destroy the clone/snapshot 4. Recreate the clone/snapshot 5. Restart the local zone 6. Start SPDS in the local zone and it works beautifully because it sees all the files it needs per its metadata!!! To accomplish the same in traditional methods would have required a SAN disk, disk merge/split, ... You get the picture, ugly! Chalk one more victory for the Solaris 10 Zones/ZFS!!! Thanks to the developers of these features that enabled me elegantly solve a difficult problem. -Anantha- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem
I don't see a patch for this on the SunSolve website. I've opened a service request to get this patch for Sol10 06/06. Stay tuned. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs and Oracle ASM
I did a non-scientific benchmark against ASM and ZFS. Just look for my posts and you'll see it. To summarize it was a statistical tie for simple loads of around 2GB of data and we've chosen to stick with ASM for a variety of reasons not the least of which is its ability to rebalance when disks are added/removed. Better integration comes to mind too. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). Per your suggestion I ran a 'zfs set' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem
I ran the DTrace script and the resulting output is rather large (1 million lines and 65MB), so I won't burden this forum with that much data. Here are the top 100 lines from the DTrace output. Let me know if you need the full output and I'll figure out a way for the group to get it. dtrace: description 'fbt:zfs::' matched 2404 probes CPU FUNCTION 520 -> zfs_lookup 2929705866442880 520-> zfs_zaccess 2929705866448160 520 -> zfs_zaccess_common 2929705866451840 520-> zfs_acl_node_read2929705866455040 520 -> zfs_acl_node_read_internal 2929705866458400 520-> zfs_acl_alloc2929705866461040 520<- zfs_acl_alloc2929705866462880 520 <- zfs_acl_node_read_internal 2929705866464080 520<- zfs_acl_node_read2929705866465600 520-> zfs_ace_access 2929705866467760 520<- zfs_ace_access 2929705866468880 520-> zfs_ace_access 2929705866469520 520<- zfs_ace_access 2929705866470320 520-> zfs_acl_free 2929705866471920 520<- zfs_acl_free 2929705866472960 520 <- zfs_zaccess_common 2929705866474720 520<- zfs_zaccess 2929705866476320 520-> zfs_dirlook 2929705866478320 520 -> zfs_dirent_lock2929705866480880 520 <- zfs_dirent_lock2929705866486560 520 -> zfs_dirent_unlock 2929705866489840 520 <- zfs_dirent_unlock 2929705866491600 520<- zfs_dirlook 2929705866492560 520 <- zfs_lookup 2929705866494080 520 -> zfs_getattr2929705866499360 520-> dmu_object_size_from_db 2929705866503520 520<- dmu_object_size_from_db 2929705866507920 520 <- zfs_getattr2929705866509280 520 -> zfs_lookup 2929705866520400 520-> zfs_zaccess 2929705866521200 520 -> zfs_zaccess_common 2929705866521920 520-> zfs_acl_node_read2929705866523280 520 -> zfs_acl_node_read_internal 2929705866524800 520-> zfs_acl_alloc2929705866526000 520<- zfs_acl_alloc2929705866526800 520 <- zfs_acl_node_read_internal 2929705866527280 520<- zfs_acl_node_read2929705866528160 520-> zfs_ace_access 2929705866528720 520<- zfs_ace_access 2929705866529280 520-> zfs_ace_access 2929705866529920 520<- zfs_ace_access 2929705866530800 520-> zfs_acl_free 2929705866531360 520<- zfs_acl_free 2929705866531920 520 <- zfs_zaccess_common 2929705866532560 520<- zfs_zaccess 2929705866533440 520-> zfs_dirlook 2929705866534000 520 -> zfs_dirent_lock2929705866534640 520 <- zfs_dirent_lock2929705866535600 520 -> zfs_dirent_unlock 2929705866536480 520 <- zfs_dirent_unlock 2929705866537120 520<- zfs_dirlook 2929705866537760 520 <- zfs_lookup 2929705866538400 520 -> zfs_getsecattr 2929705866543600 520-> zfs_getacl 2929705866546240 520 -> zfs_zaccess2929705866546960 520-> zfs_zaccess_common 2929705866547680 520 -> zfs_acl_node_read 2929705866548720 520-> zfs_acl_node_read_internal 2929705866549440 520 -> zfs_acl_alloc 2929705866550080 520 <- zfs_acl_alloc 2929705866550720 520<- zfs_acl_node_read_internal 2929705866551600 520 <- zfs_acl_node_read 2929705866552160 520 -> zfs_ace_access 2929705866552720 520 <- zfs_ace_access 2929705866553280 520 -> zfs_ace_access 2929705866554160 520 <- zfs_ace_access 2929705866554720 520 -> zfs_ace_access 2929705866555600 520 <- zfs_ace_access 2929705866556160 520 -> zfs_ace_access 2929705866557040 520 <- zfs_ace_access 2929705866557600 520 -> zfs_ace_access 2929705866558160 520 <- zfs_ace_access 2929705866558720 520 -> zfs_ace_access
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
Here's the information you requested. Script started on Tue Sep 12 16:46:46 2006 # uname -a SunOS umt1a-bio-srv2 5.10 Generic_118833-18 sun4u sparc SUNW,Netra-T12 # prtdiag System Configuration: Sun Microsystems sun4u Sun Fire E2900 System clock frequency: 150 MHZ Memory size: 96GB === CPUs === E$ CPU CPU CPU Freq SizeImplementation MaskStatus Location --- -- --- - -- 0,512 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P0 1,513 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P1 2,514 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P2 3,515 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB0/P3 8,520 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P0 9,521 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P1 10,522 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P2 11,523 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB2/P3 16,528 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P0 17,529 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P1 18,530 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P2 19,531 1500 MHz 32MBSUNW,UltraSPARC-IV+ 2.1on-line SB4/P3 # md mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ] > arc::stat print { anon = ARC_anon mru = ARC_mru mru_ghost = ARC_mru_ghost mfu = ARC_mfu mfu_ghost = ARC_mfu_ghost size = 0x11917e1200 p = 0x116e8a1a40 c = 0x11917cf428 c_min = 0xbf77c800 c_max = 0x17aef9 hits = 0x489737a8 misses = 0x8869917 deleted = 0xc832650 skipped = 0x15b29b2 hash_elements = 0x1273d0 hash_elements_max = 0x17576f hash_collisions = 0x4e0ceee hash_chains = 0x3a9b2 Segmentation Fault - core dumped # mdb -k (B)0Loading modules: [ unix krtld genunix dtrace specfs ufs sd sgsbbc md sgenv ip sctp usba fcp fctl qlc nca ssd lofs zfs random crypto ptm nfs ipc logindmux cpc sppp fcip wrsmd ] > ::kmastat > ::pgrep vi | ::walk thread 3086600f660 > :[K> 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598d91 [ 02a104598d91 cv_wait_sig+0x114() ] 02a104598e41 str_cv_wait+0x28() 02a104598f01 strwaitq+0x238() 02a104598fc1 strread+0x174() 02a1045990a1 fop_read+0x20() 02a104599161 read+0x274() 02a1045992e1 syscall_trap32+0xcc() > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598c71 > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598bb1 > 3086600f660::findstack stack pointer for thread 3086600f660: 2a104598e61 02a104598f61 zil_lwb_commit+0x1ac() 02a104599011 zil_commit+0x1b0() 02a1045990c1 zfs_fsync+0xa8() 02a104599171 fop_fsync+0x14() 02a104599231 fdsync+0x20() 02a1045992e1 syscall_trap32+0xcc() > 3086600f660::findstack stack pointer for thread 3086600f660 (TS_FREE): 2a104598ba1 02a104598fe1 segvn_unmap+0x1b8() 02a1045990d1 as_free+0xf4() 02a104599181 proc_exit+0x46c() 02a104599231 exit+8() 02a1045992e1 syscall_trap32+0xcc() [m# df -h Filesystem size used avail capacity Mounted on /dev/md/dsk/d10 32G 6.7G25G22%/ /devices 0K 0K 0K 0%/devices ctfs 0K 0K 0K 0%/system/contract proc 0K 0K 0K 0%/proc mnttab 0K 0K 0K 0%/etc/mnttab swap
[zfs-discuss] Bizzare problem with ZFS filesystem
I'm experiencing a bizzare write performance problem while using a ZFS filesystem. Here are the relevant facts: [b]# zpool list[/b] NAMESIZEUSED AVAILCAP HEALTH ALTROOT mtdc 3.27T502G 2.78T14% ONLINE - zfspool68.5G 30.8G 37.7G44% ONLINE - [b]# zfs list[/b] NAME USED AVAIL REFER MOUNTPOINT mtdc 503G 2.73T 24.5K /mtdc mtdc/sasmeta 397M 627M 397M /sasmeta mtdc/u001 30.5G 226G 30.5G /u001 mtdc/u002 29.5G 227G 29.5G /u002 mtdc/u003 29.5G 226G 29.5G /u003 mtdc/u004 28.4G 228G 28.4G /u004 mtdc/u005 28.3G 228G 28.3G /u005 mtdc/u006 29.8G 226G 29.8G /u006 mtdc/u007 30.1G 226G 30.1G /u007 mtdc/u008 30.6G 225G 30.6G /u008 mtdc/u099 266G 502G 266G /u099 zfspool 30.8G 36.6G 24.5K /zfspool zfspool/apps 30.8G 33.2G 28.5G /apps zfspool/[EMAIL PROTECTED] 2.28G - 29.8G - zfspool/home 15.4M 2.98G 15.4M /home [b]# zfs list mtdc/u099[/b] NAME PROPERTY VALUE SOURCE mtdc/u099type filesystem - mtdc/u099creation Thu Aug 17 10:21 2006 - mtdc/u099used 267G - mtdc/u099available 501G - mtdc/u099referenced 267G - mtdc/u099compressratio 3.10x - mtdc/u099mountedyes- mtdc/u099quota 768G local mtdc/u099reservationnone default mtdc/u099recordsize 128K default mtdc/u099mountpoint /u099 local mtdc/u099sharenfs offdefault mtdc/u099checksum on default mtdc/u099compressionon local mtdc/u099atime offlocal mtdc/u099deviceson default mtdc/u099exec on default mtdc/u099setuid on default mtdc/u099readonly offdefault mtdc/u099zoned offdefault mtdc/u099snapdirhidden default mtdc/u099aclmodegroupmask default mtdc/u099aclinherit secure default [b]No error messages listed by zpool or /var/opt/messages.[/b] When I try to save a file the operation takes an inordinate amount of time, in the 30+ second range!!! I truss'd the vi session to see the hangup and it waits at the write system call. # truss -p read(0, 0xFFBFD0AF, 1) (sleeping...) read(0, " w", 1)= 1 write(1, " w", 1) = 1 read(0, " q", 1)= 1 write(1, " q", 1) = 1 read(0, 0xFFBFD00F, 1) (sleeping...) read(0, "\r", 1)= 1 ioctl(0, I_STR, 0x000579F8) Err#22 EINVAL write(1, "\r", 1) = 1 write(1, " " d e l e t e m e "", 10)= 10 stat64("deleteme", 0xFFBFCFA0) = 0 creat("deleteme", 0666) = 4 ioctl(2, TCSETSW, 0x00060C10) = 0 [b]write(4, " l f f j d\n", 6) = 6[/b] < still waiting while I type this message!! This problem manifests itself only on this filesystem and not on the other ZFS filesystems on the same server built from the same ZFS pool. While I was awaiting completion of the above write I was able to start a new vi session in another window and saved a file to the /u001 filesystem without any problem. System loads are very low. Can anybody comment on this bizzare behavior? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Oracle on ZFS
One correction in the interest of full disclosure, tests were conducted on a machine that is different from my original post indicated a server configuration. Here's the server config used in tests: - E25K domain (1 board: 4P/8Way x 32GB) - 2 2Gbps FC - MPxIO - Solaris 10 Update 2 (06/06); no other patches This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Oracle on ZFS
I finally got around to running a 'benchmark' using the AOL clickstream data (2GB of text files and approximately 36 million rows). Here are the Oracle settings during the test. - Same Oracle settings for all tests - All disks in question are 32GB EMC hypers - I had the standard Oracle tablespaces on one ASM group consisting of 1disk - I created a tablespace using ASM on 10 disks - I created a tablespace using ZFS on 10 disks - I created a tablespace using ZFS with compression on 10 disks Test 1 (loading to ASM) I loaded the text file into Oracle using external table feature. Time 1m20s, system loads were in the 1-1.35 range. Test 2 (loading to ZFS) I loaded the text file into Oracle using external table feature. Time 1m16s, system loads were in the 1.13 range. Test 3 (loading from ASM to ASM) I loaded a new table from the just loaded Oracle table. Time 1m21s, system loads were in the 1-1.3 range. Test 4 (loading from ZFS to ZFS) I loaded a new table from the just loaded Oracle table. Time 1m20s, system loads were in the 1-1.3 range Test 5 (loading from ZFS to ZFS compress=ON) I loaded a new table from the just loaded Oracle table. Time 1m18s, system loads were in the 1-1.45 range, saw a compression in the 3.5-4x range. Throughout the tests I had other stuff running on the machine as well (1 additional database and 10g GridControl Repository). [b]All the tests yielded same results in my opinion.[/b] We'll probably go with Oracle ASM because of its integration with other Oracle products/features. I'm not comfortable with ZFS enough to bet on it yet (I've only played with it for less than 2 months) while ASM has been around for 3 years. The other contributing factor is ASMs ability to rebalance the data when disks are added/removed. ZFS at this time doesn't give a facility to remove drives when I'm not using mirrors (my problem is that all our disks are provisioned from EMC that are already protected), ASM does. While performing these tests I came across another (severe?) problem with ZFS that I'll post as a separate entry. -Anantha- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Oracle on ZFS
Good start, I'm now motivated to run the same test on my server. My h/w config for the test will be: - E2900 (24 way x 96GB) - 2 2Gbps QLogic cards - 40 x 64GB EMC LUNs I'll run the AOL deidentified clickstream database. It'll primarily be a write test. I intend to use the following scenarios: - SVM/UFS (nologging, atimeoff, directio), data striped across all LUNs - ZFS (compress=OFF, atime=OFF) - ZFS (compress=ON, atime=OFF) - Oracle 10g Automatic Storage Management (ASM) I'll keep the same Oracle 10g settings for all tests. I'm really interested in the comparison between ASM and ZFS, specially with the compress=ON option. In a DW environment like ours this could lead to HUGE savings. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS compression / space efficiency
We're running ZFS with compress=ON on a E2900. I'm hosting SAS/SPDS datasets (files) on these filesystems and am achieving 1:3.87 (as reported by zfs) compression. Your mileage will vary depending on the data you are writing. If your data is already compressed (zip files) then don't expect any payback. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS write performance problem with compression set to ON
I've a few questions: - Does 'zpool iostat' report numbers from the top of the ZFS stack or at the bottom? I've corelated the zpool iostat numbers with the system iostat numbers and they matchup. This tells me the numbers are from the 'bottom' of the ZFS stack, right? Having said that it'd be nice to have zpool iostat return numbers at the top of the stack. This becomes relevant when we've compression =ON. - Secondly, I did some more tests and I find the same read waves and the consistent write throughput. I've been reading another thread on this forum about Niagara and the compression where Matt Ahrens noted that the compression at this time is single-threaded. Further, he stated that there maybe a bugfix released to use multiple threads. I eagerly await the fix. Thanks again for a great feature. Looking forward to more fun stuff out of Sun and you Mr. Bonwick. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS write performance problem with compression set to ON
Therein lies my dillemma: - We know the I/O sub-system is capable of much higher I/O rates - Under the test setup I've SAS datasets which are lending themselves to compression. This should manifest itself as lots of read I/O resulting in much smaller (4x) write I/O due to compression. This means my read rates should be driven higher to keep the compression. I don't see this, as I said in my original post I see reads comes in waves. I'm beginning to think my write rates are hitting a a bottleneck in compression as follows: - ZFS issues reads - ZFS starts compressing the data before the write and cannot drain the input buffers fast enough; this results in reads to stop. - ZFS completes compression and writes out data at a much smaller rate due to the smaller compressed data stream. I'm not a filesystem wizard but shouldn't ZFS take advantage of my available CPUs to drain the input buffer faster (parallel)? It is possible that you've some internal throttles in place to make ZFS a good citizen in the Solaris landscape; a la algorithms in place to prevent cache flooding by one host/device in EMC/Hitachi. I'll perform some more tests with different datasets and report to the forum. Now if only I can convince my storage administrator to provision me raw disks instead of the mirrored disks so I can let ZFS do the same for me, another battle another day ;-) Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS write performance problem with compression set to ON
Completely forgot to mention the OS in my previous post; Solaris 10 06/06. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS write performance problem with compression set to ON
Test setup: - E2900 with 12 US-IV+ 1.5GHz processor, 96GB memory, 2x2Gbps FC HBAs, MPxIO in round-robbin config. - 50x64GB EMC disks presented on both 2 FCs. - ZFS pool defined using all 50 disks - Multiple ZFS filesystems built on the above pool. I'm observing the following: - When the filesystems have compress=OFF and I do bulk reads/writes (8 parallel 'cp's running between ZFS filesystems) I observe approximately 200-250MB/S consolidated I/O; writes in the 100MB/S range. I get these numbers running 'zpool iostat 5'. I see the same read/write ratio for the duration of the test. - When the filesystems have compress=ON I see the following: reads from compressed filesystems come in waves; zpool will report for long durations (60+ seconds) no read activity while the write activity is consistently reported at 20MB/S (no variation in the write rate throughtout the test.) - The machine is mostly idling during the entire test; both cases. - ZFS reports 4:1 compresson ratio for my filesystem. I'm puzzled by the following: - Why do reads comes in waves with compression=ON; it almost feels like ZFS reads a bunch of data and then proceeds to compress it before writing it out. This tells me there is not a read bottleneck; meaning there is no starvation of the compress routine due to the following facts: CPU/Machine/IO is not saturated in any shape or form. - Why then does ZFS generate substantially lower write throughput (magical 20MB/S spread evenly across the 50 disks, 0.4MB/S each)? Can anybody shed any light on this anomoloy (?). Mr. Bonwick I hope you're reading this post. BTW, we love the ZFS and are looking forward to rolling out aggresively in our new project. I'd like to take advantage of the compression since we're mostly I/O bound and we've plenty of CPU/Memory. Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss