[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?
Hi folks, (Long time no post...) Only starting to get into this one, so apologies if I'm light on detail, but... I have a shiny SSD I'm using to help make some VirtualBox stuff I'm doing go fast. I have a 240GB Intel 520 series jobbie. Nice. I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb. As part of my work, I have used it both as a RAW device (cxtxdxp1) and wrapped partition 1 with a virtualbox created VMDK linkage, and it works like a champ. :) Very happy with that. I then tried creating a new zpool using partition 2 of the disk (zpool create c2d0p2) and then carved a zvol out of that (30GB), and wrapped *that* in a vmdk. Still works OK and speed is good(ish) - but there are a couple of things in particular that disturb me: - Sync writes are pretty slow - only about 1/10th of what I thought I might get (about 15MB/s). ASync writes are fast - up to 150MB/s or more. - More worringly, it seems that writes are amplified by 2X in that if I write 100MB at the guest level, the underlying bare metal ZFS writes 200M, as observed by iostat. This doesn't happen on the VM's that are using RAW slices. Anyone have any thoughts on what might be happening here? I can appreciate that if everything comes through as a sync write, it goes to the ZIL first, then to it's final resting place - but it seems a little over the top that it really is double. I have also had a play with sync=, primarycache settings and a few other things but it doesn't seem to change the behavious Again - I'm looking for thoughts here - as I have only really just started looking into this. Should I happen across anything interesting, I'll followup this post. Cheers, Nathan. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)
Hi John, Actually, last time I tried the whole AF (4k) thing, it's performance was worse than woeful. But admittedly, that was a little while ago. The drives were the seagate green barracuda IIRC, and performance for just about everything was 20MB/s per spindle or worse, when it should have been closer to 100MB/s when streaming. Things were worse still when doing random... I'm actually looking to put in something larger than the 3*2TB drives (triple mirror for read perf) this pool has in it - preferably 3 * 4TB drives. (I don't want to put in more spindles - just replace the current ones...) I might just have to bite the bullet and try something with current SW. :). Nathan. On 05/29/12 08:54 PM, John Martin wrote: On 05/28/12 08:48, Nathan Kroenert wrote: Looking to get some larger drives for one of my boxes. It runs exclusively ZFS and has been using Seagate 2TB units up until now (which are 512 byte sector). Anyone offer up suggestions of either 3 or preferably 4TB drives that actually work well with ZFS out of the box? (And not perform like rubbish)... I'm using Oracle Solaris 11 , and would prefer not to have to use a hacked up zpool to create something with ashift=12. Are you replacing a failed drive or creating a new pool? I had a drive in a mirrored pool recently fail. Both drives were 1TB Seagate ST310005N1A1AS-RK with 512 byte sectors. All the 1TB Seagate boxed drives I could find with the same part number on the box (with factory seals in place) were really ST1000DM003-9YN1 with 512e/4196p. Just being cautious, I ended up migrating the pools over to a pair of the new drives. The pools were created with ashift=12 automatically: $ zdb -C | grep ashift ashift: 12 ashift: 12 ashift: 12 Resilvering the three pools concurrently went fairly quickly: $ zpool status scan: resilvered 223G in 2h14m with 0 errors on Tue May 22 21:02:32 2012 scan: resilvered 145G in 4h13m with 0 errors on Tue May 22 23:02:38 2012 scan: resilvered 153G in 3h44m with 0 errors on Tue May 22 22:30:51 2012 What performance problem do you expect? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)
Hi folks, Looking to get some larger drives for one of my boxes. It runs exclusively ZFS and has been using Seagate 2TB units up until now (which are 512 byte sector). Anyone offer up suggestions of either 3 or preferably 4TB drives that actually work well with ZFS out of the box? (And not perform like rubbish)... I'm using Oracle Solaris 11 , and would prefer not to have to use a hacked up zpool to create something with ashift=12. Thoughts on the best drives - or is Solaris 11 actually ready to go with whatever I throw at it? :) And - am I doomed to have to use these so called 'advanced format' drives (which as far as I can tell are in no way actually advanced, and only benefit HDD makers and not the end user). Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Convert pool from ashift=12 to ashift=9
Jim Klimov wrote: It is is hard enough already to justify to an average wife that...snip That made my night. Thanks, Jim. :) On 03/20/12 10:29 PM, Jim Klimov wrote: 2012-03-18 23:47, Richard Elling wrote: ... Yes, it is wrong to think that. Ok, thanks, we won't try that :) copy out, copy in. Whether this is easy or not depends on how well you plan your storage use ... Home users and personal budgets do tend to have a problem with planning. Any mistake is to be paid for personally, and many are left as is. It is is hard enough already to justify to an average wife that a storage box with large X-Tb disks needs raidz3 or mirroring and thus becomes larger and noisier, not to mention almost a thousand bucks more expensive just for the redundancy disks, but it will become two times cheaper in a year. Yup, it is not very easy to find another 10+Tb backup storage (with ZFS reliability) in a typical home I know of. Planning is not easy... But that's a rant... Hoping that in-place BP Rewrite would arrive and magically solve many problems =) Questions are: 1) How bad would a performance hit be with 512b blocks used on a 4kb drive with such efficient emulation? Depends almost exclusively on the workload and hardware. In my experience, most folks who bite the 4KB bullet have low-cost HDDs where one cannot reasonably expect high performance. Is it possible to model/emulate the situation somehow in advance to see if it's worth that change at all? It will be far more cost effective to just make the change. Meaning altogether? That with consumer disk which will suck from performance standpoint anyway, it was not a good idea to use ashift=12 and it was more cost effective to remain at ashift=9, to begin with? What about real-people's tests which seemed to show that there were substantial performance hits with misaligned large-block writes (spanning several 4k sectors at wrong boundaries)? I had an RFE posted sometime last year about making an optimisation for both worlds: use formal ashift=9 and allow writing of small blocks, but align larger blocks at set boundaries (i.e. offset divisible by 4096 for blocks sized 4096+). Perhaps writing of 512b blocks near each other should only be reserved for metadata which is dittoed anyway, so that a whole-sector (4kb) corruption won't be irreversible for some data. In effect, minblocksize for userdata would be enforced (by config) at the same 4kb in such case. This is a zfs-write only change (and some custom pool or dataset attributes), so the on-disk format and compatibility should not suffer with this solution. But I had little feedback whether the idea was at all reasonable. 2) Is it possible to easily estimate the amount of wasted disk space in slack areas of the currently active ZFS allocation (unused portions of 4kb blocks that might become available if the disks were reused with ashift=9)? Detailed space use is available from the zfs_blkstats mdb macro as previously described in such threads. 3) How many parts of ZFS pool are actually affected by the ashift setting? Everything is impacted. But that isn't a useful answer. From what I gather, it is applied at the top-level vdev level (I read that one can mix ashift=9 and ashift=12 TLVDEVs in one pool spanning several TLVDEVs). Is that a correct impression? Yes If yes, how does ashift size influence the amount of slots in uberblock ring (128 vs. 32 entries) which is applied at the leaf vdev level (right?) but should be consistent across the pool? It should be consistent across the top-level vdev. There is 128KB of space available for the uberblock list. The minimum size of an uberblock entry is 1KB. Obviously, a 4KB disk can't write only 1KB, so for 4KB sectors, there are 32 entries in theuberblock list. So if I have ashift=12 and ashift=9 top-level devices mixed in the pool, it is okay that some of them would remember 4x more of pool's TXG history than others? As far as I see in ZFS on-disk format, all sizes and offsets are in either bytes or 512b blocks, and the ashift'ed block size is not actually used anywhere except to set the minimal block size and its implicit alignment during writes. The on-disk format doc is somewhat dated and unclear here. UTSL. Are there any updates, or the 2006 pdf is the latest available? For example, is there an effort in illumos/nexenta/openindiana to publish their version of the current on-disk format? ;) Thanks for all the answers, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bad performance (Seagate drive related?)
Hey there, Few things: - Using /dev/zero is not necessarily a great test. I typically use /dev/urandom to create an initial block-o-stuff - something like a gig or so worth, in /tmp, then use dd to push that to my zpool. (/dev/zero will return dramatically different results depending on pool/dataset settings for compression etc.) - Indeed - getting a total aggregate of 180MB/s seems pretty low on the face of it for the setup you have. What's the controller you are using? Any details on the driver, backplane, expander, array or other you might be using? - Have you tried your dd on individual spindles? You might find that they behave differently - Does your controller have DRAM on it? Can you put it in passthrough mode rather than cache? - I have done some testing trying to find odd behaviour like this before, and found on different occasions a number of different things: - Drives: Things like the WD 'green' drives getting in my way - Alignment for non-EFI labled disks (hm - maybe even on EFI... that one was a while ago) (particularly for 4K 'advanced format' (ha!) disks) - The controller was unable to keep up. (In one case, I ended up tossing an HP P400 (IIRC) and using the on-motherboard chipset as it was considerably faster when running four disks - Disks with wildly different performance characteristics were also bad (eg: Enterprise SATA mixed with 5400 RPM disks. ;) I'd suggest that you spend a little time validating the basic assumptions around: - speed of individual disks, - speed of individual buses - Whether you are being limited by CPU (ie: If you have compression or dedupe turned on) (view with mpstat and friends) - I'll also note that you are looking close to the number of IOPS I'd expect a consumer disk to supply assuming a somewhat random distribution of IOPS. - Consider that your 180MB/s is actually 360 (well - not quite - but it's a lot more than 180). Remember - in a mirror, you literally need to write the data twice. 8.0 3857.8 64.0 337868.8 0.0 64.5 0.0 16.7 0 704 c5 (Note above is your c5 controller - running at around 337 MB/s) Incidentally - this seems awfully close to 3Gb/s... How did you say all of your external drives were attached? If I didn't know better, I'd be asking serious questions about how many lanes of a SAS connection sata attached drives were able to use... Actually - I don't know better, so I'd ask anyway... ;) I think this will likely go along way to helping understand where the holdup is. There is also a heap of great stuff on solarisinternals.com which I'd highly recommend taking a look at after you have validated the basics... Were this one of my systems, (and especially if it's new, and you don't love your data and can re-create the pool) I'd be tempted to do something like a very destructive... for i in all your disks do dd if=/tmp/randomdata.file.I.created.earlier of=/dev/rdsk/${i} done and see how much you can stuff down the pipe. Remember - this will kill whatever is on the disks, do think twice before you do it. ;) If you can't get at least 80-100MB/s on the outside of the platter, I'd suggest you should be looking at layers below ZFS. If you *can*, then you start looking further up the stack. Hope this helps somewhat. Let us know how you go. Cheers! Nathan. On 02/ 1/12 04:52 AM, Mohammed Naser wrote: Hi list! I have seen less-than-stellar ZFS performance on a setup of one main head connected to a JBOD (using SAS, but drives are SATA). There are 16 drives (8 mirrors) in this pool but I'm getting 180ish MB sequential writes (using dd, I know it's not precise, but those numbers should be higher). With some help on IRC, it seems that part of the reason I'm slowing down is some drives seem to be slower than the others. Initially, I had some drives running at 1.5 mode instead of 3.0 -- They are all running at 3.0 now. While running the following dd command, the output of iostat reflects a much higher %b which seems to say that those drives are slower (but could they really be slowing down everything else that much? --- Or am I looking at the wrong spot here?) -- The pool configuration is also included below dd if=/dev/zero of=4g bs=1M count=4000 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.00.08.00.0 0.0 0.00.00.2 0 0 c1 1.00.08.00.0 0.0 0.00.00.2 0 0 c1t2d0 8.0 3857.8 64.0 337868.8 0.0 64.50.0 16.7 0 704 c5 0.0 259.00.0 26386.2 0.0 3.60.0 14.0 0 37 c5t50014EE0ACE4AEEFd0 1.0 266.08.0 27139.2 0.0 3.60.0 13.5 0 37 c5t50014EE056EB0356d0 2.0 276.0 16.0 19315.1 0.0 3.70.0 13.3 0 40 c5t50014EE00239C976d0 0.0 279.00.0 19699.0 0.0 3.60.0 13.0 0 37 c5t50014EE0577C459Cd0 1.0 232.08.0 23061.9 0.0 3.60.0 15.4 0
Re: [zfs-discuss] Can I create a mirror for a root rpool?
Do note, that though Frank is correct, you have to be a little careful around what might happen should you drop your original disk, and only the large mirror half is left... ;) On 12/16/11 07:09 PM, Frank Cusack wrote: You can just do fdisk to create a single large partition. The attached mirror doesn't have to be the same size as the first component. On Thu, Dec 15, 2011 at 11:27 PM, Gregg Wonderly gregg...@gmail.com mailto:gregg...@gmail.com wrote: Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg Sent from my iPhone On Dec 15, 2011, at 6:13 PM, Tim Cook t...@cook.ms mailto:t...@cook.ms wrote: Do you still need to do the grub install? On Dec 15, 2011 5:40 PM, Cindy Swearingen cindy.swearin...@oracle.com mailto:cindy.swearin...@oracle.com wrote: Hi Anon, The disk that you attach to the root pool will need an SMI label and a slice 0. The syntax to attach a disk to create a mirrored root pool is like this, for example: # zpool attach rpool c1t0d0s0 c1t1d0s0 Thanks, Cindy On 12/15/11 16:20, Anonymous Remailer (austria) wrote: On Solaris 10 If I install using ZFS root on only one drive is there a way to add another drive as a mirror later? Sorry if this was discussed already. I searched the archives and couldn't find the answer. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!
I know some others may already have pointed this out - but I can't see it and not say something... Do you realise that losing a single disk in that pool could pretty much render the whole thing busted? At least for me - the rate at which _I_ seem to lose disks, it would be worth considering something different ;) Cheers! Nathan. On 12/19/11 09:05 AM, Jan-Aage Frydenbø-Bruvoll wrote: Hi, On Sun, Dec 18, 2011 at 22:00, Fajar A. Nugrahaw...@fajar.net wrote: From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide (or at least Google's cache of it, since it seems to be inaccessible now: Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. If the primary workload is immutable files (write once, never remove), then you can keep a pool in the 95-96% utilization range. Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer. I'm guessing that your nearly-full disk, combined with your usage performance, is the cause of slow down. Try freeing up some space (e.g. make it about 75% full), just tot be sure. I'm aware of the guidelines you refer to, and I have had slowdowns before due to the pool being too full, but that was in the 9x% range and the slowdown was in the order of a few percent. At the moment I am slightly above the recommended limit, and the performance is currently between 1/1 and 1/2000 of what the other pools achieve - i.e. a few hundred kB/s versus 2GB/s on the other pools - surely allocation above 80% cannot carry such extreme penalties?! For the record - the read/write load on the pool is almost exclusively WORM. Best regards Jan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote: On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp And you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let's read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file 'a' and all its blocks are in cache now. The 'b' file shares all the same blocks, so if ARC caches blocks only once, reading 'b' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, 'b' was read 12.5 times faster than 'a' with no disk activity. Magic?:) Hey all, That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import hangs - pls help
Hi Max, Unhelpful questions about your CPU aside, what else is your box doing? Can you run up a second or third shell (ssh or whatever) and watch if the disks / system are doing any work? Were it Solaris, I'd run: iostat -x prstat -a vmstat mpstat (Though as discussed, you have only a single core CPU) echo ::memstat | mdb -k (No idea how you might do that in BSD) Some other things to think about: - Have you tried removing the extra memory? I have indeed seen in crappy PC hardware where more than 3GB caused some really bad behaviour in Solaris. - Have you tried booting into a current Solaris (from CD) and seeing if it can import the pool? (Don't upgrade - just import) ;) I'm aware that there were some long import issues discussed on the list recently - someone had an import take some 12 hours or more - would be worth looking over the last few weeks posts. Also - getting a truss or pstack (if freebds has that?) of the process trying to initiate the import might help some of the more serious folks on the list to see where it's getting stuck. (Or if indeed, it's actually getting stuck, and not simply catastrophically slow.) Hope this helps at least a little. Cheers, Nathan. On 06/14/11 03:20 PM, Maximilian Sarte wrote: Hi, I am posting here in a tad of desperation. FYI, I am running FreeNAS 8.0. Anyhow, I created a raidz1 (tank1) with 4 x 2Tb WD EARS hdds. All was doing ok until I decided to up the RAM to 4 Gb since it is what was recommended. Asap I re-started data migration, the ZFS issued messages indicating that the pool was unavailable and froze the system. After reboot (FN is based in FreeBSD) and re-installing FN (did not want to complete booting - probably a corruption on the USB stick it was running from), tank1 was unavailable. Stauts indicates that there are no pools as List does. Import indicates that tank1 is OK and all 4 hdds are ONLINE and their status seems OK. When I try either: zpool import tank1 zpool imprt -f tank1 zpool import -fF tank1 the commands simply hang forever (FreeNAS semms OK). Any suggestions would be immensly appreciated. Tx! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool scrub on b123
Hi Karl, Is there any chance at all that some other system is writing to the drives in this pool? You say other things are writing to the same JBOD... Given that the amount flagged as corrupt is so small, I'd imagine not, but thought I'd ask the question anyways. Cheers! Nathan. On 04/16/11 04:52 AM, Karl Rossing wrote: Hi, One of our zfs volumes seems to be having some errors. So I ran zpool scrub and it's currently showing the following. -bash-3.2$ pfexec /usr/sbin/zpool status -x pool: vdipool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go config: NAME STATE READ WRITE CKSUM vdipool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c9t14d0 ONLINE 0 012 6K repaired c9t15d0 ONLINE 0 013 167K repaired c9t16d0 ONLINE 0 011 5.50K repaired c9t17d0 ONLINE 0 020 10K repaired c9t18d0 ONLINE 0 015 7.50K repaired spares c9t19d0AVAIL errors: No known data errors I have another server connected to the same jbod using drives c8t1d0 to c8t13d0 and it doesn't seem to have any errors. I'm wondering how it could have gotten so screwed up? Karl CONFIDENTIALITY NOTICE: This communication (including all attachments) is confidential and is intended for the use of the named addressee(s) only and may contain information that is private, confidential, privileged, and exempt from disclosure under law. All rights to privilege are expressly claimed and reserved and are not waived. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
Ed - Simple test. Get onto a system where you *can* disable the disk cache, disable it, and watch the carnage. Until you do that, you can pose as many interesting theories as you like. Bottom line is that at 75 IOPS per spindle won't impress many people, and that's the sort of rate you get when you disable the disk cache. Nathan. On 8/03/2011 11:53 PM, Edward Ned Harvey wrote: From: Jim Dunham [mailto:james.dun...@oracle.com] ZFS only uses system RAM for read caching, If your email address didn't say oracle, I'd just simply come out and say you're crazy, but I'm trying to keep an open mind here... Correct me where the following statement is wrong: ZFS uses system RAM to buffer async writes. Sync writes must hit the ZIL first, and then the sync writes are put into the write buffer along with all the async writes to be written to the main pool storage. So after sync writes hit the ZIL and the device write cache is flushed, they too are buffered in system RAM. as all writes must be written to some form of stable storage before acknowledged. If a vdev represents a whole disk, ZFS will attempt to enable write caching. If a device does not support write caching, the attempt to set wce fails silently. Here is an easy analogy to remember basically what you said: format -e can control the cache settings for c0t0d0, but cannot control the cache settings for c0t0d0s0 because s0 is not actually a device. I contend: Suppose you have a disk with on-disk write cache enabled. Suppose a sync write comes along, so ZFS first performs a sync write to some ZIL sectors. Then ZFS will issue the cache flush command and wait for it to complete before acknowledging the sync write; hence the disk write cache does not benefit sync writes. So then we start thinking about async writes, and conclude: The async writes were acknowledged long ago, when the async writes were buffered in ZFS system ram, so there is once again, no benefit from the disk write cache in either situation. That's my argument, unless somebody can tell me where my logic is wrong. Disk write cache offers zero benefit. And disk read cache only offers benefit in unusual cases that I would call esoteric. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How long should an empty destroy take? snv_134
Why wouldn't they try a reboot -d? That would at least get some data in the form of a crash dump if at all possible... A power cycle seems a little medieval to me... At least in the first instance. The other thing I have noted is that sometimes things to get wedged, and if you can find where, (mdb -k and take a poke at the stack of some of the zfs/zpool commands that are hung to see what they were operating on) and trying a zpool clear on that zpool. Not that I'm recommending that you should *need* to, but that has got me unwedged on occasion. (though, usually when I have dome something administratively silly... ;) Nathan. On 7/03/2011 12:14 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Yaverot We're heading into the 3rd hour of the zpool destroy on others. The system isn't locked up, as it responds to local keyboard input, and I bet you, you're in a semi-crashed state right now, which will degrade into a full system crash. You'll have no choice but to power cycle. Prove me wrong, please. ;-) I bet, as soon as you type in any zpool or zfs command ... even list or status they will also hang indefinitely. Is your pool still 100% full? That's probably the cause. I suggest if possible, immediately deleting something and destroying an old snapshot to free up a little bit of space. And then you can move onward... While this destroy is running all other zpool/zfs commands appear to be hung. Oh, sorry, didn't see this before I wrote what I wrote above. This just further confirms what I said above. zpool destroy on an empty pool should be on the order of seconds, right? zpool destroy is instant, regardless of how much data there is in a pool. zfs destroy is instant for an empty volume, but zfs destroy takes a long time for a lot of data. But as mentioned above, that's irrelevant to your situation. Because your system is crashed, and even if you try init 0 or init 6... They will fail. You have no choice but to power cycle. For the heck of it, I suggest init 0 first. Then wait half an hour, and power cycle. Just to try and make the crash as graceful as possible. As soon as it comes back up, free up a little bit of space, so you can avoid a repeat. Yes, I've triple checked, I'm not destroying tank. While writing the email, I attempted a new ssh connection, it got to the Last login: line, but hasn't made it to the prompt. Oh, sorry, yet again this is confirming what I said above. semi-crashed and degrading into a full crash. Right now, you cannot open any new command prompts. Soon it will stop responding to ping. (Maybe 2-12 hours.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sorry everyone was: Re: External SATA drive enclosures + ZFS?
Actually, I find that tremendously encouraging. Lots of internal Oracle folks still subscribed to the list! Much better than none... ;) Nathan. On 02/26/11 03:29 PM, Yaverot wrote: Sorry all, didn't realize that half of Oracle would auto-reply to a public mailing list since they're out of the office 9:30 Friday nights. I'll try to make my initial post each month during daylight hours in the future. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SIL3114 and sparc solaris 10
I can confirm that on *at least* 4 different cards - from different board OEMs - I have seen single bit ZFS checksum errors that went away immediately after removing the 3114 based card. I stepped up to the 3124 (pci-x up to 133mhz) and 3132 (pci-e) and have never looked back. I now throw any 3114 card I find into the bin at the first available opportunity as they are a pile of doom waiting to insert an exploding garden gnome into the unsuspecting chest cavity of your data. I'd also add that I have never made an effort to determine if it was actually the Solaris driver that was at fault - but being that the other two cards I have mentioned are available for about $20 a pop, it's not worth my time. I don't recall if Solaris 10 (Sparc or X86) actually has the si3124 driver, but if it does, for a cheap thrill, they are worth a bash. I have no problems pushing 4 disks pretty much flat out on a PCI-X 133 3124 based card. (note that there was a pci and a pci-x version of the 3124, so watch out.) Cheers! Nathan. On 02/24/11 02:10 AM, Andrew Gabriel wrote: Krunal Desai wrote: On Wed, Feb 23, 2011 at 8:38 AM, Mauricio Tavares raubvo...@gmail.com wrote: I see what you mean; in http://mail.opensolaris.org/pipermail/opensolaris-discuss/2008-September/043024.html they claim it is supported by the uata driver. What would you suggest instead? Also, since I have the card already, how about if I try it out? My experience with SPARC is limited, but perhaps the Option ROM/BIOS for that card is intended for x86, and not SPARC? I might thinking of another controller, but this could be the case. You could always try to boot with the card; the worst that'll probably happen is boot hangs before the OS even comes into play. SPARC won't try to run the BIOS on the card anyway (it will only run OpenFirmware BIOS), but you will have to make sure the card has the non-RAID BIOS so that the PCI class doesn't claim it to be a RAID controller, which will prevent Solaris going anywhere near the card at all. These cards could be bought with either RAID or non-RAID BIOS, but RAID was more common. You can (or could some time back) download the RAID and non-RAID BIOS from Silicon Image and re-flash which also updates the PCI class, and I think you'll need a Windows system to actually flash the BIOS. You might want to do a google search on 3114 data corruption too, although it never hit me back when I used the cards. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
I'm with the gang on this one as far as USB being the spawn of the devil for mass storage you want to depend on. I'd rather scoop my eyes out with a red hot spoon than depend on permanently attached USB storage... And - don't even start me on SPARC and USB storage... It's like watching pitch flow... (see http://en.wikipedia.org/wiki/Pitch_drop_experiment). I never spent too much time working out why - but I never seen to get better than about 10MB/s with SPARC+USB... When it comes to cheap... I use cheap external SATA/USB combo enclosures (single drive ones) as I like the flexibility of being able to use them in eSATA mode nice and fast (and reliable considering the $$) or in USB mode should I need to split a mirror off and read it on my laptop, which has no esata port... Also - using the single drive enclosures is by far the cheapest (at least here in Oz), and you get redundant power supplies, as they use their own mini brick AC/DC units. I'm currently very happy using 2TB disks in the external eSATA+USB thingies. I had been using ASTONE external eSATA/USB units - though it seems my local shop has stopped carrying them... I liked them as they had perforated side panels, which allow the disk to stay much cooler than some of my other enclosures... (And have a better 'vertical' stand if you want the disks to stand up, rather than lie on their side.) If your box has PCI-e slots, grab one or two $20 Silicon Image 3132 controllers with eSATA ports and you should be golden... You will then be able to run between 2 and 4 disks - easily pushing them to their maximum platter speed - which for most of the 2TB disks is near enough to 100M/s at the outer edges. You will also get considerably higher IOPS - particularly when they are sequential - using eSATA. Note: All of this is with the 'cheap' view... You can most certainly buy much better hardware... But bang for buck - I have been happy with the above. Cheers! Nathan. On 02/26/11 01:58 PM, Brandon High wrote: On Fri, Feb 25, 2011 at 4:34 PM, Rich Teerrich.t...@rite-group.com wrote: Space is starting to get a bit tight here, so I'm looking at adding a couple of TB to my home server. I'm considering external USB or FireWire attached drive enclosures. Cost is a real issue, but I also I would avoid USB, since it can be less reliable than other connection methods. That's the impression I get from older posts made by Sun devs, at least. I'm not sure how well Firewire 400 is supported, let alone Firewire 800. You might want to consider eSATA. Port multipliers are supported in recent builds (128+ I think), and will give better performance than USB. I'm not sure if PMP are supported on Sparc though., since it requires support in both the controller and PMP. Consider enclosures from other manufacturers as well. I've heard good things about Sans Digital, but I've never used them. The 2-drive enclosure has the same components as the item you linked but 1/2 the cost via Newegg. The intent would be put two 1TB or 2TB drives in the enclosure and use ZFS to create a mirrored pool out of them. Assuming this enclosure is set to JBOD mode, would I be able to use this with ZFS? The enclosure Yes, but I think the enclosure has a SiI5744 inside it, so you'll still have one connection from the computer to the enclosure. If that goes, you'll lose both drives. If you're just using two drives, two separate enclosures on separate buses may be better. Look at http://www.sansdigital.com/towerstor/ts1ut.html for instance. There are also larger enclosures with up to 8 drives. I can't think of a reason why it wouldn't work, but I also have exactly zero experience with this kind of set up! Like I mentioned, USB is prone to some flakiness. Assuming this would work, given that I can't see to find a 4-drive version of it, would I be correct in thinking that I could buy two of You might be better off using separate enclosures for reliability. Make sure to split the mirrors across the two devices. Use separate USB controllers if possible, so a bus reset doesn't affect both sides. Assuming my proposed enclosure would work, and assuming the use of reasonable quality 7200 RPM disks, how would you expect the performance to compare with the differential UltraSCSI set up I'm currently using? I think the DWIS is rated at either 20MB/sec or 40MB/sec, so on the surface, the USB attached drives would seem to be MUCH faster... USB 2.0 is about 30-40MB/s under ideal conditions, but doesn't support any of the command queuing that SCSI does. I'd expect performance to be slightly lower, and to use slightly more CPU. Most USB controllers don't support DMA, so all I/O requires CPU time. What about an inexpensive SAS card (eg: Supermicro AOC-USAS-L4i) and external SAS enclosure (eg: Sans Digital TowerRAID TR4X). It would cost about $350 for the setup. -B ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool
Thanks for all the thoughts, Richard. One thing that still sticks in my craw is that I'm not wanting to write intermittently. I'm wanting to write flat out, and those writes are being held up... Seems to me that zfs should know and do something about that without me needing to tune zfs_vdev_max_pending... Nonetheless, I'm now at a far more balanced point than when I started, so that's a good thing. :) Cheers, Nathan. On 15/02/2011 6:44 AM, Richard Elling wrote: Hi Nathan, comments below... On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote: On 14/02/2011 4:31 AM, Richard Elling wrote: On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com wrote: Hi all, Exec summary: I have a situation where I'm seeing lots of large reads starving writes from being able to get through to disk. snip What is the average service time of each disk? Multiply that by the average active queue depth. If that number is greater than, say, 100ms, then the ZFS I/O scheduler is not able to be very effective because the disks are too slow. Reducing the active queue depth can help, see zfs_vdev_max_pending in the ZFS Evil Tuning Guide. Faster disks helps, too. NexentaStor fans, note that you can do this easily, on the fly, via the Settings - Preferences - System web GUI. -- richard Hi Richard, Long time no speak! Anyhoo - See below. I'm unconvinced that faster disks would help. I think faster disks, at least in what I'm observing, would make it suck just as bad, just reading faster... ;) Maybe I'm missing something. Faster disks always help :-) Queue depth is around 10 (default and unchanged since install), and average service time is about 25ms... Below are 1 second samples with iostat - while I have included only about 10 seconds, it's representative of what I'm seeing all the time. extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100 ok, we'll take sd6 as an example (the math is easy :-) ... actv = 10 svc_t = 26.7 actv * svc_t = 267 milliseconds This is the queue at the disk. ZFS manages its own queue for the disk, but once it leaves ZFS, there is no way for ZFS to manage it. In the case of the active queue, the I/Os have left the OS, so even the OS is unable to change what is in the queue or directly influence when the I/Os will be finished. In ZFS, the queue has a priority scheduler and does place a higher priority on async writes than async reads (since b130 or so). But what you can see is that the intermittent nature of the async writes get stuck behind the 267 milliseconds as the queue drains the reads. [no, I'm not sure if that makes sense, try again...] If it sends reads continuously and writes occasionally, it will appear that reads have much more domination. In older releases, when the reads and writes had the same priority, this looks even worse. extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 422.10.0 54025.00.0 0.0 10.0 23.6 1 100 sd7 422.10.0 54025.00.0 0.0 10.0 23.6 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 388.07.0 49406.4 290.0 0.0 9.8 24.8 1 100 sd7 409.01.0 52350.32.0 0.0 9.5 23.2 1 99 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 423.00.0 54148.60.0 0.0 10.0 23.6 1 100 sd7 413.00.0 52868.50.0 0.0 10.0 24.2 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 400.02.0 51081.22.0 0.0 10.0 24.8 1 100 sd7 384.04.0 49153.24.0 0.0 10.0 25.7 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 401.91.0 51448.98.0 0.0 10.0 24.8 1 100 sd7 424.90.0 54392.40.0 0.0 10.0 23.5 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 405.02.0 51843.86.0
[zfs-discuss] ZFS read/write fairness algorithm for single pool
Hi all, Exec summary: I have a situation where I'm seeing lots of large reads starving writes from being able to get through to disk. Some detail: I have a newly constructed box (was an old box, but blew the mobo - different story - sigh). Anyhoo - It's a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 spindles, as single member stripes, so, yeah, the nearest thing to JBOD that this controller gets to) pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230 Hewlett-Packard Company Smart Array Controller And it's off this HP controller I'm handing my data zpool. config: NAMESTATE READ WRITE CKSUM dataONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 Cpu is AMD Phenom II, 6 core 1075T, for what it's worth I guess my problem is more one that the ZFS folks should be aware of rather than something directly impacting me, as the workload I have created is not something I typically see - but it is something I see easily impacting customers - and in a nasty way should they encounter it. It *is* also a case I'll create from time to time - when I'm moving DVD images backwards and forwards... I was stress testing the box, giving the new kits legs a stretch and kicked off the following: - create a test file to use as source for my 'full speed streaming write' (lazy way) - dd if=/dev/urandom /tmp/1 (and let that run for a few seconds, creating about100MB of random junk.) - start some jobs - while :; do cat /tmp/1 /data/delete.me/2; done (The write workload, which is fine and dandy by itself) - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done Before I kicked off the read workload, everything looked as expected. I was getting between 40 and 60MB/s to each of the disks and all was good. BUT - As soon as I introduced the read workload, my write throughput dropped to virtually zero, and remained there until the write workload was killed. The starvation is immediate. I can 100% reproducibly go from many MB/s of write throughput with no read workload to virtually 0MB/s write throughput, simply through kicking off that reading dd. Write performance picks up again as soon as I kill the read workload. It also behaves the same way of the file I'm reading is NOT the same one I'm writing to. (eg: cat file3 and the dd reading file 2) Other things to know about the system: - Disks are Seagate 2GB, 512 byte sector SATA disks - OS is Solaris 11 Express (build 151a) - zpool version is old. I'm still hedging my bets on having to go back to Nevada (sxce, build 124 or so, which is what I was at before installing s11express) Cached configuration: version: 19 - Plenty of space remains in the pool - bash-4.0$ zpool list NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT data 1.81T 1.34T 480G74% 1.00x ONLINE - - The box has 8GB of memory - and ZFS is getting a fair whack at it. ::memstat Page SummaryPagesMB %Tot Kernel 211843 827 11% ZFS File Data 1426054 5570 73% Anon 106814 4175% Exec and libs9364360% Page cache 47192 1842% Free (cachelist)31448 1222% Free (freelist)130431 5097% Total 1963146 7668 Physical 1963145 7668 - Rest of the zfs dataset properties: # zfs get all data NAME PROPERTY VALUE SOURCE data type filesystem - data creation Mon May 24 10:46 2010 - data used 1.34T - data available 451G - data referenced 500G - data compressratio 1.02x - data mountedyes- data quota none default data reservationnone default data recordsize 128K default data mountpoint /data default data sharenfs ro,anon=0 local data checksum on default data compressionofflocal data atime offlocal data deviceson default data exec
Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool
Hi Steve, Thanks for the thoughts - I think that everything you asked about is in the original email - but for reference again, it's 151a (s11 express). Are you really suggesting, for a single user system I need 16GB of memory, just to get ZFS to be able to write when it's reading? (and even them, that would be contingent on you getting repeat, cached hits on the ARC). That's hardly sensible, and anything but enterprise. I know I'm only talking about my little baby box at the moment, but extend that to a large database application, and I'm seeing badness all round. Worse - If I'm reading a 45GB contiguous file (say, HD video), the only way an ARC will help me is if I have 64GB, and have read it in the past... especially if I'm reading it sequentially. That's inconceivable!! (cue reference to the Princess Bride :). I'd also ad that for the most part, 8GB is plenty for ZFS, and there are a lot of Sun/Oracle customers using it now in LDOM environments where 8GB is just great in the control/IO domain. I don't think trying to blame the system in this case is the right answer. ZFS schedules the read/write activities, and to me it seems that it's just not doing that. I was suspicious of the impact the HP Raid controller is having - and how it might be reacting to what's being pushed at it, so re-created exactly this problem again on a different system with native non-cached SATA controllers. Issue is identical. (Though I have since determined that my HP raid controller is actually *slowing* my reads and writes to disk! ;) Cheers! Nathan. On 14/02/2011 4:08 AM, gon...@comcast.net wrote: Hi Nathan, Maybe it is buried somewhere in your email, but I did not see what zfs version you are using. This is rather important, because the 145+ kernels work a lot better in many ways than the early ones ( say 134-ish). So whenever you are reporting various ZFS issues, something like `uname -a` to report the kernel rev is most useful. Writes starved by reads has been a complaint in early ZFS, I certainy do not see any evidence of this in the 145+ kernels. There is a fair amount of tuning and configuration that can be done (adding ssd-s to your pool, zil vs no zil, how cacheing is configured, ie what to cache..) 8 Gig is not a lot of memory for ZFS, I would recommend double of that. If all goes well, most reads would be statisfied from ARC, and not interfere with writes. Steve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool
On 14/02/2011 4:31 AM, Richard Elling wrote: On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com wrote: Hi all, Exec summary: I have a situation where I'm seeing lots of large reads starving writes from being able to get through to disk. snip What is the average service time of each disk? Multiply that by the average active queue depth. If that number is greater than, say, 100ms, then the ZFS I/O scheduler is not able to be very effective because the disks are too slow. Reducing the active queue depth can help, see zfs_vdev_max_pending in the ZFS Evil Tuning Guide. Faster disks helps, too. NexentaStor fans, note that you can do this easily, on the fly, via the Settings - Preferences - System web GUI. -- richard Hi Richard, Long time no speak! Anyhoo - See below. I'm unconvinced that faster disks would help. I think faster disks, at least in what I'm observing, would make it suck just as bad, just reading faster... ;) Maybe I'm missing something. Queue depth is around 10 (default and unchanged since install), and average service time is about 25ms... Below are 1 second samples with iostat - while I have included only about 10 seconds, it's representative of what I'm seeing all the time. extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 360.9 13.0 46190.5 351.4 0.0 10.0 26.7 1 100 sd7 342.9 12.0 43887.3 329.9 0.0 10.0 28.1 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 422.10.0 54025.00.0 0.0 10.0 23.6 1 100 sd7 422.10.0 54025.00.0 0.0 10.0 23.6 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 370.0 11.0 47360.4 342.0 0.0 10.0 26.2 1 100 sd7 327.0 16.0 41856.4 632.0 0.0 9.6 28.0 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 388.07.0 49406.4 290.0 0.0 9.8 24.8 1 100 sd7 409.01.0 52350.32.0 0.0 9.5 23.2 1 99 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 423.00.0 54148.60.0 0.0 10.0 23.6 1 100 sd7 413.00.0 52868.50.0 0.0 10.0 24.2 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 400.02.0 51081.22.0 0.0 10.0 24.8 1 100 sd7 384.04.0 49153.24.0 0.0 10.0 25.7 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 401.91.0 51448.98.0 0.0 10.0 24.8 1 100 sd7 424.90.0 54392.40.0 0.0 10.0 23.5 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 215.1 208.1 26751.9 25433.5 0.0 9.3 22.1 1 100 sd7 189.1 216.1 24199.1 26833.9 0.0 8.9 22.1 1 91 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 295.0 162.0 37756.8 20610.2 0.0 10.0 21.8 1 100 sd7 307.0 150.0 39292.6 19198.4 0.0 10.0 21.8 1 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b sd6 405.02.0 51843.86.0 0.0 10.0 24.5 1 100 sd7 408.03.0 52227.8 10.0 0.0 10.0 24.3 1 100 Bottom line is that ZFS does not seem to be caring about getting my writes to disk when there is a heavy read workload. I have also confirmed that it's not the RAID controller either - behaviour is identical with direct attach SATA. But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes things to swing dramatically! - At 1, writes proceed much more than reads - 20mb/s read per spindle:35mb/s write per spindle - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s - At 3, it's starting to lean more heavily to reads again, but writes at least get a whack - 35mb/s per spindle read:15-20mb/s write. - At 4, we are closer to 35-40mb/s read, 15mb/s write By the time we get back to the default of 0xa, writes drop off almost completely. The crossover (on the box with no RAID controller) seems to be 5. Anything more than that, and writes get shouldered out the way almost completely. So - aside from the obvious - manually setting zfs_vdev_max_pending - do you have any thoughts on ZFS being able to make this sort of determination by itself? It would be somewhat of a shame to bust out such 'whacky knobs' for plain old direct attach SATA disks to get balance... Also - can I set this property per-vdev? (just in case I have sata and, say, a USP-V connected to the same box)? Thanks again, and good to see you are still playing close by! Cheers! Nathan. pci bus 0x0002 cardnum 0x00 function 0x00: vendor
Re: [zfs-discuss] ZFS Honesty after a power failure
Hey, Dennis - I can't help but wonder if the failure is a result of zfs itself finding some problems post restart... Is there anything in your FMA logs? fmstat for a summary and fmdump for a summary of the related errors eg: drteeth:/tmp # fmdump TIME UUID SUNW-MSG-ID Nov 03 13:57:29.4190 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 ZFS-8000-D3 Nov 03 13:57:29.9921 916ce3e2-0c5c-e335-d317-ba1e8a93742e ZFS-8000-D3 Nov 03 14:04:58.8973 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d ZFS-8000-CS Mar 05 18:04:40.7116 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-4M Repaired Mar 05 18:04:40.7875 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-6U Resolved Mar 05 18:04:41.0052 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-4M Repaired Mar 05 18:04:41.0760 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-6U Resolved then for example, fmdump -vu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 and fmdump -Vvu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 will show more and more information about the error. Note that some of it might seem like rubbish. The important bits should be obvious though - things like the SUNW error message is (like ZFS-8000-D3), which can be pumped into sun.com/msg to see what exactly it's going on about. Note also that there should also be something interesting in the /var/adm/messages log to match and 'faulted' devices. You might also find an fmdump -e and fmdump -eV to be interesting - This is the *error* log as opposed to the *fault* log. (Every 'thing that goes wrong' is an error, only those that are diagnosed are considered a fault.) Note that in all of these fm[dump|stat] commands, you are really only looking at the two sets of data. The errors - that is the telemetry incoming to FMA - and the faults. If you include a -e, you view the errors, otherwise, you are looking at the faults. By the way - sun.com/msg has a great PDF on it about the predictive self healing technologies in Solaris 10 and will offer more interesting information. Would be interesting to see *why* ZFS / FMA is feeling the need to fault your devices. I was interested to see on one of my boxes that I have actually had a *lot* of errors, which I'm now going to have to investigate... Looks like I have a dud rocket in my system... :) Oh - And I saw this: Nov 03 14:04:31.2783 ereport.fs.zfs.checksum Score one more for ZFS! This box has a measly 300GB mirrored, and I have already seen dud data. (heh... It's also got non-ecc memory... ;) Cheers! Nathan. Dennis Clarke wrote: On Tue, 24 Mar 2009, Dennis Clarke wrote: You would think so eh? But a transient problem that only occurs after a power failure? Transient problems are most common after a power failure or during initialization. Well the issue here is that power was on for ten minutes before I tried to do a boot from the ok pronpt. Regardless, the point is that the ZPool shows no faults at boot time and then shows phantom faults *after* I go to init 3. That does seem odd. Dennsi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reboot when copying large amounts of data
definitely time to bust out some mdb -k and see what it's moaning about. I did not see the screenshot earlier... sorry about that. Nathan. Blake wrote: I start the cp, and then, with prstat -a, watch the cpu load for the cp process climb to 25% on a 4-core machine. Load, measured for example with 'uptime', climbs steadily until the reboot. Note that the machine does not dump properly, panic or hang - rather, it reboots. I attached a screenshot earlier in this thread of the little bit of error message I could see on the console. The machine is trying to dump to the dump zvol, but fails to do so. Only sometimes do I see an error on the machine's local console - mos times, it simply reboots. On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert nathan.kroen...@sun.com wrote: Hm - Crashes, or hangs? Moreover - how do you know a CPU is pegged? Seems like we could do a little more discovery on what the actual problem here is, as I can read it about 4 different ways. By this last piece of information, I'm guessing the system does not crash, but goes really really slow?? Crash == panic == we see stack dump on console and try to take a dump hang == nothing works == no response - might be worth looking at mdb -K or booting with a -k on the boot line. So - are we crashing, hanging, or something different? It might simply be that you are eating up all your memory, and your physical backing storage is taking a while to catch up? Nathan. Blake wrote: My dump device is already on a different controller - the motherboards built-in nVidia SATA controller. The raidz2 vdev is the one I'm having trouble with (copying the same files to the mirrored rpool on the nVidia controller work nicely). I do notice that, when using cp to copy the files to the raidz2 pool, load on the machine climbs steadily until the crash, and one proc core pegs at 100%. Frustrating, yes. On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J maidakalexand...@johndeere.com wrote: If you're having issues with a disk contoller or disk IO driver its highly likely that a savecore to disk after the panic will fail. I'm not sure how to work around this, maybe a dedicated dump device not on a controller that uses a different driver then the one that you're having issues with? -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake Sent: Wednesday, March 11, 2009 4:45 PM To: Richard Elling Cc: Marc Bevand; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] reboot when copying large amounts of data I guess I didn't make it clear that I had already tried using savecore to retrieve the core from the dump device. I added a larger zvol for dump, to make sure that I wasn't running out of space on the dump device: r...@host:~# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore directory: /var/crash/host Savecore enabled: yes I was using the -L option only to try to get some idea of why the system load was climbing to 1 during a simple file copy. On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling richard.ell...@gmail.com wrote: Blake wrote: I'm attaching a screenshot of the console just before reboot. The dump doesn't seem to be working, or savecore isn't working. On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote: I'm working on testing this some more by doing a savecore -L right after I start the copy. savecore -L is not what you want. By default, for OpenSolaris, savecore on boot is disabled. But the core will have been dumped into the dump slice, which is not used for swap. So you should be able to run savecore at a later time to collect the core from the last dump. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia
Re: [zfs-discuss] reboot when copying large amounts of data
definitely time to bust out some mdb -K or boot -k and see what it's moaning about. I did not see the screenshot earlier... sorry about that. Nathan. Blake wrote: I start the cp, and then, with prstat -a, watch the cpu load for the cp process climb to 25% on a 4-core machine. Load, measured for example with 'uptime', climbs steadily until the reboot. Note that the machine does not dump properly, panic or hang - rather, it reboots. I attached a screenshot earlier in this thread of the little bit of error message I could see on the console. The machine is trying to dump to the dump zvol, but fails to do so. Only sometimes do I see an error on the machine's local console - mos times, it simply reboots. On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert nathan.kroen...@sun.com wrote: Hm - Crashes, or hangs? Moreover - how do you know a CPU is pegged? Seems like we could do a little more discovery on what the actual problem here is, as I can read it about 4 different ways. By this last piece of information, I'm guessing the system does not crash, but goes really really slow?? Crash == panic == we see stack dump on console and try to take a dump hang == nothing works == no response - might be worth looking at mdb -K or booting with a -k on the boot line. So - are we crashing, hanging, or something different? It might simply be that you are eating up all your memory, and your physical backing storage is taking a while to catch up? Nathan. Blake wrote: My dump device is already on a different controller - the motherboards built-in nVidia SATA controller. The raidz2 vdev is the one I'm having trouble with (copying the same files to the mirrored rpool on the nVidia controller work nicely). I do notice that, when using cp to copy the files to the raidz2 pool, load on the machine climbs steadily until the crash, and one proc core pegs at 100%. Frustrating, yes. On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J maidakalexand...@johndeere.com wrote: If you're having issues with a disk contoller or disk IO driver its highly likely that a savecore to disk after the panic will fail. I'm not sure how to work around this, maybe a dedicated dump device not on a controller that uses a different driver then the one that you're having issues with? -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake Sent: Wednesday, March 11, 2009 4:45 PM To: Richard Elling Cc: Marc Bevand; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] reboot when copying large amounts of data I guess I didn't make it clear that I had already tried using savecore to retrieve the core from the dump device. I added a larger zvol for dump, to make sure that I wasn't running out of space on the dump device: r...@host:~# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore directory: /var/crash/host Savecore enabled: yes I was using the -L option only to try to get some idea of why the system load was climbing to 1 during a simple file copy. On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling richard.ell...@gmail.com wrote: Blake wrote: I'm attaching a screenshot of the console just before reboot. The dump doesn't seem to be working, or savecore isn't working. On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote: I'm working on testing this some more by doing a savecore -L right after I start the copy. savecore -L is not what you want. By default, for OpenSolaris, savecore on boot is disabled. But the core will have been dumped into the dump slice, which is not used for swap. So you should be able to run savecore at a later time to collect the core from the last dump. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia
Re: [zfs-discuss] reboot when copying large amounts of data
For what it's worth, I have been running Nevada (so, same kernel as opensolaris) for ages (at least 18 months) on a Gigabyte board with the MCP55 chipset and it's been flawless. I liked it so much, I bought it's newer brother, based on the nvidia 750SLI chipset... M750SLI-DS4 Cheers! Nathan. On 13/03/09 09:21 AM, Dave wrote: Tim wrote: On Thu, Mar 12, 2009 at 2:22 PM, Blake blake.ir...@gmail.com mailto:blake.ir...@gmail.com wrote: I've managed to get the data transfer to work by rearranging my disks so that all of them sit on the integrated SATA controller. So, I feel pretty certain that this is either an issue with the Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard (though I would think that the integrated SATA would also be using the PCI bus?). The motherboard, for those interested, is an HD8ME-2 (not, I now find after buying this box from Silicon Mechanics, a board that's on the Solaris HCL...) http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm So I'm not considering one of LSI's HBA's - what do list members think about this device: http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm http://www.provantage.com/lsi-logic-lsi00117%7E7LSIG03X.htm I believe the MCP55's SATA controllers are actually PCI-E based. I use Tyan 2927 motherboards. They have on-board nVidia MCP55 chipsets, which is the same chipset at the X4500 (IIRC). I wouldn't trust the MCP55 chipset in OpenSolaris. I had random disk hangs even while the machine was mostly idle. In Feb 2008 I bought AOC-SAT2-MV8 cards and moved all my drives to these add-in cards. I haven't had any issues with drive hanging since. There does not seem to be any problems with the SAT2-MV8 under heavy load in my servers from what I've seen. When the SuperMicro AOC-USAS-L8i came out later last year, I started using them instead. They work better than the SAT2-MV8s. This card needs a 3U or bigger case: http://www.supermicro.com/products/accessories/addon/AOC-USAS-L8i.cfm This is the low profile card that will fit in a 2U: http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm They both work in normal PCI-E slots on my Tyan 2927 mobos. Finding good non-Sun hardware that works very well under OpenSolaris is frustrating to say the least. Good luck. -- Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] reboot when copying large amounts of data
Hm - Crashes, or hangs? Moreover - how do you know a CPU is pegged? Seems like we could do a little more discovery on what the actual problem here is, as I can read it about 4 different ways. By this last piece of information, I'm guessing the system does not crash, but goes really really slow?? Crash == panic == we see stack dump on console and try to take a dump hang == nothing works == no response - might be worth looking at mdb -K or booting with a -k on the boot line. So - are we crashing, hanging, or something different? It might simply be that you are eating up all your memory, and your physical backing storage is taking a while to catch up? Nathan. Blake wrote: My dump device is already on a different controller - the motherboards built-in nVidia SATA controller. The raidz2 vdev is the one I'm having trouble with (copying the same files to the mirrored rpool on the nVidia controller work nicely). I do notice that, when using cp to copy the files to the raidz2 pool, load on the machine climbs steadily until the crash, and one proc core pegs at 100%. Frustrating, yes. On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J maidakalexand...@johndeere.com wrote: If you're having issues with a disk contoller or disk IO driver its highly likely that a savecore to disk after the panic will fail. I'm not sure how to work around this, maybe a dedicated dump device not on a controller that uses a different driver then the one that you're having issues with? -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake Sent: Wednesday, March 11, 2009 4:45 PM To: Richard Elling Cc: Marc Bevand; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] reboot when copying large amounts of data I guess I didn't make it clear that I had already tried using savecore to retrieve the core from the dump device. I added a larger zvol for dump, to make sure that I wasn't running out of space on the dump device: r...@host:~# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore directory: /var/crash/host Savecore enabled: yes I was using the -L option only to try to get some idea of why the system load was climbing to 1 during a simple file copy. On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling richard.ell...@gmail.com wrote: Blake wrote: I'm attaching a screenshot of the console just before reboot. The dump doesn't seem to be working, or savecore isn't working. On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote: I'm working on testing this some more by doing a savecore -L right after I start the copy. savecore -L is not what you want. By default, for OpenSolaris, savecore on boot is disabled. But the core will have been dumped into the dump slice, which is not used for swap. So you should be able to run savecore at a later time to collect the core from the last dump. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] schedulers [was: zfs related google summer of code ideas - your vote]
Hm - a ZilArc?? Or, slarc? Or L2ArZi I'm tried something sort of similar to this when fooling around, adding different *slices* for ZIL / L2ARC but as I'm too poor to afford good SSD's my resolut was poor at beat... ;) Having ZFS manage some 'arbitrary fast stuff' and sorting out it's own ZIL and L2ARC would be interesting, though, given the propensity for SSD's to be either fast read or fast write at the moment, you may well require some whacky knobs to get it to do what you actually want it to... hm. Nathan. Bill Sommerfeld wrote: On Wed, 2009-03-04 at 12:49 -0800, Richard Elling wrote: But I'm curious as to why you would want to put both the slog and L2ARC on the same SSD? Reducing part count in a small system. For instance: adding L2ARC+slog to a laptop. I might only have one slot free to allocate to ssd. IMHO the right administrative interface for this is for zpool to allow you to add the same device to a pool as both cache and ssd, and let zfs figure out how to not step on itself when allocating blocks. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- /// // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone:+61 3 9869 6255// // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia// /// ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] destroy means destroy, right?
For years, we resisted stopping rm -r / because people should know better, until *finally* someone said - you know what - that's just dumb. Then, just like that, it was fixed. Yes - This is Unix. Yes - Provide the gun and allow the user to point it. Just don't let it go off in their groin or when pointed at their foot, or provide at least some protection when they do. Having even limited amount of restore capability will provide the user with steel capped boots and a codpiece. It won't protect them from herpes or fungus but it might deflect the bullet. On 01/30/09 08:19, Jacob Ritorto wrote: I like that, although it's a bit of an intelligence insulter. Reminds me of the old pdp11 install ( http://charles.the-haleys.org/papers/setting_up_unix_V7.pdf ) -- This step makes an empty file system. 6.The next thing to do is to restore the data onto the new empty file system. To do this you respond to the ':' printed in the last step with (bring in the program restor) : tm(0,4) ('ht(0,4)' for TU16/TE16) tape? tm(0,5) (use 'ht(0,5)' for TU16/TE16) disk? rp(0,0)(use 'hp(0,0)' for RP04/5/6) Last chance before scribbling on disk. (you type return) (the tape moves, perhaps 5-10 minutes pass) end of tape Boot : You now have a UNIX root file system. On Thu, Jan 29, 2009 at 3:42 PM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: Maybe add a timer or something? When doing a destroy, ZFS will keep everything for 1 minute or so, before overwriting. This way the disk won't get as fragmented. And if you had fat fingers and typed wrong, you have up to one minute to undo. That will catch 80% of the mistakes? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New RAM disk from ACARD might be interesting
As it presents as standard SATA, there should be no reason for this not to work... It has battery backup, and CF for backup / restore from DDR2 in the event of power loss... Pretty cool. (Would have preferred a super-cap, but oh, well... ;) Should make an excellent ZIL *and* L2ARC style device... Seems a little pricey for what it is though. It's going onto my list of what I'd buy if I had the money... ;) Nathan. On 01/30/09 12:10, Janåke Rönnblom wrote: ACARD have launched a new RAM disk which can take up to 64 GB of ECC RAM while still looking like a standard SATA drive. If anyone remember the Gigabyte I-RAM this might be a new development in this area. Its called ACARD ANS-9010 and up... http://www.acard.com.tw/english/fb01-product.jsp?idno_no=270prod_no=ANS-9010type1_title=%20Solid%20State%20Drivetype1_idno=13 This might be interesting to use as a cheap log instead of SSD cards... This test compares it with both Intel SSD (consumer and pro): http://www.techreport.com/articles.x/16255/1 However the test is more from a homeuser point of view... Anyone got the money and time to test it ;) -J -- // // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New RAM disk from ACARD might be interesting
You could be the first... Man up! ;) Nathan. Will Murnane wrote: On Thu, Jan 29, 2009 at 21:11, Nathan Kroenert nathan.kroen...@sun.com wrote: Seems a little pricey for what it is though. For what it's worth, there's also a 9010B model that has only one sata port and room for six dimms instead of eight at $250 instead of $400. That might fit in your budget a little easier... I'm considering one for a log device. I wish someone else could test it first and report problems, but someone's gotta take the jump first. It looks like this device (the 9010, that is) is also being marketed as the HyperDrive V at the same price point. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] destroy means destroy, right?
I'm no authority, but I believe it's gone. Some of the others on the list might have some funky thoughts, but I would suggest that if you have already done any other I/O's to the disk that you have likely rolled past the point of no return. Anyone else care to comment? As a side note, I had a look for anything that looked like a CR for zfs destroy / undestroy and could not find one. Anyone interested in me submitting an RFE to have something like a zfs undestroy pool/fs capability? Clearly, there would be limitations in how long you would have to get the command to work, but it would have it's merits... Cheers! Nathan. Jacob Ritorto wrote: Hi, I just said zfs destroy pool/fs, but meant to say zfs destroy pool/junk. Is 'fs' really gone? thx jake ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is Disabling ARC on SolarisU4 possible?
Also - My experience with a very small ARC is that your performance will stink. ZFS is an advanced filesystem that IMO makes some assumptions about capability and capacity of current hardware. If you don't give what it's expecting, your results may be equally unexpected. If you are keen to test the *actual* disk performance, you should just use the underlying disk device like /dev/rdsk/c0t0d0s0 Beware, however, that any writes to these devices will indeed result in the loss of the data on those devices, zpools or other. Cheers. Nathan. Richard Elling wrote: Rob Brown wrote: Afternoon, In order to test my storage I want to stop the cacheing effect of the ARC on a ZFS filesystem. I can do similar on UFS by mounting it with the directio flag. No, not really the same concept, which is why Roch wrote http://blogs.sun.com/roch/entry/zfs_and_directio I saw the following two options on a nevada box which presumably control it: primarycache secondarycache Yes, to some degree this offers some capability. But I don't believe they are in any release of Solaris 10. -- richard But I’m running Solaris 10U4 which doesn’t have them -can I disable it? Many thanks Rob *|* *Robert Brown - **ioko *Professional Services *| | **Mobile:* +44 (0)7769 711 885 *| * ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] destroy means destroy, right?
He's not trying to recover a pool - Just a filesystem... :) bdebel...@intelesyscorp.com wrote: Recovering Destroyed ZFS Storage Pools. You can use the zpool import -D command to recover a storage pool that has been destroyed. http://docs.sun.com/app/docs/doc/819-5461/gcfhw?a=view -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cifs perfomance
Are you able to qualify that a little? I'm using a realtek interface with OpenSolaris and am yet to experience any issues. Nathan. Brandon High wrote: On Wed, Jan 21, 2009 at 5:40 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: Several people reported this same problem. They changed their ethernet adaptor to an Intel ethernet interface and the performance problem went away. It was not ZFS's fault. It may not be a ZFS problem, but it is a OpenSolaris problem. The drivers for hardware Realtek and other NICs are ... not so great. -B -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cifs perfomance
Interesting. I'll have a poke... Thanks! Nathan. Brandon High wrote: On Thu, Jan 22, 2009 at 1:29 PM, Nathan Kroenert nathan.kroen...@sun.com wrote: Are you able to qualify that a little? I'm using a realtek interface with OpenSolaris and am yet to experience any issues. There's a lot of anecdotal evidence that replacing the rge driver with the gani driver can fix poor NFS and CIFS performance. Another option is to use an Intel NIC in place of the Realtek. Search the archives for gani or slow CIFS and you'll find several people who resolved poor performance by getting rid of the rge driver. While it's not hard evidence, it seems to indicate that there are problems with the driver (and most likely the hardware). -B -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hot spare not so hot ??
An interesting interpretation of using hot spares. Could it be that the hot-spare code only fires if the disk goes down whilst the pool is active? hm. Nathan. Scot Ballard wrote: I have configured a test system with a mirrored rpool and one hot spare. I powered the systems off, pulled one of the disks from rpool to simulate a hardware failure. The hot spare is not activating automatically. Is there something more i should have done to make this work ? pool: rpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool DEGRADED 0 0 0 mirrorDEGRADED 0 0 0 c0d0s0 ONLINE 0 0 0 c0d1s0 UNAVAIL 0 0 0 cannot open spares c1d1s0AVAIL errors: No known data errors Thanks -Scot ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
Hey, Tom - Correct me if I'm wrong here, but it seems you are not allowing ZFS any sort of redundancy to manage. I'm not sure how you can class it a ZFS fail when the Disk subsystem has failed... Or - did I miss something? :) Nathan. Tom Bird wrote: Morning, For those of you who remember last time, this is a different Solaris, different disk box and different host, but the epic nature of the fail is similar. The RAID box that is the 63T LUN has a hardware fault and has been crashing, up to now the box and host got restarted and both came up fine. However, just now as I have got replacement hardware in position and was ready to start copying, it went bang and my data has all gone. Ideas? r...@cs4:~# zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT content 62.5T 59.9T 2.63T95% ONLINE - r...@cs4:~# zpool status -v pool: content state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM content ONLINE 0 032 c2t8d0ONLINE 0 032 errors: Permanent errors have been detected in the following files: content:0x0 content:0x2c898 r...@cs4:~# find /content /content r...@cs4:~# (yes that really is it) r...@cs4:~# uname -a SunOS cs4.kw 5.11 snv_99 sun4v sparc SUNW,Sun-Fire-T200 from format: 2. c2t8d0 IFT-S12S-G1033-363H-62.76TB /p...@7c0/p...@0/p...@8/LSILogic,s...@0/s...@8,0 Also, content does not show in df output. thanks -- /// // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone:+61 3 9869 6255// // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia// /// ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Odd network performance with ZFS/CIFS
2C from Oz: Windows (at least XP - I have thus far been lucky enough to avoid running vista on metal) has packet schedulers, quality of service settings and other crap that can severely impact windows performance on the network. I have found that setting the following made a difference to me: - Disable Jumbo Frames (as I have only a very cheap crappy gig-switch and if I try to drive it hard with jumbo's enabled, it falls in a heap) - Lose the 'deterministic network enhancer' under windows - Lose the QoS packet scheduler - Check the interface properties and go looking for something that sounds like 'optimize for CPU / optimize for speed' and set it to speed - Depending on workload and packet sizes, it might also be worth looking at disabling nagle algorithm on the Solaris box. See http://www.sun.com/servers/coolthreads/tnb/lighttpd.jsp for a quick explanation... It would be interesting to see if you see the same issues using a Solaris or other OS client. Hope this helps somewhat. Let us know how it goes. Nathan. fredrick phol wrote: I'm currently experiencing exactly the same problem and it's been driving me nuts. Tried open soalris and am currently running the latest version of SXCE both with exactly the same results. This issue occurs with both CIFS which shows the speed degrade and ISCSI which just starts off at the lowest speed but exhibits the same peaks and troughs I have 4x500GB drives in RAIDz1 config on an AMD 780G mobo. speed tests using DD have shown read rates of ~140MB/s and write rates of `120MB/s (humourously slightly faster than one of my friends arrays on linux and intel hardware) Currently the transfer will sit at about 18% gige network utilisation for 10 seconds then dip to 0 and come straight back up to 18% this happens at regular predictable intervals, there is no randomness. I've tried two different switches, one a consumer grade switch from linksys and one a low end distribution switch from 3com both exhibit exactly the same behaviour. The only computer accessing the solaris box is w windows vista 64 sp1 machine. Currently I'm guessing that the transfer issues have somethign to do with the onboard realtek network card in the solaris box. Possibly a driver issue? I've got a dual port intel server nic on order to replace it and test with. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the new consumer NAS devices run OpenSolaris?
Meh - I doubt you hurt anyone. Most people have kill files for that sort of stuff. heh. ;) On the 'which if these should work' sort of question, if you do happen to try any of those systems, and they work, remember to submit the details to the HCL. :) I'm keen to give it a whack on a small box myself, but have not had the time or the funds. The Atom stuff should work pretty well, and even with 2GB of memory, if it's just acting as a NAS server, it should have plenty of poke. (assuming you are only using it for NAS... ;) Oh - and assuming you don't enable stuff like gzip-9 compression, which might, on the slower Atom style chips, get in the way. Looking forward to any reports. Nathan. On 13/01/09 01:47 PM, JZ wrote: ok, was I too harsh on the list? sorry folks, as I said, I have the biggest ego. no one can hurt that by trying to fight me, but yes, it can be hurt if I have to hurt the friends I love in protecting my ego or my other friends' ego. but no one can get hurt if we don't claim what we have or what we know is the best of all. a contribution to help the problem today can be better than 100% strategically correct in the long run. we use what we have today, but if that usage will impact the life or death of a promising technology branch, as a living thing, maybe we don't want to use the best of today. everyone has their own need and want, and there is no better/worse, right/wrong in the choice of technology. but some technologies can work together in a constructive fashion, and some in a destructive fashion. please, be constructive. and you will hear much less from me. best, z ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert nathan.kroen...@sun.com // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is a manual zfs scrub neccessary?
The big win for me in doing a periodic scrub is that in normal operation, ZFS only checks data as it's read back form the disks. If you don't periodically scrub, errors that happen over time won't be caught until I next read that actual data, which might be inconvenient if it's a long time since the initial data was written. As I have a lot of data that is pretty much only read once or twice after it's originally written, I could have stuff going bad over time that I don't know about. Scrubbing makes sure there is a limit on the amount of time between each 'surprise!'. :) I scrub once every month or so, depending on the system. So, in direct answer to your question, No - You don't *need* to scrub. But - It's better if you do. ;) My 2c. Nathan. On 10/11/08 11:38 AM, Douglas Walker wrote: Hi, I'm running a 3Tb RAIDZ2 array and was wondering about the zfs scrub function. This server runs as my backup server and receives an rsync every night. I was wondering if I _need_ to explicitly run a zfs scrub on my zpool periodically. There's a lot of info on google about running a scrub but not whether it's actually needed or under what circumstances you might run one - so I thought I'd ask the list it's opinions on this. If zfs does a background scrub continually anyways - is there any need to manually run a scrub? I'd imagine a scrub of a 3Tb array would take quite a while (its 7200rpm SATA disks) and if I ran a scrub this would likely overlap with my nightly rsyncs causing yet more I/O. Wouldn't this stress the disks more? If it is necessary - how often are people running a manually scrub? Once a week? month? regards D -- // // Nathan Kroenert [EMAIL PROTECTED] // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] boot -L
A quick google shows that it's not so much about the mirror, but the BE... http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/ Might help? Nathan. On 7/11/08 02:39 PM, Krzys wrote: What am I doing wrong? I have sparc V210 and I am having difficulty with boot -L, I was under the impression that boot -L will give me options to which zfs mirror I could boot my root disk? Anyway but even not that, I am seeing some strange behavior anyway... After trying boot -L I am unabl eto boot my system unless I do reset-all, is that normal? I have Solaris 10 U6 that I just upgraded my box to and I wanted to try all the cool things about zfs root disk mirroring and so on, but so far its quite strange experience with this whole thing... [22:21:25] @adas: /root init 0 [22:21:51] @adas: /root stopping NetWorker daemons: nsr_shutdown -q svc.startd: The system is coming down. Please wait. svc.startd: 90 system services are now being stopped. svc.startd: The system is down. syncing file systems... done Program terminated {0} ok boot -L SC Alert: Host System has Reset Probing system devices Probing memory Probing I/O buses Sun Fire V210, No Keyboard Copyright 2007 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.22.33, 4096 MB memory installed, Serial #64938415. Ethernet address 0:3:ba:de:e1:af, Host ID: 83dee1af. Rebooting with command: boot -L Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a File and args: -L Can't open bootlst Evaluating: The file just loaded does not appear to be executable. {1} ok boot disk0 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 File and args: ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss {1} ok boot disk1 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 File and args: ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss {1} ok boot ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss {1} ok reset-all Probing system devices Probing memory Probing I/O buses Sun Fire V210, No Keyboard Copyright 2007 Sun Microsystems, Inc. All rights reserved. OpenBoot 4.22.33, 4096 MB memory installed, Serial #64938415. Ethernet address 0:3:ba:de:e1:af, Host ID: 83dee1af. Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a File and args: SunOS Release 5.10 Version Generic_137137-09 64-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hardware watchdog enabled Hostname: adas Reading ZFS config: done. Mounting ZFS filesystems: (3/3) adas console login: Nov 6 22:27:13 squid[361]: Squid Parent: child process 363 started Nov 6 22:27:18 adas ufs: NOTICE: mount: not a UFS magic number (0x0) starting NetWorker daemons: nsrexecd console login: Does anyone have any idea why is that happening? what am I doing wrong? Thanks for help. Regards, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FYI - proposing storage pm project
Not wanting to hijack this thread, but... I'm a simple man with simple needs. I'd like to be able to manually spin down my disks whenever I want to... Anyone come up with a way to do this? ;) Nathan. Jens Elkner wrote: On Mon, Nov 03, 2008 at 02:54:10PM -0800, Yuan Chu wrote: Hi, a disk may take seconds or even tens of seconds to come on line if it needs to be powered up and spin up. Yes - I really hate this on my U40 and tried to disable PM for HDD[s] completely. However, haven't found a way to do this (thought /etc/power.conf is the right place, but either it doesn't work as explained or is not the right place). HDD[s] are HITACHI HDS7225S Revision: A9CA Any hints, how to switch off PM for this HDD? Regards, jel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] add autocomplete feature for zpool, zfs command
Hm - This caused me to ask the question: Who keeps the capabilities in sync? Is there a programmatic way we can have bash (or other shells) interrogate zpool and zfs to find out what it's capabilities are? I'm thinking something like having bash spawn a zfs command to see what options are available in that current zfs / zpool version... That way, you would never need to do anything to bash/zfs once it was done the first time... do it once, and as ZFS changes, the prompts change automatically... Or - is this old hat, and how we do it already? :) Nathan. On 10/10/08 05:06 PM, Boyd Adamson wrote: Alex Peng [EMAIL PROTECTED] writes: Is it fun to have autocomplete in zpool or zfs command? For instance - zfs cr 'Tab key' will become zfs create zfs clone 'Tab key' will show me the available snapshots zfs set 'Tab key' will show me the available properties, then zfs set com 'Tab key' will become zfs set compression=, another 'Tab key' here would show me on/off/lzjb/gzip/gzip-[1-9] .. Looks like a good RFE. This would be entirely under the control of your shell. The zfs and zpool commands have no control until after you press enter on the command line. Both bash and zsh have programmable completion that could be used to add this (and I'd like to see it for these and other solaris specific commands). I'm sure ksh93 has something similar. Boyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
Actually, the one that'll hurt most is ironically the most closely related to bad database schema design... With a zillion files in the one directory, if someone does an 'ls' in that directory, it'll not only take ages, but steal a whole heap of memory and compute power... Provided the only things that'll be doing *anything* in that directory are using indexed methods, there is no real problem from a ZFS perspective, but if something decides to list (or worse, list and sort) that directory, it won't be that pleasant. Oh - That's of course assuming you have sufficient memory in the system to cache all that metadata somewhere... If you don't then that's another zillion I/O's you need to deal with each time you list the entire directory. an ls -1rt on a directory with about 1.2 million files with names like afile1202899 takes minutes to complete on my box, and we see 'ls' get to in excess of 700MB rss... (and that's not including the memory zfs is using to cache whatever it can.) My box has the ARC limited to about 1GB, so it's obviously undersized for such a workload, but still gives you an indication... I generally look to keep directories to a size that allows the utilities that work on and in it to perform at a reasonable rate... which for the most part is around the 100K files or less... Perhaps you are using larger hardware than I am for some of this stuff? :) Nathan. On 1/10/08 07:29 AM, Toby Thain wrote: On 30-Sep-08, at 7:50 AM, Ram Sharma wrote: Hi, can anyone please tell me what is the maximum number of files that can be there in 1 folder in Solaris with ZSF file system. I am working on an application in which I have to support 1mn users. In my application I am using MySql MyISAM and in MyISAM there is 3 files created for 1 table. I am having application architechture in which each user will be having separate table, so the expected number of files in database folder is 3mn. That sounds like a disastrous schema design. Apart from that, you're going to run into problems on several levels, including O/S resources (file descriptors) and filesystem scalability. --Toby I have read somewhere that there is a limit of each OS to create files in a folder. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CF to SATA adapters for boot device
I second that question, and also ask what brand folks like for performance and compatibility? Ebay is killing me with vast choice and no detail... ;) Nathan. Al Hopper wrote: On Wed, Aug 20, 2008 at 12:57 PM, Neal Pollack [EMAIL PROTECTED] wrote: Ian Collins wrote: Brian Hechinger wrote: On Wed, Aug 20, 2008 at 05:17:45PM +1200, Ian Collins wrote: Has anyone here had any luck using a CF to SATA adapter? I've just tried an Addonics ADSACFW CF to SATA adaptor with an 8GB card that I wanted to use for a boot pool and even though the BIOS reports the disk, Solaris B95 (or the installer) doesn't see it. I tried this a while back with an IDE to CF adapter. Real nice looking one too. It would constantly cause OpenBSD to panic. I would recommend against using this, unless you get real lucky. If you want flash to boot from, buy one of the ones that is specifically made for it (not CF, but industrial grade flash meant to be a HDD). Those things work a LOT better. I can look up the details of the ones my friend uses if you'd like. I was looking to run some tests with a CF boot drive before we get an X4540, which has a CF slot. The installer did see the attached USB sticks... My team does some of the testing inside Sun for the CF boot devices. We've used a number of IDE attaced CF adapters, such as; http://www.addonics.com/products/flash_memory_reader/ad44midecf.asp and also some random models from www.frys.com. We also test the CF boot feature on various Sun rack servers and blades that use a CF socket. I have not tested the SATA adapters but would not expect issues. I'd like to know if you find issues. The IDE attached devices use the legacy ATA/IDE device driver software, which had some bugs fixed for DMA and misc CF specific issues. It would be interesting to see if a SATA adapter for CF, set in bios to use AHCI instead of Legacy/IDE mode, would have any issues with the AHCI device driver software. I've had no reason to test this yet, since the Sun HW models build the CF socket right onto the motherboard/bus. I can't find a reason to worry about hot-plug, since removing the boot drive while Solaris is running would be, um, somewhat interesting :-) True, the enterprise grade devices are higher quality and will last longer. But do not underestimate the current (2008) device wear leveling firmware that controls the CF memory usage, and hence life span. Our in house destructive life span testing shows that the commercial grade CF device will last longer than the motherboard will. The consumer grade devices Interesting thread - thanks to all the contributors. I've seen, on several different forums, that many CF users lean towards Sandisk for reliability and longevity. Does anyone else see consensus in terms of CF brands? that you find in the store or on mail order, may or may not be current generation, so your device lifespan will vary. It should still be rather good for a boot device, because Solaris does very little writing to the boot disk. You can review configuration ideas to maximize the life of your CF device in this Solaris white paper for non-volatile memory; http://www.sun.com/bigadmin/features/articles/nvm_boot.jsp I hope this helps. Cheers, Neal Pollack Any further information welcome. Ian Regards, -- // // Nathan Kroenert [EMAIL PROTECTED] // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] help me....
It starts with Z, which makes it the one of the last to be considered if it's listed alphabetically? Nathan. Rahul wrote: hi can you give some disadvantages of the ZFS file system?? plzz its urgent... help me. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-USAS-L8i
And I can certainly vouch for that series of chipsets... I have a 750a-sli chipset (the one below the 790) and the SATA ports (in AHCI mode) Just Work(tm) under nevada / opensolaris. I'm yet to give it a while on S10, mostly as I pretty much run nevada everywhere... As S10 does indeed have an AHCI driver, I'd expect it would work just fine there too. Oh - and the ports go like stink!* For what it's worth, even with Nevada, you will need the newest NVidia Xorg drivers from nvidia's website to get the video working properly, and will need to add in it's PCI ID's in /etc/driver_aliases (And, as yet, I'm unable to run compiz in a stable way - Tends to hard lock up the machine after about 5 minutes use...), a very new hdaudio driver (I needed a bodgied up one from the Beijing team to make it work) and last I checked, the nvidia ethernet did not work properly without assigning it a valid ethernet address... (The driver misreads the ethernet address and either delivers it backwards, or byte-swaps... I don't remember exactly...) Oh - And just in case you forget, most boards I have seen use IDE mode for the controllers by default, which reeks. Expect less than 15 MB/s if reading and writing at the same time if you forget to change the controller mode to AHCI! For what it's worth, the board I'm using is a giga-byte.. Manufacturer: Gigabyte Technology Co., Ltd. Product: M750SLI-DS4 Which also has the 6 X AHCI ports. It might seem like it'll be a lot of hassle getting it working, but in the ZFS space, it works great pretty much out of the box (plus ethernet address change if the nvidia driver is still busted... ;) Cheers! Nathan. *Going like stink means going like a hairy goat - like lightning - like s*it off a shovel - like a zyrtec - fast. :) Brandon High wrote: On Mon, Aug 4, 2008 at 6:49 AM, Tim [EMAIL PROTECTED] wrote: really had the motivation or the cash to do so yet. I've been keeping my eye out for a board that supports the opteron 165 and the wider lane dual pci-E slots that isn't stricly a *gaming* board. I'm starting to think the combination doesn't exist. The AMD 790GX boards are starting to show up: http://www.newegg.com/Product/Product.aspx?Item=N82E16813128352 Dual 8x PCIe slots, integrated video and 6 AHCI SATA ports. -B ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to delete hundreds of emtpy snapshots
In one of my prior experiments, I included the names of the snapshots I created in a plain text file. I used this file, and not the zfs list output to determine which snapshots I was going to remove when it came time. I don't even remember *why* I did that in the first place, but it certainly made things easier when it came time to clean up a whole bunch of stuff... (And was not impacted by zfs list being non-snappy...) The snapshot naming scheme meant that it was dead easy to work out which to remove / keep... Right now, I don't have a system (that box was killed in a dreadful xen experiment :) so I'll be watching this thread with renewed interest to see who else is doing what... Nathan. Bob Friesenhahn wrote: On Thu, 17 Jul 2008, Ben Rockwood wrote: zfs list is mighty slow on systems with a large number of objects, but there is no foreseeable plan that I'm aware of to solve that problem. Never the less, you need to do a zfs list, therefore, do it once and work from that. If the snapshots were done from a script then their names are easily predictable and similar logic can be used to re-create the existing names. This avoids the need to do a 'zfs list'. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) Nathan. Charles Soto wrote: A really smart nexus for dedup is right when archiving takes place. For systems like EMC Centera, dedup is basically a byproduct of checksumming. Two files with similar metadata that have the same hash? They're identical. Charles On 7/7/08 4:25 PM, Neil Perrin [EMAIL PROTECTED] wrote: Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write / read speed and traps for beginners
Further followup to this thread... After being beaten sufficiently with a clue-bat, it was determined that the nforce 750a could do ahci mode for it's SATA stuff. I set it to ahci, and redid the devlinks etc and cranked it up as AHCI. I'm now regularly peaking at 100MB/s, though spending most of the time around 70MB/s. *much better* The lesson here is: when in ahci mode in the bios, *don't* match that PCI-ID with the nv-sata driver. It's not what you want. heh. *blush*. Once I removed the extra nv_sata entries I had added to the driver_aliases in my miniroot, all was good. On the NGE front, it turns out that solaris does not seem to like the ethernet address of the card. Trying to set it's OWN ethernet address using ifconfig yielded this: # ifconfig nge0 ether 63:d0:b:7d:1d:0 ifconfig: dlpi_set_physaddr failed nge0: DLSAP address in improper format or invalid ifconfig: failed setting mac address on nge0 using ifconfig nge0 ether 0:e:c:5b:54:45 worked just fine, and the interface now passes traffic and sees responses just fine. So, the workaround here is adding ether a working ether address in the hostname.nge0 I guess I'll log a bug on that on Monday... Awesome. Now to work on audio... heh. Nathan. Nathan Kroenert wrote: Hey all - Just spent quite some time trying to work out why my 2 disk mirrored ZFS pool was running so slow, and found an interesting answer... System: new Gigabyte M750sli-DS4, AMD 9550, 4GB memory and 2 X Seagate 500GB SATA-II 32mb cache disks. The SATA ports on the nfoce 750asli chipset don't yet seem to be supported by the nv_sata driver (I'm only running nv_89 at the mo, though I'm not aware of new support going in just yet). I *can* get the driver to attach, but not to see any disks. interesting, but I digress... Anyhoo, - I'm stuck in IDE compatability mode for the moment. So - using plain dd to the zfs filesystem on said disk dd if=/dev/zero of=delete.me bs=65536 I could achieve only about 35-40MB/s write speed, whereas, if I dd to the slice directly, I can get around 90-95MB/s I tried using whole disks versus a slice and it made no appreciable difference. It turns out that when you are in IDE compatability mode, having two disks on the same 'controller' (c# in solaris) behaves just like real IDE... Crap! Moving the second disk onto from c1 to c2 got be back to at least 50MB/s with higher peaks, up to 60/70MB/s. Also of note, on the gigabyte board (and I guess other nforce 750asli based chipsets) only 4 of the 6 SATA ports work when in IDE mode. Other thoughts on the Nforce 750a: - nge plumbs up OK and can send and 'see' packets, but does not seem to know itself... In promiscuous mode, you can see returning icmp echo requests, but they don't make it to the top of the stack. I had to use an e1000g in a PCI slot to get my networking working properly... - Onboard Video works, including compiz, but you need to create an xorg.conf and update the nvidia driver with the latest from the nvidia website Seems snappy enough. With 4 cores @ 2.2Ghz (phenom 9550) it's looking like it'll do what I wanted quite nicely. Later... Nathan. -- // // Nathan Kroenert [EMAIL PROTECTED] // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS write / read speed and traps for beginners
Hey all - Just spent quite some time trying to work out why my 2 disk mirrored ZFS pool was running so slow, and found an interesting answer... System: new Gigabyte M750sli-DS4, AMD 9550, 4GB memory and 2 X Seagate 500GB SATA-II 32mb cache disks. The SATA ports on the nfoce 750asli chipset don't yet seem to be supported by the nv_sata driver (I'm only running nv_89 at the mo, though I'm not aware of new support going in just yet). I *can* get the driver to attach, but not to see any disks. interesting, but I digress... Anyhoo, - I'm stuck in IDE compatability mode for the moment. So - using plain dd to the zfs filesystem on said disk dd if=/dev/zero of=delete.me bs=65536 I could achieve only about 35-40MB/s write speed, whereas, if I dd to the slice directly, I can get around 90-95MB/s I tried using whole disks versus a slice and it made no appreciable difference. It turns out that when you are in IDE compatability mode, having two disks on the same 'controller' (c# in solaris) behaves just like real IDE... Crap! Moving the second disk onto from c1 to c2 got be back to at least 50MB/s with higher peaks, up to 60/70MB/s. Also of note, on the gigabyte board (and I guess other nforce 750asli based chipsets) only 4 of the 6 SATA ports work when in IDE mode. Other thoughts on the Nforce 750a: - nge plumbs up OK and can send and 'see' packets, but does not seem to know itself... In promiscuous mode, you can see returning icmp echo requests, but they don't make it to the top of the stack. I had to use an e1000g in a PCI slot to get my networking working properly... - Onboard Video works, including compiz, but you need to create an xorg.conf and update the nvidia driver with the latest from the nvidia website Seems snappy enough. With 4 cores @ 2.2Ghz (phenom 9550) it's looking like it'll do what I wanted quite nicely. Later... Nathan. -- // // Nathan Kroenert [EMAIL PROTECTED] // // Systems Engineer Phone: +61 3 9869-6255 // // Sun Microsystems Fax:+61 3 9869-6288 // // Level 7, 476 St. Kilda Road Mobile: 0419 305 456// // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA controller suggestion
Tim wrote: **pci or pci-x. Yes, you might see *SOME* loss in speed from a pci interface, but let's be honest, there aren't a whole lot of users on this list that have the infrastructure to use greater than 100MB/sec who are asking this sort of question. A PCI bus should have no issues pushing that. Hm. If it's a system with only 1 PCI bus, there are still a few things to consider here. If it's plain old 33mhz, 32 bit PCI your 100MB/s(ish) usable bandwidth is actually total bandwidth. That's 50MB/s in and 50MB/s out, if you are copying disk to disk... I am about to update my home server for exactly the issue of saturating my PCI bus... It's even worse for me, as I'm mirroring, so, that works out to closer to 33MB/s read, 33MB/s write + 33 MB/s write to the mirror. All in all, it blows. I'm looking into one of the new gigabyte NVIDIA based systems with the 750aSLI chipsets. I'm *hoping* the Solaris nv_sata drivers will work with the new chipset (or that we are on the way to updating them...). My other box that's using the Nforce 570 works like a champ, and I'm hoping to recapture that magic. (I actually wanted to buy some more 570 based MB's but cannot get 'em in Australia any more... :) Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] More USB Storage Issues
For what it's worth, I started playing with USB + flash + ZFS and was most unhappy for quite a while. I was suffering with things hanging, going slow or just going away and breaking, and thought I was witnessing something zfs was doing as I was trying to do mirror recovery and all that sort of stuff. On a hunch, I tried doing UFS and RAW instead and saw the same issues. It's starting to look like my USB hubs. Once they are under any reasonable read/write load, they just make bunches of things go offline. Yep - They are powered and plugged in. So, at this stage, I'll be grabbing a couple of 'better' USB hubs (Mine are pretty much the cheapest I could buy) and see how that goes. For gags, take ZFS out of the equation and validate that your hardware is actually providing a stable platform for ZFS... Mine wasn't... Nathan. Evan Geller wrote: So, I've been stuck in kind of an ugly pattern. I zpool create and nothing goes wrong for a while, and then eventually I'll zpool status, which doesn't respond to ^C or kill -9s or anything. Also, setting NOINUSE_CHECK=1 doesn't appear to make a difference. I'll try and truss it next time I get a chance if that helps. Anywho, other problem is I get a huge storm of these around the same time zpool hangs. Jun 4 23:17:59 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd5): Jun 4 23:17:59 cakeoffline or reservation conflict Jun 4 23:18:00 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd5): Jun 4 23:18:00 cakeoffline or reservation conflict Jun 4 23:18:01 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd6): Jun 4 23:18:01 cakeoffline or reservation conflict Jun 4 23:18:02 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd6): Jun 4 23:18:02 cakeoffline or reservation conflict Jun 4 23:18:03 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd6): Jun 4 23:18:03 cakeoffline or reservation conflict Jun 4 23:18:04 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd6): Jun 4 23:18:04 cakeoffline or reservation conflict Jun 4 23:18:04 cake zfs: [ID 664491 kern.warning] WARNING: Pool 'tank' has encountered an uncorrectable I/O error. Manual intervention is required. Sorry if this isn't enough information, but if there's anything else I can provide that'll help please let me know. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Technical Support Engineer Phone: +61 3 9869-6255 // // Sun Services Fax:+61 3 9869-6288 // // Level 3, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Get your SXCE on ZFS here!
format -e is your window to cache settings. As for the auto-enabling, I'm not sure, as IIRC, we do different things based on disk technology. eg: IDE + SATA - Always enabled SCSI - Disabled by default, unless you give ZFS the whole disk. I think. On a couple of my systems, this seems to ring true. Not at all sure about SAS. If I'm wrong here, hopefully someone else will provide the complete set of logic for determining cache enabling semantics. :) Nathan. Brian Hechinger wrote: On Wed, Jun 04, 2008 at 09:17:05PM -0400, Ellis, Mike wrote: The FAQ document ( http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/ ) has a jumpstart profile example: Speaking of the FAQ and mentioning the need to use slices, how does that affect the ability of Solaris/ZFS to automatically enable the disk's cache? Does it need to be manually over-ridden (unlike giving ZFS the whole disk where it automatically turns the disk cache on)? Also, how can you check if the disk's cache has been enabled or not? Thanks, -brian -- // // Nathan Kroenert [EMAIL PROTECTED] // // Technical Support Engineer Phone: +61 3 9869-6255 // // Sun Services Fax:+61 3 9869-6288 // // Level 3, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS root finally here in SNV90
I'd expect it's the old standard. if /var/tmp is filled, and that's part of /, then bad things happen. there are often other places in /var that are writable by more than root, and always the possibility that something barfs heavily into syslog. Since the advent of reasonably sized disks, I know many don't consider this an issue these days, but I'd still be inclined to keep /var (and especially /var/tmp) separated from / In ZFS, this is, of course, just two filesystems in the same pool, with differing quotas... :) Nathan. Rich Teer wrote: On Wed, 4 Jun 2008, Bob Friesenhahn wrote: Did you actually choose to keep / and /var combined? Is there any THat's what I'd do... reason to do that with a ZFS root since both are sharing the same pool and so there is no longer any disk space advantage? If / and /var are not combined can they have different assigned quotas without one inheriting limits from the other? Why would one do that? Just keep an eye on the root pool and all is good. -- // // Nathan Kroenert [EMAIL PROTECTED] // // Technical Support Engineer Phone: +61 3 9869-6255 // // Sun Services Fax:+61 3 9869-6288 // // Level 3, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 Thumper panic
Dumping to /dev/dsk/c6t0d0s1 certainly looks like a non-mirrored dump dev... You might try a manual savecore telling it to ignore the dump valid header and see what you get... savecore -d and perhaps try telling it to look directly at the dump device... savecore -f device You should also, when you get the chance, deliberately panic the box to make sure you can actually capture a dump... dumpadm is your friend as far as checking where you are going to dump to, and it it's one side of your swap mirror, that's bad, M'Kay? :) Nathan. Jorgen Lundman wrote: OK, this is a pretty damn poor panic report if I may say no, not had much sleep. Solaris Express Developer Edition 9/07 snv_70b X86 Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 30 August 2007 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc Even though it dumped, it wrote nothing to /var/crash/. Perhaps because swap is mirrored. Jorgen Lundman wrote: We had a panic around noon on Saturday, which it mostly recovered itself. All ZFS NFS exports just remounted, but the UFS on zdev NFS exports did not, needed manual umount mount on all clients for some reason. Is this a known bug we should consider a patch for? May 10 11:49:46 x4500-01.unix ufs: [ID 912200 kern.notice] quota_ufs: over hard disk limit (pid 477, uid 127409, inum 1047211, fs /export/zero1) May 10 11:51:26 x4500-01.unix unix: [ID 836849 kern.notice] May 10 11:51:26 x4500-01.unix ^Mpanic[cpu3]/thread=17b8c820: May 10 11:51:26 x4500-01.unix genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ff001f4ca220 addr=0 occurred in module unknown due t o a NULL pointer dereference May 10 11:51:26 x4500-01.unix unix: [ID 10 kern.notice] May 10 11:51:26 x4500-01.unix unix: [ID 839527 kern.notice] nfsd: May 10 11:51:26 x4500-01.unix unix: [ID 753105 kern.notice] #pf Page fault May 10 11:51:26 x4500-01.unix unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x0 May 10 11:51:26 x4500-01.unix unix: [ID 243837 kern.notice] pid=477, pc=0x0, sp= 0xff001f4ca318, eflags=0x10246 May 10 11:51:26 x4500-01.unix unix: [ID 211416 kern.notice] cr0: 8005003bpg,wp, ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de May 10 11:51:26 x4500-01.unix unix: [ID 354241 kern.notice] cr2: 0 cr3: 1fcbbc00 0 cr8: c May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rdi: fffedef ea000 rsi:9 rdx:0 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rcx: 17b 8c820 r8:0 r9: ff054797dc48 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rax: 0 rbx: 97eaffc rbp: ff001f4ca350 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] r10: 0 r11: fffec8b93868 r12: 27991000 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] r13: fffed1b 59c00 r14: fffecf8d8cc0 r15: 1000 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] fsb: 0 gsb: fffec3d5a580 ds: 4b May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] es: 4b fs:0 gs: 1c3 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] trp: e err: 10 rip:0 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] cs: 30 rfl:10246 rsp: ff001f4ca318 May 10 11:51:27 x4500-01.unix unix: [ID 266532 kern.notice] ss: 38 May 10 11:51:27 x4500-01.unix unix: [ID 10 kern.notice] May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca100 unix:die+c8 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca210 unix:trap+135b () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca220 unix:_cmntrap+e9 () May 10 11:51:27 x4500-01.unix genunix: [ID 802836 kern.notice] ff001f4ca350 0 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca3d0 ufs:top_end_sync+cb () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca440 ufs:ufs_fsync+1cb () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca490 genunix:fop_fsync+51 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4ca770 nfssrv:rfs3_create+604 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4caa70 nfssrv:common_dispatch+444 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4caa90 nfssrv:rfs_dispatch+2d () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4cab80 rpcmod:svc_getreq+1c6 () May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice] ff001f4cabf0
Re: [zfs-discuss] zfs data corruption
Note: IANATZD (I Am Not A Team-ZFS Dude) Speaking as a Hardware Guy, knowing that something is happening, has happened or is indicated to happen is a Good Thing (tm). Begin unlikely, but possible scenario: If, for instance, I'm getting a cluster of read errors (or, perhaps bad blocks), I could: - See it as it's happening - See the block number for each error - already know the rate at which the errors are happening - Be able to determine that it's not good, and it's time to replace the disk. - You get the picture... And based on this information, I could feel confident that I have the right information at hand to be able to determine that it is or is not time to replace this disk. Of course, that assumes: - I know anything about disks - I know anything about the error messages - I have some sort of logging tool that recognises the errors (and does not just throw out the 'retryable ones', as most I have seen are configured to do) - I care - The folks watching the logs in the enterprise management tool care - My storage even bothers to report the errors Certainly, for some organisations, all of the above are exactly how it works, and it works well for them. Looking at the ZFS/FMA approach, it certainly is somewhat different. The (very) rough concept is that FMA gets pretty much all errors reported to it. It logs them, in a persistent store, which is always available to view. It also makes diagnoses on the errors, based on the rules that exist for that particular style of error. Once enough (or the right type of) errors happen, it'll then make a Fault Diagnosis for that component, and log a message, loud and proud into the syslog. It may also take other actions, like, retire a page of memory, offline a CPU, panic the box, etc. So - That's the rough overview. It's worth noting up front that we can *observe* every event that has happened. Using fmdump and fmstat we can immediately see if anything interesting has been happening, or we can wait for a Fault Diagnosis, in which case, we can just watch /var/adm/messages. I also *believe* (though am not certain - Perhaps someone else on the list might be?) it would be possible to have each *event* (so - the individual events that lead to a Fault Diagnosis) generate a message if it was required, though I have never taken the time to do that one... There are many advantages to this approach - It does not rely on logfiles, offsets into logfiles, counters of previously processes messages and all of the other doom and gloom that comes with scraping logfiles. It's something you can simply ask: Any issues, chief? The answer is there in a flash. You will also be less likely to have the messages rolled out of the logs before you get to them (another classic...). And - You get some great details from fmdump showing you what's really going on, and it's something that's really easy to parse to look for patterns. All of this said, I understand if you feel things are being 'hidden' from you until it's *actually* busted that you are having some of your forward vision obscured 'in the name of a quiet logfile'. I felt much the same way for a period of time. (Though, I live more in the CPU / Memory camp...) But - Once I realised what I could do with fmstat and fmdump, I was not the slightest bit unhappy (Actually, that's not quite true... Even once I knew what they could do, it still took me a while to work out the options I cared about for fmdump / fmstat), but I now trust FMA to look after my CPU / Memory issues better than I would in real life. I can still get what I need when I want to, and the data is actually more accessible and interesting. I just needed to know where to go looking. All this being said, I was not actually aware that many of our disk / target drivers were actually FMA'd up yet. heh - Shows what I know. Does any of this make you feel any better (or worse)? Nathan. Mark A. Carlson wrote: fmd(1M) can log faults to syslogd that are already diagnosed. Why would you want the random spew as well? -- mark Carson Gaspar wrote: [EMAIL PROTECTED] wrote: It's not safe to jump to this conclusion. Disk drivers that support FMA won't log error messages to /var/adm/messages. As more support for I/O FMA shows up, you won't see random spew in the messages file any more. mode=large financial institution paying support customer That is a Very Bad Idea. Please convey this to whoever thinks that they're helping by not sysloging I/O errors. If this shows up in Solaris 11, we will Not Be Amused. Lack of off-box error logging will directly cause loss of revenue. /mode ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss
Re: [zfs-discuss] zfs data corruption
c4t60A9800043346859444A476B2D485872d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485758d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485642d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485471d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485357d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485241d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D485071d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484F56d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484E41d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484C70d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484B56d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484A2Dd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484870d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484755d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48462Dd0 ONLINE 0 0 0 errors: The following persistent errors have been detected: DATASET OBJECT RANGE zpool1 17 2428895232-2429026304 zpool1 17 2429026304-2429157376 zpool1 17 2429157376-2429288448 zpool1 17 2429288448-2429419520 zpool1 17 2429419520-2429550592 zpool1 17 2463629312-2463760384 zpool1 17 2463760384-2463891456 zpool1 17 2463891456-2464022528 zpool1 17 2464022528-2464153600 zpool1 17 2464153600-2464284672 zpool1 18 2397700096-2397831168 zpool1 18 2397831168-2397962240 zpool1 18 2397962240-2398093312 zpool1 18 2398093312-2398224384 zpool1 18 2398224384-2398355456 zpool1 18 2432434176-2432565248 zpool1 18 2432565248-2432696320 zpool1 18 2432696320-2432827392 zpool1 18 2432827392-2432958464 zpool1 18 2432958464-2433089536 zpool1 19 2418933760-2419064832 zpool1 19 2419064832-2419195904 zpool1 19 2419195904-2419326976 zpool1 19 2419326976-2419458048 zpool1 19 2453798912-2453929984 zpool1 19 2453929984-2454061056 zpool1 19 2454061056-2454192128 zpool1 19 2454192128-2454323200 This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Technical Support Engineer Phone: +61 3 9869-6255 // // Sun Services Fax:+61 3 9869-6288 // // Level 3, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs send takes 3 days for 1TB?
Indeed - If it was 100Mb/s ethernet, 1TB would take near enough 24 hours just to push that much data... Would be great to see some details of the setup and where the bottleneck was. I'd be surprised if ZFS has anything to do with the transfer rate... But an interesting read anyways. :) Nathan. Nicolas Williams wrote: On Wed, Apr 09, 2008 at 11:38:03PM -0400, Jignesh K. Shah wrote: Can zfs send utilize multiple-streams of data transmission (or some sort of multipleness)? Interesting read for background http://people.planetpostgresql.org/xzilla/index.php?/archives/338-guid.html Note: zfs send takes 3 days for 1TB to another system Huh? That article doesn't describe how they were moving the zfs send stream, whether the limit was the network, ZFS or disk I/O. In fact, it's bereft of numbers. It even says that the transfer time is not actually three days but upwards of 24 hours. Nico -- // // Nathan Kroenert [EMAIL PROTECTED] // // Technical Support Engineer Phone: +61 3 9869-6255 // // Sun Services Fax:+61 3 9869-6288 // // Level 3, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Sun X2100?
Did you do anything specific with the drive caches? How is your ZFS performance? Nathan. :) Rich Teer wrote: On Wed, 19 Mar 2008, Terence Ng wrote: I am new to Solaris. I have Sun X2100 with 2 x 80G harddisks (run as email server, run tomcat, jboss and postgresql) and want to run as mirror to secure the data. Since ZFS cannot be used as a root file system , does that mean I am no way can benefit from using ZFS? Nope, in fact I have set up an X2100 pretty much exactly as you want. set up 5 partitions: /, swap, space for live upgrade, a small partition for the SVM metadbs, and the rest of the disk. This last one is used as that machines zdev for its ZFS pool. So, root and swap mirrored using SVM, and everything else on a mirrored ZFS pool. HTH, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs 32bits
Paul - Don't substitute redundancy for backup... if your data is important to you, for the love of steak, make sure you have a backup that would not be destroyed by, say, a lightening strike, fire or stray 747. For what it's worth, I'm also using ZFS on 32 bit and am yet to experience any sort of issues. An external 500GB disk + external USB enclosure runs for what - $150? That's what I use anyways. :) Nathan. Paul Kraus wrote: On Thu, Mar 6, 2008 at 10:22 AM, Brian D. Horn [EMAIL PROTECTED] wrote: ZFS is not 32-bit safe. There are a number of places in the ZFS code where it is assumed that a 64-bit data object is being read atomically (or set atomically). It simply isn't true and can lead to weird and bugs. This is disturbing, especially as I have not seen this documented anywhere. I have a dual P-III 550 Intel system with 1 GB of RAM (Intel L440GX+ motherboard). I am running Solaris 10U4 and am using ZFS (mirrors and stripes only, no RAIDz). While this is 'only' a home server, I still cannot afford to lose over 500 GB of data. If ZFS isn't supported under 32 bit systems then I need to start migrating to UFS/SLVM as soon as I can. I specifically went with 10U4 so that I would have a stable, supportable environment. Under what conditions are the 32 bit / 64 bit problems likely to occur ? I have been running this system for 6 months (a migration from OpenSuSE 10.1) without any issues. The NFS server performance is at least an order of magnitude better than the SuSE server was. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Bob Friesenhahn wrote: On Tue, 4 Mar 2008, Nathan Kroenert wrote: It does seem that some of us are getting a little caught up in disks and their magnificence in what they write to the platter and read back, and overlooking the potential value of a simple (though potentially computationally expensive) circus trick, which might, just might, make your broken 1TB archive useful again... The circus trick can be handled via a user-contributed utility. In fact, people can compete with their various repair utilities. There are only 1048576 1-bit permuations to try, and then the various two-bit permutations can be tried. That does not sound 'easy', and I consider that ZFS should be... :) and IMO it's something that should really be built in, not attacked with an addon. I had (as did Jeff in his initial response) considered that we only need to actually try to flip 128KB worth of bits once... That many flips means that we in a way 'processing' some 128GB in the worse case when re-generating checksums. Internal to a CPU, depending on Cache Aliasing, competing workloads, threadedness, etc, this could be dramatically variable... something I guess the ZFS team would want to keep out of the 'standard' filesystem operation... hm. :\ I don't think it's a good idea for us to assume that it's OK to 'leave out' potential goodness for the masses that want to use ZFS in non-enterprise environments like laptops / home PC's, or use commodity components in conjunction with the Big Stuff... (Like white box PC's connected to an EMC or HDS box... ) It seems that goodness for the masses has not been left out. The forthcoming ability to request duplicate ZFS blocks is very good news indeed. We are entering an age where the entry level SATA disk is 1TB and users have more space than they know what to do with. A little replication gives these users something useful to do with their new disk while avoiding the need for unreliable circus tricks to recover data. ZFS goes far beyond MS-DOS's recover command (which should have been called destroy). I never have enough space on my laptop... I guess I'm a freak. But - I am sure that we are *both* right for some subsets of ZFS users, and that the more choice we have built into the filesystem, the better. Thanks again for the comments! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Hey, Bob My perspective on Big reasons for it *to* be integrated would be: - It's tested - By the folks charged with making ZFS good - It's kept in sync with the differing Zpool versions - It's documented - When the system *is* patched, any changes the patch brings are synced with the recovery mechanism - Being integrated, it has options that can be persistently set if required - It's there when you actually need it - It could be integrated with Solaris FMA to take some funky actions based on the nature of the failure, including cool messages telling you what you need to run to attempt a repair etc - It's integrated (recursive, self fulfilling benefit... ;) As for the separate utility for different failure modes, I agree, *development* of these might be faster if everyone chases their own pet failure mode and contributes it, but I still think getting them integrated either as optional actions on error, or as part of zdb or other would be far better than having to go looking for the utility and 'give it a whirl'. But - I'm sure that's a personal preference, and I'm sure that there are those that would love the opportunity to roll their own. OK - I'm going to shutup now. I think I have done this to death, and I don't want to end up in everyone's kill filter. Cheers! Nathan. Bob Friesenhahn wrote: On Tue, 4 Mar 2008, Nathan Kroenert wrote: The circus trick can be handled via a user-contributed utility. In fact, people can compete with their various repair utilities. There are only 1048576 1-bit permuations to try, and then the various two-bit permutations can be tried. That does not sound 'easy', and I consider that ZFS should be... :) and IMO it's something that should really be built in, not attacked with an addon. There are several reasons why this sort of thing should not be in ZFS itself. A big reason is that if it is in ZFS itself, it can only be updated via an OS patch or upgrade, along with a required reboot. If it is in a utility, it can be downloaded and used as the user sees fit without any additional disruption to the system. While some errors are random, others follow well defined patterns, so it may be that one utility is better than another or that user-provided options can help achieve success faster. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs. Novell NSS
Hm - Based on this detail from the page: Change lever for switching between Rotation + Hammering , Neutral and Hammering only I'd hope it could still hammer... Though I'd suspect the size of nails it would hammer would be somewhat limited... ;) Nathan. Boyd Adamson wrote: Richard Elling [EMAIL PROTECTED] writes: Tim wrote: The greatest hammer in the world will be inferior to a drill when driving a screw :) The greatest hammer in the world is a rotary hammer, and it works quite well for driving screws or digging through degenerate granite ;-) Need a better analogy. Here's what I use (quite often) on the ranch: http://www.hitachi-koki.com/powertools/products/hammer/dh40mr/dh40mr.html Hasn't the greatest hammer in the world lost the ability to drive nails? I'll have to start belting them in with the handle of a screwdriver... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
Are you indicating that the filesystem know's or should know what an application is doing?? It seems to me that to achieve what you are suggesting, that's exactly what it would take. Or, you are assuming that there are no co-dependent files in applications that are out there... Whichever the case, I'm confused...! Unless you are perhaps suggesting perhaps an IOCTL that an application could call to indicate I'm done for this round, please snapshot or something to that effect. Even then, I'm still confused as to how I would do anything much useful with this over and above, say, 1 minute snapshots. Nathan. Uwe Dippel wrote: atomic view? Your post was on the gory details on how ZFS writes. Atomic View here is, that 'save' of a file is an 'atomic' operation: at one moment in time you click 'save', and some other moment in time it is done. It means indivisible, and from the perspective of the user this is how it ought to look. The rub is this: how do you know when a file edit/modify has completed? Not to me, I'm sorry, this is task of the engineer, the implementer. (See 'atomic', as above.) It would be a shame if a file system never knew if the operation was completed. If an application has many files then an edit/modify may include updates and/or removals of more than one file. So once again: how do you know when an edit/modify has completed? So an 'edit' fires off a few child processes to do this and that and then you forget about them, hoping for them to do a proper job. Oh, this gives me confidence ;) No, seriously, let's look at some applications: A. User works in Office (Star-Office, sure!) and clicks 'Save' for a current work before making major modifications. So the last state of the document (odt) is being stored. Currently we can set some Backup option to be done regularly. Meaning that the backup could have happened at the very wrong moment; while saving the state on each user request 'Save' is much better. B. A bunch of e-mails are read from the Inbox and stored locally (think Maildir). The user sees the sender, doesn't know her, and deletes all of them. Of course, the deletion process will have fired up the CDP-engine ('event') and retire the files instead of deletion. So when the sender calls, and the user learns that he made a big mistake, he can roll back to before the deletion (event). C. (Sticking with /home/) I agree with you, that the rather continuous changes of the dot-files and dot-directories in the users HOME that serve JDS, and many more, do eventually not necessarily allow to reconstitute a valid state of the settings at all and any moment. Still, chances are high, that they will. In the worst case, the unlucky user can roll back to when he last took a break, if only for grabbing another coffee, because it took a minute, the writes (see above) will hopefully have completed. oh, s***, already messed up the settings? Then try to roll back to lunch break. Works? Okay! But when you roll back to lunch break, where is the stuff done in between? The backup solution means that they are lost. The event-driven (CDP) not: you can roll over all the states of files or directories between lunch break and recover the third latest version of your tendering document (see above), within the settings of the desktop that were valid this morning. Maybe SUN can't do this, but wait for Apple, and OSX10-dot-something (using ZFS as default!) will know how to do it. (And they probably also know, when their 'writes' are done.) Uwe This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
It occurred to me that we are likely missing the point here because Uwe is thinking of this as a One User on a System sort of perspective, whereas most of the rest of us are thinking of it from a 'Solaris' perspective, where we are typically expecting the system to be running many applications / DB's / users all at the same time. In Uwe's use cases thus far, it seems that he is interested in only the simple single user style applications, if I'm not mistaken, so he's not considering the consequences of what it *really* means to have CDP in the way he wishes. Uwe - am I close here? Nathan. Nicolas Williams wrote: On Tue, Feb 26, 2008 at 06:34:04PM -0800, Uwe Dippel wrote: The rub is this: how do you know when a file edit/modify has completed? Not to me, I'm sorry, this is task of the engineer, the implementer. (See 'atomic', as above.) It would be a shame if a file system never knew if the operation was completed. The filesystem knows if a filesystem operation completed. It can't know application state. You keep missing that. If an application has many files then an edit/modify may include updates and/or removals of more than one file. So once again: how do you know when an edit/modify has completed? So an 'edit' fires off a few child processes to do this and that and then you forget about them, hoping for them to do a proper job. Oh, this gives me confidence ;) You'd rather the filesystem guess application state than have the app tell it? Weird. Your other alternative -- saving a history of every write -- doesn't work because you can't tell what point in time is safe to restore to. No, seriously, let's look at some applications: A. User works in Office (Star-Office, sure!) and clicks 'Save' for a current work before making major modifications. So the last state of the document (odt) is being stored. Currently we can set some Backup option to be done regularly. Meaning that the backup could have happened at the very wrong moment; while saving the state on each user request 'Save' is much better. So modify the office suite to call a new syscall that says I'm internally consistent in all these files and boom, the filesystem can now take a useful snapshot. B. A bunch of e-mails are read from the Inbox and stored locally (think Maildir). The user sees the sender, doesn't know her, and deletes all of them. Of course, the deletion process will have fired up the CDP-engine ('event') and retire the files instead of deletion. So when the sender calls, and the user learns that he made a big mistake, he can roll back to before the deletion (event). Now think of an application like this but which uses, say, SQLite (e.g., Firefox 3.x, Thunderbird, ...). The app might never close the database file, just fsync() once in a while. The DB might have multiple files (in the SQLite case that might be multiple DBs ATTACHed into one database connection). Also, an fsync of a SQLite journal file is not as useful to CDP as an fsync() of a SQLite DB proper. Now add any of a large number of databases and apps to the mix and forget it -- the heuristics become impossible or mostly useless. C. (Sticking with /home/) I agree with you, that the rather continuous changes of the dot-files and dot-directories in the users HOME that serve JDS, and many more, do eventually not necessarily allow to reconstitute a valid state of the settings at all and any moment. Still, chances are high, that they will. In the worst case, the Chances? So what, we tell the user try restoring to this snapshot, login again and if stuff doesn't work, then try another snapshot? What if the user discovers too late that the selected snapshot was inconsistent and by then they've made other changes? unlucky user can roll back to when he last took a break, if only for grabbing another coffee, because it took a minute, the writes (see That sounds mighty painful. I'd rather modify some high-profile apps to tell the filesystem that their state is consistent, so take a snapshot. Maybe SUN can't do this, but wait for Apple, and OSX10-dot-something (using ZFS as default!) will know how to do it. (And they probably also know, when their 'writes' are done.) I'm giving you the best answer -- modify the apps -- and you reject it. Given how many important apps Apple controls it wouldn't surprise me if they did what I suggest. We should do it too. But one step at a time. We need to setup a project, gather requirements, design a solution, ... And since the solution will almost certainly entail modifications to apps where heuristics won't help, well, I think this would be a project with fairly wide scope, which means it likely won't go fast. Nico ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cause for data corruption?
My guess is that you have some defective hardware in the system that's causing bit flips in the checksum or the data payload. I'd suggest running some sort of system diagnostics for a few hours to see if you can locate the bad piece of hardware. My suspicion would be your memory or CPU, but that's just a wild guess, based on the number of errors you have and the number of devices it's spread over. Could it be that you have been corrupting data for some time and now known it? Oh - And i'd also look around based on your disk controller and ensure that there are no newer patches for it, just in case it's one for which there was a known problem. (which was worked around in the driver) I *think* there was an issue with at least one or two... Cheers! Nathan. Sandro wrote: hi folks I've been running my fileserver at home with linux for a couple of years and last week I finally reinstalled it with solaris 10 u4. I borrowed a bunch of disks from a friend, copied over all the files, reinstalled my fileserver and copied the data back. Everything went fine, but after a few days now, quite a lot of files got corrupted. here's the output: # zpool status data pool: data state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008 config: NAMESTATE READ WRITE CKSUM dataONLINE 0 0 5.52K raidz1ONLINE 0 0 5.52K c0t0d0 ONLINE 0 0 10.72 c0t1d0 ONLINE 0 0 4.59K c0t2d0 ONLINE 0 0 5.18K c0t3d0 ONLINE 0 0 9.10K c1t0d0 ONLINE 0 0 7.64K c1t1d0 ONLINE 0 0 3.75K c1t2d0 ONLINE 0 0 4.39K c1t3d0 ONLINE 0 0 6.04K errors: 388 data errors, use '-v' for a list Last night I found out about this, it told me there were errors in like 50 files. So I scrubbed the whole pool and it found a lot more corrupted files. The temporary system which I used to hold the data while I'm installing solaris on my fileserver is running nv build 80 and no errors on there. What could be the cause of these errors?? I don't see any hw errors on my disks.. # iostat -En | grep -i error c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t0d0 Soft Errors: 574 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t0d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t1d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t2d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c0t3d0 Soft Errors: 549 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t1d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t2d0 Soft Errors: 14 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 c1t3d0 Soft Errors: 548 Hard Errors: 0 Transport Errors: 0 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 although a lot of soft errors. Linux said that one disk had gone bad, but I figured the sata cable was somehow broken, so I replaced that before installing solaris. And solaris didn't and doesn't see any actual hw errors on the disks, does it? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
And would drive storage requirements through the roof!! I like it! ;) Nathan. Jonathan Loran wrote: David Magda wrote: On Feb 24, 2008, at 01:49, Jonathan Loran wrote: In some circles, CDP is big business. It would be a great ZFS offering. ZFS doesn't have it built-in, but AVS made be an option in some cases: http://opensolaris.org/os/project/avs/ Point in time copy (as AVS offers) is not the same thing as CDP. When you snapshot data as in point in time copies, you predict the future, knowing the time slice at which your data will be needed. Continuous data protection is based on the premise that you don't have a clue ahead of time which point in time you want to recover to. Essentially, for CDP, you need to save every storage block that has ever been written, so you can put them back in place if you so desire. Anyone else on the list think it is worthwhile adding CDP to the ZFS list of capabilities? It causes space management issues, but it's an interesting, useful idea. Jon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes
What about new blocks written to an existing file? Perhaps we could make that clearer in the manpage too... hm. Mattias Pantzare wrote: If you created them after, then no worries, but if I understand correctly, if the *file* was created with 128K recordsize, then it'll keep that forever... Files have nothing to do with it. The recordsize is a file system parameter. It gets a little more complicated because the recordsize is actually the maximum recordsize, not the minimum. Please read the manpage: Changing the file system's recordsize only affects files created afterward; existing files are unaffected. Nothing is rewritten in the file system when you change recordsize so is stays the same for existing files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes
Hey, Richard - I'm confused now. My understanding was that any files created after the recordsize was set would use that as the new maximum recordsize, but files already created would continue to use the old recordsize. Though I'm now a little hazy on what will happen when the new existing files are updated as well... hm. Cheers! Nathan. Richard Elling wrote: Nathan Kroenert wrote: And something I was told only recently - It makes a difference if you created the file *before* you set the recordsize property. Actually, it has always been true for RAID-0, RAID-5, RAID-6. If your I/O strides over two sets then you end up doing more I/O, perhaps twice as much. If you created them after, then no worries, but if I understand correctly, if the *file* was created with 128K recordsize, then it'll keep that forever... Files have nothing to do with it. The recordsize is a file system parameter. It gets a little more complicated because the recordsize is actually the maximum recordsize, not the minimum. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes
And something I was told only recently - It makes a difference if you created the file *before* you set the recordsize property. If you created them after, then no worries, but if I understand correctly, if the *file* was created with 128K recordsize, then it'll keep that forever... Assuming I understand correctly. Hopefully someone else on the list will be able to confirm. Cheers! Nathan. Richard Elling wrote: Anton B. Rang wrote: Create a pool [ ... ] Write a 100GB file to the filesystem [ ... ] Run I/O against that file, doing 100% random writes with an 8K block size. Did you set the record size of the filesystem to 8K? If not, each 8K write will first read 128K, then write 128K. Also check to see that your 8kByte random writes are aligned on 8kByte boundaries, otherwise you'll be doing a read-modify-write. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.
Hey all - I'm working on an interesting issue where I'm seeing ZFS being quite cranky about writing O_SYNC written blocks. Bottom line is that I have a small test case that does essentially this: open file for writing -- O_SYNC loop( write() 8KB of random data print time taken to write data } It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' is not in evidence, (and it's not always happening), I can do around 1200 O_SYNC writes per second... It seems to be waiting here virtually all of the time: 0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v stack pointer for thread 30171352960: 2a118052df1 [ 02a118052df1 cv_wait+0x38() ] 02a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570, 60005db6640) 02a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 2000, 2a1180539d8) 02a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 300a27a9e48, 0, 7b7462d0) 02a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1) 02a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) 02a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) And this also evident in a dtrace of it, following the write in... ... 28- zil_commit 28 - cv_wait 28- thread_lock 28- thread_lock 28- cv_block 28 - ts_sleep 28 - ts_sleep 28 - new_mstate 28- cpu_update_pct 28 - cpu_grow 28- cpu_decay 28 - exp_x 28 - exp_x 28- cpu_decay 28 - cpu_grow 28- cpu_update_pct 28 - new_mstate 28 - disp_lock_enter_high 28 - disp_lock_enter_high 28 - disp_lock_exit_high 28 - disp_lock_exit_high 28- cv_block 28- sleepq_insert 28- sleepq_insert 28- disp_lock_exit_nopreempt 28- disp_lock_exit_nopreempt 28- swtch 28 - disp 28- disp_lock_enter 28- disp_lock_enter 28- disp_lock_exit 28- disp_lock_exit 28- disp_getwork 28- disp_getwork 28- restore_mstate 28- restore_mstate 28 - disp 28 - pg_cmt_load 28 - pg_cmt_load 28- swtch 28- resume 28 - savectx 28- schedctl_save 28- schedctl_save 28 - savectx ... At this point, it waits for up to 80 seconds. I'm also seeing zil_commit() being called around 7-15 times per second. For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a pinch of difference. :) For what it's worth, this is a T2000, running Oracle, connected to an HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle filesystems, and I'm only seeing the issue on the ZFS filesystems that have the active oracle tables on them. The O_SYNC test case is just trying to help me understand what's happening. The *real* problem is that oracle is running like rubbish when it's trying to roll forward archive logs from another server. It's an almost 100% write workload. At the moment, it cannot even keep up with the other server's log creation rate, and it's barely doing anything. (The other box is quite different, so not really valid for direct comparison at this point). 6513020 looked interesting for a while, but I already have 120011-14 and 127111-03 and installed. I'm looking into the cache flush settings of the 9990 array to see if it's that killing me, but I'm also looking for any other ideas on what might be hurting me. I also have set zfs:zfs_nocacheflush = 1 in /etc/system The Oracle Logs are on a separate Zpool and I'm not seeing the issue on those filesystems. The lockstats I have run are not yet all that interesting. If anyone has ideas on specific incantations I should use or some specific D or anything else, I'd be most appreciative. Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun 5220 as a ZFS Server?
For what it's worth, I configured a T5220 this week with a 6 disk, three mirror zpool. (three top level mirror vdevs...). Used only internal disks... When pushing to disk, I was seeing bursts of 70 odd MB/s per spindle, with all 6 spindles making the 70MB/s, so 350MB/s ish. Read performance was about the same for large files. (did not do anything with small files, though I expect that with the 2.5 SAS disks, it should be pretty good...). I was not seeing a consistent 70MB/s per spindle, which I put down the the fact that I was only using a single thread to generate the writes. (A single thread of an N2 is only so fast... Just think of what you could do with 64 of them ;) I'll be interested to see what the others have to say. :) Hope this helps. Nathan. Michael Stalnaker wrote: We’re looking at building out sever ZFS servers, and are considering an x86 platform vs a Sun 5520 as the base platform. Any comments from the floor on comparative performance as a ZFS server? We’d be using the LSI 3801 controllers in either case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 30 seond hang, ls command....
Any chance the disks are being powered down, and you are waiting for them to power back up? Nathan. :) Neal Pollack wrote: I'm running Nevada build 81 on x86 on an Ultra 40. # uname -a SunOS zbit 5.11 snv_81 i86pc i386 i86pc Memory size: 8191 Megabytes I started with this zfs pool many dozens of builds ago, approx a year ago. I do live upgrade and zfs upgrade every few builds. When I have not accessed the zfs file systems for a long time, if I cd there and do an ls command, nothing happens for approx 30 seconds. Any clues how I would find out what is wrong? -- # zpool status -v pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz2ONLINE 0 0 0 c2d0ONLINE 0 0 0 c3d0ONLINE 0 0 0 c4d0ONLINE 0 0 0 c5d0ONLINE 0 0 0 c6d0ONLINE 0 0 0 c7d0ONLINE 0 0 0 c8d0ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT tank 172G 2.04T 52.3K /tank tank/arc 172G 2.04T 172G /zfs/arc # zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT tank 3.16T 242G 2.92T 7% ONLINE - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware for zfs home storage
I see a business opportunity for someone... Backups for the masses... of Unix / VMS and other OS/s out there. any takers? :) Nathan. Jonathan Loran wrote: eric kustarz wrote: On Jan 14, 2008, at 11:08 AM, Tim Cook wrote: www.mozy.com appears to have unlimited backups for 4.95 a month. Hard to beat that. And they're owned by EMC now so you know they aren't going anywhere anytime soon. I just signed on and am trying Mozy out. Note, its $5 per computer and its *not* archival. If you delete something on your computer, then 30 days later it is not going to be backed up anymore. eric And they don't support Solaris or Linux, so that means I would have to transfer everything indirectly from my Mac. Or worse yet, run windoz in a VM. Hardly practical. Why is it we always have to be second class citizens! Power to the (*x) people! Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing partition/label info
format -e then from there, re-label using SMI label, versus EFI. Cheers Al Slater wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, What is the quickest way of clearing the label information on a disk that has been previously used in a zpool? regards - -- Al Slater Technical Director SCL Phone : +44 (0)1273 07 Fax : +44 (0)1273 01 email : [EMAIL PROTECTED] Stanton Consultancy Ltd Pavilion House, 6-7 Old Steine, Brighton, East Sussex, BN1 1EJ Registered in England Company number: 1957652 VAT number: GB 760 2433 55 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZoluz4fTOFL/EDYRAnr5AJ4ie+xFNCi6gA5HLZ8IqI1wHItEEwCgj0ru EwSc9B16io3kBz2wS0LGoEQ= =eaZc -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
This question triggered some silly questions in my mind: Lots of folks are determined that the whole COW to different locations are a Bad Thing(tm), and in some cases, I guess it might actually be... What if ZFS had a pool / filesystem property that caused zfs to do a journaled, but non-COW update so the data's relative location for databases is always the same? Or - What if it did a double update: One to a staged area, and another immediately after that to the 'old' data blocks. Still always have on-disk consistency etc, at a cost of double the I/O's... Of course, both of these would require non-sparse file creation for the DB etc, but would it be plausible? For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? Just some stabs in the dark... Cheers! Nathan. Louwtjie Burger wrote: Hi After a clean database load a database would (should?) look like this, if a random stab at the data is taken... [8KB-m][8KB-n][8KB-o][8KB-p]... The data should be fairly (100%) sequential in layout ... after some days though that same spot (using ZFS) would problably look like: [8KB-m][ ][8KB-o][ ] Is this pseudo logical-physical view correct (if blocks n and p was updated and with COW relocated somewhere else)? Could a utility be constructed to show the level of fragmentation ? (50% in above example) IF the above theory is flawed... how would fragmentation look/be observed/calculated under ZFS with large Oracle tablespaces? Does it even matter what the fragmentation is from a performance perspective? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS very slow under xVM
I observed something like this a while ago, but assumed it was something I did. (It usually is... ;) Tell me - If you watch with an iostat -x 1, do you see bursts of I/O then periods of nothing, or just a slow stream of data? I was seeing intermittent stoppages in I/O, with bursts of data on occasion... Maybe it's not just me... Unfortunately, I'm still running old nv and xen bits, so I can't speak to the 'current' situation... Cheers. Nathan. Martin wrote: Hello I've got Solaris Express Community Edition build 75 (75a) installed on an Asus P5K-E/WiFI-AP (ip35/ICH9R based) board. CPU=Q6700, RAM=8Gb, disk=Samsung HD501LJ and (older) Maxtor 6H500F0. When the O/S is running on bare metal, ie no xVM/Xen hypervisor, then everything is fine. When it's booted up running xVM and the hypervisor, then unlike plain disk I/O, and unlike svm volumes, zfs is around 20 time slower. Specifically, with either a plain ufs on a raw/block disk device, or ufs on a svn meta device, a command such as dd if=/dev/zero of=2g.5ish.dat bs=16k count=15 takes less than a minute, with an I/O rate of around 30-50Mb/s. Similary, when running on bare metal, output to a zfs volume, as reported by zpool iostat, shows a similar high output rate. (also takes less than a minute to complete). But, when running under xVM and a hypervisor, although the ufs rates are still good, the zfs rate drops after around 500Mb. For instance, if a window is left running zpool iostat 1 1000, then after the dd command above has been run, there are about 7 lines showing a rate of 70Mbs, then the rate drops to around 2.5Mb/s until the entire file is written. Since the dd command initially completes and returns control back to the shell in around 5 seconds, the 2 gig of data is cached and is being written out. It's similar with either the Samsung or Maxtor disks (though the Samsung are slightly faster). Although previous releases running on bare metal (with xVM/Xen) have been fine, the same problem exists with the earlier b66-0624-xen drop of Open Solaris This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] characterizing I/O on a per zvol basis.
Hey all - Time for my silly question of the day, and before I bust out vi and dtrace... If there a simple, existing way I can observe the read / write / IOPS on a per-zvol basis? If not, is there interest in having one? Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
I think it's a little more sinister than that... I'm only just trying to import the pool. Not even yet doing any I/O to it... Perhaps it's the same cause, I don't know... But I'm certainly not convinced that I'd be happy with a 25K, for example, panicing just because I tried to import a dud pool... I'm ok(ish) with the panic on a failed write to a non-redundant storage. I expect it by now... Cheers! Nathan. Victor Engle wrote: Wouldn't this be the known feature where a write error to zfs forces a panic? Vic On 10/4/07, Ben Rockwood [EMAIL PROTECTED] wrote: Dick Davies wrote: On 04/10/2007, Nathan Kroenert [EMAIL PROTECTED] wrote: Client A - import pool make couple-o-changes Client B - import pool -f (heh) Oct 4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80: Oct 4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 (0x5 == 0x0) , file: ../../common/fs/zfs/space_map.c, line: 339 Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160 genunix:assfail3+b9 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200 zfs:space_map_load+2ef () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240 zfs:metaslab_activate+66 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300 zfs:metaslab_group_alloc+24e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0 zfs:metaslab_alloc_dva+192 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470 zfs:metaslab_alloc+82 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0 zfs:zio_dva_allocate+68 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510 zfs:zio_checksum_generate+6e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0 zfs:zio_write_compress+239 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610 zfs:zio_wait_for_children+5d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630 zfs:zio_wait_children_ready+20 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650 zfs:zio_next_stage_async+bb () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670 zfs:zio_nowait+11 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960 zfs:dbuf_sync_leaf+1ac () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0 zfs:dbuf_sync_list+51 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10 zfs:dnode_sync+23b () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50 zfs:dmu_objset_sync_dnodes+55 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0 zfs:dmu_objset_sync+13d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40 zfs:dsl_pool_sync+199 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0 zfs:spa_sync+1c5 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60 zfs:txg_sync_thread+19a () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70 unix:thread_start+8 () Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Is this a known issue, already fixed in a later build, or should I bug it? It shouldn't panic the machine, no. I'd raise a bug. After spending a little time playing with iscsi, I have to say it's almost inevitable that someone is going to do this by accident and panic a big box for what I see as no good reason. (though I'm happy to be educated... ;) You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously access the same LUN by accident. You'd have the same problem with Fibre Channel SANs. I ran into similar problems when replicating via AVS. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Erik - Thanks for that, but I know the pool is corrupted - That was kind if the point of the exercise. The bug (at least to me) is ZFS panicing Solaris just trying to import the dud pool. But, maybe I'm missing your point? Nathan. eric kustarz wrote: Client A - import pool make couple-o-changes Client B - import pool -f (heh) Client A + B - With both mounting the same pool, touched a couple of files, and removed a couple of files from each client Client A + B - zpool export Client A - Attempted import and dropped the panic. ZFS is not a clustered file system. It cannot handle multiple readers (or multiple writers). By importing the pool on multiple machines, you have corrupted the pool. You purposely did that by adding the '-f' option to 'zpool import'. Without the '-f' option ZFS would have told you that its already imported on another machine (A). There is no bug here (besides admin error :) ). eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Awesome. Thanks, Eric. :) This type of feature / fix is quite important to a number of the guys in the our local OSUG. In particular, they are adamant that they cannot use ZFS in production until it stops panicing the whole box for isolated filesystem / zpool failures. This will be a big step. :) Cheers. Nathan. Eric Schrock wrote: On Fri, Oct 05, 2007 at 08:20:13AM +1000, Nathan Kroenert wrote: Erik - Thanks for that, but I know the pool is corrupted - That was kind if the point of the exercise. The bug (at least to me) is ZFS panicing Solaris just trying to import the dud pool. But, maybe I'm missing your point? Nathan. This a variation on the read error while writing problem. It is a known issue and a generic solution (to handle any kind of non-replicated writes failing) is in the works (see PSARC 2007/567). - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Some people are just dumb. Take me, for instance... :) Was just looking into ZFS on iscsi and doing some painful and unnatural things to my boxes and dropped a panic I was not expecting. Here is what I did. Server: (S10_u4 sparc) - zpool create usb /dev/dsk/c4t0d0s0 (on a 4gb USB stick, if it matters) - zfs create -s -V 200mb usb/is0 - zfs set shareiscsi=on usb/is0 On Client A (nv_72 amd64) - iscsiadm stuff to enable sendtarget and set discovery-address to the server above - svcadm enable iscsiinitator - zpool create server_usb iscsi_target_created_above - created a few files - exported pool On Client B (nv_65 amd64 xen dom0) - iscsiadm stuff and enable service and import pool - import failed due to newer pool version... dang. - re-create pool - create some other files and stuff - export pool Client A - import pool make couple-o-changes Client B - import pool -f (heh) Client A + B - With both mounting the same pool, touched a couple of files, and removed a couple of files from each client Client A + B - zpool export Client A - Attempted import and dropped the panic. Oct 4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80: Oct 4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 (0x5 == 0x0) , file: ../../common/fs/zfs/space_map.c, line: 339 Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160 genunix:assfail3+b9 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200 zfs:space_map_load+2ef () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240 zfs:metaslab_activate+66 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300 zfs:metaslab_group_alloc+24e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0 zfs:metaslab_alloc_dva+192 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470 zfs:metaslab_alloc+82 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0 zfs:zio_dva_allocate+68 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510 zfs:zio_checksum_generate+6e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0 zfs:zio_write_compress+239 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610 zfs:zio_wait_for_children+5d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630 zfs:zio_wait_children_ready+20 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650 zfs:zio_next_stage_async+bb () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670 zfs:zio_nowait+11 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960 zfs:dbuf_sync_leaf+1ac () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0 zfs:dbuf_sync_list+51 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10 zfs:dnode_sync+23b () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50 zfs:dmu_objset_sync_dnodes+55 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0 zfs:dmu_objset_sync+13d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40 zfs:dsl_pool_sync+199 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0 zfs:spa_sync+1c5 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60 zfs:txg_sync_thread+19a () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70 unix:thread_start+8 () Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Yep - Sure I did some boneheaded things here (grin) and deserved a good kick in the groin, however, should I panic a whole box just because I have attempted to import a dud pool?? Without re-creating the pool, I can now panic the system reliably just through attempting to import the pool I was a little surprised, as I would have though that there should have been no chance for really nasty things to have happened at a systemwide level, and we should have just bailed on the mount / import. I see a few bugs that were closeish to this, but not a great match... Is this a known issue, already fixed in a later build, or should I bug it? After spending a little time playing with iscsi, I have to say it's almost inevitable that someone is going to do this by accident and panic a big box for what I see as no good reason. (though I'm happy to be educated... ;) Oh - and also - Kudos to the ZFS team and the other involved in the whole iSCSI thing. So easy
Re: [zfs-discuss] pool is full and cant delete files
And if there is a rubbish file somewhere, I *think* you should be able to cat /dev/null thatfile Which would free up it's blocks. Assuming you don't have snapshots... ;) Nathan. Anton B. Rang wrote: At least three alternatives -- 1. If you don't have the latest patches installed, apply them. There have been bugs in this area which have been fixed. 2. If you still can't remove files with the latest patches, and you have a service contract with Sun, open a service request to get help. 3. Add a new device (or RAID group) to the pool, which will give you free space again. You can then delete files, at the cost of having your pool larger, since you can't remove it again. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NV_65 AMD64 - ZFS seems to write fast and slow to a single spindle
Hey all - Just saw something really weird. I have been playing with by box for a little while now, and just noticed something whilst checking how fast / slow my IDE ports were on a newish motherboard... I had been copying around an image. Not a particularly large one - 500M ISO... I had been observing the read speed off disk, and write speed to disk. When reading from one disk and writing to another, I was seeing about 60MB/s and all was as expected. But, then, I thought I'd do one more run, and copied the *same* image as my last run... Of course, the image was in memory, so I expected there would be no reads and lots of writes. What I saw was lots of not very impressive speed (cmdk1 is the target): extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 0.00.00.00.0 0.0 0.00.0 0 0 cmdk1 0.0 35.30.0 4514.1 32.4 2.0 975.3 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 0.00.00.00.0 0.0 0.00.0 0 0 cmdk1 0.0 36.70.0 4697.6 32.4 2.0 936.8 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 0.00.00.00.0 0.0 0.00.0 0 0 cmdk1 0.0 36.80.0 4650.6 32.4 2.0 935.1 100 100 extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b cmdk0 0.00.00.00.0 0.1 0.00.0 0 0 cmdk1 0.0 37.50.0 4424.9 32.3 2.0 913.8 100 100 So my target disk, which is owned exclusively by ZFS, was apparently flat out writing at 4.4MB/s. At the time it goes bad, the svc_t jumps from about 125ms to 950ms. Ouch!! On closer inspection, I see that - - The cp returns almost immediately. (somewhat expected) - ZFS starts writing at about 60MB/s, but only for about 2 seconds (This is changable. Sometimes, it writes the whole image at the slower rate.) - the write rate drops back to 4 - 5MB/s - CPU usage is only 8% - I still have 1.5GB of 4GB free memory (Though I *am* running Xen at the moment. Not sure if that matters) - If I kick off a second copy to a different filename whilst the first is running it does not get any faster. - If I kick off a write to a raw zvol on the same pool, the write rate to the disk jumps back up to the expected 60MB/s, but drops again as soon as it's completed the write to the raw zvol... So, it seems it's not the disk itself. - The zpool *has* been exported and imported this boot. Not sure if that matters either. - I had a hunch that memory availability might be playing a part, so I forced a whole heap to be freed up with a honking big malloc and walk of the pages. I freed up 3GB (box has 4GB total) and it seems that I start to see the problem much more frequently as I get to about 1.5GB free. - It's not entirely predictable. Sometimes, it'll write at 50-60MB/s for up to 8 or so seconds, and others, it'll only write fast for a burst right at the start, then take quite some time to write out the rest. It's almost as if we are being throttled on the rate at which we can push data through the ZFS in-memory cache when writing previously read and written data. Or something equally bogus like me expecting that ZFS would write as fast as it can all the time, which I guess might be an invalid assumption? Now: This is running NV_65 with the Xen bits from back then. Not sure if that really matters. Does not seem that the disk is having problem - beaker:/disk2/crap # zpool status pool: disk2 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM disk2 ONLINE 0 0 0 c3d0 ONLINE 0 0 0 errors: No known data errors pool: zfs state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM zfs ONLINE 0 0 0 c1d0s7ONLINE 0 0 0 errors: No known data errors c1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: ST3320620AS Revision: Serial No: 6QF Size: 320.07GB 320070352896 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: ST3320620AS Revision: Serial No: 6QF Size: 320.07GB 320070352896 bytes Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c0t1d0 Soft Errors: 4 Hard Errors: 302 Transport Errors: 0 If anyone within SWAN on the ZFS team wanted to take a look at this box and see if it's a new bug or just me being a bonehead and not understanding what I'm seeing, please respond to me directly, and I can provide access. (I'll make an effort not to reboot the box just in case it's only this
Re: [zfs-discuss] Is there _any_ suitable motherboard?
For what it's worth, I bought a Gigabyte GA-M57SLI-S4 a couple of months ago and it rocks on a reasonably current Nevada. Certainly not the cheapest or most expensive, but I felt a good choice for multiple PCI-E slots and a couple of PCI slots. http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ClassValue=MotherboardProductID=2287ProductName=GA-M57SLI-S4 Everything on it worked a treat for me, and paired with an Nvidia 7900GS, has handled pretty much whatever I have thrown at it, including Second Life on Solaris. :) On Nevada (at the time I build it, it was NV_65), everything just worked straight out of the box. Gig Ethernet, IDE ports, SATA ports (in compatability mode), USB, 1394, audio, dual core stuff, the lot. SATA ports work fine and dandy (up to about 70MB/S per port on the outer edge of the disk using Seagate 320GB 16MB cache 7200RPM disks) using the IDE emulation. I'm waiting for the build of nevada that provides the Nvidia MCP55 SATA controller support for native SATA stuff. Not long now... ZFS seems to be able to write down the channel's at about 80MB/s... (At least on a brand spanking new Zpool. Seems closer to 60MB/s now...) Once the Nvidia SATA stuff goes back, you'll have 6 ports of NVidia SATA goodness straight off the board. (and even now, you have 6 ports of reasonable speedyness in good old ATA mode.) From what I can tell, the Nvidia SATA devices hang straight off the PCI-E bus, so you might even be able to get 'em all running flat out. (Though, I'm basing this on the output of the prtconf, I could be completely wrong.) See bottom of post for the prtconf -D output. I'm also running it as a Solaris Xen Dom0 with other OS/s lurking on top of that, so the HVM support also works great. I have just submitted this board to the HCL for SXDE, and if I get a chance, I'll pull the latest S10 and give that a whirl too. Hope this helps. (and excuse the prtconf being from the Xen boot, rather than bare metal... got a bit of stuff happening on the box at the moment and did not feel like rebooting. ;) /root # prtconf -D System Configuration: Sun Microsystems i86pc Memory size: 3895 Megabytes System Peripherals (Software Nodes): i86xpv (driver name: rootnex) scsi_vhci, instance #0 (driver name: scsi_vhci) isa, instance #0 (driver name: isa) fdc, instance #0 (driver name: fdc) fd, instance #0 (driver name: fd) asy, instance #0 (driver name: asy) lp, instance #0 (driver name: ecpp) i8042, instance #0 (driver name: i8042) keyboard, instance #0 (driver name: kb8042) motherboard xpvd, instance #0 (driver name: xpvd) xencons, instance #0 (driver name: xencons) xenbus, instance #0 (driver name: xenbus) domcaps, instance #0 (driver name: domcaps) balloon, instance #0 (driver name: balloon) evtchn, instance #0 (driver name: evtchn) privcmd, instance #0 (driver name: privcmd) pci, instance #0 (driver name: npe) pci1458,5001 pci1458,c11 pci1458,c11 pci1458,c11 pci1458,5004, instance #0 (driver name: ohci) mouse, instance #1 (driver name: hid) pci1458,5004, instance #0 (driver name: ehci) pci-ide, instance #0 (driver name: pci-ide) ide, instance #0 (driver name: ata) sd, instance #1 (driver name: sd) sd, instance #0 (driver name: sd) ide (driver name: ata) pci-ide, instance #1 (driver name: pci-ide) ide, instance #2 (driver name: ata) cmdk, instance #0 (driver name: cmdk) ide, instance #3 (driver name: ata) cmdk, instance #1 (driver name: cmdk) pci-ide, instance #2 (driver name: pci-ide) ide (driver name: ata) ide (driver name: ata) pci-ide, instance #3 (driver name: pci-ide) ide (driver name: ata) ide (driver name: ata) pci10de,370, instance #0 (driver name: pci_pci) pci8086,1e, instance #0 (driver name: e1000g) pci1458,1000, instance #0 (driver name: hci1394) pci1458,a002, instance #0 (driver name: audiohd) pci1458,e000, instance #0 (driver name: nge) pci10de,377, instance #0 (driver name: pcie_pci) display, instance #0 (driver name: nvidia) pci1022,1100 (driver name: mc-amd) pci1022,1101 (driver name: mc-amd) pci1022,1102 (driver name: mc-amd) pci1022,1103, instance #0 (driver name: amd64_gart) iscsi, instance #0 (driver name: iscsi) pseudo, instance #0 (driver name: pseudo) options, instance #0 (driver name: options) agpgart, instance #0 (driver name: agpgart) xsvc, instance #0 (driver name: xsvc) used-resources cpus cpu, instance #0 cpu, instance #1 Nathan. Ben Middleton wrote:
Re: [zfs-discuss] SiI 3114 Chipset on Syba Card - Solaris Hangs
Some time ago I encountered issues using the odd numbered ports on my SIL3114 based card. I currently use ports 0 and 2 without issue. I never did get ports 1 and 3 working... If I have a disk connected to ports 1 or 3, it just conks out on the way up when it's initializing the disks. (Unfortunately, I don't remember for sure, but I think it was a hard hang). I should likely have bugged it, but the box on which I was doing the work fills the role of my gateway to the internet, so I was disinclined to spend lots of time trying to break it, when I only needed two disks working anyways... My 2c... Nathan. Blake wrote: I have re-flashed the BIOS. Blake On 8/7/07, *Ian Collins* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Blake wrote: Hi. I'm running snv 65 and having an issue much like this: http://osdir.com/ml/solaris.opensolaris.help/2006-11/msg00047.html http://osdir.com/ml/solaris.opensolaris.help/2006-11/msg00047.html http://osdir.com/ml/solaris.opensolaris.help/2006-11/msg00047.html Has anyone found a workaround? Or is this the issue with the BIOS not liking EFI information that ZFS uses? Are you sure the card doesn't have a RAID BIOS? If it does, it will have to be re-flashed. Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, iSCSI + Mac OS X Tiger (globalSAN iSCSI)
Hey there - This is very likely completely unrelated, but here goes anyhoo... I have noticed with some particular ethernet adapters (e1000g in my case) and large MTU sizes (8K) that things (most anything that really pushes the interface) sometimes stop for no good reason on my x86 Solaris boxes. After it stops, I'm able to re-connect after a short time and it works for a while again... (Really must get around to properly reproducing the problem and logging a bug too...) I'd be curious to know if setting the MTU to 1500 on both systems makes any difference at all. Note that I have only observed this with my super cheap adapters at home. I'm yet to see if (though also yet to try really hard) on the more expensive ones at work... Again - Likely nothing to do with your problem, but hey. It has made a difference for me before... Cheers. Nathan. George wrote: I have set up an iSCSI ZFS target that seems to connect properly from the Microsoft Windows initiator in that I can see the volume in MMC Disk Management. When I shift over to Mac OS X Tiger with globalSAN iSCSI, I am able to set up the Targets with the target name shown by `iscsitadm list target` and when I actually connect or Log On I see that one connection exists on the Solaris server. I then go on to the Sessions tab in globalSAN and I see the session details and it appears that data is being transferred via the PDUs Sent, PDUs Received, Bytes, etc. HOWEVER the connection then appears to terminate on the Solaris side if I check it a few minutes later it shows no connections, but the Mac OS X initiator still shows connected although no more traffic appears to be flowing in the Session Statistics dialog area. Additionally, when I then disconnect the Mac OS X initiator it seems to drop fine on the Mac OS X side, even though the Solaris side has shown it gone for a while, however when I reconnect or Log On again, it seems to spin infinitely on the Target Connect... dialog. Solaris is, interestingly, showing 1 connection while this apparent issue (spinning beachball of death) is going on with globalSAN. Even killing the Mac OS X process doesn't seem to get me full control again as I have to restart the system to kill all processes (unless I can hunt them down and `kill -9` them which I've not successfully done thus far). Has anyone dealt with this before and perhaps be able to assist or at least throw some further information towards me to troubleshoot this? Thanks much, -George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS not utilizing all disks
Simple test - mkfile 8gb now and see where the data goes... :) Victor Latushkin wrote: Robert Milkowski wrote: Hello Leon, Thursday, May 10, 2007, 10:43:27 AM, you wrote: LM Hello, LM I've got some weird problem: ZFS does not seem to be utilizing LM all disks in my pool properly. For some reason, it's only using 2 of the 3 disks in my pool: LMcapacity operationsbandwidth LM pool used avail read write read write LM -- - - - - - - LM database8.48G 1.35T202 0 12.4M 0 LM c0t1d04.30G 460G103 0 6.21M 0 LM c0t3d04.12G 460G 96 0 6.00M 0 LM c0t2d054.9M 464G 2 0 190K 0 LM -- - - - - - - LM I've added all the disks at the same time, so it's not like the LM last disk was added later. Any ideas on what might be causing this ? I'm using solaris express b62. LM Your third disks is 4GB larger that first two disks and ZFS tries to load-balance data so that you can fill up all devices. As you've already have about 4GB on each of the first two disks ZFS should start to use third disks after copying addtitional data. No, it is not - other two disks have 4G out of 464G used, and disk in question has only 55M used. So for me it does not look like weighting problem. This is something else I believe. I'm not sure but i suspect this may be somehow related to meta data allocation, given that ZFS stores two copies for file system meta data. But this is nothing more than a wild guess. Leon, What kind of data is stored in this pool? What Solaris version are you using? How is your pool configured? Cheers, Victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [osol-help] How to recover from rm *?
begin crackly, broken record :) I, for one, would love to have similar functionality that we had in good old netware, where we could 'salvage' deleted files. The concept was that when the files were deleted, they were not actually removed, nor were the all important references to the files to allow undeleting them. In the event that a user had an oops, they could just run salvage (or later, filer) and pick the files from the directory in question, and *whammo*, undelete it. I don't recall ever having to do whole directories, nor if it was actually possible... IIRC, you could set policy that determined when the deleted files data blocks became available for overwriting (and hence permanent deletion of the file). If it happened that there was too much space 'used up' by the deleted files, you could run a purge. (I was not a fan of that part). I'd have preferred that the deleted files were simply overwritten in a fifo manner, and left purge out of it. Yes - Snapshots are great, but how often do you run a snapshot? Every 60 seconds? That's going to get real ugly if you have a filesystem per user... I once cobbled up a poor man's version of this sort of thing, aliasing rm to a scripted mv, and pushing everything into a /fs/deleted/* area when someone ran rm (maintaining filesystem directory structure). I then had occasional rm's run through it, once the filesystem reached a certain highwater mark. Something under the covers of ZFS that provided dumb dumb protection would be very cool. I was saved a number of times by the hackery above... cheers! Nathan. Robert Milkowski wrote: Hello Jeremy, Monday, February 19, 2007, 1:58:18 PM, you wrote: Something similar was proposed here before and IIRC someone even has a working implementation. I don't know what happened to it. JT That would be me. AFAIK, no one really wanted it. The problem that it JT solves can be solved by putting snapshots in a cronjob. Not exactly the same. But if people really do not want it... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [osol-help] How to recover from rm *?
I'd usually agree with that, but - if we have an opportunity to make users love ZFS even more, why not at least investigate it. A perfect example might be exactly what I did on one occasion, where I copied a bunch of photos off a CF card. I then reformatted the CF card, and cleaned up the the crappy photos on my hard disk, but unfortunately, (and stupidly) removed all of them. My photos were gone forever. :( Even a snapshot would not have helped here... I know; stupid stupid stupid, but it happens, and I would have *really* liked to have been able to recover those photos... A salvage / undelete would have been gold. Nathan. James Dickens wrote: Yes - Snapshots are great, but how often do you run a snapshot? Every 60 seconds? That's going to get real ugly if you have a filesystem per user... I'm sure every 15 minutes is suffient, if the worker doesn't have a slight penalty he will won't ever learn to be careful. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Cheap ZFS homeserver.
Urk! Where is this documented? And - is it something you can do nothing about, or are we ultimately trying to address it somewhere / somehow? Thanks!! Nathan. Bill Moore wrote: On Wed, Jan 31, 2007 at 05:01:19AM -0800, Tom Buskey wrote: As a followup, the system I'm trying to use this on is a dual PII 400 with 512MB. Real low budget. 2 500 GB drives with 2 120 GB in a RAIDZ. The idea is that I can get 2 more 500 GB drives later to get full capacity. I tested going from a 20GB to a 120GB and that worked well. I'm finding the throughput just isn't there. 2MB/s compared to 20MB/s on a similar Linux system. Anyone else going this low budget? There are many folks (including myself) that have done similar super-low budget setups. Which SATA controller are you using? If it's the SI3112, it has a documented problem when you try to use both SATA channels simultaneously - it gets less than 2MB/s, compared to about 50MB/s on each drive individually. I'm not sure if the SI3114 has similar problems, or not. This may not be your problem, but I know the SI3112 was popular on machines in that timeframe. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hot spares - in standby?
Random thoughts: If we were to use some intelligence in the design, we could perhaps have a monitor that profiles the workload on the system (a pool, for example) over a [week|month|whatever] and selects a point in time, based on history, that it would expect the disks to be quite, and can 'pre-build' the spare with the contents of the disk it's about to swap out. At the point of switch-over, it could be pretty much instantaneous... It could also bail if it happened that the system actually started to get genuinely busy... That might actually be quite cool, though, if all disks are rotated, we end up with a whole bunch of disks that are evenly worn out again, which is just what we are really trying to avoid! ;) Nathan. Wee Yeh Tan wrote: On 1/30/07, David Magda [EMAIL PROTECTED] wrote: What about a rotating spare? When setting up a pool a lot of people would (say) balance things around buses and controllers to minimize single points of failure, and a rotating spare could disrupt this organization, but would it be useful at all? The costs involved in rotating spares in terms of IOPS reduction may not be worth it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on a damaged disk
On a recent journey of pain and frustration, I had to recover a UFS filesystem from a broken disk. The disk had many bad blocks and more were going bad over time. Sadly, there were just a few files that I wanted, but I could not mount the disk without it killing my system. (PATA disks... PITA if you ask me...) My recovery method, though painful, might be of value in you locating the bad regions of the disk. What I did was to kick off a script that used dd, and did something like this... == #! /usr/bin/ksh SEEK=0 while : do dd if=/dev/rdsk/c0d1s7 of=backup.ufs.s7 bs=8192 \ oseek=${SEEK} iseek=${SEEK} count=1 conv=noerror,sync SEEK=$((SEEK + 1)) done == (Or something to that effect.) Anyhoo - the point is that this hit the disk one block at a time(I chose 8kb, as it was the ufs block size, and 512 byte blocks looked like it would take 3 weeks), and I was ultimately able to get my data back (at least the bits I cared about...) after futzing with fsck and some other novelties. If you were to do something similar to this, but instead of copying the block, send it to /dev/null, and log the result of dd, you could get a complete list of broken blocks. A few botnotes: - Yes. This is slow. WAY slow, and there are thousands of different ways that could have done this better and faster. However, it saved me from having to do anything else, and at the time, I did not feel like breaking out a compiler. Due to the massively large number of bad blocks on my disk, the size of the disk, 160GB, and the number of retries my system made for each bad block, it took 10 days (!!) to read through the whole disk 8kb at a time. - If you are happy to throw away larger blocks of disk, you could use a larger block size, which would speed things up. - If you disk really does have bad blocks that are getting in the way, chances are that it's going to get worse, and pain will ensue. I'd suggest that a new disk might be a better option. - On the new disk front, note that many hard disks come with 5 year warranties these days. If the disk is not super old, you might be able to get it replaced under warranty if you send it directly to the manufacturer... Hope this helps at least provide some ideas. :) Oh - and get a new disk. ;) Nathan. Patrick P Korsnick wrote: i have a machine with a disk that has some sort of defect and i've found that if i partition only half of the disk that the machine will still work. i tried to use 'format' to scan the disk and find the bad blocks, but it didn't work. so as i don't know where the bad blocks are but i'd still like to use some of the rest of the disk, i thought ZFS might be able to help. i partitioned the disk so slices 4,5,6 and 7 are each 5GB. i thought i'd make one or multiple zpools on those slices and then i'd be able to narrow down where the bad sections are. so my question is can i declare a zpool that spans multiple c0d0sXX but isn't a mirror and if i can, then will zfs be able to detect where the problem c0d0sXX is and not use it? if not, i'll have to make 4 different zpools and experiment with storing stuff on each to find the approximate location of the bad blocks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird thing with zfs
Hm. If the disk has no label, why would it have an s0? Or, did you mean p0? Nathan. On Wed, 2006-12-06 at 04:45, Krzys wrote: Does not work :( dd if=/dev/zero of=/dev/rdsk/c3t6d0s0 bs=1024k count=1024 dd: opening `/dev/rdsk/c3t6d0s0': I/O error That is so strange... it seems like I lost another disk... I will try to reboot and see what I get, but I guess I need to order another disk then and give it a try... Chris On Tue, 5 Dec 2006, Al Hopper wrote: On Tue, 5 Dec 2006, Krzys wrote: Thanks, ah another wird thing is that when I run format on that frive I get a coredump :( ... snip Try zeroing out the disk label with something like: dd if=/dev/zero of=/dev/rdsk/c?t?d?p0 bs=1024k count=1024 Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 !DSPAM:122,4575a7731650371292! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_51 hangs
Hm. If the system is hung, it's unlikely that a reboot -d will help. You want to be booting into kmdb, then using the F1-a interrupt sequence then dumping using $systemdump at the kmdb prompt. See the following documents: Index of lots of useful stuff: http://docs.sun.com/app/docs/doc/817-1985/6mhm8o5p3?a=view Forcing a crashdump on x86 boxes: http://docs.sun.com/app/docs/doc/817-1985/6mhm8o5q5?a=view And booting from grub into kmdb: http://docs.sun.com/app/docs/doc/817-1985/6mhm8o5q2?a=view I'm not sure how the serial console is going to impact you. I'm expecting it'll still be f1-a to drop to the debugger... That's assuming it's not a hard hang. :) Cheers. Nathan. On Wed, 2006-11-15 at 14:16, Sean Ye wrote: Hi, Chris, You may force a panic by reboot -d. Thanks, Sean On Tue, Nov 14, 2006 at 09:11:58PM -0600, Chris Csanady wrote: I have experienced two hangs so far with snv_51. I was running snv_46 until recently, and it was rock solid, as were earlier builds. Is there a way for me to force a panic? It is an x86 machine, with only a serial console. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best Practices recommendation on x4200
On Thu, 2006-11-09 at 10:21, Richard Elling - PAE wrote: One way to populate an ABE is to mirror slices. However, you cannot mirror between a device that starts at cylinder 0 and one that does not. Where is this restriction documented? It doesn't make sense to me. Maybe you have a scar from running Sybase in a previous life? ;-) IIRC, that's a part of the history of disksuite / SVM. Moreover, it was that you cannot mirror a slice that has a VTOC label on it to one that does not... (hence the understanding of it being a cylinder 0 issue). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/lvm/mirror/mirror_ioctl.c#887 Or, perhaps I need more coffee... Cheers! Nathan. ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenSolaris vs. Solaris 10 11/06 (S10u3) for NFS ZFS Server
For me, it came down to - Do I want to patch, or upgrade? My gateway to the internet is a solaris 10 box, patched whenever required. I like that as soon as a security patch is available, I can apply it and reboot. Simple. My laptop runs nevada. I upgrade from network / dvd when I see a new feature that excites me. As far as whiz-bang things that would excite you, only you will know that for sure. :) Cheers! Nathan. On Thu, 2006-11-09 at 08:58, Wes Williams wrote: I'm in the process of building a Solaris NFS server with ZFS and was wondering if any gurus here have any comments as to choosing the upcoming Solairs 10 11/06 [presumably] or OpenSolaris bXX/Solairs Express for this use. Even with my use of OpenSolaris I maintain a service contract to show my support, so bug fixes in a static supported version shouldn't be an issue in picking a version. So, the short question is are there any super-cool must-have ZFS/NFS features in OpenSolaris now that S10u3 won't have right away? Thanks! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Where is the ZFS configuration data stored?
I'll take a crack at this. First off, I'm assuming that the RAID you are talking about it provided by the hardware and not by ZFS. IF that's the case, then it will depend on the way you created the raid set, the bios of the controller, and whether or not these two things match up with any other systems. A few of the RAID controllers I have played with has an option to 'rebuild' a raid set, which I get the impression (though have never tried) allows you to essentially tell the controller there is a raid set there, and if you set it up the same way as before, it will use work. Personally, unless I was moving the disks to another system with the same RAID controller and BIOS, I would have no expectation it would work. It might, but I would not be surprised (or disappointed) if it did not. If you are talking about using ZFS's raid, then you won't need to do anything. It should just work, as ZFS will be able to just import the zpool. I hope I understood your question. (And I hope I'm telling no lies... ;) Nathan. Sergey wrote: + a little addition to the original quesion: Imagine that you have a RAID attached to Solaris server. There's ZFS on RAID. And someday you lost your server completely (fired motherboard, physical crash, ...). Is there any way to connect the RAID to some another server and restore ZFS layout (not loosing all data on RAID)? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool list No known data errors
I might be wrong here, but I think it's telling you that there are no errors. Something like: errors: none or errors: None that we know of, but we'll let you know if there are any. At least that is how I'd read it. :) Do you have an actual problem other than the text? Nathan. On Tue, 2006-10-10 at 10:05, ttoulliu2002 wrote: Hi: I have zpool created # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT ktspool34,5G 33,5K 34,5G 0% ONLINE - However, zpool status shows no known data error. May I know what is the problem # zpool status pool: ktspool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM ktspool ONLINE 0 0 0 c0t1d0s6 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Significant pauses during zfs writes
Hey, Bob - It might be worth exploring where your data stream for the writes was coming from. Moreover, it might be worth exploring how fast it was filling up caches for writing. Were you delivering enough data to keep the disks busy 100% of the time? I have been tricked by this before... :) Nathan. On Tue, 2006-08-15 at 01:38, James C. McPherson wrote: Bob Evans wrote: Just getting my feet wet with zfs. I set up a test system (Sunblade 1000, dual channel scsi card, disk array with 14x18GB 15K RPM SCSI disks) and was trying to write a large file (10 GB) to the array to see how it performed. I configured the raid using raidz. During the write, I saw the disk access lights come on, but I noticed a peculiar behavior. The system would write to the disk, but then pause for a few seconds, then contineu, then pause for a few seconds. I saw the same behavior when I made a smaller raidz using 4x36 GB scsi drives in a different enclosure. Since I'm new to zfs, and realize that I'm probably missing something, I was hoping somebody might help shed some light on my problem. Hi Bob, I'm pretty sure that's not a problem that you're seeing, just ZFS' normal behaviour. Writes are coalesced as much as possible, so the pauses that you observed are most likely going to be the system waiting for suitable IOs to be gathered up and sent out to your storage. If you want to examine this a bit more then might I suggest the DTrace Toolkit's iosnoop utility. best regards, James C. McPherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sucking down my memory!?
Something I often do when I'm a little suspicious of this sort of activity is to run something that steals vast quantities of memory... eg: something like this: #include stdio.h #include stdlib.h int main() { int memsize=0; char *input_string; char *memory; long i=0; input_string=malloc(256 * sizeof(char)); printf(How much memory? :); input_string=fgets(input_string, 255, stdin); memsize=atoi(input_string); printf(mem_size=%d\n, memsize); memory=calloc(memsize*1024*1024, 1); printf(Pausing: hit enter to exit\n); input_string=fgets(input_string, 255, stdin); exit(0); } which allows me to request, say, 500mb of memory. Watching vmstat whilst doing this is interesting. It then runs and uses lots of memory, and causes some pressure. If, at the end when it exits, you have lots of memory free, and nothing swapped out, it's all good. :) quick, dirty, possibly even smelly, with no error checking at all... :) Nathan. On Fri, 2006-07-21 at 09:28, Eric Schrock wrote: There two things to note here: 1. The vast majority of the memory is being used by the ZFS cache, but appears under 'kernel heap'. If you actually need the memory, it _should_ be released. Under UFS, this cache appears as the 'page cache', and users understand that it can be released when needed. The same is true of ZFS, but it's just not accounted for as separate memory. Now, the VM hooks needed to do this are somewhat add hoc at the moment, but the ZFS cache should keep itself from consuming 100% of the available memory. 2. There is a difference between VA (virtual addressing) and physical memory. See the following thread for a more complete discussion: http://www.opensolaris.org/jive/thread.jspa?threadID=10774tstart=45start=15 So the (apparent) high kernel memory consumption is expected, and does not indicate any type of problem. Applications actually receiving ENOMEM should not happen, and may indicate that there are some circumstances where the VM interfaces are currently inadequate. Someone else on the ZFS team may be able to get some more specifics from you to figure out what's really going on. - Eric On Thu, Jul 20, 2006 at 04:03:50PM -0700, Joseph Mocker wrote: So what's going on! Please help. I want my memory back! This is essentially by design, due to the way that ZFS uses kernel memory for caching and other stuff. You can alleviate this somewhat by running a 64bit processor, which has a significantly larger address space to play with. Uhh. If I don't have any more physical memory, how does a 64bit processor help? FWIW, this is on a SunBlade 2000 running in 64bit mode: [EMAIL PROTECTED]: uname -a SunOS watt 5.10 Generic_118833-17 sun4u sparc SUNW,Sun-Blade-1000 [EMAIL PROTECTED]: isainfo sparcv9 sparc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss