Re: [zfs-discuss] Raidz1 p
Sure, and thanks for the quick reply. Controller: Supermicro AOC-SAT2-MV8 plugged into a 64-big PCI-X 133 bus Drives: 5 x Seagate 7200.11 1.5TB disks for the raidz1. Single 36GB western digital 10krpm raptor as system disk. Mate for this is in but not yet mirrored. Motherboard: Tyan Thunder K8W S2885 (Dual AMD CPU) with 1GB ECC Ram Anything else I can provide? (thanks again) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
So the place we are arriving is to push the RFE for shrinkable pools? Warning the user about the difference in actual drive size, then offering to shrink the pool to allow a smaller device seems like a nice solution to this problem. The ability to shrink pools might be very useful in other situations. Say I built server that once did a decent amount of iops using SATA disks, and now that the workloads iops is greatly increased (busy database?), I need SAS disks. If I'd originally bought 500gb SATA (current sweet spot) disks, I might have a lot of empty space in my pool. Shrinking the pool would allow me to migrate to smaller (capacity) SAS disks with much better seek times, without being forced to buy 2x as many disks due to the higher cost/gb of SAS. I think I remember an RFE for shrinkable pools, but can't find it - can someone post a link if they know where it is? cheers, Blake ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz1 p
Can you share your hardware configuration? cheers, Blake On Mon, Jan 19, 2009 at 11:56 PM, Brad Hill wrote: > Greetings! > > I lost one out of five disks on a machine with a raidz1 and I'm not sure > exactly how to recover from it. The pool is marked as FAULTED which I > certainly wasn't expecting with only one bum disk. > > r...@blitz:/# zpool status -v tank > pool: tank > state: FAULTED > status: One or more devices could not be opened. There are insufficient >replicas for the pool to continue functioning. > action: Attach the missing device and online it using 'zpool online'. > see: http://www.sun.com/msg/ZFS-8000-3C > scrub: none requested > config: > >NAMESTATE READ WRITE CKSUM >tankFAULTED 0 0 1 corrupted data > raidz1DEGRADED 0 0 6 >c6t0d0 ONLINE 0 0 0 >c6t1d0 ONLINE 0 0 0 >c6t2d0 ONLINE 0 0 0 >c6t3d0 UNAVAIL 0 0 0 cannot open >c6t4d0 ONLINE 0 0 0 > > > Any recovery guidance I may gain from the esteemed experts of this group > would be extremely appreciated. I recently migrated to opensolaris + zfs on > the impassioned advice of a coworker and will lose some data that has been > modified since the move, but not yet backed up yet. > > Many thanks in advance... > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs null pointer deref,
If you've got enough space on /var, and you had a dump partition configured, you should find a bunch of "vmcore.[n]" files in /var/crash by now. The system normally dumps the kernel core into the dump partition (which can be the swap partition) and then copies it into /var/crash on the next successful reboot. There's likely also a stack printed at the time of the crash; that might be enough for the ZFS developers to determine if this is a known (or even fixed) bug. It's also retrievable from the core. If it's not a known bug, or if more data is needed, the developers might want a copy of the core -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs panic
Looks like a corrupted pool -- you appear to have a mirror block pointer with no valid children. From the dump, you could probably determine which file is bad, but I doubt you could delete it; you might need to recreate your pool. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Raidz1 p
Greetings! I lost one out of five disks on a machine with a raidz1 and I'm not sure exactly how to recover from it. The pool is marked as FAULTED which I certainly wasn't expecting with only one bum disk. r...@blitz:/# zpool status -v tank pool: tank state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAMESTATE READ WRITE CKSUM tankFAULTED 0 0 1 corrupted data raidz1DEGRADED 0 0 6 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 UNAVAIL 0 0 0 cannot open c6t4d0 ONLINE 0 0 0 Any recovery guidance I may gain from the esteemed experts of this group would be extremely appreciated. I recently migrated to opensolaris + zfs on the impassioned advice of a coworker and will lose some data that has been modified since the move, but not yet backed up yet. Many thanks in advance... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
On Mon, Jan 19 at 23:14, Greg Mason wrote: >So, what we're looking for is a way to improve performance, without >disabling the ZIL, as it's my understanding that disabling the ZIL >isn't exactly a safe thing to do. > >We're looking for the best way to improve performance, without >sacrificing too much of the safety of the data. > >The current solution we are considering is disabling the cache >flushing (as per a previous response in this thread), and adding one >or two SSD log devices, as this is similar to the Sun storage >appliances based on the Thor. Thoughts? In general principles, the evil tuning guide states that the ZIL should be able to handle 10 seconds of expected synchronous write workload. To me, this implies that it's improving burst behavior, but potentially at the expense of sustained throughput, like would be measured in benchmarking type runs. If you have a big JBOD array with say 8+ mirror vdevs on multiple controllers, in theory, each VDEV can commit from 60-80MB/s to disk. Unless you are attaching a separate ZIL device that can match the aggregate throughput of that pool, wouldn't it just be better to have the default behavior of the ZIL contents being inside the pool itself? The best practices guide states that the max ZIL device size should be roughly 50% of main system memory, because that's approximately the most data that can be in-flight at any given instant. "For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GBytes of log device should be sufficient." But, no comments are made on the performance requirements of the ZIL device(s) relative to the main pool devices. Clicking around finds this entry: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on ...which appears to indicate cases where a significant number of ZILs were required to match the bandwidth of just throwing them in the pool itself. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
> > Good idea. Thor has a CF slot, too, if you can find a high speed > CF card. > -- richard We're already using the CF slot for the OS. We haven't really found any CF cards that would be fast enough anyways :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
On Mon, 19 Jan 2009, Greg Mason wrote: > The current solution we are considering is disabling the cache > flushing (as per a previous response in this thread), and adding one > or two SSD log devices, as this is similar to the Sun storage > appliances based on the Thor. Thoughts? You need to add some sort of fast non-volatile cache. The Sun storage appliances are actually using battery backed DRAM for their write caches. This sort of hardware is quite rare. Fast SSD log devices are apparently pretty expensive. Some of the ones for sale are actually pretty slow. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Greg Mason wrote: > So, what we're looking for is a way to improve performance, without > disabling the ZIL, as it's my understanding that disabling the ZIL isn't > exactly a safe thing to do. > > We're looking for the best way to improve performance, without > sacrificing too much of the safety of the data. > > The current solution we are considering is disabling the cache flushing > (as per a previous response in this thread), and adding one or two SSD > log devices, as this is similar to the Sun storage appliances based on > the Thor. Thoughts? Good idea. Thor has a CF slot, too, if you can find a high speed CF card. -- richard > -Greg > > On Jan 19, 2009, at 6:24 PM, Richard Elling wrote: >>> >>> We took a rough stab in the dark, and started to examine whether or >>> not it was the ZIL. >> >> It is. I've recently added some clarification to this section in the >> Evil Tuning Guide which might help you to arrive at a better solution. >> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 >> >> >> Feedback is welcome. >> -- richard > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
So, what we're looking for is a way to improve performance, without disabling the ZIL, as it's my understanding that disabling the ZIL isn't exactly a safe thing to do. We're looking for the best way to improve performance, without sacrificing too much of the safety of the data. The current solution we are considering is disabling the cache flushing (as per a previous response in this thread), and adding one or two SSD log devices, as this is similar to the Sun storage appliances based on the Thor. Thoughts? -Greg On Jan 19, 2009, at 6:24 PM, Richard Elling wrote: >> >> We took a rough stab in the dark, and started to examine whether or >> not it was the ZIL. > > It is. I've recently added some clarification to this section in the > Evil Tuning Guide which might help you to arrive at a better solution. > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 > Feedback is welcome. > -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CIFS and zfs
I switched to the CIFS filesharing system. All that I needed to do was to disable the samba wins swat services, then i started the smb/server service. I followed the CIFS administration guide. Almost everything worked without problems. The only problem I got was a ?wins? resolution error. So, if I tried to access \\"hostname" I got a error message that windows was not able to log on. I tried the ip insteal of the hostname, suffixed by the share (-r resource-name) I just created, what worked. I restarted the smb/server service again. No changes. I restarted the machine, now everything works without problems. Somehow there where problems with the windows name resolution without the restart. Maybe thats becouse I used the wins deamon before? Greets Louis Hoefler. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> And again, I say take a look at the market today, figure out a percentage, > and call it done. I don't think you'll find a lot of users crying foul over > losing 1% of their drive space when they don't already cry foul over the > false advertising that is drive sizes today. Perhaps it's quaint, but 5GB still seems like a lot to me to throw away. > In any case, you might as well can ZFS entirely because it's not really fair > that users are losing disk space to raid and metadata... see where this > argument is going? Well, I see where this _specious_ argument is going. > I have two disks in one of my systems... both maxtor 500GB drives, purchased > at the same time shortly after the buyout. One is a rebadged Seagate, one > is a true, made in China Maxtor. Different block numbers... same model > drive, purchased at the same time. > > Wasn't zfs supposed to be about using software to make up for deficiencies > in hardware? It would seem this request is exactly that... That's a fair point, and I do encourage you to file an RFE, but a) Sun has already solved this problem in a different way as a company with our products and b) users already have the ability to right-size drives. Perhaps a better solution would be to handle the procedure of replacing a disk with a slightly smaller one by migrating data and then treating the extant disks as slightly smaller as well. This would have the advantage of being far more dynamic and of only applying the space tax in situations where it actually applies. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Greg Mason wrote: > We're running into a performance problem with ZFS over NFS. When working > with many small files (i.e. unpacking a tar file with source code), a > Thor (over NFS) is about 4 times slower than our aging existing storage > solution, which isn't exactly speedy to begin with (17 minutes versus 3 > minutes). > > We took a rough stab in the dark, and started to examine whether or not > it was the ZIL. It is. I've recently added some clarification to this section in the Evil Tuning Guide which might help you to arrive at a better solution. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Feedback is welcome. -- richard > Performing IO tests locally on the Thor shows no real IO problems, but > running IO tests over NFS, specifically, with many smaller files we see > a significant performance hit. > > Just to rule in or out the ZIL as a factor, we disabled it, and ran the > test again. It completed in just under a minute, around 3 times faster > than our existing storage. This was more like it! > > Are there any tunables for the ZIL to try to speed things up? Or would > it be best to look into using a high-speed SSD for the log device? > > And, yes, I already know that turning off the ZIL is a Really Bad Idea. > We do, however, need to provide our users with a certain level of > performance, and what we've got with the ZIL on the pool is completely > unacceptable. > > Thanks for any pointers you may have... > > -- > > Greg Mason > Systems Administrator > Michigan State University > High Performance Computing Center > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 2:55 PM, Adam Leventhal wrote: > Drive vendors, it would seem, have an incentive to make their "500GB" > drives > as small as possible. Should ZFS then choose some amount of padding at the > end of each device and chop it off as insurance against a slightly smaller > drive? How much of the device should it chop off? Conversely, should users > have the option to use the full extent of the drives they've paid for, say, > if they're using a vendor that already provides that guarantee? Drive vendors, it would seem, have incentive to make their 500GB drives as cheap as possible. The two are not necessarily one and the same. And again, I say take a look at the market today, figure out a percentage, and call it done. I don't think you'll find a lot of users crying foul over losing 1% of their drive space when they don't already cry foul over the false advertising that is drive sizes today. In any case, you might as well can ZFS entirely because it's not really fair that users are losing disk space to raid and metadata... see where this argument is going? I really, REALLY doubt you're going to have users screaming at you for losing 1% (or whatever the figure ends up being) to a right-sizing algorithm. In fact, I would bet the average user will NEVER notice if you don't tell them ahead of time. Sort of like the average user had absolutely no clue that 500GB drives were of slightly differing block numbers, and he'd end up screwed six months down the road if he couldn't source an identical drive. I have two disks in one of my systems... both maxtor 500GB drives, purchased at the same time shortly after the buyout. One is a rebadged Seagate, one is a true, made in China Maxtor. Different block numbers... same model drive, purchased at the same time. Wasn't zfs supposed to be about using software to make up for deficiencies in hardware? It would seem this request is exactly that... > > > > You know, sort of like you not letting people choose their raid layout... > > Yes, I'm not saying it shouldn't be done. I'm asking what the right answer > might be. The *right answer* in simplifying storage is not "manually slice up every disk you insert into the system to avoid this issue". The right answer is "right-size by default, give admins the option to skip it if they really want". Sort of like I'd argue the right answer on the 7000 is to give users the raid options you do today by default, and allow them to lay it out themselves from some sort of advanced *at your own risk* mode, whether that be command line (the best place I'd argue) or something else. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Recovery after SAN Corruption
Hello, We recently had SAN corruption (hard power outage), and we lost a few transaction that were waiting to be written to real disk. The end result as we all know is CKSUM errors on the zpool from a scrub, and we also had a few corrupted files reported by ZFS. My question is, what is the proper way to recover from this? Create a new ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS, poor performance with many small files
Another option to look at is: set zfs:zfs_nocacheflush=1 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide Best option is to get a a fast ZIL log device. Depends on your pool as well. NFS+ZFS means zfs will wait for write completes before responding to a sync NFS write ops. If you have a RAIDZ array, writes will be slower than a RAID10 style pool. On Tue, Jan 20, 2009 at 11:08 AM, Greg Mason wrote: > We're running into a performance problem with ZFS over NFS. When working > with many small files (i.e. unpacking a tar file with source code), a > Thor (over NFS) is about 4 times slower than our aging existing storage > solution, which isn't exactly speedy to begin with (17 minutes versus 3 > minutes). > > We took a rough stab in the dark, and started to examine whether or not > it was the ZIL. > > Performing IO tests locally on the Thor shows no real IO problems, but > running IO tests over NFS, specifically, with many smaller files we see > a significant performance hit. > > Just to rule in or out the ZIL as a factor, we disabled it, and ran the > test again. It completed in just under a minute, around 3 times faster > than our existing storage. This was more like it! > > Are there any tunables for the ZIL to try to speed things up? Or would > it be best to look into using a high-speed SSD for the log device? > > And, yes, I already know that turning off the ZIL is a Really Bad Idea. > We do, however, need to provide our users with a certain level of > performance, and what we've got with the ZIL on the pool is completely > unacceptable. > > Thanks for any pointers you may have... > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS over NFS, poor performance with many small files
We're running into a performance problem with ZFS over NFS. When working with many small files (i.e. unpacking a tar file with source code), a Thor (over NFS) is about 4 times slower than our aging existing storage solution, which isn't exactly speedy to begin with (17 minutes versus 3 minutes). We took a rough stab in the dark, and started to examine whether or not it was the ZIL. Performing IO tests locally on the Thor shows no real IO problems, but running IO tests over NFS, specifically, with many smaller files we see a significant performance hit. Just to rule in or out the ZIL as a factor, we disabled it, and ran the test again. It completed in just under a minute, around 3 times faster than our existing storage. This was more like it! Are there any tunables for the ZIL to try to speed things up? Or would it be best to look into using a high-speed SSD for the log device? And, yes, I already know that turning off the ZIL is a Really Bad Idea. We do, however, need to provide our users with a certain level of performance, and what we've got with the ZIL on the pool is completely unacceptable. Thanks for any pointers you may have... -- Greg Mason Systems Administrator Michigan State University High Performance Computing Center ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Tim wrote: > On Mon, Jan 19, 2009 at 1:12 PM, Bob Friesenhahn > mailto:bfrie...@simple.dallas.tx.us>> wrote: > > On Mon, 19 Jan 2009, Adam Leventhal wrote: > > > Are you telling me zfs is deficient to the point it can't > handle basic > right-sizing like a 15$ sata raid adapter? > > > How do there $15 sata raid adapters solve the problem? The more > details you > could provide the better obviously. Note that for the LSI RAID controllers Sun uses on many products, if you take a disk that was JBOD and tell the controller to make it RAIDed, then the controller will relabel the disk for you and will cause you to lose the data. As best I can tell, ZFS is better in that it will protect your data rather than just relabeling and clobbering your data. AFAIK, NVidia and others do likewise. > It is really quite simple. If the disk is resilvered but the new > drive is a bit too small, then the RAID card might tell you that a > bit of data might have lost in the last sectors, or it may just > assume that you didn't need that data, or maybe a bit of cryptic > message text scrolls off the screen a split second after it has been > issued. Or if you try to write at the end of the volume and one of > the replacement drives is a bit too short, then the RAID card may > return a hard read or write error. Most filesystems won't try to > use that last bit of space anyway since they run real slow when the > disk is completely full, or their flimsy formatting algorithm always > wastes a bit of the end of the disk. Only ZFS is rash enough to use > all of the space provided to it, and actually expect that the space > continues to be usable. > > > > It's a horribly *bad thing* to not use the entire disk and right-size it > for sanity's sake. That's why Sun currently sells arrays that do JUST > THAT. ?? > I'd wager fishworks does just that as well. Why don't you open source > that code and prove me wrong ;) I don't think so, because fishworks is an engineering team and I don't think I can reserve space on a person... at least not legally where I live :-) But this is not a problem for the Sun Storage 7000 systems because the supported disks are already "right-sized." > I'm wondering why they don't come right out with it and say "we want to > intentionally make this painful to our end users so that they buy our > packaged products". It'd be far more honest and productive than this > pissing match. I think that if there is enough real desire for this feature, then someone would file an RFE on http://bugs.opensolaris.org It would help to attach diffs to the bug and it would help to reach a concensus of the amount of space to be reserved prior to filing. This is not an intractable problem and easy workarounds already exist, but if ease of use is more valuable than squeezing every last block, then the RFE should fly. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 01:35:22PM -0600, Tim wrote: > > > Are you telling me zfs is deficient to the point it can't handle basic > > > right-sizing like a 15$ sata raid adapter? > > > > How do there $15 sata raid adapters solve the problem? The more details you > > could provide the better obviously. > > They short stroke the disk so that when you buy a new 500GB drive that isn't > the exact same number of blocks you aren't screwed. It's a design choice to > be both sane, and to make the end-users life easier. You know, sort of like > you not letting people choose their raid layout... Drive vendors, it would seem, have an incentive to make their "500GB" drives as small as possible. Should ZFS then choose some amount of padding at the end of each device and chop it off as insurance against a slightly smaller drive? How much of the device should it chop off? Conversely, should users have the option to use the full extent of the drives they've paid for, say, if they're using a vendor that already provides that guarantee? > You know, sort of like you not letting people choose their raid layout... Yes, I'm not saying it shouldn't be done. I'm asking what the right answer might be. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request Discussion (Was: Understanding ZFS replication)
So personally I find ZFS to be fantastic, it's only missing three features from my ideal filesystem: 1) The ability to easily recover the portions of a filesystem that are still intact after a catastrophic failure (It looks like zfs scrub can do this as long as a damaged pool could be imported, so this is almost there, or it's hackable at the moment if a bit of drive information has been kept around) 2) The ability to push the data off a device and safely remove it from a non-mirrored pool (Marked as a future feature) 3) File system level mirroring across devices, rather than device level mirroring, so... To raise this issue for discussion, pros/cons/not worth the effort, ideas: It would be fantastic if ZFS could support another option for copies that **guarantees** that it writes copies to different devices, and if it cannot (due to free space constraints or failing/failed device), write to the same device, but raise an error/warning that could be checked in zpool status or similar fashion (similar to a RAID5 losing a disk... It's workable, simply degraded) zpool scrub strikes me as the perfect tool to attempt to enforce the copies=X attribute, as a way to bring the entire filesystem into line with the current settings and ensure that old data meets the requirement, rather than only affecting new data. An issue I immediately see here would involve possibly needing to move data from one disk to another in order to free up space for replication across devices, which is likely non-trivial. -Tim Miles Nordin wrote: >> "tr" == Timothy Renner writes: >> > > tr> zfs set copies=2 zfspool/test2 > > 'copies=2' says things will be written twice, but regardless of > discussion about where the two copies are written, copies=2 says > nothing at all about being able to *read back* your data if one of the > copies disappears. It only promises that the two copies will be > written. This does you no good at all if you can't import the pool, > which is probably what will happen to anyone who has relied on > copies=2 for redundancy. > > The discussion about *where* the copies tend to be written is really > impractical and distracting, IMO. > > The chance that the copies won't be written to separate vdev's is not > where the problem comes from. You can't import a pool unless it has > enough redundancy at vdev-level to get all your data, so copies=2 > doesn't add much. The best copies=2 will do is give you a slightly > better shot at evacuating the data from a slowly-failing drive. If > anyone at all should be using it, certainly I don't think someone with > more than one drive should be using it. > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
> "b" == Blake writes: b> removing the zfs cache file located at /etc/zfs/zpool.cache b> might be an emergency workaround? just the opposite. There seem to be fewer checks blocking the autoimport of pools listed in zpool.cache than on 'zpool import' manual imports. I'd expect the reverse, for some forceable 'zpool import' to accept pools that don't autoimport, but at least Ross found zpool.cache could auto-import a pool with a missing slog, while 'zpool import' tells you, recreate from backup. pgpa9pIk70DwP.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 1:12 PM, Bob Friesenhahn < bfrie...@simple.dallas.tx.us> wrote: > On Mon, 19 Jan 2009, Adam Leventhal wrote: > >> >> Are you telling me zfs is deficient to the point it can't handle basic >>> right-sizing like a 15$ sata raid adapter? >>> >> >> How do there $15 sata raid adapters solve the problem? The more details >> you >> could provide the better obviously. >> > > It is really quite simple. If the disk is resilvered but the new drive is > a bit too small, then the RAID card might tell you that a bit of data might > have lost in the last sectors, or it may just assume that you didn't need > that data, or maybe a bit of cryptic message text scrolls off the screen a > split second after it has been issued. Or if you try to write at the end of > the volume and one of the replacement drives is a bit too short, then the > RAID card may return a hard read or write error. Most filesystems won't try > to use that last bit of space anyway since they run real slow when the disk > is completely full, or their flimsy formatting algorithm always wastes a bit > of the end of the disk. Only ZFS is rash enough to use all of the space > provided to it, and actually expect that the space continues to be usable. > > It's a horribly *bad thing* to not use the entire disk and right-size it for sanity's sake. That's why Sun currently sells arrays that do JUST THAT. I'd wager fishworks does just that as well. Why don't you open source that code and prove me wrong ;) I'm wondering why they don't come right out with it and say "we want to intentionally make this painful to our end users so that they buy our packaged products". It'd be far more honest and productive than this pissing match. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 12:39 PM, Adam Leventhal wrote: > > Sorry, I must have missed your point. I thought that you were saying that > HDS, NetApp, and EMC had a different model. Were you merely saying that the > software in those vendors' products operates differently than ZFS? > Gosh, was the point that hard to get? Let me state it a fourth time: They all short stroke the disks to avoid the CF that results in all drives not adhering to a strict sizing standard. > > > Are you telling me zfs is deficient to the point it can't handle basic > > right-sizing like a 15$ sata raid adapter? > > How do there $15 sata raid adapters solve the problem? The more details you > could provide the better obviously. > They short stroke the disk so that when you buy a new 500GB drive that isn't the exact same number of blocks you aren't screwed. It's a design choice to be both sane, and to make the end-users life easier. You know, sort of like you not letting people choose their raid layout... --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
I think this is probably true, and I suspect that Sun is also targeting media warehousing shops like some of the big social networking/video sites, where storage is coming online too fast to make manual tuning a sensible thing to do. Look at many enterprise storage graphs showing bytes on the x and time on the y axis, and you see a very scary picture for an admin - unless you can slap storage into a rack with minimal setup time. Plus, one can yell at the 7000-series when the stress gets to be too much: www.youtube.com/watch?v=tDacjrSCeq4 >From another point of view, making ZFS as friendly as possible to 'regular' users - those who *do* buy drives from Fry's - will certainly help drive adoption. Some of these people become the buyers in IT departments later. Lots of Linux tools, while horrendous to administer (LVM), have worked nicely with junky hardware for a long time. So now, we have Linux even in places where it may not make the most sense. My last enterprise job had many terabytes of data sitting on LVM. I'm glad I wasn't the admin for that nightmare. cheers, Blake On Mon, Jan 19, 2009 at 2:00 PM, Bob Friesenhahn wrote: > > It seems likely that Sun discovered that raw Solaris and system > configuration is too difficult for many "Windows" shops to grasp so > they introduced a simplified appliance product line which is > engineered entirely by Sun, and with a simplified administration > interface which does not require a lot of training to understand. > > Since the product is entirely engineered by Sun, they can ensure that > the provided disk drives and configuration are carefully matched > (based on testing and analysis) in order to offer the best > price/performance ratio. > > Bob > == > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding ZFS replication
> "tr" == Timothy Renner writes: tr> zfs set copies=2 zfspool/test2 'copies=2' says things will be written twice, but regardless of discussion about where the two copies are written, copies=2 says nothing at all about being able to *read back* your data if one of the copies disappears. It only promises that the two copies will be written. This does you no good at all if you can't import the pool, which is probably what will happen to anyone who has relied on copies=2 for redundancy. The discussion about *where* the copies tend to be written is really impractical and distracting, IMO. The chance that the copies won't be written to separate vdev's is not where the problem comes from. You can't import a pool unless it has enough redundancy at vdev-level to get all your data, so copies=2 doesn't add much. The best copies=2 will do is give you a slightly better shot at evacuating the data from a slowly-failing drive. If anyone at all should be using it, certainly I don't think someone with more than one drive should be using it. pgp2ATsq8cq9a.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, 19 Jan 2009, Adam Leventhal wrote: > >> Are you telling me zfs is deficient to the point it can't handle basic >> right-sizing like a 15$ sata raid adapter? > > How do there $15 sata raid adapters solve the problem? The more details you > could provide the better obviously. It is really quite simple. If the disk is resilvered but the new drive is a bit too small, then the RAID card might tell you that a bit of data might have lost in the last sectors, or it may just assume that you didn't need that data, or maybe a bit of cryptic message text scrolls off the screen a split second after it has been issued. Or if you try to write at the end of the volume and one of the replacement drives is a bit too short, then the RAID card may return a hard read or write error. Most filesystems won't try to use that last bit of space anyway since they run real slow when the disk is completely full, or their flimsy formatting algorithm always wastes a bit of the end of the disk. Only ZFS is rash enough to use all of the space provided to it, and actually expect that the space continues to be usable. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
On Mon, 19 Jan 2009, Tim wrote: > Remember that one time when I talked about limiting snapshots to protect a > user from themselves, and you joined into the fray of people calling me a > troll? Can you feel the irony oozing out between your lips, or are you > completely oblivious to it? Tim, I admire that you are able to keep your trolls on topic, quite unlike JZ. This is a clear indication that you are not yet qualified for the rubber room. If you don't like the simplified nature of the Sun Unified Storage 7000 products, you always have the option of building your own system based on your preferred hardware, and OpenSolaris (or Linux), just as you did before. The mere existence of these products should not annoy you. It seems likely that Sun discovered that raw Solaris and system configuration is too difficult for many "Windows" shops to grasp so they introduced a simplified appliance product line which is engineered entirely by Sun, and with a simplified administration interface which does not require a lot of training to understand. Since the product is entirely engineered by Sun, they can ensure that the provided disk drives and configuration are carefully matched (based on testing and analysis) in order to offer the best price/performance ratio. Bob == Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
Miles, that's correct - I got muddled in the details of the thread. I'm not necessarily suggesting this, but is this an occasion when removing the zfs cache file located at /etc/zfs/zpool.cache might be an emergency workaround? Tom, please don't try this until someone more expert replies to my question. cheers, Blake On Mon, Jan 19, 2009 at 1:43 PM, Miles Nordin wrote: > > b> You can get a sort of redundancy by creating multiple > b> filesystems with 'copies' enabled on the ones that need some > b> sort of self-healing in case of bad blocks. > > Won't work here. The pool won't import at all. The type of bad block > fixing you're talking about applies to cases where the pool imports, > but 'zpool status' reports files with bad blocks in them. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
> BWAHAHAHAHA. That's a good one. "You don't need to setup your raid, that's > micro-managing, we'll do that." > > Remember that one time when I talked about limiting snapshots to protect a > user from themselves, and you joined into the fray of people calling me a > troll? I don't remember this, but I don't doubt it. > Can you feel the irony oozing out between your lips, or are you > completely oblivious to it? The irony would be that on one hand I object to artificial limitations to business-critical features while on the other hand I think that users don't need to tweak settings that add complexity and little to no value? They seem very different to me, so I suppose the answer to your question is: no I cannot feel the irony oozing out between my lips, and yes I'm oblivious to the same. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> > > Since it's done in software by HDS, NetApp, and EMC, that's complete > > > bullshit. Forcing people to spend 3x the money for a "Sun" drive that's > > > identical to the seagate OEM version is also bullshit and a piss-poor > > > answer. > > > > I didn't know that HDS, NetApp, and EMC all allow users to replace their > > drives with stuff they've bought at Fry's. Is this still covered by their > > service plan or would this only be in an unsupported config? > > So because an enterprise vendor requires you to use their drives in their > array, suddenly zfs can't right-size? Vendor requirements have absolutely > nothing to do with their right-sizing, and everything to do with them > wanting your money. Sorry, I must have missed your point. I thought that you were saying that HDS, NetApp, and EMC had a different model. Were you merely saying that the software in those vendors' products operates differently than ZFS? > Are you telling me zfs is deficient to the point it can't handle basic > right-sizing like a 15$ sata raid adapter? How do there $15 sata raid adapters solve the problem? The more details you could provide the better obviously. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
>> Creating a slice, instead of using the whole disk, will cause ZFS to >> not enable write-caching on the underlying device. > Correct. Engineering trade-off. Since most folks don't read the manual, > or the best practices guide, until after they've hit a problem, it is really > just a CYA entry :-( It seems this trade-off can now be mitigated, regarding Roch Bourbonnais comment on a another thread on this list: - http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054587.html In particular: " If ZFS owns a disk it will enable the write cache on the drive but I'm not positive this has a great performance impact today. It used to but that was before we had a proper NCQ implementation. Today I don't know that it helps much. That this is because we always flush the cache when consistency requires it." -- julien. http://blog.thilelli.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
> "nk" == Nathan Kroenert writes: > "b" == Blake writes: nk> I'm not sure how you can class it a ZFS fail when the Disk nk> subsystem has failed... The disk subsystem did not fail and lose all its contents. It just rebooted a few times. b> You can get a sort of redundancy by creating multiple b> filesystems with 'copies' enabled on the ones that need some b> sort of self-healing in case of bad blocks. Won't work here. The pool won't import at all. The type of bad block fixing you're talking about applies to cases where the pool imports, but 'zpool status' reports files with bad blocks in them. pgp6eWbgsPtQQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
On Mon, Jan 19, 2009 at 11:02 AM, Adam Leventhal wrote: > > "The recommended number of disks per group is between 3 and 9. If you > have > > more disks, use multiple groups." > > > > Odd that the Sun Unified Storage 7000 products do not allow you to > control > > this, it appears to put all the hdd's into one group. At least on the > 7110 > > we are evaluating there is no control to allow multiple groups/different > > raid types. > > Our experience has shown that that initial guess of 3-9 per parity device > was > surprisingly narrow. We see similar performance out to much wider stripes > which, of course, offer the user more usable capacity. > > We don't allow you to manually set the RAID stripe widths on the 7000 > series > boxes because frankly the stripe width is an implementation detail. If you > want the best performance, choose mirroring; capacity, double-parity RAID; > for something in the middle, we offer 3+1 single-parity RAID. Other than > that you're micro-optimizing for gains that would hardly be measurable > given > the architecture of the Hybrid Storage Pool. Recall that unlike other > products in the same space, we get our IOPS from flash rather than from > a bazillion spindles spinning at 15,000 RPM. > > Adam > BWAHAHAHAHA. That's a good one. "You don't need to setup your raid, that's micro-managing, we'll do that." Remember that one time when I talked about limiting snapshots to protect a user from themselves, and you joined into the fray of people calling me a troll? Can you feel the irony oozing out between your lips, or are you completely oblivious to it? --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 11:05 AM, Adam Leventhal wrote: > > Since it's done in software by HDS, NetApp, and EMC, that's complete > > bullshit. Forcing people to spend 3x the money for a "Sun" drive that's > > identical to the seagate OEM version is also bullshit and a piss-poor > > answer. > > I didn't know that HDS, NetApp, and EMC all allow users to replace their > drives with stuff they've bought at Fry's. Is this still covered by their > service plan or would this only be in an unsupported config? > > Thanks. > > Adam > So because an enterprise vendor requires you to use their drives in their array, suddenly zfs can't right-size? Vendor requirements have absolutely nothing to do with their right-sizing, and everything to do with them wanting your money. Are you telling me zfs is deficient to the point it can't handle basic right-sizing like a 15$ sata raid adapter? --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> "edm" == Eric D Mudama writes: edm> If, instead of having ZFS manage these differences, a user edm> simply created slices that were, say, 98% if you're willing to manually create slices, you should be able to manually enable the write cache, too, while you're in there, so I wouldn't worry about that. I'd worry a little about the confusion over this write cache bit in general---where the write cache setting is stored and when it's enabled and when (if?) it's disabled, if the rules differ on each type of disk attachment, and if you plug the disk into Linux will Linux screw up the setting by auto-enabling at boot or by auto-disabling at shutdown or does Linux use stateless versions (analagous to sdparm without --save) when it prints that boot-time message about enabling write caches? For example weirdness, on iSCSI I get this, on a disk to which I've let ZFS write a GPT/EFI label: write_cache> display Write Cache is disabled write_cache> enable Write cache setting is not changeable so is that a bug of my iSCSI target, and is there another implicit write cache inside the iSCSI initiator or not? The Linux hdparm man page says: -W Disable/enable the IDE drive's write-caching feature (default state is undeterminable; manufacturer/model specific). so is the write_cache 'display' feature in 'format -e' actually reliable? Or is it impossible to reliably read this setting on an ATA drive, and 'format -e' is making stuff up? With Linux I can get all kinds of crazy caching data from a SATA disk: r...@node0 ~ # sdparm --page=ca --long /dev/sda /dev/sda: ATA WDC WD1000FYPS-0 02.0 Caching (SBC) [PS=0] mode page: IC 0 Initiator control ABPF0 Abort pre-fetch CAP 0 Caching analysis permitted DISC0 Discontinuity SIZE0 Size (1->CSS valid, 0->NCS valid) WCE 1 Write cache enable MF 0 Multiplication factor RCD 0 Read cache disable DRRP0 Demand read retension priority WRP 0 Write retension priority DPTL0 Disable pre-fetch transfer length MIPF0 Minimum pre-fetch MAPF0 Maximum pre-fetch MAPFC 0 Maximum pre-fetch ceiling FSW 0 Force sequential write LBCSS 0 Logical block cache segment size DRA 0 Disable read ahead NV_DIS 0 Non-volatile cache disable NCS 0 Number of cache segments CSS 0 Cache segment size but what's actually coming from the drive, and what's fabricated by the SCSI-to-SATA translator built into Garzik's libata? Because I think Solaris has such a translator, too, if it's attaching sd to SATA disks. I'm guessing it's all a fantasy because: r...@node0 ~ # sdparm --clear=WCE /dev/sda /dev/sda: ATA WDC WD1000FYPS-0 02.0 change_mode_page: failed setting page: Caching (SBC) but neverminding the write cache, I'd be happy saying ``just round down disk sizes using the labeling tool instead of giving ZFS the whole disk, if you care,'' IF the following things were true: * doing so were written up as a best-practice. because, I think it's a best practice if the rest of the storage industry from EMC to $15 promise cards is doing it, though maybe it's not important any more because of IDEMA. And right now very few people are likely to have done it because of the way they've been guided into the setup process. * it were possible to do this label-sizing to bootable mirrors in the various traditional/IPS/flar/jumpstart installers * there weren't a proliferation of >= 4 labeling tools in Solaris, each riddled with assertion bailouts and slightly different capabilities. Linux also has a mess of labeling tools, but they're less assertion-riddled, and usually you can pick one and use it for everything---you don't have to drag out a different tool for USB sticks because they're considered ``removeable.'' Also it's always possible to write to the unpartitioned block device with 'dd' on Linux (and FreeBSD and Mac OS X), no matter what label is on the disk, while Solaris doesn't seem to have an unpartitioned device. And finally the Linux formatting tools work by writing to this unpartitioned device, not by calling into a rat's nest of ioctl's, so they're much easier for me to get along with. Part of the attraction of ZFS should be avoiding this messy part of Solaris, but we still have to use format/fmthard/fdisk/rmformat, to swap label types because ZFS won't, to frob the write cache because ZFS's user interface is too simple and does that semi-automatically though I'm not sure all the rules it's using, to enumerate the installed disks, to determine in which of the several states working / connected-but-not-identified / disconnected / disconnected-but-refcounted the iSCSI initiator is in. And while ZFS will do special things to an UNlabeled disk, I'm not sure there i
Re: [zfs-discuss] replace same sized disk fails with too small error
Jim Dunham wrote: > Richard, > >> Ross wrote: >> >>> The problem is they might publish these numbers, but we really have >>> no way of controlling what number manufacturers will choose to use >>> in the future. >>> >>> If for some reason future 500GB drives all turn out to be slightly >>> smaller than the current ones you're going to be stuck. Reserving >>> 1-2% of space in exchange for greater flexibility in replacing >>> drives sounds like a good idea to me. As others have said, RAID >>> controllers have been doing this for long enough that even the very >>> basic models do it now, and I don't understand why such simple >>> features like this would be left out of ZFS. >>> >>> >>> >> I have added the following text to the best practices guide: >> >> * When a vdev is replaced, the size of the replacements vdev, measured >> by usable >> sectors, must be the same or greater than the vdev being replaced. >> This >> can be >> confusing when whole disks are used because different models of >> disks may >> provide a different number of usable sectors. For example, if a pool >> was >> created >> with a "500 GByte" drive and you need to replace it with another "500 >> GByte" >> drive, then you may not be able to do so if the drives are not of the >> same make, >> model, and firmware revision. Consider planning ahead and reserving >> some >> space >> by creating a slice which is smaller than the whole disk instead of >> the >> whole disk. >> > > Creating a slice, instead of using the whole disk, will cause ZFS to > not enable write-caching on the underlying device. > Correct. Engineering trade-off. Since most folks don't read the manual, or the best practices guide, until after they've hit a problem, it is really just a CYA entry :-( BTW, I also added a quick link to CR 4852783, reduce pool capacity, which is the feature which has a good chance of making this point moot. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> Since it's done in software by HDS, NetApp, and EMC, that's complete > bullshit. Forcing people to spend 3x the money for a "Sun" drive that's > identical to the seagate OEM version is also bullshit and a piss-poor > answer. I didn't know that HDS, NetApp, and EMC all allow users to replace their drives with stuff they've bought at Fry's. Is this still covered by their service plan or would this only be in an unsupported config? Thanks. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
> "The recommended number of disks per group is between 3 and 9. If you have > more disks, use multiple groups." > > Odd that the Sun Unified Storage 7000 products do not allow you to control > this, it appears to put all the hdd's into one group. At least on the 7110 > we are evaluating there is no control to allow multiple groups/different > raid types. Our experience has shown that that initial guess of 3-9 per parity device was surprisingly narrow. We see similar performance out to much wider stripes which, of course, offer the user more usable capacity. We don't allow you to manually set the RAID stripe widths on the 7000 series boxes because frankly the stripe width is an implementation detail. If you want the best performance, choose mirroring; capacity, double-parity RAID; for something in the middle, we offer 3+1 single-parity RAID. Other than that you're micro-optimizing for gains that would hardly be measurable given the architecture of the Hybrid Storage Pool. Recall that unlike other products in the same space, we get our IOPS from flash rather than from a bazillion spindles spinning at 15,000 RPM. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
Asif Iqbal wrote: > On Mon, Jan 19, 2009 at 10:47 AM, Andrew Gabriel > wrote: >> I've seen a webpage (a blog, IIRC) which compares the performance of >> RAIDZ with differing numbers of disks in each RAIDZ group. I can't now > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > section: > RAID-Z Configuration Requirements and Recommendations Thanks. I had found that, but there is another blog somewhere which compared the performance of RAIDZ's built with different values. >> find this, and can't seem to find the right things to get google to >> search on. Does anyone recall where this is? ISTR the optimum number of >> disks was 5-6. -- Cheers Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Disks in each RAIDZ group
Andrew, http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#RAID-Z_Configuration_Requirements_and_Recommendations "The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups." Odd that the Sun Unified Storage 7000 products do not allow you to control this, it appears to put all the hdd's into one group. At least on the 7110 we are evaluating there is no control to allow multiple groups/different raid types. Tom - "Andrew Gabriel" wrote: > I've seen a webpage (a blog, IIRC) which compares the performance of > RAIDZ with differing numbers of disks in each RAIDZ group. I can't now > > find this, and can't seem to find the right things to get google to > search on. Does anyone recall where this is? ISTR the optimum number > of > disks was 5-6. > > -- > Cheers > Andrew > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Richard, > Ross wrote: >> The problem is they might publish these numbers, but we really have >> no way of controlling what number manufacturers will choose to use >> in the future. >> >> If for some reason future 500GB drives all turn out to be slightly >> smaller than the current ones you're going to be stuck. Reserving >> 1-2% of space in exchange for greater flexibility in replacing >> drives sounds like a good idea to me. As others have said, RAID >> controllers have been doing this for long enough that even the very >> basic models do it now, and I don't understand why such simple >> features like this would be left out of ZFS. >> >> > > I have added the following text to the best practices guide: > > * When a vdev is replaced, the size of the replacements vdev, measured > by usable > sectors, must be the same or greater than the vdev being replaced. > This > can be > confusing when whole disks are used because different models of > disks may > provide a different number of usable sectors. For example, if a pool > was > created > with a "500 GByte" drive and you need to replace it with another "500 > GByte" > drive, then you may not be able to do so if the drives are not of the > same make, > model, and firmware revision. Consider planning ahead and reserving > some > space > by creating a slice which is smaller than the whole disk instead of > the > whole disk. Creating a slice, instead of using the whole disk, will cause ZFS to not enable write-caching on the underlying device. - Jim > > >> Fair enough, for high end enterprise kit where you want to squeeze >> every byte out of the system (and know you'll be buying Sun >> drives), you might not want this, but it would have been trivial to >> turn this off for kit like that. It's certainly a lot easier to >> expand a pool than shrink it! >> > > Actually, enterprise customers do not ever want to squeeze every > byte, they > would rather have enough margin to avoid such issues entirely. This > is what > I was referring to earlier in this thread wrt planning. > -- richard > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Ross wrote: > The problem is they might publish these numbers, but we really have no way of > controlling what number manufacturers will choose to use in the future. > > If for some reason future 500GB drives all turn out to be slightly smaller > than the current ones you're going to be stuck. Reserving 1-2% of space in > exchange for greater flexibility in replacing drives sounds like a good idea > to me. As others have said, RAID controllers have been doing this for long > enough that even the very basic models do it now, and I don't understand why > such simple features like this would be left out of ZFS. > > I have added the following text to the best practices guide: * When a vdev is replaced, the size of the replacements vdev, measured by usable sectors, must be the same or greater than the vdev being replaced. This can be confusing when whole disks are used because different models of disks may provide a different number of usable sectors. For example, if a pool was created with a "500 GByte" drive and you need to replace it with another "500 GByte" drive, then you may not be able to do so if the drives are not of the same make, model, and firmware revision. Consider planning ahead and reserving some space by creating a slice which is smaller than the whole disk instead of the whole disk. > Fair enough, for high end enterprise kit where you want to squeeze every byte > out of the system (and know you'll be buying Sun drives), you might not want > this, but it would have been trivial to turn this off for kit like that. > It's certainly a lot easier to expand a pool than shrink it! > Actually, enterprise customers do not ever want to squeeze every byte, they would rather have enough margin to avoid such issues entirely. This is what I was referring to earlier in this thread wrt planning. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (no subject)
It really is sad when you have to start filtering technical mailing lists to weed out the junk. On Sun, Jan 18, 2009 at 4:17 PM, JZ wrote: > Obama just made a good speech. > I hope you were watching TV... > > Best, > z > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding ZFS replication
This make sense. Given set of devices, ZFS can only write to free blocks. If the only free blocks are close together or on the same dev, then the protection can't be as great. This is quite likely to happen on a fullish disk. copies > 1, however, is still better than none (a single dropped block in the right place can wreak havoc). I personally like to use the 'copies' feature on machines where allocation priority for devices is for my storage pool, rather than my root pool. I like the idea that I can have multiple copies of blocks on my (single) boot device. This also work nicely because on a Solaris machine with lots of memory, I don't have to write to the disk much after boot, so the performance penalty seems fairly small. I have this running right now in one case. When I get the ability to mirror my rpool, I can remove the copies property if I wish. One other important caveat is that zfs properties only apply to newly-written data. So setting copies > 1 after an install won't make copies of the blocks you did the initial install to, just the block written going forward. cheers, Blake On Mon, Jan 19, 2009 at 1:04 AM, Carson Gaspar wrote: > Bob Friesenhahn wrote: >> On Sun, 18 Jan 2009, Tim wrote: >> >>> Honestly, I believe this list... when other people have asked if they can >>> use the copies= to avoid mirroring everything. I can't say I've saved any >>> of the threads because they didn't seem of any particular importance to me >>> at the time. >> >> The extra copies help avoid data loss, but if a disk is lost and there >> is no disk-wise redundancy, then the pool will be lost. > > I'm reading a lot of posts where folks don't seem to be understanding > each other, so let me try and re-phrase things. > > If you set copies=n, where n > 1, ZFS will _attempt_ to put the copies > on different block devices. If it can't, it will _attempt_ to place the > copies "far" away from each other on the same block device. > > The key word above is "attempt". Previous posters have shot this down > for "poor man's mirroring" because of the lack of guarantees. I suspect > these naysayers (and rightly so) are what Tim is recalling. > > -- > Carson > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
On Mon, Jan 19, 2009 at 10:47 AM, Andrew Gabriel wrote: > I've seen a webpage (a blog, IIRC) which compares the performance of > RAIDZ with differing numbers of disks in each RAIDZ group. I can't now http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide section: RAID-Z Configuration Requirements and Recommendations > find this, and can't seem to find the right things to get google to > search on. Does anyone recall where this is? ISTR the optimum number of > disks was 5-6. > > -- > Cheers > Andrew > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Asif Iqbal PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Disks in each RAIDZ group
I've seen a webpage (a blog, IIRC) which compares the performance of RAIDZ with differing numbers of disks in each RAIDZ group. I can't now find this, and can't seem to find the right things to get google to search on. Does anyone recall where this is? ISTR the optimum number of disks was 5-6. -- Cheers Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
You can get a sort of redundancy by creating multiple filesystems with 'copies' enabled on the ones that need some sort of self-healing in case of bad blocks. Is it possible to at least present your disks as several LUNs? If you must have an abstraction layer between ZFS and the block device, presenting ZFS with a plurality of abstracted devices would let you get some sort of parity...or is this device live and in production? I do think that, though ZFS doesn't need fsck in the traditional sense, some sort of recovery tool would make storage admins even happier about using ZFS. cheers, Blake On Mon, Jan 19, 2009 at 4:09 AM, Tom Bird wrote: > Toby Thain wrote: >> On 18-Jan-09, at 6:12 PM, Nathan Kroenert wrote: >> >>> Hey, Tom - >>> >>> Correct me if I'm wrong here, but it seems you are not allowing ZFS any >>> sort of redundancy to manage. > > Every other file system out there runs fine on a single LUN, when things > go wrong you have a fsck utility that patches it up and the world keeps > on turning. > > I can't find anywhere that will sell me a 48 drive SATA JBOD with all > the drives presented on a single SAS channel, so running on a single > giant LUN is a real world scenario that ZFS should be able to cope with, > as this is how the hardware I am stuck with is arranged. > >> Which is particularly catastrophic when one's 'content' is organized as >> a monolithic file, as it is here - unless, of course, you have some way >> of scavenging that file based on internal structure. > > No, it's not a monolithic file, the point I was making there is that no > files are showing up. > r...@cs4:~# find /content /content r...@cs4:~# (yes that really is it) > > thanks > -- > Tom > > // www.portfast.co.uk -- internet services and consultancy > // hosting from 1.65 per domain > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
I'm going waaay out on a limb here, as a non-programmer...but... Since the source is open, maybe community members should organize and work on some sort of sizing algorithm? I can certainly imagine Sun deciding to do this in the future - I can also imagine that it's not at the top of Sun's priority list (most of the devices they deal with are their own, and perhaps not subject to the right-sizing issue). If it matters to the community, why not, as a community, try to fix/improve zfs in this way? Again, I've not even looked at the code for block allocation or whatever it might be called in this case, so I could be *way* off here :) Lastly, Antonius, you can try the zpool trick to get this disk relabeled, I think. Try 'zpool create temp_pool [problem_disk]' then 'zpool destroy temp_pool]' - this should relabel the disk in question and set up the defaults that zfs uses. Can you also run format > partition > print on one of the existing disks and send the output so that we can see what the existing disk looks like? (Off-list directly to me if you prefer). cheers, Blake ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding ZFS replication
Z, > Beloved Tim, > You challenged me a while ago, as a friend. > I did what you asked me to do, in the honor of my father. > > Best, > z Please don't post personal stuff like this or links to wikipedia or other ephemera/apocrypha to this/any list unless they are relevant. Thanks... Sean. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
The problem is they might publish these numbers, but we really have no way of controlling what number manufacturers will choose to use in the future. If for some reason future 500GB drives all turn out to be slightly smaller than the current ones you're going to be stuck. Reserving 1-2% of space in exchange for greater flexibility in replacing drives sounds like a good idea to me. As others have said, RAID controllers have been doing this for long enough that even the very basic models do it now, and I don't understand why such simple features like this would be left out of ZFS. Fair enough, for high end enterprise kit where you want to squeeze every byte out of the system (and know you'll be buying Sun drives), you might not want this, but it would have been trivial to turn this off for kit like that. It's certainly a lot easier to expand a pool than shrink it! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS size is different ?
Chookiex writes: > Hi all, > > I have 2 questions about ZFS. > > 1. I have create a snapshot in my pool1/data1, and zfs send/recv it to > pool2/data2. but I found the USED in zfs list is different: > NAME USED AVAIL REFER MOUNTPOINT > pool2/data2 160G 1.44T 159G /pool2/data2 > pool1/data 176G 638G 175G /pool1/data1 > > It keep about 30,000,000 files. > The content of p_pool/p1 and backup/p_backup is almost same. But why is the > size different? > 160G for 30M files means your avg file size is 5333 Bytes. Pick one such files just for illustration: 5333 Bytes to be stored on raid-z2 of 5 disks (3+2). So you have to store 5333 Bytes of data onto 3 data disks. You will need a stripe of 4 x 512B sectors on each of the 3 data disks. So that's 6K of data. Over a single volume, you'd need 11 sectors of 512B to store 5632 Bytes. For this avg file size you thus have either 12 or 11 sectors to store the data, a 9% difference. You then need to tack the extra parity blocks. For raid-z2 is a double parity scheme whereas raid-5 is single parity (and will only survice a single disk failure). Depending on how these parity blocks are accounted for and your exact files size distribution, the difference you note does not appear unwaranted. > 2. /pool2/data2 is a RAID5 Disk Array with 8 disks, and , and /pool1/data1 > is a RAIDZ2 with 5 disks. > The configure like this: > > NAMESTATE READ WRITE CKSUM > pool2 ONLINE 0 0 0 > c7t10d0 ONLINE 0 0 0 > > > NAME STATE READ WRITE CKSUM > pool1 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c3t2d0ONLINE 0 0 0 > c3t1d0ONLINE 0 0 0 > c3t3d0 ONLINE 0 0 0 > c3t4d0 ONLINE 0 0 0 > c3t5d0 ONLINE 0 0 0 > > We found that pool1 is more slow than pool2, even with the same number of > disks. > So, which is better between RAID5 + ZFS and RAIDZ + ZFS? > Uncached RAID-5 random read is expected to deliver more total random read IOPS than uncached Raid-Z. The downside if using single raid-5 volume is that if a checksum error is ever detected by ZFS, ZFS report the error but will not be able to correct data blocks (metadata blocks are stored redundantly and will be corrected). -r > > york", "times", serif;font-size:12pt">Hi all,I have 2 > questions about ZFS.1. I have create a snapshot in > my pool1/data1, and zfs send/recv it to pool2/data2. but I found the USED in > zfs list is different:NAME USED AVAIL > REFER MOUNTPOINTpool2/data2 160G 1.44T 159G > /pool2/data2 pool1/data 176G 638G 175G > /pool1/data1 It keep about 30,000,000 > files.The content of p_pool/p1 and backup/p_backup is almost > same. But why is the size different?2. > /pool2/data2 is a RAID5 Disk Array with 8 disks, and , and /pool1/data1 is a > RAIDZ2 with 5 disks.The configure like this: > NAMESTATE > READ WRITE CKSUM pool2 ONLINE 0 0 0 >c7t10d0 ONLINE 0 0 0 NAME >STATE READ WRITE CKSUM pool1 ONLINE 0 > 0 0 raidz2 ONLINE 0 0 0 > c3t2d0ONLINE 0 0 0 c3t1d0ONLINE 0 > 0 0 c3t3d0 ONLINE 0 0 0 > c3t4d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 > 0 0We found that pool1 is more slow than > pool2, even with the same number of disks.So, which is better > between RAID5 + ZFS and RAIDZ + > ZFS? > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
yes, it's the same make and model as most of the other disks in the zpool and reports the same number of sectors -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
Toby Thain wrote: > On 18-Jan-09, at 6:12 PM, Nathan Kroenert wrote: > >> Hey, Tom - >> >> Correct me if I'm wrong here, but it seems you are not allowing ZFS any >> sort of redundancy to manage. Every other file system out there runs fine on a single LUN, when things go wrong you have a fsck utility that patches it up and the world keeps on turning. I can't find anywhere that will sell me a 48 drive SATA JBOD with all the drives presented on a single SAS channel, so running on a single giant LUN is a real world scenario that ZFS should be able to cope with, as this is how the hardware I am stuck with is arranged. > Which is particularly catastrophic when one's 'content' is organized as > a monolithic file, as it is here - unless, of course, you have some way > of scavenging that file based on internal structure. No, it's not a monolithic file, the point I was making there is that no files are showing up. >>> r...@cs4:~# find /content >>> /content >>> r...@cs4:~# (yes that really is it) thanks -- Tom // www.portfast.co.uk -- internet services and consultancy // hosting from 1.65 per domain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss