Re: [zfs-discuss] ZFS, Smashing Baby a fake???
I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. For once I agree with Miles, I think he's written a really good writeup of the problem here. My simple view on it would be this: Drives are only aware of themselves as an individual entity. Their job is to save restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. I'm not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That's their job. They know nothing else about the storage, they just have to do their level best to do as they're told, and will only fail if they absolutely can't store the data. The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You're throwing away some of the biggest benefits of using multiple drives in the first place. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Hey Jeff, Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? I do appreciate that it's hard to come up with a definative it's dead Jim answer, and I agree that long term the FMA approach will pay dividends. But I still feel this is a good short term solution, and one that would also compliment your long term plans. My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok And for the state where it's not returning data, you can again split that in two: - returns wrong data - doesn't return data The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it's just the second that needs immediate attention, and for the life of me I can't think of any situation that a simple timeout wouldn't catch. Personally I'd love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay The first sets whether to use this feature, and configures the maximum time ZFS will wait for a response from a device before putting it in a waiting status. The second would be optional and is the maximum time ZFS will wait before faulting a device (at which point it's replaced by a hot spare). The reason I think this will work well with the FMA work is that you can implement this now and have a real improvement in ZFS availability. Then, as the other work starts bringing better modeling for drive timeouts, the parameters can be either removed, or set automatically by ZFS. Long term I guess there's also the potential to remove the second setting if you felt FMA etc ever got reliable enough, but personally I would always want to have the final fail delay set. I'd maybe set it to a long value such as 1-2 minutes to give FMA, etc a fair chance to find the fault. But I'd be much happier knowing that the system will *always* be able to replace a faulty device within a minute or two, no matter what the FMA system finds. The key thing is that you're not faulting devices early, so FMA is still vital. The idea is purely to let ZFS to keep the pool active by removing the need for the entire pool to wait on the FMA diagnosis. As I said before, the driver and firmware are only aware of a single disk, and I would imagine that FMA also has the same limitation - it's only going to be looking at a single item and trying to determine whether it's faulty or not. Because of that, FMA is going to be designed to be very careful to avoid false positives, and will likely take it's time to reach an answer in some situations. ZFS however has the benefit of knowing more about the pool, and in the vast majority of situations, it should be possible for ZFS to read or write from other devices while it's waiting for an 'official' result from any one faulty component. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote: I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely,
Re: [zfs-discuss] Race condition yields to kernel panic (u3, u4) or hanging zfs commands (u5)
Hello Matt, you wrote about panic in u3 u4: These stack traces look like 6569719 (fixed in s10u5). Then I suppose it's also fixed by 127127-11 because that patch mentions 6569719. According to my zfs-hardness-test script this is true. Instead of crashing with an panic, with 127127-11 these servers now show hanging zfs commands like update 5. Please try my test script on a test server or see below. For update 5, you could start with the kernel stack of the hung commands. (use ::pgrep and ::findstack) We might also need the sync thread's stack (something like ::walk spa | ::print spa_t spa_dsl_pool-dp_txg.tx_sync_thread | ::findstack) Okay, I'll give it a try. $ uname -a SunOS qacult10 5.10 Generic_137111-08 sun4u sparc SUNW,Ultra-5_10 $ head -1 /etc/release Solaris 10 5/08 s10s_u5wos_10 SPARC $ ps -ef|grep zfs root 23795 23466 0 11:02:45 pts/1 0:00 ssh localhost zfs receive hardness-test/received root 23782 23779 0 11:02:45 ? 0:01 zfs receive hardness-test/received root 23807 23804 0 11:02:52 ? 0:00 zfs receive hardness-test/received root 23466 23145 0 11:00:35 pts/1 0:00 /usr/bin/bash ./zfs-hardness-test.sh root 23793 23466 0 11:02:45 pts/1 0:00 /usr/bin/bash ./zfs-hardness-test.sh root 23804 23797 0 11:02:52 ? 0:00 sh -c zfs receive hardness-test/received root 23779 1 0 11:02:45 ? 0:00 sh -c zfs receive hardness-test/received It seems that a receiving process (pid 23782) already killed has not yet finished. After killing and aborting data transmission, the script does a retry of the send-receive pipe (with same arguments) with pid 23807 on receiving end. There must be a deadlock/race condition. $ mdb -k Loading modules: [ unix krtld genunix specfs dtrace ufs pcipsy ip hook neti sctp arp usba fcp fctl zfs random nfs audiosup md lofs logindmux sd ptm fcip crypto ipc ] ::pgrep zfs$ SPID PPID PGIDSIDUID FLAGS ADDR NAME R 23782 23779 23779 23779 0 0x4a004000 03000171cc90 zfs R 23807 23804 23804 23804 0 0x4a004000 030001728058 zfs ::pgrep zfs$ | ::walk thread | ::findstack -v stack pointer for thread 3d24480: 2a1007fc8c1 [ 02a1007fc8c1 cv_wait+0x38() ] 02a1007fc971 delay+0x90(1, 183f000, 17cdef7, 17cdef8, 1, 18c0578) 02a1007fca21 dnode_special_close+0x20(300221e0a58, 7, 1, 300221e0c68, 7, 300221e0a58) 02a1007fcad1 dmu_objset_evict+0xb8(30003a8dc40, 300027cf500, 7b652000, 70407538, 7b652000, 70407400) 02a1007fcb91 dsl_dataset_evict+0x34(30003a8dc40, 30003a8dc40, 0, 300027cf500, 3000418c2c0, 30022366200) 02a1007fcc41 dbuf_evict_user+0x48(7b6140b0, 30022366200, 30003a8dc48, 0, 0 , 30022355e20) 02a1007fccf1 dbuf_rele+0x8c(30022355e78, 30022355e20, 70400400, 3, 3, 3) 02a1007fcda1 dmu_recvbackup+0x94c(300017c7400, 300017c7d80, 300017c7c28, 300017c7416, 16, 1) 02a1007fcf71 zfs_ioc_recvbackup+0x74(300017c7000, 0, 30004320150, 0, 0, 300017c7400) 02a1007fd031 zfsdev_ioctl+0x15c(70401400, 57, ffbfee20, 1d, 74, ef0) 02a1007fd0e1 fop_ioctl+0x20(30001d7a0c0, 5a1d, ffbfee20, 13, 300027da0c0, 12247f8) 02a1007fd191 ioctl+0x184(3, 300043216f8, ffbfee20, 0, 1ec08, 5a1d) 02a1007fd2e1 syscall_trap32+0xcc(3, 5a1d, ffbfee20, 0, 1ec08, ff34774c) stack pointer for thread 30003d12e00: 2a1009dca41 [ 02a1009dca41 turnstile_block+0x600() ] 02a1009dcaf1 mutex_vector_enter+0x3f0(0, 0, 30022355e78, 3d24480, 3d24480, 0) 02a1009dcba1 dbuf_read+0x6c(30022355e20, 0, 1, 1, 0, 300220f1cf8) 02a1009dcc61 dmu_bonus_hold+0xec(0, 15, 30022355e20, 2a1009dd5d8, 8, 0) 02a1009dcd21 dsl_dataset_open_obj+0x2c(3000418c2c0, 15, 0, 9, 300043ebe88 , 2a1009dd6a8) 02a1009dcde1 dsl_dataset_open_spa+0x140(0, 7b64d000, 3000418c488, 300043ebe88, 2a1009dd768, 9) 02a1009dceb1 dmu_objset_open+0x20(30003ca9000, 5, 9, 2a1009dd828, 1, 300043ebe88) 02a1009dcf71 zfs_ioc_objset_stats+0x18(30003ca9000, 0, 0, 0, 70401400, 39 ) 02a1009dd031 zfsdev_ioctl+0x15c(70401400, 39, ffbfc468, 13, 4c, ef0) 02a1009dd0e1 fop_ioctl+0x20(30001d7a0c0, 5a13, ffbfc468, 13, 300027da010, 12247f8) 02a1009dd191 ioctl+0x184(3, 300043208f8, ffbfc468, 0, 1010101, 5a13) 02a1009dd2e1 syscall_trap32+0xcc(3, 5a13, ffbfc468, 0, 1010101, 7cb88) ::walk spa | ::print spa_t { spa_name = 0x30022613108 hardness-test spa_avl = { avl_child = [ 0, 0 ] avl_pcb = 0x1 } spa_config = 0x3002244abd0 spa_config_syncing = 0 spa_config_txg = 0x4 spa_config_cache_lock = { _opaque = [ 0 ] } spa_sync_pass = 0x1 spa_state = 0 spa_inject_ref = 0 spa_traverse_wanted = 0 spa_sync_on = 0x1 spa_load_state = 0 (SPA_LOAD_NONE) spa_zio_issue_taskq = [ 0x300225e5528, 0x300225e56d8, 0x300225e5888, 0x300225e5a38, 0x300225e5be8, 0x300225e5d98 ]
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
PS. I think this also gives you a chance at making the whole problem much simpler. Instead of the hard question of is this faulty, you're just trying to say is it working right now?. In fact, I'm now wondering if the waiting for a response flag wouldn't be better as possibly faulty. That way you could use it with checksum errors too, possibly with settings as simple as errors per minute or error percentage. As with the timeouts, you could have it off by default (or provide sensible defaults), and let administrators tweak it for their particular needs. Imagine a pool with the following settings: - zfs-auto-device-timeout = 5s - zfs-auto-device-checksum-fail-limit-epm = 20 - zfs-auto-device-checksum-fail-limit-percent = 10 - zfs-auto-device-fail-delay = 120s That would allow the pool to flag a device as possibly faulty regardless of the type of fault, and take immediate proactive action to safeguard data (generally long before the device is actually faulted). A device triggering any of these flags would be enough for ZFS to start reading from (or writing to) other devices first, and should you get multiple failures, or problems on a non redundant pool, you always just revert back to ZFS' current behaviour. Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote: I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. For once I agree with Miles, I think he's written a really good writeup of the problem here. My simple view on it would be this: Drives are only aware of themselves as an individual entity. Their job is to save restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. I'm not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That's their job. They know nothing else about the storage, they just have to do their level best to do as they're told, and will only fail if they absolutely can't store the data. The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You're throwing away some of the biggest benefits of using multiple drives in the first place. -- This message posted from opensolaris.org ___ zfs-discuss mailing list
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
No, I count that as doesn't return data ok, but my post wasn't very clear at all on that. Even for a write, the disk will return something to indicate that the action has completed, so that can also be covered by just those two scenarios, and right now ZFS can lock the whole pool up if it's waiting for that response. My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. For write operations, the data can be safely committed to the rest of the pool, with just the outstanding writes for the drive left waiting. Then as soon as the device is faulted, the hot spare can kick in, and the outstanding writes quickly written to the spare. For single parity, or non redundant volumes there's some benefit in this. For dual parity pools there's a massive benefit as your pool stays available, and your data is still well protected. Ross On Tue, Nov 25, 2008 at 10:44 AM, [EMAIL PROTECTED] wrote: My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok I think you're missing won't write. There's clearly a difference between get data from a different copy which you can fix but retrying data to a different part of the redundant data and writing data: the data which can't be written must be kept until the drive is faulted. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. Except when you're writing a lot; 3 minutes can cause a 20GB backlog for a single disk. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok I think you're missing won't write. There's clearly a difference between get data from a different copy which you can fix but retrying data to a different part of the redundant data and writing data: the data which can't be written must be kept until the drive is faulted. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Hmm, true. The idea doesn't work so well if you have a lot of writes, so there needs to be some thought as to how you handle that. Just thinking aloud, could the missing writes be written to the log file on the rest of the pool? Or temporarily stored somewhere else in the pool? Would it be an option to allow up to a certain amount of writes to be cached in this way while waiting for FMA, and only suspend writes once that cache is full? With a large SSD slog device would it be possible to just stream all writes to the log? As a further enhancement, might it be possible to commit writes to the working drives, and just leave the writes for the bad drive(s) in the slog (potentially saving a lot of space)? For pools without log devices, I suspect that you would probably need the administrator to specify the behavior as I can see several options depending on the raid level and that pools priorities for data availability / integrity: Drive fault write cache settings: default - pool waits for device, no writes occur until device or spare comes online slog - writes are cached to slog device until full, then pool reverts to default behavior (could this be the default with slog devices present?) pool - writes are cached to the pool itself, up to a set maximum, and are written to the device or spare as soon as possible. This assumes a single parity pool with the other devices available. If the upper limit is reached, or another devices goes faulty, pool reverts to default behaviour. Storing directly to the rest of the pool would probably want to be off by default on single parity pools, but I would imagine that it could be on by default on dual parity pools. Would that be enough to allow writes to continue in most circumstances while the pool waits for FMA? Ross On Tue, Nov 25, 2008 at 10:55 AM, [EMAIL PROTECTED] wrote: My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. Except when you're writing a lot; 3 minutes can cause a 20GB backlog for a single disk. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] So close to better, faster, cheaper....
marko b wrote: Let me see if I'm understanding your suggestion. A stripe of mirrored pairs. I can grow by resizing an existing mirrored pair, or just attaching another mirrored pair to the stripe? Both adding an additional mirrored pair to the stripe and by replacing the sides of the mirror of an existing one with larger disks. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MIgrating to ZFS root/boot with system in several datasets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Lori Alt wrote: The SXCE code base really only supports BEs that are either all in one dataset, or have everything but /var in one dataset and /var in its own dataset (the reason for supporting a separate /var is to be able to set a set a quota on it so growth in log files, etc. can't fill up a root pool). OK. I have unified root dataset now. I want to segregate /var. How is it done by hand?. Must I use a legacy ZFS mountpoint or what?. There is an option for that in Live Update?. We are talking about Solaris 10 Update 6. Thanks in advance. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ [EMAIL PROTECTED] - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/ _/_/_/_/_/ . _/_/ _/_/_/_/ _/_/ _/_/ Things are not so easy _/_/ _/_/_/_/ _/_/_/_/ _/_/ My name is Dump, Core Dump _/_/_/_/_/_/ _/_/ _/_/ El amor es poner tu felicidad en la felicidad de otro - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.8 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBSSvzzJlgi5GaxT1NAQIK9gP9GDGDdNvQuB3d2p4lG8TsbnKlKNRLQ1oA jFAGwpD6/0p74fNVFGGZZVsM/6BCxZZlDMUvRygHSOK4TZNV3EuiABOoBCdYtxoV bsCTNwmg4R/fQUUkV8LP+BfQPzTjEaxsn3GpvyYwlS/Fj+lpKOhu4usZlv6cHPqt 73iCnmKDQto= =TuCa -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
On 25-Nov-08, at 5:10 AM, Ross Smith wrote: Hey Jeff, Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? ... The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it's just the second that needs immediate attention, and for the life of me I can't think of any situation that a simple timeout wouldn't catch. Personally I'd love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay The first sets whether to use this feature, and configures the maximum time ZFS will wait for a response from a device before putting it in a waiting status. The shortcomings of timeouts have been discussed on this list before. How do you tell the difference between a drive that is dead and a path that is just highly loaded? I seem to recall the argument strongly made in the past that making decisions based on a timeout alone can provoke various undesirable cascade effects. The second would be optional and is the maximum time ZFS will wait before faulting a device (at which point it's replaced by a hot spare). The reason I think this will work well with the FMA work is that you can implement this now and have a real improvement in ZFS availability. Then, as the other work starts bringing better modeling for drive timeouts, the parameters can be either removed, or set automatically by ZFS. ... it should be possible for ZFS to read or write from other devices while it's waiting for an 'official' result from any one faulty component. Sounds good - devil, meet details, etc. --Toby Ross On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote: I think we (the ZFS team) all generally agree with you. ... The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? ... Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replacing disk
Anyway I did not get any help but I was able to figure it out. [12:58:08] [EMAIL PROTECTED]: /root zpool status mypooladas pool: mypooladas state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed after 0h34m with 0 errors on Tue Nov 25 03:59:23 2008 config: NAME STATE READ WRITE CKSUM mypooladasDEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c4t2d0ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c4t8d0ONLINE 0 0 0 c4t9d0ONLINE 0 0 0 c4t10d0 ONLINE 0 0 0 c4t11d0 ONLINE 0 0 0 c4t12d0 ONLINE 0 0 0 16858115878292111089 FAULTED 0 0 0 was /dev/dsk/c4t13d0s0 c4t14d0 ONLINE 0 0 0 c4t15d0 ONLINE 0 0 0 errors: No known data errors [12:58:23] [EMAIL PROTECTED]: /root Anyway the way I fixed my problem was that I did export my pool so it did not exist, then I took that disk which manually had to be imported and I just created a test pool out of it with -f option on just that one disk. then I did destroy that test pool, then I imported my original pool and I was able to replace my bad disk with old disk from that particulat pool... It is kind of work around but sucks that there is no easy way of getting it rather than going around this way. and format -e, changing label on that disk did not help, I even recreated partition table and I did make a huge file, I was trying to dd to that disk hoping it would overwrite any zfs info, but I was unable to do any of that... so my work around trick did work and I have one extra disk to go, just need to buy it as I am short on one disk at this moment. On Mon, 24 Nov 2008, Krzys wrote: somehow I have issue replacing my disk. [20:09:29] [EMAIL PROTECTED]: /root zpool status mypooladas pool: mypooladas state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Mon Nov 24 20:06:48 2008 config: NAME STATE READ WRITE CKSUM mypooladasDEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c4t2d0ONLINE 0 0 0 c4t3d0ONLINE 0 0 0 c4t4d0ONLINE 0 0 0 c4t5d0ONLINE 0 0 0 c4t8d0UNAVAIL 0 0 0 cannot open c4t9d0ONLINE 0 0 0 c4t10d0 ONLINE 0 0 0 c4t11d0 ONLINE 0 0 0 c4t12d0 ONLINE 0 0 0 16858115878292111089 FAULTED 0 0 0 was /dev/dsk/c4t13d0s0 c4t14d0 ONLINE 0 0 0 c4t15d0 ONLINE 0 0 0 errors: No known data errors [20:09:38] [EMAIL PROTECTED]: /root I am trying to replace c4t13d0 disk. [20:09:38] [EMAIL PROTECTED]: /root zpool replace -f mypooladas c4t13d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c4t13d0s0 is part of active ZFS pool mypooladas. Please see zpool(1M). [20:10:13] [EMAIL PROTECTED]: /root zpool online mypooladas c4t13d0 zpool replace -f mypooladas c4t13d0 warning: device 'c4t13d0' onlined, but remains in faulted state use 'zpool replace' to replace devices that are no longer present [20:11:14] [EMAIL PROTECTED]: /root zpool replace -f mypooladas c4t13d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c4t13d0s0 is part of active ZFS pool mypooladas. Please see zpool(1M). [20:11:45] [EMAIL PROTECTED]: /root zpool replace -f mypooladas c4t8d0 c4t13d0 invalid vdev specification the following errors must be manually repaired: /dev/dsk/c4t13d0s0 is part of active ZFS pool mypooladas. Please see zpool(1M). [20:13:24] [EMAIL
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
The shortcomings of timeouts have been discussed on this list before. How do you tell the difference between a drive that is dead and a path that is just highly loaded? A path that is dead is either returning bad data, or isn't returning anything. A highly loaded path is by definition reading writing lots of data. I think you're assuming that these are file level timeouts, when this would actually need to be much lower level. Sounds good - devil, meet details, etc. Yup, I imagine there are going to be a few details to iron out, many of which will need looking at by somebody a lot more technical than myself. Despite that I still think this is a discussion worth having. So far I don't think I've seen any situation where this would make things worse than they are now, and I can think of plenty of cases where it would be a huge improvement. Of course, it also probably means a huge amount of work to implement. I'm just hoping that it's not prohibitively difficult, and that the ZFS team see the benefits as being worth it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] HELP!!! Need to disable zfs
My root drive is ufs. I have corrupted my zpool which is on a different drive than the root drive. My system paniced and now it core dumps when it boots up and hits zfs start. I have a alt root drive that can boot the system up with but how can I disable zfs from starting on a different drive? HELP HELP HELP -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP!!! Need to disable zfs
Boot from the other root drive, mount up the bad one at /mnt. Then: # mv /mnt/etc/zfs/zpool.cache /mnt/etc/zpool.cache.bad On Tue, Nov 25, 2008 at 8:18 AM, Mike DeMarco [EMAIL PROTECTED] wrote: My root drive is ufs. I have corrupted my zpool which is on a different drive than the root drive. My system paniced and now it core dumps when it boots up and hits zfs start. I have a alt root drive that can boot the system up with but how can I disable zfs from starting on a different drive? HELP HELP HELP -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP!!! Need to disable zfs
Mike DeMarco wrote: My root drive is ufs. I have corrupted my zpool which is on a different drive than the root drive. My system paniced and now it core dumps when it boots up and hits zfs start. I have a alt root drive that can boot the system up with but how can I disable zfs from starting on a different drive? HELP HELP HELP boot the working alt root drive, mount the other drive to /a then mv /a/etc/zfs/zpool.cache /a/etc/zfs/zpool.cache.corrupt reboot Enda ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Odd filename in zpool status -v output
My non-redundant rpool (2 replacement disks have been ordered :-) is reporting errors: canopus% pfexec zpool status -v rpool pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress for 4h18m, 72.07% done, 1h40m to go config: NAMESTATE READ WRITE CKSUM rpool ONLINE 8 0 0 c5d0s0ONLINE 818 0 0 540K repaired errors: Permanent errors have been detected in the following files: rpool/ROOT/opensolaris-101:/var/tmp/stmAAAaXaWkb.0015 rpool/canopus1:0x0 So I don't think I care about the damage to /var/tmp/stmAAAaXaWkb. 0015, but what's the second filename printed there? The pool has an rpool/canopus1 filesystem so I guess it is somehow related to that. I'm running the current public build (101b) of OpenSolaris. Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. I'm sorry, I didn't mean to sound offensive. Anyway I think that people should know that their drives can stuck the system for minutes, despite ZFS. I mean: there are a lot of writings about how ZFS is great for recovery in case a drive fails, but there's nothing regarding this problem. I know now it's not ZFS fault; but I wonder how many people set up their drives with ZFS assuming that as soon as something goes bad, ZFS will fix it. Is there any way to test these cases other than smashing the drive with a hammer? Having a failover policy where the failover can't be tested sounds scary... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Ross Smith wrote: My justification for this is that it seems to me that you can split disk behavior into two states: - returns data ok - doesn't return data ok And for the state where it's not returning data, you can again split that in two: - returns wrong data - doesn't return data The state in discussion in this thread is the I/O requested by ZFS hasn't finished after 60, 120, 180, 3600, etc. seconds The pool is waiting (for device timeouts) to distinguish between the first two states. More accurate state descriptions are: - The I/O has returned data - The I/O hasn't yet returned data and the user (admin) is justifiably impatient. For the first state, the data is either correct (verified by the ZFS checksums, or ESUCCESS on write) or incorrect and retried. The first of these is already covered by ZFS with its checksums (with FMA doing the extra work to fault drives), so it's just the second that needs immediate attention, and for the life of me I can't think of any situation that a simple timeout wouldn't catch. Personally I'd love to see two parameters, allowing this behavior to be turned on if desired, and allowing timeouts to be configured: zfs-auto-device-timeout zfs-auto-device-timeout-fail-delay I'd prefer these be set at the (default) pool level: zpool-device-timeout zpool-device-timeout-fail-delay with specific per-VDEV overrides possible: vdev-device-timeout and vdev-device-fail-delay This would allow but not require slower VDEVs to be tuned specifically for that case without hindering the default pool behavior on the local fast disks. Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, and want to have the iSCSI retry logic to still apply. Writes that failed while the iSCSI link is down would have to be resilvered, but at least reads would switch to the local devices faster. Set them to the default magic 0 value to have the system use the current behavior, of relying on the device drivers to report failures. Set to a number (in ms probably) and the pool would consider an I/O that takes longer than that as returns invalid data When the FMA work discussed below, these could be augmented by the pools best heuristic guess as to what the proper timeouts should be, which could be saved in (kstat?) vdev-device-autotimeout. If you set the timeout to the magic -1 value, the pool would use vdev-device-autotimeout. All that would be required is for the I/O that caused the disk to take a long time to be given a deadline (now + (vdev-device-timeout ?: (zpool-device-timeout?: forever)))* and consider the I/O complete with whatever data has returned after that deadline: if that's a bunch of 0's in a read, which would have a bad checksum; or a partially-completed write that would have to be committed somewhere else. Unfortunately, I'm not enough of a programmer to implement this. --Joe * with the -1 magic, it would be a little more complicated calculation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP!!! Need to disable zfs
Boot from the other root drive, mount up the bad one at /mnt. Then: # mv /mnt/etc/zfs/zpool.cache /mnt/etc/zpool.cache.bad On Tue, Nov 25, 2008 at 8:18 AM, Mike DeMarco [EMAIL PROTECTED] wrote: My root drive is ufs. I have corrupted my zpool which is on a different drive than the root drive. My system paniced and now it core dumps when it boots up and hits zfs start. I have a alt root drive that can boot the system up with but how can I disable zfs from starting on a different drive? HELP HELP HELP -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss That got it. Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
On Tue, 25 Nov 2008, Ross Smith wrote: Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? A stable and sane system never comes with two brains. It is wrong to put this sort of logic into ZFS when ZFS is already depending on FMA to make the decisions and Solaris already has an infrastructure to handle faults. The more appropriate solution is that this feature should be in FMA. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
Scara Maccai wrote: Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. I'm sorry, I didn't mean to sound offensive. Anyway I think that people should know that their drives can stuck the system for minutes, despite ZFS. I mean: there are a lot of writings about how ZFS is great for recovery in case a drive fails, but there's nothing regarding this problem. I know now it's not ZFS fault; but I wonder how many people set up their drives with ZFS assuming that as soon as something goes bad, ZFS will fix it. Is there any way to test these cases other than smashing the drive with a hammer? Having a failover policy where the failover can't be tested sounds scary... It is with this idea in mind that I wrote part of Chapter 1 of the book Designing Enterprise Solutions with Sun Cluster 3.0. For convenience, I also published chapter 1 as a Sun BluePrint Online article. http://www.sun.com/blueprints/1101/clstrcomplex.pdf False positives are very expensive in highly available systems, so we really do want to avoid them. One thing that we can do, and I've already (again[1]) started down the path to document, is to show where and how the various (common) timeouts are in the system. Once you know how sd, cmdk, dbus, and friends work you can make better decisions on where to look when the behaviour is not as you expect. But this is a very tedious path because there are so many different failure modes and real-world devices can react ambiguously when they fail. [1] we developed a method to benchmark cluster dependability. The description of the benchmark was published in several papers, but is now available in the new IEEE book on Dependability Benchmarking. This is really the first book of its kind and the first steps toward making dependability benchmarks more mainstream. Anyway, the work done for that effort included methods to improve failure detection and handling, so we have a detailed understanding of those things for SPARC, in lab form. Expanding that work to cover the random-device-bought-at-Frys will be a substantial undertaking. Co-conspirators welcome. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
On Tue, Nov 25, 2008 at 11:55:17AM +0100, [EMAIL PROTECTED] wrote: My idea is simply to allow the pool to continue operation while waiting for the drive to fault, even if that's a faulty write. It just means that the rest of the operations (reads and writes) can keep working for the minute (or three) it takes for FMA and the rest of the chain to flag a device as faulty. Except when you're writing a lot; 3 minutes can cause a 20GB backlog for a single disk. If we're talking isolated, or even clumped-but-relatively-few bad sectors, then having a short timeout for writes and remapping should be possible to do without running out of memory to cache those writes. But... ...writes to bad sectors will happen when txgs flush, and depending on how bad sector remapping is done (say, by picking a new block address and changing the blkptrs that referred to the old one) that might mean redoing large chunks of the txg in the next one, which might mean that fsync() could be delayed an additional 5 seconds or so. And even if that's not the case, writes to mirrors are supposed to be synchronous, so one would think that bad block remapping should be synchronous also, thus there must be a delay on writes to bad blocks no matter what -- though that delay could be tuned to be no more than a few seconds. That points to a possibly decent heuristic on writes: vdev-level timeouts that result in bad block remapping, but if the queue of outstanding bad block remappings grows too large - treat the disk as faulted and degrade the pool. Sounds simple, but it needs to be combined at a higher layer with information from other vdevs. Unplugging a whole jbod shouldn't necessarily fault all the vdevs on it -- perhaps it should cause pool operation to pause until the jbod is plugged back in... which should then cause those outstanding bad block remappings to be rolled back since they weren't bad blocks after all. That's a lot of fault detection and handling logic across many layers. Incidentally, cables to fall out, or, rather, get pulled out accidentally. What should be the failure mode of a jbod disappearing due to a pulled cable (or power supply failure)? A pause in operation (hangs)? Or faulting of all affected vdevs, and if you're mirrored across different jbods, incurring the need to re-silver later, with degraded operation for hours on end? I bet answers will vary. The best answer is to provide enough redundancy (multiple power supplies, multi-pathing, ...) to make such situations less likely, but that's not a complete answer. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn't have access to the big picture of pool configuration that ZFS has, so why shouldn't ZFS use that information to increase the reliability of the pool while still using FMA to handle device failures? The flip side of the argument is that ZFS already checks the data returned by the hardware. You might as well say that FMA should deal with that too since it's responsible for all hardware failures. The role of ZFS is to manage the pool, availability should be part and parcel of that. On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 25 Nov 2008, Ross Smith wrote: Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? A stable and sane system never comes with two brains. It is wrong to put this sort of logic into ZFS when ZFS is already depending on FMA to make the decisions and Solaris already has an infrastructure to handle faults. The more appropriate solution is that this feature should be in FMA. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
On Tue, 25 Nov 2008, Ross Smith wrote: I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn't have access to the big picture of pool configuration that ZFS has, so why shouldn't ZFS use that information to increase the reliability of the pool while still using FMA to handle device failures? If FMA does not currently have knowledge of the redundancy model but needs it to make well-informed decisions, then it should be updated to incorporate this information. FMA sees all the hardware in the system, including devices used for UFS and other types of filesystems, and even tape devices. It is able to see hardware at a much more detailed level than ZFS does. ZFS only sees an abstracted level of the hardware. If a HBA or part of the backplane fails, FMA should be able to determine the failing area (at least as far out as it can see based on available paths) whereas all ZFS knows is that it is having difficulty getting there from here. The flip side of the argument is that ZFS already checks the data returned by the hardware. You might as well say that FMA should deal with that too since it's responsible for all hardware failures. If bad data is returned, then I assume that there is a peg to FMA's error statistics counters. The role of ZFS is to manage the pool, availability should be part and parcel of that. Too much complexity tends to clog up the works and keep other areas of ZFS from being enhanced expediently. ZFS would soon become a chunk of source code that no mortal could understand and as such it would be put under maintenance with no more hope of moving forward and inability to address new requirements. A rational system really does not want to have mutiple brains. Otherwise some parts of the system will think that the device is fine while other parts believe that it has failed. None of us want to deal with an insane system like that. There is also the matter of fault isolation. If a drive can not be reached, is it because the drive failed, or because a HBA supporting multiple drives failed, or a cable got pulled? This sort of information is extremely important for large reliable systems. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
It's hard to tell exactly what you are asking for, but this sounds similar to how ZFS already works. If ZFS decides that a device is pathologically broken (as evidenced by vdev_probe() failure), it knows that FMA will come back and diagnose the drive is faulty (becuase we generate a probe_failure ereport). So ZFS pre-emptively short circuits all I/O and treats the drive as faulted, even though the diagnosis hasn't come back yet. We can only do this for errors that have a 1:1 correspondence with faults. - Eric On Tue, Nov 25, 2008 at 04:10:13PM +, Ross Smith wrote: I disagree Bob, I think this is a very different function to that which FMA provides. As far as I know, FMA doesn't have access to the big picture of pool configuration that ZFS has, so why shouldn't ZFS use that information to increase the reliability of the pool while still using FMA to handle device failures? The flip side of the argument is that ZFS already checks the data returned by the hardware. You might as well say that FMA should deal with that too since it's responsible for all hardware failures. The role of ZFS is to manage the pool, availability should be part and parcel of that. On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 25 Nov 2008, Ross Smith wrote: Good to hear there's work going on to address this. What did you guys think to my idea of ZFS supporting a waiting for a response status for disks as an interim solution that allows the pool to continue operation while it's waiting for FMA or the driver to fault the drive? A stable and sane system never comes with two brains. It is wrong to put this sort of logic into ZFS when ZFS is already depending on FMA to make the decisions and Solaris already has an infrastructure to handle faults. The more appropriate solution is that this feature should be in FMA. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ESX integration
Hi Ahmed I'm part of the team that is working on such integration and snapshot integration (and SRM) is definitely on the roadmap. Right now, there is nothing official, but as other have mentioned, some simple scripting wouldn't be too hard. I like to use the Remote Command Line appliance and runs my scripts from there. Makes it easy to have one location to quiesce the VMs, run the ssh script to snap the 7000 (AmberRoad) array and then return the VMs to full operation. Stay tuned for lots of work in this area. -ryan Ahmed Kamal wrote: Hi, Not sure if this is the best place to ask, but do Sun's new Amber road storage boxes have any kind of integration with ESX? Most importantly, quiescing the VMs, before snapshotting the zvols, and/or some level of management integration thru either the web UI or ESX's console ? If there's nothing official, did anyone hack any scripts for that? Regards ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
On 11/23/08 12:14, Paweł Tęcza wrote: As others here have said, just issue 'zfs list -t snapshot' if you just want to see the snapshots, or 'zfs list -t all' to see both filesystems and snapshots. OK, I can use that, but my dreamed `zfs list` syntax is like below: zfs list [all|snapshots] zfs list: displays all filesystems and snapshots too only if listsnaps=on zfs list all: displays all filesystems and snapshots too even if listsnaps=off zfs list snapshots: displays all snapshots, without filesystems Do you agree with me that it's simple and beautiful? ;) Pawel, With http://bugs.opensolaris.org/view_bug.do?bug_id=6734907 zfs list -t all would be useful once snapshots are omitted by default, the syntax of zfs list is very close to the one you have dreamed of: zfs list [filesystem|volume|snapshot|all] zfs list: displays all filesystems and volumes displays snapshots only if listsnaps=on zfs list -t filesystem displays all filesystems does not display volumes or snapshots zfs list -t volume displays all volumes does not display filesystems or snapshots zfs list -t snapshotdisplays all snapshots does not display filesystems or volumes zfs list -t all displays all filesystems, volumes, and snapshots (even if listsnaps=off) -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
I did a fresh install a week ago. Because of Time Slider / auto-snapshot being installed, I have 15 pages of snapshots. Malachi On Sun, Nov 23, 2008 at 8:53 AM, Paweł Tęcza [EMAIL PROTECTED] wrote: Dnia 2008-11-23, nie o godzinie 13:41 +0530, Sanjeev Bagewadi pisze: Thank your very much for your feedback! What is a large number of snapshots? 100? 1000? 1? Do people really need so many snapshots? I think that if some user has a large number of snapshots, then it's not `zfs list` problem, because it's a user's problem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ESX integration
Will this be for Sun's xVM Server as well as for ESX? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Dnia 2008-11-25, wto o godzinie 10:16 -0800, Malachi de Ælfweald pisze: I did a fresh install a week ago. Because of Time Slider / auto-snapshot being installed, I have 15 pages of snapshots. Malachi, You only wrote that you have a lot of snapshots. You didn't wrote whether you really need all of them. I doubt it. So if you don't want to litter your pool more, then the best thing is removing the most of your snapshots. Cheers, Pawel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Dnia 2008-11-25, wto o godzinie 13:11 -0500, Richard Morris - Sun Microsystems - Burlington United States pisze: Pawel, With http://bugs.opensolaris.org/view_bug.do?bug_id=6734907 zfs list -t all would be useful once snapshots are omitted by default, the syntax of zfs list is very close to the one you have dreamed of: zfs list [filesystem|volume|snapshot|all] zfs list: displays all filesystems and volumes displays snapshots only if listsnaps=on zfs list -t filesystem displays all filesystems does not display volumes or snapshots zfs list -t volume displays all volumes does not display filesystems or snapshots zfs list -t snapshotdisplays all snapshots does not display filesystems or volumes zfs list -t all displays all filesystems, volumes, and snapshots (even if listsnaps=off) Hi Rich, Thanks a lot for your feedback! I was thinking that `zfs list` thread is already dead ;) The syntax above is pretty nice for me, but IMHO the -t switch is rather needless here :) I also asked Sun people about your well-known backward compatibility, but unfortunately nobody commented on it :( My best regards, Pawel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
I think you are missing the point. They are auto-generated due to having Time Slider setup. It does auto-snapshots of the entire drive every hour. It removes old ones when the drive reaches 80% utilization. http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs Hope that helps, Malachi On Tue, Nov 25, 2008 at 1:24 PM, Paweł Tęcza [EMAIL PROTECTED] wrote: Dnia 2008-11-25, wto o godzinie 10:16 -0800, Malachi de Ælfweald pisze: I did a fresh install a week ago. Because of Time Slider / auto-snapshot being installed, I have 15 pages of snapshots. Malachi, You only wrote that you have a lot of snapshots. You didn't wrote whether you really need all of them. I doubt it. So if you don't want to litter your pool more, then the best thing is removing the most of your snapshots. Cheers, Pawel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Dnia 2008-11-25, wto o godzinie 13:46 -0800, Malachi de Ælfweald pisze: I think you are missing the point. They are auto-generated due to having Time Slider setup. It does auto-snapshots of the entire drive every hour. It removes old ones when the drive reaches 80% utilization. http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs Thanks a lot for the link! That blog entry is really very useful. Well, I've havent used Time Slider yet. But I can see at the screenshot that you can decrease % of file system capacity. Default 80% is too big value for me. Also I'm very curious whether I can configure Time Slider to taking backup every 2 or 4 or 8 hours, for example. Maybe is it advanced option? Unfortunately I can't check it now, because I'm writing on Ubuntu box :P Good night, Pawel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Dnia 2008-11-25, wto o godzinie 23:16 +0100, Paweł Tęcza pisze: Also I'm very curious whether I can configure Time Slider to taking backup every 2 or 4 or 8 hours, for example. Or set the max number of snapshots? Pawel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 'zeroing out' unused blocks on a ZFS?
I have RTFM'd through this list and a number of Sun docs at docs.sun and can't find any information on how I might be able to write out 'hard zeros' to the unused blocks on a ZFS. The reason I'd like to do this is because if the storage (LUN/s) I'm providing to the ZFS is thin-provisioned and doesn't know about a host O.S. file system and whether a previously written disk block still has data on it, only knowing it doesn't if all zero's are written, then how would I go about doing that with ZFS? I was looking at the command mkfile, which looked like it might do it, but I wasn't sure. Has anyone done this before and can provide the instructions? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Paweł Tęcza wrote: Dnia 2008-11-25, wto o godzinie 23:16 +0100, Paweł Tęcza pisze: Also I'm very curious whether I can configure Time Slider to taking backup every 2 or 4 or 8 hours, for example. Or set the max number of snapshots? UTSL http://src.opensolaris.org/source/xref/jds/zfs-snapshot/src/ The service offers a way to manage cronjobs, but you can manage them in other ways, too. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zeroing out' unused blocks on a ZFS?
On 25 November, 2008 - Dave Brown sent me these 0,8K bytes: I have RTFM'd through this list and a number of Sun docs at docs.sun and can't find any information on how I might be able to write out 'hard zeros' to the unused blocks on a ZFS. The reason I'd like to do this is because if the storage (LUN/s) I'm providing to the ZFS is thin-provisioned and doesn't know about a host O.S. file system and whether a previously written disk block still has data on it, only knowing it doesn't if all zero's are written, then how would I go about doing that with ZFS? I was looking at the command mkfile, which looked like it might do it, but I wasn't sure. Has anyone done this before and can provide the instructions? try turning compression off, then create a huge file (all free space) with mkfile.. Not sure about the exact bit storage with regards to checksum etc.. you might want to try with checksum off as well.. /Tomas -- Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS ACL/ACE issues with Samba - Access Denied
Solaris 10u4 x64 using included Samba 3.0.28 Samba is AD integrated, and I have a share configured as follows: [crlib1] comment = Creative Lib1 path = /pool/creative/lib1 read only = No vfs objects = zfsacl acl check permissions = No unix extensions = No inherit permissions = Yes map acl inherit = Yes I have set both aclmode and aclinherit to be passthrough for the LIB1 filesystem: pool/creative/lib1 aclmodepassthroughlocal pool/creative/lib1 aclinherit passthroughlocal I have a user, Tom. Tom is a member of Editors. Another test user Sue is a member of Readers. Both users are members of other groups as well. I configured the permissions on LIB1 for 777, and created a test subfolder that I have applied permissions through Windows XP. Windows complained about reordering the permissions when I first set them, and now doesn't complain when opening the security tab, so I assume they're ordered correctly. [EMAIL PROTECTED]:/pool/creative/lib1# ls -dV Test/ d-+ 2 eric domain users 4 Nov 25 21:36 Test/ group:editors:rwxpd-aARWc--s:fd:allow group:readers:r-x---a-R-c--s:fd:allow group:domain admins:rwxpdDaARWcCos:fd:allow user:eric:rwxpd-aARWc--s:fd:allow [EMAIL PROTECTED]:/pool/creative/lib1# The server can see the group (group ID 15130) and can verify the user in AD is a member of the group: [EMAIL PROTECTED]:/pool/creative/lib1# wbinfo --group-info=editors editors:x:15130 [EMAIL PROTECTED]:/pool/creative/lib1# wbinfo -r tom 15129 15018 15130 15166 15200 15127 15132 15027 15010 15120 15004 15041 15082 15133 15202 15001 [EMAIL PROTECTED]:/pool/creative/lib1# My problem is that Tom is a member of Editors, but getting an Access Denied message while trying to put a file into the Test folder. The samba log for the client shows the following trace: [2008/11/25 22:42:18, 3] smbd/process.c:(1068) Transaction 966323 of length 1604 [2008/11/25 22:42:18, 3] smbd/process.c:(926) switch message SMBsesssetupX (pid 7616) conn 0x0 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(241) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 0 [2008/11/25 22:42:18, 3] smbd/sesssetup.c:(1244) wct=12 flg2=0xc807 [2008/11/25 22:42:18, 3] smbd/sesssetup.c:(1029) Doing spnego session setup [2008/11/25 22:42:18, 3] smbd/sesssetup.c:(1060) NativeOS=[] NativeLanMan=[] PrimaryDomain=[] [2008/11/25 22:42:18, 3] smbd/sesssetup.c:(697) reply_spnego_negotiate: Got secblob of size 1471 [2008/11/25 22:42:18, 3] libads/kerberos_verify.c:(469) ads_verify_ticket: did not retrieve auth data. continuing without PAC [2008/11/25 22:42:18, 3] smbd/sesssetup.c:(321) Ticket name is [EMAIL PROTECTED] [2008/11/25 22:42:18, 4] lib/substitute.c:(407) Home server: vault [2008/11/25 22:42:18, 4] lib/substitute.c:(407) Home server: vault [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(208) push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 1 [2008/11/25 22:42:18, 3] smbd/uid.c:(358) push_conn_ctx(0) : conn_ctx_stack_ndx = 0 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(241) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(356) pop_sec_ctx (0, 0) - sec_ctx_stack_ndx = 0 [2008/11/25 22:42:18, 3] passdb/lookup_sid.c:(1069) fetch sid from gid cache 15004 - S-1-5-21-1409556225-1798326808-5522801-513 [2008/11/25 22:42:18, 3] passdb/lookup_sid.c:(1089) fetch gid from cache 15000 - S-1-5-32-544 [2008/11/25 22:42:18, 3] passdb/lookup_sid.c:(1089) fetch gid from cache 15001 - S-1-5-32-545 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(208) push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 1 [2008/11/25 22:42:18, 3] smbd/uid.c:(358) push_conn_ctx(0) : conn_ctx_stack_ndx = 0 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(241) setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1 [2008/11/25 22:42:18, 3] smbd/sec_ctx.c:(356) pop_sec_ctx (0, 0) - sec_ctx_stack_ndx = 0 [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-21-2469361529-1303801020-868054103-32338] [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-21-1409556225-1798326808-5522801-513] [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-2] [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-11] [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-21-1409556225-1798326808-5522801-5503] [2008/11/25 22:42:18, 3] lib/privileges.c:(261) get_privileges: No privileges assigned to SID [S-1-5-32-545] [2008/11/25 22:42:18, 3] passdb/lookup_sid.c:(1089) fetch gid from cache 15004 - S-1-5-21-1409556225-1798326808-5522801-513 [2008/11/25 22:42:18, 3]
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
Paweł Tęcza wrote: Dnia 2008-11-25, wto o godzinie 23:16 +0100, Paweł Tęcza pisze: Also I'm very curious whether I can configure Time Slider to taking backup every 2 or 4 or 8 hours, for example. Or set the max number of snapshots? Yes you can (though not in the time-slider gui yet). Have a read of http://src.opensolaris.org/source/xref/jds/zfs-snapshot/README.zfs-auto-snapshot.txt In particular, look for the zfs/keep setting to configure the maximum number of snapshots you'd like each instance to keep, and zfs/period to set how many intervals you want to wait between snapshots. cheers, tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Can a zpool cachefile be copied between systems?
Suppose that you have a SAN environment with a lot of LUNs. In the normal course of events this means that 'zpool import' is very slow, because it has to probe all of the LUNs all of the time. In S10U6, the theoretical 'obvious' way to get around this for your SAN filesystems seems to be to use a non-default cachefile (likely one cachefile per virtual fileserver, although you could go all the way to one cachefile per pool) and then copy this cachefile from the master host to all of your other hosts. When you need to rapidly bring up a virtual fileserver on a non-default host, you can just run zpool import -c /where/ever/host.cache -a However, the S10U6 zpool documentation doesn't say if zpool cachefiles can be copied between systems and used like this. Does anyone know if this is a guaranteed property that is sure to keep working, something that works right now but there's no guarantees that it will keep working in future versions of Solaris and patches, or something that doesn't work reliably in general? (I have done basic tests with my S10U6 test machine, and it seems to work ... but I might easily be missing something that makes it not reliable.) - cks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
On 11/25/08 16:41, Paweł Tęcza wrote: Dnia 2008-11-25, wto o godzinie 13:11 -0500, Richard Morris - Sun Microsystems - Burlington United States pisze: Pawel, With http://bugs.opensolaris.org/view_bug.do?bug_id=6734907 zfs list -t all would be useful once snapshots are omitted by default, the syntax of zfs list is very close to the one you have dreamed of: zfs list -t [filesystem|volume|snapshot|all] zfs listdisplays all filesystems and volumes displays snapshots only if listsnaps=on zfs list -t filesystem displays all filesystems does not display volumes or snapshots zfs list -t volume displays all volumes does not display filesystems or snapshots zfs list -t snapshotdisplays all snapshots does not display filesystems or volumes zfs list -t all displays all filesystems, volumes, and snapshots (even if listsnaps=off) Hi Rich, Thanks a lot for your feedback! I was thinking that `zfs list` thread is already dead ;) The syntax above is pretty nice for me, but IMHO the -t switch is rather needless here :) I also asked Sun people about your well-known backward compatibility, but unfortunately nobody commented on it :( Pawel, The fix for 6734907 did not add the -t option to zfs list. That option already existed so there's no issue with backward compatibility. Listing all datasets could be done by a zfs list -t filesystem,volume,snapshot which produced the same output as zfs list. Now that snapshots are not displayed by default (unless listsnaps=on) there's some benefit in having a shorter option to list all datasets. So 6734907 added -t all which produces the same output as -t filesystem,volume,snapshot. -- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] `zfs list` doesn't show my snapshot
On Tue, Nov 25, 2008 at 06:34:47PM -0500, Richard Morris - Sun Microsystems - Burlington United States wrote: option to list all datasets. So 6734907 added -t all which produces the same output as -t filesystem,volume,snapshot. 1. http://bugs.opensolaris.org/view_bug.do?bug_id=6734907 Hmmm - very strange, when I run 'zfs list -t all' on b101 it says: invalid type 'all' ... But the bug report says: Fixed In snv_99 Release Fixed solaris_nevada(snv_99) So, what do those fields really mean? Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss