Re: [zfs-discuss] Thin device support in ZFS?
On Thu, Dec 31 at 16:53, David Magda wrote: Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say "no really, I'm talking about the /actual/ LBA 123456". What, exactly, is the "/actual/ LBA 123456" on a modern SSD? --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Thu, Dec 31 at 10:18, Bob Friesenhahn wrote: There are of course SSDs with hardly any (or no) reserve space, but while we might be willing to sacrifice an image or two to SSD block failure in our digital camera, that is just not acceptable for serious computer use. Some people are doing serious computing on devices with 6-7% reserve. Devices with less enforced reserve will be significantly cheaper per exposed gigabyte, independent of all other factors, and always give the user the flexibility to increase their effective reserve by destroking the working area a little or a lot. If someone just needs blazing fast read access and isn't expecting to put more than a few cycles/day on their devices, small reserve MLC drives may be very cost effective and just as fast as their 20-30% reserve SLC counterparts. -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 31 dec 2009, at 22.53, David Magda wrote: > On Dec 31, 2009, at 13:44, Joerg Schilling wrote: > >> ZFS is COW, but does the SSD know which block is "in use" and which is not? >> >> If the SSD did know whether a block is in use, it could erase unused blocks >> in advance. But what is an "unused block" on a filesystem that supports >> snapshots? Snapshots make no difference - when you delete the last dataset/snapshot that references a file you also delete the data. Snapshots is a way to keep more files around, it is not a really way to keep the disk entirely full or anything like that. There is obviously no problem to distinguish between used and unused blocks, and zfs (or btrfs or similar) make no difference. > Personally, I think that at some point in the future there will need to be a > command telling SSDs that the file system will take care of handling blocks, > as new FS designs will be COW. ZFS is the first "mainstream" one to do it, > but Btrfs is there as well, and it looks like Apple will be making its own FS. That could be an idea, but there still will be holes after deleted files that need to be reclaimed. Do you mean it would be a major win to have the file system take care of the space reclaiming instead of the drive? > Just as the first 4096-byte block disks are silently emulating 4096 -to-512 > blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the > future there will be a setting to say "no really, I'm talking about the > /actual/ LBA 123456". A typical flash page size is 512 KB. You probably don't want to use all the physical pages, since those could be worn out or bad, so those need to be remapped (or otherwise avoided) at some level anyway. These days, typically disks do the remapping without the host computer knowing (both SSDs and rotating rust). I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] (snv_129, snv_130) can't import zfs pool
Hi (snv_130) created zfs pool storage (a mirror of two whole disks) zfs created storage/iscsivol, made some tests, wrote some GBs zfs created storage/mynas filesystem (sharesmb dedup=on compression=on) FILLED the storage/mynas tried to ZFS DESTROY my storage/iscsivol, but the system has HUNG... this system now tends to boot to maintenance mode due to the boot-archive corruption The pool can't be imported -f by the recent EON storage (snv_129), it hangs also and don't return to the CLI Any help is appreciated -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 03:30, Eric D. Mudama wrote: On Thu, Dec 31 at 16:53, David Magda wrote: Just as the first 4096-byte block disks are silently emulating 4096 - to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes. Perhaps in the future there will be a setting to say "no really, I'm talking about the /actual/ LBA 123456". What, exactly, is the "/actual/ LBA 123456" on a modern SSD? It doesn't exist currently because of the behind-the-scenes re-mapping that's being done by the SSD's firmware. While arbitrary to some extent, and "actual" LBA would presumably the number of a particular cell in the SSD. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote: I see the possible win that you could always use all the working blocks on the disk, and when blocks goes bad your disk will shrink. I am not sure that is really what people expect, though. Apart from that, I am not sure what the gain would be. Could you elaborate on why this would be called for? Currently you have SSDs that look like disks, but under certain circumstances the OS / FS know that it isn't rotating rust--in which case the TRIM command is then used by the OS to help the SSD's allocation algorithm(s). If the file system is COW, and knows about SSDs via TRIM, why not just skip the middle-man and tell the SSD "I'll take care of managing blocks". In the ZFS case, I think it's a logical extension of how RAID is handling: ZFS' system is much more helpful in most case that hardware- / firmware-based RAID, so it's generally best just to expose the underlying hardware to ZFS. In the same way ZFS already does COW, so why bother with the SSD's firmware doing it when giving extra knowledge to ZFS could be more useful? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] preview of new SSD based on SandForce controller
Interesting article - rumor has it that this is the same controller that Seagate will use in its upcoming enterprise level SSDs: http://anandtech.com/storage/showdoc.aspx?i=3702 It reads like SandForce has implemented a bunch of ZFS like functionality in firmware. Hmm, I wonder if they used any ZFS source code?? Happy new year. -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On 1 jan 2010, at 14.14, David Magda wrote: > On Jan 1, 2010, at 04:33, Ragnar Sundblad wrote: > >> I see the possible win that you could always use all the working >> blocks on the disk, and when blocks goes bad your disk will shrink. >> I am not sure that is really what people expect, though. Apart from >> that, I am not sure what the gain would be. >> >> Could you elaborate on why this would be called for? > > Currently you have SSDs that look like disks, but under certain circumstances > the OS / FS know that it isn't rotating rust--in which case the TRIM command > is then used by the OS to help the SSD's allocation algorithm(s). (Note that TRIM and equivalents are not only useful on SSDs, but on other storage too, such as when using sparse/thin storage.) > If the file system is COW, and knows about SSDs via TRIM, why not just skip > the middle-man and tell the SSD "I'll take care of managing blocks". > > In the ZFS case, I think it's a logical extension of how RAID is handling: > ZFS' system is much more helpful in most case that hardware- / firmware-based > RAID, so it's generally best just to expose the underlying hardware to ZFS. > In the same way ZFS already does COW, so why bother with the SSD's firmware > doing it when giving extra knowledge to ZFS could be more useful? But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn't it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn't it as good for rotating rust? /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
> On Dec 31, 2009, at 6:14 PM, Richard Elling wrote: > Some nits: > disks aren't marked as semi-bad, but if ZFS has trouble with a > block, it will try to not use the block again. So there is two levels > of recovery at work: whole device and block. Ah. I hadn't found that yet. > The "one more and you're dead" is really N errors in T time. I'm interpreting this as "OS/S/zfs/drivers will not mark a disk as failed until it returns N errors in T time," which means - check me on this - that to get a second failed disk, the time to get a second real-or-fake failed disk is T, where T is the time a second soft-failing disk may happen while the system is balled up in worrying about the first disk not responding in T time. This based on a paper I read on line about the increasing need for raidz3 or similar over raidz2 or similar because throughput from disks has not increased concomitantly with their size; this leading to increasing times to recover from first failures using the stored checking data in the array to to rebuild. The notice-an-error time plus the rebuild-the-array time is the window in which losing another disk, soft or hard, will lead to the inability to resilver the array. > For disks which don't return when there is an error, you can > reasonably expect that T will be a long time (multiples of 60 > seconds) and therefore the N in T threshold will not be triggered. The scenario I had in mind was two disks ready to fail, either soft (long time to return data) or hard (bang! That sector/block or disk is not coming back, period). The first fails and starts trying to recover in desktop-disk fashion, maybe taking hours. This leaves the system with no error report (i.e. the N-count is zero) and the T-timer ticking. Meanwhile the array is spinning. The second fragile disk is going to hit its own personal pothole at some point soon in this scenario. What happens next is not clear to me. Is OS/S/zfs going to suspend disk operations until it finally does hear from first failing disk 1, based on N still being at 0 because the disk hasn't reported back yet? Or will the array continue with other operations, noting that the operation involving failing disk1 has not completed, and either stack another request on failing disk 1, or access failing disk 2 and get its error too at some point? Or both? If the timeout is truly N errors in T time, and N is never reported back because the disk spends some hours retrying, then it looks like this is a zfs hang if not a system hang. If there is a timeout of some kind which takes place even if N never gets over 0, that would at least unhang the file system/system, but it opens you to the second failing disk fault having occurred, and you're in for another of either hung-forever or failed-array in the case of raidz. > The term "degraded" does not have a consistent > definition across the industry. Of course not! 8-) Maybe we should use "depraved" 8-) > See the zpool man page for the definition > used for ZFS. In particular, DEGRADED != FAULTED > Issues are logged, for sure. If you want to monitor > them proactively, > you need to configure SNMP traps for FMA. Ok, can deal with that. > It already does this, as long as there are N errors > in T time. OK, I can work that one out. I'm still puzzled on what happens with the "N=0 forever" case. The net result on that one seems to be that you need raid specific disks to get some kind of timeout to happen at the disk level ever (depending on the disk firmware, which as you note later, is likely to have been written by a junior EE as his first assignment 8-) ) >There is room for improvement here, but I'm not sure how > one can set a rule that would explicitly take care of the I/O never > returning from a disk while a different I/O to the same disk > returns. More research required here... Yep. I'm thinking that it might be possible to do a policy-based setup section for an array where you could select one of a number of rule-sets for what to do, based on your experience and/or paranoia about the disks in your array. I had good luck with that in a primitive whole-machine hardware diagnosis system I worked with at one point in the dim past. Kind of "if you can't do the right/perfect thing, then ensure that *something* happens." One of the rules scenarios might be "if one seek to a disk never returns and other actions to that disk to work, then halt the pending action(s) to disk and/or array, increment N, restart that disk or the entire array as needed, and retry that action in a diagnostic loop, which decides whether it's a soft fail, hard block fail, or hard disk fail" and then take the proper action based on the diagnostic. Or it could be "map that disk out and run diagnostics on it while the hot spare is swapped in" based on whether there's a hot spare or not. But yes, some thought is needed. I always tend to pick the side of "let the user/admin pick the way they want to fail" which m
Re: [zfs-discuss] Thin device support in ZFS?
On Jan 1, 2010, at 11:04, Ragnar Sundblad wrote: But that would only move the hardware specific and dependent flash chip handling code into the file system code, wouldn't it? What is won with that? As long as the flash chips have larger pages than the file system blocks, someone will have to shuffle around blocks to reclaim space, why not let the one thing that knows the hardware and also is very close to the hardware do it? And if this is good for SSDs, why isn't it as good for rotating rust? Don't really see how things are either hardware specific or dependent. COW is COW. Am I missing something? It's done by code somewhere in the stack, if the FS knows about it, it can lay things out in sequential writes. If we're talking about 512 KB blocks, ZFS in particular would create four 128 KB txgs--and 128 KB is simply the currently #define'd size, which can be changed in the future. One thing you gain is perhaps not requiring to have as much of a reserve. At most you have some hidden bad block re-mapping, similar to rotating rust nowadays. If you're shuffling blocks around, you're doing a read-modify-write, which if done in the file system, you could use as a mechanism to defrag on-the-fly or to group many small files together. Not quite sure what you mean by your last question. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Dec 31, 2009, at 12:59 PM, Ragnar Sundblad wrote: Flash SSDs actually always remap new writes into a only-append-to-new-pages style, pretty much as ZFS does itself. So for a SSD there is no big difference between ZFS and filesystems as UFS, NTFS, HFS+ et al, on the flash level they all work the same. The reason is that there is no way for it to rewrite single disk blocks, it can only fill up already erased pages of 512K (for example). When the old blocks get mixed with unused blocks (because of block rewrites, TRIM or Write Many/UNMAP), it needs to compact the data by copying all active blocks from those pages into previously erased pages, and there write the active data compacted/continuos. (When this happens, things tend to get really slow.) However, the quantity of small, overwritten pages is vastly different. I am not convinced that a workload that generates few overwrites will be penalized as much as a workload that generates a large number of overwrites. I think most folks here will welcome good, empirical studies, but thus far the only one I've found is from STEC and their disks behave very well after they've been filled and subjected to a rewrite workload. You get what you pay for. Additional pointers are always appreciated :-) http://www.stec-inc.com/ssd/videos/ssdvideo1.php -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Fri, 1 Jan 2010, David Magda wrote: It doesn't exist currently because of the behind-the-scenes re-mapping that's being done by the SSD's firmware. While arbitrary to some extent, and "actual" LBA would presumably the number of a particular cell in the SSD. There seems to be some severe misunderstanding of that a SSD is. This severe misunderstanding leads one to assume that a SSD has a "native" blocksize. SSDs (as used in computer drives) are comprised of many tens of FLASH memory chips which can be layed out and mapped in whatever fashion the designers choose to do. They could be mapped sequentially, in parallel, a combination of the two, or perhaps even change behavior depending on use. Individual FLASH devices usually have a much smaller page size than 4K. A 4K write would likely be striped across several/many FLASH devices. The construction of any given SSD is typically a closely-held trade secret and the vendor will not reveal how it is designed. You would have to chip away the epoxy yourself and reverse-engineer in order to gain some understanding of how a given SSD operates and even then it would be mostly guesswork. It would be wrong for anyone here, including someone who has participated in the design of an SSD, to claim that they know how a "SSD" will behave unless they have access to the design of that particular SSD. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS not working in non-global zone after upgrading to snv_130
Hi After upgrading OpenSolaris from snv111 to snv130 r...@t61p:/export/home/xtrnaw7# cat /etc/release OpenSolaris Development snv_130 X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 18 December 2009 my zone does not boot anymore: Inside the zone: -bash-3.2$ svcs -x svc:/system/filesystem/local:default (local file system mounts) State: maintenance since Fri Jan 01 16:18:54 2010 Reason: Start method exited with $SMF_EXIT_ERR_FATAL. See: http://sun.com/msg/SMF-8000-KS See: /var/svc/log/system-filesystem-local:default.log Impact: 16 dependent services are not running. (Use -v for list.) bash-3.2$ tail /var/svc/log/system-filesystem-local:default.log [ Sep 3 19:43:08 Executing start method ("/lib/svc/method/fs-local"). ] [ Sep 3 19:43:08 Method "start" exited with status 0. ] [ Oct 17 18:00:54 Enabled. ] [ Oct 17 18:01:25 Executing start method ("/lib/svc/method/fs-local"). ] [ Oct 17 18:01:27 Method "start" exited with status 0. ] [ Jan 1 16:18:45 Enabled. ] [ Jan 1 16:18:53 Executing start method ("/lib/svc/method/fs-local"). ] /lib/svc/method/fs-local: line 91: 12888: Abort(coredump) WARNING: /usr/sbin/zfs mount -a failed: exit status 262 But there is no ZFS filesystem configured for the zone: r...@t61p:/export/home/xtrnaw7# zonecfg -z develop001 info zonename: develop001 zonepath: /zones/develop001 brand: ipkg autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: exclusive hostid: fs: dir: /tools special: /tools raw not specified type: lofs options: [ro] fs: dir: /data/develop special: /data/develop raw not specified type: lofs options: [rw] fs: dir: /data/img special: /data/img raw not specified type: lofs options: [ro] fs: dir: /opt/SunStudioExpress special: /opt/SunStudioExpress raw not specified type: lofs options: [ro] net: address not specified physical: vnic0 defrouter not specified Looks like a general problem with ZFS in the zone: bash-3.2$ /usr/sbin/zfs list internal error: Unknown error Abort ZFS in the global zones works without problems. regards Bernd -- Bernd Schemmer, Frankfurt am Main, Germany http://bnsmb.de/ M s temprano que tarde el mundo cambiar . Fidel Castro ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
On Fri, 1 Jan 2010, Al Hopper wrote: Interesting article - rumor has it that this is the same controller that Seagate will use in its upcoming enterprise level SSDs: http://anandtech.com/storage/showdoc.aspx?i=3702 It reads like SandForce has implemented a bunch of ZFS like functionality in firmware. Hmm, I wonder if they used any ZFS source code?? The article (and product) seem interesting, but (in usual form) the article is written as a sort of unsubstantiated guess-work propped up by vendor charts and graphs and with links so the gentle reader can purchase the product on-line. It is good to see that Intel is seeing some competition. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Fri, Jan 1, 2010 at 11:17 AM, Bob Friesenhahn wrote: > On Fri, 1 Jan 2010, David Magda wrote: >> >> It doesn't exist currently because of the behind-the-scenes re-mapping >> that's being done by the SSD's firmware. >> >> While arbitrary to some extent, and "actual" LBA would presumably the >> number of a particular cell in the SSD. > > There seems to be some severe misunderstanding of that a SSD is. This severe > misunderstanding leads one to assume that a SSD has a "native" blocksize. > SSDs (as used in computer drives) are comprised of many tens of FLASH > memory chips which can be layed out and mapped in whatever fashion the > designers choose to do. They could be mapped sequentially, in parallel, a > combination of the two, or perhaps even change behavior depending on use. > Individual FLASH devices usually have a much smaller page size than 4K. A > 4K write would likely be striped across several/many FLASH devices. > > The construction of any given SSD is typically a closely-held trade secret > and the vendor will not reveal how it is designed. You would have to chip > away the epoxy yourself and reverse-engineer in order to gain some > understanding of how a given SSD operates and even then it would be mostly > guesswork. > > It would be wrong for anyone here, including someone who has participated in > the design of an SSD, to claim that they know how a "SSD" will behave unless > they have access to the design of that particular SSD. > The main issue is that most flash devices support 128k byte pages, and the smallest "chunk" (for want of a better word) of flash memory that can be written is a page - or 128kb. So if you have a write to an SSD that only changes 1 byte in one 512 byte "disk" sector, the SSD controller has to either read/re-write the affected page or figure out how to update the flash memory with the minimum affect on flash wear. If one did'nt have to worry about flash wear levelling, one could read/update/write the affected page all day long. And, to date, flash writes are much slower than flash reads - which is another basic property of the current generation of flash devices. For anyone who is interested in getting more details of the challenges with flash memory, when used to build solid state drives, reading the tech data sheets on the flash memory devices will give you a feel for the basic issues that must be solved. Bobs point is well made. The specifics of a given SSD implementation will make the performance characteristics of the resulting SSD very difficult to predict or even describe - especially as the device hardware and firmware continue to evolve. And some SSDs change the algorithms they implement on-the-fly - depending on the characteristics of the current workload and of the (inbound) data being written. There are some links to well written articles in the URL I posted earlier this morning: http://www.anandtech.com/storage/showdoc.aspx?i=3702 Regards, -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] trying to buy an Intel MLC SSD
The 80Gb Intel MLC SSDs have been hard to find in-stock and prices keep varying The original list price on the x25m 80Gb MLC drive was $230 - and it was *supposed* to be available for less than that. Demand has been high and a lot of on-line sellers have taken advantage of the demand to keep prices high. In particular, newegg.com, who usually have very keen pricing, has been selling the Intel SSDs at way above list and are not competitive with other online retailers. A possible work around is to shop for the 1.8" version of the drive - which is more widely available with better pricing. You'll need a cable with a micro-SATA connector for this drive. You can find a micro-SATA cable to standard SATA cable assembly here: http://www.satacables.com/micro-sata-cables.html Bear in mind that the 1.8" (x18m) drive is 5mm high - whereas the 2.5" (x25m) drive is 7mm tall - and it comes with a plastic frame to increase the height to 9.5mm for compatibility with some laptop mountings. Here's a reference for physical form factor info on the Intel drives: wget http://download.intel.com/design/flash/nand/mainstream/322296.pdf and here is the spec for the micro SATA connector: wget ftp://ftp.seagate.com/pub/sff/SFF-8144.PDF NB: Make sure you get the G2 version of the Intel drive - regardless of the form factor. No affiliation with Intel etc. -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Supermicro AOC-SAT2-MV8 -- cfgadm won't create attach point (dsk/xxxx)
I have a Supermicro AOC-SAT2-MV8 running on snv_130. I have a 6 disk raidz2 pool that has been running great. Today I added a Western Digital Green 1.5TB WD15EADS so I could create some scratch space. But, cfgadm will not assign the drive a dsk/xxx ... I have tried unconfigure/configure and disconnect/connect/configure with cfgadm without luck. [r...@solaris:~]$ uname -a SunOS solaris 5.11 snv_130 i86pc i386 i86pc [r...@solaris:~]$ cfgadm -alv | grep sata sata0/0::dsk/c0t0d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK0382L sata0/1connectedconfigured ok Mod: WDC WD15EADS-00P8B0 FRev: 01.00A01 SN: WD-WCAVU0382812 sata0/2::dsk/c0t2d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK0382M sata0/3::dsk/c0t3d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK03DEP sata0/4::dsk/c0t4d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK038A6 sata0/5::dsk/c0t5d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK0313K sata0/6::dsk/c0t6d0connectedconfigured ok Mod: ST3750330AS FRev: SD1A SN: 3QK037X5 sata0/7emptyunconfigured ok unavailable sata-portn /devices/p...@0,0/pci8086,2...@1/pci8086,3...@0/pci11ab,1...@1:7 Any tips? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
On Jan 1, 2010, at 11:28 AM, Bob Friesenhahn wrote: On Fri, 1 Jan 2010, Al Hopper wrote: Interesting article - rumor has it that this is the same controller that Seagate will use in its upcoming enterprise level SSDs: http://anandtech.com/storage/showdoc.aspx?i=3702 It reads like SandForce has implemented a bunch of ZFS like functionality in firmware. Hmm, I wonder if they used any ZFS source code?? The article (and product) seem interesting, but (in usual form) the article is written as a sort of unsubstantiated guess-work propped up by vendor charts and graphs and with links so the gentle reader can purchase the product on-line. It is good to see that Intel is seeing some competition. Yep, it is good to see that people who are being creative are finding design wins. IMHO, the rate of change in the SSD world right now is about 1000x the rate of change in the HDD world. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (snv_129, snv_130) can't import zfs pool
On Jan 1, 2010, at 4:57 AM, LevT wrote: Hi (snv_130) created zfs pool storage (a mirror of two whole disks) zfs created storage/iscsivol, made some tests, wrote some GBs zfs created storage/mynas filesystem (sharesmb dedup=on compression=on) FILLED the storage/mynas tried to ZFS DESTROY my storage/iscsivol, but the system has HUNG... dedup is still new and several people have reported that destroying deduped datasets can take a long time. Plenty of memory or cache devices seems to help, as does having high IOPS drives in the main pool. Otherwise, you'll have to wait for it to finish. this system now tends to boot to maintenance mode due to the boot- archive corruption This is unrelated to the above problem. More likely this occurred when you gave up and forced a restart. Follow the standard instructions for rebuilding the boot archive. -- richard The pool can't be imported -f by the recent EON storage (snv_129), it hangs also and don't return to the CLI Any help is appreciated -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL
On Jan 1, 2010, at 8:11 AM, R.G. Keen wrote: On Dec 31, 2009, at 6:14 PM, Richard Elling wrote: Some nits: disks aren't marked as semi-bad, but if ZFS has trouble with a block, it will try to not use the block again. So there is two levels of recovery at work: whole device and block. Ah. I hadn't found that yet. The "one more and you're dead" is really N errors in T time. I'm interpreting this as "OS/S/zfs/drivers will not mark a disk as failed until it returns N errors in T time," which means - check me on this - that to get a second failed disk, the time to get a second real-or-fake failed disk is T, where T is the time a second soft-failing disk may happen while the system is balled up in worrying about the first disk not responding in T time. Perhaps I am not being clear. If a disk is really dead, then there are several different failure modes that can be responsible. For example, if a disk does not respond to selection, then it is diagnosed as failed very quickly. But that is not the TLER case. The TLER case is when the disk cannot read from media without error, so it will continue to retry... perhaps forever or until reset. If a disk does not complete an I/O operation in (default) 60 seconds (for sd driver), then it will be reset and the I/O operation retried. If a disk returns bogus data (failed ZFS checksum), then the N in T algorithm may kick in. I have seen this failure mode many times. This based on a paper I read on line about the increasing need for raidz3 or similar over raidz2 or similar because throughput from disks has not increased concomitantly with their size; this leading to increasing times to recover from first failures using the stored checking data in the array to to rebuild. The notice-an-error time plus the rebuild-the-array time is the window in which losing another disk, soft or hard, will lead to the inability to resilver the array. A similar observation is that the error rate (errors/bit) has not changed, but the number of bits continues to increase. For disks which don't return when there is an error, you can reasonably expect that T will be a long time (multiples of 60 seconds) and therefore the N in T threshold will not be triggered. The scenario I had in mind was two disks ready to fail, either soft (long time to return data) or hard (bang! That sector/block or disk is not coming back, period). The first fails and starts trying to recover in desktop-disk fashion, maybe taking hours. Yes, this is the case for TLER. The only way around this is to use disks that return failures when they occur. This leaves the system with no error report (i.e. the N-count is zero) and the T-timer ticking. Meanwhile the array is spinning. The second fragile disk is going to hit its own personal pothole at some point soon in this scenario. What happens next is not clear to me. Is OS/S/zfs going to suspend disk operations until it finally does hear from first failing disk 1, based on N still being at 0 because the disk hasn't reported back yet? Or will the array continue with other operations, noting that the operation involving failing disk1 has not completed, and either stack another request on failing disk 1, or access failing disk 2 and get its error too at some point? Or both? ZFS issues I/O in parallel. However, that does not prevent an application or ZFS metadata transactions from waiting on a sequence of I/O. If the timeout is truly N errors in T time, and N is never reported back because the disk spends some hours retrying, then it looks like this is a zfs hang if not a system hang. The drivers will retry and fail the I/O. By default, for SATA disks using the sd driver, there are 5 retries of 60 seconds. After 5 minutes, the I/O will be declared failed and that info is passed back up the stack to ZFS, which will start its recovery. This is why the T part of N in T doesn't work so well for the TLER case. If there is a timeout of some kind which takes place even if N never gets over 0, that would at least unhang the file system/system, but it opens you to the second failing disk fault having occurred, and you're in for another of either hung-forever or failed-array in the case of raidz. I don't think the second disk scenario adds value to this analysis. The term "degraded" does not have a consistent definition across the industry. Of course not! 8-) Maybe we should use "depraved" 8-) See the zpool man page for the definition used for ZFS. In particular, DEGRADED != FAULTED Issues are logged, for sure. If you want to monitor them proactively, you need to configure SNMP traps for FMA. Ok, can deal with that. It already does this, as long as there are N errors in T time. OK, I can work that one out. I'm still puzzled on what happens with the "N=0 forever" case. The net result on that one seems to be that you need raid specific disks to get some kind of timeout to happen at the disk level ever (depending on the disk firmware, which as y
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
Yeah, still no joy. I moved the disks to another machine altogether with 8gb and a quad core intel versus the dual core amd I was using and it still just hangs the box on import. this time I did a nohup zpool import -fFX vault after booting off the b130 live dvd on this machine into single user text mode so I'd have minimal processes and the machine still hangs tighter than a drum. Can't even hit the enter and get a newline this way, probably because the bash process is locked. I've left it for 24 hours like this and will leave it for another day or two to see if it is actually doing anything behind the scenes. I guess my plan B will be to leave these disks in a closet and try again some time in the future and hopefully in some later build the kinks get all worked out enough with dedup to deal with my pool as I'd really not like to lose the data in this pool. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
On Jan 1, 2010, at 2:23 PM, tom wagner wrote: Yeah, still no joy. I moved the disks to another machine altogether with 8gb and a quad core intel versus the dual core amd I was using and it still just hangs the box on import. this time I did a nohup zpool import -fFX vault after booting off the b130 live dvd on this machine into single user text mode so I'd have minimal processes and the machine still hangs tighter than a drum. Can't even hit the enter and get a newline this way, probably because the bash process is locked. I've left it for 24 hours like this and will leave it for another day or two to see if it is actually doing anything behind the scenes. I guess my plan B will be to leave these disks in a closet and try again some time in the future and hopefully in some later build the kinks get all worked out enough with dedup to deal with my pool as I'd really not like to lose the data in this pool. Are the drive lights blinking? If so, then let it do its work. Rebooting won't help because when the pool is imported, the destroy will continue. See other recent threads in this forum on the subject for more insight. http://opensolaris.org/jive/forum.jspa?forumID=80&start=0 -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best way to configure raidz groups
raidz2 is recommended. As discs get large, it can take long time to repair raidz. Maybe several days. With raidz1, if another discs blows during repair, you are screwed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
That's the thing, the drive lights aren't blinking, but I was thinking maybe the writes are going so slow that it's possible they aren't registering. And since I can't keep a running iostat, Ican't tell if anything is going on. I can however get into the KMDB. is there something in there that can monitor storage activity or anything? probably not, but it's worth asking. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (snv_129, snv_130) can't import zfs pool
You might want to checkout another thread that me and some of the others started on this topic. some of the guys in that thread got their pool back but I haven't been able to. I have SSDs for my log and cache and it hasn't helped me because my system hangs hard on import the way you are describing. Thus far I still haven't been able to regain my pool after switching to a totally different system with more memory, but some of the other guys have. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
Bob Friesenhahn wrote: On Fri, 1 Jan 2010, Al Hopper wrote: Interesting article - rumor has it that this is the same controller that Seagate will use in its upcoming enterprise level SSDs: http://anandtech.com/storage/showdoc.aspx?i=3702 It reads like SandForce has implemented a bunch of ZFS like functionality in firmware. Hmm, I wonder if they used any ZFS source code?? The article (and product) seem interesting, but (in usual form) the article is written as a sort of unsubstantiated guess-work propped up by vendor charts and graphs and with links so the gentle reader can purchase the product on-line. It is good to see that Intel is seeing some competition. Bob -- Yeah, there were a bunch more "maybe" and "looks like" and "might be" than I'm really comfortable with in that article. The one thing it does bring up is the old problem of Where Intelligence Belongs. You most typically see this in the CPU/coprocessor cycle, where the battle between enough performance gain in using a separate chip vs the main CPU to perform some task is a never ending cycle. One of ZFS's founding ideas is that Intelligence belongs up in the main system (i.e. running in the OS, on the primary CPU(s)), and that all devices are stupid and unreliable. I'm looking at all the (purported) features in this SandForce controller, and wondering how they'll interact with a "smart" filesystem like ZFS, rather than a traditional "stupid" filesystem a la UFS. I see a lot of overlap, which I'm not sure is a good thing. Maybe it's approaching time for vendors to just produce really stupid SSDs: that is, ones that just do wear-leveling, and expose their true page-size info (e.g. for MLC, how many blocks of X size have to be written at once) and that's about it. Let filesystem makers worry about scheduling writes appropriately, doing redundancy, etc. Oooh! Oooh! a whole cluster of USB thumb drives! Yeah! -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
On Fri, 1 Jan 2010, Erik Trimble wrote: Maybe it's approaching time for vendors to just produce really stupid SSDs: that is, ones that just do wear-leveling, and expose their true page-size info (e.g. for MLC, how many blocks of X size have to be written at once) and that's about it. Let filesystem makers worry about scheduling writes appropriately, doing redundancy, etc. From the benchmarks, it is clear that the drive interface is already often the bottleneck for these new SSDs. That implies that the current development path is in the wrong direction unless we are willing to accept legacy-sized devices implementing a complex legacy protocol. If the devices remain the same physical size with more storage then we are faced with the same current situation we have with rotating media, with huge media density and relatively slow I/O performance. We do need stupider SSDs which fit in a small form factor, offer considerable bandwidth (e.g. 300MB/second) per device, and use a specialized communication protocol which is not defined by legacy disk drives. This allows more I/O to occur in parallel, for much better I/O rates. Oooh! Oooh! a whole cluster of USB thumb drives! Yeah! That is not far from what we should have (small chassis-oriented modules), but without the crummy USB. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
On Jan 1, 2010, at 6:33 PM, Bob Friesenhahn wrote: On Fri, 1 Jan 2010, Erik Trimble wrote: Maybe it's approaching time for vendors to just produce really stupid SSDs: that is, ones that just do wear-leveling, and expose their true page-size info (e.g. for MLC, how many blocks of X size have to be written at once) and that's about it. Let filesystem makers worry about scheduling writes appropriately, doing redundancy, etc. From the benchmarks, it is clear that the drive interface is already often the bottleneck for these new SSDs. That implies that the current development path is in the wrong direction unless we are willing to accept legacy-sized devices implementing a complex legacy protocol. If the devices remain the same physical size with more storage then we are faced with the same current situation we have with rotating media, with huge media density and relatively slow I/O performance. We do need stupider SSDs which fit in a small form factor, offer considerable bandwidth (e.g. 300MB/second) per device, and use a specialized communication protocol which is not defined by legacy disk drives. This allows more I/O to occur in parallel, for much better I/O rates. You can already see this affecting the design of high-throughput storage. The Sun Storage F1500 Flash Array has 80 SSDs and uses 64 SAS channels for host connection. Some folks think that 6 Gbps SATA/SAS connections are the Next Great Thing^TM but that only means you need 32 host connections. It is quite amazing to have 1M IOPS and 12.8 GB/s in 1 RU. Perhaps this is the DAS of the future? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] preview of new SSD based on SandForce controller
Bob Friesenhahn wrote: On Fri, 1 Jan 2010, Erik Trimble wrote: Maybe it's approaching time for vendors to just produce really stupid SSDs: that is, ones that just do wear-leveling, and expose their true page-size info (e.g. for MLC, how many blocks of X size have to be written at once) and that's about it. Let filesystem makers worry about scheduling writes appropriately, doing redundancy, etc. From the benchmarks, it is clear that the drive interface is already often the bottleneck for these new SSDs. That implies that the current development path is in the wrong direction unless we are willing to accept legacy-sized devices implementing a complex legacy protocol. If the devices remain the same physical size with more storage then we are faced with the same current situation we have with rotating media, with huge media density and relatively slow I/O performance. We do need stupider SSDs which fit in a small form factor, offer considerable bandwidth (e.g. 300MB/second) per device, and use a specialized communication protocol which is not defined by legacy disk drives. This allows more I/O to occur in parallel, for much better I/O rates. Oooh! Oooh! a whole cluster of USB thumb drives! Yeah! That is not far from what we should have (small chassis-oriented modules), but without the crummy USB. Bob I tend to like the 2.5" form factor, for a lot of reasons (economies of scale, and all). And, the new SATA III (i.e. 6Gbit/s) interface is really sufficient for reasonable I/O, at least until the 12Gbit SAS comes along in a year or so. The 1.8" drive form factor might be useful as Flash densities go up (in order to keep down the GB to drive interface ratio), but physically, that size is a bit of a pain (it's actually too small for reliability reasons, and makes chassis design harder). I'm actually all for adding a second SATA/SAS I/O connector on a 2.5" drive (it's just possible, physically). That all said, it certainly would be really nice to get a SSD controller which can really push the bandwidth, and the only way I see this happening now is to go the "stupid" route, and dumb down the controller as much as possible. I really think we just want the controller to Do What I Say, and not try any optimizations or such. There's simply much more benefit to doing the optimization up at the filesystem level than down at the device level. For a trivial case, consider the dreaded "write-read-write" problem of MLCs: to write a single bit, a whole page has to be read, then the page recomposed with the changed bits, before writing again. If the filesystem was aware that the drive had this kind of issue, then in-RAM caching would almost always allow for the avoidance of the first "read" cycle, and performance goes back to a typical Copy-on-Write style stripe write. I can see why having "dumb" controllers might not appear to the consumer/desktop market, but certainly, for the Enterprise market, I think it's actually /more/ likely that they start showing up soon. Which would be a neat reversal of sorts: Consumer drives using a complex controller with cheap flash (and a large "spare" capacity area), while Enterprise drives use a simple controller, higher-quality flash chips, and likely a much smaller spare capacity area. Which means, I expect price parity between the two. Whee! -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss