Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
> It seems they kind of rushed the appliance into the market. We've a few 7410s > and replication (with zfs send/receive) doesn't work after shares reach ~1TB > (broken pipe error). While it's the case that the 7000 series is a relatively new product, the characterization of "rushed to market" is inaccurate. While the product certainly has had bugs, we've been pretty quick to address them (for example, the issue you described). > It's frustrating and we can't do anything because every time we type "shell" > in the CLI, it freaks us out with a message saying the warranty will be > voided if we continue. I bet that we could work around that bug but we're not > allowed and the workarounds provided by Sun haven't worked. I can understand why it might be frustrating to feel shut out of your customary Solaris interfaces, but it's not Solaris: it's an appliance. Arbitrary actions that might seem benign to someone familiar with Solaris can have disastrous consequences -- I'd be happy to give some examples of the amusing ways our customers have taken careful aim and shot themselves in the foot. > Regarding dedup, Oracle is very courageous for including it in the 2010.Q1 > release if this comes to be true. But I understand the pressure on then. > Every other vendor out there is releasing products with deduplication. > Personally, I would just wait 2-3 releases before using it in a black box > like the 7000s. We're including dedup in the 2010.Q1 release, and as always we would not release a product we didn't stand behind. ZFS dedup still has some performance pathologies and surprising results at times; we're working our customers to ensure that their deployments are successful, and fixing problems as they come up. > The hardware on the other hand is incredible in terms of resilience and > performance, no doubt. Which makes me think the pretty interface becomes an > annoyance sometimes. Let's wait for 2010.Q1 :) As always, we welcome feedback (although zfs-discuss is not the appropriate forum), and are eager to improve the product. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hey Karsten, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Adam On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote: > Hi, I did some tests on a Sun Fire x4540 with an external J4500 array > (connected via two > HBA ports). I.e. there are 96 disks in total configured as seven 12-disk > raidz2 vdevs > (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4 > checksums. > The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod > modules to be used as log devices (ZIL). I was using the latest snv_134 > software release. > > Here are some first performance numbers for the extraction of an uncompressed > 50 MB > tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test > filesystem > (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard). > > standard ZIL: 7m40s (ZFS default) > 1x SSD ZIL: 4m07s (Flash Accelerator F20) > 2x SSD ZIL: 2m42s (Flash Accelerator F20) > 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) > 3x SSD ZIL: 2m47s (Flash Accelerator F20) > 4x SSD ZIL: 2m57s (Flash Accelerator F20) > disabled ZIL: 0m15s > (local extraction0m0.269s) > > I was not so much interested in the absolute numbers but rather in the > relative > performance differences between the standard ZIL, the SSD ZIL and the disabled > ZIL cases. > > Any opinions on the results? I wish the SSD ZIL performance was closer to the > disabled ZIL case than it is right now. > > ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC > cache > devices (although the system has lots of system memory i.e. the L2ARC is not > really > necessary). But the speedup of disabling the ZIL altogether is appealing (and > would > probably be acceptable in this environment). > -- > This message posted from opensolaris.org > _______ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Hey Robert, How big of a file are you making? RAID-Z does not explicitly do the parity distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths to distribute IOPS. Adam On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote: > Hi, > > > zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ > raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ > raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ > raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ > [...] > raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 > > zfs set atime=off test > zfs set recordsize=16k test > (I know...) > > now if I create a one large file with filebench and simulate a randomread > workload with 1 or more threads then disks on c2 and c3 controllers are > getting about 80% more reads. This happens both on 111b and snv_134. I would > rather except all of them to get about the same number of iops. > > Any idea why? > > > -- > Robert Milkowski > http://milek.blogspot.com > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
> Does it mean that for dataset used for databases and similar environments > where basically all blocks have fixed size and there is no other data all > parity information will end-up on one (z1) or two (z2) specific disks? No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Hey Robert, I've filed a bug to track this issue. We'll try to reproduce the problem and evaluate the cause. Thanks for bringing this to our attention. Adam On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote: > On 23/06/2010 18:50, Adam Leventhal wrote: >>> Does it mean that for dataset used for databases and similar environments >>> where basically all blocks have fixed size and there is no other data all >>> parity information will end-up on one (z1) or two (z2) specific disks? >>> >> No. There are always smaller writes to metadata that will distribute parity. >> What is the total width of your raidz1 stripe? >> >> > > 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. > > -- > Robert Milkowski > http://milek.blogspot.com > > -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression
>> I've read a small amount about compression, enough to find that it'll effect >> performance (not a problem for me) and that once you enable compression it >> only effects new files written to the file system. > > Yes, that's true. Compression on defaults to lzjb which is fast; but gzip-9 > can be twice as good. (I've just done some tests on the MacZFS port on my > blog for more info) Here's a good blog comparing some ZFS compression modes in the context of the Sun Storage 7000: http://blogs.sun.com/dap/entry/zfs_compression Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz - what is stored in parity?
> In my case, it gives an error that I need at least 11 disks (which I don't) > but the point is that raidz parity does not seem to be limited to 3. Is this > not true? RAID-Z is limited to 3 parity disks. The error message is giving you false hope and that's a bug. If you had plugged in 11 disks or more in the example you provided you would have simply gotten a different error. - ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksums
On Fri, Oct 23, 2009 at 06:55:41PM -0500, Tim Cook wrote: > So, from what I gather, even though the documentation appears to state > otherwise, default checksums have been changed to SHA256. Making that > assumption, I have two questions. That's false. The default checksum has changed from fletcher2 to fletcher4 that is to say, the definition of the value of 'on' has changed. > First, is the default updated from fletcher2 to SHA256 automatically for a > pool that was created with an older version of zfs and then upgraded to the > latest? Second, would all of the blocks be re-checksummed with a zfs > send/receive on the receiving side? As with all property changes, new writes get the new properties. Old data is not rewritten. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksums
Thank you for the correction. My next question is, do you happen to know what the overhead difference between fletcher4 and SHA256 is? Is the checksumming multi-threaded in nature? I know my fileserver has a lot of spare cpu cycles, but it would be good to know if I'm going to take a substantial hit in throughput moving from one to the other. Tim, That all really depends on your specific system and workload. As with any performance related matter experimentation is vital for making your final decision. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs code and fishworks "fork"
With that said I'm concerned that there appears to be a fork between the opensource version of ZFS and ZFS that is part of the Sun/Oracle FishWorks 7nnn series appliances. I understand (implicitly) that Sun (/Oracle) as a commercial concern, is free to choose their own priorities in terms of how they use their own IP (Intellectual Property) - in this case, the source for the ZFS filesystem. Hey Al, I'm unaware of specific plans for management either at Sun or at Oracle, but from an engineering perspective suffice it to say that it is simpler and therefore more cost effective to develop for a single, unified code base, to amortize the cost of testing those modifications, and to leverage the enthusiastic ZFS community to assist with the development and testing of ZFS. Again, this isn't official policy, just the simple facts on the ground from engineering. I'm not sure what would lead you to believe that there is fork between the open source / OpenSolaris ZFS and what we have in Fishworks. Indeed, we've made efforts to make sure there is a single ZFS for the reason stated above. Any differences that exist are quickly migrated to ON as you can see from the consistent work of Eric Schrock. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
Hi Kjetil, Unfortunately, dedup will only apply to data written after the setting is enabled. That also means that new blocks cannot dedup against old block regardless of how they were written. There is therefore no way to "prepare" your pool for dedup -- you just have to enable it when you have the new bits. Adam On Dec 9, 2009, at 3:40 AM, Kjetil Torgrim Homme wrote: > I'm planning to try out deduplication in the near future, but started > wondering if I can prepare for it on my servers. one thing which struck > me was that I should change the checksum algorithm to sha256 as soon as > possible. but I wonder -- is that sufficient? will the dedup code know > about old blocks when I store new data? > > let's say I have an existing file img0.jpg. I turn on dedup, and copy > it twice, to img0a.jpg and img0b.jpg. will all three files refer to the > same block(s), or will only img0a and img0b share blocks? > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
> What happens if you snapshot, send, destroy, recreate (with dedup on this > time around) and then write the contents of the cloned snapshot to the > various places in the pool - which properties are in the ascendancy here? the > "host pool" or the contents of the clone? The host pool I assume, because > clone contents are (in this scenario) "just some new data"? The dedup property applies to all writes so the settings for the pool of origin don't matter, just those on the destination pool. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings
Hi Giridhar, The size reported by ls can include things like holes in the file. What space usage does the zfs(1M) command report for the filesystem? Adam On Dec 16, 2009, at 10:33 PM, Giridhar K R wrote: > Hi, > > Reposting as I have not gotten any response. > > Here is the issue. I created a zpool with 64k recordsize and enabled dedupe > on it. > -->zpool create -O recordsize=64k TestPool device1 > -->zfs set dedup=on TestPool > > I copied files onto this pool over nfs from a windows client. > > Here is the output of zpool list > --> zpool list > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > TestPool 696G 19.1G 677G 2% 1.13x ONLINE - > > I ran "ls -l /TestPool" and saw the total size reported as 51,193,782,290 > bytes. > The alloc size reported by zpool along with the DEDUP of 1.13x does not addup > to 51,193,782,290 bytes. > > According to the DEDUP (Dedupe ratio) the amount of data copied is 21.58G > (19.1G * 1.13) > > Here is the output from zdb -DD > > --> zdb -DD TestPool > DDT-sha256-zap-duplicate: 33536 entries, size 272 on disk, 140 in core > DDT-sha256-zap-unique: 278241 entries, size 274 on disk, 142 in core > > DDT histogram (aggregated over all DDTs): > > bucket allocated referenced > __ __ __ > refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE > -- -- - - - -- - - - > 1 272K 17.0G 17.0G 17.0G 272K 17.0G 17.0G 17.0G > 2 32.7K 2.05G 2.05G 2.05G 65.6K 4.10G 4.10G 4.10G > 4 15 960K 960K 960K 71 4.44M 4.44M 4.44M > 8 4 256K 256K 256K 53 3.31M 3.31M 3.31M > 16 1 64K 64K 64K 16 1M 1M 1M > 512 1 64K 64K 64K 854 53.4M 53.4M 53.4M > 1K 1 64K 64K 64K 1.08K 69.1M 69.1M 69.1M > 4K 1 64K 64K 64K 5.33K 341M 341M 341M > Total 304K 19.0G 19.0G 19.0G 345K 21.5G 21.5G 21.5G > > dedup = 1.13, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.13 > > > Am I missing something? > > Your inputs are much appritiated. > > Thanks, > Giri > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings
> Thanks for the response Adam. > > Are you talking about ZFS list? > > It displays 19.6 as allocated space. > > What does ZFS treat as hole and how does it identify? ZFS will compress blocks of zeros down to nothing and treat them like sparse files. 19.6 is pretty close to your computed. Does your pool happen to be 10+1 RAID-Z? Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Hey James, > Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. > All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using > 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or > more and a spare.) Double-parity (or triple-parity) RAID are certainly more resilient against some failure modes than 2-way mirroring. For example, bit errors can arise at a certain rate from disks. In the case of a disk failure in a mirror, it's possible to encounter a bit error such that data is lost. I recently wrote an article for ACM Queue that examines recent trends in hard drives and makes the case for triple-parity RAID. It's at least peripherally relevant to this conversation: http://blogs.sun.com/ahl/entry/acm_triple_parity_raid Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
>> Applying classic RAID terms to zfs is just plain >> wrong and misleading >> since zfs does not directly implement these classic >> RAID approaches >> even though it re-uses some of the algorithms for >> data recovery. >> Details do matter. > > That's not entirely true, is it? > * RAIDZ is RAID5 + checksum + COW > * RAIDZ2 is RAID6 + checksum + COW > * A stack of mirror vdevs is RAID10 + checksum + COW Others have noted that RAID-Z isn't really the same as RAID-5 and RAID-Z2 isn't the same as RAID-6 because RAID-5 and RAID-6 define not just the number of parity disks (which would have made far more sense in my mind), but instead also include in the definition a notion of how the data and parity are laid out. The RAID levels were used to describe groupings of existing implementations and conflate things like the number of parity devices with, say, how parity is distributed across devices. For example, RAID-Z1 lays out data most like RAID-3, that is a single block is carved up and spread across many disks, but distributes parity as required for RAID-5 but in a different manner. It's an unfortunate state of affairs which is why further RAID levels should identify only the most salient aspect (the number of parity devices) or we should use unambiguous terms like single-parity and double-parity RAID. > If we can compare apples and oranges, would you same recommendation ("use > raidz2 and/or raidz3") be the same when comparing to mirror with the same > number of drives? In other words, a 2 drive mirror compares to raidz{1} the > same as a 3 drive mirror compares to raidz2 and a 4 drive mirror compares to > raidz3? If you were enterprise (in other words card about perf) why would > you ever use raidz instead of throwing more drives at the problem and doing > mirroring with identical parity? You're right that a mirror is a degenerate form of raidz1, for example, but mirrors allow for specific optimizations. While the redundancy would be the same, the performance would not. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz stripe size (not stripe width)
Hi Brad, RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev will look like this: | P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | 1K per device with an additional 1K for parity. Adam On Jan 4, 2010, at 3:17 PM, Brad wrote: > If a 8K file system block is written on a 9 disk raidz vdev, how is the data > distributed (writtened) between all devices in the vdev since a zfs write is > one continuously IO operation? > > Is it distributed evenly (1.125KB) per device? > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Hey Chris, > The DDRdrive X1 OpenSolaris device driver is now complete, > please join us in our first-ever ZFS Intent Log (ZIL) beta test > program. A select number of X1s are available for loan, > preferred candidates would have a validation background > and/or a true passion for torturing new hardware/driver :-) > > We are singularly focused on the ZIL device market, so a test > environment bound by synchronous writes is required. The > beta program will provide extensive technical support and a > unique opportunity to have direct interaction with the product > designers. Congratulations! This is great news for ZFS. I'll be very interested to see the results members of the community can get with your device as part of their pool. COMSTAR iSCSI performance should be dramatically improved in particular. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hybrid storage ... thing
> I saw this in /. and thought I'd point it out to this list. It appears > to act as a L2 cache for a single drive, in theory providing better > performance. > > http://www.silverstonetek.com/products/p_contents.php?pno=HDDBOOST&area It's a neat device, but the notion of a hybrid drive is nothing new. As with any block-based caching, this device has no notion of the semantic meaning of a given block so there's only so much intelligence it can bring to bear on the problem. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
Hey Bob, > My own conclusions (supported by Adam Leventhal's excellent paper) are that > > - maximum device size should be constrained based on its time to > resilver. > > - devices are growing too large and it is about time to transition to > the next smaller physical size. I don't disagree with those conclusions necessarily, but the HDD vendors have significant momentum built up in their efforts to improve density -- that's not going to change in the next 5 years. If indeed transitioned to reducing physical size while improving density, that would imply that there would be many more end-points to deal with, bigger switches, etc. All reasonable, but there are some significant implications. > It is unreasonable to spend more than 24 hours to resilver a single drive. Why? > It is unreasonable to spend very much time at all on resilvering (using > current rotating media) since the resilvering process kills performance. Maybe, but then it depends on how much you rely on your disks for performance. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS dedup for VAX COFF data type
> Hi Any idea why zfs does not dedup files with this format ? > file /opt/XXX/XXX/data > VAX COFF executable - version 7926 With dedup enabled, ZFS will identify and remove duplicated regardless of the data format. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal Setup: RAID-5, Areca, etc!
> > But, is there a performance boost with mirroring the drives? That is what > > I'm unsure of. > > Mirroring will provide a boost on reads, since the system to read from > both sides of the mirror. It will not provide an increase on writes, > since the system needs to wait for both halves of the mirror to > finish. It could be slightly slower than a single raid5. That's not strictly correct. Mirroring will, in fact, deliver better IOPS for both reads and writes. For reads, as Brandon stated, mirroring will deliver better performance because it can distribute the reads between both devices. For writes, however, RAID-Z with an N+1 wide stripe will divide the the data into N+1 chunks, and reads will need to access the N chunks. This reduces the total IOPS by a factor of N+1 for reads and writes whereas mirroring reduces the IOPS by a factor of 2 for writes and not at all for reads. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?
For a root device it doesn't matter that much. You're not going to be writing to the device at a high data rate so write/erase cycles don't factor much (MLC can sustain about a factor of 10 more). With MLC you'll get 2-4x the capacity for the same price, but again that doesn't matter much for a root device. Performance is typically a bit better with SLC -- especially on the write side -- but it's not such a huge difference. The reason you'd use a flash SSD for a boot device is power (with maybe a dash of performance), and either SLC or MLC will do just fine. Adam On Sep 24, 2008, at 11:41 AM, Erik Trimble wrote: > I was under the impression that MLC is the preferred type of SSD, > but I > want to prevent myself from having a think-o. > > > I'm looking to get (2) SSD to use as my boot drive. It looks like I > can > get 32GB SSDs composed of either SLC or MLC for roughly equal pricing. > Which would be the better technology? (I'll worry about rated access > times/etc of the drives, I'm just wondering about general tech for > an OS > boot drive usage...) > > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)
> So what are the downsides to this? If both nodes were to crash and > I used the same technique to recreate the ramdisk I would lose any > transactions in the slog at the time of the crash, but the physical > disk image is still in a consistent state right (just not from my > apps point of view)? You would lose transactions, but the pool would still reflect a consistent state. > So is this idea completely crazy? On the contrary; it's very clever. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Storage 7000
On Nov 10, 2008, at 10:55 AM, Tim wrote: > Just got an email about this today. Fishworks finally unveiled? Yup, that's us! On behalf of the Fishworks team, we'd like to extend a big thank you to the ZFS team and the ZFS community here who have contributed to such a huge building block in our new line of storage appliances. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
On Nov 11, 2008, at 9:38 AM, Bryan Cantrill wrote: > Just to throw some ice-cold water on this: > > 1. It's highly unlikely that we will ever support the x4500 -- > only the > x4540 is a real possibility. And to warm things up a bit: there's already an upgrade path from the x4500 to the x4540 so that would be required before any upgrade to the equivalent of the Sun Storage 7210. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
On Nov 11, 2008, at 10:41 AM, Brent Jones wrote: > Wish I could get my hands on a beta of this GUI... Take a look at the VMware version that you can run on any machine: http://www.sun.com/storage/disk_systems/unified_storage/resources.jsp Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
> Is this software available for people who already have thumpers? We're considering offering an upgrade path for people with existing thumpers. Given the feedback we've been hearing, it seems very likely that we will. No word yet on pricing or availability. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] continuous replication
On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote: > That is _not_ active-active, that is active-passive. > > If you have a active-active system I can access the same data via both > controllers at the same time. I can't if it works like you just > described. You can't call it active-active just because different > volumes are controlled by different controllers. Most active-passive > RAID controllers can do that. > > The data sheet talks about active-active clusters, how does that work? What the Sun Storage 7000 Series does would more accurately be described as dual active-passive. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
On Mon, Nov 17, 2008 at 12:35:38PM -0600, Tim wrote: > I'm not sure if this is the right place for the question or not, but I'll > throw it out there anyways. Does anyone know, if you create your pool(s) > with a system running fishworks, can that pool later be imported by a > standard solaris system? IE: If for some reason the head running fishworks > were to go away, could I attach the JBOD/disks to a system running > snv/mainline solaris/whatever, and import the pool to get at the data? Or > is the zfs underneath fishworks proprietary as well? Yes. The Sun Storage 7000 Series uses the same ZFS that's in OpenSolaris today. A pool created on the appliance could potentially be imported on an OpenSolaris system; that is, of course, not explicitly supported in the service contract. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
> Would be interesting to hear more about how Fishworks differs from > Opensolaris, what build it is based on, what package mechanism you are > using (IPS already?), and other differences... I'm sure these details will be examined in the coming weeks on the blogs of members of the Fishworks team. Keep an eye on blogs.sun.com/fishworks. > A little off topic: Do you know when the SSDs used in the Storage 7000 are > available for the rest of us? I don't think the will be, but it will be possible to purchase them as replacement parts. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
On Tue, Nov 18, 2008 at 09:09:07AM -0800, Andre Lue wrote: > Is the web interface on the appliance available for download or will it make > it to opensolaris sometime in the near future? It's not, and it's unlikely to make it to OpenSolaris. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comparison between the S-TEC Zeus and the Intel X25-E ??
The Intel part does about a fourth as many synchronous write IOPS at best. Adam On Jan 16, 2009, at 5:34 PM, Erik Trimble wrote: > I'm looking at the newly-orderable (via Sun) STEC Zeus SSDs, and > they're > outrageously priced. > > http://www.stec-inc.com/product/zeusssd.php > > I just looked at the Intel X25-E series, and they look comparable in > performance. At about 20% of the cost. > > http://www.intel.com/design/flash/nand/extreme/index.htm > > > Can anyone enlighten me as to any possible difference between an STEC > Zeus and an Intel X25-E ? I mean, other than those associated with > the > fact that you can't get the Intel one orderable through Sun right now. > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> Right, which is an absolutely piss poor design decision and why > every major storage vendor right-sizes drives. What happens if I > have an old maxtor drive in my pool whose "500g" is just slightly > larger than every other mfg on the market? You know, the one who is > no longer making their own drives since being purchased by seagate. > I can't replace the drive anymore? *GREAT*. Sun does "right size" our drives. Are we talking about replacing a device bought from sun with another device bought from Sun? If these are just drives that fell off the back of some truck, you may not have that assurance. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
> "The recommended number of disks per group is between 3 and 9. If you have > more disks, use multiple groups." > > Odd that the Sun Unified Storage 7000 products do not allow you to control > this, it appears to put all the hdd's into one group. At least on the 7110 > we are evaluating there is no control to allow multiple groups/different > raid types. Our experience has shown that that initial guess of 3-9 per parity device was surprisingly narrow. We see similar performance out to much wider stripes which, of course, offer the user more usable capacity. We don't allow you to manually set the RAID stripe widths on the 7000 series boxes because frankly the stripe width is an implementation detail. If you want the best performance, choose mirroring; capacity, double-parity RAID; for something in the middle, we offer 3+1 single-parity RAID. Other than that you're micro-optimizing for gains that would hardly be measurable given the architecture of the Hybrid Storage Pool. Recall that unlike other products in the same space, we get our IOPS from flash rather than from a bazillion spindles spinning at 15,000 RPM. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> Since it's done in software by HDS, NetApp, and EMC, that's complete > bullshit. Forcing people to spend 3x the money for a "Sun" drive that's > identical to the seagate OEM version is also bullshit and a piss-poor > answer. I didn't know that HDS, NetApp, and EMC all allow users to replace their drives with stuff they've bought at Fry's. Is this still covered by their service plan or would this only be in an unsupported config? Thanks. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> > > Since it's done in software by HDS, NetApp, and EMC, that's complete > > > bullshit. Forcing people to spend 3x the money for a "Sun" drive that's > > > identical to the seagate OEM version is also bullshit and a piss-poor > > > answer. > > > > I didn't know that HDS, NetApp, and EMC all allow users to replace their > > drives with stuff they've bought at Fry's. Is this still covered by their > > service plan or would this only be in an unsupported config? > > So because an enterprise vendor requires you to use their drives in their > array, suddenly zfs can't right-size? Vendor requirements have absolutely > nothing to do with their right-sizing, and everything to do with them > wanting your money. Sorry, I must have missed your point. I thought that you were saying that HDS, NetApp, and EMC had a different model. Were you merely saying that the software in those vendors' products operates differently than ZFS? > Are you telling me zfs is deficient to the point it can't handle basic > right-sizing like a 15$ sata raid adapter? How do there $15 sata raid adapters solve the problem? The more details you could provide the better obviously. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
> BWAHAHAHAHA. That's a good one. "You don't need to setup your raid, that's > micro-managing, we'll do that." > > Remember that one time when I talked about limiting snapshots to protect a > user from themselves, and you joined into the fray of people calling me a > troll? I don't remember this, but I don't doubt it. > Can you feel the irony oozing out between your lips, or are you > completely oblivious to it? The irony would be that on one hand I object to artificial limitations to business-critical features while on the other hand I think that users don't need to tweak settings that add complexity and little to no value? They seem very different to me, so I suppose the answer to your question is: no I cannot feel the irony oozing out between my lips, and yes I'm oblivious to the same. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 01:35:22PM -0600, Tim wrote: > > > Are you telling me zfs is deficient to the point it can't handle basic > > > right-sizing like a 15$ sata raid adapter? > > > > How do there $15 sata raid adapters solve the problem? The more details you > > could provide the better obviously. > > They short stroke the disk so that when you buy a new 500GB drive that isn't > the exact same number of blocks you aren't screwed. It's a design choice to > be both sane, and to make the end-users life easier. You know, sort of like > you not letting people choose their raid layout... Drive vendors, it would seem, have an incentive to make their "500GB" drives as small as possible. Should ZFS then choose some amount of padding at the end of each device and chop it off as insurance against a slightly smaller drive? How much of the device should it chop off? Conversely, should users have the option to use the full extent of the drives they've paid for, say, if they're using a vendor that already provides that guarantee? > You know, sort of like you not letting people choose their raid layout... Yes, I'm not saying it shouldn't be done. I'm asking what the right answer might be. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
> And again, I say take a look at the market today, figure out a percentage, > and call it done. I don't think you'll find a lot of users crying foul over > losing 1% of their drive space when they don't already cry foul over the > false advertising that is drive sizes today. Perhaps it's quaint, but 5GB still seems like a lot to me to throw away. > In any case, you might as well can ZFS entirely because it's not really fair > that users are losing disk space to raid and metadata... see where this > argument is going? Well, I see where this _specious_ argument is going. > I have two disks in one of my systems... both maxtor 500GB drives, purchased > at the same time shortly after the buyout. One is a rebadged Seagate, one > is a true, made in China Maxtor. Different block numbers... same model > drive, purchased at the same time. > > Wasn't zfs supposed to be about using software to make up for deficiencies > in hardware? It would seem this request is exactly that... That's a fair point, and I do encourage you to file an RFE, but a) Sun has already solved this problem in a different way as a company with our products and b) users already have the ability to right-size drives. Perhaps a better solution would be to handle the procedure of replacing a disk with a slightly smaller one by migrating data and then treating the extant disks as slightly smaller as well. This would have the advantage of being far more dynamic and of only applying the space tax in situations where it actually applies. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device
This is correct, and you can read about it here: http://blogs.sun.com/ahl/entry/fishworks_launch Adam On Fri, Jan 23, 2009 at 05:03:57PM +, Ross Smith wrote: > That's my understanding too. One (STEC?) drive as a write cache, > basically a write optimised SSD. And cheaper, larger, read optimised > SSD's for the read cache. > > I thought it was an odd strategy until I read into SSD's a little more > and realised you really do have to think about your usage cases with > these. SSD's are very definitely not all alike. > > > On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason wrote: > > If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun > > 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs, > > with dram caching. One such product is made by STEC. > > > > My understanding is that the Sun appliances use one SSD for the ZIL, and one > > as a read cache. For the 7210 (which is basically a Sun Fire X4540), that > > gives you 46 disks and 2 SSDs. > > > > -Greg > > > > > > Bob Friesenhahn wrote: > >> > >> On Thu, 22 Jan 2009, Ross wrote: > >> > >>> However, now I've written that, Sun use SATA (SAS?) SSD's in their high > >>> end fishworks storage, so I guess it definately works for some use cases. > >> > >> But the "fishworks" (Fishworks is a development team, not a product) write > >> cache device is not based on FLASH. It is based on DRAM. The difference > >> is > >> like night and day. Apparently there can also be a read cache which is > >> based > >> on FLASH. > >> > >> Bob > >> == > >> Bob Friesenhahn > >> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > >> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > >> > >> ___ > >> zfs-discuss mailing list > >> zfs-discuss@opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > >> > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD - slow down with age
On Feb 14, 2009, at 12:45 PM, Nicholas Lee wrote: A useful article about long term use of the Intel SSD X25-M: http://www.pcper.com/article.php?aid=669 - Long-term performance analysis of Intel Mainstream SSDs. Would a zfs cache (ZIL or ARC) based on a SSD device see this kind of issue? Maybe a periodic scrub via a full disk erase would be a useful process. Indeed SSDs can have certain properties that would cause their performance to degrade over time. We've seen this to varying degrees with different devices we've tested in our lab. We're working on adapting our use of SSDs with ZFS as a ZIL device, an L2ARC device, and eventually as primary storage. We'll first focus on the specific SSDs we certify for use in our general purpose servers and the Sun Storage 7000 series, and help influence the industry to move to standards that we can then use. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS 15K drives as L2ARC
>> After all this discussion, I am not sure if anyone adequately answered the >> original poster's question as to whether at 2540 with SAS 15K drives would >> provide substantial synchronous write throughput improvement when used as >> a L2ARC device. > > I was under the impression that the L2ARC was to speed up reads, as it > allows things to be cached on something faster than disks (usually MLC > SSDs). Offloading the ZIL is what handles synchronous writes, isn't it? > > How would adding an L2ARC speed up writes? You're absolutely right. The L2ARC is for accelerating reads only and will not affect write performance. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110 questions
On Thu, Jun 18, 2009 at 11:51:44AM -0400, Dan Pritts wrote: > I'm curious about a couple things that would be "unsupported." > > Specifically, whether they are "not supported" if they have specifically > been crippled in the software. We have not crippled the software in any way, but we have designed an appliance with some specific uses. Doing things from the Solaris shell by hand my damage your system and void your support contract. > 1) SSD's > > I can imagine buying an intel SSD, slotting it into the 7110, and using > it as a ZFS L2ARC (? i mean the equivalent of "readzilla") That's not supported, it won't work easily, and if you get it working you'll be out of luck if you have a problem. > 2) expandability > > I can imagine buying a SAS card and a JBOD and hooking it up to > the 7110; it has plenty of PCI slots. Ditto. > finally, one question - I presume that I need to devote a pair of disks > to the OS, so I really only get 14 disks for data. Correct? That's right. We market the 7110 as either 2TB = 146GB x 14 or 4.2TB = 300GB x 14 raw capacity. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110 questions
Hey Lawrence, Make sure you're running the latest software update. Note that this forumn is not the appropriate place to discuss support issues. Please contact your official Sun support channel. Adam On Thu, Jun 18, 2009 at 12:06:02PM -0700, lawrence ho wrote: > We have a 7110 on try and buy program. > > We tried using the 7110 with XEN Server 5 over iSCSI and NFS. Nothing seems > to solve the slow write problem. Within the VM, we observed around 8MB/s on > writes. Read performance is fantastic. Some troubleshooting was done with > local SUN rep. The conclusion is that 7110 does not have write cache in forms > of SSD or controller DRAM write cache. The solution from SUN is to buy > StorageTek or 7000 series model with SSD write cache. > > Adam, please advise if there any fixes for 7110. I am still shopping for SAN > and would rather buy a 7100 than a StorageTek or something else. > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Hey Bob, MTTDL analysis shows that given normal evironmental conditions, the MTTDL of RAID-Z2 is already much longer than the life of the computer or the attendant human. Of course sometimes one encounters unusual conditions where additional redundancy is desired. To what analysis are you referring? Today the absolute fastest you can resilver a 1TB drive is about 4 hours. Real-world speeds might be half that. In 2010 we'll have 3TB drives meaning it may take a full day to resilver. The odds of hitting a latent bit error are already reasonably high especially with a large pool that's infrequently scrubbed meaning. What then are the odds of a second drive failing in the 24 hours it takes to resiler? I do think that it is worthwhile to be able to add another parity disk to an existing raidz vdev but I don't know how much work that entails. It entails a bunch of work: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Matt Ahrens is working on a key component after which it should all be possible. Zfs development seems to be overwelmed with marketing-driven requirements lately and it is time to get back to brass tacks and make sure that the parts already developed are truely enterprise- grade. While I don't disagree that the focus for ZFS should be ensuring enterprise-class reliability and performance, let me assure you that requirements are driven by the market and not by marketing. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
which gap? 'RAID-Z should mind the gap on writes' ? Message was edited by: thometal I believe this is in reference to the raid 5 write hole, described here: http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance It's not. So I'm not sure what the 'RAID-Z should mind the gap on writes' comment is getting at either. Clarification? I'm planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS's write aggregation as well as the hard drive's ability to group I/Os and write them quickly. Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we're going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don't care about. Of course, doing this for writes is a bit trickier since we can't just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of 'optional' I/Os purely for the purpose of coalescing writes into larger chunks. I hope that's clear; if it's not, stay tuned for the aforementioned blog post. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: Author: Adam Leventhal Repository: /hg/onnv/onnv-gate Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651 Total changesets: 1 Log message: 6854612 triple-parity RAID-Z http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 009872.html http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612 (Via Blog O' Matty.) Would be curious to see performance characteristics. I just blogged about triple-parity RAID-Z (raidz3): http://blogs.sun.com/ahl/entry/triple_parity_raid_z As for performance, on the system I was using (a max config Sun Storage 7410), I saw about a 25% improvement to 1GB/s for a streaming write workload. YMMV, but I'd be interested in hearing your results. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: I agree completely. In fact, I have wondered (probably in these forums), why we don't bite the bullet and make a generic raidzN, where N is any number >=0. I agree, but raidzN isn't simple to implement and it's potentially difficult to get it to perform well. That said, it's something I intend to bring to ZFS in the next year or so. If memory serves, the second parity is calculated using Reed-Solomon which implies that any number of parity devices is possible. True; it's a degenerate case. In fact, get rid of mirroring, because it clearly is a variant of raidz with two devices. Want three way mirroring? Call that raidz2 with three devices. The truth is that a generic raidzN would roll up everything: striping, mirroring, parity raid, double parity, etc. into a single format with one parameter. That's an interesting thought, but there are some advantages to calling out mirroring for example as its own vdev type. As has been pointed out, reading from either side of the mirror involves no computation whereas reading from a RAID-Z 1+2 for example would involve more computation. This would complicate the calculus of balancing read operations over the mirror devices. Let's not stop there, though. Once we have any number of parity devices, why can't I add a parity device to an array? That should be simple enough with a scrub to set the parity. In fact, what is to stop me from removing a parity device? Once again, I think the code would make this rather easy. With RAID-Z stripes can be of variable width meaning that, say, a single row in a 4+2 configuration might have two stripes of 1+2. In other words, there might not be enough space in the new parity device. I did write up the steps that would be needed to support RAID-Z expansion; you can find it here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Ok, back to the real world. The one downside to triple parity is that I recall the code discovered the corrupt block by excluding it from the stripe, reconstructing the stripe and comparing that with the checksum. In other words, for a given cost of X to compute a stripe and a number P of corrupt blocks, the cost of reading a stripe is approximately X^P. More corrupt blocks would radically slow down the system. With raidz2, the maximum number of corrupt blocks would be two, putting a cap on how costly the read can be. Computing the additional parity of triple-parity RAID-Z is slightly more expensive, but not much -- it's just bitwise operations. Recovering from a read failure is identical (and performs identically) to raidz1 or raidz2 until you actually have sustained three failures. In that case, performance is slower as more computation is involved -- but aren't you just happy to get your data back? If there is silent data corruption, then and only then can you encounter the O(n^3) algorithm that you alluded to, but only as a last resort. If we don't know what drives failed, we try to reconstruct your data by assuming that one drive, then two drives, then three drives are returning bad data. For raidz1, this was a linear operation; raidz2, quadratic; now raidz3 is N-cubed. There's really no way around it. Fortunately with proper scrubbing encountering data corruption in one stripe on three different drives is highly unlikely. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Robert, On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski wrote: >> To what analysis are you referring? Today the absolute fastest you can >> resilver a 1TB drive is about 4 hours. Real-world speeds might be half >> that. In 2010 we'll have 3TB drives meaning it may take a full day to >> resilver. The odds of hitting a latent bit error are already reasonably >> high especially with a large pool that's infrequently scrubbed meaning. >> What then are the odds of a second drive failing in the 24 hours it takes >> to resiler? > > I wish it was so good with raid-zN. > In real life, at least from mine experience, it can take several days to > resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real > data. > While the way zfs ynchronizes data is way faster under some circumstances > it is also much slower under other. > IIRC some builds ago there were some fixes integrated so maybe it is > different now. Absolutely. I was talking more or less about optimal timing. I realize that due to the priorities within ZFS and real word loads that it can take far longer. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD (SLC) for cache...
My question is about SSD, and the differences between use SLC for readzillas instead of MLC. Sun uses MLCs for Readzillas for their 7000 series. I would think that if SLCs (which are generally more expensive) were really needed, they would be used. That's not entirely accurate. In the 7410 and 7310 today (the members of the Sun Storage 7000 series that support Readzilla) we use SLC SSDs. We're exploring the use of MLC. Perhaps someone on the Fishworks team could give more details, but by going what I've read and seen, MLCs should be sufficient for the L2ARC. Save your money. That's our assessment, but it's highly dependent on the specific characteristics of the MLC NAND itself, the SSD controller, and, of course, the workload. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hey Gary, There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I'm looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Adam On Aug 25, 2009, at 5:29 AM, Gary Gendel wrote: I have a 5-500GB disk Raid-Z pool that has been producing checksum errors right after upgrading SXCE to build 121. They seem to be randomly occurring on all 5 disks, so it doesn't look like a disk failure situation. Repeatingly running a scrub on the pools randomly repairs between 20 and a few hundred checksum errors. Since I hadn't physically touched the machine, it seems a very strong coincidence that it started right after I upgraded to 121. This machine is a SunFire v20z with a Marvell SATA 8-port controller (the same one as in the original thumper). I've seen this kind of problem way back around build 40-50 ish, but haven't seen it after that until now. Anyone else experiencing this problem or knows how to isolate the problem definitively? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?
Will BP rewrite allow adding a drive to raidz1 to get raidz2? And how is status on BP rewrite? Far away? Not started yet? Planning? BP rewrite is an important component technology, but there's a bunch beyond that. It's not a high priority right now for us at Sun. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?
Hi David, BP rewrite is an important component technology, but there's a bunch beyond that. It's not a high priority right now for us at Sun. What's the bug / RFE number for it? (So those of us with contracts can add a request for it.) I don't have the number handy, but while it might be satisfying to add another request for it, Matt is already cranking on it as fast as he can and more requests for it are likely to have the opposite of the intended effect. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hi James, After investigating this problem a bit I'd suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Apologies for the inconvenience. Adam On Aug 28, 2009, at 8:20 PM, James Lever wrote: On 28/08/2009, at 3:23 AM, Adam Leventhal wrote: There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I'm looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Are the errors being generated likely to cause any significant problem running 121 with a RAID-Z volume or should users of RAID-Z* wait until this issue is resolved? cheers, James -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
Hey Bob, > I have seen few people more prone to unsubstantiated conjecture than you. > The raidz checksum code was recently reworked to add raidz3. It seems > likely that a subtle bug was added at that time. That appears to be the case. I'm investigating the problem and hope to have and update to the last either later today or tomorrow. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110: Would it self upgrade the system zpool?
Hi Trevor, We intentionally install the system pool with an old ZFS version and don't provide the ability to upgrade. We don't need or use (or even expose) any of the features of the newer versions so using a newer version would only create problems rolling back to earlier releases. Adam On Sep 2, 2009, at 7:01 PM, Trevor Pretty wrote: Just Curious The 7110 I've on loan has an old zpool. I *assume* because it's been upgraded and it gives me the ability to downgrade. Anybody know if I delete the old version of Amber Road whether the pool would then upgrade (I don't want to do it as I want to show the up/downgrade feature). OS pool:- pool: system state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. And yes I may have invalidated my support. If you have a 7000 box don't ask me how to access the system like this, you can see the warning. Remember I've a loan box and are just being nosey, a sort of looking under the bonnet and going "OOOHHH" an engine, but being too scared to even pull the dip stick :-) + -+ | You are entering the operating system shell. By confirming this action in | | the appliance shell you have agreed that THIS ACTION MAY VOID ANY SUPPORT | | AGREEMENT. If you do not agree to this -- or do not otherwise understand | | what you are doing -- you should type "exit" at the shell prompt. EVERY | | COMMAND THAT YOU EXECUTE HERE IS AUDITED, and support personnel may use| | this audit trail to substantiate invalidating your support contract. The | | operating system shell is NOT a supported mechanism for managing this | | appliance, and COMMANDS EXECUTED HERE MAY DO IRREPARABLE HARM. | | | | NOTHING SHOULD BE ATTEMPTED HERE BY UNTRAINED SUPPORT PERSONNEL UNDER ANY | | CIRCUMSTANCES. This appliance is a non-traditional operating system | | environment, and expertise in a traditional operating system environment | | in NO WAY constitutes training for supporting this appliance. THOSE WITH | | EXPERTISE IN OTHER SYSTEMS -- HOWEVER SUPERFICIALLY SIMILAR -- ARE MORE| | LIKELY TO MISTAKENLY EXECUTE OPERATIONS HERE THAT WILL DO IRREPARABLE | | HARM. Unless you have been explicitly trained on supporting this | | appliance via the operating system shell, you should immediately return| | to the appliance shell.| | | | Type "exit" now to return to the appliance shell. | + -+ Trevor www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
disks 0 1 2 _ | | | |P = parity | P | D | D | LBAs D = data |___|___|___| |X = skipped sector | | | | | | X | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| The logic for the optional IOs effectively (though not literally) in this case would fill in the next LBA on the disk with a 0: _ | | | |P = parity | P | D | D | LBAs D = data |___|___|___| |X = skipped sector | | | | |0 = zero-data from aggregation | 0 | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| We can see the problem when the parity undergoes the swap described above: disks 0 1 2 _ | | | |P = parity | D | P | D | LBAs D = data |___|___|___| |X = skipped sector | | | | |0 = zero-data from aggregation | X | 0 | P | v |___|___|___| | | | | | D | X | | |___|___|___| Note that the 0 incorrectly is also swapped thus inadvertently overwriting a data sector in the subsequent stripe. This only occurs if there is IO aggregation making it much more likely with small, synchronous IOs. It's also only possible with an odd (N) number of child vdevs since to induce the problem the size of the data written must consume a multiple of N-1 sectors _and_ the total number of sectors used for data and parity must be odd (to create the need for a skipped sector). The number of data sectors is simply size / 512 and the number of parity sectors is ceil(size / 512 / (N-1)). 1) size / 512 = K * (N-1) 2) size / 512 + ceil(size / 512 / (N-1)) is odd therefore K * (N-1) + K = K * N is odd If N is even K * N cannot be odd and therefore the situation cannot arise. If N is odd, it is possible to satisfy (1) and (2). -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hey Simon, > Thanks for the info on this. Some people, including myself, reported seeing > checksum errors within mirrors too. Is it considered that these checksum > errors within mirrors could also be related to this bug, or is there another > bug related to checksum errors within mirrors that I should take a look at? Absolutely not. That is an unrelated issue. This problem is isolated to RAID-Z. > And good luck with the fix for build 124. Are talking days or weeks for the > fix to be available, do you think? :) -- Days or hours. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
On Thu, Sep 17, 2009 at 01:32:43PM +0200, Eugen Leitl wrote: > > reasons), you will lose 2 disks worth of storage to parity leaving 12 > > disks worth of data. With raid10 you will lose half, 7 disks to > > parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that > > is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The > > actual redudancy/parity is spread over all disks, not like raid3 which > > has a dedicated parity disk. > > So raidz3 has a dedicated parity disk? I couldn't see that from > skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_z Note that Tomas was talking about RAID-3 not raidz3. To summarize the RAID levels: RAID-0striping RAID-1mirror RAID-2ECC (basically not used) RAID-3bit-interleaved parity (basically not used) RAID-4block-interleaved parity RAID-5block-interleaved distributed parity RAID-6block-interleaved double distributed parity raidz1 is most like RAID-5; raidz2 is most like RAID-6. There's no RAID level that covers more than two parity disks, but raidz3 is most like RAID-6, but with triple distributed parity. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can you turn on zfs compression when the fs is already populated?
For what it's worth, there is a plan to allow data to be scrubbed so that you can enable compression for extant data. No ETA, but it's on the roadmap. In fact, I was recently reminded that I filed a bug on this in 2004: 5029294 there should be a way to compress an extant file system Adam On Wed, Jan 24, 2007 at 06:50:22PM +0100, [EMAIL PROTECTED] wrote: > > >I have an 800GB raidz2 zfs filesystem. It already has approx 142Gb of data. > >Can I simply turn on compression at this point, or do you need to start > >with compression > >at the creation time? If I turn on compression now, what happens to the > >existing data? > > Yes. Nothing. > > Casper > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Adding my own compression to zfs
On Mon, Jan 29, 2007 at 02:39:13PM -0800, roland wrote: > > # zfs get compressratio > > NAME PROPERTY VALUE SOURCE > > pool/gzip compressratio 3.27x - > > pool/lzjb compressratio 1.89x - > > this looks MUCH better than i would have ever expected for smaller files. > > any real-world data how good or bad compressratio goes with lots of very > small but good compressible files , for example some (evil for those solaris > evangelists) untarred linux-source tree ? > > i'm rather excited how effective gzip will compress here. > > for comparison: > > sun1:/comptest # bzcat /tmp/linux-2.6.19.2.tar.bz2 |tar xvf - > --snipp-- > > sun1:/comptest # du -s -k * > 143895 linux-2.6.19.2 > 1 pax_global_header > > sun1:/comptest # du -s -k --apparent-size * > 224282 linux-2.6.19.2 > 1 pax_global_header > > sun1:/comptest # zfs get compressratio comptest > NAME PROPERTY VALUE SOURCE > comptest tank compressratio 1.79x - Don't start sending me your favorite files to compress (it really should work about the same as gzip), but here's the result for the above (I found a tar file that's about 235M uncompressed): # du -ks linux-2.6.19.2/ 80087 linux-2.6.19.2 # zfs get compressratio pool/gzip NAME PROPERTY VALUE SOURCE pool/gzip compressratio 3.40x - Doing a gzip with the default compression level (6 -- the same setting I'm using in ZFS) yields a file that's about 52M. The small files are hurting a bit here, but it's still pretty good -- and considerably better than LZJB. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Need help making lsof work with ZFS
On Wed, Feb 14, 2007 at 01:56:33PM -0700, Matthew Ahrens wrote: > These files are not shipped with Solaris 10. You can find them in > opensolaris: usr/src/uts/common/fs/zfs/sys/ > > The interfaces in these files are not supported, and may change without > notice at any time. Even if they're not supported, shouldn't the header files be shipped so people can make sense of kernel data structure types? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs received vol not appearing on iscsi target list
On Sat, Feb 24, 2007 at 09:29:48PM +1300, Nicholas Lee wrote: > I'm not really a Solaris expert, but I would have expected vol4 to appear on > the iscsi target list automatically. Is there a way to refresh the target > list? Or is this a bug. Hi Nicholas, This is a bug either in ZFS or in the iSCSI target. Please file a bug. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS overhead killed my ZVOL
On Tue, Mar 20, 2007 at 06:01:28PM -0400, Brian H. Nelson wrote: > Why does this happen? Is it a bug? I know there is a recommendation of > 20% free space for good performance, but that thought never occurred to > me when this machine was set up (zvols only, no zfs proper). It sounds like this bug: 6430003 record size needs to affect zvol reservation size on RAID-Z Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS overhead killed my ZVOL
On Wed, Mar 21, 2007 at 01:23:06AM +0100, Robert Milkowski wrote: > Adam, while you are here, what about gzip compression in ZFS? > I mean are you going to integrate changes soon? I submitted the RTI today. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS overhead killed my ZVOL
On Wed, Mar 21, 2007 at 01:36:10AM +0100, Robert Milkowski wrote: > btw: I assume that compression level will be hard coded after all, > right? Nope. You'll be able to choose from gzip-N with N ranging from 1 to 9 just like gzip(1). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] gzip compression support
I recently integrated this fix into ON: 6536606 gzip compression for ZFS With this, ZFS now supports gzip compression. To enable gzip compression just set the 'compression' property to 'gzip' (or 'gzip-N' where N=1..9). Existing pools will need to upgrade in order to use this feature, and, yes, this is the second ZFS version number update this week. Recall that once you've upgraded a pool older software will no longer be able to access it regardless of whether you're using the gzip compression algorithm. I did some very simple tests to look at relative size and time requirements: http://blogs.sun.com/ahl/entry/gzip_for_zfs_update I've also asked Roch Bourbonnais and Richard Elling to do some more extensive tests. Adam >From zfs(1M): compression=on | off | lzjb | gzip | gzip-N Controls the compression algorithm used for this dataset. The "lzjb" compression algorithm is optimized for performance while providing decent data compression. Setting compression to "on" uses the "lzjb" compression algorithm. The "gzip" compression algorithm uses the same compression as the gzip(1) command. You can specify the gzip level by using the value "gzip-N", where N is an integer from 1 (fastest) to 9 (best compression ratio). Currently, "gzip" is equivalent to "gzip-6" (which is also the default for gzip(1)). This property can also be referred to by its shortened column name "compress". -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression support
On Fri, Mar 23, 2007 at 11:41:21AM -0700, Rich Teer wrote: > > I recently integrated this fix into ON: > > > > 6536606 gzip compression for ZFS > > Cool! Can you recall into which build it went? I put it back yesterday so it will be in build 62. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS layout for 10 disk?
I'd take your 10 data disks and make a single raidz2 stripe. You can sustain two disk failures before losing data, and presumably you'd replace the failed disks before that was likely to happen. If you're very concerned about failures, I'd have a single 9-wide raidz2 stripe with a hot spare. Adam On Fri, Mar 23, 2007 at 01:44:06PM -0700, John-Paul Drawneek wrote: > Just to clarify > > pool1 -> 5 disk raidz2 > pool2 -> 4 disk raid 10 > > spare for both pools > > Is that correct? > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote: > >I'm in a way still hoping that it's a iSCSI related Problem as detecting > >dead hosts in a network can be a non trivial problem and it takes quite > >some time for TCP to timeout and inform the upper layers. Just a > >guess/hope here that FC-AL, ... do better in this case > > iscsi doesn't use TCP, does it? Anyway, the problem is really transport > independent. It does use TCP. Were you thinking UDP? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Convert raidz
On Mon, Apr 02, 2007 at 12:37:24AM -0700, homerun wrote: > Is it possible to convert live 3 disks zpool from raidz to raidz2 > And is it possible to add 1 new disk to raidz configuration without > backups and recreating zpool from scratch. The reason that's not possible is because RAID-Z uses a variable stripe width. This solves some problems (notably the RAID-5 write hole [1]), but it means that a given 'stripe' over N disks in a raidz1 configuration may contains as many as floor(N/2) parity blocks -- clearly a single additional disk wouldn't be sufficient to grow the stripe properly. It would be possible to have a different type of RAID-Z where stripes were variable-width to avoid the RAID-5 write hole, but the remainder of the stripe was left unused. This would allow users to add an additional parity disk (or several if we ever implement further redundancy) to an existing configuration, BUT would potentially make much less efficient use of storage. Adam [1] http://blogs.sun.com/bonwick/entry/raid_z -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up for zfsboot
On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote: > - RAID-Z is _very_ slow when one disk is broken. Do you have data on this? The reconstruction should be relatively cheap especially when compared with the initial disk access. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Gzip compression for ZFS
On Wed, Apr 04, 2007 at 07:57:21PM +1000, Darren Reed wrote: > From: "Darren J Moffat" <[EMAIL PROTECTED]> > ... > >The other problem is that you basically need a global unique registry > >anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is > >etc etc. Similarly for crypto and any other transform. > > I've two thoughts on that: > 1) if there is to be a registry, it should be hosted by OpenSolaris > and be open to all and I think there already is such a registry: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zio.h#89 Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up for zfsboot
On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote: > If I stop all activity to x4500 with a pool made of several raidz2 and > then I issue spare attach I get really poor performance (1-2MB/s) on a > pool with lot of relatively small files. Does that mean the spare is resilvering when you collect the performance data? I think a fair test would be to compare the performance of a fully functional RAID-Z stripe against a one with a missing (absent) device. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Linux
On Thu, Apr 12, 2007 at 06:59:45PM -0300, Toby Thain wrote: > >Hey, then just don't *keep on* asking to relicense ZFS (and anything > >else) to GPL. > > I never would. But it would be horrifying to imagine it relicensed to > BSD. (Hello, Microsoft, you just got yourself a competitive filesystem.) There's nothing today preventing Microsoft (or Apple) from sticking ZFS into their OS. They'd just to have to release the (minimal) diffs to ZFS-specific files. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Status Update before Reinstall?
On Wed, Apr 25, 2007 at 09:30:12PM -0700, Richard Elling wrote: > IMHO, only a few people in the world care about dumps at all (and you > know who you are :-). If you care, setup dump to an NFS server somewhere, > no need to have it local. Well IMHO, every Solaris customer cares about crash dumps (although they may not know it). There are failures that occur once -- no dump means no solution. And you're not going to be dumping directly over NFS if you care about your crash dump (see previous point). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] software RAID vs. HW RAID - part III
Hey Robert, This is very cool. Thanks for doing the analysis. What a terrific validation of software RAID and of RAID-Z in particular. Adam On Tue, Apr 24, 2007 at 11:35:32PM +0200, Robert Milkowski wrote: > Hello zfs-discuss, > > http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-software-raid-part-iii.html > > > > -- > Best regards, > Robert Milkowskimailto:[EMAIL PROTECTED] > http://milek.blogspot.com > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
On Thu, May 03, 2007 at 11:43:49AM -0500, [EMAIL PROTECTED] wrote: > I think this may be a premature leap -- It is still undetermined if we are > running up against a yet unknown bug in the kernel implementation of gzip > used for this compression type. From my understanding the gzip code has > been reused from an older kernel implementation, it may be possible that > this code has some issues with kernel stuttering when used for zfs > compression that may have not been exposed with its original usage. If it > turns out that it is just a case of high cpu trade-off for buying faster > compression times, then the talk of a tunable may make sense (if it is even > possible given the constraints of the gzip code in kernelspace). The in-kernel version is zlib is the latest version (1.2.3). It's not surprising that we're spending all of our time in zlib if the machine is being driving by I/O. There are outstanding problems with compression in the ZIO pipeline that may contribute to the bursty behavior. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
On Wed, May 09, 2007 at 11:52:06AM +0100, Darren J Moffat wrote: > Can you give some more info on what these problems are. I was thinking of this bug: 6460622 zio_nowait() doesn't live up to its name Which was surprised to find was fixed by Eric in build 59. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iscsitadm local_name in ZFS
That would be a great RFE. Currently the iSCSI Alias is the dataset name which should help with identification. Adam On Fri, May 04, 2007 at 02:02:34PM +0200, cedric briner wrote: > cedric briner wrote: > >hello dear community, > > > >Is there a way to have a ``local_name'' as define in iscsitadm.1m when > >you shareiscsi a zvol. This way, it will give even easier > >way to identify an device through IQN. > > > >Ced. > > > > Okay no reply from you so... maybe I didn't make myself well understandable. > > Let me try to re-explain you what I mean: > when you use zvol and enable shareiscsi, could you add a suffix to the > IQN (Iscsi Qualified Name). This suffix will be given by myself and will > help me to identify which IQN correspond to which zvol : this is just a > more human readable tag on an IQN. > > Similarly, this tag is also given when you do an iscsitadm. And in the > man page of iscsitadm it is called a . > > iscsitadm iscsitadm create target -b /dev/dsk/c0d0s5 tiger > or > iscsitadm iscsitadm create target -b /dev/dsk/c0d0s5 hd-1 > > tiger and hd-1 are > > Ced. > > -- > > Cedric BRINER > Geneva - Switzerland > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS over a layered driver interface
Try 'trace((int)arg1);' -- 4294967295 is the unsigned representation of -1. Adam On Mon, May 14, 2007 at 09:57:23AM -0700, Shweta Krishnan wrote: > Thanks Eric and Manoj. > > Here's what ldi_get_size() returns: > bash-3.00# dtrace -n 'fbt::ldi_get_size:return{trace(arg1);}' -c 'zpool > create adsl-pool /dev/layerzfsminor1' dtrace: description > 'fbt::ldi_get_size:return' matched 1 probe > cannot create 'adsl-pool': invalid argument for this pool operation > dtrace: pid 2582 has exited > CPU IDFUNCTION:NAME > 0 20927 ldi_get_size:return4294967295 > > > This is strange because I looked at the code for ldi_get_size() and the only > possible return values in the code are DDI_SUCCESS (0) and DDI_FAILURE(-1). > > Looks like what I'm looking at either isn't the return value, or some bad > address is being reached. Any hints? > > Thanks, > Swetha. > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: ISCSI alias when shareiscsi=on
Right now -- as I'm sure you have noticed -- we use the dataset name for the alias. To let users explicitly set the alias we could add a new property as you suggest or allow other options for the existing shareiscsi property: shareiscsi='alias=potato' This would sort of match what we do for the sharenfs property. Adam On Thu, May 24, 2007 at 02:39:24PM +0200, cedric briner wrote: > Starting from this thread: > http://www.opensolaris.org/jive/thread.jspa?messageID=118786𝀂 > > I would love to have the possibility to set an ISCSI alias when doing an > shareiscsi=on on ZFS. This will greatly facilate to identify where an > IQN is hosted. > > the ISCSI alias is defined in rfc 3721 > e.g. http://www.apps.ietf.org/rfc/rfc3721.html#sec-2 > > and the CLI could be something like: > zfs set shareiscsi=on shareisicsiname= tank > > > Ced. > -- > > Cedric BRINER > Geneva - Switzerland > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X "Leopard" to use ZFS
On Thu, Jun 07, 2007 at 08:38:10PM -0300, Toby Thain wrote: > When should we expect Solaris kernel under OS X? 10.6? 10.7? :-) I'm sure Jonathan will be announcing that soon. ;-) Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: LZO compression?
Those are interesting results. Does this mean you've already written lzo support into ZFS? If not, that would be a great next step -- licensing issues can be sorted out later... Adam On Sat, Jun 16, 2007 at 04:40:48AM -0700, roland wrote: > btw - is there some way to directly compare lzjb vs lzo compression - to see > which performs better and using less cpu ? > > here those numbers from my little benchmark: > > |lzo |6m39.603s |2.99x > |gzip |7m46.875s |3.41x > |lzjb |7m7.600s |1.79x > > i`m just curious about these numbers - with lzo i got better speed and better > compression in comparison to lzjb > > nothing against lzjb compression - it's pretty nice - but why not taking a > closer look here? maybe here is some room for improvement > > roland > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Mac OS X 10.5 read-only support for ZFS
On Sun, Jun 17, 2007 at 09:38:51PM -0700, Anton B. Rang wrote: > Sector errors on DVD are not uncommon. Writing a DVD in ZFS format > with duplicated data blocks would help protect against that problem, at > the cost of 50% or so disk space. That sounds like a lot, but with > BluRay etc. coming along, maybe paying a 50% penalty isn't too bad. > (And if ZFS eventually supports RAID on a single disk, the penalty > would be less.) It would be an interesting project to create some software that took a directory (or ZFS filesystem) to be written to a CD or DVD and optimized the layout for redundancy. That is, choose the compression method (if any), and then, in effect, partition the CD for RAID-Z or mirroring to stretch the data to fill the entire disc. It wouldn't necessarily be all that efficient to access, but it would give you resiliency against media errors. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to take advantage of PSARC 2007/171: ZFS Separate Intent Log
Flash SSDs typically boast a huge number of _read_ IOPS (thousands), but very few write IOPS (tens). The write throughput numbers quoted are almost certainly for non-synchronous writes whose latency can easily be in the milisecond range. STEC makes an interesting device which offers fast _synchronous_ writes on an SSD, but at a pretty steep cost. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal
This is a great idea. I'd like to add a couple of suggestions: It might be interesting to focus on compression algorithms which are optimized for particular workloads and data types, an Oracle database for example. It might be worthwhile to have some sort of adaptive compression whereby ZFS could choose a compression algorithm based on its detection of the type of data being stored. Adam On Thu, Jul 05, 2007 at 08:29:38PM -0300, Domingos Soares wrote: > Bellow, follows a proposal for a new opensolaris project. Of course, > this is open to change since I just wrote down some ideas I had months > ago, while researching the topic as a graduate student in Computer > Science, and since I'm not an opensolaris/ZFS expert at all. I would > really appreciate any suggestion or comments. > > PROJECT PROPOSAL: ZFS Compression Algorithms. > > The main purpose of this project is the development of new > compression schemes for the ZFS file system. We plan to start with > the development of a fast implementation of a Burrows Wheeler > Transform based algorithm (BWT). BWT is an outstanding tool > and the currently known lossless compression algorithms > based on it outperform the compression ratio of algorithms derived from the > well > known Ziv-Lempel algorithm, while being a little more time and space > expensive. Therefore, there is space for improvement: recent results > show that the running time and space needs of such algorithms can be > significantly reduced and the same results suggests that BWT is > likely to become the new standard in compression > algorithms[1]. Suffixes Sorting (i.e. the problem of sorting suffixes of a > given string) is the main bottleneck of BWT and really significant > progress has been made in this area since the first algorithms of > Manbers and Myers[2] and Larsson and Sadakane[3], notably the new > linear time algorithms of Karkkainen and Sanders[4]; Kim, Sim and > Park[5] and Ko e aluru[6] and also the promising O(nlogn) algorithm of > Karkkainen and Burkhardt[7]. > > As a conjecture, we believe that some intrinsic properties of ZFS and > file systems in general (e.g. sparseness and data entropy in blocks) > could be exploited in order to produce brand new and really efficient > compression algorithms, as well as the adaptation of existing ones to > the task. The study might be extended to the analysis of data in > specific applications (e.g. web servers, mail servers and others) in > order to develop compression schemes for specific environments and/or > modify the existing Ziv-Lempel based scheme to deal better with such > environments. > > [1] "The Burrows-Wheeler Transform: Theory and Practice". Manzini, > Giovanni. Proc. 24th Int. Symposium on Mathematical Foundations of > Computer Science > > [2] "Suffix Arrays: A New Method for > On-Line String Searches". Manber, Udi and Myers, Eugene W.. SIAM > Journal on Computing, Vol. 22 Issue 5. 1990 > > [3] "Faster suffix sorting". Larsson, N Jasper and Sadakane, > Kunihiko. TECHREPORT, Department of Computer Science, Lund University, > 1999 > > [4] "Simple Linear Work Suffix Array Construction". Karkkainen, Juha > and Sanders,Peter. Proc. 13th International Conference on Automata, > Languages and Programming, 2003 > > [5]"Linear-time construction of suffix arrays" D.K. Kim, J.S. Sim, > H. Park, K. Park, CPM, LNCS, Vol. 2676, 2003 > > [6]"Space ecient linear time construction of sux arrays",P. Ko and > S. Aluru, CPM 2003. > > [7]"Fast Lightweight Suffix Array Construction and > Checking". Burkhardt, Stefan and K?rkk?inen, Juha. 14th Annual > Symposium, CPM 2003, > > > Domingos Soares Neto > University of Sao Paulo > Institute of Mathematics and Statistics > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs vol issue?.
On Thu, Aug 16, 2007 at 05:20:25AM -0700, ramprakash wrote: > #zfs mount -a > 1. mounts "c" again. > 2. but not "vol1".. [ ie /dev/zvol/dsk/mytank/b/c does not contain "vol1" > ] > > Is this the normal behavior or is it a bug? That looks like a bug. Please file it. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored zpool across network
On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote: > Basically, the setup is a large volume of Hi-Def video is being streamed > from a camera, onto an editing timeline. This will be written to a > network share. Due to the large amounts of data, ZFS is a really good > option for us. But we need a backup. We need to do it on generic > hardware (i was thinking AMD64 with an array of large 7200rpm hard > drives), and therefore i think im going to have one box mirroring the > other box. They will be connected by gigabit ethernet. So my question > is how do I mirror one raidz Array across the network to the other? One big decision you need to make in this scenario is whether you want true synchronous replication or if asynchronous replication possibly with some time-bound is acceptable. For the former, each byte must traverse the network before it is acknowledged to the client; for the latter, data is written locally and then transmitted shortly after that. Synchronous replication obviously imposes a much larger performance hit, but asychronous replication means you may lose data over some recent period (but the data will always be consistent). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: > And here are the results: > > RAIDZ: > > Number of READ requests: 4. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 1305213 > Requests per second: 75 > > RAID5: > > Number of READ requests: 4. > Number of WRITE requests: 0. > Number of bytes to transmit: 695678976. > Number of processes: 8. > Bytes per second: 2749719 > Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Very interesting experiment... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS gzip compression
On Sat, Sep 29, 2007 at 05:03:29PM -0700, Scott wrote: > Thanks for the reply. I suppose my next question, then, is how > difficult would it be for me to apply a patch against U4 to gain the > gzip compression functionality in ZFS? I come from a FreeBSD > background, so I have no problems with compiling OpenSolaris source, > but I would like to retain as much of the code from the production > S10U4 as I can for stability reasons. Unfortunately, that's going to be quite difficult because we don't release the source code for Solaris 10 updates (a position I personally find a bit dubious). The ZFS team may be able to give you some guidance about what build of ON most closely corresponds to what's in Solaris 10 U4, and you could try to work from there. Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Wed, Nov 07, 2007 at 01:47:04PM -0800, can you guess? wrote: > I do consider the RAID-Z design to be somewhat brain-damaged [...] How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote: > > How so? In my opinion, it seems like a cure for the brain damage of RAID-5. > > Nope. > > A decent RAID-5 hardware implementation has no 'write hole' to worry about, > and one can make a software implementation similarly robust with some effort > (e.g., by using a transaction log to protect the data-plus-parity > double-update or by using COW mechanisms like ZFS's in a more intelligent > manner). Can you reference a software RAID implementation which implements a solution to the write hole and performs well. My understanding (and this is based on what I've been told from people more knowledgeable in this domain than I) is that software RAID has suffered from being unable to provide both correctness and acceptable performance. > The part of RAID-Z that's brain-damaged is its > concurrent-small-to-medium-sized-access performance (at least up to request > sizes equal to the largest block size that ZFS supports, and arguably > somewhat beyond that): while conventional RAID-5 can satisfy N+1 > small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in > parallel (though the latter also take an extra rev to complete), RAID-Z can > satisfy only one small-to-medium access request at a time (well, plus a > smidge for read accesses if it doesn't verity the parity) - effectively > providing RAID-3-style performance. Brain damage seems a bit of an alarmist label. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. > The easiest way to fix ZFS's deficiency in this area would probably be to map > each group of N blocks in a file as a stripe with its own parity - which > would have the added benefit of removing any need to handle parity groups at > the disk level (this would, incidentally, not be a bad idea to use for > mirroring as well, if my impression is correct that there's a remnant of > LVM-style internal management there). While this wouldn't allow use of > parity RAID for very small files, in most installations they really don't > occupy much space compared to that used by large files so this should not > constitute a significant drawback. I don't really think this would be feasible given how ZFS is stratified today, but go ahead and prove me wrong: here are the instructions for bringing over a copy of the source code: http://www.opensolaris.org/os/community/tools/scm - ahl -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expanding a RAIDZ based Pool...
On Mon, Dec 10, 2007 at 03:59:22PM +, Karl Pielorz wrote: > e.g. If I build a RAIDZ pool with 5 * 400Gb drives, and later add a 6th > 400Gb drive to this pool, will its space instantly be available to volumes > using that pool? (I can't quite see this working myself) Hi Karl, You can't currently expand the width of a RAID-Z stripe. It has been considered, but implementing that would require a fairly substantial change in the way RAID-Z works. Sun's current ZFS priorities are elsewhere, but there's nothing preventing an interested member of the community from undertaking this project... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz in zfs questions
>> 2. in a raidz do all the disks have to be the same size? Disks don't have to be the same size, but only as much space will be used on the larger disks will be used as is available on the smallest disk. In other words, there's no benefit to be gained from this approach. > Related question: > Does a raidz have to be either only full disks or only slices, or can > it be mixed? E.g., can you do a 3-way raidz with 2 complete disks and > one slice (of equal size as the disks) on a 3rd, larger, disk? Sure. One could do this, but it's kind of a hack. I imagine you'd like to do something like match a disk of size N with another disk of size 2N and use RAID-Z to turn them into a single vdev. At that point it's probably a better idea to build a striped vdev and use ditto blocks to do your data redundancy by setting copies=2. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mixing RAIDZ and RAIDZ2 zvols in the same zpool
On Wed, Mar 12, 2008 at 09:59:53PM +, A Darren Dunham wrote: > It's not *bad*, but as far as I'm concerned, it's wasted space. > > You have to deal with the pool as a whole as having single-disk > redundancy for failure modes. So the fact that one section of it has > two-disk redundancy doesn't give you anything in failure planning. > > And you can't assign filesystems or particular data to that vdev, so the > added redundancy can't be concentrated anywhere. Well, one can imagine a situation where two different type of disks have different failure probabilities such that the same reliability could be garnered with one using single-parity RAID as with the other using double- parity RAID. That said, it would be a fairly uncommon scenario. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Per filesystem scrub
On Mar 31, 2008, at 10:41 AM, kristof wrote: > I would be very happy having a filesystem based zfs scrub > > We have a 18TB big zpool, it takes more then 2 days to do the scrub. > > Since we cannot take snapshots during the scrub, this is unacceptable While per-dataset scrubbing would certainly be a coarse-grained solution to your problem, work is underway to address the problematic interaction between scrubs and snapshots. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Algorithm for expanding RAID-Z
After hearing many vehement requests for expanding RAID-Z vdevs, Matt Ahrens and I sat down a few weeks ago to figure out an mechanism that would work. While Sun isn't committing resources to imlementing a solution, I've written up our ideas here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z I'd encourage anyone interested in getting involved with ZFS development to take a look. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic ZFS maintenance?
On Mon, Apr 21, 2008 at 10:41:35AM +1200, Ian Collins wrote: > Sam wrote: > > I have a 10x500 disc file server with ZFS+, do I need to perform any sort > > of periodic maintenance to the filesystem to keep it in tip top shape? > > > No, but if there are problems, a periodic scrub will tip you off sooner > rather than later. Well, tip you off _and_ correct the problems if possible. I believe a long- standing RFE has been to scrub periodically in the background to ensure that correctable problems don't turn into uncorrectable ones. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss