Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On 11/26/2012 12:54 PM, Grégory Giannoni wrote: [snip] I switched few month ago from Sun X45x0 to HP things : My fast NAS are now DL 180 G6. I got better perfs using LSI 9240-8I rather than HP SmartArray (tried P410 & P812). I'm using only 600Gb SSD drives. That LSI controllers supports SATA III, or 6Gbps SATA. The Px1x controllers do 6GB SAS, but only 3GB SATA, so that's your likely perf difference. The SmartArray Px2x series should do both SATA and SAS at 6Gbps. That said, I do think you're right that the LSI controller is probably a better fit for connections requiring a SATA SSD. The only exception is having to give up the 1GB of NVRAM on the HP controller. :-( In one of the servers I replaced the 25-disks bays by 3 8-disks bays, allowing me to connect 3 LSI 9240-8I rather than only one. This NAS achieved 4.4GBytes/sec reading and 4.1GBytes/Sec writing with 48 io/s, running Solaris 11. Using raidz-2, perfs dropped to 3.1 / 3.0 GB/sec Is the bottleneck the LSI controller, or the SAS/SATA bus, or the PCI-E bus itself? That is, have you tested with LSI 9240-4i (one per 8-drive cage, which I *believe* can use the HP multi-lane cable), and with a LSI 9260-16i or LSI 9280-24i? My instinct would be to say it's the PCI-E bus, and you could probably get away with the 4-channel cards. i.e. 4-channels @ 6Gbit/s = 3 GBytes/s > 4x PCI-E 2.0 at 2GB/s Also, the HP H220 is simply the OEM version of the LSI 9240-8i -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On 11/24/2012 5:17 AM, Edmund White wrote: Heh, I wouldn't be using G5's for ZFS purposes now. G6 and better ProLiants are a better deal for RAM capacity and CPU core countŠ Either way, I also use HP systems as the basis for my ZFS/Nexenta storage systems. Typically DL380's, since I have expansion room for either 16 drive bays, or for using them as a head unit to a D2700 or D2600 JBOD. The right replacement for the old DL320s storage server is the DL180 G6. This model was available in a number of configurations, but the best solutions for storage were the 2U 12-bay 3.5" model and the 2U 25-bay 2.5" model. Both models have a SAS expander on the backplane, but with a nice controller (LSI 9211-4i), make good ZFS storage servers. Really? I mean, sure, the G6 is beefier, but I can still get 8 cores of decently-fast CPU and 64GB of RAM in a G5, which, unless I'm doing Dedup and need a *stupid* amount of RAM, is more than sufficient for anything I've ever seen as a ZFS appliance. I'd agree that the 64GB of RAM limit can be annoying if you really want to run a Super App Server + ZFS server on them, but they're so much more powerful than the X4500/X4540 that I'd think they make an excellent drop-in replacement when paired with an MSA70, particularly on cost. The G6 is over double the cost of the G5. One thing that I do know about the G6 is that they have Nehalem CPUS (X5500-series), which support VT-D, the virtualization I/O acceleration technology from Intel, while the G5's X5400-series Harpertown's don't. If you're running zones on the system, it won't matter, but VirtualBox will care. --- Thanks for the DL180 link. Once again, I think I'd go for the G5 rather than the G6 - it's roughly half the cost (or less, as the 2.5"-enabled G6s seem to be expensive), and these boxes make nice log servers, not app servers. The DL180 G5 seems to be pretty much a DL380 G5 with a different hard drive layout (12x2.5" rather than 8x2.5") --- One word here for everyone getting HP equipment: you want to get the Px1x or Px2x (e.g. P812) series of SmartArray controllers, if you plan on running SATA drives attached to them. The older Px0x series only supports SATA I (1.5GB/s) and SAS 3GB/s, which is a serious handicap if you want to do SSDs on that channel. The newer series do SATA II (3GB/s) and SAS 6Gb/s. http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/index.html -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On 11/23/2012 5:50 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov I wonder if it would make weird sense to get the boxes, forfeit the cool-looking Fishworks, and install Solaris/OI/Nexenta/whatever to get the most flexibility and bang for a buck from the owned hardware... This is what we decided to do at work, and this is the reason why. But we didn't buy the appliance-branded boxes; we just bought normal servers running solaris. I gave up and am now buying HP-branded hardware for running Solaris on it. Particularly if you get off-lease used hardware (for which, HP is still very happy to let you buy a HW support contract), it's cheap, and HP has a lot of Solaris drivers for their branded stuff. Their whole SmartArray line of adapters has much better Solaris driver coverage than the generic stuff or the equivalent IBM or Dell items. For instance, I just got a couple of DL380 G5 systems with dual Harpertown CPUs, fully loaded with 8 2.5" SAS drives and 32GB of RAM, for about $800 total. You can attach their MSA30/50/70-series (or DS2700-series, if you want new) as dumb JBODs via SAS, and the nice SmartArray controllers have 1GB of NVRAM, which is sufficient for many purposes, so you don't even have cough up the dough for a nice ZIL SSD. HP even made a sweet little "appliance" thing that was designed for Windows, but happens to run Solaris really, really well. The DL320s (the "s" is part of the model designation). 14x 3.5" SAS/SATA hot swap bays, a Xeon 3070 dual-core CPU, SmartArray controller, 2 x GB Nic, LOM, and a free 1x PCI-E expansion slot. The only drawback is that it only takes up to 8GB of RAM. It makes a *fabulous* little backup system for logs and stuff, and it's under about $2000 even after you splurge for 1TB drives and an SSD for the thing. I am in the market for something newer than that, though. Anyone know what HP's using as a replacement for the DL320s? -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FC HBA for openindiana
Do make sure you're getting one that has the proper firmware. Those with BIOS don't work in SPARC boxes, and those with OpenBoot don't work in x64 stuff. A quick "Sun FC HBA" search on ebay turns up a whole list of stuff that's "official" Sun HBAs, which will give you an idea of the (max) pricing you'll be paying. There's currently a *huge* price difference between the 4Gb and 2Gb adapters. Also, keep in mind that PCI-X adapters are far more common at the 1/2Gb range, while PCI-E starts to be the most common choice at 4Gb+ Here's a list of all the old Sun FC HBAs (which can help you sort out which are for x64 systems, and which were for SPARC systems): http://www.oracle.com/technetwork/documentation/oracle-storage-networking-190061.html As Tim said, these should all have built-in drivers in the Illumos codebase. -Erik On 10/20/2012 4:24 PM, Tim Cook wrote: The built in drivers support Mpha so you're good to go. On Friday, October 19, 2012, Christof Haemmerle wrote: Yep i Need. 4 Gig with multipathing if possible. On Oct 19, 2012, at 10:34 PM, Tim Cook > wrote: On Friday, October 19, 2012, Christof Haemmerle wrote: hi there, i need to connect some old raid subsystems to a opensolaris box via fibre channel. can you recommend any FC HBA? thanx __ How old? If its 1gbit you'll need a 4gb or slower hba. Qlogic would be my preference. You should be able to find a 2340 for cheap on eBay. Or a 2460 if you want 4gb. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what have you been buying for slog and l2arc?
On 8/6/2012 2:53 PM, Bob Friesenhahn wrote: On Mon, 6 Aug 2012, Stefan Ring wrote: Intel's brief also clears up a prior controversy of what types of data are actually cached, per the brief it's both user and system data! So you're saying that SSDs don't generally flush data to stable medium when instructed to? So data written before an fsync is not guaranteed to be seen after a power-down? If that -- ignoring cache flush requests -- is the whole reason why SSDs are so fast, I'm glad I haven't got one yet. Testing has shown that many SSDs do not flush the data prior to claiming that they have done so. The flush request may hasten the time until the next actual cache flush. Honestly, I don't think this last point can be emphasized enough. SSDs of all flavors and manufacturers have a track record of *consistently* lying when returning from a cache flush command. There might exist somebody out there who actually does it across all products, but I've tested and used enough of the variety (both Consumer and Enterprise) to NOT trust any SSD that tells you it actually flushed out its local cache. ALWAYS insist on some form of power protection, whether it be a supercap, battery, or external power-supply. That way, even if they lie to you, you're covered from a power loss. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On 5/5/2012 8:04 AM, Bob Friesenhahn wrote: On Fri, 4 May 2012, Erik Trimble wrote: predictable, and the backing store is still only giving 1 disk's IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write. Has someone done real-world measurements which indicate that raidz* actually provides better sequential read or write than simple mirroring with the same number of disks? While it seems that there should be an advantage, I don't recall seeing posted evidence of such. If there was a measurable advantage, it would be under conditions which are unlikely in the real world. The only thing totally clear to me is that raidz* provides better storage efficiency than mirroring and that raidz1 is dangerous with large disks. Provided that the media reliability is sufficiently high, there are still many performance and operational advantages obtained from simple mirroring (duplex mirroring) with zfs. Bob I'll see what I can do about actual measurements. Given that we're really recommending a minimum of RAIDZ2 nowdays (with disks > 1TB), that means, for N disks, you get N-2 data disks in a RAIDZ2, and N/2 disks in a standard striped mirror. My brain says that even with the overhead of parity calculation, for doing sequential read/write of at least the slab size (i.e. involving all the data drives in a RAIDZ2), performance for the RAIDZ2 should be better for N >= 6. But, that's my theoretical brain, and we should do some decent benchmarking, to put some hard fact to that. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On 5/4/2012 1:24 PM, Peter Tribble wrote: On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey wrote: I think you'll get better, both performance& reliability, if you break each of those 15-disk raidz3's into three 5-disk raidz1's. Here's why: Incorrect on reliability; see below. Now, to put some numbers on this... A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write sequential. This means resilvering the entire disk sequentially, including unused space, (which is not what ZFS does) would require 2.2 hours. In practice, on my 1T disks, which are in a mirrored configuration, I find resilvering takes 12 hours. I would expect this to be ~4 days if I were using 5-disk raidz1, and I would expect it to be ~12 days if I were using 15-disk raidz3. Based on your use of "I would expect", I'm guessing you haven't done the actual measurement. I see ~12-16 hour resilver times on pools using 1TB drives in raidz configurations. The resilver times don't seem to vary with whether I'm using raidz1 or raidz2. Suddenly the prospect of multiple failures overlapping don't seem so unlikely. Which is *exactly* why you need multiple-parity solutions. Put simply, if you're using single-parity redundancy with 1TB drives or larger (raidz1 or 2-way mirroring) then you're putting your data at risk. I'm seeing - at a very low level, but clearly non-zero - occasional read errors during rebuild of raidz1 vdevs, leading to data loss. Usually just one file, so it's not too bad (and zfs will tell you which file has been lost). And the observed error rates we're seeing in terms of uncorrectable (and undetectable) errors from drives are actually slightly better than you would expect from the manufacturers spec sheets. So you definitely need raidz2 rather than raidz1; I'm looking at going to raidz3 for solutions using current high capacity (ie 3TB) drives. (On performance, I know what the theory says about getting one disk's worth of IOPS out of each vdev in a raidz configuration. In practice we're finding that our raidz systems actually perform pretty well when compared with dynamic stripes, mirrors, and hardware raid LUNs.) Really, guys: Richard, myself, and several others have covered how ZFS does resilvering (and on disk reliability, a related issue), and included very detailed calculations on IOPS required and discussions about slabs, recordsize, and how disks operate with regards to seek/access times and OS caching. Please search the archives, as it's not fruitful to repost the exact same thing repeatedly. Short version: assuming identical drives and the exact same usage pattern and /amount/ of data, the time it takes the various ZFS configurations to resilver is N for ANY mirrored config and a bit less than N*M for a M-disk RAIDZ*, where M = the number of data disks in the RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time as a 5-drive (total) RAIDZ1. Calculating what N is depends entirely on the pattern which the data was written on the drive. You're always going to be IOPS-bound on the disk being resilvered. Which RAIDZ* config to use (assuming you have a fixed tolerance for data loss) depends entirely on what your data usage pattern does to resilver times; configurations needing very long resilver times better have more redundancy. And, remember, larger configs will allow for more data to be stored, that also increases resilver time. Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk's worth of IOPS (averaged over a reasonable time period). Caching may make it appear to give more IOPS in certain cases, but that's neither sustainable nor predictable, and the backing store is still only giving 1 disk's IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Good tower server for around 1,250 USD?
On 3/24/2012 4:54 PM, The Honorable Senator and Mrs. John Blutarsky wrote: laotsu said: well check this link https://shop.oracle.com/pls/ostore/product?p1=3DSunFireX4270M2server&p2=3D&p= 3=3D&p4=3D&sc=3Docom_x86_SunFireX4270M2server&tz=3D-4:00 you may not like the price Hahahah! Thanks for the laugh. The dual 10Gbe PCI card breaks my budget. I'm not going to try to configure a server and see how much it costs... I can't even get to the site from my country btw. I had to use a proxy through my company in America to get pricing. Oracle doesn't want to sell certain things everywhere or they don't know how to run a website. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I posted this awhile ago when people were asking for a good recommendation. I suggest pretty much any of the IBM stuff - the vast majority of them seem to have real good compatibility (excepting the ServeRAID controllers) with Solaris/IllumOS. The baseboard controllers are usually some flavor of well-supported SATA or LSI SAS controller. They're otherwise quite well put together, and parts are easy to come by (and, IBM's support site for information is fabulous, even if you DON'T have a contract). You can get reconditioned/used stuff for real check, and it's even possible to get support for IBM-label stuff if it's not out of warranty (or, buy a new warranty, if you so want, from your local IBM reseller, of which there are a lot, world-wide). You can also likely get a Solaris contract for this, either through IBM or through Oracle (that is, if Oracle hasn't completely stopped selling support contracts for Solaris for non-Oracle hardware already). I personally own an X3500, which uses E5[1,3]00-series CPUs (dual or quad-core) and DDR2 RAM. They usually come with a systems management card (or you can get one for cheap, under $50). Parts are easy to come by, cheap, and covered by any IBM warranty if you buy a IBM-labeled part (even if it was a 3rd party, non-"authorized" reseller that sold you the part). The X3500 and X3500 M2, plus the X3400 or X3400 M2 are likely your best bets. The IBM Xref for Withdrawn Hardware is an excellent place to start to look for a compatible system (plus give you the IBM part numbers for everything): http://www.redbooks.ibm.com/abstracts/redpxref.html Here's what you want for under $1000: http://www.ebay.com/itm/IBM-x3500-Tower-2x-Quad-Core-2-66GHz-8GB-4x73GB-8K-Raid-/140730930501?pt=COMP_EN_Servers&hash=item20c4379545 Cheaper, and better CPU, but smaller: http://www.ebay.com/itm/IBM-x3400-M3-737942U-5U-Tower-Entry-level-Server-Intel-Xeon-E5507-2-26-GHz-/390390337076?pt=COMP_EN_Servers&hash=item5ae513ce34 -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 2.5" to 3.5" bracket for SSD
On 1/14/2012 8:15 AM, Anil Jangity wrote: I have a couple of Sun/Oracle x2270 boxes and am planning to get some 2.5" intel 320 SSD for the rpool. Do you happen to know what kind of bracket is required to get the 2.5" SSD to fit into the 3.5" slots? Thanks Anything that looks like this: http://www.amazon.com/2-5-3-5-Ssd-sata-Convert/dp/B002Z2QDNE/ref=dp_cp_ob_e_title_0 The x2270 takes a standard low-profile 3.5" hard drive, so any converter that has the power/sata connectors in the same location as the 3.5" drive's would be. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Stress test zfs
On 1/4/2012 2:59 PM, grant lowe wrote: Hi all, I've got a solaris 10 running 9/10 on a T3. It's an oracle box with 128GB memory RIght now oracle . I've been trying to load test the box with bonnie++. I can seem to get 80 to 90 K writes, but can't seem to get more than a couple K for writes. Any suggestions? Or should I take this to a bonnie++ mailing list? Any help is appreciated. I'm kinda new to load testing. Thanks. Also, note that bonnie++ is single threaded, and a T3's single-thread performance isn't stellar, by any means. It's entirely possible you're CPU bound during the test. Though, a list of your ZFS config would be nice, as previously mentioned... -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On 12/12/2011 12:23 PM, Richard Elling wrote: On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote: Not exactly. What is dedup'ed is the stream only, which is infect not very efficient. Real dedup aware replication is taking the necessary steps to avoid sending a block that exists on the other storage system. These exist outside of ZFS (eg rsync) and scale poorly. Given that dedup is done at the pool level and ZFS send/receive is done at the dataset level, how would you propose implementing a dedup-aware ZFS send command? -- richard I'm with Richard. There is no practical "optimally efficient" way to dedup a stream from one system to another. The only way to do so would be to have total information about the pool composition on BOTH the receiver and sender side. That would involve sending the checksums for the complete pool blocks between the receiver and sender, which is a non-trivial overhead, and, indeed, would usually be far worse than simply doing what 'zfs send -D' does now (dedup the sending stream itself). The only possible way that such a scheme would work would be if the receiver and sender were the same machine (note: not VMs or Zones on the same machine, but the same OS instance, since you would need the DDT to be shared). And, that's not a use case that 'zfs send' is generally optimized for - that is, while it's entirely possible, it's not the primary use case for 'zfs send' Given the overhead of network communications, there's no way that sending block checksums between hosts can ever be more efficient than just sending the self-deduped whole stream (except in pedantic cases). Let's look at possible implementations (all assume that the local sending machine does its own dedup - that is, the stream-to-be-sent is already deduped within itself): (1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a "needed" or "not-needed" reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream. (2) The sending machine reads all the to-be-sent blocks, creates a stream, AND creates a checksum table (a mini-DDT, if you will). The sender communicates to the receiver this mini-DDT. The receiver diffs this against its own master pool DDT, and then sends back an edited mini-DDT containing only the checksums that match blocks which aren't on the receiver. The original sending machine must then go back and re-construct the stream (either as a whole, or parse the stream as it is being sent) to leave out the unneeded blocks. (3) some combo of #1 and #2 where several checksums are stuffed into a packet, and sent over the wire to be checked at the destination, with the receiver sending back only those to be included in the stream. In the first scenario, you produce a huge amount of small network packet traffic, which trashes network throughput, with no real expectation that the reduction in the send stream will be worth it. In the second case, you induce a huge amount of latency into the construction of the sending stream - that is, the "sender" has to wait around and then spend a non-trivial amount of processing power on essentially double processing the send stream, when, in the current implementation, it just sends out stuff as soon as it gets it. The third scenario is only an optimization of #1 and #2, and doesn't avoid the pitfalls of either. That is, even if ZFS did pool-level sends, you're still trapped by the need to share the DDT, which induces an overhead that can't be reasonably made up vs simply sending an internally-deduped souce stream in the first place. I'm sure I can construct an instance where such DDT sharing would be better than the current 'zfs send' implementation; I'm just as sure that such an instance would be the small minority of usage, and that such a required implementation would radically alter the "typical" use case's performance to the negative. In any case, as 'zfs send' works on filesets and volumes, and ZFS maintains DDT information on a pool-level, there's no way to share an existing whole DDT between two systems (and, given the potential size of a pool-level DDT, that's a bad idea anyway). I see no ability to optimize the 'zfs send/receive' concept beyond what is currently done. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions about the DDT and other things
On 12/1/2011 6:44 PM, Ragnar Sundblad wrote: Thanks for your answers! On 2 dec 2011, at 02:54, Erik Trimble wrote: On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC also consume ARC space. It's a bad situation. Yes, it is a bad situation. But how many DDT entries can there be in each ZAP object? Some have suggested an 1:1 relationship, others have suggested that it isn't. I'm pretty sure it's NOT 1:1, but I'd have to go look at the code. In any case, it's not a very big number, so you're still looking at the same O(n) as the number of DDT entries (n). 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! Remember that, when using Dedup, each block can potentially be part of a very large number of files. So, when you delete a file, you have to go look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates. It's essentially the same problem that erasing snapshots has - for each block you delete, you have to find and update the metadata for all the other files that share that block usage. Dedup and snapshot deletion share the same problem, it's just usually worse for dedup, since there's a much larger number of blocks that have to be updated. What is it that must be updated in the DDT entries - a ref count? And how does that differ from the snapshot case, which seems like a very similar mechanism? It is similar to the snapshot case, in that the block itself has a reference count in it's structure (for use in both dedup and snapshots) that would get updated upon "delete", but you also have to consider that the DDT entry itself, which is a separate structure from the block structure, also has to be updated. This is a whole new IOPS to get that additional structure. So, more or less, a dedup delete has to do two operations for every one that a snapshot delete does. Plus, The problem is that you really need to have the entire DDT in some form of high-speed random-access memory in order for things to be efficient. If you have to search the entire hard drive to get the proper DDT entry every time you delete a block, then your IOPs limits are going to get hammered hard. Indeed! 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a SSD and be searched there without necessarily having to store anything of it in RAM? That would probably require some changes to the DDT lookup code, and some mechanism to gather the tree to be able to lift it over to the SSD cache, and some other stuff, but still that sounds - with my very basic (non-)understanding of zfs - like a not to overwhelming change. L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC device exists. Well, it rather seems to be ZAP objects, referenced from the ARC, which happens to contain DDT entries, that is in the L2ARC. I mean that you could just move the entire AVL tree onto the SSD, completely outside of zfs if you will, and have it being searched there, not dependent of what is in RAM at all. Every DDT lookup would take up to [tree depth] number of reads, but that could be OK if you have a SSD which is fast on reading (which many are). ZFS currently treats all metadata (of which DDT entries are) and data slabs the same when it comes to choosing to migrate them from ARC to L2ARC, so the most-frequently-accessed info is in the ARC (regardless of what that info is), and everything else sits in t
Re: [zfs-discuss] questions about the DDT and other things
On 12/1/2011 4:59 PM, Ragnar Sundblad wrote: I am sorry if these are dumb questions. If there are explanations available somewhere for those questions that I just haven't found, please let me know! :-) 1. It has been said that when the DDT entries, some 376 bytes or so, are rolled out on L2ARC, there still is some 170 bytes in the ARC to reference them (or rather the ZAP objects I believe). In some places it sounds like those 170 bytes refers to ZAP objects that contain several DDT entries. In other cases it sounds like for each DDT entry in the L2ARC there must be one 170 byte reference in the ARC. What is the story here really? Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC also consume ARC space. It's a bad situation. 2. Deletion with dedup enabled is a lot heavier for some reason that I don't understand. It is said that the DDT entries have to be updated for each deleted reference to that block. Since zfs already have a mechanism for sharing blocks (for example with snapshots), I don't understand why the DDT has to contain any more block references at all, or why deletion should be much harder just because there are checksums (DDT entries) tied to those blocks, and even if they have to, why it would be much harder than the other block reference mechanism. If anyone could explain this (or give me a pointer to an explanation), I'd be very happy! Remember that, when using Dedup, each block can potentially be part of a very large number of files. So, when you delete a file, you have to go look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates. It's essentially the same problem that erasing snapshots has - for each block you delete, you have to find and update the metadata for all the other files that share that block usage. Dedup and snapshot deletion share the same problem, it's just usually worse for dedup, since there's a much larger number of blocks that have to be updated. The problem is that you really need to have the entire DDT in some form of high-speed random-access memory in order for things to be efficient. If you have to search the entire hard drive to get the proper DDT entry every time you delete a block, then your IOPs limits are going to get hammered hard. 3. I, as many others, would of course like to be able to have very large datasets deduped without having to have enormous amounts of RAM. Since the DDT is a AVL tree, couldn't just that entire tree be cached on for example a SSD and be searched there without necessarily having to store anything of it in RAM? That would probably require some changes to the DDT lookup code, and some mechanism to gather the tree to be able to lift it over to the SSD cache, and some other stuff, but still that sounds - with my very basic (non-)understanding of zfs - like a not to overwhelming change. L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC device exists. There does need to be serious work on changing how the DDT in the L2ARC is referenced, however; the ARC memory requirements for DDT-in-L2ARC definitely need to be removed (which requires a non-trivial rearchitecting of dedup). There are some other changes that have to happen for Dedup to be really usable. Unfortunately, I can't see anyone around willing to do those changes, and my understanding of the code says that it is much more likely that we will simply remove and replace the entire dedup feature rather than trying to fix the existing design. 4. Now and then people mention that the problem with bp_rewrite has been explained, on this very mailing list I believe, but I haven't found that explanation. Could someone please give me a pointer to that description (or perhaps explain it again :-) )? Thanks for any enlightenment! /ragge bp_rewrite is a feature which stands for the (as yet unimplemented) system call of the same name, which does Block Pointer re-writing. That is, it would allow ZFS to change the physical location on media of an existing ZFS data slab. That is, bp_rewrite is necessary to allow ZFS to change the Physical layout of data on media, without changing the Conceptual arrangement of such data. It's been the #1 most-wanted feature of ZFS since I can remember, probably for 10 years now. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
It occurs to me that your filesystems may not be in the same state. That is, destroy both pools. Recreate them, and run the tests. This will eliminate any possibility of allocation issues. -Erik On 10/27/2011 10:37 AM, weiliam.hong wrote: Hi, Thanks for the replies. In the beginning, I only had SAS drives installed when I observed the behavior, the SATA drives were added later for comparison and troubleshooting. The slow behavior is observed only after 10-15mins of running dd where the file size is about 15GB, then the throughput drops suddenly from 70 to 50 to 20 to <10MB/s in a matter of seconds and never recovers. This couldn't be right no matter how look at it. Regards, WL On 10/27/2011 9:59 PM, Brian Wilson wrote: On 10/27/11 07:03 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of weiliam.hong 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas driver is used. Are SAS and SATA drives handled differently ? If they're all on the same HBA, they may be all on the same bus. It may be *because* you're mixing SATA and SAS disks on the same bus. I'll suggest separating the tests, don't run them concurrently, and see if there's any difference. Also, the HBA might have different defaults for SAS vs SATA, look in the HBA to see if write back / write through are the same... I don't know if the HBA gives you some way to enable/disable the on-disk cache, but take a look and see. Also, maybe the SAS disks are only doing SATA. If the HBA is only able to do SATA, then SAS disks will work, but might not work as optimally as they would if they were connected to a real SAS HBA. And one final thing - If you're planning to run ZFS (as I suspect you are, posting on this list running OI) ... It actually works *better* without any HBA. *Footnote *Footnote: ZFS works the worst, if you have ZIL enabled, no log device, and no HBA. It's a significant improvement, if you add a battery backed or nonvolatile HBA with writeback. It's a signfiicant improvement again, if you get rid of the HBA, add a log device. It's a significant improvement yet again, if you get rid of the HBA and log device, and run with ZIL disabled (if your work load is compatible with a disabled ZIL.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss First, ditto everything Edward says above. I'd add that your "dd" test creates a lot of straight sequential IO, not anything that's likely to be random IO. I can't speak to why your SAS might not be performing any better than Edward did, but your SATA's probably screaming on straight sequential IO, where on something more random I would bet they won't perform as well as they do in this test. The tool I've seen used for that sort of testing is iozone - I'm sure there are others as well, and I can't attest what's better or worse. cheers, Brian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper (X4500), and CF SSD for L2ARC = ?
On 10/14/2011 5:49 AM, Darren J Moffat wrote: On 10/14/11 13:39, Jim Klimov wrote: Hello, I was asked if the CF port in Thumpers can be accessed by the OS? In particular, would it be a good idea to use a modern 600x CF card (some reliable one intended for professional photography) as an L2ARC device using this port? I don't know about the Thumpers internal CF slot. I can say I have tried using a fast (at the time, this was about 3 years ago) CF card via a CF to IDE adaptor before and it turned out to be a really bad idea because the spinning rust disk (which was SATA) was actually faster to access. Same went for USB to CF adaptors at the time too. Last I'd checked, the CF port was fully functional. However, I'd not use it as L2ARC (and, certainly not ZIL). CF is not good in terms of either random write or read - professional-grade CF cards are optimized for STREAMING write - that is, the ability to write a big-ass JPG or BMP or TIFF as quickly as possible. The CF controller isn't good on lots of little read/write ops. In Casper's case, the CF->IDE adapter makes this even worse, since IDE is spectacularly bad at IOPS. I can't remember - does the X4500 have any extra SATA ports free on the motherboard? And, does it have any extra HD power connectors? http://www.amazon.com/dp/B002MWDRD6/ref=asc_df_B002MWDRD61280186?smid=A2YLYLTN75J8LR&tag=shopzilla_mp_1382-20&linkCode=asn&creative=395105&creativeASIN=B002MWDRD6 Is a great way to add a 2.5" drive slot, but it's just a physical slot adapter - you need to attach a standard SATA cable and HD power connector to it. If that's not an option, find yourself a cheap PCI-E adapter with eSATA ports on it, then use an external HD enclosure with eSATA for a small SSD. As a last resort, remove one of the 3.5" SATA drives, and put in an SSD in a 2.5"->3.5" converter enclosure. Remember, you can generally get by fine with a lower-end SSD as L2ARC, so a 60GB SSD should be $100 or less. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] All (pure) SSD pool rehash
On 9/27/2011 10:39 AM, Bob Friesenhahn wrote: On Tue, 27 Sep 2011, Matt Banks wrote: Also, maybe I read it wrong, but why is it that (in the previous thread about hw raid and zpools) zpools with large numbers of physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL There is no concern with a large number of physical drives in a pool. The primary concern is with the number of drives per vdev. Any variation in the latency of the drives hinders performance and each I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev (or strip) when raidzN is used. Having more vdevs is better for consistent performance and more available IOPS. Bob To expound just a bit on Bob's reply: the reason that large numbers of disks in a RAIDZ* vdev are frowned upon has to do with the fact that IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the same as a 5-disk vdev. Streaming throughput is significantly higher (i.e. it scales as O(N)), but you're unlikely to get that for the vast majority of workloads. Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the situation where the time to resilver X amount of data on a 5-drive RAIDZ is the same as a 30-drive RAIDZ. Given that you're highly likely to store much more data on a larger vdev, your resilver time to replace a drive goes up linearly with the number of drives in a RAIDZ vdev. This leads to this situation: if I have 20 x 1TB drives, here's several possible configurations, and the relative resilver times (relative, because without knowing the exact configuration of the data itself, I can't estimate wall-clock-time resilver times): (a)5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replace a failed disk (b)4 x 5-disk RAIDZ: 16TB usable, takes 1.25N time to replace a disk (c)2 x 10-disk RAIDZ: 18TB Usable, takes 2.5N time to replace a disk (d)1 x 20-disk RAIDZ:19TB usable, takes 5N time to replace a disk Notice that by doubling the number of drives in a RAIDZ, you double the resilver time for the same amount of data in the ZPOOL. The above also applies to RAIDZ[23], as the additional parity disk doesn't materially impact resilver times in either direction (and, yes, it's not really a "parity disk", I'm just being sloppy). Also, the other main reason is that larger numbers of drives in a single vdev mean there is a higher probability that multiple disk failures will result in loss of data. Richard Elling had some data on the exact calculations, but it boils down to the fact that your chance of total data loss from multiple drive failures goes up MORE THAN LINEARLY by adding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x the chance of failure that 2 x 5-disk RAIDZ zpool has. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] I'm back!
Hi folks. I'm now no longer at Oracle, and the past couple of weeks have been a bit of a mess for me as I disentangle myself from it. I apologize to those who may have tried to contact me during August, as my @oracle.com email is no longer being read by myself, and I didn't have a lot of extra time to devote to things like making sure my email subscription lists pointed to my personal email. I've done that now. I now have a free(er) hand to do some work in IllumOS (hopefully, in ZFS in particular), so I'm looking forward to getting back into the swing of things. And, hopefully, not be too much of a PITA. :-) -Erik Trimble tr...@netdemons.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NexentaCore 3.1 - ZFS V. 28
On 7/31/2011 4:29 AM, Eugen Leitl wrote: On Sat, Jul 30, 2011 at 12:56:38PM +0200, Eugen Leitl wrote: apt-get update apt-clone upgrade Any first impressions? I finally came around installing NexentaCore 3.1 along with napp-it and AMP on a HP N36L with 8 GBytes RAM. I'm testing it with 4x 1 and 1.5 TByte consumer SATA drives (Seagate) with raidz2 and raidz3 and like what I see so far. Given http://opensolaris.org/jive/thread.jspa?threadID=139315 I've ordered an Intel 311 series for ZIL/L2ARC. I hope to use above with 4x 3 TByte Hitachi Deskstar 5K3000 HDS5C3030ALA630 given the data from Blackblaze in regards to their reliability. Suggestion for above layout (8 GByte RAM 4x 3 TByte as raidz2) I should go with 4 GByte for slog and 16 GByte for L2ARC, right? Is it possible to attach slog/L2ARC to a pool after the fact? I'd rather not wear out the small SSD with ~5 TByte avoidable writes. Yes. You can attach a ZIL or L2ARC device anytime after the pool is created. Also, I think you want an Intel 320, NOT the 311, for use as a ZIL. The 320 includes capacitors, so if you lose power, your ZIL doesn't lose data. The 311 DOESN'T include capacitors. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On 7/25/2011 8:03 AM, Orvar Korvar wrote: "There is at least a common perception (misperception?) that devices cannot process TRIM requests while they are 100% busy processing other tasks." Just to confirm; SSD disks can do TRIM while processing other tasks? I heard that Illumos is working on TRIM support for ZFS and will release something soon. Anyone knows more? SSDs do Garbage Collection when the controller has spare cycles. I'm not certain if there is a time factor (i.e. is it periodic, or just when there's time in the controller's queue). So, theoretically, TRIM helps GC when the drive is at low utilization, but not when the SSD is under significant load. Under high load, the SSD doesn't have the luxury of searching the NAND for "unused" blocks, aggregating them, writing them to a new page, and then erasing the old location. It has to allocate stuff NOW, so it goes right to the dreaded read-modify-erase-write cycle. Even under high load, the SSD can "process" the TRIM request (i.e. it will mark a block as unused), but that's not useful until a GC is performed (unless you are so lucky as to mark an *entire* page as unused), so, it doesn't really matter. The GC run is what "fixes" the NAND allocation, not the TRIM command itself. I can't speak for the ZFS developers as to TRIM support. I *believe* this would have to happen both at the device level and the filesystem level. But, I certainly could be wrong. (IllumOS currently supports TRIM in the SATA framework - not sure about the SAS framework) -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On 7/25/2011 4:49 AM, joerg.schill...@fokus.fraunhofer.de wrote: Erik Trimble wrote: On 7/25/2011 3:32 AM, Orvar Korvar wrote: How long have you been using a SSD? Do you see any performance decrease? I mean, ZFS does not support TRIM, so I wonder about long term effects... Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no impact whatsoever. TRIM is primarily useful for low-volume changes - that is, for a filesystem that generally has few deletes over time (i.e. rate of change is low). Using a SSD as a ZIL or L2ARC device puts a very high write load on the device (even as an L2ARC, there is a considerably higher write load than a "typical" filesystem use). SSDs in such a configuration can't really make use of TRIM, and depend on the internal SSD controller block re-allocation algorithms to improve block layout. Now, if you're using the SSD as primary media (i.e. in place of a Hard Drive), there is a possibility that TRIM could help. I honestly can't be sure that it would help, however, as ZFS's Copy-on-Write nature means that it tends to write entire pages of blocks, rather than just small blocks. Which is fine from the SSD's standpoint. Writing to an SSD is: clear + write + verify As the SSD cannot know that the rewritten blocks have been unused for a while, the SSD cannot handle the clear operation at a time when there is no interest in the block, the TRIM command is needed to give this knowledge to the SSD. Jörg Except in many cases with ZFS, that data is irrelevant by the time it can be used, or is much less useful than with other filesystems. Copy-on-Write tends to end up with whole SSD pages of blocks being rendered "unused", rather than individual blocks inside pages. So, the SSD often can avoid the read-erase-modify-write cycle, and just do erase-write instead. TRIM *might* help somewhat when you have a relatively quiet ZFS filesystem, but I'm not really convinced of how much of a benefit it would be. As I've mentioned in other posts, ZIL and L2ARC are too "hot" for TRIM to have any noticeable impact - the SSD is constantly being used, and has no time for GC. It's stuck in the read-erase-modify-write cycle even with TRIM. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On 7/25/2011 4:28 AM, Tomas Ögren wrote: On 25 July, 2011 - Erik Trimble sent me these 2,0K bytes: On 7/25/2011 3:32 AM, Orvar Korvar wrote: How long have you been using a SSD? Do you see any performance decrease? I mean, ZFS does not support TRIM, so I wonder about long term effects... Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no impact whatsoever. TRIM is primarily useful for low-volume changes - that is, for a filesystem that generally has few deletes over time (i.e. rate of change is low). Using a SSD as a ZIL or L2ARC device puts a very high write load on the device (even as an L2ARC, there is a considerably higher write load than a "typical" filesystem use). SSDs in such a configuration can't really make use of TRIM, and depend on the internal SSD controller block re-allocation algorithms to improve block layout. Now, if you're using the SSD as primary media (i.e. in place of a Hard Drive), there is a possibility that TRIM could help. I honestly can't be sure that it would help, however, as ZFS's Copy-on-Write nature means that it tends to write entire pages of blocks, rather than just small blocks. Which is fine from the SSD's standpoint. You still need the flash erase cycle. On a related note: I've been using a OCZ Vertex 2 as my primary drive in a laptop, which runs Windows XP (no TRIM support). I haven't noticed any dropoff in performance in the year its be in service. I'm doing typical productivity laptop-ish things (no compiling, etc.), so it appears that the internal SSD controller is more than smart enough to compensate even without TRIM. Honestly, I think TRIM isn't really useful for anyone. It took too long to get pushed out to the OSes, and the SSD vendors seem to have just compensated by making a smarter controller able to do better reallocation. Which, to me, is the better ideal, in any case. Bullshit. I just got a OCZ Vertex 3, and the first fill was 450-500MB/s. Second and sequent fills are at half that speed. I'm quite confident that it's due to the flash erase cycle that's needed, and if stuff can be TRIM:ed (and thus flash erased as well), speed would be regained. Overwriting an previously used block requires a flash erase, and if that can be done in the background when the timing is not critical instead of just before you can actually write the block you want, performance will increase. /Tomas I should have been more clear: I consider the "native" speed of a SSD to be that which is obtained AFTER you've filled the entire drive once. That is, once you've blown through the extra reserve NAND, and are now into the full read/erase/write cycle. IMHO, that's really what the sustained performance of an SSD is, not the bogus numbers reported by venders. TRIM is really only useful for drives which have a low enough load factor to do background GC on unused blocks. For ZFS, that *might* be the case when the SSD is used as primary backing store, but certainly isn't the case when it's used as ZIL or L2ARC. Even with TRIM, performance after a complete fill of the SSD will drop noticeable, as the SSD has to do GC sometime. You might not notice it right away given your usage pattern, but, with OR without TRIM, a "used" SSD under load will perform the same. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On 7/25/2011 6:43 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble Honestly, I think TRIM isn't really useful for anyone. I'm going to have to disagree. There are only two times when TRIM isn't useful: 1) Your demand of the system is consistently so low that it never adds up to anything meaningful... Basically you always have free unused blocks so adding more unused blocks to the pile doesn't matter at all, or you never bother to delete anything... Or it's just a lightweight server processing requests where network latency greatly outweighs any disk latency, etc. AKA your demand is very low. or 2) Your demand of the system is consistently so high that even with TRIM, the device would never be able to find any idle time to perform an erase cycle on blocks marked for TRIM. In case #2, it is at least theoretically possible for devices to become smart enough to process the TRIM block erasures in parallel even while there are other operations taking place simultaneously. I don't know if device mfgrs implement things that way today. There is at least a common perception (misperception?) that devices cannot process TRIM requests while they are 100% busy processing other tasks. Or your disk is always 100% full. I guess that makes 3 cases, but the 3rd one is esoteric. What I'm saying is that #2 occurs all the time with ZFS, at least as a ZIL or L2ARC. TRIM is really only useful when the SSD has some "downtime" to work. As a ZIL or L2ARC, the SSD *has* no pauses, and can't do GC in the background usefully (which is what TRIM helps). Instead, what I've seen is that the increased "smarts" of the new generation SSD controllers do a better job of on-the-fly reallocation. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On 7/25/2011 3:32 AM, Orvar Korvar wrote: How long have you been using a SSD? Do you see any performance decrease? I mean, ZFS does not support TRIM, so I wonder about long term effects... Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no impact whatsoever. TRIM is primarily useful for low-volume changes - that is, for a filesystem that generally has few deletes over time (i.e. rate of change is low). Using a SSD as a ZIL or L2ARC device puts a very high write load on the device (even as an L2ARC, there is a considerably higher write load than a "typical" filesystem use). SSDs in such a configuration can't really make use of TRIM, and depend on the internal SSD controller block re-allocation algorithms to improve block layout. Now, if you're using the SSD as primary media (i.e. in place of a Hard Drive), there is a possibility that TRIM could help. I honestly can't be sure that it would help, however, as ZFS's Copy-on-Write nature means that it tends to write entire pages of blocks, rather than just small blocks. Which is fine from the SSD's standpoint. On a related note: I've been using a OCZ Vertex 2 as my primary drive in a laptop, which runs Windows XP (no TRIM support). I haven't noticed any dropoff in performance in the year its be in service. I'm doing typical productivity laptop-ish things (no compiling, etc.), so it appears that the internal SSD controller is more than smart enough to compensate even without TRIM. Honestly, I think TRIM isn't really useful for anyone. It took too long to get pushed out to the OSes, and the SSD vendors seem to have just compensated by making a smarter controller able to do better reallocation. Which, to me, is the better ideal, in any case. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pure SSD Pool
FYI - virtually all non-super-low-end SSDs are already significantly over-provisioned, for GC and scratch use inside the controller. In fact, the only difference between the OCZ "extended" models and the non-extended models (e.g. Vertex 2 50G (OCZSSD2-2VTX50G) and Vertex 2 Extended 60G (OCZSSD2-2VTXE60G)) is the amount of extra flash dedicated to scratch. Both the aforementioned drives have 64G of flash chips - it's just that the 50G one uses significantly more for scratch, and thus, will perform better under heavy use. Over-provisioning at the filesystem level is unlikely to significantly improve things, as the SSD controller generally only uses what it considered "scratch" as such - that is, while not using 10G at the filesystem level might seem useful, overall, my understanding of SSD's controller usage patterns is that this generally isn't that much of a performance gain. E.g. you'd be better off buying the 50G Vertex 2 and fully using it than the 60G model and only using 50G on it. -Erik On Tue, 2011-07-12 at 10:10 -0700, Henry Lau wrote: > It is hard to say, 90% or 80%. SSD has already reserved overprovisioning > places > for garbage collection and wear leveling. The OS level only knows file LBA, > not > the physical LBA mapping to flash pages/block. Uberblock updates and COW from > ZFS will use a new page/block each time. A TRIM command from ZFS level should > be > a better solution but RAID is still a problem for TRIM at the OS level. > > Henry > > > > - Original Message > From: Jim Klimov > Cc: ZFS Discussions > Sent: Tue, July 12, 2011 4:18:28 AM > Subject: Re: [zfs-discuss] Pure SSD Pool > > 2011-07-12 9:06, Brandon High пишет: > > On Mon, Jul 11, 2011 at 7:03 AM, Eric Sproul wrote: > >> Interesting-- what is the suspected impact of not having TRIM support? > > There shouldn't be much, since zfs isn't changing data in place. Any > > drive with reasonable garbage collection (which is pretty much > > everything these days) should be fine until the volume gets very full. > > I wonder if in this case it would be beneficial to slice i.e. 90% > of an SSD for use in ZFS pool(s) and leave the rest of the > disk unassigned to any partition or slice? This would reserve > some sectors as never-written-to-by-OS. Would this ease the > life for SSD devices without TRIM between them ans the OS? > > Curious, > //Jim > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Erik Trimble Java Platform Group - Infrastructure Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Move rpool from external hd to internal hd
On 6/29/2011 12:51 AM, Stephan Budach wrote: Hi, what are the steps necessary to move the OS rpool from an external USB drive to an internal drive? I thought about adding the internal hd as a mirror to the rpool and then detaching the USB drive, but I am unsure if I'll have to mess with Grub as well. Cheers, budy -- Yup. Look at the 'installgrub' man page. It's rather straightforward, particularly if you've already booting off the USB drive. Remember that the boot drive can't be the whole disk - you have to partition it, and then mirror onto the partition. e.g. if your internal drive is c1t0d0, you'll have to use c1t0d0s0. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 6/27/2011 1:13 PM, David Magda wrote: On Mon, June 27, 2011 15:24, Erik Trimble wrote: [...] I'm always kind of surprised that there hasn't been a movement to create standardized crypto commands, like the various FP-specific commands that are part of MMX/SSE/etc. That way, most of this could be done in hardware seamlessly. The (Ultra)SPARC T-series processors do, but to a certain extent it goes against a CPU manufacturers best (financial) interest to provide this: crypto is very CPU intensive using 'regular' instructions, so if you need to do a lot of it, it would force you to purchase a manufacturer's top-of-the-line CPUs, and to have as many sockets as you can to handle a load (and presumably you need to do "useful" work besides just en/decrypting traffic). If you have special instructions that do the operations efficiently, it means that you're not chewing up cycles as much, so a less powerful (and cheaper) processor can be purchased. I'm sure all the Web 2.0 companies would love to have these (and OpenSSL link use the instructions), so they could simply enable HTTPS for everything. (Of course it'd also be helpful for data-at-rest, on-disk encryption as well.) The last benchmarks I saw indicated that the SPARC T-series could do 45 Gb/s AES or some such, with gobs of RSA operations as well The T-series crypto isn't what I'm thinking of. AFAIK, you still need to use the Crypto framework in Solaris to access the on-chip functionality. Which makes the T-series no different than CPUs without a crypto module but a crypto add-in board instead. What I'm thinking of is something on the lines of what AMD proposed awhile ago, in combination with how we used to handle hardware that had FP optional. That is, you continue to make CPUs without any crypto functionality, EXCEPT that they support certain extensions a la MMX. If no Crypto accelerator was available, the CPU would trap any Crypto calls, and force them to done in software. You could then stick a crypto accellerator in a second CPU socket, and the CPU would recognized this was there, and pipe crypto calls to the dedicated co-processor. Think about how things were done with the i386 and i387. That's what I'm after. With modern CPU buses like AMD & Intel support, plopping a "co-processor" into another CPU socket would really, really help. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 6/27/2011 9:55 AM, Roberto Waltman wrote: I recently bought an HP Proliant Microserver for a home file server. ( pics and more here: http://arstechnica.com/civis/viewtopic.php?p=20968192 ) I installed 5 1.5TB (5900 RPM) drives, upgraded the memory to 8GB, and installed Solaris 11 Express without a hitch. A few simple tests using "dd" with 1gb and 2gb files showed excellent transfer rates: ~200 MB/sec on a 5 drive raidz2 pool, ~310 MB/sec on a five drive pool with no redundancy. That is, until I enabled encryption, which brought the transfer rates down to around 20 MB/sec... Obviously the CPU is the bottleneck here, and I?m wondering what to do next. I can split the storage into file systems with and without encryption and allocate data accordingly. No need, for example, to encrypt open source code, or music. But I would like to have everything encrypted by default. My concern is not industrial espionage from a hacker in Belarus, but having a disk fail and send it for repair with my credit card statements easily readable on it, etc. I am new to (open or closed)Solaris. I found there is something called the Encryption Framework, and that there is hardware support for encryption. This server has two unused PCI-e slots, so I thought a card could be the solution, but the few I found seem to be geared to protect SSH and VPN connections, etc., not the file system. Cost is a factor also. I could build a similar server with a much faster processor for a few hundred dollars more, so a $1000 dollar card for a < $1000 file server is not a reasonable option. Is there anything out there I could use? Thanks, Roberto Waltman You're out of luck. The hardware-encryption device is seen as a small market by the vendors, and they price accordingly. All the solutions are FIPS-compliant, which makes them non-trivially expensive to build/test/verify. I have yet to see the "basic" crypto accelerator - which should be as simple as an FPGA with downloadable (and updateable) firmware. The other major cost point is the crypto plugins - sadly, there is no way to simply have the CPU farm off crypto jobs to a co-processor. That is, there's no way for the CPU to go "oh, that looks like I'm running a crypto algorithm - I should hand it over to the crypto co-processor". Instead, you have to write custom plugin/drivers/libraries for each OS, and pray that each OS has some standardized crypto framework. Otherwise, you have to link apps with custom libraries. I'm always kind of surprised that there hasn't been a movement to create standardized crypto commands, like the various FP-specific commands that are part of MMX/SSE/etc. That way, most of this could be done in hardware seamlessly. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
On 6/17/2011 6:52 AM, Marty Scholes wrote: Lights. Good. Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. On a busy array it is hard even to use the leds as indicators. It makes me wonder how large shops with thousands of spindles handle this. We pay for the brand-name disk enclosures or servers where the fault-management stuff is supported by Solaris. Including the blinky lights. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
On 6/17/2011 12:55 AM, Lanky Doodle wrote: Thanks Richard. How does ZFS enumerate the disks? In terms of listing them does it do them logically, i.e; controller #1 (motherboard) | |--- disk1 |--- disk2 controller #3 |--- disk3 |--- disk4 |--- disk5 |--- disk6 |--- disk7 |--- disk8 |--- disk9 |--- disk10 controller #4 |--- disk11 |--- disk12 |--- disk13 |--- disk14 |--- disk15 |--- disk16 |--- disk17 |--- disk18 or is it completely random leaving me with some trial and error to work out what disk is on what port? This is not a ZFS issue, this is the Solaris device driver issue. Solaris uses a location-based disk naming scheme, NOT the BSD/Linux-style of simply incrementing the disk numbers. I.e. drives are usually named something like ctd In most cases, the on-board controllers receive a lower controller number than any add-in adapters, and add-in adapters are enumerated in PCI ID order. However, there is no good explanation of exactly *what* number a given controller may be assigned. After receiving a controller number, disks are enumerated in ascending order by ATA ID, SCSI ID, SAS WWN, or FC WWN. The naming rules can get a bit complex. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On 6/16/2011 1:32 PM, Paul Kraus wrote: On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling wrote: You can run OpenVMS :-) Since *you* brought it up (I was not going to :-), how does VMS' versioning FS handle those issues ? It doesn't, per se. VMS's filesystem has a "versioning" concept (i.e. every time you do a close() on a file, it creates a new file with the version number appended, e.g. foo;1 and foo;2 are the same file, different versions). However, it is completely missing the rest of the features we're talking about, like data *consistency* in that file. It's still up to the app using the file to figure out what data consistency means, and such. Really, all VMS adds is versioning, nothing else (no API, no additional features, etc.). I know that SAM-FS has rules for _when_ copies of a file are made, so that intermediate states are not captured. The last time I touched SAM-FS there was _not_ a nice user interface to the previous version, you had to trudge through log files and then pull the version you wanted directly from secondary storage (but they did teach us how to that in the SAM-FS / QFS class). I'd have to look, but I *think* there is a better way to get to the file history/version information now. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On 6/16/2011 12:09 AM, Simon Walter wrote: On 06/16/2011 09:09 AM, Erik Trimble wrote: We had a similar discussion a couple of years ago here, under the title "A Versioning FS". Look through the archives for the full discussion. The jist is that application-level versioning (and consistency) is completely orthogonal to filesystem-level snapshots and consistency. IMHO, they should never be mixed together - there are way too many corner cases and application-specific memes for a filesystem to ever fully handle file-level versioning and *application*-level data consistency. Don't mistake one for the other, and, don't try to *use* one for the other. They're completely different creatures. I guess that is true of the current FSs available. Though it would be nice to essentially have a versioning FS in the kernel rather than an application in userspace. But I regress. I'll use SVN and webdav. Thanks for the advice everyone. It's not really a technical problem, it's a knowledge locality problem. The *knowledge* of where to checkmark, where to version, and what data consistency means is held at the application level, and can ONLY be known by each individual application. There's no way a filesystem (or anything like that) can make the proper decisions without the application telling it what those decisions should be. So, what would the point be in having a "smart" versioning FS, since the intelligence can't be built into the FS, it would still have to be built into each and every application. So, if your apps have to be programmed to be versioning/consistency/checkmarking aware in any case, how would having a fancy Versioning filesystem be any better than using what we do now? (i.e. svn/hg/cvs/git on top of ZFS/btrfs/et al) ZFS at least makes significant practical advances by rolling the logical volume manager into the filesystem level, but I can't see any such advantage for a Versioning FS. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
We had a similar discussion a couple of years ago here, under the title "A Versioning FS". Look through the archives for the full discussion. The jist is that application-level versioning (and consistency) is completely orthogonal to filesystem-level snapshots and consistency. IMHO, they should never be mixed together - there are way too many corner cases and application-specific memes for a filesystem to ever fully handle file-level versioning and *application*-level data consistency. Don't mistake one for the other, and, don't try to *use* one for the other. They're completely different creatures. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for Linux?
On 6/14/2011 12:50 PM, Roy Sigurd Karlsbakk wrote: Are there estimates on how performant and stable would it be to run VirtualBox with a Solaris-derived NAS with dedicated hardware disks, and use that from the same desktop? I did actually suggest this as a considered variant as well ;) I am going to try and build such a VirtualBox for my ailing HomeNAS as well - so it would import that iSCSI "dcpool" and try to process its defer-free blocks. At least if the hardware box doesn't stall so that a human has to be around to go and push reset, this would be a more viable solution for my repair-reboot cycles... If you want good performance and ZFS, I'd suggest using something like OpenIndiana or Solaris 11EX or perhaps FreeBSD for the host and VirtualBox for a linux guest if that's needed. Doing so, you'll get good I/O performance, and you can use the operating system or distro you like for the rest of the services. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ The other option is to make sure you have a newer CPU that supports Virtualized I/O. I'd have to look at the desktop CPUs, but all Intel Nehalem and later CPUs have this feature, and I'm pretty sure all AMD MangyCours and later CPUs do also. Without V-IO, doing anything that pounds on a disk under *any* Virtualization product is sure to make you cry. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On 6/12/2011 5:08 AM, Dimitar Hadjiev wrote: I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1 or a 1*12-disk raidz3 with nearly the same capacity (8-9 data disks plus parity). I see that with more vdevs the IOPS will grow - does this translate to better resilver and scrub times as well? Yes it would translate in better resilver times as any failures will affect only one of the vdevs leading to a shorter parity restore time as oposed to rebuilding the whole raidz2. As for scrubbing it would be as fast as the scrub of each vdev since the whole pool does not have parity data to synchronize. Go look through the mail archives, and there's at least a couple of posts from me and Richard Elling (amongst others) about the workload that a resilver requires on a raidz* vdev. Essentially, "typical" usage of a vdev will result in resilver times linearly degrading with each additional DATA disk in the raidz*, as a resilver is IOPS-bound on the single replaced disk. So, a 3-disk raidz1 (2 data disks) should, on average, resilver 4.5 times faster than a 12-disk raidz3 (9 data disks). How good or bad is the expected reliability of 3*4-disk-raidz1 vs 1*12-disk raidz3, so which of the tradeoffs is better - more vdevs or more parity to survive loss of ANY 3 disks vs. "right" 3 disks? I'd say the chances of loseing a whole vdev in a 4*3 configuration equal the chances of loseing 4 drives in a 1*12 raidz3 configuration - it might happen, nothing is foolproof. No, the reliability of a 1x12raidz3 is *significantly* better than that of 4x3 raidz1 (or, frankly, ANY raidz1 configuration using 12 disks). Richard has some stats around here somewhere... basically, the math (singular, you damn Brits! :-) says that while a 3-disk raidz1 will certainly take shorter to re-silver after a loss than a 12-disk raidz3, this is more than counterbalanced by the ability of a 12-disk raidz3 to handle additional disk losses, where the 4x3 config is only *probabilisticly* likely to be handle a 2nd or 3rd drive failure. I'd have to re-look at the exact numbers, but, I'd generally say that 2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or 3x4raidz1, for a home server not looking for super-critical protection (in which case, you should be using mirrors with spares, not raidz*). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA disk perf question
On 6/2/2011 5:12 PM, Jens Elkner wrote: On Wed, Jun 01, 2011 at 06:17:08PM -0700, Erik Trimble wrote: On Wed, 2011-06-01 at 12:54 -0400, Paul Kraus wrote: Here's how you calculate (average) how long a random IOPs takes: seek time + ((60 / RPMs) / 2))] A truly sequential IOPs is: (60 / RPMs) / 2) For that series of drives, seek time averages 8.5ms (per Seagate). So, you get 1 Random IOPs takes [8.5ms + 4.13ms] = 12.6ms, which translates to 78 IOPS 1 Sequential IOPs takes 4.13ms, which gives 120 IOPS. Note that due to averaging, the above numbers may be slightly higher or lower for any actual workload. Nahh, shouldn't it read "numbers may be _significant_ higher or lower" ...? ;-) Regards, jel. Nope. In terms of actual, obtainable IOPS, a 7200RPM drive isn't going to be able to do more than 200 under ideal conditions, and should be able to manage 50 under anything other than the pedantically worst-case situation. That's only about a 50% deviation, not like an order of magnitude or so. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SATA disk perf question
On Wed, 2011-06-01 at 12:54 -0400, Paul Kraus wrote: > I figure this group will know better than any other I have contact > with, is 700-800 I/Ops reasonable for a 7200 RPM SATA drive (1 TB Sun > badged Seagate ST31000N in a J4400) ? I have a resilver running and am > seeing about 700-800 writes/sec. on the hot spare as it resilvers. > There is no other I/O activity on this box, as this is a remote > replication target for production data. I have a the replication > disabled until the resilver completes. > > Solaris 10U9 > zpool version 22 > Server is a T2000 > Here's how you calculate (average) how long a random IOPs takes: seek time + ((60 / RPMs) / 2))] A truly sequential IOPs is: (60 / RPMs) / 2) For that series of drives, seek time averages 8.5ms (per Seagate). So, you get 1 Random IOPs takes [8.5ms + 4.13ms] = 12.6ms, which translates to 78 IOPS 1 Sequential IOPs takes 4.13ms, which gives 120 IOPS. Note that due to averaging, the above numbers may be slightly higher or lower for any actual workload. In your case, since ZFS does write aggregation (turning multiple write requests into a single larger one), you might see what appears to be more than the above number from something like 'iostat', which is measuring not the *actual* writes to physical disk, but the *requested* write operations. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compatibility between Sun-Oracle Fishworks appliance zfs and other zfs implementations
On Thu, 2011-05-26 at 09:36 -0700, Freddie Cash wrote: > On Wed, May 25, 2011 at 9:30 PM, Matthew Ahrens > wrote: > On Wed, May 25, 2011 at 8:01 PM, Matt Weatherford > wrote: > pike# zpool get version internal > NAME PROPERTY VALUESOURCE > internal version 28 default > pike# zpool get version external-J4400-12x1TB > NAME PROPERTY VALUESOURCE > external-J4400-12x1TB version 28 default > pike# > > Can I expect to move my JBOD over to a different OS > such as FreeBSD, Illuminos, or Solaris and be able to > get my data off still? (by this i mean perform a > zpool import on another platform) > > > Yes, because zpool version 28 is supported in Illumos. I'm > sure Oracle Solaris does or will soon support it too. > According to Wikipedia, "the 9-current development branch [of > FreeBSD] uses ZFS Pool version 28". > > Correct. FreeBSD 9-CURRENT (dev branch that will be released as 9.0 > at some point) as of March or April includes support for ZFSv28. > > And FreeBSD 8-STABLE (dev branch that will be released as 8.3 at some > point) has patches available to support ZFSv28 here: > http://people.freebsd.org/~mm/patches/zfs/v28/ > > > ZFS-on-FUSE for Linux currently only supports ZFSv23. > > So you can "safely" use Illumos, Nexenta, FreeBSD, etc with ZFSv28. > You can also use Solaris 11 Express, so long as you don't upgrade the > pool version (SolE includes ZFSv31). > > -- > Freddie Cash > fjwc...@gmail.com [Note: none of this is proprietary Oracle knowledge, and I do not speak as an Oracle representative in any way] The Fishworks stuff runs on a version of Solaris 11 - it's not some forked Solaris branch. So, when Solaris 11 finally ships, you can expect that the latest Solaris 11 Update + CRU (patches) should always be able to fully utilize any Fishworks-written volume. As pointed out above by others, however, I would never count on a Fishworks volume (particularly one which used the latest-available Fishworks software update) being able to be read on *anything* other than Solaris 11. As S11 advances, Fishworks gets updated too, so the ZFS/zpool version number advances. Solaris 11 is due out RSN, which means probably sometime before the end of the calendar year. But who knows, and Oracle hasn't officially announced a launch date for S11. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On 5/25/2011 4:37 AM, Frank Van Damme wrote: Op 24-05-11 22:58, LaoTsao schreef: With various fock of opensource project E.g. Zfs, opensolaris, openindina etc there are all different There are not guarantee to be compatible I hope at least they'll try. Just in case I want to import/export zpools between Nexenta and OpenIndiana Given the new "versioning" governing board, I think that's highly likely. However, do remember that you might not be able to import a pool from another system, simply because your system can't support the featureset. Ideally, it would be nice if you could just import the pool and use the features your current OS supports, but that's pretty darned dicey, and I'd be very happy if importing worked when both systems supported the same featureset. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On 5/24/2011 8:28 AM, Orvar Korvar wrote: The netapp lawsuit is solved. No conflicts there. Regarding ZFS, it is open under CDDL license. The leaked source code that is already open is open. Nexenta is using the open sourced version of ZFS. Oracle might close future ZFS versions, but Nexenta's ZFS is open and can not be closed. There is no threat to Nexenta from the ZFS code itself; the license that it was made available under explicitly has Oracle allow use for any patents *Oracle* might have. However, since the terms of the NetApp/Oracle suit aren't available publicly, and I seriously doubt that NetApp gave up its patent claims, it could still be feasible for NetApp to sue Nexenta or whomever for alleged violations of *NetApp's* patents in the ZFS code. That is, ZFS has no copyright infringement issues for 3rd parties. It has no patent issues from Oracle. It *could* have patent issues from NetApp. The possible impact of that is beyond my knowledge. IANAL. Nor do I speak for Oracle in any manner, official or unofficial. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 5/7/2011 6:47 AM, Edward Ned Harvey wrote: See below. Right around 400,000 blocks, dedup is suddenly an order of magnitude slower than without dedup. 40 10.7sec 136.7sec143 MB 195 MB 80 21.0sec 465.6sec287 MB 391 MB The interesting thing is - In all these cases, the complete DDT and the complete data file itself should fit entirely in ARC comfortably. So it makes no sense for performance to be so terrible at this level. So I need to start figuring out exactly what's going on. Unfortunately I don't know how to do that very well. I'm looking for advice from anyone - how to poke around and see how much memory is being consumed for what purposes. I know how to lookup c_min and c and c_max... But that didn't do me much good. The actual value for c barely changes at all over time... Even when I rm the file, c does not change immediately. All the other metrics from kstat ... have less than obvious names ... so I don't know what to look for... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Some minor issues that might affect the above: (1) I'm assuming you run your script repeatedly in the same pool, without deleting the pool. If that is the case, that means that a run of X+1 should dedup completely with the run of X. E.g. a run with 12 blocks will dedup the first 11 blocks with the prior run of 11. (2) can you NOT enable "verify" ? Verify *requires* a disk read before writing for any potential dedup-able block. If case #1 above applies, then by turning on dedup, you *rapidly* increase the amount of disk I/O you require on each subsequent run. E.g. the run of 10 requires no disk I/O due to verify, but the run of 11 requires 10 I/O requests, while the run of 12 requires 11 requests, etc. This will skew your results as the ARC buffering of file info changes over time. (3) fflush is NOT the same as fsync. If you're running the script in a loop, it's entirely possible that ZFS hasn't completely committed things to disk yet, which means that you get I/O requests to flush out the ARC write buffer in the middle of your runs. Honestly, I'd do the following for benchmarking: i=0 while [i -lt 80 ]; do j = $[10 + ( 1 * 1)] ./run_your_script j sync sleep 10 i = $[$i+1] done -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 5/6/2011 5:46 PM, Richard Elling wrote: On May 6, 2011, at 3:24 AM, Erik Trimble wrote: Casper and Richard are correct - RAM starvation seriously impacts snapshot or dataset deletion when a pool has dedup enabled. The reason behind this is that ZFS needs to scan the entire DDT to check to see if it can actually delete each block in the to-be-deleted snapshot/dataset, or if it just needs to update the dedup reference count. AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. The issue is that each reference update means that one, small bit of data is changed. If the reference is not already in ARC, then a small, probably random read is needed. If you have a typical consumer disk, especially a "green" disk, and have not tuned zfs_vdev_max_pending, then that itty bitty read can easily take more than 100 milliseconds(!) Consider that you can have thousands or millions of reference updates to do during a zfs destroy, and the math gets ugly. This is why fast SSDs make good dedup candidates. Just out of curiosity - I'm assuming that a delete works like this: (1) find list of blocks associated with file to be deleted (2) using the DDT, find out if any other files are using those blocks (3) delete/update any metadata associated with the file (dirents, ACLs, etc.) (4) for each block in the file (4a) if the DDT indicates there ARE other files using this block, update the DDT entry to change the refcount (4b) if the DDT indicates there AREN'T any other files, move the physical block to the free list, and delete the DDT entry In a bulk delete scenario (not just snapshot deletion), I'd presume #1 above almost always causes a Random I/O request to disk, as all the relevant metadata for every (to be deleted) file is unlikely to be stored in ARC. If you can't fit the DDT in ARC/L2ARC, #2 above would require you to pull in the remainder of the DDT info from disk, right? #3 and #4 can be batched up, so they don't hurt that much. Is that a (roughly) correct deletion methodology? Or can someone give a more accurate view of what's actually going on? If it can't store the entire DDT in either the ARC or L2ARC, it will be forced to do considerable I/O to disk, as it brings in the appropriate DDT entry. Worst case for insufficient ARC/L2ARC space can increase deletion times by many orders of magnitude. E.g. days, weeks, or even months to do a deletion. I've never seen months, but I have seen days, especially for low-perf disks. I've seen an estimate of 5 weeks for removing a snapshot on a 1TB dedup pool made up of 1 disk. Not an optimal set up. :-) If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. Yes, perhaps a bit longer for recursive destruction, but everyone here knows recursion is evil, right? :-) -- richard You, my friend, have obviously never worshipped at the Temple of the Lamba Calculus, nor been exposed to the Holy Writ that is "Structure and Interpretation of Computer Programs" (http://mitpress.mit.edu/sicp/full-text/book/book.html). I sentence you to a semester of 6.001 problem sets, written by Prof Sussman sometime in the 1980s. (yes, I went to MIT.) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 5/6/2011 1:37 AM, casper@oracle.com wrote: Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is "you with your ram-starved system, don't even try to start using snapshots on that system". Right? I think it's more like "don't use dedup when you don't have RAM". (It is not possible to not use snapshots in Solaris; they are used for everything) Casper Casper and Richard are correct - RAM starvation seriously impacts snapshot or dataset deletion when a pool has dedup enabled. The reason behind this is that ZFS needs to scan the entire DDT to check to see if it can actually delete each block in the to-be-deleted snapshot/dataset, or if it just needs to update the dedup reference count. If it can't store the entire DDT in either the ARC or L2ARC, it will be forced to do considerable I/O to disk, as it brings in the appropriate DDT entry. Worst case for insufficient ARC/L2ARC space can increase deletion times by many orders of magnitude. E.g. days, weeks, or even months to do a deletion. If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
ements. Using the standard c_max value of 80%, remember that this is 80% of the TOTAL system RAM, including that RAM normally dedicated to other purposes. So long as the total amount of RAM you expect to dedicate to ARC usage (for all ZFS uses, not just dedup) is less than 4 times that of all other RAM consumption, you don't need to "overprovision". So the end result is: On my test system I guess the OS and processes consume 1G. (I'm making that up without any reason.) On my test system I guess I need 8G in the system to get reasonable performance without dedup or L2ARC. (Again, I'm just making that up.) We calculated that I need 7G for DDT and 3.4G for L2ARC. That is 10.4G. Multiply by 5/4 and it means I need 13G My system needs to be built with at least 8G + 13G = 21G. Of this, 20% (4.2G) is more than enough to run the OS and processes, while 80% (16.8G) is available for ARC. Of the 16.8G ARC, the DDT and L2ARC references will consume 10.4G, which leaves 6.4G for "normal" ARC caching. These numbers are all fuzzy. Anything from 16G to 24G might be reasonable. That's it. I'm done. P.S. I'll just throw this out there: It is my personal opinion that you probably won't have the whole DDT in ARC and L2ARC at the same time. Because the L2ARC is populated from the soon-to-expire list of the ARC, it seems unlikely that all the DDT entries will get into ARC, and then onto the soon-to-expire list and then pulled back into ARC and stay there. The above calculation is a sort of worst case. I think the following is likely to be a more realistic actual case: There is a *very* low probability that a DDT entry will exist in both the ARC and L2ARC at the same time. That is, such a condition will occur ONLY in the very short period of time when the DDT entry is being migrated from the ARC to the L2ARC. Each DDT entry is tracked separately, so each can be migrated from ARC to L2ARC as needed. Any entry that is migrated back from L2ARC into ARC is considered "stale" data in the L2ARC, and thus, is no longer tracked in the ARC's reference table for L2ARC. As such, you can safely assume the DDT-related memory requirements for the ARC are (maximally) just slightly bigger than size of the DDT itself. Even then, that is a worst-case scenario; a typical use case would have the actual ARC consumption somewhere closer to the case where the entire DDT is in the L2ARC. Using your numbers, that would mean the worst-case ARC usage would be a bit over 7G, and the more likely case would be somewhere in the 3.3-3.5G range. Personally, I would model the ARC memory consumption of the L2ARC entries using the average block size of the data pool, and just neglect the DDT entries in the L2ARC. Well ... inflate some. Say 10% of the DDT is in the L2ARC and the ARC at the same time. I'm making up this number from thin air. My revised end result is: On my test system I guess the OS and processes consume 1G. (I'm making that up without any reason.) On my test system I guess I need 8G in the system to get reasonable performance without dedup or L2ARC. (Again, I'm just making that up.) We calculated that I need 7G for DDT and (96M + 10% of 3.3G = 430M) for L2ARC. Multiply by 5/4 and it means I need 7.5G * 1.25 = 9.4G My system needs to be built with at least 8G + 9.4G = 17.4G. Of this, 20% (3.5G) is more than enough to run the OS and processes, while 80% (13.9G) is available for ARC. Of the 13.9G ARC, the DDT and L2ARC references will consume 7.5G, which leaves 6.4G for "normal" ARC caching. I personally think that's likely to be more accurate in the observable world. My revised end result is still basically the same: These numbers are all fuzzy. Anything from 16G to 24G might be reasonable. For total system RAM, you need the GREATER of these two values: (1) the sum of your OS & application requirements, plus your standard ZFS-related ARC requirements, plus the DDT size (2) 1.25 times the sum of size of your standard ARC needs and the DDT size Redoing your calculations based on my adjustments: (a) worst case scenario is that you need 7GB for dedup-related ARC requirements (b) you presume to need 8GB for standard ARC caching not related to dedup (c) your system needs 1GB for basic operation According to those numbers: Case #1: 1 + 8 + 7 = 16GB Case #2: 1.25 * (8 + 7) =~ 19GB Thus, you should have 19GB of RAM in your system, with 16GB being a likely reasonable amount under most conditions (e.g. typical dedup ARC size is going to be ~3.5G, not the 7G maximum used above). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 5:11 PM, Brandon High wrote: On Wed, May 4, 2011 at 4:36 PM, Erik Trimble wrote: If so, I'm almost certain NetApp is doing post-write dedup. That way, the strictly controlled max FlexVol size helps with keeping the resource limits down, as it will be able to round-robin the post-write dedup to each FlexVol in turn. They are, its in their docs. A volume is dedup'd when 20% of non-deduped data is added to it, or something similar. 8 volumes can be processed at once though, I believe, and it could be that weaker systems are not able to do as many in parallel. Sounds rational. block usage has a significant 4k presence. One way I reduced this initally was to have the VMdisk image stored on local disk, then copied the *entire* image to the ZFS server, so the server saw a single large file, which meant it tended to write full 128k blocks. Do note, that my 30 images only takes Wouldn't you have been better off cloning datasets that contain an unconfigured install and customizing from there? -B Given that my "OS" installs include a fair amount of 3rd-party add-ons (compilers, SDKs, et al), I generally find the best method for me is to fully configure a client (with the VMdisk on local storage), then copy that VMdisk to the ZFS server as a "golden image". I can then clone that image for my other clients of that type, and only have to change the network information. Initially, each new VM image consumes about 1MB of space. :-) Overall, I've found that as I have to patch each image, it's worth-while to take a new "golden-image" snapshot every so often, and then reconfigure each client machine again from that new Golden image. I'm sure I could do some optimization here, but the method works well enough. What you want to avoid is having the OS image written to, and waiting for any other configuration and customization to happen AFTER it was placed on the ZFS server is sub-optimal. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 4:44 PM, Tim Cook wrote: On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <mailto:erik.trim...@oracle.com>> wrote: On 5/4/2011 4:14 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: On Wed, May 4, 2011 at 12:29 PM, Erik Trimblemailto:erik.trim...@oracle.com>> wrote: I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code AFAIK, NetApp has more restrictive requirements about how much data can be dedup'd on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B That is very true, although worth mentioning you can have quite a few of the dedupe/SIS enabled FlexVols on even the lower-end filers (our FAS2050 has a bunch of 2TB SIS enabled FlexVols). Stupid question - can you hit all the various SIS volumes at once, and not get horrid performance penalties? If so, I'm almost certain NetApp is doing post-write dedup. That way, the strictly controlled max FlexVol size helps with keeping the resource limits down, as it will be able to round-robin the post-write dedup to each FlexVol in turn. ZFS's problem is that it needs ALL the resouces for EACH pool ALL the time, and can't really share them well if it expects to keep performance from tanking... (no pun intended) On a 2050? Probably not. It's got a single-core mobile celeron CPU and 2GB/ram. You couldn't even run ZFS on that box, much less ZFS+dedup. Can you do it on a model that isn't 4 years old without tanking performance? Absolutely. Outside of those two 2000 series, the reason there are dedup limits isn't performance. --Tim Indirectly, yes, it's performance, since NetApp has plainly chosen post-write dedup as a method to restrict the required hardware capabilities. The dedup limits on Volsize are almost certainly driven by the local RAM requirements for post-write dedup. It also looks like NetApp isn't providing for a dedicated DDT cache, which means that when the NetApp is doing dedup, it's consuming the normal filesystem cache (i.e. chewing through RAM). Frankly, I'd be very surprised if you didn't see a noticeable performance hit during the period that the NetApp appliance is performing the dedup scans. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 4:17 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote: On 5/4/2011 2:54 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS's current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC& RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. So the addition of L2ARC doesn't necessarily reduce the need for memory (at least not much if you're talking about 500 bytes combined)? I was hoping we could slap in 80GB's of SSD L2ARC and get away with "only" 16GB of RAM for example. It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC cache device, the DDT must be stored in RAM. That's about 376 bytes per dedup block. If you have an L2ARC cache device, then the ARC must contain a reference to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT entry reference. So, adding a L2ARC reduces the ARC consumption by about 55%. Of course, the other benefit from a L2ARC is the data/metadata caching, which is likely worth it just by itself. Great info. Thanks Erik. For dedupe workloads on larger file systems (8TB+), I wonder if makes sense to use SLC / enterprise class SSD (or better) devices for L2ARC instead of lower-end MLC stuff? Seems like we'd be seeing more writes to the device than in a non-dedupe scenario. Thanks, Ray I'm using Enterprise-class MLC drives (without a supercap), and they work fine with dedup. I'd have to test, but I don't think that the increase in write is that much, so I don't expect a SLC to really make much of a difference. (fill rate of the L2ARC is limited, so I can't imaging we'd bump up against the MLC's limits) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 4:14 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote: On Wed, May 4, 2011 at 12:29 PM, Erik Trimble wrote: I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code AFAIK, NetApp has more restrictive requirements about how much data can be dedup'd on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B That is very true, although worth mentioning you can have quite a few of the dedupe/SIS enabled FlexVols on even the lower-end filers (our FAS2050 has a bunch of 2TB SIS enabled FlexVols). Stupid question - can you hit all the various SIS volumes at once, and not get horrid performance penalties? If so, I'm almost certain NetApp is doing post-write dedup. That way, the strictly controlled max FlexVol size helps with keeping the resource limits down, as it will be able to round-robin the post-write dedup to each FlexVol in turn. ZFS's problem is that it needs ALL the resouces for EACH pool ALL the time, and can't really share them well if it expects to keep performance from tanking... (no pun intended) The FAS2050 of course has a fairly small memory footprint... I do like the additional flexibility you have with ZFS, just trying to get a handle on the memory requirements. Are any of you out there using dedupe ZFS file systems to store VMware VMDK (or any VM tech. really)? Curious what recordsize you use and what your hardware specs / experiences have been. Ray Right now, I use it for my Solaris 8 containers and VirtualBox images. the VB images are mostly Windows (XP and Win2003). I tend to put the OS image in one VMdisk, and my scratch disks in another. That is, I generally don't want my apps writing much to my OS images. My scratch/data disks aren't dedup. Overall, I'm running about 30 deduped images served out over NFS. My recordsize is set to 128k, but, given that they're OS images, my actual disk block usage has a significant 4k presence. One way I reduced this initally was to have the VMdisk image stored on local disk, then copied the *entire* image to the ZFS server, so the server saw a single large file, which meant it tended to write full 128k blocks. Do note, that my 30 images only takes about 20GB of actual space, after dedup. I figure about 5GB of dedup space per OS type (and, I have 4 different setups). My data VMdisks, however, chew though about 4TB of disk space, which is nondeduped. I'm still trying to determine if I'm better off serving those data disks as NFS mounts to my clients, or as VMdisk images available over iSCSI or NFS. Right now, I'm doing VMdisks over NFS. The setup I'm using is an older X4200 (non-M2), with 3rd-party SSDs as L2ARC, hooked to an old 3500FC array. It has 8GB of RAM in total, and runs just fine with that. I definitely am going to upgrade to something much larger in the near future, since I expect to up my number of VM images by at least a factor of 5. That all said, if you're relatively careful about separating OS installs from active data, you can get really impressive dedup ratios using a relatively small amount of actual space.In my case, I expect to eventually be serving about 10 different configs out to a total of maybe 100 clients, and probably never exceed 100GB max on the deduped end. Which means that I'll be able to get away with 16GB of RAM for the whole server, comfortably. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 2:54 PM, Ray Van Dolson wrote: On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote: (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS's current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC& RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. So the addition of L2ARC doesn't necessarily reduce the need for memory (at least not much if you're talking about 500 bytes combined)? I was hoping we could slap in 80GB's of SSD L2ARC and get away with "only" 16GB of RAM for example. It reduces *somewhat* the need for RAM. Basically, if you have no L2ARC cache device, the DDT must be stored in RAM. That's about 376 bytes per dedup block. If you have an L2ARC cache device, then the ARC must contain a reference to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT entry reference. So, adding a L2ARC reduces the ARC consumption by about 55%. Of course, the other benefit from a L2ARC is the data/metadata caching, which is likely worth it just by itself. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On 5/4/2011 9:57 AM, Ray Van Dolson wrote: There are a number of threads (this one[1] for example) that describe memory requirements for deduplication. They're pretty high. I'm trying to get a better understanding... on our NetApps we use 4K block sizes with their post-process deduplication and get pretty good dedupe ratios for VM content. Using ZFS we are using 128K record sizes by default, which nets us less impressive savings... however, to drop to a 4K record size would theoretically require that we have nearly 40GB of memory for only 1TB of storage (based on 150 bytes per block for the DDT). This obviously becomes prohibitively higher for 10+ TB file systems. I will note that our NetApps are using only 2TB FlexVols, but would like to better understand ZFS's (apparently) higher memory requirements... or maybe I'm missing something entirely. Thanks, Ray [1] http://markmail.org/message/wile6kawka6qnjdw ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm not familiar with NetApp's implementation, so I can't speak to why it might appear to use less resources. However, there are a couple of possible issues here: (1) Pre-write vs Post-write Deduplication. ZFS does pre-write dedup, where it looks for duplicates before it writes anything to disk. In order to do pre-write dedup, you really have to store the ENTIRE deduplication block lookup table in some sort of fast (random) access media, realistically Flash or RAM. The win is that you get significantly lower disk utilization (i.e. better I/O performance), as (potentially) much less data is actually written to disk. Post-write Dedup is done via batch processing - that is, such a design has the system periodically scan the saved data, looking for duplicates. While this method also greatly benefits from being able to store the dedup table in fast random storage, it's not anywhere as critical. The downside here is that you see much higher disk utilization - the system must first write all new data to disk (without looking for dedup), and then must also perform significant I/O later on to do the dedup. (2) Block size: a 4k block size will yield better dedup than a 128k block size, presuming reasonable data turnover. This is inherent, as any single bit change in a block will make it non-duplicated. With 32x the block size, there is a much greater chance that a small change in data will require a large loss of dedup ratio. That is, 4k blocks should almost always yield much better dedup ratios than larger ones. Also, remember that the ZFS block size is a SUGGESTION for zfs filesystems (i.e. it will use UP TO that block size, but not always that size), but is FIXED for zvols. (3) Method of storing (and data stored in) the dedup table. ZFS's current design is (IMHO) rather piggy on DDT and L2ARC lookup requirements. Right now, ZFS requires a record in the ARC (RAM) for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it boils down to 500+ bytes of combined L2ARC & RAM usage per block entry in the DDT. Also, the actual DDT entry itself is perhaps larger than absolutely necessary. I suspect that NetApp does the following to limit their resource usage: they presume the presence of some sort of cache that can be dedicated to the DDT (and, since they also control the hardware, they can make sure there is always one present). Thus, they can make their code completely avoid the need for an equivalent to the ARC-based lookup. In addition, I suspect they have a smaller DDT entry itself. Which boils down to probably needing 50% of the total resource consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement. Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The big issue is the ARC requirements, which, until they can be seriously reduced (or, best case, simply eliminated), really is a significant barrier to adoption of ZFS dedup. Right now, ZFS treats DDT entries like any other data or metadata in how it ages from ARC to L2ARC to gone. IMHO, the better way to do this is simply require the DDT to be entirely stored on the L2ARC (if present), and not ever keep any DDT info in the ARC at all (that is, the ARC should contain a pointer to the DDT in the L2ARC, and that's it, regardless of the amount or frequency of access of the DDT). Frankly, at this point, I'd almost change the design to REQUIRE a L2ARC device in order to turn on Dedup. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
On 5/3/2011 8:55 AM, Brandon High wrote: On Tue, May 3, 2011 at 5:47 AM, Joerg Schilling wrote: But this is most likely slower than star and does rsync support sparse files? 'rsync -ASHXavP' -A: ACLs -S: Sparse files -H: Hard links -X: Xattrs -a: archive mode; equals -rlptgoD (no -H,-A,-X) You don't need to specify --whole-file, it's implied when copying on the same system. --inplace can play badly with hard links and shouldn't be used. It probably will be slower than other options but it may be more accurate, especially with -H -B rsync is indeed slower than star; so far as I can tell, this is due almost exclusively to the fact that rsync needs to build an in-memory table of all work being done *before* it starts to copy. After that, it copies at about the same rate as star (my observations). I'd have to look at the code, but rsync appears to internally buffer a signification amount (due to its expect network use pattern), which helps for ZFS copying. The one thing I'm not sure of is whether rsync uses a socket, pipe, or semaphore method when doing same-host copying. I presume socket (which would slightly slow it down vs star). That said, rsync is really the only solution if you have a partial or interrupted copy. It's also really the best method to do verification. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 4/29/2011 9:44 AM, Brandon High wrote: On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk wrote: This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a combination with verify (which I would use anyway, since there are always tiny chances of collisions), why would sha256 be a better choice? fletcher4 was only an option for snv_128, which was quickly pulled and replaced with snv_128b which removed fletcher4 as an option. The official post is here: http://www.opensolaris.org/jive/thread.jspa?threadID=118519&tstart=0#437431 It looks like fletcher4 is still an option in snv_151a for non-dedup datasets, and is in fact the default. As an aside: Erik, any idea when the 159 bits will make it to the public? -B Yup, fletcher4 is still the default for any fileset not using dedup. It's "good enough", and I can't see any reason to change it for those purposes (since it's collision problems aren't much of an issue when just doing data integrity checks). Sorry, no idea on release date stuff. I'm completely out of the loop on release info. I'm lucky if I can get a heads up before it actually gets published internally. :-( I'm just a lowly Java Platform Group dude. Solaris ain't my silo. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding where dedup'd files are
On Thu, 2011-04-28 at 15:50 -0700, Brandon High wrote: > On Thu, Apr 28, 2011 at 3:48 PM, Ian Collins wrote: > > Dedup is at the block, not file level. > > Files are usually composed of blocks. > > -B > I think the point was, it may not be easy to determine which file a given block is part of. I don't think stuff is stored as a doubly-linked-list. That is, the file metadata lists the blocks associated with that file, but the block itself doesn't refer back to the file metadata. Which means, that while I can get a list of blocks which are deduped, it may not be possible to generate a list of files from that list of blocks. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Thu, 2011-04-28 at 14:33 -0700, Brandon High wrote: > On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey > wrote: > > Correct me if I'm wrong, but the dedup sha256 checksum happens in addition > > to (not instead of) the fletcher2 integrity checksum. So after bootup, > > My understanding is that enabling dedup forces sha256. > > "The default checksum used for deduplication is sha256 (subject to > change). When dedup is enabled, the dedup checksum algorithm overrides > the checksum property." > > -B > >From the man page for zfs(1) dedup=on | off | verify | sha256[,verify] Controls whether deduplication is in effect for a dataset. The default value is off. The default checksum used for deduplication is sha256 (subject to change). When dedup is enabled, the dedup checksum algorithm overrides the checksum property. Setting the value to verify is equivalent to specifying sha256,verify. If the property is set to verify, then, whenever two blocks have the same signature, ZFS will do a byte-for- byte comparison with the existing block to ensure that the contents are identical. This is from b159. A careful reading of the man page seems to imply that there's no way to change the dedup checksum algorithm from sha256, as the dedup property ignores the checksum property, and there's no provided way to explicitly set a checksum algorithm specific to dedup (i.e. there's no way to override the default for dedup). -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Thu, 2011-04-28 at 13:59 -0600, Neil Perrin wrote: > On 4/28/11 12:45 PM, Edward Ned Harvey wrote: > > > > In any event, thank you both for your input. Can anyone answer these > > authoritatively? (Neil?) I'll send you a pizza. ;-) > > > > - I wouldn't consider myself an authority on the dedup code. > The size of these structures will vary according to the release you're > running. You can always find out the size for a particular system using > ::sizeof within > mdb. For example, as super user : > > : xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k > sizeof (ddt_entry_t) = 0x178 > : xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k > sizeof (arc_buf_hdr_t) = 0x100 > : xvm-4200m2-02 ; > yup, that's how I got them. Just to add to the confusion, there are typedefs in the code which can make names slightly different: typedef struct arc_buf_hdr arc_buf_hdr_t; typedef struct ddt_entry ddt_entry_t; I got my values from a x86 box running b159, and a SPARC box running S10u9. The values were the same from both. E.g.: root@invisible:~# uname -a SunOS invisible 5.11 snv_159 i86pc i386 i86pc Solaris root@invisible:~# isainfo amd64 i386 root@invisible:~# mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs ip hook neti arp usba uhci fctl stmf kssl stmf_sbd sockfs lofs random sata sd fcip cpc crypto nfs logindmux ptm ufs sppp ipc ] > ::sizeof struct arc_buf_hdr sizeof (struct arc_buf_hdr) = 0xb0 > ::sizeof struct ddt_entry sizeof (struct ddt_entry) = 0x178 > This shows yet another size. Also there are more changes planned within > the arc. Sorry, I can't talk about those changes and nor when you'll > see them. > > However, that's not the whole story. It looks like the arc_buf_hdr_t > use their own kmem cache so there should be little wastage, but the > ddt_entry_t are allocated from the generic kmem caches and so will > probably have some roundup and unused space. Caches for small buffers > are aligned to 64 bytes. See kmem_alloc_sizes[] and comment: > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920 > Ugg. I hadn't even thought of memory alignment/allocation issues. > Pizza: Mushroom and anchovy - er, just kidding. > > Neil. And, let me say: Yuck! What is that, an ISO-standard pizza? Disgusting. ANSI-standard pizza, all the way! (pepperoni & mushrooms) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
OK, I just re-looked at a couple of things, and here's what I /think/ is the correct numbers. A single entry in the DDT is defined in the struct "ddt_entry" : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108 I just checked, and the current size of this structure is 0x178, or 376 bytes. Each ARC entry, which points to either an L2ARC item (of any kind, cached data, metadata, or a DDT line) or actual data/metadata/etc., is defined in the struct "arc_buf_hdr" : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#431 It's current size is 0xb0, or 176 bytes. These are fixed-size structures. PLEASE - someone correct me if these two structures AREN'T what we should be looking at. So, our estimate calculations have to be based on these new numbers. Back to the original scenario: 1TB (after dedup) of 4k blocks: how much space is needed for the DDT, and how much ARC space is needed if the DDT is kept in a L2ARC cache device? Step 1) 1TB (2^40 bytes) stored in blocks of 4k (2^12) = 2^28 blocks total, which is about 268 million. Step 2) 2^28 blocks of information in the DDT requires 376 bytes/block * 2^28 blocks = 94 * 2^30 = 94 GB of space. Step 3) Storing a reference to 268 million (2^28) DDT entries in the L2ARC will consume the following amount of ARC space: 176 bytes/entry * 2^28 entries = 44GB of RAM. That's pretty ugly. So, to summarize: For 1TB of data, broken into the following block sizes: DDT sizeARC consumption 512b752GB (73%) 352GB (34%) 4k 94GB (9%) 44GB (4.3%) 8k 47GB (4.5%) 22GB (2.1%) 32k 11.75GB (2.2%) 5.5GB (0.5%) 64k 5.9GB (1.1%)2.75GB (0.3%) 128k2.9GB% (0.6%) 1.4GB (0.1%) ARC consumption presumes the whole DDT is stored in the L2ARC. Percentage size is relative to the original 1TB total data size Of course, the trickier proposition here is that we DON'T KNOW what our dedup value is ahead of time on a given data set. That is, given a data set of X size, we don't know how big the deduped data size will be. The above calculations are for DDT/ARC size for a data set that has already been deduped down to 1TB in size. Perhaps it would be nice to have some sort of userland utility that builds it's own DDT as a test and does all the above calculations, to see how dedup would work on a given dataset. 'zdb -S' sorta, kinda does that, but... -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On 4/26/2011 9:29 AM, Fred Liu wrote: > From: Erik Trimble [mailto:erik.trim...@oracle.com] >> It is true, quota is in charge of logical data not physical data. >> Let's assume an interesting scenario -- say the pool is 100% full in logical >> data >> (such as 'df' tells you 100% used) but not full in physical data(such as >> 'zpool list' tells >> you still some space available), can we continue writing data into this pool? >> > Sure, you can keep writing to the volume. What matters to the OS is what > *it* thinks, not what some userland app thinks. > > OK. And then what the output of 'df' will be? > > Thanks. > > Fred 110% full. Or whatever. df will just keep reporting what it sees. Even if what it *thinks* doesn't make sense to the human reading it. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On 4/26/2011 3:59 AM, Fred Liu wrote: > >> -Original Message- >> From: Erik Trimble [mailto:erik.trim...@oracle.com] >> Sent: 星期二, 四月 26, 2011 12:47 >> To: Ian Collins >> Cc: Fred Liu; ZFS discuss >> Subject: Re: [zfs-discuss] How does ZFS dedup space accounting work >> with quota? >> >> On 4/25/2011 6:23 PM, Ian Collins wrote: >>> On 04/26/11 01:13 PM, Fred Liu wrote: >>>> H, it seems dedup is pool-based not filesystem-based. >>> That's correct. Although it can be turned off and on at the >> filesystem >>> level (assuming it is enabled for the pool). >> Which is effectively the same as choosing per-filesystem dedup. Just >> the inverse. You turn it on at the pool level, and off at the >> filesystem >> level, which is identical to "off at the pool level, on at the >> filesystem level" that NetApp does. > My original though is just enabling dedup on one file system to check if it > is mature enough or not in the production env. And I have only one pool. > If dedup is filesytem-based, the effect of dedup will be just throttled within > one file system and won't propagate to the whole pool. Just disabling dedup > cannot get rid of all the effects(such as the possible performance degrade > ... etc), > because the already dedup'd data is still there and DDT is still there. The > thinkable > thorough way is totally removing all the dedup'd data. But is it the real > thorough way? You can do that now. Enable Dedup at the pool level. Turn it OFF on all the existing filesystems. Make a new "test" filesystem, and run your tests. Remember, only data written AFTER the dedup value it turned on will be de-duped. Existing data will NOT. And, though dedup is enabled at the pool level, it will only consider data written into filesystems that have the dedup value as ON. Thus, in your case, writing to the single filesystem with dedup on will NOT have ZFS check for duplicates from the other filesystems. It will check only inside itself, as it's the only filesystem with dedup enabled. If the experiment fails, you can safely destroy your test dedup filesystem, then unset dedup at the pool level, and you're fine. > And also the dedup space saving is kind of indirect. > We cannot directly get the space saving in the file system where the > dedup is actually enabled for it is pool-based. Even in pool perspective, > it is still sort of indirect and obscure from my opinion, the real space > saving > is the abs delta between the output of 'zpool list' and the sum of 'du' on > all the folders in the pool > (or 'df' on the mount point folder, not sure if the percentage like 123% will > occur or not... grinning ^:^ ). > > But in NetApp, we can use 'df -s' to directly and easily get the space saving. That is true. Honestly, however, it would be hard to do this on a per-filesystem basis. ZFS allows for the creation of an arbitrary number of filesystems in a pool, far higher than NetApp does. The result is that the "filesystem" concept is much more flexible in ZFS. The downside is that keeping dedup statistics for a given arbitrary set of data is logistically difficult. An analogy with NetApp is thus: Can you use any tool to find the dedup ratio of an arbitrary directory tree INSIDE a NetApp filesystem? > It is true, quota is in charge of logical data not physical data. > Let's assume an interesting scenario -- say the pool is 100% full in logical > data > (such as 'df' tells you 100% used) but not full in physical data(such as > 'zpool list' tells > you still some space available), can we continue writing data into this pool? > Sure, you can keep writing to the volume. What matters to the OS is what *it* thinks, not what some userland app thinks. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On 4/25/2011 6:23 PM, Ian Collins wrote: On 04/26/11 01:13 PM, Fred Liu wrote: H, it seems dedup is pool-based not filesystem-based. That's correct. Although it can be turned off and on at the filesystem level (assuming it is enabled for the pool). Which is effectively the same as choosing per-filesystem dedup. Just the inverse. You turn it on at the pool level, and off at the filesystem level, which is identical to "off at the pool level, on at the filesystem level" that NetApp does. If it can have fine-grained granularity(like based on fs), that will be great! It is pity! NetApp is sweet in this aspect. So what happens to user B's quota if user B stores a ton of data that is a duplicate of user A's data and then user A deletes the original? Actually, right now, nothing happens to B's quota. He's always charged the un-deduped amount for his quota usage, whether or not dedup is enabled, and regardless of how much of his data is actually deduped. Which is as it should be, as quotas are about limiting how much a user is consuming, not how much the backend needs to store that data consumption. e.g. A, B, C, & D all have 100Mb of data in the pool, with dedup on. 20MB of storage has a dedup-factor of 3:1 (common to A, B, & C) 50MB of storage has a dedup factor of 2:1 (common to A & B ) Thus, the amount of unique data would be: A: 100 - 20 - 50 = 30MB B: 100 - 20 - 50 = 30MB C: 100 - 20 = 80MB D: 100MB Summing it all up, you would have an actual storage consumption of 70 (50+20 deduped) + 30+30+80+100 (unique data) = 310MB to actual storage, for 400MB of apparent storage (i.e. dedup ratio of 1.29:1 ) A, B, C, & D would each still have a quota usage of 100MB. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 4/25/2011 8:20 AM, Edward Ned Harvey wrote: There are a lot of conflicting references on the Internet, so I'd really like to solicit actual experts (ZFS developers or people who have physical evidence) to weigh in on this... After searching around, the reference I found to be the most seemingly useful was Erik's post here: http://opensolaris.org/jive/thread.jspa?threadID=131296 Unfortunately it looks like there's an arithmetic error (1TB of 4k blocks means 268million blocks, not 1 billion). Also, IMHO it seems important make the distinction, #files != #blocks. Due to the existence of larger files, there will sometimes be more than one block per file; and if I'm not mistaken, thanks to write aggregation, there will sometimes be more than one file per block. YMMV. Average block size could be anywhere between 1 byte and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs property, not a zpool property. So how can you know or configure the blocksize for something like a zvol iscsi target?) I said 2^30, which is roughly a quarter billion. But, I should have been more exact. And, the file != block difference is important to note. zvols also take a Recordsize attribute. And, zvols tend to be sticklers about all blocks being /exactly/ the recordsize value, unlike filesystems, which use it as a *maximum* block size. Min block size is 512 bytes. (BTW, is there any way to get a measurement of number of blocks consumed per zpool? Per vdev? Per zfs filesystem?) The calculations below are based on assumption of 4KB blocks adding up to a known total data consumption. The actual thing that matters is the number of blocks consumed, so the conclusions drawn will vary enormously when people actually have average block sizes != 4KB. you need to use zdb to see what the current block usage is for a filesystem. I'd have to look up the particular CLI usage for that, as I don't know what it is off the top of my head. And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is? Nope. The DDT is stored only in one place: cache device if present, /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache device is present, the DDT is stored there, BUT RAM also must store a basic lookup table for the DDT (yea, I know, a lookup table for a lookup table). My minor corrections here: The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every L2ARC entry, since the DDT is stored on the cache device. the DDT itself doesn't consume any ARC space usage if stored in a L2ARC cache E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out that I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I eat up with using a 160GB L2ARC to store my DDT on. (1) How many entries are there in the DDT? 1TB of 4k blocks means there are 268million blocks. However, at a 5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 54 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~ 14GB in size (2) How much ARC space does this DDT take up? The 54 million entries in my DDT take up about 200bytes * 54 million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just to storing the references to the DDT in the L2ARC. (3) How much space do I have left on the L2ARC device, and how many blocks can that hold? Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the device, which, assuming I'm still using 4k blocks, means I can cache about 37 million 4k blocks, or about 66% of my total data. This extra cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of ARC entries. Thus, for the aforementioned dedup scenario, I'd better spec it with (whatever base RAM for basic OS and ordinary ZFS cache and application requirements) at least a 14G L2ARC device for dedup + 10G more of RAM for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional space in the L2ARC cache beyond that used by the DDT. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 4/8/2011 9:19 PM, Ian Collins wrote: On 04/ 9/11 03:53 PM, Mark Sandrock wrote: I'm not arguing. If it were up to me, we'd still be selling those boxes. Maybe you could whisper in the right ear? :) Three little words are all that Oracle Product Managers hear: "Business case justification" I want my J4000's back, too. And, I still want something like HP's MSA 70 (25 x 2.5" drive JBOD in a 2U formfactor) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 4/8/2011 4:50 PM, Bob Friesenhahn wrote: On Fri, 8 Apr 2011, J.P. King wrote: I can't speak for this particular situation or solution, but I think in principle you are wrong. Networks are fast. Hard drives are slow. Put a But memory is much faster than either. It most situations the data would already be buffered in the X4540's memory so that it is instantly available. Bob Certainly, as a low-end product, the X4540 (and X4500) offered unmatched flexibility and performance per dollar. It *is* sad to see them go. But, given Oracle's strategic direction, is anyone really surprised? PS - Nexenta, I think you've got a product position opportunity here... PPS - about the closest thing Oracle makes to the X4540 now is the X4270 M2 in the 2.5" drive config - 24 x 2.5" drives, 2 x Westmere-EP CPUs, in a 2U rack cabinet, somewhere around $25k (list) for the 24x500GB SATA model with (2) 6-core Westmeres + 16GB RAM. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 4/8/2011 1:58 PM, Chris Banal wrote: Sounds like many of us are in a similar situation. To clarify my original post. The goal here was to continue with what was a cost effective solution to some of our Storage requirements. I'm looking for hardware that wouldn't cause me to get the run around from the Oracle support folks, finger pointing between vendors, or have lots of grief from an untested combination of parts. If this isn't possible we'll certainly find a another solution. I already know it won't be the 7000 series. Thank you, Chris Banal Talk to HP then. They still sell Officially Supported Solaris servers and disk storage systems in more varieties than Oracle does. The StorageWorks 600 Modular Disk System may be what you're looking for (70 x 2.5" drives per enclosure, 5U, SAS/SATA/FC attachment to any server, $35k list price for 70TB). Or the StorageWorks 70 Modular Disk Array (25 x 2.5" drives, 1U, SAS attachment, $11k list price for 12.5TB) -Erik Marion Hakanson wrote: jp...@cam.ac.uk said: I can't speak for this particular situation or solution, but I think in principle you are wrong. Networks are fast. Hard drives are slow. Put a 10G connection between your storage and your front ends and you'll have the bandwidth[1]. Actually if you really were hitting 1000x8Mbits I'd put 2, but that is just a question of scale. In a different situation I have boxes which peak at around 7 Gb/s down a 10G link (in reality I don't need that much because it is all about the IOPS for me). That is with just twelve 15k disks. Your situation appears to be pretty ideal for storage hardware, so perfectly achievable from an appliance. Depending on usage, I disagree with your bandwidth and latency figures above. An X4540, or an X4170 with J4000 JBOD's, has more bandwidth to its disks than 10Gbit ethernet. You would need three 10GbE interfaces between your CPU and the storage appliance to equal the bandwidth of a single 8-port 3Gb/s SAS HBA (five of them for 6Gb/s SAS). It's also the case that the Unified Storage platform doesn't have enough bandwidth to drive more than four 10GbE ports at their full speed: http://dtrace.org/blogs/brendan/2009/09/22/7410-hardware-update-and-analyzing-t he-hypertransport/ We have a customer (internal to the university here) that does high throughput gene sequencing. They like a server which can hold the large amounts of data, do a first pass analysis on it, and then serve it up over the network to a compute cluster for further computation. Oracle has nothing in their product line (anymore) to meet that need. They ended up ordering an 8U chassis w/40x 2TB drives in it, and are willing to pay the $2k/yr retail ransom to Oracle to run Solaris (ZFS) on it, at least for the first year. Maybe OpenIndiana next year, we'll see. Bye Oracle Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 4/8/2011 12:37 AM, Ian Collins wrote: On 04/ 8/11 06:30 PM, Erik Trimble wrote: On 4/7/2011 10:25 AM, Chris Banal wrote: While I understand everything at Oracle is "top secret" these days. Does anyone have any insight into a next-gen X4500 / X4540? Does some other Oracle / Sun partner make a comparable system that is fully supported by Oracle / Sun? http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html What do X4500 / X4540 owners use if they'd like more comparable zfs based storage and full Oracle support? I'm aware of Nexenta and other cloned products but am specifically asking about Oracle supported hardware. However, does anyone know if these type of vendors will be at NAB this year? I'd like to talk to a few if they are... The move seems to be to the Unified Storage (aka ZFS Storage) line, which is a successor to the 7000-series OpenStorage stuff. http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html Which is not a lot of use to those of us who use X4540s for what they were intended: storage appliances. We have had to take the retrograde step of adding more, smaller servers (like the ones we consolidated on the X4540s!). Sorry, I read the question differently, as in "I have X4500/X4540 now, and want more of them, but Oracle doesn't sell them anymore, what can I buy?". The 7000-series (now: Unified Storage) *are* storage appliances. If you have an X4540/X4500 (and some cash burning a hole in your pocket), Oracle will be happy to sell you a support license (which should include later versions of ZFS software). But, don't quote me on that - talk to a Sales Rep if you want a Quote. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 4/7/2011 10:25 AM, Chris Banal wrote: While I understand everything at Oracle is "top secret" these days. Does anyone have any insight into a next-gen X4500 / X4540? Does some other Oracle / Sun partner make a comparable system that is fully supported by Oracle / Sun? http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html What do X4500 / X4540 owners use if they'd like more comparable zfs based storage and full Oracle support? I'm aware of Nexenta and other cloned products but am specifically asking about Oracle supported hardware. However, does anyone know if these type of vendors will be at NAB this year? I'd like to talk to a few if they are... The move seems to be to the Unified Storage (aka ZFS Storage) line, which is a successor to the 7000-series OpenStorage stuff. http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for boot partition layout in ZFS
On 4/6/2011 7:50 AM, Lori Alt wrote: On 04/ 6/11 07:59 AM, Arjun YK wrote: Hi, I am trying to use ZFS for boot, and kind of confused about how the boot paritions like /var to be layed out. With old UFS, we create /var as sepearate filesystem to avoid various logs filling up the / filesystem I believe that creating /var as a separate file system was a common practice, but not a universal one. It really depended on the environment and local requirements. With ZFS, during the OS install it gives the option to "Put /var on a separate dataset", but no option is given to set quota. May be, others set quota manually. Having a separate /var dataset gives you the option of setting a quota on it later. That's why we provided the option. It was a way of enabling administrators to get the same effect as having a separate /var slice did with ufs. Administrators can choose to use it or not, depending on local requirements. So, I am trying to understand what's the best practice for /var in ZFS. Is that exactly same as in UFS or is there anything different ? I'm not sure there's a defined "best practice". Maybe someone else can answer that question. My guess is that in environments where, before, a separate ufs /var slice was used, a separate zfs /var dataset with a quota might now be appropriate. Lori Could someone share some thoughts ? Thanks Arjun Traditionally, the reason for a separate /var was one of two major items: (a) /var was writable, and / wasn't - this was typical of diskless or minimal local-disk configurations. Modern packaging systems are making this kind of configuration increasingly difficult. (b) /var held a substantial amount of data, which needed to be handled separately from / - mail and news servers are a classic example For typical machines nowdays, with large root disks, there is very little chance of /var suddenly exploding and filling / (the classic example of being screwed... ). Outside of the above two cases, about the only other place I can see that having /var separate is a good idea is for certain test machines, where you expect frequent memory dumps (in /var/crash) - if you have a large amount of RAM, you'll need a lot of disk space, so it might be good to limit /var in this case by making it a separate dataset. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot remove zil device
On 3/27/2011 7:48 AM, Jordan McQuown wrote: The pool was originally created on 2008.11 then upgraded to 2009.06 and again upgraded to snv-134. Originally I was using Intel x25e's as the zil but recently purchased a zeusRAM and when trying to remove the Intel I issue the zpool remove command and it appears to work. Yet when I do a zpool status the Intel device is still a pool member. Any thoughts? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Did you remember to upgrade the pool itself? Not just the OS, but the pool, too, need to be upgraded. You'll need to run a 'zpool upgrade' on the pool with the log device. Only later versions of ZFS supported removal of the ZIL. IIRC, the zpool version supported by 2008.11 definitely didn't, and I'm pretty sure the 2009.06 version didn't either, but b134 definitely does. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
On 3/23/2011 6:14 AM, Deano wrote: OpenIndiana and others (i.e. Benunix) are distributions that actively support full desktop workstations based on the Illumos base. It is true, that the storage server application is a popular one and so has supporters both commercially and others. ZFS is amazing and quite rightly it stands out, it works even better when used with zones, crossbow, dtrace, etc. and so its obvious to see what it's a focus and often seems the only priority. However is isn't the only interest, by a long shot. The SFE package repositories has many packages available to install for when the binary packaging aren't up to date. OpenIndiana is hard at work trying to build bigger binary repositories with more apps and newer versions. A simple "pkg install APPLICATION" is the aim for the majority of main applications. Is it not moving fast enough, or missing the packages you need? Well that's the beauty of Open Source, we welcome and have systems to help newcomers add and update the packages and applications they want, so we all benefit. Ultimately I'd (and I'm sure many would) like to have a level of binary repositories similar to Debian, with stable and faster changing place repos and support for many different applications, however that requires a lot of work and manpower. Bye, Deano Honestly (and I say this from purely personal preferences and bias, not any official statement), I see the long-term future of Solaris (and IllumOS-based distros) as the new engine for appliances, supplanting Linux and the *BSDs in that space. For a lot of reasons, Solaris has a long list of very superior functionality that make is very appealing for appliance makers. Right now, we see that in two areas: ZFS for storage, and high scaleability for DBs (the various Oracle ExaData stuff). I'm expecting to see a whole raft of things start to show up - JVM container systems (Run Your App Server in SUPERMAN MODE! ), online backup devices, firewall appliances, spam and mail filter systems, intrusion detection systems, maybe even software routers, etc... It's here that I think Solaris' strengths can beat its competitors, and where its weaknesses aren't significant. Sadly, I think Solaris' future as a general-purpose OS is likely finished. Of course, that's just my reading of the tea leaves... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On 3/21/2011 3:25 PM, Richard Elling wrote: On Mar 21, 2011, at 5:09 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Richard Elling How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. What the heck? Yes it is. Indirectly. When you say it depends on the amount of data, speed of resilvering device, etc, what you really mean (correctly) is that it depends on the total number of used blocks that must be resilvered on the resilvering device, multiplied by the access time for the resilvering device. And of course, throttling and usage during resilver can have a big impact. And various other factors. But the controllable big factor is the number of blocks used in the degraded vdev. There is no direct correlation between the number of blocks and resilver time. Just to be clear here, remember block != slab. Slab is the allocation unit often seen through the "recordsize" attribute. The number of data *slabs* directly correlates to resilver time. So here is how the number of devices in the vdev matter: If you have your whole pool made of one vdev, then every block in the pool will be on the resilvering device. You must spend time resilvering every single block in the whole pool. If you have the same amount of data, on a pool broken into N smaller vdev's, then approximately speaking, 1/N of the blocks in the pool must be resilvered on the resilvering vdev. And therefore the resilver goes approximately N times faster. Nope. The resilver time is dependent on the speed of the resilvering disk. Well, unless my previous posts are completely wrong, I can't see how resilver time is primarily bounded by speed (i.e bandwidth/throughput) of the HD for the vast majority of use cases. The IOPS and raw speed of the underlying backing store help define how fast the workload (i.e. total used slabs) gets processed. The layout of the vdev, and the on-disk data distribution, will define the total IOPS required to resilver the slab workload. Most data distribution/vdev layout combinations will result in an IOPS-bound resilver disk, not a bandwidth-saturated resilver disk. So if you assume the size of the pool or the number of total disks is a given, determined by outside constraints and design requirements, and then you faced the decision of how to architect the vdev's in your pool, then Yes. The number of devices in a vdev do dramatically impact the resilver time. Only because the number of blocks written in each vdev depend on these decisions you made earlier. I do not think it is wise to set the vdev configuration based on a model for resilver time. Choose the configuration to get the best data protection. -- richard Depends on the needs of the end-user. I can certainly see places where it would be better to build a pool out of RAIDZ2 devices rather than RAIDZ3 devices. And, of course, the converse. Resilver times should be a consideration in building your pool, just like performance and disk costs are. How much you value it, of course, it up to the end-user. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A resilver record?
On 3/20/2011 2:23 PM, Richard Elling wrote: On Mar 20, 2011, at 12:48 PM, David Magda wrote: On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote: It all depends on the number of drives in the VDEV(s), traffic patterns during resilver, speed VDEV fill, of drives etc. Still, close to 6 days is a lot. Can you detail your configuration? How many times do we have to rehash this? The speed of resilver is dependent on the amount of data, the distribution of data on the resilvering device, speed of the resilvering device, and the throttle. It is NOT dependent on the number of drives in the vdev. Thanks for clearing this up - I've been told large VDEVs lead to long resilver times, but then, I guess that was wrong. There was a thread ("Suggested RaidZ configuration...") a little while back where the topic of IOps and resilver time came up: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633 I think this message by Erik Trimble is a good summary: hmmm... I must've missed that one, otherwise I would have said... Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Here, the IOPS doesn't matter because the limit will be the media write speed of the resilvering disk -- bandwidth. Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there's only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. Here, IOPS might matter, but I doubt it. Where we see IOPS matter is when the block sizes are small (eg. metadata). In some cases you can see widely varying resilver times when the data is large versus small. These changes follow the temporal distribution of the original data. For example, if a pool's life begins with someone loading their MP3 collection (large blocks, mostly sequential) and then working on source code (small blocks, more random, lots of creates/unlinks) then the resilver will be bandwidth bound as it resilvers the MP3s and then IOPS bound as it resilvers the source. Hence, the prediction of when resilver will finish is not very accurate. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you're always going to be IOPS bound by the single disk resilvering, you have a fixed limit. You will not always be IOPS bound by the resilvering disk. You will be speed bound by the resilvering disk, where speed is either write bandwidth or random write IOPS. -- richard Really? Can you really be bandwidth limited on a (typical) RAIDZ resilver? I can see where you might be on a mirror, with large slabs and essentially sequential read/write - that is, since the drivers can queue up several read/write requests at a time, you have the potential to be reading/writing several (let's say 4) 128k slabs per single IOPS. That means you read/write at 512k per IOPS for a mirror (best case scenario). For a 7200RPM drive, that's 100 IOPS x .5MB/IOPS = 50MB/s, which is lower than the maximum throughput of a modern SATA drive. For one of the 15k SAS drives able to do 300IOPS, you get 150MB/s, which indeed exceeds a SAS drive's write bandwidth. For RAIDZn configs, however, you're going to be limited on the size of an individual read/write. As Roy pointed out before, that max size of an individual portion of a slab is 128k/X, where X=number of data drives in RAIDZn. So, for a typical 4-data-drive RAIDZn, even in the best case scenario where I can queue multiple slab requests (say 4) into a single IOPS, that means I'm likely to top out at about 128k of data to write to the resilvered drive per IOPS. Which, leads to 12MB/s for the 7200RPM drive, and 36MB/s for the 15k drive, both well under their respective bandwidth capability. Even with large slab sizes, I really can't see any place where a RAIDZ resilver isn't going to be IOPS bound when using HDs as backing store. Mirrors are more likely, but still, even in that case, I think you're going to hit the IOPS barrier far more often than the bandwidth barrier. Now, with SSDs as backing store, yes, you become bandwidth limited, because the IOPS values of SSDs are at least an order of magnitude greater than HDs, though both have the same max bandwidth characteristics. Now, the *total* time it takes to resilver either a mirror or RAIDZ is indeed primarily dependent on the number of allocated slabs in the vdev, and the level of fragmentation of slabs. That essentiall
Re: [zfs-discuss] sorry everyone was: Re: External SATA drive enclosures + ZFS?
On Fri, 2011-02-25 at 20:29 -0800, Yaverot wrote: > Sorry all, didn't realize that half of Oracle would auto-reply to a public > mailing list since they're out of the office 9:30 Friday nights. I'll try to > make my initial post each month during daylight hours in the future. > ___ Nah, probably just a Beehive (our mail system) burp. Happens a lot. Besides, it's 8:45 PST here, and I'm still at work. :-) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv initial data load
On 2/16/2011 8:08 AM, Richard Elling wrote: On Feb 16, 2011, at 7:38 AM, white...@gmail.com wrote: Hi, I have a very limited amount of bandwidth between main office and a colocated rack of servers in a managed datacenter. My hope is to be able to zfs send/recv small incremental changes on a nightly basis as a secondary offsite backup strategy. My question is about the initial "seed" of the data. Is it possible to use a portable drive to copy the initial zfs filesystem(s) to the remote location and then make the subsequent incrementals over the network? If so, what would I need to do to make sure it is an exact copy? Thank you, Yes, and this is a good idea. Once you have replicated a snapshot, it will be an exact replica -- it is an all-or-nothing operation. You can then make more replicas or incrementally add snapshots. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss To follow up on Richard's post, what you want to do is a perfectly good way to deal with moving large amounts of data via Sneakernet. :-) I'd suggest that you create a full zfs filesystem on the external drive, and use 'zfs send/receive' to copy a snapshot from the production box to there, rather than try to store just a file from the output of 'zfs send'. You can then 'zfs send/receive' that backup snapshot from the external drive onto your remote backup machine when you carry the drive over there later. As Richard mentioned, that snapshot is unique, and it doesn't matter that you "recovered" it onto an external drive first, then copied that snapshot over to the backup machine. It's a frozen snapshot, so you're all good for future incrementals. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One LUN per RAID group
On 2/15/2011 1:37 PM, Torrey McMahon wrote: On 2/14/2011 10:37 PM, Erik Trimble wrote: That said, given that SAN NVRAM caches are true write caches (and not a ZIL-like thing), it should be relatively simple to swamp one with write requests (most SANs have little more than 1GB of cache), at which point, the SAN will be blocking on flushing its cache to disk. Actually, most array controllers now have 10s if not 100s of GB of cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. The latest HDS box is probably close if not more. Of course you still have to flush to disk and the cache flush algorithms of the boxes themselves come into play but 1GB was a long time ago. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss STK2540 and the STK6140 have at most 1GB. STK6180 has 4GB. The move to large GB caches is only recent - only large (i.e big array setups with a dedicated SAN head) have had multi-GB NVRAM cache for any length of time. In particular, pretty much all base arrays still have 4GB or less on the enclosure controller - only in the SAN heads do you find big multi-GB caches. And, lots (I'm going to be brave and say the vast majority) of ZFS deployments use direct-attach arrays or internal storage, rather than large SAN configs. Lots of places with older SAN heads are also going to have much smaller caches. Given the price tag of most large SANs, I'm thinking that there are still huge numbers of 5+ year-old SANs out there, and practically all of them have only a dozen or less GB of cache. So, yes, big SAN modern configurations have lots of cache. But they're also the ones most likely to be hammered with huge amounts of I/O from multiple machines. All of which makes it relatively easy to blow through the cache capacity and slow I/O back down to the disk speed. Once you get back down to raw disk speed, having multiple LUNS per raid array is almost certainly going to perform worse than a single LUN, due to thrashing. That is, it would certainly be better (i.e. faster) for an array to have to commit 1 128k slab than 4 32k slabs. So, the original recommendation is interesting, but needs to have the caveat that you'd really only use it if you can either limit the amount of sustained I/O you have, or are using very-large-cache disk setups. I would think it idea might also apply (i.e. be useful) for something like the F5100 or similar RAM/Flash arrays. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One LUN per RAID group
On 2/14/2011 3:52 PM, Gary Mills wrote: On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote: On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills wrote: Is there any reason not to use one LUN per RAID group? [...] In other words, if you build a zpool with one vdev of 10GB and another with two vdev's each of 5GB (both coming from the same array and raid set) you get almost exactly twice the random read performance from the 2x5 zpool vs. the 1x10 zpool. This finding is surprising to me. How do you explain it? Is it simply that you get twice as many outstanding I/O requests with two LUNs? Is it limited by the default I/O queue depth in ZFS? After all, all of the I/O requests must be handled by the same RAID group once they reach the storage device. Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot spares), you get substantially better random read performance using 10 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of ZFS aith number of vdevs and not "spindles". I'm going to go out on a limb here and say that you get the extra performance under one condition: you don't overwhelm the NVRAM write cache on the SAN device head. So long as the SAN's NVRAM cache can acknowledge the write immediately (i.e. it isn't full with pending commits to backing store), then, yes, having multiple write commits coming from different ZFS vdevs will obviously give more performance than a single ZFS vdev. That said, given that SAN NVRAM caches are true write caches (and not a ZIL-like thing), it should be relatively simple to swamp one with write requests (most SANs have little more than 1GB of cache), at which point, the SAN will be blocking on flushing its cache to disk. So, if you can arrange your workload to avoid more than the maximum write load of the SAN's raid array over a defined period, then, yes, go with the multiple LUN/array setup. In particular, I would think this would be excellent for small-write/latency-sensitive applications, where the total amount of data written (over several seconds) isn't large, but where latency is critical. For larger I/O requests (or, for consistent, sustained I/O of more than small amounts), all bets are off as far as possibly advantage of multiple LUNS/array. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CPU Limited on Checksums?
On 2/8/2011 8:41 AM, Krunal Desai wrote: Hi all, My system is powered by an Intel Core 2 Duo (E6600) with 8GB of RAM. Running into some very heavy CPU usage. First, a copy from one zpool to another (cp -aRv /oldtank/documents* /tank/documents/*), both in the same system. Load averages are around ~4.8. I think I used lockstat correctly, and found the following: movax@megatron:/tank# lockstat -kIW -D 20 sleep 30 Profiling interrupt: 2960 events in 30.516 seconds (97 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 1518 51% 51% 0.00 1800 cpu[0] SHA256TransformBlocks 334 11% 63% 0.00 2820 cpu[0] vdev_raidz_generate_parity_pq 261 9% 71% 0.00 3493 cpu[0] bcopy_altentry 119 4% 75% 0.00 3033 cpu[0] mutex_enter 73 2% 78% 0.00 2818 cpu[0] i86_mwait So, obviously here it seems checksum calculation is, to put it mildly, eating up CPU cycles like none other. I believe it's bad(TM) to turn off checksums? (zfs property just has checksum=on, I guess it has defaulted to SHA256 checksums?) Second, a copy from my desktop PC to my new zpool. (5900rpm drive over GigE to 2 6-drive RAID-Z2s). Load average are around ~3. Again, with lockstat: movax@megatron:/tank# lockstat -kIW -D 20 sleep 30 Profiling interrupt: 2919 events in 30.089 seconds (97 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 1298 44% 44% 0.00 1853 cpu[0] i86_mwait 301 10% 55% 0.00 2700 cpu[0] vdev_raidz_generate_parity_pq 144 5% 60% 0.00 3569 cpu[0] bcopy_altentry 103 4% 63% 0.00 3933 cpu[0] ddi_getl 83 3% 66% 0.00 2465 cpu[0] mutex_enter Here it seems as if 'i86_mwait' is occupying the top spot (is this because I have power-management set to poll my CPU?). Is something odd happening drive buffer wise? (i.e. coming in on NIC, buffered in the HBA somehow, and then flushed to disks?) Either case, it seems I'm hitting a ceiling of around 65MB/s. I assume CPU is bottlenecking, since bonnie++ benches resulted in much better performance for the vdev. In the latter case though, it may just be a limitation of the source drive (if it can't read data faster than 65MB/s, I can't write faster than that...). e: E6600 is a first-generation 65nm LGA775 CPU, clocked at 2.40GHz. Dual-cores, no hyper-threading. In the second case, you're almost certainly hitting a Network bottleneck. If you don't have Jumbo Frames turned on at both ends of the connection, and a switch that understands JF, then the CPU is spending a horrible amount of time doing interrupts, trying to re-assemble small packets. I also think you might be running into an erratum around old Xeons, CR 6588054. This was fixed in kernel patch 127128-11, included in s10u5 or later. Otherwise, it might be an issue with the powersave functionality of certain Intel CPUs. In either case, try putting this in your /etc/system: set idle_cpu_prefer_mwait = 0 If that fix causes an issue (and, there a reports it does occasionally), you'll need to boot without the /etc/system, append the '-a' flag to the end of the GRUB menu entry that you boot from. This will push you into an interactive boot, where, when it asks you for a /etc/system to use, specify /dev/null. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sil3124 Sata controller for ZFS on Sparc OpenSolaris Nevada b130
On 2/8/2011 2:17 AM, Tomas Ögren wrote: On 08 February, 2011 - Robert Soubie sent me these 1,1K bytes: Le 08/02/2011 07:10, Jerry Kemp a écrit : As part of a small home project, I have purchased a SIL3124 hba in hopes of attaching an external drive/drive enclosure via eSATA. The host in question is an old Sun Netra T1 currently running OpenSolaris Nevada b130. The card in question is this Sil3124 card: http://www.newegg.com/product/product.aspx?item=N82E16816124003 although I did not purchase it from Newegg. I specifically purchased this card as I have seen specific reports of it working under Solaris/OpenSolaris distro's on several Solaris mailing lists. I use a non-eSata version of this card under Solaris Express 11 for a boot mirrored ZFS pool. And another one for a Windows 7 machine that does backups of the server. Bios and drivers are available from the Silicon Image site, but nothing for Solaris. The problem itself is sparc vs x86 and firmware for the card. AFAIK, there is no sata card with drivers for solaris sparc. Use a SAS card. /Tomas Thomas is correct. This is a hardware issue, not an OS driver one. In order to use a card with SPARC, its firmware must be OpenBoot-aware. Pretty much all consumer SATA cards only have PC BIOS firmware, as there is no market for sales to SPARC folks. However, several of the low-end SAS cards ($100 or so) also have available OpenBoot firmware, in addition to PC BIOS firmware. In particular, the LSI1068 series-based HBAs are a good place to look. Note that you *might* have to flash the new OpenBoot firmware onto the card - cards don't come with both PC-BIOS and OpenBoot firmware. Be sure to check the OEM's web site to make sure that the card is explicitly supported for SPARC, not just "Solaris". -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 2/7/2011 1:10 PM, Yi Zhang wrote: [snip] This is actually what I did for 2.a) in my original post. My concern there is that ZFS' internal write buffering makes it hard to get a grip on my application's behavior. I want to present my application's "raw" I/O performance without too much outside factors... UFS plus directio gives me exactly (or close to) that but ZFS doesn't... Of course, in the final deployment, it would be great to be able to take advantage of ZFS' advanced features such as I/O optimization. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss And, there's your answer. You seem to care about doing "bare-metal" I/O for tuning of your application, so you can do consistent measurements. Not for actual usage in production. Therefore, do what's inferred in the above: develop your app, using it on UFS w/directio to work out the application issues and tune. When you deploy it, use ZFS and its caching techniques to get maximum (though not absolutely consistently measurable) performance for the already-tuned app. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
On 2/7/2011 1:06 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? What matters is the amount of unique data in your pool. I'll just assume it's all unique, but of course that's ridiculous because if it's all unique then why would you want to enable dedup. But anyway, I'm assuming 16T of unique data. The rule is a little less than 3G of ram for every 1T of unique data. In your case, 16*2.8 = 44.8G ram required in addition to your base ram configuration. You need at least 48G of ram. Or less unique data. To follow up on Ned's estimation, please let us know what kind of data you're planning on putting in the Dedup'd zpool. That can really give us a better idea as to the number of slabs that the pool will have, which is what drives dedup RAM and L2ARC usage. You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* also want one for ZIL, depending on your write patterns). In all honesty, these days, it doesn't pay to dedup a pool unless you can count on large amounts of common data. Virtual Machine images, incremental backups, ISO images of data CD/DVDs, and some Video are your best bet. Pretty much everything else is going to cost you more in RAM/L2ARC than it's worth. IMHO, you don't want Dedup unless you can *count* on a 10x savings factor. Also, for reasons discussed here before, I would not recommend a Core i7 for use as a fileserver CPU. It's an Intel Desktop CPU, and almost certainly won't support ECC Ram on your motherboard, and it seriously overpowered for your use. See if you can find a nice socket AM3+ motherboard for a low-range Athlon X3/X4. You can get ECC RAM for it (even in a desktop motherboard), it will cost less, and perform at least as well. Dedup is not CPU intensive. Compression is, and you may very well want to enable that, but you're still very unlikely to hit a CPU bottleneck before RAM starvation or disk wait occurs. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication requirements
On 2/7/2011 1:06 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Core i7 2600 CPU 16gb DDR3 Memory 64GB SSD for ZIL (optional) Would this produce decent results for deduplication of 16TB worth of pools or would I need more RAM still? What matters is the amount of unique data in your pool. I'll just assume it's all unique, but of course that's ridiculous because if it's all unique then why would you want to enable dedup. But anyway, I'm assuming 16T of unique data. The rule is a little less than 3G of ram for every 1T of unique data. In your case, 16*2.8 = 44.8G ram required in addition to your base ram configuration. You need at least 48G of ram. Or less unique data. To follow up on Ned's estimation, please let us know what kind of data you're planning on putting in the Dedup'd zpool. That can really give us a better idea as to the number of slabs that the pool will have, which is what drives dedup RAM and L2ARC usage. You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* also want one for ZIL, depending on your write patterns). In all honesty, these days, it doesn't pay to dedup a pool unless you can count on large amounts of common data. Virtual Machine images, incremental backups, ISO images of data CD/DVDs, and some Video are your best bet. Pretty much everything else is going to cost you more in RAM/L2ARC than it's worth. IMHO, you don't want Dedup unless you can *count* on a 10x savings factor. Also, for reasons discussed here before, I would not recommend a Core i7 for use as a fileserver CPU. It's an Intel Desktop CPU, and almost certainly won't support ECC Ram on your motherboard, and it seriously overpowered for your use. See if you can find a nice socket AM3+ motherboard for a low-range Athlon X3/X4. You can get ECC RAM for it (even in a desktop motherboard), it will cost less, and perform at least as well. Dedup is not CPU intensive. Compression is, and you may very well want to enable that, but you're still very unlikely to hit a CPU bottleneck before RAM starvation or disk wait occurs. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM - No need for TRIM
On 2/6/2011 3:51 AM, Orvar Korvar wrote: Ok, so can we say that the conclusion for a home user is: 1) Using SSD without TRIM is acceptable. The only drawback is that without TRIM, the SSD will write much more, which effects life time. Because when the SSD has written enough, it will break. I dont have high demands for my OS disk, so battery backup is overkill for my needs. So I can happily settle for the next gen Intel G3 SSD disk, without worrying the SSD will break because Solaris has no TRIM support yet? Yes. All modern SSDs will wear out, but, even without TRIM support, it will be a significant time (5+ years) before they do. Internal wear-leveling by the SSD controller results in an expected lifespan about the same as hard drives. TRIM really only impacts performance. For the ZFS ZIL use case, TRIM has only a small impact on performance - SSD performance for ZIL drops off quickly from peak, and supporting TRIM would only slightly mitigate this. For home use, lack of TRIM support won't noticeably change your performance as a ZIL cache or lower the lifespan of the SSD. The Intel X25-M (either G3 or G2) would be sufficient for your purposes. In general, we do strongly recommend you put a UPS on your system, to avoid cache corruption in case of power outages. 2) And later, when Solaris gets TRIM support, should I reformat or is there no need to reformat? I mean, maybe I must format and reinstall to get TRIM all over the disk. Or will TRIM immediately start to do it's magic? If/when TRIM is supported by ZFS, I would expect this to transparent to you, the end-user. You'd have to upgrade the OS to the proper new patchlevel, and *possibly* run a 'zpool upgrade' to update the various pools to the latest version, but I suspect the latter will be completely unnecessary. TRIM support would come in ZFS's guts, not in the pool format. Worst case is that you'd have to enable TRIM at the device layer, which would probably entail either editing a config file and rebooting, or just running some command to enable the feature. I can't imagine it would require any reformating or reinstalling. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM - No need for TRIM
On 2/6/2011 3:51 AM, Orvar Korvar wrote: Ok, so can we say that the conclusion for a home user is: 1) Using SSD without TRIM is acceptable. The only drawback is that without TRIM, the SSD will write much more, which effects life time. Because when the SSD has written enough, it will break. I dont have high demands for my OS disk, so battery backup is overkill for my needs. So I can happily settle for the next gen Intel G3 SSD disk, without worrying the SSD will break because Solaris has no TRIM support yet? Yes. All modern SSDs will wear out, but, even without TRIM support, it will be a significant time (5+ years) before they do. Internal wear-leveling by the SSD controller results in an expected lifespan about the same as hard drives. TRIM really only impacts performance. For the ZFS ZIL use case, TRIM has only a small impact on performance - SSD performance for ZIL drops off quickly from peak, and supporting TRIM would only slightly mitigate this. For home use, lack of TRIM support won't noticeably change your performance as a ZIL cache or lower the lifespan of the SSD. The Intel X25-M (either G3 or G2) would be sufficient for your purposes. In general, we do strongly recommend you put a UPS on your system, to avoid cache corruption in case of power outages. 2) And later, when Solaris gets TRIM support, should I reformat or is there no need to reformat? I mean, maybe I must format and reinstall to get TRIM all over the disk. Or will TRIM immediately start to do it's magic? If/when TRIM is supported by ZFS, I would expect this to transparent to you, the end-user. You'd have to upgrade the OS to the proper new patchlevel, and *possibly* run a 'zpool upgrade' to update the various pools to the latest version, but I suspect the latter will be completely unnecessary. TRIM support would come in ZFS's guts, not in the pool format. Worst case is that you'd have to enable TRIM at the device layer, which would probably entail either editing a config file and rebooting, or just running some command to enable the feature. I can't imagine it would require any reformating or reinstalling. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM - No need for TRIM
On 2/5/2011 5:44 AM, Orvar Korvar wrote: So... Sun's SSD used for ZIL and L2ARC does not use TRIM, so how big a problem is lack of TRIM in ZFS really? It should not hinder anyone to run without TRIM? I didnt really understand the answer on this question. Because Sun's SSD does not use TRIM - and it is not consider a hinder? A home user could use SSD without greater problems? If I format the disk every year and reinstall, would that help? I am concerned as a home user... As a home user, you don't really care about support for TRIM. Even a reasonable SSD (i.e. not "Enterprise" level) will provide a very significant boost when used as a ZIL, over just having bare drives (in particular, if you just have SATA 7200 or 5400 RPM drives, you will really notice the boost). Ideally, you want an SSD with a battery or supercapacitor, but they're pretty expensive right now. (i.e. OCZ's Vertex 2 Pro series). If you have a dependable UPS for the system, and can accept the (small) risk that you might lose the last write commit on certain power-loss scenarios, a mid-line SSD is entirely OK. Note that a ZIL cache device is NOT A WRITE CACHE. ZIL is there for synchronous writes only, so that ZFS can essentially turn synchronous writes into asynchronous writes. NFS is a big sync() write user, but Samba is not. You won't notice any improvement over Samba when using a ZIL cache. Don't bother with a reformat of the SSD. It won't help, other than maybe a yearly reformat would be good to reset absolutely everything inside it, but, you won't notice any really difference even after you do. Regarding the DDRAM SSDs that dont need TRIM, they seem interesting. I got a mail from CTO of DDrive who did not want to mail publicly to this list (advertisment he said) but it seems expensive, the DDdrive X1 which is targeted to Enterprise costs $2000 USD. Are there any home user variants? But reading this presentation, it seems to have a point of using DDRAM variants: http://www.ddrdrive.com/zil_accelerator.pdf Quite interesting info There are a couple of things similar to the DDRdrive, but they're all Enterprise-class, and going to cost you. There's no $200 solution. Which is kind of sad, since building a DDRdrive-like thing reallly just requires 2 4Gbit DRAM chips, 2 4GBit NAND chips, a small battery, and a FPGA that does some basic PCI-E to Memory mapping (and, a little bit of extra DRAM->NAND copying if power is lost). Really, it's entirely doable for someone with decent VLSI experience. The DDRdrive folks have got the experience to do it, but, honestly, its not a complex product. Bottom line, it's maybe $50 in parts, plus a $100k VLSI Engineer to do the design. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and TRIM
On 2/4/2011 7:39 AM, Christopher George wrote: So, the bottom line is that Solaris 11 Express can not use TRIM and SSD? Correct. So, it might not be a good idea to use a SSD? It is true that a Flash based SSD, will be adversely impacted by ZFS not supporting TRIM, especially for the ZIL accelerator. But a DRAM based SSD is immune to TRIM support status and thus unaffected. Actually, TRIM support would only add unnecessary overhead to the DDRdrive X1's device driver. Best regards, Christopher George Founder/CTO www.ddrdrive.com Bottom line here is this: for a ZIL, you have a hierarchy of performance, each about two orders of magnitude faster than the prior: 1. hard drive 2. NAND-based SSD 3. DRAM-based SSD You'll still get a very noticeable improvement of using a NAND (flash) SSD over not using a dedicated ZIL device. It just won't be the improvement "promised" by the SSD packaging. If that performance isn't sufficient for you, then a DRAM SSD is your best bet. Note that even if TRIM would be supported, it wouldn't remove the whole penalty that a fully-written-to NAND SSD suffers. NAND requires that any block which was priorly written to be erased BEFORE you can write to it again. TRIM only helps with using unwritten blocks inside pages, and to schedule whole page erasures inside the SSD controller. I can't put real numbers on it, but I would suspect that rather than suffer a 10x loss of performance, you might only lose 5x or so if TRIM were properly usable. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS not usable (was ZFS Dedup question)
On 1/28/2011 2:24 PM, Roy Sigurd Karlsbakk wrote: I created a zfs pool with dedup with the following settings: zpool create data c8t1d0 zfs create data/shared zfs set dedup=on data/shared The thing I was wondering about was it seems like ZFS only dedup at the file level and not the block. When I make multiple copies of a file to the store I see an increase in the deup ratio, but when I copy similar files the ratio stays at 1.00x. I've done some rather intensive tests on zfs dedup on this 12TB test system we have. I have concluded that with some 150B worth of L2ARC and 8GB ARC, ZFS dedup is unusable for volumes even at 2TB storage. It works, but it's dead slow in write terms, and the time to remove a dataset is still very long. I wouldn't recommend using ZFS dedup unless your name were Ahmed Nazif or Silvio Berlusconi, where the damage might be used for some good. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- If you want Dedup to perform well, you *absolutely* must have a L2ARC device which can hold the *entire* Dedup Table. Remember, the size of the DDT is not dependent on the size of your data pool, but in the number of zfs slabs which are contained in that pool (slab = record, for this purpose). Thus, 12TB worth of DVD iso images (record size about 128k) will consume 256 times less DDT space as will 12TB filled with text configuration files (average record size < 512b). And, I doubt 8GB for ARC is sufficient, either, for a DDT consuming over 100GB of space. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedup question
On 1/28/2011 1:48 PM, Nicolas Williams wrote: On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote: I created a zfs pool with dedup with the following settings: zpool create data c8t1d0 zfs create data/shared zfs set dedup=on data/shared The thing I was wondering about was it seems like ZFS only dedup at the file level and not the block. When I make multiple copies of a file to the store I see an increase in the deup ratio, but when I copy similar files the ratio stays at 1.00x. Dedup is done at the block level, not file level. "Similar files" does not mean that they actually share common blocks. You'll have to look more closely to determine if they do. Nico What Nico said. The big reason here is that blocks have to be ALIGNED on the same block boundaries to be dedup'd. That is, if I have a file which contains: AAABBCCDD if I have 4-character wide blocks, then if I copy the file, and append an "X" to the above file, making it look like: XAAABBCCDD There will be NO DEDUP in that case. This is what trips people up most of the time - they see "similar" files, but don't realize that "similar" for dedup has to mean aligned on block boundaries, not just "I've got the same 3k of data in both files". -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrinking a pool, Increasing hotspares
On Mon, 2011-01-24 at 13:56 -0800, Phillip V wrote: > Hey all, > > I have a 10 TB root pool setup like so: > pool: s78 > state: ONLINE > scrub: resilver completed after 2h0m with 0 errors on Wed Jan 19 22:04:39 > 2011 > config: > > NAME STATE READ WRITE CKSUM > s78 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c14t0d0 ONLINE 0 0 0 > c7t0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t1d0 ONLINE 0 0 0 > c14t1d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t2d0 ONLINE 0 0 0 > c14t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t3d0 ONLINE 0 0 0 > c14t3d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t4d0 ONLINE 0 0 0 > c15t6d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t5d0 ONLINE 0 0 0 > c14t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t6d0 ONLINE 0 0 0 > c14t6d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c7t7d0 ONLINE 0 0 0 > c14t7d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c15t7d0 ONLINE 0 0 0 > c14t4d0 ONLINE 0 0 1.11M 251G resilvered > logs ONLINE 0 0 0 > c15t2d0ONLINE 0 0 0 > spares > c15t0d0AVAIL > > errors: No known data errors > > Only 2 TB of the pool are used and I would like to remove one mirror set from > the pool and use the two drives in that mirror set as hot spares (for a total > of 3 hotspares). Is this possible? No. You cannot shrink the size of a pool, regardless of what type of vdev it is composed of. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 13:34 -0800, Philip Brown wrote: > > On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon > > wrote: > > > > ZFS's ability to handle "short-term" interruptions > > depend heavily on the > > underlying device driver. > > > > If the device driver reports the device as > > "dead/missing/etc" at any > > point, then ZFS is going to require a "zpool replace" > > action before it > > re-accepts the device. If the underlying driver > > simply stalls, then > > it's more graceful (and no user interaction is > > required). > > > > As far as what the resync does: ZFS does "smart" > > resilvering, in that > > it compares what the "good" side of the mirror has > > against what the > > "bad" side has, and only copies the differences over > > to sync them up. > > > > Hmm. Well, we're talking fibre, so we're very concerned with the recovery > mode when the fibre drivers have marked it as "failed". (except it hasnt > "really" failed, we've just had a switch drop out) > > I THINK what you are saying, is that we could, in this situation, do: > > zpool replace (old drive) (new drive) > > and then your "smart" recovery, should do the limited resilvering only. Even > for potentially long outages. > > Is that what you are saying? Yes. It will always look at the "replaced" drive to see if it was a prior member of the mirror, and do smart resilvering if possible. If the device path stays the same (which, hopefully, it should), you can even do: zpool replace (old device) (old device) -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?
On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote: > > On 1/18/2011 2:46 PM, Philip Brown wrote: > > My specific question is, how easily does ZFS handle*temporary* SAN > > disconnects, to one side of the mirror? > > What if the outage is only 60 seconds? > > 3 minutes? > > 10 minutes? > > an hour? > > Depends on the multipath drivers and the failure mode. For example, if > the link drops completely at the host hba connection some failover > drivers will mark the path down immediately which will propagate up the > stack faster than an intermittent connection or something father down > stream failing. > > > If we have 2x1TB drives, in a simple zfs mirror if one side goes > > temporarily off line, will zfs attempt to resync **1 TB** when it comes > > back? Or does it have enough intelligence to say, "oh hey I know this > > disk..and I know [these bits] are still good, so I just need to resync > > [that bit]" ? > > My understanding is yes though I can't find the reference for this. (I'm > sure someone else will find it in short order.) ZFS's ability to handle "short-term" interruptions depend heavily on the underlying device driver. If the device driver reports the device as "dead/missing/etc" at any point, then ZFS is going to require a "zpool replace" action before it re-accepts the device. If the underlying driver simply stalls, then it's more graceful (and no user interaction is required). As far as what the resync does: ZFS does "smart" resilvering, in that it compares what the "good" side of the mirror has against what the "bad" side has, and only copies the differences over to sync them up. This is one of ZFS's great strengths, in that most other RAID systems can't do this. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
You can't really do that. Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes RAM to maintain a cache table of what's in the L2ARC. Using 2GB of RAM with an SSD-based L2ARC (even without Dedup) likely won't help you too much vs not having the SSD. If you're going to turn on Dedup, you need at least 8GB of RAM to go with the SSD. -Erik On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote: > Thanks everyone, I think overtime I'm gonna update the system to include an > ssd for sure. Memory may come later though. Thanks for everyone's responses > > Erik Trimble wrote: > > >On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: > >> I've since turned off dedup, added another 3 drives and results have > >> improved to around 148388K/sec on average, would turning on compression > >> make things more CPU bound and improve performance further? > >> > >> On 18 Jan 2011, at 15:07, Richard Elling wrote: > >> > >> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: > >> > > >> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've > >> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb > >> >> drives connected via SATA-II in a raidz1 configuration with dedup > >> >> enabled compression off and pool version 23. From running bonnie++ I > >> >> get the following results: > >> >> > >> >> Version 1.03b --Sequential Output-- --Sequential Input- > >> >> --Random- > >> >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > >> >> --Seeks-- > >> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > >> >> /sec %CP > >> >> nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 > >> >> 429.8 1 > >> >> --Sequential Create-- Random > >> >> Create > >> >> -Create-- --Read--- -Delete-- -Create-- --Read--- > >> >> -Delete-- > >> >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > >> >> /sec %CP > >> >>16 7181 29 + +++ + +++ 21477 97 + +++ > >> >> + +++ > >> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ > >> >> > >> >> > >> >> I'd expect more than 105290K/s on a sequential read as a peak for a > >> >> single drive, let alone a striped set. The system has a relatively > >> >> decent CPU, however only 2GB memory, do you think increasing this to > >> >> 4GB would noticeably affect performance of my zpool? The memory is only > >> >> DDR1. > >> > > >> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, > >> > turn off dedup > >> > and enable compression. > >> > -- richard > >> > > >> > >> ___ > >> zfs-discuss mailing list > >> zfs-discuss@opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > >Compression will help speed things up (I/O, that is), presuming that > >you're not already CPU-bound, which it doesn't seem you are. > > > >If you want Dedup, you pretty much are required to buy an SSD for L2ARC, > >*and* get more RAM. > > > > > >These days, I really don't recommend running ZFS as a fileserver without > >a bare minimum of 4GB of RAM (8GB for anything other than light use), > >even with Dedup turned off. > > > > > >-- > >Erik Trimble > >Java System Support > >Mailstop: usca22-317 > >Phone: x67195 > >Santa Clara, CA > >Timezone: US/Pacific (GMT-0800) > > -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is my bottleneck RAM?
On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote: > I've since turned off dedup, added another 3 drives and results have improved > to around 148388K/sec on average, would turning on compression make things > more CPU bound and improve performance further? > > On 18 Jan 2011, at 15:07, Richard Elling wrote: > > > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote: > > > >> Hi guys, sorry in advance if this is somewhat a lowly question, I've > >> recently built a zfs test box based on nexentastor with 4x samsung 2tb > >> drives connected via SATA-II in a raidz1 configuration with dedup enabled > >> compression off and pool version 23. From running bonnie++ I get the > >> following results: > >> > >> Version 1.03b --Sequential Output-- --Sequential Input- > >> --Random- > >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > >> --Seeks-- > >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > >> /sec %CP > >> nexentastor 4G 60582 54 20502 4 12385 3 53901 57 105290 10 > >> 429.8 1 > >> --Sequential Create-- Random > >> Create > >> -Create-- --Read--- -Delete-- -Create-- --Read--- > >> -Delete-- > >> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > >> %CP > >>16 7181 29 + +++ + +++ 21477 97 + +++ + > >> +++ > >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++ > >> > >> > >> I'd expect more than 105290K/s on a sequential read as a peak for a single > >> drive, let alone a striped set. The system has a relatively decent CPU, > >> however only 2GB memory, do you think increasing this to 4GB would > >> noticeably affect performance of my zpool? The memory is only DDR1. > > > > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn > > off dedup > > and enable compression. > > -- richard > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Compression will help speed things up (I/O, that is), presuming that you're not already CPU-bound, which it doesn't seem you are. If you want Dedup, you pretty much are required to buy an SSD for L2ARC, *and* get more RAM. These days, I really don't recommend running ZFS as a fileserver without a bare minimum of 4GB of RAM (8GB for anything other than light use), even with Dedup turned off. -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 1/3/2011 8:28 AM, Richard Elling wrote: On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: On 12/26/10 05:40 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling mailto:richard.ell...@gmail.com>> wrote: There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now. Pardon my skepticism, but where is the proof of this claim (I'm quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I've seen exactly nothing out of "outside of Oracle" in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn't work on ZFS. Exactly my observation as well. I haven't seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. I am quite sure you understand how pipelines work :-) -- richard I'm getting pretty close to my pain threshold on the BP_rewrite stuff, since not having that feature's holding up a big chunk of work I'd like to push. If anyone outside of Oracle is working on some sort of change to ZFS that will allow arbitrary movement/placement of pre-written slabs, can they please contact me? I'm pretty much at the point where I'm going to start diving into that chunk of the source to see if there's something little old me can do, and I'd far rather help on someone else's implementation than have to do it myself from scratch. I'd prefer a private contact, as I realize that such work may not be ready for public discussion yet. Thanks, folks! Oh, and this is completely just me, not Oracle talking in any way. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL
On 12/29/2010 4:55 PM, Jason Warr wrote: HyperDrive5 = ACard ANS9010 I have personally been wanting to try one of these for some time as a ZIL device. Yes, but do remember these require a half-height 5.25" drive bay, and you really, really should buy the extra CF card for backup. Also, stay away from the ANS-9010S with LVD SCSI interface. As (I think) Bob pointed out a long time ago, parallel SCSI isn't good for a high-IOPS interface. It (the LVD interface) will throttle long before the drive does... I've been waiting for them to come out with a 3.5" version, one which I can plug directly into a standard 3.5" SAS/SATA hotswap bay... And, of course, the ANS9010 is limited to the SATA2 interface speed, so it is cheaper and lower-performing (but still better than an SSD) than the DDRdrive. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
On 12/25/2010 12:16 PM, joerg.schill...@fokus.fraunhofer.de wrote: Erik Trimble wrote: I've read Joerg's paper, and I've read several of the patents in question, and nowhere around is there any real code. A bit of Netapp filed patents (without code) in 1993, I of course have working code for SuinOS-4.9 from 1991. Se below for more information. Joerg - your paper used to be available here (which is where I read it awhile ago), but not anymore: http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/wofs.ps.gz I just re-looked, and I now remember that I got it from that URL via the Internet Archive (www.internet.org). It's still available at the URL above from 2001 at the Archive. This address did go away in 2001 when the German government enforced integration of GMD into Fraunhofer. The old postscript version (created from troff) is here: http://cdrecord.berlios.de/private/wofs.ps.gz A few years ago, a friend helped me to add the images that originally have been created outside of troff and inserted the old way (using glue). Since 2006, there is a pdf version that includes the images: http://cdrecord.berlios.de/private/WoFS.pdf Is there a better location? (and, a full English translation? I read it in German, but my German is maybe at 7th-grade level, so I might have missed some subtleties...) There is currently no English tranlation and as a result of the legal situation in 1991, I could not publish the related implementation. Even getting the SunOS-4.0 source code in 1988 in order to allow the implementation, was a bit tricky. Horst Winterhoff (Chief Sun Germany and Sun Europe) asked Bill Joy for a permission to give away the source for my Diploma Thesis. As a result of this and the fact that there was no official howto from Sun for writing filesystems, I was forced to keep the implementation unpublished (as for the implementation of mmap() in wofs, I was forced to copy aprox. 100 lines from the UFS code). If the code you copied is currently still in the OpenSolaris codebase, then you're OK. But, the SunOS codebase is significantly different than the Solaris one, so I wouldn't automatically assume that you can publish that code. Though, if your borrowing was restricted to the UFS implementation (and not the Virtual Memory/Filesystem caching stuff), your chances are good that it's still in the OpenSolaris codebase. Since June 2005, I would asume that the situation is different and there is no longer a problem to publish the WOFS source. If people are interested, I could publish the unedit original state from 1991 (including the SCCS history for my implementation) even though it looks a bit messy. For at least historical reasons, that would be nice. Though, I don't want to offer legal advice as to the possibility of problems, particularly for someone outside the US system. :-) I tried to verify whether the submission of the diploma thesis in 1991 is an official publication and in theory it should be, as a copy is stored in the univertity library. Unfortunately, the university library is uanble to find the paper. There are however many people who could confirm that the development really happened between 1988 and 1991. If your thesis paper was available via Lexisnexis, then, it certainly should count as officially published for any legal system. If not, I suspect that different countries would have different standards for university thesis. Maybe, it is a good idea to send a mail to someone from eff.org? Jörg Yup. They'd be the right people to talk to. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
On 12/25/2010 11:19 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 1:10 PM, Erik Trimble <mailto:erik.trim...@oracle.com>> wrote: On 12/25/2010 6:25 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org <mailto:zfs-discuss-boun...@opensolaris.org> [mailto:zfs-discuss- <mailto:zfs-discuss-> boun...@opensolaris.org <mailto:boun...@opensolaris.org>] On Behalf Of Joerg Schilling And people should note that Netapp filed their patents starting from 1993. This is 5 years after I started to develop WOFS, which is copy on write. This still In any case, this is 20 year old technology. Aren't patents something to protect new ideas? Boy, those guys must be really dumb to waste their time filing billion dollar lawsuits, protecting 20-year old technology, when it's so obvious that you and other people clearly invented it before them, and all the money they waste on lawyers can never achieve anything. They should all fire themselves. And anybody who defends against it can safely hire a law student for $20/hr to represent them, and just pull out your documents as defense, because that's so easy. Plus, as you said, the technology is so old, it should be worthless by now. Why are we all wasting our time in this list talking about irrelevant old technology, anyway? While that's a bit sarcastic there Ned, it *should* be the literal truth. But, as the SCO/Linux suit showed, having no realistic basis for a lawsuit doesn't prevent one from being dragged through the (U.S.) courts for the better part of a decade. Why can't we have a loser-pays civil system like every other civilized country? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) If you've got enough money, we do. You just have to make it to the end of the trial, and have a judge who feels similar. They often award monetary settlements for the cost of legal defense to the victor. --Tim Which is completely useless as a system. I'm still significantly out-of-pocket for a suit that I shouldn't have had to fight in the first place, and the likelihood that I get to recover that money isn't good (defense cost awards aren't common). There's no disincentive to trolling the legal system, forcing settlements on those unable to fight a protracted suit, even if they're sure to win the case. Using the US legal system as a business strategy is evil, pure and simple, and one all too common nowadays. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
On 12/25/2010 10:59 AM, Tim Cook wrote: On Sat, Dec 25, 2010 at 8:25 AM, Edward Ned Harvey <mailto:opensolarisisdeadlongliveopensola...@nedharvey.com>> wrote: > From: zfs-discuss-boun...@opensolaris.org <mailto:zfs-discuss-boun...@opensolaris.org> [mailto:zfs-discuss- <mailto:zfs-discuss-> > boun...@opensolaris.org <mailto:boun...@opensolaris.org>] On Behalf Of Joerg Schilling > > And people should note that Netapp filed their patents starting from 1993. > This > is 5 years after I started to develop WOFS, which is copy on write. This still > > In any case, this is 20 year old technology. Aren't patents something to > protect new ideas? Boy, those guys must be really dumb to waste their time filing billion dollar lawsuits, protecting 20-year old technology, when it's so obvious that you and other people clearly invented it before them, and all the money they waste on lawyers can never achieve anything. They should all fire themselves. And anybody who defends against it can safely hire a law student for $20/hr to represent them, and just pull out your documents as defense, because that's so easy. Plus, as you said, the technology is so old, it should be worthless by now. Why are we all wasting our time in this list talking about irrelevant old technology, anyway? Indeed. Isn't the Oracle database itself at least 20 years old? And Windows? And Solaris itself? All the employees of those companies should probably just start donating their time for free instead of collecting a paycheck since it's quite obvious they should no longer be able to charge for their product. What I find most entertaining is all the armchair lawyers on this mailing list that think they've got prior art when THEY'VE NEVER EVEN SEEN THE CODE IN QUESTION! --Tim Well... I've read Joerg's paper, and I've read several of the patents in question, and nowhere around is there any real code. A bit of pseudo-code and some math, but no full, working code. And, granted that I'm not a IP lawyer, but it does look like Joerg's work is prior art (and, given that the standard is supposed to be what someone in the industry would consider obvious, based on their knowledge, and I think I qualify). Which all points to the real problem of software patents - they're really patents on IDEAS, not on a specific implementation. Who the moron was that really though that was OK (yes, I know who specifically, but in general...) should be shot. Copyright is fine or protecting software work, but patents? Joerg - your paper used to be available here (which is where I read it awhile ago), but not anymore: http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/wofs.ps.gz Is there a better location? (and, a full English translation? I read it in German, but my German is maybe at 7th-grade level, so I might have missed some subtleties...) [As obvious as it is, it should be pointed out, I'm making these statements as a very personal opinion, and I'm certain Oracle wouldn't have the same one. I in no way represent Oracle.] -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
On 12/25/2010 6:25 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Joerg Schilling And people should note that Netapp filed their patents starting from 1993. This is 5 years after I started to develop WOFS, which is copy on write. This still In any case, this is 20 year old technology. Aren't patents something to protect new ideas? Boy, those guys must be really dumb to waste their time filing billion dollar lawsuits, protecting 20-year old technology, when it's so obvious that you and other people clearly invented it before them, and all the money they waste on lawyers can never achieve anything. They should all fire themselves. And anybody who defends against it can safely hire a law student for $20/hr to represent them, and just pull out your documents as defense, because that's so easy. Plus, as you said, the technology is so old, it should be worthless by now. Why are we all wasting our time in this list talking about irrelevant old technology, anyway? While that's a bit sarcastic there Ned, it *should* be the literal truth. But, as the SCO/Linux suit showed, having no realistic basis for a lawsuit doesn't prevent one from being dragged through the (U.S.) courts for the better part of a decade. Why can't we have a loser-pays civil system like every other civilized country? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5" SSD for ZIL
On 12/23/2010 7:57 AM, Deano wrote: In an ideal world, if we could obtain details on how to reset/format blocks of a SSD, we could do it automatically running behind the ZIL. As a log its going in one direction, a background task could clean up behind it, making the performance lowing over time a non-issue for the ZIL. A first start may be calling unmap/trim on those blocks (which I was surprised to find in the source is already coded up in the SATA driver, just not used yet) but really a reset would be better. But as you say a tool to say if its need doing would be a good start. They certainly exist in closed source form... Deano -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ray Van Dolson Sent: 23 December 2010 15:46 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Looking for 3.5" SSD for ZIL On Thu, Dec 23, 2010 at 07:35:29AM -0800, Deano wrote: If anybody does know of any source to the secure erase/reformatters, I’ll happily volunteer to do the port and then maintain it. I’m currently in talks with several SSD and driver chip hardware peeps with regard getting datasheets for some SSD products etc. for the purpose of better support under the OI/Solaris driver model but these things can take a while to obtain, so if anybody knows of existing open source versions I’ll jump on it. Thanks, Deano A tool to help the end user know *when* they should run the reformatter tool would be helpful too. I know we can just wait until performance "degrades", but it would be nice to see what % of blocks are in use, etc. Ray AFAIK, all the reformatter utilities are closed-source, direct from the SSD manufacturer. They talk directly to the drive firmware, so they're decidedly implementation-specific (I'd be flabberghasted if one worked on two different manufacturers' SSDs, even if they used the same basic controller). Many are DOS-based. As Christopher noted, you'll get a drop-off in performance as soon as you collect enough sync writes to have written (in the aggregate) slightly more than the total capacity of the SSD (including the "extra" that most SSDs now have). That said, I would expect full TRIM support to possibly make this better, as it could free up partially-used pages more frequently, and thus increasing the time before performance drops (which is due to the page remapping/reshuffling demands on the SSD controller). But, yes, SSDs are inherently less fast than DRAM. They're utility is entirely dependent on what your use case (and performance demands) are. The longer-term solution is to have SSDs change how they are designed, moving away from the current one-page-of-multiple-blocks as the atomic entity of writing, and straight to a one-block-per-page setup. Don't hold your breath. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5" SSD for ZIL
On 12/22/2010 7:05 AM, Christopher George wrote: I'm not sure if TRIM will work with ZFS. Neither ZFS nor the ZIL code in particular support TRIM. I was concerned that with trim support the SSD life and write throughput will get affected. Your concerns about sustainable write performance (IOPS) for a Flash based SSD are valid, the resulting degradation will vary depending on the controller used. Best regards, Christopher George Founder/CTO www.ddrdrive.com Christopher is correct, in that SSDs will suffer from (non-trivial) performance degredation after they've exhausted their free list, and haven't been told to reclaim emptied space. True battery-backed DRAM is the only permanent solution currently available which never runs into this problem. Even TRIM-supported SSDs eventually need reconditioning. However, this *can* be overcome by frequently re-formatting the SSD (not the Solaris format, a low-level format using a vendor-supplied utility). It's generally a simple thing, but requires pulling the SSD from the server, connecting it to either a Linux or Windows box, running the reformatter, then replacing the SSD. Which, is a PITA. But, still a bit cheaper than buying a DDRdrive. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Looking for 3.5" SSD for ZIL
On 12/22/2010 10:04 PM, Christopher George wrote: How about comparing a non-battery backed ZIL to running a ZFS dataset with sync=disabled. Which is more risky? Most likely, the 3.5" SSD's on-board volatile (not power protected) memory would be small relative to the transaction group (txg) size and thus less "risky" than sync=disabled. Best regards, Christopher George Founder/CTO www.ddrdrive.com To the OP: First off, what do you mean by "sync=disabled"??? There is no such parameter for a mount option or attribute for ZFS, nor is there for exporting anything in NFS, nor for client-side NFS mounts. If you meant "disable the ZIL", well, DON'T. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Moreover, disabling the ZIL on a per-dataset basis is not possible. As noted in the ETG, disabling ZIL can cause possible NFS-client-side corruption. If you absolutely must turn it off, however, you will get More Reliable transactions than a non-SuperCap'd SSD, by virtue that any sync-write on such a fileserver will not return as complete until the data has reach backing store. Which, in most cases, will tank (no pun intended) your synchronous performance. About the only case it won't cripple performance is when your backing store is using some sort of NVRAM to buffer writes to the disks (as most large array controllers do - but make sure that cache is battery backed). But even there, it can be a relatively simple thing to overwhelm the very limited cache on such a controller, in which case your performance tanks again. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] stupid ZFS question - floating point operations
On 12/22/2010 11:49 AM, Tomas Ögren wrote: On 22 December, 2010 - Jerry Kemp sent me these 1,0K bytes: I have a coworker, who's primary expertise is in another flavor of Unix. This coworker lists floating point operations as one of ZFS detriments. I's not really sure what he means specifically, or where he got this reference from. Then maybe ask him first? Guilty until proven innocent isn't the regular path... In an effort to refute what I believe is an error or misunderstanding on his part, I have spent time on Yahoo, Google, the ZFS section of OpenSolaris.org, etc. I really haven't turned up much of anything that would prove or disprove his comments. The one thing I haven't done is to go through the ZFS source code, but its been years since I have done any serious programming. If someone from Oracle, or anyone on this mailing list could point me towards any documentation, or give me a definitive word, I would sure appreciate it. If there were floating point operations going on within ZFS, at this point I am uncertain as to what they would be. TIA for any comments, Jerry ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas So far as my understanding of the codebase goes (and, while I've read a significant portion, I'm not really an expert here): Assuming he means that ZFS has a weakness of heavy floating-point calculation requirements (i.e using ZFS requires heavy FP usage), that's wrong. Like all normal filesystems, the "ordinary" operations are all integer, load, and store. The ordinary work of caching, block allocation, and fetching/writing is of course all integer-based. I can't imagine someone writing a filesystem which does such operations using floating point. A quick grep through the main ZFS sources doesn't find anything of type "double" or "float". I think he might be confused with what is happening on Checksums (which is still all Integer, but looks/sounds "expensive"). Yes, ZFS is considerably *more* compute intensive than other filesystems. However, it's all Integer, and one of the base assumptions of ZFS is that modern systems have lots of excess CPU cycles around, so stealing 5% for use with ZFS won't impact performance much, and the added features of ZFS more than make up for any CPU cycles lost. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On 12/20/2010 11:56 AM, Mark Sandrock wrote: Erik, just a hypothetical what-if ... In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don't know why this would not be doable. (I'm biased towards mirrors for busy filesystems.) I'm supposing that a block-level snapshot is not doable -- or is it? Mark Snapshots on ZFS are true snapshots - they take a picture of the current state of the system. They DON'T copy any data around when created. So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. The problem is this: Let's say I write block A, B, C, and D on a clean zpool (what kind, it doesn't matter). I now delete block C. Later on, I write block E. There is a probability (increasing dramatically as times goes on), that the on-disk layout will now look like: A, B, E, D rather than A, B, [space], D, E So, in the first case, I can do a sequential read to get A & B, but then must do a seek to get D, and a seek to get E. The "fragmentation" problem is mainly due to file deletion, NOT to file re-writing. (though, in ZFS, being a C-O-W filesystem, re-writing generally looks like a delete-then-write process, rather than a modify process). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss