Re: [zfs-discuss] Nice chassis for ZFS server
On Dec 14, 2007 1:12 AM, can you guess? [EMAIL PROTECTED] wrote: yes. far rarer and yet home users still see them. I'd need to see evidence of that for current hardware. What would constitute evidence? Do anecdotal tales from home users qualify? I have two disks (and one controller!) that generate several checksum errors per day each. I assume that you're referring to ZFS checksum errors rather than to transfer errors caught by the CRC resulting in retries. If so, then the next obvious question is, what is causing the ZFS checksum errors? And (possibly of some help in answering that question) is the disk seeing CRC transfer errors (which show up in its SMART data)? If the disk is not seeing CRC errors, then the likelihood that data is being 'silently' corrupted as it crosses the wire is negligible (1 in 65,536 if you're using ATA disks, given your correction below, else 1 in 4.3 billion for SATA). Controller or disk firmware bugs have been known to cause otherwise undetected errors (though I'm not familiar with any recent examples in normal desktop environments - e.g., the CERN study discussed earlier found a disk firmware bug that seemed only activated by the unusual demands placed on the disk by a RAID controller, and exacerbated by that controller's propensity just to ignore disk time-outs). So, for that matter, have buggy file systems. Flaky RAM can result in ZFS checksum errors (the CERN study found correlations there when it used its own checksum mechanisms). I've also seen intermittent checksum fails that go away once all the cables are wiggled. Once again, a significant question is whether the checksum errors are accompanied by a lot of CRC transfer errors. If not, that would strongly suggest that they're not coming from bad transfers (and while they could conceivably be the result of commands corrupted on the wire, so much more data is transferred compared to command bandwidth that you'd really expect to see data CRC errors too if commands were getting mangled). When you wiggle the cables, other things wiggle as well (I assume you've checked that your RAM is solidly seated). On the other hand, if you're getting a whole bunch of CRC errors, then with only a 16-bit CRC it's entirely conceivable that a few are sneaking by unnoticed. Unlikely, since transfers over those connections have been protected by 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger protection) The ATA/7 spec specifies a 32-bit CRC (older ones used a 16-bit CRC) [1]. Yup - my error: the CRC was indeed introduced in ATA-4 (33 MB/sec. version), but was only 16 bits wide back then. The serial ata protocol also specifies 32-bit CRCs beneath 8/10b coding (1.0a p. 159)[2]. That's not much stronger at all. The extra strength comes more from its additional coverage (commands as well as data). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... though I'm not familiar with any recent examples in normal desktop environments One example found during early use of zfs in Solaris engineering was a system with a flaky power supply. It seemed to work just fine with ufs but when zfs was installed the sata drives started to shows many ZFS checksum errors. After replacing the powersupply, the system did not detect any more errors. Flaky powersupplies are an important contributor to PC unreliability; they also tend to fail a lot in various ways. Thanks - now that you mention it, I think I remember reading about that here somewhere. But did anyone delve into these errors sufficiently to know that they were specifically due to controller or disk firmware bugs (since you seem to be suggesting by the construction of your response above that they were) rather than, say, to RAM errors (if the system in question didn't have ECC RAM, anyway) between checksum generation and disk access on either reads or writes (the CERN study found a correlation even using ECC RAM between detected RAM errors and silent data corruption)? Not that the generation of such otherwise undetected errors due to a flaky PSU isn't interesting in its own right, but this specific sub-thread was about whether poor connections were a significant source of such errors (my comment about controller and disk firmware bugs having been a suggested potential alternative source) - so identifying the underlying mechanisms is of interest as well. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
the next obvious question is, what is causing the ZFS checksum errors? And (possibly of some help in answering that question) is the disk seeing CRC transfer errors (which show up in its SMART data)? The memory is ECC in this machine, and Memtest passed it for five days. The disk was indeed getting some pretty lousy SMART scores, Seagate ATA disks (if that's what you were using) are notorious for this in a couple of specific metrics: they ship from the factory that way. This does not appear to be indicative of any actual problem but rather of error tablulation which they perform differently than other vendors do (e.g., I could imagine that they did something unusual in their burn-in exercising that generated nominal errors, but that's not even speculation, just a random guess). but that doesn't explain the controller issue. This particular controller is a SIIG-branded silicon image 0680 chipset (which is, apparently, a piece of junk - if I'd done my homework I would've bought something else)... but the premise stands. I bought a piece of consumer-level hardware off the shelf, it had corruption issues, and ZFS told me about it when XFS had been silent. Then we've been talking at cross-purposes. Your original response was to my request for evidence that *platter errors that escape detection by the disk's ECC mechanisms* occurred sufficiently frequently to be a cause for concern - and that's why I asked specifically what was causing the errors you saw (to see whether they were in fact the kind for which I had requested evidence). Not that detecting silent errors due to buggy firmware is useless: it clearly saved you from continuing corruption in this case. My impression is that in conventional consumer installations (typical consumers never crack open their case at all, let alone to add a RAID card) controller and disk firmware is sufficiently stable (especially for the limited set of functions demanded of it) that ZFS's added integrity checks may not count for a great deal (save perhaps peace of mind, but typical consumers aren't sufficiently aware of potential dangers to suffer from deficits in that area) - but your experience indicates that when you stray from that mold ZFS's added protection may sometimes be as significant as it was for Robert's mid-range array firmware bugs. And since there indeed was a RAID card involved in the original hypothetical situation under discussion, the fact that I was specifically referring to undetectable *disk* errors was only implied by my subsequent discussion of disk error rates, rather than explicit. The bottom line appears to be that introducing non-standard components into the path between RAM and disk has, at least for some specific subset of those components, the potential to introduce silent errors of the form that ZFS can catch - quite possibly in considerably greater numbers that the kinds of undetected disk errors that I was talking about ever would (that RAID card you were using has a relatively popular low-end chipset, and Robert's mid-range arrays were hardly fly-by-night). So while I'm still not convinced that ZFS offers significant features in the reliability area compared with other open-source *software* solutions, the evidence that it may do so in more sophisticated (but not quite high-end) hardware environments is becoming more persuasive. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Hello can, Thursday, December 13, 2007, 12:02:56 AM, you wrote: cyg On the other hand, there's always the possibility that someone cyg else learned something useful out of this. And my question about To be honest - there's basically nothing useful in the thread, perhaps except one thing - doesn't make any sense to listen to you. I'm afraid you don't qualify to have an opinion on that, Robert - because you so obviously *haven't* really listened. Until it became obvious that you never would, I was willing to continue to attempt to carry on a technical discussion with you, while ignoring the morons here who had nothing whatsoever in the way of technical comments to offer (but continued to babble on anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Would you two please SHUT THE F$%K UP. Just for future reference, if you're attempting to squelch a public conversation it's often more effective to use private email to do it rather than contribute to the continuance of that public conversation yourself. Have a nice day! - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
Are there benchmarks somewhere showing a RAID10 implemented on an LSI card with, say, 128MB of cache being beaten in terms of performance by a similar zraid configuration with no cache on the drive controller? Somehow I don't think they exist. I'm all for data scrubbing, but this anti-raid-card movement is puzzling. Oh, for joy - a chance for me to say something *good* about ZFS. rather than just try to balance out excessive enthusiasm. Save for speeding up synchronous writes (if it has enough on-board NVRAM to hold them until it's convenient to destage them to disk), a RAID-10 card should not enjoy any noticeable performance advantage over ZFS mirroring. By contrast, if extremely rare undetected and (other than via ZFS checksums) undetectable (or considerably more common undetected but detectable via disk ECC codes, *if* the data is accessed) corruption occurs, if the RAID card is used to mirror the data there's a good chance that even ZFS's validation scans won't see the problem (because the card happens to access the good copy for the scan rather than the bad one) - in which case you'll lose that data if the disk with the good data fails. And in the case of (extremely rare) otherwise-undetectable corruption, if the card *does* return the bad copy then IIRC ZFS (not knowing that a good copy also exists) will just claim that the data is gone (though I don't know if it will then flag it such that you'll never have an opportunity to find the good copy). If the RAID card scrubs its disks the difference (now limited to the extremely rare undetectable-via-disk-ECC corruption) becomes pretty negligible - but I'm not sure how many RAIDs below the near-enterprise category perform such scrubs. In other words, if you *don't* otherwise scrub your disks then ZFS's checksums-plus-internal-scrubbing mechanisms assume greater importance: it's only the contention that other solutions that *do* offer scrubbing can't compete with ZFS in effectively protecting your data that's somewhat over the top. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... when the difference between an unrecoverable single bit error is not just 1 bit but the entire file, or corruption of an entire database row (etc), those small and infrequent errors are an extremely big deal. You are confusing unrecoverable disk errors (which are rare but orders of magnitude more common) with otherwise *undetectable* errors (the occurrence of which is at most once in petabytes by the studies I've seen, rather than once in terabytes), despite my attempt to delineate the difference clearly. Conventional approaches using scrubbing provide as complete protection against unrecoverable disk errors as ZFS does: it's only the far rarer otherwise *undetectable* errors that ZFS catches and they don't. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Nice chassis for ZFS server
... If the RAID card scrubs its disks A scrub without checksum puts a huge burden on disk firmware and error reporting paths :-) Actually, a scrub without checksum places far less burden on the disks and their firmware than ZFS-style scrubbing does, because it merely has to scan the disk sectors sequentially rather than follow a tree path to each relatively small leaf block. Thus it also compromises runtime operation a lot less as well (though in both cases doing it infrequently in the background should usually reduce any impact to acceptable levels). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)
Great questions. 1) First issue relates to the überblock. Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity. That could in theory result in a corrupt überblock at the secondary. Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently? ZFS already has to deal with potential uberblock partial writes if it contains multiple disk sectors (and it might be prudent even if it doesn't, as Richard's response seems to suggest). Common ways of dealing with this problem include dumping it into the log (in which case the log with its own internal recovery procedure becomes the real root of all evil) or cycling around at least two locations per mirror copy (Richard's response suggests that there are considerably more, and that perhaps each one is written in quadruplicate) such that the previous uberblock would still be available if the new write tanked. ZFS-style snapshots complicate both approaches unless special provisions are taken - e.g., copying the current uberblock on each snapshot and hanging a list of these snapshot uberblock addresses off the current uberblock, though even that might run into interesting complications under the scenario which you describe below. Just using the 'queue' that Richard describes to accumulate snapshot uberblocks would limit the number of concurrent snapshots to less than the size of that queue. In any event, as long as writes to the secondary copy don't continue after a write failure of the kind that you describe has occurred (save for the kind of catch-up procedure that you mention later), ZFS's internal facilities should not be confused by encountering a partial uberblock update at the secondary, any more than they'd be confused by encountering it on an unreplicated system after restart. 2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario. If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks. When contact between the sites is re-established there will be a period of 'catch-up' resynchronization. In most, if not all, cases this is done on a simple block-order basis. Write-ordering is lost until the two sites are once again in sync and routine replication restarts. I can see this has having major ZFS impact. It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied. This breaks the assumption that the on-disk format is always self-consistent. If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved? My inclination is to say By repopulating your environment from backups: it is not reasonable to expect *any* file system to operate correctly, or to attempt any kind of comprehensive recovery (other than via something like fsck, with no guarantee of how much you'll get back), when the underlying hardware transparently reorders updates which the file system has explicitly ordered when it presented them. But you may well be correct in suspecting that there's more potential for data loss should this occur in a ZFS environment than in update-in-place environments where only portions of the tree structure that were explicitly changed during the connection hiatus would likely be affected by such a recovery interruption (though even there if a directory changed enough to change its block structure on disk you could be in more trouble). Obviously all filesystems can suffer with this scenario, but ones that expect less from their underlying storage (like UFS) can be fscked, and although data that was being updated is potentially corrupt, existing data should still be OK and usable. My concern is that ZFS will handle this scenario less well. There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback. You're talking about an HDS- or EMC-level snapshot, right? This isn't always easy to do, especially since the resync is usually automatic; there is no clear
Re: [zfs-discuss] Nice chassis for ZFS server
... Now it seems to me that without parity/replication, there's not much point in doing the scrubbing, because you could just wait for the error to be detected when someone tries to read the data for real. It's only if you can repair such an error (before the data is needed) that such scrubbing is useful. Pretty much I think I've read (possibly in the 'MAID' descriptions) the contention that at least some unreadable sectors get there in stages, such that if you catch them early they will be only difficult to read rather than completely unreadable. In such a case, scrubbing is worthwhile even without replication, because it finds the problem early enough that the disk itself (or higher-level mechanisms if the disk gives up but the higher level is more persistent) will revector the sector when it finds it difficult (but not impossible) to read. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
(apologies if this gets posted twice - it disappeared the first time, and it's not clear whether that was intentional) Hello can, Tuesday, December 11, 2007, 6:57:43 PM, you wrote: Monday, December 10, 2007, 3:35:27 AM, you wrote: cyg and it made them slower cyg That's the second time you've claimed that, so you'll really at cyg least have to describe *how* you measured this even if the cyg detailed results of those measurements may be lost in the mists of time. cyg So far you don't really have much of a position to defend at cyg all: rather, you sound like a lot of the disgruntled TOPS users cyg of that era. Not that they didn't have good reasons to feel cyg disgruntled - but they frequently weren't very careful about aiming their ire accurately. cyg Given that RMS really was *capable* of coming very close to the cyg performance capabilities of the underlying hardware, your cyg allegations just don't ring true. Not being able to jump into And where is your proof that it was capable of coming very close to the...? cyg It's simple: I *know* it, because I worked *with*, and *on*, it cyg - for many years. So when some bozo who worked with people with cyg a major known chip on their shoulder over two decades ago comes cyg along and knocks its capabilities, asking for specifics (not even cyg hard evidence, just specific allegations which could be evaluated cyg and if appropriate confronted) is hardly unreasonable. Bill, you openly criticize people (their work) who have worked on ZFS for years... not that there's anything wrong with that, just please realize that because you were working on it it doesn't mean it is/was perfect - just the same as with ZFS. Of course it doesn't - and I never claimed that RMS was anything close to 'perfect' (I even gave specific examples of areas in which it was *far* from perfect). Just as I've given specific examples of where ZFS is far from perfect. What I challenged was David's assertion that RMS was severely deficient in its *capabilities* - and demanded not 'proof' of any kind but only specific examples (comparable in specificity to the examples of ZFS's deficiencies that *I* have provided) that could actually be discussed. I know, everyone loves their baby... No, you don't know: you just assume that everyone is as biased as you and others here seem to be. Nevertheless just because you were working on and with it, it's not a proof. The person you were replaying to was also working with it (but not on it I guess). Not that I'm interested in such a proof. Just noticed that you're demanding some proof, while you are also just write some statements on its performance without any actual proof. You really ought to spend a lot more time understanding what you've read before responding to it, Robert. I *never* asked for anything like 'proof': I asked for *examples* specific enough to address - and repeated that explicitly in responding to your previous demand for 'proof'. Perhaps I should at that time have observed that your demand for 'proof' (your use of quotes suggesting that it was something that *I* had demanded) was ridiculous, but I thought my response made that obvious. Let me use your own words: In other words, you've got nothing, but you'd like people to believe it's something. The phrase Put up or shut up comes to mind. Where are your proofs on some of your claims about ZFS? cyg Well, aside from the fact that anyone with even half a clue cyg knows what the effects of uncontrolled file fragmentation are on cyg sequential access performance (and can even estimate those cyg effects within moderately small error bounds if they know what cyg the disk characteristics are and how bad the fragmentation is), cyg if you're looking for additional evidence that even someone cyg otherwise totally ignorant could appreciate there's the fact that I've never said there are not fragmentation problems with ZFS. Not having made a study of your collected ZFS contributions here I didn't know that. But some of ZFS's developers are on record stating that they believe there is no need to defragment (unless they've changed their views since and not bothered to make us aware of it), and in the entire discussion in the recent 'ZFS + DB + fragments' thread there were only three contributors (Roch, Anton, and I) who seemed willing to admit that any problem existed. So since one of my 'claims' for which you requested substantiation involved fragmentation problems, it seemed appropriate to address them. Well, actually I've been hit by the issue in one environment. But didn't feel any impulse to mention that during all the preceding discussion, I guess. Also you haven't done your work home properly, as one of ZFS developers actually stated they are going to work on ZFS de-fragmentation and disk removal (pool shrinking). See
Re: [zfs-discuss] Yager on ZFS
Hello can, Tuesday, December 11, 2007, 6:57:43 PM, you wrote: Monday, December 10, 2007, 3:35:27 AM, you wrote: cyg and it made them slower cyg That's the second time you've claimed that, so you'll really at cyg least have to describe *how* you measured this even if the cyg detailed results of those measurements may be lost in the mists of time. cyg So far you don't really have much of a position to defend at cyg all: rather, you sound like a lot of the disgruntled TOPS users cyg of that era. Not that they didn't have good reasons to feel cyg disgruntled - but they frequently weren't very careful about aiming their ire accurately. cyg Given that RMS really was *capable* of coming very close to the cyg performance capabilities of the underlying hardware, your cyg allegations just don't ring true. Not being able to jump into And where is your proof that it was capable of coming very close to the...? cyg It's simple: I *know* it, because I worked *with*, and *on*, it cyg - for many years. So when some bozo who worked with people with cyg a major known chip on their shoulder over two decades ago comes cyg along and knocks its capabilities, asking for specifics (not even cyg hard evidence, just specific allegations which could be evaluated cyg and if appropriate confronted) is hardly unreasonable. Bill, you openly criticize people (their work) who have worked on ZFS for years... not that there's anything wrong with that, just please realize that because you were working on it it doesn't mean it is/was perfect - just the same as with ZFS. Of course it doesn't - and I never claimed that RMS was anything close to 'perfect' (I even gave specific examples of areas in which it was *far* from perfect). Just as I've given specific examples of where ZFS is far from perfect. What I challenged was David's assertion that RMS was severely deficient in its *capabilities* - and demanded not 'proof' of any kind but only specific examples (comparable in specificity to the examples of ZFS's deficiencies that *I* have provided) that could actually be discussed. I know, everyone loves their baby... No, you don't know: you just assume that everyone is as biased as you and others here seem to be. Nevertheless just because you were working on and with it, it's not a proof. The person you were replaying to was also working with it (but not on it I guess). Not that I'm interested in such a proof. Just noticed that you're demanding some proof, while you are also just write some statements on its performance without any actual proof. You really ought to spend a lot more time understanding what you've read before responding to it, Robert. I *never* asked for anything like 'proof': I asked for *examples* specific enough to address - and repeated that explicitly in responding to your previous demand for 'proof'. Perhaps I should at that time have observed that your demand for 'proof' (your use of quotes suggesting that it was something that *I* had demanded) was ridiculous, but I thought my response made that obvious. Let me use your own words: In other words, you've got nothing, but you'd like people to believe it's something. The phrase Put up or shut up comes to mind. Where are your proofs on some of your claims about ZFS? cyg Well, aside from the fact that anyone with even half a clue cyg knows what the effects of uncontrolled file fragmentation are on cyg sequential access performance (and can even estimate those cyg effects within moderately small error bounds if they know what cyg the disk characteristics are and how bad the fragmentation is), cyg if you're looking for additional evidence that even someone cyg otherwise totally ignorant could appreciate there's the fact that I've never said there are not fragmentation problems with ZFS. Not having made a study of your collected ZFS contributions here I didn't know that. But some of ZFS's developers are on record stating that they believe there is no need to defragment (unless they've changed their views since and not bothered to make us aware of it), and in the entire discussion in the recent 'ZFS + DB + fragments' thread there were only three contributors (Roch, Anton, and I) who seemed willing to admit that any problem existed. So since one of my 'claims' for which you requested substantiation involved fragmentation problems, it seemed appropriate to address them. Well, actually I've been hit by the issue in one environment. But didn't feel any impulse to mention that during all the preceding discussion, I guess. Also you haven't done your work home properly, as one of ZFS developers actually stated they are going to work on ZFS de-fragmentation and disk removal (pool shrinking). See http://www.opensolaris.org/jive/thread.jspa?messageID=139680↠ Hmmm - there were at least two Sun ZFS personnel participating in the database thread, and they never mentioned
Re: [zfs-discuss] Yager on ZFS
... Bill - I don't think there's a point in continuing that discussion. I think you've finally found something upon which we can agree. I still haven't figured out exactly where on the stupid/intellectually dishonest spectrum you fall (lazy is probably out: you have put some effort in to responding), but it is clear that you're hopeless. On the other hand, there's always the possibility that someone else learned something useful out of this. And my question about just how committed you were to your ignorance has been answered. It's difficult to imagine how someone so incompetent in the specific area that he's debating can be so self-assured - I suspect that just not listening has a lot to do with it - but also kind of interesting to see that in action. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Monday, December 10, 2007, 3:35:27 AM, you wrote: cyg and it made them slower cyg That's the second time you've claimed that, so you'll really at cyg least have to describe *how* you measured this even if the cyg detailed results of those measurements may be lost in the mists of time. cyg So far you don't really have much of a position to defend at cyg all: rather, you sound like a lot of the disgruntled TOPS users cyg of that era. Not that they didn't have good reasons to feel cyg disgruntled - but they frequently weren't very careful about aiming their ire accurately. cyg Given that RMS really was *capable* of coming very close to the cyg performance capabilities of the underlying hardware, your cyg allegations just don't ring true. Not being able to jump into And where is your proof that it was capable of coming very close to the...? It's simple: I *know* it, because I worked *with*, and *on*, it - for many years. So when some bozo who worked with people with a major known chip on their shoulder over two decades ago comes along and knocks its capabilities, asking for specifics (not even hard evidence, just specific allegations which could be evaluated and if appropriate confronted) is hardly unreasonable. Hell, *I* gave more specific reasons why someone might dislike RMS in particular and VMS in general (complex and therefore user-unfriendly low-level interfaces and sometimes poor *default* performance) than David did: they just didn't happen to match those that he pulled out of (whereever) and that I challenged. Let me use your own words: In other words, you've got nothing, but you'd like people to believe it's something. The phrase Put up or shut up comes to mind. Where are your proofs on some of your claims about ZFS? Well, aside from the fact that anyone with even half a clue knows what the effects of uncontrolled file fragmentation are on sequential access performance (and can even estimate those effects within moderately small error bounds if they know what the disk characteristics are and how bad the fragmentation is), if you're looking for additional evidence that even someone otherwise totally ignorant could appreciate there's the fact that Unix has for over two decades been constantly moving in the direction of less file fragmentation on disk - starting with the efforts that FFS made to at least increase proximity and begin to remedy the complete disregard for contiguity that the early Unix file system displayed and to which ZFS has apparently regressed, through the additional modifications that Kleiman and McVoy introduced in the early '90s to group 56 KB of blocks adjacently when possible, through the extent-based architectures of VxFS, XFS, JFS, and soon-to-be ext4 file systems ( I'm probably missing others here): given the relative changes between disk access times and bandwidth over the past decade and a half, ZFS with its max 128 KB blocks in splendid isolation offers significantly worse sequential performance relative to what's attainable than the systems that used 56 KB aggregates back then did (and they weren't all that great in that respect). Given how slow Unix was to understand and start to deal with this issue, perhaps it's not surprising how ignorant some Unix people still are - despite the fact that other platforms fully understood the problem over three decades ago. Last I knew, ZFS was still claiming that it needed nothing like defragmentation, while describing write allocation mechanisms that could allow disastrous degrees of fragmentation under conditions that I've described quite clearly. If ZFS made no efforts whatsoever in this respect the potential for unacceptable performance would probably already have been obvious even to its blindest supporters, so I suspect that when ZFS is given the opportunity by a sequentially-writing application that doesn't force every write (or by use of the ZIL in some cases) it aggregates blocks in a file together in cache and destages them in one contiguous chunk to disk (rather than just mixing blocks willy-nilly in its batch disk writes) - and a lot of the time there's probably not enough other system write activity to make this infeasible, so that people haven't found sequential streaming performance to be all that bad most of the time (especially on the read end if their systems are lightly load ed and the fact that their disks may be working a lot harder than they ought to have to is not a problem). But the potential remains for severe fragmention under heavily parallel access conditions, or when a file is updated at fine grain but then read sequentially (the whole basis of the recent database thread), and with that fragmentation comes commensurate performance degradation. And even if you're not capable of understanding why yourself you should consider it significant that no one on the ZFS development team has piped up to say
Re: [zfs-discuss] Yager on ZFS
... I remember trying to help customers move their applications from TOPS-20 to VMS, back in the early 1980s, and finding that the VMS I/O capabilities were really badly lacking. Funny how that works: when you're not familiar with something, you often mistake your own ignorance for actual deficiencies. Of course, the TOPS-20 crowd was extremely unhappy at being forced to migrate at all, and this hardly improved their perception of the situation. If you'd like to provide specifics about exactly what was supposedly lacking, it would be possible to evaluate the accuracy of your recollection. I've played this game before, and it's off-topic and too much work to be worth it. In other words, you've got nothing, but you'd like people to believe it's something. The phrase Put up or shut up comes to mind. Researching exactly when specific features were released into VMS RMS from this distance would be a total pain, I wasn't asking for anything like that: I was simply asking for specific examples of the VMS I/O capabilities that you allegedly 'found' were really badly lacking in the early 1980s. Even if the porting efforts you were involved in predated the pivotal cancellation of Jupiter in 1983, that was still close enough to the VMS cluster release that most VMS development effort had turned in that direction (i.e., the single-system VMS I/O subsystem had pretty well reached maturity), so there won't be any need to quibble about what shipped when. Surely if you had a sufficiently strong recollection to be willing to make such a definitive assertion you can remember *something* specific. and then we'd argue about which ones were beneficial for which situations, which people didin't much agree about then or since. No, no, no: you're reading far more generality into this than I ever suggested. I'm not asking you to judge what was useful, and I couldn't care less whether you thought the features that VMS had and TOPS lacked were valuable: I'm just asking you to be specific about what VMS I/O capabilities you claim were seriously deficient. My experience at the time was that RMS was another layer of abstraction and performance loss between the application and the OS, Ah - your 'experience'. So you actually measured RMS's effect on performance, rather than just SWAGged that adding a layer that you found unappealing in a product that your customers were angry about having to move to Must Be A Bad Idea? What was the quantitative result of that measurement, and how was RMS configured for the relevant workload? After all, the extra layer wasn't introduced just to give you something to complain about: it was there to provide additional features and configuration flexibility (much of it performance-related), as described above. If you didn't take advantage of those facilities, that could be a legitimate *complexity* knock against the environment but it's not a legitimate *capability* or *performance* knock (rather the opposite, in fact). and it made it harder to do things If you were using the RMS API itself rather than accessing RMS through a higher-level language that provided simple I/O handling for simple I/O needs, that was undoubtedly the case: as I observed above, that's a price that VMS was happy to pay for providing complete control to applications that wanted it. RMS was designed from the start to provide that alternative with the understanding that access via higher-level language mechanisms would usually be used by those people who didn't need the low-level control that the native RMS API provided. and it made them slower That's the second time you've claimed that, so you'll really at least have to describe *how* you measured this even if the detailed results of those measurements may be lost in the mists of time. and it made files less interchangeable between applications; That would have been some trick, given that RMS supported pure byte-stream files as well as its many more structured types (and I'm pretty sure that the C run-time system took this approach, using RMS direct I/O and doing its own deblocking to ensure that some of the more idiomatic C activities like single-character reads and writes would not inadvertently perform poorly). So at worst you could have used precisely the same in-file formats that were being used in the TOPS-20 environment and achieved the same degree of portability (unless you were actually encountering peculiarities in language access rather than in RMS itself: I'm considerably less familiar with that end of the environment). but I'm not interested in trying to defend this position for weeks based on 25-year-old memories. So far you don't really have much of a position to defend at all: rather, you sound like a lot of the disgruntled TOPS users of that era. Not that they didn't have good reasons to feel disgruntled - but
Re: [zfs-discuss] Yager on ZFS
why don't you put your immense experience and knowledge to contribute to what is going to be the next and only filesystems in modern operating systems, Ah - the pungent aroma of teenage fanboy wafts across the Net. ZFS is not nearly good enough to become what you suggest above, nor is it amenable to some of the changes necessary to make it good enough. So while I'm happy to give people who have some personal reason to care about it pointers on how it could be improved, I have no interest in working on it myself. instead of spending your time asking for specifics You'll really need to learn to pay a lot more attention to specifics yourself if you have any desire to become technically competent when you grow up. and treating everyone of ignorant I make some effort only to treat the ignorant as ignorant. It's hardly my fault that they are so common around here, but I'd like to think that there's a silent majority of more competent individuals in the forum who just look on quietly (and perhaps somewhat askance). It used to be that the ignorant felt motivated to improve themselves, but now they seem more inclined to engage in aggressive denial (which may be easier on the intellect but seems a less productive use of energy). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: NTFS Single Instance Storage (Re: Yager on ZFS
[EMAIL PROTECTED] wrote: Darren, Do you happen to have any links for this? I have not seen anything about NTFS and CAS/dedupe besides some of the third party apps/services that just use NTFS as their backing store. Single Instance Storage is what Microsoft uses to refer to this: http://research.microsoft.com/sn/Farsite/WSS2000.pdf While SIS is likely useful in certain environments, it is actually layered on top of NTFS rather than part of it - and in fact could in principle be layered on top of just about any underlying file system in any OS that supported layered 'filter' drivers. File access to a shared file via SIS runs through an additional phase of directory look-up similar to that involved in following a symbolic link, and its described copy-on-close semantics require divided data access within the updater's version of the file (fetching unchanged data from the shared copy and changed data from the to-be-fleshed-out-after-close copy) with apparently no mechanism to avoid the need to copy the entire file after close even if only a single byte within it has been changed (which could compromise its applicability in some environments). Nonetheless, unlike most dedupe products it does apply to on-line rather than backup storage, and Microsoft deserves credit for fielding it well in advance of the dedupe startups: once in a while they actually do produce something that qualifies as at least moderately innovative. NTFS was at least respectable if not ground-breaking as well when it first appeared, and it's too bad that it has largely stagnated since while MS pursued its 'structured storage' and similar dreams (one might suspect in part to try to create a de facto storage standard that competitors couldn't easily duplicate, limiting the portability of applications built to take advantage of its features without attracting undue attention from trust-busters, such as they are these days - but perhaps I'm just too cynical). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
from the description here http://www.djesys.com/vms/freevms/mentor/rms.html so who cares here ? RMS is not a filesystem, but more a CAS type of data repository Since David begins his description with the statement RMS stands for Record Management Services. It is the underlying file system of OpenVMS, I'll suggest that your citation fails a priori to support your allegation above. Perhaps you're confused by the fact that RMS/Files-11 is a great deal *more* of a file system than most Unix examples (though ReiserFS was at least heading in somewhat similar directions). You might also be confused by the fact that VMS separates its file system facilities into an underlying block storage and directory layer specific to disk storage and the upper RMS deblocking/interpretation/pan-device layer, whereas Unix combines the two. Better acquainting yourself with what CAS means in the context of contemporary disk storage solutions might be a good idea as well, since it bears no relation to RMS (nor to virtually any Unix file system). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mail system errors (On Topic).
Yet another prime example. Ah - yet another brave denizen (and top-poster) who's more than happy to dish it out but squeals for administrative protection when receiving a response in kind. The fact that your pleas seem to be going unanswered actually reflects rather well on whoever is managing this forum: even if they don't particularly care for my attitude, they appear to recognize that there's a good reason why I deal with some of you as I have. Do have a nice day. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you run a database on RMS? As well as you could on must Unix file systems. And you've been able to do so for almost three decades now (whereas features like asynchronous and direct I/O are relative newcomers in the Unix environment). I guess its not suited And you guess wrong: that's what happens when you speak from ignorance rather than from something more substantial. we are already trying to get ride of a 15 years old filesystem called wafl, Whatever for? Please be specific about exactly what you expect will work better with whatever you're planning to replace it with - and why you expect it to be anywhere nearly as solid. and a 10 years old file system called Centera, My, you must have been one of the *very* early adopters, since EMC launched it only 5 1/2 years ago. so do you thing we are going to consider a 35 years old filesystem now... computer science made a lot of improvement since Well yes, and no. For example, most Unix platforms are still struggling to match the features which VMS clusters had over two decades ago: when you start as far behind as Unix did, even continual advances may still not be enough to match such 'old' technology. Not that anyone was suggesting that you replace your current environment with RMS: if it's your data, knock yourself out using whatever you feel like using. On the other hand, if someone else is entrusting you with *their* data, they might be better off looking for someone with more experience and sense. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? wrote: can you run a database on RMS? As well as you could on must Unix file systems. And you've been able to do so for almost three decades now (whereas features like asynchronous and direct I/O are relative newcomers in the Unix environment). nny, I remember trying to help customers move their applications from TOPS-20 to VMS, back in the early 1980s, and finding that the VMS I/O capabilities were really badly lacking. Funny how that works: when you're not familiar with something, you often mistake your own ignorance for actual deficiencies. Of course, the TOPS-20 crowd was extremely unhappy at being forced to migrate at all, and this hardly improved their perception of the situation. If you'd like to provide specifics about exactly what was supposedly lacking, it would be possible to evaluate the accuracy of your recollection. RMS was an abomination -- nothing but trouble, Again, specifics would allow an assessment of that opinion. and another layer to keep you away from your data. Real men use raw disks, of course. And with RMS (unlike Unix systems of that era) you could get very close to that point if you wanted to without abandoning the file level of abstraction - or work at a considerably more civilized level if you wanted that with minimal sacrifice in performance (again, unlike the Unix systems of that era, where storage performance was a joke until FFS began to improve things - slowly). VMS and RMS represented a very different philosophy than Unix: you could do anything, and therefore were exposed to the complexity that this flexibility entailed. Unix let you do things one simple way - whether it actually met your needs or not. Back then, efficient use of processing cycles (even in storage applications) could be important - and VMS and RMS gave you that option. Nowadays, trading off cycles to obtain simplicity is a lot more feasible, and the reasons for the complex interfaces of yesteryear can be difficult to remember. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
You have me at a disadvantage here, because I'm not even a Unix (let alone Solaris and Linux) aficionado. But don't Linux snapshots in conjunction with rsync (leaving aside other possibilities that I've never heard of) provide rather similar capabilities (e.g., incremental backup or re-synching), especially when used in conjunction with scripts and cron? Which explains why you keep ranting without knowing what you're talking about. Au contraire, cookie: I present things in detail to make it possible for anyone capable of understanding the discussion to respond substantively if there's something that requires clarification or further debate. You, by contrast, babble on without saying anything substantive at all - which makes you kind of amusing, but otherwise useless. You could at least have tried to answer my question above, since you took the trouble to quote it - but of course you didn't, just babbled some more. Copy-on-write. Even a bookworm with 0 real-life-experience should be able to apply this one to a working situation. As I may well have been designing and implementing file systems since before you were born (or not: you just have a conspicuously callow air about you), my 'real-life' experience with things like COW is rather extensive. And while I don't have experience with Linux adjuncts like rsync, unlike some people I'm readily able to learn from the experience of others (who seem far more credible when describing their successful use of rsync and snapshots on Linux than anything I've seen you offer up here). There's a reason ZFS (and netapp) can take snapshots galore without destroying their filesystem performance. Indeed: it's because ZFS already sacrificed a significant portion of that performance by disregarding on-disk contiguity, so there's relatively little left to lose. By contrast, systems that respect the effects of contiguity on performance (and WAFL does to a greater degree than ZFS) reap its benefits all the time (whether snapshots exist or not) while only paying a penalty when data is changed (and they don't have to change as much data as ZFS does because they don't have to propagate changes right back to the root superblock on every update). It is possible to have nearly all of the best of both worlds, but unfortunately not with any current implementations that I know of. ZFS could at least come considerably closer, though, if it reorganized opportunistically as discussed in the database thread. (By the way, since we're talking about snapshots here rather than about clones it doesn't matter at all how many there are, so your 'snapshots galore' bluster above is just more evidence of your technical incompetence: with any reasonable implementation the only run-time overhead occurs in keeping the most recent snapshot up to date, regardless of how many older snapshots may also be present.) But let's see if you can, for once, actually step up to the plate and discuss something technically, rather than spout buzzwords that you apparently don't come even close to understanding: Are you claiming that writing snapshot before-images of modified data (as, e.g., Linux LVM snapshots do) for the relatively brief period that it takes to transfer incremental updates to another system 'destroys' performance? First of all, that's clearly dependent upon the update rate during that interval, so if it happens at a quiet time (which presumably would be arranged if its performance impact actually *was* a significant issue) your assertion is flat-out-wrong. Even if the snapshot must be processed during normal operation, maintaining it still won't be any problem if the run-time workload is read-dominated. And I suppose Sun must be lying in its documentation for fssnap (which Sun has offered since Solaris 8 with good old update-in-place UFS) where it says While the snapshot is active, users of the file system might notice a slight performance impact [as contrasted with your contention that performance is 'destroyed'] when the file system is written to, but they see no impact when the file system is read (http://docsun.cites.uiuc.edu/sun_docs/C/solaris_9/SUNWaadm/SYSADV1/p185.html). You'd really better contact them right away and set them straight. Normal system cache mechanisms should typically keep about-to-be-modified data around long enough to avoid the need to read it back in from disk to create the before-image for modified data used in a snapshot, and using a log-structured approach to storing these BIs in the snapshot file or volume (though I don't know what specific approaches are used in fssnap and LVM: do you?) would be extremely efficient - resulting in minimal impact on normal system operation regardless of write activity. C'mon, cookie: surprise us for once - say something intelligent. With guidance and practice, you might even be able to make a habit of it. - bill This
Re: [zfs-discuss] Mail system errors (On Topic).
I keep getting ETOOMUCHTROLL errors thrown while reading this list, is there a list admin that can clean up the mess? I would hope that repeated personal attacks could be considered grounds for removal/blocking. Actually, most of your more unpleasant associates here seem to suffer primarily from blind and misguided loyalty and/or an excess of testosterone - so there's always hope that they'll grow up over time and become productive contributors. And if I'm not complaining about their attacks but just dealing with them in kind while carrying on more substantive conversations, it's not clear that they should pose a serious problem for others. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Once again, profuse apologies for having taken so long (well over 24 hours by now - though I'm not sure it actually appeared in the forum until a few hours after its timestamp) to respond to this. can you guess? wrote: Primarily its checksumming features, since other open source solutions support simple disk scrubbing (which given its ability to catch most deteriorating disk sectors before they become unreadable probably has a greater effect on reliability than checksums in any environment where the hardware hasn't been slapped together so sloppily that connections are flaky). From what I've read on the subject, That premise seems bad from the tart. Then you need to read more or understand it better. I don't believe that scrubbing will catch all the types of errors that checksumming will. That's absolutely correct, but it in no way contradicts what I said (and you quoted) above. Perhaps you should read that again, more carefully: it merely states that disk scrubbing probably has a *greater* effect on reliability than checksums do, not that it completely subsumes their features. There are a category of errors that are not caused by firmware, or any type of software. The hardware just doesn't write or read the correct bit value this time around. With out a checksum there's no way for the firmware to know, and next time it very well may write or read the correct bit value from the exact same spot on the disk, so scrubbing is not going to flag this sector as 'bad'. It doesn't have to, because that's a *correctable* error that the disk's extensive correction codes (which correct *all* single-bit errors as well as most considerably longer error bursts) resolve automatically. Now you may claim that this type of error happens so infrequently No, it's actually one of the most common forms, due to the desire to pack data on the platter as tightly as possible: that's why those long correction codes were created. Rather than comment on the rest of your confused presentation about disk error rates, I'll just present a capsule review of the various kinds: 1. Correctable errors (which I just described above). If a disk notices that a sector *consistently* requires correction it may deal with it as described in the next paragraph. 2. Errors that can be corrected only with retries (i.e., the sector is not *consistently* readable even after the ECC codes have been applied, but can be successfully read after multiple attempts which can do things like fiddle slightly with the head position over the track and signal amplification to try to get a better response). A disk may try to rewrite such a sector in place to see if its readability improves as a result, and if it doesn't will then transparently revector the data to a spare sector if one exists and mark the original sector as 'bad'. Background scrubbing gives the disk an opportunity to discover such sectors *before* they become completely unreadable, thus significantly improving reliability even in non-redundant environments. 3. Uncorrectable errors (bursts too long for the ECC codes to handle even after the kinds of retries described above, but which the ECC codes can still detect): scrubbing catches these as well, and if suitable redundancy exists it can correct them by rewriting the offending sector (the disk may transparently revector it if necessary, or the LVM or file system can if the disk can't). Disk vendor specs nominally state that one such error may occur for every 10^14 bits transferred for a contemporary commodity (ATA or SATA) drive (i.e., about once in every 12.5 TB), but studies suggest that in practice they're much rarer. 4. Undetectable errors (errors which the ECC codes don't detect but which ZFS's checksums presumably would). Disk vendors no longer provide specs for this reliability metric. My recollection from a decade or more ago is that back when they used to it was three orders of magnitude lower than the uncorrectable error rate: if that still obtained it would mean about once in every 12.5 petabytes transferred, but given that the real-world incidence of uncorrectable errors is so much lower than speced and that ECC codes keep increasing in length it might be far lower than that now. ... Aside from the problems that scrubbing handles (and you need scrubbing even if you have checksums, because scrubbing is what helps you *avoid* data loss rather than just discover it after it's too late to do anything about it), and aside from problems Again I think you're wrong on the basis for your point. No: you're just confused again. The checksumming in ZFS (if I understand it correctly) isn't used for only detecting the problem. If the ZFS pool has any redundancy at all, those same checksums can be used to repair that same data, thus *avoiding* the data loss. 1. Unlike things like disk ECC codes, ZFS's checksums can't
Re: [zfs-discuss] zfs rollback without unmounting a file system
Allowing a filesystem to be rolled back without unmounting it sounds unwise, given the potentially confusing effect on any application with a file currently open there. And if a user can't roll back their home directory filesystem, is that so bad? Presumably they can still access snapshot versions of individual files or even entire directory sub-trees and copy them to their current state if they want to - or whistle up someone else to perform a rollback of their home directory if they really need to. I'm not normally one to advocate protecting users from themselves, but I do think that applications have some rights to believe that there are some guarantees about stability as long as they have a file accessed (and that the system should terminate that access if it can't sustain those guarantees). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
So name these mystery alternatives that come anywhere close to the protection, If you ever progress beyond counting on your fingers you might (with a lot of coaching from someone who actually cares about your intellectual development) be able to follow Anton's recent explanation of this (given that the higher-level overviews which I've provided apparently flew completely over your head). functionality, I discussed that in detail elsewhere here yesterday (in more detail than previously in an effort to help the slower members of the class keep up). and ease of use That actually may be a legitimate (though hardly decisive) ZFS advantage: it's too bad its developers didn't extend it farther (e.g., by eliminating the vestiges of LVM redundancy management and supporting seamless expansion to multi-node server systems). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
apologies in advance for prolonging this thread .. Why do you feel any need to? If you were contributing posts as completely devoid of technical content as some of the morons here have recently been submitting I could understand it, but my impression is that the purpose of this forum is to explore the kind of questions that you're interested in discussing. i had considered taking this completely offline, but thought of a few people at least who might find this discussion somewhat interesting And any who don't are free to ignore it, so no harm done there either. .. at the least i haven't seen any mention of Merkle trees yet as the nerd in me yearns for I'd never heard of them myself until recently, despite having come up with the idea independently to use a checksumming mechanism very similar to ZFS's. Merkle seems to be an interesting guy - his home page is worth a visit. On Dec 5, 2007, at 19:42, bill todd - aka can you guess? wrote: what are you terming as ZFS' incremental risk reduction? .. (seems like a leading statement toward a particular assumption) Primarily its checksumming features, since other open source solutions support simple disk scrubbing (which given its ability to catch most deteriorating disk sectors before they become unreadable probably has a greater effect on reliability than checksums in any environment where the hardware hasn't been slapped together so sloppily that connections are flaky). ah .. okay - at first reading incremental risk reduction seems to imply an incomplete approach to risk The intent was to suggest a step-wise approach to risk, where some steps are far more significant than others (though of course some degree of overlap between steps is also possible). *All* approaches to risk are incomplete. ... i do believe that an interesting use of the merkle tree with a sha256 hash is somewhat of an improvement over conventional volume based data scrubbing techniques Of course it is: that's why I described it as 'incremental' rather than as 'redundant'. The question is just how *significant* an improvement it offers. since there can be a unique integration between the hash tree for the filesystem block layout and a hierarchical data validation method. In addition to the finding unknown areas with the scrub, you're also doing relatively inexpensive data validation checks on every read. Yup. ... sure - we've seen many transport errors, I'm curious what you mean by that, since CRCs on the transports usually virtually eliminate them as problems. Unless you mean that you've seen many *corrected* transport errors (indicating that the CRC and retry mechanisms are doing their job and that additional ZFS protection in this area is probably redundant). as well as firmware implementation errors Quantitative and specific examples are always good for this kind of thing; the specific hardware involved is especially significant to discussions of the sort that we're having (given ZFS's emphasis on eliminating the need for much special-purpose hardware). .. in fact with many arrays we've seen data corruption issues with the scrub I'm not sure exactly what you're saying here: is it that the scrub has *uncovered* many apparent instances of data corruption (as distinct from, e.g., merely unreadable disk sectors)? (particularly if the checksum is singly stored along with the data block) Since (with the possible exception of the superblock) ZFS never stores a checksum 'along with the data block', I'm not sure what you're saying there either. - just like spam you really want to eliminate false positives that could indicate corruption where there isn't any. The only risk that ZFS's checksums run is the infinitesimal possibility that corruption won't be detected, not that they'll return a false positive. if you take some time to read the on disk format for ZFS you'll see that there's a tradeoff that's done in favor of storing more checksums in many different areas instead of making more room for direct block pointers. While I haven't read that yet, I'm familiar with the trade-off between using extremely wide checksums (as ZFS does - I'm not really sure why, since cryptographic-level security doesn't seem necessary in this application) and limiting the depth of the indirect block tree. But (yet again) I'm not sure what you're trying to get at here. ... on this list we've seen a number of consumer level products including sata controllers, and raid cards (which are also becoming more commonplace in the consumer realm) that can be confirmed to throw data errors. Your phrasing here is a bit unusual ('throwing errors' - or exceptions - is not commonly related to corrupting data). If you're referring to some kind of silent data corruption, once again specifics are important: to put
Re: [zfs-discuss] Yager on ZFS
can you guess? wrote: There aren't free alternatives in linux or freebsd that do what zfs does, period. No one said that there were: the real issue is that there's not much reason to care, since the available solutions don't need to be *identical* to offer *comparable* value (i.e., they each have different strengths and weaknesses and the net result yields no clear winner - much as some of you would like to believe otherwise). I see you carefully snipped You would think the fact zfs was ported to freebsd so quickly would've been a good first indicator that the functionality wasn't already there. A point you appear keen to avoid discussing. Hmmm - do I detect yet another psychic-in-training here? Simply ignoring something that one considers irrelevant does not necessarily imply any active desire to *avoid* discussing it. I suspect that whoever ported ZFS to FreeBSD was a fairly uncritical enthusiast just as so many here appear to be (and I'll observe once again that it's very easy to be one, because ZFS does sound impressive until you really begin looking at it closely). Not to mention the fact that open-source operating systems often gather optional features more just because they can than because they necessarily offer significant value: all it takes is one individual who (for whatever reason) feels like doing the work. Linux, for example, is up to its ears in file systems, all of which someone presumably felt it worthwhile to introduce there. Perhaps FreeBSD proponents saw an opportunity to narrow the gap in this area (especially since incorporating ZFS into Linux appears to have licensing obstacles). In any event, the subject under discussion here is not popularity but utility - *quantifiable* utility - and hence the porting of ZFS to FreeBSD is not directly relevant. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
(Can we declare this thread dead already?) Many have already tried, but it seems to have a great deal of staying power. You, for example, have just contributed to its continued vitality. Others seem to care. *identical* to offer *comparable* value (i.e., they each have different strengths and weaknesses and the net result yields no clear winner - much as some of you would like to believe otherwise). Interoperability counts for a lot for some people. Then you'd better work harder on resolving the licensing issues with Linux. Fewer filesystems to earn about can count too. And since ZFS differs significantly from its more conventional competitors, that's something of an impediment to acceptance. ZFS provides peace of mind that you tell us doesn't matter. Sure it matters, if it gives that to you: just don't pretend that it's of any *objective* significance, because *that* requires actual quantification. And it's actively developed and you and everyone else can see that this is so, Sort of like ext2/3/4, and XFS/JFS (though the latter have the advantage of already being very mature, hence need somewhat less 'active' development). and that recent ZFS improvements and others that are in the pipe (and discussed publicly) are very good improvements, which all portends an even better future for ZFS down the line. Hey, it could even become a leadership product someday. Or not - time will tell. Whatever you do not like about ZFS today may be fixed tomorrow, There'd be more hope for that if its developers and users seemed less obtuse. except for the parts about it being ZFS, opensource, Sun-developed, ..., the parts that really seem to bother you. Specific citations of material that I've posted that gave you that impression would be useful: otherwise, you just look like another self-professed psychic (is this a general characteristic of Sun worshipers, or just of ZFS fanboys?). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? wrote: There aren't free alternatives in linux or freebsd that do what zfs does, period. No one said that there were: the real issue is that there's not much reason to care, since the available solutions don't need to be *identical* to offer *comparable* value (i.e., they each have different strengths and weaknesses and the net result yields no clear winner - much as some of you would like to believe otherwise). Ok. So according to you, most of what ZFS does is available elsewhere, and the features it has that nothing else has are' really a value add, ar least not enough to produce a 'clear winner'. Ok, assume for a second that I believe that. Unlike so many here I don't assume things lightly - and this one seems particularly shaky. can you list one other software raid/filesystem that as any feature (small or large) that ZFS lacks? Well, duh. Because if all else is really equal, and ZFS is the only one with any advantages then, whether those advantages are small or not (and I don't agree with how small you think they are - see my other post that you've ignored so far.) Sorry - I do need to sleep sometimes. But I'll get right to it, I assure you (or at worst soon: time has gotten away from me again and I've got an appointment to keep this afternoon). I think there is a 'clear winner' - at least at the moment - Things can change at any time. You don't get out much, do you? How does ZFS fall short of other open-source competitors (I'll limit myself to them, because once you get into proprietary systems - and away from the quaint limitations of Unix file systems - the list becomes utterly unmanageable)? Let us count the ways (well, at least the ones that someone as uninformed as I am about open-source features can come up with off the top of his head), starting in the volume-management arena: 1. RAID-Z, as I've explained elsewhere, is brain-damaged when it comes to effective disk utilization for small accesses (especially reads): RAID-5 offers the same space efficiency with N times the throughput for such workloads (used to be provided by mdadm on Linux unless the Linux LVM now supports it too). 2. DRDB on Linux supports remote replication (IIRC there was an earlier, simpler mechanism that also did). 3. Can you yet shuffle data off a disk such that it can be removed from a zpool? LVM on Linux supports this. 4. Last I knew, you couldn't change the number of disks in a RAID-Z stripe at all, let alone reorganize existing stripe layout on the fly. Typical hardware RAIDs can do this and I thought that Linux RAID support could as well, but can't find verification now - so I may have been remembering a proposed enhancement. And in the file system arena: 5. No user/group quotas? What *were* they thinking? The discussions about quotas here make it clear that per-filesystem quotas are not an adequate alternative: leaving aside the difficulty of simulating both user *and* group quotas using that mechanism, using it raises mount problems when very large numbers of users are involved, plus hard-link and NFS issues crossing mount points. 6. ZFS's total disregard of on-disk file contiguity can torpedo sequential-access performance by well over a decimal order of magnitude in situations where files either start out severely fragmented (due to heavily parallel write activity during their population) or become so due to fine-grained random updates. 7. ZFS's full-path COW approach increases the space overhead of snapshots compared with conventional file systems. 8. Not available on Linux. Damn - I've got to run. Perhaps others more familiar with open-source alternatives will add to this list while I'm out. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
I suspect ZFS will change that game in the future. In particular for someone doing lots of editing, snapshots can help recover from user error. Ah - so now the rationalization has changed to snapshot support. Unfortunately for ZFS, snapshot support is pretty commonly available We can cherry pick features all day. People choose ZFS for the combination (as well as its unique features). Actually, based on the self-selected and decidedly unscientific sample of ZFS proponents that I've encountered around the Web lately, it appears that people choose ZFS in large part because a) they've swallowed the Last Word In File Systems viral marketing mantra hook, line, and sinker (that's in itself not all that surprising, because the really nitty-gritty details of file system implementation aren't exactly prime topics of household conversation - even among the technically inclined), b) they've incorporated this mantra into their own self-image (the 'fanboy' phenomenon - but at least in the case of existing Sun customers this is also not very surprising, because dependency on a vendor always tends to engender loyalty - especially if that vendor is not doing all that well and its remaining customers have become increasingly desperate for good news that will reassure them). and/or c) they're open-source zealots who've been sucked in by Jonathan's recent attempt to turn t he patent dispute with NetApp into something more profound than the mundane inter-corporation spat which it so clearly is. All of which certainly helps explain why so many of those proponents are so resistant to rational argument: their zeal is not technically based, just technically rationalized (as I was pointing out in the post to which you responded) - much more like the approach of a (volunteer) marketeer with an agenda than like that of an objective analyst (not to suggest that *no one* uses ZFS based on an objective appreciation of the trade-offs involved in doing so, of course - just that a lot of its more vociferous supporters apparently don't). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
my personal-professional data are important (this is my valuation, and it's an assumption you can't dispute). Nor was I attempting to: I was trying to get you to evaluate ZFS's incremental risk reduction *quantitatively* (and if you actually did so you'd likely be surprised at how little difference it makes - at least if you're at all rational about assessing it). ... I think for every fully digital people own data are vital, and almost everyone would reply NONE at your question what level of risk user is willing to tolerate. The fact that appears to escape people like you it that there is *always* some risk, and you *have* to tolerate it (or not save anything at all). Therefore the issue changes to just how *much* risk you're willing to tolerate for a given amount of effort. (There's also always the possibility of silent data corruption, even if you use ZFS - because it only eliminates *some* of the causes of such corruption. If your data is corrupted in RAM during the period when ZFS is not watching over it, for example, you're SOL.) How to *really* protect valuable data has already been thoroughly discussed in this thread, though you don't appear to have understood it. It takes multiple copies (most of them off-line), in multiple locations, with verification of every copy operation and occasional re-verification of the stored content - and ZFS helps with only part of one of these strategies (reverifying the integrity of your on-line copy). If you don't take the rest of the steps, ZFS's incremental protection is virtually useless, because the risk of data loss from causes that ZFS doesn't protect against is so much higher than the incremental protection that it provides (i.e., you may *feel* noticeably better protected but you're just kidding yourself). If you *do* take the rest of the steps, then it takes little additional effort to revalidate your on-line content as well as the off-line copies, so all ZFS provides is a small reduction in effort to achieve the same (very respectable) level of protecti on that other solutions can achieve when manual steps are taken to reverify the on-line copy as well as the off-line copies. Try to step out of your my data is valuable rut and wrap your mind around the fact that ZFS's marginal contribution to its protection, real though it may be, just isn't very significant in most environments compared to the rest of the protection solution that it *doesn't* help with. That's why I encouraged you to *quantify* the effect that ZFS's protection features have in *your* environment (along with its other risks that ZFS can't ameliorate): until you do that, you're just another fanboy (not that there's anything wrong with that, as long as you don't try to present your personal beliefs as something of more objective validity). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
I was trying to get you to evaluate ZFS's incremental risk reduction *quantitatively* (and if you actually did so you'd likely be surprised at how little difference it makes - at least if you're at all rational about assessing it). ok .. i'll bite since there's no ignore feature on the list yet: what are you terming as ZFS' incremental risk reduction? .. (seems like a leading statement toward a particular assumption) Primarily its checksumming features, since other open source solutions support simple disk scrubbing (which given its ability to catch most deteriorating disk sectors before they become unreadable probably has a greater effect on reliability than checksums in any environment where the hardware hasn't been slapped together so sloppily that connections are flaky). Aside from the problems that scrubbing handles (and you need scrubbing even if you have checksums, because scrubbing is what helps you *avoid* data loss rather than just discover it after it's too late to do anything about it), and aside from problems deriving from sloppy assembly (which tend to become obvious fairly quickly, though it's certainly possible for some to be more subtle), checksums primarily catch things like bugs in storage firmware and otherwise undetected disk read errors (which occur orders of magnitude less frequently than uncorrectable read errors). Robert Milkowski cited some sobering evidence that mid-range arrays may have non-negligible firmware problems that ZFS could often catch, but a) those are hardly 'consumer' products (to address that sub-thread, which I think is what applies in Stefano's case) and b) ZFS's claimed attraction for higher-end (corporate) use is its ability to *eliminate* the need for such products (hence its ability to catch their bugs would not apply - though I can understand why people who needed to use them anyway might like to have ZFS's integrity checks along for the ride, especially when using less-than-fully-mature firmware). And otherwise undetected disk errors occur with negligible frequency compared with software errors that can silently trash your data in ZFS cache or in application buffers (especially in PC environments: enterprise software at least tends to be more stable and more carefully controlled - not to mention their typical use of ECC RAM). So depending upon ZFS's checksums to protect your data in most PC environments is sort of like leaving on a vacation and locking and bolting the back door of your house while leaving the front door wide open: yes, a burglar is less likely to enter by the back door, but thinking that the extra bolt there made you much safer is likely foolish. .. are you just trying to say that without multiple copies of data in multiple physical locations you're not really accomplishing a more complete risk reduction What I'm saying is that if you *really* care about your data, then you need to be willing to make the effort to lock and bolt the front door as well as the back door and install an alarm system: if you do that, *then* ZFS's additional protection mechanisms may start to become significant (because you're eliminated the higher-probability risks and ZFS's extra protection then actually reduces the *remaining* risk by a significant percentage). Conversely, if you don't care enough about your data to take those extra steps, then adding ZFS's incremental protection won't reduce your net risk by a significant percentage (because the other risks that still remain are so much larger). Was my point really that unclear before? It seems as if this must be at least the third or fourth time that I've explained it. yes i have read this thread, as well as many of your other posts around usenet and such .. in general i find your tone to be somewhat demeaning (slightly rude too - but - eh, who's counting? i'm none to judge) As I've said multiple times before, I respond to people in the manner they seem to deserve. This thread has gone on long enough that there's little excuse for continued obtuseness at this point, but I still attempt to be pleasant as long as I'm not responding to something verging on being hostile. - now, you do know that we are currently in an era of collaboration instead of deconstruction right? Can't tell it from the political climate, and corporations seem to be following that lead (I guess they've finally stopped just gazing in slack-jawed disbelief at what this administration is getting away with and decided to cash in on the opportunity themselves). Or were you referring to something else? .. so i'd love to see the improvements on the many shortcomings you're pointing to and passionate about written up, proposed, and freely implemented :) Then ask the ZFS developers to get on the stick: fixing the fragmentation problem discussed elsewhere should be easy, and RAID-Z is at least amenable to a
Re: [zfs-discuss] Yager on ZFS
he isn't being paid by NetApp.. think bigger O frabjous day! Yet *another* self-professed psychic, but one whose internal voices offer different counsel. While I don't have to be psychic myself to know that they're *all* wrong (that's an advantage of fact-based rather than faith-based opinions), a battle-of-the-incompetents would be amusing to watch (unless it took place in a realm which no mere mortals could visit). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
... Hi bill, only a question: I'm an ex linux user migrated to solaris for zfs and its checksumming; So the question is: do you really need that feature (please quantify that need if you think you do), or do you just like it because it makes you feel all warm and safe? Warm and safe is definitely a nice feeling, of course, but out in the real world of corporate purchasing it's just one feature out of many 'nice to haves' - and not necessarily the most important. In particular, if the *actual* risk reduction turns out to be relatively minor, that nice 'feeling' doesn't carry all that much weight. On the other hand, it's hard to argue for risk *increase* (using something else)... And no one that I'm aware of was doing anything like that: what part of the All things being equal paragraph (I've left it in below in case you missed it the first time around) did you find difficult to understand? - bill ... All things being equal, of course users would opt for even marginally higher reliability - but all things are never equal. If using ZFS would require changing platforms or changing code, that's almost certainly a show-stopper for enterprise users. If using ZFS would compromise performance or require changes in management practices (e.g., to accommodate file-system-level quotas), those are at least significant impediments. In other words, ZFS has its pluses and minuses just as other open-source file systems do, and they *all* have the potential to start edging out expensive proprietary solutions in *some* applications (and in fact have already started to do so). This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Tue, 4 Dec 2007, Stefano Spinucci wrote: On 11/7/07, can you guess? [EMAIL PROTECTED] wrote: However, ZFS is not the *only* open-source approach which may allow that to happen, so the real question becomes just how it compares with equally inexpensive current and potential alternatives (and that would make for an interesting discussion that I'm not sure I have time to initiate tonight). - bill Hi bill, only a question: I'm an ex linux user migrated to solaris for zfs and its checksumming; you say there are other open-source alternatives but, for a linux end user, I'm aware only of Oracle btrfs (http://oss.oracle.com/projects/btrfs/), who is a Checksumming Copy on Write Filesystem not in a final state. what *real* alternatives are you referring to??? if I missed something tell me, and I'll happily stay with linux with my data checksummed and snapshotted. bye --- Stefano Spinucci Hi Stefano, Did you get a *real* answer to your question? Do you think that this (quoted) message is a *real* answer? Hi, Al - I see that you're still having difficulty understanding basic English, and your other recent technical-content-free drivel here suggests that you might be better off considering a career in janitorial work than in anything requiring even basic analytical competence. But I remain willing to help you out with English until you can find the time to take a remedial course (though for help with finding a vocation more consonant with your abilities you'll have to look elsewhere). Let's begin by repeating the question at issue, since failing to understand that may be at the core of your problem: what *real* alternatives are you referring to??? Despite a similar misunderstanding by your equally-illiterate associate Mr. Cook, that was not a question about what alternatives provided the specific support in which Stefano was particularly interested (though in another part of my response to him I did attempt to help him understand why that interest might be misplaced). Rather, it was a question about what *I* had referred to in an earlier post of mine, as you might also have gleaned from the first sentence of my response to that question (As I said in the post to which you responded...) had what passes for your brain been even minimally engaged when you read it. My response to that question continued by listing some specific features (snapshots, disk scrubbing, software RAID) available in Linux and Free BSD that made them viable alternatives to ZFS for enterprise use (the context of that earlier post that I was being questioned about). Whether Linux and FreeBSD also offer management aids I admitted I didn't know - though given ZFS's own limitations in this area such as the need to define mirror pairs and parity groups explicitly and the inability to expand parity groups it's not clear that lack thereof would constitute a significant drawback (especially since the management activities that their file systems require are comparable to what such enterprise installations are already used to dealing with). And, in an attempt to forestall yet another round of babble, I then addressed the relative importance (or lack thereof) of several predictable Yes, but ZFS also offers wonderful feature X... responses. Now, not being a psychic myself, I can't state with authority that Stefano really meant to ask the question that he posed rather than something else. In retrospect, I suppose that some of his surrounding phrasing *might* suggest that he was attempting (however unskillfully) to twist my comment about other open source solutions being similarly enterprise-capable into a provably-false assertion that those other solutions offered the *same* features that he apparently considers so critical in ZFS rather than just comparably-useful ones. But that didn't cross my mind at the time: I simply answered the question that he asked, and in passing also pointed out that those features which he apparently considered so critical might well not be. Once again, though, I've reached the limit of my ability to dumb down the discussion in an attempt to reach your level: if you still can't grasp it, perhaps a friend will lend a hand. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Literacy has nothing to do with the glaringly obvious BS you keep spewing. Actually, it's central to the issue: if you were capable of understanding what I've been talking about (or at least sufficiently humble to recognize the depths of your ignorance), you'd stop polluting this forum with posts lacking any technical content whatsoever. Rather than answer a question, which couldn't be answered, The question that was asked was answered - it's hardly my problem if you could not competently parse the question, or the answer, or the subsequent explanation (though your continuing drivel after those three strikes suggests that you may simply be ineducable). because you were full of it, you tried to convince us all he really didn't know what he wanted. No: I answered his question and *also* observed that he probably really didn't know what he wanted (at least insofar as being able to *justify* the intensity of his desire for it). ... There aren't free alternatives in linux or freebsd that do what zfs does, period. No one said that there were: the real issue is that there's not much reason to care, since the available solutions don't need to be *identical* to offer *comparable* value (i.e., they each have different strengths and weaknesses and the net result yields no clear winner - much as some of you would like to believe otherwise). You can keep talking in circles till you're blue in the face, or I suppose your fingers go numb in this case, but the fact isn't going to change. Yes, people do want zfs for any number of reasons, that's why they're here. Indeed, but it has become obvious that most of the reasons are non-technical in nature. This place is fanboy heaven, where never is heard a discouraging word (and you're hip-deep in buffalo sh!t). Hell, I came here myself 18 months ago because ZFS seemed interesting, but found out that the closer I looked, the less interesting it got. Perhaps it's not surprising that so many of you never took that second step: it does require actual technical insight, which seems to be in extremely short supply here. So short that it's not worth spending time here from any technical standpoint: at this point I'm mostly here for the entertainment, and even that is starting to get a little tedious. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Now, not being a psychic myself, I can't state with authority that Stefano really meant to ask the question that he posed rather than something else. In retrospect, I suppose that some of his surrounding phrasing *might* suggest that he was attempting (however unskillfully) to twist my comment about other open source solutions being similarly enterprise-capable into a provably-false assertion that those other solutions offered the *same* features that he apparently considers so critical in ZFS rather than just comparably-useful ones. But that didn't cross my mind at the time: I simply answered the question that he asked, and in passing also pointed out that those features which he apparently considered so critical might well not be. dear bill, my question was honest That's how I originally accepted it, and I wouldn't have revisited the issue looking for other interpretations if two people hadn't obviously thought it meant something else. For that matter, even if you actually intended it to mean something else that doesn't imply that there was any devious intent. In any event, what you actually asked was what I had referred to, and I told you: it may not have met your personal goals for your own storage, but that wasn't relevant to the question that you asked (and that I answered). Your English is so good that the possibility that it might be a second language had not occurred to me - but if so it would help explain any subtle miscommunication. ... if there are no alternatives to zfs, As I explained, there are eminently acceptable alternatives to ZFS from any objective standpoint. I'd gladly stick with it, And you're welcome to, without any argument from me - unless you try to convince other people that there are strong technical reasons to do so, in which case I'll challenge you to justify them in detail so that any hidden assumptions can be brought out into the open. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
I suppose we're all just wrong. By George, you've got it! - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Your response here appears to refer to a different post in this thread. I never said I was a typical consumer. Then it's unclear how your comment related to the material which you quoted (and hence to which it was apparently responding). If you look around photo forums, you'll see an interest the digital workflow which includes long term storage and archiving. A chunk of these users will opt for an external RAID box (10%? 20%?). I suspect ZFS will change that game in the future. In particular for someone doing lots of editing, snapshots can help recover from user error. Ah - so now the rationalization has changed to snapshot support. Unfortunately for ZFS, snapshot support is pretty commonly available (e.g., in Linux's LVM - and IIRC BSD's as well - if you're looking at open-source solutions) so anyone who actually found this feature important has had access to it for quite a while already. And my original comment which you quoted still obtains as far as typical consumers are concerned. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write time performance question
And some results (for OLTP workload): http://przemol.blogspot.com/2007/08/zfs-vs-vxfs-vs-ufs -on-scsi-array.html While I was initially hardly surprised that ZFS offered only 11% - 15% of the throughput of UFS or VxFS, a quick glance at Filebench's OLTP workload seems to indicate that it's completely random-access in nature without any of the sequential-scan activity that can *really* give ZFS fits. The fact that you were using an underlying hardware RAID really shouldn't have affected these relationships, given that it was configured as RAID-10. It would be interesting to see your test results reconciled with a detailed description of the tests generated by the Kernel Performance Engineering group which are touted as indicating that ZFS performs comparably with other file systems in database use: I actually don't find that too hard to believe (without having put all that much thought into it) when it comes to straight OLTP without queries that might result in sequential scans, but your observations seem to suggest otherwise (and the little that I have been able to infer about the methodology used to generate some of the rosy-looking ZFS performance numbers does not inspire confidence in the real-world applicability of those internally-generated results). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On 11/7/07, can you guess? [EMAIL PROTECTED] wrote: However, ZFS is not the *only* open-source approach which may allow that to happen, so the real question becomes just how it compares with equally inexpensive current and potential alternatives (and that would make for an interesting discussion that I'm not sure I have time to initiate tonight). - bill Hi bill, only a question: I'm an ex linux user migrated to solaris for zfs and its checksumming; So the question is: do you really need that feature (please quantify that need if you think you do), or do you just like it because it makes you feel all warm and safe? Warm and safe is definitely a nice feeling, of course, but out in the real world of corporate purchasing it's just one feature out of many 'nice to haves' - and not necessarily the most important. In particular, if the *actual* risk reduction turns out to be relatively minor, that nice 'feeling' doesn't carry all that much weight. you say there are other open-source alternatives but, for a linux end user, I'm aware only of Oracle btrfs (http://oss.oracle.com/projects/btrfs/), who is a Checksumming Copy on Write Filesystem not in a final state. what *real* alternatives are you referring to??? As I said in the post to which you responded, I consider ZFS's ease of management to be more important (given that even in high-end installations storage management costs dwarf storage equipment costs) than its real but relatively marginal reliability edge, and that's the context in which I made my comment about alternatives (though even there if ZFS continues to require definition of mirror pairs and parity groups for redundancy that reduces its ease-of-management edge, as does its limitation to a single host system in terms of ease-of-scaling). Specifically, features like snapshots, disk scrubbing (to improve reliability by dramatically reducing the likelihood of encountering an unreadable sector during a RAID rebuild), and software RAID (to reduce hardware costs) have been available for some time in Linux and FreeBSD, and canned management aids would not be difficult to develop if they don't exist already. The dreaded 'write hole' in software RAID is a relatively minor exposure (since it only compromises data if a system crash or UPS failure - both rare events in an enterprise setting - sneaks in between a data write and the corresponding parity update and then, before the array has restored parity consistency in the background, a disk dies) - and that exposure can be reduced to seconds by a minuscule amount of NVRAM that remembers which writes were active (or to zero with somewhat more NVRAM to remember the updates themselves in an inexpensive hardware solution). The real question is usually what level of risk an enterprise storage user is willing to tolerate. At the paranoid end of the scale reside the users who will accept nothing less than z-series or Tandem-/Stratus-style end-to-end hardware checking from the processor traces on out - which rules out most environments that ZFS runs in (unless Sun's N-series telco products might fill the bill: I'm not very familiar with them). And once you get down into users of commodity processors, the risk level of using stable and robust file systems that lack ZFS's additional integrity checks is comparable to the risk inherent in the rest of the system (at least if the systems are carefully constructed, which should be a given in an enterprise setting) - so other open-source solutions are definitely in play there. All things being equal, of course users would opt for even marginally higher reliability - but all things are never equal. If using ZFS would require changing platforms or changing code, that's almost certainly a show-stopper for enterprise users. If using ZFS would compromise performance or require changes in management practices (e.g., to accommodate file-system-level quotas), those are at least significant impediments. In other words, ZFS has its pluses and minuses just as other open-source file systems do, and they *all* have the potential to start edging out expensive proprietary solutions in *some* applications (and in fact have already started to do so). When we move from 'current' to 'potential' alternatives, the scope for competition widens. Because it's certainly possible to create a file system that has all of ZFS's added reliability but runs faster, scales better, incorporates additional useful features, and is easier to manage. That discussion is the one that would take a lot of time to delve into adequately (and might be considered off topic for this forum - which is why I've tried to concentrate here on improvements that ZFS could actually incorporate without turning it upside down). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
We will be using Cyrus to store mail on 2540 arrays. We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both connected to same host, and mirror and stripe the LUNs. So a ZFS RAID-10 set composed of 4 LUNs. Multi-pathing also in use for redundancy. Sounds good so far: lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter, but RAID-5 should be OK, especially if the workload is read-dominated. ZFS might aggregate small writes such that their performance would be good as well if Cyrus doesn't force them to be performed synchronously (and ZFS doesn't force them to disk synchronously on file close); even synchronous small writes could perform well if you mirror the ZFS small-update log: flash - at least the kind with decent write performance - might be ideal for this, but if you want to steer clear of a specialized configuration just carving one small LUN for mirroring out of each array (you could use a RAID-0 stripe on each array if you were compulsive about keeping usage balanced; it would be nice to be able to 'center' it on the disks, but probably not worth the management overhead unless the array makes it easy to do so) should still offer a noticeable improv ement over just placing the ZIL on the RAID-5 LUNs. My question is any guidance on best choice in CAM for stripe size in the LUNs? Default is 128K right now, can go up to 512K, should we go higher? By 'stripe size' do you mean the size of the entire stripe (i.e., your default above reflects 32 KB on each data disk, plus a 32 KB parity segment) or the amount of contiguous data on each disk (i.e., your default above reflects 128 KB on each data disk for a total of 512 KB in the entire stripe, exclusive of the 128 KB parity segment)? If the former, by all means increase it to 512 KB: this will keep the largest ZFS block on a single disk (assuming that ZFS aligns them on 'natural' boundaries) and help read-access parallelism significantly in large-block cases (I'm guessing that ZFS would use small blocks for small files but still quite possibly use large blocks for its metadata). Given ZFS's attitude toward multi-block on-disk contiguity there might not be much benefit in going to even larger stripe sizes, though it probably wouldn't hurt noticeably either as long as the entire stripe (ignoring parity) didn't exceed 4 - 16 MB in size (all the above numbers assume the 4 + 1 stripe configuration that you described). In general, having less than 1 MB per-disk stripe segments doesn't make sense for *any* workload: it only takes 10 - 20 milliseconds to transfer 1 MB from a contemporary SATA drive (the analysis for high-performance SCSI/FC/SAS drives is similar, since both bandwidth and latency performance improve), which is comparable to the 12 - 13 ms. that it takes on average just to position to it - and you can still stream data at high bandwidths in parallel from the disks in an array as long as you have a client buffer as large in MB as the number of disks you need to stream from to reach the required bandwidth (you want 1 GB/sec? no problem: just use a 10 - 20 MB buffer and stream from 10 - 20 disks in parallel). Of course, this assumes that higher software layers organize data storage to provide that level of contiguity to leverage... - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
Any reason why you are using a mirror of raid-5 lun's? Some people aren't willing to run the risk of a double failure - especially when recovery from a single failure may take a long time. E.g., if you've created a disaster-tolerant configuration that separates your two arrays and a fire completely destroys one of them, you'd really like to be able to run the survivor without worrying too much until you can replace its twin (hence each must be robust in its own right). The above situation is probably one reason why 'RAID-6' and similar approaches (like 'RAID-Z2') haven't generated more interest: if continuous on-line access to your data is sufficiently critical to consider them, then it's also probably sufficiently critical to require such a disaster-tolerant approach (which dual-parity RAIDs can't address). It would still be nice to be able to recover from a bad sector on the single surviving site, of course, but you don't necessarily need full-blown RAID-6 for that: you can quite probably get by with using large blocks and appending a private parity sector to them (maybe two private sectors just to accommodate a situation where a defect hits both the last sector in the block and the parity sector that immediately follows it; it would also be nice to know that the block size is significantly smaller than a disk track size, for similar reasons). This would, however, tend to require file-system involvement such that all data was organized into such large blocks: otherwise, all writes for smaller blocks would turn into read/modify/writes. Panasas (I always tend to put an extra 's' into that name, and to judge from Google so do a hell of a lot of other people: is it because of the resemblance to 'parnassas'?) has been crowing about something that it calls 'tiered parity' recently, and it may be something like the above. ... How about running a ZFS mirror over RAID-0 luns? Then again, the downside is that you need intervention to fix a LUN after a disk goes boom! But you don't waste all that space :) 'Wasting' 20% of your disk space (in the current example) doesn't seem all that alarming - especially since you're getting more for that expense than just faster and more automated recovery if a disk (or even just a sector) fails. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 w/ small random encrypted text files
If it's just performance you're after for small writes, I wonder if you've considered putting the ZIL on an NVRAM card? It looks like this can give something like a 20x performance increase in some situations: http://blogs.sun.com/perrin/entry/slog_blog_or_bloggin g_on That's certainly interesting reading, but it may be just a tad optimistic. For example, it lists a throughput of 211 MB/sec with only *one* disk in the main pool - which unless that's also a solid-state disk is clearly unsustainable (i.e., you're just seeing the performance while the solid-state log is filling up, rather than what the performance will eventually stabilize at: my guess is that the solid-state log may be larger than the file being updated, in which case updates just keep accumulating there without *ever* being forced to disk, which is unlikely to occur in most normal environments). The numbers are a bit strange in other areas as well. In the case of a single pool disk and no slog, 11 MB/sec represents about 1400 synchronous 8 KB updates per second on a disk with only about 1/10th that IOPS capacity even with queuing enabled (and when you take into account the need to propagate each such synchronous update all the way back to the superblock it begins to look somewhat questionable even from the bandwidth point of view). One might suspect that what's happening is that once the first synchronous write has been submitted a whole bunch of additional ones accumulate while waiting for the disk to finish the first, and that ZFS is smart enough not to queue them up to the disk (which would require full-path updates for every one of them) but instead to gather them in its own cache and write them all back at once in one fell swoop (including a single update for the ancestor path) when the disk is free again. This would explain not only the otherwise suspicious performance but also why adding the slog provides so little improvement; it's also a tribute to the care that the ZFS developers put into this aspect of their implementation. On the other hand, when an slog is introduced performance actually *declines* in systems with more than one pool disk, suggesting that the developers paid somewhat less attention to this aspect of the implementation (where if the updates are held and batched similarly to my conjecture above they ought to be able to reach something close to the disk's streaming-sequential bandwidth, unless there's some pathological interaction with the pool-disk updates that should have been avoidable). Unless I'm missing something the bottom line appears to be that in the absence of an NVRAM-based slog you might be just as well (and sometimes better) off not using one at all. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
[Zombie thread returns from the grave...] Getting back to 'consumer' use for a moment, though, given that something like 90% of consumers entrust their PC data to the tender mercies of Windows, and a large percentage of those neither back up their data, nor use RAID to guard against media failures, nor protect it effectively from the perils of Internet infection, it would seem difficult to assert that whatever additional protection ZFS may provide would make any noticeable difference in the consumer space - and that was the kind of reasoning behind my comment that began this sub-discussion. As a consumer at home, IT guy at work and amateur photographer, I think ZFS will help change that. Let's see, now: Consumer at home? OK so far. IT guy at work? Nope, nothing like a mainstream consumer, who doesn't want to know about anything like the level of detail under discussion here. Amateur photographer? Well, sort of - except that you seem to be claiming to have reached the *final* stage of evolution that you lay out below, which - again - tends to place you *well* out of the mainstream. Try reading my paragraph above again and seeing just how closely it applies to people like you. Here's what I think photogs evolve through: ) What are negatives? - Mom/dad taking holiday photos 2) Keep negatives in the envelope - average snapshot photog 3) Keep them filed in boxes - started snapping with a SLR? Might be doing darkroom work 4) Get acid free boxes - pro/am. 5) Store slides in archival environment (humidity, temp, etc). - obsessive In the digital world: 1) keeps them on the card until printed. Only keeps the print 2) copies them to disk erases them off the card. Gets burned when system disk dies 2a) puts them on CD/DVD. Gets burned a little when the disk dies and some photos not on CD/DVDs yet. OK so far. My wife is an amateur photographer and that's the stage where she's at. Her parents, however, are both retired *professional* photographers - and that's where they're at as well. 3a) gets an external USB drive to store things. Gets burned when that disk dies. That sounds as if it should have been called '2b' rather than '3a', since there's still only one copy of the data. 3b) run raid in the box. 3c) gets an external RAID disk (buffalo/ReadyNAS, etc). While these (finally) give you some redundancy, they don't protect against loss due to user errors, system errors, or virii (well, an external NAS might help some with the last two, but not a simple external RAID). They also cost significantly more (and are considerably less accessible to the average consumer) than simply keeping a live copy on your system plus an archive copy (better yet, *two* archive copies) on DVDs (the latter is what my wife and her folks do for any photos they care about). 4) archives to multiple places. etc... At which point you find out that you didn't need RAID after all (see above): you just leave the photos on a flash card (which are dirt-cheap these days) and your system disk until they've been copied to the archive media. 5) gets ZFS and does transfer direct to local disk from flash card. Which doesn't give you any data redundancy at all unless you're using multiple drives (guess how many typical consumers do) and doesn't protect you from user errors, system errors, or virii (unless you use an external NAS to help with the last two - again, guess how many typical consumers do) - and you'd *still* arguably be better off using the approach I described in my previous paragraph (since there's nothing like off-site storage if you want *real* protection). In other words, you can't even make the ZFS case for the final-stage semi-professional photographer above, let alone anything remotely resembling a 'consumer': you'd just really, really like to justify something that you've become convinced is hot. There's obviously been some highly-effective viral marketing at work here. Today I can build a Solaris file server for a reasonable price with off the shelf parts ($300 + disks). *Build* a file server? You must be joking: if a typical consumer wants to *buy* a file server they can do so (though I'm not sure that a large percentage of 'typical' consumers actually *have* done so) - but expecting them to go out and shop for one running ZFS is - well, 'hopelessly naive' doesn't begin to do the idea justice. I can't get near that for a WAFL based system. Please don't try to reintroduce WAFL into the consumer part of this discussion: I though we'd finally succeeded in separating the sub-threads. ... I can see ZFS coming to ready made networked RAID box that a pro-am photographer could purchase. *If* s/he had any interest in ZFS per se - see above. I don't ever see that with WAFL. And either FS on a network RAID box will be less error prone then a box running ext3/xfs as is typical now. 'Less error
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
Hi Bill, ... lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter, Why does lots of small files in a largish system with presumably significant access parallelism makes RAID-Z a non-starter? thanks, max Every ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe - so not only do N + 1 disks get written for every block update, but N disks get *read* on every block *read*. Normally, small files can be read in a single I/O request to one disk (even in conventional parity-RAID implementations). RAID-Z requires N I/O requests spread across N disks, so for parallel-access reads to small files RAID-Z provides only about 1/Nth the throughput of conventional implementations unless the disks are sufficiently lightly loaded that they can absorb the additional load that RAID-Z places on them without reducing throughput commensurately. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
We are running Solaris 10u4 is the log option in there? Someone more familiar with the specifics of the ZFS releases will have to answer that. If this ZIL disk also goes dead, what is the failure mode and recovery option then? The ZIL should at a minimum be mirrored. But since that won't give you as much redundancy as your main pool has, perhaps you should create a small 5-disk RAID-0 LUN sharing the disks of each RAID-5 LUN and mirror the log to all four of them: even if one entire array box is lost, the other will still have a mirrored ZIL and all the RAID-5 LUNs will be the same size (not that I'd expect a small variation in size between the two pairs of LUNs to be a problem that ZFS couldn't handle: can't it handle multiple disk sizes in a mirrored pool as long as each individual *pair* of disks matches?). Having 4 copies of the ZIL on disks shared with the RAID-5 activity will compromise the log's performance, since each log write won't complete until the slowest copy finishes (i.e., congestion in either of the RAID-5 pairs could delay it). It still should usually be faster than just throwing the log in with the rest of the RAID-5 data, though. Then again, I see from your later comment that you have the same questions that I had about whether the results reported in http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on suggest that having a ZIL may not help much anyway (at least for your specific workload: I can imagine circumstances in which performance of small, synchronous writes might be more critical than other performance, in which case separating them out could be useful). We did get the 2540 fully populated with 15K 146-gig drives. With 12 disks, and wanting to have at least ONE hot global spare in each array, and needing to keep LUNs the same size, you end up doing 2 5-disk RAID-5 LUNs and 2 hot spares in each array. Not that I really need 2 spares I just didn't see any way to make good use of an extra disk in each array. If we wanted to dedicate them instead to this ZIL need, what is best way to go about that? As I noted above, you might not want to have less redundancy in the ZIL than you have in the main pool: while the data in the ZIL is only temporary (until it gets written back to the main pool), there's a good chance that there will *always* be *some* data in it, so if you lost one array box entirely at least that small amount of data would be at the mercy of any failure on the log disk that made any portion of the log unreadable. Now, if you could dedicate all four spare disks to the log (mirroring it 4 ways) and make each box understand that it was OK to steal one of them to use as a hot spare should the need arise, that might give you reasonable protection (since then any increased exposure would only exist until the failed disk was manually replaced - and normally the other box would still hold two copies as well). But I have no idea whether the box provides anything like that level of configurability. ... Hundreds of POP and IMAP user processes coming and going from users reading their mail. Hundreds more LMTP processes from mail being delivered to the Cyrus mail-store. And with 10K or more users a *lot* of parallelism in the workload - which is what I assumed given that you had over 1 TB of net email storage space (but I probably should have made that assumption more explicit, just in case it was incorrect). Sometimes writes predominate over reads, depends on time of day whether backups are running, etc. The servers are T2000 with 16 gigs RAM so no shortage of room for ARC cache. I have turned off cache flush also pursuing performance. From Neil's comment in the blog entry that you referenced, that sounds *very* dicey (at least by comparison with the level of redundancy that you've built into the rest of your system) - even if you have rock-solid UPSs (which have still been known to fail). Allowing a disk to lie to higher levels of the system (if indeed that's what you did by 'turning off cache flush') by saying that it's completed a write when it really hasn't is usually a very bad idea, because those higher levels really *do* make important assumptions based on that information. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
I think the point of dual battery-backed controllers is that data should never be lost. Am I wrong? That depends upon exactly what effect turning off the ZFS cache-flush mechanism has. If all data is still sent to the controllers as 'normal' disk writes and they have no concept of, say, using *volatile* RAM to store stuff when higher levels enable the disk's write-back cache nor any inclination to pass along such requests blithely to their underlying disks (which of course would subvert any controller-level guarantees, since they can evict data from their own write-back caches as soon as the disk write request completes), then presumably as long as they get the data they guarantee that it will eventually get to the platters and the ZFS cache-flush mechanism is a no-op. Of course, if that's true then disabling cache-flush should have no noticeable effect on performance (the controller just answers Done as soon as it receives a cache-flush request, because there's no applicable cache to flush), so you might as well just leave it enabled. Conversely, if you found that disabling it *did* improve performance, then it probably opened up a significant reliability hole. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
Bill, you have a long-winded way of saying I don't know. But thanks for elucidating the possibilities. Hmmm - I didn't mean to be *quite* as noncommittal as that suggests: I was trying to say (without intending to offend) FOR GOD'S SAKE, MAN: TURN IT BACK ON!, and explaining why (i.e., that either disabling it made no difference and thus it might as well be enabled, or that if indeed it made a difference that indicated that it was very likely dangerous). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?
That depends upon exactly what effect turning off the ZFS cache-flush mechanism has. The only difference is that ZFS won't send a SYNCHRONIZE CACHE command at the end of a transaction group (or ZIL write). It doesn't change the actual read or write commands (which are always sent as ordinary writes -- for the ZIL, I suspect that setting the FUA bit on writes rather than flushing the whole cache might provide better performance in some cases, but I'm not sure, since it probably depends what other I/O might be outstanding.) It's a bit difficult to imagine a situation where flushing the entire cache unnecessarily just to force the ZIL would be preferable - especially if ZFS makes any attempt to cluster small transaction groups together into larger aggregates (in which case you'd like to let them continue to accumulate until the aggregate is large enough to be worth forcing to disk in a single I/O). Of course, if that's true then disabling cache-flush should have no noticeable effect on performance (the controller just answers Done as soon as it receives a cache-flush request, because there's no applicable cache to flush), so you might as well just leave it enabled. The problem with SYNCHRONIZE CACHE is that its semantics aren't quite defined as precisely as one would want (until a fairly recent update). Some controllers interpret it as push all data to disk even if they have battery-backed NVRAM. That seems silly, given that for most other situations they consider that data in NVRAM is equivalent to data on the platter. But silly or not, if that's the way some arrays interpret the command, then it does have performance implications (and the other reply I just wrote would be unduly alarmist in such cases). Thanks for adding some actual experience with the hardware to what had been a purely theoretical discussion. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
In order to be reasonably representative of a real-world situation, I'd suggest the following additions: 1) create a large file (bigger than main memory) on an empty ZFS pool. 1a. The pool should include entire disks, not small partitions (else seeks will be artificially short). 1b. The file needs to be a *lot* bigger than the cache available to it, else caching effects on the reads will be non-negligible. 1c. Unless the file fills up a large percentage of the pool the rest of the pool needs to be fairly full (else the seeks that updating the file generates will, again, be artificially short ones). 2) time a sequential scan of the file 3) random write i/o over say, 50% of the file (either with or without matching blocksize) 3a. Unless the file itself fills up a large percentage of the pool, do this while other significant other updating activity is also occurring in the pool so that the local holes in the original file layout created by some of its updates don't get favored for use by subsequent updates to the same file (again, artificially shortening seeks). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... This needs to be proven with a reproducible, real-world workload before it makes sense to try to solve it. After all, if we cannot measure where we are, how can we prove that we've improved? Ah - Tests Measurements types: you've just gotta love 'em. Wife: Darling, is there really supposed to be that much water in the bottom of our boat? TM: There's almost always a little water in the bottom of a boat, Love. Wife: But I think it's getting deeper! TM: I suppose you *could* be right: I'll just put this mark where the water is now, and then after a few minutes we can see if it really has gotten deeper and, if so, just how much we really may need to worry about it. Wife: I think I'll use this bucket to get rid of some of it, just in case. TM: No, don't do that: then we won't be able to see how bad the problem is! Wife: But - TM: And try not to rock the boat: it changes the level of the water at the mark that I just made. Wife: I'm really not a very good swimmer, dear: let's just head for shore. TM: That would be silly if there turns out not to be any problem, wouldn't it? (Wife hits TM over head with bucket, grabs oars, and starts rowing.) - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS
I'm going to combine three posts here because they all involve jcone: First, as to my message heading: The 'search forum' mechanism can't find his posts under the 'jcone' name (I was curious, because they're interesting/strange, depending on how one looks at them). I've also noticed (once in his case, once in Louwtjie's) that the 'last post' column of one thread may reflect a post made to a different thread. Second, in response to your Indexing other than hash tables post: The only way you could get a file system like ZFS to perform indexed look-ups for you would be to make each of your 'records' an entire file with the appropriate look-up name, and ReiserFS may be the only current file system that could handle this reasonably well This is an outgrowth of the Unix mindset that files must only be byte-streams rather than anything more powerful (such as the single- and multi-key indexed files of traditional minicomputer and mainframe systems) - and that's especially unfortunate in ZFS's case, because system-managed COW mechanisms just happen to be a dynamite way to handle b-trees (you could do so at the application level on top of ZFS via use of a sparse file plus a facility to deallocate space in it explicitly, but you'd still need an entire separate level of in-file space-allocation/deallocation mechanism). B-trees are the obvious solution to the kind of partial-key and/or key-range queries that you described. Finally, in response to your current post (which sounds more as if it had come from a hardware engineer than from a database type): All the facilities that you describe are traditionally handled by transactions of one form or another, and only read-only transactions can normally be non-blocking (because they simply capture a consistent point-in-time database state and operate upon that, ignoring any subsequent changes that may occur during their lifetimes). Other less-popular but more general non-blocking approaches exist which simply abort upon detecting conflict rather than attempt to wait for the conflict to evaporate, which tends not to scale very well because (unlike the case with non-blocking low-level hardware synchronization) restarting a transaction when you don't have to can very often result in a *lot* of redundant work being performed; they include some multi-version approaches that implement more general 'time domain addressing' than that just described for read-only transactions and the rare implementations based upon 'optimistic' concurrency control that let conflicts occur and then decide whether to abort someone when they attempt to commit. ZFS supports transactions only for its internal use, and cannot feasibly support arbitrarily complex transactions because its atomicity approach depends upon gathering all transaction updates in RAM before writing them back atomically to disk (yes, it could perhaps do so in stages, since the entire new tree structure doesn't become visible until its root has been made persistent, but that could arbitrarily delay other write activity in the system). While I think that supporting user-level transactions is a useful file-system feature and a few file systems such as Transarc's Structured File System have actually done so, ZFS would have to change significantly to do so for anything other than *very* limited user-level transactions - hence I wouldn't hold my breath waiting for such support in ZFS. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS
The B-trees I'm used to tree divide in arbitrary places across the whole key, so doing partial-key queries is painful. While the b-trees in DEC's Record Management Services (RMS) allowed multi-segment keys, they treated the entire key as a byte-string as far as prefix searches went (i.e., the segmentation wasn't significant to that, and there's no obvious reason why it should have been in other implementations). I can't find Structured File System Transarc usefully in Google. Do you have a link handy? If not, never mind. Well, transarc.com now leads to a porn site, so that's not much help. And Wikipedia's entry for Transarc is regrettably sparse. Transarc was a Pittsburgh RD company formed by some *very* bright CMU people. It's probably best known for its 'Encina' distributed transaction environment (SFS was actually part of Encina, but IIRC a separable one), for having developed the distributed file system (DFS) component of the Open Group's Distributed Computing Environment (DCE), and for AFS, the productized (and now open source) version of CMU's distributed Andrew file system; my own acquaintance with Transarc became closer when I was helping develop a distributed transactional object system in the mid''90s and we were using their book Camelot and Avalon for high-level design inspiration. They were always closely associated with IBM, which absorbed them as a wholly-owned subsidiary in 1994 (and I've heard relatively little about them since). SFS was one of their lesser-known achievements: a record-oriented transactional file system. I've always felt that system-managed record-oriented files were useful, in part because a lot of the nitty-gritty space management that's required (e.g., to handle the structured pages that tend to be necessary to accommodate data that's allowed to change its size or is required to remain in some key order under insertion/update/deletion activity) duplicates similar space-management required of the system to manage conventional byte-stream files and in part because any kind of system-wide lock- and deadlock-management facilities tend to want to tie into such data at a higher-than-byte-stream level (e.g., because the locked entities may have to move around) - so SFS was interesting to me. Unfortunately, it's been long enough that I can't remember too many details about it - e.g., it may or may not have supported interlocked access at the record field level - and at least after a qui ck search I can't find any papers about it that I may have downloaded (that era was before I really recognized how evanescent Web material often may be). I actually did get a Google hit at position 19 with the search terms you used (after a plethora of hits on log structured file system, of course), but it wasn't very enlightening. Nor were several later ones, until hit 42 at the University of Waterloo - a .pdf that contains at least a brief description starting on page 21 (including a thinly-disguised rip-off of a figure in GrayReuter's classic Transaction Processing - but it's not quite *identical*...). Aha - good old reliable IBM *does* still have some SFS documentation on line that hit 75 noticed; munging that URL a bit led to http://publib.boulder.ibm.com/infocenter/txformp/v5r1/index.jsp?noscript=1 (expand Encina Books in the left-hand frame and start digging...). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... My understanding of ZFS (in short: an upside down tree) is that each block is referenced by it's parent. So regardless of how many snapshots you take, each block is only ever referenced by one other, and I'm guessing that the pointer and checksum are both stored there. If that's the case, to move a block it's just a case of: - read the data - write to the new location - update the pointer in the parent block Which changes the contents of the parent block (the change in the data checksum changed it as well), and thus requires that this parent also be rewritten (using COW), which changes the pointer to it (and of course its checksum as well) in *its* parent block, which thus also must be re-written... and finally a new copy of the superblock is written to reflect the new underlying tree structure - all this in a single batch-written 'transaction'. The old version of each of these blocks need only be *saved* if a snapshot exists and it hasn't previously been updated since that snapshot was created. But all the blocks need to be COWed even if no snapshot exists (in which case the old versions are simply discarded). ... PS. 1. You'd still need an initial defragmentation pass to ensure that the file was reasonably piece-wise contiguous to begin with. No, not necessarily. If you were using a zpool configured like this I'd hope you were planning on creating the file as a contiguous block in the first place :) I'm not certain that you could ensure this if other updates in the system were occurring concurrently. Furthermore, the file may be extended dynamically as new data is inserted, and you'd like to have some mechanism that could restore reasonable contiguity to the result (which can be difficult to accomplish in the foreground if, for example, free space doesn't happen to exist on the disk right after the existing portion of the file). ... Any zpool with this option would probably be dedicated to the database file and nothing else. In fact, even with multiple databases I think I'd have a single pool per database. It's nice if you can afford such dedicated resources, but it seems a bit cavalier to ignore users who just want decent performance from a database that has to share its resources with other activity. Your prompt response is probably what prevented me from editing my previous post after I re-read it and realized I had overlooked the fact that over-writing the old data complicates things. So I'll just post the revised portion here: 3. Now you must make the above transaction persistent, and then randomly over-write the old data block with the new data (since that data must be in place before you update the path to it below, and unfortunately since its location is not arbitrary you can't combine this update with either the transaction above or the transaction below). 4. You can't just slide in the new version of the block using the old version's existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version's new location that you just had to establish (if a snapshot exists, that's the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block (save for the fact that the block itself was already written above): all the *additional* overhead occurred in the previous steps. So instead of a single full-path update that fragments the file, you have two full-path updates, a random write, and possibly a random read initially to fetch the old data. And you still need an initial defrag pass to establish initial contiguity. Furthermore, these additional resources are consumed at normal rather than the reduced priority at which a background reorg can operate. On the plus side, though, the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ... Taking it a stage further, I wonder if this would work well with the prioritized write feature request (caching writes to a solid state disk)? http://www.genunix.org/wiki/index.php/OpenSolaris_Sto age_Developer_Wish_List That could potentially mean there's very little slowdown: - Read the original block - Save that to solid state disk - Write the the new block in the original location - Periodically stream writes from the solid state disk to the main storage I'm not sure this would confer much benefit if things in fact need to be handled as I described above. In particular, if a snapshot exists you almost certainly must establish the old version in its new location in the snapshot rather than just capture it in the log; if no snapshot exists you could capture the old version in the log and
Re: [zfs-discuss] ZFS + DB + fragments
... With regards sharing the disk resources with other programs, obviously it's down to the individual admins how they would configure this, Only if they have an unconstrained budget. but I would suggest that if you have a database with heavy enough requirements to be suffering noticable read performance issues due to fragmentation, then that database really should have it's own dedicated drives and shouldn't be competing with other programs. You're not looking at it from a whole-system viewpoint (which if you're accustomed to having your own dedicated storage devices is understandable). Even if your database performance is acceptable, if it's performing 50x as many disk seeks as it would otherwise need to when scanning a table that's affecting the performance of *other* applications. Also, I'm not saying defrag is bad (it may be the better solution here), just that if you're looking at performance in this kind of depth, you're probably experienced enough to have created the database in a contiguous chunk in the first place :-) As I noted, ZFS may not allow you to ensure that and in any event if the database grows that contiguity may need to be reestablished. You could grow the db in separate files, each of which was preallocated in full (though again ZFS may not allow you to ensure that each is created contiguously on disk), but while databases may include such facilities as a matter of course it would still (all other things being equal) be easier to manage everything if it could just extend a single existing file (or one file per table, if they needed to be kept separate) as it needed additional space. I do agree that doing these writes now sounds like a lot of work. I'm guessing that needing two full-path updates to achieve this means you're talking about a much greater write penalty. Not all that much. Each full-path update is still only a single write request to the disk, since all the path blocks (again, possibly excepting the superblock) are batch-written together, thus mostly increasing only streaming bandwidth consumption. ... It may be that ZFS is not a good fit for this kind of use, and that if you're really concerned about this kind of performance you should be looking at other file systems. I suspect that while it may not be a great fit now with relatively minor changes it could be at least an acceptable one. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Rats - I was right the first time: there's a messy problem with snapshots. The problem is that the parent of the child that you're about to update in place may *already* be in one or more snapshots because one or more of its *other* children was updated since each snapshot was created. If so, then each snapshot copy of the parent is pointing to the location of the existing copy of the child you now want to update in place, and unless you change the snapshot copy of the parent (as well as the current copy of the parent) the snapshot will point to the *new* copy of the child you are now about to update (with an incorrect checksum to boot). With enough snapshots, enough children, and bad enough luck, you might have to change the parent (and of course all its ancestors...) in every snapshot. In other words, Nathan's approach is pretty much infeasible in the presence of snapshots. Background defragmention works as long as you move the entire region (which often has a single common parent) to a new location, which if the source region isn't excessively fragmented may not be all that expensive; it's probably not something you'd want to try at normal priority *during* an update to make Nathan's approach work, though, especially since you'd then wind up moving the entire region on every such update rather than in one batch in the background. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
But the whole point of snapshots is that they don't take up extra space on the disk. If a file (and hence a block) is in every snapshot it doesn't mean you've got multiple copies of it. You only have one copy of that block, it's just referenced by many snapshots. I used the wording copies of a parent loosely to mean previous states of the parent that also contain pointers to the current state of the child about to be updated in place. The thing is, the location of that block isn't saved separately in every snapshot either - the location is just stored in it's parent. And in every earlier version of the parent that was updated for some *other* reason and still contains a pointer to the current child that someone using that snapshot must be able to follow correctly. So moving a block is just a case of updating one parent. No: every version of the parent that points to the current version of the child must be updated. ... If you think about it, that has to work for the old data since as I said before, ZFS already has this functionality. If ZFS detects a bad block, it moves it to a new location on disk. If it can already do that without affecting any of the existing snapshots, so there's no reason to think we couldn't use the same code for a different purpose. Only if it works the way you think it works, rather than, say, by using a look-aside list of moved blocks (there shouldn't be that many of them), or by just leaving the bad block in the snapshot (if it's mirrored or parity-protected, it'll still be usable there unless a second failure occurs; if not, then it was lost anyway). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... just rearrange your blocks sensibly - and to at least some degree you could do that while they're still cache-resident Lots of discussion has passed under the bridge since that observation above, but it may have contained the core of a virtually free solution: let your table become fragmented, but each time that a sequential scan is performed on it determine whether the region that you're currently scanning is *sufficiently* fragmented that you should retain the sequential blocks that you've just had to access anyway in cache until you've built up around 1 MB of them and then (in a background thread) flush the result contiguously back to a new location in a single bulk 'update' that changes only their location rather than their contents. 1. You don't incur any extra reads, since you were reading sequentially anyway and already have the relevant blocks in cache. Yes, if you had reorganized earlier in the background the current scan would have gone faster, but if scans occur sufficiently frequently for their performance to be a significant issue then the *previous* scan will probably not have left things *all* that fragmented. This is why you choose a fragmentation threshold to trigger reorg rather than just do it whenever there's any fragmentation at all, since the latter would probably not be cost-effective in some circumstances; conversely, if you only perform sequential scans once in a blue moon, every one may be completely fragmented but it probably wouldn't have been worth defragmenting constantly in the background to avoid this, and the occasional reorg triggered by the rare scan won't constitute enough additional overhead to justify heroic efforts to avoid it. Such a 'threshold' is a crude but possi bly adequate metric; a better but more complex one would perhaps nudge up the threshold value every time a sequential scan took place without an intervening update, such that rarely-updated but frequently-scanned files would eventually approach full contiguity, and an even finer-grained metric would maintain such information about each individual *region* in a file, but absent evidence that the single, crude, unchanging threshold (probably set to defragment moderately aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 1 MB region) is inadequate these sound a bit like over-kill. 2. You don't defragment data that's never sequentially scanned, avoiding unnecessary system activity and snapshot space consumption. 3. You still incur additional snapshot overhead for data that you do decide to defragment for each block that hadn't already been modified since the most recent snapshot, but performing the local reorg as a batch operation means that only a single copy of all affected ancestor blocks will wind up in the snapshot due to the reorg (rather than potentially multiple copies in multiple snapshots if snapshots were frequent and movement was performed one block at a time). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and
OTOH, when someone whom I don't know comes across as a pushover, he loses credibility. It may come as a shock to you, but some people couldn't care less about those who assess 'credibility' on the basis of form rather than on the basis of content - which means that you can either lose out on potentially useful knowledge by ignoring them due to their form, or change your own attitude. I'd expect a senior engineer to show not only technical expertise but also the ability to handle difficult situations, *not* adding to the difficulties by his comments. Another surprise for you, I'm afraid: some people just don't meet your expectations in this area. In particular, I only 'show my ability to handle difficult situations' in the manner that you suggest when I have some actual interest in the outcome - otherwise, I simply do what I consider appropriate and let the chips fall where they may. Deal with that in whatever manner you see fit (just as I do). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and
: Big talk from someone who seems so intent on hiding : their credentials. : Say, what? Not that credentials mean much to me since I evaluate people : on their actual merit, but I've not been shy about who I am (when I : responded 'can you guess?' in registering after giving billtodd as my : member name I was being facetious). You're using a web-based interface to a mailing list and the 'billtodd' bit doesn't appear to any users (such as me) subscribed via that mechanism. Then perhaps Sun should make more of a point of this in their Web-based registration procedure. So yes, 'can you guess?' is unhelpful and makes you look as if you're being deliberately unhelpful. Appearances can be deceiving, in large part because they're so subjective. That's why sensible people dig beneath them before forming any significant impressions. ... : If you're still referring to your incompetent alleged research, [...] : [...] right out of the : same orifice from which you've pulled the rest of your crap. It's language like that that is causing the problem. No, it's ignorant loudmouths like cook and al who are causing the problem: I'm simply responding to them as I see fit. IMHO you're being a tad rude. I'm being rude as hell to people who truly earned it, and intend to continue until they shape up or shut up. So if you feel that there's a problem here that you'd like to help fix, I suggest that you try tackling it at its source. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Regardless of the merit of the rest of your proposal, I think you have put your finger on the core of the problem: aside from some apparent reluctance on the part of some of the ZFS developers to believe that any problem exists here at all (and leaving aside the additional monkey wrench that using RAID-Z here would introduce, because one could argue that files used in this manner are poor candidates for RAID-Z anyway hence that there's no need to consider reorganizing RAID-Z files), the *only* down-side (other than a small matter of coding) to defragmenting files in the background in ZFS is the impact that would have on run-time performance (which should be minimal if the defragmentation is performed at lower priority) and the impact it would have on the space consumed by a snapshot that existed while the defragmentation was being done. One way to eliminate the latter would be simply not to reorganize while any snapshot (or clone) existed: no worse than the situation today, and better whenever no snapshot or clone is present. That would change the perceived 'expense' of a snapshot, though, since you'd know you were potentially giving up some run-time performance whenever one existed - and it's easy to imagine installations which might otherwise like to run things such that a snapshot was *always* present. Another approach would be just to accept any increased snapshot space overhead. So many sequentially-accessed files are just written once and read-only thereafter that a lot of installations might not see any increased snapshot overhead at all. Some files are never accessed sequentially (or done so only in situations where performance is unimportant), and if they could be marked Don't reorganize then they wouldn't contribute any increased snapshot overhead either. One could introduce controls to limit the times when reorganization was done, though my inclination is to suspect that such additional knobs ought to be unnecessary. One way to eliminate almost completely the overhead of the additional disk accesses consumed by background defragmentation would be to do it as part of the existing background scrubbing activity, but for actively-updated files one might want to defragment more often than one needed to scrub. In any event, background defragmention should be a relatively easy feature to introduce and try out if suitable multi-block contiguous allocation mechanisms already exist to support ZFS's existing batch writes. Use of ZIL to perform opportunistic defragmentation while updated data was still present in the cache might be a bit more complex, but could still be worth investigating. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] I was going to send you an email
until I remembered that you said that you were speaking for others as well and decided that I'd like to speak to them too. As I said in a different thread, I really do try to respond to people in the manner that they deserve (and believe that in most cases here I have done so): even though I recognize that this may be off-putting it's sometimes the only way to break through bias and complacency, and since I came to zfs-discuss in search of technical interaction rather than a warm, fuzzy feeling of belonging I don't see too much of a down-side (unless I've managed to scare off anyone who might otherwise have contributed some technical insight, which would be unfortunate). But I do apologize if I've managed to offend any less-rabid bystanders (I was beginning to wonder whether there *were* any less-rabid bystanders) in the process, since that was not my intent. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and
You've been trolling from the get-go and continue to do so. Y'know, cookie, before letting the drool onto your keyboard you really ought to learn to research it. I said a good deal of what I've said recently well over a year ago here (and in fact had forgotten how much detail I went into back then, else I might just have given a pointer to it). First it's I have the magical fix, which wasn't a fix at all. Just because you can't understanding something doesn't mean it isn't feasible, dearie. You claim to want to better the project, I'll have to see a reference for that, I'm afraid: while I have some interest in ZFS from a technical standpoint, I've never had any kind of commitment to it. ... You rant and rave about how this is so much like wafl from a technical perspective, but then claim to not work for netapp or even KNOW anyone from netapp. Yet a quick search of the net has you claiming to have worked with netapp hardware for years. Another reference required, I'm afraid: I've never touched a NetApp box, nor to the best of my knowledge used one (of course, when you're interacting with the Internet, you don't know what hardware may be on the other end). You must be the ONLY person on this planet to have used a vendors wares for YEARS, have an intimate technical knowledge of the wares, but not know a SINGLE person who works for the company selling or supporting such wares. No, cookie: you're just as incompetent a researcher as you are technically. ... ^^did you see that paragraph, it was a list of names of all the people on this list who care what you have to say. Ah - I see that you responded to this post before responding to the poster who just proved you wrong (yet again). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and
Ah - no references to back up your drivel, I see. No surprise there, of course - but thanks for playing. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and
Big talk from someone who seems so intent on hiding their credentials. Say, what? Not that credentials mean much to me since I evaluate people on their actual merit, but I've not been shy about who I am (when I responded 'can you guess?' in registering after giving billtodd as my member name I was being facetious). If you're still referring to your incompetent alleged research, I'm still waiting for something I can look at: I do happen to know that another Bill Todd has long been associated with Interbase/Firebird (which is kind of ironic since Jim Starkey is an old friend of mine from my eleven years at DEC, though it's been a few years since we managed to get together), but aside from the possibility that you confused him with me I can only suspect that you pulled your 'discovery' right out of the same orifice from which you've pulled the rest of your crap. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pls discontinue troll bait was: Yager on ZFS and ZFS
I've been observing two threads on zfs-discuss with the following Subject lines: Yager on ZFS ZFS + DB + fragments and have reached the rather obvious conclusion that the author can you guess? is a professional spinmeister, Ah - I see we have another incompetent psychic chiming in - and judging by his drivel below a technical incompetent as well. While I really can't help him with the former area, I can at least try to educate him in the latter. ... Excerpt 1: Is this premium technical BullShit (BS) or what? Since you asked: no, it's just clearly beyond your grade level, so I'll try to dumb it down enough for you to follow. - BS 301 'grad level technical BS' --- Still, it does drive up snapshot overhead, and if you start trying to use snapshots to simulate 'continuous data protection' rather than more sparingly the problem becomes more significant (because each snapshot will catch any background defragmentation activity at a different point, such that common parent blocks may appear in more than one snapshot even if no child data has actually been updated). Once you introduce CDP into the process (and it's tempting to, since the file system is in a better position to handle it efficiently than some add-on product), rethinking how one approaches snapshots (and COW in general) starts to make more sense. Do you by any chance not even know what 'continuous data protection' is? It's considered a fairly desirable item these days and was the basis for several hot start-ups (some since gobbled up by bigger fish that apparently agreed that they were onto something significant), since it allows you to roll back the state of individual files or the system as a whole to *any* historical point you might want to (unlike snapshots, which require that you anticipate points you might want to roll back to and capture them explicitly - or take such frequent snapshots that you'll probably be able to get at least somewhere near any point you might want to, a second-class simulation of CDP which some vendors offer because it's the best they can do and is precisely the activity which I outlined above, expecting that anyone sufficiently familiar with file systems to be able to follow the discussion would be familiar with it). But given your obvious limitations I guess I should spell it out in words of even fewer syllables: 1. Simulating CDP without actually implementing it means taking very frequent snapshots. 2. Taking very frequent snapshots means that you're likely to interrupt background defragmentation activity such that one child of a parent is moved *before* the snapshot is taken while another is moved *after* the snapshot is taken, resulting in the need to capture a before-image of the parent (because at least one of its pointers is about to change) *and all ancestors of the parent* (because the pointer change will propagate through all the ancestral checksums - and pointers, with COW) in every snapshot that occurs immediately prior to moving *any* of its children rather than just having to capture a single before-image of the parent and all its ancestors after which all its child pointers will likely get changed before the next snapshot is taken. So that's what any competent reader should have been able to glean from the comments that stymied you. The paragraph's concluding comments were considerably more general in nature and thus legitimately harder to follow: had you asked for clarification rather than just assumed that they were BS simply because you couldn't understand them you would not have looked like such an idiot, but since you did call them into question I'll now put a bit more flesh on them for those who may be able to follow a discussion at that level of detail: 3. The file system is in a better position to handle CDP than some external mechanism because a) the file system knows (right down to the byte level if it wants to) exactly what any individual update is changing, b) the file system knows which updates are significant (e.g., there's probably no intrinsic need to capture rollback information for lazy writes because the application didn't care whether they were made persistent at that time, but for any explicitly-forced writes or syncs a rollback point should be established), and c) the file system is already performing log forces (where a log is involved) or batch disk updates (a la ZFS) to honor such application-requested persistence, and can piggyback the required CDP before-image persistence on them rather than requiring separate synchronous log or disk accesses to do so. 4. If you've got full-fledged CDP, it's questionable whether you need snapshots as well (unless you have really, really inflexible requirements for virtually instantaneous rollback and/or for high-performance writable-clone access) - and if CDP turns out to be this decade's important new file
Re: [zfs-discuss] ZFS + DB + fragments
... I personally believe that since most people will have hardware LUN's (with underlying RAID) and cache, it will be difficult to notice anything. Given that those hardware LUN's might be busy with their own wizardry ;) You will also have to minimize the effect of the database cache ... By definition, once you've got the entire database in cache, none of this matters (though filling up the cache itself takes some added time if the table is fragmented). Most real-world databases don't manage to fit all or even mostly in cache, because people aren't willing to dedicate that much RAM to running them. Instead, they either use a lot less RAM than the database size or share the system with other activity that shares use of the RAM. In other words, they use a cost-effective rather than a money-is-no-object configuration, but then would still like to get the best performance they can from it. It will be a tough assignment ... maybe someone has already done this? Thinking about this (very abstract) ... does it really matter? [8KB-a][8KB-b][8KB-c] So what it 8KB-b gets updated and moved somewhere else? If the DB gets a request to read 8KB-a, it needs to do an I/O (eliminate all caching). If it gets a request to read 8KB-b, it needs to do an I/O. Does it matter that b is somewhere else ... Yes, with any competently-designed database. it still needs to go get it ... only in a very abstract world with read-ahead (both hardware or db) would 8KB-b be in cache after 8KB-a was read. 1. If there's no other activity on the disk, then the disk's track cache will acquire the following data when the first block is read, because it has nothing better to do. But if the all the disks are just sitting around waiting for this table scan to get to them, then if ZFS has a sufficiently intelligent read-ahead mechanism it could help out a lot here as well: the differences become greater when the system is busier. 2. Even a moderately smart disk will detect a sequential access pattern if one exists and may read ahead at least modestly after having detected that pattern even if it *does* have other requests pending. 3. But in any event any competent database will explicitly issue prefetches when it knows (and it *does* know) that it is scanning a table sequentially - and will also have taken pains to try to ensure that the table data is laid out such that it can be scanned efficiently. If it's using disks that support tagged command queuing it may just issue a bunch of single-database-block requests at once, and the disk will organize them such that they can all be satisfied by a single streaming access; with disks that don't support queuing, the database can elect to issue a single large I/O request covering many database blocks, accomplishing the same thing as long as the table is in fact laid out contiguously on the medium (the database knows this if it's handling the layout directly, but when it's using a file system as an intermediary it usually can only hope that the file system has minimized file fragmentation). Hmmm... the only way is to get some data :) *hehe* Data is good, as long as you successfully analyze what it actually means: it either tends to confirm one's understanding or to refine it. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? billtodd at metrocast.net writes: You really ought to read a post before responding to it: the CERN study did encounter bad RAM (and my post mentioned that) - but ZFS usually can't do a damn thing about bad RAM, because errors tend to arise either before ZFS ever gets the data or after it has already returned and checked it (and in both cases, ZFS will think that everything's just fine). According to the memtest86 author, corruption most often occurs at the moment memory cells are written to, by causing bitflips in adjacent cells. So when a disk DMA data to RAM, and corruption occur when the DMA operation writes to the memory cells, and then ZFS verifies the checksum, then it will detect the corruption. Therefore ZFS is perfectly capable (and even likely) to detect memory corruption during simple read operations from a ZFS pool. Of course there are other cases where neither ZFS nor any other checksumming filesystem is capable of detecting anything (e.g. the sequence of events: data is corrupted, checksummed, written to disk). Indeed - the latter was the first of the two scenarios that I sketched out. But at least on the read end of things ZFS should have a good chance of catching errors due to marginal RAM. That must mean that most of the worrisome alpha-particle problems of yore have finally been put to rest (since they'd be similarly likely to trash data on the read side after ZFS had verified it). I think I remember reading that somewhere at some point, but I'd never gotten around to reading that far in the admirably-detailed documentation that accompanies memtest: thanks for enlightening me. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
can you guess? wrote: For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? No question about it. And sequential table scans in databases are among the most significant examples, because (unlike things like streaming video files which just get laid down initially and non-synchronously in a manner that at least potentially allows ZFS to accumulate them in large, contiguous chunks - though ISTR some discussion about just how well ZFS managed this when it was accommodating multiple such write streams in parallel) the tables are also subject to fine-grained, often-random update activity. Background defragmentation can help, though it generates a boatload of additional space overhead in any applicable snapshot. The reason that this is hard to characterize is that there are really two very different configurations used to address different performance requirements: cheap and fast. It seems that when most people first consider this problem, they do so from the cheap perspective: single disk view. Anyone who strives for database performance will choose the fast perspective: stripes. And anyone who *really* understands the situation will do both. Note: data redundancy isn't really an issue for this analysis, but consider it done in real life. When you have a striped storage device under a file system, then the database or file system's view of contiguous data is not contiguous on the media. The best solution is to make the data piece-wise contiguous on the media at the appropriate granularity - which is largely determined by disk access characteristics (the following assumes that the database table is large enough to be spread across a lot of disks at moderately coarse granularity, since otherwise it's often small enough to cache in the generous amounts of RAM that are inexpensively available today). A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield over 80% of the disk's maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an 'average' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that's an integral multiple of the track size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk size' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk's maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you'll obt ain a mighty 2% of the potential streaming performance (again, we'll be charitable and ignore the further degradation if RAID-Z is used). Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). Using 1 MB chunks still spreads out your database admirably for parallel random-access throughput: even if the table is only 1 GB in size (eminently cachable in RAM, should that be preferable), that'll spread it out across 1,000 disks (2,000, if you mirror it and load-balance to spread out the accesses), and for much smaller database tables if they're accessed sufficiently heavily for throughput to be an issue they'll be wholly cache-resident. Or another way to look at it is in terms of how many disks you have in your system: if it's less than the number of MB in your table size, then the table will be spread across all of them regardless of what chunk size is used, so you might as well use one that's large enough to give you decent sequential scanning performance (and if your table is too small to spread across all the disks, then it may well all wind up in cache anyway). ZFS's problem (well, the one specific to this issue, anyway) is that it tries to use its 'block size' to cover two different needs: performance for moderately fine-grained updates (though its need to propagate those updates upward to the root of the applicable tree
Re: [zfs-discuss] Yager on ZFS
... Well, ZFS allows you to put its ZIL on a separate device which could be NVRAM. And that's a GOOD thing (especially because it's optional rather than requiring that special hardware be present). But if I understand the ZIL correctly not as effective as using NVRAM as a more general kind of log for a wider range of data sizes and types, as WAFL does. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS
... At home the biggest reason I went with ZFS for my data is ease of management. I split my data up based on what it is ... media (photos, movies, etc.), vendor stuff (software, datasheets, etc.), home directories, and other misc. data. This gives me a good way to control backups based on the data type. It's not immediately clear why simply segregating the different data types into different directory sub-trees wouldn't allow you to do pretty much the same thing. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Richard Elling wrote: ... there are really two very different configurations used to address different performance requirements: cheap and fast. It seems that when most people first consider this problem, they do so from the cheap perspective: single disk view. Anyone who strives for database performance will choose the fast perspective: stripes. And anyone who *really* understands the situation will do both. I'm not sure I follow. Many people who do high performance databases use hardware RAID arrays which often do not expose single disks. They don't have to expose single disks: they just have to use reasonable chunk sizes on each disk, as I explained later. Only very early (or very low-end) RAID used very small per-disk chunks (up to 64 KB max). Before the mid-'90s chunk sizes had grown to 128 - 256 KB per disk on mid-range arrays in order to improve disk utilization in the array. From talking with one of its architects years ago my impression is that HP's (now somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as an example, though today one could argue for as much as 4 MB and soon perhaps even more). The array chunk size is not the unit of update, just the unit of distribution across the array: RAID-5 will happily update a single 4 KB file block within a given array chunk and the associated 4 KB of parity within the parity chunk. But the larger chunk size does allow files to retain the option of using logical contiguity to attain better streaming sequential performance, rather than splintering that logical contiguity at fine grain across multiple disks. ... A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield over 80% of the disk's maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an 'average' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that's an integral multiple of the track size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk size' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk's maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you'll obtain a mighty 2% of the potential streaming performance (again, we'll be charitable and ignore the further degradation if RAID-Z is used). You do not seem to be considering the track cache, which for modern disks is 16-32 MBytes. If those disks are in a RAID array, then there is often larger read caches as well. Are you talking about hardware RAID in that last comment? I thought ZFS was supposed to eliminate the need for that. Expecting a seek and read for each iop is a bad assumption. The bad assumption is that the disks are otherwise idle and therefore have the luxury of filling up their track caches - especially when I explicitly assumed otherwise in the following paragraph in that post. If the system is heavily loaded the disks will usually have other requests queued up (even if the next request comes in immediately rather than being queued at the disk itself, an even half-smart disk will abort any current read-ahead activity so that it can satisfy the new request). Not that it would necessarily do much good for the case currently under discussion even if the disks weren't otherwise busy and they did fill up the track caches: ZFS's COW policies tend to encourage data that's updated randomly at fine grain (as a database table often is) to be splattered across the storage rather than neatly arranged such that the next data requested from a given disk will just happen to reside right after the previous data requested from that disk. Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). ... Real data would be greatly appreciated. In my tests, I see reasonable media bandwidth speeds
Re: [zfs-discuss] Yager on ZFS
Adam Leventhal wrote: On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote: How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Nope. A decent RAID-5 hardware implementation has no 'write hole' to worry about, and one can make a software implementation similarly robust with some effort (e.g., by using a transaction log to protect the data-plus-parity double-update or by using COW mechanisms like ZFS's in a more intelligent manner). Can you reference a software RAID implementation which implements a solution to the write hole and performs well. No, but I described how to use a transaction log to do so and later on in the post how ZFS could implement a different solution more consistent with its current behavior. In the case of the transaction log, the key is to use the log not only to protect the RAID update but to protect the associated higher-level file operation as well, such that a single log force satisfies both (otherwise, logging the RAID update separately would indeed slow things down - unless you had NVRAM to use for it, in which case you've effectively just reimplemented a low-end RAID controller - which is probably why no one has implemented that kind of solution in a stand-alone software RAID product). ... The part of RAID-Z that's brain-damaged is its concurrent-small-to-medium-sized-access performance (at least up to request sizes equal to the largest block size that ZFS supports, and arguably somewhat beyond that): while conventional RAID-5 can satisfy N+1 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in parallel (though the latter also take an extra rev to complete), RAID-Z can satisfy only one small-to-medium access request at a time (well, plus a smidge for read accesses if it doesn't verity the parity) - effectively providing RAID-3-style performance. Brain damage seems a bit of an alarmist label. I consider 'brain damage' to be if anything a charitable characterization. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Everyone tries to avoid waiting for the disk at all. Remarkably few succeed very well. Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. Intelligent prefetch doesn't do squat if your problem is disk throughput (which in server environments it frequently is). And all caching does (if you're lucky and your workload benefits much at all from caching) is improve your system throughput at the point where you hit the disk throughput wall. Improving your disk utilization, by contrast, pushes back that wall. And as I just observed in another thread, not by 20% or 50% but potentially by around two decimal orders of magnitude if you compare the sequential scan performance to multiple randomly-updated database tables between a moderately coarsely-chunked conventional RAID and a fine-grained ZFS block size (e.g., the 16 KB used by the example database) with each block sprayed across several disks. Sure, that's a worst-case scenario. But two orders of magnitude is a hell of a lot, even if it doesn't happen often - and suggests that in more typical cases you're still likely leaving a considerable amount of performance on the table even if that amount is a lot less than a factor of 100. The easiest way to fix ZFS's deficiency in this area would probably be to map each group of N blocks in a file as a stripe with its own parity - which would have the added benefit of removing any need to handle parity groups at the disk level (this would, incidentally, not be a bad idea to use for mirroring as well, if my impression is correct that there's a remnant of LVM-style internal management there). While this wouldn't allow use of parity RAID for very small files, in most installations they really don't occupy much space compared to that used by large files so this should not constitute a significant drawback. I don't really think this would be feasible given how ZFS is stratified today, but go ahead and prove me wrong: here are the instructions for bringing over a copy of the source code: http://www.opensolaris.org/os/community/tools/scm Now you want me not only to design the fix but code it for you? I'm afraid that you vastly overestimate my commitment to ZFS: while I'm somewhat interested in discussing it and happy to provide what insights I can, I really don't personally care whether it succeeds or fails. But I sort of assumed that you might. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... For modern disks, media bandwidths are now getting to be 100 MBytes/s. If you need 500 MBytes/s of sequential read, you'll never get it from one disk. And no one here even came remotely close to suggesting that you should try to. You can get it from multiple disks, so the questions are: 1. How to avoid other bottlenecks, such as a shared fibre channel ath? Diversity. 2. How to predict the data layout such that you can guarantee a wide spread? You've missed at least one more significant question: 3. How to lay out the data such that this 500 MB/s drain doesn't cripple *other* concurrent activity going on in the system (that's what increasing the amount laid down on each drive to around 1 MB accomplishes - otherwise, you can easily wind up using all the system's disk resources to satisfy that one application, or even fall short if you have fewer than 50 disks available, since if you spread the data out relatively randomly in 128 KB chunks on a system with disks reasonably well-filled with data you'll only be obtaining around 10 MB/s from each disk, whereas with 1 MB chunks similarly spread about each disk can contribute more like 35 MB/s and you'll need only 14 - 15 disks to meet your requirement). Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
This question triggered some silly questions in my mind: Actually, they're not silly at all. Lots of folks are determined that the whole COW to different locations are a Bad Thing(tm), and in some cases, I guess it might actually be... What if ZFS had a pool / filesystem property that caused zfs to do a journaled, but non-COW update so the data's relative location for databases is always the same? That's just what a conventional file system (no need even for a journal, when you're updating in place) does when it's not guaranteeing write atomicity (you address the latter below). Or - What if it did a double update: One to a staged area, and another immediately after that to the 'old' data blocks. Still always have on-disk consistency etc, at a cost of double the I/O's... It only requires an extra disk access if the new data is too large to dump right into the journal itself (which guarantees that the subsequent in-place update can complete). Whether the new data is dumped into the log or into a temporary location the pointer to which is logged, the subsequent in-place update can be deferred until it's convenient (e.g., until after any additional updates to the same data have also been accumulated, activity has cooled off, and the modified blocks are getting ready to be evicted from the system cache - and, optionally, until the target disks are idle or have their heads positioned conveniently near the target location). ZFS's small-synchronous-write log can do something similar as long as the writes aren't too large to place in it. However, data that's only persisted in the journal isn't accessible via the normal snapshot mechanisms (well, if an entire file block was dumped into the journal I guess it could be, at the cost of some additional complexity in journal space reuse), so I'm guessing that ZFS writes back any dirty data that's in the small-update journal whenever a snapshot is created. And if you start actually updating in place as described above, then you can't use ZFS-style snapshotting at all: instead of capturing the current state as the snapshot with the knowledge that any subsequent updates will not disturb it, you have to capture the old state that you're about to over-write and stuff it somewhere else - and then figure out how to maintain appropriate access to it while the rest of the system moves on. Snapshots make life a lot more complex for file systems than it used to be, and COW techniques make snapshotting easy at the expense of normal run-time performance - not just because they make update-in-place infeasible for preserving on-disk contiguity but because of the significant increase in disk bandwidth (and snapshot storage space) required to write back changes all the way up to whatever root structure is applicable: I suspect that ZFS does this on every synchronous update save for those that it can leave temporarily in its small-update journal, and it *has* to do it whenever a snapshot is created. Of course, both of these would require non-sparse file creation for the DB etc, but would it be plausible? Update-in-place files can still be sparse: it's only data that already exists that must be present (and updated in place to preserve sequential access performance to it). For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? No question about it. And sequential table scans in databases are among the most significant examples, because (unlike things like streaming video files which just get laid down initially and non-synchronously in a manner that at least potentially allows ZFS to accumulate them in large, contiguous chunks - though ISTR some discussion about just how well ZFS managed this when it was accommodating multiple such write streams in parallel) the tables are also subject to fine-grained, often-random update activity. Background defragmentation can help, though it generates a boatload of additional space overhead in any applicable snapshot. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
some business do not accept any kind of risk Businesses *always* accept risk: they just try to minimize it within the constraints of being cost-effective. Which is a good thing for ZFS, because it can't eliminate risk either, just help to minimize it cost-effectively. However, the subject here is not business use but 'consumer' use. ... at the moment only ZFS can give this assurance, plus the ability to self correct detected errors. You clearly aren't very familiar with WAFL (which can do the same). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
... And how about FAULTS? hw/firmware/cable/controller/ram/... If you had read either the CERN study or what I already said about it, you would have realized that it included the effects of such faults. ...and ZFS is the only prophylactic available. You don't *need* a prophylactic if you're not having sex: the CERN study found *no* clear instances of faults that would occur in consumer systems and that could be attributed to the kinds of errors that ZFS can catch and more conventional file systems can't. It found faults in the interaction of its add-on RAID controller (not a normal 'consumer' component) with its WD disks, it found single-bit errors that appeared to correlate with ECC RAM errors (i.e., likely occurred in RAM rather than at any point where ZFS would be involved), it found block-sized errors that appeared to correlate with misplaced virtual memory allocation (again, outside ZFS's sphere of influence). ... but I had a box that was randomly corrupting blocks during DMA. The errors showed up when doing a ZFS scrub and I caught the problem in time. Yup - that's exactly the kind of error that ZFS and WAFL do a perhaps uniquely good job of catching. WAFL can't catch all: It's distantly isolated from the CPU end. WAFL will catch everything that ZFS catches, including the kind of DMA error described above: it contains validating information outside the data blocks just as ZFS does. Explain how it can do that, when it is isolated from the application by several layers including the network? Darrell covered one aspect of this (i.e., that ZFS couldn't either if it were being used in a server), but there's another as well: as long as the NFS messages between client RAM and server RAM are checksummed in RAM on both ends, then that extends the checking all the way to client RAM (the same place where local ZFS checks end) save for any problems occurring *in* RAM at one end or the other (and ZFS can't deal with in-RAM problems either: all it can do is protect the data until it gets to RAM). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? wrote: at the moment only ZFS can give this assurance, plus the ability to self correct detected errors. You clearly aren't very familiar with WAFL (which can do the same). ... so far as I can tell it's quite irrelevant to me at home; I can't afford it. Neither can I - but the poster above was (however irrelevantly) talking about ZFS's supposedly unique features for *businesses*, so I answered in that context. (By the way, something has gone West with my email and I'm temporarily unable to send the response I wrote to your message night before last. If you meant to copy it here as well, just do so and I'll respond to it here.) - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Nathan Kroenert wrote: ... What if it did a double update: One to a staged area, and another immediately after that to the 'old' data blocks. Still always have on-disk consistency etc, at a cost of double the I/O's... This is a non-starter. Two I/Os is worse than one. Well, that attitude may be supportable for a write-only workload, but then so is the position that you really don't even need *one* I/O (since no one will ever need to read the data and you might as well just drop it on the floor). In the real world, data (especially database data) does usually get read after being written, and the entire reason the original poster raised the question was because sometimes it's well worth taking on some additional write overhead to reduce read overhead. In such a situation, if you need to protect the database from partial-block updates as well as to keep it reasonably laid out for sequential table access, then performing the two writes described is about as good a solution as one can get (especially if the first of them can be logged - even better, logged in NVRAM - such that its overhead can be amortized across multiple such updates by otherwise independent processes, and even more especially if, as is often the case, the same data gets updated multiple times in sufficiently close succession that instead of 2N writes you wind up only needing to perform N+1 writes, the last being the only one that updates the data in place after the activity has cooled down). Of course, both of these would require non-sparse file creation for the DB etc, but would it be plausible? For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? We are all anxiously awaiting data... Then you might find it instructive to learn more about the evolution of file systems on Unix: In The Beginning there was the block, and the block was small, and it was isolated from its brethren, and darkness was upon the face of the deep because any kind of sequential performance well and truly sucked. Then (after an inexcusably lengthy period of such abject suckage lasting into the '80s) there came into the world FFS, and while there was still only the block the block was at least a bit larger, and it was at least somewhat less isolated from its brethren, and once in a while it actually lived right next to them, and while sequential performance still usually sucked at least it sucked somewhat less. And then the disciples Kleiman and McVoy looked upon FFS and decided that mere proximity was still insufficient, and they arranged that blocks should (at least when convenient) be aggregated into small groups (56 KB actually not being all that small at the time, given the disk characteristics back then), and the Great Sucking Sound of Unix sequential-access performance was finally reduced to something at least somewhat quieter than a dull roar. But other disciples had (finally) taken a look at commercial file systems that had been out in the real world for decades and that had had sequential performance down pretty well pat for nearly that long. And so it came to pass that corporations like Veritas (VxFS), and SGI (EFS XFS), and IBM (JFS) imported the concept of extents into the Unix pantheon, and the Gods of Throughput looked upon it, and it was good, and (at least in those systems) Unix sequential performance no longer sucked at all, and even non-corporate developers whose faith was strong nearly to the point of being blind could not help but see the virtues revealed there, and began incorporating extents into their own work, yea, even unto ext4. And the disciple Hitz (for it was he, with a few others) took a somewhat different tack, and came up with a 'write anywhere file layout' but had the foresight to recognize that it needed some mechanism to address sequential performance (not to mention parity-RAID performance). So he abandoned general-purpose approaches in favor of the Appliance, and gave it most uncommodity-like but yet virtuous NVRAM to allow many consecutive updates to be aggregated into not only stripes but adjacent stripes before being dumped to disk, and the Gods of Throughput smiled upon his efforts, and they became known throughout the land. Now comes back Sun with ZFS, apparently ignorant of the last decade-plus of Unix file system development (let alone development in other systems dating back to the '60s). Blocks, while larger (though not necessarily proportionally larger, due to dramatic increases in disk bandwidth), are once again often isolated from their brethren. True, this makes the COW approach a lot easier to implement, but (leaving aside the debate about whether COW as implemented in ZFS is a good idea at all) there is *no question whatsoever* that it returns a significant degree of suckage to sequential performance - especially for data subject to small, random
Re: [zfs-discuss] Yager on ZFS
On 14-Nov-07, at 7:06 AM, can you guess? wrote: ... And how about FAULTS? hw/firmware/cable/controller/ram/... If you had read either the CERN study or what I already said about it, you would have realized that it included the effects of such faults. ...and ZFS is the only prophylactic available. You don't *need* a prophylactic if you're not having sex: the CERN study found *no* clear instances of faults that would occur in consumer systems and that could be attributed to the kinds of errors that ZFS can catch and more conventional file systems can't. Hmm, that's odd, because I've certainly had such faults myself. (Bad RAM is a very common one, You really ought to read a post before responding to it: the CERN study did encounter bad RAM (and my post mentioned that) - but ZFS usually can't do a damn thing about bad RAM, because errors tend to arise either before ZFS ever gets the data or after it has already returned and checked it (and in both cases, ZFS will think that everything's just fine). that nobody even thinks to check.) Speak for yourself: I've run memtest86+ on all our home systems, and I run it again whenever encountering any problem that might be RAM-related. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
... Well single bit error rates may be rare in normal operation hard drives, but from a systems perspective, data can be corrupted anywhere between disk and CPU. The CERN study found that such errors (if they found any at all, which they couldn't really be sure of) were far less common than I will note from multiple personal experiences these issues _do_ happen with netapp and emc (symm and clariion) And Robert already noted that they've occurred in his mid-range arrays. In both cases, however, you're talking about decidedly non-consumer hardware, and had you looked more carefully at the material to which you were responding you would have found that its comments were in the context of experiences with consumer hardware (and in particular what *quantitative* level of additional protection ZFS's 'special sauce' can be considered to add to its reliability). Errors introduced by mid-range and high-end arrays don't enter into that discussion (though they're interesting for other reasons). - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
In the previous and current responses, you seem quite determined of others misconceptions. I'm afraid that your sentence above cannot be parsed grammatically. If you meant that I *have* determined that some people here are suffering from various misconceptions, that's correct. Given that fact and the first paragraph of your response below, I think you can figure out why nobody on this list will reply to you again. Predicting the future (especially the actions of others) is usually a feat reserved for psychics: are you claiming to be one (perhaps like the poster who found it 'clear' that I was a paid NetApp troll - one of the aforementioned misconceptions)? Oh, well - what can one expect from someone who not only top-posts but completely fails to trim quotations? I see that you appear to be posting from a .edu domain, so perhaps next year you will at least mature to the point of becoming sophomoric. Whether people here find it sufficiently uncomfortable to have their beliefs (I'm almost tempted to say 'faith', in some cases) challenged that they'll indeed just shut up I really wouldn't presume to guess. As for my own attitude, if you actually examine my responses rather than just go with your gut (which doesn't seem to be a very reliable guide in your case) you'll find that I tend to treat people pretty much as they deserve. If they don't pay attention to what they're purportedly responding to or misrepresent what I've said, I do chide them a bit (since I invariably *do* pay attention to what *they* say and make sincere efforts to respond to exactly that), and if they're confrontational and/or derogatory then they'll find me very much right back in their face. Perhaps it's some kind of territorial thing - that people bridle when they find a seriously divergent viewpoint popping up in a cozy little community where most most of them have congregated because they already share the beliefs of the group. Such in-bred communities do provide a kind of sanctuary and feeling of belonging: perhaps it's unrealistic to expect most people to be able to rise above that and deal rationally with the wider world's entry into their little one. Or not: we'll see. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Thanks for taking the time to flesh these points out. Comments below: ... The compression I see varies from something like 30% to 50%, very roughly (files reduced *by* 30%, not files reduced *to* 30%). This is with the Nikon D200, compressed NEF option. On some of the lower-level bodies, I believe the compression can't be turned off. Smaller files will of course get hit less often -- or it'll take longer to accumulate the terrabyte, is how I'd prefer to think of it. Either viewpoint works. And since the compression is not that great, you still wind up consuming a lot of space. Effectively, you're trading (at least if compression is an option rather than something that you're stuck with) the possibility that a picture will become completely useless should a bit get flipped for a storage space reduction of 30% - 50% - and that's a good trade, since it effectively allows you to maintain a complete backup copy on disk (for archiving, preferably off line) almost for free compared with the uncompressed option. Damage that's fixable is still damage; I think of this in archivist mindset, with the disadvantage of not having an external budget to be my own archivist. There will *always* be the potential for damage, so the key is to make sure that any damage is easily fixable. The best way to do this is to a) keep multiple copies, b) keep them isolated from each other (that's why RAID is not a suitable approach to archiving), and c) check (scrub) them periodically to ensure that if you lose a piece (whether a bit or a sector) you can restore the affected data from another copy and thus return your redundancy to full strength. For serious archiving, you probably want to maintain at least 3 such copies (possibly more if some are on media of questionable longevity). For normal use, there's probably negligible risk of losing any data if you maintain only two on reasonably reliable media: 'MAID' experience suggests that scrubbing as little as every few months reduces the likelihood of encountering detectable errors while restoring redundancy by several orders of magnitude (i.e., down to something like once in a PB at worst for disks - becoming comparable to the levels of bit-flip errors that the disk fails to detect at all). Which is what I've been getting at w.r.t. ZFS in this particular application (leaving aside whether it can reasonably be termed a 'consumer' application - because bulk video storage is becoming one and it not only uses a similar amount of storage space but should probably be protected using similar strategies): unless you're seriously worried about errors in the once-per-PB range, ZFS primarily just gives you automated (rather than manually-scheduled) scrubbing (and only for your on-line copy). Yes, it will help detect hardware faults as well if they happen to occur between RAM and the disk (and aren't otherwise detected - I'd still like to know whether the 'bad cable' experiences reported here occurred before ATA started CRCing its transfers), but while there's anecdotal evidence of such problems presented here it doesn't seem to be corroborated by the few actual studies that I'm familiar with, so that risk is difficult to quantify. Getting back to 'consumer' use for a moment, though, given that something like 90% of consumers entrust their PC data to the tender mercies of Windows, and a large percentage of those neither back up their data, nor use RAID to guard against media failures, nor protect it effectively from the perils of Internet infection, it would seem difficult to assert that whatever additional protection ZFS may provide would make any noticeable difference in the consumer space - and that was the kind of reasoning behind my comment that began this sub-discussion. By George, we've managed to get around to having a substantive discussion after all: thanks for persisting until that occurred. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
Well, I guess we're going to remain stuck in this sub-topic for a bit longer: The vast majority of what ZFS can detect (save for *extremely* rare undetectable bit-rot and for real hardware (path-related) errors that studies like CERN's have found to be very rare - and you have yet to provide even anecdotal evidence to the contrary) You wanted anectodal evidence: To be accurate, the above was not a solicitation for just any kind of anecdotal evidence but for anecdotal evidence that specifically contradicted the notion that otherwise undetected path-related hardware errors are 'very rare'. During my personal experience with only two home machines, ZFS has helped me detect corruption at least three times in a period of a few months. One due to silent corruption due to a controller bug (and a driver that did not work around it). If that experience occurred using what could be considered normal consumer hardware and software, that's relevant (and disturbing). As I noted earlier, the only path-related problem that the CERN study unearthed involved their (hardly consumer-typical) use of RAID cards, the unusual demands that those cards placed on the WD disk firmware (to the point where it produced on-disk errors), and the cards' failure to report accompanying disk time-outs. Another time corruption during hotswapping (though this does not necessarily count since I did it on hardware that I did not know was supposed to support it, and I would not have attempted it to begin with otherwise). Using ZFS as a test platform to see whether you could get away with using hardware in a manner that it may not have been intended to be used may not really qualify as 'consumer' use. As I've noted before, consumer relevance remains the point in question here (since that's the point that fired off this lengthy sub-discussion). ... In my professional life I have seen bitflips a few times in the middle of real live data running on real servers that are used for important data. As a result I have become pretty paranoid about it all, making heavy use of par2. And well you should - but, again, that's hardly 'consumer' use. ... can also be detected by scrubbing, and it's arguably a lot easier to apply brute-force scrubbing (e.g., by scheduling a job that periodically copies your data to the null device if your system does not otherwise support the mechanism) than to switch your file system. How would your magic scrubbing detect arbitrary data corruption without checksumming The assertion is that it would catch the large majority of errors that ZFS would catch (i.e., all the otherwise detectable errors, most of them detected by the disk when it attempts to read a sector), leaving a residue of no noticeable consequence to consumers (especially as one could make a reasonable case that most consumers would not experience any noticeable problem even if *none* of these errors were noticed). or redundancy? Redundancy is necessary if you want to fix (not just catch) errors, but conventional mechanisms provide redundancy just as effective as ZFS's. (With the minor exception of ZFS's added metadata redundancy, but the likelihood that an error will happen to hit the relatively minuscule amount of metadata on a disk rather than the sea of data on it is, for consumers, certainly negligible, especially considering all the far more likely potential risks in the use of their PCs.) A lot of the data people save does not have checksumming. *All* disk data is checksummed, right at the disk - and according to the studies I'm familiar with this detects most errors (certainly enough of those that ZFS also catches to satisfy most consumers). If you've got any quantitative evidence to the contrary, by all means present it. ... I think one needs to stop making excuses by observing properties of specific file types and simlar. I'm afraid that's incorrect: given the statistical incidence of the errors in question here, in normal consumer use only humongous files will ever experience them with non-neglible probability. So those are the kinds of files at issue. When such a file experiences one of these errors, then either it will be one that ZFS is uniquely (save for WAFL) capable of detecting, or it will be one that more conventional mechanisms can detect. The latter are, according to the studies I keep mentioning, far more frequent (only relatively, of course: we're still only talking about one in every 10 TB or so, on average and according to manufacturers' specs, which seem to be if anything pessimistic in this area), and comprise primarily unreadable disk sectors which (as long as they're detected in a timely manner by scrubbing, whether ZFS's or some manually-scheduled mechanism) simply require that the bad sector (or file) be replaced by a good copy to restore the desired level of redundancy. When we get into the
Re: [zfs-discuss] Response to phantom dd-b post
Chill. It's a filesystem. If you don't like it, don't use it. Hey, I'm cool - it's mid-November, after all. And it's not about liking or not liking ZFS: it's about actual merits vs. imagined ones, and about legitimate praise vs. illegitimate hype. Some of us have a professional interest in such things. If you don't, by all means feel free to ignore the discussion. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Hallelujah! I don't know when this post actually appeared in the forum, but it wasn't one I'd seen until right now. If it didn't just appear due to whatever kind of fluke made the 'disappeared' post appear right now too, I apologize for having missed it earlier. In a compressed raw file, it'll affect the rest of the file generally; so it essentially renders the whole thing useless, unless it happens to hit towards the end and you can crop around it. If it hits in metadata (statistically unlikely, the bulk of the file is image data) it's probably at worst annoying, but it *might* hit one of the bits software uses to recognize and validate the file, too. In an uncompressed raw file, if it hits in image data it'll affect probably 9 pixels; it's easily fixed. That's what I figured (and the above is the first time you've mentioned *compressed* RAW files, so the obvious next observation is that if they compress well - and if not, why bother compressing them? - then the amount of room that they occupy is significantly smaller and the likelihood of getting an error in one is similarly smaller). ... Even assuming that you meant 'MB' rather than 'Mb' above, that suggests that it would take you well over a decade to amass 1 TB of RAW data (assuming that, as you suggest both above and later, you didn't accumulate several hundred MB of pictures *every* day but just on those days when you were traveling, at a sporting event, etc.). I seem to come up with a DVD full every month or two these days, myself. I mean, it varies; there was this one weekend I filled 4 or some such; but it varies both ways, and that average isn't too far off. 25GB a year seems to take 40 years to reach 1TB. However, my rate has increased so dramatically in the last 7 years that I'm not at all sure what to expect; is it time for the curve to level off yet, for me? Who knows! Well, it still looks as if you're taking well over a decade to fill 1 TB at present, as I estimated. Then again, I'm *also* working on scanning in the *last* 40 years worth of photos, and those tend to be bigger (scans are less good pixels so you need more of them), and *that* runs the numbers up, in chunks when I take time to do a big scanning batch. OK - that's another new input, though not yet a quantitative one. ... Even if you've got your original file archived, you still need your working copies available, and Adobe Photoshop can turn that RAW file into a PSD of nearly 60Mb in some cases. If you really amass all your pictures this way (rather than, e.g., use Photoshop on some of them and then save the result in a less verbose format), I'll suggest that this takes you well beyond the 'consumer' range of behavior. It's not snapshot usage, but it's common amateur usage. Amateurs tend to do lots of the same things professionals do (and sometimes better, though not usually). Hobbies are like that. The argument for the full Photoshop file is the concept of nondestructive editing. I do retouching on new layers instead of erasing what I already have with the new stuff. I use adjustment layers with layer masks for curve adjustments. I can go back and improve the mask, or nudge the curves, without having to start over from scratch. It's a huge win. And it may be more valuable for amateurs, actually; professionals tend to have the experience to know their minds better and know when they have it right, so many of them may do less revisiting old stuff and improving it a bit. Also, when the job is done and sent to the client, they tend not to care about it any more. OK - but at a *maximum* of 60 MB per shot you're still talking about having to manually massage at least 20,000 shots in Photoshop before the result consumes 1 TB of space. That's a *lot* of manual labor: do you really perform it on anything like that number of shots? - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
... Having my MP3 collection gotten fucked up thanks to neither Windows nor NTFS being able to properly detect and report in-flight data corruption (i.e. bad cable), after copying it from one drive to another to replace one of them, I'm really glad that I've ZFS to manage my data these days. Hmmm. All this talk about bad cables by you and others sounds more like older ATA (before transfers over the cable got CRC protection) than like contemporary drives. Was your experience with a recent drive and controller? ... As far as all these reliability studies go, my practical experience is quite the opposite. I'm fixing computers of friends and acquaintances left and right, bad sectors are rather pretty common. I certainly haven't found them to be common, unless a drive was on the verge of major failure. Though if a drive is used beyond its service life (usually 3 - 5 years) they may become more common. In any case, if a conventional scrub would detect the bad sector then ZFS per se wouldn't add unique value (save that the check would be automated rather than something that the user, or system assembler, had to set up to be scheduled). I really meant it, though, when I said that I don't completely discount anecdotal experience: I just like to get more particulars before deciding how much to weigh it against more formal analyses. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On 9-Nov-07, at 3:23 PM, Scott Laird wrote: Most video formats are designed to handle errors--they'll drop a frame or two, but they'll resync quickly. So, depending on the size of the error, there may be a visible glitch, but it'll keep working. Interestingly enough, this applies to a lot of MPEG-derived formats as well, like MP3. I had a couple bad copies of MP3s that I tried to listen to on my computer a few weeks ago (podcasts copied via bluetooth off of my phone, apparently with no error checking), and it made the story hard to follow when a few seconds would disappear out of the middle, but it didn't destroy the file. Well that's nice. How about your database, your source code, your ZIP file, your encrypted file, ... They won't be affected, because they're so much smaller that (at something like 1 error per 10 TB) the chance of an error hitting them is negligible: that was the whole point of singling out huge video files as the only likely candidates to worry about. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On 9-Nov-07, at 2:45 AM, can you guess? wrote: ... This suggests that in a ZFS-style installation without a hardware RAID controller they would have experienced at worst a bit error about every 10^14 bits or 12 TB And how about FAULTS? hw/firmware/cable/controller/ram/... If you had read either the CERN study or what I already said about it, you would have realized that it included the effects of such faults. ... but I had a box that was randomly corrupting blocks during DMA. The errors showed up when doing a ZFS scrub and I caught the problem in time. Yup - that's exactly the kind of error that ZFS and WAFL do a perhaps uniquely good job of catching. WAFL can't catch all: It's distantly isolated from the CPU end. WAFL will catch everything that ZFS catches, including the kind of DMA error described above: it contains validating information outside the data blocks just as ZFS does. ... CERN was using relatively cheap disks Don't forget every other component in the chain. I didn't, and they didn't: read the study. ... Your position is similar to that of an audiophile enthused about a measurable but marginal increase in music quality and trying to convince the hoi polloi that no other system will do: while other audiophiles may agree with you, most people just won't consider it important - and in fact won't even be able to distinguish it at all. Data integrity *is* important. You clearly need to spend a lot more time trying to understand what you've read before responding to it. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
Just to note here as well as earlier that some of the confusion about what you had and had not said was related to my not having seen the post where you talked about RAW and compressed RAW errors until this morning. Since your other mysteriously 'disappeared' post also appeared recently, I suspect that the RAW/compressed post was not present earlier when we were talking about its contents, but it is also possible that I just missed it. In any case, my response to you was based on your claim below (by selective quoting) that this content had been in a post that I had responded to. - bill can you guess? wrote: ... Most of the balance of your post isn't addressed in any detail because it carefully avoids the fundamental issues that I raised: Not true; and by selective quoting you have removed my specific responses to most of these issues. While I'm naturally reluctant to call you an outright liar, David, you have hardly so far in this discussion impressed me as someone whose presentation is so well-organized and responsive to specific points that I can easily assume that I simply missed those responses. If you happen to have a copy of that earlier post, I'd like to see it resubmitted (without modification). Oh, dear: I got one post/response pair out of phase with the above - the post which I claimed did not address the issues that I raised *is* present here (and indeed does not address them). I still won't call you an outright liar: you're obviously just *very* confused about what qualifies as responding to specific points. And, just for the record, if you do have a copy of the post that disappeared, I'd still like to see it. 1. How much visible damage does a single-bit error actually do to the kind of large photographic (e.g., RAW) file you are describing? If it trashes the rest of the file, as you state is the case with jpeg, then you might have a point (though you'd still have to address my second issue below), but if it results in a virtually invisible blemish they you most certainly don't. I addressed this quite specifically, for two cases (compressed raw vs. uncompressed raw) with different results. Then please do so where we all can see it. Especially since there's no evidence of it in the post (still right here, up above) where you appear to be claiming that you did. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
No, you aren't cool, and no it isn't about zfs or your interest in it. It was clear from the get-go that netapp was paying you to troll any discussion on it, It's (quite literally) amazing how the most incompetent individuals turn out to be those who are the most certain of their misconceptions. In fact, there have been studies done that establish this as a statistically-significant trait among that portion of the population - so at least you aren't alone in this respect. For the record, I have no connection with NetApp, I have never had any connection with NetApp (save for appreciating the elegance of their products), they never in any way asked me to take any part in any discussion on any subject whatsoever (let alone offered to pay me to do so), I don't even *know* anyone at NetApp (at least that I'm aware of) save by professional reputation. In other words, you've got your head so far up your ass that you're not only ready to make accusations that you do not (and in fact could not) have any evidence to support, you're ready to make accusations that are factually flat wrong. Simply because an individual of your caliber apparently cannot conceive of the possibility that someone might take sufficient personal and professional interest in a topic to devote actual time and effort to attempting to cut through the hype that mostly well-meaning but less-than-objective and largely-uncritical supporters are shoveling out? Sheesh. ... Yes, every point you've made could be refuted. Rather than drool about it, try taking an actual shot at doing so: though I'd normally consider talking with you to be a waste of my time, I'll make an exception in this case. Call it a grudge match, if you want: I *really* don't like the kind of incompetence that someone who behaves as you just did represents and also consider it something in the nature of a civic duty to expose if for what it is. ... I suggest getting a blog and ranting there, you have no audience here. Another demonstrably incorrect statement, I'm afraid: the contents of this thread make it clear that some people here, despite their preconceptions, do consider a detailed analysis of ZFS's relative strengths to be a fit subject for discussion. And since it's only human for them to resist changing those preconceptions, it's hardly surprising that the discussion gets slightly heated at times. Education frequently can only occur through confrontation: existing biases make it difficult for milder forms to get through. I'd like to help people here learn something, but I'm at least equally interested in learning things myself - and since there are areas in which I consider ZFS's design to be significantly sub-optimal, where better to test that opinion than here? Unfortunately, so far the discussion has largely bogged down in debate over just how important ZFS's unique (save for WAFL) checksum protection mechanisms may be, and has not been very productive given the reluctance of many here to tackle that question quantitatively (though David eventually started to do so) - so there's been very little opportunity for learning on my part save for a few details about media-file internals. I'm more interested in discussing things like whether my suggested fix for RAID-Z's poor parallel-small-access performance overlooked some real obstacle, or why ZFS was presented as a highly-scalable file system when its largest files can require up to 6 levels of indirect blocks (making performance for random-access operations suck and causing snapshot data for updated large files to balloon) and it offers no obvious extension path to clustered operation (a single node - especially a single *commodity* node of the type that ZFS otherwise favors - runs o ut of steam in the PB range, or even lower for some workloads, and even breaking control out into a separate metadata server doesn't get you that much farther), or whether ZFS's apparently-centralized block-allocation mechanisms can scale well (using preallocation to predistribute large chunks that can be managed independently helps, but again doesn't get you beyond the PB range at best), or the blind spot that some of the developers appear to have about the importance of on-disk contiguity for streaming access performance (128 KB chunks just don't cut it in terms of efficient disk utilization in parallel environments unless they're grouped together), or its trade-off of run-time performance and space use for performance when accessing snapshots (I'm guessing that it was more faith in the virtue of full-tree-path updating as compared with using a transaction log that actually caused that decision, so perhaps that's the real subject for discussion). Of course, given that ZFS is what it is, there's natural tendency just to plow forward and not 'waste time' revisiting already-made decisions - so the people best able to discuss them may not want to. But you
[zfs-discuss] Response to phantom dd-b post
This is a bit weird: I just wrote the following response to a dd-b post that now seems to have disappeared from the thread. Just in case that's a temporary aberration, I'll submit it anyway as a new post. can you guess? wrote: Ah - thanks to both of you. My own knowledge of video format internals is so limited that I assumed most people here would be at least equally familiar with the notion that a flipped bit or two in a video would hardly qualify as any kind of disaster (or often even as being noticeable, unless one were searching for it, in the case of commercial-quality video). But also, you're thinking like a consumer, Well, yes - since that's the context of my comment to which you originally responded. Did you manage to miss that, even after I repeated it above in the post to which you're responding *this* time? not like an archivist. A bit lost in an achival video *is* a disaster, or at least a serious degradation. Or not, unless you're really, really obsessive-compulsive about it - certainly *far* beyond the point of being reasonably characterized as a 'consumer'. ... And since the CERN study seems to suggest that the vast majority of errors likely to be encountered at this level of incidence (and which could be caught by ZFS) are *detectable* errors, they'll (in the unlikely event that you encounter them at all) typically only result in requiring use of a RAID (or backup) copy (surely one wouldn't be entrusting data of any real value to a single disk). They'll only be detected when the files are *read*; ZFS has the scrub concept, but most RAID systems don't, Perhaps you're just not very familiar with other systems, David. For example, see http://gentoo-wiki.com/HOWTO_Gentoo_Install_on_Software_RAID#Data_Scrubbing, where it tells you how to run a software RAID scrub manually (or presumably in a cron job if it can't be configured to be more automatic). Or a variety of Adaptec RAID cards which support two different forms of scanning/fixup which presumably could also be scheduled externally if an internal scheduling mechanism is not included). I seriously doubt that these are the only such facilities out there: they're just ones I happen to be able to cite with minimal effort. ... So I see no reason to change my suggestion that consumers just won't notice the level of increased reliability that ZFS offers in this area: not only would the difference be nearly invisible even if the systems they ran on were otherwise perfect, but in the real world consumers have other reliability issues to worry about that occur multiple orders of magnitude more frequently than the kinds that ZFS protects against. And yet I know many people who have lost data in ways that ZFS would have prevented. Specifics would be helpful here. How many? Can they reasonably be characterized as consumers (I'll remind you once more: *that's* the subject to which your comments purport to be responding)? Can the data loss reasonably be characterized as significant (to 'consumers')? Were the causes hardware problems that could reasonably have been avoided ('bad cables' might translate to 'improperly inserted, overly long, or severely kinked cables', for example - and such a poorly-constructed system will tend to have other problems that ZFS cannot address)? - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? wrote: ... If you include 'image files of various sorts', as he did (though this also raises the question of whether we're still talking about 'consumers'), then you also have to specify exactly how damaging single-bit errors are to those various 'sorts' (one might guess not very for the uncompressed formats that might well be taking up most of the space). And since the CERN study seems to suggest that the vast majority of errors likely to be encountered at this level of incidence (and which could be caught by ZFS) are *detectable* errors, they'll (in the unlikely event that you encounter them at all) typically only result in requiring use of a RAID (or backup) copy (surely one wouldn't be entrusting data of any real value to a single disk). I have to comment here. As a bloke with a bit of a photography habit - I have a 10Mpx camera and I shoot in RAW mode - it is very, very easy to acquire 1Tb of image files in short order. So please respond to the question that I raised above (and that you yourself quoted): just how much damage will a single-bit error do to such a RAW file? Each of the photos I take is between 8 and 11Mb, and if I'm at a sporting event or I'm travelling for work or pleasure, it is *incredibly* easy to amass several hundred Mb of photos every single day. Even assuming that you meant 'MB' rather than 'Mb' above, that suggests that it would take you well over a decade to amass 1 TB of RAW data (assuming that, as you suggest both above and later, you didn't accumulate several hundred MB of pictures *every* day but just on those days when you were traveling, at a sporting event, etc.). I'm by no means a professional photographer (so I'm not out taking photos every single day), although a very close friend of mine is. My photo storage is protected by ZFS with mirroring and backups to dvd media. My profotog friend has 3 copies of all her data - working set, immediate copy on usb-attached disk, and second backup also on usb-attached disk but disconnected. Sounds wise on both your parts - and probably makes ZFS's extra protection pretty irrelevant (I won't bother repeating why here). Even if you've got your original file archived, you still need your working copies available, and Adobe Photoshop can turn that RAW file into a PSD of nearly 60Mb in some cases. If you really amass all your pictures this way (rather than, e.g., use Photoshop on some of them and then save the result in a less verbose format), I'll suggest that this takes you well beyond the 'consumer' range of behavior. It is very easy for the storage medium to acquire some degree of corruption - whether it's a CF or SD card, they all use FAT32. I have been in the position of losing photos due to this. Not many - perhaps a dozen over the course of 12 months. So in those cases you didn't maintain multiple copies. Bad move, and usually nothing that using ZFS could help with. While I'm not intimately acquainted with flash storage, my impression is that data loss usually occurs due to bad writes (since once written the data just sits there persistently and AFAIK is not subject ot the kinds of 'bit rot' that disk and tape data can experience). So if the loss occurs to the original image captured on flash before it can be copied elsewhere, you're just SOL and nothing ZFS offers could help you. That flipped bit which you seem to be dismissing as hardly... a disaster can in fact make your photo file totally useless, because not only will you probably not be able to get the file off the media card, but whatever software you're using to keep track of your catalog will also be unable to show you the entire contents. That might be the image itself, or it might be the equally important EXIF information. Here come those pesky numbers again, I'm afraid. Because given that the size difference between your image data and the metadata (including EXIF information, if that's what I suspect it is) is at least several orders of magnitude, the chance that the bad bit will be in something other than the image data is pretty negligible. So even if you can format your card to use ZFS (can you? if not, what possible relevance does your comment above have to this discussion?), doing so won't help at all: the affected file will still be inaccessible (unless you use ZFS to create a redundant pool across multiple such cards: is that really what you're suggesting should be done?) both to normal extraction (though couldn't dd normally get off everything but the bad sector?) and to your cataloging software. I don't depend on FAT32-formatted media cards to make my living, fortunately, but if I did I imagine I'd probably end up only using each card for about a month before exercising caution and purchasing a new one rather than depending on the card itself to be reliable any more. The 'wear
Re: [zfs-discuss] Response to phantom dd-b post
can you guess? wrote: This is a bit weird: I just wrote the following response to a dd-b post that now seems to have disappeared from the thread. Just in case that's a temporary aberration, I'll submit it anyway as a new post. Strange things certainly happen here now and then. The post you're replying to is one I definitely did send in. Could I have messed up and sent it just to you, thus causing confusion when you read it, deleted it, remembered it as in the group rather than direct? I used the forum's 'quote original' feature in replying and then received a screen-full of Java errors saying that the parent post didn't exist when I attempted to submit it. Most of the balance of your post isn't addressed in any detail because it carefully avoids the fundamental issues that I raised: 1. How much visible damage does a single-bit error actually do to the kind of large photographic (e.g., RAW) file you are describing? If it trashes the rest of the file, as you state is the case with jpeg, then you might have a point (though you'd still have to address my second issue below), but if it results in a virtually invisible blemish they you most certainly don't. 2. If you actually care about your data, you'd have to be a fool to entrust it to *any* single copy, regardless of medium. And once you've got more than one copy, then you're protected (at the cost of very minor redundancy restoration effort in the unlikely event that any problem occurs) against the loss of any one copy due to a minor error - the only loss of non-negligible likelihood that ZFS protects against better than other file systems. If you're relying upon RAID to provide the multiple copies - though this would also arguably be foolish, if only due to the potential for trashing all the copies simultaneously - you'd probably want to schedule occasional scrubs, just in case you lost a disk. But using RAID as a substitute for off-line redundancy is hardly suitable in the kind of archiving situations that you describe - and therefore ZFS has absolutely nothing of value to offer there: you should be using off-line copies, and occasionally checking all copies for readability (e.g., by copying them to the null device - again, something you could do for your on-line copy with a cron job and which you should do for your off-line copy/copies once in a while as well. In sum, your support of ZFS in this specific area seems very much knee-jerk in nature rather than carefully thought out - exactly the kind of 'over-hyping' that I pointed out in my first post in this thread. ... And yet I know many people who have lost data in ways that ZFS would have prevented. Specifics would be helpful here. How many? Can they reasonably be characterized as consumers (I'll remind you once more: *that's* the subject to which your comments purport to be responding)? Can the data loss reasonably be characterized as significant (to 'consumers')? Were the causes hardware problems that could reasonably have been avoided ('bad cables' might translate to 'improperly inserted, overly long, or severely kinked cables', for example - and such a poorly-constructed system will tend to have other problems that ZFS cannot address)? Reasonably avoided is irrelevant; they *weren't* avoided. While that observation has at least some merit, I'll observe that you jumped directly to the last of my questions above while carefully ignoring the three questions that preceded it. ... Nearly everybody I can think of who's used a computer for more than a couple of years has stories of stuff they've lost. Of course they have - and usually in ways that ZFS would have been no help whatsoever in mitigating. I knew a lot of people who lost their entire hard drive at one point or other especially in the 1985-1995 timeframe. Fine example of a situation where only redundancy can save you, and where good old vanilla-flavored RAID (with scrubbing - but, as I noted, that's hardly something that ZFS has any corner on) provides comparable protection to ZFS-with-mirroring. The people were quite upset by the loss; I'm not going to accept somebody else deciding it's not significant. I never said such situations were not significant, David: I simply observed (and did so again above) that in virtually all of them ZFS offered no particular advantage over more conventional means of protection. You need to get a grip and try to understand the *specifics* of what's being discussed here if you want to carry on a coherent discussion about it. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Response to phantom dd-b post
can you guess? wrote: ... Most of the balance of your post isn't addressed in any detail because it carefully avoids the fundamental issues that I raised: Not true; and by selective quoting you have removed my specific responses to most of these issues. While I'm naturally reluctant to call you an outright liar, David, you have hardly so far in this discussion impressed me as someone whose presentation is so well-organized and responsive to specific points that I can easily assume that I simply missed those responses. If you happen to have a copy of that earlier post, I'd like to see it resubmitted (without modification). 1. How much visible damage does a single-bit error actually do to the kind of large photographic (e.g., RAW) file you are describing? If it trashes the rest of the file, as you state is the case with jpeg, then you might have a point (though you'd still have to address my second issue below), but if it results in a virtually invisible blemish they you most certainly don't. I addressed this quite specifically, for two cases (compressed raw vs. uncompressed raw) with different results. Then please do so where we all can see it. 2. If you actually care about your data, you'd have to be a fool to entrust it to *any* single copy, regardless of medium. And once you've got more than one copy, then you're protected (at the cost of very minor redundancy restoration effort in the unlikely event that any problem occurs) against the loss of any one copy due to a minor error - the only loss of non-negligible likelihood that ZFS protects against better than other file systems. You have to detect the problem first. ZFS is in a much better position to detect the problem due to block checksums. Bulls***, to quote another poster here who has since been strangely quiet. The vast majority of what ZFS can detect (save for *extremely* rare undetectable bit-rot and for real hardware (path-related) errors that studies like CERN's have found to be very rare - and you have yet to provide even anecdotal evidence to the contrary) can also be detected by scrubbing, and it's arguably a lot easier to apply brute-force scrubbing (e.g., by scheduling a job that periodically copies your data to the null device if your system does not otherwise support the mechanism) than to switch your file system. If you're relying upon RAID to provide the multiple copies - though this would also arguably be foolish, if only due to the potential for trashing all the copies simultaneously - you'd probably want to schedule occasional scrubs, just in case you lost a disk. But using RAID as a substitute for off-line redundancy is hardly suitable in the kind of archiving situations that you describe - and therefore ZFS has absolutely nothing of value to offer there: you should be using off-line copies, and occasionally checking all copies for readability (e.g., by copying them to the null device - again, something you could do for your on-line copy with a cron job and which you should do for your off-line copy/copies once in a while as well. You have to detect the problem first. And I just described how to above - in a manner that also handles the off-line storage that you *should* be using for archival purposes (where ZFS scrubbing is useless). ZFS block checksums will detect problems that a simple read-only pass through most other filesystems will not detect. The only problems that ZFS will detect that a simple read-through pass will not are those that I just enumerated above: *extremely* rare undetectable bit-rot and real hardware (path-related) errors that studies like CERN's have found to be very rare (like, none in the TB-sized installation under discussion here). In sum, your support of ZFS in this specific area seems very much knee-jerk in nature rather than carefully thought out - exactly the kind of 'over-hyping' that I pointed out in my first post in this thread. And your opposition to ZFS appears knee-jerk and irrational, from this end. But telling you that will have no beneficial effect, any more than what you just told me about how my opinions appear to you. Couldn't we leave personalities out of this, in future? When someone appears to be arguing irrationally, it's at least worth trying to straighten him out. But I'll stop - *if* you start addressing the very specific and quantitative issues that you've been so assiduously skirting until now. ... And yet I know many people who have lost data in ways that ZFS would have prevented. Specifics would be helpful here. How many? Can they reasonably be characterized as consumers (I'll remind you once more: *that's* the subject to which your comments purport to be responding)? Can the data loss reasonably be characterized as significant
Re: [zfs-discuss] Response to phantom dd-b post
can you guess? wrote: ... Most of the balance of your post isn't addressed in any detail because it carefully avoids the fundamental issues that I raised: Not true; and by selective quoting you have removed my specific responses to most of these issues. While I'm naturally reluctant to call you an outright liar, David, you have hardly so far in this discussion impressed me as someone whose presentation is so well-organized and responsive to specific points that I can easily assume that I simply missed those responses. If you happen to have a copy of that earlier post, I'd like to see it resubmitted (without modification). Oh, dear: I got one post/response pair out of phase with the above - the post which I claimed did not address the issues that I raised *is* present here (and indeed does not address them). I still won't call you an outright liar: you're obviously just *very* confused about what qualifies as responding to specific points. And, just for the record, if you do have a copy of the post that disappeared, I'd still like to see it. 1. How much visible damage does a single-bit error actually do to the kind of large photographic (e.g., RAW) file you are describing? If it trashes the rest of the file, as you state is the case with jpeg, then you might have a point (though you'd still have to address my second issue below), but if it results in a virtually invisible blemish they you most certainly don't. I addressed this quite specifically, for two cases (compressed raw vs. uncompressed raw) with different results. Then please do so where we all can see it. Especially since there's no evidence of it in the post (still right here, up above) where you appear to be claiming that you did. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Thanks for the detailed reply, Robert. A significant part of it seems to be suggesting that high-end array hardware from multiple vendors may be *introducing* error sources that studies like CERN's (and Google's, and CMU's) never encountered (based, as they were, on low-end hardware). If so, then at least a major part of your improved experience is not due to using ZFS per se but to getting rid of the high-end equipment and using more reliable commodity parts: a remarkable thought - I wonder if anyone has ever done that kind of a study. A quick Google of ext3 fsck did not yield obvious examples of why people needed to run fsck on ext3, though it did remind me that by default ext3 runs fsck just for the hell of it every N (20?) mounts - could that have been part of what you were seeing? There are two problems with over-hyping a product: it gives competitors something legitimate to refute, and it leaves the impression that the product has to be over-sold because it doesn't have enough *real* merits to stand on. Well, for people like me there's a third problem: we just don't like spin. When a product has legitimate strengths that set it out out from the pack, it seems a shame to over-sell it the same way that a mediocre product is sold and waste the opportunity to take the high ground that it actually does own. I corrected your misunderstanding about WAFL's separate checksums in my October 26th response to you in http://storagemojo.com/2007/10/25/sun-fires-back-at-netapp/ - though in that response I made a reference to something that I seem to have said somewhere (I have no idea where) other than in that thread. In any event, one NetApp paper detailing their use is 3356.pdf (first hit if you Google Introduction to Data ONTAP 7G) - search for 'checksum' and read about block and zone checksums in locations separate from the data that they protect. As just acknowledged above, I occasionally recall something incorrectly. I now believe that the mechanisms described there were put in place more to allow use of disks with standard 512-byte sector sizes than specifically to separate the checksums from the data, and that while thus separating the checksums may achieve a result similar to ZFS's in-parent checksums the quote that you provided may indicate the primary mechanism that WAFL uses to validate its data: whether the 'checksums' reside with the data or elsewhere, I now remember reading (I found the note that I made years ago, but it didn't provide a specific reference and I just spent an hour searching NetApp's Web site for it without success) that the in-block (or near-to-block) 'checksums' include not only file identity and offset information but a block generation number (I think this is what the author meant by the 'identity' of the block) that increments each time the block is updated, and that this generation nu mber is kept in the metadata block that points to the file block, thus allowing the metadata block to verify with a high degree of certainty that the target block is indeed not only the right file block, containing the right data, but the right *version* of that block. As I said, thanks (again) for the detailed response, - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
bull* -- richard Hmmm. Was that bull* as in Numbers? We don't need no stinking numbers! We're so cool that we work for a guy who thinks he's Steve Jobs! or Silly engineer! Can't you see that I've got my rakish Marketing hat on? Backwards! or I jes got back from an early start on my weekend an you better [hic] watch what you say, buddy, if you [Hic] don't want to get a gallon of [HIC] slightly-used beer and nachos all over your [HIC!] shoes [HIIICCC!!!] oh, sh- [BLARRG] Inquiring minds want to know. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss