Re: [zfs-discuss] ZFS Dedup question
On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote: > I created a zfs pool with dedup with the following settings: > zpool create data c8t1d0 > zfs create data/shared > zfs set dedup=on data/shared > > The thing I was wondering about was it seems like ZFS only dedup at > the file level and not the block. When I make multiple copies of a > file to the store I see an increase in the deup ratio, but when I copy > similar files the ratio stays at 1.00x. Dedup is done at the block level, not file level. "Similar files" does not mean that they actually share common blocks. You'll have to look more closely to determine if they do. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Tue, Jan 18, 2011 at 07:16:04AM -0800, Orvar Korvar wrote: > BTW, I thought about this. What do you say? > > Assume I want to compress data and I succeed in doing so. And then I > transfer the compressed data. So all the information I transferred is > the compressed data. But, then you don't count all the information: > knowledge about which algorithm was used, which number system, laws of > math, etc. So there are lots of other information that is implicit, > when compress/decompress - not just the data. > > So, if you add data and all implicit information you get a certain bit > size X. Do this again on the same set of data, with another algorithm > and you get another bit size Y. > > You compress the data, using lots of implicit information. If you use > less implicit information (simple algorithm relying on simple math), > will X be smaller than if you use lots of implicit information > (advanced algorithm relying on a large body of advanced math)? What > can you say about the numbers X and Y? Advanced math requires many > math books that you need to transfer as well. Just as the laws of thermodynamics preclude perpetual motion machines, so do they preclude infinite, loss-less data compression. Yes, thermodynamics and information theory are linked, amazingly enough. Data compression algorithms work by identifying certain types of patterns, then replacing the input with notes such as "pattern 1 is ... and appears at offsets 12345 and 1234567" (I'm simplifying a lot). Data that has few or no observable patterns (observable by the compression algorithm in question) will not compress, but will expand if you insist on compressing -- randomly-generated data (e.g., the output of /dev/urandom) will not compress at all and will expand if you insist. Even just one bit needed to indicate whether a file is compressed or not will mean expansion when you fail to compress and store the original instead of the "compressed" version. Data compression reduces repetition, thus making it harder to further compress compressed data. Try it yourself. Try building a pipeline of all the compression tools you have, see how many rounds of compression you can apply to typical data before further compression fails. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Sat, Jan 15, 2011 at 10:19:23AM -0600, Bob Friesenhahn wrote: > On Fri, 14 Jan 2011, Peter Taps wrote: > > >Thank you for sharing the calculations. In lay terms, for Sha256, > >how many blocks of data would be needed to have one collision? > > Two. Pretty funny. In this thread some of you are treating SHA-256 as an idealized hash function. The odds of accidentally finding collisions in an idealized 256-bit hash function are minute because the distribution of hash function outputs over inputs is random (or, rather, pseudo-random). But cryptographic hash functions are generally only approximations of idealized hash functions. There's nothing to say that there aren't pathological corner cases where a given hash function produces lots of collisions that would be semantically meaningful to people -- i.e., a set of inputs over which the outputs are not randomly distributed. Now, of course, we don't know of such pathological corner cases for SHA-256, but not that long ago we didn't know of any for SHA-1 or MD5 either. The question of whether disabling verification would improve performance is pretty simple: if you have highly deduplicatious, _synchronous_ (or nearly so, due to frequent fsync()s or NFS close operations) writes, and the "working set" did not fit in the ARC nor L2ARC, then yes, disabling verification will help significantly, by removing an average of at least half a disk rotation from the write latency. Or if you have the same work load but with asynchronous writes that might as well be synchronous due to an undersized cache (relative to the workload). Otherwise the cost of verification should be hidden by caching. Another way to put this would be that you should first determine that verification is actually affecting performance, and only _then_ should you consider disabling it. But if you want to have the freedom to disable verficiation, then you should be using SHA-256 (or switch to it when disabling verification). Safety features that cost nothing are not worth turning off, so make sure their cost is significant before even thinking of turning them off. Similarly, the cost of SHA-256 vs. Fletcher should also be lost in the noise if the system has enough CPU, but if the choice of hash function could make the system CPU-bound instead of I/O-bound, then the choice of hash function would make an impact on performance. The choice of hash functions will have a different performance impact than verification: a slower hash function will affect non-deduplicatious workloads more than highly deduplicatious workloads (since the latter will require more I/O for verification, which will overwhelm the cost of the hash function). Again, measure first. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Fri, Jan 07, 2011 at 06:39:51AM -0800, Michael DeMan wrote: > On Jan 7, 2011, at 6:13 AM, David Magda wrote: > > The other thing to note is that by default (with de-dupe disabled), ZFS > > uses Fletcher checksums to prevent data corruption. Add also the fact all > > other file systems don't have any checksums, and simply rely on the fact > > that disks have a bit error rate of (at best) 10^-16. > > Agreed - but I think it is still missing the point of what the > original poster was asking about. > > In all honesty I think the debate is a business decision - the highly > improbable vs. certainty. The OP seemed to be concerned that SHA-256 is particularly slow, so the business decision here would involve a performance vs. error rate trade-off. Now, unless you have highly deduplicatious data, a workload with a high cache hit ratio in the ARC for DDT entries, and a fast ZIL device, I suspect that the I/O costs of dedup dominate the cost of the hash function, which means: the above business trade-off is not worthwhile as one would be trading an tiny uptick in error rates for small uptick in performance. Before you even get to where you're making such a decision you'll want to have invested in plenty of RAM, L2ARC and fast ZIL device capacity -- and for those making such that investment I suspect that the OP's trade-off won't seem worthwhile. BTW, note that verification isn't guaranteed to have a zero error rate... Imagine a) a block being written collides with a different block already in the pool, b) bit rot on disk in that colliding block such that the on-disk block matches the new block, c) on a mirrored vdev such that you might get one or another version of the block in question, randomly. Such an error requires monumentally bad luck to happen at all. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Thu, Jan 06, 2011 at 06:07:47PM -0500, David Magda wrote: > On Jan 6, 2011, at 15:57, Nicolas Williams wrote: > > > Fletcher is faster than SHA-256, so I think that must be what you're > > asking about: "can Fletcher+Verification be faster than > > Sha256+NoVerification?" Or do you have some other goal? > > Would running on recent T-series servers, which have have on-die > crypto units, help any in this regard? Yes, particularly for larger blocks. Hash collisions don't matter as long as ZFS verifies dups, so the real question is: what is the false positive dup rate (i.e., the accidental collision rate). But that's going to vary a lot by {hash function, working data set}, thus it's not possible to make exact determinations, just estimates. For me the biggest issue is that as good as Fletcher is for a CRC, I'd rather have a cryptographic hash function because I've seen incredibly odd CRC failures before. There's a famous case from within SWAN a few years ago where a switch flipped pairs of bits such that all too often the various CRCs that applied to the moving packets failed to detect the bit flips, and we discovered this when an SCCS file in a clone of the ON gate got corrupted. Such failures (collisions) wouldn't affect dedup, but they would mask corruption of non-deduped blocks. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Thu, Jan 06, 2011 at 11:44:31AM -0800, Peter Taps wrote: > I have been told that the checksum value returned by Sha256 is almost > guaranteed to be unique. All hash functions are guaranteed to have collisions [for inputs larger than their output anyways]. > In fact, if Sha256 fails in some case, we > have a bigger problem such as memory corruption, etc. Essentially, > adding verification to sha256 is an overkill. What makes a hash function cryptographically secure is not impossibility of collisions, but computational difficulty of finding arbitrary colliding input pairs, collisions for known inputs, second pre-images, and first pre-images. Just because you can't easily find collisions on purpose doesn't mean that you can't accidentally find collisions. That said, if the distribution of SHA-256 is even enough then your chances of finding a collision by accident are so remote (one in 2^128) that you could reasonably decide that you don't care. > Perhaps (Sha256+NoVerification) would work 99.99% of the time. But > (Fletcher+Verification) would work 100% of the time. Fletcher is faster than SHA-256, so I think that must be what you're asking about: "can Fletcher+Verification be faster than Sha256+NoVerification?" Or do you have some other goal? Assuming I guessed correctly... The speed of the hash function isn't significant compared to the cost of the verification I/O, period, end of story. So, SHA-256 w/o verification will be faster than Fletcher + Verification -- lots faster if you have particularly deduplicatious data to write. Moreorever, SHA-256 + verification will likely be somewhat faster than Fletcher + verification because SHA-256 will likely have fewer collisions than Fletcher, and the cost of I/O dominates the cost of the hash functions. > Which one of the two is a better deduplication strategy? > > If we do not use verification with Sha256, what is the worst case > scenario? Is it just more disk space occupied (because of failure to > detect duplicate blocks) or there is a chance of actual data > corruption (because two blocks were assumed to be duplicate although > they are not)? If you don't verify then you run the risk of corruption on collision, NOT the risk of using too much disk space. > Or, if I go with (Sha256+Verification), how much is the overhead of > verification on the overall process? > > If I do go with verification, it seems (Fletcher+Verification) is more > efficient than (Sha256+Verification). And both are 100% accurate in > detecting duplicate blocks. You're confused. Fletcher may be faster to compute than SHA-256, but the run-time of both is as nothing compared to latency of the disk I/O needed for verification, which means that the hash function's rate of collisions is more important than its computational cost. (Now, Fletcher is thought to not be a cryptographically secure hash function, while SHA-256 is, for now, considered cryptographically secure. That probably means that the distribution of Fletcher's outputs over random inputs is not as even as that of SHA-256, which probably means you can expect more collisions with Fletcher than with SHA-256. Note that I made no absolute statements in the previous sentence -- that's because I've not read any studies of Fletcher's performance relative to SHA-256, thus I'm not certain of anything stated in the previous sentence.) David Magda's advice is spot on. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL
On Mon, Dec 27, 2010 at 09:06:45PM -0500, Edward Ned Harvey wrote: > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Nicolas Williams > > > > > Actually I'd say that latency has a direct relationship to IOPS because > it's the > > time it takes to perform an IO that determines how many IOs Per Second > > that can be performed. > > > > Assuming you have enough synchronous writes and that you can organize > > them so as to keep the drive at max sustained sequential write > > bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn't > > Ok, what we've hit here is two people using the same word to talk about > different things. Apples to oranges, as it were. Both meanings of "IOPS" > are ok, but context is everything. > > There are drive random IOPS, which is dependent on latency and seek time, > and there is also measured random IOPS above the filesystem layer, which is > not always related to latency or seek time, as described above. Clearly the application cares about _synchronous_ operations that are meaningful to it. In the case of an NFS application that would be open() with O_CREAT (and particularly O_EXCL), close(), fsync() and so on. For a POSIX (but not NFS) application the number of synchronous operations is smaller. The rate of asynchronous operations is less important to the application because those are subject to caching, thus less predictable. But to the filesystem the IOPS are not just about synchronous I/O but about how many distinct I/O operations can be completed per unit of time. I tried to keep this clear; sorry for any confusion. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL
On Sat, Dec 25, 2010 at 08:37:42PM -0500, Ross Walker wrote: > On Dec 24, 2010, at 1:21 PM, Richard Elling wrote: > > > Latency is what matters most. While there is a loose relationship between > > IOPS > > and latency, you really want low latency. For 15krpm drives, the average > > latency > > is 2ms for zero seeks. A decent SSD will beat that by an order of > > magnitude. > > Actually I'd say that latency has a direct relationship to IOPS because it's > the time it takes to perform an IO that determines how many IOs Per Second > that can be performed. Assuming you have enough synchronous writes and that you can organize them so as to keep the drive at max sustained sequential write bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn't enter into that formula. Latency does remain though, and will be noticeable to apps doing synchronous operations. Thus 100MB/s, say, sustained sequential write bandwidth with, say, 2KB avg ZIL entries you'd get 51200/s logical, sync write operations. The latency for each such operation would still be 2ms (or whatever it is for the given disk). Since you'd likely have to batch many ZIL writes you'd end up making the latency for some ops longer than 2ms and others shorter, but if you can keep the drive at max sustained seq write bandwidth then the average latency will be 2ms. SSDs are clearly a better choice. BTW, a parallelized tar would greatly help reduce the impact of high latency open()/close() (over NFS) operations... Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL
On Thu, Dec 23, 2010 at 11:25:43AM +0100, Stephan Budach wrote: > as I have learned from the discussion about which SSD to use as ZIL > drives, I stumbled across this article, that discusses short > stroking for increasing IOPs on SAS and SATA drives: There was a thread on this a while back. I forget when or the subject. But yes, you could even use 7200 rpm drives to make a fast ZIL device. The trick is the on-disk format, and the pseudo-device driver that you would have to layer on top of the actual device(s) to get such performance. The key is that sustained sequential I/O rates for disks can be quite large, so if you organize the disk in a log form and use the outer tracks only, then you can get pretend to have awesome write IOPS for a disk (but NOT read IOPs). But it's not necessarily as cheap as you might think. You'd be making very inefficient use of an expensive disk (in the case of an SAS 15k rpm disk), or disks, and if plural then you are also using more ports (oops). Disks used this way probably also consume more power than SSDs (OK, this part of my analysis if very iffy), and you still need to do something about ensuring syncs to disk on power failure (such as just disabling the cache on the disk, but this would lower performance, increasing the cost). When you factor all the costs in I suspect you'll find that SSDs are priced reasonably well. That's not to say that one could not put together a disk-based log device that could eat SSDs' lunch, but SSD prices would then just come down to match that -- and you can expect SSD prices to come down anyways, as with any new technologies. I don't mean to discourage you, just to point out that there's plenty of work to do to make "short-stroked disks as ZILs" a workable reality, while the economics of doing that work versus waiting for SSD prices to come down don't seem appealing. Caveat emptor: my analysis is off-the-cuff; I could be wrong. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] stupid ZFS question - floating point operations
On Thu, Dec 23, 2010 at 09:32:13AM +, Darren J Moffat wrote: > On 22/12/2010 20:27, Garrett D'Amore wrote: > >That said, some operations -- and cryptographic ones in particular -- > >may use floating point registers and operations because for some > >architectures (sun4u rings a bell) this can make certain expensive > > Well remembered! There are sun4u optimisations that use the > floating point unit but those only apply to the bignum code which in > kernel is only used by RSA. > > >operations go faster. I don't think this is the case for secure > >hash/message digest algorithms, but if you use ZFS encryption as found > >in Solaris 11 Express you might find that on certain systems these > >registers are used for performance reasons, either on the bulk crypto or > >on the keying operations. (More likely the latter, but my memory of > >these optimizations is still hazy.) > > RSA isn't used at all by ZFS encryption, everything is AES > (including key wrapping) and SHA256. > > So those optimistations for floating point don't come into play for > ZFS encryption. Moreover, we have platform-specific crypto optimizations. If there were FPU operations that help speed up symmetric crypto on an M4000 but not on UltraSPARC T2s, then we'd use that on the one but not on the other. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
Also, when the IV is stored you can more easily look for accidental IV re-use, and if you can find hash collisions, them you can even cause IV re-use (if you can write to the filesystem in question). For GCM IV re-use is rather fatal (for CCM it's bad, but IIRC not fatal), so I'd not use GCM with dedup either. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
On Wed, Nov 17, 2010 at 01:58:06PM -0800, Bill Sommerfeld wrote: > On 11/17/10 12:04, Miles Nordin wrote: > >black-box crypto is snake oil at any level, IMNSHO. > > Absolutely. As Darren said, much of the design has been discussed in public, and reviewed by cryptographers. It'd be nicer if we had a detailed paper though. > >Congrats again on finishing your project, but every other disk > >encryption framework I've seen taken remotely seriously has a detailed > >paper describing the algorithm, not just a list of features and a > >configuration guide. It should be a requirement for anything treated > >as more than a toy. I might have missed yours, or maybe it's coming > >soon. > > In particular, the mechanism by which dedup-friendly block IV's are > chosen based on the plaintext needs public scrutiny. Knowing > Darren, it's very likely that he got it right, but in crypto, all > the details matter and if a spec detailed enough to allow for > interoperability isn't available, it's safest to assume that some of > the details are wrong. Dedup + crypto does have security implications. Specifically: it facilitates "traffic" analysis, and then known- and even chosen-plaintext attacks (if there were any practical such attacks on the cipher). For example, IIUC, the ratio of dedup vs. non-dedup blocks + analysis of dnodes and their data sizes (in blocks) + per-dnode dedup ratios can probably be used to identify OS images, which would then help mount known-plaintext attacks. For a mailstore you'd be able to distinguish mail sent or kept by a single local user vs. mail sent to and kept by more than one local user, and by sending mail you could help mount chose-plaintext attacks. And so on. My advice would be to not bother encrypting OS images, and if you encrypt only documents, then dedup is likely of less or no interest to you -- in general, you may not want to bother with dedup + crypto. However, it is fantastic that crypto and dedup can work together. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Sat, Oct 09, 2010 at 09:52:51PM -0700, Richard Elling wrote: > Are we living in the past? > > In the bad old days, UNIX systems spoke NFS and Windows systems spoke > CIFS. The cost of creating a file system was expensive -- slices, > partitions, etc. > > With ZFS, file systems (datasets) are relatively inexpensive. > > So, are we putting too many constraints into a system (ZFS) which is > busy trying to remove constraints? Is it reasonable to expect that > ZPL is the only kind of "file system" ZFS customers need? Is it high > time for a ZCIFS dataset? I don't quite understand what you mean. ZPL is just a POSIX layer. It _happens_ to be used not just by the system call layer in Solaris, but also by the SMB and NFS servers, but you could also imagine the SMB and NFS servers using the DMU directly while maintaining on-disk compatibility with the ZPL. Not using the ZPL does not necessitate having a different on-disk format, or different semantics. Now, if you were asking about dataset properties that make a dataset behave more like what Windows expects or more like what Unix expects, that's different, but that wouldn't require junking the ZPL. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Wed, Oct 06, 2010 at 05:19:25PM -0400, Miles Nordin wrote: > >>>>> "nw" == Nicolas Williams writes: > > nw> *You* stated that your proposal wouldn't allow Windows users > nw> full control over file permissions. > > me: I have a proposal > > you: op! OP op, wait! DOES YOUR PROPOSAL blah blah WINDOWS blah blah > COMPLETELY AND EXACTLY LIKE THE CURRENT ONE. > > me: no, but what it does is... The correct quote is: "no, not under my proposal." That's from a post from you on September 30, 2010, with Message-Id: . That was a direct answer to a direct question. Now, maybe you wish to change your view. That'd be fine. Do not, however, imply that I'm liar though, not if you want to be taken seriously. Please re-write your proposal _clearly_ and refrain from personal attacks. Cheers, Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Wed, Oct 06, 2010 at 04:38:02PM -0400, Miles Nordin wrote: > >>>>> "nw" == Nicolas Williams writes: > > nw> The current system fails closed > > wrong. > > $ touch t0 > $ chmod 444 t0 > $ chmod A0+user:$(id -nu):write_data:allow t0 > $ ls -l t0 > -r--r--r--+ 1 carton carton 0 Oct 6 20:22 t0 > > now go to an NFSv3 client: > $ ls -l t0 > -r--r--r-- 1 carton 405 0 2010-10-06 16:26 t0 > $ echo lala > t0 > $ > > wide open. The system does what the ACL says. The mode fails to accurately represent the actual access because... the mode can't. Now, we could have chosen (and still could choose to) represent the presence of ACEs for subjects other than owner@/group@/everyone@ by using the group bits of the mode to represent the maximal set of permissions granted. But I don't consider the above "failing open". > nw> You seem to be in denial. You continue to ignore the > nw> constraint that Windows clients must be able to fully control > nw> permissions in spite of their inability to perceive and modify > nw> file modes. > > You remain unshakably certain that this is true of my proposal in > spite of the fact that you've said clearly that you don't understand > my proposal. That's bad science. *You* stated that your proposal wouldn't allow Windows users full control over file permissions. > It may be my fault that you don't understand it: maybe I need to write > something shorter but just as expressive to fit within mailing list > attention spans, or maybe my examples are unclear. However that > doesn't mean that I'm in denial nor make you right---that just makes > me annoying. Yes, that may be. I encourage you to find a clearer way to express your proposal. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Mon, Oct 04, 2010 at 02:28:18PM -0400, Miles Nordin wrote: > >>>>> "nw" == Nicolas Williams writes: > > nw> I would think that 777 would invite chmods. I think you are > nw> handwaving. > > it is how AFS worked. Since no file on a normal unix box besides /tmp But would the AFS experience translate into double plus happiness for us? > ever had 777 it would send a SIGWTF to any AFS-unaware graybeards that > stumbled onto the directory, alerting them that they needed to go > learn something and come back. A signal?! How would that work when the entity doing a chmod is on a remote NFS client? > I understand that everything:everyone on windows doesn't send SIGWTF, > but 777 on unix for AFS sites it did. You realize it's not > hypothetical, right? AFS was actually implemented, widely, and > there's experience with it. Yes... but I'm skeptical about the universality of that experience's applicability. Specifically: I don't think it could work for us. AFS developers had fewer constraints than Solaris developers. It is no surprise that they were able to find happy solutions to these sorts of problems long ago. OpenAFS has a Windows native client and an Explorer shell extension (which surely handles chmod?). However, we don't have the luxury of telling customers to install third-party (possibly ours, whatever) Windows native clients for protocols other than SMB, nor can we tell them to install Explorer shell extensions. Solaris' SMB server needs to work out of the box and without the limitations implied by having a separate ACL and mode (well, we have that now, but we always compute a new mode from the new ACL when ACLs are changed). > If they failed to act on the SIGWTF, the overall system enforced the > tighter of the unix permissions and the AFS ACL, so it fails closed. > The current system fails open. The current system fails closed (by discarding the ACL and replacing it with a new one based entirely on the new mode). > Also AFS did no translation between unix permissions and AFS ACL's so > it was easy to undo such a mistake when it happened: double-check the > AFS ACL is not too wide on the directories where you see unix people > mucking around in case the muckers were responding to a real problem, > then set the unix modes back to 777. Right, but with SMB in the picture we don't have this luxury. You seem unwilling to accept that one constraint. > nw> When chmod()ing an object... ZFS would search for the most > nw> specific matching file in .zfs/ACLs/ and, if found, would > nw> replace the chmod()ed object's ACL with that of the > nw> .zfs/ACLs/... file found. The .inherit suffix would indicate > nw> that if the chmod() target's parent directory has inherittable > nw> ACEs then they will be groupmasked and added to the ACEs from > nw> the .zfs/ACLs/... file to produce a final ACL. > > This proposal, like the current situation, seems to make chmod > configurable to act like ``not chmod'' which IMHO is exactly what's > unpopular about the current regime. You've tried to leave chmod To some degree, yes. It's different though, and might conceivably be acceptable, though I don't think it will be (I was illustrating potential alternatives). But I really like one thing about it: most apps shouldn't care about ACL contents, they should care about context-specific permissions changes. In a directory containing shared documents the intention should typically be "share with all these people", while in home directories the intention should typically be "don't share with anyone" (but this will vary; e.g., ~/.ssh/authorized_keys needs to be reachable and readable by everyone). Add in executable versus not- executable, and you have a pretty complete picture -- just a few "named" ACLs at most, per-dataset. If we could replace chmod(2) with a version that takes actual names for pre-configured ACLs, _that_ would be great. But we can't for the same reason that we can't remove chmod(2): it's a widely used interface. > active on windows trees and guess at the intent of whoever invokes > chmod, providing no warning that you're secretly doing > ``approximately'' what he asked for rather than exactly. Maybe that > flies on Windows, but on Unix people expect more precision: thorough > abstractions that survive corner cases and have good exception > handling. Look, mode is a pretty lame hammer -- ACLs are far, far more granular-- but it's a hammer that many apps use. Given the lack of granularity of modes, I think an approximation of intent is the best we can do. Consider: both, aclmode=discard and aclmode=groupmask beh
Re: [zfs-discuss] Migrating to an aclmode-less world
On Mon, Oct 04, 2010 at 04:30:05PM -0600, Cindy Swearingen wrote: > Hi Simon, > > I don't think you will see much difference for these reasons: > > 1. The CIFS server ignores the aclinherit/aclmode properties. Because CIFS/SMB has no chmod operation :) > 2. Your aclinherit=passthrough setting overrides the aclmode > property anyway. aclinherit=passthrough-x is a better choice. Also, aclinherit doesn't override aclmode. aclinherit applies on create and aclmode used to apply on chmod. > 3. The only difference is that if you use chmod on these files > to manually change the permissions, you will lose the ACL values. Right. That only happens from NFSv3 clients [that don't instead edit the POSIX Draft ACL translated from the ZFS ACL], from non-Windows NFSv4 clients [that don't instead edit the ACL], and from local applications [that don't instead edit the ZFS ACL]. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote: > >> Can the user in (3) fix the permissions from Windows? > > no, not under my proposal. Let's give it a whirld anyways: > but it sounds like currently people cannot ``fix'' permissions through > the quirky autotranslation anyway, certainly not to the point where > neither unix nor windows users are confused: windows users are always > confused, and unix users don't get to see all the permissions. No, that's not right. Today you can fix permissions from any NFSv4 client that exports an NFSv4-style ACL interface to users. You can fix permissions from Windows. You can fix permissions a local Solaris shell. You can also fix permissions from NFSv3 clients (but you get POSIX Draft -> ZFS translated ACLs, which are confusing because they tend to result in DENY ACEs being scattered all over). You can also chmod, but you lose your ACL if you do that. > >> Now what? > > set the unix perms to 777 as a sign to the unix people to either (a) > leave it alone, or (b) learn to use 'chmod A...'. This will actually > work: it's not a hand-waving hypothetical that just doesn't play out. I would think that 777 would invite chmods. I think you are handwaving. > What I provide, which we don't have now, is a way to make: > > /tub/dataset/a subtree > > -rwxrwxrwxin old unix > [working, changeable permissions] in windows > > /tub/dataset/b subtree > > -rw-r--r--in old unix > [everything: everyone]in windows, but unix permissions > still enforced > > this means: > > * unix writers and windows writers can cooperate even within a single >dataset > > * an intuitive warning sign when non-native permissions are in effect, > > * fewer leaked-data surprises I don't understand what exactly you're proposing. You've not said anything about how chmod is to be handled. > If you accept that the autotranslation between the two permissions > regimes is total shit, which it is, then what I offer is the best oyu > can hope for. If I could understand what you're proposing I might agree, who knows. But I do think there's other possibilities, some probably better than what you propose (whatever that is). Here's a crazy alternative that might work (or not): allow users to pre-configure named ACLs where the names are {owner, group, mode}. E.g., we could have: .zfs/ACLs//[:][d|-][.inherit] ^ ^^^ ^ || | +-- owned by | | +-- applies to | directory | or other| objects | | see below When chmod()ing an object... ZFS would search for the most specific matching file in .zfs/ACLs/ and, if found, would replace the chmod()ed object's ACL with that of the .zfs/ACLs/... file found. The .inherit suffix would indicate that if the chmod() target's parent directory has inherittable ACEs then they will be groupmasked and added to the ACEs from the .zfs/ACLs/... file to produce a final ACL. E.g., a chmod(0644) of /a/b/c/foo (say, a file owned by 'joe' with group 'staff', with /, /a, /a/b, and /a/b/c all being datasets), where c has inherittable ACEs would cause ZFS to search for .zfs/ACLs/joe/staff:-rw-r--r--.inherit, .zfs/ACLs/joe/-rw-r--r--.inherit, zfs/ACLs/joe/staff:-rw-r--r--, and .zfs/ACLs/joe/-rw-r--r--, first in /a/b/c, then /a/b, then /a, then /. I said this is "crazy". Is it? I think it probably is. This would almost certainly prove to be a hard-to-use design. Users would need to be educated in order to not be surprised... OTOH, it puts much more control in the hands of the user. These named ACLs could be inheritted from parent datasets as a way to avoid having to set them up too many times. And with the .inherit twist it probably has enough granularity of control to be useful (particularly if users are dataset-happy). Finally, these could even be managed remotely. I see zero chance of such a design being adopted. It'd be better, IMO, to go for non-POSIX-equivalent groupmasking and translations of POSIX mode_t and POSIX Draft ACLs to ZFS ACLs. For example: take the current translations, remove all owner@ and group DENY ACEs, then sort any remaining user DENY ACEs to be first, and any everyone@ DENY ACEs to be last. The results would surely be surprising to some users, but the kinds of mode_t and POSIX Draft ACLs where surprise is likely are rare. That's two alternatives right there. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailm
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote: > >> Can the user in (3) fix the permissions from Windows? > > no, not under my proposal. Then your proposal is a non-starter. Support for multiple remote filesystem access protocols is key for ZFS and Solaris. The impedance mismatches between these various protocols means that we need to make some trade-offs. In this case I think the business (as well as the engineers involved) would assert that being a good SMB server is critical, and that being able to authoritatively edit file permissions via SMB clients is part of what it means to be a good SMB server. Now, you could argue that we should being aclmode back and let the user choose which trade-offs to make. And you might propose new values for aclmode or enhancements to the groupmask setting of aclmode. > but it sounds like currently people cannot ``fix'' permissions through > the quirky autotranslation anyway, certainly not to the point where > neither unix nor windows users are confused: windows users are always > confused, and unix users don't get to see all the permissions. Thus the current behavior is the same as the old aclmode=discard setting. > >> Now what? > > set the unix perms to 777 as a sign to the unix people to either (a) > leave it alone, or (b) learn to use 'chmod A...'. This will actually > work: it's not a hand-waving hypothetical that just doesn't play out. That's not an option, not for a default behavior anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Thu, Sep 30, 2010 at 03:28:14PM -0500, Nicolas Williams wrote: > Consider this chronologically-ordered sequence of events: > > 1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including >inherittable ACEs. A mode computed from this ACL might be 664, say. > > 2) A Unix user does chmod(644) on that file, and one way or another this >effectively reduces permissions otherwise granted by the ACL. > > 3) Another Windows user now fails to get write perm that they should >have, so they complain, and then the owner tries to view/change the >ACL from a Windows desktop. > > Now what? > > Can the user in (3) fix the permissions from Windows? For that to be > possible the mode must implicitly get recomputed when the ACL is > modified. Also, even if in (3) the user can fix the perms from Windows because we'd recompute the mode from the ACL, the user wouldn't be able to see the "effective" ACL (as "reduced" by the mode_t that Windows can't see). The only way to address that is... to do groupmasking. And that gets us back to the problems we had with groupmasking. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side
On Thu, Sep 30, 2010 at 02:55:26PM -0400, Miles Nordin wrote: > >>>>> "nw" == Nicolas Williams writes: > nw> Keep in mind that Windows lacks a mode_t. We need to interop > nw> with Windows. If a Windows user cannot completely change file > nw> perms because there's a mode_t completely out of their > nw> reach... they'll be frustrated. > > well...AIUI this already works very badly, so keep that in mind, too. > > In AFS this is handled by most files having 777, and we could do the > same if we had an AND-based system. This is both less frustrating and > more self-documenting than the current system. > > In an AND-based system, some unix users will be able to edit the > windows permissions with 'chmod A...'. In shops using older unixes > where users can only set mode bits, the rule becomes ``enforced > permissions are the lesser of what Unix people and Windows people > apply.'' This rule is easy to understand, not frustrating, and > readily encourages ad-hoc cooperation (``can you please set > everything-everyone on your subtree? we'll handle it in unix.'' / > ``can you please set 777 on your subtree? or 770 group windows? we > want to add windows silly-sid-permissions.''). This is a big step > better than existing systems with subtrees where Unix and Windows > users are forced to cooperate. Consider this chronologically-ordered sequence of events: 1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including inherittable ACEs. A mode computed from this ACL might be 664, say. 2) A Unix user does chmod(644) on that file, and one way or another this effectively reduces permissions otherwise granted by the ACL. 3) Another Windows user now fails to get write perm that they should have, so they complain, and then the owner tries to view/change the ACL from a Windows desktop. Now what? Can the user in (3) fix the permissions from Windows? For that to be possible the mode must implicitly get recomputed when the ACL is modified. What if (2) happens again? But, OK, this is a problem no matter what, whether we do groupmasking, discard, or keep mode separate from the ACL and AND the two. ZFS does, in fact, keep a separate mode, and it does recompute it when ACLs are modified. So this may just be a matter of doing the AND thing and not touching the ACL on chmod. Is that what you have in mind? > It would certainly work much better than the current system, where you > look at your permissions and don't have any idea whether you've got > more, less, or exactly the same permission as what your software is > telling you: the crappy autotranslation teaches users that all bets > are off. No, currently you look at permissions that they reflect the ACL (with the group bits being the max of all non-owner@ and non-everyone@ ACEs). > It would be nice if, under my proposal, we could delete the unix > tagspace entirely: > > chpacl '(unix)' chmod -R A- . Huh? > but unfortunately, deletion of ACL's is special-cased by Solaris's > chmod to ``rewrite ACL's that match the UNIX permissions bits,'' so it > would probably have to stay special-cased in a tagspace system. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)
On Wed, Sep 29, 2010 at 05:21:51PM -0500, Nicolas Williams wrote: > On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote: > > > Keep in mind that Windows lacks a mode_t. We need to > > > interop with Windows. > > > > Oh my, I see. Another itch to scratch. Now at least Windows users are > > happy while me and mabye others are not. > > Yes. Pardon me for forgetting to mention this earlier. There's so many > wrinkles here... But this is one of the biggers; I should not have s/biggers/biggest/ > forgotten it. > > Nico > -- > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)
On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote: > > Keep in mind that Windows lacks a mode_t. We need to > > interop with Windows. > > Oh my, I see. Another itch to scratch. Now at least Windows users are > happy while me and mabye others are not. Yes. Pardon me for forgetting to mention this earlier. There's so many wrinkles here... But this is one of the biggers; I should not have forgotten it. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)
Keep in mind that Windows lacks a mode_t. We need to interop with Windows. If a Windows user cannot completely change file perms because there's a mode_t completely out of their reach... they'll be frustrated. Thus an ACL-and-mode model where both are applied doesn't work. It'd be nice, but it won't work. The mode has to be entirely encoded by the ACL. But we can't resort to interesting encoding tricks as Windows users won't understand them. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs proerty aclmode gone in 147?
On Wed, Sep 29, 2010 at 03:44:57AM -0700, Ralph Böhme wrote: > > On 9/28/2010 2:13 PM, Nicolas Williams wrote: > > The version of samba bundled with Solaris 10 seems to > > insist on > > chmod'ing stuff. I've tried all of the various Just in case it's not clear, I did not write the quoted text. (One can tell from the level of quotation that an attribution is missing and that none of my text was quoted. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs proerty aclmode gone in 147?
On Wed, Sep 29, 2010 at 10:15:32AM +1300, Ian Collins wrote: > Based on my own research, experimentation and client requests, I > agree with all of the above. Good to know. > I have be re-ordering and cleaning (deny) ACEs for one client for a > couple of years now and we haven't seen any user complaints. In > their environment, all ACLs started life as POSIX (from a Solaris 9 > host) and with the benefit of hindsight, I would have cleaned them > up on import to ZFS rather than simply reading the POSIX ACL and > writing back to ZFS. The saddest scenario would be when you have to interop with NFSv3 clients whose users (or their apps) are POSIX ACL happy, but whose files also need to be accessible from NFSv4, SMB, and local ZPL clients where the users (possibly the same users, or their apps) are also ZFS ACL happy. Particularly if you also have Windows clients and the users edit file ACLs there too! Thankfully this is relatively easy to avoid because: apps that edit ACLs are few and far between, thus easy to remediate, and users should not really be manually setting POSIX Draft and ZFS/NFSv4/SMB ACLs on the same files. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs proerty aclmode gone in 147?
On Tue, Sep 28, 2010 at 02:03:30PM -0700, Paul B. Henson wrote: > On Tue, 28 Sep 2010, Nicolas Williams wrote: > > > I've researched this enough (mainly by reading most of the ~240 or so > > relevant zfs-discuss posts and several bug reports) > > And I think some fair fraction of those posts were from me, so I'll try not > to start rehashing old discussions ;). :) > > That only leaves aclmode=discard and some variant of aclmode=groupmask > > that is less confusing. > > Or aclmode=deny, which is pretty simple, not very confusing, and basically > the only paradigm that will prevent chmod from breaking your ACL. That can potentially render many applications unusable. > > So one might wonder: can one determine user intent from the ACL prior to > > the change and the mode/POSIX ACL being set, and then edit the ZFS ACL > > in a way that approximates the user's intention? > > You're assuming the user is intentionally executing the chmod, or even > *aware* of it happening. Probably at least 99% of the chmod calls executed > on a file with a ZFS ACL at my site are the result of non-ACL aware legacy > apps being stupid. In which case the *user* intent to to *leave the damn > ACL alone* :)... But that's not really clear. The user is running some app. The app does some chmod(2)ing on behalf of the user. The user may also use chmod(1). Now what? Suppose you make chmod(1) not use chomd(2), so as to be able to say that all chmod(2) calls are made by "apps", not the user. But then what about scripts that use chmod(1)? Basically, I think intent can be estimated in some cases, and combined with some simplifying assumptions (that will sometimes be wrong), such as "security entities are all distinct, non-overlapping" (as a way to minimize the number of DENY ACEs needed) can yield a groupmasking algorithm that doesn't suck. However, it'd still not be easy to explain, and it'd still result in surprises (since the above assumption will often be wrong, leading to more permissive ACLs than the user might have intended!). Seems like a lot of work for little gain, and high support call generation rate. > > But much better than that would be if we just move to a ZFS ACL world > > (which, among other things, means we'll need a simple libc API for > > editing ACLs). > > Yep. And a good first step towards an ACL world would be providing a way to > keep chmod from destroying ACLs in the current world... I don't think that will happen... Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs proerty aclmode gone in 147?
On Tue, Sep 28, 2010 at 12:18:49PM -0700, Paul B. Henson wrote: > On Sat, 25 Sep 2010, [iso-8859-1] Ralph Böhme wrote: > > > Darwin ACL model is nice and slick, the new NFSv4 one in 147 is just > > braindead. chmod resulting in ACLs being discarded is a bizarre design > > decision. > > Agreed. What's the point of ACLs that disappear? Sun didn't want to fix > acl/chmod interaction, maybe one of the new OpenSolaris forks will do the > right thing... I've researched this enough (mainly by reading most of the ~240 or so relevant zfs-discuss posts and several bug reports) to conclude the following: - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in DENY ACEs are enormously confusing to users. - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in DENY ACEs are susceptible to ACL re-ordering when modified from Windows clients -which insist on DENY ACEs first-, leading to much confusion. - This all gets more confusing when hand-crafted ZFS inherittable ACEs are mixed with chmod(2)s with the old aclmode=groupmask setting. The old aclmode=passthrough setting was dangerous and had to be removed, period. (Doing chmod(600) would not necessarily deny other users/groups access -- that's very, very broken.) That only leaves aclmode=discard and some variant of aclmode=groupmask that is less confusing. But here's the thing: the only time that groupmasking results in sensible ACLs is when it doesn't require DENY ACEs, which in turn is only when mode_t bits and/or POSIX ACLs are strictly non-increasing (e.g., 777, 775, 771, 750, 755, 751, etcetera, would all be OK, but 757 would not be). The problem then is this: if you have an aclmode setting that sometimes groupmasks and sometimes discards... that'll be confusing too! So one might wonder: can one determine user intent from the ACL prior to the change and the mode/POSIX ACL being set, and then edit the ZFS ACL in a way that approximates the user's intention? I believe that would be possible, but risky too, as the need to avoid DENY ACEs (see Windows issue) would often result in more permissive ACLs than the user actually intended. Taken altogether I believe that aclmode=discard is the simplest setting to explain and understand. Perhaps eventually a variant of groupmasking will be developed that is also simple to explain and understand, but right now I very much doubt it (and yes, I've tried myself). But much better than that would be if we just move to a ZFS ACL world (which, among other things, means we'll need a simple libc API for editing ACLs). Note, incidentally, that there's a groupmasking behavior left in ZFS at this time: on create of objects in directories with inherittable ACEs and with aclinherit=passthrough*. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pools inside pools
On Thu, Sep 23, 2010 at 06:58:29AM +, Markus Kovero wrote: > > What is an example of where a checksummed outside pool would not be able > > to protect a non-checksummed inside pool? Would an intermittent > > RAM/motherboard/CPU failure that only corrupted the inner pool's block > > before it was passed to the outer pool (and did not corrupt the outer > > pool's block) be a valid example? > > > If checksums are desirable in this scenario, then redundancy would also > > be needed to recover from checksum failures. > > That is excellent point also, what is the point for checksumming if > you cannot recover from it? At this kind of configuration one would > benefit performance-wise not having to calculate checksums again. The benefit of checksumming in the "inner tunnel", as it were (the inner pool), is to provide one more layer of protection relative to iSCSI. But without redundancy in the inner pool you cannot recover from failures, as you point out. And you must have checksumming in the outer pool, so that it can be scrubbed. It's tempting to say that the inner pool should not checksum at all, and that iSCSI and IPsec should be configured correctly to provide sufficient protection to the inner pool. Another possibility is to have a remote ZFS protocol of sorts, but then you begin to wonder if something like Lustre (married to ZFS) isn't better. > Checksums in outer pools effectively protect from disk issues, if > hardware fails so data is corrupted isn't outer pools redundancy going > to handle it for inner pool also. Yes. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS COW and simultaneous read & write of files
On Wed, Sep 22, 2010 at 12:30:58PM -0600, Neil Perrin wrote: > On 09/22/10 11:22, Moazam Raja wrote: > >Hi all, I have a ZFS question related to COW and scope. > > > >If user A is reading a file while user B is writing to the same file, > >when do the changes introduced by user B become visible to everyone? > >Is there a block level scope, or file level, or something else? > > > >Thanks! > > Assuming the user is using read and write against zfs files. > ZFS has reader/writer range locking within files. > If thread A is trying to read the same section that thread B is > writing it will > block until the data is written. Note, written in this case means > written into the zfs > cache and not to the disks. If thread A requires that changes to the > file be stable (on disk) > before reading it can use the little known O_RSYNC flag. That's assuming local access (i.e., POSIX semantics). It's different if NFS is involved (because of NFS' close-to-open semantics). It might be different if SMB is involved (dunno). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please warn a home user against OpenSolaris under VirtualBox under WinXP ; )
On Wed, Sep 22, 2010 at 07:14:43AM -0700, Orvar Korvar wrote: > There was a guy doing that: Windows as host and OpenSolaris as guest > with raw access to his disks. He lost his 12 TB data. It turned out > that VirtualBox dont honor the write flush flag (or something > similar). VirtualBox has an option to honor flushes. Also, recent versions of ZFS can recover by throwing out the last N transactions that were not committed fully. > In other words, I would never ever do that. Your data is safer with > Windows only and a Windows raid solution. > > Use OpenSolaris as host instead, and Win as guest. I don't think your advice is correct. If you're going to run production services on VirtualBox VMs then you should enable cache flushes in VBox: http://www.virtualbox.org/manual/ch12.html#id2692517 " To enable flushing for IDE disks, issue the following command: VBoxManage setextradata "VM name" "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 The value [x] that selects the disk is 0 for the master device on the first channel, 1 for the slave device on the first channel, 2 for the master device on the second channel or 3 for the master device on the second channel. To enable flushing for SATA disks, issue the following command: VBoxManage setextradata "VM name" "VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush" 0 The value [x] that selects the disk can be a value between 0 and 29. " IMO VBox should have a simple toggle for this in either its disk or vm manager UI. And the flush commands should be honored by default. What VBox could do is have some radio buttons or checkboxes for indicating the purpose of a given VM, and then derive default flush behavior from that (e.g., test and gaming VMs need not honor flushes, dev VMs might, and prod VMs do). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Wed, Sep 15, 2010 at 05:18:08PM -0400, Edward Ned Harvey wrote: > It is absolutely not difficult to avoid fragmentation on a spindle drive, at > the level I described. Just keep plenty of empty space in your drive, and > you won't have a fragmentation problem. (Except as required by COW.) How > on earth do you conclude this is "practically impossible?" That's expensive. It's also approaching short-stroking (which is expensive). Which is what Richard said (in so many words, that it's expensive). Can you make HDDs perform awesome? Yes, but you'll need lots of them, and you'll need to use them very inefficiently. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is the "1000 bit"?
On Tue, Sep 14, 2010 at 04:13:31PM -0400, Linder, Doug wrote: > I recently created a test zpool (RAIDZ) on some iSCSI shares. I made > a few test directories and files. When I do a listing, I see > something I've never seen before: > > [r...@hostname anewdir] # ls -la > total 6160 > drwxr-xr-x 2 root other 4 Sep 14 14:16 . > drwxr-xr-x 4 root root 5 Sep 14 15:04 .. > -rw--T 1 root other2097152 Sep 14 14:16 barfile1 > -rw--T 1 root other1048576 Sep 14 14:16 foofile1 > > I looked up the "T" bit in the man page for ls, and it says that "T" > means "The 1000 bit is turned on, and execution is off (undefined > bit-state)." Which is as clear as mud. It's the sticky bit. Nowadays it's only useful on directories, and really it's generally only used with 777 permissions. The chmod(1) (man -M/usr/man chmod) and chmod(2) (man -s 2 chmod) manpages describe the sticky bit. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode
On Sat, Aug 28, 2010 at 12:05:53PM +1200, Ian Collins wrote: > Think of this from the perspective of an application. How would > write failure be reported? open(2) returns EACCES if the file can > not be written but there isn't a corresponding return from write(2). > Any open file descriptors would have to be updated to reflect the > change of access and the application would end up with an unexpected > error return (EBADF?). EROFS. But write(2) isn't supposed to return EROFS. NFSv3's and v4's write ops are allowed to return the NFS equivalent of EROFS, and so typically NFS clients do cause write(2) to return EROFS in such cases (but then, NFS isn't fully POSIX). write(2) can return EIO though, and, IIRC, the BSD revoke(2) syscall arranges for just that to be returned by write(2) calls on revoked fildes. IMO EROFS and EIO would both be OK. It might be a good idea to require a force option to make a change that would cause non-POSIX behavior. I'd think that there's many possible ways to handle this: a) disallow setting readonly=on on mounted datasets that are readonly=false; b) disallow ... but only if there are any fildes open for write (doesn't matter if shared with NFS as NFS writes are allowed to return EROFS); c) allow the change but make it take effect on next mount; d) force umount the dataset, make the change, mount again; e) have write(2), to fildes open for write before the change to readonly=on, return EROFS after the change; f) same as (d) but only if you force the prop change; g) have write(2), to fildes open for write before the change to readonly=on, return EIO after the change; h) allow write(2)s to fildes open for write before the change to readonly=on; (h) is current behavior. (a) and (b) would be reasonable, but if EBUSY, the user may not be able to change the property without drastic steps (such as rebooting, if there's lots of datasets below). (c) would be confusing, and not that useful. (d) would be unreasonable (plus what if there's datasets below this one?!). (e)... may be reasonable if you think that we're well outside POSIX the moment you change the readonly prop to on. (f) is reasonable (by forcing the change you'd be saying that you're happy to leave POSIX land). (h) is reasonable. > If the application has been given permission to open a file for > writing and this permission is unexpectedly revoked, strange things > my happen. The file being written would be in an inconsistent > state. Well, there's always the BSD revoke(2) system call. Use it and > I think it is better to let write operation complete and leave the > file in a consistent state. There is that too. But you could, too, just power off... The application should use fsync(2) (or fdatasync()) carefully to ensure that failed write(2)s and power failures don't leave the application in an unrecoverable state. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
On Fri, Aug 20, 2010 at 10:17:38AM +1200, Ian Collins wrote: > On 08/20/10 09:48 AM, Nicolas Williams wrote: > >And anyways, the temptation to build classes that can be used > >elsewhere becomes rather strong. IMO C++ in the kernel is asking for > >trouble. And C++ in user-land? Same thing: you'll end up wanting to > >turn parts of your application into libraries, and then some other > >developer will want to use those in their C++ app, and then you run into > >the ABI issues all over again. > > There are a couple of simple solutions to that. Either make library > code header only (which is most common for template based code) or > provide CC and gcc libraries, just like we have 32 and 64 bit > versions of other system libraries. Or just stick to one compiler, > like Solaris did before the big gcc build project kicked off. Or wait for a standard ABI to be formulated and widely adopted. Or don't use C++. Use Java or a JVM-hosted language. Use Python. Use C. Use C#. Use whatever. Anything, anything other than C++. But more than anything: we don't need a language flame war on a ZFS list. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
On Fri, Aug 20, 2010 at 09:38:51AM +1200, Ian Collins wrote: > On 08/20/10 09:33 AM, Nicolas Williams wrote: > >Any driver C++ code would still need a C++ run-time. Either you must > >statically link it in, or you'll have a problem with multiple drivers > >using different C++ run-times. If you statically link in the run-time, > >then you're bloating the text of the kernel. If you're not then you > >have a problem. C++ is bad because of its ABI issues, really. > > > You snipped the bit where I said > > "Drivers and kernel modules are a good example; in that world you > have to live without the runtime library (which is dynamic only). > So you are effectively just using C++ as a superset of C with all > the benefits that offers." > > So you basically loose the C++ specific parts of the standard > library and exceptions. But you still have the built in features of > the language. I'm not sure it's that easy to avoid the C++ run-time when you're coding. And anyways, the temptation to build classes that can be used elsewhere becomes rather strong. IMO C++ in the kernel is asking for trouble. And C++ in user-land? Same thing: you'll end up wanting to turn parts of your application into libraries, and then some other developer will want to use those in their C++ app, and then you run into the ABI issues all over again. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
On Fri, Aug 20, 2010 at 09:23:56AM +1200, Ian Collins wrote: > On 08/20/10 08:30 AM, Garrett D'Amore wrote: > >There is no common C++ ABI. So you get into compatibility concerns > >between code built with different compilers (like Studio vs. g++). > >Fail. > > Which is why we have extern "C". Just about any Solaris driver, > library or kernel module could be implemented in C++ behind the C > compatibility layer and no one would notice. Any driver C++ code would still need a C++ run-time. Either you must statically link it in, or you'll have a problem with multiple drivers using different C++ run-times. If you statically link in the run-time, then you're bloating the text of the kernel. If you're not then you have a problem. C++ is bad because of its ABI issues, really. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] User level transactional API
On Thu, Aug 12, 2010 at 07:48:10PM -0500, Norm Jacobs wrote: > For single file updates, this is commonly solved by writing data to > a temp file and using rename(2) to move it in place when it's ready. For anything more complicated you need... a more complicated approach. Note that "transactional API" means, among other things, "rollback" -- easy at the whole dataset level, hard in more granular form. Dataset- level rollback is nowhere need granular enough for applications). Application transactions consisting of more than one atomic filesystem operation require application-level recovery code. SQLite3 is a good (though maybe extreme?) example of such an application; there are many others. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Filesystem
On Wed, Jul 14, 2010 at 03:07:59PM -0600, Beau J. Bechdol wrote: > So not sue if this is the correct list to email to or not. I am curious to > know on my machine I have two hard drive (c8t0d0 and c8t1d0). Can some one > explain to me what this exactly means? What does "c8" "t0" and "d0" actually > mean. I might have to go back to solaris 101 to understand what this all > means. The 'c' is for "controller", and the number that follows is one that is assigned to the given controller (not necessarily on a first-come- first-served 0-based basis!). The controller number should be considered unpredictable at install time. Once installed it shouldn't change, except for removable disks, where the controller number might vary according to which slot you plugged the disk into. The 't' is for "target". The 'd' is for "disk" -- think LUN. The 'p' is for "partition", and is used in Solaris on x86. The 's' is for "slice". Slices are like partitions, but only used in SOLARIS2 partitions, of which you're allowed no more than one per-disk. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Hash functions (was Re: Hashing files rapidly on ZFS)
On Thu, Jul 08, 2010 at 08:42:33PM -0700, Garrett D'Amore wrote: > On Fri, 2010-07-09 at 10:23 +1000, Peter Jeremy wrote: > > In theory, collisions happen. In practice, given a cryptographic hash, > > if you can find two different blocks or files that produce the same > > output, please publicise it widely as you have broken that hash function. > > Not necessarily. While you *should* publicize it widely, given all the > possible text that we have, and all the other variants, its > theoretically possibly to get likely. Like winning a lottery where > everyone else has a million tickets, but you only have one. > > Such an occurrence -- if isolated -- would not, IMO, constitute a > 'breaking' of the hash function. A hash function is broken when we know how to create colliding inputs. A random collision does not a break make, though it might, perhaps, help figure out how to break the hash function later. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?
On Wed, Jun 30, 2010 at 01:35:31PM -0700, valrh...@gmail.com wrote: > Finally, for my purposes, it doesn't seem like a ZIL is necessary? I'm > the only user of the fileserver, so there probably won't be more than > two or three computers, maximum, accessing stuff (and writing stuff) > remotely. It depends on what you're doing. The perennial complaint about NFS is the synchronous open()/close() operations and the fact that archivers (tar, ...) will generally unpack archives in a single-threaded manner, which means all those synchronous ops punctuate the archiver's performance with pauses. This is a load type for which ZIL devices come in quite handy. If you write lots of small files often and in single-threaded ways _and_ want to guarantee you don't lose transactions, then you want a ZIL device. (The recent knob for controlling whether synchronous I/O gets done asynchronously would help you if you don't care about losing a few seconds worth of writes, assuming that feature makes it into any release of Solaris.) > But, from what I can gather, by spending a little under $400, I should > substantially increase the performance of my system with dedup? Many > thanks, again, in advance. If you have deduplicatious data, yes. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
On Wed, Jun 16, 2010 at 04:44:07PM +0200, Arne Jansen wrote: > Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main > pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection > of the RAID-controller the disk can leave the write cache enabled. This means > the disk can write essentially with full speed, meaning 150MB/s for a 15k > drive. > 114000 4k writes/s are 456MB/s, so 3 spindles should do. You'd still have to flush those caches at the end of each transaction, which would tend to come every few seconds, so you'd need to factor that in. You can definitely do with disk what you can do with SSDs, but not necessarily with the same SWAP (space, wattage and price), and you'd have a more complex system no matter what. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication and ISO files
On Fri, Jun 04, 2010 at 12:37:01PM -0700, Ray Van Dolson wrote: > On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote: > > On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson wrote: > > > The ISO's I'm testing with are the 32-bit and 64-bit versions of the > > > RHEL5 DVD ISO's. While both have their differences, they do contain a > > > lot of similar data as well. > > > > Similar != identical. > > > > Dedup works on blocks in zfs, so unless the iso files have identical > > data aligned at 128k boundaries you won't see any savings. > > > > > If I explode both ISO files and copy them to my ZFS filesystem I see > > > about a 1.24x dedup ratio. > > > > Each file starts a new block, so the identical files can be deduped. > > > > -B > > Makes sense. So, as someone else suggested, decreasing my block size > may improve the deduplication ratio. > > recordsize I presume is the value to tweak? Yes, but I'd not expect that much commonality between 32-bit and 64-bit Linux ISOs... Do the same check again with the ISOs "exploded", as you say. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions about zil
On Mon, May 24, 2010 at 05:48:56PM -0400, Thomas Burgess wrote: > I recently got a new SSD (ocz vertex LE 50gb) > > It seems to work really well as a ZIL performance wise. My question is, how > safe is it? I know it doesn't have a supercap so lets' say dataloss > occursis it just dataloss or is it pool loss? Just dataloss. > also, does the fact that i have a UPS matter? Relative to power loss, yes. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send/recv over ssh
On Thu, May 20, 2010 at 04:23:49PM -0400, Thomas Burgess wrote: > I know i'm probably doing something REALLY stupid.but for some reason i > can't get send/recv to work over ssh. I just built a new media server and > i'd like to move a few filesystem from my old server to my new server but > for some reason i keep getting strange errors... > > At first i'd see something like this: > > pfexec: can't get real path of ``/usr/bin/zfs'' > > or something like this: > > zfs: Command not found Add /usr/sbin to your PATH or use /usr/sbin/zfs as the full path of the zfs(1M) command. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On Wed, May 19, 2010 at 02:29:24PM -0700, Don wrote: > "Since it ignores Cache Flush command and it doesn't have any > persistant buffer storage, disabling the write cache is the best you > can do." > > This actually brings up another question I had: What is the risk, > beyond a few seconds of lost writes, if I lose power, there is no > capacitor and the cache is not disabled? You can lose all writes from the last committed transaction (i.e., the one before the currently open transaction). (You also lose writes from the currently open transaction, but that's unavoidable in any system.) Nowadays the system will let you know at boot time that the last transaction was not committed properly and you'll have a chance to go back to the previous transaction. For me, getting much-better-than-disk performance out of an SSD with cache disabled is enough to make that SSD worthwhile, provided the price is right of course. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in campus clusters
On Wed, May 19, 2010 at 07:50:13AM -0700, John Hoogerdijk wrote: > Think about the potential problems if I don't mirror the log devices > across the WAN. If you don't mirror the log devices then your disaster recovery semantics will be that you'll miss any transactions that hadn't been committed to disk yet at the time of the disaster. Which means that the log devices' effects is purely local: for recovery from local power failures (not extending to local disasters) and for acceleration. This may or may not be acceptable to you. If not, then mirror the log devices. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] inodes in snapshots
On Wed, May 19, 2010 at 05:33:05AM -0700, Chris Gerhard wrote: > The reason for wanting to know is to try and find versions of a file. No, there's no such guarantee. The same inode and generation number pair is extremely unlikely to be re-used, but the inode number itself is likely to be re-used. > If a file is renamed then the only way to know that the renamed file > was the same as a file in a snapshot would be if the inode numbers > matched. However for that to be reliable it would require the i-nodes > are not reused. There's also the crtime (creation time, not to be confused with ctime), which you can get with ls(1). > If they are able to be reused then when an inode number matches I > would also have to compare the real creation time which requires > looking at the extended attributes. Right, that's what you'll have to do. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...
On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote: > On 5/6/10 5:28 AM, Robert Milkowski wrote: > > >sync=disabled > >Synchronous requests are disabled. File system transactions > >only commit to stable storage on the next DMU transaction group > >commit which can be many seconds. > > Is there a way (short of DTrace) to write() some data and get > notified when the corresponding txg is committed? Think of it as a > poor man's group commit. fsync(2) is it. Of course, if you disable sync writes then there's no way to find out for sure. If you need to know when a write is durable, then don't disable sync writes. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
On Wed, Apr 21, 2010 at 01:03:39PM -0500, Jason King wrote: > ISTR POSIX also doesn't allow a number of features that can be turned > on with zfs (even ignoring the current issues that prevent ZFS from > being fully POSIX compliant today). I think an additional option for > the snapdir property ('directory' ?) that provides this behavior (with > suitable warnings about posix compliance) would be reasonable. > > I believe it's sufficient that zfs provide the necessary options to > act in a posix compliant manner (much like you have to set $PATH > correctly to get POSIX conforming behavior, even though that might not > be the default), though I'm happy to be corrected about this. Yes, that's true. But you couldn't rely on this behavior, whereas you can rely on dataset roots having .zfs. If you're going to script this, then you'll want to rely on the current (POSIX-compliant) behavior. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
POSIX doesn't allow us to have special dot files/directories outside filesystem root directories. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
On Wed, Apr 21, 2010 at 10:45:24AM -0400, Edward Ned Harvey wrote: > > From: Mark Shellenbaum [mailto:mark.shellenb...@oracle.com] > > > > > > You can create/destroy/rename snapshots via mkdir, rmdir, mv inside > > the > > > .zfs/snapshot directory, however, it will only work if you're running > > the > > > command locally. It will not work from a NFS client. > > > > > > > It will work over NFS or SMB, but you will need to allow it via the > > necessary delegated administration permissions. > > Go on? > I tried it over NFS and it didn't work. So ... what are the "necessary > permissions?" See zfs(1M), search for "delegate". > I did it from a NFS client as root, where root maps to root. Huh; dunno why that didn't work. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
On Tue, Apr 20, 2010 at 04:28:02PM +, A Darren Dunham wrote: > On Sat, Apr 17, 2010 at 09:03:33AM -0400, Edward Ned Harvey wrote: > > > "zfs list -t snapshot" lists in time order. > > > > Good to know. I'll keep that in mind for my "zfs send" scripts but it's not > > relevant for the case at hand. Because "zfs list" isn't available on the > > NFS client, where the users are trying to do this sort of stuff. > > I'll note for comparison that the Netapp shapshots do expose this in one > way. > > The actual snapshot directory access time is set to the time of the > snapshot. That makes it visible over NFS. Would be handy to do > something similar in ZFS. The .zfs/snapshot directory is most certainly available over NFS. But note that .zfs does not appear in directory listings of dataset roots -- you have to actually refer to it: % ls -f|fgrep .zfs % ls -f .zfs . ..snapshot % ls .zfs/snapshot % nfsstat -m $PWD /net/.../pool/nico from ...:/pool/nico Flags: vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,retrans=5,timeo=600 Attr cache:acregmin=3,acregmax=60,acdirmin=30,acdirmax=60 % And you can even create, rename and destroy snapshots by creating, renaming and removing directories in .zfs/snapshot: % mkdir .zfs/snapshot/foo % mv .zfs/snapshot/foo .zfs/snapshot/bar % rmdir .zfs/snapshot/bar (All this also works locally, not just over ZFS.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: rm files/directories from snapshots
On Fri, Apr 16, 2010 at 01:56:07PM -0400, Edward Ned Harvey wrote: > The typical problem scenario is: Some user or users fill up the filesystem. > They rm some files, but disk space is not freed. You need to destroy all > the snapshots that contain the deleted files, before disk space is available > again. > > It would be nice if you could rm files from snapshots, without needing to > destroy the whole snapshot. > > Is there any existing work or solution for this? See the archives. See the other replies to you already. Short version: no. However, a script to find all the snapshots that you'd have to delete in order to delete some file might be useful, but really, only marginally so: you should send your snapshots to backup and clean them out from time to time anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
On Fri, Apr 16, 2010 at 02:19:47PM -0700, Richard Elling wrote: > On Apr 16, 2010, at 1:37 PM, Nicolas Williams wrote: > > I've a ksh93 script that lists all the snapshotted versions of a file... > > Works over NFS too. > > > > % zfshist /usr/bin/ls > > History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls): > > -r-xr-xr-x 1 root bin33416 Jul 9 2008 > > /.zfs/snapshot/install/usr/bin/ls > > -r-xr-xr-x 1 root bin37612 Nov 21 2008 > > /.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls > > -r-xr-xr-x 1 root bin37612 Nov 21 2008 > > /.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls > > -r-xr-xr-x 1 root bin37612 Nov 21 2008 > > /.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls > > -r-xr-xr-x 1 root bin37612 Nov 21 2008 > > /.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls > > % > > > > It's not perfect (e.g., it doesn't properly canonicalize its arguments, > > so it doesn't handle symlinks and ..s in paths), but it's a start. > > There are some interesting design challenges here. For the general case, you > can't rely on the snapshot name to be in time order, so you need to sort by > the > mtime of the destination. I'm using ls -ltr. > It would be cool to only list files which are different. True. That'd not be hard. > If you mv a file to another directory, you might want to search by filename > or a partial directory+filename. Or even inode number. > Or maybe you just setup your tracker.cfg and be happy? Exactly. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZFS better: zfshistory
On Fri, Apr 16, 2010 at 01:54:45PM -0400, Edward Ned Harvey wrote: > If you've got nested zfs filesystems, and you're in some subdirectory where > there's a file or something you want to rollback, it's presently difficult > to know how far back up the tree you need to go, to find the correct ".zfs" > subdirectory, and then you need to figure out the name of the snapshots > available, and then you need to perform the restore, even after you figure > all that out. I've a ksh93 script that lists all the snapshotted versions of a file... Works over NFS too. % zfshist /usr/bin/ls History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls): -r-xr-xr-x 1 root bin33416 Jul 9 2008 /.zfs/snapshot/install/usr/bin/ls -r-xr-xr-x 1 root bin37612 Nov 21 2008 /.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls -r-xr-xr-x 1 root bin37612 Nov 21 2008 /.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls -r-xr-xr-x 1 root bin37612 Nov 21 2008 /.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls -r-xr-xr-x 1 root bin37612 Nov 21 2008 /.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls % It's not perfect (e.g., it doesn't properly canonicalize its arguments, so it doesn't handle symlinks and ..s in paths), but it's a start. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Rollback From ZFS Send
On Tue, Apr 06, 2010 at 11:53:23AM -0400, Tony MacDoodle wrote: > Can I rollback a snapshot that I did a zfs send on? > > ie: zfs send testpool/w...@april6 > /backups/w...@april6_2010 That you did a zfs send does not prevent you from rolling back to a previous snapshot. Similarly for zfs recv -- that you went from one snapshot to another by zfs receiving a send does not stop you from rolling back to an earlier snapshot. You do need to have an earlier snapshot to rollback to, if you want to rollback. Also, if you are using zfs send for backups, or for replication, and you rollback the primary dataset, then you'll need to update your backups/ replicas accordingly. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff
zfs diff is incredibly cool. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff
One really good use for zfs diff would be: as a way to index zfs send backups by contents. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send and ARC
On Thu, Mar 25, 2010 at 04:23:38PM +, Darren J Moffat wrote: > If the data is in the L2ARC that is still better than going out to > the main pool disks to get the compressed version. Well, one could just compress it... If you'd otherwise put compression in the ssh pipe (or elsewhere) then you could stop doing that. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send and receive corruption across a WAN link?
On Thu, Mar 18, 2010 at 10:38:00PM -0700, Rob wrote: > Can a ZFS send stream become corrupt when piped between two hosts > across a WAN link using 'ssh'? No. SSHv2 uses HMAC-MD5 and/or HMAC-SHA-1, depending on what gets negotiated, for integrity protection. The chances of random on the wire corruption going undetected by link-layer CRCs, TCP's CRC and SSHv2's MACs is infinitessimally small. I suspect the chances of local bit flips due to cosmic rays and what not are higher. A bigger problem is that SSHv2 connections do not survive corruption on the wire. That is, if corruption is detected then the connection gets aborted. If you were zfs send'ing 1TB across a long, narrow link and corruption hit the wire while sending the last block you'd have to re-send the whole thing (but even then such corruption would still have to get past link-layer and TCP checksums -- I've seen it happen, so it is possible, but it is also unlikely). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
BTW, it should be relatively easy to implement aclmode=ignore and aclmode=deny, if you like. - $SRC/common/zfs/zfs_prop.c needs to be updated to know about the new values of aclmode. - $SRC/uts/common/fs/zfs/zfs_acl.c:zfs_acl_chmod()'s callers need to be modified: - in the create path if zfs_acl_chmod() gets called then you can't ignore nore deny the mode; - zfs_acl_chmod_setattr() should call neither zfs_acl_node_read() nor zfs_acl_chmod() if aclmode=ignore or aclmode=deny - in all other paths you zfs_acl_chmod() should do what it should do - $SRC/uts/common/fs/zfs/zfs_vnops.c:zfs_setattr() may need some updates too, e.g., to not call zfs_aclset_common() in the case of aclmode=ignore -- you'll probably have to play around to figure out what else. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Mon, Mar 01, 2010 at 09:04:58PM -0800, Paul B. Henson wrote: > On Mon, 1 Mar 2010, Nicolas Williams wrote: > > Yes, that sounds useful. (Group modebits could be applied to all ACEs > > that are neither owner@ nor everyone@ ACEs.) > > That sounds an awful lot like the POSIX mask_obj, which was the bane of my > previous filesystem, DFS, and which, as it seems history repeats itself, I > was also unable to get an option implemented to ignore it and allow ACL's > to work without impediment. Alternatively group modebits apply to only the group@ ACEs. This could be just yet another option. If no modebits were to apply to ACEs with subjects other than owner@/group@/everyone@ (what about subjects that match the file's owner/group but aren't owner@/gr...@?) then there'd be no way to use modebits as a big filter for ACLs. This is why I proposed the above. > > If users have private primary groups then you can have them run with > > umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that > > users can share files in specific directories. (This is one reason that > > I recommend always giving users their own private primary groups.) > > The only reason for the recommendation to give users their own private > primary groups is because of the lack of flexibility of the umask/mode bits > security model. In an environment with inheritable ACL's (that aren't > subject to being violated by that legacy security model) there's no real > need. All reasons I have for it really come back to this: the idea of a primary group and file group is an anachronism from back when ACLs (and supplementary group memberships!) were overkill. Think back to the days when the AT&T labs were the only place where Unix ran and Unix had a user base in the tens of users. We're stuck with the notion of a primary group (Windows seems to have it for interop with POSIX). The way to make the best of that situation is to give every user their own private group. > > Alternatively we could have a new mode bit to indicate that the group > > bits of umask are to be treated as zero, or maybe assign this behavior > > to the set-gid bit on ZFS. > > So rather than a nice simple option granting ACL's immunity from umask/mode > bits baggage, another attempted mapping/interaction? You have a good idea of what is "simple" for your use case. Your use case also appears to be greatly influenced by what we could (should, do) consider to be a bug in Samba. Your idea of "simple" may not match every one else's. And your idea of "simple" might well differ if that one application didn't use chmod() at all. Personally I don't see a simple, non-surprising solution. I see a set of solutions that one could pick from. In all cases I think we need a way to synthesize modebits from ACLs (e.g., for objects created via protocols that have no conception of modebits but have a conception of ACLs) -- that's a difficult problem because any algorithm for doing that will necessarily be lossy in many cases. > If you only ever access ZFS via CIFS from windows clients, you can have a > pure ACL model. Why should access via local shell or NFSv4 be a poor > stepchild and chained down with legacy semantics that make it exceedingly > difficult to actually use ACL's for their intended purpose? I am certainly not advocating that. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Tue, Mar 02, 2010 at 11:10:52AM -0800, Bill Sommerfeld wrote: > On 03/02/10 08:13, Fredrich Maney wrote: > >Why not do the same sort of thing and use that extra bit to flag a > >file, or directory, as being an ACL only file and will negate the rest > >of the mask? That accomplishes what Paul is looking for, without > >breaking the existing model for those that need/wish to continue to > >use it? > > While we're designing on the fly: Heh. > Another possibility would be to use an > additional umask bit or two to influence the mode-bit - acl interaction. Well, I think the bit, if we must have one, belongs in the filesystem objects that have ACLs, as opposed to processes. There may be no umask to apply in remote access cases, so using a process attribute is likely to result in different behavior according to the access protocol and client. That might not be surprising for the CIFS case, but it certainly would be for the NFS case. But also I think it's the owner of an object that should decide what happens to the object's ACL on chmod rather than random programs and user environments. We might need multiple bits, but we do have multiple bits to play with in mode_t. The main issue with adding mode_t bits is going to be: will apps handle the appearance of new mode_t bits correctly? I suspect that they will, or at least that we'd condier it a bug if they didn't. Or we could add a new file attribute. But given cheap datasets, why not settle for a suitable dataset property as a starting point. I.e., maybe we could play with aclmode a little more. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Fri, Feb 26, 2010 at 03:00:29PM -0500, Miles Nordin wrote: > >>>>> "nw" == Nicolas Williams writes: > > nw> What could we do to make it easier to use ACLs? > > 1. how about AFS-style ones where the effective permission is the AND >of the ACL and the unix permission? You might have to combine this Yes, that sounds useful. (Group modebits could be applied to all ACEs that are neither owner@ nor everyone@ ACEs.) >with an inheritable-by-subdirectories umask setting so you could >create ACL-dominated lands of files that are all unix 777, but this >would stop clobbering difficult-to-recreate ACL's as well as >unintended information leaking. If users have private primary groups then you can have them run with umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that users can share files in specific directories. (This is one reason that I recommend always giving users their own private primary groups.) Alternatively we could have a new mode bit to indicate that the group bits of umask are to be treated as zero, or maybe assign this behavior to the set-gid bit on ZFS. > 2. define a standard API for them, add ability to replicate them to >[...] That'd be nice. > Maybe we're beyond the point of no return for the first suggestion. Why? It can just be another value of the aclmode property. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Fri, Feb 26, 2010 at 04:26:43PM -0800, Paul B. Henson wrote: > On Fri, 26 Feb 2010, Nicolas Williams wrote: > > I believe we can do a bit better. > > > > A chmod that adds (see below) or removes one of r, w or x for owner is a > > simple ACL edit (the bit may turn into multiple ACE bits, but whatever) > > modifying / replacing / adding owner@ ACEs (if there is one). A similar > > chmod that affecting group bits should probably apply to group@ ACEs. A > > similar chmod that affecting other should apply to any everyone@ ACEs. > > I don't necessarily think that's better; and I believe that's approximately > the behavior you can already get with aclmode=passthrough. > > If something is trying to change permissions on an object with a > non-trivial ACL using chmod, I think it's safe to assume that's not what > the original user who configured the ACL wants. At least, that would be > safe to assume if the user had explicitly configured the hypothetical > aclmode=deny or aclmode=ignore :). Suppose you deny or ignore chmods. Well, how would you ever set or reset set-uid/gid and sticky bits? chmod(2) deals only in absolute modes, not relative changes, which means that in order to distinguish those bits from the rwx bits the filesystem would have to know the file's current mode bits in order to compare them to the new bits -- but this is hard (see my other e-mail in a new sub-thread). You'd have to remove the ACL then chmod; oof. > Take, for example, a problem I'm currently having on Linux clients mounting > ZFS over NFSv4. Linux supports NFSv4, and even has a utility to manipulate > NFSv4 ACL's that works ok (but isn't nearly as nice as the ACL integrated > chmod command in Solaris). However, the default behavior of the linux cp > command is to try and copy the mode bits along with the file. So, I copy > a file into zfs over the NFSv4 mount from some local location. The file is > created and inherits the explicitly configured ACL from the parent > directory; the cp command then does a chmod() on it and the ACL is broken. > That's not what I want, I configured that inheritable ACL for a reason, and > I want it respected regardless of the permissions of the file in its > original location. Can you make that utility avoid the chmod? The mode bits should come from the open(2)/creat(2), and there should be no need to set them again after setting the ACL. > Another instance is an application that doesn't seem to trust creat() and > umask to do the right thing, after creating a file it explicitly chmod's it > to match the permissions it thinks it should have had based on the > requested mode and the current umask. If the file inherited an explicitly > specified non-trivial ACL, there's really nothing that can be done about > that chmod, other than ignore or deny it, that will result in the > permissions intended by the user who configured the ACL. Such an app is broken. > > For set-uid/gid and the sticky bits being set/cleared on non-directories > > chmod should not affect the ACL at all. > > Agreed. But see above, below. > > For directories the sticky and setgid bits may require editing the > > inherittable ACEs of the ACL. > > Sticky bit yes; in fact, as it affects permissions I think I'd lump that in > to the ignore/deny category. sgid on directory though? That doesn't > explicitly affect permission, it just potentially changes the group > ownership of new files/directories. I suppose that indirectly affects > permissions, as the implicit group@ ACE would be applied to a different > group, but that's probably the intention of the person setting the sgid > bit, and I don't think any actual ACL entry changes should occur from it. I think both can be implemented as inherittable ACLs. > > chmod(2) always takes an absolute mode. ZFS would have to reconstruct > > the relative change based on the previous mode... > > Or perhaps some interface extension allowing relative changes to the > non-permission mode bits? But we'd have to extend NFSv4 and get the extension adopted and deployed. There's no chance of such a change being made in a short period of time -- we're talking years. > For example, chown(2) allows you to specify -1 > for either the user or group, meaning don't change that one. mode_t is > unsigned, so negative values won't work there, but there are a ton of > extra bits in an unsigned int not relevant to the mode, perhaps setting one > of them to signify only non permission related mode bits should be > manipulated: True, there's enough unused bits there that you could add ignore bits (and mode4 is an unsigned 32-bit in
[zfs-discuss] cmod(2) vs. ACLs (Re: Who is using ZFS ACL's in production?)
On Fri, Feb 26, 2010 at 05:02:34PM -0600, David Dyer-Bennet wrote: > > On Fri, February 26, 2010 12:45, Paul B. Henson wrote: > > > I've already posited as to an approach that I think would make a pure-ACL > > deployment possible: > > > > > > http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html > > > > Via this concept or something else, there needs to be a way to configure > > ZFS to prevent the attempted manipulation of legacy permission mode bits > > from breaking the security policy of the ACL. > > It seems to me that it should depend. > > chown ddb /path/to/file > chmod 640 /path/to/file > > constitutes explicit instructions to give read-write access to ddb, read > access to people in the group, and no access to others. Now, how should > that be combined with an ACL? The chown is irrelevant (well, it's relevant to you in terms of your intentions, but it's very hard for the filesystem to consider a chmod in relation to earlier chowns and chgrps). I see four ways to handle the mode mask vs. ACL conflict: a) clobber the ACL; b) map the change as best you can to an ACL change; c) ignore the rwx bits in the mode mask (except on create from a POSIX open(2)/creat(2), in which case the ACL has to be derived from the initial mode); d) fail the chmod(). All three can be surprising! (d) may be the least surprising, but it may disrupt some apps. (b) is the next least surprising, but it has some dangerous effects. (b) is tricky because the filesystem needs to figure out what the change actually was by tracking mode bits from the beginning. For (b) IMO the right thing to do would be to always track a mode mask whose rwx bits are not actually used for authorization, but which are used to detect changes on chmod(2), and then the changes should be applied as best effort edits of the ACLs. On create via non-POSIX methods the mode mask would have to be constructed synthetically. When the ACL is edited the current mode bits have to be brought in sync with owner@/group@/everyone@ ACEs. All methods of synchronizing or synthesizing a mode mask from/to an ACL are going to be lossy. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Fri, Feb 26, 2010 at 02:50:05PM -0800, Paul B. Henson wrote: > On Fri, 26 Feb 2010, Bill Sommerfeld wrote: > > > I believe this proposal is sound. > > Mere words can not express the sheer joy with which I receive this opinion > from an @sun.com address ;). I believe we can do a bit better. A chmod that adds (see below) or removes one of r, w or x for owner is a simple ACL edit (the bit may turn into multiple ACE bits, but whatever) modifying / replacing / adding owner@ ACEs (if there is one). A similar chmod that affecting group bits should probably apply to group@ ACEs. A similar chmod that affecting other should apply to any everyone@ ACEs. For set-uid/gid and the sticky bits being set/cleared on non-directories chmod should not affect the ACL at all. For directories the sticky and setgid bits may require editing the inherittable ACEs of the ACL. > There's also the question of what to do with the non-access-control pieces > of the legacy mode bits that have no ACL equivilent (suid, sgid, sticky > bit, et al). I think the only way to set those is with an absolute chmod, chmod(2) always takes an absolute mode. ZFS would have to reconstruct the relative change based on the previous mode... but how to know what the "previous mode" was? ZFS would have to construct one from the owner@/group@/everyone@ + set-uid/gid + sticky bits, if any. Best effort will do. > so there'd be no way to manipulate them in the current implementation > without whacking the ACL. That's likely done relatively infrequently, those > bits could always be set before the ACL is applied. In our current > deployment the only one we use is sgid on directories, which is inherited, > not directly applied. You should probably stop using the set-gid bit on directories and use inherttable ACLs instead... Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On Fri, Feb 26, 2010 at 08:23:40AM -0800, Paul B. Henson wrote: > So far it's been quite a struggle to deploy ACL's on an enterprise central > file services platform with access via multiple protocols and have them > actually be functional and reliable. I can see why the average consumer > might give up. Can you describe your struggles? What could we do to make it easier to use ACLs? Is this about chmod [and so random apps] clobbering ACLs? or something more fundamental about ACLs? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 03:31:51PM -0600, Bob Friesenhahn wrote: > With millions of such tiny files, it makes sense to put the small > files in a separate zfs filesystem which has its recordsize property > set to a size not much larger than the size of the files. This should > reduce waste, resulting in reduced potential for fragmentation in the > rest of the pool. Tuning the dataset recordsize down does not help in this case. The files are already small, so their recordsize is already small. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with hundreds of millions of files
On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote: > I have a directory here containing a million files and it has not > caused any strain for zfs at all although it can cause considerable > stress on applications. The biggest problem is always the apps. For example, ls by default sorts, and if you're using a locale with a non-trivial collation (e.g., any UTF-8 locales) then the sort gets very expensive. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 'secure erase'
On Mon, Feb 08, 2010 at 03:41:16PM -0500, Miles Nordin wrote: > ch> In our particular case, there won't be > ch> snapshots of destroyed filesystems (I create the snapshots, > ch> and destroy them with the filesystem). > > Right, but if your zpool is above a zvol vdev (ex COMSTAR on another > box), then someone might take a snapshot of the encrypted zvol. Then > after you ``securely delete'' a filesystem by overwriting various > intermediate keys or whatever, they might roll back the zvol snapshot > to undelete. > > Yes, you still need the passphrase to reach what they've undeleted, > but that's always true---what's ``secure delete'' supposed to mean > besides the ability to permanently remove one dataset but not others, > even from those who posess the passphrase? Otherwise it would not be > a feature. It would just be a suggestion: ``forget your passphrase.'' Correct. Secure erasure through "forgetting the keys" really does depend on "forgetting the keys", which does include "forgetting the passphrase". The only way to avoid that would be to store the wrapped keys in local keystores (i.e., a TPM or a smartcard) that do support secure erasure, so that "forgetting the keys" can be done without having to forget passphrases. > nw> ZFS crypto over zvols and what not presents no additional > nw> problems. > > If you are counting on the ability to forget a key by overwriting the > block of vdev in which the key's stored, then doing it over zvol's is > an additional problem. True, but this could happen regardless of whether the underlying storage is a zvol or not. I stand by the statement that "ZFS crypto over zvols and what not presents no additional problems". Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 'secure erase'
On Fri, Feb 05, 2010 at 05:08:02PM -0500, c.hanover wrote: > In our particular case, there won't be snapshots of destroyed > filesystems (I create the snapshots, and destroy them with the > filesystem). OK. > I'm not too sure on the particulars of NFS/ZFS, but would it be > possible to create a 1GB file without writing any data to it, and then > use a hex editor to access the data stored on those blocks previously? Absolutely not. That is, you can create a 1GB file without writing to it, but it will appear to contain all zeros. > Any chance someone could make any kind of sense of the contents > (allocated in the same order they were before, or what have you)? No. See above. > ZFS crypto will be nice when we get either NFSv4 or NFSv3 w/krb5 for > over the wire encryption. Until then, not much point. You can use NFS with krb5 over the wire encryption _now_. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 'secure erase'
On Fri, Feb 05, 2010 at 04:41:08PM -0500, Miles Nordin wrote: > > "ch" == c hanover writes: > > ch> is there a way to a) securely destroy a filesystem, > > AIUI zfs crypto will include this, some day, by forgetting the key. Right. > but for SSD, zfs above a zvol, or zfs above a SAN that may do > snapshots without your consent, I think it's just logically not a > solveable problem, period, unless you have a writeable keystore > outside the vdev structure. IIIRC ZFS crypto will store encrypted blocks in L2ARC and ZIL, so forgetting the key is sufficient to obtain a high degree of security. ZFS crypto over zvols and what not presents no additional problems. However, if your passphrase is guessable then the key might be recoverable even after it's "forgotten". Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS 'secure erase'
On Fri, Feb 05, 2010 at 03:49:15PM -0500, c.hanover wrote: > Two things, mostly related, that I'm trying to find answers to for our > security team. > > Does this scenario make sense: > * Create a filesystem at /users/nfsshare1, user uses it for a while, > asks for the filesystem to be deleted > * New user asks for a filesystem and is given /users/nfsshare2. What > are the chances that they could use some tool or other to read > unallocated blocks to view the previous user's data? If the tool isn't accessing the raw disks, then the answer is "no chance". (There's no way to access the raw disks over NFS.) > Related to that, when files are deleted on a ZFS volume over an NFS > share, how are they wiped out? Are they zeroed or anything. Same > question for destroying ZFS filesystems, does the data lay about in > any way? (That's largely answered by the first scenario.) Deleting a file does not guarantee that data blocks are released: snapshots might exist that retain references to the data blocks of a file that is being deleted. Nor are blocks wiped when released. > If the data is retrievable in any way, is there a way to a) securely > destroy a filesystem, or b) securely erase empty space on a > filesystem. When ZFS crypto ships you'll be able to securely destroy encrypted datasets. Until then the only form of secure erasure is to destroy the pool and then wipe the individual disks. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unionfs help
On Thu, Feb 04, 2010 at 04:03:19PM -0500, Frank Cusack wrote: > On 2/4/10 2:46 PM -0600 Nicolas Williams wrote: > >In Frank's case, IIUC, the better solution is to avoid the need for > >unionfs in the first place by not placing pkg content in directories > >that one might want to be writable from zones. If there's anything > >about Perl5 (or anything else) that causes this need to arise, then I > >suggest filing a bug. > > Right, and thanks for chiming in. Problem is that perl wants to install > add-on packages in places that the coincide with the system install. > Most stuff is limited to the site_perl directory, which is easily > redirected, but it also has some other locations it likes to meddle with. Maybe we need a zone_perl location. Judicious use of the search paths will get you out of this bind, I think. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unionfs help
On Thu, Feb 04, 2010 at 03:19:15PM -0500, Frank Cusack wrote: > BTW, I could just install everything in the global zone and use the > default "inheritance" of /usr into each local zone to see the data. > But then my zones are not independent portable entities; they would > depend on some non-default software installed in the global zone. > > Just wanted to explain why this is valuable to me and not just some > crazy way to do something simple. There's no unionfs for Solaris. (For those of you who don't know, unionfs is a BSDism and is a pseudo-filesystem which presents the union of two underlying filesystems, but with all changes being made only to one of the two filesystems. The idea is that one of the underlying filesystems cannot be modified through the union, with all changes made through the union being recorded in an overlay fs. Think, for example, of unionfs- mounting read-only media containing sources: you could cd to the mount point and build the sources, with all intermediate files and results placed in the overlay.) In Frank's case, IIUC, the better solution is to avoid the need for unionfs in the first place by not placing pkg content in directories that one might want to be writable from zones. If there's anything about Perl5 (or anything else) that causes this need to arise, then I suggest filing a bug. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device
On Thu, Jan 21, 2010 at 02:11:31PM -0800, Moshe Vainer wrote: > >PS: For data that you want to mostly archive, consider using Amazon > >Web Services (AWS) S3 service. Right now there is no charge to push > >data into the cloud and its $0.15/gigabyte to keep it there. Do a > >quick (back of the napkin) calculation on what storage you can get for > >$30/month and factor in bandwidth costs (to pull the data when/if you > >need it). My "napkin" calculations tell me that I cannot compete > >with AWS S3 for up to 100Gb of storage available 7x24. Even the > >electric utility bill would be more than AWS charges - especially when > >you consider UPS and air conditioning. And thats not including any > >hardware (capital equipment) costs! see: http://aws.amazon.com/s3/ > > When going the amazon route, you always need to take into account > retrieval time/bandwidth cost. If you were to store 100GB on Amazon - > how fast can you get your data back, or how much would bandwidth cost > you to retrieve it in a timely manner. It is all a matter of > requirements of course. Don't forget asymmetric upload/download bandwidth. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] DeDup and Compression - Reverse Order?
On Thu, Dec 17, 2009 at 03:32:21PM +0100, Kjetil Torgrim Homme wrote: > if the hash used for dedup is completely separate from the hash used for > data protection, I don't see any downsides to computing the dedup hash > from uncompressed data. why isn't it? Hash and checksum functions are slow (hash functions are slower, but either way you'll be loading large blocks of data, which sets a floor for cost). Duplicating work is bad for performance. Using the same checksum for integrity protection and dedup is an optimization, and a very nice one at that. Having separate checksums would require making blkptr_t larger, which imposes its own costs. There's lots of trade-offs here. Using the same checksum/hash for integrity protection and dedup is a great solution. If you use a non-cryptographic checksum algorithm then you'll want to enable verification for dedup. That's all. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote: > >if any of f2..f5 have different block sizes from f1 > > This restriction does not sound so bad to me if this only refers to > changes to the blocksize of a particular ZFS filesystem or copying > between different ZFSes in the same pool. This can properly be managed > with a "-f" switch on the userlan app to force the copy when it would > fail. Why expose such details? If you have dedup on and if the file blocks and sizes align then cat f1 f2 f3 f4 f5 > f6 will do the right thing and consume only space for new metadata. If the file blocks and sizes do not align then cat f1 f2 f3 f4 f5 > f6 will still work correctly. Or do you mean that you want a way to do that cat ONLY if it would consume no new space for data? (That might actually be a good justification for a ZFS cat command, though I think, too, that one could script it.) > >any of f1..f5's last blocks are partial > > Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS > blocksize? This is a severe restriction that will fail unless in very > special cases. Say f1 is 1MB, f2 is 128KB, f3 is 510 bytes, f4 is 514 bytes, and f5 is 10MB, and the recordsize for their containing datasets is 128KB, then the new file will consume 10MB + 128KB more than f1..f5 did, but 1MB + 128KB will be de-duplicated. This is not really "a severe restriction". To make ZFS do better than that would require much extra metadata and complexity in the filesystem that users who don't need to do space-efficient file concatenation (most users, that is) won't want to pay for. > Is this related to the disk format or is it restriction in the > implrmentation? (do you know where to look in the source code?). Both. > >...but also ZFS most likely could not do any better with any other, more > >specific non-dedup solution > > Properly lots of I/O traffic, digest calculation+lookups, could be > saved as we already know it will be a duplicate. (In our case the > files are gigabyte sizes) ZFS hashes, and records hashes of blocks, not sub-blocks. Look at my above example. To efficiently dedup the concatenation of the 10MB of f5 would require being able to have something like "sub-block pointers". Alternatively, if you want a concatenation-specific feature ZFS would have to have a metadata notion of concatentation, but then the Unix way of concatenating files couldn't be used for this since the necessary context is lost in the I/O redirection. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 03:57:28AM -0800, Per Baatrup wrote: > I would like to to concatenate N files into one big file taking > advantage of ZFS copy-on-write semantics so that the file > concatenation is done without actually copying any (large amount of) > file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS > filesystem? > > Am looking into the ZFS source code to understand if there are > sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 > f3 f4 f5" userland application in C code. Does anybody have advice on > this? There have been plenty of answers already. Quite aside from dedup, the fact that all blocks in a file must have the same uncompressed size means that if any of f2..f5 have different block sizes from f1, or any of f1..f5's last blocks are partial then ZFS could not perform this concatenation as efficiently as you wish. In other words: dedup _is_ what you're looking for... ...but also ZFS most likely could not do any better with any other, more specific non-dedup solution. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard
On Mon, Sep 07, 2009 at 09:58:19AM -0700, Richard Elling wrote: > I only know of "hole punching" in the context of networking. ZFS doesn't > do networking, so the pedantic answer is no. But a VDEV may be an iSCSI device, thus there can be networking below ZFS. For some iSCSI targets (including ZVOL-based ones) a hole punchin operation can be very useful since it explicitly tells the backend that some contiguous block of space can be released for allocation to others. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] PSARC recover files?
On Tue, Nov 10, 2009 at 03:33:22PM -0600, Tim Cook wrote: > You're telling me a scrub won't actively clean up corruption in snapshots? > That sounds absolutely absurd to me. Depends on how much redundancy you have in your pool. If you have no mirrors, no RAID-Z, and no ditto blocks for data, well, you have no redundancy, and ZFS won't be able to recover affected files. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup question
On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote: > forgive my ignorance, but what's the advantage of this new dedup over > the existing compression option? Wouldn't full-filesystem compression > naturally de-dupe? If you snapshot/clone as you go, then yes, dedup will do little for you because you'll already have done the deduplication via snapshots and clones. But dedup will give you that benefit even if you don't snapshot/clone all your data. Not all data can be managed hierarchically, with a single dataset at the root of a history tree. For example, suppose you want to create two VirtualBox VMs running the same guest OS, sharing as much on-disk storage as possible. Before dedup you had to: create one VM, then snapshot and clone that VM's VDI files, use an undocumented command to change the UUID in the clones, import them into VirtualBox, and setup the cloned VM using the cloned VDI files. (I know because that's how I manage my VMs; it's a pain, really.) With dedup you need only enable dedup and then install the two VMs. Clearly the dedup approach is far, far easier to use than the snapshot/clone approach. And since you can't always snapshot/clone... There are many examples where snapshot/clone isn't feasible but dedup can help. For example: mail stores (though they can do dedup at the application layer by using message IDs and hashes). For example: home directories (think of users saving documents sent via e-mail). For example: source code workspaces (ONNV, Xorg, Linux, whatever), where users might not think ahead to snapshot/clone a local clone (I also tend to maintain a local SCM clone that I then snapshot/clone to get workspaces for bug fixes and projects; it's a pain, really). I'm sure there are many, many other examples. The workspace example is particularly interesting: with the snapshot/clone approach you get to deduplicate the _source code_, but not the _object code_, while with dedup you get both dedup'ed automatically. As for compression, that helps whether you dedup or not, and it helps by about the same factor either way -- dedup and compression are unrelated, really. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
On Mon, Nov 02, 2009 at 12:58:32PM -0500, Dennis Clarke wrote: > Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the > major leap from SHA256 to SHA512 was a 32-bit to 64-bit step. ZFS doesn't have enough room in blkptr_t for 512-bi hashes. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs inotify?
On Mon, Oct 26, 2009 at 08:53:50PM -0700, Anil wrote: > I haven't tried this, but this must be very easy with dtrace. How come > no one mentioned it yet? :) You would have to monitor some specific > syscalls... DTrace is not reliable in this sense: it will drop events rather than overburden the system. Also, system calls are not the only thing you want to watch for -- you should really trace the VFS/fop rather than syscalls for this. In any case, port_create(3C) and gamin are the way forward. port_create(3C) is rather easy to use. Searching the web for PORT_SOURCE_FILE you'll find useful docs like: http://blogs.sun.com/praks/entry/file_events_notification which has example code too. I do think it'd be useful to have command-line utility in core Solaris that uses this facility, something like the example in Prakash's blog (which, incidentally, _works_), but perhaps a bit more complete. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can't rm file when "No space left on device"...
On Thu, Oct 01, 2009 at 11:03:06AM -0700, Rudolf Potucek wrote: > Hmm ... I understand this is a bug, but only in the sense that the > message is not sufficiently descriptive. Removing the file from the > source filesystem will not necessarily free any space because the > blocks have to be retained in the snapshots. The same problem exists > for zeroing the file with >file as suggested earlier. > > It seems like the appropriate solution would be to have a tool that > allows removing a file from one or more snapshots at the same time as > removing the source ... That would make them not really snapshots. And such a tool would have to "fix" clones too. Snapshot and clones are great. They are also great ways to consume too much space. One must do some spring cleaning once in a while. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression algorithm : jpeg ??
On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote: > On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote: > >We have groups generating terabytes a day of image data from lab > >instruments and saving them to an X4500. > > Wouldn't it be easier to compress at the application, or between the > application and the archiving file system? Especially when it comes to reading the images back! ZFS compression is transparent. You can't write uncompressed data then read back compressed data. And compression is at the block level, not for the whole file, so even if you could read it back compressed, it wouldn't be in a useful format. Most people want to transfer data compressed, particularly images. So compressing at the application level in this case seems best to me. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] utf8only and normalization properties
So, the manpage seems to have a bug in it. The valid values for the normalization property are: none | formC | formD | formKC | formKD Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?
On Fri, Aug 21, 2009 at 06:46:32AM -0700, Chris Murray wrote: > Nico, what is a zero-link file, and how would I go about finding > whether I have one? You'll have to bear with me, I'm afraid, as I'm > still building my Solaris knowledge at the minute - I was brought up > on Windows. I use Solaris for my storage needs now though, and slowly > improving on my knowledge so I can move away from Windows one day :) I see that Mark S. thinks this may be a specific ZFS bug, and there's a followup with instructions on how to detect if that's the case. However, it can also be a zero-link file. I've certainly run into that problem before myself, on UFS and other filesystems. A zero-link file is a file that has been removed (unlink(2)ed), but which remains open in some process(es). Such a file continues to consume space until the processes that have it open are killed. Typically you'd use pfiles(1) or lsof to find such files. > If it makes any difference, the problem persists after a full reboot, Yeah, if you rebooted and there's no 14GB .nfs* files, then this is not a zero-link file. See the followups. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send speed
On Tue, Aug 18, 2009 at 04:22:19PM -0400, Paul Kraus wrote: > We have a system with some large datasets (3.3 TB and about 35 > million files) and conventional backups take a long time (using > Netbackup 6.5 a FULL takes between two and three days, differential > incrementals, even with very few files changing, take between 15 and > 20 hours). We already use snapshots for day to day restores, but we > need the 'real' backups for DR. zfs send will be very fast for "differential incrementals ... with very few files changing" since zfs send is a block-level diff based on the differences between the selected snapshots. Where a traditional backup tool would have to traverse the entire filesystem (modulo pruning based on ctime/mtime), zfs send simply traverses a list of changed blocks that's kept up by ZFS as you make changes in the first place. For a *full* backup zfs send and traditional backup tools will have similar results as both will be I/O bound and both will have more or less the same number of I/Os to do. Caveat: zfs send formats are not guraranteed to be backwards compatible, therefore zfs send is not suitable for long-term backups. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?
Perhaps an open 14GB, zero-link file? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] utf8only and normalization properties
On Thu, Aug 13, 2009 at 05:57:57PM -0500, Haudy Kazemi wrote: > >Therefore, if you need to interoperate with MacOS X then you should > >enable the normalization feature. > > > Thank you for the reply. My goal is to configure the filesystem for the > lowest common denominator without knowing up front which clients will be > used. OS X and Win XP are listed because they are commonly used as > desktop OSes. Ubuntu Linux is a third potential desktop OS. Right, so set normalization=formD . > The normalization property documentation says "this property indicates > whether a file system should perform a unicode normalization of file > names whenever two file names are compared. File names are always > stored unmodified, names are normalized as part of any comparison > process." Where does the file system use filename comparisons and what > does it use them for? Filename collision checking? Sorting? The system does filename comparisons when doing lookups (open("/foo/bar/baz", ...) does at least three such lookups, for example), and on create (since that involves a lookup). Yes, this is about collisions. Consider a file named "á" (that's "a" with an acute accent). There are _two_ possible encodings for that name in UTF-8. That means that you could have two files in the same directory and with the same name, though they'd have different names if you looked at the bytes that make up the names. That would be confusing, at the very least. To avoid such collisions you can enable normalization. You can find more here: http://blogs.sun.com/nico/entry/filesystem_i18n > Is it used for any other operation, say when returning a filename to an > application? Would applications reading/writing files to a ZFS No, directory listings always return the filename used when the file name was created, without any normalization. > filesystem ever notice the difference in normalization settings as long > as they produce filenames that do not conflict with existing names or > create invalid UTF8? The documentation says filenames are stored > unmodified, which sounds like things should be transparent to applications. Applications shouldn't notice normalization being enabled. The only reasons to disable normalization are: a) you don't want to force the use of UTF-8, or b) you consistently use a single normalization form and you don't want to pay a penalty for normalizing on lookup. (b) is probably not a problem -- the normalization code is fast if you use all US-ASCII strings, and it's linear with the number of non-ASCII, Unicode codepoints in file names. But I don't have performance numbers to share. I think that normalization should be enabled by default if you enable utf8only, and utf8only should probably be enabled by default in Solaris, but that's just my personal opinion. > (In regard to filename collision checking, if non-normalized unmodified > filenames are always stored on disk, and they don't conflict in > non-normalized form, what would the point be of normalizing the > filenames for a comparison? To verify there isn't conflict in > normalized forms, and if there is no conflict with an existing file to > allow the filename to be written unmodified?) Yes. > The ZFS documentation doesn't list the valid values for the > normalization property other than 'none. From your reply and from the The zfs(1M) manpage lists them: normalization = none | formD | formKCf That's not all existing Unicode normalization forms, no. The reason for this is that we only normalize on lookup (the file names returned by readdir are not normalized), and for that the forms C and D are semantically equivalent, but K and non-K forms are not semantically equivalent, so we need one K form and one non-K form. NFD is faster than NFC, but the K forms require a trip through form C, so NFKC is faster than NFKD (at least if I remember correctly). Which means that NFD and NFKC were sufficient, and there's no reason to ever want NFC or NFKD. > suggest they be added to the documentation at > http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html Yes, that's a good point. PS: ZFS directories are hashed. When normalization is enabled, the hash keys are normalized on create, but the hash contents are not, so filenames rename unnormalized. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] utf8only and normalization properties
On Wed, Aug 12, 2009 at 06:17:44PM -0500, Haudy Kazemi wrote: > I'm wondering what are some use cases for ZFS's utf8only and > normalization properties. They are off/none by default, and can only be > set when the filesystem is created. When should they specifically be > enabled and/or disabled? (i.e. Where is using them a really good idea? > Where is using them a really bad idea?) These are for interoperability. The world is converging on Unicode for filesystem object naming. If you want to exclude non-Unicode strings then you should set utf8only (some non-Unicode strings in some codesets can look like valid UTF-8 though). But Unicode has multiple canonical and non-canonical ways of representing certain characters (e.g., ´). Solaris and Windows input methods tend to conform to NFKC, so they will interop even if you don't enable the normalization feature. But MacOS X normalizes to NFD. Therefore, if you need to interoperate with MacOS X then you should enable the normalization feature. > Looking forward, starting with Windows XP and OS X 10.5 clients, is > there any reason to change the defaults in order to minimize problems? You should definetely enable normalization (see above). It doesn't matter what normalization form you use, but "nfd" runs faster than "nfc". The normalization feature doesn't cost much if you use all US-ASCII file names. And it doesn't cost much if your file names are mostly US-ASCII. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] feature proposal
On Wed, Jul 29, 2009 at 03:35:06PM +0100, Darren J Moffat wrote: > Andriy Gapon wrote: > >What do you think about the following feature? > > > >"Subdirectory is automatically a new filesystem" property - an > >administrator turns > >on this magic property of a filesystem, after that every mkdir *in the > >root* of > >that filesystem creates a new filesystem. The new filesystems have > >default/inherited properties except for the magic property which is off. > > This has been brought up before and I thought there was an open CR for > it but I can't find it. I'd want this to be something one could set per-directory, and I'd want it to not be inherittable (or to have control over whether it is inherittable). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] The importance of ECC RAM for ZFS
On Fri, Jul 24, 2009 at 05:01:15PM +0200, dick hoogendijk wrote: > On Fri, 24 Jul 2009 10:44:36 -0400 > Kyle McDonald wrote: > > ... then it seems like a shame (or a waste?) not to equally > > protect the data both before it's given to ZFS for writing, and after > > ZFS reads it back and returns it to you. > > But that was not the question. > The question was: [quote] "My question is: is there any technical > reason, in ZFS's design, that makes it particularly important for ZFS > to require ECC RAM?" The only thing I can think of is this: if a cosmic ray flips a bit in memory holding a ZFS transaction that's already had all its checksums computed, but hasn't hit disk yet, then you'll have a checksum verification failure later when you read back the affected file (or directory). Using ECC memory avoids that. You still have the processor to worry about though. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualization, alignment and zfs variation stripes
On Wed, Jul 22, 2009 at 02:45:52PM -0500, Bob Friesenhahn wrote: > On Wed, 22 Jul 2009, t. johnson wrote: > >Lets say I have a simple-ish setup that uses vmware files for > >virtual disks on an NFS share from zfs. I'm wondering how zfs' > >variable block size comes into play? Does it make the alignment > >problem go away? Does it make it worse? Or should we perhaps be > > My understanding is that zfs uses fixed block sizes except for the > tail block of a file, or if the filesystem has compression enabled. For one block files, the block is variable, between 512 bytes and the smaller of the dataset's recordsize or 128KB. For multi-block files all blocks are the same size, except the tail block. But these are sizes in file data, not actual on-disk sizes (which can be less because of compression). > Zfs's large blocks can definitely cause performance problems if the > system has insufficient memory to cache the blocks which are accessed, > or only part of the block is updated. You should set the virtual disk image files' recordsize (or, rather, the containing dataset's recordsize) to match the preferred block size of the filesystem types (or data) that you'll put on those virtual disks. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs get faster and less expensive
On Tue, Jul 21, 2009 at 02:45:57PM -0700, Richard Elling wrote: > But to put this in perspective, you would have to *delete* 20 GBytes Or overwrite (since the overwrites turn in to COW writes of new blocks and the old blocks are released if not referred to from snapshot). > of data a day on a ZFS file system for 5 years (according to Intel) to > reach the expected endurance. I don't know many people who delete > that much data continuously (I suspect that the satellite data vendors > might in their staging servers... not exactly a market for SSDs) Don't forget atime updates. If you just read, you're still writing. Of course, the writes from atime updates will generally be less than the number of data blocks read, so you might have to read many more times what you say in order to get the same effect. (Speaking of atime updates, I run my root datasets with atime updates disabled. I don't have hard data, but it stands to reason that things can go fast that way. I also mount filesystems in VMs with atime disabled. Yes, I'm picking nits; sorry. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?
On Fri, Jun 19, 2009 at 04:09:29PM -0400, Miles Nordin wrote: > Also, as I said elsewhere, there's a barrier controlled by Sun to > getting bugs accepted. This is a useful barrier: the bug database is > a more useful drive toward improvement if it's not cluttered. It also > means, like I said, sometimes the mailing list is a more useful place > for information. There's two bug databases, sadly. bugs.opensolaris.org is like you describe, whereas defect.opensolaris.org is not. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..
On Fri, May 22, 2009 at 04:40:43PM -0600, Eric D. Mudama wrote: > As another datapoint, the 111a opensolaris preview got me ~29MB/s > through an SSH tunnel with no tuning on a 40GB dataset. > > Sender was a Core2Duo E4500 reading from SSDs and receiver was a Xeon > E5520 writing to a few mirrored 7200RPM SATA vdevs in a single pool. > Network was a $35 8-port gigabit netgear switch. Unfortunately the SunSSH doesn't know how to grow SSHv2 channel windows to take full advantage of the TCP BDP, so you could probably have gone faster. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss