Re: [zfs-discuss] ZFS Dedup question

2011-01-28 Thread Nicolas Williams
On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote:
> I created a zfs pool with dedup with the following settings:
> zpool create data c8t1d0
> zfs create data/shared
> zfs set dedup=on data/shared
> 
> The thing I was wondering about was it seems like ZFS only dedup at
> the file level and not the block. When I make multiple copies of a
> file to the store I see an increase in the deup ratio, but when I copy
> similar files the ratio stays at 1.00x.

Dedup is done at the block level, not file level.  "Similar files" does
not mean that they actually share common blocks.  You'll have to look
more closely to determine if they do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-18 Thread Nicolas Williams
On Tue, Jan 18, 2011 at 07:16:04AM -0800, Orvar Korvar wrote:
> BTW, I thought about this. What do you say?
> 
> Assume I want to compress data and I succeed in doing so. And then I
> transfer the compressed data. So all the information I transferred is
> the compressed data. But, then you don't count all the information:
> knowledge about which algorithm was used, which number system, laws of
> math, etc. So there are lots of other information that is implicit,
> when compress/decompress - not just the data.
> 
> So, if you add data and all implicit information you get a certain bit
> size X. Do this again on the same set of data, with another algorithm
> and you get another bit size Y. 
> 
> You compress the data, using lots of implicit information. If you use
> less implicit information (simple algorithm relying on simple math),
> will X be smaller than if you use lots of implicit information
> (advanced algorithm relying on a large body of advanced math)? What
> can you say about the numbers X and Y? Advanced math requires many
> math books that you need to transfer as well.

Just as the laws of thermodynamics preclude perpetual motion machines,
so do they preclude infinite, loss-less data compression.  Yes,
thermodynamics and information theory are linked, amazingly enough.

Data compression algorithms work by identifying certain types of
patterns, then replacing the input with notes such as "pattern 1 is ...
and appears at offsets 12345 and 1234567" (I'm simplifying a lot).  Data
that has few or no observable patterns (observable by the compression
algorithm in question) will not compress, but will expand if you insist
on compressing -- randomly-generated data (e.g., the output of
/dev/urandom) will not compress at all and will expand if you insist.
Even just one bit needed to indicate whether a file is compressed or not
will mean expansion when you fail to compress and store the original
instead of the "compressed" version.  Data compression reduces
repetition, thus making it harder to further compress compressed data.

Try it yourself.  Try building a pipeline of all the compression tools
you have, see how many rounds of compression you can apply to typical
data before further compression fails.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-17 Thread Nicolas Williams
On Sat, Jan 15, 2011 at 10:19:23AM -0600, Bob Friesenhahn wrote:
> On Fri, 14 Jan 2011, Peter Taps wrote:
> 
> >Thank you for sharing the calculations. In lay terms, for Sha256,
> >how many blocks of data would be needed to have one collision?
> 
> Two.

Pretty funny.

In this thread some of you are treating SHA-256 as an idealized hash
function.  The odds of accidentally finding collisions in an idealized
256-bit hash function are minute because the distribution of hash
function outputs over inputs is random (or, rather, pseudo-random).

But cryptographic hash functions are generally only approximations of
idealized hash functions.  There's nothing to say that there aren't
pathological corner cases where a given hash function produces lots of
collisions that would be semantically meaningful to people -- i.e., a
set of inputs over which the outputs are not randomly distributed.  Now,
of course, we don't know of such pathological corner cases for SHA-256,
but not that long ago we didn't know of any for SHA-1 or MD5 either.

The question of whether disabling verification would improve performance
is pretty simple: if you have highly deduplicatious, _synchronous_ (or
nearly so, due to frequent fsync()s or NFS close operations) writes, and
the "working set" did not fit in the ARC nor L2ARC, then yes, disabling
verification will help significantly, by removing an average of at least
half a disk rotation from the write latency.  Or if you have the same
work load but with asynchronous writes that might as well be synchronous
due to an undersized cache (relative to the workload).  Otherwise the
cost of verification should be hidden by caching.

Another way to put this would be that you should first determine that
verification is actually affecting performance, and only _then_ should
you consider disabling it.  But if you want to have the freedom to
disable verficiation, then you should be using SHA-256 (or switch to it
when disabling verification).

Safety features that cost nothing are not worth turning off,
so make sure their cost is significant before even thinking
of turning them off.

Similarly, the cost of SHA-256 vs. Fletcher should also be lost in the
noise if the system has enough CPU, but if the choice of hash function
could make the system CPU-bound instead of I/O-bound, then the choice of
hash function would make an impact on performance.  The choice of hash
functions will have a different performance impact than verification: a
slower hash function will affect non-deduplicatious workloads more than
highly deduplicatious workloads (since the latter will require more I/O
for verification, which will overwhelm the cost of the hash function).
Again, measure first.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Nicolas Williams
On Fri, Jan 07, 2011 at 06:39:51AM -0800, Michael DeMan wrote:
> On Jan 7, 2011, at 6:13 AM, David Magda wrote:
> > The other thing to note is that by default (with de-dupe disabled), ZFS
> > uses Fletcher checksums to prevent data corruption. Add also the fact all
> > other file systems don't have any checksums, and simply rely on the fact
> > that disks have a bit error rate of (at best) 10^-16.
> 
> Agreed - but I think it is still missing the point of what the
> original poster was asking about.
> 
> In all honesty I think the debate is a business decision - the highly
> improbable vs. certainty.

The OP seemed to be concerned that SHA-256 is particularly slow, so the
business decision here would involve a performance vs. error rate
trade-off.

Now, unless you have highly deduplicatious data, a workload with a high
cache hit ratio in the ARC for DDT entries, and a fast ZIL device, I
suspect that the I/O costs of dedup dominate the cost of the hash
function, which means: the above business trade-off is not worthwhile as
one would be trading an tiny uptick in error rates for small uptick in
performance.  Before you even get to where you're making such a decision
you'll want to have invested in plenty of RAM, L2ARC and fast ZIL device
capacity -- and for those making such that investment I suspect that the
OP's trade-off won't seem worthwhile.

BTW, note that verification isn't guaranteed to have a zero error
rate...  Imagine a) a block being written collides with a different
block already in the pool, b) bit rot on disk in that colliding block
such that the on-disk block matches the new block, c) on a mirrored vdev
such that you might get one or another version of the block in question,
randomly.  Such an error requires monumentally bad luck to happen at
all.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Nicolas Williams
On Thu, Jan 06, 2011 at 06:07:47PM -0500, David Magda wrote:
> On Jan 6, 2011, at 15:57, Nicolas Williams wrote:
> 
> > Fletcher is faster than SHA-256, so I think that must be what you're
> > asking about: "can Fletcher+Verification be faster than
> > Sha256+NoVerification?"  Or do you have some other goal?
> 
> Would running on recent T-series servers, which have have on-die
> crypto units, help any in this regard?

Yes, particularly for larger blocks.

Hash collisions don't matter as long as ZFS verifies dups, so the real
question is: what is the false positive dup rate (i.e., the accidental
collision rate).  But that's going to vary a lot by {hash function,
working data set}, thus it's not possible to make exact determinations,
just estimates.

For me the biggest issue is that as good as Fletcher is for a CRC, I'd
rather have a cryptographic hash function because I've seen incredibly
odd CRC failures before.  There's a famous case from within SWAN a few
years ago where a switch flipped pairs of bits such that all too often
the various CRCs that applied to the moving packets failed to detect the
bit flips, and we discovered this when an SCCS file in a clone of the ON
gate got corrupted.  Such failures (collisions) wouldn't affect dedup,
but they would mask corruption of non-deduped blocks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Nicolas Williams
On Thu, Jan 06, 2011 at 11:44:31AM -0800, Peter Taps wrote:
> I have been told that the checksum value returned by Sha256 is almost
> guaranteed to be unique.

All hash functions are guaranteed to have collisions [for inputs larger
than their output anyways].

>  In fact, if Sha256 fails in some case, we
> have a bigger problem such as memory corruption, etc. Essentially,
> adding verification to sha256 is an overkill.

What makes a hash function cryptographically secure is not impossibility
of collisions, but computational difficulty of finding arbitrary
colliding input pairs, collisions for known inputs, second pre-images,
and first pre-images.  Just because you can't easily find collisions on
purpose doesn't mean that you can't accidentally find collisions.

That said, if the distribution of SHA-256 is even enough then your
chances of finding a collision by accident are so remote (one in 2^128)
that you could reasonably decide that you don't care.

> Perhaps (Sha256+NoVerification) would work 99.99% of the time. But
> (Fletcher+Verification) would work 100% of the time.

Fletcher is faster than SHA-256, so I think that must be what you're
asking about: "can Fletcher+Verification be faster than
Sha256+NoVerification?"  Or do you have some other goal?

Assuming I guessed correctly...  The speed of the hash function isn't
significant compared to the cost of the verification I/O, period, end of
story.  So, SHA-256 w/o verification will be faster than Fletcher +
Verification -- lots faster if you have particularly deduplicatious data
to write.  Moreorever, SHA-256 + verification will likely be somewhat
faster than Fletcher + verification because SHA-256 will likely have
fewer collisions than Fletcher, and the cost of I/O dominates the cost
of the hash functions.

> Which one of the two is a better deduplication strategy?
> 
> If we do not use verification with Sha256, what is the worst case
> scenario? Is it just more disk space occupied (because of failure to
> detect duplicate blocks) or there is a chance of actual data
> corruption (because two blocks were assumed to be duplicate although
> they are not)?

If you don't verify then you run the risk of corruption on collision,
NOT the risk of using too much disk space.

> Or, if I go with (Sha256+Verification), how much is the overhead of
> verification on the overall process?
> 
> If I do go with verification, it seems (Fletcher+Verification) is more
> efficient than (Sha256+Verification). And both are 100% accurate in
> detecting duplicate blocks.

You're confused.  Fletcher may be faster to compute than SHA-256, but
the run-time of both is as nothing compared to latency of the disk I/O
needed for verification, which means that the hash function's rate of
collisions is more important than its computational cost.

(Now, Fletcher is thought to not be a cryptographically secure hash
function, while SHA-256 is, for now, considered cryptographically
secure.  That probably means that the distribution of Fletcher's outputs
over random inputs is not as even as that of SHA-256, which probably
means you can expect more collisions with Fletcher than with SHA-256.
Note that I made no absolute statements in the previous sentence --
that's because I've not read any studies of Fletcher's performance
relative to SHA-256, thus I'm not certain of anything stated in the
previous sentence.)

David Magda's advice is spot on.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-27 Thread Nicolas Williams
On Mon, Dec 27, 2010 at 09:06:45PM -0500, Edward Ned Harvey wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Nicolas Williams
> > 
> > > Actually I'd say that latency has a direct relationship to IOPS because
> it's the
> > time it takes to perform an IO that determines how many IOs Per Second
> > that can be performed.
> > 
> > Assuming you have enough synchronous writes and that you can organize
> > them so as to keep the drive at max sustained sequential write
> > bandwidth, then IOPS == bandwidth / logical I/O size.  Latency doesn't
> 
> Ok, what we've hit here is two people using the same word to talk about
> different things.  Apples to oranges, as it were.  Both meanings of "IOPS"
> are ok, but context is everything.  
> 
> There are drive random IOPS, which is dependent on latency and seek time,
> and there is also measured random IOPS above the filesystem layer, which is
> not always related to latency or seek time, as described above.

Clearly the application cares about _synchronous_ operations that are
meaningful to it.  In the case of an NFS application that would be
open() with O_CREAT (and particularly O_EXCL), close(), fsync() and so
on.  For a POSIX (but not NFS) application the number of synchronous
operations is smaller.  The rate of asynchronous operations is less
important to the application because those are subject to caching, thus
less predictable.  But to the filesystem the IOPS are not just about
synchronous I/O but about how many distinct I/O operations can be
completed per unit of time.  I tried to keep this clear; sorry for any
confusion.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-26 Thread Nicolas Williams
On Sat, Dec 25, 2010 at 08:37:42PM -0500, Ross Walker wrote:
> On Dec 24, 2010, at 1:21 PM, Richard Elling  wrote:
> 
> > Latency is what matters most.  While there is a loose relationship between 
> > IOPS
> > and latency, you really want low latency.  For 15krpm drives, the average 
> > latency
> > is 2ms for zero seeks.  A decent SSD will beat that by an order of 
> > magnitude.
> 
> Actually I'd say that latency has a direct relationship to IOPS because it's 
> the time it takes to perform an IO that determines how many IOs Per Second 
> that can be performed.

Assuming you have enough synchronous writes and that you can organize
them so as to keep the drive at max sustained sequential write
bandwidth, then IOPS == bandwidth / logical I/O size.  Latency doesn't
enter into that formula.  Latency does remain though, and will be
noticeable to apps doing synchronous operations.

Thus 100MB/s, say, sustained sequential write bandwidth with, say, 2KB
avg ZIL entries you'd get 51200/s logical, sync write operations.  The
latency for each such operation would still be 2ms (or whatever it is
for the given disk).  Since you'd likely have to batch many ZIL writes
you'd end up making the latency for some ops longer than 2ms and others
shorter, but if you can keep the drive at max sustained seq write
bandwidth then the average latency will be 2ms.

SSDs are clearly a better choice.

BTW, a parallelized tar would greatly help reduce the impact of high
latency open()/close() (over NFS) operations...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-23 Thread Nicolas Williams
On Thu, Dec 23, 2010 at 11:25:43AM +0100, Stephan Budach wrote:
> as I have learned from the discussion about which SSD to use as ZIL
> drives, I stumbled across this article, that discusses short
> stroking for increasing IOPs on SAS and SATA drives:

There was a thread on this a while back.  I forget when or the subject.
But yes, you could even use 7200 rpm drives to make a fast ZIL device.
The trick is the on-disk format, and the pseudo-device driver that you
would have to layer on top of the actual device(s) to get such
performance.  The key is that sustained sequential I/O rates for disks
can be quite large, so if you organize the disk in a log form and use
the outer tracks only, then you can get pretend to have awesome write
IOPS for a disk (but NOT read IOPs).

But it's not necessarily as cheap as you might think.  You'd be making
very inefficient use of an expensive disk (in the case of an SAS 15k rpm
disk), or disks, and if plural then you are also using more ports
(oops).  Disks used this way probably also consume more power than SSDs
(OK, this part of my analysis if very iffy), and you still need to do
something about ensuring syncs to disk on power failure (such as just
disabling the cache on the disk, but this would lower performance,
increasing the cost).  When you factor all the costs in I suspect you'll
find that SSDs are priced reasonably well.  That's not to say that one
could not put together a disk-based log device that could eat SSDs'
lunch, but SSD prices would then just come down to match that -- and you
can expect SSD prices to come down anyways, as with any new
technologies.

I don't mean to discourage you, just to point out that there's plenty of
work to do to make "short-stroked disks as ZILs" a workable reality,
while the economics of doing that work versus waiting for SSD prices to
come down don't seem appealing.  Caveat emptor: my analysis is
off-the-cuff; I could be wrong.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] stupid ZFS question - floating point operations

2010-12-23 Thread Nicolas Williams
On Thu, Dec 23, 2010 at 09:32:13AM +, Darren J Moffat wrote:
> On 22/12/2010 20:27, Garrett D'Amore wrote:
> >That said, some operations -- and cryptographic ones in particular --
> >may use floating point registers and operations because for some
> >architectures (sun4u rings a bell) this can make certain expensive
> 
> Well remembered!  There are sun4u optimisations that use the
> floating point unit but those only apply to the bignum code which in
> kernel is only used by RSA.
> 
> >operations go faster. I don't think this is the case for secure
> >hash/message digest algorithms, but if you use ZFS encryption as found
> >in Solaris 11 Express you might find that on certain systems these
> >registers are used for performance reasons, either on the bulk crypto or
> >on the keying operations. (More likely the latter, but my memory of
> >these optimizations is still hazy.)
> 
> RSA isn't used at all by ZFS encryption, everything is AES
> (including key wrapping) and SHA256.
> 
> So those optimistations for floating point don't come into play for
> ZFS encryption.

Moreover, we have platform-specific crypto optimizations.  If there were
FPU operations that help speed up symmetric crypto on an M4000 but not
on UltraSPARC T2s, then we'd use that on the one but not on the other.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Nicolas Williams
Also, when the IV is stored you can more easily look for accidental IV
re-use, and if you can find hash collisions, them you can even cause IV
re-use (if you can write to the filesystem in question).  For GCM IV
re-use is rather fatal (for CCM it's bad, but IIRC not fatal), so I'd
not use GCM with dedup either.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express

2010-12-02 Thread Nicolas Williams
On Wed, Nov 17, 2010 at 01:58:06PM -0800, Bill Sommerfeld wrote:
> On 11/17/10 12:04, Miles Nordin wrote:
> >black-box crypto is snake oil at any level, IMNSHO.
> 
> Absolutely.

As Darren said, much of the design has been discussed in public, and
reviewed by cryptographers.  It'd be nicer if we had a detailed paper
though.

> >Congrats again on finishing your project, but every other disk
> >encryption framework I've seen taken remotely seriously has a detailed
> >paper describing the algorithm, not just a list of features and a
> >configuration guide.  It should be a requirement for anything treated
> >as more than a toy.  I might have missed yours, or maybe it's coming
> >soon.
> 
> In particular, the mechanism by which dedup-friendly block IV's are
> chosen based on the plaintext needs public scrutiny.  Knowing
> Darren, it's very likely that he got it right, but in crypto, all
> the details matter and if a spec detailed enough to allow for
> interoperability isn't available, it's safest to assume that some of
> the details are wrong.

Dedup + crypto does have security implications.  Specifically: it
facilitates "traffic" analysis, and then known- and even
chosen-plaintext attacks (if there were any practical such attacks on
the cipher).

For example, IIUC, the ratio of dedup vs.  non-dedup blocks + analysis
of dnodes and their data sizes (in blocks) + per-dnode dedup ratios can
probably be used to identify OS images, which would then help mount
known-plaintext attacks.  For a mailstore you'd be able to distinguish
mail sent or kept by a single local user vs. mail sent to and kept by
more than one local user, and by sending mail you could help mount
chose-plaintext attacks.  And so on.

My advice would be to not bother encrypting OS images, and if you
encrypt only documents, then dedup is likely of less or no interest to
you -- in general, you may not want to bother with dedup + crypto.
However, it is fantastic that crypto and dedup can work together.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-10 Thread Nicolas Williams
On Sat, Oct 09, 2010 at 09:52:51PM -0700, Richard Elling wrote:
> Are we living in the past?
> 
> In the bad old days, UNIX systems spoke NFS and Windows systems spoke
> CIFS. The cost of creating a file system was expensive -- slices,
> partitions, etc.
> 
> With ZFS, file systems (datasets) are relatively inexpensive.
> 
> So, are we putting too many constraints into a system (ZFS) which is
> busy trying to remove constraints?  Is it reasonable to expect that
> ZPL is the only kind of "file system" ZFS customers need?  Is it high
> time for a ZCIFS dataset?

I don't quite understand what you mean.  ZPL is just a POSIX layer.  It
_happens_ to be used not just by the system call layer in Solaris, but
also by the SMB and NFS servers, but you could also imagine the SMB and
NFS servers using the DMU directly while maintaining on-disk
compatibility with the ZPL.  Not using the ZPL does not necessitate
having a different on-disk format, or different semantics.

Now, if you were asking about dataset properties that make a dataset
behave more like what Windows expects or more like what Unix expects,
that's different, but that wouldn't require junking the ZPL.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Nicolas Williams
On Wed, Oct 06, 2010 at 05:19:25PM -0400, Miles Nordin wrote:
> >>>>> "nw" == Nicolas Williams  writes:
> 
> nw> *You* stated that your proposal wouldn't allow Windows users
> nw> full control over file permissions.
> 
> me: I have a proposal
> 
> you: op!  OP op, wait!  DOES YOUR PROPOSAL blah blah WINDOWS blah blah
>  COMPLETELY AND EXACTLY LIKE THE CURRENT ONE.
> 
> me: no, but what it does is...

The correct quote is:

"no, not under my proposal."

That's from a post from you on September 30, 2010, with Message-Id:
.  That was a direct answer to a
direct question.

Now, maybe you wish to change your view.  That'd be fine.  Do not,
however, imply that I'm liar though, not if you want to be taken
seriously.  Please re-write your proposal _clearly_ and refrain from
personal attacks.

Cheers,

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-06 Thread Nicolas Williams
On Wed, Oct 06, 2010 at 04:38:02PM -0400, Miles Nordin wrote:
> >>>>> "nw" == Nicolas Williams  writes:
> 
> nw> The current system fails closed 
> 
> wrong.
> 
> $ touch t0
> $ chmod 444 t0
> $ chmod A0+user:$(id -nu):write_data:allow t0
> $ ls -l t0
> -r--r--r--+  1 carton   carton 0 Oct  6 20:22 t0
> 
> now go to an NFSv3 client:
> $ ls -l t0
> -r--r--r-- 1 carton 405 0 2010-10-06 16:26 t0
> $ echo lala > t0
> $ 
> 
> wide open.

The system does what the ACL says.  The mode fails to accurately
represent the actual access because... the mode can't.  Now, we could
have chosen (and still could choose to) represent the presence of ACEs
for subjects other than owner@/group@/everyone@ by using the group bits
of the mode to represent the maximal set of permissions granted.

But I don't consider the above "failing open".

> nw> You seem to be in denial.  You continue to ignore the
> nw> constraint that Windows clients must be able to fully control
> nw> permissions in spite of their inability to perceive and modify
> nw> file modes.
> 
> You remain unshakably certain that this is true of my proposal in
> spite of the fact that you've said clearly that you don't understand
> my proposal.  That's bad science.

*You* stated that your proposal wouldn't allow Windows users full
control over file permissions.

> It may be my fault that you don't understand it: maybe I need to write
> something shorter but just as expressive to fit within mailing list
> attention spans, or maybe my examples are unclear.  However that
> doesn't mean that I'm in denial nor make you right---that just makes
> me annoying.

Yes, that may be.  I encourage you to find a clearer way to express your
proposal.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-05 Thread Nicolas Williams
On Mon, Oct 04, 2010 at 02:28:18PM -0400, Miles Nordin wrote:
> >>>>> "nw" == Nicolas Williams  writes:
> 
> nw> I would think that 777 would invite chmods.  I think you are
> nw> handwaving.
> 
> it is how AFS worked.  Since no file on a normal unix box besides /tmp

But would the AFS experience translate into double plus happiness for us?

> ever had 777 it would send a SIGWTF to any AFS-unaware graybeards that
> stumbled onto the directory, alerting them that they needed to go
> learn something and come back.

A signal?!  How would that work when the entity doing a chmod is on a
remote NFS client?

> I understand that everything:everyone on windows doesn't send SIGWTF,
> but 777 on unix for AFS sites it did.  You realize it's not
> hypothetical, right?  AFS was actually implemented, widely, and
> there's experience with it.

Yes... but I'm skeptical about the universality of that experience's
applicability.  Specifically: I don't think it could work for us.

AFS developers had fewer constraints than Solaris developers.  It is no
surprise that they were able to find happy solutions to these sorts of
problems long ago.

OpenAFS has a Windows native client and an Explorer shell extension
(which surely handles chmod?).  However, we don't have the luxury of
telling customers to install third-party (possibly ours, whatever)
Windows native clients for protocols other than SMB, nor can we tell
them to install Explorer shell extensions.  Solaris' SMB server needs to
work out of the box and without the limitations implied by having a
separate ACL and mode (well, we have that now, but we always compute a
new mode from the new ACL when ACLs are changed).

> If they failed to act on the SIGWTF, the overall system enforced the
> tighter of the unix permissions and the AFS ACL, so it fails closed.
> The current system fails open.

The current system fails closed (by discarding the ACL and replacing it
with a new one based entirely on the new mode).

> Also AFS did no translation between unix permissions and AFS ACL's so
> it was easy to undo such a mistake when it happened: double-check the
> AFS ACL is not too wide on the directories where you see unix people
> mucking around in case the muckers were responding to a real problem,
> then set the unix modes back to 777.

Right, but with SMB in the picture we don't have this luxury.  You seem
unwilling to accept that one constraint.

> nw> When chmod()ing an object... ZFS would search for the most
> nw> specific matching file in .zfs/ACLs/ and, if found, would
> nw> replace the chmod()ed object's ACL with that of the
> nw> .zfs/ACLs/... file found.  The .inherit suffix would indicate
> nw> that if the chmod() target's parent directory has inherittable
> nw> ACEs then they will be groupmasked and added to the ACEs from
> nw> the .zfs/ACLs/... file to produce a final ACL.
> 
> This proposal, like the current situation, seems to make chmod
> configurable to act like ``not chmod'' which IMHO is exactly what's
> unpopular about the current regime.  You've tried to leave chmod

To some degree, yes.  It's different though, and might conceivably be
acceptable, though I don't think it will be (I was illustrating
potential alternatives).

But I really like one thing about it: most apps shouldn't care about ACL
contents, they should care about context-specific permissions changes.
In a directory containing shared documents the intention should
typically be "share with all these people", while in home directories
the intention should typically be "don't share with anyone" (but this
will vary; e.g., ~/.ssh/authorized_keys needs to be reachable and
readable by everyone).  Add in executable versus not- executable, and
you have a pretty complete picture -- just a few "named" ACLs at most,
per-dataset.

If we could replace chmod(2) with a version that takes actual names for
pre-configured ACLs, _that_ would be great.  But we can't for the same
reason that we can't remove chmod(2): it's a widely used interface.

> active on windows trees and guess at the intent of whoever invokes
> chmod, providing no warning that you're secretly doing
> ``approximately'' what he asked for rather than exactly.  Maybe that
> flies on Windows, but on Unix people expect more precision: thorough
> abstractions that survive corner cases and have good exception
> handling.

Look, mode is a pretty lame hammer -- ACLs are far, far more granular--
but it's a hammer that many apps use.  Given the lack of granularity of
modes, I think an approximation of intent is the best we can do.

Consider: both, aclmode=discard and aclmode=groupmask beh

Re: [zfs-discuss] Migrating to an aclmode-less world

2010-10-05 Thread Nicolas Williams
On Mon, Oct 04, 2010 at 04:30:05PM -0600, Cindy Swearingen wrote:
> Hi Simon,
> 
> I don't think you will see much difference for these reasons:
> 
> 1. The CIFS server ignores the aclinherit/aclmode properties.

Because CIFS/SMB has no chmod operation :)

> 2. Your aclinherit=passthrough setting overrides the aclmode
> property anyway.

aclinherit=passthrough-x is a better choice.

Also, aclinherit doesn't override aclmode.  aclinherit applies on create
and aclmode used to apply on chmod.

> 3. The only difference is that if you use chmod on these files
> to manually change the permissions, you will lose the ACL values.

Right.  That only happens from NFSv3 clients [that don't instead edit
the POSIX Draft ACL translated from the ZFS ACL], from non-Windows NFSv4
clients [that don't instead edit the ACL], and from local applications
[that don't instead edit the ZFS ACL].

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-10-04 Thread Nicolas Williams
On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote:
> >> Can the user in (3) fix the permissions from Windows?
> 
> no, not under my proposal.

Let's give it a whirld anyways:

> but it sounds like currently people cannot ``fix'' permissions through
> the quirky autotranslation anyway, certainly not to the point where
> neither unix nor windows users are confused: windows users are always
> confused, and unix users don't get to see all the permissions.

No, that's not right.  Today you can fix permissions from any NFSv4
client that exports an NFSv4-style ACL interface to users.  You can fix
permissions from Windows.  You can fix permissions a local Solaris
shell.  You can also fix permissions from NFSv3 clients (but you get
POSIX Draft -> ZFS translated ACLs, which are confusing because they
tend to result in DENY ACEs being scattered all over).  You can also
chmod, but you lose your ACL if you do that.

> >> Now what?
> 
> set the unix perms to 777 as a sign to the unix people to either (a)
> leave it alone, or (b) learn to use 'chmod A...'.  This will actually
> work: it's not a hand-waving hypothetical that just doesn't play out.

I would think that 777 would invite chmods.  I think you are handwaving.

> What I provide, which we don't have now, is a way to make:
> 
>   /tub/dataset/a subtree
> 
> -rwxrwxrwxin old unix
> [working, changeable permissions] in windows
> 
>   /tub/dataset/b subtree
> 
> -rw-r--r--in old unix
> [everything: everyone]in windows, but unix permissions 
>   still enforced
> 
> this means:
> 
>  * unix writers and windows writers can cooperate even within a single
>dataset
> 
>  * an intuitive warning sign when non-native permissions are in effect, 
> 
>  * fewer leaked-data surprises

I don't understand what exactly you're proposing.  You've not said
anything about how chmod is to be handled.

> If you accept that the autotranslation between the two permissions
> regimes is total shit, which it is, then what I offer is the best oyu
> can hope for.

If I could understand what you're proposing I might agree, who knows.
But I do think there's other possibilities, some probably better than
what you propose (whatever that is).

Here's a crazy alternative that might work (or not): allow users to
pre-configure named ACLs where the names are {owner, group, mode}.
E.g., we could have:

.zfs/ACLs//[:][d|-][.inherit]
^   ^^^  ^
||   |
+-- owned by |   |
   +-- applies to  |
 directory   |
 or other|
 objects |
 |
see below

When chmod()ing an object... ZFS would search for the most specific
matching file in .zfs/ACLs/ and, if found, would replace the chmod()ed
object's ACL with that of the .zfs/ACLs/... file found.  The .inherit
suffix would indicate that if the chmod() target's parent directory has
inherittable ACEs then they will be groupmasked and added to the ACEs
from the .zfs/ACLs/... file to produce a final ACL.

E.g., a chmod(0644) of /a/b/c/foo (say, a file owned by 'joe' with group
'staff', with /, /a, /a/b, and /a/b/c all being datasets), where c has
inherittable ACEs would cause ZFS to search for
.zfs/ACLs/joe/staff:-rw-r--r--.inherit, .zfs/ACLs/joe/-rw-r--r--.inherit, 
zfs/ACLs/joe/staff:-rw-r--r--, and .zfs/ACLs/joe/-rw-r--r--, first in
/a/b/c, then /a/b, then /a, then /.

I said this is "crazy".  Is it?  I think it probably is.  This would
almost certainly prove to be a hard-to-use design.  Users would need to
be educated in order to not be surprised...  OTOH, it puts much more
control in the hands of the user.  These named ACLs could be inheritted
from parent datasets as a way to avoid having to set them up too many
times.  And with the .inherit twist it probably has enough granularity
of control to be useful (particularly if users are dataset-happy).
Finally, these could even be managed remotely.

I see zero chance of such a design being adopted.

It'd be better, IMO, to go for non-POSIX-equivalent groupmasking and
translations of POSIX mode_t and POSIX Draft ACLs to ZFS ACLs.  For
example: take the current translations, remove all owner@ and group DENY
ACEs, then sort any remaining user DENY ACEs to be first, and any
everyone@ DENY ACEs to be last.  The results would surely be surprising
to some users, but the kinds of mode_t and POSIX Draft ACLs where
surprise is likely are rare.

That's two alternatives right there.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailm

Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams
On Thu, Sep 30, 2010 at 08:14:24PM -0400, Miles Nordin wrote:
> >> Can the user in (3) fix the permissions from Windows?
> 
> no, not under my proposal.

Then your proposal is a non-starter.  Support for multiple remote
filesystem access protocols is key for ZFS and Solaris.

The impedance mismatches between these various protocols means that we
need to make some trade-offs.  In this case I think the business (as
well as the engineers involved) would assert that being a good SMB
server is critical, and that being able to authoritatively edit file
permissions via SMB clients is part of what it means to be a good SMB
server.

Now, you could argue that we should being aclmode back and let the user
choose which trade-offs to make.  And you might propose new values for
aclmode or enhancements to the groupmask setting of aclmode.

> but it sounds like currently people cannot ``fix'' permissions through
> the quirky autotranslation anyway, certainly not to the point where
> neither unix nor windows users are confused: windows users are always
> confused, and unix users don't get to see all the permissions.

Thus the current behavior is the same as the old aclmode=discard
setting.

> >> Now what?
> 
> set the unix perms to 777 as a sign to the unix people to either (a)
> leave it alone, or (b) learn to use 'chmod A...'.  This will actually
> work: it's not a hand-waving hypothetical that just doesn't play out.

That's not an option, not for a default behavior anyways.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams
On Thu, Sep 30, 2010 at 03:28:14PM -0500, Nicolas Williams wrote:
> Consider this chronologically-ordered sequence of events:
> 
> 1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including
>inherittable ACEs.  A mode computed from this ACL might be 664, say.
> 
> 2) A Unix user does chmod(644) on that file, and one way or another this
>effectively reduces permissions otherwise granted by the ACL.
> 
> 3) Another Windows user now fails to get write perm that they should
>have, so they complain, and then the owner tries to view/change the
>ACL from a Windows desktop.
> 
> Now what?
> 
> Can the user in (3) fix the permissions from Windows?  For that to be
> possible the mode must implicitly get recomputed when the ACL is
> modified.

Also, even if in (3) the user can fix the perms from Windows because
we'd recompute the mode from the ACL, the user wouldn't be able to see
the "effective" ACL (as "reduced" by the mode_t that Windows can't see).
The only way to address that is... to do groupmasking.  And that gets us
back to the problems we had with groupmasking.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side

2010-09-30 Thread Nicolas Williams
On Thu, Sep 30, 2010 at 02:55:26PM -0400, Miles Nordin wrote:
> >>>>> "nw" == Nicolas Williams  writes:
> nw> Keep in mind that Windows lacks a mode_t.  We need to interop
> nw> with Windows.  If a Windows user cannot completely change file
> nw> perms because there's a mode_t completely out of their
> nw> reach... they'll be frustrated.
> 
> well...AIUI this already works very badly, so keep that in mind, too.
> 
> In AFS this is handled by most files having 777, and we could do the
> same if we had an AND-based system.  This is both less frustrating and
> more self-documenting than the current system.
> 
> In an AND-based system, some unix users will be able to edit the
> windows permissions with 'chmod A...'.  In shops using older unixes
> where users can only set mode bits, the rule becomes ``enforced
> permissions are the lesser of what Unix people and Windows people
> apply.''  This rule is easy to understand, not frustrating, and
> readily encourages ad-hoc cooperation (``can you please set
> everything-everyone on your subtree?  we'll handle it in unix.'' /
> ``can you please set 777 on your subtree?  or 770 group windows?  we
> want to add windows silly-sid-permissions.'').  This is a big step
> better than existing systems with subtrees where Unix and Windows
> users are forced to cooperate.

Consider this chronologically-ordered sequence of events:

1) File is created via Windows, gets SMB/ZFS/NFSv4-style ACL, including
   inherittable ACEs.  A mode computed from this ACL might be 664, say.

2) A Unix user does chmod(644) on that file, and one way or another this
   effectively reduces permissions otherwise granted by the ACL.

3) Another Windows user now fails to get write perm that they should
   have, so they complain, and then the owner tries to view/change the
   ACL from a Windows desktop.

Now what?

Can the user in (3) fix the permissions from Windows?  For that to be
possible the mode must implicitly get recomputed when the ACL is
modified.

What if (2) happens again?  But, OK, this is a problem no matter what,
whether we do groupmasking, discard, or keep mode separate from the ACL
and AND the two.

ZFS does, in fact, keep a separate mode, and it does recompute it when
ACLs are modified.  So this may just be a matter of doing the AND thing
and not touching the ACL on chmod.  Is that what you have in mind?

> It would certainly work much better than the current system, where you
> look at your permissions and don't have any idea whether you've got
> more, less, or exactly the same permission as what your software is
> telling you: the crappy autotranslation teaches users that all bets
> are off.

No, currently you look at permissions that they reflect the ACL (with
the group bits being the max of all non-owner@ and non-everyone@ ACEs).

> It would be nice if, under my proposal, we could delete the unix
> tagspace entirely:
> 
>  chpacl '(unix)' chmod -R A- .

Huh?

> but unfortunately, deletion of ACL's is special-cased by Solaris's
> chmod to ``rewrite ACL's that match the UNIX permissions bits,'' so it
> would probably have to stay special-cased in a tagspace system.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams
On Wed, Sep 29, 2010 at 05:21:51PM -0500, Nicolas Williams wrote:
> On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote:
> > > Keep in mind that Windows lacks a mode_t.  We need to
> > > interop with Windows.
> > 
> > Oh my, I see. Another itch to scratch. Now at least Windows users are
> > happy while me and mabye others are not.
> 
> Yes.  Pardon me for forgetting to mention this earlier.  There's so many
> wrinkles here...  But this is one of the biggers; I should not have

s/biggers/biggest/

> forgotten it.
> 
> Nico
> -- 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams
On Wed, Sep 29, 2010 at 03:09:22PM -0700, Ralph Böhme wrote:
> > Keep in mind that Windows lacks a mode_t.  We need to
> > interop with Windows.
> 
> Oh my, I see. Another itch to scratch. Now at least Windows users are
> happy while me and mabye others are not.

Yes.  Pardon me for forgetting to mention this earlier.  There's so many
wrinkles here...  But this is one of the biggers; I should not have
forgotten it.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tagged ACL groups: let's just keep digging until we come out the other side (was: zfs proerty aclmode gone in 147?)

2010-09-29 Thread Nicolas Williams
Keep in mind that Windows lacks a mode_t.  We need to interop with
Windows.  If a Windows user cannot completely change file perms because
there's a mode_t completely out of their reach... they'll be frustrated.

Thus an ACL-and-mode model where both are applied doesn't work.  It'd be
nice, but it won't work.

The mode has to be entirely encoded by the ACL.  But we can't resort to
interesting encoding tricks as Windows users won't understand them.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-29 Thread Nicolas Williams
On Wed, Sep 29, 2010 at 03:44:57AM -0700, Ralph Böhme wrote:
> > On 9/28/2010 2:13 PM, Nicolas Williams wrote:
> > The version of samba bundled with Solaris 10 seems to
> > insist on 
> > chmod'ing stuff. I've tried all of the various

Just in case it's not clear, I did not write the quoted text.  (One can
tell from the level of quotation that an attribution is missing and that
none of my text was quoted.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams
On Wed, Sep 29, 2010 at 10:15:32AM +1300, Ian Collins wrote:
> Based on my own research, experimentation and client requests, I
> agree with all of the above.

Good to know.

> I have be re-ordering and cleaning (deny) ACEs for one client for a
> couple of years now and we haven't seen any user complaints.  In
> their environment, all ACLs started life as POSIX (from a Solaris 9
> host) and with the benefit of hindsight, I would have cleaned them
> up on import to ZFS rather than simply reading the POSIX ACL and
> writing back to ZFS.

The saddest scenario would be when you have to interop with NFSv3
clients whose users (or their apps) are POSIX ACL happy, but whose files
also need to be accessible from NFSv4, SMB, and local ZPL clients where
the users (possibly the same users, or their apps) are also ZFS ACL
happy.  Particularly if you also have Windows clients and the users edit
file ACLs there too!  Thankfully this is relatively easy to avoid
because: apps that edit ACLs are few and far between, thus easy to
remediate, and users should not really be manually setting POSIX Draft
and ZFS/NFSv4/SMB ACLs on the same files.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams
On Tue, Sep 28, 2010 at 02:03:30PM -0700, Paul B. Henson wrote:
> On Tue, 28 Sep 2010, Nicolas Williams wrote:
> 
> > I've researched this enough (mainly by reading most of the ~240 or so
> > relevant zfs-discuss posts and several bug reports)
> 
> And I think some fair fraction of those posts were from me, so I'll try not
> to start rehashing old discussions ;).

:)

> > That only leaves aclmode=discard and some variant of aclmode=groupmask
> > that is less confusing.
> 
> Or aclmode=deny, which is pretty simple, not very confusing, and basically
> the only paradigm that will prevent chmod from breaking your ACL.

That can potentially render many applications unusable.

> > So one might wonder: can one determine user intent from the ACL prior to
> > the change and the mode/POSIX ACL being set, and then edit the ZFS ACL
> > in a way that approximates the user's intention?
> 
> You're assuming the user is intentionally executing the chmod, or even
> *aware* of it happening. Probably at least 99% of the chmod calls executed
> on a file with a ZFS ACL at my site are the result of non-ACL aware legacy
> apps being stupid. In which case the *user* intent to to *leave the damn
> ACL alone* :)...

But that's not really clear.  The user is running some app.  The app
does some chmod(2)ing on behalf of the user.  The user may also use
chmod(1).  Now what?  Suppose you make chmod(1) not use chomd(2), so as
to be able to say that all chmod(2) calls are made by "apps", not the
user.   But then what about scripts that use chmod(1)?

Basically, I think intent can be estimated in some cases, and combined
with some simplifying assumptions (that will sometimes be wrong), such
as "security entities are all distinct, non-overlapping" (as a way to
minimize the number of DENY ACEs needed) can yield a groupmasking
algorithm that doesn't suck.  However, it'd still not be easy to
explain, and it'd still result in surprises (since the above assumption
will often be wrong, leading to more permissive ACLs than the user might
have intended!).  Seems like a lot of work for little gain, and high
support call generation rate.

> > But much better than that would be if we just move to a ZFS ACL world
> > (which, among other things, means we'll need a simple libc API for
> > editing ACLs).
> 
> Yep. And a good first step towards an ACL world would be providing a way to
> keep chmod from destroying ACLs in the current world...

I don't think that will happen...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs proerty aclmode gone in 147?

2010-09-28 Thread Nicolas Williams
On Tue, Sep 28, 2010 at 12:18:49PM -0700, Paul B. Henson wrote:
> On Sat, 25 Sep 2010, [iso-8859-1] Ralph Böhme wrote:
> 
> > Darwin ACL model is nice and slick, the new NFSv4 one in 147 is just
> > braindead. chmod resulting in ACLs being discarded is a bizarre design
> > decision.
> 
> Agreed. What's the point of ACLs that disappear? Sun didn't want to fix
> acl/chmod interaction, maybe one of the new OpenSolaris forks will do the
> right thing...

I've researched this enough (mainly by reading most of the ~240 or so
relevant zfs-discuss posts and several bug reports) to conclude the
following:

 - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in
   DENY ACEs are enormously confusing to users.

 - ACLs derived from POSIX mode_t and/or POSIX Draft ACLs that result in
   DENY ACEs are susceptible to ACL re-ordering when modified from
   Windows clients -which insist on DENY ACEs first-, leading to much
   confusion.

 - This all gets more confusing when hand-crafted ZFS inherittable ACEs
   are mixed with chmod(2)s with the old aclmode=groupmask setting.

The old aclmode=passthrough setting was dangerous and had to be removed,
period.  (Doing chmod(600) would not necessarily deny other users/groups
access -- that's very, very broken.)

That only leaves aclmode=discard and some variant of aclmode=groupmask
that is less confusing.

But here's the thing: the only time that groupmasking results in
sensible ACLs is when it doesn't require DENY ACEs, which in turn is
only when mode_t bits and/or POSIX ACLs are strictly non-increasing
(e.g., 777, 775, 771, 750, 755, 751, etcetera, would all be OK, but 757
would not be).

The problem then is this: if you have an aclmode setting that sometimes
groupmasks and sometimes discards... that'll be confusing too!

So one might wonder: can one determine user intent from the ACL prior to
the change and the mode/POSIX ACL being set, and then edit the ZFS ACL
in a way that approximates the user's intention?  I believe that would
be possible, but risky too, as the need to avoid DENY ACEs (see Windows
issue) would often result in more permissive ACLs than the user actually
intended.

Taken altogether I believe that aclmode=discard is the simplest setting
to explain and understand.  Perhaps eventually a variant of groupmasking
will be developed that is also simple to explain and understand, but
right now I very much doubt it (and yes, I've tried myself).  But much
better than that would be if we just move to a ZFS ACL world (which,
among other things, means we'll need a simple libc API for editing
ACLs).

Note, incidentally, that there's a groupmasking behavior left in ZFS at
this time: on create of objects in directories with inherittable ACEs
and with aclinherit=passthrough*.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pools inside pools

2010-09-23 Thread Nicolas Williams
On Thu, Sep 23, 2010 at 06:58:29AM +, Markus Kovero wrote:
> > What is an example of where a checksummed outside pool would not be able 
> > to protect a non-checksummed inside pool?  Would an intermittent 
> > RAM/motherboard/CPU failure that only corrupted the inner pool's block 
> > before it was passed to the outer pool (and did not corrupt the outer 
> > pool's block) be a valid example?
> 
> > If checksums are desirable in this scenario, then redundancy would also 
> > be needed to recover from checksum failures.
> 
> That is excellent point also, what is the point for checksumming if
> you cannot recover from it? At this kind of configuration one would
> benefit performance-wise not having to calculate checksums again.

The benefit of checksumming in the "inner tunnel", as it were (the inner
pool), is to provide one more layer of protection relative to iSCSI.
But without redundancy in the inner pool you cannot recover from
failures, as you point out.  And you must have checksumming in the outer
pool, so that it can be scrubbed.

It's tempting to say that the inner pool should not checksum at all, and
that iSCSI and IPsec should be configured correctly to provide
sufficient protection to the inner pool.  Another possibility is to have
a remote ZFS protocol of sorts, but then you begin to wonder if
something like Lustre (married to ZFS) isn't better.

> Checksums in outer pools effectively protect from disk issues, if
> hardware fails so data is corrupted isn't outer pools redundancy going
> to handle it for inner pool also.

Yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS COW and simultaneous read & write of files

2010-09-22 Thread Nicolas Williams
On Wed, Sep 22, 2010 at 12:30:58PM -0600, Neil Perrin wrote:
> On 09/22/10 11:22, Moazam Raja wrote:
> >Hi all, I have a ZFS question related to COW and scope.
> >
> >If user A is reading a file while user B is writing to the same file,
> >when do the changes introduced by user B become visible to everyone?
> >Is there a block level scope, or file level, or something else?
> >
> >Thanks!
> 
> Assuming the user is using read and write against zfs files.
> ZFS has reader/writer range locking within files.
> If thread A is trying to read the same section that thread B is
> writing it will
> block until the data is written. Note, written in this case means
> written into the zfs
> cache and not to the disks. If thread A requires that changes to the
> file be stable (on disk)
> before reading it can use the little known O_RSYNC flag.

That's assuming local access (i.e., POSIX semantics).  It's different if
NFS is involved (because of NFS' close-to-open semantics).  It might be
different if SMB is involved (dunno).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please warn a home user against OpenSolaris under VirtualBox under WinXP ; )

2010-09-22 Thread Nicolas Williams
On Wed, Sep 22, 2010 at 07:14:43AM -0700, Orvar Korvar wrote:
> There was a guy doing that: Windows as host and OpenSolaris as guest
> with raw access to his disks. He lost his 12 TB data. It turned out
> that VirtualBox dont honor the write flush flag (or something
> similar).

VirtualBox has an option to honor flushes.

Also, recent versions of ZFS can recover by throwing out the last N
transactions that were not committed fully.

> In other words, I would never ever do that. Your data is safer with
> Windows only and a Windows raid solution.
> 
> Use OpenSolaris as host instead, and Win as guest.

I don't think your advice is correct.  If you're going to run production
services on VirtualBox VMs then you should enable cache flushes in VBox:

http://www.virtualbox.org/manual/ch12.html#id2692517

"
To enable flushing for IDE disks, issue the following command:

VBoxManage setextradata "VM name"
  "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0

The value [x] that selects the disk is 0 for the master device on the
first channel, 1 for the slave device on the first channel, 2 for the
master device on the second channel or 3 for the master device on the
second channel.

To enable flushing for SATA disks, issue the following command:

VBoxManage setextradata "VM name"
  "VBoxInternal/Devices/ahci/0/LUN#[x]/Config/IgnoreFlush" 0

The value [x] that selects the disk can be a value between 0 and 29.
"

IMO VBox should have a simple toggle for this in either its disk or vm
manager UI.  And the flush commands should be honored by default.  What
VBox could do is have some radio buttons or checkboxes for indicating
the purpose of a given VM, and then derive default flush behavior from
that (e.g., test and gaming VMs need not honor flushes, dev VMs might,
and prod VMs do).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Nicolas Williams
On Wed, Sep 15, 2010 at 05:18:08PM -0400, Edward Ned Harvey wrote:
> It is absolutely not difficult to avoid fragmentation on a spindle drive, at
> the level I described.  Just keep plenty of empty space in your drive, and
> you won't have a fragmentation problem.  (Except as required by COW.)  How
> on earth do you conclude this is "practically impossible?"

That's expensive.  It's also approaching short-stroking (which is
expensive).  Which is what Richard said (in so many words, that it's
expensive).  Can you make HDDs perform awesome?  Yes, but you'll need
lots of them, and you'll need to use them very inefficiently.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is the "1000 bit"?

2010-09-14 Thread Nicolas Williams
On Tue, Sep 14, 2010 at 04:13:31PM -0400, Linder, Doug wrote:
> I recently created a test zpool (RAIDZ) on some iSCSI shares.  I made
> a few test directories and files.  When I do a listing, I see
> something I've never seen before:
> 
> [r...@hostname anewdir] # ls -la
> total 6160
> drwxr-xr-x   2 root other  4 Sep 14 14:16 .
> drwxr-xr-x   4 root root   5 Sep 14 15:04 ..
> -rw--T   1 root other2097152 Sep 14 14:16 barfile1
> -rw--T   1 root other1048576 Sep 14 14:16 foofile1
> 
> I looked up the "T" bit in the man page for ls, and it says that "T"
> means  "The 1000 bit is turned on, and execution is off (undefined
> bit-state)."  Which is as clear as mud.

It's the sticky bit.  Nowadays it's only useful on directories, and
really it's generally only used with 777 permissions.  The chmod(1) (man
-M/usr/man chmod) and chmod(2) (man -s 2 chmod)  manpages describe the
sticky bit.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Nicolas Williams
On Sat, Aug 28, 2010 at 12:05:53PM +1200, Ian Collins wrote:
> Think of this from the perspective of an application. How would
> write failure be reported?  open(2) returns EACCES if the file can
> not be written but there isn't a corresponding return from write(2).
> Any open file descriptors would have to be updated to reflect the
> change of access and the application would end up with an unexpected
> error return (EBADF?).

EROFS.  But write(2) isn't supposed to return EROFS.  NFSv3's and v4's
write ops are allowed to return the NFS equivalent of EROFS, and so
typically NFS clients do cause write(2) to return EROFS in such cases
(but then, NFS isn't fully POSIX).

write(2) can return EIO though, and, IIRC, the BSD revoke(2) syscall
arranges for just that to be returned by write(2) calls on revoked
fildes.

IMO EROFS and EIO would both be OK.  It might be a good idea to require
a force option to make a change that would cause non-POSIX behavior.

I'd think that there's many possible ways to handle this:

a) disallow setting readonly=on on mounted datasets that are
   readonly=false;

b) disallow ... but only if there are any fildes open for write (doesn't
   matter if shared with NFS as NFS writes are allowed to return EROFS);

c) allow the change but make it take effect on next mount;

d) force umount the dataset, make the change, mount again;

e) have write(2), to fildes open for write before the change to
   readonly=on, return EROFS after the change;

f) same as (d) but only if you force the prop change;

g) have write(2), to fildes open for write before the change to
   readonly=on, return EIO after the change;

h) allow write(2)s to fildes open for write before the change to
   readonly=on;

(h) is current behavior.  (a) and (b) would be reasonable, but if EBUSY,
the user may not be able to change the property without drastic steps
(such as rebooting, if there's lots of datasets below).  (c) would be
confusing, and not that useful.  (d) would be unreasonable (plus what if
there's datasets below this one?!).  (e)...  may be reasonable if you
think that we're well outside POSIX the moment you change the readonly
prop to on.  (f) is reasonable (by forcing the change you'd be saying
that you're happy to leave POSIX land).  (h) is reasonable.

> If the application has been given permission to open a file for
> writing and this permission is unexpectedly revoked, strange things
> my happen.  The file being written would be in an inconsistent
> state.

Well, there's always the BSD revoke(2) system call.  Use it and 

> I think it is better to let write operation complete and leave the
> file in a consistent state.

There is that too.  But you could, too, just power off...  The
application should use fsync(2) (or fdatasync()) carefully to ensure
that failed write(2)s and power failures don't leave the application in
an unrecoverable state.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Nicolas Williams
On Fri, Aug 20, 2010 at 10:17:38AM +1200, Ian Collins wrote:
> On 08/20/10 09:48 AM, Nicolas Williams wrote:
> >And anyways, the temptation to build classes that can be used
> >elsewhere becomes rather strong.  IMO C++ in the kernel is asking for
> >trouble.  And C++ in user-land?  Same thing: you'll end up wanting to
> >turn parts of your application into libraries, and then some other
> >developer will want to use those in their C++ app, and then you run into
> >the ABI issues all over again.
> 
> There are a couple of simple solutions to that.  Either make library
> code header only (which is most common for template based code) or
> provide CC and gcc libraries, just like we have 32 and 64 bit
> versions of other system libraries.  Or just stick to one compiler,
> like Solaris did before the big gcc build project kicked off.

Or wait for a standard ABI to be formulated and widely adopted.

Or don't use C++.  Use Java or a JVM-hosted language.  Use Python.  Use
C.  Use C#.  Use whatever.  Anything, anything other than C++.

But more than anything: we don't need a language flame war on a ZFS list.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Nicolas Williams
On Fri, Aug 20, 2010 at 09:38:51AM +1200, Ian Collins wrote:
> On 08/20/10 09:33 AM, Nicolas Williams wrote:
> >Any driver C++ code would still need a C++ run-time.  Either you must
> >statically link it in, or you'll have a problem with multiple drivers
> >using different C++ run-times.  If you statically link in the run-time,
> >then you're bloating the text of the kernel.  If you're not then you
> >have a problem.  C++ is bad because of its ABI issues, really.
> >
> You snipped the bit where I said
> 
> "Drivers and kernel modules are a good example; in that world you
> have to live without the runtime library (which is dynamic only).
> So you are effectively just using C++ as a superset of C with all
> the benefits that offers."
> 
> So you basically loose the C++ specific parts of the standard
> library and exceptions.  But you still have the built in features of
> the language.

I'm not sure it's that easy to avoid the C++ run-time when you're
coding.  And anyways, the temptation to build classes that can be used
elsewhere becomes rather strong.  IMO C++ in the kernel is asking for
trouble.  And C++ in user-land?  Same thing: you'll end up wanting to
turn parts of your application into libraries, and then some other
developer will want to use those in their C++ app, and then you run into
the ABI issues all over again.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Nicolas Williams
On Fri, Aug 20, 2010 at 09:23:56AM +1200, Ian Collins wrote:
> On 08/20/10 08:30 AM, Garrett D'Amore wrote:
> >There is no common C++ ABI.  So you get into compatibility concerns
> >between code built with different compilers (like Studio vs. g++).
> >Fail.
> 
> Which is why we have extern "C".  Just about any Solaris driver,
> library or kernel module could be implemented in C++ behind the C
> compatibility layer and no one would notice.

Any driver C++ code would still need a C++ run-time.  Either you must
statically link it in, or you'll have a problem with multiple drivers
using different C++ run-times.  If you statically link in the run-time,
then you're bloating the text of the kernel.  If you're not then you
have a problem.  C++ is bad because of its ABI issues, really.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User level transactional API

2010-08-12 Thread Nicolas Williams
On Thu, Aug 12, 2010 at 07:48:10PM -0500, Norm Jacobs wrote:
> For single file updates, this is commonly solved by writing data to
> a temp file and using rename(2) to move it in place when it's ready.

For anything more complicated you need... a more complicated approach.

Note that "transactional API" means, among other things, "rollback" --
easy at the whole dataset level, hard in more granular form.  Dataset-
level rollback is nowhere need granular enough for applications).

Application transactions consisting of more than one atomic filesystem
operation require application-level recovery code.  SQLite3 is a good
(though maybe extreme?) example of such an application; there are many
others.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Filesystem

2010-07-14 Thread Nicolas Williams
On Wed, Jul 14, 2010 at 03:07:59PM -0600, Beau J. Bechdol wrote:
> So not sue if this is the correct list to email to or not. I am curious to
> know on my machine I have two hard drive (c8t0d0 and c8t1d0). Can some one
> explain to me what this exactly means? What does "c8" "t0" and "d0" actually
> mean. I might have to go back to solaris 101 to understand what this all
> means.

The 'c' is for "controller", and the number that follows is one that is
assigned to the given controller (not necessarily on a first-come-
first-served 0-based basis!).  The controller number should be
considered unpredictable at install time.  Once installed it shouldn't
change, except for removable disks, where the controller number might
vary according to which slot you plugged the disk into.

The 't' is for "target".

The 'd' is for "disk" -- think LUN.

The 'p' is for "partition", and is used in Solaris on x86.

The 's' is for "slice".  Slices are like partitions, but only used in
SOLARIS2 partitions, of which you're allowed no more than one per-disk.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Hash functions (was Re: Hashing files rapidly on ZFS)

2010-07-09 Thread Nicolas Williams
On Thu, Jul 08, 2010 at 08:42:33PM -0700, Garrett D'Amore wrote:
> On Fri, 2010-07-09 at 10:23 +1000, Peter Jeremy wrote:
> > In theory, collisions happen.  In practice, given a cryptographic hash,
> > if you can find two different blocks or files that produce the same
> > output, please publicise it widely as you have broken that hash function.
> 
> Not necessarily.  While you *should* publicize it widely, given all the
> possible text that we have, and all the other variants, its
> theoretically possibly to get likely.  Like winning a lottery where
> everyone else has a million tickets, but you only have one.
> 
> Such an occurrence -- if isolated -- would not, IMO, constitute a
> 'breaking' of the hash function.

A hash function is broken when we know how to create colliding
inputs.  A random collision does not a break make, though it might,
perhaps, help figure out how to break the hash function later.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup RAM requirements, vs. L2ARC?

2010-06-30 Thread Nicolas Williams
On Wed, Jun 30, 2010 at 01:35:31PM -0700, valrh...@gmail.com wrote:
> Finally, for my purposes, it doesn't seem like a ZIL is necessary? I'm
> the only user of the fileserver, so there probably won't be more than
> two or three computers, maximum, accessing stuff (and writing stuff)
> remotely.

It depends on what you're doing.

The perennial complaint about NFS is the synchronous open()/close()
operations and the fact that archivers (tar, ...) will generally unpack
archives in a single-threaded manner, which means all those synchronous
ops punctuate the archiver's performance with pauses.  This is a load
type for which ZIL devices come in quite handy.  If you write lots of
small files often and in single-threaded ways _and_ want to guarantee
you don't lose transactions, then you want a ZIL device.  (The recent
knob for controlling whether synchronous I/O gets done asynchronously
would help you if you don't care about losing a few seconds worth of
writes, assuming that feature makes it into any release of Solaris.)

> But, from what I can gather, by spending a little under $400, I should
> substantially increase the performance of my system with dedup? Many
> thanks, again, in advance.

If you have deduplicatious data, yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs adequate ZIL devices?

2010-06-16 Thread Nicolas Williams
On Wed, Jun 16, 2010 at 04:44:07PM +0200, Arne Jansen wrote:
> Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main
> pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection
> of the RAID-controller the disk can leave the write cache enabled. This means
> the disk can write essentially with full speed, meaning 150MB/s for a 15k 
> drive.
> 114000 4k writes/s are 456MB/s, so 3 spindles should do.

You'd still have to flush those caches at the end of each transaction,
which would tend to come every few seconds, so you'd need to factor that
in.  You can definitely do with disk what you can do with SSDs, but not
necessarily with the same SWAP (space, wattage and price), and you'd
have a more complex system no matter what.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication and ISO files

2010-06-04 Thread Nicolas Williams
On Fri, Jun 04, 2010 at 12:37:01PM -0700, Ray Van Dolson wrote:
> On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote:
> > On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson  wrote:
> > > The ISO's I'm testing with are the 32-bit and 64-bit versions of the
> > > RHEL5 DVD ISO's.  While both have their differences, they do contain a
> > > lot of similar data as well.
> > 
> > Similar != identical.
> > 
> > Dedup works on blocks in zfs, so unless the iso files have identical
> > data aligned at 128k boundaries you won't see any savings.
> > 
> > > If I explode both ISO files and copy them to my ZFS filesystem I see
> > > about a 1.24x dedup ratio.
> > 
> > Each file starts a new block, so the identical files can be deduped.
> > 
> > -B
> 
> Makes sense.  So, as someone else suggested, decreasing my block size
> may improve the deduplication ratio.
> 
> recordsize I presume is the value to tweak?

Yes, but I'd not expect that much commonality between 32-bit and 64-bit
Linux ISOs...

Do the same check again with the ISOs "exploded", as you say.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions about zil

2010-05-24 Thread Nicolas Williams
On Mon, May 24, 2010 at 05:48:56PM -0400, Thomas Burgess wrote:
> I recently got a new SSD (ocz vertex LE 50gb)
> 
> It seems to work really well as a ZIL performance wise.  My question is, how
> safe is it?  I know it doesn't have a supercap so lets' say dataloss
> occursis it just dataloss or is it pool loss?

Just dataloss.

> also, does the fact that i have a UPS matter?

Relative to power loss, yes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send/recv over ssh

2010-05-20 Thread Nicolas Williams
On Thu, May 20, 2010 at 04:23:49PM -0400, Thomas Burgess wrote:
> I know i'm probably doing something REALLY stupid.but for some reason i
> can't get send/recv to work over ssh.  I just built a new media server and
> i'd like to move a few filesystem from my old server to my new server but
> for some reason i keep getting strange errors...
> 
> At first i'd see something like this:
> 
> pfexec: can't get real path of ``/usr/bin/zfs''
> 
> or something like this:
> 
> zfs: Command not found

Add /usr/sbin to your PATH or use /usr/sbin/zfs as the full path of the
zfs(1M) command.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 02:29:24PM -0700, Don wrote:
> "Since it ignores Cache Flush command and it doesn't have any
> persistant buffer storage, disabling the write cache is the best you
> can do."
> 
> This actually brings up another question I had: What is the risk,
> beyond a few seconds of lost writes, if I lose power, there is no
> capacitor and the cache is not disabled?

You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).  (You also lose writes from
the currently open transaction, but that's unavoidable in any system.)

Nowadays the system will let you know at boot time that the last
transaction was not committed properly and you'll have a chance to go
back to the previous transaction.

For me, getting much-better-than-disk performance out of an SSD with
cache disabled is enough to make that SSD worthwhile, provided the price
is right of course.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS in campus clusters

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 07:50:13AM -0700, John Hoogerdijk wrote:
> Think about the potential problems if I don't mirror the log devices
> across the WAN.

If you don't mirror the log devices then your disaster recovery
semantics will be that you'll miss any transactions that hadn't been
committed to disk yet at the time of the disaster.  Which means that the
log devices' effects is purely local: for recovery from local power
failures (not extending to local disasters) and for acceleration.

This may or may not be acceptable to you.  If not, then mirror the log
devices.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] inodes in snapshots

2010-05-19 Thread Nicolas Williams
On Wed, May 19, 2010 at 05:33:05AM -0700, Chris Gerhard wrote:
> The reason for wanting to know is to try and find versions of a file.

No, there's no such guarantee.  The same inode and generation number
pair is extremely unlikely to be re-used, but the inode number itself is
likely to be re-used.

> If a file is renamed then the only way to know that the renamed file
> was the same as a file in a snapshot would be if the inode numbers
> matched. However for that to be reliable it would require the i-nodes
> are not reused.

There's also the crtime (creation time, not to be confused with ctime),
which you can get with ls(1).

>  If they are able to be reused then when an inode number matches I
>  would also have to compare the real creation time which requires
>  looking at the extended attributes.

Right, that's what you'll have to do.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Nicolas Williams
On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote:
> On 5/6/10 5:28 AM, Robert Milkowski wrote:
> 
> >sync=disabled
> >Synchronous requests are disabled. File system transactions
> >only commit to stable storage on the next DMU transaction group
> >commit which can be many seconds.
> 
> Is there a way (short of DTrace) to write() some data and get
> notified when the corresponding txg is committed? Think of it as a
> poor man's group commit.

fsync(2) is it.  Of course, if you disable sync writes then there's no
way to find out for sure.  If you need to know when a write is durable,
then don't disable sync writes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-21 Thread Nicolas Williams
On Wed, Apr 21, 2010 at 01:03:39PM -0500, Jason King wrote:
> ISTR POSIX also doesn't allow a number of features that can be turned
> on with zfs (even ignoring the current issues that prevent ZFS from
> being fully POSIX compliant today).  I think an additional option for
> the snapdir property ('directory' ?) that provides this behavior (with
> suitable warnings about posix compliance) would be reasonable.
> 
> I believe it's sufficient that zfs provide the necessary options to
> act in a posix compliant manner (much like you have to set $PATH
> correctly to get POSIX conforming behavior, even though that might not
> be the default), though I'm happy to be corrected about this.

Yes, that's true.  But you couldn't rely on this behavior, whereas you
can rely on dataset roots having .zfs.  If you're going to script this,
then you'll want to rely on the current (POSIX-compliant) behavior.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-21 Thread Nicolas Williams
POSIX doesn't allow us to have special dot files/directories outside
filesystem root directories.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-21 Thread Nicolas Williams
On Wed, Apr 21, 2010 at 10:45:24AM -0400, Edward Ned Harvey wrote:
> > From: Mark Shellenbaum [mailto:mark.shellenb...@oracle.com]
> > >
> > > You can create/destroy/rename snapshots via mkdir, rmdir, mv inside
> > the
> > > .zfs/snapshot directory, however, it will only work if you're running
> > the
> > > command locally.  It will not work from a NFS client.
> > >
> > 
> > It will work over NFS or SMB, but you will need to allow it via the
> > necessary delegated administration permissions.
> 
> Go on?
> I tried it over NFS and it didn't work.  So ... what are the "necessary
> permissions?"

See zfs(1M), search for "delegate".

> I did it from a NFS client as root, where root maps to root.

Huh; dunno why that didn't work.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-20 Thread Nicolas Williams
On Tue, Apr 20, 2010 at 04:28:02PM +, A Darren Dunham wrote:
> On Sat, Apr 17, 2010 at 09:03:33AM -0400, Edward Ned Harvey wrote:
> > > "zfs list -t snapshot" lists in time order.
> > 
> > Good to know.  I'll keep that in mind for my "zfs send" scripts but it's not
> > relevant for the case at hand.  Because "zfs list" isn't available on the
> > NFS client, where the users are trying to do this sort of stuff.
> 
> I'll note for comparison that the Netapp shapshots do expose this in one
> way.
> 
> The actual snapshot directory access time is set to the time of the
> snapshot. That makes it visible over NFS.  Would be handy to do
> something similar in ZFS.

The .zfs/snapshot directory is most certainly available over NFS.

But note that .zfs does not appear in directory listings of dataset
roots -- you have to actually refer to it:

% ls -f|fgrep .zfs
% ls -f .zfs
. ..snapshot
% ls .zfs/snapshot

% nfsstat -m $PWD
/net/.../pool/nico from ...:/pool/nico
 Flags: 
vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

%

And you can even create, rename and destroy snapshots by creating,
renaming and removing directories in .zfs/snapshot:

% mkdir .zfs/snapshot/foo
% mv .zfs/snapshot/foo .zfs/snapshot/bar
% rmdir .zfs/snapshot/bar

(All this also works locally, not just over ZFS.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: rm files/directories from snapshots

2010-04-16 Thread Nicolas Williams
On Fri, Apr 16, 2010 at 01:56:07PM -0400, Edward Ned Harvey wrote:
> The typical problem scenario is:  Some user or users fill up the filesystem.
> They rm some files, but disk space is not freed.  You need to destroy all
> the snapshots that contain the deleted files, before disk space is available
> again.
> 
> It would be nice if you could rm files from snapshots, without needing to
> destroy the whole snapshot.
> 
> Is there any existing work or solution for this?  

See the archives.  See the other replies to you already.  Short version: no.

However, a script to find all the snapshots that you'd have to delete in
order to delete some file might be useful, but really, only marginally
so: you should send your snapshots to backup and clean them out from
time to time anyways.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-16 Thread Nicolas Williams
On Fri, Apr 16, 2010 at 02:19:47PM -0700, Richard Elling wrote:
> On Apr 16, 2010, at 1:37 PM, Nicolas Williams wrote:
> > I've a ksh93 script that lists all the snapshotted versions of a file...
> > Works over NFS too.
> > 
> > % zfshist /usr/bin/ls
> > History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls):
> > -r-xr-xr-x   1 root bin33416 Jul  9  2008 
> > /.zfs/snapshot/install/usr/bin/ls
> > -r-xr-xr-x   1 root bin37612 Nov 21  2008 
> > /.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls
> > -r-xr-xr-x   1 root bin37612 Nov 21  2008 
> > /.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls
> > -r-xr-xr-x   1 root bin37612 Nov 21  2008 
> > /.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls
> > -r-xr-xr-x   1 root bin37612 Nov 21  2008 
> > /.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls
> > % 
> > 
> > It's not perfect (e.g., it doesn't properly canonicalize its arguments,
> > so it doesn't handle symlinks and ..s in paths), but it's a start.
> 
> There are some interesting design challenges here.  For the general case, you 
> can't rely on the snapshot name to be in time order, so you need to sort by 
> the
> mtime of the destination.  

I'm using ls -ltr.

> It would be cool to only list files which are different.

True.  That'd not be hard.

> If you mv a file to another directory, you might want to search by filename
> or a partial directory+filename.

Or even inode number.

> Or maybe you just setup your tracker.cfg and be happy? 

Exactly.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: zfshistory

2010-04-16 Thread Nicolas Williams
On Fri, Apr 16, 2010 at 01:54:45PM -0400, Edward Ned Harvey wrote:
> If you've got nested zfs filesystems, and you're in some subdirectory where
> there's a file or something you want to rollback, it's presently difficult
> to know how far back up the tree you need to go, to find the correct ".zfs"
> subdirectory, and then you need to figure out the name of the snapshots
> available, and then you need to perform the restore, even after you figure
> all that out.

I've a ksh93 script that lists all the snapshotted versions of a file...
Works over NFS too.

% zfshist /usr/bin/ls
History for /usr/bin/ls (/.zfs/snapshot/*/usr/bin/ls):
-r-xr-xr-x   1 root bin33416 Jul  9  2008 
/.zfs/snapshot/install/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-12-07-20:47:58/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-12-01-00:42:30/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-07-17-21:08:45/usr/bin/ls
-r-xr-xr-x   1 root bin37612 Nov 21  2008 
/.zfs/snapshot/2009-06-03-03:44:34/usr/bin/ls
% 

It's not perfect (e.g., it doesn't properly canonicalize its arguments,
so it doesn't handle symlinks and ..s in paths), but it's a start.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Rollback From ZFS Send

2010-04-06 Thread Nicolas Williams
On Tue, Apr 06, 2010 at 11:53:23AM -0400, Tony MacDoodle wrote:
> Can I rollback a snapshot that I did a zfs send on?
> 
> ie: zfs send testpool/w...@april6 > /backups/w...@april6_2010

That you did a zfs send does not prevent you from rolling back to a
previous snapshot.  Similarly for zfs recv -- that you went from one
snapshot to another by zfs receiving a send does not stop you from
rolling back to an earlier snapshot.

You do need to have an earlier snapshot to rollback to, if you want to
rollback.

Also, if you are using zfs send for backups, or for replication, and you
rollback the primary dataset, then you'll need to update your backups/
replicas accordingly.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs diff

2010-03-29 Thread Nicolas Williams
zfs diff is incredibly cool.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs diff

2010-03-29 Thread Nicolas Williams
One really good use for zfs diff would be: as a way to index zfs send
backups by contents.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send and ARC

2010-03-25 Thread Nicolas Williams
On Thu, Mar 25, 2010 at 04:23:38PM +, Darren J Moffat wrote:
> If the data is in the L2ARC that is still better than going out to
> the main pool disks to get the compressed version.



Well, one could just compress it...  If you'd otherwise put compression
in the ssh pipe (or elsewhere) then you could stop doing that.



Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS send and receive corruption across a WAN link?

2010-03-22 Thread Nicolas Williams
On Thu, Mar 18, 2010 at 10:38:00PM -0700, Rob wrote:
> Can a ZFS send stream become corrupt when piped between two hosts
> across a WAN link using 'ssh'?

No.  SSHv2 uses HMAC-MD5 and/or HMAC-SHA-1, depending on what gets
negotiated, for integrity protection.  The chances of random on the wire
corruption going undetected by link-layer CRCs, TCP's CRC and SSHv2's
MACs is infinitessimally small.  I suspect the chances of local bit
flips due to cosmic rays and what not are higher.

A bigger problem is that SSHv2 connections do not survive corruption on
the wire.  That is, if corruption is detected then the connection gets
aborted.  If you were zfs send'ing 1TB across a long, narrow link and
corruption hit the wire while sending the last block you'd have to
re-send the whole thing (but even then such corruption would still have
to get past link-layer and TCP checksums -- I've seen it happen, so it
is possible, but it is also unlikely).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams
BTW, it should be relatively easy to implement aclmode=ignore and
aclmode=deny, if you like.

 - $SRC/common/zfs/zfs_prop.c needs to be updated to know about the new
   values of aclmode.

 - $SRC/uts/common/fs/zfs/zfs_acl.c:zfs_acl_chmod()'s callers need to be
   modified:

- in the create path if zfs_acl_chmod() gets called then you can't
  ignore nore deny the mode;
- zfs_acl_chmod_setattr() should call neither zfs_acl_node_read()
  nor zfs_acl_chmod() if aclmode=ignore or aclmode=deny
- in all other paths you zfs_acl_chmod() should do what it should do

 - $SRC/uts/common/fs/zfs/zfs_vnops.c:zfs_setattr() may need some
   updates too, e.g., to not call zfs_aclset_common() in the case of
   aclmode=ignore -- you'll probably have to play around to figure out
   what else.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams
On Mon, Mar 01, 2010 at 09:04:58PM -0800, Paul B. Henson wrote:
> On Mon, 1 Mar 2010, Nicolas Williams wrote:
> > Yes, that sounds useful.  (Group modebits could be applied to all ACEs
> > that are neither owner@ nor everyone@ ACEs.)
> 
> That sounds an awful lot like the POSIX mask_obj, which was the bane of my
> previous filesystem, DFS, and which, as it seems history repeats itself, I
> was also unable to get an option implemented to ignore it and allow ACL's
> to work without impediment.

Alternatively group modebits apply to only the group@ ACEs.  This could
be just yet another option.

If no modebits were to apply to ACEs with subjects other than
owner@/group@/everyone@ (what about subjects that match the file's
owner/group but aren't owner@/gr...@?) then there'd be no way to use
modebits as a big filter for ACLs.  This is why I proposed the above.

> > If users have private primary groups then you can have them run with
> > umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that
> > users can share files in specific directories.  (This is one reason that
> > I recommend always giving users their own private primary groups.)
> 
> The only reason for the recommendation to give users their own private
> primary groups is because of the lack of flexibility of the umask/mode bits
> security model. In an environment with inheritable ACL's (that aren't
> subject to being violated by that legacy security model) there's no real
> need.

All reasons I have for it really come back to this: the idea of a
primary group and file group is an anachronism from back when ACLs (and
supplementary group memberships!) were overkill.  Think back to the days
when the AT&T labs were the only place where Unix ran and Unix had a
user base in the tens of users.  We're stuck with the notion of a
primary group (Windows seems to have it for interop with POSIX).  The
way to make the best of that situation is to give every user their own
private group.

> > Alternatively we could have a new mode bit to indicate that the group
> > bits of umask are to be treated as zero, or maybe assign this behavior
> > to the set-gid bit on ZFS.
> 
> So rather than a nice simple option granting ACL's immunity from umask/mode
> bits baggage, another attempted mapping/interaction?

You have a good idea of what is "simple" for your use case.  Your use
case also appears to be greatly influenced by what we could (should, do)
consider to be a bug in Samba.  Your idea of "simple" may not match
every one else's.  And your idea of "simple" might well differ if that
one application didn't use chmod() at all.

Personally I don't see a simple, non-surprising solution.  I see a set
of solutions that one could pick from.  In all cases I think we need a
way to synthesize modebits from ACLs (e.g., for objects created via
protocols that have no conception of modebits but have a conception of
ACLs) -- that's a difficult problem because any algorithm for doing that
will necessarily be lossy in many cases.

> If you only ever access ZFS via CIFS from windows clients, you can have a
> pure ACL model. Why should access via local shell or NFSv4 be a poor
> stepchild and chained down with legacy semantics that make it exceedingly
> difficult to actually use ACL's for their intended purpose?

I am certainly not advocating that.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-02 Thread Nicolas Williams
On Tue, Mar 02, 2010 at 11:10:52AM -0800, Bill Sommerfeld wrote:
> On 03/02/10 08:13, Fredrich Maney wrote:
> >Why not do the same sort of thing and use that extra bit to flag a
> >file, or directory, as being an ACL only file and will negate the rest
> >of the mask? That accomplishes what Paul is looking for, without
> >breaking the existing model for those that need/wish to continue to
> >use it?
> 
> While we're designing on the fly:

Heh.

>   Another possibility would be to use an 
> additional umask bit or two to influence the mode-bit - acl interaction.

Well, I think the bit, if we must have one, belongs in the filesystem
objects that have ACLs, as opposed to processes.  There may be no umask
to apply in remote access cases, so using a process attribute is likely
to result in different behavior according to the access protocol and
client.  That might not be surprising for the CIFS case, but it
certainly would be for the NFS case.

But also I think it's the owner of an object that should decide what
happens to the object's ACL on chmod rather than random programs and
user environments.

We might need multiple bits, but we do have multiple bits to play with
in mode_t.  The main issue with adding mode_t bits is going to be: will
apps handle the appearance of new mode_t bits correctly?  I suspect that
they will, or at least that we'd condier it a bug if they didn't.  Or we
could add a new file attribute.

But given cheap datasets, why not settle for a suitable dataset property
as a starting point.  I.e., maybe we could play with aclmode a little
more.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-03-01 Thread Nicolas Williams
On Fri, Feb 26, 2010 at 03:00:29PM -0500, Miles Nordin wrote:
> >>>>> "nw" == Nicolas Williams  writes:
> 
> nw> What could we do to make it easier to use ACLs?
> 
> 1. how about AFS-style ones where the effective permission is the AND
>of the ACL and the unix permission?  You might have to combine this

Yes, that sounds useful.  (Group modebits could be applied to all ACEs
that are neither owner@ nor everyone@ ACEs.)

>with an inheritable-by-subdirectories umask setting so you could
>create ACL-dominated lands of files that are all unix 777, but this
>would stop clobbering difficult-to-recreate ACL's as well as
>unintended information leaking.

If users have private primary groups then you can have them run with
umask 007 or 002 and use set-gid and/or inherittable ACLs to ensure that
users can share files in specific directories.  (This is one reason that
I recommend always giving users their own private primary groups.)

Alternatively we could have a new mode bit to indicate that the group
bits of umask are to be treated as zero, or maybe assign this behavior
to the set-gid bit on ZFS.

> 2. define a standard API for them, add ability to replicate them to
>[...]

That'd be nice.

> Maybe we're beyond the point of no return for the first suggestion.

Why?  It can just be another value of the aclmode property.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams
On Fri, Feb 26, 2010 at 04:26:43PM -0800, Paul B. Henson wrote:
> On Fri, 26 Feb 2010, Nicolas Williams wrote:
> > I believe we can do a bit better.
> >
> > A chmod that adds (see below) or removes one of r, w or x for owner is a
> > simple ACL edit (the bit may turn into multiple ACE bits, but whatever)
> > modifying / replacing / adding owner@ ACEs (if there is one).  A similar
> > chmod that affecting group bits should probably apply to group@ ACEs.  A
> > similar chmod that affecting other should apply to any everyone@ ACEs.
> 
> I don't necessarily think that's better; and I believe that's approximately
> the behavior you can already get with aclmode=passthrough.
> 
> If something is trying to change permissions on an object with a
> non-trivial ACL using chmod, I think it's safe to assume that's not what
> the original user who configured the ACL wants. At least, that would be
> safe to assume if the user had explicitly configured the hypothetical
> aclmode=deny or aclmode=ignore :).

Suppose you deny or ignore chmods.  Well, how would you ever set or
reset set-uid/gid and sticky bits?  chmod(2) deals only in absolute
modes, not relative changes, which means that in order to distinguish
those bits from the rwx bits the filesystem would have to know the
file's current mode bits in order to compare them to the new bits -- but
this is hard (see my other e-mail in a new sub-thread).  You'd have to
remove the ACL then chmod; oof.

> Take, for example, a problem I'm currently having on Linux clients mounting
> ZFS over NFSv4. Linux supports NFSv4, and even has a utility to manipulate
> NFSv4 ACL's that works ok (but isn't nearly as nice as the ACL integrated
> chmod command in Solaris). However, the default behavior of the linux cp
> command is to try and copy the mode bits along with the file. So, I copy
> a file into zfs over the NFSv4 mount from some local location. The file is
> created and inherits the explicitly configured ACL from the parent
> directory; the cp command then does a chmod() on it and the ACL is broken.
> That's not what I want, I configured that inheritable ACL for a reason, and
> I want it respected regardless of the permissions of the file in its
> original location.

Can you make that utility avoid the chmod?  The mode bits should come
from the open(2)/creat(2), and there should be no need to set them again
after setting the ACL.

> Another instance is an application that doesn't seem to trust creat() and
> umask to do the right thing, after creating a file it explicitly chmod's it
> to match the permissions it thinks it should have had based on the
> requested mode and the current umask. If the file inherited an explicitly
> specified non-trivial ACL, there's really nothing that can be done about
> that chmod, other than ignore or deny it, that will result in the
> permissions intended by the user who configured the ACL.

Such an app is broken.

> > For set-uid/gid and the sticky bits being set/cleared on non-directories
> > chmod should not affect the ACL at all.
> 
> Agreed.

But see above, below.

> > For directories the sticky and setgid bits may require editing the
> > inherittable ACEs of the ACL.
> 
> Sticky bit yes; in fact, as it affects permissions I think I'd lump that in
> to the ignore/deny category. sgid on directory though? That doesn't
> explicitly affect permission, it just potentially changes the group
> ownership of new files/directories. I suppose that indirectly affects
> permissions, as the implicit group@ ACE would be applied to a different
> group, but that's probably the intention of the person setting the sgid
> bit, and I don't think any actual ACL entry changes should occur from it.

I think both can be implemented as inherittable ACLs.

> > chmod(2) always takes an absolute mode.  ZFS would have to reconstruct
> > the relative change based on the previous mode...
> 
> Or perhaps some interface extension allowing relative changes to the
> non-permission mode bits?

But we'd have to extend NFSv4 and get the extension adopted and
deployed.  There's no chance of such a change being made in a short
period of time -- we're talking years.

>   For example, chown(2) allows you to specify -1
> for either the user or group, meaning don't change that one. mode_t is
> unsigned, so negative values won't work there, but there are a ton of
> extra bits in an unsigned int not relevant to the mode, perhaps setting one
> of them to signify only non permission related mode bits should be
> manipulated:

True, there's enough unused bits there that you could add ignore bits
(and mode4 is an unsigned 32-bit in

[zfs-discuss] cmod(2) vs. ACLs (Re: Who is using ZFS ACL's in production?)

2010-02-26 Thread Nicolas Williams
On Fri, Feb 26, 2010 at 05:02:34PM -0600, David Dyer-Bennet wrote:
> 
> On Fri, February 26, 2010 12:45, Paul B. Henson wrote:
> 
> > I've already posited as to an approach that I think would make a pure-ACL
> > deployment possible:
> >
> > 
> > http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html
> >
> > Via this concept or something else, there needs to be a way to configure
> > ZFS to prevent the attempted manipulation of legacy permission mode bits
> > from breaking the security policy of the ACL.
> 
> It seems to me that it should depend.
> 
> chown ddb /path/to/file
> chmod 640 /path/to/file
> 
> constitutes explicit instructions to give read-write access to ddb, read
> access to people in the group, and no access to others.  Now,  how should
> that be combined with an ACL?

The chown is irrelevant (well, it's relevant to you in terms of your
intentions, but it's very hard for the filesystem to consider a chmod in
relation to earlier chowns and chgrps).

I see four ways to handle the mode mask vs. ACL conflict:

a) clobber the ACL;
b) map the change as best you can to an ACL change;
c) ignore the rwx bits in the mode mask (except on create from a POSIX
   open(2)/creat(2), in which case the ACL has to be derived from the
   initial mode);
d) fail the chmod().

All three can be surprising!  (d) may be the least surprising, but it
may disrupt some apps.  (b) is the next least surprising, but it has
some dangerous effects.  (b) is tricky because the filesystem needs to
figure out what the change actually was by tracking mode bits from the
beginning.

For (b) IMO the right thing to do would be to always track a mode mask
whose rwx bits are not actually used for authorization, but which are
used to detect changes on chmod(2), and then the changes should be
applied as best effort edits of the ACLs.  On create via non-POSIX
methods the mode mask would have to be constructed synthetically.  When
the ACL is edited the current mode bits have to be brought in sync with
owner@/group@/everyone@ ACEs.  All methods of synchronizing or
synthesizing a mode mask from/to an ACL are going to be lossy.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams
On Fri, Feb 26, 2010 at 02:50:05PM -0800, Paul B. Henson wrote:
> On Fri, 26 Feb 2010, Bill Sommerfeld wrote:
> 
> > I believe this proposal is sound.
> 
> Mere words can not express the sheer joy with which I receive this opinion
> from an @sun.com address ;).

I believe we can do a bit better.

A chmod that adds (see below) or removes one of r, w or x for owner is a
simple ACL edit (the bit may turn into multiple ACE bits, but whatever)
modifying / replacing / adding owner@ ACEs (if there is one).  A similar
chmod that affecting group bits should probably apply to group@ ACEs.  A
similar chmod that affecting other should apply to any everyone@ ACEs.

For set-uid/gid and the sticky bits being set/cleared on non-directories
chmod should not affect the ACL at all.  For directories the sticky and
setgid bits may require editing the inherittable ACEs of the ACL.

> There's also the question of what to do with the non-access-control pieces
> of the legacy mode bits that have no ACL equivilent (suid, sgid, sticky
> bit, et al). I think the only way to set those is with an absolute chmod,

chmod(2) always takes an absolute mode.  ZFS would have to reconstruct
the relative change based on the previous mode... but how to know what
the "previous mode" was?  ZFS would have to construct one from the
owner@/group@/everyone@ + set-uid/gid + sticky bits, if any.  Best
effort will do.

> so there'd be no way to manipulate them in the current implementation
> without whacking the ACL. That's likely done relatively infrequently, those
> bits could always be set before the ACL is applied. In our current
> deployment the only one we use is sgid on directories, which is inherited,
> not directly applied.

You should probably stop using the set-gid bit on directories and use
inherttable ACLs instead...

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Who is using ZFS ACL's in production?

2010-02-26 Thread Nicolas Williams
On Fri, Feb 26, 2010 at 08:23:40AM -0800, Paul B. Henson wrote:
> So far it's been quite a struggle to deploy ACL's on an enterprise central
> file services platform with access via multiple protocols and have them
> actually be functional and reliable. I can see why the average consumer
> might give up.

Can you describe your struggles?  What could we do to make it easier to
use ACLs?  Is this about chmod [and so random apps] clobbering ACLs? or
something more fundamental about ACLs?

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Nicolas Williams
On Wed, Feb 24, 2010 at 03:31:51PM -0600, Bob Friesenhahn wrote:
> With millions of such tiny files, it makes sense to put the small 
> files in a separate zfs filesystem which has its recordsize property 
> set to a size not much larger than the size of the files.  This should 
> reduce waste, resulting in reduced potential for fragmentation in the 
> rest of the pool.

Tuning the dataset recordsize down does not help in this case.  The
files are already small, so their recordsize is already small.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Nicolas Williams
On Wed, Feb 24, 2010 at 02:09:42PM -0600, Bob Friesenhahn wrote:
> I have a directory here containing a million files and it has not 
> caused any strain for zfs at all although it can cause considerable 
> stress on applications.

The biggest problem is always the apps.  For example, ls by default
sorts, and if you're using a locale with a non-trivial collation (e.g.,
any UTF-8 locales) then the sort gets very expensive.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS 'secure erase'

2010-02-08 Thread Nicolas Williams
On Mon, Feb 08, 2010 at 03:41:16PM -0500, Miles Nordin wrote:
> ch> In our particular case, there won't be
> ch> snapshots of destroyed filesystems (I create the snapshots,
> ch> and destroy them with the filesystem).
> 
> Right, but if your zpool is above a zvol vdev (ex COMSTAR on another
> box), then someone might take a snapshot of the encrypted zvol.  Then
> after you ``securely delete'' a filesystem by overwriting various
> intermediate keys or whatever, they might roll back the zvol snapshot
> to undelete.
> 
> Yes, you still need the passphrase to reach what they've undeleted,
> but that's always true---what's ``secure delete'' supposed to mean
> besides the ability to permanently remove one dataset but not others,
> even from those who posess the passphrase?  Otherwise it would not be
> a feature.  It would just be a suggestion: ``forget your passphrase.''

Correct.  Secure erasure through "forgetting the keys" really does
depend on "forgetting the keys", which does include "forgetting the
passphrase".  The only way to avoid that would be to store the wrapped
keys in local keystores (i.e., a TPM or a smartcard) that do support
secure erasure, so that "forgetting the keys" can be done without having
to forget passphrases.

> nw> ZFS crypto over zvols and what not presents no additional
> nw> problems.
> 
> If you are counting on the ability to forget a key by overwriting the
> block of vdev in which the key's stored, then doing it over zvol's is
> an additional problem.

True, but this could happen regardless of whether the underlying storage
is a zvol or not.  I stand by the statement that "ZFS crypto over zvols
and what not presents no additional problems".

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams
On Fri, Feb 05, 2010 at 05:08:02PM -0500, c.hanover wrote:
> In our particular case, there won't be snapshots of destroyed
> filesystems (I create the snapshots, and destroy them with the
> filesystem).

OK.

> I'm not too sure on the particulars of NFS/ZFS, but would it be
> possible to create a 1GB file without writing any data to it, and then
> use a hex editor to access the data stored on those blocks previously?

Absolutely not.

That is, you can create a 1GB file without writing to it, but it will
appear to contain all zeros.

> Any chance someone could make any kind of sense of the contents
> (allocated in the same order they were before, or what have you)?

No.  See above.

> ZFS crypto will be nice when we get either NFSv4 or NFSv3 w/krb5 for
> over the wire encryption.  Until then, not much point.

You can use NFS with krb5 over the wire encryption _now_.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams
On Fri, Feb 05, 2010 at 04:41:08PM -0500, Miles Nordin wrote:
> > "ch" == c hanover  writes:
> 
> ch> is there a way to a) securely destroy a filesystem,
> 
> AIUI zfs crypto will include this, some day, by forgetting the key.

Right.

> but for SSD, zfs above a zvol, or zfs above a SAN that may do
> snapshots without your consent, I think it's just logically not a
> solveable problem, period, unless you have a writeable keystore
> outside the vdev structure.

IIIRC ZFS crypto will store encrypted blocks in L2ARC and ZIL, so
forgetting the key is sufficient to obtain a high degree of security.

ZFS crypto over zvols and what not presents no additional problems.
However, if your passphrase is guessable then the key might be
recoverable even after it's "forgotten".

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS 'secure erase'

2010-02-05 Thread Nicolas Williams
On Fri, Feb 05, 2010 at 03:49:15PM -0500, c.hanover wrote:
> Two things, mostly related, that I'm trying to find answers to for our
> security team.
> 
> Does this scenario make sense:
> * Create a filesystem at /users/nfsshare1, user uses it for a while,
> asks for the filesystem to be deleted
> * New user asks for a filesystem and is given /users/nfsshare2.  What
> are the chances that they could use some tool or other to read
> unallocated blocks to view the previous user's data?

If the tool isn't accessing the raw disks, then the answer is "no
chance".  (There's no way to access the raw disks over NFS.)

> Related to that, when files are deleted on a ZFS volume over an NFS
> share, how are they wiped out?  Are they zeroed or anything.  Same
> question for destroying ZFS filesystems, does the data lay about in
> any way?  (That's largely answered by the first scenario.)

Deleting a file does not guarantee that data blocks are released:
snapshots might exist that retain references to the data blocks of a
file that is being deleted.  Nor are blocks wiped when released.

> If the data is retrievable in any way, is there a way to a) securely
> destroy a filesystem, or b) securely erase empty space on a
> filesystem.

When ZFS crypto ships you'll be able to securely destroy encrypted
datasets.  Until then the only form of secure erasure is to destroy the
pool and then wipe the individual disks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unionfs help

2010-02-04 Thread Nicolas Williams
On Thu, Feb 04, 2010 at 04:03:19PM -0500, Frank Cusack wrote:
> On 2/4/10 2:46 PM -0600 Nicolas Williams wrote:
> >In Frank's case, IIUC, the better solution is to avoid the need for
> >unionfs in the first place by not placing pkg content in directories
> >that one might want to be writable from zones.  If there's anything
> >about Perl5 (or anything else) that causes this need to arise, then I
> >suggest filing a bug.
> 
> Right, and thanks for chiming in.  Problem is that perl wants to install
> add-on packages in places that the coincide with the system install.
> Most stuff is limited to the site_perl directory, which is easily
> redirected, but it also has some other locations it likes to meddle with.

Maybe we need a zone_perl location.  Judicious use of the search paths
will get you out of this bind, I think.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unionfs help

2010-02-04 Thread Nicolas Williams
On Thu, Feb 04, 2010 at 03:19:15PM -0500, Frank Cusack wrote:
> BTW, I could just install everything in the global zone and use the
> default "inheritance" of /usr into each local zone to see the data.
> But then my zones are not independent portable entities; they would
> depend on some non-default software installed in the global zone.
> 
> Just wanted to explain why this is valuable to me and not just some
> crazy way to do something simple.

There's no unionfs for Solaris.

(For those of you who don't know, unionfs is a BSDism and is a
pseudo-filesystem which presents the union of two underlying
filesystems, but with all changes being made only to one of the two
filesystems.  The idea is that one of the underlying filesystems cannot
be modified through the union, with all changes made through the union
being recorded in an overlay fs.  Think, for example, of unionfs-
mounting read-only media containing sources: you could cd to the mount
point and build the sources, with all intermediate files and results
placed in the overlay.)

In Frank's case, IIUC, the better solution is to avoid the need for
unionfs in the first place by not placing pkg content in directories
that one might want to be writable from zones.  If there's anything
about Perl5 (or anything else) that causes this need to arise, then I
suggest filing a bug.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-21 Thread Nicolas Williams
On Thu, Jan 21, 2010 at 02:11:31PM -0800, Moshe Vainer wrote:
> >PS: For data that you want to mostly archive, consider using Amazon
> >Web Services (AWS) S3 service. Right now there is no charge to push
> >data into the cloud and its $0.15/gigabyte to keep it there. Do a
> >quick (back of the napkin) calculation on what storage you can get for
> >$30/month and factor in bandwidth costs (to pull the data when/if you
> >need it). My "napkin" calculations tell me that I cannot compete
> >with AWS S3 for up to 100Gb of storage available 7x24. Even the
> >electric utility bill would be more than AWS charges - especially when
> >you consider UPS and air conditioning. And thats not including any
> >hardware (capital equipment) costs! see: http://aws.amazon.com/s3/
> 
> When going the amazon route, you always need to take into account
> retrieval time/bandwidth cost.  If you were to store 100GB on Amazon -
> how fast can you get your data back, or how much would bandwidth cost
> you to retrieve it in a timely manner. It is all a matter of
> requirements of course.

Don't forget asymmetric upload/download bandwidth.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DeDup and Compression - Reverse Order?

2009-12-17 Thread Nicolas Williams
On Thu, Dec 17, 2009 at 03:32:21PM +0100, Kjetil Torgrim Homme wrote:
> if the hash used for dedup is completely separate from the hash used for
> data protection, I don't see any downsides to computing the dedup hash
> from uncompressed data.  why isn't it?

Hash and checksum functions are slow (hash functions are slower, but
either way you'll be loading large blocks of data, which sets a floor
for cost).  Duplicating work is bad for performance.  Using the same
checksum for integrity protection and dedup is an optimization, and a
very nice one at that.  Having separate checksums would require making
blkptr_t larger, which imposes its own costs.

There's lots of trade-offs here.  Using the same checksum/hash for
integrity protection and dedup is a great solution.

If you use a non-cryptographic checksum algorithm then you'll
want to enable verification for dedup.  That's all.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file concatenation with ZFS copy-on-write

2009-12-03 Thread Nicolas Williams
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote:
> >if any of f2..f5 have different block sizes from f1
> 
> This restriction does not sound so bad to me if this only refers to
> changes to the blocksize of a particular ZFS filesystem or copying
> between different ZFSes in the same pool. This can properly be managed
> with a "-f" switch on the userlan app to force the copy when it would
> fail.

Why expose such details?

If you have dedup on and if the file blocks and sizes align then

cat f1 f2 f3 f4 f5 > f6

will do the right thing and consume only space for new metadata.

If the file blocks and sizes do not align then

cat f1 f2 f3 f4 f5 > f6

will still work correctly.

Or do you mean that you want a way to do that cat ONLY if it would
consume no new space for data?  (That might actually be a good
justification for a ZFS cat command, though I think, too, that one could
script it.)

> >any of f1..f5's last blocks are partial
> 
> Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS
> blocksize? This is a severe restriction that will fail unless in very
> special cases.

Say f1 is 1MB, f2 is 128KB, f3 is 510 bytes, f4 is 514 bytes, and f5 is
10MB, and the recordsize for their containing datasets is 128KB, then
the new file will consume 10MB + 128KB more than f1..f5 did, but 1MB +
128KB will be de-duplicated.

This is not really "a severe restriction".  To make ZFS do better than
that would require much extra metadata and complexity in the filesystem
that users who don't need to do space-efficient file concatenation (most
users, that is) won't want to pay for.

> Is this related to the disk format or is it restriction in the
> implrmentation? (do you know where to look in the source code?).

Both.

> >...but also ZFS most likely could not do any better with any other, more
> >specific non-dedup solution
> 
> Properly lots of I/O traffic, digest calculation+lookups, could be
> saved as we already know it will be a duplicate.  (In our case the
> files are gigabyte sizes)

ZFS hashes, and records hashes of blocks, not sub-blocks.  Look at my
above example.  To efficiently dedup the concatenation of the 10MB of f5
would require being able to have something like "sub-block pointers".
Alternatively, if you want a concatenation-specific feature ZFS would
have to have a metadata notion of concatentation, but then the Unix way
of concatenating files couldn't be used for this since the necessary
context is lost in the I/O redirection.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file concatenation with ZFS copy-on-write

2009-12-03 Thread Nicolas Williams
On Thu, Dec 03, 2009 at 03:57:28AM -0800, Per Baatrup wrote:
> I would like to to concatenate N files into one big file taking
> advantage of ZFS copy-on-write semantics so that the file
> concatenation is done without actually copying any (large amount of)
> file content.
>   cat f1 f2 f3 f4 f5 > f15
> Is this already possible when source and target are on the same ZFS
> filesystem?
> 
> Am looking into the ZFS source code to understand if there are
> sufficient (private) interfaces to make a simple "zcat -o f15   f1 f2
> f3 f4 f5" userland application in C code. Does anybody have advice on
> this?

There have been plenty of answers already.

Quite aside from dedup, the fact that all blocks in a file must have the
same uncompressed size means that if any of f2..f5 have different block
sizes from f1, or any of f1..f5's last blocks are partial then ZFS could
not perform this concatenation as efficiently as you wish.

In other words: dedup _is_ what you're looking for...

...but also ZFS most likely could not do any better with any other, more
specific non-dedup solution.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fwd: [ilugb] Does ZFS support Hole Punching/Discard

2009-11-11 Thread Nicolas Williams
On Mon, Sep 07, 2009 at 09:58:19AM -0700, Richard Elling wrote:
> I only know of "hole punching" in the context of networking. ZFS doesn't
> do networking, so the pedantic answer is no.

But a VDEV may be an iSCSI device, thus there can be networking below
ZFS.

For some iSCSI targets (including ZVOL-based ones) a hole punchin
operation can be very useful since it explicitly tells the backend that
some contiguous block of space can be released for allocation to others.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] PSARC recover files?

2009-11-10 Thread Nicolas Williams
On Tue, Nov 10, 2009 at 03:33:22PM -0600, Tim Cook wrote:
> You're telling me a scrub won't actively clean up corruption in snapshots?
> That sounds absolutely absurd to me.

Depends on how much redundancy you have in your pool.  If you have no
mirrors, no RAID-Z, and no ditto blocks for data, well, you have no
redundancy, and ZFS won't be able to recover affected files.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup question

2009-11-02 Thread Nicolas Williams
On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote:
> forgive my ignorance, but what's the advantage of this new dedup over  
> the existing compression option?  Wouldn't full-filesystem compression  
> naturally de-dupe?

If you snapshot/clone as you go, then yes, dedup will do little for you
because you'll already have done the deduplication via snapshots and
clones.  But dedup will give you that benefit even if you don't
snapshot/clone all your data.  Not all data can be managed
hierarchically, with a single dataset at the root of a history tree.

For example, suppose you want to create two VirtualBox VMs running the
same guest OS, sharing as much on-disk storage as possible.  Before
dedup you had to: create one VM, then snapshot and clone that VM's VDI
files, use an undocumented command to change the UUID in the clones,
import them into VirtualBox, and setup the cloned VM using the cloned
VDI files.  (I know because that's how I manage my VMs; it's a pain,
really.)  With dedup you need only enable dedup and then install the two
VMs.

Clearly the dedup approach is far, far easier to use than the
snapshot/clone approach.  And since you can't always snapshot/clone...

There are many examples where snapshot/clone isn't feasible but dedup
can help.  For example: mail stores (though they can do dedup at the
application layer by using message IDs and hashes).  For example: home
directories (think of users saving documents sent via e-mail).  For
example: source code workspaces (ONNV, Xorg, Linux, whatever), where
users might not think ahead to snapshot/clone a local clone (I also tend
to maintain a local SCM clone that I then snapshot/clone to get
workspaces for bug fixes and projects; it's a pain, really).  I'm sure
there are many, many other examples.

The workspace example is particularly interesting: with the
snapshot/clone approach you get to deduplicate the _source code_, but
not the _object code_, while with dedup you get both dedup'ed
automatically.

As for compression, that helps whether you dedup or not, and it helps by
about the same factor either way -- dedup and compression are unrelated,
really.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Nicolas Williams
On Mon, Nov 02, 2009 at 12:58:32PM -0500, Dennis Clarke wrote:
> Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the
> major leap from SHA256 to SHA512 was a 32-bit to 64-bit step.

ZFS doesn't have enough room in blkptr_t for 512-bi hashes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs inotify?

2009-10-26 Thread Nicolas Williams
On Mon, Oct 26, 2009 at 08:53:50PM -0700, Anil wrote:
> I haven't tried this, but this must be very easy with dtrace. How come
> no one mentioned it yet? :) You would have to monitor some specific
> syscalls...

DTrace is not reliable in this sense: it will drop events rather than
overburden the system.  Also, system calls are not the only thing you
want to watch for -- you should really trace the VFS/fop rather than
syscalls for this.  In any case, port_create(3C) and gamin are the way
forward.

port_create(3C) is rather easy to use.  Searching the web for
PORT_SOURCE_FILE you'll find useful docs like:

http://blogs.sun.com/praks/entry/file_events_notification

which has example code too.

I do think it'd be useful to have command-line utility in core Solaris
that uses this facility, something like the example in Prakash's blog
(which, incidentally, _works_), but perhaps a bit more complete.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can't rm file when "No space left on device"...

2009-10-01 Thread Nicolas Williams
On Thu, Oct 01, 2009 at 11:03:06AM -0700, Rudolf Potucek wrote:
> Hmm ... I understand this is a bug, but only in the sense that the
> message is not sufficiently descriptive. Removing the file from the
> source filesystem will not necessarily free any space because the
> blocks have to be retained in the snapshots. The same problem exists
> for zeroing the file with >file as suggested earlier.
> 
> It seems like the appropriate solution would be to have a tool that
> allows removing a file from one or more snapshots at the same time as
> removing the source ... 

That would make them not really snapshots.  And such a tool would have
to "fix" clones too.

Snapshot and clones are great.  They are also great ways to consume too
much space.  One must do some spring cleaning once in a while.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Nicolas Williams
On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote:
> On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
> >We have groups generating terabytes a day of image data  from lab  
> >instruments and saving them to an X4500.
> 
> Wouldn't it be easier to compress at the application, or between the
> application and the archiving file system?

Especially when it comes to reading the images back!

ZFS compression is transparent.  You can't write uncompressed data then
read back compressed data.  And compression is at the block level, not
for the whole file, so even if you could read it back compressed, it
wouldn't be in a useful format.

Most people want to transfer data compressed, particularly images.  So
compressing at the application level in this case seems best to me.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] utf8only and normalization properties

2009-08-27 Thread Nicolas Williams
So, the manpage seems to have a bug in it.  The valid values for the
normalization property are:

none | formC | formD | formKC | formKD

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?

2009-08-21 Thread Nicolas Williams
On Fri, Aug 21, 2009 at 06:46:32AM -0700, Chris Murray wrote:
> Nico, what is a zero-link file, and how would I go about finding
> whether I have one? You'll have to bear with me, I'm afraid, as I'm
> still building my Solaris knowledge at the minute - I was brought up
> on Windows. I use Solaris for my storage needs now though, and slowly
> improving on my knowledge so I can move away from Windows one day  :)

I see that Mark S. thinks this may be a specific ZFS bug, and there's a
followup with instructions on how to detect if that's the case.

However, it can also be a zero-link file.  I've certainly run into that
problem before myself, on UFS and other filesystems.

A zero-link file is a file that has been removed (unlink(2)ed), but
which remains open in some process(es).  Such a file continues to
consume space until the processes that have it open are killed.

Typically you'd use pfiles(1) or lsof to find such files.

> If it makes any difference, the problem persists after a full reboot,

Yeah, if you rebooted and there's no 14GB .nfs* files, then this is not
a zero-link file.  See the followups.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send speed

2009-08-18 Thread Nicolas Williams
On Tue, Aug 18, 2009 at 04:22:19PM -0400, Paul Kraus wrote:
>        We have a system with some large datasets (3.3 TB and about 35
> million files) and conventional backups take a long time (using
> Netbackup 6.5 a FULL takes between two and three days, differential
> incrementals, even with very few files changing, take between 15 and
> 20 hours). We already use snapshots for day to day restores, but we
> need the 'real' backups for DR.

zfs send will be very fast for "differential incrementals ... with very
few files changing" since zfs send is a block-level diff based on the
differences between the selected snapshots.  Where a traditional backup
tool would have to traverse the entire filesystem (modulo pruning based
on ctime/mtime), zfs send simply traverses a list of changed blocks
that's kept up by ZFS as you make changes in the first place.

For a *full* backup zfs send and traditional backup tools will have
similar results as both will be I/O bound and both will have more or
less the same number of I/Os to do.

Caveat: zfs send formats are not guraranteed to be backwards
compatible, therefore zfs send is not suitable for long-term backups.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] *Almost* empty ZFS filesystem - 14GB?

2009-08-18 Thread Nicolas Williams
Perhaps an open 14GB, zero-link file?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] utf8only and normalization properties

2009-08-13 Thread Nicolas Williams
On Thu, Aug 13, 2009 at 05:57:57PM -0500, Haudy Kazemi wrote:
> >Therefore, if you need to interoperate with MacOS X then you should
> >enable the normalization feature.
> >  
> Thank you for the reply. My goal is to configure the filesystem for the 
> lowest common denominator without knowing up front which clients will be 
> used. OS X and Win XP are listed because they are commonly used as 
> desktop OSes.  Ubuntu Linux is a third potential desktop OS.

Right, so set normalization=formD .

> The normalization property documentation says "this property indicates 
> whether a file system should perform a unicode normalization of file 
> names whenever two file names are compared.  File names are always 
> stored unmodified, names are normalized as part of any comparison 
> process."  Where does the file system use filename comparisons and what 
> does it use them for?  Filename collision checking?  Sorting?

The system does filename comparisons when doing lookups
(open("/foo/bar/baz", ...) does at least three such lookups, for
example), and on create (since that involves a lookup).

Yes, this is about collisions.  Consider a file named "á" (that's "a"
with an acute accent).  There are _two_ possible encodings for that name
in UTF-8.  That means that you could have two files in the same
directory and with the same name, though they'd have different names if
you looked at the bytes that make up the names.  That would be
confusing, at the very least.

To avoid such collisions you can enable normalization.

You can find more here:

http://blogs.sun.com/nico/entry/filesystem_i18n

> Is it used for any other operation, say when returning a filename to an 
> application?  Would applications reading/writing files to a ZFS 

No, directory listings always return the filename used when the file
name was created, without any normalization.

> filesystem ever notice the difference in normalization settings as long 
> as they produce filenames that do not conflict with existing names or 
> create invalid UTF8?  The documentation says filenames are stored 
> unmodified, which sounds like things should be transparent to applications.

Applications shouldn't notice normalization being enabled.  The only
reasons to disable normalization are: a) you don't want to force the use
of UTF-8, or b) you consistently use a single normalization form and you
don't want to pay a penalty for normalizing on lookup.

(b) is probably not a problem -- the normalization code is fast if you
use all US-ASCII strings, and it's linear with the number of non-ASCII,
Unicode codepoints in file names.  But I don't have performance numbers
to share.  I think that normalization should be enabled by default if
you enable utf8only, and utf8only should probably be enabled by default
in Solaris, but that's just my personal opinion.

> (In regard to filename collision checking, if non-normalized unmodified 
> filenames are always stored on disk, and they don't conflict in 
> non-normalized form, what would the point be of normalizing the 
> filenames for a comparison?  To verify there isn't conflict in 
> normalized forms, and if there is no conflict with an existing file to 
> allow the filename to be written unmodified?)

Yes.

> The ZFS documentation doesn't list the valid values for the 
> normalization property other than 'none.  From your reply and from the 

The zfs(1M) manpage lists them:

 normalization = none | formD | formKCf

That's not all existing Unicode normalization forms, no.  The reason for
this is that we only normalize on lookup (the file names returned by
readdir are not normalized), and for that the forms C and D are
semantically equivalent, but K and non-K forms are not semantically
equivalent, so we need one K form and one non-K form.  NFD is faster
than NFC, but the K forms require a trip through form C, so NFKC is
faster than NFKD (at least if I remember correctly).  Which means that
NFD and NFKC were sufficient, and there's no reason to ever want NFC or
NFKD.

> suggest they be added to the documentation at
> http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html

Yes, that's a good point.

PS:  ZFS directories are hashed.  When normalization is enabled, the
 hash keys are normalized on create, but the hash contents are not,
 so filenames rename unnormalized.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] utf8only and normalization properties

2009-08-12 Thread Nicolas Williams
On Wed, Aug 12, 2009 at 06:17:44PM -0500, Haudy Kazemi wrote:
> I'm wondering what are some use cases for ZFS's utf8only and 
> normalization properties.  They are off/none by default, and can only be 
> set when the filesystem is created.  When should they specifically be 
> enabled and/or disabled?  (i.e. Where is using them a really good idea?  
> Where is using them a really bad idea?)

These are for interoperability.

The world is converging on Unicode for filesystem object naming.  If you
want to exclude non-Unicode strings then you should set utf8only (some
non-Unicode strings in some codesets can look like valid UTF-8 though).

But Unicode has multiple canonical and non-canonical ways of
representing certain characters (e.g., ´).  Solaris and Windows
input methods tend to conform to NFKC, so they will interop even if you
don't enable the normalization feature.  But MacOS X normalizes to NFD.

Therefore, if you need to interoperate with MacOS X then you should
enable the normalization feature.

> Looking forward, starting with Windows XP and OS X 10.5 clients, is 
> there any reason to change the defaults in order to minimize problems?

You should definetely enable normalization (see above).

It doesn't matter what normalization form you use, but "nfd" runs faster
than "nfc".

The normalization feature doesn't cost much if you use all US-ASCII file
names.  And it doesn't cost much if your file names are mostly US-ASCII.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] feature proposal

2009-07-29 Thread Nicolas Williams
On Wed, Jul 29, 2009 at 03:35:06PM +0100, Darren J Moffat wrote:
> Andriy Gapon wrote:
> >What do you think about the following feature?
> >
> >"Subdirectory is automatically a new filesystem" property - an 
> >administrator turns
> >on this magic property of a filesystem, after that every mkdir *in the 
> >root* of
> >that filesystem creates a new filesystem. The new filesystems have
> >default/inherited properties except for the magic property which is off.
> 
> This has been brought up before and I thought there was an open CR for 
> it but I can't find it.

I'd want this to be something one could set per-directory, and I'd want
it to not be inherittable (or to have control over whether it is
inherittable).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-24 Thread Nicolas Williams
On Fri, Jul 24, 2009 at 05:01:15PM +0200, dick hoogendijk wrote:
> On Fri, 24 Jul 2009 10:44:36 -0400
> Kyle McDonald  wrote:
> > ... then it seems  like a shame (or a waste?)  not to equally
> > protect the data both before it's given to ZFS for writing, and after
> > ZFS reads it back and returns it to you.
> 
> But that was not the question.
> The question was: [quote] "My question is: is there any technical
> reason, in ZFS's design, that makes it particularly important for ZFS
> to require ECC RAM?"

The only thing I can think of is this: if a cosmic ray flips a bit in
memory holding a ZFS transaction that's already had all its checksums
computed, but hasn't hit disk yet, then you'll have a checksum
verification failure later when you read back the affected file (or
directory).  Using ECC memory avoids that.  You still have the processor
to worry about though.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-22 Thread Nicolas Williams
On Wed, Jul 22, 2009 at 02:45:52PM -0500, Bob Friesenhahn wrote:
> On Wed, 22 Jul 2009, t. johnson wrote:
> >Lets say I have a simple-ish setup that uses vmware files for 
> >virtual disks on an NFS share from zfs. I'm wondering how zfs' 
> >variable block size comes into play? Does it make the alignment 
> >problem go away? Does it make it worse? Or should we perhaps be
> 
> My understanding is that zfs uses fixed block sizes except for the 
> tail block of a file, or if the filesystem has compression enabled.

For one block files, the block is variable, between 512 bytes and the
smaller of the dataset's recordsize or 128KB.  For multi-block files all
blocks are the same size, except the tail block.  But these are sizes in
file data, not actual on-disk sizes (which can be less because of
compression).

> Zfs's large blocks can definitely cause performance problems if the 
> system has insufficient memory to cache the blocks which are accessed, 
> or only part of the block is updated.

You should set the virtual disk image files' recordsize (or, rather, the
containing dataset's recordsize) to match the preferred block size of
the filesystem types (or data) that you'll put on those virtual disks.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSDs get faster and less expensive

2009-07-21 Thread Nicolas Williams
On Tue, Jul 21, 2009 at 02:45:57PM -0700, Richard Elling wrote:
> But to put this in perspective, you would have to *delete* 20 GBytes

Or overwrite (since the overwrites turn in to COW writes of new blocks
and the old blocks are released if not referred to from snapshot).

> of data a day on a ZFS file system for 5 years (according to Intel) to
> reach the expected endurance.  I don't know many people who delete
> that much data continuously (I suspect that the satellite data vendors
> might in their staging servers... not exactly a market for SSDs)

Don't forget atime updates.  If you just read, you're still writing.

Of course, the writes from atime updates will generally be less than the
number of data blocks read, so you might have to read many more times
what you say in order to get the same effect.

(Speaking of atime updates, I run my root datasets with atime updates
disabled.  I don't have hard data, but it stands to reason that things
can go fast that way.  I also mount filesystems in VMs with atime
disabled.

Yes, I'm picking nits; sorry.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-19 Thread Nicolas Williams
On Fri, Jun 19, 2009 at 04:09:29PM -0400, Miles Nordin wrote:
> Also, as I said elsewhere, there's a barrier controlled by Sun to
> getting bugs accepted.  This is a useful barrier: the bug database is
> a more useful drive toward improvement if it's not cluttered.  It also
> means, like I said, sometimes the mailing list is a more useful place
> for information.

There's two bug databases, sadly.  bugs.opensolaris.org is like you
describe, whereas defect.opensolaris.org is not.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..

2009-05-22 Thread Nicolas Williams
On Fri, May 22, 2009 at 04:40:43PM -0600, Eric D. Mudama wrote:
> As another datapoint, the 111a opensolaris preview got me ~29MB/s
> through an SSH tunnel with no tuning on a 40GB dataset.
> 
> Sender was a Core2Duo E4500 reading from SSDs and receiver was a Xeon
> E5520 writing to a few mirrored 7200RPM SATA vdevs in a single pool.
> Network was a $35 8-port gigabit netgear switch.

Unfortunately the SunSSH doesn't know how to grow SSHv2 channel windows
to take full advantage of the TCP BDP, so you could probably have gone
faster.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   5   >