Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Nico Williams
IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Nico Williams
Bloom filters are very small, that's the difference.  You might only need a
few bits per block for a Bloom filter.  Compare to the size of a DDT entry.
 A Bloom filter could be cached entirely in main memory.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)

2013-01-14 Thread Nico Williams
On Mon, Jan 14, 2013 at 1:48 PM, Tomas Forsman  wrote:
>> https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=15852599
>
> Host oraclecorp.com not found: 3(NXDOMAIN)
>
> Would oracle.internal be a better domain name?

Things like that cannot be changed easily.  They (Oracle) are stuck
with that domainname for the forseeable future.  Also, whoever thought
it up probably didn't consider leakage of internal URIs to the
outside.  *shrug*
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-07-30 Thread Nico Williams
The copies thing is a really only for laptops, where the likelihood of
redundancy is very low (there are some high-end laptops with multiple
drives, but those are relatively rare) and where this idea is better
than nothing.  It's also nice that copies can be set on a per-dataset
manner (whereas RAID-Zn and mirroring are for pool-wide redundancy,
not per-dataset), so you could set it > 1 on home directories but not
/.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
You can treat whatever hash function as an idealized one, but actual
hash functions aren't.  There may well be as-yet-undiscovered input
bit pattern ranges where there's a large density of collisions in some
hash function, and indeed, since our hash functions aren't ideal,
there must be.  We just don't know where these potential collisions
are -- for cryptographically secure hash functions that's enough (plus
2nd pre-image and 1st pre-image resistance, but allow me to handwave),
but for dedup?  *shudder*.

Now, for some content types collisions may not be a problem at all.
Think of security camera recordings: collisions will show up as bad
frames in a video stream that no one is ever going to look at, and if
they should need it, well, too bad.

And for other content types collisions can be horrible.  Us ZFS lovers
love to talk about how silent bit rot means you may never know about
serious corruption in other filesystems until it's too late.  Now, if
you disable verification in dedup, what do you get?  The same
situation as other filesystems are in relative to bit rot, only with
different likelihoods.

Disabling verification is something to do after careful deliberation,
not something to do by default.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov  wrote:
> It's also possible to set "dedup=verify" with "checksum=sha256",
> however, that makes little sense (as the chances of getting a random
> hash collision are essentially nil).

IMO dedup should always verify.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 9:48 AM,   wrote:
>>Huge space, but still finite=85
>
> Dan Brown seems to think so in "Digital Fortress" but it just means he
> has no grasp on "big numbers".

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.

IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-04 Thread Nico Williams
On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn
 wrote:
> On Tue, 3 Jul 2012, James Litchfield wrote:
>> Agreed - msync/munmap is the only guarantee.
>
> I don't see that the munmap definition assures that anything is written to
> "disk".  The system is free to buffer the data in RAM as long as it likes
> without writing anything at all.

Oddly enough the manpages at the Open Group don't make this clear.  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).  I really makes no sense at all to
have munmap(2) not imply msync(3C).

(That's another thing, I don't see where the standard requires that
munmap(2) be synchronous.  I think it'd be nice to have an mmap(2)
option for requesting whether munmap(2) of the same mapping be
synchronous or asynchronous.  Async munmap(2) -> no need to mount
cross-calls, instead allowing to mapping to be torn down over time.
Doing a synchronous msync(3C), then a munmap(2) is a recipe for going
real slow, but if munmap(2) does not portably guarantee an implied
msync(3C), then would it be safe to do an async msync(2) then
munmap(2)??)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-03 Thread Nico Williams
On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield
 wrote:
> On 07/02/12 15:00, Nico Williams wrote:
>> You can't count on any writes to mmap(2)ed files hitting disk until
>> you msync(2) with MS_SYNC.  The system should want to wait as long as
>> possible before committing any mmap(2)ed file writes to disk.
>> Conversely you can't expect that no writes will hit disk until you
>> msync(2) or munmap(2).
>
> Driven by fsflush which will scan memory (in chunks) looking for dirty,
> unlocked, non-kernel pages to flush to disk.

Right, but one just cannot count on that -- it's not part of the API
specification.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-02 Thread Nico Williams
On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn
 wrote:
> On Mon, 2 Jul 2012, Iwan Aucamp wrote:
>> I'm interested in some more detail on how ZFS intent log behaves for
>> updated done via a memory mapped file - i.e. will the ZIL log updates done
>> to an mmap'd file or not ?
>
>
> I would to expect these writes to go into the intent log unless msync(2) is
> used on the mapping with the MS_SYNC option.

You can't count on any writes to mmap(2)ed files hitting disk until
you msync(2) with MS_SYNC.  The system should want to wait as long as
possible before committing any mmap(2)ed file writes to disk.
Conversely you can't expect that no writes will hit disk until you
msync(2) or munmap(2).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Nico Williams
On Tue, Jun 26, 2012 at 8:12 AM, Lionel Cons
 wrote:
> On 26 June 2012 14:51,   wrote:
> We've already asked our Netapp representative. She said it's not hard
> to add that.

Did NetApp tell you that they'll add support for using the NFSv4 LINK
operation on source objects that are directories?!  I'd be extremely
surprised!  Or did they only tell you that they'll add support for
using the NFSv4 REMOVE operation on non-empty directories?  The latter
is definitely feasible (although it could fail due to share deny OPENs
of files below, say, but hey).  The former is... not sane.

>> I'd suggest whether you can restructure your code and work without this.
>
> It would require touching code for which we don't have sources anymore
> (people gone, too). It would also require to create hard links to the
> results files directly, which means linking 15000+ files per directory
> with a minimum of 3 directories. Each day (this is CERN after
> all).

Oh, I see.  But you still don't want hardlinks to directories!
Instead you might be able to use LD_PRELOAD to emulate the behavior
that the application wants.  The app is probably implementing
rename(), so just detect the sequence and map it to an actual
rename(2).

> The other way around would be to throw the SPARC machines away and go
> with Netapp.

So Solaris is just a fileserver here?

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Nico Williams
On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith
 wrote:
> On 06/26/12 05:46 AM, Lionel Cons wrote:
>> On 25 June 2012 11:33,   wrote:
>>> To be honest, I think we should also remove this from all other
>>> filesystems and I think ZFS was created this way because all modern
>>> filesystems do it that way.
>>
>> This may be wrong way to go if it breaks existing applications which
>> rely on this feature. It does break applications in our case.
>
> Existing applications rely on the ability to corrupt UFS filesystems?
> Sounds horrible.

My guess is that the OP just wants unlink() of an empty directory to
be the same as rmdir() of the same.  Or perhaps they want unlink() of
a non-empty directory to result in a recursive rm...  But if they
really want hardlinks to directories, then yeah, that's horrible.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there an actual newsgroup for zfs-discuss?

2012-06-11 Thread Nico Williams
On Mon, Jun 11, 2012 at 5:05 PM, Tomas Forsman  wrote:
> .. or use a mail reader that doesn't suck.

Or the mailman thread view.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Terminology question on ZFS COW

2012-06-05 Thread Nico Williams
COW goes back at least to the early days of virtual memory and fork().
 On fork() the kernel would arrange for writable pages in the parent
process to be made read-only so that writes to them could be caught
and then the page fault handler would copy the page (and restore write
access) so the parent and child each have their own private copies.
COW as used in ZFS is not the same, but the concept was introduced
very early also, IIRC in the mid-80s -- certainly no later than
BSD4.4's log structure filesystem (which ZFS resembles in many ways).

So, is COW a misnomer?  Yes and no, and anyways, it's irrelevant.  The
important thing is that when you say COW people understand that you're
not saving a copy of the old thing but rather writing the new thing to
a new location.  (The old version of whatever was copied-on-write is
stranded, unless -of course- you have references left to it from
things like snapshots.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] current status of SAM-QFS?

2012-05-03 Thread Nico Williams
On Wed, May 2, 2012 at 7:59 AM, Paul Kraus  wrote:
> On Wed, May 2, 2012 at 7:46 AM, Darren J Moffat  
> wrote:
> If Oracle is only willing to share (public) information about the
> roadmap for products via official sales channels then there will be
> lots of FUD in the market. Now, as to sharing futures and NDA
> material, that _should_ only be available via direct Oracle channels
> (as it was under Sun as well).

Sun was tight lipped too, yes, but information leaked through the open
or semi-open software development practices in Solaris.  If you saw
some feature pushed to some gate you had no guarantee that it would
remain there or be supported, but you had a pretty good inkling as to
whether the engineers working on it intended it to remain there.

If you can't get something out of your rep, you might try reading the
tea leaves (sketchy business).  But ultimately you need to be prepared
for any product's EOL.  You can expect some amount of warning time
about EOLs, but legacy has a way of sticking around, so write plan for
how to migrate data and where to, then put the plan in a drawer
somewhere (and update it as necessary).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling
 wrote:
> [...]

NFSv4 had migration in the protocol (excluding protocols between
servers) from the get-go, but it was missing a lot (FedFS) and was not
implemented until recently.  I've no idea what clients and servers
support it adequately besides Solaris 11, though that's just my fault
(not being informed).  It's taken over a decade to get to where we
have any implementations of NFSv4 migration.

>> For me one of the exciting things about Lustre was/is the idea that
>> you could just have a single volume where all new data (and metadata)
>> is distributed evenly as you go.  Need more storage?  Plug it in,
>> either to an existing head or via a new head, then flip a switch and
>> there it is.  No need to manage allocation.  Migration may still be
>> needed, both within a cluster and between clusters, but that's much
>> more manageable when you have a protocol where data locations can be
>> all over the place in a completely transparent manner.
>
>
> Many distributed file systems do this, at the cost of being not quite
> POSIX-ish.

Well, Lustre does POSIX semantics just fine, including cache coherency
(as opposed to NFS' close-to-open coherency, which is decidedly
non-POSIX).

> In the brave new world of storage vmotion, nosql, and distributed object
> stores,
> it is not clear to me that coding to a POSIX file system is a strong
> requirement.

Well, I don't quite agree.  I'm very suspicious of
eventually-consistent.  I'm not saying that the enormous DBs that eBay
and such run should sport SQL and ACID semantics -- I'm saying that I
think we can do much better than eventually-consistent (and
no-language) while not paying the steep price that ACID requires.  I'm
not alone in this either.

The trick is to find the right compromise.  Close-to-open semantics
works out fine for NFS, but O_APPEND is too wonderful not to have
(ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not
O_APPEND).

Whoever first delivers the right compromise in distributed DB
semantics stands to make a fortune.

> Perhaps people are so tainted by experiences with v2 and v3 that we can
> explain
> the non-migration to v4 as being due to poor marketing? As a leader of NFS,
> Sun
> had unimpressive marketing.

Sun did not do too much to improve NFS in the 90s, not compared to the
v4 work that only really started paying off only too recently.  And
then since Sun had lost the client space by then it doesn't mean all
that much to have the best server if the clients aren't able to take
advantage of the server's best features for lack of client
implementation.  Basically, Sun's ZFS, DTrace, SMF, NFSv4, Zones, and
other amazing innovations came a few years too late to make up for the
awful management that Sun was saddled with.  But for all the decidedly
awful things Sun management did (or didn't do), the worst was
terminating Sun PS (yes, worse that all the non-marketing, poor
marketing, poor acquisitions, poor strategy, and all the rest
including truly epic mistakes like icing Solaris on x86 a decade ago).
 One of the worst outcomes of the Sun debacle is that now there's a
bevy of senior execs who think the worst thing Sun did was to open
source Solaris and Java -- which isn't to say that Sun should have
open sourced as much as it did, or that open source is an end in
itself, but that open sourcing these things was legitimate a business
tool with very specific goals in mind in each case, and which had
nothing to do with the sinking of the company.  Or maybe that's one of
the best outcomes, because the good news about it is that those who
learn the right lessons (in that case: that open source is a
legitimate business tool that is sometimes, often even, a great
mind-share building tool) will be in the minority, and thus will have
a huge advantage over their competition.  That's another thing Sun did
not learn until it was too late: mind-share matters enormously to a
software company.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar  wrote:
> On 4/26/12 2:17 PM, J.P. King wrote:
>> I don't know SnapMirror, so I may be mistaken, but I don't see how you
>> can have non-synchronous replication which can allow for seamless client
>> failover (in the general case). Technically this doesn't have to be
>> block based, but I've not seen anything which wasn't. Synchronous
>> replication pretty much precludes DR (again, I can think of theoretical
>> ways around this, but have never come across anything in practice).
>
> "seamless" is an over-statement, I agree. NetApp has synchronous SnapMirror
> (which is only mostly synchronous...). Worst case, clients may see a
> filesystem go backwards in time, but to a point-in-time consistent state.

Sure, if we assume apps make proper use of O_EXECL, O_APPEND,
link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and
can roll their own state back on their own.  Databases typically know
how to do that (e.g., SQLite3).  Most apps?  Doubtful.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling
 wrote:
> On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:
> Reboot requirement is a lame client implementation.

And lame protocol design.  You could possibly migrate read-write NFSv3
on the fly by preserving FHs and somehow updating the clients to go to
the new server (with a hiccup in between, no doubt), but only entire
shares at a time -- you could not migrate only part of a volume with
NFSv3.

Of course, having migration support in the protocol does not equate to
getting it in the implementation, but it's certainly a good step in
that direction.

> You are correct, a ZFS send/receive will result in different file handles on
> the receiver, just like
> rsync, tar, ufsdump+ufsrestore, etc.

That's understandable for NFSv2 and v3, but for v4 there's no reason
that an NFSv4 server stack and ZFS could not arrange to preserve FHs
(if, perhaps, at the price of making the v4 FHs rather large).
Although even for v3 it should be possible for servers in a cluster to
arrange to preserve devids...

Bottom line: live migration needs to be built right into the protocol.

For me one of the exciting things about Lustre was/is the idea that
you could just have a single volume where all new data (and metadata)
is distributed evenly as you go.  Need more storage?  Plug it in,
either to an existing head or via a new head, then flip a switch and
there it is.  No need to manage allocation.  Migration may still be
needed, both within a cluster and between clusters, but that's much
more manageable when you have a protocol where data locations can be
all over the place in a completely transparent manner.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus  wrote:
> On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams  wrote:
>> Nothing's changed.  Automounter + data migration -> rebooting clients
>> (or close enough to rebooting).  I.e., outage.
>
>    Uhhh, not if you design your automounter architecture correctly
> and (as Richard said) have NFS clients that are not lame to which I'll
> add, automunters that actually work as advertised. I was designing
> automount architectures that permitted dynamic changes with minimal to
> no outages in the late 1990's. I only had a little over 100 clients
> (most of which were also servers) and NIS+ (NIS ver. 3) to distribute
> the indirect automount maps.

Further below you admit that you're talking about read-only data,
effectively.  But the world is not static.  Sure, *code* is by and
large static, and indeed, we segregated data by whether it was
read-only (code, historical data) or not (application data, home
directories).  We were able to migrated *read-only* data with no
outages.  But for the rest?  Yeah, there were always outages.  Of
course, we had a periodic maintenance window, with all systems
rebooting within a short period, and this meant that some data
migration outages were not noticeable, but they were real.

>    I also had to _redesign_ a number of automount strategies that
> were built by people who thought that using direct maps for everything
> was a good idea. That _was_ a pain in the a** due to the changes
> needed at the applications to point at a different hierarchy.

We used indirect maps almost exclusively.  Moreover, we used
hierarchical automount entries, and even -autofs mounts.  We also used
environment variables to control various things, such as which servers
to mount what from (this was particularly useful for spreading the
load on read-only static data).  We used practically every feature of
the automounter except for executable maps (and direct maps, when we
eventually stopped using those).

>    It all depends on _what_ the application is doing. Something that
> opens and locks a file and never releases the lock or closes the file
> until the application exits will require a restart of the application
> with an automounter / NFS approach.

No kidding!  In the real world such applications exist and get used.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
 wrote:
> On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:
> > I disagree vehemently.  automount is a disaster because you need to
> > synchronize changes with all those clients.  That's not realistic.
>
> Really?  I did it with NIS automount maps and 600+ clients back in 1991.
> Other than the obvious problems with open files, has it gotten worse since
> then?

Nothing's changed.  Automounter + data migration -> rebooting clients
(or close enough to rebooting).  I.e., outage.

> Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc.

But not with AFS.  And spec-wise not with NFSv4 (though I don't know
if/when all NFSv4 clients will properly support migration, just that
the protocol and some servers do).

> With server-side, referral-based namespace construction that problem
> goes away, and the whole thing can be transparent w.r.t. migrations.

Yes.

> Agree, but we didn't have NFSv4 back in 1991 :-)  Today, of course, this
> is how one would design it if you had to design a new DFS today.

Indeed, that's why I built an automounter solution in 1996 (that's
still in use, I'm told).  Although to be fair AFS existed back then
and had global namespace and data migration back then, and was mature.
 It's taken NFS that long to catch up...

> >[...]
>
> Almost any of the popular nosql databases offer this and more.
> The movement away from POSIX-ish DFS and storing data in
> traditional "files" is inevitable. Even ZFS is a object store at its core.

I agree.  Except that there are applications where large octet streams
are needed.  HPC, media come to mind.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins  wrote:
> Aren't those general considerations when specifying a file server?

There are Lustre clusters with thousands of nodes, hundreds of them
being servers, and high utilization rates.  Whatever specs you might
have for one server head will not meet the demand that hundreds of the
same can.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling
 wrote:
> Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents).
> FWIW,
> automounters were invented 20+ years ago to handle this in a nearly seamless
> manner.
> Today, we have DFS from Microsoft and NFS referrals that almost eliminate
> the need
> for automounter-like solutions.

I disagree vehemently.  automount is a disaster because you need to
synchronize changes with all those clients.  That's not realistic.
I've built a large automount-based namespace, replete with a
distributed configuration system for setting the environment variables
available to the automounter.  I can tell you this: the automounter
does not scale, and it certainly does not avoid the need for outages
when storage migrates.

With server-side, referral-based namespace construction that problem
goes away, and the whole thing can be transparent w.r.t. migrations.

For my money the key features a DFS must have are:

 - server-driven namespace construction
 - data migration without having to restart clients,
   reconfigure them, or do anything at all to them
 - aggressive caching

 - striping of file data for HPC and media environments

 - semantics that ultimately allow multiple processes
   on disparate clients to cooperate (i.e., byte range
   locking), but I don't think full POSIX semantics are
   needed

   (that said, I think O_EXCL is necessary, and it'd be
   very nice to have O_APPEND, though the latter is
   particularly difficult to implement and painful when
   there's contention if you stripe file data across
   multiple servers)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer  wrote:
> 2:20pm, Richard Elling wrote:
>> Ignoring lame NFS clients, how is that architecture different than what
>> you would have
>> with any other distributed file system? If all nodes share data to all
>> other nodes, then...?
>
> Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
> each node would have to mount from each other node. With 16 nodes, that's
> what, 240 mounts? Not to mention your data is in 16 different
> mounts/directory structures, instead of being in a unified filespace.

To be fair NFSv4 now has a distributed namespace scheme so you could
still have a single mount on the client.  That said, some DFSes have
better properties, such as striping of data across sets of servers,
aggressive caching, and various choices of semantics (e.g., Lustre
tries hard to give you POSIX cache coherency semantics).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
I agree, you need something like AFS, Lustre, or pNFS.  And/or an NFS
proxy to those.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Linux vs FreeBSD

2012-04-25 Thread Nico Williams
As I understand it LLNL has very large datasets on ZFS on Linux.  You
could inquire with them, as well as
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/topics?pli=1
.  My guess is that it's quite stable for at least some use cases
(most likely: LLNL's!), but that may not be yours.  You could
always... test it, but if you do then please tell us how it went :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?

2012-02-04 Thread Nico
Hi.
Going with Dell PV MD1200 and Dell PE R510 or R710 is no problem at all. 

But you should be aware if you go with Dell R510 or R710 that the internal
storage PCIe slot will not work with any other HBA's than Dell H200 or H700 and
also remeber to order extra Intel nics and disable the broadcom nics in bios.
 
So if you plan to flash the H200 with IT firmware be ware that you have to move
the controller to another PCIe slot, but also that your support is not valid
anymore.

Maybe you should keep the H200 IR firmware on the H200 controller and use
Hardware raid for the syspool and order a Dell 6Gbps SAS HBA with IT firmware.
This way you'll keep the hardware support for your Dell equipment.

/Nico

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss by memory corruption?

2012-01-18 Thread Nico Williams
On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov  wrote:
> 2012-01-18 1:20, Stefan Ring wrote:
>> I don’t care too much if a single document gets corrupted – there’ll
>> always be a good copy in a snapshot. I do care however if a whole
>> directory branch or old snapshots were to disappear.
>
> Well, as far as this problem "relies" on random memory corruptions,
> you don't get to choose whether your document gets broken or some
> low-level part of metadata tree ;)

Other filesystems tend to be much more tolerant of bit rot of all
types precisely because they have no block checksums.

But I'd rather have ZFS -- *with* redundancy, of course, and with ECC.

It might be useful to have a way to recover from checksum mismatches
by involving a human.  I'm imagining a tool that tests whether
accepting a block's actual contents results in making data available
that the human thinks checks out, and if so, then rewriting that
block.  Some bit errors might simply result in meaningless metadata,
but in some cases this can be corrected (e.g., ridiculous block
addresses).  But if ECC takes care of the problem then why waste the
effort?  (Partial answer: because it'd be a very neat GSoC type
project!)

> Besides, what if that document you don't care about is your account's
> entry in a banking system (as if they had no other redundancy and
> double-checks)? And suddenly you "don't exist" because of some EIOIO,
> or your balance is zeroed (or worse, highly negative)? ;)

This is why we have paper trails, logs, backups, redundancy at various
levels, ...

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Nico Williams
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov  wrote:
> I've recently had a sort of an opposite thought: yes,
> ZFS redundancy is good - but also expensive in terms
> of raw disk space. This is especially bad for hardware
> space-constrained systems like laptops and home-NASes,
> where doubling the number of HDDs (for mirrors) or
> adding tens of percent of storage for raidZ is often
> not practical for whatever reason.

Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.

> Current ZFS checksums allow us to detect errors, but
> in order for recovery to actually work, there should be
> a redundant copy and/or parity block available and valid.
>
> Hence the question: why not put ECC info into ZFS blocks?

RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2012-01-05 Thread Nico Williams
On Thu, Jan 5, 2012 at 8:53 AM, sol  wrote:
>> if a bug fixed in Illumos is never reported to Oracle by a customer,
>> it would likely never get fixed in Solaris either
>
> :-(
>
> I would have liked to think that there was some good-will between the ex- and 
> current-members of the zfs team, in the sense that the people who created zfs 
> but then left Oracle still care about it enough to want the Oracle version to 
> be as bug-free as possible.

My intention was to encourage users to report bugs to both, Oracle and
Illumos.  It's possible that Oracle engineers pay attention to the
Illumos bug database, but I expect that for legal reasons the will not
look at Illumos code that has any new copyright notices relative to
Oracle code.  The simplest way for Oracle engineers to avoid all
possible legal problems is to simply ignore at least the Illumos
source repositories, possibly more.  I'm speculating, sure; I might be
wrong.

As for good will, I'm certain that there is, at least at the engineer
level, and probably beyond.  But that doesn't mean that there will be
bug parity, much less feature parity.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens  wrote:
> On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble  wrote:
>> (1) when constructing the stream, every time a block is read from a fileset
>> (or volume), its checksum is sent to the receiving machine. The receiving
>> machine then looks up that checksum in its DDT, and sends back a "needed" or
>> "not-needed" reply to the sender. While this lookup is being done, the
>> sender must hold the original block in RAM, and cannot write it out to the
>> to-be-sent-stream.
> ...
>> you produce a huge amount of small network packet
>> traffic, which trashes network throughput
>
> This seems like a valid approach to me.  When constructing the stream,
> the sender need not read the actual data, just the checksum in the
> indirect block.  So there is nothing that the sender "must hold in
> RAM".  There is no need to create small (or synchronous) network
> packets, because sender need not wait for the receiver to determine if
> it needs the block or not.  There can be multiple asynchronous
> communication streams:  one where the sender sends all the checksums
> to the receiver; another where the receiver requests blocks that it
> does not have from the sender; and another where the sender sends
> requested blocks back to the receiver.  Implementing this may not be
> trivial, and in some cases it will not improve on the current
> implementation.  But in others it would be a considerable improvement.

Right, you'd want to let the socket/transport buffer/flow control
writes of "I have this new block checksum" messages from the zfs
sender and "I need the block with this checksum" messages from the zfs
receiver.

I like this.

A separate channel for bulk data definitely comes recommended for flow
control reasons, but if you do that then securing the transport gets
complicated: you couldn't just zfs send .. | ssh ... zfs receive.  You
could use SSH channel multiplexing, but that will net you lousy
performance (well, no lousier than one already gets with SSH
anyways)[*].  (And SunSSH lacks this feature anyways)  It'd then begin
to pay to have have a bonafide zfs send network protocol, and now
we're talking about significantly more work.  Another option would be
to have send/receive options to create the two separate channels, so
one would do something like:

% zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs
receive --dedup-control-channel ... &
% zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive
--dedup-bulk-channel
% wait

The second zfs receive would rendezvous with the first and go from there.

[*] The problem with SSHv2 is that it has flow controlled channels
layered over a flow controlled congestion channel (TCP), and there's
not enough information flowing from TCP to SSHv2 to make this work
well, but also, the SSHv2 channels cannot have their window shrink
except by the sender consuming it, which makes it impossible to mix
high-bandwidth bulk and small control data over a congested link.
This means that in practice SSHv2 channels have to have relatively
small windows, which then forces the protocol to work very
synchronously (i.e., with effectively synchronous ACKs of bulk data).
I now believe the idea of mixing bulk and non-bulk data over a single
TCP connection in SSHv2 is a failure.  SSHv2 over SCTP, or over
multiple TCP connections, would be much better.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 2:06 PM, sol  wrote:
> Richard Elling wrote:
>> many of the former Sun ZFS team
>> regularly contribute to ZFS through the illumos developer community.
>
> Does this mean that if they provide a bug fix via illumos then the fix won't
> make it into the Oracle code?

If you're an Oracle customer you should report any ZFS bugs you find
to Oracle if you want fixes in Solaris.  You may want to (and I
encourage you to) report such bugs to Illumos if at all possible
(i.e., unless your agreement with Oracle or your employer's policies
somehow prevent you from doing so).

The following is complete speculation.  Take it with salt.

With reference to your question, it may mean that Oracle's ZFS team
would have to come up with their own fixes to the same bugs.  Oracle's
legal department would almost certainly have to clear the copying of
any non-trivial/obvious fix from Illumos into Oracle's ON tree.  And
if taking a fix from Illumos were to require opening the affected
files (because they are under CDDL in Illumos) then executive
management approval would also be required.  But the most likely case
is that the issue simply wouldn't come up in the first place because
Oracle's ZFS team would almost certainly ignore the Illumos repository
(perhaps not the Illumos bug tracker, but probably that too) as that's
simply the easiest way for them to avoid legal messes.  Think about
it.  Besides, I suspect that from Oracle's point of view what matters
are bug reports by Oracle customers to Oracle, so if a bug fixed in
Illumos is never reported to Oracle by a customer, it would likely
never get fixed in Solaris either except by accident, as a result of
another change.

Also, the Oracle ZFS team is not exactly devoid of clue, even with the
departures from it to date.  I suspect they will be able to fix bugs
in Oracle's ZFS and completely independently of the open ZFS
community, even if it means duplicating effort.

That said, Illumos is a fork of OpenSolaris, and as such it and
Solaris will necessarily diverge as at least one of the two (and
probably both, for a while) gets plenty of bug fixes and enhancements.
 This is a good thing, not a bad thing, at least for now.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs  wrote:
> Jim,
>
> You are spot on.  I was hoping that the writes would be close enough to 
> identical that
> there would be a high ratio of duplicate data since I use the same record 
> size, page size,
> compression algorithm, … etc.  However, that was not the case.  The main 
> thing that I
> wanted to prove though was that if the data was the same the L1 ARC only 
> caches the
> data that was actually written to storage.  That is a really cool thing!  I 
> am sure there will
> be future study on this topic as it applies to other scenarios.
>
> With regards to directory engineering investing any energy into optimizing 
> ODSEE DS
> to more effectively leverage this caching potential, that won't happen.  OUD 
> far out
> performs ODSEE.  That said OUD may get some focus in this area.  However, 
> time will
> tell on that one.

Databases are not as likely to benefit from dedup as virtual machines,
indeed, DBs are likely to not benefit at all from dedup.  The VM use
case benefits from dedup for the obvious reason that many VMs will
have the same exact software installed most of the time, using the
same filesystems, and the same patch/update installation order, so if
you keep data out of their root filesystems then you can expect
enormous deduplicatiousness.  But databases, not so much.  The unit of
deduplicable data in a VM use case is the guest's preferred block
size, while in a DB the unit of deduplicable data might be a
variable-sized table row, or even smaller: a single row/column value
-- and you have no way to ensure alignment of individual deduplicable
units nor ordering of sets of deduplicable units into larger ones.

When it comes to databases your best bets will be: a) database-level
compression or dedup features (e.g., Oracle's column-level compression
feature) or b) ZFS compression.

(Dedup makes VM management easier, because the alternative is to patch
one master guest VM [per-guest type] then re-clone and re-configure
all instances of that guest type, in the process possibly losing any
customizations in those guests.  But even before dedup, the ability to
snapshot and clone datasets was an impressive dedup-like tool for the
VM use-case, just not as convenient as dedup.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-28 Thread Nico Williams
On Wed, Dec 28, 2011 at 3:14 PM, Brad Diggs  wrote:
>
> The two key takeaways from this exercise were as follows.  There is 
> tremendous caching potential
> through the use of ZFS deduplication.  However, the current block level 
> deduplication does not
> benefit directory as much as it perhaps could if deduplication occurred at 
> the byte level rather than
> the block level.  It very could be that even byte level deduplication doesn't 
> work as well either.
> Until that option is available, we won't know for sure.

How would byte-level dedup even work?  My best idea would be to apply
the rsync algorithm and then start searching for little chunks of data
with matching rsync CRCs, rolling the rsync CRC over the data until a
match is found for some chunk (which then has to be read and
compared), and so on.  The result would be incredibly slow on write
and would have huge storage overhead.  On the read side you could have
many more I/Os too, so read would get much slower as well.  I suspect
any other byte-level dedup solutions would be similarly lousy.
There'd be no real savings to be had, making the idea not worthwhile.

Dedup is for very specific use cases.  If your use case doesn't
benefit from block-level dedup, then don't bother with dedup.  (The
same applies to compression, but compression is much more likely to be
useful in general, which is why it should generally be on.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-27 Thread Nico Williams
On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack  wrote:
> So with a de facto fork (illumos) now in place, is it possible that two
> zpools will report the same version yet be incompatible across
> implementations?

Not likely: the Illumos community has developed a method for managing
ZFS extensions in a way other than linear chronology.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-27 Thread Nico Williams
On Tue, Dec 27, 2011 at 2:20 PM, Frank Cusack  wrote:
> <http://sparcv9.blogspot.com/2011/12/solaris-11-illumos-and-source.html>
>
>> If I "upgrade" ZFS to use the new features in Solaris 11 I will be unable
>> to import my pool using the free ZFS implementation that is available in
>> illumos based distributions
>
>
> Is that accurate?  I understand if the S11 version is ahead of illumos, of
> course I can't use the same pools in both places, but that is the same
> problem as using an S11 pool on S10.  The author is implying a much worse
> situation, that there are zfs "tracks" in addition to versions and that S11
> is now on a different track and an S11 pool will not be usable elsewhere,
> "ever".  I hope it's just a misrepresentation.

Hard to say.  Suppose Oracle releases no details on any additions to
the on-disk ZFS format since build 147...  then either the rest of the
ZFS developer community forks for good, or they have to reverse
engineer Oracle's additions.  Even if Oracle does release details on
their additions, what if the external ZFS developer community
disagrees vehemently with any of those?  And what if the open source
community adds extensions that Oracle never adopts?  A fork is not yet
a reality, but IMO it sure looks likely.

Of course, you can still manage to have pools that will work on all
implementations -- until the day that implementations start removing
older formats anyways, which not only could happen, but I think will
happen, though probably not until S10 is EOLed, and in any case
probably not for a few years yet, likely not even within the next half
decade.  It's hard to predict such things though, so take the above
with some (or lots!) of salt.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-13 Thread Nico Williams
On Dec 11, 2011 5:12 AM, "Nathan Kroenert"  wrote:
>
>  On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:
>>
>> On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
>>>
>>> Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.
>>>
>>> The only vendor i know that can do this is Netapp
>>
>> And you really work at Oracle?:)
>>
>> The answer is definiately yes. ARC caches on-disk blocks and dedup just
>> reference those blocks. When you read dedup code is not involved at all.
>> Let me show it to you with simple test:
>>
>> Create a file (dedup is on):
>>
>># dd if=/dev/random of=/foo/a bs=1m count=1024
>>
>> Copy this file so that it is deduped:
>>
>># dd if=/foo/a of=/foo/b bs=1m
>>
>> Export the pool so all cache is removed and reimport it:
>>
>># zpool export foo
>># zpool import foo
>>
>> Now let's read one file:
>>
>># dd if=/foo/a of=/dev/null bs=1m
>>1073741824 bytes transferred in 10.855750 secs (98909962
bytes/sec)
>>
>> We read file 'a' and all its blocks are in cache now. The 'b' file
>> shares all the same blocks, so if ARC caches blocks only once, reading
>> 'b' should be much faster:
>>
>># dd if=/foo/b of=/dev/null bs=1m
>>1073741824 bytes transferred in 0.870501 secs (1233475634
bytes/sec)
>>
>> Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
>> activity. Magic?:)
>>
>
> Hey all,
>
> That reminds me of something I have been wondering about... Why only 12x
faster? If we are effectively reading from memory - as compared to a disk
reading at approximately 100MB/s (which is about an average PC HDD reading
sequentially), I'd have thought it should be a lot faster than 12x.
>
> Can we really only pull stuff from cache at only a little over one
gigabyte per second if it's dedup data?

The second file may gave the same data, but not the same metadata -the
inode number at least must be different- so the znode for it must get read
in, and that will slow reading the copy down a bit.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)

2011-11-29 Thread Nico Williams
On Tue, Nov 29, 2011 at 12:17 PM, Cindy Swearingen
 wrote:
> I think the "too many open files" is a generic error message about running
> out of file descriptors. You should check your shell ulimit
> information.

Also, see how many open files you have: echo /proc/self/fd/*

It'd be quite weird though to have a very low fd limit or a very large
number of file descriptors open in the shell.

That said, as Casper says, utilities like mv(1) should be able to cope
with reasonably small fd limits (i.e., not as small as 3, but perhaps
as small as 10).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'

2011-11-28 Thread Nico Williams
On Mon, Nov 28, 2011 at 11:28 AM, Smith, David W.  wrote:
> You could list by inode, then use find with rm.
>
> # ls -i
> 7223 -O
>
> # find . -inum 7223 -exec rm {} \;

This is the one solution I'd recommend against, since it would remove
hardlinks that you might care about.

Also, this thread is getting long, repetitive, tiring.  Please stop.
This is a standard issue Unix beginner question, just like "my test
program does nothing".

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] virtualbox rawdisk discrepancy

2011-11-21 Thread Nico Williams
Moving boot disks from one machine to another used to work as long as
the machines were of the same architecture.  I don't recall if it was
*supported* (and wouldn't want to pretend to speak for Oracle now),
but it was meant to work (unless you minimized the install and removed
drivers not needed on the first system that are needed on the other
system).  You did have to do a reconfigure boot though!

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] aclmode=mask

2011-11-14 Thread Nico Williams
On Mon, Nov 14, 2011 at 6:20 PM, Nico Williams  wrote:
> I see, with great pleasure, that ZFS in Solaris 11 has a new
> aclmode=mask property.

Also, congratulations on shipping.  And thank you for implementing aclmode=mask.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] aclmode=mask

2011-11-14 Thread Nico Williams
I see, with great pleasure, that ZFS in Solaris 11 has a new
aclmode=mask property.

http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbscy.html#gkkkp
http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbchf.html#gljyz
http://download.oracle.com/docs/cd/E23824_01/html/821-1462/zfs-1m.html#scrolltoc
(search for aclmode)

May this be the last word in ACL/chmod interactions (knocks on wood,
crosses fingers, ...).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-11-14 Thread Nico Williams
On Mon, Nov 14, 2011 at 8:33 AM, Edward Ned Harvey
 wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Paul Kraus
>>
>> Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls
>> apart (in terms of performance) when you get too many objects in one
>> FS, which is specifically what drove us to ZFS. We had 4.5 TB of data
>
> According to wikipedia, btrfs is a b-tree.
> I know in ZFS, the DDT is an AVL tree, but what about the rest of the
> filesystem?

ZFS directories are hashed.  Aside from this, the filesystem (and
volume) have a tree structure, but that's not what's interesting here
-- what's interesting is how directories are indexed.

> B-trees should be logarithmic time, which is the best O() you can possibly
> achieve.  So if HFS+ is dog slow, it's an implementation detail and not a
> general fault of b-trees.

Hash tables can do much better than O(log N) for searching: O(1) for
best case, and O(n) for the worst case.

Also, b-trees are O(log_b N), where b is the number of entries
per-node.  6e7 entries/directory probably works out to 2-5 reads
(assuming 0% cache hit rate) depending on the size of each directory
entry and the size of the b-tree blocks.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-11-11 Thread Nico Williams
On Fri, Nov 11, 2011 at 4:27 PM, Paul Kraus  wrote:
> The command syntax paradigm of zfs (command sub-command object
> parameters) is not unique to zfs, but seems to have been the "way of
> doing things" in Solaris 10. The _new_ functions of Solaris 10 were
> all this way (to the best of my knowledge)...
>
> zonecfg
> zoneadm
> svcadm
> svccfg
> ... and many others are written this way. To boot the zone named foo
> you use the command "zoneadm -z foo boot", to disable the service
> named sendmail, "svcadm disable sendmail", etc. Someone at Sun was
> thinking :-)

I'd have preferred "zoneadm boot foo".  The -z zone command thing is a
bit of a sore point, IMO.

But yes, all these new *adm(1M( and *cfg(1M) commands in S10 are
wonderful, especially when compared to past and present alternatives
in the Unix/Linux world.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Nico Williams
To some people "active-active" means all cluster members serve the
same filesystems.

To others "active-active" means all cluster members serve some
filesystems and can serve all filesystems ultimately by taking over
failed cluster members.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-19 Thread Nico Williams
On Wed, Oct 19, 2011 at 7:24 AM, Garrett D'Amore
 wrote:
> I'd argue that from a *developer* point of view, an fsck tool for ZFS might 
> well be useful.  Isn't that what zdb is for? :-)
>
> But ordinary administrative users should never need something like this, 
> unless they have encountered a bug in ZFS itself.  (And bugs are as likely to 
> exist in the checker tool as in the filesystem. ;-)

zdb can be useful for admins -- say, to gather stats not reported by
the system, to explore the fs/vol layout, for educational purposes,
and so on.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-18 Thread Nico Williams
On Tue, Oct 18, 2011 at 9:35 AM, Brian Wilson  wrote:
> I just wanted to add something on fsck on ZFS - because for me that used to
> make ZFS 'not ready for prime-time' in 24x7 5+ 9s uptime environments.
> Where ZFS doesn't have an fsck command - and that really used to bug me - it
> does now have a -F option on zpool import.  To me it's the same
> functionality for my environment - the ability to try to roll back to a
> 'hopefully' good state and get the filesystem mounted up, leaving the
> corrupted data objects corrupted.  [...]

Yes, that's exactly what it is.  There's no point calling it fsck
because fsck fixes individual filesystems, while ZFS fixups need to
happen at the volume level (at volume import time).

It's true that this should have been in ZFS from the word go.  But
it's there now, and that's what matters, IMO.

It's also true that this was never necessary with hardware that
doesn't lie, but it's good to have it anyways, and is critical for
personal systems such as laptops.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
Also, it's not worth doing a clustered ZFS thing that is too
application-specific.  You really want to nail down your choices of
semantics, explore what design options those yield (or approach from
the other direction, or both), and so on.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov  wrote:
> Thanks to Nico for concerns about POSIX locking. However,
> hopefully, in the usecase I described - serving images of
> VMs in a manner where storage, access and migration are
> efficient - whole datasets (be it volumes or FS datasets)
> can be dedicated to one VM host server at a time, just like
> whole pools are dedicated to one host nowadays. In this
> case POSIX compliance can be disregarded - access
> is locked by one host, not avaialble to others, period.
> Of course, there is a problem of capturing storage from
> hosts which died, and avoiding corruptions - but this is
> hopefully solved in the past decades of clustering tech's.

It sounds to me like you need horizontal scaling more than anything
else.  In that case, why not use pNFS or Lustre?  Even if you want
snapshots, a VM should be able to handle that on its own, and though
probably not as nicely as ZFS in some respects, having the application
be in control of the exact snapshot boundaries does mean that you
don't have to quiesce your VMs just to snapshot safely.

> Nico also confirmed that "one node has to be a master of
> all TXGs" - which is conveyed in both ideas of my original
> post.

Well, at any one time one node would have to be the master of the next
TXG, but it doesn't mean that you couldn't have some cooperation.
There are lots of other much more interesting questions.  I think the
biggest problem lies in requiring full connectivity from every server
to every LUN.  I'd much rather take the Lustre / pNFS model (which,
incidentally, don't preclude having snapshots).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov  wrote:
> So, one version of the solution would be to have a single host
> which imports the pool in read-write mode (i.e. the first one
> which boots), and other hosts would write thru it (like iSCSI
> or whatever; maybe using SAS or FC to connect between
> "reader" and "writer" hosts). However they would read directly
> from the ZFS pool using the full SAN bandwidth.

You need to do more than simply assign a node for writes.  You need to
send write and lock requests to one node.  And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers).  I think you'd basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.

I don't think ZFS is the beast you're looking for.  You want something
more like Lustre, GPFS, and so on.  I suppose someone might surprise
us one day with properly clustered ZFS, but I think it'd be more
likely that the filesystem would be ZFS-like, not ZFS proper.

> Second version of the solution is more or less the same, except
> that all nodes can write to the pool hardware directly using some
> dedicated block ranges "owned" by one node at a time. This
> would work like much a ZIL containing both data and metadata.
> Perhaps these ranges would be whole metaslabs or some other
> ranges as "agreed" between the master node and other nodes.

This is much hairier.  You need consistency.  If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane.  There's a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).

And still you need to be careful about POSIX semantics.  That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager).  Or else you'll have to give up on
POSIX.

There's a hefty price to be paid for POSIX semantics in a clustered
environment.  You'd do well to read up on Lustre's experience in
detail.  And not just Lustre -- that would be just to start.  I
caution you that this is not a simple project.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling
 wrote:
> On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
>> ZFS developers have for a long time stated that ZFS is not intended,
>> at least not in near term, for clustered environments (that is, having
>> a pool safely imported by several nodes simultaneously). However,
>> many people on forums have wished having ZFS features in clusters.
>
> ...and UFS before ZFS… I'd wager that every file system has this RFE in its
> wish list :-)

Except the ones that already have it!  :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "zfs diff" performance disappointing

2011-09-26 Thread Nico Williams
Ah yes, of course.  I'd misread your original post.  Yes, disabling
atime updates will reduce the number of superfluous transactions.
It's *all* transactions that count, not just the ones the app
explicitly caused, and atime implies lots of transactions.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "zfs diff" performance disappointing

2011-09-26 Thread Nico Williams
On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea  wrote:
> I just upgraded to Solaris 10 Update 10, and one of the improvements
> is "zfs diff".
>
> Using the "birthtime" of the sectors, I would expect very high
> performance. The actual performance doesn't seems better that an
> standard "rdiff", though. Quite disappointing...
>
> Should I disable "atime" to improve "zfs diff" performance? (most data
> doesn't change, but "atime" of most files would change).

atime has nothing to do with it.

How much work zfs diff has to do depends on how much has changed
between snapshots.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs scripts

2011-09-09 Thread Nico Williams
On Fri, Sep 9, 2011 at 5:33 AM, Sriram Narayanan  wrote:
> Plus, you'll need an & character at the end of each command.

And a wait command, if you want the script to wait for the sends to
finish (which you should).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-28 Thread Nico Williams
On Wed, Jul 27, 2011 at 9:22 PM, Daniel Carosone  wrote:
> Absent TRIM support, there's another way to do this, too.  It's pretty
> easy to dd /dev/zero to a file now and then.  Just make sure zfs
> doesn't prevent these being written to the SSD (compress and dedup are
> off).  I have a separate "fill" dataset for this purpose, to avoid
> keeping these zeros in auto-snapshots too.

Nice.

Seems to me that it'd be nicer to have an interface to raw flash (no
wear leveling, direct access to erasure, read, write,
read-modify-write [as an optimization]).  Then the filesystem could do
a much better job of using flash efficiently.  But a raw interface
wouldn't be a disk-compatible interface.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-24 Thread Nico Williams
On Jul 9, 2011 1:56 PM, "Edward Ned Harvey" <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:
>
> Given the abysmal performance, I have to assume there is a significant
> number of "overhead" reads or writes in order to maintain the DDT for each
> "actual" block write operation.  Something I didn't mention in the other
> email is that I also tracked iostat throughout the whole operation.  It's
> all writes (or at least 99.9% writes.)  So I am forced to conclude it's a
> bunch of small DDT maintenance writes taking place and incurring access
time
> penalties in addition to each intended single block access time penalty.
>
> The nature of the DDT is that it's a bunch of small blocks, that tend to
be
> scattered randomly, and require maintenance in order to do anything else.
> This sounds like precisely the usage pattern that benefits from low
latency
> devices such as SSD's.

The DDT should be written to in COW fashion, and asynchronously, so there
should be no access time penalty.  Or so ISTM it should be.

Dedup is necessarily slower for writing because of the deduplication table
lookups.  Those are synchronous lookups, but for async writes you'd think
that total write throughput would only be affected by a) the additional read
load (which is zero in your case) and b) any inability to put together large
transactions due to the high latency of each logical write, but (b)
shouldn't happen, particularly if the DDT fits in RAM or L2ARC, as it does
in your case.

So, at first glance my guess is ZFS is leaving dedup write performance on
the table most likely due to implementation reasons, not design reasons.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
On Jun 27, 2011 4:15 PM, "David Magda"  wrote:
> The (Ultra)SPARC T-series processors do, but to a certain extent it goes
> against a CPU manufacturers best (financial) interest to provide this:
> crypto is very CPU intensive using 'regular' instructions, so if you need
> to do a lot of it, it would force you to purchase a manufacturer's
> top-of-the-line CPUs, and to have as many sockets as you can to handle a
> load (and presumably you need to do "useful" work besides just
> en/decrypting traffic).

I hope no CPU vendor thinks about the economics of chip making that way.  I
actually doubt any do.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
On Jun 27, 2011 9:24 PM, "David Magda"  wrote:
> AESNI is certain better than nothing, but RSA, SHA, and the RNG would be
nice as well. It'd also be handy for ZFS crypto in addition to all the
network IO stuff.

The most important reason for AES-NI might be not performance but defeating
side-channel attacks.

Also, really fast AES HW makes AES-based hash functions quite tempting.

No, AES-NI is nothing to sneeze at.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
IMO a faster processor with built-in AES and other crypto support is
most likely to give you the most bang for your buck, particularly if
you're using closed Solaris 11, as Solaris engineering is likely to
add support for new crypto instructions faster than Illumos (but I
don't really know enough about Illumos to say for sure).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Nico Williams
That said, losing committed transactions when you needed and thought
you had ACID semantics... is bad.  But that's implied in any
restore-from-backups situation.  So you replicate/distribute
transactions so that restore from backups (or snapshots) is an
absolutely last resort matter, and if you ever have to restore from
backups you also spend time manually tracking down (from
counterparties, "paper" trails kept elsewhere, ...) any missing
transactions.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Nico Williams
On Thu, Jun 16, 2011 at 8:51 AM,   wrote:
> If a database engine or another application keeps both the data and the
> log in the same filesystem, a snapshot wouldn't create inconsistent data
> (I think this would be true with vim and a large number of database
> engines; vim will detect the swap file and datbase should be able to
> detect the inconsistency and rollback and re-apply the log file.)

Correct.  SQLite3 will be able to recover automatically from restores
of mid-transaction snapshots.

VIM does not recover automatically, but it does notice the swap file
and warns the user and gives them a way to handle the problem.

(When you save a file, VIM renames the old one out of the way, creates
a new file with the original name, writes the new contents to it,
closes it, then unlinks the swap file.  On recovery VIM notices the
swap file and gives the user a menu of choices.)

I believe this is the best solution: write applications so they can
recover from being restarted with data restored from a mid-transaction
snapshot.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Versioning FS was: question about COW and snapshots

2011-06-16 Thread Nico Williams
As Casper pointed out, the right thing to do is to build applications
such that they can detect mid-transaction state and roll it back (or
forward, if there's enough data).  Then mid-transaction snapshots are
fine, and the lack of APIs by which to inform the filesystem of
application transaction boundaries becomes much less of an issue
(adding such APIs is not a good solution, since it'd take many years
for apps to take advantage of them and more years still for legacy
apps to be upgraded or decomissioned).

The existing FS interfaces provide enough that one can build
applications this way.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
And, without a sub-shell:

find . -type f \! -links 1 | xargs stat -c " %b %B *+p" /dev/null | dc
2>/dev/null | tail -1

(The stderr redirection is because otherwise dc whines once that the
stack is empty, and the tail is because we print interim totals as we
go.)

Also, this doesn't quit work, since it counts every link, when we want
to count all but one links.  This, then, is what will tell you how
much space you saved due to hardlinks:

find . -type f \! -links 1 | xargs stat -c " 8k %b %B * %h 1 - * %h
/+p" /dev/null 2>/dev/null | dc

Excuse my earlier brainfarts :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams  wrote:
> Try this instead:
>
> (echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | 
> dc

s/\$p//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk  
wrote:
>> If anyone has any ideas be it ZFS based or any useful scripts that
>> could help here, I am all ears.
>
> Something like this one-liner will show what would be allocated by everything 
> if hardlinks weren't used:
>
> # size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do 
> size=$(( $size + $i )); done; echo $size

Oh, you don't want to do that: you'll run into max argument list size issues.

Try this instead:

(echo 0; find . -type f \! -links 1 | xargs stat -c " %b %B *+" $p; echo p) | dc

;)

xargs is your friend (and so is dc... RPN FTW!).  Note that I'm not
printing the number of links because find will print a name for every
link (well, if you do the find from the root of the relevant
filesystem), so we'd be counting too much space.

You'll need the GNU stat(1).  Or you could do something like this
using the ksh stat builtin:

(
echo 0
find . -type f \! -links 1 | while read p; do
xargs stat -c " %b %B *+" $p
done
echo p
) | dc

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Nico Williams
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
 wrote:
> I have an interesting question that may or may not be answerable from some
> internal
> ZFS semantics.

This is really standard Unix filesystem semantics.

> [...]
>
> So total storage used is around ~7.5MB due to the hard linking taking place
> on each store.
>
> If hard linking capability had been turned off, this same message would have
> used 1500 x 2MB =3GB
> worth of storage.
>
> My question is there any simple ways of determining the space savings on
> each of the stores from the usage of hard links?  [...]

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-26 Thread Nico Williams
On May 25, 2011 7:15 AM, "Garrett D'Amore"  wrote:
>
> You are welcome to your beliefs.   There are many groups that do standards
that do not meet in public.  [...]

True.

> [...] In fact, I can't think of any standards bodies that *do* hold open
meetings.

I can: the IETF, for example.  All business of the IETF is transacted or
confirmed on open participation mailing lists, and IETF meetings are known
as NOTE WELL meetings because of the notice given at their opening regarding
the fact that meeting is public and resulting considerations regarding,
e.g., trade secrets.

Mind you, there are many more standards setting organizations that don't
have open participation, such as OASIS, ISO, and so on.  I don't begrudge
you starting closed, our even staying closed, though I would prefer that at
least the output of any ZFS standards org be open.  Also, I would recommend
that you eventually consider creating a new open participation list for
non-members (separate from any members-only list).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-22 Thread Nico Williams
On Sun, May 22, 2011 at 1:52 PM, Nico Williams  wrote:
> [...] Or perhaps you'll argue that no one should ever need bi-di
> replication, that if one finds oneself wanting that then one has taken
> a wrong turn somewhere.

You could also grant the premise and argue instead that nothing the
filesystem can do to speed up remote bi-di sync is worth the cost --
an argument that requires a lot more analysis.  For example, if the
filesystem were to compute and store rsync rolling CRC signatures,
well, that would require significant compute and storage resources,
and it might not speed up synchronization enough to ever be
worthwhile.  Similarly, a Merkle hash tree based on rolling hash
functions (and excluding physical block pointer details) might require
each hash output to grow linearly with block size in order to retain
the rolling hash property (I'm not sure about this; I know very little
about rolling hash functions), in which case the added complexity
would be a complete non-starter.  Whereas a Merkle hash tree built
with regular hash functions would not be able to resolve
insertions/deletions of data chunks of size that is not a whole
multiple of block size.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-22 Thread Nico Williams
On Sun, May 22, 2011 at 10:20 AM, Richard Elling
 wrote:
> ZFS already tracks the blocks that have been written, and the time that
> they were written. So we already know when something was writtem, though
> that does not answer the question of whether the data was changed. I think
> it is a pretty good bet that newly written data is different :-)

Not really.  There's bp rewrite (assuming that ever ships, or gets
implemented elsewhere), for example.

>> Then, the filesystem should make this Merkle Tree available to
>> applications through a simple query.
>
> Something like "zfs diff" ?

That works within a filesystem.  And zfs send/recv works when you have
one filesystem faithfully tracking another.

When you have two filesystems with similar contents, and the history
of each is useless in deciding how to do a bi-directional
synchronization, then you need a way to diff files that is not based
on intra-filesystem history.  The rsync algorithm is the best
high-performance algorithm that we have for determining differences
between files separated by a network.  My proposal (back then, and
Zooko's now) is to leverage work that the filesystem does anyways to
implement a high-performance remote diff that is faster than rsync for
the simple reason that some of the rsync algorithm essentially gets
pre-computed.

>> This would enable applications—without needing any further
>> in-filesystem code—to perform a Merkle Tree sync, which would range
>> from "noticeably more efficient" to "dramatically more efficient" than
>> rsync or zfs send. :-)
>
> Since ZFS send already has an option to only send the changed blocks,
> I disagree with your assertion that your solution will be "dramatically
> more efficient" than zsf send. We already know zfs send is much more
> efficient than rsync for large file systems.

You missed Zooko's point completely.  It might help to know that Zooko
works on a project called Tahoe Least-Authority Filesystem, which is
by nature distributed.  Once you lose the constraints of not having a
network or having uni-directional replication only, I think you'll get
it.  Or perhaps you'll argue that no one should ever need bi-di
replication, that if one finds oneself wanting that then one has taken
a wrong turn somewhere.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
Then again, Windows apps may be doing seek+write to pre-allocate storage.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
On Mon, May 2, 2011 at 3:56 PM, Eric D. Mudama
 wrote:
> Yea, kept googling and it makes sense.  I guess I am simply surprised
> that the application would have done the seek+write combination, since
> on NTFS (which doesn't support sparse) these would have been real
> 1.5GB files, and there would be hundreds or thousands of them in
> normal usage.

It could have been smbd compressing long runs of zeros.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
Also, sparseness need not be apparent to applications.  Until recent
improvements to lseek(2) to expose hole/non-hole offsets, the only way
to know about sparseness was to notice that a file's reported size is
more than the file's reported filesystem blocks times the block size.
Sparse files in Unix go back at least to the early 80s.

If a filesystem protocol, such as CIFS (I've no idea if it supports
sparse files), were to not support sparse files, all that would mean
is that the server must report a number of blocks that matches a
file's size (assuming the protocol in question even supports any
notion of reporting a file's size in blocks).

There's really two ways in which a filesystem protocol could support
sparse files: a) by reporting file size in bytes and blocks, b) by
reporting lists of file offsets demarcating holes from non-holes.  (b)
is a very new idea; Lustre may be the only filesystem that I know that
supports this (see the Linux FIEMAP APIs)., though work is in progress
to add this to NFSv4.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] disable zfs/zpool destroy for root user

2011-02-17 Thread Nico Williams
On Thu, Feb 17, 2011 at 3:07 PM, Richard Elling
 wrote:
> On Feb 17, 2011, at 12:44 PM, Stefan Dormayer wrote:
>
>> Hi all,
>>
>> is there a way to disable the subcommand destroy of zpool/zfs for the root 
>> user?
>
> Which OS?

Heheh.  Great answer.  The real answer depends also on what the OP
meant by "root".

"root" in Solaris isn't the all-powerful thing it used to be, or, rather, it is,
but its power can be limited.  And not just on Solaris either.

The OP's question is difficult to answer because the question isn't the one
the OP really wants to ask -- we must tease out that real question, or guess.
I'd start with: just what is it that you want to accomplish?

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-14 Thread Nico Williams
On Feb 14, 2011 6:56 AM, "Paul Kraus"  wrote:
> P.S. I am measuring number of objects via `zdb -d` as that is faster
> than trying to count files and directories and I expect is a much
> better measure of what the underlying zfs code is dealing with (a
> particular dataset may have lots of snapshot data that does not
> (easily) show up).

It's faster because; a) no atime updates, b) no ZPL overhead.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Nico Williams
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang  wrote:
> On Mon, Feb 7, 2011 at 1:51 PM, Brandon High  wrote:
> Maybe I didn't make my intention clear. UFS with directio is
> reasonably close to a raw disk from my application's perspective: when
> the app writes to a file location, no buffering happens. My goal is to
> find a way to duplicate this on ZFS.

You're still mixing directio and O_DSYNC.

O_DSYNC is like calling fsync(2) after every write(2).  fsync(2) is
useful to obtain
some limited transactional semantics, as well as for durability
semantics.  In ZFS
you don't need to call fsync(2) to get those transactional semantics, but you do
need to call fsync(2) get those durability semantics.

Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly
more than just the data blocks you wrote to.  Which means that O_DSYNC on ZFS
is significantly slower than on UFS.  You can address this in one of two ways:
a) you might realize that you don't need every write(2) to be durable, then stop
using O_DSYNC, b) you might get a fast ZIL device.

I'm betting that if you look carefully at your application's requirements you'll
probably conclude that you don't need O_DSYNC at all.  Perhaps you can tell us
more about your application.

> Setting primarycache didn't eliminate the buffering, using O_DSYNC
> (whose side effects include elimination of buffering) made it
> ridiculously slow: none of the things I tried eliminated buffering,
> and just buffering, on ZFS.
>
> From the discussion so far my feeling is that ZFS is too different
> from UFS that there's simply no way to achieve this goal...

You've not really stated your application's requirements.  You may be convinced
that you need O_DSYNC, but chances are that you don't.  And yes, it's possible
that you'd need O_DSYNC on UFS but not on ZFS.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool export+import doesn't maintain snapshot

2009-01-14 Thread Nico Sabbi
On Wednesday 14 January 2009 16:49:48 cindy.swearin...@sun.com wrote:
> Nico,
>
> If you want to enable snapshot display as in previous releases,
> then set this parameter on the pool:
>
> # zpool set listsnapshots=on pool-name
>
> Cind
>

thanks, it works as I need.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool export+import doesn't maintain snapshot

2009-01-14 Thread Nico Sabbi
On Wednesday 14 January 2009 11:44:56 Peter Tribble wrote:
> On Wed, Jan 14, 2009 at 10:11 AM, Nico Sabbi  
wrote:
> > Hi,
> > I wanted to migrate a virtual disk from a S10U6 to OpenSolaris
> > 2008.11.
> > In the first machine I rebooted to single-user and ran
> > $ zpool export disco
> >
> > then copied the disk files to the target VM, rebooted as
> > single-user and ran
> > $ zpool import disco
> >
> > The disc was mounted, but none of the hundreds of snapshots was
> > there.
> >
> > Did Imiss something?
>
> How do you know the snapshots are gone?
>
> Note that the zfs list command no longer shows snapshots by
> default. You need 'zfs list -t all' for that.

now I see them, but why this change? what do I have to do to list them
by default as on the old server?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool export+import doesn't maintain snapshot

2009-01-14 Thread Nico Sabbi
Hi,
I wanted to migrate a virtual disk from a S10U6 to OpenSolaris 
2008.11.
In the first machine I rebooted to single-user and ran
$ zpool export disco

then copied the disk files to the target VM, rebooted as single-user
and ran
$ zpool import disco

The disc was mounted, but none of the hundreds of snapshots was there.

Did Imiss something?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS boot and data on same disk - is this supported?

2008-12-19 Thread Nico Sabbi
On Friday 19 December 2008 03:32:01 Ian Collins wrote:
> On Fri 19/12/08 14:52 , Shawn Joy shawn@sun.com sent:
> > I have read the ZFS best practice guide located at
> > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices
> >_Guide However I have questions whether we support using slices
> > for data on the same disk as we use for ZFS boot.
>
> Why would you want to do this instead of giving ZFS the whole disk?
>  Do you have compelling reasons to use UFS rather than ZFS
> filesystems for data?

I find ZFS's eager to monopolize the disk quite irritating: sometimes
there OSs on the same disk.
BTW, how much does ZFS slow down (on average) when using 
slices instead of the whole disk? 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some basic questions about getting the best performance for database usage

2008-06-30 Thread Nico Sabbi
On Monday 30 June 2008 11:14:10 James C. McPherson wrote:
> Christiaan Willemsen wrote:
> ...
>
> > And that is exactly where ZFS  comes in, at least as far as I
> > read.
> >
> > The question is: how can we maximize IO by using the best
> > possible combination of hardware and ZFS RAID?
>
> ...
>
> > For what I read, mirroring and striping should get me better
> > performance than raidz of RAID5. But I guess you might give me
> > some pointer on how to distribute the disk. My biggest question
> > is what I should leave to the HW raid, and what to ZFS?
>
> Hi Christiaan,
> If you haven't found it already, I highly recommend going
> through the information at these three urls:
>
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Gu
>ide
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_G
>uide
> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guid
>e
>
>
> I'll defer to Richard Elling and Roch Bourbonnais for specific
> suggestions based on your email - as far as I'm concerned they're
> the experts when it comes to ZFS tuning and database performance.
>
>
> James C. McPherson
> --

I want to save you some time and sufference: I had to  add
set zfs:zfs_nocacheflush = 1
to /etc/system and reboot to cure the horrible slowness I experienced
with all of Myisam engines on ZFS, especially Innodb.
I had never seen a DB going so slow until that moment
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] memory hog

2008-06-23 Thread Nico Sabbi
On Monday 23 June 2008 09:39:13 Kaiwai Gardiner wrote:
> Erik Trimble wrote:
> > Edward wrote:
> >> So does that mean ZFS is not for consumer computer?
> >> If ZFS require 4GB of Ram for operation, that means i will need 
> >> 8GB+ Ram if i were to use Photoshop or any other memory
> >> intensive application?
> >
> > No.  It works fine on desktops - I'm writing this on an older
> > Athlon64 with 1GB.   Memory pressure does seem to become a bit
> > more of an issue when I'm doing more I/O on the box (which, I'm
> > assuming, is due to the various caches), so for things like
> > compiling, I feel a little cramped.
> >
> > Personally, (in my experience only), I'd say that ZFS works well
> > for use on the desktop, ASSUMING you dedicate 1GB of RAM to
> > solely the OS (and ZFS).  For very heavy I/O work, I think at
> > least 2GB is a better idea.
> >
> > So, size your total memory accordingly.
>
> I've got a Dell Dimension 8400 w/ 2.5gb ram and p4 3.2Ghz
> processor; I haven't noticed any slow downs either. Memory is so
> cheap, adding an extra 2gb is only around NZ$100 these days anyway.
>
> Matthew

this is the kind of reasoning that hides problems rather than
correcting them. Sooner or later problems will show up in
other - maybe worse -  forms
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss