Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Bakul Shah
On Thu, 06 Jan 2011 22:42:15 PST Michael DeMan   wrote:
> To be quite honest, I too am skeptical about about using de-dupe just based o
> n SHA256.  In prior posts it was asked that the potential adopter of the tech
> nology provide the mathematical reason to NOT use SHA-256 only.  However, if 
> Oracle believes that it is adequate to do that, would it be possible for some
> body to provide:
> 
> (A) The theoretical documents and associated mathematics specific to say one 
> simple use case?

See http://en.wikipedia.org/wiki/Birthday_problem -- in
particular see section 5.1 and the probability table of
section 3.4.

> On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote:
> 
> >> I have been told that the checksum value returned by Sha256 is almost
> >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a
> >> bigger problem such as memory corruption, etc. Essentially, adding
> >> verification to sha256 is an overkill.

Agreed.

> > Someone please correct me if I'm wrong.

OK :-)

> > Suppose you have 128TB of data.  That is ...  you have 2^35 unique 4k block
> s
> > of uniformly sized data.  Then the probability you have any collision in
> > your whole dataset is (sum(1 thru 2^35))*2^-256 
> > Note: sum of integers from 1 to N is  (N*(N+1))/2
> > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35
> > Note: (N*(N+1))/2 in this case = 2^69 + 2^34
> > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~=
> > 5.1E-57 + 1.5E-67
> > 
> > ~= 5.1E-57

I believe this is wrong. See the wikipedia article referenced
above.

p(n,d) = 1 - d!/(d^n*(d-n)!)

In your example n = 2^35, d = 2^256.  If you extrapolate the
256 bits row of the probability table of section 3.1, it is
somewhere between 10^-48 and 10^-51. 

This may be easier to grasp: to get a 50% probability of a
collision with sha256, you need 4*10^38 blocks. For a
probability similar to disk error rates (10^-15), you need
1.5*10^31 blocks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Bakul Shah
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble   wrote:
> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what 
> order to rebuild things from. That means, it resilvers the very first 
> slab ever written, then the next oldest, etc.   The problem here is that 
> slab "age" has nothing to do with where that data physically resides on 
> the actual disks. If you've used the zpool as a WORM device, then, sure, 
> there should be a strict correlation between increasing slab age and 
> locality on the disk.  However, in any reasonable case, files get 
> deleted regularly. This means that the probability that for a slab B, 
> written immediately after slab A, it WON'T be physically near slab A.
> 
> In the end, the problem is that using metadata order, while reducing the 
> total amount of work to do in the resilver (as you only resilver live 
> data, not every bit on the drive), increases the physical inefficiency 
> for each slab.  That is, seek time between cyclinders begins to dominate 
> your slab reconstruction time.  In RAIDZ, this problem is magnified by 
> both the much larger average vdev size vs mirrors, and the necessity 
> that all drives containing a slab information return that data before 
> the corrected data can be written to the resilvering drive.
> 
> Thus, current ZFS resilvering tends to be seek-time limited, NOT 
> throughput limited.  This is really the "fault" of the underlying media, 
> not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
> negligible, but throughput isn't),  they resilver really, really fast. 
> In fact, they resilver at the maximum write throughput rate.   However, 
> HDs are severely seek-limited, so that dominates HD resilver time.

You guys may be interested in a solution I used in a totally
different situation.  There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring' operation. Some of these updates could
completely change the tree!  Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
and simple.

My algorithm started copying sequentially from the start.  If
N blocks were already copied when an update comes along,
updates of any block with block# > N are ignored (since the
sequential copy would get to them eventually).  Updates of
any block# <= N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.

I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver => under 12 hours).  How much worse
this gets depends on the amount of updates during
resilvering.

At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn't worry about optimizing the
mostly empty FS case.

Bakul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-06 Thread Bakul Shah
> The 45 byte score is the checksum of the top of the tree, isn't that
> right?

Yes. Plus an optional label.

> ZFS snapshots and clones save a lot of space, but the
> 'content-hash == address' trick means you could potentially save
> much more.

Especially if you carry around large files (disk images,
databases) that change.

> Though I'm still not sure how well it scales up -
> Bigger working set means you need longer (more expensive) hashes
> to avoid a collision, and even then its not guaranteed.

> When i last looked they were still using SHA-160
> and I ran away screaming at that point :)

You need 2^80 blocks for a 50%+ probability that a pair will
have the same SHA-160 hash (by the birthday paradox).  Crypto
attacks are not relevant.  For my personal use I am willing
to live with these odds until my backups cross 2^40 distinct
blocks (greater than 8 Petabytes)!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-12-05 Thread Bakul Shah
> I have budget constraints then I can use only user-level storage.
> 
> until I discovered zfs I used subversion and git, but none of them is designe
> d to  manage gigabytes of data, some to be versioned, some to be unversioned.
> 
> I can't afford silent data corruption and, if the final response is "*now* th
> ere is no *real* opensource software alternative to zfs automatic checksummin
> g and simple snapshotting" I'll be an happy solaris user (for data storage), 
> an happy linux user (for everyday work), and an unhappy offline windows user 
> (for some video-related activity I can't do with linux).

Note that I don't wish to argue for/against zfs/billtodd but
the comment above about "no *real* opensource software
alternative zfs automating checksumming and simple
snapshotting" caught my eye.

There is an open source alternative for archiving that works
quite well.  venti has been available for a few years now.
It runs on *BSD, linux, macOS & plan9 (its native os).  It
uses strong crypto checksums, stored separately from the data
(stored in the pointer blocks) so you get a similar guarantee
against silent data corruption as ZFS.

You can back up a variety of filesystems (ufs, hfs, ext2fs,
fat) or use it to to backup a file tree.  Each backup results
in a single 45 byte "score" containing the checksum of root
pointer block.  Using this score you can retrieve the entire
backup.  Further, it stores only one copy of a data block
regardless of what files or which backup it may belong to. In
effect every "full backup" is an incremental backup (only
changed data blocks and changed or new ptr blocks are
stored).

So it is really an "archival" server.  You don't take
snapshots but you do a backup.  However you can nfs mount a
venti and all your backups will show up under directories
like ///.

Ideally you'd store a venti on RAID storage.  You can even
copy a bunch of venti to another one, you can store its
arenas on CDs or DVD and so on.

It is not as fast as ZFS nor anywhere near as easy to use and
its intended use is not the same as ZFS (not a primary
filesystem). But for what it does, it is not bad at all!

Unlike ZFS, it fits best where you have a fast filesystem for
speed critical use, venti for backups and RAID for
redundancy.

Google for "venti sean dorward".  If interested, go to
http://swtch.com/plan9port/ and pick up plan9port (a
collection of programs from plan9, not just venti).  See
http://swtch.com/plan9port/man/man8/index.html for how to use
venti.

> I think for every fully digital people own data are vital, and almost everyon
> e would reply "NONE" at your question "what level of risk user is willing to 
> tolerate".

NONE is not possible.  It is a question of how much risk you
are willing to tolerate for what cost.  Thankfully, these
days you have a variety of choices and much much lower cost
for a given degree of risk compared to just a few years ago!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] An Academic Sysadmin's Lament for ZFS ?

2007-09-08 Thread Bakul Shah
> why are you providing disk space to students?
> 
> When you solve this problem, the quota problem is moot.
> 
> NB. I managed a large University network for several years, and
> am fully aware of the costs involved.  I do not believe that the
> 1960s timeshare model will survive in such environments.

So are you saying you don't believe the network is the computer?
:-)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilvering speed?

2007-05-10 Thread Bakul Shah
> > It seems to me that once you copy meta data, you can indeed
> > copy all live data sequentially.
> 
> I don't see this, given the top down strategy. For instance, if I 
> understand the transactional update process, you can't commit the 
> metadata until the data is in place.
> 
> Can you explain in more detail your reasoning?

First realize that this is just a thought experiment -- I
haven't read much source code in any detail as yet and it is
entirely likely what I am suggesting can not work or there is
no need for it!  With that caveat

http://blogs.sun.com/bonwick/entry/smokin_mirrors talks about
top-down resilvering.   That is, copy the root blocks first
(for example the uberblock), then the blocks they point to
and so on [1].  A major goal here is to minimize data loss in
case of a second failure.  Completely losing a metadata block
means you can't access anything it points to so metadata
blocks are far more precious than data blocks.  This is
different from a normal update transaction, where the
copy-on-write proceeds bottom up -- which is what you are
talking about.  A major goal for a normal update is to ensure
that at all times a consistent filesystem structure is seen
by all.

All I was suggesting is that once all the metadata is copied
(or "resilvered"), switch to sequential copying to maximize
performance.  This does make checking the validity of a data
block more complicated.  So instead of copying data of file1
and then file2 and so on, just copy blocks in the most
efficient order, save their checksums and periodically
validate a whole bunch.  In fact since metadata is read
first, you can roughly figure which metadata blocks will be
needed when to check data block validity (because you know
where data blocks are stored).

> >   Given that a vast majority
> > of disk blocks in use will typically contain data, this is a
> > winning strategy from a performance point of view and still
> > allows you to retrieve a fair bit of data in case of a second
> > disk failure (checksumming will catch a case where good
> > metadata points to as yet uncopied data block).  If amount of
> > live data is > 50% of disk space you may as well do a disk
> > copy, perhaps skipping over already copied meta data.
> >
> > Not only that, you can even start using the disk being
> > resilvered right away for writes,  The new write will be
> > either to a) an already copied block
> 
> How can that be, under a COW regime?

I was talking about resilvering, not a normal update.  Copy
on write happens only for a normal update.  I was speculating
that you can do normal updates during resilvering.

Not sure if this is clear to anyone!

-- bakul

[1] Top down resilvering seems very much like a copying
garbage collector.  That similarity make me wonder if the
physical layout can be rearranged in some way for a more
efficient access to data -- the idea is to resilver and
compactify at the same time on one of the mirrors and then
make it the master and resilver the other mirrors.  Nah...
probably not worth the hassle.  [Again, I suspect no one else
understands what I am talking about:-)]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilvering speed?

2007-05-09 Thread Bakul Shah
> Robert Milkowski wrote:
> > Hello Mario,
> > 
> > Wednesday, May 9, 2007, 5:56:18 PM, you wrote:
> > 
> > MG> I've read that it's supposed to go at full speed, i.e. as fast as
> > MG> possible. I'm doing a disk replace and what zpool reports kind of
> > MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get
> > MG> throttled at some point between the builds, or is my ATA controller hav
> ing bigger issues?
> > 
> > Lot of small files perhaps? What kind of protection have you used?
> 
> Good question.  Remember that resilvering is done in time order and from
> the top-level metadata down, not by sequentially blasting bits.  Jeff
> Bonwick describes this as top-down resilvering.
>   http://blogs.sun.com/bonwick/entry/smokin_mirrors
> 
>  From a MTTR and performance perspective this means that ZFS recovery time
> is a function of the amount of space used, where it is located (!), and the
> validity of the surviving or regenerated data.  The big win is the amount of
> space used, as most file systems are not full.
>   -- richard

It seems to me that once you copy meta data, you can indeed
copy all live data sequentially.  Given that a vast majority
of disk blocks in use will typically contain data, this is a
winning strategy from a performance point of view and still
allows you to retrieve a fair bit of data in case of a second
disk failure (checksumming will catch a case where good
metadata points to as yet uncopied data block).  If amount of
live data is > 50% of disk space you may as well do a disk
copy, perhaps skipping over already copied meta data.

Not only that, you can even start using the disk being
resilvered right away for writes,  The new write will be
either to a) an already copied block or b) as yet uncopied
block.  In case a) there is nothing more to do.  In case b)
the copied-from-block will have the new data so in both cases
the right thing happens.  Any potential window between
reading a copied-from block and writing to copied-to block
can be closed with careful coding/locking.

If a second disk fails during copying, the current strategy
doesn't buy you much in most any case.  You really don't want
to go through a zillion files looking for survivors.  If you
have a backup, you will restore from that rather than look
through the debris.  Not to mention you have made the window
of a potentially catastrophic failure much larger if
resilvering is significantly slower.

Comments?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?

2007-05-07 Thread Bakul Shah
> Pawel Jakub Dawidek wrote:
> > This is what I see on Solaris (hole is 4GB):
> > 
> > # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k
> > real   23.7
> > # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k
> > real   21.2
> > 
> > # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k
> > real   31.4
> > # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k
> > real 7:32.2
> 
> This is probably because the time to execute this on ZFS is dominated by 
> per-systemcall costs, rather than per-byte costs.  You are doing 32x more 
> system calls with the 4k blocksize, and it is taking 20x longer.
> 
> That said, I could be wrong, and yowtch, that's much slower than I'd like!

You missed my earlier post where I showed accessing a hole
file takes much longer than accessing a regular data file for
blocksize of 4k and below.  I will repeat the most dramatic
difference:

  ZFSUFS2
Elapsed System  Elapsed System 
md5 SPACY   210.01   77.46  337.51   25.54
md5 HOLEY   856.39  801.21   82.11   28.31

I used md5 because all but a couple of syscalls are for
reading the file (with a buffer of 1K).  dd would make
an equal number of calls for writing.

For both file systems and both cases the filesize is the same
but SPACY has 10GB allocated while HOLEY was created with
truncate -s 10G HOLEY.

Look at the system times.  On UFS2 system time is a little
bit more for the HOLEY case because it has to clear a block.
ON ZFS it is over 10 times more!  Something is very wrong.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS vs UFS2 overhead and may be a bug?

2007-05-03 Thread Bakul Shah
[originally reported for ZFS on FreeBSD but Pawel Jakub Dawid
 says this problem also exists on Solaris hence this email.]

Summary: on ZFS, overhead for reading a hole seems far worse
than actual reading from a disk.  Small buffers are used to
make this overhead more visible.

I ran the following script on both ZFS and UF2 filesystems.

[Note that on FreeBSD cat uses a 4k buffer and md5 uses a 1k
 buffer. On Solaris you can replace them with dd with
 respective buffer sizes for this test and you should see
 similar results.]

$ dd SPACY# 10G zero bytes allocated
$ truncate -s 10G HOLEY # no space allocated

$ time dd /dev/null bs=1m   # A1
$ time dd /dev/null bs=1m   # A2
$ time cat SPACY >/dev/null # B1
$ time cat HOLEY >/dev/null # B2
$ time md5 SPACY# C1
$ time md5 HOLEY# C2

I have summarized the results below.

  ZFSUFS2
Elapsed System  Elapsed System Test
dd SPACY bs=1m  110.26   22.52  340.38   19.11  A1
dd HOLEY bs=1m   22.44   22.41   24.24   24.13  A2

cat SPACY   119.64   33.04  342.77   17.30  B1
cat HOLEY   222.85  222.08   22.91   22.41  B2

md5 SPACY   210.01   77.46  337.51   25.54  C1  
md5 HOLEY   856.39  801.21   82.11   28.31  C2


A1, A2:
Numbers are more or less as expected.  When doing large
reads, reading from "holes" takes far less time than from a
real disk.  We also see that UFS2 disk is about 3 times
slower for sequential reads.

B1, B2:
UFS2 numbers are as expected but ZFS numbers for the HOLEY
file are much too high.  Why should *not* going to a real
disk cost more?  We also see that UFS2 handles holey files 10
times more efficiently than ZFS!

C1, C2:
Again UFS2 numbers and C1 numbers for ZFS are as expected.
but C2 numbers for ZFS are very high.  md5 uses BLKSIZ (==
1k) size reads and does hardly any other system calls.  For
ZFS each syscall takes 76.4 microseconds while UFS2 syscalls
are 2.7 us each!  zpool iostat shows there is no IO to the
real disk so this implies that for the HOLEY case zfs read
calls have a significantly higher overhead or there is a bug.

Basically C tests just confirm what we find in B tests.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss