The interesting collision is going to be file system level encryption vs. de-duplication as the former makes the latter pretty difficult.

dave johnson wrote:
How other storage systems do it is by calculating a hash value for said file (or block), storing that value in a db, then checking every new file (or block) commit against the db for a match and if found, replace file (or block) with duplicate entry in db.

The most common non-proprietary hash calc for file-level deduplication seems to be the combination of the SHA1 and MD5 together. Collisions have been shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but the probibility of collitions occuring simultaneously both is to "small" as the capacity of ZFS is to "large" :)

While computationally intense, this would be a VERY welcome feature addition to ZFS and given the existing infrastructure within the filesystem already, while non-trivial by any means, it seems a prime candidate. I am not a programmer so I do not have the expertise to spearhead such a movement but I would think getting at least a placeholder "Goals and Objectives" page into the OZFS community pages would be a good start even if movement on this doesn't come for a year or more.

Thoughts ?

-=dave

----- Original Message ----- From: "Gary Mills" <[EMAIL PROTECTED]>
To: "Erik Trimble" <[EMAIL PROTECTED]>
Cc: "Matthew Ahrens" <[EMAIL PROTECTED]>; "roland" <[EMAIL PROTECTED]>; <zfs-discuss@opensolaris.org>
Sent: Sunday, June 24, 2007 3:58 PM
Subject: Re: [zfs-discuss] zfs space efficiency


On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote:
Matthew Ahrens wrote:
>Will Murnane wrote:
>>On 6/23/07, Erik Trimble <[EMAIL PROTECTED]> wrote:
>>>Now, wouldn't it be nice to have syscalls which would implement "cp"
>>>and
>>>"mv", thus abstracting it away from the userland app?

>A "copyfile" primitive would be great!  It would solve the problem of
>having all those "friends" to deal with -- stat(), extended
>attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That isn't to
>say that it would have to be implemented in the kernel; it could
>easily be a library function.
>
I'm with Matt.  Having a "copyfile" library/sys call would be of
significant advantage.  In this case, we can't currently take advantage
of the CoW ability of ZFS when doing 'cp A B'  (as has been pointed out
to me).  'cp' simply opens file A with read(), opens a new file B with
write(), and then shuffles the data between the two.  Now, if we had a
copyfile(A,B) primitive, then the 'cp' binary would simply call this
function, and, depending on the underlying FS, it would get implemented
differently.  In UFS, it would work as it does now. For ZFS, it would
work like a snapshot, where file A and B share data blocks (at least
until someone starts to update either A or B).

Isn't this technique an instance of `deduplication', which seems to be
a hot idea in storage these days?  I wonder if it could be done
automatically, behind the scenes, in some fashion.

--
-Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to