Does anyone know a tool that can look over a dataset and give 
duplication statistics? I'm not looking for something incredibly 
efficient but I'd like to know how much it would actually benefit our 
dataset: HiRISE has a large set of spacecraft data (images) that could 
potentially have large amounts of redundancy, or not. Also, other up and 
coming missions have a large data volume that have a lot of duplicate 
image info and a small budget; with "d11p" in OpenSolaris there is a 
good business case to invest in Sun/OpenSolaris rather than buy the 
cheaper storage (+ linux?) that can simply hold everything as is.

If someone feels like coding a tool up that basically makes a file of 
checksums and counts how many times a particular checksum get's hit over 
a dataset, I would be willing to run it and provide feedback. :)

-Tim

Charles Soto wrote:
> Oh, I agree.  Much of the duplication described is clearly the result of
> "bad design" in many of our systems.  After all, most of an OS can be served
> off the network (diskless systems etc.).  But much of the dupe I'm talking
> about is less about not using the most efficient system administration
> tricks.  Rather, it's about the fact that software (e.g. Samba) is used by
> people, and people don't always do things efficiently.
>
> Case in point:  students in one of our courses were hitting their quota by
> growing around 8GB per day.  Rather than simply agree that "these kids need
> more space," we had a look at the files.  Turns out just about every student
> copied a 600MB file into their own directories, as it was created by another
> student to be used as a "template" for many of their projects.  Nobody
> understood that they could use the file right where it sat.  Nope. 7GB of
> dupe data.  And these students are even familiar with our practice of
> putting "class media" on a read-only share (these files serve as similar
> "templates" for their own projects - you can create a full video project
> with just a few MB in your "project file" this way).
>
> So, while much of the situation is caused by "bad data management," there
> aren't always systems we can employ that prevent it.  Done right, dedup can
> certainly be "worth it" for my operations.  Yes, teaching the user the
> "right thing" is useful, but that user isn't there to know how to "manage
> data" for my benefit.  They're there to learn how to be filmmakers,
> journalists, speech pathologists, etc.
>
> Charles
>
>
> On 7/7/08 9:24 PM, "Bob Friesenhahn" <[EMAIL PROTECTED]> wrote:
>
>   
>> On Mon, 7 Jul 2008, Mike Gerdts wrote:
>>     
>>> As I have considered deduplication for application data I see several
>>> things happen in various areas.
>>>       
>> You have provided an excellent description of gross inefficiencies in
>> the way systems and software are deployed today, resulting in massive
>> duplication.  Massive duplication is used to ease service deployment
>> and management.  Most of this massive duplication is not technically
>> necessary.
>>     
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to