Re: [CODE4LIB] Digital collection backups

ddwiggins Fri, 11 Jan 2013 13:13:13 -0800

Be careful about assuming too much on this.
 
When I started working with S3, the system required an MD5 sum to upload, and 
would respond to requests with this "etag" in the header as well. I therefor 
assumed that this was integral to the system, and was a good way to compare 
local files against the remote copies.
 
Then, maybe a year or two ago, Amazon introduced chunked uploads, so that you 
could send files in pieces and reassemble them once they got to S3. This was 
good, because it eliminated problems with huge files failing to upload due to 
network hicups. I went ahead and implemented it in my scripts. Then, all of a 
sudden I started getting invalid checksums. Turns out that for multipart file 
uploads, they now create etag identifiers that are not the md5 sum of the 
underlying files. 
 
I now store the checksum as a separate piece of header metadata. And my sync 
script does periodically compare against this. But since this is just metadata, 
checking it doesn't really prove anything about the underlying file that Amazon 
has. To do this I would need to write a script that would actually retrieve the 
file and rerun the checksum. I have not done this yet, although it is on my 
to-do list at some point. This would ideally happen on an Amazon server so that 
I wouldn't have to send the file back and forth.
 
In any case, my main point is: don't assume that you can just check against a 
checksum from the API to verify a file for digital preservation purposes.
 
-David
 
 
 
 
 
__________
 
David Dwiggins
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
(617) 994-5948
ddwigg...@historicnewengland.org
http://www.historicnewengland.org
>>> Joshua Welker <jwel...@sbuniv.edu> 1/11/2013 2:45 PM >>>
Thanks for bringing up the issue of the cost of making sure the data is 
consistent. We will be using DSpace for now, and I know DSpace has some 
checksum functionality built in out-of-the-box. It shouldn't be too difficult 
to write a script that loops through DSpace's checksum data and compares it 
against the files in Glacier. Reading the Glacier FAQ on Amazon's site, it 
looks like they provide an archive inventory (updated daily) that can be 
downloaded as JSON. I read some users saying that this inventory includes 
checksum data. So hopefully it will just be a matter of comparing the local 
checksum to the Glacier checksum, and that would be easy enough to script.

Josh Welker

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ryan Eby
Sent: Friday, January 11, 2013 11:37 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Digital collection backups

As Aaron alludes to your decision should base off your real needs and they 
might not be exclusive.

LOCKSS/MetaArchive might be worth the money if it is the community archival 
aspect you are going for. Depending on your institution being a participant 
might make political/mission sense regardless of the storage needs and it could 
just be a specific collection that makes sense.

Glacier is a great choice if you are looking for spreading a backup across 
regions. S3 similarly if you also want to benefit from CloudFront (the CDN
setup) to take load off your institutions server (you can now use cloudfront 
off your own origin server as well). Depending on your bandwidth this might be 
worth the money regardless of LOCKSS participation (which can be more dark). 
Amazon also tends to be dropping prices over time vs raising but as any 
outsource you have to plan that it might not exist in the future. Also look 
more at Glacier prices in terms of checking your data for consistency. There 
have been a few papers on the costs of making sure Amazon really has the proper 
data depending on how often your requirements want you to check.

Another option if you are just looking for more geo placement is finding an 
institution or service provider that will colocate. There may be another small 
institution that would love to shove a cheap box with hard drives on your 
network in exchange for the same. Not as involved/formal as LOCKSS but gives 
you something you control to satisfy your requirements. It could also be as low 
tech as shipping SSDs to another institution who then runs some bagit checksums 
on the drive, etc.

All of the above should be scriptable in your workflow. Just need to decide 
what you really want out of it.

Eby

On Fri, Jan 11, 2013 at 11:52 AM, Aaron Trehub <treh...@auburn.edu> wrote:

> Hello Josh,
>
> Auburn University is a member of two Private LOCKSS Networks: the 
> MetaArchive Cooperative and the Alabama Digital Preservation Network 
> (ADPNet).  Here's a link to a recent conference paper that describes 
> both networks, including their current pricing structures:
>
> http://conference.ifla.org/past/ifla78/216-trehub-en.pdf
>
> LOCKSS has worked well for us so far, in part because supporting 
> community-based solutions is important to us.  As you point out, 
> however, Glacier is an attractive alternative, especially for 
> institutions that may be more interested in low-cost, low-throughput 
> storage and less concerned about entrusting their content to a 
> commercial outfit or having to pay extra to get it back out.  As with 
> most things, you pay your money--more or less, depending--and make your 
> choice.  And take your risks.
>
> Good luck with whatever solution(s) you decide on.  They need not be 
> mutually exclusive.
>
> Best,
>
> Aaron
>
> Aaron Trehub
> Assistant Dean for Technology and Technical Services Auburn University 
> Libraries
> 231 Mell Street, RBD Library
> Auburn, AL 36849-5606
> Phone: (334) 844-1716
> Skype: ajtrehub
> E-mail: treh...@auburn.edu
> URL: http://lib.auburn.edu/
>
>

Re: [CODE4LIB] Digital collection backups

Reply via email to