The library of congress has several tools for making and working with bagit 
bags.

Java command line tool and library
https://github.com/LibraryOfCongress/bagit-java

a python command line tool and library
https://github.com/LibraryOfCongress/bagit-python

or a standalone java desktop application (GUI based)
https://github.com/LibraryOfCongress/bagger 

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe 
Hourcle
Sent: Saturday, January 24, 2015 10:07 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata

On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:

> Howdy all,
> 
> I've been toying with the idea of embedding DOI's in all our digital 
> assets and possibly inserting/updating other metadata as well. 
> However, doing this would alter checksums created using normal methods.
> 
> Is there a practical/easy way to checksum only the objects themselves 
> without the metadata? If the metadata in a tiff or other kind of file 
> is modified, it does nothing to the actual object. Since providing 
> more complete metadata within objects makes them more 
> usable/identifiable and might simplify migrations down the road, it 
> seems like this wouldn't be a bad way to go.


The only file format that I'm aware of that has a provision for this is FITS 
(Flexible Image Transport System), which was a concept of a 'CHECKSUM' and a 
'DATASUM'.  (the 'DATASUM' is the checksum for only the payload portion, the 
'CHECKSUM' includes the metadata)[1].  It's possible that there are others, but 
I suspect that most consumer file formats won't have specific provisions for 
this.

The problems with 'metadata' in a lot of file formats is that they're just 
arbitrary segments -- you'd have to have a program that knew which segments 
were considered 'headers' vs. not.  It might be easier to have it be able to 
compute a separate checksum for each segment, so that should the modifications 
change their order, they'd still be considered valid.

Of course, I personally don't like changing files if I can help it.
If it were me, I'd keep the metadata outside the file;  if you're using BagIt, 
you could easily add additional metadata outside of the data directory.[2]

If you're just doing this internally, and don't need the DOI to be attached to 
the file when it's served, you could also look into file systems that support 
arbitrary metadata.  Older Macs used to use this, where there was a 'data fork' 
and a 'resource fork', but you had to have a service that knew to only send the 
data fork.
Other OSes support forks, but some also have 'extended file attributes', which 
allows you to attach a few key/value pairs to the file.  (exact limits are 
dependent upon the OS).

-Joe


[1] http://fits.gsfc.nasa.gov/registry/checksum.html
[2] https://tools.ietf.org/html/draft-kunze-bagit ; 
http://en.wikipedia.org/wiki/BagIt

Reply via email to