When exposing sets of MARC records as linked data, do you think it is better to
expose them in batch (collection) files or as individual RDF serializations? To
bastardize the Bard — “To batch or not to batch? That is the question.”
Suppose I am a medium-sized academic research library. Suppose my collection is
comprised of approximately 3.5 million bibliographic records. Suppose I want to
expose those records via linked data. Suppose further that this will be done by
“simply” making RDF serialization files (XML, Turtle, etc.) accessible via an
HTTP filesystem. No scripts. No programs. No triple stores. Just files on an
HTTP file system coupled with content negotiation. Given these assumptions,
would you:
1. create batches of MARC records, convert them to MARCXML
and then to RDF, and save these files to disc, or
2. parse the batches of MARC record sets into individual
records, convert them into MARCXML and then RDF, and
save these files to disc
Option #1 would require heavy lifting against large files, but the number of
resulting files to save to disc would be relatively few — reasonably managed in
a single directory on disc. On the other hand, individual URIs pointing to
individual serializations would not be accessible. They would only be
accessible by retrieving the collection file in which they reside. Moreover, a
mapping of individual URIs to collection files would need to be maintained.
Option #2 would be easier on the computing resources because processing little
files is generally easier than processing bigger ones. On the other hand, the
number of files generated by this option is not easily be managed without the
use of a sophisticated directory structure. (It is not feasible to put 3.5
million files in a single directory.) But I would still need to create a
mapping from URI to directory.
In either case, I would probably create a bunch of site map files denoting the
locations of my serializations — YAP (Yet Another Mapping).
I’m leaning towards Option #2 because individual URIs could be resolved more
easily with “simple” content negotiation.
(Given my particular use case — archival MARC records — I don’t think I’d
really have more than a few thousand items, but I’m asking the question on a
large scale anyway.)
—
Eric Morgan