-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 9:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file
Can't you have a legal "MARC" file that does NOT have 4500 in those
leader positions? It's just not legal "Marc21", right? Other marc
formats may specify or even allow flexibility in the things these bytes
specify:
* Length of the length-of-field portion
* Number of characters in the starting-character-position portion of a
Directory entry
* Number of characters in the implementation-defined portion of a Directory
entry
Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
Marc21 is one such) to use for it's own purposes.
I have no idea how that should inform the 'marc magic'.
Is mime-type application/marc defined as specifically Marc21, or as any
Marc?
Jonathan
On 4/6/2011 12:28 PM, Ford, Kevin wrote:
Well, this brings us right up against the issue of files that adhere to their
specifications versus forgiving applications. Think of browsers and HTML.
Suffice it to say, MARC applications are quite likely to be forgiving of leader
positions 20-23. In my non-conforming MARC file and in Bill's, the leader
positions 20-21 ("45") seemed constant, but things could fall apart for
positions 22-23. So...
I present the following (in-line and attached, to preserve tabs) in an
attempt to straddle the two sides of this issue: applications forgiving of non-
conforming files. Should the two characters following 45 (at position 20)
*not* be 00, then the identification will be noted as "non-conforming." We
could classify this as reasonable identification but hardly ironclad (indeed,
simply checking to confirm that part of the first 24 positions match the
specification hardly constitutes a robust identification, but it's something).
It will also give you a mimetype too, now.
Would any like testing it out more fully on their own files?
#--------------------------------------------
# MARC 21 Magic (Third cut)
# Set at position 0
0 byte x
# leader position 20-21 must be 45
20 string 45
# leader starts with 5 digits, followed by codes specific to MARC
format
0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic
!:mime application/marc
0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
!:mime application/marc
0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings
!:mime application/marc
0 regex/1 (^[0-9]{5})[acdn][w] MARC Classification
!:mime application/marc
0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
!:mime application/marc
# leader position 22-23, should be "00" but is it?
0 regex/1 (^.{21})([^0]{2}) (non-conforming)
!:mime application/marc
If this works, I'll see about submitting this copy. Thanks to all your efforts
already.
Warmly,
Kevin
--
Library of Congress
Network Development and MARC Standards Office
________________________________________
From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Simon
Spero [s...@unc.edu]
Sent: Sunday, April 03, 2011 14:01
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file
I am pretty sure that the marc4j standard reader ignores them; the
tolerant reader definitely does. Otherwise JHU might have about two
parseable records based on the mangled leaders that J-Rock gets stuck
with :-)
An analysis of the ~7M LC bib records from the scriblio.net data files
(~ Dec 2006) indicated that leader has less than 8 bits of
information in it (shannon-weaver definition). This excludes the
initial length value, which is redundant given the end of record marker.
The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
The final characters of the leader are "450".
Also, I object to the phrase "decent MARC tool". Any tool capable of
dealing with MARC as it exists cannot afford the luxury of decency :-)
[ HA: "A clear conscience?"
BW: "Yes, Sir Humphrey."
HA: "When did you acquire this taste for luxuries?"]
Simon
On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens<o...@ostephens.com>
wrote:
"I'm sure any decent MARC tool can deal with them, since decent MARC
tools are certainly going to be forgiving enough to deal with four
characters that apparently don't even really matter."
You say that, but I'm pretty sure Marc4J throws errors MARC records
where these characters are incorrect
Owen
On Fri, Apr 1, 2011 at 3:51 AM, William Denton<w...@pobox.com> wrote:
On 28 March 2011, Ford, Kevin wrote:
I couldn't get Simon's MARC 21 Magic file to work. Among other
issues,
I
received "line too long" errors. But, since I've been curious
about
this
for sometime, I figured I'd take a whack at it myself. Try this:
This is very nice! Thanks. I tried it on a bunch of MARC files I
have, and it recognized almost all of them. A few it didn't, so I
had a closer look, and they're invalid.
For example, the Internet Archive's Binghamton catalogue dump:
http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
$ file -m marc.magic bgm*mrc
bgm_openlib_final_0-5.mrc: data
bgm_openlib_final_10-15.mrc: MARC Bibliographic
bgm_openlib_final_15-18.mrc: data
bgm_openlib_final_5-10.mrc: MARC Bibliographic
But why? Aha:
$ head -c 25 bgm_openlib_final_*mrc
==> bgm_openlib_final_0-5.mrc<==
01812cas 2200457 45x00
==> bgm_openlib_final_10-15.mrc<==
01008nam 2200289ua 45000
==> bgm_openlib_final_15-18.mrc<==
01614cam 00385 45 0
==> bgm_openlib_final_5-10.mrc<==
00887nam 2200265v 45000
As you say, the leader should end with 4500 (as defined at
http://www.loc.gov/marc/authority/adleader.html) but two of those
files don't. So they're not valid MARC. I'm sure any decent MARC
tool can
deal
with them, since decent MARC tools are certainly going to be
forgiving enough to deal with four characters that apparently don't
even really matter.
So on the one hand they're usable MARC but file wouldn't say so, and
on
the
other that's a good indication that the files have failed a basic
validity
test. I wonder if there are similar situations for JPEGs or MP3s.
I think you should definitely submit this for inclusion in the magic
file.
It would be very useful for us all!
Bill
P.S. I'd never used head -c (to show a fixed number of bytes) before.
Always nice to find a new useful option to an old command.
#--------------------------------------------
# MARC 21 Magic (Second cut)
# Set at position 0
0 short>0x0000
# leader ends with 4500
20 string 4500
# leader starts with 5 digits, followed by codes specific to MARC
format
0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic
0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings
0 regex/1 (^[0-9]{5})[acdn][w] MARC Classification
0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
--
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com