Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Prettyman, Timothy
Just as a historical note, this non-standard use of LDR/22 is likely due to 
OCLC's use of the character as a hexadecimal flag from back in the days when 
marc records were mostly schlepped around on tapes.  They referred to it as the 
Transaction type code.  When records were sent to oclc for processing, 
various values of the flag indicated whether a catalog card was to be produced, 
whether the record was an update, whether the user location symbol was to be 
set, etc.  I'm sure others have used it for their own nefarious purposes as 
well.

Tim Prettyman
University of Michigan/LIT

On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote:

 Well, this brings us right up against the issue of files that adhere to their 
 specifications versus forgiving applications.  Think of browsers and HTML.  
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader positions 20-23.  In my non-conforming MARC file and in Bill's, the 
 leader positions 20-21 (45) seemed constant, but things could fall apart 
 for positions 22-23.  So...
 
 I present the following (in-line and attached, to preserve tabs) in an 
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-conforming files.  Should the two characters following 45 (at position 
 20) *not* be 00, then the identification will be noted as non-conforming.  
 We could classify this as reasonable identification but hardly ironclad 
 (indeed, simply checking to confirm that part of the first 24 positions match 
 the specification hardly constitutes a robust identification, but it's 
 something).
 
 It will also give you a mimetype too, now.
 
 Would any like testing it out more fully on their own files?
 
 
 #
 # MARC 21 Magic  (Third cut)
 
 # Set at position 0
 0 bytex   
 
 # leader position 20-21 must be 45
 20   string  45  
 
 # leader starts with 5 digits, followed by codes specific to MARC format
 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
 !:mimeapplication/marc
 
 # leader position 22-23, should be 00 but is it?
 0   regex/1 (^.{21})([^0]{2})   (non-conforming)
 !:mimeapplication/marc
 
 
 If this works, I'll see about submitting this copy.  Thanks to all your 
 efforts already.
 
 Warmly,
 
 Kevin
 
 --
 Library of Congress
 Network Development and MARC Standards Office
 
 
 
 
 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
 [s...@unc.edu]
 Sent: Sunday, April 03, 2011 14:01
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I am pretty sure that the marc4j standard reader ignores them; the tolerant
 reader definitely does. Otherwise JHU might have about two parseable records
 based on the mangled leaders that J-Rock  gets stuck with :-)
 
 An analysis of the ~7M LC bib records from the scriblio.net data files (~
 Dec 2006) indicated that leader  has less than 8 bits of information in it
 (shannon-weaver definition). This excludes the initial length value, which
 is redundant given the end of record marker.
 
 
 The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are 450.
 
 Also, I object to the phrase decent MARC tool.  Any tool capable of
 dealing with MARC as it exists cannot afford the luxury of decency :-)
 
 [ HA: A clear conscience?
 BW: Yes, Sir Humphrey.
 HA: When did you acquire this taste for luxuries?]
 
 Simon
 
 On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens o...@ostephens.com wrote:
 
 I'm sure any decent MARC tool can deal with them, since decent MARC tools
 are certainly going to be forgiving enough to deal with four characters
 that
 apparently don't even really matter.
 
 You say that, but I'm pretty sure Marc4J throws errors MARC records where
 these characters are incorrect
 
 Owen
 
 On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote:
 
 On 28 March 2011, Ford, Kevin wrote:
 
 I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
 I
 received line too long errors.  But, since I've been curious about
 this
 for sometime, I figured I'd take a whack at it myself.  Try this:
 
 
 This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
 and it recognized almost all of them.  A few it didn't, so I had a closer
 look, and they're invalid.
 
 For example, the Internet Archive's Binghamton catalogue dump:
 
 http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
 
 $ file -m marc.magic bgm*mrc
 bgm_openlib_final_0-5.mrc: data
 

Re: [CODE4LIB] unwanted (bogus) characters in marc

2010-10-07 Thread Prettyman, Timothy
The marcxml version of the record looks fine:

http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml

Perhaps there was a problem when the record was converted from marcxml to marc 
at some point.

-Tim

Tim Prettyman
University of Michigan/LIT



On 10/7/10 9:58 AM, Cowles, Esme escow...@ucsd.edu wrote:

This record has the classic signs of Unicode treated as Latin-1 by mistake.  
The multibyte characters often show up as à followed by some other random 
character.  This actually happened to my conference badge in Asheville, which 
read Esmé Cowles.

-Esmé
--
Esme Cowles escow...@ucsd.edu

Necessity is the plea for every infringement of human freedom. It is the
 argument of tyrants; it is the creed of slaves. -- William Pitt, 1783

On Oct 7, 2010, at 9:39 AM, Ross Singer wrote:

 Eric, is this your source file?

 http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_meta.mrc

 I have nothing really much to offer with regard to MARC.pm and its
 ilk, but I thought it might help people track down your problem.

 FWIW, yaz-marcdump spits out this on that record:

 $ yaz-marcdump bienfait.mrc
 00795cam a2200277 a 4500
 001 1556719
 003 CaOTULAS
 005 19931129144435.0
 008 780210s1842fr  fre d
 035$a (Sirsi) AZF-9578
 040$a NUC $c NUC $d otsm
 049$a otstm $b eng
 050 04 $a BX946 $b .P5
 055  3 $a BX946 $b .P55 1842
 090  8 $a BX 946 .P55 1842 $b SMRS
 100 10 $a Pinard, Clovis, $d d.1865.
 245 10 $a Bienfaits du Catholicisme dans la société / $c par l'abbé P
 (No separator at end of field length=71)
 260 na $d .
 (Separator but not at end of field length=26)
 300 18 $2 .
 (Separator but not at end of field length=11)
 490 00 $p .
 (Separator but not at end of field length=45)
 (Bad indicator data. Skipping 2 bytes)
 596 ?t $e nn
 (No separator at end of field length=7)
 610 ne
 (Separator but not at end of field length=30)
 948 xH $s tory.
 (Separator but not at end of field length=27)
 039 0/ $6 /199
 (No separator at end of field length=9)
 (Bad indicator data. Skipping 1 bytes)
 093 0  $f mcsk
 (Separator but not at end of field length=21)
 926 12 $1 44434
 (Separator but not at end of field length=48)

 The diacritics definitely look pretty sketchy there.

 In fact, I just tried this with every encoding in yaz-marcdump, and
 the diacritics never properly converted to UTF-8.

 They seem ok here:

 http://ia341306.us.archive.org/1/items/bienfaitsducatho00pina/bienfaitsducatho00pina_marc.xml

 though, so you might want to grab both binary marc and marcxml and
 fall back to the latter in case of encoding errors.

 -Ross.

 On Thu, Oct 7, 2010 at 6:51 AM, Eric Lease Morgan emor...@nd.edu wrote:
 How do I trap for unwanted (bogus) characters in MARC records?

 I have a set of Internet Archive identifiers, and have written the 
 followoing Perl loop to get the MARC records associated with each one:

  # process each identifier
  my $ua = LWP::UserAgent-new( agent = AGENT );
  while ( DATA ) {

# get the identifier
chop;
my $identifier = $_;
print $identifier, \n;

# get its corresponding MARC record
my $response = $ua-get( ROOT . $identifier/$identifier . _meta.mrc );
if ( ! $response-is_success ) {

  warn $response-status_line;
  next;

}

# save it
open MARC,   $identifier.mrc or die Can't open $identifier.mrc: $!\n;
binmode MARC, :utf8;
print MARC $response-content;
close MARC;

  }

 I then use the venerable marcdump to see the fruits of my labors: marcdump 
 *.mrc. Unfortunately, marcdump returns the following error against (at 
 least) one of my files:

  bienfaitsducatho00pina.mrc
  utf8 \xC3 does not map to Unicode at /System/Library/
  Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162.

 What is going on here? Am I saving my files incorrectly? Is the original 
 MARC data inherintly incorrect? Is there some way I can fix the MARC record 
 in question?

 --
 Eric Lease Morgan