RE: Corrupt MARC records

2005-05-07 Thread Houghton,Andrew
 
Most MARC utilities like MARC::Record depend upon the actual directory lengths 
and having well formed structure.  Isn't that what standards are for?  But 
sometimes you really do get badly formed MARC records and need to recover the 
data.  The presented code does have two caveats, which I pointed out and Ed 
reiterates.  The directory *must* be in the same order as the fields.

However, even if the fields are not in the same order as the directory, code 
could be written to take that into account so long as you can make the 
assumption that the start positions for each directory entry give the "nearest" 
position to the data.  If we take the directory and sort on the start position 
field, we will have the directory in the order necessary for extraction by the 
presented code.

Of course, you would probably want to keep track of the original directory and 
the sorted directory order so you can output the MARC record with the fields in 
the same order as the original.  Things are never ideal when you have corrupt 
MARC records...


Andy.

-Original Message-
From: Ed Summers [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 07, 2005 3:11 PM
To: perl4lib@perl.org
Subject: Re: Corrupt MARC records

> I wondered if any of you had run into similar problems, or if you had 
> any thoughts on how to tackle this particular issue.

It's ironic that MARC::Record *used* to do what Andrew suggests: using
split() rather than
than substr() with the actual directory lengths. The reason for the switch was 
just as Andrew pointed out: the order of the tags in the directory is not 
necessarily the order of the field data.

If you need to you could try downloading MARC::Record v1.17 and try using that. 
Or you could roll your own code and cut and paste it everywhere like Andrew ;-)

//Ed



Re: Corrupt MARC records

2005-05-07 Thread Ed Summers
I wondered if any of you had run into similar problems, or if you had 
any thoughts on how to tackle this particular issue.
It's ironic that MARC::Record *used* to do what Andrew suggests: using 
split() rather than
than substr() with the actual directory lengths. The reason for the 
switch was just as Andrew pointed out: the order of the tags in the 
directory is not necessarily the order of the field data.

If you need to you could try downloading MARC::Record v1.17 and try 
using that. Or you could roll your own code and cut and paste it 
everywhere like Andrew ;-)

//Ed


RE: Corrupt MARC records

2005-05-07 Thread Houghton,Andrew
 
It's amazing when you read your own response, after send it, you discover 
mistakes...  OK here is an addendum to what I said below:

You will probably need this line at the beginning:

  use Carp;

The croaking should be:

or croak("Cannot open input file $FileMARC21\n");

To avoid reusing a variable inappropriately the for loop should be:

  foreach $entry (@dir) {

my $fld = shift(@flds);
my $len = length($fld) + length($usmFldD);

$start += $len;

# Update directories field length and start position
substr($entry, 3, 4) = sprintf('%4.4d', $len);
substr($entry, 7, 5) = sprintf('%5.5d', $start);
  }

Finally, writing the MARC record out should be:

  print STDOUT 
$ldr,join('',@dir),$usmFldD,join($usmFldD,@fields,''),$usmRecD;

I did say it was written off the top of my head and not tested, didn't I?  It's 
still not test, but the above mistakes were obvious after reading what I sent...


Andy.

-Original Message-
From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 07, 2005 10:58 AM
To: perl4lib@perl.org
Subject: RE: Corrupt MARC records

 
MARC records contain a field delimiter after each field and a record delimiter 
at the end.  Assuming that those delimiters are still in your MARC records and 
that the directory entries are in the same order as the fields, then you can do 
the following:

1 Set Perl's record delimiter to the MARC record delimiter.
2 For each record from the MARC file
2.1 Use Perl's split function on the record data with the field delimiter
2.2 Use Perl's shift function to move the split array down 1 and capture the 
first field which contains the leader and directory
2.3 Use Perl's substr function to split the leader and directory
2.4 Use Perl's split function on the directory to separate each 12 characters
2.5 For each directory entry
2.5.1 Calculate the length from the field array and update the directory entry
2.6 Update the leader's record length
2.7 Piece the leader, directory and fields back together
2.8 Write out the new MARC record

The above is just off the top of my head and I probably overlooked some things. 
 So here is a sketch of what I think the code would look like:

my $usmRecD = "\x1D";# MARC21 record delimiter.
my $usmFldD = "\x1E";# MARC21 field  delimiter.
my $usmSubD = "\x1F";# MARC21 field  separator.

my $anyRecQ = quotemeta($usmRecD);
my $anyFldQ = quotemeta($usmFldD);
my $anySfdQ = quotemeta($usmSubD);

  # Change Perls default record delimiters...
  $/ = $usmRecD;
  $\ = '';

  # Process each command line file...
  foreach $FileMARC21 (@ARGV) {

# Open file from command line...
open(MARC21, '<' . $FileMARC21)
or croak("Cannot open input file $FileMARC21$chrCrLf");

# Process each record in the file...
while () {
  my $rec = $_;

  substr($rec, 20, 4) = '4500';  #HACK to correct for OCLC MARC...

  my @flds = split($anyFldQ, $rec, 2);

  my $ldr  = substr($flds[0], 0, 24);
  my $base = substr($flds[0], 12, 5);

  my $dir  = substr($flds[0], 24);
  my $dirL = length($dir);

  my @dir  = split('(.{12,12})', $dir);
 @dir  = grep ! /^$/, @dir;

  shift(@flds);

  my @fields = @flds;
  my $start  = length($ldr) + $dirL + length($usmFldD);

  foreach $dir (@dir) {

my $fld = shift(@flds);
my $len = length($fld) + length($usmFldD);

$start += $len;

# Update directories field length and start position
substr($dir, 3, 4) = sprintf('%4.4d', $len);
substr($dir, 7, 5) = sprintf('%5.5d', $start);
  }

  # Update leader's record length
  substr($ldr, 0, 5) = sprintf('%5.5d', $start + length($usmRecD));

  # Write out leader, directory, fields
  print STDOUT $ldr,$dir,$usmFldD,join($usmFldD,@fields,''),$usmRecD;

}

# Close file from command line...
close(MARC21);

  }

The code is off the top of my head and parts have been copied from a variety of 
Perl scripts I had hanging around.  It isn't tested, but hopefully a start for 
your work.


Andy.


-Original Message-
From: Ron Davies [mailto:[EMAIL PROTECTED]
Sent: Saturday, May 07, 2005 6:14 AM
To: perl4lib@perl.org
Subject: Corrupt MARC records

I have been having some problems with a client's catalogue that contains quite 
a few corrupt MARC records. These are for the most part records that have been 
kicking around since as long ago as 1965, and that have been transferred 
between various systems and converted between different formats over the years.

The common problem seems to be that the values for the length of a field in the 
directory no longer matches the length of the data that is actually in that 
field, and hence the record length does not match what is in the leader. The 
difference is typically one character in one field and may be related to what 
were once hidden control characters or blanks at the end of a formatted line 
during the

RE: Corrupt MARC records

2005-05-07 Thread Houghton,Andrew
 
MARC records contain a field delimiter after each field and a record delimiter 
at the end.  Assuming that those delimiters are still in your MARC records and 
that the directory entries are in the same order as the fields, then you can do 
the following:

1 Set Perl's record delimiter to the MARC record delimiter.
2 For each record from the MARC file
2.1 Use Perl's split function on the record data with the field delimiter
2.2 Use Perl's shift function to move the split array down 1 and capture the 
first field which contains the leader and directory
2.3 Use Perl's substr function to split the leader and directory
2.4 Use Perl's split function on the directory to separate each 12 characters
2.5 For each directory entry
2.5.1 Calculate the length from the field array and update the directory entry
2.6 Update the leader's record length
2.7 Piece the leader, directory and fields back together
2.8 Write out the new MARC record

The above is just off the top of my head and I probably overlooked some things. 
 So here is a sketch of what I think the code would look like:

my $usmRecD = "\x1D";# MARC21 record delimiter.
my $usmFldD = "\x1E";# MARC21 field  delimiter.
my $usmSubD = "\x1F";# MARC21 field  separator.

my $anyRecQ = quotemeta($usmRecD);
my $anyFldQ = quotemeta($usmFldD);
my $anySfdQ = quotemeta($usmSubD);

  # Change Perls default record delimiters...
  $/ = $usmRecD;
  $\ = '';

  # Process each command line file...
  foreach $FileMARC21 (@ARGV) {

# Open file from command line...
open(MARC21, '<' . $FileMARC21)
or croak("Cannot open input file $FileMARC21$chrCrLf");

# Process each record in the file...
while () {
  my $rec = $_;

  substr($rec, 20, 4) = '4500';  #HACK to correct for OCLC MARC...

  my @flds = split($anyFldQ, $rec, 2);

  my $ldr  = substr($flds[0], 0, 24);
  my $base = substr($flds[0], 12, 5);

  my $dir  = substr($flds[0], 24);
  my $dirL = length($dir);

  my @dir  = split('(.{12,12})', $dir);
 @dir  = grep ! /^$/, @dir;

  shift(@flds);

  my @fields = @flds;
  my $start  = length($ldr) + $dirL + length($usmFldD);

  foreach $dir (@dir) {

my $fld = shift(@flds);
my $len = length($fld) + length($usmFldD);

$start += $len;

# Update directories field length and start position
substr($dir, 3, 4) = sprintf('%4.4d', $len);
substr($dir, 7, 5) = sprintf('%5.5d', $start);
  }

  # Update leader's record length
  substr($ldr, 0, 5) = sprintf('%5.5d', $start + length($usmRecD));

  # Write out leader, directory, fields
  print STDOUT $ldr,$dir,$usmFldD,join($usmFldD,@fields,''),$usmRecD;

}

# Close file from command line...
close(MARC21);

  }

The code is off the top of my head and parts have been copied from a variety of 
Perl scripts I had hanging around.  It isn't tested, but hopefully a start for 
your work.


Andy.


-Original Message-
From: Ron Davies [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 07, 2005 6:14 AM
To: perl4lib@perl.org
Subject: Corrupt MARC records

I have been having some problems with a client's catalogue that contains quite 
a few corrupt MARC records. These are for the most part records that have been 
kicking around since as long ago as 1965, and that have been transferred 
between various systems and converted between different formats over the years.

The common problem seems to be that the values for the length of a field in the 
directory no longer matches the length of the data that is actually in that 
field, and hence the record length does not match what is in the leader. The 
difference is typically one character in one field and may be related to what 
were once hidden control characters or blanks at the end of a formatted line 
during the data conversion to MARC a number of years ago. 
The records can be searched and displayed (e.g. within the client's ILS), and 
they can even updated within the ILS, so they are useful, but they can't be 
updated and written by MARC::Record. This creates problems when doing large 
scale global changes.

My thought on how to deal with this would be to read in sequence each field of 
the offending record, and readjust field lengths, based on the shorter of (a) 
the value in the field length in the directory, or (b) the actual string in the 
field up to the field separator character. As you went through the record, you 
would readjust the index into the data section as you came across an 
inconsistent field length. The worst that could happen is that a few characters 
might be lost from the end of a field, but at the end of the process the record 
would be "clean" again. I would suspect it would still need a human eye to scan 
over it to ensure that no egregious errors were introduced, but this would be a 
lot easier than trying to identify the offending field, manually deleting the 
field, and the re-entering it again.

I wondered if a

Corrupt MARC records

2005-05-07 Thread Ron Davies
I have been having some problems with a client's catalogue that contains 
quite a few corrupt MARC records. These are for the most part records that 
have been kicking around since as long ago as 1965, and that have been 
transferred between various systems and converted between different formats 
over the years.

The common problem seems to be that the values for the length of a field in 
the directory no longer matches the length of the data that is actually in 
that field, and hence the record length does not match what is in the 
leader. The difference is typically one character in one field and may be 
related to what were once hidden control characters or blanks at the end of 
a formatted line during the data conversion to MARC a number of years ago. 
The records can be searched and displayed (e.g. within the client's ILS), 
and they can even updated within the ILS, so they are useful, but they 
can't be updated and written by MARC::Record. This creates problems when 
doing large scale global changes.

My thought on how to deal with this would be to read in sequence each field 
of the offending record, and readjust field lengths, based on the shorter 
of (a) the value in the field length in the directory, or (b) the actual 
string in the field up to the field separator character. As you went 
through the record, you would readjust the index into the data section as 
you came across an inconsistent field length. The worst that could happen 
is that a few characters might be lost from the end of a field, but at the 
end of the process the record would be "clean" again. I would suspect it 
would still need a human eye to scan over it to ensure that no egregious 
errors were introduced, but this would be a lot easier than trying to 
identify the offending field, manually deleting the field, and the 
re-entering it again.

I wondered if any of you had run into similar problems, or if you had any 
thoughts on how to tackle this particular issue.

Thanks in advance,
Ron
Ron Davies
Information and documentation systems consultant
Av. Baden-Powell 1  Bte 2, 1200 Brussels, Belgium
Email:  ron(at)rondavies.be
Tel:+32 (0)2 770 33 51
GSM:+32 (0)484 502 393