Thank you for all the input, and I think I have resolved my particular issue.
Battle won. War still raging.
Using the script suggested by Galen as an starting point, I wrote the following
hack outputting integers denoting MARC records containing non-UTF-8 characters,
but the script output nothing; all the data in all of my records was encoded as
UTF-8:
#!/usr/bin/perl
# require
use strict;
use Encode;
# initialize
binmode STDIN, :bytes;
$/= \035;
my $i = 0;
# read STDIN
while ( ) {
# increment
$i++;
# check validity
eval { my $utf8str = Encode::is_utf8( $_, Encode::FB_CROAK ); };
# check for error
if ( $@ ) { print Record $i contains non-UTF-8 characters\n; }
}
# done
exit;
Since all of the data in all of my records was UTF-8, then all of the leaders
of all of the records need to have a value of a set in position #9 of the
leader. So I wrote the following hack (circumventing MARC::Batch):
#!/usr/bin/perl
# require
use strict;
# initialize
binmode STDIN, :bytes;
binmode STDOUT, :bytes;
$/ = \035;
# loop through the input
while ( ) {
# do the work and output
substr( $_, 9, 1 ) = a;
print $_;
}
# done
exit;
I then fed the output of my fix routine to my indexing routing, and all of my
problems seemed to go away. GIGO?
I'm still not sure, but I think deep within MARC::Batch some sort of encoding
is observed, honored, and output. And when the denoted encoding is not true and
things like binmode( FILE, :utf8 ) get called, output gets munged. Again, I'm
not sure. It is almost exhausting.
--
Eric Morgan
University of Notre Dame