Thank you for all the input, and I think I have resolved my particular issue. 
Battle won. War still raging.

Using the script suggested by Galen as an starting point, I wrote the following 
hack outputting integers denoting MARC records containing non-UTF-8 characters, 
but the script output nothing; all the data in all of my records was encoded as 
UTF-8:

  #!/usr/bin/perl

  # require
  use strict;
  use Encode;

  # initialize
  binmode STDIN, ":bytes";
  $/    = "\035"; 
  my $i = 0;

  # read STDIN
  while ( <> ) {

      # increment
      $i++;
    
      # check validity
      eval { my $utf8str = &Encode::is_utf8( $_, Encode::FB_CROAK ); };
    
      # check for error
      if ( $@ ) { print "Record $i contains non-UTF-8 characters\n"; }
    
  }

  # done
  exit;


Since all of the data in all of my records was UTF-8, then all of the leaders 
of all of the records need to have a value of "a" set in position #9 of the 
leader. So I wrote the following hack (circumventing MARC::Batch):

  #!/usr/bin/perl

  # require
  use strict;

  # initialize
  binmode STDIN,  ":bytes";
  binmode STDOUT, ":bytes";
  $/ = "\035"; 

  # loop through the input
  while ( <> ) {

      # do the work and output
      substr( $_, 9, 1 ) = "a";
      print $_;
        
  }

  # done
  exit;


I then fed the output of my fix routine to my indexing routing, and all of my 
problems seemed to go away. GIGO?

I'm still not sure, but I think deep within MARC::Batch some sort of encoding 
is observed, honored, and output. And when the denoted encoding is not true and 
things like binmode( FILE, ":utf8" ) get called, output gets munged. Again, I'm 
not sure. It is almost exhausting.


-- 
Eric Morgan
University of Notre Dame






Reply via email to