Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-28 Thread Ashley Sanders
Eric,

 How can I figure out whether or not a MARC record contains ONLY characters 
 from the UTF-8 character set?

You can use a regex to check if a string is utf-8. There are various examples
floating around the internet. An example is the one here:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part
of the expression in the above page. (I think the w3c example is aimed at XML1.0
in which the MARC control characters are not allowed.)

Ashley.
--
Ashley Sanders a.sand...@manchester.ac.uk
http://copac.ac.uk -- A Mimas service funded by JISC at the University of 
Manchester



Re: reading and writing of utf-8 with marc::batch [resolved; gigo]

2013-03-28 Thread Eric Lease Morgan

Thank you for all the input, and I think I have resolved my particular issue. 
Battle won. War still raging.

Using the script suggested by Galen as an starting point, I wrote the following 
hack outputting integers denoting MARC records containing non-UTF-8 characters, 
but the script output nothing; all the data in all of my records was encoded as 
UTF-8:

  #!/usr/bin/perl

  # require
  use strict;
  use Encode;

  # initialize
  binmode STDIN, :bytes;
  $/= \035; 
  my $i = 0;

  # read STDIN
  while (  ) {

  # increment
  $i++;

  # check validity
  eval { my $utf8str = Encode::is_utf8( $_, Encode::FB_CROAK ); };

  # check for error
  if ( $@ ) { print Record $i contains non-UTF-8 characters\n; }

  }

  # done
  exit;


Since all of the data in all of my records was UTF-8, then all of the leaders 
of all of the records need to have a value of a set in position #9 of the 
leader. So I wrote the following hack (circumventing MARC::Batch):

  #!/usr/bin/perl

  # require
  use strict;

  # initialize
  binmode STDIN,  :bytes;
  binmode STDOUT, :bytes;
  $/ = \035; 

  # loop through the input
  while (  ) {

  # do the work and output
  substr( $_, 9, 1 ) = a;
  print $_;

  }

  # done
  exit;


I then fed the output of my fix routine to my indexing routing, and all of my 
problems seemed to go away. GIGO?

I'm still not sure, but I think deep within MARC::Batch some sort of encoding 
is observed, honored, and output. And when the denoted encoding is not true and 
things like binmode( FILE, :utf8 ) get called, output gets munged. Again, I'm 
not sure. It is almost exhausting.


-- 
Eric Morgan
University of Notre Dame