Re: printing UTF-8 encoded MARC records with as_usmarc

Shelley Doljack Tue, 31 Jul 2012 12:18:08 -0700

The problem was I wasn't telling perl to output UTF-8. Now that I added 
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds 
like once I set binmode to UTF-8 everything will be interpreted as such, even 
when the record is in MARC-8. Is that right? So this means that I can only use 
my script with a file of records where all of them are encoded in UTF-8. If I 
want to run the script against a file with all MARC-8 encoding, then I'd need 
to remove the binmode line.


It doesn't seem possible to say: 

if ($record->encoding() eq 'UTF-8' ) { 
binmode(FILE, ':utf8') ; 
FILE $record->as_usmarc() ; 
} 
else { 
print FILE $record->as_usmarc() ; 
} 

This will result in messing up the diacritics if a file has a mixture of 
records in MARC-8 and UTF-8. Is that correct? 

Thanks, 
Shelley 

----- Original Message -----

> From: "William Dueber" <dueb...@umich.edu>
> To: "Shelley Doljack" <sdolj...@stanford.edu>
> Cc: perl4lib@perl.org
> Sent: Monday, July 30, 2012 5:13:41 PM
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

> First off, it's entirely possible that you have bad UTF-8 (perhaps
> rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
> have plenty of that crap.

> You need to tell perl that you'll be outputting UTF-8 using 'bincode'

> binmode(FILE, ':utf8');

> In general, you'll want to do this to basically every file you open
> for reading or writing.

> A great overview of Perl and UTF-8 can be found at:

> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <
> sdolj...@stanford.edu > wrote:

> > Hi,
> 

> > I wrote a script that extracts marc records from a file given
> > certain
> > conditions and puts them in a new file. When my input record is
> > correctly encoded in UTF-8 and I run my script from windows command
> > prompt, this warning message appears: "Wide character in print at
> > record_extraction.pl line 99" (the line in my script where I print
> > to a new file using as_usmarc). I compared the extracted record
> > before and after in MarcEdit and the diacritic was changed. I tried
> > marcdump newfile.mrc to see what happens and I get this error:
> > "utf8
> > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176."
> > When I run my extraction script again with MARC-8 encoded data then
> > I don't have the same problem.
> 

> > The basic outline of my script is:
> 

> > my $batch = MARC::Batch->new('USMARC', $input_file);
> 

> > while (my $record = $batch->next()) {
> 
> > #do some checks
> 
> > #if checks ok then
> 
> > print FILE $record->as_usmarc();
> 
> > }
> 

> > Do I need to add something that specifies to interpret the data as
> > UTF-8? Does MARC::Record not handle UTF-8 at all?
> 

> > Thanks,
> 
> > Shelley
> 

> > ----
> 
> > Shelley Doljack
> 
> > E-Resources Metadata Librarian
> 
> > Metadata and Library Systems
> 
> > Stanford University Libraries
> 
> > sdolj...@stanford.edu
> 
> > 650-725-0167
> 

> --

> Bill Dueber
> Programmer -- Library Systems
> University of Michigan

Re: printing UTF-8 encoded MARC records with as_usmarc

Reply via email to