The problem was I wasn't telling perl to output UTF-8. Now that I added
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
like once I set binmode to UTF-8 everything will be interpreted as such, even
when the record is in MARC-8. Is that right? So this means that I can only use
my script with a file of records where all of them are encoded in UTF-8. If I
want to run the script against a file with all MARC-8 encoding, then I'd need
to remove the binmode line.
It doesn't seem possible to say:
if ($record->encoding() eq 'UTF-8' ) {
binmode(FILE, ':utf8') ;
FILE $record->as_usmarc() ;
}
else {
print FILE $record->as_usmarc() ;
}
This will result in messing up the diacritics if a file has a mixture of
records in MARC-8 and UTF-8. Is that correct?
Thanks,
Shelley
----- Original Message -----
> From: "William Dueber" <[email protected]>
> To: "Shelley Doljack" <[email protected]>
> Cc: [email protected]
> Sent: Monday, July 30, 2012 5:13:41 PM
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc
> First off, it's entirely possible that you have bad UTF-8 (perhaps
> rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
> have plenty of that crap.
> You need to tell perl that you'll be outputting UTF-8 using 'bincode'
> binmode(FILE, ':utf8');
> In general, you'll want to do this to basically every file you open
> for reading or writing.
> A great overview of Perl and UTF-8 can be found at:
> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default
> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <
> [email protected] > wrote:
> > Hi,
>
> > I wrote a script that extracts marc records from a file given
> > certain
> > conditions and puts them in a new file. When my input record is
> > correctly encoded in UTF-8 and I run my script from windows command
> > prompt, this warning message appears: "Wide character in print at
> > record_extraction.pl line 99" (the line in my script where I print
> > to a new file using as_usmarc). I compared the extracted record
> > before and after in MarcEdit and the diacritic was changed. I tried
> > marcdump newfile.mrc to see what happens and I get this error:
> > "utf8
> > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176."
> > When I run my extraction script again with MARC-8 encoded data then
> > I don't have the same problem.
>
> > The basic outline of my script is:
>
> > my $batch = MARC::Batch->new('USMARC', $input_file);
>
> > while (my $record = $batch->next()) {
>
> > #do some checks
>
> > #if checks ok then
>
> > print FILE $record->as_usmarc();
>
> > }
>
> > Do I need to add something that specifies to interpret the data as
> > UTF-8? Does MARC::Record not handle UTF-8 at all?
>
> > Thanks,
>
> > Shelley
>
> > ----
>
> > Shelley Doljack
>
> > E-Resources Metadata Librarian
>
> > Metadata and Library Systems
>
> > Stanford University Libraries
>
> > [email protected]
>
> > 650-725-0167
>
> --
> Bill Dueber
> Programmer -- Library Systems
> University of Michigan