The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line.
It doesn't seem possible to say: if ($record->encoding() eq 'UTF-8' ) { binmode(FILE, ':utf8') ; FILE $record->as_usmarc() ; } else { print FILE $record->as_usmarc() ; } This will result in messing up the diacritics if a file has a mixture of records in MARC-8 and UTF-8. Is that correct? Thanks, Shelley ----- Original Message ----- > From: "William Dueber" <dueb...@umich.edu> > To: "Shelley Doljack" <sdolj...@stanford.edu> > Cc: perl4lib@perl.org > Sent: Monday, July 30, 2012 5:13:41 PM > Subject: Re: printing UTF-8 encoded MARC records with as_usmarc > First off, it's entirely possible that you have bad UTF-8 (perhaps > rogue MARC-8, perhaps just lousy characters) in your MARC. I know we > have plenty of that crap. > You need to tell perl that you'll be outputting UTF-8 using 'bincode' > binmode(FILE, ':utf8'); > In general, you'll want to do this to basically every file you open > for reading or writing. > A great overview of Perl and UTF-8 can be found at: > http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack < > sdolj...@stanford.edu > wrote: > > Hi, > > > I wrote a script that extracts marc records from a file given > > certain > > conditions and puts them in a new file. When my input record is > > correctly encoded in UTF-8 and I run my script from windows command > > prompt, this warning message appears: "Wide character in print at > > record_extraction.pl line 99" (the line in my script where I print > > to a new file using as_usmarc). I compared the extracted record > > before and after in MarcEdit and the diacritic was changed. I tried > > marcdump newfile.mrc to see what happens and I get this error: > > "utf8 > > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176." > > When I run my extraction script again with MARC-8 encoded data then > > I don't have the same problem. > > > The basic outline of my script is: > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > while (my $record = $batch->next()) { > > > #do some checks > > > #if checks ok then > > > print FILE $record->as_usmarc(); > > > } > > > Do I need to add something that specifies to interpret the data as > > UTF-8? Does MARC::Record not handle UTF-8 at all? > > > Thanks, > > > Shelley > > > ---- > > > Shelley Doljack > > > E-Resources Metadata Librarian > > > Metadata and Library Systems > > > Stanford University Libraries > > > sdolj...@stanford.edu > > > 650-725-0167 > > -- > Bill Dueber > Programmer -- Library Systems > University of Michigan