I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it.
http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012 /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -----Original Message----- From: William Dueber [mailto:dueb...@umich.edu] Sent: Monday, July 30, 2012 8:14 PM To: Shelley Doljack Cc: perl4lib@perl.org Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <sdolj...@stanford.edu>wrote: > Hi, > > I wrote a script that extracts marc records from a file given certain > conditions and puts them in a new file. When my input record is correctly > encoded in UTF-8 and I run my script from windows command prompt, this > warning message appears: "Wide character in print at record_extraction.plline > 99" (the line in my script where I print to a new file using > as_usmarc). I compared the extracted record before and after in MarcEdit > and the diacritic was changed. I tried marcdump newfile.mrc to see what > happens and I get this error: "utf8 \xF4 does not map to Unicode at > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > with MARC-8 encoded data then I don't have the same problem. > > The basic outline of my script is: > > my $batch = MARC::Batch->new('USMARC', $input_file); > > while (my $record = $batch->next()) { > #do some checks > #if checks ok then > print FILE $record->as_usmarc(); > } > > Do I need to add something that specifies to interpret the data as UTF-8? > Does MARC::Record not handle UTF-8 at all? > > Thanks, > Shelley > > ---- > Shelley Doljack > E-Resources Metadata Librarian > Metadata and Library Systems > Stanford University Libraries > sdolj...@stanford.edu > 650-725-0167 > -- Bill Dueber Programmer -- Library Systems University of Michigan