Hi Devon, > I just recently came across this presentation which lays out pretty much > all the issues with Unicode in perl, and makes some recommendations for > best practices.
While Nick Patch's presentation is excellent, I'm not sure that it "lays out pretty much all the issues with Unicode in perl". ;-) To fit that bill, I highly recommend this series of talks given by Tom Christiansen at OSCON 2011: 1. Perl Unicode Essentials 2. Unicode in Perl Regexes 3. Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly http://training.perl.com/OSCON2011/index.html (resolves to http://98.245.80.27/tcpc/OSCON2011/index.html) If you read through those presentations and disagree, I promise to buy you a beer at the next conference (code4lib?) we both attend. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Smith,Devon [mailto:smit...@oclc.org] > Sent: Tuesday, July 31, 2012 8:26 AM > To: William Dueber; Shelley Doljack > Cc: perl4lib@perl.org > Subject: RE: printing UTF-8 encoded MARC records with as_usmarc > > I just recently came across this presentation which lays out pretty much > all the issues with Unicode in perl, and makes some recommendations for > best practices. You may find some general insight into the whole > situation by going over it. > > http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore- > perl-workshop-2012 > > /dev > -- > Devon Smith > Consulting Software Engineer > OCLC Research > http://www.oclc.org/research/people/smith.htm > > > -----Original Message----- > From: William Dueber [mailto:dueb...@umich.edu] > Sent: Monday, July 30, 2012 8:14 PM > To: Shelley Doljack > Cc: perl4lib@perl.org > Subject: Re: printing UTF-8 encoded MARC records with as_usmarc > > First off, it's entirely possible that you have bad UTF-8 (perhaps rogue > MARC-8, perhaps just lousy characters) in your MARC. I know we have > plenty > of that crap. > > You need to tell perl that you'll be outputting UTF-8 using 'bincode' > > binmode(FILE, ':utf8'); > > In general, you'll want to do this to basically every file you open for > reading or writing. > > A great overview of Perl and UTF-8 can be found at: > > http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid- > utf-8-by-default > > > > > > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack > <sdolj...@stanford.edu>wrote: > > > Hi, > > > > I wrote a script that extracts marc records from a file given certain > > conditions and puts them in a new file. When my input record is > correctly > > encoded in UTF-8 and I run my script from windows command prompt, this > > warning message appears: "Wide character in print at > record_extraction.plline 99" (the line in my script where I print to a > new file using > > as_usmarc). I compared the extracted record before and after in > MarcEdit > > and the diacritic was changed. I tried marcdump newfile.mrc to see what > > happens and I get this error: "utf8 \xF4 does not map to Unicode at > > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script > again > > with MARC-8 encoded data then I don't have the same problem. > > > > The basic outline of my script is: > > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > > while (my $record = $batch->next()) { > > #do some checks > > #if checks ok then > > print FILE $record->as_usmarc(); > > } > > > > Do I need to add something that specifies to interpret the data as UTF- > 8? > > Does MARC::Record not handle UTF-8 at all? > > > > Thanks, > > Shelley > > > > ---- > > Shelley Doljack > > E-Resources Metadata Librarian > > Metadata and Library Systems > > Stanford University Libraries > > sdolj...@stanford.edu > > 650-725-0167 > > > > > > -- > > Bill Dueber > Programmer -- Library Systems > University of Michigan