I just recently came across this presentation which lays out pretty much all 
the issues with Unicode in perl, and makes some recommendations for best 
practices. You may find some general insight into the whole situation by going 
over it.

http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012

/dev
-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm


-----Original Message-----
From: William Dueber [mailto:dueb...@umich.edu] 
Sent: Monday, July 30, 2012 8:14 PM
To: Shelley Doljack
Cc: perl4lib@perl.org
Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <sdolj...@stanford.edu>wrote:

> Hi,
>
> I wrote a script that extracts marc records from a file given certain
> conditions and puts them in a new file. When my input record is correctly
> encoded in UTF-8 and I run my script from windows command prompt, this
> warning message appears: "Wide character in print at record_extraction.plline 
> 99" (the line in my script where I print to a new file using
> as_usmarc). I compared the extracted record before and after in MarcEdit
> and the diacritic was changed. I tried marcdump newfile.mrc to see what
> happens and I get this error: "utf8 \xF4 does not map to Unicode at
> C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> with MARC-8 encoded data then I don't have the same problem.
>
> The basic outline of my script is:
>
> my $batch = MARC::Batch->new('USMARC', $input_file);
>
> while (my $record = $batch->next()) {
>      #do some checks
>      #if checks ok then
>      print FILE $record->as_usmarc();
> }
>
> Do I need to add something that specifies to interpret the data as UTF-8?
> Does MARC::Record not handle UTF-8 at all?
>
> Thanks,
> Shelley
>
> ----
> Shelley Doljack
> E-Resources Metadata Librarian
> Metadata and Library Systems
> Stanford University Libraries
> sdolj...@stanford.edu
> 650-725-0167
>



-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan

Reply via email to