That's hilarious, that Terry has had to do enough ugliness with Marc encodings that he indeed can recognize 0xC2 off the bat as the Marc8 encoding it represents! I am in awe, as well as sympathy.

If the record is in Marc8, then you need to know if Perl Batch::Marc can handle Marc8. If it's supposed to be able to handle it, you need to figure out why it's not. (leader byte says UTF-8 even though it's really Marc8?).

If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. The only software package I know of that can convert from and to Marc8 encoding is Java Marc4J, but I wouldn't be shocked if there was something in Perl to do it. (But yes, as you can tell by the name, "Marc8" is a character encoding ONLY used in Marc, nobody but library people write software for dealing with it).

On 4/6/2011 5:01 PM, Reese, Terry wrote:
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 1:28 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode

I am not familar with that Perl module. But I'm more familiar then I'd want
with char encoding in Marc.

I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
familiar with in past debugging, but I've forgotten em), but the first things to
look at:

1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
Theoretically there is a Marc leader byte that tells you whether it's
Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is it
wrong?

2. Does Perl MARC::Batch  have a function to convert from Marc8 to
UTF-8?   If so, how does it decide whether to convert? Is it trying to
do that?  Is it assuming that the leader byte the record accurately
identifies the encoding, and if so, is the leader byte wrong?   Is it
trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
first place?  Or is it assuming the source was UTF-8 in the first place, when in
fact it was Marc8?

Not the answer you wanted, maybe someone else will have that. Debugging
char encoding is hands down the most annoying kind of debugging I ever do.

On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
Ack! While using the venerable Perl MARC::Batch module I get the
following error while trying to read a MARC record:
    utf8 "\xC2" does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap
this error allowing me to move on, or 2) figure out how to open the file
"correctly".

Reply via email to