yaz-marcdump does a really good job of charset and format conversion for MARC records, and is blindingly fast.
But yaz-marcdump seems to think there are a lot of separators in the wrong place and bad indicator data, whether treating the records as UTF-8 or MARC-8. The leaders in the records say they are UTF-8, but looking at the data, the byte sequences that Jon G. noticed reminds me of UTF-8 data that was UTF-8-encoded a second time. I wonder if they go re-encoded in transmission somewhere along the way. Maybe just in the download from zoila. -Tod On Apr 6, 2011, at 4:11 PM, Jonathan Rochkind wrote: > That's hilarious, that Terry has had to do enough ugliness with Marc > encodings that he indeed can recognize 0xC2 off the bat as the Marc8 > encoding it represents! I am in awe, as well as sympathy. > > If the record is in Marc8, then you need to know if Perl Batch::Marc can > handle Marc8. If it's supposed to be able to handle it, you need to > figure out why it's not. (leader byte says UTF-8 even though it's really > Marc8?). > > If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. > The only software package I know of that can convert from and to Marc8 > encoding is Java Marc4J, but I wouldn't be shocked if there was > something in Perl to do it. (But yes, as you can tell by the name, > "Marc8" is a character encoding ONLY used in Marc, nobody but library > people write software for dealing with it). > > On 4/6/2011 5:01 PM, Reese, Terry wrote: >> I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker >> in MARC-8. I'd guess the file isn't in UTF8. >> >> --TR >> >>> -----Original Message----- >>> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of >>> Jonathan Rochkind >>> Sent: Wednesday, April 06, 2011 1:28 PM >>> To: CODE4LIB@LISTSERV.ND.EDU >>> Subject: Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode >>> >>> I am not familar with that Perl module. But I'm more familiar then I'd want >>> with char encoding in Marc. >>> >>> I don't recognize the bytes 0xC2 (there are some bytes I became pathetically >>> familiar with in past debugging, but I've forgotten em), but the first >>> things to >>> look at: >>> >>> 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. >>> Theoretically there is a Marc leader byte that tells you whether it's >>> Marc8 or UTF-8, but the leader byte is often wrong in real world records. >>> Is it >>> wrong? >>> >>> 2. Does Perl MARC::Batch have a function to convert from Marc8 to >>> UTF-8? If so, how does it decide whether to convert? Is it trying to >>> do that? Is it assuming that the leader byte the record accurately >>> identifies the encoding, and if so, is the leader byte wrong? Is it >>> trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the >>> first place? Or is it assuming the source was UTF-8 in the first place, >>> when in >>> fact it was Marc8? >>> >>> Not the answer you wanted, maybe someone else will have that. Debugging >>> char encoding is hands down the most annoying kind of debugging I ever do. >>> >>> On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: >>>> Ack! While using the venerable Perl MARC::Batch module I get the >>> following error while trying to read a MARC record: >>>> utf8 "\xC2" does not map to Unicode >>>> >>>> This is a real pain, and I'm hoping someone here can help me either: 1) >>>> trap >>> this error allowing me to move on, or 2) figure out how to open the file >>> "correctly". Tod Olson <t...@uchicago.edu> Systems Librarian University of Chicago Library