Hi Leif, I really appreciate you taking a look at this and responding. Although I consider myself somewhat knowledgeable about character sets, I still find these kinds of problems to be confusing.
> In this case the leader and actual length will not agree, as > your utf8 characters have turned into latin1. I was under the impression that the MARC record length in the Leader was the record length in bytes rather than the number of characters. Is that your understanding? Also, I am still troubleshooting my particular set of records (I was out of town last week) since this problem only appears to manifest itself for records with non-ASCII characters in the 100 and 245 fields. Records with a note field having non-ASCII characters doesn't cause a problem. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Leif Andersson [mailto:[EMAIL PROTECTED] > Sent: Saturday, March 01, 2008 2:51 PM > To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED] > Subject: Re: Help for utf-8 output > > It seems there is a little bug (by design) kicking in. > > The leader gets wrong and some characters get wrong in this case: > + Reading a raw marc record (utf8) from file > + Turning it into a MARC::Record object > + Without modification writing it out to file. > Yes. Even without modification the bug manifests itself! > > Let's start with code simply copying one record from a file > utf8.mrc containing one or more marc records. This basic > operation not involving MARC::Record is OK. > > #!perl -w > use strict; > # > open(IN, "utf8.mrc") || die "1"; > open(OUT, ">out_good.mrc") || die "2"; > binmode IN; > binmode OUT; > # > # Read in raw MARC > $/ = "\x1D"; > my $marc = <IN>; > print OUT $marc; > __END__ > > Now, we're adding MARC::Record to the process, along with > some debug info. > Example code producing *faulty* record: > > #!perl -w > use strict; > use MARC::Record; > use Devel::Peek; > # > open(IN, "utf8.mrc") || die "1"; > open(OUT, ">out_bad.mrc") || die "2"; > binmode IN; > binmode OUT; > # > # Read in raw MARC > $/ = "\x1D"; > my $marc = <IN>; > Dump($marc); # the utf8-flag is not on > my $obj = MARC::Record->new_from_usmarc( $marc ); # Convert > back to raw MARC my $marc2 = $obj->as_usmarc(); Dump($marc2); > # the utf8-flag IS on print OUT $marc2; __END__ > > > In this case the leader and actual length will not agree, as > your utf8 characters have turned into latin1. > The problem is that $marc2 has the utf8 flag set internally by Perl. > And the conversion on output is made in spite of binmode. > > We can get around the problem by either (for instance) use bytes; > or > Encode::_utf8_off($marc2); > before printing to file. > > But shouldn't MARC::Record take care of this for us? > A file of MARC records may contain records in different encodings. > The text parts of a MARC record can be treated as made up by > certain encodings, but the "blob" itself, I suppose, should > be exposed to the caller as pure binary. > > Are there any drawbacks in letting MARC::Record strip off any > eventual utf8 flag before returning the record as_usmarc() ? > If not I suggest this change be made to a future release of > MARC::Record. > > I shall also add that this character mess only sets in when doing IO. > If you are updating your databases through one API or another > you are probably OK! > > > Leif > ====================================== > Leif Andersson, Systems Librarian > Stockholm University Library > SE-106 91 Stockholm > SWEDEN > Phone : +46 8 162769 > Mobile: +46 70 6904281 > > -----Ursprungligt meddelande----- > Från: Doran, Michael D [mailto:[EMAIL PROTECTED] > Skickat: den 21 februari 2008 18:49 > Till: perl4lib@perl.org > Ämne: RE: Help for utf-8 output > > Hi Jackie, > > I'm working on a very similar problem... converting > theses/dissertations records (in XML) to MARC records. I'm > still in the testing stage, but have had similar problems > with records with diacritics in the 100 or 245 fields > (however diacritics in a 520a field don't seem to cause any > problems). Since our records are not "diacritic rich" it's > hard to determine the exact extent of the problem. > > I am using these versions: > Perl v5.8.8 > MARC::Charset 0.98 > MARC::Lint 1.43 > MARC::Record 2.0 > XML::LibXML 1.66 > > Here's an example "bad" record (which I have minimized to > just the 245 field): > > marcdump test.mrc > test.mrc > LDR 00127cam a2200037 4500 > 245 13 _aAn Empirical Test Of The Situational Leadership® > Model In Japan / > _cRiho Yoshioka. > > Recs Errs Filename > ----- ----- -------- > 1 1 test.mrc > > When I run test.mrc through MARC::Lint, I get this message: > > Invalid record length in record 1: Leader says 00127 bytes > but it's actually 125 Invalid length in directory for tag > 245 in record 1 field does not end in end of field character > in tag 245 in record 1 > > When examined in vi the character in question, a Registered > Sign, appears to be correctly UTF-8 encoded C2AE, and the bib > Leader (position 09=a) indicates that it is Unicode encoded. > I've attached the MARC record. > > I noticed that when I run your record (ck245.dat) through > MARC::Lint, I get the same invalid record length message: > > Invalid record length in record 3: Leader says 00567 bytes > but it's actually 569 field does not end in end of field > character in tag 100 in record 3 field does not end in end > of field character in tag 245 in record 3 Invalid indicators > ".10" forced to blanks in record 3 for tag 245 > > field does not end in end of field character in tag 260 in > record 3 Invalid indicators ". " forced to blanks in record > 3 for tag 260 > > field does not end in end of field character in tag 300 in > record 3 Invalid indicators ". " forced to blanks in record > 3 for tag 300 > > field does not end in end of field character in tag 502 in > record 3 Invalid indicators ". " forced to blanks in record > 3 for tag 502 > > field does not end in end of field character in tag 504 in > record 3 Invalid indicators ". " forced to blanks in record > 3 for tag 504 > > field does not end in end of field character in tag 690 in > record 3 Invalid indicators ". 4" forced to blanks in record > 3 for tag 690 > > Anybody have any ideas? > > -- Michael > > # Michael Doran, Systems Librarian > # University of Texas at Arlington > # 817-272-5326 office > # 817-688-1926 mobile > # [EMAIL PROTECTED] > # http://rocky.uta.edu/doran/ > > > > -----Original Message----- > > From: Shieh, Jackie [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 19, 2008 10:50 AM > > To: perl4lib@perl.org > > Subject: Help for utf-8 output > > > > I was wondering if anyone has similar experience and has > come up with > > good solutions to help solving the challenge below?! > > > > What I have is an Excel spreadsheet for dissertations which I have > > saved as a tab delimited file (examining the file in TextPad, the > > diacritics appears to be fine), then read in and output the > file as a > > utf-8 MARC file. I <print> title field confirming author > field that > > contains diacritics with the title showing proper indicator values. > > > > But when I looked the MARC itself, the fields that follow the field > > containing diacritics are all off its original position. > See attached > > zip file. Examples below: first two have diacritics in a > 100 field, > > last one diacritic is in 245 subfield b) > > > > 001 diss 34001 > > 100 1 _aP<E9>rez, Nancy L. > > 245 _aSynchronic and Diachronic Matlatzinkan Phonology. > > > > 001 diss 34042 > > 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo > > 245 _aDoing being boricua : > > > > 001 diss 33892 > > 100 1 _aDavis, Jennifer M. > > 245 14 _aThe Functional Complexities of Inherited Cardiac > Troponin I > > Mutations : > > _bIdentification of Ca<B2>+ Independent Contractile > > Dysfunction. > > > > I would be greatly appreciate any suggestion to solve this. > > Thank you most kindly. > > > > Regards, > > > > --Jackie > > > > |Jackie Shieh > > |Data Loads & Development > > |Harlan Hatcher Graduate Library > > |University of Michigan > > |920 North University > > |Ann Arbor, MI 48109-1205 > > |Phone: 734.763.6070 FAX: 734.615.9788 > > |E-mail: JShieh [AT] umich [DOT] edu > > >